This Blog Has Moved! Visit chriswarbo.tk instead!: rdf

Showing posts with label rdf. Show all posts

Tuesday, 14 December 2010

Blog Dump 7: Reasoning & Proof

Code is formal mathematical notation, defined only from the axioms specified by the language. Compilers and interpreters are assumed to make-good these axioms based on the axioms of their own implementation. At some point the axioms become the specification of a machine, wherein the machine is assumed to make-good those axioms based on either further axioms of languages (if we're in an emulator), or through the laws of Physics (if we've got physical hardware). Any failure of these axioms results in undefined behaviour of the code. Thus, it is justified to formally study programs with very high-level axioms by assuming that they're true. We can check them later, of course, by studying the implementations. Thus we don't need, even through we can have if we want, a full proof of a program's execution right down to the metal. We just need to prove what it implements given the axioms of its language. Given such high-level axiomatic specification of a language, it is straightforward (if computationally expensive and undecidable) to compare specifications and programs in any relevant language.
The issue becomes what language to use for specifications and axioms. Here there is no point inventing yet another new language, and it would be best to use a declarative language. Standard Mathematical notation, such as propositional logic, Hoare logic, Lambda calculus, etc. would suffice.
An important issue is the structure of our types, as there are two ways we would like to exploit it. We would like to be able to hack our type system when convenient, which requires knowledge of type internals. For example, if we know that C's integers are stored as a list of 32 bits then we can use the individual bits via bit masks, bit-wise operations and even via decimal equivalents. If we treat types only as the concepts which they embody then we can use more high-level reasoning like "this is an integer", and therefore apply whatever we know about integers (including swapping the implementation for a different integer representation, which breaks any hacks being used).

Knowledge should be completely bottom-up. Specify as many facts (triples) as you can, and at some point higher-order patterns may fall out. If they don't, then we write more triples. The reasoning approach should be a 'destructive' one, based on filtering. For example, a 'constructive' approach would construct a path for inference like 'trees are plants and plants are living, therefore trees are living'; whilst a 'destructive' approach would describe the undefined concept of 'tree' by saying 'X is a tree, X is a plant, X is alive'. This is less reliable than the 'constructive' approach (which can also be used), but allows us to use learning like a human: a child sees a red ball and is told "red". As far as the child knows, red balls are "red" but the word "red" could describe its shape, its size, its colour, etc. These possibilities are narrowed down (filtered, destroyed) by describing a red car as "red", a red traffic light as "red", etc. With enough examples of "red", there are enough filters to narrow down what is meant. A red ball and red traffic light are both round, perhaps that is what "red" means? No, because a red car is not round, thus "red" is not the shape. In the tree example we can say 'trees are plants', which narrows down the scope of what trees might be. We can say 'trees are alive'.

The advantage is that we have an implied generality: we don't have to choose whether a particular object is one colour or another, we can say it's both. We can say that white wine is "white", which it is, without having to worry about breaking an ontology by 'misusing' the adjective 'white'. The lack of focus lets the system 'keep an open mind'. It can clarify things and verify its own inferences by asking questions, in order of the impact of the answer (ie. whether yes or no will make the biggest differentiation of the emerging ontology): 'Is a pineapple a tree?', 'Are plants alive?', etc.

If the system is going wrong for some reason, eg. asking a string of ridiculous inferences, and its own differentiation checker isn't stopping it, then the problem is simple: more triples need to be given explicitly.

Reusing URIs is a good idea, so we can have the system search out existing ontologies and instantiations, then ask things like: "are trees plants in the same way that cars are vehicles?". In that case we've said 'trees are plants' and it has found an ontology which says 'cars are vehicles' and is wondering if the relationship is the same, ie. if it should reuse the URI of the predicate. In a grahical way we could offer a multitude of possibly preexisting definitions, along with tick boxes to indicate which ones are similar.

Blog Dump 6: The Web of Code

The original World Wide Web consists of big lines of text, called pages, which, by virtue of some ill-defined, external set of rules, somehow manage to contain knowledge. The pages can point to other places on the Web for whatever reason, and to a person this is great. However, to a machine it's just a bunch of numbers with no discernable meaning.

Mixed in with this knowledge are various well-defined, machine-understandable properties which originally denoted the presentation of the text (eg. 'b' for a bold bit, 'i' for an italic bit, etc.) but which gradually changed to instead merely split it up in various ways (eg. 'em' for a bit that needs emphasising, 'strong' for a bit which is important, etc.). The knowledge in the text, however, is still lost to the machine processing it.

This means that a person has to be involved somewhere, requiring a load of unfortuate individuals to navigate through the shit user interface of the Web, reading everything they find, until they get the bit of knowledge they were after.

These days we have search engines which slightly lessen the burden, since we can jump to places which we, as people, have reason to believe are pretty close to our destination and thus shouldn't require much navigating from. Still, though, the machines don't know what's going on.

In the Web of Data we entirely throw away the concept of a page, since it's irrelevant to our quest for knowledge. Text can be confined to pages, but knowledge can't; knowledge is pervasive, interconnected, predicated, higher-order and in general we can't contain a description of anything to a single unit without reference to other things, the decriptions of which reference other things and we end up with our 'page' on one thing actually containing the entire sum of all knowledge. Since there's only one sum of all knowledge, we don't need to use a concept like 'page' which implies that there is more than one.

With the artificial limit of pages done away with we can put stuff anywhere, as long as we use unique names (Universal Resource Identifiers, URIs) to make sure we don't get confused about which bits are talking about what. Now we've got a machine-readable, distributed, worldwide database of knowledge: that's the Web of Data.

At this point many short-sighted people think that the next step is to rewrite the old Web on top of the Web of Data, so that both humans and machines can understand it and work in harmony. These people are, of course, sooooo 20th century.

Machines aren't intelligent (yet), so there's no way we could make a serious moral argument that they are our slaves. Therefore, why aren't they doing everything? There should be no relevant information kept back from the machine, and nothing it outputs should contain any new knowledge which can't already be found on the Web of Data. If we want to refine what it programatically generates, we should do so by adding new information to the Web of Data until it knows what we want, and thus nobody else need specify that data again.

To me, as a programmer, there is an obvious analogy to be made:

The original coding system consists of big lines of text, called files, which, by virtue of some well-defined, external set of rules, somehow manage to contain computation. The files can import other files in the system for whatever reason, and to a person this is great. However, to a machine it's just a bunch of calculations with no discernable meaning.

Mixed in with this computation are various well-defined, machine-understandable properties which originally denoted the representation of the data (eg. 'int' for a 32bit integer, 'double' for a 64bit rational, etc.) but which gradually changed to instead merely split it up in various ways (eg. 'class' for a bit that contains related parts, 'module' for a bit which is self-contained, etc.). The computation in the text, however, is still lost to the machine processing it.

This means that a person has to be involved somewhere, requiring a load of unfortuate individuals to navigate through the shit user interface of the system, reading everything they find, until they get the bit of calculation they were after.

These days we have search engines which slightly lessen the burden, since we can jump to places which we, as people, have reason to believe are pretty close to our destination and thus shouldn't require much navigating from. Still, though, the machines don't know what's going on.

In the Web of Code we entirely throw away the concept of a file, since it's irrelevant to our quest for computation. Text can be confined to files, but computation can't; computation is pervasive, interconnected, predicated, higher-order and in general we can't contain a serialisation of anything to a single unit without reference to other things, the serialisations of which reference other things and we end up with our 'file' on one thing actually containing the entire sum of all computation. Since there's only one sum of all computation, we don't need to use a concept like 'file' which implies that there is more than one.

With the artificial limit of files done away with we can put stuff anywhere, as long as we use unique names (Universal Resource Identifiers, URIs) to make sure we don't get confused about which bits are talking about what. Now we've got a machine-readable, distributed, worldwide database of computation: that's the Web of Code.

At this point many short-sighted people think that the next step is to rewrite the old coding system on top of the Web of Code, so that both humans and machines can understand it and work in harmony. These people are, of course, sooooo 20th century.

Machines aren't intelligent (yet), so there's no way we could make a serious moral argument that they are our slaves. Therefore, why aren't they doing everything? There should be no relevant information kept back from the machine, and nothing it outputs should contain any new calculation which can't already be found in the Web of Code. If we want to refine what it programatically generates, we should do so by adding new information to the Web of Code until it knows what we want, and thus nobody else need specify that process again.

What Is The Web of Code?

The Web of Code is code like any other. However, the operations it performs are not on memory, they are on things in Web of Code. Memory is just a cache. The Web of Code is as high-level and sparse as possible, describing only what it needs to and no more. If we want to alert the user then we alert the user, we do not want to display rectangles and render glyphs, so we do not display rectangles and render glyphs, these are low-level details which can be worked out through reasoning and search on the Web of Code.

Monday, 27 April 2009

Some Nice Things

I've not posted for a while, due to a mixture of an increasing workload, the ability to let off a constant barrage of my thoughts to Identi.ca rather than build them up into a blog post, and my constant disdain for Web-based apps.

So what do I want to blog about? Nothing particularly structured, just some stuff that I find interesting. Keep in mind though, that my definition of interesting includes the fact that 12cm optical discs have increased their storage capacity by 2 orders of magnitude in the 27 years from the CD to the BluRay, whilst in the same time frame the capacity of 3 1/2" hard drives has gone up 12 orders of magnitude. (I'm writing an essay on Optical Data Storage for a Physics module :) )

For those of you who may remember Deluxe Paint on AGA capable Amigas I can heartily recommend that you check out Grafx2, which seems to work on pretty much every OS and has recently been added to Debian, so you can install it by ticking "grafx2" in any package manager, it will be downloaded and installed along with everything it depends on :) Doesn't seem to do animation yet, as far as I can tell, which is a shame.

Also recently added to Debian is Closed World Model, cwm. This is pretty special, since it takes cutting edge computer knowledge representation as used by the Semantic Web, and makes it accessible via a tool similar to UNIX's (and of course GNU's) classic sed tool. For example, you can use a command like "cwm --rdf inputfile1.rdf inputfile2.rdf --n3 inputfile3.n --rdf --think --pipe > output.rdf" to take at all of the knowledge from the RDF files inputfile1.rdf, inputfile2.rdf and inputfile3.n (in RDF-XML and Notation3 formats), comparing the knowledge they contain, and dumping all of the new knowledge it can infer into the RDF-XML file output.rdf. For example, inputfile1.rdf could contain statements that Chris Warburton is a student, Chris Warburton has a website http://www.freewebs.com/chriswarbo and that Chris Warburton has a brother David Warburton. inputfile2.rdf could say that Brothers are related and that Brothers share a Mother. inputfile3.n could say that David Warburton has a blog at http://fun-chips.blogspot.com and David Warburton has a mother Cheryl Warburton. cwd would then combine these and the output file would contain deductions such as David Warburton is related to a student, http://www.freewebs.com/chriswarbo is run by a student and Chris Warburton has a mother Cheryl Warburton.

This is pretty cool, since it commoditises the previously tricky area of RDF access, allowing it to be scripted, for example in the backend of Web sites, in the same way that Imagemagick has done to images (eg. for thumbnailing).

Pretty cool. Anyway, it's getting late so I should get some sleep now.

I'm going to post some of my programming experiments soon, so look out for them :)

Pretty cool. Anyway, it's getting late so I should get some sleep now.

I'm going to post some of my programming experiments soon, so look out for them :)

Thursday, 11 September 2008

Model-View-Controller

First there was Last.fm's site redesign, now there's Facebook's. Both have been given a hard time, however I don't personally give a flying fuck about their website design. The *ONLY* useful thing from last.fm is the artist information. Fuck the personal statistics. Fuck the Shoutboxes. Fuck the Friends. The *ONLY* reason accounts exist is to prevent spamming which would skew the artist information database. So, do you want to see how last.fm REALLY looks? Well there are a few styles: text and XML. There is also an API (Application Programmer's Interface) available, but it requires an account to use so it can go and die in a bin.

If you are particularly clueless about how computers work then you might be thinking to yourself "those look like shit", and you would be right. However, structure is FAR more important than presentation, and those documents are well structured. The reason thinking along those lines is clueless is because such people have never seen a computer program, they've never pressed the View Source button in their Web browser and they've never taken apart any gadgets in their life. Those of you who have done at least one of those things will know that beauty is only skin deep. A website might look pretty, but the HTML the pages are made of look as ugly as sin. The HTML will have a decent structure though, which means a Web browser can be stuck in between the user and the HTML to make it look pretty.

What I am saying is that last.fm allow access to their database in a structured way, which allows applications (which are completely stupid and need to be told exactly what to do, hence the need for structure) to display the data in whatever the hell way the user wants. You don't need to use their website, since code for putting data into and getting data from last.fm exists inside every decent music player (Amarok, Banshee, Listen, etc.). Your choice of application is completely up to you, and you can keep using that application for as long as last.fm's web services maintain their current structure. If the structure or protocol or something changes, then that's not too bad for anyone using a well written piece of Free Software. If you're using a proprietary program to display it then you're knackered and I hope you've learned a valuable lesson.

Now let's look at the recent Facebook change. Can I access Facebook's database via a well defined and structured interface? Can I bollocks. That means I'm stuck with whatever the almighty Facebook deities bestow upon me, and I'd better pray that I like it because it's all they allow me. Thank fuck I don't use it.

For those of you who haven't realised it by now I am talking about the Semantic Web and the Model View Controller architecture. In the World Wide Web the stuff that gets passed around is HTML. That HTML can contain any data, is layed out in a certain way and the structure is very freeform, as long as it meets a few rules defined in the HTML and XHTML standards. The Semantic Web is different. In the Semantic Web the data is structured in a very specific way, in RDF triples to be precise. There is NO layout in the Semantic Web, since it is not about anything visual like pages, it is about *concepts* and *meaning*. Applications you use which can access the Semantic Web (there are as many as people can create, rather than the Web's singleton the Web Browser) can do whatever the hell they like with the structured information they receive. There's no arguments about the layout and look of a semantic Facebook because there's NO SUCH THING as the "layout" or "look" of anything on the Semantic Web. The layout and look are ENTIRELY up to the user and which program they decide to view it with. Some users may even view it with applications they access via the World Wide Web.

PS: I would just like to point out the first sentence of Wikipedia's World Wide Web article:

"The World Wide Web (commonly shortened to the Web) is a system of interlinked hypertext documents accessed via the Internet." Thank you.

Sunday, 7 September 2008

Services, integration, communication and standards

These days it is possible to do pretty much anything over the Internet. There's eBay, online banking, PayPal, Flickr, OpenStreetMap, OpenDesktop, Email, chat, forums, Wikis, Facebook, scrobbling, blogs, etc. The big problem, in my mind, is that of technological walled gardens.

A walled garden is an area which cannot be accessed without permission, usually by registering some form of account. Facebook is a classic example of a walled garden. Walls are useful, since having everyone's bank accounts available without restriction would be a problem to say the least. A technological walled garden would be an enclosed, restricted area which can only be accessed via certain technological means. Technological walled gardens are often simpler to implement than open systems, but often the reason the garden operator does this is because they see this as a way to run a dot-com or Web-2.0 business.

Let's take an example, Yahoo! Mail, Windows Live Mail and Gmail, which are all walled gardens in the classical sense, an account is needed and the login details must be provided in order to access anything. The first two, however, are also technological walled gardens: whilst mechanisms to send, retrieve, check and manage email have been around for decades, from "get my mail" (POP3) and "send my mail" (SMTP) to more sophisticated "synchronise this lot of email with that lot" (IMAP) and are well defined, standardised, understood and implemented, in order to access Yahoo! Mail or Windows Live Mail you still need to log in via their website because they don't use any of these standards. Gmail supports them, which is how I can use Evolution and Kmail to manage my Gmail account. Yahoo and Microsoft specifically disable them (I know Yahoo used to allow POP3 and SMTP access, when they stopped I moved away from them) with the reasoning that Evolution and Kmail don't display adverts, whereas their websites do. Here interoperability and standardisation desired by customers (if it wasn't used then there's no point disabling it, since nobody would be unexposed to adverts and the POP/SMTP/IMAP server load would be zero) is sacrificed in order to force adverts onto users who don't want them. This of course doesn't even touch upon the flexibility of using an email client (screen readers and other accessibility for the disabled, offline access, complete choice of interface (including Web), etc.).

That is the major reason why I refuse to join Facebook, MySpace, etc. I cannot buy or download Facebook's software and run it on my own machine,and even if I managed to write my own there would be no way to make it talk to Facebook's own servers. Since the entire point of Facebook is the stuff in their database, this would make my software useless. Hence Facebook have created a technological walled garden: If I joined Facebook then I would be sending any data I entered into a blackhole as far as accessing it on my terms is concerned.

Last.fm is better, since although their server/database software is a trade secret (as far as I can tell), the Audio Scrobbler system they use to gather data is completely documented and has many implementations, many of which are Free Software (including the official player). The contents of their database is also available, and doesn't even require an account (I have tried to think of ways to submit similar artists without an account, such as submitting two tracks at a time and building the data from that, but I suppose spam would be too much of a problem). Only the artist recommendations/similarity, tags and thingsare available, but that's the entire reason I use it, fuck the site with all of its Flash, poor social networking, confusing messaging system and stuff, that's all fluff around the periphery of the useful information. Essentially last.fm is like Gmail: I can't get the code which runs it, but I can get my data in and out in standardised ways which can be implemented with Free Software. I could make my own server which synchronises with with their database via the available protocols, and thus get out everything that I put in.

Now, the issue of synchronisation is interesting. How do you keep two things synchronised? There are a few different approaches and each has its place:

Known, unchanging, unmoving data

Here HTML can be used, ie. a web page. This is fine for people, and for applications needing that data it can simply be copied into the application once. An example would be an "about" page.

Unknown, unchanging, unmoving data

Here HTML can still be used, but since the data is not know beforehand it can be hard for an application to get anything useful from it. RDFa can be used inside the HTML to label each piece of information, thus an application only needs to be told what to find and it will look through the labels until it does, regardless of page structure, layout, etc. An example would be a scientific paper.

Changing data which is accessed once at a time

Here RSS or ATOM can be used. This allows changes to be specified in the file. ATOM is a standard, but RSS is a dialect of RDF which means labelled data is possible. An example would be a changelog.

Changing data which is accessed on every change

Here XMPP PubSub can be used. This means that there is no checking for updates since the source will push any changes out to subscribers when they are made. This doesn't use a file, it uses a stream. This is what my library is designed to accomplish. An example would be a blog.

Two-way communication and instruction execution

Here systems such as JOLIE can be used, overlaying protocols like SOAP. This can be used for dynamically generated data like database queries and searches, as well as for sending instructions such as "Approve Payment". An example would be a shop.

Notice that the first technology, HTML, is the only one which needs a Web browser to access. RDFa, ATOM and RSS are all structured in a way that allows applications to handle them directly, no human is needed and thus many more possibilities are available. XMPP can also be structured, since RDF, ATOM and RSS can all be sent over XMPP, allowing machines to handle the data, but doing so in an as-needed basis which makes things more scalable. JOLIE covers a range of protocols which are all inerently machine-focused, they are executing instructions. This might be a "buy this laptop" instruction when a button is pressed, or a "search for this" instruction when a query is entered.

These technologies allow data and communication to break free of implementation and visualisation. The next Facebook doesn't have to be a centralised Web site, it can be a distributed system with many servers run by different people interacting with each other to allow a scalable, unowned network, like the Web but made of pure information without the overhead of layout, styles, widgets or interfaces. This is called the Semantic Web. All of the visualisation, interface and style can be swappable and implemented where it is wanted, for instance as desktop applications on user's machines, or as Web sites running on various servers. There is no reason why, in an interconnected world, I should have to visit Facebook.com in order to interact with Facebook.

Except, of course, that Facebook wants to force unwanted adverts onto their customers.

This Blog Has Moved! Visit chriswarbo.tk instead!

Tuesday, 14 December 2010

Blog Dump 7: Reasoning & Proof

Blog Dump 6: The Web of Code

Monday, 27 April 2009

Some Nice Things

Thursday, 11 September 2008

Model-View-Controller

Sunday, 7 September 2008

Services, integration, communication and standards

Pictury Linkies

Nice Places

Blog Archive

My Ideas