Sunday 19 December 2010

More ocPortal Thoughts

I've just finished a 0.1 version of my Flattr addon for ocPortal, which means that I'm looking for a new project to hack on. I've created a list of ideas in Basket, which includes many of the things I mentioned in my last post. Since I've taken some time off over Christmas I thought I'd try and make a start on getting ocPortal integrated with the Federated Social Web framework that's emerging. This includes:

WebFinger support: WebFinger is an extension of the classic UNIX finger utility. It allows arbitrary data to be associated with an email address. There are 2 parts to this: ocPortal can take advantage of WebFinger data by populating profiles with existing public information, and it can produce WebFinger data by making public profile information available to services looking for it.

Salmon support: Salmon allows content to flow upstream (ie. towards the source). It means that, for example, a news article can be syndicated by many different sites, and comments from all of them will be sent upstream to the source. There's an issue here with comment spam, since we would like this to be automatic (ie. CAPTCHA wouldn't work), but once the protocol is supported we can keep it turned off until we work out a solution to spam.

PUbSubHubbub: This is a publish-subscribe mechanism, similar in principle to XMPP PubSub which I wrote a library for a couple of years ago. This allows arbitrary content to be sent downstream to everyone subscribed to a source. For example, it allows comments on a news article to be sent to everyone who is syndicating it. Combined with Salmon, this allows content like comments on a republished article to be sent upstream to the original, which can then push them out to everyone who's republished that article.

OStatus support: This is a group of standards made to interoperate with each other. It allows users on different sites to follow each other's activity, comment and converse. There's a really clear breakdown of how to make a site/application/service OStatus capable. The nice part is that each step towards integration gives useful functionality. This depends on WebFinger, Salmon and PUbSubHubbub support, if we want to support the whole lot.

There are some ocPortal hacks knocking around in Subversion to log user activity in a "Bob made a comment on Foo" way, so this could be piped into any social web structure I manage to put in place. Ideally, ocPortal should be able to pass the SWAT0 test.

If I manage to get some of the above done then I might have a go at turning ocPortal into a Diaspora "pod". From what I can tell, this also requires WebFinger and Salmon support, but the only documentation I can find for Diaspora's message-passing interface is pretty sparse. When I do get around to this I think I'll have to have a chat with some Diaspora devs.

Oh well, time to install Apache, MySQL, PHP and ocPortal on my XO :)

UPDATE: Holidays are over. I've done a lot of reading of specifications, and I've started hacking the "Salmonpress" plugin for WordPress into ocPortal. This also includes a skeletal use of WebFinger and, the main reason I chose it, "Magic Signatures". I've cleaned up the code to remove WordPress-specific stuff, mostly put ocPortal replacements in place, and have got rid of its stupid use of classes and static methods (otherwise known as functions, which is what I've turned them into). Haven't worked out the best way to get this working for all content types (or as much as possible) yet. Also, didn't need MySQL on my XO, I used ocPortal's built-in XML database :)

Tuesday 14 December 2010

Blog Dump 8: Building a Language

We want a high-level language. We want it to have objects. We want it to be dynamic (latent typing). We would like it to be functional. We want it to be pure; both in an object-oriented sense (everything is an object, there are no 'built-in types' or objects which differ in implementation from user-created objects) and in a functional sense (we don't want implied state or mutable values, we want referential transparency, we want functions which take one object and return one object, we want currying).

To do this we should use an Id object system:

The master object is defined as a reference to another object (eg. a pointer) and some (unknown) contents. Thus we can say the type of an object is:

type(object) := (reference × contents)

This object can be used as a prototype by a function we'll call clone:

clone :: object -> object
clone :: (reference × contents) -> (reference × contents)

The cloned object's reference points to the cloned-from object, whilst the contents of the object should be independent of any other. We can't initialise it with default values, since they're to be immutable and thus we'd never get a non-default object! We must therefore forget about this object's contents until they're set or accessed, which requires us to be lazy in some respect. We don't neccessarily need to use lazy computation, delaying all computation until we know everything we need to. We could deal with memory allocation and such if we knew the size of the object, or if we had functions in place to reallocate memory when the contents become more defined.

Every reference is an object, every reference points to an object and every object contains a reference. This gives a large number of objects, but we will not reduce this in the design phase.

To specify the behaviour of an object we look in its vtable, which is what the reference points to. A vtable is also an object, since we want a pure object system. Thus we have to make the root object somehow, ensuring that the mechanisms used to do so can be coerced into objects afterwards. This, once again, relies on a lazy approach to data:

# Get an 'empty' (ie. undefined) object
object = make_raw_object()
# Make another empty object
object_vtable = make_raw_object()
# Use this second object to describe the behaviour of 'object'
give_reference(object, ref_to(object_vtable))
# Now make a third empty object
vtable_vtable = make_raw_object()
# This third object describes the behaviour of behaviour objects
give_reference(object_vtable, ref_to(vtable_vtable))
# Since the third object is a behaviour, it describes itself too
give_reference(vtable_vtable, ref_to(vtable_vtable))

Now we have the Id object model: everything is an object, our objects have a reference to some behaviour (vtable) along with some contents. Each object has a behaviour (object has object_vtable, object_vtable has vtable_vtable and vtable_vtable has vtable_vtable). We have used functions 'make_raw_object', 'ref_to' and 'give_reference'. Not only are these are impure (they are not in our object model, and are probably not even objects if we're using a low-level starting point), but they also break the object system by allowing new object hierarchies to be defined! Thus these should not exist in the final system, which is OK since they are not needed after we have bootstrapped the object system; thus they can live in a bootstrap compiler which can be thrown away once it's compiled a native compiler.

Now we have a skeletal object hierarchy
object -> number -> integer -> natural
object_vtable -> num_vtable -> int_vtable -> nat_vtable

object -> vtable -> num_vtable -> int_vtable -> nat_vtable
object_vtable -> vtabl_vtabl -> vtabl_vtabl -> vtabl_vtabl -> vtabl_vtabl

object -> vtable -> vtable_vtable
object_vtable -> vtable_vtable -> vtable_vtable
Now we loop the hierarchy around so that the vtable's reference points to vtable_vtable:

This is where the post trails off. From a quick skim I'm pretty certain I've got Id right so far, although I've stopped short of mentioning anything directly functional (eg. function objects, continuations, etc.). Maybe I'll finish this off some time.

Blog Dump 7: Reasoning & Proof

Code is formal mathematical notation, defined only from the axioms specified by the language. Compilers and interpreters are assumed to make-good these axioms based on the axioms of their own implementation. At some point the axioms become the specification of a machine, wherein the machine is assumed to make-good those axioms based on either further axioms of languages (if we're in an emulator), or through the laws of Physics (if we've got physical hardware). Any failure of these axioms results in undefined behaviour of the code. Thus, it is justified to formally study programs with very high-level axioms by assuming that they're true. We can check them later, of course, by studying the implementations. Thus we don't need, even through we can have if we want, a full proof of a program's execution right down to the metal. We just need to prove what it implements given the axioms of its language. Given such high-level axiomatic specification of a language, it is straightforward (if computationally expensive and undecidable) to compare specifications and programs in any relevant language.
The issue becomes what language to use for specifications and axioms. Here there is no point inventing yet another new language, and it would be best to use a declarative language. Standard Mathematical notation, such as propositional logic, Hoare logic, Lambda calculus, etc. would suffice.
An important issue is the structure of our types, as there are two ways we would like to exploit it. We would like to be able to hack our type system when convenient, which requires knowledge of type internals. For example, if we know that C's integers are stored as a list of 32 bits then we can use the individual bits via bit masks, bit-wise operations and even via decimal equivalents. If we treat types only as the concepts which they embody then we can use more high-level reasoning like "this is an integer", and therefore apply whatever we know about integers (including swapping the implementation for a different integer representation, which breaks any hacks being used).

Knowledge should be completely bottom-up. Specify as many facts (triples) as you can, and at some point higher-order patterns may fall out. If they don't, then we write more triples. The reasoning approach should be a 'destructive' one, based on filtering. For example, a 'constructive' approach would construct a path for inference like 'trees are plants and plants are living, therefore trees are living'; whilst a 'destructive' approach would describe the undefined concept of 'tree' by saying 'X is a tree, X is a plant, X is alive'. This is less reliable than the 'constructive' approach (which can also be used), but allows us to use learning like a human: a child sees a red ball and is told "red". As far as the child knows, red balls are "red" but the word "red" could describe its shape, its size, its colour, etc. These possibilities are narrowed down (filtered, destroyed) by describing a red car as "red", a red traffic light as "red", etc. With enough examples of "red", there are enough filters to narrow down what is meant. A red ball and red traffic light are both round, perhaps that is what "red" means? No, because a red car is not round, thus "red" is not the shape. In the tree example we can say 'trees are plants', which narrows down the scope of what trees might be. We can say 'trees are alive'.

The advantage is that we have an implied generality: we don't have to choose whether a particular object is one colour or another, we can say it's both. We can say that white wine is "white", which it is, without having to worry about breaking an ontology by 'misusing' the adjective 'white'. The lack of focus lets the system 'keep an open mind'. It can clarify things and verify its own inferences by asking questions, in order of the impact of the answer (ie. whether yes or no will make the biggest differentiation of the emerging ontology): 'Is a pineapple a tree?', 'Are plants alive?', etc.

If the system is going wrong for some reason, eg. asking a string of ridiculous inferences, and its own differentiation checker isn't stopping it, then the problem is simple: more triples need to be given explicitly.

Reusing URIs is a good idea, so we can have the system search out existing ontologies and instantiations, then ask things like: "are trees plants in the same way that cars are vehicles?". In that case we've said 'trees are plants' and it has found an ontology which says 'cars are vehicles' and is wondering if the relationship is the same, ie. if it should reuse the URI of the predicate. In a grahical way we could offer a multitude of possibly preexisting definitions, along with tick boxes to indicate which ones are similar.

Blog Dump 6: The Web of Code

The original World Wide Web consists of big lines of text, called pages, which, by virtue of some ill-defined, external set of rules, somehow manage to contain knowledge. The pages can point to other places on the Web for whatever reason, and to a person this is great. However, to a machine it's just a bunch of numbers with no discernable meaning.

Mixed in with this knowledge are various well-defined, machine-understandable properties which originally denoted the presentation of the text (eg. 'b' for a bold bit, 'i' for an italic bit, etc.) but which gradually changed to instead merely split it up in various ways (eg. 'em' for a bit that needs emphasising, 'strong' for a bit which is important, etc.). The knowledge in the text, however, is still lost to the machine processing it.

This means that a person has to be involved somewhere, requiring a load of unfortuate individuals to navigate through the shit user interface of the Web, reading everything they find, until they get the bit of knowledge they were after.

These days we have search engines which slightly lessen the burden, since we can jump to places which we, as people, have reason to believe are pretty close to our destination and thus shouldn't require much navigating from. Still, though, the machines don't know what's going on.

In the Web of Data we entirely throw away the concept of a page, since it's irrelevant to our quest for knowledge. Text can be confined to pages, but knowledge can't; knowledge is pervasive, interconnected, predicated, higher-order and in general we can't contain a description of anything to a single unit without reference to other things, the decriptions of which reference other things and we end up with our 'page' on one thing actually containing the entire sum of all knowledge. Since there's only one sum of all knowledge, we don't need to use a concept like 'page' which implies that there is more than one.

With the artificial limit of pages done away with we can put stuff anywhere, as long as we use unique names (Universal Resource Identifiers, URIs) to make sure we don't get confused about which bits are talking about what. Now we've got a machine-readable, distributed, worldwide database of knowledge: that's the Web of Data.

At this point many short-sighted people think that the next step is to rewrite the old Web on top of the Web of Data, so that both humans and machines can understand it and work in harmony. These people are, of course, sooooo 20th century.

Machines aren't intelligent (yet), so there's no way we could make a serious moral argument that they are our slaves. Therefore, why aren't they doing everything? There should be no relevant information kept back from the machine, and nothing it outputs should contain any new knowledge which can't already be found on the Web of Data. If we want to refine what it programatically generates, we should do so by adding new information to the Web of Data until it knows what we want, and thus nobody else need specify that data again.

To me, as a programmer, there is an obvious analogy to be made:

The original coding system consists of big lines of text, called files, which, by virtue of some well-defined, external set of rules, somehow manage to contain computation. The files can import other files in the system for whatever reason, and to a person this is great. However, to a machine it's just a bunch of calculations with no discernable meaning.

Mixed in with this computation are various well-defined, machine-understandable properties which originally denoted the representation of the data (eg. 'int' for a 32bit integer, 'double' for a 64bit rational, etc.) but which gradually changed to instead merely split it up in various ways (eg. 'class' for a bit that contains related parts, 'module' for a bit which is self-contained, etc.). The computation in the text, however, is still lost to the machine processing it.

This means that a person has to be involved somewhere, requiring a load of unfortuate individuals to navigate through the shit user interface of the system, reading everything they find, until they get the bit of calculation they were after.

These days we have search engines which slightly lessen the burden, since we can jump to places which we, as people, have reason to believe are pretty close to our destination and thus shouldn't require much navigating from. Still, though, the machines don't know what's going on.

In the Web of Code we entirely throw away the concept of a file, since it's irrelevant to our quest for computation. Text can be confined to files, but computation can't; computation is pervasive, interconnected, predicated, higher-order and in general we can't contain a serialisation of anything to a single unit without reference to other things, the serialisations of which reference other things and we end up with our 'file' on one thing actually containing the entire sum of all computation. Since there's only one sum of all computation, we don't need to use a concept like 'file' which implies that there is more than one.

With the artificial limit of files done away with we can put stuff anywhere, as long as we use unique names (Universal Resource Identifiers, URIs) to make sure we don't get confused about which bits are talking about what. Now we've got a machine-readable, distributed, worldwide database of computation: that's the Web of Code.

At this point many short-sighted people think that the next step is to rewrite the old coding system on top of the Web of Code, so that both humans and machines can understand it and work in harmony. These people are, of course, sooooo 20th century.

Machines aren't intelligent (yet), so there's no way we could make a serious moral argument that they are our slaves. Therefore, why aren't they doing everything? There should be no relevant information kept back from the machine, and nothing it outputs should contain any new calculation which can't already be found in the Web of Code. If we want to refine what it programatically generates, we should do so by adding new information to the Web of Code until it knows what we want, and thus nobody else need specify that process again.

What Is The Web of Code?

The Web of Code is code like any other. However, the operations it performs are not on memory, they are on things in Web of Code. Memory is just a cache. The Web of Code is as high-level and sparse as possible, describing only what it needs to and no more. If we want to alert the user then we alert the user, we do not want to display rectangles and render glyphs, so we do not display rectangles and render glyphs, these are low-level details which can be worked out through reasoning and search on the Web of Code.

Sunday 12 December 2010

ocPortal Ramblings

Since I've now been working with ocPortal long enough to get to know its internals pretty well, and which bits are a joy to use and which aren't, I thought I'd write down a few thoughts about what I think works badly, which areas would benefit from attention, what features I'd like to see and what may be possible in the future. Due to ocPortal's nature, as a layer on top of stubborn databases, flaky libraries and inconsistent languages, which needs to offer flashy, dynamic, user-editable coolness in a reliable way, this list will inevitably include low-level and high-level details. My bias is, of course, on the internals, but I am also a Web user so I care about the user-facing parts too :)

The Good (the majority of ocPortal)

Before I get into a rant about the less brilliant parts of ocPortal, I thought I'd make it absolutely clear that these are specific examples picked out because of their annoyingness; they aren't anything like a representative sample of ocPortal as a whole. ocPortal contains loads of really cool features, which make it a really nice platform to code on.

Some examples of this are the "require" systems, which allow you to say "require_code('a')", "require_lang('b')", "require_javascript('c')" and "require_css('d')". This will tell ocPortal that, before your code runs, the page should have access to the code in 'a.php', the language strings in 'b.ini', the Javascript in 'c.js' and the style information in 'd.css'. The major problem this solves is that it's often a bad idea to include some dependency "just in case" it's not been included yet, because this can cause a load of errors about conflicts where the dependency tries to overwrite a previously included version of itself. With ocPortal, these headaches never occur. A similar feature is the "extra_head" function, which allows things (for example, raw Javascript, metadata tags, etc.) to be written into the page's <head></head> section at any point during page generation, rather than having to specify them all at the start.

A related feature is ocPortal's fallback system. Code requested by "require_code" will be taken from the "sources_custom" folder if possible, and if there's nothing which matches then ocPortal falls back to the regular "sources" folder. The same happens for themes, where the "templates_custom" folder of a theme will be checked first, falling back to the "templates" folder, then if there is still no match the "templates_custom" folder of the default theme is checked, and finally the "templates" folder of the default theme is used. The same applies to the "css_custom" and "images_custom" folders. Languages also use fallbacks, going from "lang_custom" of the desired language, to "lang" for the desired language, to "lang_custom" for the default (currently English) and finally to the usual "lang" folder for the default language (English). This makes it very easy to specify, for example, tweaks to a theme (in the theme's *_custom folders), specify a theme's standard behaviour (the theme's non-custom folders), specify new functionality available to all themes (the *_custom folders of the default theme) and keep the original, known-to-work ocPortal defaults intact (the non-custom folders of the default theme). The same is true for the sources and sources_custom, although a little magic here allows you to specify only the additions/changes you want to make, rather than having to make a complete copy of the file first.

ocPortal's abstractions are also nice to work with, namely the database and forum drivers. They are rather leaky and can easily be abused (for example dumping raw SQL into the parameters of a nice abstract function), but as long as this is avoided on the whole, then it can be a handy thing to keep in place when you're in a pickle and need a bit of a hacked solution. It's like the UNIX philosophy that if you take away people's ability to do stupid things, you also take away their ability to do clever things (otherwise known as "giving you enough rope to hang yourself ;) ).

There's a lot more I could write about cool features of ocPortal, like the Tempcode templating language, the hooks system, the AJAX workflows, etc. but I am writing this to talk about what *can* be done, rather than what is *already* done. Let's move on.

The Bad (things which aren't the best way to solve the problems they're for)

I mentioned in the above section that I could have gone on about ocPortal's hooks being a good thing. They are. However, they are also "bad" in that there is the potential for them to be so much more. OK, a little background: a lot of systems in ocPortal aren't really suited to being written in one big chunk, the classic example being site-wide search which has to include bits from every other system. To handle these cases, ocPortal has a feature called "hooks"; instead of putting the desired behaviour in a few strategic places, we instead put a call to every one of the hooks we have for that system. So for example, rather than having a large chunk of search code which will look in the galleries, the downloads, the forums, the news, the catalogues, the users, and so on, we instead say "grab all of the search hooks and run each one". Then, in the "hooks/systems/search" folder of sources (or sources_custom) we can put a new file to deal with galleries, one to deal with downloads, and so on. Since the search says "all of the search hooks", rather than some specific list, we can add new hooks into the folder whenever we like and they'll be used everywhere that the built-in ones are.

Hooks rely heavily on a methodology called "metaprogramming", where a language manipulates some text, adding bits on, replacing bits, etc. just like a regular program, but the text it's manipulating is actually code written in that language, which then gets run. Metaprogramming is really cool, but can get quite confusing quite quickly, so it's usually reserved for when nothing else is available. ocPortal hooks are created through metaprogramming as PHP objects, which are then asked to perform their task. Now, the problem with hooks is that this metaprogramming magic must be invoked each time we want to use a hook or hooks, and there's a lot of duplication in the hooks' contents.

When we design a system we usually try to keep related things together, but we inevitably end up having some similar code split up into various disparate places. With hooks, to me, the problem is that the search hooks have a gallery search. The introspection hooks have gallery introspection. The configuration hooks have gallery configuration. And so on. I think a better Object Oriented way to do this would be to generate large, amalgamated, Katamari-style objects to represent each part of ocPortal (galleries, downloads, catalogues, etc.) which a hook may want (each being defined via a hook, of course), and generate their methods from the current hooks (eg. gallery->search, downloads->search, downloads->introspection, etc.). This also makes it unnecessary to specify which hooks you want to instantiate before using them (an annoyance with the current model), as instead you can just ask ocPortal for a specific hook object (eg. 'search/galleries'), or all objects used by a set of hooks (eg. 'search'). This can be encapsulated behind a single function call like "require_object". Then the methods can be generated and called via proxy methods on the objects. For example if we ran "$galleries = require_object('search/galleries'); $galleries->search($params);" then the call to "search" is intercepted, a relevant hook is looked for and instantiated, then called with $params. That keeps down the overhead of having to generate objects with all hooks instantiated in them to start with. Performing a complete search of the site would be as simple as "$site_objects = get_objects('search'); $results = array(); foreach ($site_objects as $obj) { $results = array_merge($results, $obj->search($params)); }". We could even recurse through all of the hooks, bunging the methods into the generated objects. For example if we were writing part of the gallery system then we might access our desired hooks via "$objects = get_objects(); $search_results = $objects['galleries']->search($params); $relevant_configs = $objects['galleries']->configs();". Here I've used the convention that "hooks/foo/bar.php" will be at the key "bar" in the objects array, and will have the method "foo".

The Ugly (things which are unfortunate, and waiting to be fixed)

At the moment ocPortal has a few historic systems which aren't actually used. For example, there is infrastructure to create new configuration options. This is actually useless, however, since the configuration options are designated via hooks. Such implementation changes have left various cruft around, and this can be confusing for programmers as they wonder why calls to these things don't do as they expect.

Some parts of HTML and its kin are a little annoying, for which various workarounds can be used. Unfortunately, ocPortal uses Flash to work around some of these, for example the limitations of file uploading (a form must be submitted (ie. a new page loaded), no progress information is given, etc.). This is unfortunate because the only viable runtime for Flash is Adobe's (formerly Macromedia's) and this isn't Free Software. Another example is video and audio playing, where the pending HTML5 standard defines how to embed such things, but makes no requirements for the formats to use. The current formats most used are h.264, which gives high quality video, is supported by many proprietary vendors like Apple, but Free Software implementations of it are illegal in many countries due to patents; Ogg Theora, which is a medium quality codec (used to be poor, but has improved a lot recently) which has been the format of choice for Free Software for years, and is thus supported by browsers like Firefox easily; and WebM, a new format pushed by Google, which is patent-free (as far as we know) and supported by Free Software browsers. Until this settles down, it is difficult to make a HTML5 video site which will work on the majority of machines, without keeping at least 3 copies of each video (which is an awful lot of space) and potentially infringing patent law when converting to h.264. This unfortunately makes Flash a more widespread option than the standard, for the moment at least. It is only a matter of time before these things get replaced, but I would like to see it sooner rather than later, especially since I refuse to comply with Adobe's licensing terms.

The Future (my wishlist and thoughts)

These are ideas for where ocPortal can go. They're not a roadmap, they're not a vision, they're just things I've noticed and thought about, along with explorations of what new possibilities we would have if we implemented any of these.

WebDAV: WebDAV is a filesystem which runs over HTTP. On UNIX systems we're used to accessing everything via a file descriptor, and interfacing with all of this via some address underneath /. More recently, thanks to FUSE, we've become used to the idea of mounting filesystems which transparently encrpyt their contents, which transparently compress their contents, those which offer access to machines over a network, those which auto-generate contents from some source such as WikipediaFS, and so on. Now, spare a minute for those poor folk still stuck on legacy operating systems, which can't even handle ext2 by default. They can, however, mount WebDAV. That makes WebDAV, as a standard, a much better filesystem concept than an ocPortal-specific filesystem akin to WikipediaFS which very few users would be able to use.

By having ocPortal available via a filesystem like WebDAV, we can remove the barrier which prevents UNIX awesomeness from being unleashed on it. With an ocPortal filesystem, auto-generated and managed by PHP server-side code rather than being a direct representation of some disk's contents, we can allow users new ways to interact with a site (mount a site with your login credentials and have access to all of the files you're allowed to view, organised neatly into folders, which are populated automatically, and offer several concurrent hierarchies to choose from (eg. "images/dave/october" and "images/october/dave" could both generate the same results)), we can allow administrators new ways to manage their site (copying CSVs directly into and out of the site, dragging RSS in to post news, etc.) and we allow developers new ways to customise a site (direct manipulation of a site's source with native programming tools, with the changesets being available directly to the site's PHP).

Another advantage to doing this is that we can cover the same ground as ownCloud. This tries to offer an online filesystem like Dropbox or UbuntuOne but Free Software (CPAL's good for this, although not quite AGPL). Users can store their desktop preferences there so they can be available on whatever machine they're using. We've already got a great Web frontend, just not the filesystem backend (so the opposite to ownCloud at the moment).

The point about handling changesets is what intrigues me the most. Since all writes to the site will be handled by PHP, we can do what we like with them. A nice way to handle them would be to have the whole site in a distributed version control system like Git, and have the changes saved as commits as they are made. This would let site admins roll back changes very easily, whilst also allowing changes to ocPortal sites to cross-polinate as users can pull changesets from each others' sites (if this is specifically enabled by the admin, of course). ocPortal change from being broadcasted from a restricted SVN repository to those sites which poll the central server; into being a true community of shared code, with no barriers to entry.

There would be no need to keep backups of things which are removed (like addons), since they can be restored from the site's revision history or pulled in from the ocPortal network. Indeed, the entire concept of addons can be restructured into Git changesets which, along with themes, can spread through the network without the need for a central repository. ocPortal.com would be a showcase of suggestions, rather than a key piece of infrastructure.

There are lots of services out there which would enhance various parts of ocPortal, even if they remain as separate servers. The obvious one is email. Forum threads should generate mailing lists and emailing the list should populate the correct forum (there are PHPBB mods which do this, but I've not tried them). All content, like blogs, news, galleries, etc. should be emailed out to those who want it (in the body if possible, or else as an attachment) whilst comments can be posted by sending a reply. Interacting with an ocPortal site should be, as far as possible, doable via email.

If a site is being hooked into an email server, then there should be the ability to auto-generate email accounts for users. This is less standardised, but some reference implementation could be written, eg. against SquirrelMail. This would only need to tie together the account credentials, it wouldn't need any interface work since a direct link to the mail frontend can be given.

The same goes for XMPP/Jabber. There is limited support in ocPortal at the moment for using XMPP in the chat rooms. I think this should be available as standard, such that every chat room is discoverable via XMPP service discovery on the site's URL. Going further, the site can offer content via Publish/Subscribe and allow all of the same kind of interactions that are possible via email.

A nice feature would be for ocPortal to seed its content via Bittorrent, offering Magnet links via the site and using the Distributed Hash Table to manage peers (rather than a tracker).

There is a lot of opportunity for ocPortal to sprinkle more metadata around sites. At the moment it uses Dublin Core plus a custom ontology to describe documents. The most obvious next step is to generate RDF for users' profiles, using ontologies like Friend of a Friend.
There is the choice to use RDFa, which is scattered throughout the pages, but this would make templating much harder, so I think having separate RDF files for each thing we care to describe is enough. We should make sure that we're following the Linked Data recommendations of using HTTP redirects whenever a non-digital resource is requested (eg. the URI of a person is requested). Generic concepts common to all ocPortal sites, like "image", "video", etc., could be linked to a reference on DBPedia for that concept. Sites may have an RDF endpoint, if we find bots are taking too much bandwidth or aren't finding what we're giving out, but it's not too important. While we're discussing metadata, we can scrape incoming data for metadata, for example EXIF data in an uploaded image. Display this alongside regular
ocPortal fields.

There should be a federation between ocPortal sites, such that accounts made on one site will work on another ocPortal site, if the admins turn on this option. This could be achieved via OpenID, ie. scrap ocPortal-specific users in favour of everyone being identified by OpenID. In which case we would like to have a mechanism to generate these for users who don't already have one, or who want another. There is some nice code floating about which used to run xmppid.net which allows an XMPP ID to be used as an OpenID, by asking for confirmation out-of-band via XMPP. If we have an XMPP server hooked up to the site, with accounts for each user, then this would be a nice solution.

Along with this, users on any ocPortal site should be able to message any user on another ocPortal site. This could be achieved via XMPP/email integration, and/or through some form of OStatus-style messaging.

We should allow sites to use client-side cryptography. Thus the site is merely a host; the user's browser does the encrypting via Javascript before uploading, and does the decrypting after downloading. This would be difficult if we use WebDAV.

If we had OpenCollaborationServices support then lots of desktop applications would be able to grab data from ocPortal sites. This is a bit of a moving target at the moment, with the only major implementations being on the various parts of opendesktop.org (eg. gnome-look.org, kde-apps.org, etc.), but this would give us a hook in to many existing desktop applications.

Would be nice to have a few more protocols coming in and out, under some abstraction (maybe an "activities" log). PSYC can be used for server-server communications (eg. syncing), Wave could be used for client-server. We might even look at the Gobby protocol if we wanted to allow concurrent editing.

More coherent Javascript management. Currently Javascript works on DOM
elements in a procedural way. Would be nicer to have an Object Oriented
approach, where the objects are whatever the hell we want them to be, and
there may be some DOM objects associated with an object if we like; the
objects should sort out all of the logic amongst themselves.

There is a lot of interest in distributed social networks like Diaspora, GNUSocial, StatusNet, etc. which are trying to be Free social networks or replacements for specific services. They have the ideas right, some have protocols, but none have any kind of decent sites coming out of their HTML generators. Could we usurp them by taking the existing awesomeness of ocPortal sites and making them support these protocols? Would be nice, in that we'd get publicity, we'd get new features, plus everyone would get ocPortal as a kind of uber-Diaspora/GNUSocial/StatusNet. CPAL terms are like AGPL, which helps
the cause.

Not a specific criticism, but Tempcode is a bit daunting to use at first. Being a string-replacement system, it is all about metaprogramming. This is powerful, but do we want it to be a text replacement system, or do we maybe want a generative language? Do we want to work at the string level, or at the token level? What's wrong with XML? What's wrong with s-expressions? Would be nice to have the minimum amount of syntax possible, and no special-cases (we can break this symmetry at the function/directive level, which is interpreted by PHP anyway). s-expressions would be good for this. Similarly, if we want to overhaul templating, we could make quite a nice generative language for CSS with bugger all syntax and much more flexibility and extensibility. Would be a job to get working in a way that Web people would accept though. Could we make an XHTML-to-Tempcode convertor, to handle all of our legacy support? If we had a higher-level (eg. token-based) language then it would be a lot easier to make higher-level tools, like drag 'n' drop of elements/directives, live previews, etc.

Thursday 9 December 2010

Blog Dump 5: Some Statistics

Statistical models are well suited to computer simulations. The most interesting results can usually be found in random processes, so here's an example showing the St Petersburg Lottery:

#!/usr/bin/env python

import random


def flip():


"""Simulate a coin flip."""


if random.randint(0,1): # Choose this statement or the next with equal chance


return "Heads" # Give a "Heads"


else:

return "Tails" # Give a "Tails"


def lottery():


"""Run a St Petersburg Lottery, flipping a coin until it's tails. Each time it's heads

our pot is doubled and added to the running total."""


winnings = 1 # Start with 1 pound


pot = 1 # The pot starts at 1 pound


while flip() == "Heads": # Loop until we get a Tails


pot = pot * 2 # Double the pot


winnings = winnings + pot # Add it to our winnings

return winnings # When we've finished looping, return the result


def play(rounds):

"""Run the lottery 'rounds' times in a row."""


cost_per_game = 2 # Deduct this from the wealth for each round


wealth = 1 # Start with 1 pound

print "Rounds, Wealth" # Column headings for our spreadsheet


print "0,1" # Print the first results

for x in range(rounds): # Loop through each round


wealth = wealth - cost_per_game + lottery() # Deduct the cost and add the winnings to our wealth


print str(x+1)+', '+str(wealth) # Print the round number and our current wealth


play(1000) # Play 1000 rounds


The scenario is that we start with £1 and pay a fixed sum (here £2) to play a lottery, in which a coin if tossed. If it's tails then we take the winnings, which start at £1, but if it's heads then the winnings go up by twice the last increase, so they become £1 + £2. If we get a tails next, we get the winnings, but if we get heads it goes up by double the last increase again, so the winnings become £1 + £2 + £4. This keeps going on until we get a tails, when we get the winnings so far. The remarkable thing about this game is that the amount you can expect to win is infinite (0.5*£1 + 0.25*£2 + 0.125*£4+... = 0.5 + 0.5 + 0.5 + ...), so that you should be willing to pay any finite amount to enter. In this case the entrance fee for the lottery is £2, but it could be £2,000,000 and it wouldn't make a difference. We can see this from the graph below, where we run 1000 lotteries in a row, with 25 of these simulations running simultaneously (ie. 25,000 lotteries). The number of lotteries entered goes along the bottom, whilst the winnings goes up the side. The 25 runs of the simulation have been superimposed on each other to better approximate what's going on (I can't be arsed working out the error bars for a blog post). At any point on the graph, the darkness is a rough estimation of the probability of owning this amount of money after this many rounds (we use the area under the curve since, if we have £50 then we also have every amount below £50).



I've also made some simulations of the Monty Hall problem, random walks (intersecting and non-intersecting) and a few other statistical phenomena. I might post them if I get time. They can be the source of pretty patterns :)


Blog Dump 4: What Makes Me A Geek: Reason 0x54F6

I have only ever paid for one album in my life. The rest of my music is freely distributable online. The name of that album is 01011001, which is Y in ASCII.

Blog Dump 3: Duck Typing as Contraint Propagation

Another of my unfinished programming musings:

Personally, I really like Python. Python is quick to write, quite readable, but most of all it's dynamic. Many people get caught up on the syntax of languages, usually because that's the easiest part, however the syntax is just a way of expressing the semantics, which is what you actually mean when you code. Let's compare some C code to some Python code:

int square(int x) {

return x*x;

}

int mynum = 5;

int mynum_squared = square(mynum);

Don't focus on the syntax, I'm keeping such things as simple as possible. This C code simply defines a function called "square". It says that square acts upon integers ("int") and gives back ("return"s) integers as a result; in this case the result is the input multiplied ("*") by itself. We then make an integer which we call "mynum" and set it to equal to the number 5. We then make another integer which we call "mynum_squared" and we set it equal to the result of the function square acting upon mynum. Since code is just Maths we can simplify (or "run") this to show:

mynum_squared = square(x=mynum)

mynum_squared = {mynum*mynum}

mynum_squared = {5*5}

mynum_squared = 25

It's pretty straightforward, just like Maths in school. Let's see how we would write this if we were using Python:

def square(x):

return x*x

mynum = 5

mynum_squared = square(mynum)

Let's read what this code is doing. We define a function called "square" which acts on something. It also "return"s something as a result. More specifically, what it returns is the input "*" with itself. We then give a name "mynum" to the number 5. We then define another name, which points to the result of the function square acting upon whatever mynum points to. Once again, code is just Maths so we can simplify this too:

mynum_squared = square(mynum)

mynum_squared = square(5)

mynum_squared = 5*5

In Python, the asterisk "*", rather than being an easier-to-write-on-a-keyboard multiplication sign like it is in C, is actually syntactic sugar. Syntactic sugar means that it is a shorthand way of writing something that's annoying to write out in full every time. Let's write out the full description of this code, rather than using the convenient shorthand:

mynum_squared = 5*5

mynum_squared = 5.__mul__(5)

Now what does this last version say? Well you may recognise the "(5)" as passing the number 5 to a function. The function we're running is called "__mul__", and the "." means that its definition is stored inside the number 5. So what it says is to get the function __mul__ which is stored in the number 5, give it a number 5 and make a pointer to the result called "mynum_squared". Since we can't see the definition here of what the function "__mul__" does, there's no way for us to simplify this code any more.

So how does this massive difference in semantics manifest itself in the syntax of C and Python? In the C code, everything had a type, for example "int mynum = 5;". The name "mynum" cannot store anything that is not an integer*. This allows us to know exactly what the code will do before we run it (as long as we know what the input will be), since we know how everything behaves, although of course it may take us a while to work out exactly what it does (otherwise there's no point writing the program ;) ) and it may take forever (see Halting problem). So where are the types in Python code? Well they're there, but they're a bit redundant. Let's give our Python code some types in a C-style syntax:

object square(object x):

return x*x

object mynum = 5

object mynum_squared = square(mynum)

As you can see, the type of everything is just "object"**, so there's not much point writing it out explicitly for everything. The vagueness of having the type of everything no more specific than object, or essentially "something", makes Python incredibly powerful, since we get a type system called "duck typing". Duck typing comes from the saying "if it walks like a duck, sounds like a duck and looks like a duck then it's a duck".

More formally, what this means is that in our "square" function we take in an object "x" then return "x*x" or, without the syntactic sugar, "x.__mul__(x)". In our C function we take in an int, if we want to use our square function on something else (for example a real number) we need to copy and paste our existing definition and replace all of the "int"s with something else (eg. "float"). In duck typing the type of the input is just 'anything which works', no more and no less.

This is a bit tautological, ie. "square accepts anything which square accepts" but is nevertheless more specific than simply "an object", which is the largest upper bound we could have on the type. If we call the type consisting of all objects T, so that

forall o of type object, o is a member of T of type P(object)

We get, through a bastardisation of formal logic, a search over the space of all objects. "T" is the type of object we're looking for, so it is "of type type". At the moment we're just dumping every object we find into T, since we know it acts on objects (although we define it in reverse; we don't put objects into T, we say that T contains it, so that we get a timeless mathematical truth). Now our more specific version we can call T', such that:

forall o of type object, o is a member of T' of type P(object) if and only if square(o) is not a member of errors

As I said, this is a bit better, but what we would like to do is replace the use of square(o), since we'd like to get rid of the tautological aspect. We'd also like to use more easily defined concepts than a set of all errors, since if we tried to define the set of all errors we'd end up solving this type problem as part of that, so we've not made our lives easier yet. Let's get rid of those by further refining the type based on what we know about the function, so the more specific T'' is defined such that:

forall o of type object, o is a member of T'' of type P(object) if and only if there exists an f of type function where name(f) = '__mul__' and f is a member of o

Now that's gotten rid of the tautology nicely, but our replacement of 'the set of all errors' with 'o must have a function __mul__' is pretty vague since this is still true for objects which would give errors. Thus we can make another refinement and say T''' is defined as:

forall o of type object, o is a member of T''' of type P(object) if and only if there exists an f of type function where name(f) = '__mul__' and f is a member of o and there exists an x of type U where x is a member of arguments(f) and o is a member of U

This is getting a bit unweildy, but formal logic has a tendency to do that (and of course I'm using English rather than Unicode symbols, both for convenience of typing and I'd end up translating it to English anyway as an explanation). Essentially our search now contains some undefined bits, most importantly the type U which is the type of the arguments accepted by the function __mul__. We don't just want to make sure that x.__mul__ exists, we also want to make sure that x.__mul__(x) is valid, so we need to know the type of __mul__'s arguments and ensure that our object o is a member of that type.

Now we've nicely defined a type for square, namely T''', using all of the knowledge that we can get from the code without restricting it, but this definition itself is defined by another type U. Of course, we also haven't defined the type "function", but we can do that in a similar way if we know how Python works (a Python function f is an object for which f.__call__ exists), and we can define U if we have the source code for __mul__. Of course, it can become intractable to define types in this way for an entire system, so what we would like is a function which takes in function objects and returns type definitions on its arguments so that we can do this automatically. Of course such a function might never halt, for example if we have a recursive function like:

def recursive(x):

return recursive(x)

attempting to inspect this in a naive way would result in endless attempts to find types for both, since it will accept anything and never halt (or in the case of real hardware, the stack depth will be exceeded and the program will exit with an error). Of course, as human programmers we know what this will do, so we could make a more sophisticated system for inspecting our code, which can notice such recursion and tell us that the type of x is just "object" and the return value of recursive(x) is "RuntimeError". This would be necessary to reason about the following perfectly valid Python:

def infinity(x):

return infinity(x)

def buzz_lightyear():

try:

infinity(1)

return "And beyond!"

except RuntimeError:

return "Falling with style."

The function "buzz_lightyear" tries to run the function "infinity", which breaks when it reaches the maximum recursion depth imposed by the Python interpreter and gives a RuntimeError indicating this. The line 'return "And beyond!"' is never reached, due to this error, and instead we GOTO the "except RuntimeError" line, which tells us what to do if we get a RuntimeError. The line 'return "Falling with style."' is run, so that this function always returns the text string "Falling with style.", and of course it doesn't accept any input. It is possible to reason about this function in a very simple way and figure out that it returns a text string (both of the "return" statements have a text string, regardless of what "infinity" returns). It's slightly harder to work out that it always returns "Falling with style.", but possible. Of course, we could write a nasty bit of code like:

def infinity(x, values):

x.__class__.getValue = values[0]

return infinity(x, [values[1], values[0]])

def buzz_lightyear():

def value1():

return "Hello world"

def value2():

return value2

try:

infinity(copyright, [value1, value2])

return 3

except RuntimeError:

return copyright.__class__.getValue()

This is a complete bastard, and actually depends heavily on global variables which may have different defaults on different systems, and may be changed during the running of a program. What it does is the following: Define a function called "infinity" which takes two objects, which we call "x" and "values". The first thing we do in this function is change the 'class' of x (the type of object it is) so that its function 'getValue' is the first element of our values argument. We then return the result of running the infinity function on x and a list object containing the first two elements of 'values' in reverse order. In other words, this will run itself over and over, each time swapping the 'getValue' function between those first two elements of 'values'. This will cause a RuntimeException, since it can't recurse infinitely, but we don't know what 'getValue' will be when this happens, just by looking at the code. We then define our function 'buzz_lightyear' which takes no arguments. Inside buzz_lightyear we define two functions, 'value1' and 'value2'. The former takes no arguments and returns a string "Hello world"; the latter simply returns itself as an object (but without running itself recursively). With these in hand we proceed to attempt a call to 'infinity', giving it Python's built-in 'copyright' object and a list object containing our value1 and value2 functions. After running infinity we return the object 3 (which never happens, as a RuntimeError is encountered). What ends up running in the 'except' statement is whatever function the 'getValue' name is pointing to, which is one of value1 or value2, and we return the result. Here there are three possible return types, a number 3 (which never happens), a string "Hello world" or a function value2. Reasoning about such a program in a general way is very tricky, for example the return type of buzz_lightyear could be defined as A where:

forall o of type object, o is a member of A of type P(object) if and only if o is a member of (function unioned with string unioned with number)

or as A' where:

forall o of type object, o is a member of A of type P(object) if and only if o is a member of {F, "Hello world"} where F of type P(function)

Doing any more than this would be beyond the scope of our code, so we don't really want to take it any further. We can use these types as constraints to work out the types of everything that depends on them, until we've exhausted the information contained directly in the code. The rest we can only work out as the code's running, when we know what exact values these objects have, and thus what type they are. As long as we're always working on lowering the upper bound on the types then we're not imposing any restrictions on the code which were not already present by virtue of Python's semantics. When it comes to runtime the introspection abilities of the language (the ability to know about itself) can be greatly enhanced this way.

Of course, there are other ways of doing type inference which give a lower limit to types, essentially turning pure object functions like our Python "square" function into ones with concrete types, ie. types built up from "atomic types" (or, as I call them, bloody annoying non-object crap). Essentially it takes C-style declarations and puts them in Python, for example Pyrex. These allow working with statically/concretely typed C code, but in a backwards way: rather than seeing these restricted C functions as an alternative execution path for Python objects and treating them accordingly (we don't care how it's implemented, just that it works), instead the concrete typing leaks out and we end up hitting artificial walls when we try to program, which are only there due to a leaky implementation (for example, in Python 2.x try to run "hello".foo = 10).

So what's the point of all this? Well firstly, code can only run when we know what type everything has. Treating everything as a generic "object" can slow down execution speed as it becomes very hard to automatically optimise the code. If we knew that something is only ever an integer then we can put number-specific code there to make it much faster, rather than keeping it generic. The simplest way to implement a dynamic language like Python ends up being with an interpreter and a virtual machine, a statically typed program which simulates a computer capable of reading and executing Python line by line***, whilst statically typed languages can instead be compiled (translated) into code which a real computer will understand, and are thus much faster. If we have type information then we can compile Python into machine language and see a massive speed boost whilst remaining duck typed. Since concrete type information can only ever be known at run time, such compilation would need to happen as the program's running: so-called Just In Time compilation.

Current approaches to JIT compiling successfully use type feedback, where a function is run in the regular way in virtual machine, but the specific types it uses are remembered. A compiler then takes the function and the type information and compiles a machine-code version, so that next time the function is run with the same type of input it can use the machine-code version instead. This is nice, but requires all of the work to be done when its needed, so a little push in the right direction from inferred types might reduce this overhead.

The most interesting use, as far as I'm concerned, is with automatically generating specifications. Given enough information about the code it becomes possible to know a lot about what it is capable of doing, without running it. This allows some code to become declarative, meaning that the description of the code specifies exactly what it does, and the code itself just becomes one particular way of implementing that specification. Comparing specifications (which is not possible in the general case, but we can still try opportunistically, eg. with a timeout) allows us to automatically replace the code (implementation) with alternatives that do the same thing. This is useful for building compilers, and if we want to make languages work together (although of course we just shift the problem, since we can have competing, incompatible specification languages).

Another useful feature of this specification-carrying-code has been implemented by Hesam Samimi in an extension to Java, where the programmer can supply a specification for each block of code, and the virtual machine can check if the supplied code actually does implement the specification or not. This makes debugging easier, as long as you can get your specifications right, but interestingly it can be input-dependent. For example, a function has a specification, and when that function is called the code is checked against the specification to see whether it matches, given the input. If it does then great, but if not then an alternative can be used. This makes handling of edge-cases and corner-cases much nicer. Here's a literal example of edge- and corner-case handling code; We define a function which takes four arguments, a "grid", a function called "process" and an x any y coordinate. We want to run "process" on the square in the grid at position (x,y), along with every neighbouring square, and return a list of the results. This function looks like the following:

define my_function(grid, process, x, y):

if x == 0 or x == 99 or y == 0 or y == 99:

if 0 <>

if y == 0:

return map(process, [grid[x-1][y], grid[x-1][y+1], grid[x][y], grid[x][y+1], grid[x+1][y], grid[x+1][y+1]])

else:

return map(process, [grid[x-1][y-1], grid[x-1][y], grid[x][y-1], grid[x][y], grid[x+1][y-1], grid[x+1][y]])

elif 0 <>

if x == 0:

return map(process, [grid[x][y-1], grid[x][y], grid[x][y+1], grid[x+1][y-1], grid[x+1][y], grid[x+1][y+1]])

else:

return map(process, [grid[x-1][y-1], grid[x-1][y], grid[x-1][y+1], grid[x][y-1], grid[x][y], grid[x][y+1]])

else:

if y == 0 and x == 0:

return map(process, [grid[x][y], grid[x][y+1], grid[x+1][y], grid[x+1][y+1]])

elif y == 0 and x == 99:

return map(process, [grid[x-1][y], grid[x-1][y+1], grid[x][y], grid[x][y+1]])

elif y == 99 and x == 0:

return map(process, [grid[x][y-1], grid[x][y], grid[x+1][y-1], grid[x+1][y]])

else:

return map(process, [grid[x-1][y-1], grid[x-1][y], grid[x][y-1], grid[x][y]])

else:

return map(process, [grid[x-1][y-1], grid[x-1][y], grid[x-1][y+1], grid[x][y-1], grid[x][y], grid[x][y+1], grid[x+1][y-1],

grid[x+1][y], grid[x+1][y+1]])

This looks pretty messy, but what is it doing? Well it handling the neighbours: if a square is on the edge of the grid (here we say it's on the edge if its coordinate is 0 or 99) then it won't have any neighbours on one side. The first four "return" lines handle the bottom, top, left and right edges respectively. Likewise, if we're in a corner then we have no neighbours on two sides, so the next four "return" lines handle the bottom left, bottom right, top left and top right corners. The bottom line, the final "return" statement, will work for every square which isn't a corner or edge case, ie. one that is surrounded by neighbours. This type of code is really ugly, and very prone to making mistakes like a plus or minus the wrong way around. A lot of my current Physics project requires code like this, except in three dimensions where there are 26 neighbours to each cube, 6 face cases, 12 edge cases and 8 corner cases. Needless to say, I've often spent a whole day trying to track down a single misplaced minus sign. What makes specifications nice is that we can replace this whole lot with the following:

define my_function(grid, process, x, y):

return map(process, [grid[x-1][y-1], grid[x-1][y], grid[x-1][y+1], grid[x][y-1], grid[x][y], grid[x][y+1], grid[x+1][y-1],

grid[x+1][y], grid[x+1][y+1]])

Which is just that last line from before. This will work the majority of the time, but if we're on an edge or in a corner then it won't match the required specification and won't be used. In our 2D example there are four corner cases, four edge cases and one regular case. In a relevant 100x100 grid there are 4 corner squares, 392 edge squares and 9604 regular squares. That means, only counting return statements, 11% of our code is handling 96% of the data, 44% of the code is handling 3.9% of the data and 44% of the code is handling 0.04% of the data. The larger the input, the less significant the edges and corners become.

Corners completely handle the 1x1 and 2x2 grids, but then remain constant regardless of grid length. Edges come into play in the 3x3 grid, already equally important as the corners, then increase linearly with the grid length. The regular, non-edge case also starts in the 3x3 grid, but goes up with the square of the grid length, until by the 7x7 grid they are more significant than the edges and corners combined. After that the edge cases aren't worth mentioning.

Not only does getting rid of edge cases make the code far more readable and understandable, but the code becomes simpler for the machine to run. In our case the edge case code does about as much as a results verifier would, so there's probably no speedup achievable here, but if a costly algorithm is used by a function to handle every possibility, where the vast majority of cases can be fulfilled by a quick algorithm, then simplifying it in this way can give speed gains. For example, speeding up the non-edge case of our grid function, for a 100x100 grid is worth 96/4=24 times as much as a speed increase in the edge case code. Thus if we double the speed of the regular case we can get away with an order of magnitude slow-down in the edge case and still get an improvement.

The obvious question becomes: what happens if we get an edge case? If our code comes with specifications then we don't need to care about the edge cases, since an adequate function can be created to handle them automatically by looking through the specifications of everything else and building the required functionality out of the available components. In fact, we don't even need to write the regular function, that can be automatically generated too! Writing the function by hand just becomes an optimisation, with the regular case being the most obvious candidate for optimisation in this example. Since automatically synthesised code can be expensive to produce and inefficient to run, it is a good idea to do such optimisation, but not at all necessary. The specification language becomes the language to program in, and the computer does the rest (as long as sufficient components are written with implementations, or else no code can be generated).

So why am I writing this long and boring post? Well, firstly I am a great believer in the need for correctness proofs: code is just a fancy Mathematical notation, and without a proof of what it does then it's no better than engineering or a social science. Rules of thumb (eg. design patterns) are all well and good for day-to-day tasks, but a rule-of-thumb mindset isn't going to produce anything truly new, truly insightful or truly robust: the second law of Thermodynamics is not "hot stuff tends to cool down and cold stuff tends to heat up", that's a rule of thumb; it is in fact proved from Atomic Theory using Statistical Mechanics, whilst Atomic Theory is proved from Quantum Mechanics and Statistical Mechanics is Statistics and Probability Theory, which also underlies Quantum Theory along with Guage Theory, Topology and so on. Whilst there can never be a complete proof of anything, we can narrow down the number of assumptions we have to make.

* In reality it can store anything, since everything in a computer is the same old binary, but since it will treat what you give it as an integer you'll get corrupted data unless your data behaves exactly like an integer, in which case you may as well use an integer :P Note that a program doesn't particularly care if your data is corrupt, since it will carry on using it, happily working its way down the execution path, all the while fucking up whatever you thought it was meant to be doing because you didn't understand what program you just wrote.

** Functions are actually a bit of a strange type, since we give them things to act on. Python functions are a kind of object, but when we run them we're doing a funny thing called Currying. For example, our "square" function in Python has the type object->object, since we give it an object and get an object back. If run a function which makes a list out of two objects we have a type of object->object->object. We can give it an object, like 2 for example, and it gives back something of the type object->object, which is also a function, which in this case takes an object and gives out an object. If we put in the object 78 to this new function then we would get out a list object [2, 78] (based on the definition we gave). In Python we can't directly Curry functions, ie. we can't turn an object->object->object into an object->object, we can only go straight from one end to the other, but since this is all Maths it doesn't stop us from thinking about it :)

*** Actually the raw Python is turned into bytecode, a much simpler language which is faster to execute, and it is this which is interpreted in the virtual machine.


Blog Dump 2: The Universality of Search

Another unfinished post I thought I'd share:

Searching is a much-studied area of computer science. The reason for this is that, like the Travelling Salesman Problem, the problem of how to search can be applied to a massive number of problems if they are phrased in the right way. As a straightforward example (if you'll excuse the pun) let's look at navigation: How do I get from Sheffield Train Station to the Devonshire Cat pub? This can be modelled quite literally as a search problem if you take all of the roads heading away from the train station (this is called enumerating them) and for each one, see if it ends up at the Devonshire Cat. If none do then enumerate all of the roads you found coming off them and do the same thing. Keep going until you either find the Devonshire Cat, or you run out of roads.

Another example this can be applied to is deciding where to eat with a group of friends. Here you can either enumerate the venues and search through them for the best match with your group's preferences, or enumerate the preferences and search through those until the best venue is found.

The shear generality of search is what makes it fascinating. Indeed, for any problem where you know the answer is in a certain set of possible answers, and you have some way to verify the answer (equivalent to rating each venue in the dining out example) then you can enumerate the possible answers and check each one. The problems with this approach are pretty obvious: firstly the set of possibilities could be unfathomably large (eg. trying to find the first sentence of this post, if we were to first enumerate 1-character-long sentences, then 2-character-long sentences and so on then we would only find it after looking through about 2,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 possibilities, which according to this site is pronounced "two quattuorvigintillion"), then of course there are infinite sets like the natural numbers, where we can start counting through them but we've no idea how long it may take to find the answer we're after, and then there are non-recursively-enumerable sets, which means that not only can you not know how long it will take to look through, but you can never even find a way to go through every member of the set (for example, trying to enumerate the Real numbers, eg. with a decimal point, then there will always be an infinity of numbers between those we decide to test which will be skipped, no matter how we choose them).

Once again, it trails off before making any real point :P

Blog dump 1: Diet Python

I've got a few blog posts hanging around on my PC, so I thought I'd post them even though they're not really finished. Here's the first.

I've talked about this before, but have been doing some more work on Diet Python (available in my Python Decompiler on gitorious.org).

Diet Python is a sub-set of Python which is just as expressive. The PyPy project, for example, uses a subset of Python called RPython which is restricted to exclude the really dynamic stuff, and Google's Go language is essentially the same. They exist to keep the syntax and readability of Python, but by throwing away bits they make it much easier to implement. That's not the goal of Diet Python. In Diet Python we don't really care about readability, and syntax can be as awkward as we like as long as it's still valid Python. What Diet Python does is throw away the redundant bits of Python that can be replaced with other Python that does exactly the same thing.

As a simple example, a + b does the same thing as a.__add__(b), but if we're making a Python interpreter we have to handle both. Diet Python throws away the +, since it's less general than the function call, so that Diet Python implementations would only have to implement the function call, they can forget about the +. Diet Python has a 'compiler' which translates Python into Diet Python, so that we can write normal Python (eg. a + b) and then strip away the fat to get a Diet Python equivalent. This Diet Python, since it's a subset of Python, will also run fine in a standard Python interpreter.

There are two approaches I'm taking with Diet Python. The first is to remain 100% compatible, so that there is no change between the input semantics and the output semantics and both can run the same in a standard Python interpreter. The other is to make the output as 'pure' as possible, so that we may end up doing things which make no sense in standard Python but 'should' make sense: for example adding methods to True and False. In CPython that's not possible since they're poorly implemented in C, and this would also change their API, but as a Python programmer it feels like this is a bug in CPython and that there's no reason to stop doing it just because it doesn't work due to CPython. That gives optional translations which, for example, change:

if a:

print b

else:

print c

to (a).__if__('print b', 'print c') which wouldn't work in CPython due to its True and False peculiarities.

Given enough of each type of translation, compatible and uncompatible, I hope to able to get Python code down to a bare minimum of syntax. I believe that to be message sends and strings.

The blog post kind of trails off there. Oh well, better that I'm writing code than blogs anyway ;)

Friday 3 September 2010

Odd SOCKS

As a follow-on from my last post I'll discuss the networking on my Fedora 11 machine. I'm not a massive networking geek; I know quite a bit, but mostly through trial-and-error as a step to getting something else working. For this reason I've never introduced something unnecessary into the mix for its own sake, and thus all I knew about SOCKS was that there's version 4 and 5 in the wild, and it's used by The Onion Router. Since I've always used TOR for HTTP, and used a privacy-filtering-type proxy like Privoxy to handle the HTTP<->SOCKS tunneling, I've never bothered learning more about it.

Now, my office machine is set up to route through a SOCKS proxy on a different machine on the ethernetwork. Web browsing is simply a matter of Firefox's Edit->Settings->Advanced->Network->Settings. However, Thunderbird (which has exactly the same connection changing interface as Firefox) can't connect through IMAP this way. This is due to IMAP not going over HTTP. The same applies to SVN, ping, etc. There is a solution though, which is to use connect-tunnel to run a local server for redirecting calls to any port through a HTTP proxy. Running a connect-tunnel command like "./connect-tunnel -P 192.168.1.1:1080 -T 10234:mysite.com:3690" will allow us to run a command like "svn co svn://localhost:10234" which will think it's checking out a subversion repository from the current machine on port 10234, when in fact it is being routed through a HTTP proxy on port 1080 of 192.168.1.1 to the remote machine at mysite.com on port 3690.

This is great, but the proxy isn't HTTP, it's SOCKS, so we need some way of tunneling this HTTP connection through SOCKS (so, in this example, we'd have a SOCKS proxy redirecting HTTP traffic which is encapsulating an SVN connection). This was fine for Firefox, which has built-in facilities to redirect over a SOCKS proxy, but not all applications do (especially if they don't even know they're being tunneled through HTTP in the first place!). This is where we use a "socksify" program, which intercepts HTTP requests and redirects them without the application which made them knowing about it. There are a few of these, for example Dante contains such functionality and is in Debian, but unfortunately it isn't in Fedora and I was after something I wouldn't have to compile (due to dependency hell). Fedora does contain tsocks which is pretty simple to set up and use, and worked for some applications. It could be used either via "tsocks my_application_name" (eg. "tsocks evolution" to run the Evolution mail reader through SOCKS), or via adding its library to the LD_PRELOAD environment variable, which affects all subsequently executed applications (eg. put it in your .bash_profile). However I found it didn't work for some programs. I then tried proxychains, which I'm pretty happy with. It is called in a similar way to tsocks, eg. "proxychains evolution", and works in some cases where tsocks doesn't.

Aside from this I also recommend setting environment variables, such as "http_proxy" (which, once again, can be done via .bash_profile) and your desktop's settings (I found Gnome's to be useless here, since I didn't come across any application which actually paid attention to it, but KDE applications consistently adhere to this setting, even if not all of them wind up working properly). For Subversion the /etc/subversion/servers file is a good place to put HTTP proxy settings, and try using http:// for the repository address, as this may be configured to work as well as svn://.

Basically, try everything you can find, try them in combination and try layering/tunneling them. Even if you don't seem to need one method now, the next application you try may need it ;)

Freedom to Work

I've been messing with my new work computer, on which I had the option of running Windows Vista or Fedora 11, and thought I'd give a run-down of how I've been finding it, the issues I've come across, how I've fixed them and future tinkering opportunities. Needless to say, if you're looking for any information about fixing issues with Windows Vista then you're in the right place. As a first step you'll need to wipe over it with Fedora 11, then your computer will be usable and under your control so that you can follow the advice below if you'd like to.

Graphics
The graphics card is Nvidia, but isn't supported by the nv driver :( I tried nouveau, which starts up OK with the right resolution and a working cursor, but everything else looks like someone's taken a dump in the framebuffer. I've kept clear of Nvidia cards for years because they're anti-Freedom, but a consequence of this is that I'm not familiar with nouveau settings like I am with Intel, AMD and VIA. Rather than spend too long trying to grok nouveau I had to do the unthinkable: install Nvidia's binary blob :( Gives output at the native resolution of the widescreen monitor, unlike VESA, but obviously I'd like to purge this proprietary contaminant from the machine at some point. I think I'll do some research into nouveau driver options in my own time, then spend a few minutes each morning testing them at work, since it obviously requires restarting X over and over. If I can get it working I'll post the results here and maybe on some Wikis. It certainly can't be as awkward as getting my XO's screen to play nicely under Debian ;)

Codecs
There were only (supposedly) patent-free Free Software codecs installed on the machine, since that's how Fedora has to ship as it's from the "land of the the free", but for those of us in less backwards jurisdictions this isn't particularly brilliant for testing videos and things. I'm not overly familiar with Fedora's package naming conventions so I installed Gstreamer's good, bad and ugly meta-packages, but a lot of guides don't make a distinction between non-Free codecs and patent-encumbered Free Software codecs, so I might've unwittingly pulled in some evilware. I'll take a closer look through at some point, but for now I'm sticking to MPlayer, since it's a far more powerful application than, eg. Totem, and is more useful from the commandline.

Previously-installed Malware
When I took over the machine it was full of proprietary junk. It's got some scarily dangerous repositories enabled like one from Adobe, one from Dropbox and, less evil (or so they say), one from Google. It's got some crappy Flash plugin installed from Adobe, but thankfully SWFDec overrides it. I'm not sure how this proprietary player's installed itself, so I haven't looked into how to undo it utterly. When I do I think I may have to keep an equivalent in some sort of sandbox anyway (eg. a UnionFS overlay) since I may have to test some sites which are infected with Flash (although, of course, I don't know how anyone could debug something for which they don't have full, irrevocable access to the source). As well as this it's also got Skype, which is a program that does a limited form of SIP/Jingle-style VoIP. However, not only is the client proprietary, the whole network it uses is a black box too! I'd like to purge this from existence, but unfortunately work does some communication via this obfuscated mess, so I can't for the moment because there's no interoperable replacement (did someone say "vendor lockin" AKA "the exact reason why this crock should never be used by anyone"?). Once again I think a sandbox might be required to stop it playing havoc with the system, or whatever it's meant to do (I don't know since I've not seen the code). As well as this is Dropbox, which I've never had the misfortune to use directly, but heard many horror stories from the Computer Science department about how prevalent the myth is that Dropbox is somehow a version control system. What it really is seems to be is some sort of UPnP-aware WebDAV-esque remote filesystem, which appears to attempt peer-to-peer synchronisation and causes unending conflicts if it's used for anything more elaborate than the equivalent of an email conversation. I can't really see the point of it, since it's just a poor man's Git, but once again I don't know of any compatible alternative (LOCKIN ALERT! LOCKIN ALERT!) so I think I'll sandbox it too, at least the daemon.

It seems a shame to me that so much infrastructure has been build on such untrustworthy foundations, for a company which is apparently "Proud to be Open Source". Of course, the gatekeepers of these secret services will die at some point, and their code will either be freed or become useless, but hopefully we won't have to wait that long for someone to kill it. Who knows, maybe I could play a role in this now that I'm trying my hand at Web technologies?

Next post will cover networking, which is occasionally cropping up problems as I do new things, but is worth sharing what I tried nonetheless.

Ciao.

Thursday 26 August 2010

IWF Crap

Seems the Internet Watch Foundation are at it again! Sub-domains of Webs.com are currently inaccessible to anyone in the UK who's ISP pays attention to the IWF. This means that I am being blocked from visiting my own Web site because it contains child porn (the only reason the IWF blocks sites). Hmm. You can visit it via its old address here and judge for yourself how much is illegal pornography, or you can visit it through a non-UK proxy. Annoyingly the www.tk servers won't let me change the redirect address of www.chriswarbo.tk. I think it may have expired :'( If you think this is ridiculous then do what I do, support the Open Rights Group :)

Saturday 26 June 2010

Bits & Bobs

The title of 'father of computing' is usually assigned to Alan Turing. Before him Babbage had sketched ideas for his Analytical Engine, but had not proved anything about its operation. Alonzo Church had created his Lambda Calculus formalism, but the computation (beta reduction) still had to be performed by hand. It was Turing who considered a physical machine for performing computation, studied the machine mathematically and then build some.

Turing's approach was to think of a mechanical Mathematician. He thought, what is it that Mathematicians actually do when they perform Mathematics? Well they sit in front of a piece of paper containing some Maths, for example it might be the statement of a problem like:

x+2*5y=12
y=-3
x?

Turing argued that a machine would find it easier to read such things if, rather than having a variety of notation styles, the symbols used could be written neatly instead; for example by writing one symbol in each square on some squared paper.

x + 2 * 5 y = 1 2
y = - 3
x ?

Next he said that we don't really need a 2D sheet of squared paper, we can do it in 1D if we have some symbol for a new line, like ";". Thus our Maths becomes:

x + 2 * 5 y = 1 2 ; y = - 3 ; x ? ;

Now, which symbols should we use? If we want to build a machine then we want as few as possible, to keep things simple. Turing noted that rather than giving everything a unique symbol, they could be given labels based on a number of simpler symbols, like writing "10" instead of "x" for example. Thus we can say "10" means "x", "11" means "y", "12" means "+", "13" means "*", "14" means "=", "15" means ";", "16" means "?" and "17" means "-". Now our problem looks like:

10 12 2 13 5 11 14 1 2 15 11 14 17 3 15 10 16 15

Now it's getting confusing: we've got things like "1 2" as well as "12", how can we tell the difference? We do it by making every number 8 digits long, filling any unused digits with 0. This gives

00000010 00000012 00000002 00000013 00000005 00000011 00000014 00000001 00000002 00000015 00000011 00000014 00000017 00000003 00000015 00000010 00000016 00000015

Now that everything is written using only numbers, all in a straight line, in groups of 8, it's looking much more likely that a machine will be able to read it mechanically, since it only needs to know how to read "0", "1", "2", "3", "4", "5", "6", "7", "8" and "9". In fact, we don't even need to use all of these, since it's entirely possible to write numbers with only "0" and "1", which is called binary (as opposed to digital) Maths. So we can rewrite the above problem converting each 8-digit-long digital number into an 8-bit-long binary number:

00001010 00001100 00000010 00001101 00000101 00001011 00001110 00000001 00000010 00001111 00001011 00001110 00010001 00000011 00001111 00001010 00010000 00001111

Of course, I'm also using spaces (" ") to make it clear where one number ends and the next begins. Since they're all 8 bits long I don't need to do this, so our final, simplified problem is written as:

000010100000110000000010000011010000010100001011000011100000000100000010000011110000101100001110000100010000001100001111000010100001000000001111

Now our machine only needs to be able to tell the difference between 2 things, 0 and 1. Of course, we don't need to write them down on paper as 0 and 1, we can use anything that can be in one of two states. For example we can use the magnetism of a piece of metal, which can point in one of two directions (this is used in hard drives); we can use switches, which can be either on or off (this is used in RAM and USB drives); we can use the reflectivity of a surface, which either reflects or doesn't (this is used in CDs and DVDs), and so on.

Now that we have a mechanical device that reads our incredibly simplified Maths notation, what does it do? Turing argued that brains aren't magical things: people just follow certain rules that they've learned, or even things that they've read further back on the page. Thus people go forwards and backwards over the page, changing the contents of their brain as they read, and occasionally writing something down. Thus Turing's machine just has to go forwards and backwards over its zeros and ones, called "the tape", performing certain actions based on what it reads, and occasionally writing to the tape (setting a bit to be 0 or 1). The actions it takes define the machine, and Turing proved that some sets of instructions are able to mimic any other set of instructions. These types of machine are called Universal Turing Machines, and Turing proved all kinds of interesting things about them, for example that's impossible to know for certain whether a given machine will ever run out of actions to take or not (known as the Halting Problem).



An illustration of Turing's machine is shown above, and Turing and Church showed that Universal Turing Machines are just as good at Maths as Church's Lambda Calculus, ie. they can solve exactly the same problems. They put forward the argument that many other things can solve those exact problems too, but that nothing exists which can solve more than those. This is known as the Church-Turing Thesis, and was proved recently by Yuri Gurevich. Thus Turing's simple little device, trundling back and forth over its tape, is actually the most powerful computer in the world. Of course, it's in joint-first-place with every other computer, since the machines we use every day are equivalent in power to a Universal Turing Machine; they are known as Turing Complete.