Tuesday, 25 May 2010

Symmetry in Languages

This post is about languages and grammars*, and I'll be using examples in a pseudo-BNF notation which looks like the following:

abc ::= ('foo' <bar>) | <baz> | <bar> <baz>

This reads as "An abc is defined as the letters 'foo' followed by a bar, or as a baz, or as a bar followed by a baz."

Consider the Newspeak language from George Orwell's 1984. Part of the grammar would look like the following:

word ::= ('un' <base>) | <base>
emphasised ::= 'plus' <word>
twiceemphasised ::= 'double' <emphasised>
valid ::= <word> | <emphasised> | <twiceemphasised>

Such a grammar allows us to, given the base word "good", say the valid words "good", "ungood", "plusgood", "plusungood", "doubleplusgood" and "doubleplusungood". This is the entire spectrum of "goodness" allowed by Newspeak, which is a purposefully restrictive language.

Now, what happens if we get rid of some of these rules? Let's try getting rid of 'twiceemphasised':

word ::= ('un' <base>) | <base>
emphasised ::= 'plus' <word>
valid ::= <word> | <emphasised>

We reduce the expressiveness, since we now only have one way to emphasise things, there's no level above "plusgood".

Now what if we get rid of "emphasised", but just call it another kind of "word"?

word ::= 'un' <base> | <base> | 'plus' <word>
valid ::= <word>

Now, the definition of "word" contains the use of a "word", which is recursive. What does this mean? Well "plusgood" is now a "word", and we're allowed to write 'plus' in front of "words", so we can say "plusplusgood", "plusplusplusgood" and any other number of new words. We can also say "plusplusungood" along with an infinite number of negatives.

Thus getting rid of rules completely (like "twiceemphasised") reduces what we can say, but by getting rid of distinctions and special cases (like the distinction between "word" and "emphasised") we get a whole new level of expressiveness. In fact, since we can say "plusplusgood" there isn't a need for "doubleplusgood" anymore, it's redundant, so there's no point putting it back in. Also, the idea of "valid" is also redundant, since it's now just the same as a "word", so we can get rid of that too, to reduce our entire grammar (minus the "base" vocabulary of words like "good") to:

word ::= 'un' <base> | <base> | 'plus' <word>

This grammar, even though it only contains two concepts, allows us to say infinitely more than the original grammar that contained five. We can even get rid of the distinction between "base" and "word" by writing it as:

word ::= 'un' <word> | <base> | 'plus' <word>

Now we can say "unungood", "unplusgood", "plusunununununplusplusunplusgood" and infinitely more new words.

Now, where am I going with this? Well, let's consider a bit of grammar for the C programming language:

simpletype ::= 'int' | 'float' | 'double' | 'char'
modifiedtype ::= <simpletype> | ('const' | 'long' | 'signed' | 'unsigned') <modifiedtype>
composite ::= <union> | <struct>
pointer ::= <type> '*'
type ::= <pointer> | <composite> | <modifiedtype>

This is a bit off, but will serve for our purposes. It says, quite simply, that we can make things that have type "int", "float", "double" or "char", or we can modify those types to get things like "unsigned long long int". We can have "unions" and "structs", which are built out of other types, and we can have pointers to any of them (including pointers to pointers (to pointers...)).

Now let's consider a similar thing in Java, which claims to be an "object oriented" language:

basetype ::= 'byte' | 'short' | 'int' | 'long' | 'float' | 'double' | 'boolean' | 'char'
type ::= <object> | <basetype>

Where anybody can create their own kind of "object", for example strings of text, key/value pairs, Web browsers, customers, etc. In the end though, everything in Java ends up being defined via a few things from "basetype".

If we look at the first Object Oriented language, Smalltalk, we can see why I said Java is "object oriented" in quotes:

object

That's it. EVERYTHING in Smalltalk is an object. If you want a number, that's fine since there are number objects. You want a true/false boolean? That's fine since there are true and false objects. No expressiveness has been lost by throwing away the complicated special-cases of languages like C, and by adding "base types" Java has gained no expressiveness over Smalltalk. In fact, by having more special cases we lose expressiveness. For example if we want to write, in pseudocode, a function for the square of something, we can do so in C like so:

int square(int num) {
return num*num;
}

char square(char num) {
return num*num;
}

float square(float num) {
return num*num;
}

double square(double num) {
return num*num;
}

We would also need to take into account the various modified versions, whilst structs and unions can't be squared in a general way. We're also forgetting here that C would complain about multiple function definitions, so we would have to call each function something different (eg. int square_int(int num)), and the obvious issue that some of these number types might not be able to store the result of the multiplication if it's too big (don't get me started on that bollocks). In Java we get a similar thing, although we're allowed multiple functions with the same name since they take different argument types:

int square(int num) {
return num*num;
}

float square(float num) {
return num*num;
}

byte square(byte num) {
return num*num;
}

long square(long num) {
return num*num;
}

short square(short num) {
return num*num;
}

char square(char num) {
return num*num;
}

boolean square(boolean num) {
return num*num;
}

double square(double num) {
return num*num;
}

Object square(Object num) {
return Object*Object;
}

The last case, of Object, won't work since Java has EVEN MORE special cases to do with operators like *, defining them for some kinds of objects and not others. Now, what would we write for Smalltalk?

[:num | ^ num * num]

That's it. Done. Works for everything (for which multiplication is valid).

The nice thing about using objects in this way is that everything stays symmetric, ie. you can treat everything in the same way. This lack of special cases and artificial distinctions means that it's much easier and more intuitive to make systems which handle everything thrown at them. As long as there's a sane fallback in the case of failure, we can leave this function (that one line) alone FOREVER and be confident that it will still work with everything we give it; we don't need to keep fiddling with it as more types are added to the language.

What's more, now that we've got this working function, we can stick it in a library and use it everywhere. We can do the same for branches, via the methods:

foo isTrue: [something run] isFalse: [somethingelse run]

That works for ANY foo, and the two code blocks given are of course examples. We can do the same for iteration:

foo do: [:x | something run using x]

That will work for ANY foo which is iterable. Everything works in the same way, 40 year old code can work with objects written last week, code written last week can work with objects defined 40 years ago.

If you're adding a feature to a language via extra syntax, ask yourself whether you're allowing the expression of something new which was not previously possible, or whether you're just breaking the symmetry between that feature and the entirety of the rest of the language.

To bring this back to where it started, I'm very interested in the development of Newspeak.

* More specifically formally defined languages and grammars, not broken grammars retrofitted to natural languages via the use of duct tape and insistence that the language's users aren't adhering to the 'standard' which is based around the user's use of it anyway.

2 comments:

Richard said...

Never thought of Newspeak as a 'formal' language. Great post!

Anonymous said...

Good fill someone in on and this mail helped me alot in my college assignement. Gratefulness you on your information.