Tuesday, 12 April 2011

Python Decompiler Performance

I knew the performance of Python Decompiler was rubbish, but I didn't realise how rubbish! Based on the code analysis I wrote about last time, I've put the node classes in order of frequency and reordered the comparisons made in "thing" to match this. That's all I've done, and it's given a 58% speed increase and a 47% RAM decrease compared to the alphabetical order! Running the tests file on itself now only takes 16 seconds and 69MB of RAM. If I compare it to before I had a go at optimising, back before it used the input stream as a stack, the current code takes 64% less time and 66% less RAM.


Code, as always, is on Gitorious :)

Saturday, 9 April 2011

Some Python Stats

I realised today that there's a really easy-to-fix bottleneck in my Python Abstract Syntax Tree decompiler. The decompiler looks for "things", and for each "thing" it checks against a list of alternatives to see what it might be (and hence how to decompile it). This list is ordered, so it checks against the first rule, if it doesn't match then it checks against the next rule and so on until it either finds a match, or it runs out of rules to check against (and throws an error).

This ordering of the checks, specifically ordered choice, is what defines Parsing Expression Grammars as a deterministic counterpart to Backus-Naur Form, along with greedy matching.

There are two quirks of my Python decompiler, however, that can be used to our advantage to make it faster really easily.

Firstly, if the choice were not ordered, as in Backus-Naur Form, there is only one instance of nondeterminism, which happens for deletion. Thus, as long as deletion comes first, we can actually rearrange the other rules however we like.

The second quirk is that I have implemented the rules pretty much in accordance with the classes of AST nodes used in Python's compiler module. This means that we can predict the chances of each match succeeding by simply parsing existing Python code and counting up the number of times each node occurs.

I've written a quick script to do this counting, which you can find here. To use it, you need to provide a list of Python files for it to check. I did this for my whole operating system by running the command 'find / -name "*.py" > ALL_PYTHON_FILES' which will (after some time and "permission denied" warnings) give you a list called "ALL_PYTHON_FILES". For me this list contains 25,087 files.

Whilst the node_counter.py script runs reasonably quickly, I keep getting a segmentation fault with around 23,700 files to go. For this reason, I've also given it an optional second argument, which is the number of files to use from the given list. These are picked at random from throughout the file, to prevent it grabbing everything from the same project and thus biasing the results.

The script outputs a spreadsheet which contains each type of node and the number of times it was found in the files it checked. I used Gnumeric to work out some percentages and collate the results from running it against 1 file, 10 files, 100 files and 1000 files, and generated the following (which you can find the source for here)

The green lines are the most accurate, since they're from the sample of 1000 files; the others are there to give an indication of the variance, because I couldn't be bothered to work out the standard deviation.

What this shows us is that we can't guess what kind of node a given "thing" will be, since none of these is above 0.5 (ie. 50%), but what it does show is that some types of node are far more common than others. We can see that Name nodes (which represent the "x" in "x = 5", for example) are used all over the place, such that around 30% of the nodes are Names. Getattr is probably the next-most-used node (which represents the ".foo" in "bar.foo = 8") and the frequency rapidly decreases until we get some very niche nodes like Ellipsis which don't occur at all. Note that some classes, like "Expr" and "Node" will always show zero since they're superclasses that aren't directly instantiated by the compiler.

So what does this mean for performance increases? It means that by putting the node types in order of frequency, we can make the most common cases match after only a few attempts, whilst sacrificing the performance of rarely used classes.

Since this script is part of the Python decompiler, it's Public Domain. If it's of any use to you, go ahead and take it :)

Friday, 8 April 2011

Dear World

Have you *STILL* not realised that everything you ever put online is up for grabs? Just because someone spouts some bullshit phrases like "privacy settings", "passwords", "secret questions", etc. doesn't change this fact. What we do know is:

1) It's not known if computer security is even possible in theory.

2) Even if computer security turns out to be possible, then implementing it correctly is still a difficult problem.

3) Even if computer security is possible, and someone manages to implement it correctly, you have no idea if they're lying unless you have full access to all parts of the code and environment.

4) Even if computer security is possible, and someone manages to implement it correctly, and this can be verified, then you sure as hell shouldn't give them any stuff you care about, because they'll put in in a system that's theoretically sound, verified, correct and proven to be able to stop you ever getting it back. In short, you'd be fucked.

Whilst all software fails the first test, and most software fails the second test, Dropbox also fails the third and fourth. Don't use it, or anything like it, for anything private. If you still want to upload stuff then there are plenty of Free Software programs for doing so, ie. they pass tests 3 and 4.

Yes, I know that banks, credit card companies and the like send transactions over the Internet, but that doesn't mean that doing so is secure. It means that doing so is cheap. So cheap, in fact, that they can afford to pay for insurance to cover all of the fraud that happens on the system every day.

Update: DropBox are now being investigated for false advertising. What's the false claim they made? That they're secure. Specifically, DropBox claimed that their employees cannot access any of the data stored in their system, which turns out to be complete bollocks. The investigation is being made since DropBox has provided a simple, quick, insecure system but presented it as a simple, quick, secure system; whilst companies attempting to create real secure systems, which inevitably end up less simple and quick than DropBox, have been left in the dust by DropBox.

The moral of the story is to run your own clustered filesystem, or if speed isn't your main concern then use FreeNet.

Abstraction Fail

Loading Javascript in a Web page. Pretty high-level stuff. Isn't it fun to live in a world without endianness, memory allocation, GOTO, compilation, packets, out-of-order message arrival, unreliable streams, network addresses, filesystems and all of that other low-level crap? We can float in our clouds high above it all :)

Except when Facebook Connect doesn't work in IE because of an OS bug in the networking stack. The network layer in Windows XP can throw away incoming data if it is a) "chunked" (ie. split into sections) and b) compressed. Since missing data causes the decompression to fail, this results in compressed data being sent to the browser, which subsequently ignores it as incomprehensible gibberish. Thus Facebook Connect didn't work. Bloody Microsoft >:(

If you've been hit by this, there's no solution I know of. As a workaround, you may be able to try swapping from HTTP to HTTPS or HTTPS to HTTP in the URL you're referencing, but of course that will only work if a) the server you're contacting supports HTTP and HTTPS (Facebook does, thankfully), b) the one you switch to doesn't trigger this bug as well and c) you don't give a crap about using SSL.

Saturday, 2 April 2011

PyMeta Stacks

As you may know if you've been following my blog, I've been working on a project for a few years called, rather glamorously, "python decompiler". I've released it into the Public Domain and you can find it on gitorious.

The idea is that compiling software (translating it from one form to another) is a mathematical operation, and that for every mathematical operation we would like to know its inverse. Version 2 of Python contains a compiler, which creates Abstract Syntax Trees (ASTs) out of Python code, but there is no reverse transformation to be found. That's what python decompiler does; it takes an AST and produces Python code that will compile into that AST, thus compile(decompile(compile(code))) is the same as compile(code). We can't quite make decompile(compile(code)) match the original code since the compiler throws away information ((((((like redundant brackets)))))).

So how does it work? I've used the PyMeta library, a Python implementation of the OMeta pattern matching system. This is applied to the AST, and pattern-matches against AST node objects. The only difficulty with doing this is that a OMeta is designed for pattern-matching against an iterable object, such as a string of text, a list of objects, etc. and I want to use it to pattern-match recursively inside objects, preferably without blowing those objects apart into a serialised form beforehand.

The original way I did this was to instantiate a new pattern matcher on every child object, so for example the Python code "1 + 2 * 3 / 4 - 5" will produce an AST a bit like "Sub(Add(Const(1),Div(Mul(Const(2),Const(3)),Const(4)),Const(5))". The classical way to pattern-match such a tree is to make a rule that matches "Sub(A,B)" where "A" and "B" are themselves other rules, and so on. This recursion of rules is great for digging out structures like the AST I've written out as text above, however in Python we've got a structured system of objects that a) doesn't need constructing (it's already structured) and b) has encapsulated innards, which makes pattern-matching against the structure tedious (although Python's introspection makes it reasonable, at least). Since Python is a dynamic language we would also like to avoid having to look at the inside of objects, since that would make the code less flexible than if it implies the contents. By making new pattern matchers, rather than defining new rules, we get arbitrary nesting ability and don't have to look inside objects (they look inside themselves, recursively). The drawback to this is that every single node in the AST needs its own pattern matcher, which takes ages to run (since OMeta pattern matchers are bootstrapped from the ground up in themselves) and eats up tons of RAM.

I received a couple of emails about python decompiler the other day, so it seems that people may be playing with it after all. To that end, I decided to have a bit of a think about improving its efficiency, and realised that I could get the required recursion if I turn the input stream into a stack. This is suprisingly easy to do, and straightforward to use (as long as you make sure the stack will be clean when you're finished, regardless of if your rules match or not!). We add the following function to a newly constructed grammar (which is a class in PyMeta):

def ins(self, val):
...self.input.data.insert(self.input.position+1, val)
...self.input.tl = None
...self.input.memo = {}
grammar.ins = ins

(I've written "..." to represent indentation. Silly blogger HTML editor.) 5 lines is all it takes, but now we can push on to a pattern matcher's input stream from within the grammar. We do this by calling Python code from within a rule, by wrapping it as !(code). Thus we just have to call !(self.ins(a)) and "a", whatever it is, will be the next thing on the input stream, so we can match against it. This turns the input stream into a stack, and makes OMeta recursive whilst only needing 1 pattern matcher object.

I've now converted python decompiler to use this form. I've run some quick tests by getting python decompiler to compile, decompile and recompile its own test file, and check the results for validity. The previous method managed, using the "time" utility and KDE's system monitor, take 44 seconds and 202MB of RAM. This is a completely stupid amount of resources for text processing, even in an interpreted language. After the change it now takes 38 seconds and 129MB RAM. That's not a great speed improvement (14%), but the memory usage is much better (a 36% reduction). I'm sure I can get this down even more with a little playing :)