Wednesday, 13 August 2008

Some ramblings about graphics, compilers and CPUs

This is probably very wrong, but whatever, it's *my* blog :P

I am following the work going on with Gallium3D as much as possible, since this seems like it should offer some pretty cool features when it becomes stable. It seems (from what I understand) to be an abstraction layer over the graphics card. Until now drivers are written for graphics cards which implement certain things like OpenGL, etc. and anything which is not supported relies on (much slower) software rendering (like Mesa for OpenGL), or if that's not available it just doesn't work.

Since the drivers are responsible for providing a lot of things as well as the basic hardware access this results in a lot of duplicate work, with slightly different ways of doing the same things for each driver (for instance every driver needs to implement a memory manager. The Intel driver's memory manager is called GEM and is Free Software and available for other drivers to use, but it's written in a very Intel-specific way and is therefore useless to other drivers). There is work going on to make a generic memory manager for the kernel and a generic way for graphics drivers to access the hardware (called the Direct Rendering Infrastructure, working towards version 2), but it still leaves rather a lot of common ground in each driver.

Gallium3D is a very generic abstraction layer that is designed to sit in between the drivers and the actual programs and libraries being used. Gallium3D is implemented to work in a very similar way to the actual silicon inside current graphics cards, thus writing a driver to sit between the card and Gallium3D is pretty straighforward because not much has to be translated from one representation to another (and DRI can still be used to do a lot of the background work). This makes Gallium3D act rather like a software graphics card, but doesn't cause too much slowdown since a) it is so similar to real hardware and b) it uses the Low Level Virtual Machine (LLVM) system to gain Just In Time compilation and optimisation for its code*. Since Gallium3D is software it can be run on any hardware with a suitable driver (which, as I said, is pretty painless to make), so libraries like OpenGL can be written to talk to Gallium3D and they will run on whatever hardware Gallium3D talks to (with software like Mesa running anything that the hardware can't handle). This means that writing a graphics library is a lot easier (since you only have to write it for Gallium3D's state tracker, eerything else comes free after that) and thus more than the basic OpenGL-and-that's-it can be used.

As an example, WINE can run Windows programs on Linux, BSD, MacOS, etc. (and even on Windows :P ). However, a lot of Windows programs and especially games use Microsoft's DirectX system for 3D graphics. Normally the WINE team writes replacements for Microsoft's libraries which perform the same job, thus allowing Windows programs to run, but writing their own DirectX implementation would be a huge job on its own. Making a different one for every graphics card driver out there would be even worse, and relying on the driver's developers to do it like they currently do for OpenGL would make the drivers EVEN MORE bloated and complicated than they already are.

Thus the WINE team are writing a DirectX implementation, but instead of working with the graphics card they are writing it to use OpenGL (since OpenGL is already in the current drivers). This is pretty inefficient, however, since DirectX operations need to be translated to OpenGL, then the OpenGL needs to be translated to however the graphics card works, then run, and since games are generally very resource intensive, not to mention that 3D is one of the hardest things for a computer to do anyway, it's far from ideal. With Gallium3D, however, the OpenGL translation can be skipped entirely and DirectX can be implemented straight to Gallium3D, which then gets sent very efficiently to the graphics card whilst still only needing to be written for one system. Likewise other libraries can be accelerated through a graphics card without fear of some users not having a driver capable of it. Application developers could even write a custom library for their application and count on it being accelerated regardless of the card it runs on (for example, a proprietary program cannot count on drivers containing code to handle their custom, proprietary library since it can't be included by its very nature. It can, however, count on Gallium3D being installed and thus can include one implementation which will be accelerated regardless of the driver).

* JIT compilation is pretty cool. There are historically 2 types of program: one is compiled (like C) which means the program is first turned from source code into machine code, and then this machine code runs when you tell it to. The second is interpreted (like Python), which means that instead of running directly on the computer, it is run instead inside another program called a "virtual machine" (which is usually compiled, although it can itself be interpreted, as long as at some point there is a compiled virtual machine talking to the real computer). Programs are more flexible than computer processors, which means that interpreted programming languages can have many nice features added and usually be written more simply than compiled languages. Compiled programs usually run faster than interpreted programs, since they are run directly, there is no virtual machine acting as a middle-man.

Just-In-Time compilation, however, is like a mixture between the two. The source code is NOT compiled into a runnable program, making it different to compiled languages. However, there is no need for a virtual machine, thus making it not an interpreted language. What happens is very siimlar to an interpreted language, but instead of the virtual machine looking at what to do next then telling the computer, in a JIT language the next bit of code gets compiled just before it needs to run and is then given to the real computer just like a compiled program.

At first glance it would seem like this might be faster than interpreting a language (again, no middle-man) but slightly slower than a compiled program (since with a compiled program all of the compiling is already done before you run the program). However, JIT-compiled languages can actually be FASTER than compiled programs!

This is because when a program is compiled then that is it, the finished thing, that code will be sent to the computer. Compiled programs must therefore be able to handle anything they are given, so a banking program must be able to handle everything from fractions of pennies up to billions and perhaps trillions of pounds. Such large numbers take up a lot of memory, and moving them around and performing calculations on them can take time. Compiled programs have to use a lot of memory for everything and accept these slow-downs just in case they are given a huge number. This means handling £1 as something like £0000000000001, just so it can deal with a trillion pounds if so commanded. A JIT-compiled program, however, knows how much it is dealing with, thus £1 can be stored as simply £1, whilst a thousand billion pounds can still be dealt with if it comes along by storing it as £1000000000000. This means JIT programs can use less memory for storing values, the calculations they do can be quicker as they don't need to deal with unneeded digits, and the program can speed up greatly due to less cache misses**.

Another advantage of JIT compilation is that unneeded calculations can be skipped. For example, a program may need to add deposits to an account and take away withdrawals. In a precompiled program this will always take at least two calculations, either add the deposits to the total then take the withdrawals from the total, or take the withdrawals away from the deposits and add the result onto the total. In a JIT-compiled program this can be made more efficient, since if the withdrawal amount is zero then the compiler can skip one of the calculations, and likewise if the deposits are zero. If both are zero then both calculations can be skipped, and if they are both the same then the calculations can also be skipped. For instance, compare the following precompiled pseudo-code, where it has to work for any values, and the JIT pseudo-code which already knows the values since it is compiled whilst the program is running:

Precompiled:
add(total, deposit)
subtract(total, withdrawal)

OR

add(total, subtract(deposit, withdrawal))

JIT:
add(total, 15)
subtract(total, 15)

OR

add(total, subtract(15, 15))

The JIT compiler can skip either of those, since it knows they cancel out each other's effects.

JIT compilation is becoming more and more common, with JIT compilers being written for Java and Smalltalk, for example. There is even JIT support in EUAE (the E(nhanced/xtended/ggplant) U(NIX/biquitous) Amiga Emulator, the exact naming of which is open to interpretation). An emulator translates programs written for one computer into something which will run on a different type of computer. In EUAE's case this means running code written for the chips in Amiga computers (such as the Motorola 68020) on different chips like the Intel 386. This used to be done by treating the Amiga programs like an interpreted language, with the emulator acting as the virtual machine. With the JIT engine, however, the Amiga programs can be run directly on the different chip, with the "compilation" actually being a translation from one chips instructions to another's.

A very promising project currently being worked on is the Low Level Virtual Machine (LLVM). Traditional compilers, like the GNU Compiler Collection, work by translating the given language into a common, internal language which the compiler can work with. Various optimisations can then be done on this internal representation before translating it into machine code for whatever computer is requested. LLVM, however, is slightly more clever. It performs the same translation and optimisation, but is not only able to translate this internal representation into machine code, it also has a virtual machine capable of interpreting it and is even able to JIT compile it, depending on the options chosen by the user. This means that, for example, C code, which is the classic example of a compiled language, can be fed into LLVM and made to run in a virtual machine or be JIT compiled. The same goes for Ada, Smalltalk and other languages which LLVM can handle. This means that LLVM could potentially (when bugs are fixed and GCC-specific assumptions in some programs are handled) make almost the whole of a computer system compile Just-In-Time and be optimised on-the-fly (not quite everything though, since something needs to start up the JIT compiler :P ). LLVM could even optimise itself by compiling itself. Likewise an entire system could be run in a virtual machine without the need for the likes of KVM or Virtual Box. The future looks pretty interesting!

** Computer processors can be as fast as you care to make them, but that doesn't matter if you can't give them things to do at the same rate. The processor contains "registers" which contain the numbers it's dealing with, however these registers can only contain one thing at a time and thus their values have to keep changing for each operation. The place where numbers can actually be stored over a long time is the RAM (Random Access Memory), the processor's registers can therefore get their next values from any point in the RAM and dump the results of calculations just performed into any other point in the RAM. However, this is a huge slowdown, so processors have a cache which is able to store a few things at a time, and the registers are connected to the cache rather than the RAM. This is a lot faster, since frequently used values can be stored in the cache and accessed really quickly (these are called "hits").However, if something needed is not yet in the cache then it has to be taken from the RAM (a "miss") which a) takes time and b) means kicking something else out of the cache to make room for the new value.

There are various algorithms for trying to keep cache misses as low as possible, but an obvious way is to use small numbers. In a really simplified (and decimal rather than binary) example, let's say a cache can store ten digits, then I can store for example 5, 16, 3, 78, 4, 67 and 1 all at once, since that is 10 digits in total. That can't really be made any more efficient, so if a different number is needed (a miss occurs) then at least one of them has to get replaced. If a program is being overly cautious about the size of the numbers it is expecting, then it might use the cache inefficiently. For example, say the program is anticipating values of a few hundred, it might want to store the number 547 which would require three places in the cache. If the number it actually deals with turns out to be five, then it will store 005 in the cache, wasting two spaces, replacing numbers it doesn't need to and thus increasing the likelyhood of a miss. If there is a miss later and one or two of those three digits are chosen to be replaced then the whole 005 gets kicked out rather than just the wasteful zeroes, meaning that any future calculations needing that 5 will result in a miss and the whole 005 will have to be brought in again from the RAM, causing the same inefficiencies again.

OK, back to revision :P

No comments: