Tuesday, August 25, 2009

PyPy gets a new compiler

Today, I merged the parser-compiler branch, which I have been working on over the summer. It contained a total rewrite of both PyPy's Python parser and AST compiler. PyPy's old parser was (in)famous internally for being complicated and slow (with many algorithmic complexities greater than O(n)). The new parser is a simple as I could make it LL(1) parser like CPython (though it doesn't share the hacks of CPython's parser).

The new compiler is based on the Abstract Syntax Trees (AST) that CPython 2.5 introduced instead of PyPy's old AST based on the compiler package's. This means that Python code running on PyPy will be able to use the same _ast interface as CPython. PyPy's _ast implementation supports AST features that CPython 2.6 added, including compiling modified AST to bytecode and executing it. In this rewrite, some more obscure compiler features were added, too. For example, jumps in bytecode can now be greater than 65535 bytes! (That's like an if statement with 7000 lines of code in the body.)

While the PyPy translation toolchain still has many obscure details and hacks, this merge completes the process of making the actual Python interpreter very clean. Hopefully, this will make adding new features much easier and make PyPy less frustrating to maintain as well as providing application level code with an improved AST interface!

Gothenburg JIT sprint report

Finally, we managed to squeeze in some time to write a report about what has been going on the mysterious JIT sprint in Gothenburg, Sweden. The main goals of the sprint were to lay down the groundwork for getting more JIT work going in the next months and get more of PyPy developers up to speed with the current state of the JIT. One of the elements was to get better stability of the JIT, moving it slowly from being a prototype to actually work nicely on larger programs.

The secret goal of the sprint was to seek more speed, which Anto and Carl Friedrich did even during the break day:

We spent the first two days improving test coverage of the x86 backend and the optimizer. Now we have 100% coverage with unittests (modulo figleaf bugs), which does not mean anything, but it's better than before.

Then we spent quite some time improving the optimizer passes, so now we generate far less code than before the sprint, because a lot of it is optimized away. On the interpreter side, we marked more objects (like code objects) as immutable, so that reading fields from them can be constant-folded.

Another important optimization that we did is to remove consecutive reading of the same fields from the same structure, if no code in between can change it.

Our JIT is a hybrid environment, where only hot loops of code are jitted and the rest stays being interpreted. We found out that the performance of the non-jitted part was suboptimal, because all accesses to python frames went through an extra layer of indirection. We removed this layer of indirection, in the case where the jit and the interpreter cannot access the same frame (which is the common case).

We also spent some time improving the performance of our x86 backend, by making it use more registers and by doing more advanced variable renaming at the end of loops. It seems that using more registerd is not as much of a win as we hoped, because modern day processors are much smarter than we thought.

The most mind bending part was finding why we loose performance by making the JIT see more of the interpreter. It took us two very frustrating days and 36 gray hairs to find out that from the JIT we call a different malloc function in the Boehm GC, which is by far slower than the version that we use from the interpreter. This meant that the more we jitted, the slower our code got, purely because of the mallocs.

Now that this is fixed, the world makes much more sense again.

A lot of the sprint's work is not directly measurable in the performance figures, but we did a lot of work that is necessary for performance to improve in the next weeks. After we have done a bit more work, we should be able to provide some performance figures for programs that are more realistic than just loops that count to ten millions (which are very fast already :).

Now we're going to enjoy a couple of days off to recover from the sprint.

Bästa hälsningar,
Carl Friedrich, fijal