Hi all,
Here are a few words about the JIT's "great speedup in compiling time" advertized on the PyPy 1.3 release (see the previous blog post). The exact meaning behind these words needs a fair bit of explanation, so here it is in case you are interested.
If you download a version of PyPy 1.3 that includes a JIT compiler, you get an executable that could be qualified as rather fat: it actually contains three interpreters. You have on the one hand the regular Python interpreter. It is here because it's not possible to JIT-compile every single piece of Python code you try to run; only the most executed loops are JIT-compiled. They are JIT-compiled with a tracing interpreter that operates one level down. This is the second interpreter. This tracing step is quite slow, but it's all right because it's only invoked on the most executed loops (on the order of 100 to 1000 times in total in a run of a Python script that takes anyway seconds or minutes to run).
So apart from the JIT compilation itself, we have two worlds in which the execution proceeds: either by regular interpretation, or by the execution of assembler code generated by the JIT compiler. And of course, we need to be able to switch from one world to the other quickly: during regular interpretation we have to detect if we already have generated assembler for this piece of code and if so, jump to it; and during execution of the assembler, when a "guard" fails, i.e. when we meet a path of execution for which we did not produce assembler, then we need to switch back to regular interpretation (or occasionally invoke the JIT compiler again).
Let us consider the cost of switching from one world to another. During regular interpretation, if we detect that we already have assembler corresponding to this Python loop, then we just jump to it instead of interpreting the Python loop. This is fairly cheap, as it involves just one fast extra check per Python loop. The reverse is harder because "guard" failures can occur at any point in time: it is possible that the bit of assembler that we already executed so far corresponds to running the first 4 Python opcodes of the loop and a half. The guard that failed just now is somewhere in the middle of interpreting that opcode -- say, multiplying these two Python objects.
It's almost impossible to just "jump" at the right place in the code of the regular interpreter -- how do you jump inside a regular function compiled in C, itself in a call chain, resuming execution of the function from somewhere in the middle?
So here is the important new bit in PyPy 1.3. Previously, what we would do is invoke the JIT compiler again in order to follow what needs to happen between the guard failure and the real end of the Python opcode. We would then throw away the trace generated, as the only purpose was to finish running the current opcode. We call this "blackhole interpretation". After the end of the Python opcode, we can jump to the regular interpreter easily.
Doing so was straightforward, but slow, in case it needs to be done very often (as in the case in some examples, but not all). In PyPy 1.3, this blackhole interpretation step has been redesigned as a time-critical component, and that's where the third interpreter comes from. It is an interpreter that works like the JIT compiler, but without the overhead of tracing (e.g. it does not need to box all values). It was designed from the ground up for the sole purpose of finishing the execution of the current Python opcode. The bytecode format that it interprets is also new, designed for that purpose, and the JIT compiler itself (the second interpreter) was adapted to it. The old bytecode format in PyPy 1.2 is gone (it was more suited for the JIT compiler, but less for blackhole interpretation).
In summary, it was a lot of changes in the most front-end-ish parts of the JIT compiler, even though it was mostly hidden changes. I hope that this longish blog post helped bring it a bit more to the light :-)