Hi all,
Here are a few words about the JIT's "great speedup in compiling
time" advertized on the PyPy 1.3 release (see the
previous blog post).
The exact meaning behind these words needs a fair bit of
explanation, so here it is in case you are interested.
If you download a version of PyPy 1.3 that includes a JIT
compiler, you get an executable that could be qualified as rather
fat: it actually contains three interpreters. You have on the
one hand the regular Python interpreter. It is here because it's
not possible to JIT-compile every single piece of Python code you
try to run; only the most executed loops are JIT-compiled. They
are JIT-compiled with a tracing interpreter that operates one
level down. This is the second interpreter. This tracing step
is quite slow, but it's all right because it's only invoked on
the most executed loops (on the order of 100 to 1000 times in
total in a run of a Python script that takes anyway seconds or
minutes to run).
So apart from the JIT compilation itself, we have two worlds in
which the execution proceeds: either by regular interpretation,
or by the execution of assembler code generated by the JIT
compiler. And of course, we need to be able to switch from one
world to the other quickly: during regular interpretation we have
to detect if we already have generated assembler for this piece
of code and if so, jump to it; and during execution of the
assembler, when a "guard" fails, i.e. when we meet a path of
execution for which we did not produce assembler, then we need to
switch back to regular interpretation (or occasionally invoke the
JIT compiler again).
Let us consider the cost of switching from one world to another.
During regular interpretation, if we detect that we already have
assembler corresponding to this Python loop, then we just jump to
it instead of interpreting the Python loop. This is fairly
cheap, as it involves just one fast extra check per Python loop.
The reverse is harder because "guard" failures can occur at any
point in time: it is possible that the bit of assembler that we
already executed so far corresponds to running the first 4 Python
opcodes of the loop and a half. The guard that failed just now
is somewhere in the middle of interpreting that opcode -- say,
multiplying these two Python objects.
It's almost impossible to just "jump" at the right place in the
code of the regular interpreter -- how do you jump inside a
regular function compiled in C, itself in a call chain, resuming
execution of the function from somewhere in the middle?
So here is the important new bit in PyPy 1.3. Previously, what
we would do is invoke the JIT compiler again in order to follow
what needs to happen between the guard failure and the real end
of the Python opcode. We would then throw away the trace
generated, as the only purpose was to finish running the current
opcode. We call this "blackhole interpretation". After the end
of the Python opcode, we can jump to the regular interpreter
easily.
Doing so was straightforward, but slow, in case it needs to be
done very often (as in the case in some examples, but not all).
In PyPy 1.3, this blackhole interpretation step has been
redesigned as a time-critical component, and that's where the
third interpreter comes from. It is an interpreter that works
like the JIT compiler, but without the overhead of tracing (e.g.
it does not need to box all values). It was designed from the
ground up for the sole purpose of finishing the execution of the
current Python opcode. The bytecode format that it interprets is
also new, designed for that purpose, and the JIT compiler itself
(the second interpreter) was adapted to it.
The old bytecode format in PyPy 1.2 is gone
(it was more suited for the JIT compiler, but less for blackhole
interpretation).
In summary, it was a lot of changes in the most front-end-ish
parts of the JIT compiler, even though it was mostly hidden
changes. I hope that this longish blog post helped bring it a
bit more to the light :-)