Recently, the jit-unroll-loops branch was merged. It implements the
idea described in
Using Escape Analysis Across Loop Boundaries for Specialization.
That post does only talk about virtuals, but the idea turned out
to be more far reaching. After the metainterpreter produces a trace,
several optimizations are applied to the trace before it is turned
into binary code. Removing allocations is only one of them. There are also
for instance
- Heap optimizations that removes memory accesses by reusing results
previously read from or written to the same location.
- Reusing of the results of pure operations if the same pure
operation is executed twice.
- Removal of redundant guards.
- ...
A lot of these optimizations are in one way or another removing
operations form the trace and/or reusing previous results. All of these
optimizations could benefit from being able to operate across loop
boundaries. Not only in the sense that operations operating on loop
invariants could be moved out of the loop entirely. But also that
results produced at the end of an iteration could be reused at the
beginning of the next even if there are no loop invariants involved.
This is achieved by unrolling the trace into two iterations, and
letting the optimizer work on this two-iteration-trace.
The optimizer will now be able to optimize the second iteration more than the
first since it can reuse results from the first iteration. The
optimized version of the first iteration we call the preamble and the
optimized version of the second iteration we call the loop. The
preamble will end with a jump to the loop, while the loop will end
with a jump to itself. This means that the preamble will be executed
once for the first iteration, the loop will be executed for all following
iterations.
Sqrt example
Here is an example of a Python implementation of sqrt using a fairly
simple algorithm
def sqrt(y, n=10000):
x = y / 2
while n > 0:
n -= 1
x = (x + y/x) / 2
return x
If it is called with sqrt(1234.0),
a fairly long trace is produced. From this trace
the optimizer creates
the
following preamble (Loop 1) and loop (Loop 0)
Looking at the preamble, it starts by making sure that it is not
currently being profiled, the guard
on i5, and that the function object have not been changed
since the trace was made, the guard on p3. Somewhat
intermixed with that, the
integer variable n is unboxed, by making sure p11
points to an integer object and reading out the integer value from
that object.
These operations are not needed in the
loop (and have been removed from it) as emitting the same guards again
would be redundant and n becomes a virtual before the
end of the preamble.
guard_value(i5, 0, descr=<Guard6>)
guard_nonnull_class(p11, ConstClass(W_IntObject), descr=<Guard7>)
guard_value(p3, ConstPtr(ptr15), descr=<Guard8>)
i16 = getfield_gc_pure(p11, descr=<W_IntObject.inst_intval>)
Next comes a test and a guard implementing the while statement
followed by the decrementing of
n. These operation appear
both in the preamble and in the loop
i18 = int_gt(i16, 0)
guard_true(i18, descr=<Guard9>)
i20 = int_sub(i16, 1)
After that the two floating point variables
x and
y
are unboxed. Again this is only needed in the preamble. Note how the
unboxed value of
y, called
f23, is passed unchanged
from the preamble to the loop in arguments of the jump
to allow it to be reused. It will not become a virtual
since it is never changed within the loop.
guard_nonnull_class(p12, 17652552, descr=<Guard10>)
guard_nonnull_class(p10, 17652552, descr=<Guard11>)
f23 = getfield_gc_pure(p10, descr=<W_FloatObject.inst_floatval>)
f24 = getfield_gc_pure(p12, descr=<W_FloatObject.inst_floatval>)
Following that is the actual calculations performed in the loop in
form of floating point operations (since the function was called with
a float argument). These appear in both the loop
and the preamble.
i26 = float_eq(f24, 0.000000)
guard_false(i26, descr=<Guard12>)
f27 = float_truediv(f23, f24)
f28 = float_add(f24, f27)
f30 = float_truediv(f28, 2.000000)
Finally there are some tests checking if a signal was received
(such as when the user presses ctrl-C) and thus should execute some
signal handler or if we need to hand over to another thread. This is
implemented with a counter that is decreased once every iteration. It
will go below zero after some specific number of iterations, tunable by
sys.setcheckinterval. The counter is read from and written to
some global location where it also can be made negative by a C-level
signal handler.
i32 = getfield_raw(32479328, descr=<pypysig_long_struct.c_value>)
i34 = int_sub(i32, 2)
setfield_raw(32479328, i34, descr=<pypysig_long_struct.c_value>)
i36 = int_lt(i34, 0)
guard_false(i36, descr=<Guard13>)
jump(p0, p1, p2, p4, p10, i20, f30, f23, descr=<Loop0>)
Bridges
When a guard fails often enough, the meta-interpreter is started again
to produce a new trace starting at the failing guard. The tracing is
continued until a previously compiled loop is entered. This could
either be the the same loop that contains the failing guard
or some completely different loop. If it is the same loop, executing
the preamble again maybe be unnecessary.
It is preferable to end the bridge with a jump directly to
the loop. To achieve this the optimizer tries to produce
short
preambles that are inlined at the end of bridges allowing
them to jump directly to the loop. Inlining is better than jumping to
a common preamble because most of the inlined short preamble can
typically be removed again by the optimizer.
Creating such a short
preamble is however not always possible. Bridges jumping to loops for which
no short preamble can be generated have to end with a jump to the
full preamble instead.
The short preamble is created by comparing the operations in the
preamble with the operations in the loop. The
operations that are in the preamble but not in the loop
are moved to the short preamble whenever it is safe to move them to
the front of the operations remaining. In other words, the full preamble
is equivalent to the short preamble followed by one iteration of the
loop.
This much has currently been implemented. To give the full picture
here, there are two more features that
hopefully will be implemented in the near future.
The first is to replace the full preamble, used by the interpreter
when it reaches a compiled loop, with the short preamble.
This is currently not done and is probably not as straight forward as
it might first seem. The problem is where to resume interpreting on a
guard failure. However, implementing that should save some
memory. Not only
because the preamble will become smaller, but mainly because the
guards will appear either in the loop or in the preamble, but not
in both (as they do now). That means there will only be a single bridge and
not potentially two copies once the guards are traced.
The sqrt example above would with a short preamble result in a trace
like this
If it is executed long enough, the last guard will be traced to form a
bridge. The trace will inherit the virtuals from its parent. This can
be used to optimize away the part of the inlined short preamble
that deals with virtuals. The resulting bridge should look
something like
[p0, p1, p2, p3, p4, f5, i6]
i7 = force_token()
setfield_gc(p1, i7, descr=<PyFrame.vable_token>)
call_may_force(ConstClass(action_dispatcher), p0, p1, descr=<VoidCallDescr>)
guard_not_forced(, descr=<Guard19>)
guard_no_exception(, descr=<Guard20>)
guard_nonnull_class(p4, 17674024, descr=<Guard21>)
f52 = getfield_gc_pure(p4, descr=<W_FloatObject.inst_floatval>)
jump(p1, p0, p2, p3, p4, i38, f53, f52, descr=<Loop0>)
Here the first paragraph comes from the traced bridge and the second
is what remains of the short preamble after optimization. The
box
p4 is
not a virtual (it contains a pointer to
y which is never
changed), and it is only virtuals
that the bridge inherit from it's parents. This is why the last two
operations currently cannot be removed.
Each time the short preamble is inlined, a new copy of each of the
guards in it is generated. Typically the short preamble is inlined in
several places and thus there will be several copies of each of those
guards.
If they fail often enough bridges
from them will be traced (as with all guards). But since there
typically are several copies of each guard the same bridge
will be generated in
several places. To prevent this, mini-bridges from the inlined guards
are produced already during the inlining. These mini-bridges contain
nothing but a jump to the preamble.
The mini-bridges needs the arguments of the preamble to be able
to jump to it. These arguments contain among other things, boxed
versions of the
variables x and y. Those variables are virtuals in
the loop, and have to be allocated. Currently those allocations
are placed in front of the inlined guard. Moving those allocations into
the mini-bridges is the second feature that
hopefully will be implemented in the near future.
After this feature is
implemented, the result should look something like
Multiple specialized versions
Floating point operations were generated in the trace above
because
sqrt was called with a float argument. If it is
instead called with an int argument, integer operations will be generated. The
somewhat more complex situations is when both int's and float's are
used as arguments. Then the jit need to generate multiple versions of
the same loop, specialized in different ways. The details, given
below, on how this is achieved is somewhat involved. For the casual
reader it would make perfect sense to skip to the next section here.
Consider the case when sqrt is first called with a float
argument (but with n small enough not to generate the
bridge). Then the trace shown above will be
generated. If sqrt is now called with an int argument, the
guard in the preamble testing that the type of the input object is float
will fail:
guard_nonnull_class(p12, 17652552, descr=<Guard10>)
It will fail every iteration, so soon enough a bridge will be
generated from this guard in the preamble. This guard will end with a
jump to the same loop, and the optimizer will try to inline
the short preamble at the end of it. This will however fail
since now there are two guards on
p12. One that makes sure it
is an int and and one that makes sure it is a float. The optimizer
will detect that the second guard will always fail and mark the bridge
as invalid. Invalid loops are not passed on to the backend for
compilation.
If a loop is detected to be invalid while inlining the short preamble,
the metainterpreter will continue to trace for yet another
iteration of the loop. This new trace can be compiled as above and
will produce a new loop with a new preamble that are now specialized
for int arguments instead of float arguments. The bridge that
previously became invalid will now be tried again. This time inlining
the short preamble of the new loop instead. This will produce a set of
traces connected like this
(click for some hairy details)
The height of the boxes is this figure represents how many instructions
they contain (presuming the missing features from the previous section
are implemented). Loop 0 is specialized for floats and it's preamble have
been split into two boxes at the failing guard. Loop 2 is specialized
for ints and is larger than Loop 0. This is mainly because the integer
division in python does not map to the integer division of the
machine, but have to be implemented with several instructions (integer
division in python truncates its result towards minus
infinity, while the the machine integer division truncates towards
0). Also the height of the bridge is about the same as the height of
Loop 2. This is because it contains a full iteration of the loop.
A More Advanced Example
Let's conclude with an example that is a bit more advanced, where this unrolling
approach actually outperforms the previous approach. Consider
making a
fixed-point
implementation of the square root using 16 bit's of decimals. This can be
done using the same implementation
of
sqrt but calling it with an object of a class representing
such fixed-point real numbers:
class Fix16(object):
def __init__(self, val, scale=True):
if isinstance(val, Fix16):
self.val = val.val
else:
if scale:
self.val = int(val * 2**16)
else:
self.val = val
def __add__(self, other):
return Fix16(self.val + Fix16(other).val, False)
def __sub__(self, other):
return Fix16(self.val - Fix16(other).val, False)
def __mul__(self, other):
return Fix16((self.val >> 8) * (Fix16(other).val >> 8), False)
def __div__(self, other):
return Fix16((self.val << 16) / Fix16(other).val, False)
Below is a table comparing the runtime of the sqrt function above with
different argument types on different python interpreters. Pypy 1.4.1
was released before the optimizations described in this post were in place
while they are in place in the
nightly
build from January 5,
denoted pypy in the table. There are also the running time for the same
algorithms implemented in C and compiled with "gcc -O3
-march=native". Tests were executed on a 2.53GHz Intel Core2
processor with n=100000000 iterations.
Comparing the integer versions with C may be considered a
bit unfair because of the more advanced integer division operator in
python. The left part of this table shows runtimes of sqrt in
a program containing a single call to sqrt (i.e. only a single
specialized version of the loop is needed). The right part shows the
runtime of sqrt when it has been called with a different
type of argument before.
| First call | | Second call |
| float | int | Fix16 | |
float | int | Fix16 |
cpython |
28.18 s |
22.13 s |
779.04 s |
|
28.07 s |
22.21 s |
767.03 s |
pypy 1.4.1 |
1.20 s |
6.49 s |
11.31 s |
|
1.20 s |
6.54 s |
11.23 s |
pypy |
1.20 s |
6.44 s |
6.78 s |
|
1.19 s |
6.26 s |
6.79 s |
gcc |
1.15 s |
1.82 s |
1.89 s |
|
1.15 s |
1.82 s |
1.89 s |
For this to work in the last case, when Fix16 is the argument type in
the second type,
the trace_limit had to be increased from its default value to prevent
the metainterpreter from aborting while tracing the second version of
the loop. Also sys.setcheckinterval(1000000) were used to prevent the
bridge from being generated. With the bridge the performance of the
last case is significantly worse. Maybe because the optimizer currently
fails to generate a short preamble for it. But the slowdown
seems too big for that to be the only explanation. Below are the runtimes
numbers with checkinterval set to its default value of 100:
| First call | | Second call |
| float | int | Fix16 | |
float | int | Fix16 |
cpython |
28.71 s |
22.09 s |
781.86 s |
|
28.28 s |
21.92 s |
761.59 s |
pypy 1.4.1 |
1.21 s |
6.48 s |
11.22 s |
|
1.72 s |
7.58 s |
12.18 s |
pypy |
1.21 s |
6.27 s |
7.22 s |
|
1.20 s |
6.29 s |
90.47 s |
Conclusions
Even though we are seeing speedups in a variety of different small
benchmarks, more complicated examples are not affected much by these
optimizations. It might partly be because larger examples have longer
and more complicated loops, and thus allowing optimizations to operate
across loop boundary will have a smaller relative effect. Another problem is
that with more complicated examples there will be more bridges, and bridges
are currently not handled very well (most of the time all virtuals are
forced at the end of the bridge as explained above). But moving those
forcings into the mini bridges should fix that.