Tuesday, January 11, 2011

Loop invariant code motion

Recently, the jit-unroll-loops branch was merged. It implements the idea described in Using Escape Analysis Across Loop Boundaries for Specialization. That post does only talk about virtuals, but the idea turned out to be more far reaching. After the metainterpreter produces a trace, several optimizations are applied to the trace before it is turned into binary code. Removing allocations is only one of them. There are also for instance
  • Heap optimizations that removes memory accesses by reusing results previously read from or written to the same location.
  • Reusing of the results of pure operations if the same pure operation is executed twice.
  • Removal of redundant guards.
  • ...
A lot of these optimizations are in one way or another removing operations form the trace and/or reusing previous results. All of these optimizations could benefit from being able to operate across loop boundaries. Not only in the sense that operations operating on loop invariants could be moved out of the loop entirely. But also that results produced at the end of an iteration could be reused at the beginning of the next even if there are no loop invariants involved.

This is achieved by unrolling the trace into two iterations, and letting the optimizer work on this two-iteration-trace. The optimizer will now be able to optimize the second iteration more than the first since it can reuse results from the first iteration. The optimized version of the first iteration we call the preamble and the optimized version of the second iteration we call the loop. The preamble will end with a jump to the loop, while the loop will end with a jump to itself. This means that the preamble will be executed once for the first iteration, the loop will be executed for all following iterations.

Sqrt example

Here is an example of a Python implementation of sqrt using a fairly simple algorithm

def sqrt(y, n=10000):
    x = y / 2
    while n > 0:
        n -= 1
        x = (x + y/x) / 2
    return x

If it is called with sqrt(1234.0), a fairly long trace is produced. From this trace the optimizer creates the following preamble (Loop 1) and loop (Loop 0)

Looking at the preamble, it starts by making sure that it is not currently being profiled, the guard on i5, and that the function object have not been changed since the trace was made, the guard on p3. Somewhat intermixed with that, the integer variable n is unboxed, by making sure p11 points to an integer object and reading out the integer value from that object. These operations are not needed in the loop (and have been removed from it) as emitting the same guards again would be redundant and n becomes a virtual before the end of the preamble.

        guard_value(i5, 0, descr=<Guard6>) 
        guard_nonnull_class(p11, ConstClass(W_IntObject), descr=<Guard7>) 
        guard_value(p3, ConstPtr(ptr15), descr=<Guard8>) 
        i16 = getfield_gc_pure(p11, descr=<W_IntObject.inst_intval>)
Next comes a test and a guard implementing the while statement followed by the decrementing of n. These operation appear both in the preamble and in the loop
        i18 = int_gt(i16, 0)
        guard_true(i18, descr=<Guard9>) 
        i20 = int_sub(i16, 1)
After that the two floating point variables x and y are unboxed. Again this is only needed in the preamble. Note how the unboxed value of y, called f23, is passed unchanged from the preamble to the loop in arguments of the jump to allow it to be reused. It will not become a virtual since it is never changed within the loop.
        guard_nonnull_class(p12, 17652552, descr=<Guard10>) 
        guard_nonnull_class(p10, 17652552, descr=<Guard11>) 
        f23 = getfield_gc_pure(p10, descr=<W_FloatObject.inst_floatval>)
        f24 = getfield_gc_pure(p12, descr=<W_FloatObject.inst_floatval>)
Following that is the actual calculations performed in the loop in form of floating point operations (since the function was called with a float argument). These appear in both the loop and the preamble.
        i26 = float_eq(f24, 0.000000)
        guard_false(i26, descr=<Guard12>) 
        f27 = float_truediv(f23, f24)
        f28 = float_add(f24, f27)
        f30 = float_truediv(f28, 2.000000)
Finally there are some tests checking if a signal was received (such as when the user presses ctrl-C) and thus should execute some signal handler or if we need to hand over to another thread. This is implemented with a counter that is decreased once every iteration. It will go below zero after some specific number of iterations, tunable by sys.setcheckinterval. The counter is read from and written to some global location where it also can be made negative by a C-level signal handler.
        i32 = getfield_raw(32479328, descr=<pypysig_long_struct.c_value>)
        i34 = int_sub(i32, 2)
        setfield_raw(32479328, i34, descr=<pypysig_long_struct.c_value>)
        i36 = int_lt(i34, 0)
        guard_false(i36, descr=<Guard13>) 
        jump(p0, p1, p2, p4, p10, i20, f30, f23, descr=<Loop0>)


When a guard fails often enough, the meta-interpreter is started again to produce a new trace starting at the failing guard. The tracing is continued until a previously compiled loop is entered. This could either be the the same loop that contains the failing guard or some completely different loop. If it is the same loop, executing the preamble again maybe be unnecessary. It is preferable to end the bridge with a jump directly to the loop. To achieve this the optimizer tries to produce short preambles that are inlined at the end of bridges allowing them to jump directly to the loop. Inlining is better than jumping to a common preamble because most of the inlined short preamble can typically be removed again by the optimizer. Creating such a short preamble is however not always possible. Bridges jumping to loops for which no short preamble can be generated have to end with a jump to the full preamble instead.

The short preamble is created by comparing the operations in the preamble with the operations in the loop. The operations that are in the preamble but not in the loop are moved to the short preamble whenever it is safe to move them to the front of the operations remaining. In other words, the full preamble is equivalent to the short preamble followed by one iteration of the loop.

This much has currently been implemented. To give the full picture here, there are two more features that hopefully will be implemented in the near future. The first is to replace the full preamble, used by the interpreter when it reaches a compiled loop, with the short preamble. This is currently not done and is probably not as straight forward as it might first seem. The problem is where to resume interpreting on a guard failure. However, implementing that should save some memory. Not only because the preamble will become smaller, but mainly because the guards will appear either in the loop or in the preamble, but not in both (as they do now). That means there will only be a single bridge and not potentially two copies once the guards are traced.

The sqrt example above would with a short preamble result in a trace like this

If it is executed long enough, the last guard will be traced to form a bridge. The trace will inherit the virtuals from its parent. This can be used to optimize away the part of the inlined short preamble that deals with virtuals. The resulting bridge should look something like
    [p0, p1, p2, p3, p4, f5, i6]
    i7 = force_token()
    setfield_gc(p1, i7, descr=<PyFrame.vable_token>)
    call_may_force(ConstClass(action_dispatcher), p0, p1, descr=<VoidCallDescr>)
    guard_not_forced(, descr=<Guard19>) 
    guard_no_exception(, descr=<Guard20>) 

    guard_nonnull_class(p4, 17674024, descr=<Guard21>) 
    f52 = getfield_gc_pure(p4, descr=<W_FloatObject.inst_floatval>)
    jump(p1, p0, p2, p3, p4, i38, f53, f52, descr=<Loop0>)
Here the first paragraph comes from the traced bridge and the second is what remains of the short preamble after optimization. The box p4 is not a virtual (it contains a pointer to y which is never changed), and it is only virtuals that the bridge inherit from it's parents. This is why the last two operations currently cannot be removed.

Each time the short preamble is inlined, a new copy of each of the guards in it is generated. Typically the short preamble is inlined in several places and thus there will be several copies of each of those guards. If they fail often enough bridges from them will be traced (as with all guards). But since there typically are several copies of each guard the same bridge will be generated in several places. To prevent this, mini-bridges from the inlined guards are produced already during the inlining. These mini-bridges contain nothing but a jump to the preamble.

The mini-bridges needs the arguments of the preamble to be able to jump to it. These arguments contain among other things, boxed versions of the variables x and y. Those variables are virtuals in the loop, and have to be allocated. Currently those allocations are placed in front of the inlined guard. Moving those allocations into the mini-bridges is the second feature that hopefully will be implemented in the near future. After this feature is implemented, the result should look something like

Multiple specialized versions

Floating point operations were generated in the trace above because sqrt was called with a float argument. If it is instead called with an int argument, integer operations will be generated. The somewhat more complex situations is when both int's and float's are used as arguments. Then the jit need to generate multiple versions of the same loop, specialized in different ways. The details, given below, on how this is achieved is somewhat involved. For the casual reader it would make perfect sense to skip to the next section here.

Consider the case when sqrt is first called with a float argument (but with n small enough not to generate the bridge). Then the trace shown above will be generated. If sqrt is now called with an int argument, the guard in the preamble testing that the type of the input object is float will fail:

        guard_nonnull_class(p12, 17652552, descr=<Guard10>) 
It will fail every iteration, so soon enough a bridge will be generated from this guard in the preamble. This guard will end with a jump to the same loop, and the optimizer will try to inline the short preamble at the end of it. This will however fail since now there are two guards on p12. One that makes sure it is an int and and one that makes sure it is a float. The optimizer will detect that the second guard will always fail and mark the bridge as invalid. Invalid loops are not passed on to the backend for compilation.

If a loop is detected to be invalid while inlining the short preamble, the metainterpreter will continue to trace for yet another iteration of the loop. This new trace can be compiled as above and will produce a new loop with a new preamble that are now specialized for int arguments instead of float arguments. The bridge that previously became invalid will now be tried again. This time inlining the short preamble of the new loop instead. This will produce a set of traces connected like this

(click for some hairy details)

The height of the boxes is this figure represents how many instructions they contain (presuming the missing features from the previous section are implemented). Loop 0 is specialized for floats and it's preamble have been split into two boxes at the failing guard. Loop 2 is specialized for ints and is larger than Loop 0. This is mainly because the integer division in python does not map to the integer division of the machine, but have to be implemented with several instructions (integer division in python truncates its result towards minus infinity, while the the machine integer division truncates towards 0). Also the height of the bridge is about the same as the height of Loop 2. This is because it contains a full iteration of the loop.

A More Advanced Example

Let's conclude with an example that is a bit more advanced, where this unrolling approach actually outperforms the previous approach. Consider making a fixed-point implementation of the square root using 16 bit's of decimals. This can be done using the same implementation of sqrt but calling it with an object of a class representing such fixed-point real numbers:

class Fix16(object):
    def __init__(self, val, scale=True):
        if isinstance(val, Fix16):
            self.val = val.val
            if scale:
                self.val = int(val * 2**16)
                self.val = val

    def __add__(self, other):
        return  Fix16(self.val + Fix16(other).val, False)

    def __sub__(self, other):
        return  Fix16(self.val - Fix16(other).val, False)

    def __mul__(self, other):
        return  Fix16((self.val >> 8) * (Fix16(other).val >> 8), False)

    def __div__(self, other):
        return  Fix16((self.val << 16) / Fix16(other).val, False)

Below is a table comparing the runtime of the sqrt function above with different argument types on different python interpreters. Pypy 1.4.1 was released before the optimizations described in this post were in place while they are in place in the nightly build from January 5, denoted pypy in the table. There are also the running time for the same algorithms implemented in C and compiled with "gcc -O3 -march=native". Tests were executed on a 2.53GHz Intel Core2 processor with n=100000000 iterations. Comparing the integer versions with C may be considered a bit unfair because of the more advanced integer division operator in python. The left part of this table shows runtimes of sqrt in a program containing a single call to sqrt (i.e. only a single specialized version of the loop is needed). The right part shows the runtime of sqrt when it has been called with a different type of argument before.

First callSecond call
floatintFix16   floatintFix16
cpython 28.18 s 22.13 s 779.04 s 28.07 s 22.21 s 767.03 s
pypy 1.4.1 1.20 s 6.49 s 11.31 s 1.20 s 6.54 s 11.23 s
pypy 1.20 s 6.44 s 6.78 s 1.19 s 6.26 s 6.79 s
gcc 1.15 s 1.82 s 1.89 s 1.15 s 1.82 s 1.89 s

For this to work in the last case, when Fix16 is the argument type in the second type, the trace_limit had to be increased from its default value to prevent the metainterpreter from aborting while tracing the second version of the loop. Also sys.setcheckinterval(1000000) were used to prevent the bridge from being generated. With the bridge the performance of the last case is significantly worse. Maybe because the optimizer currently fails to generate a short preamble for it. But the slowdown seems too big for that to be the only explanation. Below are the runtimes numbers with checkinterval set to its default value of 100:

First callSecond call
floatintFix16   floatintFix16
cpython 28.71 s 22.09 s 781.86 s 28.28 s 21.92 s 761.59 s
pypy 1.4.1 1.21 s 6.48 s 11.22 s 1.72 s 7.58 s 12.18 s
pypy 1.21 s 6.27 s 7.22 s 1.20 s 6.29 s 90.47 s


Even though we are seeing speedups in a variety of different small benchmarks, more complicated examples are not affected much by these optimizations. It might partly be because larger examples have longer and more complicated loops, and thus allowing optimizations to operate across loop boundary will have a smaller relative effect. Another problem is that with more complicated examples there will be more bridges, and bridges are currently not handled very well (most of the time all virtuals are forced at the end of the bridge as explained above). But moving those forcings into the mini bridges should fix that.


Anonymous said...

Great post.

Eric said...

Do you think you could fix the pictures?
I only see black images with a exclamation marks.


Anonymous said...

Something has eaten the images. Please fix, if you can.