This is part 2 of a series on how to speed up an interpreter written with PyPy by adding JIT hints to the interpreter. Part 1 described how to control the extent of tracing. In this post I will describe how to add hints that influence the optimizer. If applied correctly these techniques can give really big speedups by pre-computing parts of what happens at runtime. On the other hand, if applied incorrectly they might lead to code bloat, thus making the resulting program actually slower.
Background
Before sending the trace to the backend to produce actual machine code, it is optimized. The optimizer applies a number of techniques to remove or reduce the number of operations: most of these are well known compiler optimization techniques, with the difference that it is easier to apply them in a tracing JIT because it only has to deal with linear traces. Among the techniques:
- constant folding
- common subexpression elimination
- allocation removal, as described in the paper that I recently presented at PEPM
- store/load propagation
- loop invariant code motion
In some places it turns out that if the interpreter author rewrites some parts of the interpreter with these optimizations in mind the traces that are produced by the optimizer can be vastly improved.
In this post I will describe two hints that allow the interpreter author to increase the optimization opportunities for constant folding. For constant folding to work, two conditions need to be met:
- the arguments of an operation actually need to all be constant, i.e. statically known by the optimizer
- the operation needs to be pure, i.e. always yield the same result given the same arguments.
The PyPy JIT generator automatically detects the majority of these conditions. However, for the cases in which the automatic detection does not work, the interpreter author can apply hints to improve the optimization opportunities. There is one kind of hint for both of the conditions above.
Note: These hints are written by an interpreter developer and applied to the RPython source of the interpreter. Normal Python users will never see them.
Where Do All the Constants Come From
It is worth clarifying what is a "constant" in this context. A variable of the trace is said to be constant if its value is statically known by the optimizer.
The simplest example of constants are literal values. For example, if in the RPython source code we have a line like y = x + 1, the second operand will be a constant in the trace.
However, the optimizer can statically know the value of a variable even if it is not a constant in the original source code. For example, consider the following fragment of RPython code:
if x == 4:
y = y + x
If the fragment is traced with x being 4, the following trace is produced:
guard(x == 4) y = y + x
In the trace above, the value of x is statically known thanks to the guard. Remember that a guard is a runtime check. The above trace will run to completion when x == 4. If the check fails, execution of the trace is stopped and the interpreter continues to run.
There are cases in which it is useful to turn an arbitrary variable into a constant value. This process is called promotion and it is an old idea in partial evaluation (it's called "the trick" there). Promotion is also heavily used by Psyco and by all older versions of PyPy's JIT. Promotion is a technique that only works well in JIT compilers, in static compilers it is significantly less applicable.
Promotion is essentially a tool for trace specialization. In some places in the interpreter it would be very useful if a variable were constant, even though it could have different values in practice. In such a place, promotion is used. The typical reason to do that is if there is a lot of computation depending on the value of that variable.
Let's make this more concrete. If we trace a call to the following function:
def f1(x, y):
z = x * 2 + 1
return z + y
We get a trace that looks like this:
v1 = x * 2 z = v1 + 1 v2 = z + y return(v2)
Observe how the first two operations could be constant-folded if the value of x were known. Let's assume that the value of x can vary, but does so rarely, i.e. only takes a few different values at runtime. If this is the case, we can add a hint to promote x, like this:
def f2(x, y):
x = hint(x, promote=True)
z = x * 2 + 1
return z + y
The meaning of this hint is that the tracer should pretend that x is a constant in the code that follows. When just running the code, the function has no effect, as it simply returns its first argument. When tracing, some extra work is done. Let's assume that this changed function is traced with the arguments 4 and 8. The trace will be the same, except for one operation at the beginning:
guard(x == 4) v1 = x * 2 z = v1 + 1 v2 = z + y return(v2)
The promotion is turned into a guard operation in the trace. The guard captures the value of x as it was at runtime. From the point of view of the optimizer, this guard is not any different than the one produced by the if statement in the example above. After the guard, the rest of the trace can assume that x is equal to 4, meaning that the optimizer will turn this trace into:
guard(x == 4) v2 = 9 + y return(v2)
Notice how the first two arithmetic operations were constant folded. The hope is that the guard is executed quicker than the multiplication and the addition that was now optimized away.
If this trace is executed with values of x other than 4, the guard will fail, and execution will continue in the interpreter. If the guard fails often enough, a new trace will be started from the guard. This other trace will capture a different value of x. If it is e.g. 2, then the optimized trace looks like this:
guard(x == 2) v2 = 5 + y return(v2)
This new trace will be attached to the guard instruction of the first trace. If x takes on even more values, a new trace will eventually be made for all of them, linking them into a chain. This is clearly not desirable, so we should promote only variables that don't vary much. However, adding a promotion hint will never produce wrong results. It might just lead to too much assembler code.
Promoting integers, as in the examples above, is not used that often. However, the internals of dynamic language interpreters often have values that are variable but vary little in the context of parts of a user program. An example would be the types of variables in a user function. Even though in principle the argument to a Python function could be any Python type, in practise the argument types tend to not vary much. Therefore it is possible to promote the types. In the next blog post I will give a complete example for how this works.
Declaring New Pure Operations
In the last section we saw a way to turn arbitrary variables into constants. All pure operations on these constants can be constant-folded. This works great for constant folding of simple types, e.g. integers. Unfortunately, in the context of an interpreter for a dynamic language, most operations actually manipulate objects, not simple types. The operations on objects are often not pure and might even have side-effects. If one reads a field out of a constant reference to an object this cannot necessarily be folded away because the object can be mutated. Therefore, another hint is needed.
As an example, take the following class:
class A(object):
def __init__(self, x, y):
self.x = x
self.y = y
def f(self, val):
self.y = self.compute() + val
def compute(self):
return self.x * 2 + 1
Tracing the call a.f(10) of some instance of A yields the following trace (note how the call to compute is inlined):
x = a.x v1 = x * 2 v2 = v1 + 1 v3 = v2 + val a.y = v3
In this case, adding a promote of self in the f method to get rid of the computation of the first few operations does not help. Even if a is a constant reference to an object, reading the x field does not necessarily always yield the same value. To solve this problem, there is another annotation, which lets the interpreter author communicate invariants to the optimizer. In this case, she could decide that the x field of instances of A is immutable, and therefore compute is a pure function. To communicate this, there is a purefunction decorator. If the code in compute should be constant-folded away, we would change the class as follows:
class A(object):
def __init__(self, x, y):
self.x = x
self.y = y
def f(self, val):
self = hint(self, promote=True)
self.y = self.compute() + val
@purefunction
def compute(self):
return self.x * 2 + 1
Now the trace will look like this:
guard(a == 0xb73984a8) v1 = compute(a) v2 = v1 + val a.y = v2
Here, 0xb73984a8 is the address of the instance of A that was used during tracing. The call to compute is not inlined, so that the optimizer has a chance to see it. Since compute function is marked as pure, and its argument is a constant reference, the call will be removed by the optimizer. The final trace looks like this:
guard(a == 0xb73984a8) v2 = 9 + val a.y = v2
(assuming that the x field's value is 4).
On the one hand, the purefunction annotation is very powerful. It can be used to constant-fold arbitrary parts of the computation in the interpreter. However, the annotation also gives you ample opportunity to mess things up. If a function is annotated to be pure, but is not really, the optimizer can produce subtly wrong code. Therefore, a lot of care has to be taken when using this annotation.
Observably Pure Functions
Why can't we simply write an analysis to find out that the x fields of the A instances is immutable and deduce that compute is a pure function, since it only reads the x field and does not have side effects? This might be possible in this particular case, but in practice the functions that are annotate with the purefunction decorator are usually more complex. The easiest example for this is that of a function that uses memoization to cache its results. If you analyze this function, it looks like the function has side effects, because it changes the memoizing dictionary. However, because this side effect is not externally visible, the function from the outside is pure. This is a property that is not easily detectable by analysis. Therefore, the purity of this function needs to be annotated.
Immutable Fields
One of the most common cases of pure functions is reading immutable values out of objects. Since this is so common, we have special syntactic sugar for it. A RPython class can have a class attribute _immutable_fields_ set to a list of strings, listing the fields that cannot be changed. This is equivalent to using getters and annotating them with purefunction.
Conclusion
In this blog post I explained two more hints that can be used in the source code of the interpreter. They are used to influence what the optimizer does with the trace. I realize the examples given here are a bit too small, in the next installment I will give a worked-out example that puts all the pieces together.
Again a very interesting post. I would like some precisions for one sentence:
ReplyDelete"If x takes on even more values, a new trace will eventually be made for all of them, linking them into a chain."
Does it mean they are all tried in sequence, or is there some dispatch mechanism? If there isn't, wouldn't it be beneficial to have one in place (probably using a hash table of some sort) when there is more than a few values? Or is the number of "generated branches" never supposed to be large enough to make such an approach worthwile?
@Gaëtan:
ReplyDeleteRight now it's just a linear search always, which is clearly not ideal and we might very well fix this in the future. Currently we have the hope that in practice the number of values is always small, but we never measured.