The question I was asked most often during my recent US trip was how exactly the hints work that interpreter authors can use to improve the execution speed of the programs running on their interpreters. Since those hints are not really documented all that well, I decided to write blog posts about them. This is the first one.
Background
First, let's recap some basics: PyPy's approach to implementing dynamic languages is to write an interpreter for the language in RPython. This interpreter can be translated to C and then further to machine code. The interpreter consists of code in the form of a large number of generated C functions and some data. Similarly, the user program consists of functions in the language the interpreter executes.
As was explained in a blog post and a paper two years ago, PyPy's JIT is a meta-tracer. Since we want to re-use our tracer for a variety of languages, we don't trace the execution of the user program, but instead trace the execution of the interpreter that is running the program. This means that the traces don't contain the bytecodes of the language in question, but RPython-level operations that the interpreter did to execute the program.
On the other hand, the loops that are traced by the tracer are the loops in the user program. This means that the tracer stops tracing after one iteration of the loop in the user function that is being considered. At this point, it can have traced many iterations of the interpreter main loop.
Here's a diagram of this process:
On the left you see the levels of execution. The CPU executes the binary of PyPy's Python interpreter, which consists of RPython functions that have been compiled first to C, then to machine code. Some of these functions contain loops, others don't. The interpreter runs a Python program written by a programmer (the user). If the tracer is used, it traces operations on the level of the interpreter. However, the extent of the trace is determined by the loops in the user program.
How Far Should Tracing Go
When the tracer encounters a function call at the interpreter level, e.g. the interpreter main loop calling a helper function, it can do one of two things:
- it can trace into the helper function, effectively inlining it into the trace.
- it can not trace into the function and instead record a call to that function as an operation in the trace. Such a call operation in the trace is sometimes called residual call.
As a default, the tracer will try to trace into the helper because that will give more information to the optimizer, allowing it to do a better job. This is particularly important for the allocation removal optimization, because if a freshly allocated object is passed as an argument to a residual call, its allocation cannot be optimized away.
There is a problem however if the helper function itself contains a loop. The tracer records the linear sequence of operations that are being executed. Thus when it encounters a loop on the interpreter level it records all the operations of every iteration of the loop itself, with the net effect of unrolling it. The only places where the tracer stops and tries to close the trace is in the main loop of the interpreter. When the tracer encounters the main loop, it also checks whether the original user loop has been closed, and thus whether it can stop tracing.
For most helper functions in the interpreter that contain loops, fully unrolling does not make sense. If a loop is unrolled, the trace is specific to the number of iteration that was seen during tracing. If the trace is later executed with a different number of iterations, the trace will be left via a guard failure, which is inefficient. Therefore the default behaviour of the tracer is to never trace into a function on the interpreter level that contains a loop, but to trace into all non-looping helper functions.
This default behaviour is essentially a heuristic, but one that usually makes sense. We want to produce just enough traces to make the resulting code efficient, but not more. Therefore we trace as much as possible (everything by default) except the functions which loops where tracing would produce code that is less general than it could be.
As an example for a helper with a loop, take string concatenation. It loops over the characters of both arguments and copies them over into the result string. It does not make sense to unroll the loops in this function. If we do that, the resulting trace can only be used for strings of the length that was seen during tracing. In practise, the string lengths are usually different each run, meaning that the trace with unrolling is not run to completion in most cases.
Influencing the Default Behaviour
Sometimes the default behaviour is not actually what is wanted. This is something the interpreter author has to decide, usually by looking at the traces that are produced and deciding that they should be improved. There are two ways in which the default is wrong:
- false negatives: if a helper function that does contain a loop should be traced into, unrolling the loop.
- false positives: if a helper function that does not contain a loop is inlined into the trace, but the interpreter author decides that this is not helpful.
If the interpreter author finds false negatives or false positives, she can fix that by applying a hint to the tracer. These hints take the form of function decorators (which both live in the pypy.rlib.jit module). In the next two subsections I will describe these two function decorators and their use.
Unrolling Functions With Loops
The first decorator, used to fix false negatives, is the unroll_safe decorator. It is used to tell the tracer to always trace into a function that has a loop, effectively unrolling the loop. This decorator should be used only if the loop in the helper function is expected to always run for the same number of iterations. This sounds like a strong restriction, in practise this is less severe: The number of iterations needs to only be the same in the context where the helper functions is traced from.
It is easiest to understand this condition via an example. Let's look at the BUILD_TUPLE bytecode in Python. It takes one argument, the length n of the tuple being built. The bytecode pops n arguments from the stack, turns them into a tuple and pushes that tuple on the stack. Thus the function that implements BUILD_TUPLE in PyPy's Python interpreter calls a helper popvalues which pops n values from the stack and returns them in a list. This helper is implemented with a loop and would thus not be traced into by default. The loop in the helper can run for very different numbers of iterations, because it is used in a variety of places. However, for every concrete BUILD_TUPLE bytecode, the argument will be constant. Therefore it is safe (and even necessary) to annotate popvalues with the unroll_safe decorator.
A different example is the implementation of the isinstance builtin. It is used to check whether an object a is an instance of a class B like this: isinstance(a, B). The second argument of the function can also be a tuple of classes to check whether an object is an instance of one of a number of classes: isinstance(a, (A, B, C, D)). To implement this second case, the implementation of isinstance contains a loop iterating over the elements of the tuple. The number of loop iterations can vary, but is usually fixed for each individual call site which typically just lists a few classes in the source code. Therefore it is also safe to annotate the implementation of isinstance with the unroll_safe decorator.
Preventing the Tracing of Functions
The second decorator dont_look_inside is used to fix false positives. It tells the JIT to never trace into the decorated function and just always produce a residual call instead. This decorator is in many ways less important than the unrolling one (except for a special situation that I will describe in a follow-up post). It is used if tracing into a function is not expected to yield any speed benefits, because the optimizer will not be able to improve it much. This is often the case if the called helper function does not contain any "dynamic" behaviour. In such a situation it is better to just leave the function call in the trace, because that produces less code.
An example would be the import mechanism in Python. It's very unlikely that any performance improvement can be had by turning part of it into assembler. Therefore we hide it from the tracer by annotating them with dont_look_inside.
Conclusion
In this post we discussed two hints that can be used to control precisely which parts of the interpreter should be meta-traced. If these hints are used carefully, this can go a long way to making the interpreter produce traces that contain exactly the interesting part of the execution, and will contain calls to the functions that can not be optimized by tracing techniques.
In the next part of this series I will discuss a different set of hints that can be used to strongly optimize traces.
Would it be possible (i.e. is the code amenable) to programmatically randomly sprinkle these decorators around and compare effects on speed (or on measurable trace quality)?
ReplyDeleteIt would make JIT generation a bit more meta :)
Thanks for the very interesting post!
ReplyDeleteSorry if the following questions are naive, but you post makes me wonder if not tracing at all the functions which contain loops with a varying number of iteration means that no optimization is possible at all for those loops? Also, wouldn't it be possible to detect there is a loop and produce a special kind of trace in that case which do not duplicate the body of the loop? I guess that if it was possible and useful, you'd have done it, so I guess the real question is: why doesn't this work?
@Victor: yes, there are probably ways to do place some of the hints more automatically. However, you will always have to look at the traces and think about how to improve them, so we chose the pragmatic path and didn't do anything magic.
ReplyDelete@Gaëtan: those are excellent questions!
ReplyDeleteYes, functions in the interpreter with loops that we do not trace are not optimized at all. For most of these functions this is not a problem, e.g. string concatenation does not have much optimization potential anyway. However, there are some functions with loops (like the implementation of the map builtin) that would benefit from tracing, and we don't have a good general solution for that yet.
One of the ideas for solutions are indeed to try to start new traces in the interpreter functions with loops. We did not get around to playing with this yet, as there are not so many cases in the Python interpreter where this leads to a huge benefit.
I'm puzzled now. I fail to see why those loops "do not have much optimization potential". I can understand that it's hard to optimize them because of the trace problem but I thought they would benefit from optimization like any other code (eg avoiding boxing/unboxing temporary variables), especially since they are within a loop, hence any gain will be multiplied by the number of iterations.
ReplyDelete@Gaëtan:
ReplyDeleteis it possible that you are mixing up the two levels involved? The post talked only about functions in the interpreter, not about the functions in pure Python that a user of the interpreter might write. To clarify:
- All loops on the application level, i.e. in the program the user wrote, are traceable and will be traced if they are executed often enough.
- Some loops in the interpreter itself are not. Most of these loops do not do any boxing/unboxing, so they won't benefit from optimization. For some of the loops that would benefit we added some manual hacks to trace them anyway, e.g. for the implementation of "map". Some others still need to be improved, e.g. any, all, zip, ...
Carl, thanks for the post. The information is very helpful.
ReplyDeleteWhile I understand special casing to overwrite the default tracing/not-tracing rules can help performance, I wonder how well are the default heuristics performing. Do you have any bulk part estimation of the performance loss by turning off special casing? And how many hints (related to whether to trace or unroll) do you have to introduce to PyPy?
Hi Peng,
ReplyDeleteThanks :-). No, I didn't really do benchmarks yet, plan to do so in the future (these blog posts will turn into a paper soonish).
There are about 20-30 unroll_safe hints and equally many dont_look_inside hints. Some of them are really important, ie the speed would be abysmal without them. Most of them are really in the bytecode dispatch area, they are cases that e.g. Jython would not have, because in Jython the Python-to-Java compiler takes care of them.
No, I wasn't confusing the two levels involved (if pypy wasn't optimizing variable-length loops in userlevel code, it wouldn't optimize much I guess).
ReplyDeleteMy point was more theoretical: I guess that, in theory, those loops would benefit from optimizations like any other part of the interpreter. Your answer leads me to believe that *in practice* this isn't an issue because there are either not that many of them in the interpreter and/or they are not in speed critical parts and most of those that are important speed-wise have been taken care of manually in some way or another.
@Gaëtan: yes, that's a good interpretation. At some point we might still think about a more general solution for this problem, to get the remaining rare cases fixed, but for now we have a lot of the common ones covered.
ReplyDelete@Gaëtan
ReplyDeleteUntraceable Interpreter-level loops don't need to be optimized by the jit because they are agressively optimized by the C compiler (remeber that rpython is translated to C)