Friday, November 7, 2008

Porting the JIT to CLI (part 2)

In my previous post, we saw that PyPy JIT generator can produce huge speedups when applied to the tlc toy language.

In this post we will dive a bit into the internals of PyPy JIT, to see how it manages to do so. Note that this is a very high level overview of how the JIT works, and applies to all backends. Then, in the third post of this series, we will look closer at the CLI JIT backend, seeing how it works around some .NET limitations and how the generated code looks like.

PyPy JIT for dummies

As you surely know, the key idea of PyPy is that we are too lazy to write a JIT of our own: so, instead of passing nights writing a JIT, we pass years coding a JIT generator that writes the JIT for us :-).

I'm not going to explain how the JIT generator does its job, (perhaps this will be the subject of another blog post), but how the generated JIT works.

There are values that, if known at compile-time (i.e., when the JIT compiler runs), let the JIT to produce very efficient code. In a dynamic language, types are the primary example: for instance, suppose you are a compiler and you have to compile to following Python function:

def mysum(a):
  return a + 1

At compile time, you don't have any knowledge about the type of the parameter: it could be integer, float, an user defined object, etc. In this situation, the only safe choice is to emit code which does the usual, slow, full lookup to know how to perform the operations.

On the other hand, suppose that you knew in advance that the parameter is an integer: this time, you could emit code that exploits this extra knowledge, by performing directly a fast integer addition.

The idea behind PyPy JIT is that if you don't have enough knowledge to generate efficient code, you stop compiling and wait until you know exactly what you need. Concretely, you emit code that runs until the point where you stopped the compilation, then it triggers a special procedure that restarts the compiler. This time the JIT compiler knows everything you need, because you can inspect the state of the running program.

Let's see an example: the first time the JIT compiles mysum, it produces something like this pseudo-code:

PyObject mysum_compiled(PyObject a)
{
  Type a_type = a.GetType();
  switch(a_type) {
      default: continue_compilation(a_type, <position>);
  }
}

If you call mysum(41), the execution goes in the default branch of the switch, thus calling continue_compilation: its job is to restart the JIT compiler, which now can emit fast code because it knows the exact type of a; then, it modifies the original mysum_compiled function, in order to make it executing the newly generated code the next time it encounters an integer at that point:

PyObject mysum_compiled(PyObject a)
{
  Type a_type = a.GetType();
  switch(a_type) {
      PyInteger: return new PyInteger(a.value+1); // fast path!
      default: continue_compilation(a_type, <position>);
  }
}

From now on, every time we call mysum with an integer argument, the JIT compiler is not called anymore and the fast path is directly executed; if we happen to call mysum with a float arguments, the switch goes again in the default branch, and the JIT compiler is started once more to produce fast code also for this case. What happens in practice is that compile-time and runtime are continuously intermixed, until the switches are stable enough and the compiler is not needed anymore.

In PyPy jargon, this kind of "growable switch" is called flexswitch, and it's one of the most important concept of our JIT generator.

Promotion

How can the JIT generator know which values are useful to know to generate efficient code and which aren't? Unfortunately it can't, or at least our JIT generator is not smart enough at the moment.

To get the best from it, the developers of the VM need to instruct it by annotating the variables on which we want the JIT to stop until it knows the actual values; this is done by using particular hints, called promote and promote_class; variables annotated with such hints are said to be promoted. If something is promoted, a flexswitch is used to gain information about it, as seen in the last section.

For an example, let's look at an excerpt from main dispatch loop of the tlc virtual machine:

elif opcode == ADD:
  a, b = stack.pop(), stack.pop()
  hint(a, promote_class=True)
  hint(b, promote_class=True)
  stack.append(b.add(a))

This the implementation of the ADD opcode: first, it pops two values from the stack; then, it computes the result; finally, it push the result to the stack again. In between, both the classes of a and b have been promoted: this means that when the JIT emits the code for b.add(a), it knows exactly what is happening: if it sees that both are instances of the IntObj class, it inlines the method call and emits a fast integer addition instead.

Virtuals

The other important concept of the JIT is the presence of virtual structures, virtual lists, and virtual dictionaries. Again, I'm not going to explain in depth how they work, but only why they are so important for generating highly efficient code.

The essence of virtuals is that you don't allocate objects until you really need to do it, e.g. because they are being passed as an argument to some external function. Instead, we store all the informations we need as local variables; e.g., in the case of a virtual structure, we create as many local variables as the number of its fields: if the structure escapes the local scope, we force it to a real object, by allocating memory on the heap and initializing it after the current value of the local variables.

This technique allows the JIT to avoid the allocation of many temporary objects that hold intermediate results; consider for example the following Python loop:

result = 0
for i in range(N):
  result += i
return result

Without the JIT, at each iteration, a new int object is created and bound to the result variable, while the previous one is discarded and not needed anymore. By combining virtuals and promotion, the JIT can emit code that does the whole computation locally, and allocates a real object only at the end, when it escapes from the local scope because it is returned from the function.

Putting it all together

This is, essentially, how PyPy's generated JITs work. To summarize, our JITs emit multiple versions of each chunk of code: each version is specialized and optimized for one particular case.

The cost of selecting the right specialization to use (through flexswitches) is almost always negligible compared to how much time you save by running the fast version instead of the more-general-but-slow one. Moreover, each specialized version knows the exact shape of the objects it's dealing with, so they can be virtualized to make the generated code even more efficient.

At the end, the actual code generation is done by one of the JIT backends: the backends exploit all the knowledge gathered by the previous steps to produce highly efficient code, but this will be the subject of the next blog post.

6 comments:

Anonymous said...

Wow... I love this approach. Keep up the great work and interesting posts!

Luis said...

This is a very clear and didactic explanation. Thanks!

Anonymous said...

can't wait for the next one

kbob said...

What does the flexswitch compile into? I'm guessing it would look like

t = type(obj);
if (t == int)
...
else if (t == float)
...
else
....

but maybe there's a better way (or maybe the answer is backend-dependent).

Antonio Cuni said...

@bob
your guess is right, how to implement the flexswitch is backend-dependent. This is the hardest part for .NET, as the flexswitch needs to grow dynamically (i.e., you have to add more case after the .NET method has already been compiled). It will be subject of the next blog post.

Unknown said...

It seems that an implication of the JIT way is that, by adopting a consistent habit of implementing type driven Generic Functions, the JIT could accomplish nearly all of the possible optimizations in a single pass. In other words, by definition, each type based variation of a Generic Function call can only be fired when data of that type is provided as a parameter.