Tuesday, November 4, 2008

Porting the JIT to CLI (part 1)

As the readers of this blog already know, I have been working on the CLI JIT backend for some months: last Friday, it reached an important milestone, as it is now able to produce huge speedups for a little dynamic language. To know how huge the speedup is, read on :-).

The goal of PyPy JIT generator is to take an interpreter and, with the help of few annotations, automatically generate a JIT compiler for it. In this post, we will talk about the tlc virtual machine: while tlc it is just a toy language, it contains some features that make it an interesting target for our JIT generator.

The tlc virtual machine

tlc is executed by a stack based, dynamically typed virtual machine (for those who knows a bit about the Python VM: does it sound familiar? :-)).

There are three types of objects: integers, nil, and cons cells (i.e. lisp-like pairs of objects).

As the VM is very simple, it provides only few opcodes:

  • opcodes to manipulate the stack, like PUSH, POP, etc.
  • integer operations, like ADD, MUL, all the comparisons, etc.: these operations can only be applied to integers;
  • list operations, like CONS, CAR, CDR: these operations can only be applied to lists;
  • other operations, including jumps and conditional jumps.

The VM is interesting for our purposes because it has a lot of similarities with Python (though on a smaller scale, of course):

  1. it has to do type-checks at runtime before doing most of the operations;
  2. every time you do an arithmetic operation, it has to unbox the operand, do the computation, and the box the result again.

This means that even if you have a program which only uses integers, you are paying a lot of overhead.

To know more about this toy VM, look at its source code: the interesting bits are the classes used to represent objects, and the interp_eval function, which contains the main loop of the virtual machine. As you can see, the implementation is quite straightforward; all the hint calls you see are the special annotations needed by the JIT generator to produce better code.

Let's JIT it!

So, the whole point is to generate a JIT compiler from it, isn't it?

First, checkout a fresh copy of the oo-jit branch:

$ svn co http://codespeak.net/svn/pypy/branch/oo-jit

Then, go to the oo-jit/pypy/jit/tl directory, and compile the tlc VM with the CLI backend and JIT enabled:

$ cd oo-jit/pypy/jit/tl/
$ ../../translator/goal/translate.py -b cli --jit --batch targettlc
...
lot of texts
...

If everything went OK, you now have a targettlc-cli executable, which accepts two arguments: the name of the file containing the tlc program we want to run, and an integer to be passed to it.

Luckily, in the same directory we have a factorial.tlc file that contains the bytecode for a function that -- guess? -- computes the factorial of a given integer; let's try it:

$ ./targettlc-cli factorial.tlc 5
Non jitted:    120 (0.009371 seconds)
Warmup jitted: 120 (0.208954 seconds)
Warmed jitted: 120 (0.000323999999999991 seconds)

Cool, it seems that the result was computed correcly :-). As you can see from the output, we ran the program three times:

  1. by plain interpretation, without any jitting;
  2. with the jit enabled: this run includes the time spent by doing the compilation itself, plus the time spent by running the produced code;
  3. again with the jit enabled, but this time the compilation has already been done, so we are actually measuring how good is the code we produced.

So, it's time to run a benchmark: let's try to compute the factorial of a very big number; the result will be 0, because obviously after a while we overflow, but after all we are interested in the time spent, not in the result:

$ ./targettlc-cli factorial.tlc 5000000
Non jitted:    0 (19.93247 seconds)
Warmup jitted: 0 (0.293229999999998 seconds)
Warmed jitted: 0 (0.0494239999999984 seconds)

$ python -c 'print 19.93247/0.0494239999999984'
403.295362577

And no, I didn't make any mistake in copying&pasting: the jitted version is really 400 times faster that the non jitted one!

Warning: my laptop seems to be not very well suited for benchmarks, as the results vary a lot from run to run; I've run the benchmarks a lot of times, and I got speedup factors up to 500 times, so your results may be different.

More benchmarks

It's also interesting to compare the result with a manual written C# version of the factorial, to see how good is code we produced; to get reasonable results, we need to compute a larger factorial, to let to code to run a bit more:

$ ./targettlc-cli --onlyjit factorial.tlc 100000000
Warmup jitted: 0 (0.980856 seconds)
Warmed jitted: 0 (0.769716 seconds)

$ mono factorial.exe 100000000
C#:            0 (0.153777 seconds)

$ python -c 'print 0.769716/0.153777'
5.00540392907

We know that the generated code is far from being optimal, but probably the factor of five is at least partially due to the fact that Mono's own JIT is optimized for C#-like code, and our code has a completely different shape.

All the benchmarks above were run under Linux, with Mono 1.9.1. Here are the results for the same benchmarks, but run with Microsoft CLR (on a different machine, so the absolute values are not comparable):

$ ./targettlc-cli factorial.tlc 5000000
Non jitted:    0 (15,640625 seconds)
Warmup jitted: 0 (0,4375 seconds)
Warmed jitted: 0 (0,03125 seconds)

$ python -c 'print 15.640625/0.03125'
500.5

$ ./targettlc-cli --onlyjit factorial.tlc 100000000
Warmup jitted: 0 (0,90625 seconds)
Warmed jitted: 0 (0,515625 seconds)

$ ./factorial.exe 100000000
C#:            0 (0,34375 seconds)

$ python -c 'print 0.515625/0.34375'
1.5

The results are even better than before; this is probably thanks to CLR's JIT, that does a better job than Mono when faced to something which is different than the usual C#-like code.

Conclusions (for now)

This is a very important result, because it proves that PyPy's approach to JIT compilers can be applied effectively also to OO virtual machines; the result is even better than what I expected, because when generating code for .NET we have much less freedom than when generating assembly code, and I had to play some tricks to work around some .NET limitations.

Moreover, it worked at the first try :-). I tried to compile the tlc virtual machine as soon as all the related JIT tests were passing, and surprisingly everything worked just fine, even if it was the very first time I was trying to apply some features of the JIT to something bigger than a test: I think this is yet another prove that Test Driven Development just works!

Even if this is a major milestone, the CLI JIT backend is not yet completed: as a consequence it can't still be used for the full PyPy, but all the hardest problems should have been solved now.

Since a lot of readers asked for more technical details, especially about the JIT, I will try to soon write a second blog post explaining how the CLI backend works internally, with a brief look to the generated code to see how it looks like.

7 comments:

Anonymous said...

If you are benchmarking on Linux then watch out for CPU speed scaling. For example on Ubuntu by default the ondemand governor is used which runs the CPU at lowest possible speed and until there is any CPU demand at which point it runs at fastest. The time to switch varies (eg sometimes it can be instantaneous, other times a second or two, and other times not at all).

Make sure to use: cpufreq-set -g performance

That will run at maximum CPU speed the whole time.

Lucian said...

Woohoo! Can't wait for more (and the jvm counterpart).

Anonymous said...

I agree that this result is very important. For me and many others too i guess, a very good JIT is the most important missing part in python and other dynamic languages. Speed *is* important,

For integers, psycho also showed huge performance gains. But i think the real proof of pypy's approach would be to show similar results for floating point operations also...

Anonymous said...

awesome!

Anonymous said...

Keep it on!

Antonio Cuni said...

hi, thanks for your valuable comments.

Some notes:

- I know about cpufreq-set, but even setting the governor to performance doesn't help, the timings vary a lot between different runs. If someone knows a way to run reliable benchmarks, it would be very appreciated!

- I have plans to experiment also the JIT on the JVM: since HotSpot usually does a better job than CLR's JIT, it's possible/likely that the JVM is a better platform for our purposes. Also, the experimental Da Vinci Machine contains features that could be very useful for us. Unfortunately the PyPy non-JIT JVM backend is not as advanced as the CLI one, and it lacks some features that are really needed for writing a JIT backend.

- Float operations are already (mostly) supported by our JIT backends; I bet that if you add a FloatObj to the tlc interpreter, you will see huge speedups as well. However, the real point of PyPy's approach is that once finished it will optimize much more than ints and floats, including features that are currently not implemented by psyco (e.g. generators).

Ορέστης said...

Brilliant post! Keep us updated!