Tuesday, November 9, 2010

A snake which bites its tail: PyPy JITting itself

We have to admit: even if we have been writing for years about the fantastic speedups that the PyPy JIT gives, we, the PyPy developers, still don't use it for our daily routine. Until today :-).

Readers brave enough to run translate.py to translate PyPy by themselves surely know that the process takes quite a long time to complete, about a hour on super-fast hardware and even more on average computers. Unfortunately, it happened that translate.py was a bad match for our JIT and thus ran much slower on PyPy than on CPython.

One of the main reasons is that the PyPy translation toolchain makes heavy use of custom metaclasses, and until few weeks ago metaclasses disabled some of the central optimizations which make PyPy so fast. During the recent Düsseldorf sprint, Armin and Carl Friedrich fixed this problem and re-enabled all the optimizations even in presence of metaclasses.

So, today we decided that it was time to benchmark again PyPy against itself. First, we tried to translate PyPy using CPython as usual, with the following command line (on a machine with an "Intel(R) Xeon(R) CPU W3580 @ 3.33GHz" and 12 GB of RAM, running a 32-bit Ubuntu):

$ python ./translate.py -Ojit targetpypystandalone --no-allworkingmodules

... lots of output, fractals included ...

[Timer] Timings:
[Timer] annotate                       ---  252.0 s
[Timer] rtype_lltype                   ---  199.3 s
[Timer] pyjitpl_lltype                 ---  565.2 s
[Timer] backendopt_lltype              ---  217.4 s
[Timer] stackcheckinsertion_lltype     ---   26.8 s
[Timer] database_c                     ---  234.4 s
[Timer] source_c                       ---  480.7 s
[Timer] compile_c                      ---  258.4 s
[Timer] ===========================================
[Timer] Total:                         --- 2234.2 s

Then, we tried the same command line with PyPy (SVN revision 78903, x86-32 JIT backend, downloaded from the nightly build page):

$ pypy-c-78903 ./translate.py -Ojit targetpypystandalone --no-allworkingmodules

... lots of output, fractals included ...

[Timer] Timings:
[Timer] annotate                       ---  165.3 s
[Timer] rtype_lltype                   ---  121.9 s
[Timer] pyjitpl_lltype                 ---  224.0 s
[Timer] backendopt_lltype              ---   72.1 s
[Timer] stackcheckinsertion_lltype     ---    7.0 s
[Timer] database_c                     ---  104.4 s
[Timer] source_c                       ---  167.9 s
[Timer] compile_c                      ---  320.3 s
[Timer] ===========================================
[Timer] Total:                         --- 1182.8 s

Yes, it's not a typo: PyPy is almost two times faster than CPython! Moreover, we can see that PyPy is faster in each of the individual steps apart compile_c, which consists in just a call to make to invoke gcc. The slowdown comes from the fact that the Makefile also contains a lot of calls to the trackgcroot.py script, which happens to perform badly on PyPy but we did not investigate why yet.

However, there is also a drawback: on this specific benchmark, PyPy consumes much more memory than CPython. The reason why the command line above contains --no-allworkingmodules is that if we include all the modules the translation crashes when it's complete at 99% because it consumes all the 4GB of memory which is addressable by a 32-bit process.

A partial explanation if that so far the assembler generated by the PyPy JIT is immortal, and the memory allocated for it is never reclaimed. This is clearly bad for a program like translate.py which is divided into several independent steps, and for which most of the code generated in each step could be safely be thrown away when it's completed.

If we switch to 64-bit we can address the whole 12 GB of RAM that we have, and thus translating with all working modules is no longer an issue. This is the time taken with CPython (note that it does not make sense to compare with the 32-bit CPython translation above, because that one does not include all the modules):

$ python ./translate.py -Ojit

[Timer] Timings:
[Timer] annotate                       ---  782.7 s
[Timer] rtype_lltype                   ---  445.2 s
[Timer] pyjitpl_lltype                 ---  955.8 s
[Timer] backendopt_lltype              ---  457.0 s
[Timer] stackcheckinsertion_lltype     ---   63.0 s
[Timer] database_c                     ---  505.0 s
[Timer] source_c                       ---  939.4 s
[Timer] compile_c                      ---  465.1 s
[Timer] ===========================================
[Timer] Total:                         --- 4613.2 s

And this is for PyPy:

$ pypy-c-78924-64 ./translate.py -Ojit

[Timer] Timings:
[Timer] annotate                       ---  505.8 s
[Timer] rtype_lltype                   ---  279.4 s
[Timer] pyjitpl_lltype                 ---  338.2 s
[Timer] backendopt_lltype              ---  125.1 s
[Timer] stackcheckinsertion_lltype     ---   21.7 s
[Timer] database_c                     ---  187.9 s
[Timer] source_c                       ---  298.8 s
[Timer] compile_c                      ---  650.7 s
[Timer] ===========================================
[Timer] Total:                         --- 2407.6 s

The results are comparable with the 32-bit case: PyPy is still almost 2 times faster than CPython. And it also shows that our 64-bit JIT backend is as good as the 32-bit one. Again, the drawback is in the consumed memory: CPython used 2.3 GB while PyPy took 8.3 GB.

Overall, the results are impressive: we knew that PyPy can be good at optimizing small benchmarks and even middle-sized programs, but as far as we know this is the first example in which it heavily optimizes a huge, real world application. And, believe us, the PyPy translation toolchain is complex enough to contains all kinds of dirty tricks and black magic that make Python lovable and hard to optimize :-).

21 comments:

Victor said...

This is amazing, huge kudos to all PyPy developers!

Do these results include "Håkan's jit-unroll-loops branch" you mentioned in sprint report? When are we going to get a release containing these improvements? And do the nightly builds include them?

Carl Friedrich Bolz-Tereick said...

@Victor: No, Håkan's branch has not been merged. It still has some problems that we don't quite know how to solve.

The nightly builds include all other improvements though. We plan to do a release at some point soon.

Anonymous said...

This is great!

One question: A while back, after the GSoC project for 64-bit, there was an issue with asmgcc-64 such that the 64-bit GC was slower than it should be.

It appears from the performance described in this post, that that must be resolved now. Is that right?

Thanks,
Gary

Leonardo Santagada said...

There should be a way to not only throw away jit memory but somehow tell pypy to try to not use more than say 3gb of ram so it will not hit swap on 4gb machines.

Maciej Fijalkowski said...

@Gary yes, that is correct

Anonymous said...

Wow, cool work!

Eric van Riet Paap said...

Excellent.. congratulations!

ArneBab said...

Wow, looks great!

Many thanks for posting the benchmark – and for your relentless work on pypy!

One thing: Could you add tests comparing with programs converted to python3?

Antonio Cuni said...

@ArneBab: I'm not sure what you mean, but consider that at the moment PyPy does not support Python 3, so it does not make sense to compare against it.

Michael Foord said...

PyPy continues to get more and more impressive.

Armin Rigo said...

For reference, at some point (long ago) I tried to use Psyco to speed up translate.py on CPython; but i didn't make any difference -- I'm guessing it's because we have nested scope variables at a few critical points, which Psyco cannot optimize. Now I no longer have a need for that :-)

Anonymous said...

Very cool achievement. I'm curious however to know why compile_c section is slower. I thought it was mostly waiting on external programs to run and so should of been similar time cpython? Congratulations!

Antonio Cuni said...

@Anonymous: you are right when you say that compile_c mostly invokes gcc, but also a python script called trackgcroot.py.

The python script is run with the same interpreter using for translate.py (so pypy in this case), and it happens that it's slower than with cpython.

Anonymous said...

How come the 64 bit timings are so much worse than the 32 bit timings (both CPython and PyPy)?

Carl Friedrich Bolz-Tereick said...

@Anonymous: Because the 64bit version is translating all modules, which simply gives the translator a lot more to do. We cannot do that yet on 32bit due to memory problems.

Victor said...

@cfbolz Well, but you sure can run the 64bit version with the same module list as you did for 32bit... So if running the benchmark again in the same conditions isn't a lot of work, it'd provide yet another interesting data point ;)

adimasci said...

Nice work !

Anonymous said...

In other words: The pypy jit compiler leaks a massive amount of memory. Will you address this issue?

Maciej Fijalkowski said...

Technically it's not "leaking". And yes, we're trying to address this issue.

Tim Parkin said...

Yes I think the word you wanted was "uses" instead of "leaks". The latter implies unforseen problems and errors, the former implies that memory usage hasn't been addressed yet... Just to reiterate - PyPy currently *uses* more memory than CPython.

Tim Parkin said...

Oh and a huge congratulations for this achievement!!!