Thursday, October 15, 2009

First pypy-cli-jit benchmarks

As the readers of this blog already know, I've been working on porting the JIT to CLI/.NET for the last months. Now that it's finally possible to get a working pypy-cli-jit, it's time to do some benchmarks.

Warning: as usual, all of this has to be considered to be a alpha version: don't be surprised if you get a crash when trying to run pypy-cli-jit. Of course, things are improving very quickly so it should become more and more stable as days pass.

For this time, I decided to run four benchmarks. Note that for all of them we run the main function once in advance, to let the JIT recoginizing the hot loops and emitting the corresponding code. Thus, the results reported do not include the time spent by the JIT compiler itself, but give a good measure of how good is the code generated by the JIT. At this point in time, I know that the CLI JIT backend spends way too much time compiling stuff, but this issue will be fixed soon.

  • f1.py: this is the classic PyPy JIT benchmark. It is just a function that does some computational intensive work with integers.
  • floatdemo.py: this is the same benchmark involving floating point numbers that have already been described in a previous blog post.
  • oodemo.py: this is just a microbenchmark doing object oriented stuff such as method calls and attribute access.
  • richards2.py: a modified version of the classic richards.py, with a warmup call before starting the real benchmark.

The benchmarks were run on a Windows machine with an Intel Pentium Dual Core E5200 2.5GHz and 2GB RAM, both with .NET (CLR 2.0) and Mono 2.4.2.3.

Because of a known mono bug, if you use a version older than 2.1 you need to pass the option -O=-branch to mono when running pypy-cli-jit, else it will just loop forever.

For comparison, we also run the same benchmarks with IronPython 2.0.1 and IronPython 2.6rc1. Note that IronPython 2.6rc1 does not work with mono.

So, here are the results (expressed in seconds) with Microsoft CLR:

Benchmark pypy-cli-jit ipy 2.0.1 ipy 2.6 ipy2.01/ pypy ipy2.6/ pypy
f1 0.028 0.145 0.136 5.18x 4.85x
floatdemo 0.671 0.765 0.812 1.14x 1.21x
oodemo 1.25 4.278 3.816 3.42x 3.05x
richards2 1228 442 670 0.36x 0.54x

And with Mono:

Benchmark pypy-cli-jit ipy 2.0.1 ipy2.01/ pypy
f1 0.042 0.695 16.54x
floatdemo 0.781 1.218 1.55x
oodemo 1.703 9.501 5.31x
richards2 720 862 1.20x

These results are very interesting: under the CLR, we are between 5x faster and 3x slower than IronPython 2.0.1, and between 4.8x faster and 1.8x slower than IronPython 2.6. On the other hand, on mono we are consistently faster than IronPython, up to 16x. Also, it is also interesting to note that pypy-cli runs faster on CLR than mono for all benchmarks except richards2.

I've not investigated yet, but I think that the culprit is the terrible behaviour of tail calls on CLR: as I already wrote in another blog post, tail calls are ~10x slower than normal calls on CLR, while being only ~2x slower than normal calls on mono. richads2 is probably the benchmark that makes most use of tail calls, thus explaining why we have a much better result on mono than CLR.

The next step is probably to find an alternative implementation that does not use tail calls: this probably will also improve the time spent by the JIT compiler itself, which is not reported in the numbers above but that so far it is surely too high to be acceptable. Stay tuned.

7 comments:

Michael Foord said...

Perhaps you should try another run with the .NET 4 beta. They have at least *mostly* fixed the terrible performance of tail calls there.

Anyway - interesting stuff, keep up the good work. What is the current state of .NET integration with pypy-cli?

Antonio Cuni said...

Oh, I didn't know about .NET 4 beta. Have you got any link that explains how they fixed the tail call stuff? I'll surely give it a try.

About the .NET integration: no news from this front. Nowadays I'm fully concentrated on the JIT because I need some (possibly good :-)) results for my phd thesis. When pypy-cli-jit is super-fast, I'll try to make is also useful :-)

Michael Foord said...

Here's at least one link (with some references) on the tail call improvements in .NET 4:

http://extended64.com/blogs/news/archive/2009/05/10/tail-call-improvements-in-net-framework-4.aspx

Michael Foord said...

I'm also intrigued as to why you didn't benchmark IronPython 2.6 on Mono? I thought that on very recent versions of Mono you could build and run IronPython 2.6 fine now?

Michael Foord said...

Ah, I see now you say that it doesn't work. Hmmm... there are definitely folks who maintain a version that does work (perhaps needing Mono 2.4.3 which I guess is trunk?).

See the download previews here anyway: http://ironpython-urls.blogspot.com/2009/09/more-from-mono-moonlight-2-monodevelop.html

Anonymous said...

I wonder if this paper would be useful? It's a way to do continuations using the stack on .NET. Maybe you can use it to speed up tail calls?

http://www.cs.brown.edu/~sk/Publications/Papers/Published/pcmkf-cont-from-gen-stack-insp/

Antonio Cuni said...

@Michael: from the link you posted, it seems that tail call improvements in .NET 4 are only for x86_64, but my benchmarks were un on 32 bit, so I don't think it makes a difference. Anyway, I'll try to benchmark with .NET 4 soon, thanks for the suggestion.

@Anonymous: the paper is interesting, but I don't think it's usable for our purposes: throwing and catching exception is incredibly costing in .NET, we cannot really use them too heavily. The fact that the paper says nothing about performances is also interesting :-)