Tuesday, October 6, 2009

PyPy's JIT now supports floats

Hello.

We've just merged branch which adds float support to x86 backend. This means that floating point operations are now super fast in PyPy's JIT. Let's have a look at example, provided by Alex Gaynor and stolen from Factor blog.

The original version of the benchmark, was definitely tuned for the performance needs of CPython.

For running this on PyPy, I changed to a bit simpler version of the program, and I'll explain a few changes that I did, which the reflect current limitations of PyPy's JIT. They're not very deep and they might be already gone while you're reading it:

  • Usage of __slots__. This is a bit ridiculous, but we spend quite a bit of time to speed up normal instances of new-style classes which are very fast, yet ones with __slots__ are slower. To be fixed soon.
  • Usage of reduce. This one is even more obscure, but reduce is not perceived as a thing producing loops in a program. Moving to a pure-Python version of reduce fixes the problem.
  • Using x ** 2 vs x * x. In PyPy, reading a local variable is a no-op when JITted (the same as reading local variable in C). However multiplication is simpler operation that power operation.

I also included the original Java benchmark. Please note that original java version is similar to my modified one (not the one specifically tuned for CPython)

The performance figures below (for n = 1 000 000), average of 10 runs:
  • CPython 2.6: 7.56s
  • CPython & psyco 2.6: 4.44s
  • PyPy: 1.63s
  • Java (JVM 1.6, client mode): 0.77s

and while JVM is much faster, it's very good that we can even compare :-)

Cheers
fijal

24 comments:

Anonymous said...

So it's much faster than Psyco and only about 2x slower than the JVM. That's impressive, as Python is much more dynamic!

Congrats and thanks for the regular updates, it's much appreciated.

Luis said...

Very exciting!
By the way, this result doesn't include the time to generate assembler. Right?

Anonymous said...

Great, you guys are heroes!

Btw, what's the next big hurdle to run real-world programs? Memory use? Threads?

Anonymous said...

Great job! I really appreciate your work.

@Luis: I think, it does include the assembler. I just compiled trunk and ran the modified benchmark on python 2.6 and pypy-c-jit. Best time of 10 runs:
Python 2.6.2: 0.911113977432
Pypy: 0.153664112091
So it's nearly 6x faster for me (including the time for generating the assembler, of course) - even much better than on the postet numbers...I don't know, if cpython was run with the unmodified version of the benchmark though.

William said...

I'd be interested to see the results for a much longer run (n = 10 000 000?).

Panos Laganakos said...

Wicked! Keep the sweetness coming :)

Cory said...

Very exciting. Thanks! These are nearing "holy crap" numbers.

<mindControl>siiiixty foooouuur biiiiit<mindControl>

:-)

illume said...

awesome! things are really starting to move along now :)

I tried the same little benchmark with the shedskin python to C++ compiler for comparison:

cpython2.5: 16.2770409584
cpython2.6: 12.2321541309
shedskin: 0.316256999969

Shedskin is 38.6 times faster than cpython2.6, and 51.4 times faster than cpython2.5... and to extrapolate from your numbers 3.9 times faster than the jvm.

Of course that doesn't include the time it takes to generate the C++ and then compile it with g++ (using the old 4.0.1 g++, not the latest 4.4). I also didn't include the python interpreter startup cost.

btw, I found map, reduce and filter all to be faster with pure python versions when using psyco too.

cu!

Maciej Fijalkowski said...

@illume

that's a bit unfair comparison, since shedskin is not python. you can compare RPython and shedskin though. RPython is sometimes faster than C even...

And also, yes, in PyPy or psyco time we include compilation time.

Cheers,
fijal

Luis said...

I'm still confussed.. if you post the average of 10 runs, and assembler is generated only in the first run, then this time is diluted. Shouldn't you compute the average of 10 runs, but excluding the first one? (that means, runing it 11 times and ignoring the first one?).

Anonymous said...

@Luis: no, I think fijal started the pypy-c interpreter 10 times, and each time it generates assembly (it's not cached afaik).

Luis said...

Well, no matter how they measure it, this is definitely within the "Holy Crap" range...

Maciej Fijalkowski said...

@Luis:

Maybe I should... I really run this 10 times while assembler was generated only during the first time. But also dilluting assembler generation time over runs is kind of real-life effect...

Baczek said...

how about including unladen swallow results?

Michael Allman said...

How come the pypy JIT is compiled AOT to C? I thought the idea of PyPy was to implement a python runtime in python? Why not run the JIT on a python runtime?

Awesome work. I wish the Ruby folk were as motivated...

Cheers.

Anonymous said...

I seem to recall grumblings from C++ programmers a few years ago when Java started supporting multi-core architecture, which made Java execution as fast or faster than C++ with much less development effort (for free with the Java interpreter vs hand-written C++ support).

If your testing machine is a multi-core/processor machine, it might be appropriate to say that PyPy is now as fast as C++ (without explicit multi-core support). Wow!

Armin Rigo said...

Michael: because our goal is to have a general framework, not a Python-centered solution. For example, the JIT generator works mostly out of the box with any other language that we implemented in RPython (which includes Smalltalk).

hihhu said...

Great work!

How large an effort would it be to have eg. Perl or Ruby working with this? Just out of curiosity, I'm trying to understand this project better.

Anonymous said...

In the correct original version of the benchmark there are two calls to sin(). A good compiler optimizes one of them away. A worse compiler don't. So it's more fair to put back the second sin in the Python code too.

Maciej Fijalkowski said...

@hihu:

It would be a bit easier than writing the interpreter in C, since RPython is much nicer. Also, you get JIT for almost free and decent GC for free. On the other hand, writing interpreters it's quite a bit of work on it's own.

@Anonymous:

Indeed, well, spotted, it would be more fair. However, there is no measurable difference (at least in pypy running time).

PS. We have weekends, too.

Cheers,
fijal

della said...

Would a Pypy implementation of Perl/Ruby/PHP mean that it would be possible to use libraries developed in one language for the other one? That would be very cool indeed.

And, for that matter, would that mean interoperability between python2 and python3 modules when the py3 interpreter will be done? :)

Maciej Fijalkowski said...

@della.

In general, that would not be that simple. You need to somehow map data types between interpreters in an unclear manner. For example, what would happen if you call Python2.x function passing argument that is py3k dict (which has different interface)?

Cheers,
fijal

della said...

One would imagine having different interfaces for the same objects when accessed from 2.x and 3.x code. Would that be difficult?

Of course, I understand mapping data structures between languages that have many more differences between them than py2 and py3 would definitely be more complex.

Anonymous said...

Not to rain on the parade, but Java's trig functions are very slow outside of -pi/2,pi/2 range to correct terrible fsin/fcos results on Intel x86.

See http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4857011

Your benchmark should include something to measure the error, or not use trig functions as a benchmark when comparing to Java.