PyPy Status Blog: PyPy's JIT now supports floats

Tuesday, October 6, 2009

PyPy's JIT now supports floats

Hello.

We've just merged branch which adds float support to x86 backend. This means that floating point operations are now super fast in PyPy's JIT. Let's have a look at example, provided by Alex Gaynor and stolen from Factor blog.

The original version of the benchmark, was definitely tuned for the performance needs of CPython.

For running this on PyPy, I changed to a bit simpler version of the program, and I'll explain a few changes that I did, which the reflect current limitations of PyPy's JIT. They're not very deep and they might be already gone while you're reading it:

Usage of __slots__. This is a bit ridiculous, but we spend quite a bit of time to speed up normal instances of new-style classes which are very fast, yet ones with __slots__ are slower. To be fixed soon.
Usage of reduce. This one is even more obscure, but reduce is not perceived as a thing producing loops in a program. Moving to a pure-Python version of reduce fixes the problem.
Using x ** 2 vs x * x. In PyPy, reading a local variable is a no-op when JITted (the same as reading local variable in C). However multiplication is simpler operation that power operation.

I also included the original Java benchmark. Please note that original java version is similar to my modified one (not the one specifically tuned for CPython)

The performance figures below (for n = 1 000 000), average of 10 runs:

CPython 2.6: 7.56s
CPython & psyco 2.6: 4.44s
PyPy: 1.63s
Java (JVM 1.6, client mode): 0.77s

and while JVM is much faster, it's very good that we can even compare :-)

Cheers
fijal

24 comments:

Anonymous said...: So it's much faster than Psyco and only about 2x slower than the JVM. That's impressive, as Python is much more dynamic!

Congrats and thanks for the regular updates, it's much appreciated.; October 6, 2009 at 7:26 PM
Luis said...: Very exciting!
By the way, this result doesn't include the time to generate assembler. Right?; October 6, 2009 at 7:31 PM
Anonymous said...: Great, you guys are heroes!

Btw, what's the next big hurdle to run real-world programs? Memory use? Threads?; October 6, 2009 at 7:37 PM
Anonymous said...: Great job! I really appreciate your work.

@Luis: I think, it does include the assembler. I just compiled trunk and ran the modified benchmark on python 2.6 and pypy-c-jit. Best time of 10 runs:
Python 2.6.2: 0.911113977432
Pypy: 0.153664112091
So it's nearly 6x faster for me (including the time for generating the assembler, of course) - even much better than on the postet numbers...I don't know, if cpython was run with the unmodified version of the benchmark though.; October 6, 2009 at 7:47 PM
William said...: I'd be interested to see the results for a much longer run (n = 10 000 000?).; October 6, 2009 at 9:36 PM
Panos Laganakos said...: Wicked! Keep the sweetness coming :); October 6, 2009 at 9:55 PM
Unknown said...: Very exciting. Thanks! These are nearing "holy crap" numbers.

<mindControl>siiiixty foooouuur biiiiit<mindControl>

:-); October 7, 2009 at 5:15 AM
René Dudfield said...: awesome! things are really starting to move along now :)

I tried the same little benchmark with the shedskin python to C++ compiler for comparison:

cpython2.5: 16.2770409584
cpython2.6: 12.2321541309
shedskin: 0.316256999969

Shedskin is 38.6 times faster than cpython2.6, and 51.4 times faster than cpython2.5... and to extrapolate from your numbers 3.9 times faster than the jvm.

Of course that doesn't include the time it takes to generate the C++ and then compile it with g++ (using the old 4.0.1 g++, not the latest 4.4). I also didn't include the python interpreter startup cost.

btw, I found map, reduce and filter all to be faster with pure python versions when using psyco too.

cu!; October 7, 2009 at 1:35 PM
Maciej Fijalkowski said...: @illume

that's a bit unfair comparison, since shedskin is not python. you can compare RPython and shedskin though. RPython is sometimes faster than C even...

And also, yes, in PyPy or psyco time we include compilation time.

Cheers,
fijal; October 7, 2009 at 3:42 PM
Luis said...: I'm still confussed.. if you post the average of 10 runs, and assembler is generated only in the first run, then this time is diluted. Shouldn't you compute the average of 10 runs, but excluding the first one? (that means, runing it 11 times and ignoring the first one?).; October 7, 2009 at 4:34 PM
Anonymous said...: @Luis: no, I think fijal started the pypy-c interpreter 10 times, and each time it generates assembly (it's not cached afaik).; October 7, 2009 at 8:31 PM
Luis said...: Well, no matter how they measure it, this is definitely within the "Holy Crap" range...; October 7, 2009 at 9:28 PM
Maciej Fijalkowski said...: @Luis:

Maybe I should... I really run this 10 times while assembler was generated only during the first time. But also dilluting assembler generation time over runs is kind of real-life effect...; October 7, 2009 at 10:37 PM
Baczek said...: how about including unladen swallow results?; October 8, 2009 at 6:06 PM
Michael Allman said...: How come the pypy JIT is compiled AOT to C? I thought the idea of PyPy was to implement a python runtime in python? Why not run the JIT on a python runtime?

Awesome work. I wish the Ruby folk were as motivated...

Cheers.; October 8, 2009 at 8:26 PM
Anonymous said...: I seem to recall grumblings from C++ programmers a few years ago when Java started supporting multi-core architecture, which made Java execution as fast or faster than C++ with much less development effort (for free with the Java interpreter vs hand-written C++ support).

If your testing machine is a multi-core/processor machine, it might be appropriate to say that PyPy is now as fast as C++ (without explicit multi-core support). Wow!; October 8, 2009 at 8:32 PM
Armin Rigo said...: Michael: because our goal is to have a general framework, not a Python-centered solution. For example, the JIT generator works mostly out of the box with any other language that we implemented in RPython (which includes Smalltalk).; October 9, 2009 at 1:39 PM
hihhu said...: Great work!

How large an effort would it be to have eg. Perl or Ruby working with this? Just out of curiosity, I'm trying to understand this project better.; October 9, 2009 at 8:06 PM
Anonymous said...: In the correct original version of the benchmark there are two calls to sin(). A good compiler optimizes one of them away. A worse compiler don't. So it's more fair to put back the second sin in the Python code too.; October 9, 2009 at 10:23 PM
Maciej Fijalkowski said...: @hihu:

It would be a bit easier than writing the interpreter in C, since RPython is much nicer. Also, you get JIT for almost free and decent GC for free. On the other hand, writing interpreters it's quite a bit of work on it's own.

@Anonymous:

Indeed, well, spotted, it would be more fair. However, there is no measurable difference (at least in pypy running time).

PS. We have weekends, too.

Cheers,
fijal; October 11, 2009 at 9:37 PM
della said...: Would a Pypy implementation of Perl/Ruby/PHP mean that it would be possible to use libraries developed in one language for the other one? That would be very cool indeed.

And, for that matter, would that mean interoperability between python2 and python3 modules when the py3 interpreter will be done? :); October 12, 2009 at 10:48 AM
Maciej Fijalkowski said...: @della.

In general, that would not be that simple. You need to somehow map data types between interpreters in an unclear manner. For example, what would happen if you call Python2.x function passing argument that is py3k dict (which has different interface)?

Cheers,
fijal; October 12, 2009 at 5:33 PM
della said...: One would imagine having different interfaces for the same objects when accessed from 2.x and 3.x code. Would that be difficult?

Of course, I understand mapping data structures between languages that have many more differences between them than py2 and py3 would definitely be more complex.; October 13, 2009 at 11:34 AM
Anonymous said...: Not to rain on the parade, but Java's trig functions are very slow outside of -pi/2,pi/2 range to correct terrible fsin/fcos results on Intel x86.

See http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4857011

Your benchmark should include something to measure the error, or not use trig functions as a benchmark when comparing to Java.; November 2, 2009 at 7:08 PM