PyPy Status Blog: PyPy's JIT now supports floats

Tuesday, October 6, 2009

PyPy's JIT now supports floats

Hello.

We've just merged branch which adds float support to x86 backend. This means that floating point operations are now super fast in PyPy's JIT. Let's have a look at example, provided by Alex Gaynor and stolen from Factor blog.

The original version of the benchmark, was definitely tuned for the performance needs of CPython.

For running this on PyPy, I changed to a bit simpler version of the program, and I'll explain a few changes that I did, which the reflect current limitations of PyPy's JIT. They're not very deep and they might be already gone while you're reading it:

Usage of __slots__. This is a bit ridiculous, but we spend quite a bit of time to speed up normal instances of new-style classes which are very fast, yet ones with __slots__ are slower. To be fixed soon.
Usage of reduce. This one is even more obscure, but reduce is not perceived as a thing producing loops in a program. Moving to a pure-Python version of reduce fixes the problem.
Using x ** 2 vs x * x. In PyPy, reading a local variable is a no-op when JITted (the same as reading local variable in C). However multiplication is simpler operation that power operation.

I also included the original Java benchmark. Please note that original java version is similar to my modified one (not the one specifically tuned for CPython)

The performance figures below (for n = 1 000 000), average of 10 runs:

CPython 2.6: 7.56s
CPython & psyco 2.6: 4.44s
PyPy: 1.63s
Java (JVM 1.6, client mode): 0.77s

and while JVM is much faster, it's very good that we can even compare :-)

Cheers
fijal

24 comments:

AnonymousOctober 6, 2009 at 7:26 PM
So it's much faster than Psyco and only about 2x slower than the JVM. That's impressive, as Python is much more dynamic!

Congrats and thanks for the regular updates, it's much appreciated.
ReplyDelete
Replies
LuisOctober 6, 2009 at 7:31 PM
Very exciting!
By the way, this result doesn't include the time to generate assembler. Right?
ReplyDelete
Replies
AnonymousOctober 6, 2009 at 7:37 PM
Great, you guys are heroes!

Btw, what's the next big hurdle to run real-world programs? Memory use? Threads?
ReplyDelete
Replies
AnonymousOctober 6, 2009 at 7:47 PM
Great job! I really appreciate your work.

@Luis: I think, it does include the assembler. I just compiled trunk and ran the modified benchmark on python 2.6 and pypy-c-jit. Best time of 10 runs:
Python 2.6.2: 0.911113977432
Pypy: 0.153664112091
So it's nearly 6x faster for me (including the time for generating the assembler, of course) - even much better than on the postet numbers...I don't know, if cpython was run with the unmodified version of the benchmark though.
ReplyDelete
Replies
WilliamOctober 6, 2009 at 9:36 PM
I'd be interested to see the results for a much longer run (n = 10 000 000?).
ReplyDelete
Replies
Panos LaganakosOctober 6, 2009 at 9:55 PM
Wicked! Keep the sweetness coming :)
ReplyDelete
Replies
UnknownOctober 7, 2009 at 5:15 AM
Very exciting. Thanks! These are nearing "holy crap" numbers.

<mindControl>siiiixty foooouuur biiiiit<mindControl>

:-)
ReplyDelete
Replies
René DudfieldOctober 7, 2009 at 1:35 PM
awesome! things are really starting to move along now :)

I tried the same little benchmark with the shedskin python to C++ compiler for comparison:

cpython2.5: 16.2770409584
cpython2.6: 12.2321541309
shedskin: 0.316256999969

Shedskin is 38.6 times faster than cpython2.6, and 51.4 times faster than cpython2.5... and to extrapolate from your numbers 3.9 times faster than the jvm.

Of course that doesn't include the time it takes to generate the C++ and then compile it with g++ (using the old 4.0.1 g++, not the latest 4.4). I also didn't include the python interpreter startup cost.

btw, I found map, reduce and filter all to be faster with pure python versions when using psyco too.

cu!
ReplyDelete
Replies
Maciej FijalkowskiOctober 7, 2009 at 3:42 PM
@illume

that's a bit unfair comparison, since shedskin is not python. you can compare RPython and shedskin though. RPython is sometimes faster than C even...

And also, yes, in PyPy or psyco time we include compilation time.

Cheers,
fijal
ReplyDelete
Replies
LuisOctober 7, 2009 at 4:34 PM
I'm still confussed.. if you post the average of 10 runs, and assembler is generated only in the first run, then this time is diluted. Shouldn't you compute the average of 10 runs, but excluding the first one? (that means, runing it 11 times and ignoring the first one?).
ReplyDelete
Replies
AnonymousOctober 7, 2009 at 8:31 PM
@Luis: no, I think fijal started the pypy-c interpreter 10 times, and each time it generates assembly (it's not cached afaik).
ReplyDelete
Replies
LuisOctober 7, 2009 at 9:28 PM
Well, no matter how they measure it, this is definitely within the "Holy Crap" range...
ReplyDelete
Replies
Maciej FijalkowskiOctober 7, 2009 at 10:37 PM
@Luis:

Maybe I should... I really run this 10 times while assembler was generated only during the first time. But also dilluting assembler generation time over runs is kind of real-life effect...
ReplyDelete
Replies
BaczekOctober 8, 2009 at 6:06 PM
how about including unladen swallow results?
ReplyDelete
Replies
Michael AllmanOctober 8, 2009 at 8:26 PM
How come the pypy JIT is compiled AOT to C? I thought the idea of PyPy was to implement a python runtime in python? Why not run the JIT on a python runtime?

Awesome work. I wish the Ruby folk were as motivated...

Cheers.
ReplyDelete
Replies
AnonymousOctober 8, 2009 at 8:32 PM
I seem to recall grumblings from C++ programmers a few years ago when Java started supporting multi-core architecture, which made Java execution as fast or faster than C++ with much less development effort (for free with the Java interpreter vs hand-written C++ support).

If your testing machine is a multi-core/processor machine, it might be appropriate to say that PyPy is now as fast as C++ (without explicit multi-core support). Wow!
ReplyDelete
Replies
Armin RigoOctober 9, 2009 at 1:39 PM
Michael: because our goal is to have a general framework, not a Python-centered solution. For example, the JIT generator works mostly out of the box with any other language that we implemented in RPython (which includes Smalltalk).
ReplyDelete
Replies
hihhuOctober 9, 2009 at 8:06 PM
Great work!

How large an effort would it be to have eg. Perl or Ruby working with this? Just out of curiosity, I'm trying to understand this project better.
ReplyDelete
Replies
AnonymousOctober 9, 2009 at 10:23 PM
In the correct original version of the benchmark there are two calls to sin(). A good compiler optimizes one of them away. A worse compiler don't. So it's more fair to put back the second sin in the Python code too.
ReplyDelete
Replies
Maciej FijalkowskiOctober 11, 2009 at 9:37 PM
@hihu:

It would be a bit easier than writing the interpreter in C, since RPython is much nicer. Also, you get JIT for almost free and decent GC for free. On the other hand, writing interpreters it's quite a bit of work on it's own.

@Anonymous:

Indeed, well, spotted, it would be more fair. However, there is no measurable difference (at least in pypy running time).

PS. We have weekends, too.

Cheers,
fijal
ReplyDelete
Replies
dellaOctober 12, 2009 at 10:48 AM
Would a Pypy implementation of Perl/Ruby/PHP mean that it would be possible to use libraries developed in one language for the other one? That would be very cool indeed.

And, for that matter, would that mean interoperability between python2 and python3 modules when the py3 interpreter will be done? :)
ReplyDelete
Replies
Maciej FijalkowskiOctober 12, 2009 at 5:33 PM
@della.

In general, that would not be that simple. You need to somehow map data types between interpreters in an unclear manner. For example, what would happen if you call Python2.x function passing argument that is py3k dict (which has different interface)?

Cheers,
fijal
ReplyDelete
Replies
dellaOctober 13, 2009 at 11:34 AM
One would imagine having different interfaces for the same objects when accessed from 2.x and 3.x code. Would that be difficult?

Of course, I understand mapping data structures between languages that have many more differences between them than py2 and py3 would definitely be more complex.
ReplyDelete
Replies
AnonymousNovember 2, 2009 at 7:08 PM
Not to rain on the parade, but Java's trig functions are very slow outside of -pi/2,pi/2 range to correct terrible fsin/fcos results on Intel x86.

See http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4857011

Your benchmark should include something to measure the error, or not use trig functions as a benchmark when comparing to Java.
ReplyDelete
Replies

Add comment

See also PyPy's IRC channel: #pypy at freenode.net, or the pypy-dev mailing list.
If the blog post is old, it is pointless to ask questions here about it---you're unlikely to get an answer.