Because PyPy will be presenting at the upcoming euroscipy conference, I have been playing recently with the idea of NumPy and PyPy integration. My idea is to integrate PyPy's JIT with NumPy or at least a very basic subset of it. Time constraints make it impossible to hand write a JIT compiler that understands NumPy. But given PyPy's architecture we actually have a JIT generator, so we don't need to write one :-)
Our JIT has shown that it can speed up small arithmetic examples significantly. What happens with something like NumPy?
I wrote a very minimal subset of NumPy in RPython, called micronumpy (only single-dimension int arrays that can only get and set items), and a benchmark against it. The point of this benchmark is to compare the performance of a builtin function (numpy.minimum) against the equivalent hand-written function, written in pure Python and compiled by our JIT.
The goal is to prove that it is possible to write algorithms in Python instead of C without loss of efficiency. Sure, we can write some functions (like minimum in the following example), but there is a whole universe of other ufuncs which would be cool to have in Python instead, assuming this could be done without a huge loss in efficiency.
Here are the results. This is comparing PyPy svn revision 66303 in the pyjitpl5 branch against python 2.6 with NumPy 1.2.1. The builtin numpy.minimum in PyPy is just a naive implementation in RPython, which is comparable to the speed of a naive implementation written in C (and thus a bit slower than the optimized version in NumPy):
NumPy (builtin function) | 0.12s |
PyPy's micronumpy (builtin function) | 0.28s |
CPython (pure Python) | 11s |
PyPy with JIT (pure Python) | 0.91s |
As we can see, PyPy's JIT is slower than the optmized NumPy's C version, but still much faster than CPython (12x).
Why is it slower? When you actually look at assembler, it's pretty obvious that it's atrocious. There's a lot of speedup to be gained out of just doing simple optimizations on resulting assembler. There are also pretty obvious limitations, like x86 backend not being able to emit opcodes for floats or x86_64 not being there. Those limitations are not fundamental in any sense and can be relatively straightforward to overcome. Therefore it seems we can get C-level speeds for pure Python implementations of numeric algorithms using NumPy arrays in PyPy. I think it's an interesting perspective that Python has the potential of becoming less of a glue language and more of a real implementation language in the scientific field.
Cheers,fijal
I have the feeling your are confessing pypys secrete goal ;-).
ReplyDeletea really efficient python for science: THAT would be a real milestone for dynamic languages; and start their era...
ReplyDeleteVery, very interesting.
ReplyDeleteSomething I missed though was a real naive C implementation. You state it is about as fast as "PyPy's micronumpy", but it would have been nice to post the numbers. Of course, the problem is that the code would be different (C, instead of Python), but still...
What would it take to get this really started? Some of our group would happily help here, if there is a sort of a guideline (a TODO list?) that tells what must be done (i.e. as a friend put it, we would be codemonkeys).
ReplyDeleteThe difference in pure-python speed is what is most interesting for me, as however much NumPy you use, sometimes important parts of the software still can't be easily vectorized (or at all). If PyPy can let me run compiled NumPy (or Cython) code glued with lightning-fast Python, this leaves me with almost no performance problems. Add to that the convenience of vectorization as a means of writing short, readable code, and its a winning combination.
ReplyDeleteSaying that implementing efficient code generation for floating point code on x86 in your jit is going to be straight forward is disingenuous.
ReplyDeleteHere's a project using corepy, runtime assembler to create a faster numpy:
ReplyDeletehttp://numcorepy.blogspot.com/
There's also projects like pycuda, and pygpu which generate numpy code to run on GPUs.
It gets many times than standard numpy.
pygame uses SDL blitters, and its own blitters - which are specialised array operations for images... these are many times faster than numpy in general - since they are hand optimized assembler, or very efficiently optimised C.
Remember that hand optimized assembler can be 10x faster than even C, and that not all C code is equal.
So it seems that even the pypy generated C code could even be faster.
What about applying pypy to CUDA, or OpenCL C like languages?
cu,
@ilume
ReplyDeleteI think you're completely missing the point. These experiments are performed using pure-python code that happens to operate on numpy arrays. Assembler generation happens when interpreting this code by the interpreter, so it's not really even the level of hand-written C. Corenumpy on the other hand is trying to speed up numpy operations itself (which is also a nice goal, but completely different).
Cheers,
fijal
Hi Maciej! Would you mind blogging an update on PyPy / C interfaces and NumPy?
ReplyDeleteI am extensively using NumPy / SciPy / NLopt (apart from apart from the stuff I import from there, my stuff is mostly pure Python algorithms, which interpreter spends most time working on).
The latest improvements in PyPy JIT really sound like if they could magically dramatically speed up my stuff...
I don't mind trying PyPy out in production if it will yield significant speedups and otherwise debugging why wouldn't it, but I need access to C stuff from within Python.
Stay tuned, I'll blog about it when I have more results. The progress has been slow so far, but it might accelerate
ReplyDeleteHi! Thanks, can't wait for it... :-)
ReplyDelete