PyPy Status Blog: Numpy in PyPy - status and roadmap

Wednesday, May 4, 2011

Numpy in PyPy - status and roadmap

Hello.

NumPy integration is one of the single most requested features for PyPy. This post tries to describe where we are, what we plan (or what we don't plan), and how you can help.

Short version for the impatient: we are doing experiments, which show that PyPy+numpy can be faster and better than CPython+numpy. We have a plan on how to move forward, but at the moment there is lack of dedicated people or money to tackle it.

The slightly longer version

Integrating numpy in PyPy has been my pet project on an on-and-off (mostly off) basis over the past two years. There were some experiments, then a long pause, and then some more experiments which are documented below.

The general idea is not to use the existing CPython module, but to reimplement numpy in RPython (i.e. the language PyPy is implemented in), thus letting our JIT achieve extra speedups. The really cool thing about this part is that numpy will automatically benefit of any general JIT improvements, without any need of extra tweaking.

At the moment, there is branch called numpy-exp which contains a translatable version of a very minimal version of numpy in the module called micronumpy. Example benchmarks show the following:

	add	iterate
CPython 2.6.5 with numpy 1.3.0	0.260s (1x)	4.2 (1x)
PyPy numpy-exp @ 3a9d77b789e1	0.120s (2.2x)	0.087 (48x)

The add benchmark spends most of the time inside the + operator on arrays (doing a + a + a + a + a), , which in CPython is implemented in C. As you can see from the table above, the PyPy version is already ~2 times faster. (Although numexpr is still faster than PyPy, but we're working on it).

The exact way array addition is implemented is worth another blog post, but in short it lazily evaluates the expression and computes it at the end, avoiding intermediate results. This approach scales much better than numexpr and can lead to speeding up all the operations that you can perform on matrices.

The next obvious step to get even more speedups would be to extend the JIT to use SSE operations on x86 CPUs, which should speed it up by about additional 2x, as well as using multiple threads to do operations.

iterate is also interesting, but for entirely different reasons. On CPython it spends most of the time inside a Python loop; the PyPy version is ~48 times faster, because the JIT can optimize across the python/numpy boundary, showing the potential of this approach, users are not grossly penalized for writing their loops in Python.

The drawback of this approach is that we need to reimplement numpy in RPython, which takes time. A very rough estimate is that it would be possible to implement an useful subset of it (for some definition of useful) in a period of time comprised between one and three man-months.

It also seems that the result will be faster for most cases and the same speed as original numpy for other cases. The only problem is finding the dedicated persons willing to spend quite some time on this and however, I am willing to both mentor such a person and encourage him or her.

The good starting point for helping would be to look at what's already implemented in micronumpy modules and try extending it. Adding a - operator or adding integers would be an interesting start. Drop by on #pypy on irc.freenode.net or get in contact with developers via some other channel (such as the pypy-dev mailing list) if you want to help.

Another option would be to sponsor NumPy development. In case you're interested, please get in touch with us or leave your email in comments.

Cheers,
fijal