Tuesday, January 10, 2012

NumPyPy progress report - running benchmarks

Hello.

We're excited to let you know about some of the great progress we've made on NumPyPy: both completeness and performance. In this blog entry we mostly will talk about performance and how much progress we have made so far.

Word of warning: this work is in progress -- we're maybe half way to where we want to be and there are many trivial and not so trivial optimizations to be written. (For example, we haven't even started to implement important optimizations, like vectorization.)

Benchmark

We chose a laplace equation solver, based on SciPy's PerformancePython wiki. Unfortunately, the different implementations on the wiki page accidentally use two different algorithms, which have different convergences, and very different performance characteristics on modern computers. As a result, we implemented our own versions in both C and Python (with and without NumPy). The full source can be found in fijal's hack repo, all these benchmarks were performed at revision 18502dbbcdb3.

First, let me describe various algorithms used. Note that some of them contain PyPy-specific hacks to work around limitations in the current implementation. These hacks will go away eventually and the performance will improve. Numerically the algorithms used are identical, however exact data layout in memory differs between them.

A note about all the benchmarks: they each were run once, but the performance is very stable across runs.

Starting with the C version, it implements a trivial laplace transform using two loops and double-reference memory (array of int*). The double reference does not matter for performance and the two algorithms are implemented in inline-laplace.c and laplace.c. They were both compiled with gcc 4.4.5 at -O3. The inline version modifies array in-place while the non-inline version stores results in a copy. That makes them converge at different rate, hence different number of iterations

A straightforward version of those in Python is implemented in laplace.py using, respectively, inline_slow_time_step and slow_time_step. slow_2_time_step does the same thing, except it copies arrays in-place instead of creating new copies. Table below compares running PyPy against C:

bench number of iterations time per iteration
laplace C 219 6.3ms
inline-laplace C 278 20ms
slow python 219 17ms
slow 2 python 219 14ms
inline_slow python 278 23.7ms

An important thing to notice is the data dependency of the inline version causes a huge slowdown for the C versions. This is not a severe disadvantage for us though -- the brain-dead Python version takes longer and PyPy is not able to take advantage of the knowledge that the data is independent. The results are in the same ballpark as the C versions -- 15% - 170% slower, but the algorithm one chooses matters more than the language. By comparison, the slow versions take about 5.75s each on CPython 2.6 per iteration and, by estimation, are about 200x slower than the PyPy equivalent, if I had the patience to measure the full run.

The next step is to use NumPy expressions. The first problem we run into is that computing the error requires walking the entire array a second time. This is fairly inefficient in terms of cache access, so I took the liberty of computing the errors every 15 steps. This results in the convergence being rounded to the nearest 15 iterations, but speeds things up considerably. numeric_time_step takes the most braindead approach of replacing the array with itself, like this:

u[1:-1, 1:-1] = ((u[0:-2, 1:-1] + u[2:, 1:-1])*dy2 +
                       (u[1:-1,0:-2] + u[1:-1, 2:])*dx2)*dnr_inv

We need 3 arrays here -- one is an intermediate (PyPy only needs one, for all of those subexpressions), one is a copy for computing the error, and one is the result. This works automatically because in NumPy + or * creates an intermediate, while NumPyPy avoids allocating the intermediate if possible.

numeric_2_time_step works in pretty much the same way:

src = self.u
self.u = src.copy()
self.u[1:-1, 1:-1] = ((src[0:-2, 1:-1] + src[2:, 1:-1])*dy2 +
                      (src[1:-1,0:-2] + src[1:-1, 2:])*dx2)*dnr_inv

except the copy is now explicit rather than implicit.

numeric_3_time_step does the same thing, but notice one doesn't have to copy the entire array, it's enough to copy the border pieces and fill rest with zeros:

src = self.u
self.u = numpy.zeros((self.nx, self.ny), 'd')
self.u[0] = src[0]
self.u[-1] = src[-1]
self.u[:, 0] = src[:, 0]
self.u[:, -1] = src[:, -1]
self.u[1:-1, 1:-1] = ((src[0:-2, 1:-1] + src[2:, 1:-1])*dy2 +
                      (src[1:-1,0:-2] + src[1:-1, 2:])*dx2)*dnr_inv

numeric_4_time_step is the one that tries hardest to resemble the C version. Instead of doing an array copy, it actually notices that one can alternate between two arrays. This is exactly what the C version does. The remove_invalidates call is a PyPy specific hack - we hope to remove this call in the near future, but, in short, it promises "I don't have any unbuilt intermediates that depend on the value of the argument", which means one doesn't have to compute sub-expressions one is not actually using:

remove_invalidates(self.old_u)
remove_invalidates(self.u)
self.old_u[:,:] = self.u
src = self.old_u
self.u[1:-1, 1:-1] = ((src[0:-2, 1:-1] + src[2:, 1:-1])*dy2 +
                      (src[1:-1,0:-2] + src[1:-1, 2:])*dx2)*dnr_inv

This one is the most comparable to the C version.

numeric_5_time_step does the same thing, but notices one doesn't have to copy the entire array, it's enough to just copy the edges. This is an optimization that was not done in the C version:

remove_invalidates(self.old_u)
remove_invalidates(self.u)
src = self.u
self.old_u, self.u = self.u, self.old_u
self.u[0] = src[0]
self.u[-1] = src[-1]
self.u[:, 0] = src[:, 0]
self.u[:, -1] = src[:, -1]
self.u[1:-1, 1:-1] = ((src[0:-2, 1:-1] + src[2:, 1:-1])*dy2 +
                      (src[1:-1,0:-2] + src[1:-1, 2:])*dx2)*dnr_inv

Let's look at the table of runs. As before, gcc 4.4.5, compiled at -O3, and PyPy nightly 7bb8b38d8563, on an x86-64 machine. All of the numeric methods run for 226 steps, slightly more than the 219, rounding to the next 15 when the error is computed.

benchmark PyPy CPython
numeric 21ms 35ms
numeric 2 14ms 37ms
numeric 3 13ms 29ms
numeric 4 11ms 31ms
numeric 5 9.3ms 21ms

We think that these preliminary results are pretty good. They're not as fast as the C version (or as fast as we'd like them to be), but we're already much faster than NumPy on CPython -- almost always by more than 2x on this relatively real-world example. This is not the end, though. In fact, it's hardly the beginning! As we continue work, we hope to make even more use of the high level information that we have. Looking at the assembler generated by gcc for this example, it's pretty clear we can outperform it thanks to better aliasing information and hence better possibilities for vectorization. Stay tuned.

EDIT: fixed the benchmark name

EDIT2: added info that first table is about PyPy

Cheers, fijal

17 comments:

D said...

Nice to hear, but what we (numpy users) really need is 2-dimensional matrices with basic arithmetic operations (+, -, /, *, sin, cos, pow etc) and other related methods, e.g. min(array,axis), nanmax(array, axis), argmax(array,axis), nanargmin(array, axis) etc. While CPython soft dependent on these operations works more or less fast, with PyPy it mere doesn't work at all. I hope first of all you'll focus on it instead of speed improvement for single-dimensional arrays.
Regards, D.

Maciej Fijalkowski said...

It would be really cool if you try before complaining. I think all of it works on a nightly build, except the axis argument which is on a branch being worked on.

D said...

Also, IIRC NumPyPy still misses linalg.solve method for solving systems of linear equations, that is highly important for lots of soft. Connecting sparse SLE solver (like umfpack or superlu from scipy.sparse) also would be very essential.

Maciej Fijalkowski said...

We're working on it. Stay tuned

D said...

Maciej, anything about 2-dimensional matrix implementations with related operations haven't been mentioned in blog, so why I have to know about it? I only installed and tried stable PyPy 1.7, because I had tried building PyPy from sources and found it damned hard, especially for my limited hardware (2 GB RAM).

Maciej Fijalkowski said...

Good point, we'll write a blog post what has been implemented as well. Try nightly

Adam said...

A Laplace transform is something quite different to solving Laplace's equation with finite differences...

Maciej Fijalkowski said...

fixed, thanks

Anonymous said...

It may be nice to link to the nightly builds so that people can try this out :)

Chris LeBlanc said...

This is excellent! Great work, the potential of this project is very exciting. I was quietly wishing for this since pypy first started.

I use NumPy all the time, and any increase in performance makes a big difference. This is one of the main advantages of NumPyPy over NumPy, so it makes sense to focus on it.

There seems to be lots of complaining about missing features and such, but having a solid foundation to work from seems to be the most important thing. Missing features can be added down the line.

I remember reading a blog post last year about using transactional memory as a way of removing the GIL. If you could combine that with NumPyPy to run numerical tasks in parallel, that would make a lot of scientific programmers very happy. I don't know if this is feasible, but it sure would be nice.

Keep up the good work.

Maciej Fijalkowski said...

Hi Chris.

We have vague plans how to parallelize numpy expressions without even having to remove the GIL. That way you'll have workers that are able to perform (or help perform) numeric tasks, but the interpreter itself will still run in a single thread. The same goes for GPUs and MIC.

Anonymous said...

Nightly builds
http://buildbot.pypy.org/nightly/trunk

Anonymous said...

Please when you consider parallelizing things, do remember about leaving an explicit switch to turn it off!

I run my Python stuff on clusters through a queuing system and it will be VERY unhappy if single processes use more than one thread without informing the scheduler.

Anonymous said...

Hey, by the way, your progress on NumPy is amazing and highly appreciated.

Maciej Fijalkowski said...

@Anonymous of course, this is a given that we'll leave the switch to turn it off. It might be not even on by default, that's up for discussion

Paul Harrison said...

Chris, if you haven't considered this already, it's sometimes possible to achieve parallelism with multiple processes using memory mapped files as numpy arrays. It's a bit awkward, but it can also make for an easier path to a computation that is resumable or can be run on a cluster.

GIL removal would be wonderful, but it's a pretty ambitious idea. Then again, these pypy folk seem able to deliver on some pretty amazing stuff.

Peter S said...

I am closely following these developments with numpypy and I just succesfully tested the last nightly build, which I find very impressive!

For research purposes, the main thing we need is scipy.stats.ttest_1samp to work on pypy. Is there an estimation on when scipypy will be available?