Hello.

We're excited to let you know about some of the great progress we've made on NumPyPy: both completeness and performance. In this blog entry we mostly will talk about performance and how much progress we have made so far.

**Word of warning:** this
work is in progress -- we're maybe half way to where we want to be and there are
many trivial and not so trivial optimizations to be written. (For example, we
haven't even started to implement important optimizations, like vectorization.)

# Benchmark

We chose a laplace equation solver, based on SciPy's PerformancePython wiki. Unfortunately, the different implementations on the wiki page accidentally use two different algorithms, which have different convergences, and very different performance characteristics on modern computers. As a result, we implemented our own versions in both C and Python (with and without NumPy). The full source can be found in fijal's hack repo, all these benchmarks were performed at revision 18502dbbcdb3.

First, let me describe various algorithms used. Note that some of them contain PyPy-specific hacks to work around limitations in the current implementation. These hacks will go away eventually and the performance will improve. Numerically the algorithms used are identical, however exact data layout in memory differs between them.

**A note about all the benchmarks:** they each were run once, but the
performance is very stable across runs.

Starting with the C version, it implements a trivial laplace transform
using two loops and double-reference memory (array of `int*`). The double
reference does not matter for performance and the two algorithms are
implemented in `inline-laplace.c` and `laplace.c`. They were both compiled
with `gcc 4.4.5` at `-O3`. The inline version modifies array in-place while the non-inline version stores results in a copy. That makes them converge at different rate, hence different number of iterations

A straightforward version of those in Python is implemented in `laplace.py`
using, respectively, `inline_slow_time_step` and `slow_time_step`.
`slow_2_time_step` does the same thing, except it copies arrays in-place
instead of creating new copies. Table below compares running PyPy against C:

bench | number of iterations | time per iteration |

laplace C | 219 | 6.3ms |

inline-laplace C | 278 | 20ms |

slow python | 219 | 17ms |

slow 2 python | 219 | 14ms |

inline_slow python | 278 | 23.7ms |

An important thing to notice is the data dependency of the inline
version causes a huge slowdown for the C versions. This is not a severe
disadvantage for us though -- the brain-dead Python version takes longer
and PyPy is not able to take advantage of the knowledge that the data is
independent. The results are in the same ballpark as the C versions --
**15% - 170%** slower, but the algorithm
one chooses matters more than the language. By comparison, the slow versions
take about **5.75s** each on CPython 2.6 per iteration and, by estimation,
are about **200x** slower than the PyPy equivalent, if I had the patience to
measure the full run.

The next step is to use NumPy expressions. The first problem we run into is
that computing the error requires walking the entire array a second time. This
is fairly inefficient in terms of cache access, so I took the liberty of
computing the errors every 15 steps. This results in the convergence being
rounded to the nearest 15 iterations, but speeds things up considerably.
`numeric_time_step` takes the most braindead approach of replacing the array
with itself, like this:

u[1:-1, 1:-1] = ((u[0:-2, 1:-1] + u[2:, 1:-1])*dy2 + (u[1:-1,0:-2] + u[1:-1, 2:])*dx2)*dnr_inv

We need 3 arrays here -- one is an intermediate (PyPy only needs one, for all of
those subexpressions), one is a copy for computing the error, and one is the
result. This works automatically because in NumPy `+` or `*` creates an
intermediate, while NumPyPy avoids allocating the intermediate if possible.

`numeric_2_time_step` works in pretty much the same way:

src = self.u self.u = src.copy() self.u[1:-1, 1:-1] = ((src[0:-2, 1:-1] + src[2:, 1:-1])*dy2 + (src[1:-1,0:-2] + src[1:-1, 2:])*dx2)*dnr_inv

except the copy is now explicit rather than implicit.

`numeric_3_time_step` does the same thing, but notice one doesn't have to copy
the entire array, it's enough to copy the border pieces and fill rest with
zeros:

src = self.u self.u = numpy.zeros((self.nx, self.ny), 'd') self.u[0] = src[0] self.u[-1] = src[-1] self.u[:, 0] = src[:, 0] self.u[:, -1] = src[:, -1] self.u[1:-1, 1:-1] = ((src[0:-2, 1:-1] + src[2:, 1:-1])*dy2 + (src[1:-1,0:-2] + src[1:-1, 2:])*dx2)*dnr_inv

`numeric_4_time_step` is the one that tries hardest to resemble the C version.
Instead of doing an array copy, it actually notices that one can alternate
between two arrays. This is exactly what the C version does. The
`remove_invalidates` call is a PyPy specific hack - we hope to remove this
call in the near future, but, in short, it promises "I don't have any unbuilt
intermediates that depend on the value of the argument", which means one doesn't
have to compute sub-expressions one is not actually using:

remove_invalidates(self.old_u) remove_invalidates(self.u) self.old_u[:,:] = self.u src = self.old_u self.u[1:-1, 1:-1] = ((src[0:-2, 1:-1] + src[2:, 1:-1])*dy2 + (src[1:-1,0:-2] + src[1:-1, 2:])*dx2)*dnr_inv

This one is the most comparable to the C version.

`numeric_5_time_step` does the same thing, but notices one doesn't have to copy
the entire array, it's enough to just copy the edges. This is an optimization
that was not done in the C version:

remove_invalidates(self.old_u) remove_invalidates(self.u) src = self.u self.old_u, self.u = self.u, self.old_u self.u[0] = src[0] self.u[-1] = src[-1] self.u[:, 0] = src[:, 0] self.u[:, -1] = src[:, -1] self.u[1:-1, 1:-1] = ((src[0:-2, 1:-1] + src[2:, 1:-1])*dy2 + (src[1:-1,0:-2] + src[1:-1, 2:])*dx2)*dnr_inv

Let's look at the table of runs. As before, `gcc 4.4.5`, compiled at `-O3`,
and PyPy nightly 7bb8b38d8563, on an x86-64 machine. All of the numeric methods
run for 226 steps, slightly more than the 219, rounding to the next 15 when the
error is computed.

benchmark | PyPy | CPython |

numeric | 21ms | 35ms |

numeric 2 | 14ms | 37ms |

numeric 3 | 13ms | 29ms |

numeric 4 | 11ms | 31ms |

numeric 5 | 9.3ms | 21ms |

We think that these preliminary results are pretty good. They're not as fast as the C version (or as fast as we'd like them to be), but we're already much faster than NumPy on CPython -- almost always by more than 2x on this relatively real-world example. This is not the end, though. In fact, it's hardly the beginning! As we continue work, we hope to make even more use of the high level information that we have. Looking at the assembler generated by gcc for this example, it's pretty clear we can outperform it thanks to better aliasing information and hence better possibilities for vectorization. Stay tuned.

EDIT: fixed the benchmark name

EDIT2: added info that first table is about PyPy

Cheers, fijal

## 17 comments:

Nice to hear, but what we (numpy users) really need is 2-dimensional matrices with basic arithmetic operations (+, -, /, *, sin, cos, pow etc) and other related methods, e.g. min(array,axis), nanmax(array, axis), argmax(array,axis), nanargmin(array, axis) etc. While CPython soft dependent on these operations works more or less fast, with PyPy it mere doesn't work at all. I hope first of all you'll focus on it instead of speed improvement for single-dimensional arrays.

Regards, D.

It would be really cool if you try before complaining. I think all of it works on a nightly build, except the axis argument which is on a branch being worked on.

Also, IIRC NumPyPy still misses linalg.solve method for solving systems of linear equations, that is highly important for lots of soft. Connecting sparse SLE solver (like umfpack or superlu from scipy.sparse) also would be very essential.

We're working on it. Stay tuned

Maciej, anything about 2-dimensional matrix implementations with related operations haven't been mentioned in blog, so why I have to know about it? I only installed and tried stable PyPy 1.7, because I had tried building PyPy from sources and found it damned hard, especially for my limited hardware (2 GB RAM).

Good point, we'll write a blog post what has been implemented as well. Try nightly

A Laplace transform is something quite different to solving Laplace's equation with finite differences...

fixed, thanks

It may be nice to link to the nightly builds so that people can try this out :)

This is excellent! Great work, the potential of this project is very exciting. I was quietly wishing for this since pypy first started.

I use NumPy all the time, and any increase in performance makes a big difference. This is one of the main advantages of NumPyPy over NumPy, so it makes sense to focus on it.

There seems to be lots of complaining about missing features and such, but having a solid foundation to work from seems to be the most important thing. Missing features can be added down the line.

I remember reading a blog post last year about using transactional memory as a way of removing the GIL. If you could combine that with NumPyPy to run numerical tasks in parallel, that would make a lot of scientific programmers very happy. I don't know if this is feasible, but it sure would be nice.

Keep up the good work.

Hi Chris.

We have vague plans how to parallelize numpy expressions without even having to remove the GIL. That way you'll have workers that are able to perform (or help perform) numeric tasks, but the interpreter itself will still run in a single thread. The same goes for GPUs and MIC.

Nightly builds

http://buildbot.pypy.org/nightly/trunk

Please when you consider parallelizing things, do remember about leaving an explicit switch to turn it off!

I run my Python stuff on clusters through a queuing system and it will be VERY unhappy if single processes use more than one thread without informing the scheduler.

Hey, by the way, your progress on NumPy is amazing and highly appreciated.

@Anonymous of course, this is a given that we'll leave the switch to turn it off. It might be not even on by default, that's up for discussion

Chris, if you haven't considered this already, it's sometimes possible to achieve parallelism with multiple processes using memory mapped files as numpy arrays. It's a bit awkward, but it can also make for an easier path to a computation that is resumable or can be run on a cluster.

GIL removal would be wonderful, but it's a pretty ambitious idea. Then again, these pypy folk seem able to deliver on some pretty amazing stuff.

I am closely following these developments with numpypy and I just succesfully tested the last nightly build, which I find very impressive!

For research purposes, the main thing we need is scipy.stats.ttest_1samp to work on pypy. Is there an estimation on when scipypy will be available?

Post a Comment