PyPy Status Blog: PyPy IO improvements

Wednesday, November 5, 2014

PyPy IO improvements

Hello everyone!

We've wrapped up the Warsaw sprint, so I would like to describe some branches which have been recently merged and which improved the I/O and the GC: gc_no_cleanup_nursery and gc-incminimark-pinning.

The first branch was started by Wenzhu Man for her Google Summer of Code and finished by Maciej Fijałkowski and Armin Rigo. The PyPy GC works by allocating new objects in the young object area (the nursery), simply by incrementing a pointer. After each minor collection, the nursery has to be cleaned up. For simplicity, the GC used to do it by zeroing the whole nursery.

This approach has bad effects on the cache, since you zero a large piece of memory at once and do unnecessary work for things that don't require zeroing like large strings. We mitigated the first problem somewhat with incremental nursery zeroing, but this branch removes the zeroing completely, thus improving the string handling and recursive code (since jitframes don't requires zeroed memory either). I measured the effect on two examples: a recursive implementation of fibonacci and gcbench, to measure GC performance.

The results for fibonacci and gcbench are below (normalized to cpython 2.7). Benchmarks were run 50 times each (note that the big standard deviation comes mostly from the warmup at the beginning, true figures are smaller):

benchmark	CPython	PyPy 2.4	PyPy non-zero
fibonacci	4.8+-0.15 (1.0x)	0.59+-0.07 (8.1x)	0.45+-0.07 (10.6x)
gcbench	22+-0.36 (1.0x)	1.34+-0.28 (16.4x)	1.02+-0.15 (21.6x)

The second branch was done by Gregor Wegberg for his master thesis and finished by Maciej Fijałkowski and Armin Rigo. Because of the way it works, the PyPy GC from time to time moves the objects in memory, meaning that their address can change. Therefore, if you want to pass pointers to some external C function (for example, write(2) or read(2)), you need to ensure that the objects they are pointing to will not be moved by the GC (e.g. when running a different thread). PyPy up to 2.4 solves the problem by copying the data into or from a non-movable buffer, which is obviously inefficient. The branch introduce the concept of "pinning", which allows us to inform the GC that it is not allowed to move a certain object for a short period of time. This introduces a bit of extra complexity in the garbage collector, but improves the I/O performance quite drastically, because we no longer need the extra copy to and from the non-movable buffers.

In this benchmark, which does I/O in a loop, we either write a number of bytes from a freshly allocated string into /dev/null or read a number of bytes from /dev/full. I'm showing the results for PyPy 2.4, PyPy with non-zero-nursery and PyPy with non-zero-nursery and object pinning. Those are wall times for cases using os.read/os.write and file.read/file.write, normalized against CPython 2.7.

Benchmarks were done using PyPy 2.4 and revisions 85646d1d07fb for non-zero-nursery and 3d8fe96dc4d9 for non-zero-nursery and pinning. The benchmarks were run once, since the standard deviation was small.

The Y axis is speed, normalized to CPython, the more the better

What we can see is that os.read and os.write both improved greatly and outperforms CPython now for each combination. file operations are a little more tricky, and while those branches improved the situation a bit, the improvement is not as drastic as in os versions. It really should not be the case and it showcases how our file buffering is inferior to CPython. We plan on removing our own buffering and using FILE* in C in the near future, so we should outperform CPython on those too (since our allocations are cheaper). If you look carefully in the benchmark, the write function is copied three times. This hack is intended to avoid JIT overspecializing the assembler code, which happens because the buffering code was written way before the JIT was done. In fact, our buffering is hilariously bad, but if stars align correctly it can be JIT-compiled to something that's not half bad. Try removing the hack and seeing how the performance of the last benchmark drops :-) Again, this hack should be absolutely unnecessary once we remove our own buffering, stay tuned for more.

Cheers,
fijal

Hello everyone!

We've wrapped up the Warsaw sprint, so I would like to describe some branches which have been recently merged and which improved the I/O and the GC: gc_no_cleanup_nursery and gc-incminimark-pinning.

benchmark	CPython	PyPy 2.4	PyPy non-zero
fibonacci	4.8+-0.15 (1.0x)	0.59+-0.07 (8.1x)	0.45+-0.07 (10.6x)
gcbench	22+-0.36 (1.0x)	1.34+-0.28 (16.4x)	1.02+-0.15 (21.6x)

The Y axis is speed, normalized to CPython, the more the better

Cheers,
fijal

Posted by Maciej Fijalkowski at 16:14

3 comments:

Yichao Yu said...: Sounds great!!!

Just wondering, will the pin-memory also improves the situation when passing strings/other buffers to c functions (e.g. via cffi)?; November 5, 2014 at 7:32 PM
Anonymous said...: Hey,

In your benchmark, the following loop:
for i in range(num):
os.write(fd, " " * num2)

Is not hoisted out by CPython (whereas I guess PyPy does hoist it).
Which means that the buffer written is basically allocated/freed upon each loop.

If you want to measure pure I/O performance (so let's say a zero-copy setting), it should be hoisted manually out of the loop for CPython, like this:

payload = b" " * num2
for i in range(num):
os.write(fd, payload)

Then, the results go from:

fwrite 100 bytes, 1.93us per write
fwrite 1000 bytes, 2.57us per write
fwrite 10000 bytes, 6.73us per write
file_write 100 bytes, 0.99us per write
file_write 1000 bytes, 1.68us per write
file_write 10000 bytes, 4.71us per write

to

fwrite 100 bytes, 1.38us per write
fwrite 1000 bytes, 1.48us per write
fwrite 10000 bytes, 1.38us per write
file_write 100 bytes, 0.65us per write
file_write 1000 bytes, 0.96us per write
file_write 10000 bytes, 2.32us per write

Also, might be worth trying wth binary mode.

Anyway, keep up the great work!; November 5, 2014 at 10:54 PM
Maciej Fijalkowski said...: PyPy does not hoist the buffer allocation here. The benchmark specifically allocated/frees the buffer every loop, since we want the object written fresh (otherwise pinning is not needed), but also we think that writing a new object (as opposed to the constant buffer) is really more of a common case. Yes, you get an overhead of allocation measured too, but the case here is that we wanted to measure the IO of fresh objects, not old ones; November 6, 2014 at 7:10 AM