Friday, November 28, 2014

September donations and thank you to the Python Software Foundation!

Hello everyone!

We would like to show you a short update on the PyPy funding. We gathered a total of $15,986 in the month of September and as per earlier agreement, the Python Software Foundation donated $10,000 to PyPy. We would like to thank everyone participating and the PSF in particular for supporting the PyPy project and making our work possible!

We've been working hard on the goals outlined in the funding proposals.

  • PyPy Python 3 support has been in beta for a while and it's already being used by many people, as seen per the number of reported bugs. We're currently supporting 3.2, planning on moving towards 3.4 in the future.
  • Software Transactional Memory has been a successful research project, with first real world results shown during the Warsaw sprint.
  • More detailed update on numpy will be published soon. A little spoiler is that we're planning on addressing matplotlib, scipy and the larger ecosystem to some extent. Stay tuned!

Again, thanks to everyone who donated and happy Thanksgiving to everyone on that side of the world!

Cheers,
fijal and the entire PyPy team


Monday, November 17, 2014

Tornado without a GIL on PyPy STM

This post is by Konstantin Lopuhin, who tried PyPy STM during the Warsaw sprint.

Python has a GIL, right? Not quite - PyPy STM is a python implementation without a GIL, so it can scale CPU-bound work to several cores. PyPy STM is developed by Armin Rigo and Remi Meier, and supported by community donations. You can read more about it in the docs.

Although PyPy STM is still a work in progress, in many cases it can already run CPU-bound code faster than regular PyPy, when using multiple cores. Here we will see how to slightly modify Tornado IO loop to use transaction module. This module is described in the docs and is really simple to use - please see an example there. An event loop of Tornado, or any other asynchronous web server, looks like this (with some simplifications):

while True:
    for callback in list(self._callbacks):
        self._run_callback(callback)
    event_pairs = self._impl.poll()
    self._events.update(event_pairs)
    while self._events:
        fd, events = self._events.popitem()
        handler = self._handlers[fd]
        self._handle_event(fd, handler, events)

We get IO events, and run handlers for all of them, these handlers can also register new callbacks, which we run too. When using such a framework, it is very nice to have a guaranty that all handlers are run serially, so you do not have to put any locks. This is an ideal case for the transaction module - it gives us guaranties that things appear to be run serially, so in user code we do not need any locks. We just need to change the code above to something like:

while True:
    for callback in list(self._callbacks):
        transaction.add(                # added
            self._run_callback, callback)
    transaction.run()                   # added
    event_pairs = self._impl.poll()
    self._events.update(event_pairs)
    while self._events:
        fd, events = self._events.popitem()
        handler = self._handlers[fd]
        transaction.add(                # added
            self._handle_event, fd, handler, events)
    transaction.run()                   # added

The actual commit is here, - we had to extract a little function to run the callback.

Part 1: a simple benchmark: primes

Now we need a simple benchmark, lets start with this - just calculate a list of primes up to the given number, and return it as JSON:

def is_prime(n):
    for i in xrange(2, n):
        if n % i == 0:
            return False
    return True

class MainHandler(tornado.web.RequestHandler):
    def get(self, num):
        num = int(num)
        primes = [n for n in xrange(2, num + 1) if is_prime(n)]
        self.write({'primes': primes})

We can benchmark it with siege:

siege -c 50 -t 20s http://localhost:8888/10000

But this does not scale. The CPU load is at 101-104 %, and we handle 30 % less request per second. The reason for the slowdown is STM overhead, which needs to keep track of all writes and reads in order to detect conflicts. And the reason for using only one core is, obviously, conflicts! Fortunately, we can see what this conflicts are, if we run code like this (here 4 is the number of cores to use):

PYPYSTM=stm.log ./primes.py 4

Then we can use print_stm_log.py to analyse this log. It lists the most expensive conflicts:

14.793s lost in aborts, 0.000s paused (1258x STM_CONTENTION_INEVITABLE)
File "/home/ubuntu/tornado-stm/tornado/tornado/httpserver.py", line 455, in __init__
    self._start_time = time.time()
File "/home/ubuntu/tornado-stm/tornado/tornado/httpserver.py", line 455, in __init__
    self._start_time = time.time()
...

There are only three kinds of conflicts, they are described in stm source, Here we see that two threads call into external function to get current time, and we can not rollback any of them, so one of them must wait till the other transaction finishes. For now we can hack around this by disabling this timing - this is only needed for internal profiling in tornado.

If we do it, we get the following results (but see caveats below):

Impl. req/s
PyPy 2.4 14.4
CPython 2.7 3.2
PyPy-STM 1 9.3
PyPy-STM 2 16.4
PyPy-STM 3 20.4
PyPy STM 4 24.2
   

As we can see, in this benchmark PyPy STM using just two cores can beat regular PyPy! This is not linear scaling, there are still conflicts left, and this is a very simple example but still, it works!

But its not that simple yet :)

First, these are best-case numbers after long (much longer than for regular PyPy) warmup. Second, it can sometimes crash (although removing old pyc files fixes it). Third, benchmark meta-parameters are also tuned.

Here we get relatively good results only when there are a lot of concurrent clients - as a results, a lot of requests pile up, the server is not keeping with the load, and transaction module is busy with work running this piled up requests. If we decrease the number of concurrent clients, results get slightly worse. Another thing we can tune is how heavy is each request - again, if we ask primes up to a lower number, then less time is spent doing calculations, more time is spent in tornado, and results get much worse.

Besides the time.time() conflict described above, there are a lot of others. The bulk of time is lost in these two conflicts:

14.153s lost in aborts, 0.000s paused (270x STM_CONTENTION_INEVITABLE)
File "/home/ubuntu/tornado-stm/tornado/tornado/web.py", line 1082, in compute_etag
    hasher = hashlib.sha1()
File "/home/ubuntu/tornado-stm/tornado/tornado/web.py", line 1082, in compute_etag
    hasher = hashlib.sha1()

13.484s lost in aborts, 0.000s paused (130x STM_CONTENTION_WRITE_READ)
File "/home/ubuntu/pypy/lib_pypy/transaction.py", line 164, in _run_thread
    got_exception)

The first one is presumably calling into some C function from stdlib, and we get the same conflict as for time.time() above, but is can be fixed on PyPy side, as we can be sure that computing sha1 is pure.

It is easy to hack around this one too, just removing etag support, but if we do it, performance is much worse, only slightly faster than regular PyPy, with the top conflict being:

83.066s lost in aborts, 0.000s paused (459x STM_CONTENTION_WRITE_WRITE)
File "/home/arigo/hg/pypy/stmgc-c7/lib-python/2.7/_weakrefset.py", line 70, in __contains__
File "/home/arigo/hg/pypy/stmgc-c7/lib-python/2.7/_weakrefset.py", line 70, in __contains__

Comment by Armin: It is unclear why this happens so far. We'll investigate...

The second conflict (without etag tweaks) originates in the transaction module, from this piece of code:

while True:
    self._do_it(self._grab_next_thing_to_do(tloc_pending),
                got_exception)
    counter[0] += 1

Comment by Armin: This is a conflict in the transaction module itself; ideally, it shouldn't have any, but in order to do that we might need a little bit of support from RPython or C code. So this is pending improvement.

Tornado modification used in this blog post is based on 3.2.dev2. As of now, the latest version is 4.0.2, and if we apply the same changes to this version, then we no longer get any scaling on this benchmark, and there are no conflicts that take any substantial time.

Comment by Armin: There are two possible reactions to a conflict. We can either abort one of the two threads, or (depending on the circumstances) just pause the current thread until the other one commits, after which the thread will likely be able to continue. The tool ``print_stm_log.py`` did not report conflicts that cause pauses. It has been fixed very recently. Chances are that on this test it would report long pauses and point to locations that cause them.

Part 2: a more interesting benchmark: A-star

Although we have seen that PyPy STM is not all moonlight and roses, it is interesting to see how it works on a more realistic application.

astar.py is a simple game where several players move on a map (represented as a list of lists of integers), build and destroy walls, and ask server to give them shortest paths between two points using A-star search, adopted from ActiveState recipie.

The benchmark bench_astar.py is simulating players, and tries to put the main load on A-star search, but also does some wall building and destruction. There are no locks around map modifications, as normal tornado is executing all callbacks serially, and we can keep this guaranty with atomic blocks of PyPy STM. This is also an example of a program that is not trivial to scale to multiple cores with separate processes (assuming more interesting shared state and logic).

This benchmark is very noisy due to randomness of client interactions (also it could be not linear), so just lower and upper bounds for number of requests are reported

Impl. req/s
PyPy 2.4 5 .. 7
CPython 2.7 0.5 .. 0.9
PyPy-STM 1 2 .. 4
PyPy STM 4 2 .. 6

Clearly this is a very bad benchmark, but still we can see that scaling is worse and STM overhead is sometimes higher. The bulk of conflicts come from the transaction module (we have seen it above):

91.655s lost in aborts, 0.000s paused (249x STM_CONTENTION_WRITE_READ)
File "/home/ubuntu/pypy/lib_pypy/transaction.py", line 164, in _run_thread
    got_exception)

Although it is definitely not ready for production use, you can already try to run things, report bugs, and see what is missing in user-facing tools and libraries.

Benchmarks setup:

Wednesday, November 5, 2014

PyPy IO improvements


Hello everyone!

We've wrapped up the Warsaw sprint, so I would like to describe some branches which have been recently merged and which improved the I/O and the GC: gc_no_cleanup_nursery and gc-incminimark-pinning.

The first branch was started by Wenzhu Man for her Google Summer of Code and finished by Maciej Fijałkowski and Armin Rigo. The PyPy GC works by allocating new objects in the young object area (the nursery), simply by incrementing a pointer. After each minor collection, the nursery has to be cleaned up. For simplicity, the GC used to do it by zeroing the whole nursery.

This approach has bad effects on the cache, since you zero a large piece of memory at once and do unnecessary work for things that don't require zeroing like large strings. We mitigated the first problem somewhat with incremental nursery zeroing, but this branch removes the zeroing completely, thus improving the string handling and recursive code (since jitframes don't requires zeroed memory either). I measured the effect on two examples: a recursive implementation of fibonacci and gcbench, to measure GC performance.

The results for fibonacci and gcbench are below (normalized to cpython 2.7). Benchmarks were run 50 times each (note that the big standard deviation comes mostly from the warmup at the beginning, true figures are smaller):

benchmark CPython PyPy 2.4 PyPy non-zero
fibonacci 4.8+-0.15 (1.0x) 0.59+-0.07 (8.1x) 0.45+-0.07 (10.6x)
gcbench 22+-0.36 (1.0x) 1.34+-0.28 (16.4x) 1.02+-0.15 (21.6x)

The second branch was done by Gregor Wegberg for his master thesis and finished by Maciej Fijałkowski and Armin Rigo. Because of the way it works, the PyPy GC from time to time moves the objects in memory, meaning that their address can change. Therefore, if you want to pass pointers to some external C function (for example, write(2) or read(2)), you need to ensure that the objects they are pointing to will not be moved by the GC (e.g. when running a different thread). PyPy up to 2.4 solves the problem by copying the data into or from a non-movable buffer, which is obviously inefficient. The branch introduce the concept of "pinning", which allows us to inform the GC that it is not allowed to move a certain object for a short period of time. This introduces a bit of extra complexity in the garbage collector, but improves the I/O performance quite drastically, because we no longer need the extra copy to and from the non-movable buffers.

In this benchmark, which does I/O in a loop, we either write a number of bytes from a freshly allocated string into /dev/null or read a number of bytes from /dev/full. I'm showing the results for PyPy 2.4, PyPy with non-zero-nursery and PyPy with non-zero-nursery and object pinning. Those are wall times for cases using os.read/os.write and file.read/file.write, normalized against CPython 2.7.

Benchmarks were done using PyPy 2.4 and revisions 85646d1d07fb for non-zero-nursery and 3d8fe96dc4d9 for non-zero-nursery and pinning. The benchmarks were run once, since the standard deviation was small.

The Y axis is speed, normalized to CPython, the more the better

What we can see is that os.read and os.write both improved greatly and outperforms CPython now for each combination. file operations are a little more tricky, and while those branches improved the situation a bit, the improvement is not as drastic as in os versions. It really should not be the case and it showcases how our file buffering is inferior to CPython. We plan on removing our own buffering and using FILE* in C in the near future, so we should outperform CPython on those too (since our allocations are cheaper). If you look carefully in the benchmark, the write function is copied three times. This hack is intended to avoid JIT overspecializing the assembler code, which happens because the buffering code was written way before the JIT was done. In fact, our buffering is hilariously bad, but if stars align correctly it can be JIT-compiled to something that's not half bad. Try removing the hack and seeing how the performance of the last benchmark drops :-) Again, this hack should be absolutely unnecessary once we remove our own buffering, stay tuned for more.

Cheers,
fijal