Saturday, July 5, 2014

PyPy-STM: first "interesting" release

Hi all,

PyPy-STM is now reaching a point where we can say it's good enough to be a GIL-less Python. (We don't guarantee there are no more bugs, so please report them :-) The first official STM release:

This corresponds roughly to PyPy 2.3 (not 2.3.1). It requires 64-bit Linux. More precisely, this release is built for Ubuntu 12.04 to 14.04; you can also rebuild it from source by getting the branch stmgc-c7. You need clang to compile, and you need a patched version of llvm.

This version's performance can reasonably be compared with a regular PyPy, where both include the JIT. Thanks for following the meandering progress of PyPy-STM over the past three years --- we're finally getting somewhere really interesting! We cannot thank enough all contributors to the previous PyPy-STM money pot that made this possible. And, although this blog post is focused on the results from that period of time, I have of course to remind you that we're running a second call for donation for future work, which I will briefly mention again later.

A recap of what we did to get there: around the start of the year we found a new model, a "redo-log"-based STM which uses a couple of hardware tricks to not require chasing pointers, giving it (in this context) exceptionally cheap read barriers. This idea was developed over the following months and (relatively) easily integrated with the JIT compiler. The most recent improvements on the Garbage Collection side are closing the gap with a regular PyPy (there is still a bit more to do there). There is some preliminary user documentation.

Today, the result of this is a PyPy-STM that is capable of running pure Python code on multiple threads in parallel, as we will show in the benchmarks that follow. A quick warning: this is only about pure Python code. We didn't try so far to optimize the case where most of the time is spent in external libraries, or even manipulating "raw" memory like array.array or numpy arrays. To some extent there is no point because the approach of CPython works well for this case, i.e. releasing the GIL around the long-running operations in C. Of course it would be nice if such cases worked as well in PyPy-STM --- which they do to some extent; but checking and optimizing that is future work.

As a starting point for our benchmarks, when running code that only uses one thread, we get a slow-down between 1.2 and 3: at worst, three times as slow; at best only 20% slower than a regular PyPy. This worst case has been brought down --it used to be 10x-- by recent work on "card marking", a useful GC technique that is also present in the regular PyPy (and about which I don't find any blog post; maybe we should write one :-) The main remaining issue is fork(), or any function that creates subprocesses: it works, but is very slow. To remind you of this fact, it prints a line to stderr when used.

Now the real main part: when you run multithreaded code, it scales very nicely with two threads, and less-than-linearly but still not badly with three or four threads. Here is an artificial example:

    total = 0
    lst1 = ["foo"]
    for i in range(100000000):
        lst1.append(i)
        total += lst1.pop()

We run this code N times, once in each of N threads (full benchmark). Run times, best of three:

Number of threads Regular PyPy (head) PyPy-STM
N = 1 real 0.92s
user+sys 0.92s
real 1.34s
user+sys 1.34s
N = 2 real 1.77s
user+sys 1.74s
real 1.39s
user+sys 2.47s
N = 3 real 2.57s
user+sys 2.56s
real 1.58s
user+sys 4.106s
N = 4 real 3.38s
user+sys 3.38s
real 1.64s
user+sys 5.35s

(The "real" time is the wall clock time. The "user+sys" time is the recorded CPU time, which can be larger than the wall clock time if multiple CPUs run in parallel. This was run on a 4x2 cores machine. For direct comparison, avoid loops that are so trivial that the JIT can remove all allocations from them: right now PyPy-STM does not handle this case well. It has to force a dummy allocation in such loops, which makes minor collections occur much more frequently.)

Four threads is the limit so far: only four threads can be executed in parallel. Similarly, the memory usage is limited to 2.5 GB of GC objects. These two limitations are not hard to increase, but at least increasing the memory limit requires fighting against more LLVM bugs. (Include here snark remarks about LLVM.)

Here are some measurements from more real-world benchmarks. This time, the amount of work is fixed and we parallelize it on T threads. The first benchmark is just running translate.py on a trunk PyPy. The last three benchmarks are here.

Benchmark PyPy 2.3 (PyPy head) PyPy-STM, T=1 T=2 T=3 T=4
translate.py --no-allworkingmodules
(annotation step)
184s (170s) 386s (2.10x) n/a
multithread-richards
5000 iterations
24.2s (16.8s) 52.5s (2.17x) 37.4s (1.55x) 25.9s (1.07x) 32.7s (1.35x)
mandelbrot
divided in 16-18 bands
22.9s (18.2s) 27.5s (1.20x) 14.4s (0.63x) 10.3s (0.45x) 8.71s (0.38x)
btree 2.26s (2.00s) 2.01s (0.89x) 2.22s (0.98x) 2.14s (0.95x) 2.42s (1.07x)

This shows various cases that can occur:

  • The mandelbrot example runs with minimal overhead and very good parallelization. It's dividing the plane to compute in bands, and each of the T threads receives the same number of bands.
  • Richards, a classical benchmark for PyPy (tweaked to run the iterations in multiple threads), is hard to beat on regular PyPy: we suspect that the difference is due to the fact that a lot of paths through the loops don't allocate, triggering the issue already explained above. Moreover, the speed of Richards was again improved dramatically recently, in trunk.
  • The translation benchmark measures the time translate.py takes to run the first phase only, "annotation" (for now it consumes too much memory to run translate.py to the end). Moreover the timing starts only after the large number of subprocesses spawned at the beginning (mostly gcc). This benchmark is not parallel, but we include it for reference here. The slow-down factor of 2.1x is still too much, but we have some idea about the reasons: most likely, again the Garbage Collector, missing the regular PyPy's very fast small-object allocator for old objects. Also, translate.py is an example of application that could, with reasonable efforts, be made largely parallel in the future using atomic blocks.
  • Atomic blocks are also present in the btree benchmark. I'm not completely sure but it seems that, in this case, the atomic blocks create too many conflicts between the threads for actual parallization: the base time is very good, but running more threads does not help at all.

As a summary, PyPy-STM looks already useful to run CPU-bound multithreaded applications. We are certainly still going to fight slow-downs, but it seems that there are cases where 2 threads are enough to outperform a regular PyPy, by a large margin. Please try it out on your own small examples!

And, at the same time, please don't attempt to retrofit threads inside an existing large program just to benefit from PyPy-STM! Our goal is not to send everyone down the obscure route of multithreaded programming and its dark traps. We are going finally to shift our main focus on the phase 2 of our research (donations welcome): how to enable a better way of writing multi-core programs. The starting point is to fix and test atomic blocks. Then we will have to debug common causes of conflicts and fix them or work around them; and try to see how common frameworks like Twisted can be adapted.

Lots of work ahead, but lots of work behind too :-)

Armin (thanks Remi as well for the work).

22 comments:

Unknown said...
This comment has been removed by the author.
Armin Rigo said...

You're just extracting and running the "bin/pypy"? It works for me on a very close configuration, Ubuntu 14.04 too...

Unknown said...
This comment has been removed by the author.
Armin Rigo said...

Yes. Sorry, it doesn't make sense to me. You need to debug with gdb, probably with an executable that has got the debugging symbols. You need to either build it yourself, or recompile the pregenerated sources from: http://cobra.cs.uni-duesseldorf.de/~buildmaster/misc/pypy-c-r72356-stm-jit-SOURCE.txz

Ernst Sjöstrand said...

If I try virtualenv I get:
virtualenv stmtest -p Projekt/pypy-stm-2.3-linux64/bin/pypy
Running virtualenv with interpreter Projekt/pypy-stm-2.3-linux64/bin/pypy
[forking: for now, this operation can take some time]
[forking: for now, this operation can take some time]
New pypy executable in stmtest/bin/pypy
[forking: for now, this operation can take some time]
ERROR: The executable stmtest/bin/pypy is not functioning
ERROR: It thinks sys.prefix is u'/home/ernst' (should be u'/home/ernst/stmtest')
ERROR: virtualenv is not compatible with this system or executable

Armin Rigo said...

@Ernst: sorry, it works fine for me as well. I tried the pypy-stm provided here, both on a Ubuntu 12.04 and a Ubuntu 14.04 machine. Maybe you have a too old virtualenv? Does it work with regular PyPy?

Armin Rigo said...

Thanks to the author of the now-deleted comments, we could track and fix a bug that only shows up on some Linux systems. If pypy-stm systematically segfaults at start-up for you too, try the "2.3-r2" release (see update in the post itself).

Anonymous said...

This is exciting! One minor bug in the actual post: you can describe slowdown / speedup in two different ways, with total time as a percentage of original time, or with time difference as a percentage of original time. You mention a 20% slowdown (clearly using the latter standard) and then a 300% slowdown, which you describe as 3x (suggesting that you use the former standard). To be consistent , you should either describe them as 120% and 300%, respectively (using the former standard), or 20% and 200%, respectively (using the latter standard).

Thanks!

Unknown said...

Hi again,

just to play around a little I've put together https://github.com/Tinche/stm-playground for myself.

I picked a generic CPU-bound problem (primality testing) and tried comparing multithreaded implementations in CPython 2.7, ordinary PyPy and PyPy-STM.

I figured this would be easily parallelizable (low conflicts) but it doesn't seem to be the case - I don't get all my cores pegged using the STM.

bench-threadpool.py, on my machine, gives about the same time for CPython and PyPy-STM, while ordinary PyPy totally smokes them both (even with the GIL :), one order of magnitude difference (20 sec vs 2 sec).

bench-threadpool-naive will crash the STM interpreter on my system. :)

Getting away from threads, CPython will actually beat PyPy in a multi-process scenario by a factor of 2, which I found surprising. CPython does indeed use up all my cores 100% while dealing with a process pool, while PyPy has won't even come close.

For the same workload, PyPy is actually faster running multithreaded with the GIL than multi-process, and fastest running with only 1 thread (expected, with the GIL only being overhead in this scenario).

Pim said...

This is good news. For many of my applications, an important feature in the next phase will be the optimization for [..] the built-in dictionary type, for which we would like accesses and writes using independent keys to be truly independent [..]. My applications are mostly server applications (Twisted-based and others) that store state information on sessions/transactions in a small number of dictionaries that can have hundreds or thousands of entries concurrently, and would be accessed constantly.

I'm glad I donated and plan do so again in the future :-)

Armin Rigo said...

@Tin: I would tweak bench-queue.py to avoid a million inter-thread communications via the queue. For example, run 1000 check_primes instead of just 1 for every number received from the queue.

Armin Rigo said...

@Tin: ...no, I tried too and it doesn't seem to help. We'll need to look into this in more details....

Unknown said...

@Armin I've pushed a version of bench-queue with a tweakable batch size and concurrency level. Doing the work in batches of, say, 1000 does indeed make it go faster with all implementations.

I've noticed pypy-stm runs have a large variance. It's not like I'm doing scientific measurements here, but for the queue test I'm getting runtimes from ~15 sec to ~27 sec, whereas for example ordinary PyPy is in the range 4.6 sec - 4.9 sec, and CPython ~22.5 - ~24.7, again, relatively close. Again, this is just something I noticed along the way and not the result of serious benchmarking in isolation.

Armin Rigo said...

Ooooof. Ok, I found out what is wrong in bench-queue. The issue is pretty technical, but basically if you add "with __pypy__.thread.atomic:" in the main top-level loop in worker(), then it gets vastly faster. On my machine it beats the real-time speed of a regular pypy. See http://bpaste.net/show/450553/

It clearly needs to be fixed...

Armin Rigo said...

Added an answer to the question "what about PyPy3?": https://pypy.readthedocs.org/en/latest/stm.html#python-3

Unknown said...

@Armin, cool! I've found that the thread pool version can be sped up ~2-3x by wrapping the contents of check_prime with 'atomic' too.

One more observation: with the atomic context manager, on PyPy-STM the queue implementation will beat the thread pool implementation (slightly), which isn't the case for CPython or ordinary PyPy.

geerk said...

This is exciting news! I think pypy is the future of python.

Canesin said...

If you guys did a facelift on the website like yours HippyVM I believe the project would gain a lot of momentum, it is unfortunate but true that most company managers would visit it and think it is not industrial quality if an employ comes saying that they should sponsor developing something in PyPy.

Anonymous said...

r2 still doesn't work for me (ubuntu 14.04, intel Core2 CPU T7400)
bash: ./pypy: cannot execute binary file: Exec format error

isomorph said...

this is a question for the guys developing PyPy... i am completely new to Python so please bear with me.

here is what i don't understand: it seems to me that you are reinventing the wheel because doesn't the Oracle or Azul Systems JVM already provide a super performant GC and JIT? even STM is becoming available. and since Jython can run on the JVM, why do PyPy at all?

wouldn't a JVM compliant implementation of Python be more performant than PyPy or CPython?

or am i missing something here?

any pointers greatly appreciated. thanks.

Armin Rigo said...

Having a JIT in the JVM is very different from having a JIT that can understand Python. For proof, the best (and only) implementation of Python on the JVM, Jython, is running at around CPython speed (generally a bit slower). I suspect that STM is similarly not designed for the purposes to which Jython would put it and would thus perform poorly. The only part that would probably work out of the box would be the GC. A more subtle argument against starting from the JVM is that of semantic mismatch. See for example http://www.stups.uni-duesseldorf.de/mediawiki/images/5/51/Pypy.pdf

isomorph said...

awesome! thanks a lot armin. :D