Wednesday, October 16, 2013

Update on STM

Hi all,

The sprint in London was a lot of fun and very fruitful. In the last update on STM, Armin was working on improving and specializing the automatic barrier placement. There is still a lot to do in that area, but that work is merged now. Specializing and improving barrier placement is still to be done for the JIT.

But that is not all. Right after the sprint, we were able to squeeze the last obvious bugs in the STM-JIT combination. However, the performance was nowhere near to what we want. So until now, we fixed some of the most obvious issues. Many come from RPython erring on the side of caution and e.g. making a transaction inevitable even if that is not strictly necessary, thereby limiting parallelism. Another problem came from increasing counters everytime a guard fails, which caused transactions to conflict on these counter updates. Since these counters do not have to be completely accurate, we update them non-transactionally now with a chance of small errors.

There are still many such performance issues of various complexity left to tackle: we are nowhere near done. So stay tuned or contribute :)


Now, since the JIT is all about performance, we want to at least show you some numbers that are indicative of things to come. Our set of STM benchmarks is very small unfortunately (something you can help us out with), so this is not representative of real-world performance. We tried to minimize the effect of JIT warm-up in the benchmark results.

The machine these benchmarks were executed on has 4 physical cores with Hyper-Threading (8 hardware threads).

Raytracer from stm-benchmarks: Render times in seconds for a 1024x1024 image:

Interpreter Base time: 1 thread 8 threads (speedup)
PyPy-2.1 2.47 2.56 (0.96x)
CPython 81.1 73.4 (1.1x)
PyPy-STM 50.2 10.8 (4.6x)

For comparison, disabling the JIT gives 148s on PyPy-2.1 and 87s on PyPy-STM (with 8 threads).

Richards from PyPy repository on the stmgc-c4 branch: Average time per iteration in milliseconds:

Interpreter Base time: 1 thread 8 threads (speedup)
PyPy-2.1 15.6 15.4 (1.01x)
CPython 239 237 (1.01x)
PyPy-STM 371 116 (3.2x)

For comparison, disabling the JIT gives 492ms on PyPy-2.1 and 538ms on PyPy-STM.

Try it!

All this can be found in the PyPy repository on the stmgc-c4 branch. Try it for yourself, but keep in mind that this is still experimental with a lot of things yet to come. Only Linux x64 is supported right now, but contributions are welcome.

You can download a prebuilt binary from here: (Linux x64 Ubuntu >= 12.04). This was made at revision bafcb0cdff48.


What the numbers tell us is that PyPy-STM is, as expected, the only of the three interpreters where multithreading gives a large improvement in speed. What they also tell us is that, obviously, the result is not good enough yet: it still takes longer on a 8-threaded PyPy-STM than on a regular single-threaded PyPy-2.1. However, as you should know by now, we are good at promising speed and delivering it... years later :-)

But it has been two years already since PyPy-STM started, and this is our first preview of the JIT integration. Expect major improvements soon: with STM, the JIT generates code that is completely suboptimal in many cases (barriers, allocation, and more). Once we improve this, the performance of the STM-JITted code should come much closer to PyPy 2.1.


Remi & Armin


tobami said...

To see a multithreading speed up in a python interpreter is awesome!

For next update, I would suggest to do the benchmarking turning off hyperthreading and measuring 1, 2 and 4 threads. That would give a better picture of how the STM implementation scales with threads/cores.

Mak Sim said...

Guys you are doing great job!

Anonymous said...

STM | Société de transport de Montréal ?

Anonymous said...

STM stands for Software Transactional Memory and is a way to run multiple non-conflicting tasks at the same time and make it appear as if they had run in sequence.

LKRaider said...

A bit off-topic, but just came across this paper:

"Speculative Staging for Interpreter Optimization
-- we report that our optimization makes the CPython interpreter up to more than four times faster, where our interpreter closes the gap between and sometimes even outperforms PyPy's just-in-time compiler."