Sunday, August 18, 2013

Update on STM

Hi all,

A quick update on Software Transactional Memory. We are working on two fronts.

On the one hand, the integration of the "c4" C library with PyPy is done and works well, but is still subject to improvements. The "PyPy-STM" executable (without the JIT) seems to be stable, as far as it has been tested. It runs a simple benchmark like Richards with a 3.2x slow-down over a regular JIT-less PyPy.

The main factor of this slow-down: the numerous "barriers" in the code --- checks that are needed a bit everywhere to verify that a pointer to an object points to a recent enough version, and if not, to go to the most recent version. These barriers are inserted automatically during the translation; there is no need for us to manually put 42 million barriers in the source code of PyPy. But this automatic insertion uses a primitive algorithm right now, which usually ends up putting more barriers than the theoretical optimum. I (Armin) am trying to improve that --- and progressing: last week the slow-down was around 4.5x. This is done in the branch stmgc-static-barrier.

On the other hand, Remi is progressing on the JIT integration in the branch stmgc-c4. This has been working in simple cases since a couple of weeks by now, but the resulting "PyPy-JIT-STM" often crashes. This is because while the basics are not really hard, we keep hitting new issues that must be resolved.

The basics are that whenever the JIT is about to generate assembler corresponding to a load or a store in a GC object, it must first generate a bit of extra assembler that corresponds to the barrier that we need. This works fine by now (but could benefit from the same kind of optimizations described above, to reduce the number of barriers). The additional issues are all more subtle. I will describe the current one as an example: it is how to write constant pointers inside the assembler.

Remember that the STM library classifies objects as either "public" or "protected/private". A "protected/private" object is one which has not been seen by another thread so far. This is essential as an optimization, because we know that no other thread will access our protected or private objects in parallel, and thus we are free to modify their content in place. By contrast, public objects are frozen, and to do any change, we first need to build a different (protected) copy of the object. See this blog post for more details.

So far so good, but the JIT will sometimes (actually often) hard-code constant pointers into the assembler it produces. For example, this is the case when the Python code being JITted creates an instance of a known class; the corresponding assembler produced by the JIT will reserve the memory for the instance and then write the constant type pointer in it. This type pointer is a GC object (in the simple model, it's the Python class object; in PyPy it's actually the "map" object, which is a different story).

The problem right now is that this constant pointer may point to a protected object. This is a problem because the same piece of assembler can later be executed by a different thread. If it does, then this different thread will create instances whose type pointer is bogus: looking like a protected object, but actually protected by a different thread. Any attempt to use this type pointer to change anything on the class itself will likely crash: the threads will all think they can safely change it in-place. To fix this, we need to make sure we only write pointers to public objects in the assembler. This is a bit involved because we need to ensure that there is a public version of the object to start with.

When this is done, we will likely hit the next problem, and the next one; but at some point it should converge (hopefully!) and we'll give you our first PyPy-JIT-STM ready to try. Stay tuned :-)

A bientôt,

Armin.

8 comments:

Anonymous said...

*assembly

Unknown said...

Thanks for the update; glad it's coming together! I'm really looking forward to seeing how it stacks up once the JIT work is complete.

Do you think that it'll be possible to ever get better than a 2x slowdown for serial operations? Or is that the minimal possible? Naively, it makes sense that it'll never be as fast, but if 1.5x or lower were possible, that would be very exciting.

Also, is the end goal that you would have a module you import to "turn on" STM? Or would it always be a separate build of pypy, just like JIT/JIT-less?

Unknown said...
This comment has been removed by the author.
Armin Rigo said...

@Christopher: the slow-down we'll get is still unknown, but I fear it won't really go well under 2x.

I see it mainly as a separate build: either you want to run all these barrier instructions everywhere (which gives the slow-down) or not. It could be possible in theory to have a version that has the barriers everywhere, but creates JIT-generated assembler that doesn't, and thus runs almost as fast as a regular PyPy as long as you don't "turn on" STM. We will see if that makes sense.

Armin Rigo said...

@Anonymous: ah, thanks :-) I think I now learned the difference between "assembler" and "assembly" in English, which was never quite clear to me. Note that in french the same word ("assembleur") is used to mean both terms.

Unknown said...

@Armin: Ah, I see. Well, from a user's perspective, what I most write in python these days is either GUI applications (for which I've never been able to use pypy due to lack of bindings, but that's another issue entirely), or for small services, for which pypy has provided a rather nice speed improvement.

In a perfect world, I'd be able to use pypy for both of these tasks, not using STM for my GUI applications, but turning it on for the services I write (well, once they reach a certain point where I'd gain something from concurrency).

I suspect having a separate build would make such a use-case awkward.

Also, my interest is a bit self-motivated; at work we current use node.js for a lot of our services. Pypy compares decently for a lot of our tasks, but it not 'clearly better'. Once STM is stable, however, several of our services that we've struggled scaling to multiple cores on node.js could be rewritten in pypy STM, and should scale much easier. (Manual process management is painful!)

Again, if pypy STM were a seperate build, we'd have to manage having both installed in the case where we have servers running services that need concurrency, or ones that work well enough with a very fast async implementation. Not impossible, just a bit awkward. :)

Either way, I'm pretty excited!

Unknown said...

Are there any plans or experiments going on related to Hardware Transactional Memory?

Armin Rigo said...

@Ignacio Hernandez: for HTM, our position is still as described last year in: http://morepypy.blogspot.com/2012/08/multicore-programming-in-pypy-and.html