Wednesday, June 5, 2013

STM on the drawing board

Hi all!

This is an update about the Software Transactional Memory subproject of PyPy. I have some good news of progress. Also, Remi Meier will likely help me this summer. He did various investigations with PyPy-STM for his Master's Thesis and contributed back a lot of ideas and some code. Welcome again Remi!

I am also sorry that it seems to advance so slowly. Beyond the usual excuses --- I was busy with other things, e.g. releasing PyPy 2.0 --- I would like to reassure people: I'm again working on it, and the financial contributions are still there and reserved for STM (almost half the money is left, a big thank you again if you contributed!).

The real reason for the apparent slowness, though, is that it is really a research project. It's possible to either have hard deadlines, or to follow various tracks and keep improving the basics, but not both at the same time.

During the past month where I have worked again on STM, I worked still on the second option; and I believe it was worth every second of it. Let me try to convince you :-)

The main blocker was that the STM subsystem, written in C, and the Garbage Collection (GC) subsystem, written in RPython, were getting harder and harder to coordinate. So what I did instead is to give up using RPython in favor of using only C for both. C is a good language for some things, which includes low-level programming where we must take care of delicate multithreading issues; RPython is not a good fit in that case, and wasn't designed to be.

I started a fresh Mercurial repo which is basically a stand-alone C library. This library (in heavy development right now!) gives any C program some functions to allocate and track GC-managed objects, and gives an actual STM+GC combination on these objects. It's possible (though rather verbose) to use it directly in C programs, like in a small example interpreter. Of course the eventual purpose is to link it with PyPy during translation to C, with all the verbose calls automatically generated.

Since I started this, bringing the GC closer to the STM, I kept finding new ways that the two might interact to improve the performance, maybe radically. Here is a summary of the current ideas.

When we run multiple threads, there are two common cases: one is to access (read and write) objects that have only been seen by the current thread; the other is to read objects seen by all threads, like in Python the modules/functions/classes, but not to write to them. Of course, writing to the same object from multiple threads occurs too, and it is handled correctly (that's the whole point), but it is a relatively rare case.

So each object is classified as "public" or "protected" (or "private", when they belong to the current transaction). Newly created objects, once they are no longer private, remain protected until they are read by a different thread. Now, the point is to use very different mechanisms for public and for protected objects. Public objects are visible by all threads, but read-only in memory; to change them, a copy must be made, and the changes are written to the copy (the "redolog" approach to STM). Protected objects, on the other hand, are modified in-place, with (if necessary) a copy of them being made for the sole purpose of a possible abort of the transaction (the "undolog" approach).

This is combined with a generational GC similar to PyPy's --- but here, each thread gets its own nursery and does its own "minor collections", independently of the others.

So objects are by default protected; when another thread tries to follow a pointer to them, then it is that other thread's job to carefully "steal" the object and turn it public (possibly making a copy of it if needed, e.g. if it was still a young object living in the original nursery).

The same object can exist temporarily in multiple versions: any number of public copies; at most one active protected copy; and optionally one private copy per thread (this is the copy as currently seen by the transaction in progress on that thread). The GC cleans up the unnecessary copies.

These ideas are variants and extensions of the same basic idea of keeping multiple copies with revision numbers to track them. Moreover, "read barriers" and "write barriers" are used by the C program calling into this library in order to be sure that it is accessing the right version of the object. In the currently investigated variant I believe it should be possible to have rather cheap read barriers, which would definitely be a major speed improvement over the previous variants. Actually, as far as I know, it would be a major improvement over most of the other existing STMs: in them, the typical read barrier involves following chains of pointers, and checking some dictionary to see if this thread has a modified local copy of the object. The difference with a read barrier that can resolve most cases in a few CPU cycles should be huge.

So, this is research :-) It is progressing, and at some point I'll be satisfied with it and stop rewriting everything; and then the actual integration into PyPy should be straightforward (there is already code to detect where the read and write barriers need to be inserted, where transactions can be split, etc.). Then there is support for the JIT to be written, and so on. But more about it later.

The purpose of this post was to give you some glimpses into what I'm working on right now. As usual, no plan for release yet. But you can look forward to seeing the C library progress. I'll probably also start soon some sample interpreter in C, to test the waters (likely a revival of duhton). If you know nothing about Python but all about the C-level multithreading issues, now is a good time to get involved :-)

Thanks for reading!

Armin

3 comments:

Paul Jaros said...

Thanks for the update. I was wondering since some time how the progress in STM has come along.
Good job also :)

Tuure Laurinolli said...

Do you have a description of the read and write barriers required somewhere? How does requiring a copy to be made of protected objects upon modification work with e.g. large arrays?

David said...

Check out John Carmack's brainstorm on an integrated STM+GC system concept which is sort of "globally phased compacting GC+STM". He doesn't use the term STM, but the concept is the same.

http://www.youtube.com/watch?v=1PhArSujR_A&feature=player_detailpage&t=1354