Monday, May 7, 2012

STM update: back to threads?

Hi again,

Here is another update on the status of Software Transactional Memory on PyPy.

Those of you who have been closely following this blog since last year know that, from the very first post about STM, I explored various design ideas about the API that we should get when programming in Python.

I went a full circle, and now I am back to where I started (with, important difference, a very roughly working implementation of pypy-stm).

What I realized is that the "thread" module is not that bad after all --- I mean, yes, it is a horribly low-level interface, but it is general enough to build various interesting things on top of it. What the "stm-thread" branch of PyPy contains is, basically, the regular "thread" module in which the GIL was replaced with STM. It gives multicore capabilities to any program based on multiple threads. (This is so far exactly the idea same than the one being investigated for Hardware Transactional Memory. It is roughly also what you would get if you managed to convince GCC 4.7 to compile CPython using STM.)

Now while this might already be quite interesting to some people, here is how it relates to all I said previously: namely, threads are bad, and some new "transaction" module would be a better idea.

There is one new core functionality in the "stm-thread" branch: it is "thread.atomic", a context manager that can be used in a "with" statement (exact name subject to change). In terms of the GIL, it prevents the GIL from being released in the "with" block. In terms of STM, it prevents a "transaction break", which means that the whole "with" statement runs in one single transaction. (From the Python programmer's point of view, the net effect is the same.)

So far, no ground-breaking news. But what I missed previously is that this is enough to give multicore capabilities even to a program that is not using threads so far. It is possible to rewrite an equivalent of the old transaction module in a few pages of pure Python, using "thread.atomic". Something along the following lines: start N threads that each reads from a Queue.Queue() the next job to do, and does it in a "with thread.atomic" block. The STM version of PyPy is then able to run these atomic blocks concurrently. The key point is that the slightly delicate handling of threads should be nicely hidden inside the new "transaction" module, and from outside the observed behavior would be exactly as if the transactions that we schedule are run serially.

The point I kept missing was that, yes, this sounds like nonsense, because it seems that we create N threads just to serialize their work again in "thread.atomic" sections. In fact this would be nonsense in any model that would "just" remove the GIL to let multiple threads run concurrently without crashing. Indeed, you have multiple threads, but their atomic blocks would be again a sort of GIL: only one of them would run at a time. And this is indeed the simple model of execution that you get even with STM --- but not the model of performance. The performance with STM scales with the number of cores, as long as there is enough non-conflicting work to do.

So in summary the complete circle back to the starting point is that threads might be a good low-level model. It mends itself naturally to, say, a kind of program in which the main thread polls file descriptors using select() or the Linux epoll(), and the work received is split along N other threads --- which is the kind of program you would naturally write in other languages that don't have a GIL, say Java. The other threads can then use "thread.atomic" blocks to protect sections of their work. The traditional Transactional Memory point of view is that you use such blocks to guard the short sections of code that communicate with other threads or modify global state, but nothing prevents you from using much larger sections: you should be able to scale them up to the size of a native "unit of work", so that every unit is naturally atomic. And then it's only a matter of design: you can tweak an existing module that does the thread pooling to add one "with thread.atomic"; or do it yourself from scratch; or (if the design is compatible enough) just plug in the proposed pure-Python "transaction" module. Or if you feel like it you can even use threads directly (but keep in mind that using threads too explicitly is not a composable abstraction, whereas higher-level designs typically are).

At the end of the day, you can write or reuse programs whose global structure you are already familiar with, for example with a thread pool (that can be hidden in a library if you prefer), or any other structure with or without explicit threads. But you can do so without all the mess that comes with threads like locks and deadlocks. From that angle it is really similar to Garbage Collection: e.g. the Boehm GC (now used by GCC itself) lets you write C code like you are used to, but forgeting all you had to learn about careful explicit memory management.

20 comments:

Benjamin said...

So I'm not sure if I fully grok STM, but my basic understanding of the workflow for a transaction is this:

1. Make a copy of whatever it is you're planning to use, ie, 'stuff'.
2. Do anything that doesn't have side effects (writing to memory/disk).
3. Acquire a lock & compare the state of the parts of 'stuff' you want to change to the current state.
4a. If 'stuff to write' is unchanged, write it and release lock.
4b. Otherwise, release lock and restart transaction.

With the context manager, how is 'stuff' determined? Does it record everything in locals()? That seems like it might be excessive. Would it make sense to expose 'stuff' to the programmer?

If you were to expose 'stuff' to the programmer, I'd think you'd want a new local context where the only variables available were those explicitly specified as 'stuff' (and builtins, etc) so as to avoid congruency accidents. Something like:

with atomic(f, x, y, z, q) as f, x, y, z, q:
z += f(x, y)
y = x
x = q.pop()

This would also help remind folks to keep their transactions small.

Furthermore, this could easily be transformed into a very useful (function) decorator that uses the function's arguments as the 'stuff'.

Am I missing something? Are my suggestions reasonable?

Unknown said...

this might give you some insight into another approach for passing messages (aka information) between threads which might be GIL friendly.

Frankier said...

@Benjamin:

My understanding is STM is using these type of transactions: http://en.wikipedia.org/wiki/Optimistic_concurrency_control

Armin Rigo said...

@Benjamin: no, that's not reasonable at all in the context of large transactions. "Help remind folks to keep their transactions small" is precisely what I don't want: I want large transactions. This might be harder to do efficiently, it might be more conflict-prone, etc.; but what I don't want is the classical situation where you have to be very careful about keeping your transactions as small as possible, because that's just as hard and error-prone as using locks.

What I want is for "the average programmer" to not use the "thread" module at all, including "thread.atomic". This should be part of a library that does thread pooling and dispatching (large) transactions.

Kristján Valur said...

You know, of course, that stackless has an "atomic" property, and stacklesslib has an stacklesslib.utils.atomic ctxtmgr.

I recently modified stackless so that the "atomic" property also inhibited GIL release, so that inter-thread tasklet operations could be made safe.

On a whim I scoured the python archives and found that such a property had been proposed to cPython but rejected (unwisely imho) in favor of general locking.

Perhaps we can get them to reconsider?

Kristján Valur said...

Oh, and btw:
an "atomic" property in regular cPython (and stackless) of course only prevents preemptive release of the GIL. Any blocking IO calls will still cause a "co-operative" GIL release. For this reason, "atomic" cannot be replace generic locks completely.

How does this play with longer "transactions" in STM?

Armin Rigo said...

@Kris: ah, interesting. You did the same as what I attempted in my hack of CPython at https://bitbucket.org/arigo/cpython-withatomic . This didn't really work out, though, because the stdlib (including file objects) use regular locks. A simple "print" in an atomic block could lead to deadlocks: the atomic block can block waiting for the stdout's file lock to be released, but it does so without releasing the GIL. Now the lock would typically be released by another thread --- if only it could grab the GIL for a short while.

You can see the workaround I found in the last few commit messages of the above repository, but I'm not satisfied with it... In general I'm still unsure what the best way is. For now in pypy-stm I'm going to hack on a case-by-case basis to convert the locks to atomic sections.

Perhaps it is possible to do it semi-generically, e.g. convert all syntactically nested "with lock:" statements in the user code into "with atomic:" statements (similar to next year's Intel CPUs, which will have "lock elision" to help convert from lock-based to HTM programs). As far as I know, this idea doesn't work in all situations, e.g. if you acquire a lock in one thread and release it in another thread.

As far as I can say, this issue is the main blocker preventing any further progress on the CPython side. It is certainly the reason I stopped pushing for it last year.

Armin Rigo said...

@Kris: ah, ok: you have a version of "atomic" that doesn't prevent the GIL from being released around I/O calls. This is different from the version described in this post, which is also what I assumed in my previous answer. In a "with atomic" block, the GIL is not released under any circumstance (equivalently, the whole "atomic" block runs as a single transaction), so that the programmer can assume that a "with atomic" block is truly atomic.

ArneBab said...

How would a code example look for thread.atomic?

Armin Rigo said...

@Arne: here is an example using directly thread.atomic. In your multithreaded application, at some point, you want to remove an item from list1 and add it to list2, knowing that list1 and list2 are also accessed by other threads. Then you write:

with thread.atomic:
x = list1.pop()
list2.append(x)

This is a classical STM example. What I'm pushing for is not that, though: it is for not writing multithreaded code in the first place. With the proper library code you can write code like the first few lines of transaction. The library code would itself use thread.atomic, but not you directly.

Kristján Valur said...

Yes, sorry for not being clear, Armin. But an "atomic" flag that inhibits involountary thread switching is useful too, because it is a fast "lock" around all kinds of code:

with atomic:
foo = foo+1 #multi-threading-safe

without the overhead of real locks.
In our GIL world, real locks only benefit areas that incur thread-blocking operations such as IO.

Anyway, that is off-topic, I suppose :)

Kristján Valur said...

Of course, we cannot replace thread._Lock with an "atomic" equivalent, because it is a non-recursive entity, also used for such things as condition variables!.

Not a very wise move, in retrospect.

Armin Rigo said...

@Kris: indeed. I found out a way that should in all cases either work or raise an exception if unsupported (and not unexpectedly deadlock).

The unsupported situation is: we are in a "with atomic" block trying to acquire a lock, and this lock is acquired already. In this case, there is nothing the interpreter can do automatically. It can only complain rather than deadlocking: no other thread is going to run in parallel to release the lock.

This should let the "common use case" work, which is locks used as scoped mutexes. Caveat: only as long as you use them either only in "with atomic" blocks --- because they appear to be fully serialized, so the mutex will never block --- or only outside "with atomic" blocks.

This leaves the case of mixed usage as unsupported, but I don't see how it could reasonably be supported.

So for now, pypy-stm will raise "oups deadlock" if you try to use "print" statements both inside and outside atomic blocks in parallel... that's the best I could come up with so far.

Anonymous said...

thanks for the article. might want to reword "This is so far exactly the idea same than the one being investigated for Hardware Transactional Memory.". :)

Ole Laursen said...

To expand slightly on what someone else commented, there was a talk not too long ago by some guys who found out using queues to communicate between threads can be pretty hefty bottleneck. They were using the JVM.

The talk is interesting because they actually measured the stuff they do and compared it with how it affects the CPU pipelines/caches. The queue discussion is around 32 minutes into the talk.

It's perhaps not relevant for pypy-stm at the moment, but it's definitely relevant for anyone interested in high-performance multithreaded code.

Dima Q said...

Good job, Armin!

This is exactly what Python needs, and if turns out hard rather than insanely hard, all the better!

Jonas W. said...

I am not entirely sure about the concept which is being implemented in PyPy-stm or better, which is planned for a parallel PyPy in the future.

I think am a pretty conservative programmer, and I actually dislike the idea of running code twice because of conflicts which could have been foreseen at development time ;). I still see the advantages STM brings regarding development time.

So I'm wondering about a point which was not entirely clear in your post. You're saying you don't want people to (be forced to?) write short transactions. However, I could still in a project which is both CPU and memory intensive try to keep the thread.atomic sections as small as possible to avoid unneccessary overheads but still get effective logs?

Armin Rigo said...

@Jonas: it is not always obvious at development time -- to say the least -- how to avoid all conflicts. Think about how hard it is to add automatic GC to C++ in a large project: it's messy but you might get pretty far with just reference counting -- until some point when you loose because of cyclic references. If instead you had used a proper GC-managed language, the problem would just not exist. It's the same about Transactional Memory and conflicts: you can either think harder and harder about using locks correctly, until your programs becomes really messy; then you give up and use TM, solving the issue instantly and letting you think again about your original problem.

Regarding the transaction size: with a good implementation, big transactions should not be slower than small transactions. The only potential drawback of having big transactions is that the risks of conflicts might increase (depending on your program).

Note that this question has a different answer for Python than for C, where code outside transactions runs faster than code within transactions. It is not so in Python. The reason is that transactions are always needed in Python: either explicitly, or implicitly in order to protect the interpreter structures (in replacement of the famous GIL).

Connelly Barnes said...

Is there any plan to add type declarations as some optional mode in PyPy, like Cython allows? Because PyPy can sometimes give some speed up, but when it doesn't it seems the alternative for the user is to go back to CPython + Cython.

ArneBab said...

@Armin: Looks nice!

But you’re right: The explicit transaction still looks nicer.

I think though, that both can nicely complement each other:

(1) The transaction is efficient for pushing out parts of the code from the main run to get it multithreaded (think “#pragma omp parallel for” from OpenMP).

(2) The thread.atomic is efficient for protecting stuff inside a threaded application. Also I like that I don’t have to explicitely state which variables I want to protect. And I like that it is not full locking: If I don’t actually get a conflict, other code still runs in parallel.

The first actually looks more interesting though, because it might be possible to make every for-loop run like this, as long as later runs are not dependent on the result of previous runs. This would require quite heavy runtime analysis, though.