Saturday, January 14, 2012

Transactional Memory (II)

Here is an update about the previous blog post about the Global Interpreter Lock (GIL). In 5 months, the point of view changed quite a bit.

Let me remind you that the GIL is the technique used in both CPython and PyPy to safely run multi-threaded programs: it is a global lock that prevents multiple threads from actually running at the same time. The reason to do that is that it would have disastrous effects in the interpreter if several threads access the same object concurrently --- to the point that in CPython even just manipulating the object's reference counter needs to be protected by the lock.

So far, the ultimate goal to enable true multi-CPU usage has been to remove the infamous GIL from the interpreter, so that multiple threads could actually run in parallel. It's a lot of work, but this has been done in Jython. The reason that it has not been done in CPython so far is that it's even more work: we would need to care not only about carefully adding fine-grained locks everywhere, but also about reference counting; and there are a lot more C extension modules that would need care, too. And we don't have locking primitives as performant as Java's, which have been hand-tuned since ages (e.g. to use help from the JIT compiler).

But we think we have a plan to implement a different model for using multiple cores. Believe it or not, this is better than just removing the GIL from PyPy. You might get to use all your cores without ever writing threads.

You would instead just use some event dispatcher, say from Twisted, from Stackless, or from your favorite GUI; or just write your own. From there, you (or someone else) would add some minimal extra code to the event dispatcher's source code, to exploit the new transactional features offered by PyPy. Then you would run your program on a special version of PyPy, and voilà: you get some form of automatic parallelization. Sounds magic, but the basic idea is simple: start handling multiple events in parallel, giving each one its own transaction. More about it later.

Threads or Events?

First, why would this be better than "just" removing the GIL? Because using threads can be a mess in any complex program. Some authors (e.g. Lee) have argued that the reason is that threads are fundamentally non-deterministic. This makes it very hard to reason about them. Basically the programmer needs to "trim" down the non-determinism (e.g. by adding locks, semaphores, etc.), and it's hard to be sure when he's got a sufficiently deterministic result, if only because he can't write exhaustive tests for it.

By contrast, consider a Twisted program. It's not a multi-threaded program, which means that it handles the "events" one after the other. The exact ordering of the events is not really deterministic, because they often correspond to external events; but that's the only source of non-determinism. The actual handling of each event occurs in a nicely deterministic way, and most importantly, not in parallel with the handling of other events. The same is true about other libraries like GUI toolkits, gevent, or Stackless.

(Of course the Twisted and the Stackless models, to cite only these two, are quite different from each other; but they have in common the fact that they are not multi-threaded, and based instead on "events" --- which in the Stackless case means running a tasklet from one switch() point to the next one.)

These two models --- threads or events --- are the two main models we have right now. The latter is more used in Python, because it is much simpler to use than the former, and the former doesn't give any benefit because of the GIL. A third model, which is the only one that gives multi-core benefits, is to use multiple processes, and do inter-process communication.

The problem

Consider the case of a big program that has arbitrary complicated dependencies. Even assuming a GIL-less Python, this is likely enough to prevent the programmer from even starting a multi-threaded rewrite, because it would require a huge mess of locks. He could also consider using multiple processes instead, but the result is annoying as well: the complicated dependencies translate into a huge mess of inter-process synchronization.

The problem can also be down-sized to very small programs, like the kind of hacks that you do and forget about. In this case, the dependencies might be simpler, but you still have to learn and use subtle locking patterns or a complex inter-process library, which is overkill for the purpose.

(This is similar to how explicit memory management is not very hard for small programs --- but still, nowadays a lot of people agree that automatic memory management is easier for programs of all sizes. I think the same will eventually be true for using multiple CPUs, but the correct solution will take time to mature, like garbage collectors did. This post is a step in hopefully the right direction :-))

Events in Transactions

Let me introduce the notion of independent events: two events are independent if they don't touch the same set of objects. In a multi-threaded world, it means that they can be executed in parallel without needing any lock to ensure correctness.

Events might also be mostly independent, i.e. they rarely access the same object concurrently. Of course, in a multi-threaded world we would still need locks to ensure correctness, but the point is that the locks are rarely causing pauses: lock contention is low.

Consider again the Twisted example I gave above. There are often several events pending in the dispatch queue (assuming the program is using 100% of our single usable CPU, otherwise the whole discussion is moot). The case I am interested in is the case in which these events are generally mostly independent, i.e. we expect few conflicts between them. However they don't have to be proved independent. In fact it is fine if they have arbitrary complicated dependencies as described above. The point is the expected common case. Imagine that you have a GIL-less Python and that you can, by a wave of your hand, have all the careful locking mess magically done. Then what I mean here is the case in which such a theoretical program would run mostly in parallel on multiple core, without waiting too often on the locks.

In this case, the solution I'm proposing is that with minimal tweaks in the event dispatch loop, we can handle multiple events on multiple threads, each in its own transaction. A transaction is basically a tentative execution of the corresponding piece of code: if we detect conflicts with other concurrently executing transactions, we abort the whole transaction and restart it from scratch.

By now, the fact that it can basically work should be clear: multiple transactions will only get into conflict when modifying the same data structures, which is the case where the magical wand above would have put locks. If the magical program could progress without too many locks, then the transactional program can progress without too many conflicts. In a way, you get even more than what the magical program can give you: each event is dispatched in its own transaction, which means that from each event's point of view, we have the illusion that nobody else is running concurrently. This is exactly what all existing Twisted-/Stackless-/etc.-based programs are assuming.

Note that this solution, without transactions, already exists in some other languages: for example, Erlang is all about independent events. This is the simple case where we can just run them on multiple cores, knowing by construction of the language that you can't get conflicts. Of course, it doesn't work for Python or for a lot of other languages. From that point of view, what I'm suggesting is merely that transactional memory could be a good model to cope with the risks of conflicts that come from not having a special-made language.

Not a perfect solution

Of course, transactional memory (TM) is not a perfect solution either. Right now, the biggest issue is the performance hit that comes from the software implementation (STM). In time, hardware support (HTM) is likely to show up and help mitigate the problem; but I won't deny the fact that in some cases, because it's simple enough and/or because you really need the top performance, TM is not the best solution.

Also, the explanations above are silent on what is a hard point for TM, namely system calls. The basic general solution is to suspend other transactions as soon as a transaction does its first system call, so that we are sure that the transaction will succeed. Of course this solution is far from optimal. Interestingly, it's possible to do better on a case-by-case basis: for example, by adding in-process buffers, we can improve the situation for sockets, by having recv() store in a buffer what is received so that it can be re-recv()-ed later if the transaction is aborted; similarly, send() or writes to log files can be delayed until we are sure that the transaction will commit.

From my point of view, the most important point is that the TM solution comes from the correct side of the "determinism" scale. With threads, you have to prune down non-determinism. With TM, you start from a mostly deterministic point, and if needed, you add non-determinism. The reason you would want to do so is to make the transactions shorter: shorter transactions have less risks of conflicts, and when there are conflicts, less things to redo. So making transactions shorter increases the parallelism that your program can achieve, while at the same time requiring more care.

In terms of an event-driven model, the equivalent would be to divide the response of a big processing event into several events that are handled one after the other: for example, the first event sets things up and fires the second event, which does the actual computation; and afterwards a third event writes the results back. As a result, the second event's transaction has little risks of getting aborted. On the other hand, the writing back needs to be aware of the fact that it's not in the same transaction as the original setting up, which means that other unrelated transactions may have run in-between.

One step towards the future?

These, and others, are the problems of the TM approach. They are "new" problems, too, in the sense that the existing ways of programming don't have these problems.

Still, as you have guessed, I think that it is overall a win, and possibly a big win --- a win that might be on the same scale for the age of multiple CPUs as automatic garbage collection was 20 years ago for the age of RAM size explosion.

Stay tuned for more!

--- Armin (and reviews by Antonio and Fijal)


UPDATE: please look at the tiny transaction module I wrote as an example. The idea is to have the same interface as this module, but implemented differently. By making use of transactional memory internally, it should be possible to safely run on multiple CPUs while keeping the very same programmer interface.

22 comments:

  1. Great article, great solution to a big problem...

    I am really looking forward to this :-)

    As an experiment I have developed Pyworks, which makes objects concurrent and methods asynchronious. But it makes little sense to do performance test on an multicore CPU because of the GIL.

    The code for Pyworks can be found at https://bitbucket.org/raindog/pyworks

    ReplyDelete
  2. > These two models --- threads or events --- are the two main models we have right now.

    Where does Go-style concurrency fit in?

    ReplyDelete
  3. If you go that road, you will certainly find out that Transactional Memory is much, much harder to get right than it looks like in today effectful/imperative languages. Sure, it looks wonderful on paper, but if your language doesn't help you control side-effects it will give you a very hard time.

    Currently, there is satisfying STM support in Haskell (because of its tight type-based control of side-effects) and Clojure (beacuse of its tight control on mutability), and it might be getting into Scala.

    I doubt Python can easily get such control, at least without an important reorganization of idiomatic practices and frameworks, that go beyond the "let's be event-driven" decision. Which makes your "this is going to work magically" story a bit hard to believe.

    There has been intense research on this topic for some decades now, and several attempts at getting it to work in current mainstream languages have mostly failed.

    See for example this long retrospective of the STM.NET effort at Microsoft Research, by Joe Duffy:
    A (brief) retrospective on transactional memory
    or this shorter blog post by Brian Hurt:
    The problem with STM: your languages still suck.

    I was a bit disappointed that you didn't cite any of the relevant literature in your post. It made me suspicious of "reiventing the wheel"...

    ReplyDelete
  4. One major use-case for multithreading involves a large, unchanging data structure which many threads access. I.e., the data structure is loaded by a parent task, then not modified again; a number of threads are then spawned to use it for calculations.

    In CPython, the GIL makes this impossible if only because the reference counters need to be protected. With Cython in threads, however, you can turn off the GIL and do some work on C-style data structures.

    I'm wondering whether the STM PyPy effort could have a very useful, and very early, benefit: simply enabling an unchanging data structure to be accessed by a number of processors via the kinds of events you describe. There wouldn't be a need for transactions, because the programmer would take responsibility for only sharing unchanging structures between simultaneously-executing events.

    But it seems like the basic requirements for this kind of facility might be met in in early stage of STM development. And a solution that allowed multiple processors to access large, unchanging structures would be very useful in certain applications. I know I have one in mind that I'm looking at CPython/Cython for, but I'd rather see if I could get the performance I need from PyPy.

    Just thought it was worth mentioning.

    ReplyDelete
  5. @Anonymous: in the extract you cite I meant "the two main models in Python". As far as I can tell, Go does concurrency by enforcing all communications to go via channels, so I would classify it as a "special-made" language. This solution might be nice and usable, but it does not really work at all in languages like Python.

    ReplyDelete
  6. @Armin, CSP may be built into Go, but IMO this was a mistake, there is no requirement for it to be a language feature; it fits nicer as library. See [python-csp] for a python implementation.

    [python-csp] http://code.google.com/p/python-csp/wiki/Tutorial

    ReplyDelete
  7. @gasche: I know about Haskell, Clojure and Scala, and I just read the two blog posts you pointed to.

    I'm not talking about giving explicit TM to the end programmers. I'm instead considering TM as an internal, implementation-only feature. That makes it very similar to GCs.

    I know the points and issues of traditional TM systems, which are nicely reported by Joe Duffy in "A (brief) retrospective on transactional memory". These are of course perfectly valid issues, but I think they do not apply (or "not that much") in the particular context I'm talking about. For example, this includes the large sections about nested transactions, and about consistency between the transactional and non-transactional worlds (Weak or Strong Atomicity, The Privatization Problem). Even "Where is the Killer App?" is obvious in this case: any existing Twisted App is potentially a Killer App.

    Sorry for not including references to papers. I must admit I don't know any paper that describes a similar use case for TM.

    ReplyDelete
  8. The link to the previous blog post is broken. It should be: http://morepypy.blogspot.com/2011/06/global-interpreter-lock-or-how-to-kill.html

    ReplyDelete
  9. > @Armin, CSP may be built into Go, but IMO this was a mistake, there is no requirement for it to be a language feature; it fits nicer as library. See [python-csp] for a python implementation.

    Stackless (which PyPy enables) supports Go-style channels as well, no?

    http://www.stackless.com/wiki/Channels

    ReplyDelete
  10. Your idea could work for other easy to inject into points, such as loops, and comprehensions. Especially with much of the work in pypy already done for identifying information about loops.

    How does this compare to grand central dispatch and blocks? http://en.wikipedia.org/wiki/Grand_Central_Dispatch

    Events are a very good way to model concurrency, and are widely used. It is a great place to dispatch concurrency into parallelism.

    Closures/blocks provide a fairly decent way to get some of the protection of STM - and in many programs give you the 80% solution. For code that plays nicely and avoids mutable, or global data - this works. Luckily, a lot of event based code is already written in this way. As you say, they are "generally mostly independent".

    Making the bad cases a quick fail, like in JavaScript worker threads could be an ok option. As soon as someone tries to access global data(do a system call, access the DOM, or access data outside the closure even), the program would fail there. Then you could fix those cases, or "add non-determinism" as you say. I think I'd prefer fail fast here, rather than have to detect these problems, and have them silently pass by.

    You still have scheduling problems, and trying to figure out task size. As well, this does not solve lots of other problems. However, it is cool that it could be applied automatically, and probably 'safely'.

    Another random thought... you could probably mark chunks of code as 'pure' as your run through them, and if they do a system call or mutate global data mark them as 'unpure' and don't try them again.

    I very much look forward to reading your results as you implement more.

    ReplyDelete
  11. When Armin gets this excited I'd fasten my seatbelt and put my goggles on.

    Thank you for letting me be an (otherwise mostly silent) observer.

    Please keep shifting boundaries!

    - Eric

    ReplyDelete
  12. Update: please look at the tiny transaction module I wrote as an example. The idea is to have the same interface as this module, but implemented differently. By making use of transactional memory internally, it should be possible to safely run on multiple CPUs while keeping the very same programmer interface.

    https://bitbucket.org/arigo/arigo/raw/default/hack/stm/transactionmodule/

    ReplyDelete
  13. @Armin: That transaction code looks very simple. It seems trivial to implement a map/mapReduce style function on top of your transaction module.

    It is a very similar API to worker pool APIs which many thread using programs use. The main difference is that you combine the join() in the run method. It seems that a threaded web server for example could use this? What would happen if each incoming request comes in, and is put into the transaction (and say the 10th request has an error)? Would it be better to use multiple transactions?

    Have you thought how thread local storage would work?

    ReplyDelete
  14. @notme: yes, a web server or anything can use this instead of using threads. It's of course missing a convincing select() or poll() version for that.

    The details haven't been thought out; right now an exception interrupts everything. In an STM model it's unclear if concurrent transactions should still be allowed to complete or not. Anyway the point is that exceptions should not really occur because precisely they interrupt everything --- you would typically add instead in every transaction code like "try: .. except: traceback.print_exc()".

    Thread local storage: what would be the point?

    ReplyDelete
  15. I also see no reason for Thread local memory.

    I like the idea of thinking about TM in the same line as GC. When you have GC the changes to the language is that you don't need to write free/dealloc.

    Having TM would mean that you don't have to write acquire_GIL

    ReplyDelete
  16. The devil's in the details.

    I'm not sure I buy your conclusions here. STM is not a panacea for solving concurrency issues, and it has some key limitations that limit its general applicability.

    On what granularity do you plan to have transactions? How do you know? Perhaps the VM will have enough knowledge of a given thread's activities to limit transactional overhead to only those structures in memory that are shared, but there still needs to be some indirection in case another thread hops in and starts making changes.

    Where do transactions start and end? In STMs I know, the in-transaction overhead for reading and writing data is *much* higher, since it needs to know if someone else has committed a transaction first and be able to roll back.

    Perhaps this is all intended to be hidden, and you never actually have "threads" that the user can see. But if you're going to parallelize, you'll have threads *somewhere* that are going to contend for resources. If they're going to contend for resources, even in an STM, they're going to have to check for contention, register their interest, and then you're back to the indirection overhead.

    Perhaps I'm not understand what your end goal is. You can't simply turn the world into a series of transactions unless you want every read and write to have transaction overhead or you have some clear way of limiting transaction overhead to only where it's needed. You cite Erlang...but Erlang deals with immutable objects, and there's far less need for anything like an STM. Others have mentioned Clojure...but again, Clojure is mostly immutable structures, and transactional overhead is limited to Refs, where you'll make single coarse-grained reads and writes.

    Am I missing the point? Are you not suggesting VM-wide STM, with the resulting transactional overhead for every read and write?

    ReplyDelete
  17. @Charles: Indeed, I am suggesting VM-wide STM, with the resulting transactional overhead for every read and write. I actually got such a VM yesterday (with no GC): it seems to be about 10x slower on a single thread.

    Note that even 10x slower is a plus if it scales to dozens of processors. But of course, a better point of view is that some years ago the regular pypy *was* 10x slower than CPython. It was a lot of efforts but we managed to make it only 1.5-2x slower. And this is all without counting the JIT. If STM bogs down to a generally-not-triggered read barrier before every read, then the performance impact could be well under 2x.

    Please note also that I don't care about Java-like performance where even loosing 10% of performance would be a disaster. If we end up with a pypy-tm that is 2x slower than a regular pypy, I would be quite happy, and I believe that there is a non-negligible fraction of the Python users that would be, too.

    On granularity: for now I'm going with the idea that the granularity is defined "naturally" in the source program as the amount of work done every time some central dispatch loop calls some code. There might be several dispatch loops in total, too. This is true in the cases I can think of: typical Twisted or Stackless programs, pypy's "translate.py", the richards benchmark, etc.

    Please look at http://paste.pocoo.org/show/539822/ for an example of what I'm talking about. It's a diff against the standard richards.py: it is a pure Python user program in which I added calls to the new 'transaction' module. At this level there is no hint of Transactional Memory.

    ReplyDelete
  18. @Gary Robinson: (off-topic:) for this kind of use case, you can use os.fork() after the immutable data is ready. It "kind of works" both in pypy and in cpython, although not really --- in cpython the reference counts are modified, causing the pages to get unshared between processes; and in pypy the garbage collector (GC) has the same effect, so far. It could be solved in pypy by more tweaks the GC.

    ReplyDelete
  19. @armin: @Anonymous: in the extract you cite I meant "the two main models in Python". As far as I can tell, Go does concurrency by enforcing all communications to go via channels, so I would classify it as a "special-made" language. This solution might be nice and usable, but it does not really work at all in languages like Python.

    Armin, Stackless Python uses a model that at the API level is very similar to Go. Go borrows from the Bell Labs family of languages (i.e. Newsqueak). The fundamental idea is that message pasing is used to share information between threads/processes/coroutines. In this regard, Go is in the same camp as say, Erlang (although the messaging systems are different).


    What I think is interesting and workable for Python are efforts in languages like Polyphonic C# (see the paper "Scalable Join Patterns") and Concurrent/Parallel ML, where lock-free libraries and STM techniques are used under the hood to improve the efficiency of the messaging/synchronisation system. In this fashion, the programmer has a conceptually clean concurrency model and still can make the important decisions about how to partition the problem.

    Cheers,
    Andrew

    ReplyDelete
  20. @daniel@Armin, CSP may be built into Go, but IMO this was a mistake, there is no requirement for it to be a language feature; it fits nicer as library. See [python-csp] for a python library

    I have looked at Python-CSP a long time ago. I recall it being verbose. However I use Stackless Python. And using PyPy's stackless.py, I implemented select() and join patterns. Sometimes I wish I had language support: they cut down on silly mistakes and make the code less verbose for simple cases. However what I have found is that the language can get in the way. For instance, in Go, one has to come up with hacks to do some simple like do a select on an arbitrary number of channels. Perhaps I am wrong but I suspect stuff like select()'s design was influenced by the fact Newsqueak was originally designed to make a windowing system easier to write. So one is monitoring only a handful of channels. In constrast, this is not the way Stackless Python programmes are written.

    Cheers,
    Andrew

    ReplyDelete
  21. A link to a group that did the same thing (thanks a lot Andrew for this link!):

    http://research.microsoft.com/en-us/projects/ame/

    In particular the May 2007 paper (HotOS) nicely summarizes exactly what I'm trying to say, and I think it is clearer than me, if I have to jugde from feedback :-)

    ReplyDelete
  22. Speaking as someone maintaining a large application that uses Twisted, this sounds great.

    ReplyDelete

See also PyPy's IRC channel: #pypy at freenode.net, or the pypy-dev mailing list.
If the blog post is old, it is pointless to ask questions here about it---you're unlikely to get an answer.