Comments on PyPy Status Blog: Transactional Memory (II)

Speaking as someone maintaining a large applicatio...

2012-02-27T18:57:35.221+01:00

Speaking as someone maintaining a large application that uses Twisted, this sounds great.

A link to a group that did the same thing (thanks ...

2012-02-01T21:39:56.083+01:00

A link to a group that did the same thing (thanks a lot Andrew for this link!):

http://research.microsoft.com/en-us/projects/ame/

In particular the May 2007 paper (HotOS) nicely summarizes exactly what I'm trying to say, and I think it is clearer than me, if I have to jugde from feedback :-)

@daniel@Armin, CSP may be built into Go, but IMO t...

2012-02-01T19:59:32.540+01:00

@daniel@Armin, CSP may be built into Go, but IMO this was a mistake, there is no requirement for it to be a language feature; it fits nicer as library. See [python-csp] for a python library

I have looked at Python-CSP a long time ago. I recall it being verbose. However I use Stackless Python. And using PyPy's stackless.py, I implemented select() and join patterns. Sometimes I wish I had language support: they cut down on silly mistakes and make the code less verbose for simple cases. However what I have found is that the language can get in the way. For instance, in Go, one has to come up with hacks to do some simple like do a select on an arbitrary number of channels. Perhaps I am wrong but I suspect stuff like select()'s design was influenced by the fact Newsqueak was originally designed to make a windowing system easier to write. So one is monitoring only a handful of channels. In constrast, this is not the way Stackless Python programmes are written.

Cheers,
Andrew

@armin: @Anonymous: in the extract you cite I mean...

2012-02-01T19:43:36.103+01:00

@armin: @Anonymous: in the extract you cite I meant "the two main models in Python". As far as I can tell, Go does concurrency by enforcing all communications to go via channels, so I would classify it as a "special-made" language. This solution might be nice and usable, but it does not really work at all in languages like Python.

Armin, Stackless Python uses a model that at the API level is very similar to Go. Go borrows from the Bell Labs family of languages (i.e. Newsqueak). The fundamental idea is that message pasing is used to share information between threads/processes/coroutines. In this regard, Go is in the same camp as say, Erlang (although the messaging systems are different).

What I think is interesting and workable for Python are efforts in languages like Polyphonic C# (see the paper "Scalable Join Patterns") and Concurrent/Parallel ML, where lock-free libraries and STM techniques are used under the hood to improve the efficiency of the messaging/synchronisation system. In this fashion, the programmer has a conceptually clean concurrency model and still can make the important decisions about how to partition the problem.

Cheers,
Andrew

@Gary Robinson: (off-topic:) for this kind of use ...

2012-01-31T18:13:36.789+01:00

@Gary Robinson: (off-topic:) for this kind of use case, you can use os.fork() after the immutable data is ready. It "kind of works" both in pypy and in cpython, although not really --- in cpython the reference counts are modified, causing the pages to get unshared between processes; and in pypy the garbage collector (GC) has the same effect, so far. It could be solved in pypy by more tweaks the GC.

@Charles: Indeed, I am suggesting VM-wide STM, wit...

2012-01-24T11:03:51.876+01:00

@Charles: Indeed, I am suggesting VM-wide STM, with the resulting transactional overhead for every read and write. I actually got such a VM yesterday (with no GC): it seems to be about 10x slower on a single thread.

Note that even 10x slower is a plus if it scales to dozens of processors. But of course, a better point of view is that some years ago the regular pypy *was* 10x slower than CPython. It was a lot of efforts but we managed to make it only 1.5-2x slower. And this is all without counting the JIT. If STM bogs down to a generally-not-triggered read barrier before every read, then the performance impact could be well under 2x.

Please note also that I don't care about Java-like performance where even loosing 10% of performance would be a disaster. If we end up with a pypy-tm that is 2x slower than a regular pypy, I would be quite happy, and I believe that there is a non-negligible fraction of the Python users that would be, too.

On granularity: for now I'm going with the idea that the granularity is defined "naturally" in the source program as the amount of work done every time some central dispatch loop calls some code. There might be several dispatch loops in total, too. This is true in the cases I can think of: typical Twisted or Stackless programs, pypy's "translate.py", the richards benchmark, etc.

Please look at http://paste.pocoo.org/show/539822/ for an example of what I'm talking about. It's a diff against the standard richards.py: it is a pure Python user program in which I added calls to the new 'transaction' module. At this level there is no hint of Transactional Memory.

The devil's in the details. I'm not sure ...

2012-01-24T05:22:00.884+01:00

The devil's in the details.

I'm not sure I buy your conclusions here. STM is not a panacea for solving concurrency issues, and it has some key limitations that limit its general applicability.

On what granularity do you plan to have transactions? How do you know? Perhaps the VM will have enough knowledge of a given thread's activities to limit transactional overhead to only those structures in memory that are shared, but there still needs to be some indirection in case another thread hops in and starts making changes.

Where do transactions start and end? In STMs I know, the in-transaction overhead for reading and writing data is *much* higher, since it needs to know if someone else has committed a transaction first and be able to roll back.

Perhaps this is all intended to be hidden, and you never actually have "threads" that the user can see. But if you're going to parallelize, you'll have threads *somewhere* that are going to contend for resources. If they're going to contend for resources, even in an STM, they're going to have to check for contention, register their interest, and then you're back to the indirection overhead.

Perhaps I'm not understand what your end goal is. You can't simply turn the world into a series of transactions unless you want every read and write to have transaction overhead or you have some clear way of limiting transaction overhead to only where it's needed. You cite Erlang...but Erlang deals with immutable objects, and there's far less need for anything like an STM. Others have mentioned Clojure...but again, Clojure is mostly immutable structures, and transactional overhead is limited to Refs, where you'll make single coarse-grained reads and writes.

Am I missing the point? Are you not suggesting VM-wide STM, with the resulting transactional overhead for every read and write?

I also see no reason for Thread local memory. I l...

2012-01-18T11:06:01.492+01:00

I also see no reason for Thread local memory.

I like the idea of thinking about TM in the same line as GC. When you have GC the changes to the language is that you don't need to write free/dealloc.

Having TM would mean that you don't have to write acquire_GIL

@notme: yes, a web server or anything can use this...

2012-01-16T13:55:58.961+01:00

@notme: yes, a web server or anything can use this instead of using threads. It's of course missing a convincing select() or poll() version for that.

The details haven't been thought out; right now an exception interrupts everything. In an STM model it's unclear if concurrent transactions should still be allowed to complete or not. Anyway the point is that exceptions should not really occur because precisely they interrupt everything --- you would typically add instead in every transaction code like "try: .. except: traceback.print_exc()".

Thread local storage: what would be the point?

@Armin: That transaction code looks very simple. ...

2012-01-16T13:11:41.330+01:00

@Armin: That transaction code looks very simple. It seems trivial to implement a map/mapReduce style function on top of your transaction module.

It is a very similar API to worker pool APIs which many thread using programs use. The main difference is that you combine the join() in the run method. It seems that a threaded web server for example could use this? What would happen if each incoming request comes in, and is put into the transaction (and say the 10th request has an error)? Would it be better to use multiple transactions?

Have you thought how thread local storage would work?

Update: please look at the tiny transaction module...

2012-01-16T11:08:46.467+01:00

Update: please look at the tiny transaction module I wrote as an example. The idea is to have the same interface as this module, but implemented differently. By making use of transactional memory internally, it should be possible to safely run on multiple CPUs while keeping the very same programmer interface.

https://bitbucket.org/arigo/arigo/raw/default/hack/stm/transactionmodule/

When Armin gets this excited I'd fasten my sea...

2012-01-15T09:56:39.633+01:00

When Armin gets this excited I'd fasten my seatbelt and put my goggles on.

Thank you for letting me be an (otherwise mostly silent) observer.

Please keep shifting boundaries!

- Eric

Your idea could work for other easy to inject into...

2012-01-15T09:03:05.617+01:00

Your idea could work for other easy to inject into points, such as loops, and comprehensions. Especially with much of the work in pypy already done for identifying information about loops.

How does this compare to grand central dispatch and blocks? http://en.wikipedia.org/wiki/Grand_Central_Dispatch

Events are a very good way to model concurrency, and are widely used. It is a great place to dispatch concurrency into parallelism.

Closures/blocks provide a fairly decent way to get some of the protection of STM - and in many programs give you the 80% solution. For code that plays nicely and avoids mutable, or global data - this works. Luckily, a lot of event based code is already written in this way. As you say, they are "generally mostly independent".

Making the bad cases a quick fail, like in JavaScript worker threads could be an ok option. As soon as someone tries to access global data(do a system call, access the DOM, or access data outside the closure even), the program would fail there. Then you could fix those cases, or "add non-determinism" as you say. I think I'd prefer fail fast here, rather than have to detect these problems, and have them silently pass by.

You still have scheduling problems, and trying to figure out task size. As well, this does not solve lots of other problems. However, it is cool that it could be applied automatically, and probably 'safely'.

Another random thought... you could probably mark chunks of code as 'pure' as your run through them, and if they do a system call or mutate global data mark them as 'unpure' and don't try them again.

I very much look forward to reading your results as you implement more.

> @Armin, CSP may be built into Go, but IMO thi...

2012-01-15T08:24:40.099+01:00

> @Armin, CSP may be built into Go, but IMO this was a mistake, there is no requirement for it to be a language feature; it fits nicer as library. See [python-csp] for a python implementation.

Stackless (which PyPy enables) supports Go-style channels as well, no?

http://www.stackless.com/wiki/Channels

The link to the previous blog post is broken. It s...

2012-01-14T22:45:56.137+01:00

The link to the previous blog post is broken. It should be: http://morepypy.blogspot.com/2011/06/global-interpreter-lock-or-how-to-kill.html

@gasche: I know about Haskell, Clojure and Scala, ...

2012-01-14T22:11:38.768+01:00

@gasche: I know about Haskell, Clojure and Scala, and I just read the two blog posts you pointed to.

I'm not talking about giving explicit TM to the end programmers. I'm instead considering TM as an internal, implementation-only feature. That makes it very similar to GCs.

I know the points and issues of traditional TM systems, which are nicely reported by Joe Duffy in "A (brief) retrospective on transactional memory". These are of course perfectly valid issues, but I think they do not apply (or "not that much") in the particular context I'm talking about. For example, this includes the large sections about nested transactions, and about consistency between the transactional and non-transactional worlds (Weak or Strong Atomicity, The Privatization Problem). Even "Where is the Killer App?" is obvious in this case: any existing Twisted App is potentially a Killer App.

Sorry for not including references to papers. I must admit I don't know any paper that describes a similar use case for TM.

@Armin, CSP may be built into Go, but IMO this was...

2012-01-14T21:27:36.388+01:00

@Armin, CSP may be built into Go, but IMO this was a mistake, there is no requirement for it to be a language feature; it fits nicer as library. See [python-csp] for a python implementation.

[python-csp] http://code.google.com/p/python-csp/wiki/Tutorial

@Anonymous: in the extract you cite I meant "...

2012-01-14T20:27:27.451+01:00

@Anonymous: in the extract you cite I meant "the two main models in Python". As far as I can tell, Go does concurrency by enforcing all communications to go via channels, so I would classify it as a "special-made" language. This solution might be nice and usable, but it does not really work at all in languages like Python.

One major use-case for multithreading involves a l...

2012-01-14T17:57:36.111+01:00

One major use-case for multithreading involves a large, unchanging data structure which many threads access. I.e., the data structure is loaded by a parent task, then not modified again; a number of threads are then spawned to use it for calculations.

In CPython, the GIL makes this impossible if only because the reference counters need to be protected. With Cython in threads, however, you can turn off the GIL and do some work on C-style data structures.

I'm wondering whether the STM PyPy effort could have a very useful, and very early, benefit: simply enabling an unchanging data structure to be accessed by a number of processors via the kinds of events you describe. There wouldn't be a need for transactions, because the programmer would take responsibility for only sharing unchanging structures between simultaneously-executing events.

But it seems like the basic requirements for this kind of facility might be met in in early stage of STM development. And a solution that allowed multiple processors to access large, unchanging structures would be very useful in certain applications. I know I have one in mind that I'm looking at CPython/Cython for, but I'd rather see if I could get the performance I need from PyPy.

Just thought it was worth mentioning.

If you go that road, you will certainly find out t...

2012-01-14T17:50:26.747+01:00

If you go that road, you will certainly find out that Transactional Memory is much, much harder to get right than it looks like in today effectful/imperative languages. Sure, it looks wonderful on paper, but if your language doesn't help you control side-effects it will give you a very hard time.

Currently, there is satisfying STM support in Haskell (because of its tight type-based control of side-effects) and Clojure (beacuse of its tight control on mutability), and it might be getting into Scala.

I doubt Python can easily get such control, at least without an important reorganization of idiomatic practices and frameworks, that go beyond the "let's be event-driven" decision. Which makes your "this is going to work magically" story a bit hard to believe.

There has been intense research on this topic for some decades now, and several attempts at getting it to work in current mainstream languages have mostly failed.

See for example this long retrospective of the STM.NET effort at Microsoft Research, by Joe Duffy:
A (brief) retrospective on transactional memory
or this shorter blog post by Brian Hurt:
The problem with STM: your languages still suck.

I was a bit disappointed that you didn't cite any of the relevant literature in your post. It made me suspicious of "reiventing the wheel"...

> These two models --- threads or events --- ar...

2012-01-14T16:38:43.283+01:00

> These two models --- threads or events --- are the two main models we have right now.

Where does Go-style concurrency fit in?

Great article, great solution to a big problem... ...

2012-01-14T16:17:02.952+01:00

Great article, great solution to a big problem...

I am really looking forward to this :-)

As an experiment I have developed Pyworks, which makes objects concurrent and methods asynchronious. But it makes little sense to do performance test on an multicore CPU because of the GIL.

The code for Pyworks can be found at https://bitbucket.org/raindog/pyworks