Comments on PyPy Status Blog: Global Interpreter Lock, or how to kill it

have you looked at all at "Worlds" as a ...

2011-12-31T21:38:33.829+01:00

have you looked at all at "Worlds" as a simpler interface to STM?

http://www.vpri.org/pdf/tr2011001_final_worlds.pdf

I managed to write a Haskell STM implementation in...

2011-12-11T08:40:15.356+01:00

I managed to write a Haskell STM implementation in a single morning. It may not be the most efficient implementation (I've found it to be about half the speed of the GHC implementation in the limited testing I've done), but it's really simple and only uses atomic CAS.

https://gist.github.com/1454995

@halfaleague get in contact. pypy@sfconservancy.or...

2011-10-28T09:31:25.803+02:00

@halfaleague get in contact. pypy@sfconservancy.org is the right address for non-profit funding inquires.

How can we fund this?

2011-10-28T05:55:44.248+02:00

How can we fund this?

I can see a use for STM in CPython, too, though. E...

2011-10-14T23:26:12.994+02:00

I can see a use for STM in CPython, too, though. Even though it seems to be not applicable, it need not be true.

I worked on making the reference counting thread-friendly, in the sense that when you have multiple threads reading a big data structure, CPython's reference counting turns all the reads into writes, which is awful for performance.

I wrote a patch to pack all writes in the same memory page (ie, reference pools, external reference counting), and was working on a patch for STM reference count updates.

The thing with STM and reference counting, is that many operations cancel out at the end of the transaction. Like when you just read objects while performing computations, you acquire a reference, work, then release it.

In the end, STM here would remove the need to write to shared memory.

In the process of working on that patch, I can tell CPython can be made to use STM techniques. You have thread-local storage at the VM level already, macros handle almost all reference counting operations, it's all abstracted enough that it might be possible.

For reference counting, the only problem is that STM is way slower for single threaded applications. WAY slower. For multithreaded, it pays off considerably, but CPython guys are very strongly set in favouring single-threaded performance.

Also, what about the performance if the lazy commi...

2011-07-29T15:08:15.711+02:00

Also, what about the performance if the lazy commit method used in the post? Every transaction will create additional memory? Is that really efficient, IMHO this model is aiming a very small number of use cases??

@Tuure Laurinolli: yes, but PyPy has no refcounts....

2011-07-24T14:07:41.673+02:00

@Tuure Laurinolli: yes, but PyPy has no refcounts. I was just discussing the pro/cons of the proposed locking solution on CPython (which is off-topic as far as this original blog post is concerned). I don't even want to think about STM for CPython :-)

For your second question, from the user's point of view, the semantics we would get with STM are automatically the same as with the GIL, which is why I like the approach.

One thread have one interpreter. Threads interacti...

2011-07-17T13:48:04.077+02:00

One thread have one interpreter.
Threads interactive like os native thread, use the os interactive method wrap by py.

I want to embed multi interpreter in my c code!

Please kill GIL!!!

One thread have one interpreter. Threads interacti...

2011-07-17T13:32:46.131+02:00

One thread have one interpreter.
Threads interactive like os native thread, use the os interactive method wrap by py.

I want to embed multi interpreter in my c code!

Please kill GIL!!!

@Armin Rigo You'd need similar atomic instruc...

2011-07-11T21:10:32.626+02:00

@Armin Rigo

You'd need similar atomic instructions for an STM implementation too - although perhaps not as many? In any case they should be about as cheap as L1 cache writes unless there's contention, but then things are going to be slow in any case if you have contention. Of course you might have false sharing of objects etc. to muddle things up.

In any case, what sort of semantics would a GIL-free Python have in multi-threaded case, compared to current GIL-infested Python? Each opcode can assumed to execute atomically?

@Skandalfo: right, indeed. I don't know exact...

2011-07-09T14:18:37.631+02:00

@Skandalfo: right, indeed. I don't know exactly the cost of such atomic operations. Maybe it's fine, but I fear that doing tons of increfs/decrefs all the time (as needed for refcounts in CPython's simple interpreter) has an important cost.

@Armin Rigo: If the problem for the RW-lock approa...

2011-07-06T22:47:53.763+02:00

@Armin Rigo: If the problem for the RW-lock approach in CPython is just about reference count updates and checks, perhaps those could be done via atomic primitives, as supported on most modern architectures. This is what boost::shared_ptr does, IIRC, for the pointers to be thread-safe by default.

What is wrong with Jython's lock model? Java i...

2011-07-06T22:30:48.051+02:00

What is wrong with Jython's lock model? Java is a pretty efficient language, no? And there is also no need to acquire locks for objects that you can prove won't cause conflicts...

I love this idea. Just musing on an implementatio...

2011-07-06T20:42:20.010+02:00

I love this idea.

Just musing on an implementation detail here, but isn't the "lazy" STM implementation's transaction system effectively just an in-memory implementation of copy-on-write semantics? It might be interesting to take a look at other things that have used COW for inspiration. (ZFS and btrfs come to mind) I like the idea that committing a transaction for a given object would just involve changing the object's address in memory to the modified copy.

Also, I'd be interested to see the read/write lock system get implemented, because it seems like it might be a better choice for programs that only use a couple threads.

@Skandalfo: this cannot work with CPython, because...

2011-07-06T15:26:32.805+02:00

@Skandalfo: this cannot work with CPython, because of reference counting -- every bytecode modifies reference counts, so needs the "write" lock. But it could be a possible idea to consider in PyPy.

There's an intermediate option between the GIL...

2011-07-02T21:18:53.762+02:00

There's an intermediate option between the GIL and the careful locking done by Jython, that I had a look at some time ago for making Python more thread friendly.

Just exchanging the GIL for a global readers-writer lock would allow Python to use way more concurrency. You would run all Python code under a reader lock for operations that were read-only on objects. For modifying built in mutable objects, or for things like the one involving both lists in the Jython example, or when calling into C modules, you would have to acquire the writer version of the lock.

Python threads would relinquish the reader lock each N opcodes, just like it's done now for the GIL, and I guess the acquisition of the writer lock should be given priority over the reader ones.

This approach should be simpler to implement than using the transactional memory approach, and it should be possible to bake it into CPython too. I think I remember having read some discussion about this somewhere, but it didn't seem to come to anything...

@EmilK: I think that would be very uncool. You wou...

2011-07-01T15:17:02.357+02:00

@EmilK: I think that would be very uncool. You would allow the developer to introduce bugs that would be extremely hard to locate. Parallel programs are quite difficult to get right to start with, and anyone who does not have complete understanding of what constitutes a critical section will be very likely to make an error.

It would be cool, if the python programmer could m...

2011-07-01T11:55:02.945+02:00

It would be cool, if the python programmer could mark "uncritical" sections, such that the stm book keeping is disabled for those sections where the programmer knows that there is no concurrency.

@⚛: to complete Ben's answer: yes, you are cor...

2011-06-30T23:12:22.909+02:00

@⚛: to complete Ben's answer: yes, you are correct, but that's why the translation step "insert STM logic" is never going to be mandatory. You will get either a regular pypy-c-gil or a pypy-c-stm, as two different executables, and you will choose the one most suited for your particular program. I still expect pypy-c-gil to be the most used one, with pypy-c-stm an alternative that is only useful for people with massively multi-threaded programs.

@Michael Foord: STM definitely can be made as fine...

2011-06-30T22:27:55.441+02:00

@Michael Foord: STM definitely can be made as fine-grained as you like. Some existing STM systems operate at the level of machine words. Given that this one will be operating at the interpreter level, I would guess that code working on different sections of the same object (or array) would able to run in parallel, but I guess it depends on how the tradeoffs play out.

@⚛: That's kind of how parallelization goes. T...

2011-06-30T22:22:02.258+02:00

@⚛: That's kind of how parallelization goes. There are overheads, and the only way to make up for them is to hope you have enough parallel speedup. STM (and any approach to this problem based on fine-grained locking) would work best if only a small known set of objects are shared between threads, and only those are synchronized, which unfortunately cannot be the case for a general GIL-removal proposal.

However I think PyPy's JIT could potentially help a little here. The escape analysis PyPy already does can also prove "this value cannot be accessed by another thread" and used to avoid logging some values, since they cannot conflict with parallel transactions. There are probably some more STM-specific optimizations the JIT could do as well.

So if there is only one thread transactions will b...

2011-06-30T20:29:25.359+02:00

So if there is only one thread transactions will be disabled?

I wonder how "fine grained" transactions will be: if you have parallel operations working concurrently on a large array do you think you will be able to allow threads to simultaneously modify different areas of the array?

I am not sure what to make out of the solution (=S...

2011-06-30T17:47:43.758+02:00

I am not sure what to make out of the solution (=STM) to GIL you proposed in the article. You are essentially suggesting to slow down all Python programs in PyPy by a factor of, say, 4 and hope to recover the loss for a very small percentage of programs on an 8-core machine.

That can't be right. Please tell me I am dreaming ... :)

To address the anonymous question near the start o...

2011-06-30T15:47:39.184+02:00

To address the anonymous question near the start of the comments, one way to detect commit collision is to copy a global generation counter at the start of your transaction, and then compare your stored copy to the current generation counter at commit time (after taking a lock), and if no one else has incremented the generation counter, you do so and complete your operation.

So transaction does:
self.generation = global.generation

And commit does:
if lock(global.lock):
if self.generation == global.generation:
global.generation += 1
return True
unlock(global.lock)
return False

@tuomasjjrasanen: yes, actually the first paper is...

2011-06-30T12:54:32.434+02:00

@tuomasjjrasanen: yes, actually the first paper is from the 80's. But I think that it's only from around 2003 or 2004 that research seriously started, in the sense that papers were produced regularly, from several teams.