Wednesday, June 29, 2011

Global Interpreter Lock, or how to kill it

People that listened to my (Armin Rigo) lightning talk at EuroPython know that suddenly, we have a plan to remove the Global Interpreter Lock --- the infamous GIL, the thing in CPython that prevents multiple threads from actually running in your Python code in parallel.

That's not actually new, because Jython has been doing it all along. Jython works by very carefully adding locks to all the mutable built-in types, and by relying on the underlying Java platform to be efficient about them (so that the result is faster than, say, very carefully adding similar locks in CPython). By "very carefully", I mean really really carefully; for example, 'dict1.update(dict2)' needs to lock both dict1 and dict2, but if you do it naively, then a parallel 'dict2.update(dict1)' might cause a deadlock.

All of PyPy, CPython and IronPython have a GIL. But for PyPy we are considering a quite different approach than Jython's, based on Software Transactional Memory. This is a recent development in computer science, and it gives a nicer solution than locking. Here is a short introduction to it.

Say you want to atomically pop an item from 'list1' and append it to 'list2':

def f(list1, list2):
    x = list1.pop()
    list2.append(x)

This is not safe in multithreaded cases (even with the GIL). Say that you call f(l1, l2) in thread 1 and f(l2, l1) in thread 2. What you want is that it has no effect at all (x is moved from one list to the other, then back). But what can occur is that instead the top of the two lists are swapped, depending on timing issues.

One way to fix it is with a global lock:

def f(list1, list2):
    global_lock.acquire()
    x = list1.pop()
    list2.append(x)
    global_lock.release()

A finer way to fix it is with locks that come with the lists:

def f(list1, list2):
    acquire_all_locks(list1.lock, list2.lock)
    x = list1.pop()
    list2.append(x)
    release_all_locks(list1.lock, list2.lock)

The second solution is a model for Jython's, while the first is a model for CPython's. Indeed, in CPython's interpreter, we acquire the GIL, then we do one bytecode (or actually a number of them, like 100), then we release the GIL; and then we proceed to the next bunch of 100.

Software Transactional Memory (STM) gives a third solution:

def f(list1, list2):
    while True:
        t = transaction()
        x = list1.pop(t)
        list2.append(t, x)
        if t.commit():
            break

In this solution, we make a transaction object and use it in all reads and writes we do to the lists. There are actually several different models, but let's focus on one of them. During a transaction, we don't actually change the global memory at all. Instead, we use the thread-local transaction object. We store in it which objects we read from, which objects we write to, and what values we write. It is only when the transaction reaches its end that we attempt to "commit" it. Committing might fail if other commits have occurred in between, creating inconsistencies; in that case, the transaction aborts and must restart from the beginning.

In the same way as the previous two solutions are models for CPython and Jython, the STM solution looks like it could be a model for PyPy in the future. In such a PyPy, the interpreter would start a transaction, do one or several bytecodes, and then end the transaction; and repeat. This is very similar to what is going on in CPython with the GIL. In particular, it means that it gives programmers all the same guarantees as the GIL does. The only difference is that it can actually run multiple threads in parallel, as long as their code does not interfere with each other. (In particular, if you need not just the GIL but actual locks in your existing multi-threaded program, then this will not magically remove the need for them. You might get an additional built-in module that exposes STM to your Python programs, if you prefer it over locks, but that's another question.)

Why not apply that idea to CPython? Because we would need to change everything everywhere. In the example above, you may have noted that I no longer call 'list1.pop()', but 'list1.pop(t)'; this is a way to tell that the implementation of all the methods needs to be changed, in order to do their work "transactionally". This means that instead of really changing the global memory in which the list is stored, it must instead record the change in the transation object. If our interpreter is written in C, as CPython is, then we need to write it explicitly everywhere. If it is written instead in a higher-level language, as PyPy is, then we can add this behavior as as set of translation rules, and apply them automatically wherever it is necessary. Moreover, it can be a translation-time option: you can either get the current "pypy" with a GIL, or a version with STM, which would be slower due to the extra bookkeeping. (How much slower? I have no clue, but as a wild guess, maybe between 2 and 5 times slower. That is fine if you have enough cores, as long as it scales nicely :-)

A final note: as STM research is very recent (it started around 2003), there are a number of variants around, and it's not clear yet which one is better in which cases. As far as I can tell, the approach described in "A Comprehensive Strategy for Contention Management in Software Transactional Memory" seems to be one possible state-of-the-art; it also seems to be "good enough for all cases".

So, when will it be done? I cannot say yet. It is still at the idea stage, but I think that it can work. How long would it take us to write it? Again no clue, but we are looking at many months rather than many days. This is the sort of thing that I would like to be able to work on full time after the Eurostars funding runs out on September 1. We are currently looking at ways to use crowdfunding to raise money so that I can do exactly that. Expect a blog post about that very soon. But this looks like a perfect candidate for crowdfunding -- there are at least thousands of you who would be willing to pay 10s of Euros to Kill the GIL. Now we only have to make this happen.

46 comments:

Michael Foord said...

If you concurrently run two transactions that interfere with each other - and they both restart on failure - isn't there a possibility that neither would ever complete? How would you mitigate against that? (Fallback to a global lock after a certain number of transaction failures perhaps?)

Anonymous said...

There's a thing that is not clear to me: how do you detect failures during commits?

jdhardy said...

IronPython doesn't have a GIL - it's the same as Jython.

Michael Foord said...

Plus transactions have to be scoped around code that is side-effect free (or you can guarantee containing the side-effects within the transaction). Why STM research was done in Haskell I guess. Anyway, it sounds like a hard problem. That's why Armin is interested I guess... :-)

Antonio Cuni said...

@michael: if two transactions conflict, you rollback only one of those, and from the external the effect is the same as having one locked by the GIL

About side effects: the plan is to close a transaction before a side effect operation and reopen a new one after it: this is what happens already with the GIL, which is released e.g. before I/O calls.

At least, this is how I understand it, and since I'm not Armin I might be wrong :-)

Michael Foord said...

@antonio
Ah, that makes sense. Thanks. :-)

Anonymous said...

This sounds like a great idea...

What happens when transaction interleaves together and fail? Both threads will still continue trying so to me this appears to be somewhat as efficient as locks. (Note I know nothing in this topic and would definitely like to learn more).

Sebastian Noack said...

I don't think that the primary reason STM is slower than the GIL, is the extra bookkeeping, but the fact that things need to be repeated. However, I could imagine, that STM still might yield better response times than acquiring locks, in some cases.

Sebastian Noack said...
This comment has been removed by the author.
xyproto said...

I can imagine the reason this is efficient is because code often work on different parts of memory in different threads.

ChrisW said...

Hmm, ZODB has this kind of optimistic transaction committing, it results in having to deal with ConflictErrors and slowness from retrying requests when they conflict amongst other pain. If that's the price for losing the GIL, I'll stick with the GIL, thanks...

Zemantic dreams said...

Ok, so where can we give a small contribution?




Andraz Tori, Zemanta

Richard said...

Have you read about Microsoft's abandoned attempt to bring STM to .NET? Have you considered the problems they had?

Jon Morgan said...

Interesting idea, but some questions:
1. What do C extensions do? (extensions designed for CPython that are using GIL methods). Would they still be able to be used, or would they have to be rewritten for PyPy?

2. What happens if repeatable operations are interleaved with operations that are not repeatable? (e.g. logging values to a file - we wouldn't want it to happen twice if there was a conflict, unless of course you are using that logging to trace what is happening...).

Ben said...

@Michael Foord: In state-of-the-art lazy[1] STM systems, the probability of two transactions continually causing each other to restart is minuscule. A transaction only causes another one to restart when it tries to commit. So when somebody restarts, it means that someone else has successfully committed.

[1] In "Lazy" STMs, transactions only get exclusive access to the things they're trying to write to for a very short window of time at the end. This means they have to record writes in a transaction log, as Armin described, because there might be many pending writes for the same object. An alternative design is "eager" STM, where transactions write directly and have to "undo" their writes if they get aborted. Eager systems look good on paper, but in my opinion they're not worth it. With eager STM, the runtime system has to be very carefully designed to avoid livelock (when the system hangs because some transactions constantly abort each other). Lazy STM is almost impossible to livelock in practice, because even if some transactions are highly conflicting at least one of them (almost always) has to commit.

Ben said...

Also, my honours project was implementing most of an STM system, and I've been a long time fan of (and sometime tinkerer with) PyPy, so I would be very interested in where this goes.

And I know this is extremely premature, but if there were enough money coming in for this project and the PyPy team were willing to include outside developers, I would absolutely love to put serious work into this.

Armin Rigo said...

@Richard: reading the web page you point out, Microsoft's STM attempt (like most others I'm aware of) seems to work at a different level: basically as a library for application programmers. I can go through all 4 points and show why they are not relevant in our context:

* any visible I/O (e.g. writing to a file or a log) is going to end the transaction and start the next one, just like the GIL is released and re-acquired around most calls to the C library's write() function

* the 2nd issue is moot, because STM will be an internal detail in PyPy, not a user-visible feature

* the 3nd issue he describes is about "update-in-place" STM, which I believe is not the best solution: we want instead to keep a local log of the changes, and apply them only at commit-time (as described e.g. in the paper I pointed out)

* the final issue is the lack of real successes with STM. Well, we can't do anything about that ahead of time :-)

Anonymous said...

One note on the lock-based example you gave, that locks list1 and then list2: It isn't free of deadlocks!

Having two threads call the function simultaneously with swapped args may cause a deadlock. See the bank account problem.

Groupdmt said...
This comment has been removed by a blog administrator.
Armin Rigo said...

@Anonymous: yes, I know it can deadlock. I have hidden the problem into some theoretical function acquire_all_locks(), which should somehow make sure that all locks are atomically acquired, in any order (which I think is possible by first sorting the locks according to their address in memory). I didn't want to put too much emphasis on the negative side of locks :-)

Armin Rigo said...

@Jon Morgan:

1. We would most probably still
have a GIL for the CPython C
extensions. Only one can run at a
time, but any number of PyPy
threads can run at the same time.
(This is because the CPython C
extensions never access PyPy's own
objects directly --- they cannot,
because PyPy's own objects can
move, and the C code is not
prepared for that.)

2. Logging to a file is done with a
call to a function like write().
In CPython and so far in PyPy, the
call to write() is preceded by
"release GIL" and followed by
"re-acquire GIL". In the STM PyPy,
it would be preceded by "end the
current transaction" and "start the
next transaction". This gives the
same behavior. But we may have to
think a bit harder about writes
that are buffered, because it seems
that if all threads write into the
same buffer then it will cause many
transaction conflicts.

Note however that we are talking
here about very short-lived
transactions. Even if you have 20
threads all writing to the same log
file, each thread is going to run
much more than 20 bytecodes between
any two writes to the log file.
You only get conflicts if two of
these threads are running the
write() call at the same time, and
such a conflict only causes one of
the threads to roll back and retry
the write(), not more.

Armin Rigo said...

@tuomasjjrasanen: yes, actually the first paper is from the 80's. But I think that it's only from around 2003 or 2004 that research seriously started, in the sense that papers were produced regularly, from several teams.

Kevin Granade said...

To address the anonymous question near the start of the comments, one way to detect commit collision is to copy a global generation counter at the start of your transaction, and then compare your stored copy to the current generation counter at commit time (after taking a lock), and if no one else has incremented the generation counter, you do so and complete your operation.

So transaction does:
self.generation = global.generation

And commit does:
if lock(global.lock):
if self.generation == global.generation:
global.generation += 1
return True
unlock(global.lock)
return False

Jan Ziak (atomsymbol) said...

I am not sure what to make out of the solution (=STM) to GIL you proposed in the article. You are essentially suggesting to slow down all Python programs in PyPy by a factor of, say, 4 and hope to recover the loss for a very small percentage of programs on an 8-core machine.

That can't be right. Please tell me I am dreaming ... :)

Michael Foord said...

So if there is only one thread transactions will be disabled?

I wonder how "fine grained" transactions will be: if you have parallel operations working concurrently on a large array do you think you will be able to allow threads to simultaneously modify different areas of the array?

Ben said...

@⚛: That's kind of how parallelization goes. There are overheads, and the only way to make up for them is to hope you have enough parallel speedup. STM (and any approach to this problem based on fine-grained locking) would work best if only a small known set of objects are shared between threads, and only those are synchronized, which unfortunately cannot be the case for a general GIL-removal proposal.

However I think PyPy's JIT could potentially help a little here. The escape analysis PyPy already does can also prove "this value cannot be accessed by another thread" and used to avoid logging some values, since they cannot conflict with parallel transactions. There are probably some more STM-specific optimizations the JIT could do as well.

Ben said...

@Michael Foord: STM definitely can be made as fine-grained as you like. Some existing STM systems operate at the level of machine words. Given that this one will be operating at the interpreter level, I would guess that code working on different sections of the same object (or array) would able to run in parallel, but I guess it depends on how the tradeoffs play out.

Armin Rigo said...

@⚛: to complete Ben's answer: yes, you are correct, but that's why the translation step "insert STM logic" is never going to be mandatory. You will get either a regular pypy-c-gil or a pypy-c-stm, as two different executables, and you will choose the one most suited for your particular program. I still expect pypy-c-gil to be the most used one, with pypy-c-stm an alternative that is only useful for people with massively multi-threaded programs.

EmilK said...

It would be cool, if the python programmer could mark "uncritical" sections, such that the stm book keeping is disabled for those sections where the programmer knows that there is no concurrency.

Jacob Hallén said...

@EmilK: I think that would be very uncool. You would allow the developer to introduce bugs that would be extremely hard to locate. Parallel programs are quite difficult to get right to start with, and anyone who does not have complete understanding of what constitutes a critical section will be very likely to make an error.

Skandalfo said...

There's an intermediate option between the GIL and the careful locking done by Jython, that I had a look at some time ago for making Python more thread friendly.

Just exchanging the GIL for a global readers-writer lock would allow Python to use way more concurrency. You would run all Python code under a reader lock for operations that were read-only on objects. For modifying built in mutable objects, or for things like the one involving both lists in the Jython example, or when calling into C modules, you would have to acquire the writer version of the lock.

Python threads would relinquish the reader lock each N opcodes, just like it's done now for the GIL, and I guess the acquisition of the writer lock should be given priority over the reader ones.

This approach should be simpler to implement than using the transactional memory approach, and it should be possible to bake it into CPython too. I think I remember having read some discussion about this somewhere, but it didn't seem to come to anything...

Armin Rigo said...

@Skandalfo: this cannot work with CPython, because of reference counting -- every bytecode modifies reference counts, so needs the "write" lock. But it could be a possible idea to consider in PyPy.

WhiteLynx said...

I love this idea.

Just musing on an implementation detail here, but isn't the "lazy" STM implementation's transaction system effectively just an in-memory implementation of copy-on-write semantics? It might be interesting to take a look at other things that have used COW for inspiration. (ZFS and btrfs come to mind) I like the idea that committing a transaction for a given object would just involve changing the object's address in memory to the modified copy.

Also, I'd be interested to see the read/write lock system get implemented, because it seems like it might be a better choice for programs that only use a couple threads.

Anonymous said...

What is wrong with Jython's lock model? Java is a pretty efficient language, no? And there is also no need to acquire locks for objects that you can prove won't cause conflicts...

Skandalfo said...

@Armin Rigo: If the problem for the RW-lock approach in CPython is just about reference count updates and checks, perhaps those could be done via atomic primitives, as supported on most modern architectures. This is what boost::shared_ptr does, IIRC, for the pointers to be thread-safe by default.

Armin Rigo said...

@Skandalfo: right, indeed. I don't know exactly the cost of such atomic operations. Maybe it's fine, but I fear that doing tons of increfs/decrefs all the time (as needed for refcounts in CPython's simple interpreter) has an important cost.

Tuure Laurinolli said...

@Armin Rigo

You'd need similar atomic instructions for an STM implementation too - although perhaps not as many? In any case they should be about as cheap as L1 cache writes unless there's contention, but then things are going to be slow in any case if you have contention. Of course you might have false sharing of objects etc. to muddle things up.

In any case, what sort of semantics would a GIL-free Python have in multi-threaded case, compared to current GIL-infested Python? Each opcode can assumed to execute atomically?

Anonymous said...

One thread have one interpreter.
Threads interactive like os native thread, use the os interactive method wrap by py.

I want to embed multi interpreter in my c code!

Please kill GIL!!!

Raymin said...

One thread have one interpreter.
Threads interactive like os native thread, use the os interactive method wrap by py.

I want to embed multi interpreter in my c code!

Please kill GIL!!!

Armin Rigo said...

@Tuure Laurinolli: yes, but PyPy has no refcounts. I was just discussing the pro/cons of the proposed locking solution on CPython (which is off-topic as far as this original blog post is concerned). I don't even want to think about STM for CPython :-)

For your second question, from the user's point of view, the semantics we would get with STM are automatically the same as with the GIL, which is why I like the approach.

Anonymous said...

Also, what about the performance if the lazy commit method used in the post? Every transaction will create additional memory? Is that really efficient, IMHO this model is aiming a very small number of use cases??

klaussfreire said...

I can see a use for STM in CPython, too, though. Even though it seems to be not applicable, it need not be true.

I worked on making the reference counting thread-friendly, in the sense that when you have multiple threads reading a big data structure, CPython's reference counting turns all the reads into writes, which is awful for performance.

I wrote a patch to pack all writes in the same memory page (ie, reference pools, external reference counting), and was working on a patch for STM reference count updates.

The thing with STM and reference counting, is that many operations cancel out at the end of the transaction. Like when you just read objects while performing computations, you acquire a reference, work, then release it.

In the end, STM here would remove the need to write to shared memory.

In the process of working on that patch, I can tell CPython can be made to use STM techniques. You have thread-local storage at the VM level already, macros handle almost all reference counting operations, it's all abstracted enough that it might be possible.

For reference counting, the only problem is that STM is way slower for single threaded applications. WAY slower. For multithreaded, it pays off considerably, but CPython guys are very strongly set in favouring single-threaded performance.

halfaleague said...

How can we fund this?

Maciej Fijalkowski said...

@halfaleague get in contact. pypy@sfconservancy.org is the right address for non-profit funding inquires.

Daniel Waterworth said...

I managed to write a Haskell STM implementation in a single morning. It may not be the most efficient implementation (I've found it to be about half the speed of the GHC implementation in the limited testing I've done), but it's really simple and only uses atomic CAS.

https://gist.github.com/1454995

shawn said...

have you looked at all at "Worlds" as a simpler interface to STM?

http://www.vpri.org/pdf/tr2011001_final_worlds.pdf