Comments on PyPy Status Blog: Multicore Programming in PyPy and CPython

Ammm... Jython 2.7.0 ! All pure Python syntax usi...

2015-09-22T08:53:45.214+02:00

Ammm... Jython 2.7.0 !

All pure Python syntax using threading instantly go MULTI-CORE! All you need to do is replace the 'p' with a 'j' in your command and voila!

;)

I really believe that concurrency - like memory al...

2012-09-06T22:02:43.118+02:00

I really believe that concurrency - like memory allocation, GC and safe arrays - should be done without the user thinking about it...

Languages like Erlang, ABCL and Concurrent Object Oriented C solves this quite elegant.

Just make every Object a "process" (thread/greenlet) and every return value a Future and your are done :-)

I'm a little late, but regarding the simple le...

2012-09-06T16:04:45.199+02:00

I'm a little late, but regarding the simple let's-do-the-loop-concurrently example, if pypy-stm ends up working out as hoped, would it be relatively easy for pypy to do it automatically without having to use parallel loop thing explicitly?

I have a hunch the answer would be yes, but that the hard part is figuring out when it makes sense and how to do the split (each thread needs a good chunk to work on).

On the other hand, GCC has OpenMP which does seem really convenient and also looks like it has (or rather an implementation of that would have to have) solved part of this problem.

Many years ago, I read about research in auto-parallellising compilers and it stroke me as a really hard problem. But if you can just do some magic with the loops, perhaps it's an attainable goal?

@Mark D.: I don't know if "a dozen memory...

2012-08-19T12:58:39.552+02:00

@Mark D.: I don't know if "a dozen memory modification" comes from real work in the field or is just a guess. My own guess would be that Intel Haswell supports easily hunderds of modifications, possibly thousands. Moreover the built-in cache coherency mechanisms should be used here too, in a way that scales with the cache size; this means they should not be "prohibitively expensive".
Of course I know that in 0.1 seconds we do far more than thousands writes, but I think that nothing strictly limits the progression of future processors in that respect.

The occurrence of conflicts in large transactions depends on two factors. First, "true conflicts", which is the hard problem, but which I think should be relatively deterministic and debuggable with new tools. Second, "false conflicts", which is the HTM/STM mechanism detecting a conflict when there is none. To handle large transactions this should occur with a probability very, very close to 0% for each memory access. In pypy-stm it is 0%, but indeed, with HTM it depends on how close to 0% they can get. I have no data on that.

0.1 second transactions? With hardware transactio...

2012-08-16T05:23:46.202+02:00

0.1 second transactions? With hardware transactional memory the general idea is transactions about ten thousand times smaller. A dozen memory modifications maybe.

It would be prohibitively expensive, hardware wise, to implement conflict detection for transactions much larger than that, to say nothing of the occurrence of conflicts requiring rollback and re-execution if such enormously large transactions were executed optimistically.

Damn. And I thought I was being original. I can al...

2012-08-14T22:50:30.942+02:00

Damn. And I thought I was being original. I can already spot a few key places where kernel-based support would be superior (not only raw performance, but also transparency), but in general, that's exactly what I was talking about, sans transaction retrials.

@klaussfreire is this perhaps what you are lookin...

2012-08-14T21:18:38.213+02:00

@klaussfreire

is this perhaps what you are looking for: http://plasma.cs.umass.edu/emery/grace

Cheers,
Wim

@Armin, well, Python itself does know. In my half...

2012-08-14T16:43:25.251+02:00

@Armin, well, Python itself does know.

In my half-formed idea in my head, python would use thread-local versions of the integer pool and the various free lists, and allocation of new objects would be served from an also thread-local arena (while in a transaction).

Read-write access to shared objects, yes, would be a little bit unpredictable. That's why I was wondering how good (if at all) it would work for Python.

@klaussfreire: that approach is a cool hack but un...

2012-08-14T10:12:32.728+02:00

@klaussfreire: that approach is a cool hack but unlikely to work in practice in a language like Python, because the user doesn't control at all what objects are together with what other objects on the same pages. Even with the reference counts moved out of the way I guess you'd have far too many spurious conflicts.

Option 5, implement STM on the operating system. L...

2012-08-14T00:35:12.904+02:00

Option 5, implement STM on the operating system. Linux already has COW for processes, imagine COW-MERGE for threads.

When you start transactional mode, all pages are marked read-only, thread-private and COW. When you commit, dirty pages are merged with the processes' page maps, unless conflicts arise (the process already has dirty pages).

A simple versioning system and version checks would take care of conflict detection.

I just wonder how difficult it would be designing applications that can run on this model (conflicts at page level vs object level).

Thread-private allocation arenas are entirely possible to avoid new objects from creating conflicts all the time, so it would be a matter of making read-only use of objects really read-only, something I've done incrementally in patches already. Reference counts have to be externalized (taken out of PyObject), for instance.

There are two key concurrency patterns to keep in ...

2012-08-12T09:26:11.265+02:00

There are two key concurrency patterns to keep in mind when considering Armin's STM work:

1. Event-loop based applications that spend a lot of time idling waiting for events.

2. Map-reduce style applications where only the reduce step is particularly prone to resource contention, but the map step is read-heavy (and thus hard to split amongst multiple processes)

For both of those use cases, splitting out multiple processes often won't pay off due to either the serialisation overhead or the additional complexity needed to make serialisation possible at all.

Coarse-grained STM, however, should pay off handsomely in both of those scenarios: if the CPU bound parts of the application are touching different data structures, or are only *reading* any shared data, with any writes being batched for later application, then the STM interaction can be built in to the event loop or parallel execution framework.

Will STM help with threading use cases where multiple threads are simultaneously reading and writing the same data structure? No, it won't. However, such applications don't exploit multiple cores effectively even with free threading, because their *lock* contention will also be high.

As far as "just kill the GIL" goes, I've already written extensively on that topic: http://python-notes.boredomandlaziness.org/en/latest/python3/questions_and_answers.html#but-but-surely-fixing-the-gil-is-more-important-than-fixing-unicode

@Benjamin: a user program might be optimized to re...

2012-08-10T23:04:41.032+02:00

@Benjamin: a user program might be optimized to reduce its memory usage, for example by carefully reusing objects instead of throwing them away, finding more memory-efficient constructs, and so on. But in many cases in Python you don't care too much. Similarly, I expect that it's possible to reduce the size of transactions by splitting them up carefully, hoping to get some extras in performance. But most importantly I'd like a system where the programmer didn't have to care overmuch about that. It should still work reasonably well for *any* size, just like a reasonable GC should work for any heap size.

If I had to describe the main issue I have against HTM, it is that beyond some transaction size we loose all parallelism because it has to fall back on the GIL.

Well, now that I think about it, it's the same in memory usage: if you grow past the RAM size, the program is suddenly swapping, and performance becomes terrible. But RAM sizes are so far much more generous than maximum hardware transaction sizes.

@Armin I'd love to hear your thoughts (benefit...

2012-08-10T22:17:27.976+02:00

@Armin I'd love to hear your thoughts (benefits, costs, entrenched ideas, etc.) on large vs small transactions at some point. Though I suspect that would be a post unto itself.

Transactional programming is neat. So are Gorouti...

2012-08-10T19:15:47.813+02:00

Transactional programming is neat. So are Goroutines and functional-style parallelism. On the other hand, I think that C and C++ (or at least C1x and C++11) get one thing completely right: they don't try to enforce any particular threading model. For some problems (like reference counts, as you mention), you really do want a different model. As long as other languages force me to choose a single model, my big projects will stay in C/C++.

@Anonymous: "So you may do all the book-keepi...

2012-08-10T09:57:57.141+02:00

@Anonymous: "So you may do all the book-keeping yourself in software, but then at commit time use HTM.": I don't see how (or the point), can you be more explicit or post a link?

@Anonymous: I'm not saying that STM is the final solution to all problems. Some classes of problems have other solutions that work well so far and I'm not proposing to change them. Big servers can naturally handle big loads just by having enough processes. What I'm describing instead is a pure language feature that may or may not help in particular cases --- and there are other cases than the one you describe where the situation is very different and multiprocessing doesn't help at all. Also, you have to realise that any argument "we will never need feature X because we can work around it using hack Y" is bound to lose eventually: at least some people in some cases will need the clean feature X because the hack Y is too complicated to learn or use correctly.

@Benjamin: "atomic" actually means "not decomposable", not necessarily "as small as possible". This focus on smallness of transaction IMO is an artefact of last decade's research focus. In my posts I tend to focus on large transaction as a counterpoint: in the use cases I have in mind there is no guarantee that all transactions will be small. Some of them may be, but others not, and this is a restriction. In things like "one iteration through this loop = one transaction", some of these iterations go away and do a lot of stuff.

I get the overall goals and desires and I think th...

2012-08-10T08:27:37.456+02:00

I get the overall goals and desires and I think they are fabulous. However, one notion that seems counterintuitive to me is the desire for large atomic operations.

Aside from the nomenclature (atomic generally means smallest possible), my intuition is that STM would generally operate more efficiently by having fewer roll-backs with small atomic operations and frequent commits. This leads me to assume there is some sort of significant overhead involved with the setup or teardown of the STM 'wrapper'.

From a broader perspective, I get that understanding interlacing is much easier with larger pieces, but larger pieces of code don't lend themselves to wide distribution across many cores like small pieces do.

It seems, to me, that you're focusing heavily on the idea of linearly written code magically functioning in parallel and neglecting the idea of simple, low-cost concurrency, which might have a much bigger short-term impact; and which, through use, may shed light on better frameworks for reducing the complexity inherent in concurrency.

@Anonymous. I welcome you to work out how to make...

2012-08-09T22:27:13.617+02:00

@Anonymous.

I welcome you to work out how to make pypy translation process parallel using any techniques you described.

template processing. lol.

2012-08-09T20:59:31.897+02:00

template processing. lol.

Jesus Christ why don't we all just spend 5 min...

2012-08-09T20:54:47.023+02:00

Jesus Christ why don't we all just spend 5 min fiddling with the multiprocessing module and learn how to partition execution and queues like we partition sequences of statements into functions? So sick of GIL articles and the obsession with not learning how to divide up the work and communicate. In some ways the need to recognize narrow channels where relatively small amounts of data are being channeled through relatively intense blocks of execution and create readable, explicit structure around those blocks might actually improve the comprehensibility of some code I've seen. Getting a little tired of seeing so much effort by excellent, essential, dedicated Python devs getting sucked up by users who won't get it.

I think users are driving this speed-for-free obsession way to far. If anything bugs in a magical system are harder to find than understanding explicit structure and explicit structure that's elegant is neither crufty nor slow. Eventually, no interpreter will save a bad programmer. Are we next going to enable the novice "Pythonista" to forego any knowledge of algorithms?

We -need- JIT on production systems to get response times down for template processing without micro-caching out the wazoo. These types of services are already parallel by nature of the servers and usually I/O bound except for the few slow parts. Cython already serves such an excellent roll for both C/C++ API's AND speed AND optimizing existing python code with minimal changes. JIT PyPy playing well with Cython would make Python very generally uber. Users who actually get multiprocessing and can divide up the workflow won't want a slower implementation of any other kind. Getting a somewhat good solution for 'free' is not nearly as appealing as the additional headroom afforded by an incremental user cost (adding some strong typing or patching a function to work with pypy/py3k).

What does 'just killing the damn GIL' mean...

2012-08-09T18:32:12.972+02:00

What does 'just killing the damn GIL' mean without something like STM? Do you consider it acceptable for Python primitives not to be threadsafe?

If you intend to run 64 cores, then what is the exact reason you need threading and can't use multiprocessing?

No. We really do want a GIL-free Python. Even if t...

2012-08-09T17:37:50.261+02:00

No. We really do want a GIL-free Python. Even if that means we sometimes need to deal with locks.

Right now a high end server can have 64 cores. That means that parallel python code could run faster than serial C code.

STM and other high level abstractions are neat, but they're no substitute for just killing the damn GIL.

With HTM you don't have to have a one-to-one m...

2012-08-09T16:53:53.668+02:00

With HTM you don't have to have a one-to-one mapping between your application transactions and the hardware interface. You can also have an STM, that is implemented using HTM. So you may do all the book-keeping yourself in software, but then at commit time use HTM.

@John: I didn't foresee this development at th...

2012-08-09T13:55:40.500+02:00

@John: I didn't foresee this development at the start of the year, so I don't know. It's a topic that would need to be discussed internally, likely with feedback from past donators.

Right now of course I'm finishing the basics of pypy-stm (working on the JIT now), and from there on there is a lot that can be done as pure Python, like libraries of better-suited data structures --- and generally gaining experience that would anyway be needed for CPython's work.

A question: does a “donate towards STM/AME in pypy...

2012-08-09T13:29:28.892+02:00

A question: does a “donate towards STM/AME in pypy” also count as a donation towards the CPython work? Getting the hooks in CPython to allow exploration and implementation of this seems at least as important as the pypy work. In fact, I think it’s quite a bit more important.

2012-08-09T10:41:46.140+02:00

This comment has been removed by the author.