Monday, August 13, 2012

CFFI release 0.3

Hi everybody,

We released CFFI 0.3. This is the first release that supports more than CPython 2.x :-)

  • CPython 2.6, 2.7, and 3.x are supported (3.3 definitely, but maybe 3.2 or earlier too)
  • PyPy trunk is supported.

In more details, the main news are:

  • support for PyPy. You need to get a trunk version of PyPy, which comes with the built-in module _cffi_backend to use with the CFFI release. For testing, you can download the Linux 32/64 versions of PyPy trunk. The OS/X and Windows versions of _cffi_backend are not tested at all so far, so probably don't work yet.
  • support for Python 3. It is unknown which exact version is required; probably 3.2 or even earlier, but we need 3.3 to run the tests. The 3.x version is not a separate source; it runs out of the same sources. Thanks Amaury for starting this port.
  • the main change in the API is that you need to use ffi.string(cdata) instead of str(cdata) or unicode(cdata). The motivation for this change was the Python 3 compatibility. If your Python 2 code used to contain str(<cdata 'char *'>), it would interpret the memory content as a null-terminated string; but on Python 3 it would just return a different string, namely "<cdata 'char *'>", and proceed without even a crash, which is bad. So ffi.string() solves it by always returning the memory content as an 8-bit string (which is a str in Python 2 and a bytes in Python 3).
  • other minor API changes are documented at (grep for version 0.3).

Upcoming work, to be done before release 1.0:

  • expose to the user the module cffi.model in a possibly refactored way, for people that don't like (or for some reason can't easily use) strings containing snippets of C declarations. We are thinking about refactoring it in such a way that it has a ctypes-compatible interface, to ease porting existing code from ctypes to cffi. Note that this would concern only the C type and function declarations, not all the rest of ctypes.
  • CFFI 1.0 will also have a corresponding PyPy release. We are thinking about calling it PyPy 2.0 and including the whole of CFFI (instead of just the _cffi_backend module like now). In other words it will support CFFI out of the box --- we want to push forward usage of CFFI in PyPy :-)


Armin Rigo and Maciej Fijałkowski

C++ objects in cppyy, part 1: Data Members

The cppyy module makes it possible to call into C++ from PyPy through the Reflex package. Documentation and setup instructions are available here. Recent work has focused on STL, low-level buffers, and code quality, but also a lot on pythonizations for the CINT backend, which is mostly for High Energy Physics (HEP) use only. A previous posting walked through the high-level structure and organization of the module, where it was argued why it is necessary to write cppyy in RPython and generate bindings at run-time for the best performance. This posting details how access to C++ data structures is provided and is part of a series of 3 postings on C++ object representation in Python: the second posting will be about method dispatching, the third will tie up several odds and ends by showing how the choices presented here and in part 2 work together to make features such as auto-casting possible.

Wrapping Choices

Say we have a plain old data type (POD), which is the simplest possible data structure in C++. Like for example:

    struct A {
        int    m_i;
        double m_d;

What should such a POD look like when represented in Python? Let's start by looking at a Python data structure that is functionally similar, in that it also carries two public data members of the desired types. Something like this:

    class A(object):
        def __init__(self):
            self.m_i = 0
            self.m_d = 0.

Alright, now how to go about connecting this Python class with the former C++ POD? Or rather, how to connect instances of either. The exact memory layout of a Python A instance is up to Python, and likewise the layout of a C++ A instance is up to C++. Both layouts are implementation details of the underlying language, language implementation, language version, and the platform used. It should be no surprise then, that for example an int in C++ looks nothing like a PyIntObject, even though it is perfectly possible, in both cases, to point out in memory where the integer value is. The two representations can thus not make use of the same block of memory internally. However, the requirement is that the access to C++ from Python looks and feels natural in its use, not that the mapping is exact. Another requirement is that we want access to the actual object from both Python and C++. In practice, it is easier to provide natural access to C++ from Python than the other way around, because the choices of memory layout in C++ are far more restrictive: the memory layout defines the access, as the actual class definition is gone at run-time. The best choice then, is that the Python object will act as a proxy to the C++ object, with the actual data always being in C++.

From here it follows that if the m_i data member lives in C++, then Python needs some kind of helper to access it. Conveniently, since version 2.2, Python has a property construct that can take a getter and setter function that are called when the property is used in Python code, and present it to the programmer as if it were a data member. So we arrive at this (note how the property instance is a variable at the class level):

    class A(object):
        def __init__(self):
            self._cppthis = construct_new_A()
        m_i = property(get_m_i, set_m_i)
        m_d = property(get_m_d, set_m_d)

The construct_new_A helper is not very interesting (the reflection layer can provide for it directly), and methods are a subject for part 2 of this posting, so focus on get_m_i and set_m_i. In order for the getter to work, the method needs to have access to the C++ instance for which the Python object is a proxy. On access, Python will call the getter function with the proxy instance for which it is called. The proxy has a _cppthis data member from which the C++ instance can be accessed (think of it as a pointer) and all is good, at least for m_i. The second data member m_d, however, requires some more work: it is located at some offset into _cppthis. This offset can be obtained from the reflection information, which lets the C++ compiler calculate it, so details such as byte padding are fully accounted for. Since the setter also needs the offset, and since both share some more details such as the containing class and type information of the data member, it is natural to create a custom property class. The getter and setter methods then become bound methods of an instance of that custom property, CPPDataMember, and there is one such instance per data member. Think of something along these lines:

    def make_datamember(cppclass, name):
        cppdm = cppyy.CPPDataMember(cppclass, name)
        return property(cppdm.get, cppdm.set)
where the make_datamember function replaces the call to property in the class definition above.

Now hold on a minute! Before it was argued that Python and C++ can not share the same underlying memory structure, because of choices internal to the language. But if on the Python side choices are being made by the developer of the language bindings, that is no longer a limitation. In other words, why not go through e.g. the Python extension API, and do this:

    struct A_pyproxy {
        int    m_i;
        double m_d;

Doing so would save on malloc overhead and remove a pointer indirection. There are some technical issues specific to PyPy for such a choice: there is no such thing as PyPyObject_HEAD and the layout of objects is not a given as that is decided only at translation time. But assume that those issues can be solved, and also accept that there is no problem in creating structure definitions like this at run-time, since the reflection layer can provide both the required size and access to the placement new operator (compare e.g. CPython's struct module). There is then still a more fundamental problem: it must be possible to take over ownership in Python from instances created in C++ and vice-versa. With a proxy scheme, that is trivial: just pass the pointer and do the necessary bookkeeping. With an embedded object, however, not every use case can be implemented: e.g. if an object is created in Python, passed to C++, and deleted in C++, it must have been allocated independently. The proxy approach is therefore still the best choice, although embedding objects may provide for optimizations in some use cases.


The next step, is to take a more complicated C++ class, one with inheritance (I'm leaving out details such as constructors etc., for brevity):

    class A {
        virtual ~A() {}
        int    m_i;
        double m_d;

    class B : public A {
        virtual ~B() {}
        int    m_j;

From the previous discussion, it should already be clear what this will look like in Python:

    class A(object):
        def __init__(self):
            self._cppthis = construct_new_A()
        m_i = make_datamember('A', 'm_i')
        m_d = make_datamember('A', 'm_d')

    class B(A):
        def __init__(self):
            self._cppthis = construct_new_B()
        m_j = make_datamember('B', 'm_j')

There are some minor adjustments needed, however. For one, the offset of the m_i data member may be no longer zero: it is possible that a virtual function dispatch table (vtable) pointer is added at the beginning of A (an alternative is to have the vtable pointer at the end of the object). But if m_i is handled the same way as m_d, with the offset provided by the compiler, then the compiler will add the bits, if any, for the vtable pointer and all is still fine. A real problem could come in however, with a call of the m_i property on an instance of B: in that case, the _cppthis points to a B instance, whereas the getter/setter pair expect an A instance. In practice, this is usually not a problem: compilers will align A and B and calculate an offset for m_j from the start of A. Still, that is an implementation detail (even though it is one that can be determined at run-time and thus taken advantage of by the JIT), so it can not be relied upon. The m_i getter thus needs to take into account that it can be called with a derived type, and so it needs to add an additional offset. With that modification, the code looks something like this (as you would have guessed, this is getting more and more into pseudo-code territory, although it is conceptually close to the actual implementation in cppyy):

    def get_m_i(self):
        return int(self._cppthis + offset(A, m_i) + offset(self.__class__, A))

Which is a shame, really, because the offset between B and A is going to be zero most of the time in practice, and the JIT can not completely elide the offset calculation (as we will see later; it is easy enough to elide if self.__class__ is A, though). One possible solution is to repeat the properties for each derived class, i.e. to have a get_B_m_i etc., but that looks ugly on the Python side and anyway does not work in all cases: e.g. with multiple inheritance where there are data members with the same name in both bases, or if B itself has a public data member called m_i that shadows the one from A. The optimization then, is achieved by making B in charge of the offset calculations, by making offset a method of B, like so:

    def get_m_i(self):
        return int(self._cppthis + offset(A, m_i) + self.offset(A))

The insight is that by scanning the inheritance hierarchy of a derived class like B, you can know statically whether it may sometimes need offsets, or whether the offsets are always going to be zero. Hence, if the offsets are always zero, the method offset on B will simply return the literal 0 as its implementation, with the JIT taking care of the rest through inlining and constant folding. If the offset could be non-zero, then the method will perform an actual calculation, and it will let the JIT elide the call only if possible.

Multiple Virtual Inheritance

Next up would be multiple inheritance, but that is not very interesting: we already have the offset calculation between the actual and base class, which is all that is needed to resolve any multiple inheritance hierarchy. So, skip that and move on to multiple virtual inheritance. That that is going to be a tad more complicated will be clear if you show the following code snippet to any old C++ hand and see how they respond. Most likely you will be told: "Don't ever do that." But if code can be written, it will be written, and so for the sake of the argument, what would this look like in Python:

    class A {
        virtual ~A() {}
        int m_a;

    class B : public virtual A {
        virtual ~B() {}
        int m_b;

    class C : public virtual A {
        virtual ~C() {}
        int m_c;

    class D : public virtual B, public virtual C {
        virtual ~D() {}
        int m_d;

Actually, nothing changes from what we have seen so far: the scheme as laid out above is fully sufficient. For example, D would simply look like:

    class D(B, C):
        def __init__(self):
            self._cppthis = construct_new_D()
        m_d = make_datamember('D', 'm_d')

Point being, the only complication added by the multiple virtual inheritance, is that navigation of the C++ instance happens with pointers internal to the instance rather than with offsets. However, it is still a fixed offset from any location to any other location within the instance as its parts are laid out consecutively in memory (this is not a requirement, but it is the most efficient, so it is what is used in practice). But what you can not do, is determine the offset statically: you need a live (i.e. constructed) object for any offset calculations. In Python, everything is always done dynamically, so that is of itself not a limitation. Furthermore, self is already passed to the offset calculation (remember that this was done to put the calculation in the derived class, to optimize the common case of zero offset), thus a live C++ instance is there precisely when it is needed. The call to the offset calculation is hard to elide, since the instance will be passed to a C++ helper and so the most the JIT can do is guard on the instance's memory address, which is likely to change between traces. Instead, explicit caching is needed on the base and derived types, allowing the JIT to elide the lookup in the explicit cache.

Static Data Members and Global Variables

That, so far, covers all access to instance data members. Next up are static data members and global variables. A complication here is that a Python property needs to live on the class in order to work its magic. Otherwise, if you get the property, it will simply return the getter function, and if you set it, it will dissappear. The logical conclusion then, is that a property representing a static or global variable, needs to live on the class of the class, or the metaclass. If done directly though, that would mean that every static data member is available from every class, since all Python classes have the same metaclass, which is class type (and which is its own metaclass). To prevent that from happening and because type is actually immutable, each proxy class needs to have its own custom metaclass. Furthermore, since static data can also be accessed on the instance, the class, too, gets a property object for each static data member. Expressed in code, for a basic C++ class, this looks as follows:

    class A {
        static int s_i;

Paired with some Python code such as this, needed to expose the static variable both on the class and the instance level:

    meta_A = type(CppClassMeta, 'meta_A', [CPPMetaBase], {})
    meta_A.s_i = make_datamember('A', 's_i')

    class A(object):
        __metaclass__ = meta_A
        s_i = make_datamember('A', 's_i')

Inheritance adds no complications for the access of static data per se, but there is the issue that the metaclasses must follow the same hierarchy as the proxy classes, for the Python method resolution order (MRO) to work. In other words, there are two complete, parallel class hierarchies that map one-to-one: a hierarchy for the proxy classes and one for their metaclasses.

A parallel class hierarchy is used also in other highly dynamic, object-oriented environments, such as for example Smalltalk. In Smalltalk as well, class-level constructs, such as class methods and data members, are defined for the class in the metaclass. A metaclass hierarchy has further uses, such as lazy loading of nested classes and member templates (this would be coded up in the base class of all metaclasses: CPPMetaBase), and makes it possible to distribute these over different reflection libraries. With this in place, you can write Python codes like so:

    >>>> from cppyy.gbl import A
    >>>> a = A()
    >>>> a.s_i = 42
    >>>> print A.s_i == a.s_i
    >>>> # etc.

The implementation of the getter for s_i is a lot easier than for instance data: the static data lives at a fixed, global, address, so no offset calculations are needed. The same is done for global data or global data living in namespaces: namespaces are represented as Python classes, and global data are implemented as properties on them. The need for a metaclass is one of the reasons why it is easier for namespaces to be classes: module objects are too restrictive. And even though namespaces are not modules, you still can, with some limitations, import from them anyway.

It is common that global objects themselves are pointers, and therefore it is allowed that the stored _cppthis is not a pointer to a C++ object, but rather a pointer to a pointer to a C++ object. A double pointer, as it were. This way, if the C++ code updates the global pointer, it will automatically reflect on the Python side in the proxy. Likewise, if on the Python side the pointer gets set to a different variable, it is the pointer that gets updated, and this will be visible on the C++ side. In general, however, the same caveat as for normal Python code applies: in order to set a global object, it needs to be set within the scope of that global object. As an example, consider the following code for a C++ namespace NS with global variable g_a, which behaves the same as Python code for what concerns the visibility of changes to the global variable:

    >>>> from cppyy.gbl import NS, A
    >>>> from NS import g_a
    >>>> g_a = A(42)                     # does NOT update C++ side
    >>>> print NS.g_a.m_i
    13                                   # the old value happens to be 13
    >>>> NS.g_a = A(42)                  # does update C++ side
    >>>> print NS.g_a.m_i
    >>>> # etc.


That covers all there is to know about data member access of C++ classes in Python through a reflection layer! A few final notes: RPython does not support metaclasses, and so the construction of proxy classes (code like make_datamember above) happens in Python code instead. There is an overhead penalty of about 2x over pure RPython code associated with that, due to extra guards that get inserted by the JIT. A factor of 2 sounds like a lot, but the overhead is tiny to begin with, and 2x of tiny is still tiny and it's not easy to measure. The class definition of the custom property, CPPDataMember, is in RPython code, to be transparent to the JIT. The actual offset calculations are in the reflection layer. Having the proxy class creation in Python, with structural code in RPython, complicates matters if proxy classes need to be constructed on-demand. For example, if an instance of an as-of-yet unseen type is returned by a method. Explaining how that is solved is a topic of part 2, method calls, so stay tuned.

This posting laid out the reasoning behind the object representation of C++ objects in Python by cppyy for the purpose of data member access. It explained how the chosen representation of offsets gives rise to a very pythonic representation, which allows Python introspection tools to work as expected. It also explained some of the optimizations done for the benefit of the JIT. Next up are method calls, which will be described in part 2.

Thursday, August 9, 2012

Multicore Programming in PyPy and CPython

Hi all,

This is a short "position paper" kind of post about my view (Armin Rigo's) on the future of multicore programming in high-level languages. It is a summary of the keynote presentation at EuroPython. As I learned by talking with people afterwards, I am not a good enough speaker to manage to convey a deeper message in a 20-minutes talk. I will try instead to convey it in a 250-lines post...

This is about three points:

  1. We often hear about people wanting a version of Python running without the Global Interpreter Lock (GIL): a "GIL-less Python". But what we programmers really need is not just a GIL-less Python --- we need a higher-level way to write multithreaded programs than using directly threads and locks. One way is Automatic Mutual Exclusion (AME), which would give us an "AME Python".
  2. A good enough Software Transactional Memory (STM) system can be used as an internal tool to do that. This is what we are building into an "AME PyPy".
  3. The picture is darker for CPython, though there is a way too. The problem is that when we say STM, we think about either GCC 4.7's STM support, or Hardware Transactional Memory (HTM). However, both solutions are enough for a "GIL-less CPython", but not for "AME CPython", due to capacity limitations. For the latter, we need somehow to add some large-scale STM into the compiler.

Let me explain these points in more details.

GIL-less versus AME

The first point is in favor of the so-called Automatic Mutual Exclusion approach. The issue with using threads (in any language with or without a GIL) is that threads are fundamentally non-deterministic. In other words, the programs' behaviors are not reproductible at all, and worse, we cannot even reason about it --- it becomes quickly messy. We would have to consider all possible combinations of code paths and timings, and we cannot hope to write tests that cover all combinations. This fact is often documented as one of the main blockers towards writing successful multithreaded applications.

We need to solve this issue with a higher-level solution. Such solutions exist theoretically, and Automatic Mutual Exclusion (AME) is one of them. The idea of AME is that we divide the execution of each thread into a number of "atomic blocks". Each block is well-delimited and typically large. Each block runs atomically, as if it acquired a GIL for its whole duration. The trick is that internally we use Transactional Memory, which is a technique that lets the system run the atomic blocks from each thread in parallel, while giving the programmer the illusion that the blocks have been run in some global serialized order.

This doesn't magically solve all possible issues, but it helps a lot: it is far easier to reason in terms of a random ordering of large atomic blocks than in terms of a random ordering of lines of code --- not to mention the mess that multithreaded C is, where even a random ordering of instructions is not a sufficient model any more.

How do such atomic blocks look like? For example, a program might contain a loop over all keys of a dictionary, performing some "mostly-independent" work on each value. This is a typical example: each atomic block is one iteration through the loop. By using the technique described here, we can run the iterations in parallel (e.g. using a thread pool) but using AME to ensure that they appear to run serially.

In Python, we don't care about the order in which the loop iterations are done, because we are anyway iterating over the keys of a dictionary. So we get exactly the same effect as before: the iterations still run in some random order, but --- and that's the important point --- they appear to run in a global serialized order. In other words, we introduced parallelism, but only under the hood: from the programmer's point of view, his program still appears to run completely serially. Parallelisation as a theoretically invisible optimization... more about the "theoretically" in the next paragraph.

Note that randomness of order is not fundamental: they are techniques building on top of AME that can be used to force the order of the atomic blocks, if needed.

PyPy and STM/AME

Talking more precisely about PyPy: the current prototype pypy-stm is doing precisely this. In pypy-stm, the length of the atomic blocks is selected in one of two ways: either explicitly or automatically.

The automatic selection gives blocks corresponding to some small number of bytecodes, in which case we have merely a GIL-less Python: multiple threads will appear to run serially, with the execution randomly switching from one thread to another at bytecode boundaries, just like in CPython.

The explicit selection is closer to what was described in the previous section: someone --- the programmer or the author of some library that the programmer uses --- will explicitly put with thread.atomic: in the source, which delimitates an atomic block. For example, we can use it to build a library that can be used to iterate over the keys of a dictionary: instead of iterating over the dictionary directly, we would use some custom utility which gives the elements "in parallel". It would give them by using internally a pool of threads, but enclosing every handling of an element into such a with thread.atomic block.

This gives the nice illusion of a global serialized order, and thus gives us a well-behaving model of the program's behavior.

Restating this differently, the only semantical difference between pypy-stm and a regular PyPy or CPython is that it has thread.atomic, which is a context manager that gives the illusion of forcing the GIL to not be released during the execution of the corresponding block of code. Apart from this addition, they are apparently identical.

Of course they are only semantically identical if we ignore performance: pypy-stm uses multiple threads and can potentially benefit from that on multicore machines. The drawback is: when does it benefit, and how much? The answer to this question is not immediate. The programmer will usually have to detect and locate places that cause too many "conflicts" in the Transactional Memory sense. A conflict occurs when two atomic blocks write to the same location, or when A reads it, B writes it, but B finishes first and commits. A conflict causes the execution of one atomic block to be aborted and restarted, due to another block committing. Although the process is transparent, if it occurs more than occasionally, then it has a negative impact on performance.

There is no out-of-the-box perfect solution for solving all conflicts. What we will need is more tools to detect them and deal with them, data structures that are made aware of the risks of "internal" conflicts when externally there shouldn't be one, and so on. There is some work ahead.

The point here is that from the point of view of the final programmer, we gets conflicts that we should resolve --- but at any point, our program is correct, even if it may not be yet as efficient as it could be. This is the opposite of regular multithreading, where programs are efficient but not as correct as they could be. In other words, as we all know, we only have resources to do the easy 80% of the work and not the remaining hard 20%. So in this model we get a program that has 80% of the theoretical maximum of performance and it's fine. In the regular multithreading model we would instead only manage to remove 80% of the bugs, and we are left with obscure rare crashes.

CPython and HTM

Couldn't we do the same for CPython? The problem here is that pypy-stm is implemented as a transformation step during translation, which is not directly possible in CPython. Here are our options:

  • We could review and change the C code everywhere in CPython.
  • We use GCC 4.7, which supports some form of STM.
  • We wait until Intel's next generation of CPUs comes out ("Haswell") and use HTM.
  • We write our own C code transformation within a compiler (e.g. LLVM).

I will personally file the first solution in the "thanks but no thanks" category. If anything, it will give us another fork of CPython that will painfully struggle to keep not more than 3-4 versions behind, and then eventually die. It is very unlikely to be ever merged into the CPython trunk, because it would need changes everywhere. Not to mention that these changes would be very experimental: tomorrow we might figure out that different changes would have been better, and have to start from scratch again.

Let us turn instead to the next two solutions. Both of these solutions are geared toward small-scale transactions, but not long-running ones. For example, I have no clue how to give GCC rules about performing I/O in a transaction --- this seems not supported at all; and moreover looking at the STM library that is available so far to be linked with the compiled program, it assumes short transactions only. By contrast, when I say "long transaction" I mean transactions that can run for 0.1 seconds or more. To give you an idea, in 0.1 seconds a PyPy program allocates and frees on the order of ~50MB of memory.

Intel's Hardware Transactional Memory solution is both more flexible and comes with a stricter limit. In one word, the transaction boundaries are given by a pair of special CPU instructions that make the CPU enter or leave "transactional" mode. If the transaction aborts, the CPU cancels any change, rolls back to the "enter" instruction and causes this instruction to return an error code instead of re-entering transactional mode (a bit like a fork()). The software then detects the error code. Typically, if transactions are rarely cancelled, it is fine to fall back to a GIL-like solution just to redo these cancelled transactions.

About the implementation: this is done by recording all the changes that a transaction wants to do to the main memory, and keeping them invisible to other CPUs. This is "easily" achieved by keeping them inside this CPU's local cache; rolling back is then just a matter of discarding a part of this cache without committing it to memory. From this point of view, there is a lot to bet that we are actually talking about the regular per-core Level 1 and Level 2 caches --- so any transaction that cannot fully store its read and written data in the 64+256KB of the L1+L2 caches will abort.

So what does it mean? A Python interpreter overflows the L1 cache of the CPU very quickly: just creating new Python function frames takes a lot of memory (on the order of magnitude of 1/100 of the whole L1 cache). Adding a 256KB L2 cache into the picture helps, particularly because it is highly associative and thus avoids a lot of fake conflicts. However, as long as the HTM support is limited to L1+L2 caches, it is not going to be enough to run an "AME Python" with any sort of medium-to-long transaction. It can run a "GIL-less Python", though: just running a few hundred or even thousand bytecodes at a time should fit in the L1+L2 caches, for most bytecodes.

I would vaguely guess that it will take on the order of 10 years until CPU cache sizes grow enough for a CPU in HTM mode to actually be able to run 0.1-second transactions. (Of course in 10 years' time a lot of other things may occur too, including the whole Transactional Memory model being displaced by something else.)

Write your own STM for C

Let's discuss now the last option: if neither GCC 4.7 nor HTM are sufficient for an "AME CPython", then we might want to write our own C compiler patch (as either extra work on GCC 4.7, or an extra pass to LLVM, for example).

We would have to deal with the fact that we get low-level information, and somehow need to preserve interesting high-level bits through the compiler up to the point at which our pass runs: for example, whether the field we read is immutable or not. (This is important because some common objects are immutable, e.g. PyIntObject. Immutable reads don't need to be recorded, whereas reads of mutable data must be protected against other threads modifying them.) We can also have custom code to handle the reference counters: e.g. not consider it a conflict if multiple transactions have changed the same reference counter, but just resolve it automatically at commit time. We are also free to handle I/O in the way we want.

More generally, the advantage of this approach over both the current GCC 4.7 and over HTM is that we control the whole process. While this still looks like a lot of work, it looks doable. It would be possible to come up with a minimal patch of CPython that can be accepted into core without too much troubles (e.g. to mark immutable fields and tweak the refcounting macros), and keep all the cleverness inside the compiler extension.


I would assume that a programming model specific to PyPy and not applicable to CPython has little chances to catch on, as long as PyPy is not the main Python interpreter (which looks unlikely to change anytime soon). Thus as long as only PyPy has AME, it looks like it will not become the main model of multicore usage in Python. However, I can conclude with a more positive note than during the EuroPython conference: it is a lot of work, but there is a more-or-less reasonable way forward to have an AME version of CPython too.

In the meantime, pypy-stm is around the corner, and together with tools developed on top of it, it might become really useful and used. I hope that in the next few years this work will trigger enough motivation for CPython to follow the ideas.

Tuesday, August 7, 2012

NumPyPy non-progress report

Hello everyone.

Not much has happened in the past few months with numpypy development. A part of the reason was doing other stuff for me, a part of the reason was various unexpected visa-related admin, a part of the reason was EuroPython and a part was long-awaited holiday.

The thing that's maybe worth mentioning is that it does not mean the donations disappeared in the mist. PyPy developers are being paid to work on NumPyPy on an hourly basis - that means if I decide to take holidays or work on something else, the money is simply staying in the account until later.

Thanks again for all the donations, I hope to get back to this topic soon!