Sunday, September 9, 2018

The First 15 Years of PyPy — a Personal Retrospective

A few weeks ago I (=Carl Friedrich Bolz-Tereick) gave a keynote at ICOOOLPS in Amsterdam with the above title. I was very happy to have been given that opportunity, since a number of our papers have been published at ICOOOLPS, including the very first one I published when I'd just started my PhD. I decided to turn the talk manuscript into a (longish) blog post, to make it available to a wider audience. Note that this blog post describes my personal recollections and research, it is thus necessarily incomplete and coloured by my own experiences.

PyPy has turned 15 years old this year, so I decided that that's a good reason to dig into and talk about the history of the project so far. I'm going to do that using the lens of how performance developed over time, which is from something like 2000x slower than CPython, to roughly 7x faster. In this post I am going to present the history of the project, and also talk about some lessons that we learned.

The post does not make too many assumptions about any prior knowledge of what PyPy is, so if this is your first interaction with it, welcome! I have tried to sprinkle links to earlier blog posts and papers into the writing, in case you want to dive deeper into some of the topics.

As a disclaimer, in this post I am going to mostly focus on ideas, and not explain who had or implemented them. A huge amount of people contributed to the design, the implementation, the funding and the organization of PyPy over the years, and it would be impossible to do them all justice.

2003: Starting the Project

On the technical level PyPy is a Python interpreter written in Python, which is where the name comes from. It also has an automatically generated JIT compiler, but I'm going to introduce that gradually over the rest of the blog post, so let's not worry about it too much yet. On the social level PyPy is an interesting mixture of a open source project, that sometimes had research done in it.

The project got started in late 2002 and early 2003. To set the stage, at that point Python was a significantly less popular language than it is today. Python 2.2 was the version at the time, Python didn't even have a bool type yet.

In fall 2002 the PyPy project was started by a number of Python programmers on a mailing list who said something like (I am exaggerating somewhat) "Python is the greatest most wonderful most perfect language ever, we should use it for absolutely everything. Well, what aren't we using it for? The Python virtual machine itself is written in C, that's bad. Let's start a project to fix that."

Originally that project was called "minimal python", or "ptn", later gradually renamed to PyPy. Here's the mailing list post to announce the project more formally:

Minimal Python Discussion, Coding and Sprint

We announce a mailinglist dedicated to developing
a "Minimal Python" version.  Minimal means that
we want to have a very small C-core and as much
as possible (re)implemented in python itself.  This
includes (parts of) the VM-Code.

Why would that kind of project be useful? Originally it wasn't necessarily meant to be useful as a real implementation at all, it was more meant as a kind of executable explanation of how Python works, free of the low level details of CPython. But pretty soon there were then also plans for how the virtual machine (VM) could be bootstrapped to be runnable without an existing Python implementation, but I'll get to that further down.

2003: Implementing the Interpreter

In early 2003 a group of Python people met in Hildesheim (Germany) for the first of many week long development sprints, organized by Holger Krekel. During that week a group of people showed up and started working on the core interpreter. In May 2003 a second sprint was organized by Laura Creighton and Jacob Halén in Gothenburg (Sweden). And already at that sprint enough of the Python bytecodes and data structures were implemented to make it possible to run a program that computed how much money everybody had to pay for the food bills of the week. And everybody who's tried that for a large group of people knows that that’s an amazingly complex mathematical problem.

In the next two years, the project continued as a open source project with various contributors working on it in their free time, and meeting for the occasional sprint. In that time, the rest of the core interpreter and the core data types were implemented.

There's not going to be any other code in this post, but to give a bit of a flavor of what the Python interpreter at that time looked like, here's the implementation of the DUP_TOP bytecode after these first sprints. As you can see, it's in Python, obviously, and it has high level constructs such as method calls to do the stack manipulations:

def DUP_TOP(f):
    w_1 =

Here's the early code for integer addition:

def int_int_add(space, w_int1, w_int2):
    x = w_int1.intval
    y = w_int2.intval
        z = x + y
    except OverflowError:
        raise FailedToImplement(space.w_OverflowError,
                                space.wrap("integer addition"))
    return W_IntObject(space, z)

(the current implementations look slightly but not fundamentally different.)

Early organizational ideas

Some of the early organizational ideas of the project were as follows. Since the project was started on a sprint and people really liked that style of working PyPy continued to be developed on various subsequent sprints.

From early on there was a very heavy emphasis on testing. All the parts of the interpreter that were implemented had a very careful set of unit tests to make sure that they worked correctly. From early on, there was a continuous integration infrastructure, which grew over time (nowadays it is very natural for people to have automated tests, and the concept of green/red builds: but embracing this workflow in the early 2000s was not really mainstream yet, and it is probably one of the reasons behind PyPy's success).

At the sprints there was also an emphasis on doing pair programming to make sure that everybody understood the codebase equally. There was also a heavy emphasis on writing good code and on regularly doing refactorings to make sure that the codebase remained nice, clean and understandable. Those ideas followed from the early thoughts that PyPy would be a sort of readable explanation of the language.

There was also a pretty fundamental design decision made at the time. That was that the project should stay out of language design completely. Instead it would follow CPython's lead and behave exactly like that implementation in all cases. The project therefore committed to being almost quirk-to-quirk compatible and to implement even the more obscure (and partially unnecessary) corner cases of CPython.

All of these principles continue pretty much still today (There are a few places where we had to deviate from being completely compatible, they are documented here).

2004-2007: EU-Funding

While all this coding was going on it became clear pretty soon that the goals that various participants had for the project would be very hard to achieve with just open source volunteers working on the project in their spare time. Particularly also the sprints became expensive given that those were just volunteers doing this as a kind of weird hobby. Therefore a couple of people of the project got together to apply for an EU grant in the framework programme 6 to solve these money problems. In mid-2004 that application proved to be successful. And so the project got a grant of a 1.3 million Euro for two years to be able to employ some of the core developers and to make it possible for them work on the project full time. The EU grant went to seven small-to-medium companies and Uni Düsseldorf. The budget also contained money to fund sprints, both for the employed core devs as well as other open source contributors.

The EU project started in December 2004 and that was a fairly heavy change in pace for the project. Suddenly a lot of people were working full time on it, and the pace and the pressure picked up quite a lot. Originally it had been a leisurely project people worked on for fun. But afterwards people discovered that doing this kind of work full time becomes slightly less fun, particularly also if you have to fulfill the ambitious technical goals that the EU proposal contained. And the proposal indeed contained a bit everything to increase its chance of acceptance, such as aspect oriented programming, semantic web, logic programming, constraint programming, and so on. Unfortunately it turned out that those things then have to be implemented, which can be called the first thing we learned: if you promise something to the EU, you'll have to actually go do it (After the funding ended, a lot of these features were actually removed from the project again, at a cleanup sprint).

2005: Bootstrapping PyPy

So what were the actually useful things done as part of the EU project?

One of the most important goals that the EU project was meant to solve was the question of how to turn PyPy into an actually useful VM for Python. The bootstrapping plans were taken quite directly from Squeak, which is a Smalltalk VM written in a subset of Smalltalk called Slang, which can then be bootstrapped to C code. The plan for PyPy was to do something similar, to define a restricted subset of Python called RPython, restricted in such a way that it should be possible to statically compile RPython programs to C code. Then the Python interpreter should only use that subset, of course.

The main difference from the Squeak approach is that Slang, the subset of Squeak used there, is actually quite a low level language. In a way, you could almost describe it as C with Smalltalk syntax. RPython was really meant to be a much higher level language, much closer to Python, with full support for single inheritance classes, and most of Python's built-in data structures.

(BTW, you don’t have to understand any of the illustrations in this blog post, they are taken from talks and project reports we did over the years so they are of archaeological interest only and I don’t understand most of them myself.)

From 2005 on, work on the RPython type inference engine and C backend started in earnest, which was sort of co-developed with the RPython language definition and the PyPy Python interpreter. This is also roughly the time that I joined the project as a volunteer.

And at the second sprint I went to, in July 2005, two and a half years after the project got started, we managed to bootstrap the PyPy interpreter to C for the first time. When we ran the compiled program, it of course immediately segfaulted. The reason for that was that the C backend had turned characters into signed chars in C, while the rest of the infrastructure assumed that they were unsigned chars. After we fixed that, the second attempt worked and we managed to run an incredibly complex program, something like 6 * 7. That first bootstrapped version was really really slow, a couple of hundred times slower than CPython.

The bootstrapping process of RPython has a number of nice benefits, a big one being that a number of the properties of the generated virtual machine don't have to expressed in the interpreter. The biggest example of this is garbage collection. RPython is a garbage collected language, and the interpreter does not have to care much about GC in most cases. When the C source code is generated, a GC is automatically inserted. This is a source of great flexibility. Over time we experimented with a number of different GC approaches, from reference counting to Boehm to our current incremental generational collector. As an aside, for a long time we were also working on other backends to the RPython language and hoped to be able to target Java and .NET as well. Eventually we abandoned this strand of work, however.

RPython's Modularity Problems

Now we come to the first thing I would say we learned in the project, which is that the quality of tools we thought of as internal things still matters a lot. One of the biggest technical mistakes we've made in the project was that we designed RPython without any kind of story for modularity. There is no concept of modules in the language or any other way to break up programs into smaller components. We always thought that it would be ok for RPython to be a little bit crappy. It was meant to be this sort of internal language with not too many external users. And of course that turned out to be completely wrong later.

That lack of modularity led to various problems that persist until today. The biggest one is that there is no separate compilation for RPython programs at all! You always need to compile all the parts of your VM together, which leads to infamously bad compilation times.

Also by not considering the modularity question we were never forced to fix some internal structuring issues of the RPython compiler itself. Various layers of the compiler keep very badly defined and porous interfaces between them. This was made possible by being able to work with all the program information in one heap, making the compiler less approachable and maintainable than it maybe could be.

Of course this mistake just got more and more costly to fix over time, and so it means that so far nobody has actually done it. Not thinking more carefully about RPython's design, particularly its modularity story, is in my opinion the biggest technical mistake the project made.

2006: The Meta-JIT

After successfully bootstrapping the VM we did some fairly straightforward optimizations on the interpreter and the C backend and managed to reduce the slowdown versus CPython to something like 2-5 times slower. That's great! But of course not actually useful in practice. So where do we go from here?

One of the not so secret goals of Armin Rigo, one of the PyPy founders, was to use PyPy together with some advanced partial evaluation magic sauce to somehow automatically generate a JIT compiler from the interpreter. The goal was something like, "you write your interpreter in RPython, add a few annotations and then we give you a JIT for free for the language that that interpreter implements."

Where did the wish for that approach come from, why not just write a JIT for Python manually in the first place? Armin had actually done just that before he co-founded PyPy, in a project called Psyco. Psyco was an extension module for CPython that contained a method-based JIT compiler for Python code. And Psyco proved to be an amazingly frustrating compiler to write. There were two main reasons for that. The first reason was that Python is actually quite a complex language underneath its apparent simplicity. The second reason for the frustration was that Python was and is very much an alive language, that gains new features in the language core in every version. So every time a new Python version came out, Armin had to do fundamental changes and rewrites to Psyco, and he was getting pretty frustrated with it. So he hoped that that effort could be diminished by not writing the JIT for PyPy by hand at all. Instead, the goal was to generate a method-based JIT from the interpreter automatically. By taking the interpreter, and applying a kind of advanced transformation to it, that would turn it into a method-based JIT. And all that would still be translated into a C-based VM, of course.

Slide from Psyco presentation at EuroPython 2002

The First JIT Generator

From early 2006 on until the end of the EU project a lot of work went into writing such a JIT generator. The idea was to base it on runtime partial evaluation. Partial evaluation is an old idea in computer science. It's supposed to be a way to automatically turn interpreters for a language into a compiler for that same language. Since PyPy was trying to generate a JIT compiler, which is in any case necessary to get good performance for a dynamic language like Python, the partial evaluation was going to happen at runtime.

There are various ways to look at partial evaluation, but if you've never heard of it before, a simple way to view it is that it will compile a Python function by gluing together the implementations of the bytecodes of that function and optimizing the result.

The main new ideas of PyPy's partial-evaluation based JIT generator as opposed to earlier partial-evaluation approaches are the ideas of "promote" and the idea of "virtuals". Both of these techniques had already been present (in a slightly less general form) in Psyco, and the goal was to keep using them in PyPy. Both of these techniques also still remain in use today in PyPy. I'm going on a slight technical diversion now, to give a high level explanation of what those ideas are for.


One important ingredient of any JIT compiler is the ability to do runtime feedback. Runtime feedback is most commonly used to know something about which concrete types are used by a program in practice. Promote is basically a way to easily introduce runtime feedback into the JIT produced by the JIT generator. It's an annotation the implementer of a language can use to express their wish that specialization should happen at this point. This mechanism can be used to express all kinds of runtime feedback, moving values from the interpreter into the compiler, whether they be types or other things.


Virtuals are a very aggressive form of partial escape analysis. A dynamic language often puts a lot of pressure on the garbage collector, since most primitive types (like integers, floats and strings) are boxed in the heap, and new boxes are allocated all the time.

With the help of virtuals a very significant portion of all allocations in the generated machine code can be completely removed. Even if they can't be removed, often the allocation can be delayed or moved into an error path, or even into a deoptimization path, and thus disappear from the generated machine code completely.

This optimization really is the super-power of PyPy's optimizer, since it doesn't work only for primitive boxes but for any kind of object allocated on the heap with a predictable lifetime.

As an aside, while this kind of partial escape analysis is sort of new for object-oriented languages, it has actually existed in Prolog-based partial evaluation systems since the 80s, because it's just extremely natural there.

JIT Status 2007

So, back to our history. We're now in 2007, at the end of the EU project (you can find the EU-reports we wrote during the projects here). The EU project successfully finished, we survived the final review with the EU. So, what's the 2007 status of the JIT generator? It works kind of, it can be applied to PyPy. It produces a VM with a JIT that will turn Python code into machine code at runtime and run it. However, that machine code is not particularly fast. Also, it tends to generate many megabytes of machine code even for small Python programs. While it's always faster than PyPy without JIT, it's only sometimes faster than CPython, and most of the time Psyco still beats it. On the one hand, this is still an amazing achievement! It's arguably the biggest application of partial evaluation at this point in time! On the other hand, it was still quite disappointing in practice, particularly since some of us had believed at the time that it should have been possible to reach and then surpass the speed of Psyco with this approach.

2007: RSqueak and other languages

After the EU project ended we did all kinds of things. Like sleep for a month for example, and have the cleanup sprint that I already mentioned. We also had a slightly unusual sprint in Bern, with members of the Software Composition Group of Oscar Nierstrasz. As I wrote above, PyPy had been heavily influenced by Squeak Smalltalk, and that group is a heavy user of Squeak, so we wanted to see how to collaborate with them. At the beginning of the sprint, we decided together that the goal of that week should be to try to write a Squeak virtual machine in RPython, and at the end of the week we'd gotten surprisingly far with that goal. Basically most of the bytecodes and the Smalltalk object system worked, we had written an image loader and could run some benchmarks (during the sprint we also regularly updated a blog, the success of which led us to start the PyPy blog).

The development of the Squeak interpreter was very interesting for the project, because it was the first real step that moved RPython from being an implementation detail of PyPy to be a more interesting project in its own right. Basically a language to write interpreters in, with the eventual promise to get a JIT for that language almost for free. That Squeak implementation is now called RSqueak ("Research Squeak").

I'll not go into more details about any of the other language implementations in RPython in this post, but over the years we've had a large variety of language of them done by various people and groups, most of them as research vehicles, but also some as real language implementations. Some very cool research results came out of these efforts, here's a slightly outdated list of some of them.

The use of RPython for other languages complicated the PyPy narrative a lot, and in a way we never managed to recover the simplicity of the original project description "PyPy is Python in Python". Because now it's something like "we have this somewhat strange language, a subset of Python, that's called RPython, and it's good to write interpreters in. And if you do that, we'll give you a JIT for almost free. And also, we used that language to write a Python implementation, called PyPy.". It just doesn't roll off the tongue as nicely.

2008-2009: Four More JIT Generators

Back to the JIT. After writing the first JIT generator as part of the EU project, with somewhat mixed results, we actually wrote several more JIT generator prototypes with different architectures to try to solve some of the problems of the first approach. To give an impression of these prototypes, here’s a list of them.

  • The second JIT generator we started working on in 2008 behaved exactly like the first one, but had a meta-interpreter based architecture, to make it more flexible and easier to experiment with. The meta-interpreter was called the "rainbow interpreter", and in general the JIT is an area where we went somewhat overboard with borderline silly terminology, with notable occurrences of "timeshifter", "blackhole interpreter" etc.

  • The third JIT generator was an experiment based on the second one which changed compilation strategy. While the previous two had compiled many control flow paths of the currently compiled function eagerly, that third JIT was sort of maximally lazy and stopped compilation at every control flow split to avoid guessing which path would actually be useful later when executing the code. This was an attempt to reduce the problem of the first JIT generating way too much machine code. Only later, when execution went down one of the not yet compiled paths would it continue compiling more code. This gives an effect similar to that of lazy basic block versioning.

  • The fourth JIT generator was a pretty strange prototype, a runtime partial evaluator for Prolog, to experiment with various specialization trade-offs. It had an approach that we gave a not at all humble name, called "perfect specialization".

  • The fifth JIT generator is the one that we are still using today. Instead of generating a method-based JIT compiler from our interpreter we switched to generating a tracing JIT compiler. Tracing JIT compilers were sort of the latest fashion at the time, at least for a little while.

2009: Meta-Tracing

So, how did that tracing JIT generator work? A tracing JIT generates code by observing and logging the execution of the running program. This yields a straight-line trace of operations, which are then optimized and compiled into machine code. Of course most tracing systems mostly focus on tracing loops.

As we discovered, it's actually quite simple to apply a tracing JIT to a generic interpreter, by not tracing the execution of the user program directly, but by instead tracing the execution of the interpreter while it is running the user program (here's the paper we wrote about this approach).

So that's what we implemented. Of course we kept the two successful parts of the first JIT, promote and virtuals (both links go to the papers about these features in the meta-tracing context).

Why did we Abandon Partial Evaluation?

So one question I get sometimes asked when telling this story is, why did we think that tracing would work better than partial evaluation (PE)? One of the hardest parts of compilers in general and partial evaluation based systems in particular is the decision when and how much to inline, how much to specialize, as well as the decision when to split control flow paths. In the PE based JIT generator we never managed to control that question. Either the JIT would inline too much, leading to useless compilation of all kinds of unlikely error cases. Or it wouldn't inline enough, preventing necessary optimizations.

Meta tracing solves this problem with a hammer, it doesn't make particularly complex inlining decisions at all. It instead decides what to inline by precisely following what a real execution through the program is doing. Its inlining decisions are therefore very understandable and predictable, and it basically only has one heuristic based on whether the called function contains a loop or not: If the called function contains a loop, we'll never inline it, if it doesn't we always try to inline it. That predictability is actually what was the most helpful, since it makes it possible for interpreter authors to understand why the JIT did what it did and to actually influence its inlining decisions by changing the annotations in the interpreter source. It turns out that simple is better than complex.

2009-2011: The PyJIT Eurostars Project

While we were writing all these JIT prototypes, PyPy had sort of reverted back to being a volunteer-driven open source project (although some of us, like Antonio Cuni and I, had started working for universities and other project members had other sources of funding). But again, while we did the work it became clear that to get an actually working fast PyPy with generated JIT we would need actual funding again for the project. So we applied to the EU again, this time for a much smaller project with less money, in the Eurostars framework. We got a grant for three participants, merlinux, OpenEnd and Uni Düsseldorf, on the order of a bit more than half a million euro. That money was specifically for JIT development and JIT testing infrastructure.

Tracing JIT improvements

When writing the grant we had sat together at a sprint and discussed extensively and decided that we would not switch JIT generation approaches any more. We all liked the tracing approach well enough and thought it was promising. So instead we agreed to try in earnest to make the tracing JIT really practical. So in the Eurostars project we started with implementing sort of fairly standard JIT compiler optimizations for the meta-tracing JIT, such as:

  • constant folding

  • dead code elimination

  • loop invariant code motion (using LuaJIT's approach)

  • better heap optimizations

  • faster deoptimization (which is actually a bit of a mess in the meta-approach)

  • and dealing more efficiently with Python frames objects and the features of Python's debugging facilities


In 2010, to make sure that we wouldn't accidentally introduce speed regressions while working on the JIT, we implemented infrastructure to build PyPy and run our benchmarks nightly. Then, the website was implemented by Miquel Torres, a volunteer. The website shows the changes in benchmark performance compared to the previous n days. It didn't sound too important at first, but this was (and is) a fantastic tool, and an amazing motivator over the next years, to keep continually improving performance.

Continuous Integration

This actually leads me to something else that I'd say we learned, which is that continuous integration is really awesome, and completely transformative to have for a project. This is not a particularly surprising insight nowadays in the open source community, it's easy to set up continuous integration on Github using Travis or some other CI service. But I still see a lot of research projects that don't have tests, that don't use CI, so I wanted to mention it anyway. As I mentioned earlier in the post, PyPy has a quite serious testing culture, with unit tests written for new code, regression tests for all bugs, and integration tests using the CPython test suite. Those tests are run nightly on a number of architectures and operating systems.

Having all this kind of careful testing is of course necessary, since PyPy is really trying to be a Python implementation that people actually use, not just write papers about. But having all this infrastructure also had other benefits, for example it allows us to trust newcomers to the project very quickly. Basically after your first patch gets accepted, you immediately get commit rights to the PyPy repository. If you screw up, the tests (or the code reviews) are probably going to catch it, and that reduction to the barrier to contributing is just super great.

This concludes my advertisement for testing in this post.

2010: Implementing Python Objects with Maps

So, what else did we do in the Eurostars project, apart from adding traditional compiler optimizations to the tracing JIT and setting up CI infrastructure? Another strand of work, that went on sort of concurrently to the JIT generator improvements, were deep rewrites in the Python runtime, and the Python data structures. I am going to write about two exemplary ones here, maps and storage strategies.

The first such rewrite is fairly standard. Python instances are similar to Javascript objects, in that you can add arbitrary attributes to them at runtime. Originally Python instances were backed by a dictionary in PyPy, but of course in practice most instances of the same class have the same set of attribute names. Therefore we went and implemented Self style maps, which are often called hidden classes in the JS world to represent instances instead. This has two big benefits, it allows you to generate much better machine code for instance attribute access and makes instances use a lot less memory.

2011: Container Storage Strategies

Another important change in the PyPy runtime was rewriting the Python container data structures, such as lists, dictionaries and sets. A fairly straightforward observation about how those are used is that in a significant percentage of cases they contain type-homogeneous data. As an example it's quite common to have lists of only integers, or lists of only strings. So we changed the list, dict and set implementations to use something we called storage strategies. With storage strategies these data structures use a more efficient representations if they contain only primitives of the same type, such as ints, floats, strings. This makes it possible to store the values without boxing them in the underlying data structure. Therefore read and write access are much faster for such type homogeneous containers. Of course when later another data type gets added to such a list, the existing elements need to all be boxed at that point, which is expensive. But we did a study and found out that that happens quite rarely in practice. A lot of that work was done by Lukas Diekmann.

Deep Changes in the Runtime are Necessary

These two are just two examples for a number of fairly fundamental changes in the PyPy runtime and PyPy data structures, probably the two most important ones, but we did many others. That leads me to another thing we learned. If you want to generate good code for a complex dynamic language such as Python, it's actually not enough at all to have a good code generator and good compiler optimizations. That's not going to help you, if your runtime data-structures aren't in a shape where it's possible to generate efficient machine code to access them.

Maybe this is well known in the VM and research community. However it's the main mistake that in my opinion every other Python JIT effort has made in the last 10 years, where most projects said something along the lines of "we're not changing the existing CPython data structures at all, we'll just let LLVM inline enough C code of the runtime and then it will optimize all the overhead away". That never works very well.

JIT Status 2011

So, here we are at the end of the Eurostars project, what's the status of the JIT? Well, it seems this meta-tracing stuff really works! We finally started actually believing in it, when we reached the point in 2010 where self-hosting PyPy was actually faster than bootstrapping the VM on CPython. Speeding up the bootstrapping process is something that Psyco never managed at all, so we considered this a quite important achievement. At the end of Eurostars, we were about 4x faster than CPython on our set of benchmarks.

2012-2017: Engineering and Incremental Progress

2012 the Eurostars project was finished and PyPy reverted yet another time back to be an open source project. From then on, we've had a more diverse set of sources of funding: we received some crowd funding via the Software Freedom Conservancy and contracts of various sizes from companies to implement various specific features, often handled by Baroque Software. Over the next couple of years we revamped various parts of the VM. We improved the GC in major ways. We optimized the implementation of the JIT compiler to improve warmup times. We implemented backends for various CPU architectures (including PowerPC and s390x). We tried to reduce the number of performance cliffs and make the JIT useful in a broader set of cases.

Another strand of work was to push quite significantly to be more compatible with CPython, particularly the Python 3 line as well as extension module support. Other compatibility improvements we did was making sure that virtualenv works with PyPy, better support for distutils and setuptools and similar improvements. The continually improving performance as well better compatibility with the ecosystem tools led to the first few users of PyPy in industry.


Another very important strand of work that took a lot of effort in recent years was CPyExt. One of the main blockers of PyPy adoption had always been the fact that a lot of people need specific C-extension modules at least in some parts of their program, and telling them to reimplement everything in Python is just not a practical solution. Therefore we worked on CPyExt, an emulation layer to make it possible to run CPython C-extension modules in PyPy. Doing that was a very painful process, since the CPython extension API leaks a lot of CPython implementation details, so we had to painstakingly emulate all of these details to make it possible to run extensions. That this works at all remains completely amazing to me! But nowadays CPyExt is even getting quite good, a lot of the big numerical libraries such as Numpy and Pandas are now supported (for a while we had worked hard on a reimplementation of Numpy called NumPyPy, but eventually realized that it would never be complete and useful enough). However, calling CPyExt modules from PyPy can still be very slow, which makes it impractical for some applications that's why we are working on it.

Not thinking about C-extension module emulation earlier in the project history was a pretty bad strategic mistake. It had been clear for a long time that getting people to just stop using all their C-extension modules was never going to work, despite our efforts to give them alternatives, such as cffi. So we should have thought of a story for all the existing C-extension modules earlier in the project. Not starting CPyExt earlier was mostly a failure of our imagination (and maybe a too high pain threshold): We didn't believe this kind of emulation was going to be practical, until somebody went and tried it.

Python 3

Another main focus of the last couple of years has been to catch up with the CPython 3 line. Originally we had ignored Python 3 for a little bit too long, and were trailing several versions behind. In 2016 and 2017 we had a grant from the Mozilla open source support program of $200'000 to be able to catch up with Python 3.5. This work is now basically done, and we are starting to target CPython 3.6 and will have to look into 3.7 in the near future.

Incentives of OSS compared to Academia

So, what can be learned from those more recent years? One thing we can observe is that a lot of the engineering work we did in that time is not really science as such. A lot of the VM techniques we implemented are kind of well known, and catching up with new Python features is also not particularly deep researchy work. Of course this kind of work is obviously super necessary if you want people to use your VM, but it would be very hard to try to get research funding for it. PyPy managed quite well over its history to balance phases of more research oriented work, and more product oriented ones. But getting this balance somewhat right is not easy, and definitely also involves a lot of luck. And, as has been discussed a lot, it's actually very hard to find funding for open source work, both within and outside of academia.

Meta-Tracing really works!

Let me end with what, in my opinion, is the main positive technical result of PyPy the project. Which is that the whole idea of using a meta-tracing JIT can really work! Currently PyPy is about 7 times faster than CPython on a broad set of benchmarks. Also, one of the very early motivations for using a meta-jitting approach in PyPy, which was to not have to adapt the JIT to new versions of CPython proved to work: indeed we didn't have to change anything in the JIT infrastructure to support Python 3.

RPython has also worked and improved performance for a number of other languages. Some of these interpreters had wildly different architectures. AST-based interpreters, bytecode based, CPU emulators, really inefficient high-level ones that allocate continuation objects all the time, and so on. This shows that RPython also gives you a lot of freedom in deciding how you want to structure the interpreter and that it can be applied to languages of quite different paradigms.

I'll end with a list of the people that have contributed code to PyPy over its history, more than 350 of them. I'd like to thank all of them and the various roles they played. To the next 15 years!


A lot of people helped me with this blog post. Tim Felgentreff made me give the keynote, which lead me to start collecting the material. Samuele Pedroni gave essential early input when I just started planning the talk, and also gave feedback on the blog post. Maciej Fijałkowski gave me feedback on the post, in particular important insight about the more recent years of the project. Armin Rigo discussed the talk slides with me, and provided details about the early expectations about the first JIT's hoped-for performance. Antonio Cuni gave substantial feedback and many very helpful suggestions for the blog post. Michael Hudson-Doyle also fixed a number of mistakes in the post and rightfully complained about the lack of mention of the GC. Christian Tismer provided access to his copy of early Python-de mailing list posts. Matti Picus pointed out a number of things I had forgotten and fixed a huge number of typos and awkward English, including my absolute inability to put commas correctly. All remaining errors are of course my own.

update: fixed confusing wording in the maps section.

Wednesday, June 20, 2018

Repeating a Matrix Multiplication Benchmark

I watched the Hennessy & Patterson's Turing award lecture recently:

In it, there's a slide comparing the performance of various matrix multiplication implementations, using Python (presumably CPython) as a baseline and comparing that against various C implementations (I couldn't find the linked paper yet):

I expected the baseline speedup of switching from CPython to C to be higher and I also wanted to know what performance PyPy gets, so I did my own benchmarks. This is a problem that Python is completely unsuited for, so it should give very exaggerated results.

The usual disclaimers apply: All benchmarks are lies, benchmarking of synthetic workloads even more so. My implementation is really naive (though I did optimize it a little bit to help CPython), don't use any of this code for anything real. The benchmarks ran on my rather old Intel i5-3230M laptop under Ubuntu 17.10.

With that said, my results were as follows:

Implementation time speedup over CPython speedup over PyPy
CPython 512.588 ± 2.362 s 1 ×
PyPy 8.167 ± 0.007 s 62.761 ± 0.295 × 1 ×
'naive' C 2.164 ± 0.025 s 236.817 ± 2.918 × 3.773 ± 0.044 ×
NumPy 0.171 ± 0.002 s 2992.286 ± 42.308 × 47.678 ± 0.634 ×

This is running 1500x1500 matrix multiplications with (the same) random matrices. Every implementation is run 50 times in a fresh process. The results are averaged, the errors are bootstrapped 99% confidence intervals.

So indeed the speedup that I got of switching from CPython to C is quite a bit higher than 47x! PyPy is much better than CPython, but of course can't really compete against GCC. And then the real professionals (numpy/OpenBLAS) are in a whole 'nother league. The speedup of the AVX numbers in the slide above is even higher than my NumPy numbers, which I assume is the result of my old CPU with two cores, vs. the 18 core CPU with AVX support. Lesson confirmed: leave matrix multiplication to people who actually know what they are doing.

Friday, April 27, 2018

How to ignore the annoying Cython warnings in PyPy 6.0

If you install any Cython-based module in PyPy 6.0.0, it is very likely that you get a warning like this:
>>>> import numpy
/data/extra/pypy/6.0.0/site-packages/numpy/random/ UserWarning: __builtin__.type size changed, may indicate binary incompatibility. Expected 888, got 408
  from .mtrand import *
The TL;DR version is: the warning is a false alarm, and you can hide it by doing:
$ pypy -m pip install pypy-fix-cython-warning
The package does not contain any module, only a .pth file which installs a warning filter at startup.

Technical details

This happens because whenever Cython compiles a pyx file, it generates C code which does a sanity check on the C size of PyType_Type. PyPy versions up to 5.10 are buggy and report the incorrect size, so Cython includes a workaround to compare it with the incorrect value, when on PyPy.
PyPy 6 fixed the bug and now PyType_Type reports the correct size; however, Cython still tries to compare it with the old, buggy value, so it (wrongly) emits the warning.
Cython 0.28.2 includes a fix for it, so that C files generated by it no longer emit the warning. However, most packages are distributed with pre-cythonized C files. For example, include C files which were generated by Cython 0.26.1: if you compile it you still get the warning, even if you locally installed a newer version of Cython.

There is not much that we can do on the PyPy side, apart for waiting for all the Cython-based packages to do a new release which include C files generated by a newer Cython.  In the mean time, installing this module will silence the warning.

Thursday, April 26, 2018

PyPy2.7 and PyPy3.5 v6.0 dual release

The PyPy team is proud to release both PyPy2.7 v6.0 (an interpreter supporting Python 2.7 syntax), and a PyPy3.5 v6.0 (an interpreter supporting Python 3.5 syntax). The two releases are both based on much the same codebase, thus the dual release.
This release is a feature release following our previous 5.10 incremental release in late December 2017. Our C-API compatibility layer cpyext is now much faster (see the blog post) as well as more complete. We have made many other improvements in speed and CPython compatibility. Since the changes affect the included python development header files, all c-extension modules must be recompiled for this version.
Until we can work with downstream providers to distribute builds with PyPy, we have made packages for some common packages available as wheels. You may compile yourself using pip install --no-build-isolation <package>, the no-build-isolation is currently needed for pip v10.
First-time python users are often stumped by silly typos and omissions when getting started writing code. We have improved our parser to emit more friendly syntax errors, making PyPy not only faster but more friendly.
The GC now has hooks to gain more insights into its performance
The default Matplotlib TkAgg backend now works with PyPy, as do pygame and pygobject.
We updated the cffi module included in PyPy to version 1.11.5, and the cppyy backend to 0.6.0. Please use these to wrap your C and C++ code, respectively, for a JIT friendly experience.
As always, this release is 100% compatible with the previous one and fixed several issues and bugs raised by the growing community of PyPy users. We strongly recommend updating.
The Windows PyPy3.5 release is still considered beta-quality. There are open issues with unicode handling especially around system calls and c-extensions.
The utf8 branch that changes internal representation of unicode to utf8 did not make it into the release, so there is still more goodness coming. We also began working on a Python3.6 implementation, help is welcome.
You can download the v6.0 releases here:
We would like to thank our donors for the continued support of the PyPy project. If PyPy is not quite good enough for your needs, we are available for direct consulting work.
We would also like to thank our contributors and encourage new people to join the project. PyPy has many layers and we need help with all of them: PyPy and RPython documentation improvements, tweaking popular modules to run on pypy, or general help with making RPython’s JIT even better.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7 and CPython 3.5. It’s fast (PyPy and CPython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
We also welcome developers of other dynamic languages to see what RPython can do for them.
The PyPy release supports:
  • x86 machines on most common operating systems (Linux 32/64 bits, Mac OS X 64 bits, Windows 32 bits, OpenBSD, FreeBSD)
  • newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux,
  • big- and little-endian variants of PPC64 running Linux,
  • s390x running Linux

What else is new?

PyPy 5.10 was released in Dec, 2017.
There are many incremental improvements to RPython and PyPy, the complete listing is here.
Please update, and continue to help us make PyPy better.

Cheers, The PyPy team

Tuesday, April 10, 2018

Improving SyntaxError in PyPy

For the last year, my halftime job has been to teach non-CS uni students to program in Python. While doing that, I have been trying to see what common stumbling blocks exist for novice programmers. There are many things that could be said here, but a common theme that emerges is hard-to-understand error messages. One source of such error messages, particularly when starting out, is SyntaxErrors.

PyPy's parser (mostly following the architecture of CPython) uses a regular-expression-based tokenizer with some cleverness to deal with indentation, and a simple LR(1) parser. Both of these components obviously produce errors for invalid syntax, but the messages are not very helpful. Often, the message is just "invalid syntax", without any hint of what exactly is wrong. In the last couple of weeks I have invested a little bit of effort to make them a tiny bit better. They will be part of the upcoming PyPy 6.0 release. Here are some examples of what changed.

Missing Characters

The first class of errors occurs when a token is missing, often there is only one valid token that the parser expects. This happens most commonly by leaving out the ':' after control flow statements (which is the syntax error I personally still make at least a few times a day). In such situations, the parser will now tell you which character it expected:

>>>> # before
>>>> if 1
  File "<stdin>", line 1
    if 1
SyntaxError: invalid syntax

>>>> # after
>>>> if 1
  File "<stdin>", line 1
    if 1
SyntaxError: invalid syntax (expected ':')

Another example of this feature:

>>>> # before
>>>> def f:
  File "<stdin>", line 1
    def f:
SyntaxError: invalid syntax

>>>> # after
>>>> def f:
  File "<stdin>", line 1
    def f:
SyntaxError: invalid syntax (expected '(')


Another source of errors are unmatched parentheses. Here, PyPy has always had slightly better error messages than CPython:

>>> # CPython
>>> )
  File "<stdin>", line 1
SyntaxError: invalid syntax

>>>> # PyPy
>>> )
  File "<stdin>", line 1
SyntaxError: unmatched ')'

The same is true for parentheses that are never closed (the call to eval is needed to get the error, otherwise the repl will just wait for more input):

>>> # CPython
>>> eval('(')
  File "<string>", line 1
SyntaxError: unexpected EOF while parsing

>>>> # PyPy
>>>> eval('(')
  File "<string>", line 1
SyntaxError: parenthesis is never closed

What I have now improved is the case of parentheses that are matched wrongly:

>>>> # before
>>>> (1,
.... 2,
.... ]
  File "<stdin>", line 3
SyntaxError: invalid syntax

>>>> # after
>>>> (1,
.... 2,
.... ]
  File "<stdin>", line 3
SyntaxError: closing parenthesis ']' does not match opening parenthesis '(' on line 1


Obviously these are just some very simple cases, and there is still a lot of room for improvement (one huge problem is that only a single SyntaxError is ever shown per parse attempt, but fixing that is rather hard).

If you have a favorite unhelpful SyntaxError message you love to hate, please tell us in the comments and we might try to improve it. Other kinds of non-informative error messages are also always welcome!

Tuesday, March 27, 2018

Leysin Winter Sprint 2018: review

Like every year, the PyPy developers and a couple of newcomers gathered in Leysin, Switzerland, to share their thoughts and contribute to the development of PyPy.

As always, we had interesting discussions about how we could improve PyPy, to make it the first choice for even more developers. We also made some progress with current issues, like compatibility with Python 3.6 and improving the performance of CPython extension modules, where we fixed a lot of bugs and gained new insights about where and how we could tweak PyPy.

We were very happy about the number of new people who joined us for the first time, and hope they enjoyed it as much as everyone else.


We worked on the following topics (and more!):
  • Introductions for newcomers
  • Python 3.5 and 3.6 improvements
  • CPyExt performance improvements and GC implementation
  • JIT: guard-compatible implementation
  • Pygame performance improvements
  • Unicode/UTF8 implementation
  • CFFI tutorial/overview rewrite
  • py3 test runners refactoring
  • RevDB improvements
The weather was really fine for most of the week, with only occasional snow and fog. We started our days with a short (and sometimes not so short) planning session and enjoyed our dinners in the great restaurants in the area. Some of us even started earlier and continued till late night. It was a relaxed, but also very productive atmosphere. On our break day on Wednesday, we enjoyed the great conditions and went skiing and hiking.


  • Arianna
  • Jean-Daniel
  • Stefan Beyer
  • Floris Bruynooghe
  • Antonio Cuni
  • René Dudfield
  • Manuel Jacob
  • Ronan Lamy
  • Remi Meier
  • Matti Picus
  • Armin Rigo
  • Alexander Schremmer
Leysin is easily reachable by Geneva Airport, so feel free to join us next time!


Saturday, January 13, 2018

PyPy 5.10.1 bugfix release for python 3.5

We have released a bug fix PyPy3.5-v5.10.1 due to the following issues:
  • Fix time.sleep(float('nan')) which would hang on Windows
  • Fix missing errno constants on Windows
  • Fix issue 2718 for the REPL on Linux
  • Fix an overflow in converting int secs to nanosecs (issue 2717 )
  • Using kwarg 'flag' to os.setxattr had no effect
  • Fix the winreg module for unicode entries in the registry on Windows
Note that many of these fixes are for our new beta version of PyPy3.5 on Windows. There may be more unicode problems in the Windows beta version, especially concerning directory- and file-names with non-ASCII characters.

On macOS, we recommend you wait for the Homebrew package to prevent issues with third-party packages. For other supported platforms our downloads are available now.
Thanks to those who reported the issues.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7 and CPython 3.5. It’s fast (PyPy and CPython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
We also welcome developers of other dynamic languages to see what RPython can do for them.
This PyPy 3.5 release supports:
  • x86 machines on most common operating systems (Linux 32/64 bits, macOS 64 bits, Windows 32 bits, OpenBSD, FreeBSD)
  • newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux,
  • big- and little-endian variants of PPC64 running Linux,
  • s390x running Linux
Please update, and continue to help us make PyPy better.
The PyPy Team

Monday, January 8, 2018

Leysin Winter sprint: 17-24 March 2018

The next PyPy sprint will be in Leysin, Switzerland, for the thirteenth time. This is a fully public sprint: newcomers and topics other than those proposed below are welcome.

(Note: this sprint is independent from the suggested April-May sprint in Poland.)

Goals and topics of the sprint

The list of topics is open, but here is our current list:

  • cffi tutorial/overview rewrite
  • py3 test runners are too complicated
  • make win32 builds green
  • make packaging more like cpython/portable builds
  • get CI builders for PyPy into mainstream projects (Numpy, Scipy, lxml, uwsgi)
  • get more of scientific stack working (tensorflow?)
  • cpyext performance improvements
  • General 3.5 and 3.6 improvements
  • JIT topics: guard-compatible, and the subsequent research project to save and reuse traces across processes
  • finish unicode-utf8
  • update, (web devs needed)

As usual, the main side goal is to have fun in winter sports :-) We can take a day off (for ski or anything else).

Exact times

Work days: starting March 18th (~noon), ending March 24th (~noon).

Please see announcement.txt for more information.

Monday, December 25, 2017

PyPy2.7 and PyPy3.5 v5.10 dual release

The PyPy team is proud to release both PyPy2.7 v5.10 (an interpreter supporting Python 2.7 syntax), and a final PyPy3.5 v5.10 (an interpreter for Python 3.5 syntax). The two releases are both based on much the same codebase, thus the dual release.

This release is an incremental release with very few new features, the main feature being the final PyPy3.5 release that works on linux and OS X with beta windows support. It also includes fixes for vmprof cooperation with greenlets.

Compared to 5.9, the 5.10 release contains mostly bugfixes and small improvements. We have in the pipeline big new features coming for PyPy 6.0 that did not make the release cut and should be available within the next couple months.

As always, this release is 100% compatible with the previous one and fixed several issues and bugs raised by the growing community of PyPy users. As always, we strongly recommend updating.

There are quite a few important changes that are in the pipeline that did not make it into the 5.10 release. Most important are speed improvements to cpyext (which will make numpy and pandas a bit faster) and utf8 branch that changes internal representation of unicode to utf8, which should help especially the Python 3.5 version of PyPy.

This release concludes the Mozilla Open Source grant for having a compatible PyPy 3.5 release and we're very grateful for that. Of course, we will continue to improve PyPy 3.5 and probably move to 3.6 during the course of 2018.

You can download the v5.10 releases here:

We would like to thank our donors for the continued support of the PyPy project.

We would also like to thank our contributors and encourage new people to join the project. PyPy has many layers and we need help with all of them: PyPy and RPython documentation improvements, tweaking popular modules to run on pypy, or general help with making RPython's JIT even better.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7 and CPython 3.5. It's fast (PyPy and CPython 2.7.x performance comparison) due to its integrated tracing JIT compiler.

We also welcome developers of other dynamic languages to see what RPython can do for them.

The PyPy release supports:

  • x86 machines on most common operating systems (Linux 32/64 bits, Mac OS X 64 bits, Windows 32 bits, OpenBSD, FreeBSD)
  • newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux,
  • big- and little-endian variants of PPC64 running Linux,
  • s390x running Linux


  • improve ssl handling on windows for pypy3 (makes pip work)
  • improve unicode handling in various error reporters
  • fix vmprof cooperation with greenlets
  • fix some things in cpyext
  • test and document the cmp(nan, nan) == 0 behaviour
  • don't crash when calling sleep with inf or nan
  • fix bugs in _io module
  • inspect.isbuiltin() now returns True for functions implemented in C
  • allow the sequences future-import, docstring, future-import for CPython bug compatibility
  • Issue #2699: non-ascii messages in warnings
  • posix.lockf
  • fixes for FreeBSD platform
  • add .debug files, so builds contain debugging info, instead of being stripped
  • improvements to cppyy
  • issue #2677 copy pure c PyBuffer_{From,To}Contiguous from cpython
  • issue #2682, split firstword on any whitespace in sqlite3
  • ctypes: allow ptr[0] = foo when ptr is a pointer to struct
  • matplotlib will work with tkagg backend once matplotlib pr #9356 is merged
  • improvements to utf32 surrogate handling
  • cffi version bump to 1.11.2
Maciej Fijalkowski, Matti Picus and the whole PyPy team

Monday, October 30, 2017

How to make your code 80 times faster

I often hear people who are happy because PyPy makes their code 2 times faster or so. Here is a short personal story which shows PyPy can go well beyond that.

DISCLAIMER: this is not a silver bullet or a general recipe: it worked in this particular case, it might not work so well in other cases. But I think it is still an interesting technique. Moreover, the various steps and implementations are showed in the same order as I tried them during the development, so it is a real-life example of how to proceed when optimizing for PyPy.

Some months ago I played a bit with evolutionary algorithms: the ambitious plan was to automatically evolve a logic which could control a (simulated) quadcopter, i.e. a PID controller (spoiler: it doesn't fly).

The idea is to have an initial population of random creatures: at each generation, the ones with the best fitness survive and reproduce with small, random variations.

However, for the scope of this post, the actual task at hand is not so important, so let's jump straight to the code. To drive the quadcopter, a Creature has a run_step method which runs at each delta_t (full code):
class Creature(object):
    INPUTS = 2  # z_setpoint, current z position
    OUTPUTS = 1 # PWM for all 4 motors
    STATE_VARS = 1

    def run_step(self, inputs):
        # state: [state_vars ... inputs]
        # out_values: [state_vars, ... outputs]
        self.state[self.STATE_VARS:] = inputs
        out_values =, self.state) + self.constant
        self.state[:self.STATE_VARS] = out_values[:self.STATE_VARS]
        outputs = out_values[self.STATE_VARS:]
        return outputs
  • inputs is a numpy array containing the desired setpoint and the current position on the Z axis;
  • outputs is a numpy array containing the thrust to give to the motors. To start easy, all the 4 motors are constrained to have the same thrust, so that the quadcopter only travels up and down the Z axis;
  • self.state contains arbitrary values of unknown size which are passed from one step to the next;
  • self.matrix and self.constant contains the actual logic. By putting the "right" values there, in theory we could get a perfectly tuned PID controller. These are randomly mutated between generations.
run_step is called at 100Hz (in the virtual time frame of the simulation). At each generation, we test 500 creatures for a total of 12 virtual seconds each. So, we have a total of 600,000 executions of run_step at each generation.

At first, I simply tried to run this code on CPython; here is the result:
$ python -m ev.main
Generation   1: ... [population = 500]  [12.06 secs]
Generation   2: ... [population = 500]  [6.13 secs]
Generation   3: ... [population = 500]  [6.11 secs]
Generation   4: ... [population = 500]  [6.09 secs]
Generation   5: ... [population = 500]  [6.18 secs]
Generation   6: ... [population = 500]  [6.26 secs]
Which means ~6.15 seconds/generation, excluding the first.

Then I tried with PyPy 5.9:
$ pypy -m ev.main
Generation   1: ... [population = 500]  [63.90 secs]
Generation   2: ... [population = 500]  [33.92 secs]
Generation   3: ... [population = 500]  [34.21 secs]
Generation   4: ... [population = 500]  [33.75 secs]
Ouch! We are ~5.5x slower than CPython. This was kind of expected: numpy is based on cpyext, which is infamously slow. (Actually, we are working on that and on the cpyext-avoid-roundtrip branch we are already faster than CPython, but this will be the subject of another blog post.)

So, let's try to avoid cpyext. The first obvious step is to use numpypy instead of numpy (actually, there is a hack to use just the micronumpy part). Let's see if the speed improves:
$ pypy -m ev.main   # using numpypy
Generation   1: ... [population = 500]  [5.60 secs]
Generation   2: ... [population = 500]  [2.90 secs]
Generation   3: ... [population = 500]  [2.78 secs]
Generation   4: ... [population = 500]  [2.69 secs]
Generation   5: ... [population = 500]  [2.72 secs]
Generation   6: ... [population = 500]  [2.73 secs]
So, ~2.7 seconds on average: this is 12x faster than PyPy+numpy, and more than 2x faster than the original CPython. At this point, most people would be happy and go tweeting how PyPy is great.

In general, when talking of CPython vs PyPy, I am rarely satified of a 2x speedup: I know that PyPy can do much better than this, especially if you write code which is specifically optimized for the JIT. For a real-life example, have a look at capnpy benchmarks, in which the PyPy version is ~15x faster than the heavily optimized CPython+Cython version (both have been written by me, and I tried hard to write the fastest code for both implementations).

So, let's try to do better. As usual, the first thing to do is to profile and see where we spend most of the time. Here is the vmprof profile. We spend a lot of time inside the internals of numpypy, and allocating tons of temporary arrays to store the results of the various operations.

Also, let's look at the jit traces and search for the function run: this is loop in which we spend most of the time, and it is composed of 1796 operations. The operations emitted for the line + self.constant are listed between lines 1217 and 1456. Here is the excerpt which calls; most of the ops are cheap, but at line 1232 we see a call to the RPython function descr_dot; by looking at the implementation we see that it creates a new W_NDimArray to store the result, which means it has to do a malloc():

The implementation of the + self.constant part is also interesting: contrary the former, the call to W_NDimArray.descr_add has been inlined by the JIT, so we have a better picture of what's happening; in particular, we can see the call to __0_alloc_with_del____ which allocates the W_NDimArray for the result, and the raw_malloc which allocates the actual array. Then we have a long list of 149 simple operations which set the fields of the resulting array, construct an iterator, and finally do a call_assembler: this is the actual logic to do the addition, which was JITtted indipendently; call_assembler is one of the operations to do JIT-to-JIT calls:

All of this is very suboptimal: in this particular case, we know that the shape of self.matrix is always (3, 2): so, we are doing an incredible amount of work, including calling malloc() twice for the temporary arrays, just to call two functions which ultimately do a total of 6 multiplications and 6 additions. Note also that this is not a fault of the JIT: CPython+numpy has to do the same amount of work, just hidden inside C calls.

One possible solution to this nonsense is a well known compiler optimization: loop unrolling. From the compiler point of view, unrolling the loop is always risky because if the matrix is too big you might end up emitting a huge blob of code, possibly uselss if the shape of the matrices change frequently: this is the main reason why the PyPy JIT does not even try to do it in this case.

However, we know that the matrix is small, and always of the same shape. So, let's unroll the loop manually:
class SpecializedCreature(Creature):

    def __init__(self, *args, **kwargs):
        Creature.__init__(self, *args, **kwargs)
        # store the data in a plain Python list = list(self.matrix.ravel()) + list(self.constant)
        self.data_state = [0.0]
        assert self.matrix.shape == (2, 3)
        assert len( == 8

    def run_step(self, inputs):
        # state: [state_vars ... inputs]
        # out_values: [state_vars, ... outputs]
        k0, k1, k2, q0, q1, q2, c0, c1 =
        s0 = self.data_state[0]
        z_sp, z = inputs
        # compute the output
        out0 = s0*k0 + z_sp*k1 + z*k2 + c0
        out1 = s0*q0 + z_sp*q1 + z*q2 + c1
        self.data_state[0] = out0
        outputs = [out1]
        return outputs
In the actual code there is also a sanity check which asserts that the computed output is the very same as the one returned by Creature.run_step.

So, let's try to see how it performs. First, with CPython:
$ python -m ev.main
Generation   1: ... [population = 500]  [7.61 secs]
Generation   2: ... [population = 500]  [3.96 secs]
Generation   3: ... [population = 500]  [3.79 secs]
Generation   4: ... [population = 500]  [3.74 secs]
Generation   5: ... [population = 500]  [3.84 secs]
Generation   6: ... [population = 500]  [3.69 secs]
This looks good: 60% faster than the original CPython+numpy implementation. Let's try on PyPy:
Generation   1: ... [population = 500]  [0.39 secs]
Generation   2: ... [population = 500]  [0.10 secs]
Generation   3: ... [population = 500]  [0.11 secs]
Generation   4: ... [population = 500]  [0.09 secs]
Generation   5: ... [population = 500]  [0.08 secs]
Generation   6: ... [population = 500]  [0.12 secs]
Generation   7: ... [population = 500]  [0.09 secs]
Generation   8: ... [population = 500]  [0.08 secs]
Generation   9: ... [population = 500]  [0.08 secs]
Generation  10: ... [population = 500]  [0.08 secs]
Generation  11: ... [population = 500]  [0.08 secs]
Generation  12: ... [population = 500]  [0.07 secs]
Generation  13: ... [population = 500]  [0.07 secs]
Generation  14: ... [population = 500]  [0.08 secs]
Generation  15: ... [population = 500]  [0.07 secs]
Yes, it's not an error. After a couple of generations, it stabilizes at around ~0.07-0.08 seconds per generation. This is around 80 (eighty) times faster than the original CPython+numpy implementation, and around 35-40x faster than the naive PyPy+numpypy one.

Let's look at the trace again: it no longer contains expensive calls, and certainly no more temporary malloc() s. The core of the logic is between lines 386-416, where we can see that it does fast C-level multiplications and additions: float_mul and float_add are translated straight into mulsd and addsd x86 instructions.

As I said before, this is a very particular example, and the techniques described here do not always apply: it is not realistic to expect an 80x speedup on arbitrary code, unfortunately. However, it clearly shows the potential of PyPy when it comes to high-speed computing. And most importantly, it's not a toy benchmark which was designed specifically to have good performance on PyPy: it's a real world example, albeit small.

You might be also interested in the talk I gave at last EuroPython, in which I talk about a similar topic: "The Joy of PyPy JIT: abstractions for free" (abstract, slides and video).

How to reproduce the results

$ git clone
$ cd evolvingcopter
$ {python,pypy} -m ev.main --no-specialized --no-numpypy
$ {python,pypy} -m ev.main --no-specialized
$ {python,pypy} -m ev.main