Saturday, January 29, 2011

A JIT Backend for ARM Processors

In the past few months, I have been developing as a part of my master thesis the ARM backend for the the PyPy JIT, in the arm-backend branch. Currently, it is still work in progress: all integer and object operations are working and the support for floating point is also under development.
ARM processors are very widely used, beeing deployed in servers, some netbooks and mainly mobile devices such as phones and tablets. One of our goals is to be able to run PyPy on phones, specially on Android. Currently is not yet possible to translate and compile PyPy for Android automatically, but there has been some work on using Android's NDK to compile PyPy's generated C code.
The JIT Backend targets the application profile of the ARMv7 instruction set architecture which is found for example in the Cortex-A8 processors used in many Android powered devices and in Apple's A4 processors built into the latest iOS devices. To develop and test the backend we are using a BeagleBoard-xM which has a 1 GHz ARM Cortex-A8 and 512 MB of RAM running the ARM port of Ubuntu 10.10.
Currently on Linux it is possible to translate and cross-compile PyPy's Python interpreter as well as other interpreters with the ARM JIT backend enabled using Scratchbox 2 to provide a build environment and the GNU ARM cross compilation toolchain. So far the backend only supports the Boehm garbage collector which does not produce the best results combined with the JIT, but we plan to add support for the other GCs in the future, doing so should increase the performance of PyPy on ARM.
While still debugging the last issues with the backend we already can run some simple benchmarks on Pyrolog, a prolog interpreter written in RPython. Even using Boehm as the GC the results look very promising. In the benchmarks we compare Pyrolog to SWI-Prolog, a prolog interpreter written in C, which is available from the package repositories for Ubuntu's ARM port.
The benchmarks can be found in the pyrolog-bench repository.
BenchmarkSWI-Prolog in ms.Pyrolog in ms.Speedup
iterate60.06.010.0
iterate_assert130.06.021.67
iterate_call3310.05.0662.0
iterate_cut60.0359.00.16713
iterate_exception4950.0346.014.306
iterate_failure400.0127.03.1496
iterate_findall740.0No res.
iterate_if140.06.023.333
The iterate_call benchmark, which constructs a predicate and calls it at runtime, with a speedup of 662 times over SWI-Prolog is an example where the JIT can show its strength. The Pyrolog interpreter and the JIT treat dynamically defined predicates as static ones and can generate optimezed code in both cases. Whereas SWI only compiles statically defined rules and has to fall back to interpretation on dynamic ones.
For simple benchmarks running on PyPy's Python intepreter we see some speedups over CPython, but we still need to debug the backend bit more before we can show numbers on more complex benchmarks. So, stay tuned.

Friday, January 21, 2011

PyPy wants you!

If you ever considered contributing to PyPy, but never did so far, this is a good moment to start! :-)

Recently, we merged the fast-forward branch which brings Python 2.7 compatibility, with the plan of releasing a new version of PyPy as soon as all tests pass.

However, at the moment there are still quite a few of failing tests because of new 2.7 features that have not been implemented yet: many of them are easy to fix, and doing it represents a good way to get confidence with the code base, for those who are interested in it. Michael Foord wrote a little howto explaining the workflow for running lib-python tests.

Thus, if you are willing to join us in the effort of having a PyPy compatible with Python 2.7, probably the most sensible option is to come on the #PyPy IRC channel on Freenode, so we can coordinate each other not to fix the same test twice.

Moreover, if you are a student and are considering participating in the next Google Summer of Code this is a good time to get into pypy. You have the opportunity to get a good understanding of pypy for when you decide what you would like to work on over the summer.

Tuesday, January 11, 2011

Loop invariant code motion

Recently, the jit-unroll-loops branch was merged. It implements the idea described in Using Escape Analysis Across Loop Boundaries for Specialization. That post does only talk about virtuals, but the idea turned out to be more far reaching. After the metainterpreter produces a trace, several optimizations are applied to the trace before it is turned into binary code. Removing allocations is only one of them. There are also for instance
  • Heap optimizations that removes memory accesses by reusing results previously read from or written to the same location.
  • Reusing of the results of pure operations if the same pure operation is executed twice.
  • Removal of redundant guards.
  • ...
A lot of these optimizations are in one way or another removing operations form the trace and/or reusing previous results. All of these optimizations could benefit from being able to operate across loop boundaries. Not only in the sense that operations operating on loop invariants could be moved out of the loop entirely. But also that results produced at the end of an iteration could be reused at the beginning of the next even if there are no loop invariants involved.

This is achieved by unrolling the trace into two iterations, and letting the optimizer work on this two-iteration-trace. The optimizer will now be able to optimize the second iteration more than the first since it can reuse results from the first iteration. The optimized version of the first iteration we call the preamble and the optimized version of the second iteration we call the loop. The preamble will end with a jump to the loop, while the loop will end with a jump to itself. This means that the preamble will be executed once for the first iteration, the loop will be executed for all following iterations.

Sqrt example

Here is an example of a Python implementation of sqrt using a fairly simple algorithm

def sqrt(y, n=10000):
    x = y / 2
    while n > 0:
        n -= 1
        x = (x + y/x) / 2
    return x

If it is called with sqrt(1234.0), a fairly long trace is produced. From this trace the optimizer creates the following preamble (Loop 1) and loop (Loop 0)

Looking at the preamble, it starts by making sure that it is not currently being profiled, the guard on i5, and that the function object have not been changed since the trace was made, the guard on p3. Somewhat intermixed with that, the integer variable n is unboxed, by making sure p11 points to an integer object and reading out the integer value from that object. These operations are not needed in the loop (and have been removed from it) as emitting the same guards again would be redundant and n becomes a virtual before the end of the preamble.

        guard_value(i5, 0, descr=<Guard6>) 
        guard_nonnull_class(p11, ConstClass(W_IntObject), descr=<Guard7>) 
        guard_value(p3, ConstPtr(ptr15), descr=<Guard8>) 
        i16 = getfield_gc_pure(p11, descr=<W_IntObject.inst_intval>)
Next comes a test and a guard implementing the while statement followed by the decrementing of n. These operation appear both in the preamble and in the loop
        i18 = int_gt(i16, 0)
        guard_true(i18, descr=<Guard9>) 
        i20 = int_sub(i16, 1)
After that the two floating point variables x and y are unboxed. Again this is only needed in the preamble. Note how the unboxed value of y, called f23, is passed unchanged from the preamble to the loop in arguments of the jump to allow it to be reused. It will not become a virtual since it is never changed within the loop.
        guard_nonnull_class(p12, 17652552, descr=<Guard10>) 
        guard_nonnull_class(p10, 17652552, descr=<Guard11>) 
        f23 = getfield_gc_pure(p10, descr=<W_FloatObject.inst_floatval>)
        f24 = getfield_gc_pure(p12, descr=<W_FloatObject.inst_floatval>)
Following that is the actual calculations performed in the loop in form of floating point operations (since the function was called with a float argument). These appear in both the loop and the preamble.
        i26 = float_eq(f24, 0.000000)
        guard_false(i26, descr=<Guard12>) 
        f27 = float_truediv(f23, f24)
        f28 = float_add(f24, f27)
        f30 = float_truediv(f28, 2.000000)
Finally there are some tests checking if a signal was received (such as when the user presses ctrl-C) and thus should execute some signal handler or if we need to hand over to another thread. This is implemented with a counter that is decreased once every iteration. It will go below zero after some specific number of iterations, tunable by sys.setcheckinterval. The counter is read from and written to some global location where it also can be made negative by a C-level signal handler.
        i32 = getfield_raw(32479328, descr=<pypysig_long_struct.c_value>)
        i34 = int_sub(i32, 2)
        setfield_raw(32479328, i34, descr=<pypysig_long_struct.c_value>)
        i36 = int_lt(i34, 0)
        guard_false(i36, descr=<Guard13>) 
        jump(p0, p1, p2, p4, p10, i20, f30, f23, descr=<Loop0>)

Bridges

When a guard fails often enough, the meta-interpreter is started again to produce a new trace starting at the failing guard. The tracing is continued until a previously compiled loop is entered. This could either be the the same loop that contains the failing guard or some completely different loop. If it is the same loop, executing the preamble again maybe be unnecessary. It is preferable to end the bridge with a jump directly to the loop. To achieve this the optimizer tries to produce short preambles that are inlined at the end of bridges allowing them to jump directly to the loop. Inlining is better than jumping to a common preamble because most of the inlined short preamble can typically be removed again by the optimizer. Creating such a short preamble is however not always possible. Bridges jumping to loops for which no short preamble can be generated have to end with a jump to the full preamble instead.

The short preamble is created by comparing the operations in the preamble with the operations in the loop. The operations that are in the preamble but not in the loop are moved to the short preamble whenever it is safe to move them to the front of the operations remaining. In other words, the full preamble is equivalent to the short preamble followed by one iteration of the loop.

This much has currently been implemented. To give the full picture here, there are two more features that hopefully will be implemented in the near future. The first is to replace the full preamble, used by the interpreter when it reaches a compiled loop, with the short preamble. This is currently not done and is probably not as straight forward as it might first seem. The problem is where to resume interpreting on a guard failure. However, implementing that should save some memory. Not only because the preamble will become smaller, but mainly because the guards will appear either in the loop or in the preamble, but not in both (as they do now). That means there will only be a single bridge and not potentially two copies once the guards are traced.

The sqrt example above would with a short preamble result in a trace like this

If it is executed long enough, the last guard will be traced to form a bridge. The trace will inherit the virtuals from its parent. This can be used to optimize away the part of the inlined short preamble that deals with virtuals. The resulting bridge should look something like
    [p0, p1, p2, p3, p4, f5, i6]
    i7 = force_token()
    setfield_gc(p1, i7, descr=<PyFrame.vable_token>)
    call_may_force(ConstClass(action_dispatcher), p0, p1, descr=<VoidCallDescr>)
    guard_not_forced(, descr=<Guard19>) 
    guard_no_exception(, descr=<Guard20>) 

    guard_nonnull_class(p4, 17674024, descr=<Guard21>) 
    f52 = getfield_gc_pure(p4, descr=<W_FloatObject.inst_floatval>)
    jump(p1, p0, p2, p3, p4, i38, f53, f52, descr=<Loop0>)
Here the first paragraph comes from the traced bridge and the second is what remains of the short preamble after optimization. The box p4 is not a virtual (it contains a pointer to y which is never changed), and it is only virtuals that the bridge inherit from it's parents. This is why the last two operations currently cannot be removed.

Each time the short preamble is inlined, a new copy of each of the guards in it is generated. Typically the short preamble is inlined in several places and thus there will be several copies of each of those guards. If they fail often enough bridges from them will be traced (as with all guards). But since there typically are several copies of each guard the same bridge will be generated in several places. To prevent this, mini-bridges from the inlined guards are produced already during the inlining. These mini-bridges contain nothing but a jump to the preamble.

The mini-bridges needs the arguments of the preamble to be able to jump to it. These arguments contain among other things, boxed versions of the variables x and y. Those variables are virtuals in the loop, and have to be allocated. Currently those allocations are placed in front of the inlined guard. Moving those allocations into the mini-bridges is the second feature that hopefully will be implemented in the near future. After this feature is implemented, the result should look something like

Multiple specialized versions

Floating point operations were generated in the trace above because sqrt was called with a float argument. If it is instead called with an int argument, integer operations will be generated. The somewhat more complex situations is when both int's and float's are used as arguments. Then the jit need to generate multiple versions of the same loop, specialized in different ways. The details, given below, on how this is achieved is somewhat involved. For the casual reader it would make perfect sense to skip to the next section here.

Consider the case when sqrt is first called with a float argument (but with n small enough not to generate the bridge). Then the trace shown above will be generated. If sqrt is now called with an int argument, the guard in the preamble testing that the type of the input object is float will fail:

        guard_nonnull_class(p12, 17652552, descr=<Guard10>) 
It will fail every iteration, so soon enough a bridge will be generated from this guard in the preamble. This guard will end with a jump to the same loop, and the optimizer will try to inline the short preamble at the end of it. This will however fail since now there are two guards on p12. One that makes sure it is an int and and one that makes sure it is a float. The optimizer will detect that the second guard will always fail and mark the bridge as invalid. Invalid loops are not passed on to the backend for compilation.

If a loop is detected to be invalid while inlining the short preamble, the metainterpreter will continue to trace for yet another iteration of the loop. This new trace can be compiled as above and will produce a new loop with a new preamble that are now specialized for int arguments instead of float arguments. The bridge that previously became invalid will now be tried again. This time inlining the short preamble of the new loop instead. This will produce a set of traces connected like this

(click for some hairy details)

The height of the boxes is this figure represents how many instructions they contain (presuming the missing features from the previous section are implemented). Loop 0 is specialized for floats and it's preamble have been split into two boxes at the failing guard. Loop 2 is specialized for ints and is larger than Loop 0. This is mainly because the integer division in python does not map to the integer division of the machine, but have to be implemented with several instructions (integer division in python truncates its result towards minus infinity, while the the machine integer division truncates towards 0). Also the height of the bridge is about the same as the height of Loop 2. This is because it contains a full iteration of the loop.

A More Advanced Example

Let's conclude with an example that is a bit more advanced, where this unrolling approach actually outperforms the previous approach. Consider making a fixed-point implementation of the square root using 16 bit's of decimals. This can be done using the same implementation of sqrt but calling it with an object of a class representing such fixed-point real numbers:

class Fix16(object):
    def __init__(self, val, scale=True):
        if isinstance(val, Fix16):
            self.val = val.val
        else:
            if scale:
                self.val = int(val * 2**16)
            else:
                self.val = val

    def __add__(self, other):
        return  Fix16(self.val + Fix16(other).val, False)

    def __sub__(self, other):
        return  Fix16(self.val - Fix16(other).val, False)

    def __mul__(self, other):
        return  Fix16((self.val >> 8) * (Fix16(other).val >> 8), False)

    def __div__(self, other):
        return  Fix16((self.val << 16) / Fix16(other).val, False)

Below is a table comparing the runtime of the sqrt function above with different argument types on different python interpreters. Pypy 1.4.1 was released before the optimizations described in this post were in place while they are in place in the nightly build from January 5, denoted pypy in the table. There are also the running time for the same algorithms implemented in C and compiled with "gcc -O3 -march=native". Tests were executed on a 2.53GHz Intel Core2 processor with n=100000000 iterations. Comparing the integer versions with C may be considered a bit unfair because of the more advanced integer division operator in python. The left part of this table shows runtimes of sqrt in a program containing a single call to sqrt (i.e. only a single specialized version of the loop is needed). The right part shows the runtime of sqrt when it has been called with a different type of argument before.

First callSecond call
floatintFix16   floatintFix16
cpython 28.18 s 22.13 s 779.04 s 28.07 s 22.21 s 767.03 s
pypy 1.4.1 1.20 s 6.49 s 11.31 s 1.20 s 6.54 s 11.23 s
pypy 1.20 s 6.44 s 6.78 s 1.19 s 6.26 s 6.79 s
gcc 1.15 s 1.82 s 1.89 s 1.15 s 1.82 s 1.89 s

For this to work in the last case, when Fix16 is the argument type in the second type, the trace_limit had to be increased from its default value to prevent the metainterpreter from aborting while tracing the second version of the loop. Also sys.setcheckinterval(1000000) were used to prevent the bridge from being generated. With the bridge the performance of the last case is significantly worse. Maybe because the optimizer currently fails to generate a short preamble for it. But the slowdown seems too big for that to be the only explanation. Below are the runtimes numbers with checkinterval set to its default value of 100:

First callSecond call
floatintFix16   floatintFix16
cpython 28.71 s 22.09 s 781.86 s 28.28 s 21.92 s 761.59 s
pypy 1.4.1 1.21 s 6.48 s 11.22 s 1.72 s 7.58 s 12.18 s
pypy 1.21 s 6.27 s 7.22 s 1.20 s 6.29 s 90.47 s

Conclusions

Even though we are seeing speedups in a variety of different small benchmarks, more complicated examples are not affected much by these optimizations. It might partly be because larger examples have longer and more complicated loops, and thus allowing optimizations to operate across loop boundary will have a smaller relative effect. Another problem is that with more complicated examples there will be more bridges, and bridges are currently not handled very well (most of the time all virtuals are forced at the end of the bridge as explained above). But moving those forcings into the mini bridges should fix that.

Wednesday, December 22, 2010

PyPy 1.4.1

Here is PyPy 1.4.1 :-)

Update: Win32 binaries available.

Enjoy!

Release announcement

We're pleased to announce the 1.4.1 release of PyPy. This release consolidates all the bug fixes that occurred since the previous release. To everyone that took the trouble to report them, we want to say thank you.

What is PyPy

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython. Note that it still only emulates Python 2.5 by default; the fast-forward branch with Python 2.7 support is slowly getting ready but will only be integrated in the next release.

In two words, the advantage of trying out PyPy instead of CPython (the default implementation of Python) is, for now, the performance. Not all programs are faster in PyPy, but we are confident that any CPU-intensive task will be much faster, at least if it runs for long enough (the JIT has a slow warm-up phase, which can take several seconds or even one minute on the largest programs).

Note again that we do support compiling and using C extension modules from CPython (pypy setup.py install). However, this is still an alpha feature, and the most complex modules typically fail for various reasons; others work (e.g. PIL) but take a serious performance hit. Also, for Mac OS X see below.

Please note also that PyPy's performance was optimized almost exclusively on Linux. It seems from some reports that on Windows as well as Mac OS X (probably for different reasons) the performance might be lower. We did not investigate much so far.

More highlights

  • We migrated to Mercurial (thanks to Ronny Pfannschmidt and Antonio Cuni) for the effort) and moved to bitbucket. The new command to check out a copy of PyPy is:
    hg clone http://bitbucket.org/pypy/pypy

  • In long-running processes, the assembler generated by old JIT-compilations is now freed. There should be no more leak, however long the process runs.

  • Improve a lot the performance of the binascii module, and of hashlib.md5 and hashlib.sha.

  • Made sys.setrecursionlimit() a no-op. Instead, we rely purely on the built-in stack overflow detection mechanism, which also gives you a RuntimeError -- just not at some exact recursion level.

  • Fix argument processing (now e.g. pypy -OScpass works like it does on CPython --- if you have a clue what it does there :-) )

  • cpyext on Mac OS X: it still does not seem to work. I get systematically a segfault in dlopen(). Contributions welcome.

  • Fix two corner cases in the GC (one in minimark, one in asmgcc+JIT). This notably prevented pypy translate.py -Ojit from working on Windows, leading to crashes.

  • Fixed a corner case in the JIT's optimizer, leading to Fatal RPython error: AssertionError.

  • Added some missing built-in functions into the 'os' module.

  • Fix ctypes (it was not propagating keepalive information from c_void_p).

Tuesday, December 14, 2010

PyPy migrates to Mercurial

The assiduous readers of this blog surely remember that during the last Düsseldorf sprint in October, we started the process for migrating our main development repository from Subversion to Mercurial. Today, after more than two months, the process has finally been completed :-).

The new official PyPy repository is hosted on BitBucket.

The migration has been painful because the SVN history of PyPy was a mess and none of the existing conversion tools could handle it correctly. This was partly because PyPy started when subversion was still at version 0.9 when some best-practices were still to be established, and partly because we probably managed to invent all the possible ways to do branches (and even some of the impossible ones: there is at least one commit which you cannot do with the plain SVN client but you have to speak to the server by yourself :-)).

The actual conversion was possible thanks to the enormous work done by Ronny Pfannschmidt and his hackbeil tool. I would like to personally thank Ronny for his patience to handle all the various requests we asked for.

We hope that PyPy development becomes even more approachable now, at least from a version control point of view.

Friday, December 10, 2010

Oh, and btw: PyPy gets funding through "Eurostars"

There is a supporting reason why we made so many advances in the last year: funding through Eurostars, a European research funding program. The title of our proposal (accepted in 2009) is: "PYJIT - a fast and flexible toolkit for dynamic programming languages based on PyPy". And the participants are Open End AB, the Heinrich-Heine-Universität Düsseldorf (HHU), and merlinux GmbH.

It's not hard to guess what PYJIT is actually about, is it? Quoting: "The PYJIT project will deliver a fast and flexible Just-In-Time Compiler toolkit based on PyPy to the market of dynamic languages. Our main aim is to showcase our project's results for the Open Source language Python, providing unprecedented levels of flexibility and with speed hitherto only available using statically typed languages." (Details in German or in Swedish :-)

A subgoal is to improve our development and testing infrastructure, mainly showcased by Holger's recent py.test releases, the testing tool used by PyPy for its 16K tests and the speed.pypy.org infrastructure (web app programmed by Miquel Torres on his own time).

The overall scope of this project is smaller than that of the previous EU project from 2004 to 2007. The persons that are (or were) getting money to work on PyPy are Samuele Pedroni (at Open End), Maciej Fijalkowski (as a subcontractor), Carl Friedrich Bolz, Armin Rigo, Antonio Cuni (all at HHU), and Holger Krekel (at merlinux) as well as Ronny Pfannschmidt (as a subcontractor).

The Eurostars funding lasts until August 2011. What comes afterwards? Well, for one, many of the currently funded people have done work without getting funding in previous years. This will probably continue. We also have non-funded people in the core group right now and we'll hope to enlarge it further. But of course there are still large tasks ahead which may greatly benefit from funding. We have setup a donation infrastructure and maybe we can win one or more larger organisations to provide higher or regular sums of money to fund future development work. Another possibility for companies is to pay PyPy developers to help and improve PyPy for their particular use cases.

And finally, your help, donations and suggestions are always welcome and overall we hope to convince more and more people it's worthwhile to invest into PyPy's future.

Wednesday, December 8, 2010

Leysin Winter sprint

Hi all,

The next sprint will be in Leysin, Switzerland, during the week of the 16th-22nd of January 2011.

Now that we have released 1.4, and plan to release 1.4.1 soon, the sprint is going to be mainly working on fixing issues reported by various users. Of course this does not prevent people from showing up with a more precise interest in mind.

As usual, the break day on the sprint will likely be a day of skiing :-)

Hoping to see you there.

Update: there are actually a number of branches that we want to polish and merge into trunk: at least fast-forward, jit-unroll-loops, arm-backend and jitypes2. For more details, see the announcement.

Wednesday, December 1, 2010

PyPy 1.4 release aftermath

A couple days have passed since the announcement of the 1.4 release, and this is a short summary of what happened afterwards. Let's start with numbers:

  • 16k visits to the release announcement on our blog
  • we don't have download statistics unfortunately
  • 10k visits to speed center
  • most traffic comes from referring sites, reddit alone creating above a third of our traffic

Not too bad for a project that doesn't have a well-established user base.

Lessons learned:

  • Releases are very important. They're still the major way projects communicate with community, even if we have nightly builds that are mostly stable.
  • No segfaults were reported, no incompatibilities between JIT and normal interpretation. We think that proves (or at least provides a lot of experimental evidence) that our write-once-and-then-transform method is effective.
  • A lot of people complained about their favorite module in C not working, we should have made it clearer that CPyExt is in alpha state. Indeed, we would like to know which C extension modules do work :-).
  • Some people reported massive speedups, other reported slowdowns compared to CPython. Most of those slowdowns relate to modules being inefficient (or doing happy nonsense), like ctypes. This is expected, given that not all modules are even jitted (although having them jitted is usually a matter of a couple of minutes).
  • Nobody complained about a lack of some stdlib module. We implemented the ones which are used more often, but this makes us wonder if less used stdlib modules have any users at all.

In general feedback has been overwhelmingly positive and we would like to thank everyone trying (and especially those reporting problems)

Cheers,
fijal

We are not heroes, just very patient

Inspired by some of the comments to the release that said "You are heroes", I though a bit about the longish history of PyPy and hunted around for some of the mailing list posts that started the project. Then I put all this information together into the following timeline:

There is also a larger version of the timeline. Try to click on some of the events, the links usually go to the sprint descriptions. I also tried to find pictures for the sprints but succeeded for only half of them, if anybody still has some, I would be interested. It's kind of fun to browse around in some of the old sprint descriptions to see how PyPy evolved. Some of the current ideas have been around for a long time, some are new. In the description of the releases I put estimates for the speed of the release.

Friday, November 26, 2010

PyPy 1.4: Ouroboros in practice

We're pleased to announce the 1.4 release of PyPy. This is a major breakthrough in our long journey, as PyPy 1.4 is the first PyPy release that can translate itself faster than CPython. Starting today, we are using PyPy more for our every-day development. So may you :) You can download it here:

http://pypy.org/download.html

What is PyPy

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython. It is fast (pypy 1.4 and cpython 2.6 comparison).

New Features

Among its new features, this release includes numerous performance improvements (which made fast self-hosting possible), a 64-bit JIT backend, as well as serious stabilization. As of now, we can consider the 32-bit and 64-bit linux versions of PyPy stable enough to run in production.

Numerous speed achievements are described on our blog. Normalized speed charts comparing pypy 1.4 and pypy 1.3 as well as pypy 1.4 and cpython 2.6 are available on the benchmark website. For the impatient: yes, we got a lot faster!

More highlights

  • PyPy's built-in Just-in-Time compiler is fully transparent and automatically generated; it now also has very reasonable memory requirements. The total memory used by a very complex and long-running process (translating PyPy itself) is within 1.5x to at most 2x the memory needed by CPython, for a speed-up of 2x.
  • More compact instances. All instances are as compact as if they had __slots__. This can give programs a big gain in memory. (In the example of translation above, we already have carefully placed __slots__, so there is no extra win.)
  • Virtualenv support: now PyPy is fully compatible with virtualenv: note that to use it, you need a recent version of virtualenv (>= 1.5).
  • Faster (and JITted) regular expressions - huge boost in speeding up the re module.
  • Other speed improvements, like JITted calls to functions like map().

Cheers,
Carl Friedrich Bolz, Antonio Cuni, Maciej Fijalkowski, Amaury Forgeot d'Arc, Armin Rigo and the PyPy team

Improving Memory Behaviour to Make Self-Hosted PyPy Translations Practical

In our previous blog post, we talked about how fast PyPy can translate itself compared to CPython. However, the price to pay for the 2x speedup was an huge amount of memory: actually, it was so huge that a standard -Ojit compilation could not be completed on 32-bit because it required more than the 4 GB of RAM that are addressable on that platform. On 64-bit, it consumed 8.3 GB of RAM instead of the 2.3 GB needed by CPython.

This behavior was mainly caused by the JIT, because at the time we wrote the blog post the generated assembler was kept alive forever, together with some big data structure needed to execute it.

In the past two weeks Anto and Armin attacked the issue in the jit-free branch, which has been recently merged to trunk. The branch solves several issues. The main idea of the branch is that if a loop has not been executed for a certain amount of time (controlled by the new loop_longevity JIT parameter) we consider it "old" and no longer needed, thus we deallocate it.

(In the process of doing this, we also discovered and fixed an oversight in the implementation of generators, which led to generators being freed only very slowly.)

To understand the freeing of loops some more, let's look at how many loops are actually created during a translation. The purple line in the following graph shows how many loops (and bridges) are alive at any point in time with an infinite longevity, which is equivalent to the situation we had before the jit-free branch. By contrast, the blue line shows the number of loops that you get in the current trunk: the difference is evident, as now we never have more than 10000 loops alive, while previously we got up to about 37000 ones. The time on the X axis is expressed in "Giga Ticks", where a tick is the value read out of the Time Stamp Counter of the CPU.

The grey vertical bars represent the beginning of each phase of the translation:

  • annotate performs control flow graph construction and type inference.
  • rtype lowers the abstraction level of the control flow graphs with types to that of C.
  • pyjitpl constructs the JIT.
  • backendopt optimizes the control flow graphs.
  • stackcheckinsertion finds the places in the call graph that can overflow the C stack and inserts checks that raise an exception instead.
  • database_c produces a database of all the objects the C code will have to know about.
  • source_c produces the C source code.
  • compile_c calls the compiler to produce the executable.

You can nicely see, how the number of alive graphs drops shortly after the beginning of a new phase.

Those two fixes, freeing loops and generators, improve the memory usage greatly: now, translating PyPy on PyPy on 32-bit consumes 2 GB of RAM, while on CPython it consumes 1.1 GB. This result can even be improved somewhat, because we are not actually freeing the assembler code itself, but only the large data structures around it; we can consider it as a residual memory leak of around 150 MB in this case. This will be fixed in the jit-free-asm branch.

The following graph shows the memory usage in more detail:

  • the blue line (cpython-scaled) shows the total amount of RAM that the OS allocates for CPython. Note that the X axis (the time) has been scaled down so that it spans as much as the PyPy one, to ease the comparison. Actually, CPython took more than twice as much time as PyPy to complete the translation
  • the red line (VmRss) shows the total amount of RAM that the OS allocates for PyPy: it includes both the memory directly handled by our GC and the "raw memory" that we need to allocate for other tasks, such as the assembly code generated by the JIT
  • the brown line (gc-before) shows how much memory is used by the GC before each major collection
  • the yellow line (gc-after) shows how much memory is used by the GC after each major collection: this represent the amount of memory which is actually needed to hold our Python objects. The difference between gc-before and gc-after (the GC delta) is the amout of memory that the GC uses before triggering a new major collection

By comparing gc-after and cpython-scaled, we can see that PyPy uses mostly the same amount of memory as CPython for storing the application objects (due to reference counting the memory usage in CPython is always very close to the actually necessary memory). The extra memory used by PyPy is due to the GC delta, to the machine code generated by the JIT and probably to some other external effect (such as e.g. Memory Fragmentation).

Note that the GC delta can be set arbitrarly low (another recent addition -- the default value depends on the actual RAM on your computer; it probably works to translate if your computer has precisely 2 GB, because in this case the GC delta and thus the total memory usage will be somewhat lower than reported here), but the cost is to have more frequent major collections and thus a higher run-time overhead. The same is true for the memory needed by the JIT, which can be reduced by telling the JIT to compile less often or to discard old loops more frequently. As often happens in computer science, there is a trade-off between space and time, and currently for this particular example PyPy runs twice as fast as CPython by doubling the memory usage. We hope to improve even more on this trade-off.

On 64-bit, things are even better as shown by the the following graph:

The general shape of the lines is similar to the 32-bit graph. However, the relative difference to CPython is much better: we need about 3 GB of RAM, just 24% more than the 2.4 GB needed by CPython. And we are still more than 2x faster!

The memory saving is due (partly?) to the vtable ptr optimization, which is enabled by default on 64-bit because it has no speed penalty (see Unifying the vtable ptr with the GC header).

The net result of our work is that now translating PyPy on PyPy is practical and takes less than 30 minutes. It's impressive how quickly you get used to translation taking half the time -- now we cannot use CPython any more for that because it feels too slow :-).

Monday, November 15, 2010

Running large radio telescope software on top of PyPy and twisted

Hello.

As some of you already know, I've recently started working on a very large radio telescope at SKA South Africa. This telescope's operating software runs almost exclusively on Python (several high throughput pieces are in C or CUDA or directly executed by FPGAs). Some cool telescope pictures:


(photos courtesy of SKA South Africa)

Most of the operation software is using the KatCP protocol to talk between devices. The currently used implementation is Open Source software with a custom home built server and client. As part of the experiments, I've implemented a Twisted based version and run in on top of CPython and PyPy for both the default implementation and the one based on Twisted to see how those perform.

There are two testing scenarios: the first one is trying to saturate the connection by setting up multiple sensors that report state every 10ms, the second one is measuring a round-trip between sending a request and receiving the response. Both numbers are measuring the number of requests per 0.2s, so the more the better. On X axis there is a number of simultanously connected clients.

All benchmark code is available in the KatCP repository.

The results are as follows:


As you can see, in general Twisted has larger overhead for a single client and scales better as the number of clients increases. That's I think expected, since Twisted has extra layers of indirection. The round trip degradation of Twisted has to be investigated, but for us scenario1 is by far more important.

All across the board PyPy performs much better than CPython for both Twisted and a home-made solution, which I think is a pretty good result.

Note: we didn't roll this set up into production yet, but there are high chances for both twisted and PyPy to be used in some near future.

Cheers, fijal

Saturday, November 13, 2010

Efficiently Implementing Python Objects With Maps

As could be foreseen by my Call for Memory Benchmarks post a while ago, I am currently working on improving the memory behaviour of PyPy's Python interpreter. In this blog post I want to describe the various data a Python instance can store. Then I want to describe how a branch that I did and that was recently merged implements the various features of instances in a very memory-efficient way.

Python's Object Model

All "normal" new-style Python instances (i.e. instances of subclasses of object without added declarations) store two (or possibly three) kinds of information.

Storing the Class

Every instance knows which class it belongs to. This information is accessible via the .__class__ attribute. It can also be changed to other (compatible enough) classes by writing to that attribute.

Instance Variables

Every instance also has an arbitrary number of attributes stored (also called instance variables). The instance variables used can vary per instance, which is not the case in most other class-based languages: traditionally (e.g. in Smalltalk or Java) the class describes the shape of its instances, which means that the set of admissible instance variable names is the same for all instances of a class.

In Python on the other hand, it is possible to add arbitrary attributes to an instance at any point. The instance behaves like a dictionary mapping attribute names (as strings) to the attribute values.

This is actually how CPython implements instances. Every instance has a reference to a dictionary that stores all the attributes of the instance. This dictionary can be reached via the .__dict__ attribute. To make things more fun, the dictionary can also be changed by writing to that attribute.

Example

As an example, consider the following code:

class A(object):
    pass

instance1 = A()
instance1.x = 4
instance1.y = 6
instance1.z = -1

instance2 = A()
instance2.x = 1
instance2.y = 2
instance2.z = 3

These two instances would look something like this in memory:

(The picture glosses over a number of details, but it still shows the essential issues.)

This way of storing things is simple, but unfortunately rather inefficient. Most instances of the same class have the same shape, i.e. the same set of instance attribute names. That means that the key part of all the dictionaries is identical (shown grey here). Therefore storing that part repeatedly in all instances is a waste. In addition, dictionaries are themselves rather large. Since they are typically implemented as hashmaps, which must not be too full to be efficient, a dictionary will use something like 6 words on average per key.

Slots

Since normal instances are rather large, CPython 2.2 introduced slots, to make instances consume less memory. Slots are a way to fix the set of attributes an instance can have. This is achieved by adding a declaration to a class, like this:

class B(object):
    __slots__ = ["x", "y", "z"]

Now the instances of B can only have x, y and z as attributes and don't have a dictionary at all. Instead, the instances of B get allocated with enough size to hold exactly the number of instance variables that the class permits. This clearly saves a lot of memory over the dictionary approach, but has a number of disadvantages. It is obviously less flexible, as you cannot add additional instance variables to an instance if you happen to need to do that. It also introduces a set of rules and corner-cases that can be surprising sometimes (e.g. instances of a subclass of a class with slots that doesn't have a slots declaration will have a dict).

Using Maps for Memory-Efficient Instances

As we have seen in the diagram above, the dictionaries of instances of the same class tend to look very similar and share all the keys. The central idea to use less memory is to "factor out" the common parts of the instance dictionaries into a new object, called a "map" (because it is a guide to the landscape of the object, or something). After that factoring out, the representation of the instances above looks something like this:

Every instance now has a reference to its map, which describes what the instance looks like. The actual instance variables are stored in an array (called storage in the diagram). In the example here, the map describes that the instances have three attributes x, y and z. The numbers after the attributes are indexes into the storage array.

If somebody adds a new attribute to one of the instances, the map for that instance will be changed to another map that also contains the new attribute, and the storage will have to grow a field with the new attribute. The maps are immutable, immortal and reused as much as possible. This means, that two instances of the same class with the same set of attributes will have the same map. This also means that the memory the map itself uses is not too important, because it will potentially be amortized over many instances.

Note that using maps makes instances nearly as small as if the correct slots had been declared in the class. The only overhead needed is the indirection to the storage array, because you can get new instance variables, but not new slots.

The concept of a "map" that describes instances is kind of old and comes from the virtual machine for the Self programming language. The optimization was first described in 1989 in a paper by Chambers, Ungar and Lee with the title An Efficient Implementation of Self, a Dynamically-Typed Object-Oriented Language Based on Prototypes. A similar technique is used in Google's V8 JavaScript engine, where the maps are called hidden classes and in the Rhino JavaScript engine.

The rest of the post describes a number of further details that occur if instances are implemented using maps.

Supporting Dictionaries with Maps

The default instance representation with maps as shown above works without actually having a dictionary as part of each instance. If a dictionary is actually requested, by accessing the .__dict__ attribute, it needs to be created and cached. The dictionary is not a normal Python dictionary, but a thin wrapper around the object that forwards all operations to it. From the user's point of view it behaves like a normal dictionary though (it even has the correct type).

The dictionary needs to be cached, because accessing .__dict__ several times should always return the same dictionary. The caching happens by using a different map that knows about the dictionary and putting the dictionary into the storage array:

Things become really complex if the fake dict is used in strange ways. As long as the keys are strings, everything is fine. If somebody adds other keys to the dict, they cannot be represented by the map any more (which supports only attributes, i.e. string keys in the __dict__). If that happens, all the information of the instance will move into the fake dictionary, like this:

In this picture, the key -1 was added to the instance's dictionary. Since using the dictionary in arbitrary ways should be rare, we are fine with the additional time and memory that the approach takes.

Slots and Maps

Maps work perfectly together with slots, because the slots can just be stored into the storage array used by the maps as well (in practise there are some refinements to that scheme). This means that putting a __slots__ on a class has mostly no effect, because the instance only stores the values of the attributes (and not the names), which is equivalent to the way slots are stored in CPython.

Implementation Details

In the diagrams above, I represented the maps as flat objects. In practise this is a bit more complex, because it needs to be efficient to go from one map to the next when new attributes are added. Thus the maps are organized in a tree.

The instances with their maps from above look a bit more like this in practise:

Every map just describes one attribute of the object, with a name and a an index. Every map also has a back field, that points to another map describing what the rest of the object looks like. This chain ends with a terminator, which also stores the class of the object.

The maps also contain the information necessary for making a new object of class A. Immediately after the new object has been created, its map is the terminator. If the x attribute is added, its maps is changed to the second-lowest map, and so on. The blue arrows show the sequence of maps that the new object goes through when the attributes x, y, z are added.

This representation of maps as chains of objects sounds very inefficient if an object has many attributes. The whole chain has to be walked to find the index. This is true to some extent. The problem goes away in the presence of the JIT, which knows that the chain of maps is an immutable structure, and will thus optimize away all the chain-walking. If the JIT is not used, there are a few caches that try to speed up the walking of this chain (similar to the method cache in CPython and PyPy).

Results

It's hard to compare the improvements of this optimization in a fair way, as the trade-offs are just very different. Just to give an impression, a million objects of the same class with three fields on a 32bit system takes:

without slots:

  • 182 MiB memory in CPython
  • 177 MiB memory in PyPy without maps
  • 40 MiB memory in PyPy with maps

with slots:

  • 45 MiB memory in CPython
  • 50 MiB memory in PyPy without maps
  • 40 MiB memory in PyPy with maps

Note how maps make the objects a bit more efficient like CPython using slots. Also, using slots has no additional effect in PyPy.

Conclusion

Maps are a powerful approach to shrinking the memory used by many similar instances. I think they can be pushed even further (e.g. by adding information about the types of the attributes) and plan to do so in the following months. Details will be forthcoming.

Wednesday, November 10, 2010

Speeding up PyPy by donations

PyPy joins the Software Freedom Conservancy

Good news. PyPy is now a member of the Software Freedom Conservancy (SFC), see the SFC blog post. This allows us to manage non-profit monetary aspects of the project independently from a company or particular persons. So we can now officially receive donations both from people prefering right or left sides, see the Donate buttons on our home page and our blog. And you can use PayPal or Google Checkout, Donations are tax-exempt in the USA and hopefully soon in Europe as well.

What's it going to get used for? For the immediate future we intend to use the donations for funding travels of core contributors to PyPy sprints who otherwise can't afford to come. So if you have no time but some money you can help to encourage coding contributors to care for PyPy. If we end up with bigger sums we'll see and take suggestions. Money spending decisions will be done by core PyPy people according to non-profit guidelines. And we'll post information from time to time about how much we got and where the money went.

If you have any questions regarding the SFC membership or donations you may send email to sfc at pypy.org which will be observed by Carl Friedrich Bolz, Jacob Hallen and Holger Krekel - the initial PyPy SFC representatives on behalf of the PyPy team. Many thanks go out to Bradley M. Kuhn for helping to implement the PyPy SFC membership.

cheers,

Holger & Carl Friedrich

Tuesday, November 9, 2010

A snake which bites its tail: PyPy JITting itself

We have to admit: even if we have been writing for years about the fantastic speedups that the PyPy JIT gives, we, the PyPy developers, still don't use it for our daily routine. Until today :-).

Readers brave enough to run translate.py to translate PyPy by themselves surely know that the process takes quite a long time to complete, about a hour on super-fast hardware and even more on average computers. Unfortunately, it happened that translate.py was a bad match for our JIT and thus ran much slower on PyPy than on CPython.

One of the main reasons is that the PyPy translation toolchain makes heavy use of custom metaclasses, and until few weeks ago metaclasses disabled some of the central optimizations which make PyPy so fast. During the recent Düsseldorf sprint, Armin and Carl Friedrich fixed this problem and re-enabled all the optimizations even in presence of metaclasses.

So, today we decided that it was time to benchmark again PyPy against itself. First, we tried to translate PyPy using CPython as usual, with the following command line (on a machine with an "Intel(R) Xeon(R) CPU W3580 @ 3.33GHz" and 12 GB of RAM, running a 32-bit Ubuntu):

$ python ./translate.py -Ojit targetpypystandalone --no-allworkingmodules

... lots of output, fractals included ...

[Timer] Timings:
[Timer] annotate                       ---  252.0 s
[Timer] rtype_lltype                   ---  199.3 s
[Timer] pyjitpl_lltype                 ---  565.2 s
[Timer] backendopt_lltype              ---  217.4 s
[Timer] stackcheckinsertion_lltype     ---   26.8 s
[Timer] database_c                     ---  234.4 s
[Timer] source_c                       ---  480.7 s
[Timer] compile_c                      ---  258.4 s
[Timer] ===========================================
[Timer] Total:                         --- 2234.2 s

Then, we tried the same command line with PyPy (SVN revision 78903, x86-32 JIT backend, downloaded from the nightly build page):

$ pypy-c-78903 ./translate.py -Ojit targetpypystandalone --no-allworkingmodules

... lots of output, fractals included ...

[Timer] Timings:
[Timer] annotate                       ---  165.3 s
[Timer] rtype_lltype                   ---  121.9 s
[Timer] pyjitpl_lltype                 ---  224.0 s
[Timer] backendopt_lltype              ---   72.1 s
[Timer] stackcheckinsertion_lltype     ---    7.0 s
[Timer] database_c                     ---  104.4 s
[Timer] source_c                       ---  167.9 s
[Timer] compile_c                      ---  320.3 s
[Timer] ===========================================
[Timer] Total:                         --- 1182.8 s

Yes, it's not a typo: PyPy is almost two times faster than CPython! Moreover, we can see that PyPy is faster in each of the individual steps apart compile_c, which consists in just a call to make to invoke gcc. The slowdown comes from the fact that the Makefile also contains a lot of calls to the trackgcroot.py script, which happens to perform badly on PyPy but we did not investigate why yet.

However, there is also a drawback: on this specific benchmark, PyPy consumes much more memory than CPython. The reason why the command line above contains --no-allworkingmodules is that if we include all the modules the translation crashes when it's complete at 99% because it consumes all the 4GB of memory which is addressable by a 32-bit process.

A partial explanation if that so far the assembler generated by the PyPy JIT is immortal, and the memory allocated for it is never reclaimed. This is clearly bad for a program like translate.py which is divided into several independent steps, and for which most of the code generated in each step could be safely be thrown away when it's completed.

If we switch to 64-bit we can address the whole 12 GB of RAM that we have, and thus translating with all working modules is no longer an issue. This is the time taken with CPython (note that it does not make sense to compare with the 32-bit CPython translation above, because that one does not include all the modules):

$ python ./translate.py -Ojit

[Timer] Timings:
[Timer] annotate                       ---  782.7 s
[Timer] rtype_lltype                   ---  445.2 s
[Timer] pyjitpl_lltype                 ---  955.8 s
[Timer] backendopt_lltype              ---  457.0 s
[Timer] stackcheckinsertion_lltype     ---   63.0 s
[Timer] database_c                     ---  505.0 s
[Timer] source_c                       ---  939.4 s
[Timer] compile_c                      ---  465.1 s
[Timer] ===========================================
[Timer] Total:                         --- 4613.2 s

And this is for PyPy:

$ pypy-c-78924-64 ./translate.py -Ojit

[Timer] Timings:
[Timer] annotate                       ---  505.8 s
[Timer] rtype_lltype                   ---  279.4 s
[Timer] pyjitpl_lltype                 ---  338.2 s
[Timer] backendopt_lltype              ---  125.1 s
[Timer] stackcheckinsertion_lltype     ---   21.7 s
[Timer] database_c                     ---  187.9 s
[Timer] source_c                       ---  298.8 s
[Timer] compile_c                      ---  650.7 s
[Timer] ===========================================
[Timer] Total:                         --- 2407.6 s

The results are comparable with the 32-bit case: PyPy is still almost 2 times faster than CPython. And it also shows that our 64-bit JIT backend is as good as the 32-bit one. Again, the drawback is in the consumed memory: CPython used 2.3 GB while PyPy took 8.3 GB.

Overall, the results are impressive: we knew that PyPy can be good at optimizing small benchmarks and even middle-sized programs, but as far as we know this is the first example in which it heavily optimizes a huge, real world application. And, believe us, the PyPy translation toolchain is complex enough to contains all kinds of dirty tricks and black magic that make Python lovable and hard to optimize :-).

Sunday, October 31, 2010

Düsseldorf Sprint Report 2010

This years installment of the yearly PyPy Düsseldorf Sprint is drawing to a close. As usual, we worked in the seminar room of the programming language group at the University of Düsseldorf. The sprint was different from previous ones in that we had fewer people than usual and many actually live in Düsseldorf all the time.

David spent the sprint working on the arm-backend branch, which is adding an ARM backend to the JIT. With the help of Armin he added support for bridges in the JIT and generally implemented missing operations, mostly for handling integers so far.

Ronny and Anto worked the whole week trying to come up with a scheme for importing PyPy's SVN history into a mercurial repository without loosing too much information. This is a non-trivial task, because PyPy's history is gnarly. We are nearly at revision 79000 and when we started using it, Subversion was at version 0.1. All possible and impossible ways to mangle and mistreat a Subversion repository have been applied to PyPy's repo, so most of the importing tools just give up. Ronny and Anto came up with a new plan and new helper scripts every day, only to then discover another corner case that they hadn't thought of. Now they might actually have a final plan (but they said that every day, so who knows?).

The branch history of PyPy's repository (every box is a branch)

Carl Friedrich and Lukas started working in earnest on memory benchmarks to understand the memory behaviour of Python code better. They have now implemented a generic memory benchmark runner and a simple analysis that walks all objects and collects size information about them. They also added some benchmarks that were proposed in the comments of the recent call for benchmarks. As soon as some results from that work are there, we will post about them.

There were also some minor tasks performed during the sprint. Armin implemented the _bisect module and the dict.popitem method in RPython. Armin and Carl Friedrich made the new memory-saving mapdict implementation more suitable to use without the JIT (blog post should come about that too, at some point). They also made classes with custom metaclasses a lot faster when the JIT is used.

The last three days of the sprint were spent working on HÃ¥kan's jit-unroll-loops branch. The branch is meant to move loop invariants out of the loop, using techniques very similar to what is described in the recent post on escape analysis across loop boundaries (see? it will soon stop being science-fiction). Some of the ideas of this approach also come from LuaJIT which also uses very aggressive loop invariant code motion in its optimizers. Moving loop invariants outside of the loop is very useful, because many of the lookups that Python programs do in loops are loop invariants. An example is if you call a function in a loop: The global lookup can often be done only once.

This branch fundamentally changes some of the core assumptions of the JIT, so it is a huge amount of work to make it fit with all the other parts and to adapt all tests. That work is now nearly done, some failing tests remain. The next steps are to fix them and then do additional tests with the translated executable and look at the benchmarks.

Monday, October 25, 2010

The peace of green

No, we are not going to talk about the environment (i.e., the set of variables as printed by /usr/bin/env. What else? :-)).

After months in which we had a couple of tests failing every day, we finally managed to turn (almost) everything green today, at least on Linux. Enjoy this screenshoot taken from the nightly build page:

As usual, the full buildbot results can be seen from the summary page.

cheers, Anto

Friday, October 22, 2010

PhD Thesis about PyPy's CLI JIT Backend

Hi all,

few months ago I finished the PhD studies and now my thesis is available, just in case someone does not have anything better to do than read it :-).

The title of the thesis is High performance implementation of Python for CLI/.NET with JIT compiler generation for dynamic languages, and its mainly based on my work on the CLI backend for the PyPy JIT (note that the CLI JIT backend is currently broken on trunk, but it's still working in the cli-jit branch).

The thesis might be useful also for people that are not directly interested in the CLI JIT backend, as it also contains general information about the inner workings of PyPy which are independent from the backend: in particular, chapters 5 and 6 explain how the JIT frontend works.

Here is the summary of chapters:
  1. Introduction
  2. The problem
  3. Enter PyPy
  4. Characterization of the target platform
  5. Tracing JITs in a nutshell
  6. The PyPy JIT compiler generator
  7. The CLI JIT backend
  8. Benchmarks
  9. Conclusion and Future Work

cheers, Anto

Friday, October 1, 2010

Next PyPy sprint

Hi all,

The next PyPy sprint is scheduled for the end of the month, from the 25th to the 31st of October 2010. It will be done at the university of Düsseldorf, Germany, where three of us are working.

Please see this link for more information.