Tuesday, June 23, 2009

EuroPython

EuroPython is coming. We have two 30-minutes talks that we will present. In addition, the sprint takes place the 29th of June (there will be no-one from the team on the 28th of June), as well as on the 3rd and 4th of July.

JIT progress

In the last days I finally understood how to do virtualizables. Now the frame overhead is gone. This was done with the help of discussion with Samuele, porting ideas from PyPy's first JIT attempt.

This is of course work in progress, but it works in PyPy (modulo a few XXXs, but no bugs so far). The performance of the resulting code is quite good: even with Boehm (the GC that is easy to compile to but gives a slowish pypy-c), a long-running loop typically runs 50% faster than CPython. That's "baseline" speed, moreover: we will get better speed-ups by applying optimizations on the generated code. Doing so is in progress, but it suddenly became easier because that optimization phase no longer has to consider virtualizables -- they are now handled earlier.

Update:Virtualizables is basically a way to avoid frame overhead. The frame object is allocated and has a pointer, but the JIT is free to unpack it's fields (for example python level locals) and store them somewhere else (stack or registers). Each external (out of jit) access to frame managed by jit, needs to go via special accessors that can ask jit where those variables are.

Monday, June 15, 2009

News from the jit front

As usual, progress is going slower then predicted, but nevertheless, we're working hard to make some progress.

We recently managed to make our nice GCs cooperate with our JIT. This is one point from our detailed plan. As of now, we have a JIT with GCs and no optimizations. It already speeds up some things, while slowing down others. The main reason for this is that the JIT generates assembler which is kind of ok, but it does not do the same level of optimizations gcc would do.

So the current status of the JIT is that it can produce assembler out of executed python code (or any interpreter written in RPython actually), but the results are not high quality enough since we're missing optimizations.

The current plan, as of now, looks as follows:

  • Improve the handling of GCs in JIT with inlining of malloc-fast paths, that should speed up things by a constant, not too big factor.
  • Write a simplified python interpreter, which will be a base for experiments and to make sure that our JIT does correct things with regard to optimizations. That would work as mid-level integration test.
  • Think about ways to inline loop-less python functions into their parent's loop.
  • Get rid of frame overhead (by virtualizables)
  • Measure, write benchmarks, publish
  • Profit

Cheers,
fijal

Saturday, May 16, 2009

ICOOOLPS Submissions

Both of the papers that people from the PyPy team submitted to ICOOOLPS have been accepted. They are:

  • "Faster than C#: efficient implementation of dynamic languages on .NET" (pdf1) by Armin, Anto and Davide Ancona, who is Anto's Ph.D. advisor
  • "Tracing the Meta-Level: PyPy’s Tracing JIT Compiler" (pdf2) by Carl Friedrich, Armin, Anto and Maciek

(the pdfs are obviously the submitted versions, not the final ones).

This year ICOOOLPS (Implementation, Compilation, Optimization of Object-Oriented Languages, Programs and Systems) is being held on July the 6th at ECOOP 2009 in Genova, Italy. Other than these two papers, Anto and Carl Friedrich will also present a PyPy tutorial, on July the 7th.

Thursday, April 30, 2009

4 weeks of GDB

Hello.

So, according to our jit plan we're mostly done with point 1, that is to provide a JIT that compiles python code to assembler in the most horrible manner possible but doesn't break. That meant mostly 4 weeks of glaring at GDB and megabytess of assembler generated by C code generated from python code. The figure of 4 weeks proves that our approach is by far superior to the one of psyco, since Armin says it's "only 4 weeks" :-)

Right now, pypy compiled with JIT can run the whole CPython test suite without crashing, which means we're done with obvious bugs and the only ones waiting for us are really horrible. (Or they really don't exist. At least they should never be about obscure Python corner cases: they can only be in the 10'000 lines of relatively clear code that is our JIT generator.)

But... the fun thing is that we can actually concentrate on optimizations! So the next step is to provide a JIT that is correct *and* actually speeds up python. Stay tuned for more :-)

Cheers,
fijal, armin & benjamin

UPDATE: for those of you blessed with no knowledge of C, gdb stands for GNU debugger, a classic debugger for C. (It's also much more powerful than python debugger, pdb, which is kind of surprising).

Tuesday, April 28, 2009

1.1 final released

We just released PyPy 1.1 final. Not much changed since the beta, apart from some more fixed bugs. Have fun with it!

Tuesday, April 21, 2009

Roadmap for JIT

Hello.

First a disclaimer. This post is more about plans for future than current status. We usually try to write about things that we have done, because it's much much easier to promise things than to actually make it happen, but I think it's important enough to have some sort of roadmap.

In recent months we came to the point where the 5th generation of JIT prototype was working as nice or even a bit nicer than 1st one back in 2007. Someone might ask "so why did you spend all this time without going forward?". And indeed, we spend a lot of time moving sideways, but as posted, we also spent a lot of time doing some other things, which are important as well. The main advantage of current JIT incarnation is much much simpler than the first one. Even I can comprehend it, which is much of an improvement :-)

So, the prototype is working and gives very nice speedups in range of 20-30x over CPython. We're pretty confident this prototype will work and will produce fast python interpreter eventually. So we decided that now we'll work towards changing prototype into something stable and solid. This might sound easy, but in fact it's not. Having stable assembler backend and optimizations that keep semantics is not as easy as it might sound.

The current roadmap, as I see it, looks like as following:

  • Provide a JIT that does not speedup things, but produce assembler without optimizations turned on, that is correct and able to run CPython's library tests on a nightly basis.
  • Introduce simple optimizations, that should make above JIT a bit faster than CPython. With optimizations disabled JIT is producing incredibly dumb assembler, which is slower than correspoding C code, even with removal of interpretation overhead (which is not very surprising).
  • Backport optimizations from JIT prototype, one by one, keeping an eye on how they perform and making sure they don't break anything.
  • Create new optimizations, like speeding up attribute access.
  • Profit.

This way, we can hopefully provide a working JIT, which gives fast python interpreter, which is a bit harder than just a nice prototype.

Tell us what you think about this plan.

Cheers,
fijal & others.

Leysin Sprint Report

The Leysin sprint is nearing its end, as usual here is an attempt at a summary

of what we did.

Beautiful Leysin Landscape

Release Work

Large parts of the sprint were dedicated to fixing bugs. Since the easy bugs seem to have been fixed long ago, those were mostly very annoying and hard bugs. This work was supported by our buildbots, which we tried to get free of test-failures. This was worked on by nearly all participants of the sprint (Samuele, Armin, Anto, Niko, Anders, Christian, Carl Friedrich). One particularly annoying bug was the differences in the tracing events that PyPy produces (fixed by Anders, Samuele and Christian). Some details about larger tasks are in the sections below.

The work culminated in the beta released on Sunday.

Stackless

A large number of problems came from our stackless features, which do some advanced things and thus seem to contain advanced bugs. Samuele and Carl Friedrich spent some time fixing tasklet pickling and unpickling. This was achieved by supporting the (un)pickling of builtin code objects. In addition they fixed some bugs in the finalization of tasklets. This needs some care because the __del__ of a tasklet cannot run at arbitrary points in time, but only at safe points. This problem was a bit subtle to get right, and popped up nearly every morning of the sprint in form of a test failure.

Armin and Niko added a way to restrict the stack depth of the RPython-level stack. This can useful when using stackless, because if this is not there it is possible that you fill your whole heap with stack frames in the case of an infinite recursion. Then they went on to make stackless not segfault when threads are used at the same time, or if a callback from C library code is in progress. Instead you get a RuntimeError now, which is not good but better than a segfault.

Anto and Armin working on the JIT

Killing Features

During the sprint we discussed the fate of the LLVM and the JS backends. Both have not really been maintained for some time, and even partially untested (their tests were skipped). Also their usefulness appears to be limited. The JS backend is cool in principle, but has some serious limitations due to the fact that JavaScript is really a dynamic language, while RPython is rather static. This made it hard to use some features of JS from RPython, e.g. RPython does not support closures of any kind.

The LLVM backend had its own set of problems. For a long time it produced the fastest form of PyPy's Python interpreter, by first using the LLVM backend, applying the LLVM optimizations to the result, then using LLVM's C backend to produce C code, then apply GCC to the result :-). However, it is not clear that it is still useful to directly produce LLVM bitcode, since LLVM has rather good C frontends nowadays, with llvm-gcc and clang. It is likely that we will use LLVM in the future in our JIT (but that's another story, based on different code).

Therefore we decided to remove these two backends from SVN, which Samuele and Carl Friedrich did. They are not dead, only resting until somebody who is interested in maintaining them steps up.

Windows

One goal of the release is good Windows-support. Anders and Samuele set up a new windows buildbot which revealed a number of failures. Those were attacked by Anders, Samuele and Christian as well as by Amaury (who was not at the sprint, but thankfully did a lot of Windows work in the last months).

OS X

Christian with some help by Samuele tried to get translation working again under Mac OS X. This was a large mess, because of different behaviours of some POSIX functionality in Leopard. It is still possible to get the old behaviour back, but whether that was enabled or not depended on a number of factors such as which Python is used. Eventually they managed to successfully navigate that maze and produce something that almost works (there is still a problem remaining about OpenSSL).

Samuele and Carl Friedrich pretending to work on something

Documentation

The Friday of the sprint was declared to be a documentation day, where (nearly) no coding was allowed. This resulted in a newly structured and improved getting started document (done by Carl Friedrich, Samuele and some help of Niko) and a new document describing differences to CPython (Armin, Carl Friedrich) as well as various improvements to existing documents (everybody else). Armin undertook the Sisyphean task of listing all talks, paper and related stuff of the PyPy project.

Various Stuff

Java Backend Work

Niko and Anto worked on the JVM backend for a while. First they had to fix translation of the Python interpreter to Java. Then they tried to improve the performance of the Python interpreter when translated to Java. Mostly they did a lot of profiling to find performance bottlenecks. They managed to improve performance by 40% by overriding fillInStackTrace of the generated exception classes. Apart from that they found no simple-to-fix performance problems.

JIT Work

Armin gave a presentation about the current state of the JIT to the sprinters as well as Adrian Kuhn, Toon Verwaest and Camillo Bruni of the University of Bern who came to visit for one day. There was a bit of work on the JIT going on too; Armin and Anto tried to get closer to having a working JIT on top of the CLI.

Sunday, April 19, 2009

Beta for 1.1.0 released

Today we are releasing a beta of the upcoming PyPy 1.1 release. There are some Windows and OS X issues left that we would like to address between now and the final release but apart from this things should be working. We would appreciate feedback.

The PyPy development team.

PyPy 1.1: Compatibility & Consolidation

Welcome to the PyPy 1.1 release - the first release after the end of EU funding. This release focuses on making PyPy's Python interpreter more compatible with CPython (currently CPython 2.5) and on making the interpreter more stable and bug-free.

PyPy's Getting Started lives at:

http://codespeak.net/pypy/dist/pypy/doc/getting-started.html

Highlights of This Release

Other Changes

What is PyPy?

Technically, PyPy is both a Python interpreter implementation and an advanced compiler, or more precisely a framework for implementing dynamic languages and generating virtual machines for them.

The framework allows for alternative frontends and for alternative backends, currently C, Java and .NET. For our main target "C", we can "mix in" different garbage collectors and threading models, including micro-threads aka "Stackless". The inherent complexity that arises from this ambitious approach is mostly kept away from the Python interpreter implementation, our main frontend.

Socially, PyPy is a collaborative effort of many individuals working together in a distributed and sprint-driven way since 2003. PyPy would not have gotten as far as it has without the coding, feedback and general support from numerous people.

Have fun,

the PyPy release team, [in alphabetical order]

Amaury Forgeot d'Arc, Anders Hammerquist, Antonio Cuni, Armin Rigo, Carl Friedrich Bolz, Christian Tismer, Holger Krekel, Maciek Fijalkowski, Samuele Pedroni

and many others: http://codespeak.net/pypy/dist/pypy/doc/contributor.html

Wednesday, April 15, 2009

Leysin Sprint Started

The Leysin Sprint started today. The weather is great and the view is wonderful, as usual. Technically we are working on the remaining test failures of the nightly test runs and are generally trying to fix various long-postponed bugs. I will try to give more detailed reports as the sprint progresses.

Monday, April 6, 2009

Pycon videos are online

Hi.

We didn't yet write full pycon summary, but both of our talks are now online: PyPy status talk and python in a sandbox.

Update:

Slides are also available: PyPy status talk and Python in a sandbox.

Enjoy!
fijal & holger

Thursday, March 26, 2009

VM summit: nice to see friendly competition

So Google has launched the unladen swallow project with this first goal:

    Produce a version of Python at least 5x faster than CPython.

We discussed some details with Collin Winter, Jeffrey Yasskin and Thomas Wouters during the VM summit yesterday. We were a bit confused about usage of the term JIT, because as far as we understood, it's going to be upfront compilation into LLVM. In the past we have looked into LLVM – at one point PyPy extensively use it but it wasn't clear how we could make good use to it. They also consider changing to something else than LLVM. It's gonna be interesting to see how this works out.

It's good to see friendly competition, and we think that should take up the challenge and see if we can produce faster pickling, run 2to3 and Django faster than what they can come up with. We also talked to IronPython and Jython developers and all agreed that some common benchmarks would be good. And maybe do weekly press releases about small speed increases? :)

The idea of the VM summit here in Chicago was to bring together implementors of various virtual machine languages. There were members of the communities of IronPython, CPython, GemStone's MagLev, Rubinius, Mozilla's TraceMonkey, Parrot, Sun's Da Vinci Machine, Microsoft's DLR, Jython and JRuby. Everybody got to talk 5-10 minutes on their current status and challenges. It is clear that you cannot begin to cover the complexities and architectures of the involved projects. But that wasn't too much of a problem because the rest of the day everybody freely and dynamically grouped on their issues of choice. We established some more personal contacts, was great to chat with people like Andreas Gal from the University of California, Irvine, who have a very similar idea about the JIT that we have. Actually, we could probably haved mixed our two presentations and nobody would have actually noticed :-).

At the end of the presentation part, John Rose presented his slides. John is a Hotspot developer, and while not precisely a dynamic language implementor, he has a lot of experience in virtual machine implementation. It's very good to see the JVM being extended towards supporting dynamic-language specific things, in order to be something more than just a good platform for Java. We'll probably have some extra meetup with him the next days.

cheers,
holger and fijal

Thursday, March 12, 2009

PyPy talk at OpenBossa 09

Yesterday i gave my PyPy status/mobile perspectives at OpenBossa, Nokia's developer conference for embedded platforms in Brazil. Found it a bit of a tough task to do that in 50 minutes. I had some 50, later more developers attending the talk and was happy with the questions and the feedback. Guess it's a good sign if the number of people grows during a talk :) It was the first time i tried to work more with pictures and actually used some devianart photos from Marikaz to mark section transitions. I summarize/highlight some key points here in the post.

After intro and 2.5 compatibility status, i talked about our measurements of PyPy's Python on Nokia's N810 internet tablet. The best bit is that for almost all Python data structures PyPy has smaller memory representations than CPython. Particularly good are class instances which often score at 50% of CPython's sizes. Startup time is also often better and can be improved. On the bad side, PyPy's quite large base interpreter size and its bytecode execution is often worse. In the talk i also outline ideas for "perfect PYC files" for minimizing module import times and maximizing sharing across interpreter processes. I also briefly discussed the PyPy situation with extension modules and regarding C++ libs. Most of these ideas arose from sprint discussions last year. In the morning i also had some good talk with Stefan Seefeld about Boost Python and the new QT4 bindings. Maybe to use Boost Python is also a good opportunity - but PyPy does not currently have a C-level or C++ level API.

In subsequent lunch discussions people agreed that PyPy has three main interesting areas currently:

  • the Python Just-In-Time Compiler
  • a virtualized, sandboxed Python interpreter
  • an efficient Python interpreter for small devices

I think our upcoming 1.1 release will be a good point in time for many people to look some more into PyPy. I hope we are crossing the chasm soon. It's been a while since the project started :) Getting some more sponsoring to sustain and increase our current efforts probably wouldn't hurt.

Now i am off to spend my last day in Recife / Brazil, fly back to Germany in the evening and then spend time on preparing for Pycon 2009. And I guess i am going to enjoy some naturally cold air - at least my two jogging sessions at Brazillian beaches, at a sustained 30 degrees celsius, were tough. I guess i shouldn't complain, though :)

Was great meeting all the brazillian guys and the few women - just had breakfeast with Kate Alhola, kernel hacker and working on the new "Freemantle" graphical platform. Many thanks go to Marcio Marcedo and the Python team at INDT who invited me here. Hope to come again next year and eventually talk more about the Zone VM :)

If you are interested in some more not so pypy-specific bits about the conference and what i experienced, you might head over to my tetamap blog.

holger

Tuesday, March 10, 2009

Good news everyone!

A quick update from the JIT front. As of yesterday, we're now able to translate a highly-experimental Python interpreter that contains JIT. It mostly crashes immediately, mostly due to some unsupported operations in the assembler backend, but for a carefully crafted program, we're able to get massive speedups. For something as complex as:

  i = 0
  while i < 10000000:
   i = i + 1

our JIT is about 20x faster than CPython. That's still about 3x slower than Psyco, but looking at assembler code it's obvious that we can speed it up a lot. These are very good news, since we don't encode python semantics at all in the JIT. The JIT is automatically generated from the Python interpreter source code. This means we should be able to expand it to handle more complex python programs relatively quickly (interested assembler experts needed!).

This is actually the fifth incarnation of JIT that happened over the last two years. It's by far simpler and more promising than any of the previous approaches. Expect more details soon!

Cheers,
fijal

Thursday, March 5, 2009

JIT - a bit of look inside

The previous post about our JIT explained a bit from the 1000 km perspective how the tracing JIT would approach a language like Python.

I would like to step a bit inside and give a zoom to some of its features that are already working. While probably not the most innovative, I think it's very nice to look at the way we work with the JIT and what tools we use.

The main cool thing is that you can work on and try the JIT (including trying it on the Python interpreter!) without even generating a single bit of assembler. How? Let's start with something very simple. Let's take a simple interpreter for language X.

Language X has 3 opcodes: CO_INCREASE, CO_DECREASE and CO_JUMP_BACK_3. CO_INCREASE increase the accumulator by one, CO_DECREASE decrease it by one, CO_JUMP_BACK_3 jump 3 opcodes back, if the accumulator is smaller than 100 (this is only to maintain some halting conditions possible). The interpreter for language X looks like this::

    jitdriver = JitDriver(greens = ['i'], reds = ['res', 'a'])
    code = [CO_INCREASE, CO_INCREASE, CO_INCREASE,
            CO_JUMP_BACK_3, CO_INCREASE, CO_DECREASE]
            
    def add(res, a):
        return res + a

    def sub(res, a):
        return res - a

    def main_interpreter_loop(a):
        i = 0
        res = 0
        c = len(code)
        while i < c:
            jitdriver.jit_merge_point(res=res, i=i, a=a)
            elem = code[i]
            if elem == CO_INCREASE:
                res = add(res, a)
            elif elem == CO_DECREASE:
                res = sub(res, a)
            else:
                if res > 100:
                    pass
                else:
                    i = i - 3
                    jitdriver.can_enter_jit(res=res, i=i, a=a)
                    continue
            i = i + 1
        return res

All very simple code, expect the jitdriver hints, which instruct JIT how to behave (they are the equivalent of the ``add_to_position_key`` of last the blog post).

Let's look how this code is processed. This will also give a glance at how we work in this code. This particular piece can be found on a branch in pypy/jit/metainterp/test/test_loop.py and can be run with ./test_all.py jit/metainterp/test/test_loop.py -k test_example -s --view from pypy directory. The -s option lets you see the debugging output, while --view will show you some graphs. So, let's look at graphs in order:

And the same picture with a bit of zoom for the first block:

This is the call graph of an interpreter loop, nothing magic so far. This is an intermediate representation of translation toolchain input. If you look around you can follow how the opcodes are dispatched (with a chain of ifs) and helpers called. Next graph is very boring, because it's a bit lower level representation of the same thing (you exit with q or escape btw :).

When we exit the graph viewer, we can see the trace generated by interpreting this graph with a given bytecode (variable code in paste above). It's something like:


        [compiler] ENTER
        [runner:cpu]    call__4 [(''), * GCREF hidden, 0] -> 0
        [runner:cpu]    int_eq [0, 0] -> True
        [runner:cpu]    int_add [9, 1] -> 10
        [runner:cpu]    int_add [0, 1] -> 1
        [runner:cpu]    int_lt [1, 6] -> True
        [runner:cpu]    call__4 [(''), * GCREF hidden, 1] -> 0
        [runner:cpu]    int_eq [0, 0] -> True
        [runner:cpu]    int_add [10, 1] -> 11
        [runner:cpu]    int_add [1, 1] -> 2
        [runner:cpu]    int_lt [2, 6] -> True
        [runner:cpu]    call__4 [(''), * GCREF hidden, 2] -> 0
        [runner:cpu]    int_eq [0, 0] -> True
        [runner:cpu]    int_add [11, 1] -> 12
        [runner:cpu]    int_add [2, 1] -> 3
        [runner:cpu]    int_lt [3, 6] -> True
        [runner:cpu]    call__4 [(''), * GCREF hidden, 3] -> 1
        [runner:cpu]    int_eq [1, 0] -> False
        [runner:cpu]    int_eq [1, 2] -> False
        [runner:cpu]    int_gt [12, 100] -> False
        [runner:cpu]    int_sub [3, 3] -> 0
        [compiler] LEAVE

It's entering JIT, doing some primitive operations for bytecode dispatching and repeating the loop. Note that at the end of the interpreted loop (not to be confused with the interpreter loop), we see int_sub [3, 3] which resets the bytecode position to the beginning. At this time JIT (instructed by can_enter_jit hint) notices that all green variables are the same (here only i), hence we can compile the efficient loop from this point.

The loop contains 3 additions and a check (for i < 100), exactly the same as our interpreted program would do, but completely without interpretation overhead!

As you might have noticed, there is no assembler involved so far. All of this instruction execution is done directly, in pure python. In fact, the code for executing instructions is located in jit/backend/llgraph which directly interprets instructions. This is by far simpler (and easier to debug) than x86 assembler.

And this is basically it: the very simple interpreter and a jit for it. Of course we actually can generate assembler for that. Also the missing piece is optimizing the generated graphs. While for this example, by removing the interpretetation overhead, we're done, with more complex examples it's important to further optimize traces. Hopefully this and how we actually generate assembler will be topics for next blog posts.

Cheers,
fijal

Wednesday, March 4, 2009

PyPy on Mobiles, at OpenBossa

Next week i am going to give a talk on PyPy at OpenBossa, a developer conference on embedded platforms. I've written up a bit more of my background and why i find it very interesting to go there on my blog. Probably will mostly follow up there or on twitter and not much here on the PyPy blog because it's not all about PyPy. To summarize how i see it: i think there is great potential for Python and PyPy on mobiles and am thrilled to hear about what's going on currently and to discuss opportunities.

cheers, holger

Monday, March 2, 2009

Applying a Tracing JIT to an Interpreter

After I had failed once more to explain to someone on IRC what the idea behind the current JIT generator work of PyPy, I decided to just write a blog post to explain it. Here it is :-). The post turned out to be a bit long, so please bear with me.

The goal of the post is to give an understanding of how PyPy's JIT generator is going to work. To do this, I will look at what happens when you write an interpreter in Java and apply a completely normal tracing JIT to it (for this reason all the code examples will be in some sort of pseudo-Java). The resulting generated machine code is bad, so I will explain a way to fix the occurring problem.

The techniques I describe here are conceptually similar to what we are doing in PyPy. The details (as usual) are different. The reasons why I am trying to explain things in this way is that I can start from tracing JITs, which are a known existing technique.

To understand the following, it is helpful to already know a bit how a normal tracing JIT works. I will give a reminder of how it is working, but there also exist a couple of more thorough introductions on the web already. I also will leave out a lot of details about the more detailed workings of tracing JITs and only explain the things that are relevant to what I am trying to get to here.

Tracing JITs

Tracing JITs are an idea explored by the Dynamo project in the context of dynamic optimization of machine code at runtime. The techniques were then successfully applied to Java VMs and are now being used by Mozilla's TraceMonkey JavaScript VM. They are built on some basic assumptions:

  • programs spend most of their runtime in loops
  • several iterations of the same loop are likely to take similar code paths
  • the best way to gain information about the behaviour of a program is to observe it

The basic approach of a tracing JIT is to only generate machine code for commonly executed loops and to interpret the rest of the program. The code for those common loops however should be highly optimized, including aggressive inlining.

The generation of loops works as follows: At first, everything is interpreted. The interpreter does a bit of lightweight profiling to figure out which loops are run often. When a common loop is identified, the interpreter enters a special mode (called tracing mode). When in tracing mode, the interpreter records a history (the trace) of all the operations it executes, in addition to actually performing the operations. During tracing, the trace is repeatedly checked whether the interpreter is at a position in the program that it had seen earlier in the trace. If this happens, the trace recorded corresponds to a loop in the program that the tracing interpreter is running. At this point, this loop is turned into machine code by taking the trace and making machine code versions of all the operations in it.

This process assumes that the path through the loop that was traced is a "typical" example of possible paths (which is statistically likely). Of course it is possible that later another path through the loop is taken, therefore the machine code will contain guards, which check that the path is still the same. If during execution of the machine code a guard fails, the machine code is left and execution falls back to using interpretation (there are more complex mechanisms in place to still produce more code for the cases of guard failures, but they are of no importance for this post).

It is important to understand when the tracer considers a loop in the trace to be closed. This happens when the position key is the same as at an earlier point. The position key describes the position of the execution of the program, e.g. usually contains things like the function currently being executed and the program counter position of the tracing interpreter.

Let's look at a small example. Take the following code:

int sum_1_to_n(int n) {
    int result = 0;
    while (n >= 0) {
        result += n;
        n -= 1;
    }
    return result;
}

The tracing JIT will at one point trace the execution of the while loop in sum_1_to_n. The trace might look as follows:

guard_true(n >= 0);
result += n;
n -= 1;
<loop_back>

This trace will then be turned into machine code. Note that the machine code loop is by itself infinite and can only be left via a guard failure.

A slightly more complex example:

int f(int a, int b) {
    if (b % 46 == 41)
        return a - b;
    else
        return a + b;
}

int strange_sum(int n) {
    int result = 0;
    while (n >= 0) {
        result = f(result, n);
        n -= 1;
    }
    return result;
}

The trace of the loop in strange_sum would maybe look like this:

guard_true(n >= 0);
a = result;
b = n;
guard_false(b % 46 == 41);
result = a + b;
n -= 1;
<loop_back>

This would then be turned into machine code. Note how f was inlined into the loop and how the common else case was turned into machine code, while the other one is implemented via a guard failure.

Applying a Tracing JIT to an Interpreter

In the rest of the post we will explore what happens when the program that is being executed/compiled by the tracing JIT is itself a (bytecode) interpreter for another language.

A stylized bytecode interpreter for a simple programming language could look as follows:

W_Object interpret(String bytecode, ...) {
    Stack<W_Object> stack = new Stack<W_Object>();
    int pc = 0;
    while (true) { // bytecode dispatch loop
        char instruction = bytecode.charAt(pc);
        pc += 1;
        switch (instruction) {
            case ADD:
                W_Object arg2 = stack.pop();
                W_Object arg1 = stack.pop();
                stack.push(do_addition(arg1, arg2));
                break;
            case SUB:
                W_Object arg2 = stack.pop();
                W_Object arg1 = stack.pop();
                stack.push(do_substraction(arg1, arg2));
                break;
            case RETURN:
                return stack.pop();
            case JUMP_BACKWARD:
                pc -= (int)bytecode.charAt(pc);
                break;
            case LOAD_INTEGER:
                int value = (int)bytecode.charAt(pc);
                pc += 1;
                stack.push(new W_Integer(value));
                break;
            case PRINT:
                do_print(stack.pop());
                break;
            case DUP:
                stack.push(stack.peek());
                break;
            case JUMP_IF_TRUE:
                ...
            ...
        }
    }

If we apply a tracing JIT to this function, it will trace and compile the execution of one bytecode, because after one bytecode the bytecode dispatch loop is closed. E.g. it might trace and produce machine code for the execution of a SUB. (Sidenote: this interpret function is an example where one of the assumptions of a tracing JIT break down: two iterations of the bytecode dispatch loop are rarely going to follow the same code path, because usually two consecutive bytecodes encode different instructions).

The important bit to remember here is that the tracing JIT will produce a machine code loop that corresponds to the bytecode dispatch loop in the interpret function. Let's see how we can change that.

Improving the Generated Code

If we want to make use of the fact that the program that is being jitted is itself an interpreter, we need to change the tracing JIT a bit. To be more precise we add a way for the user of the tracing JIT to add information to the position key that the tracing JIT uses to decide when a loop is closed. This is done by a call to a magic function add_to_position_key. This allows the program writer to influence the tracing JIT's behaviour.

The semantics of add_to_position_key is as follows: The method itself does not do anything. It has an effect only when it is seen during tracing. If it is seen during tracing, the tracer adds the argument of the call to the position key that the tracer is using to find out whether a loop was closed or not.

In the example of the interpret function above, we would add a call to this function into the while loop as follows:

W_Object interpret(String bytecode, ...) {
    Stack stack = new Stack();
    int pc = 0;
    while (true) { // bytecode dispatch loop
        add_to_position_key(pc);
        add_to_position_key(bytecode);
        char instruction = bytecode.charAt(pc);
        pc += 1;
        switch (instruction) {
            case ADD:
    ...

When the modified tracing JIT traces now the interpret function executing a SUB, something interesting happens. When the bytecode loop is closed, the modified tracing JIT does not consider the trace to be a loop, because the value of pc has been increased by one, so the position key differs. Instead it continues to trace, effectively unrolling the bytecode dispatch loop of interpret.

The only way for a loop to be considered closed is if the pc variable has the same value a second time. This can only happen after a JUMP_BACKWARD instruction has been executed. A JUMP_BACKWARD instruction will only be in the bytecode when the bytecode represents a loop. This means that the modified tracing JIT will trace the interpret function and will only consider that the trace represents a loop when the bytecode itself represents a loop! Thus, a machine code loop will eventually be created that corresponds to the loop in the bytecode.

Let's look at at example. If we have a bytecode that corresponds to the following instructions:

pc |   instruction
---+---------------------
0  |  LOAD_INTEGER 0
2  |  DUP
3  |  PRINT
4  |  LOAD_INTEGER 1
6  |  ADD
7  |  JUMP_BACKWARD 6

This loop will print integers starting from 0 and going on from there. The modified tracing JIT will unroll the bytecode dispatch until it sees the JUMP_BACKWARD bytecode. After that bytecode the pc will be 2 again. Thus the earlier position key is repeated, which means that the loop will be closed. The produced machine code will do the equivalent of the following Java code:

...
guard_true(pc == 2)
guard_true(bytecode == "... correct bytecode string ...")
while (true) {
    instruction = bytecode.charAt(pc);
    pc += 1;
    guard_true(instruction == DUP);
    stack.push(stack.peek());

    instruction = bytecode.charAt(pc);
    pc += 1;
    guard_true(instruction == PRINT);
    do_print(stack.pop());

    instruction = bytecode.charAt(pc);
    pc += 1;
    guard_true(instruction == LOAD_INTEGER)
    value = (int)bytecode.charAt(pc);
    pc += 1
    stack.push(W_Integer(value))

    instruction = bytecode.charAt(pc);
    pc += 1;
    guard_true(instruction == ADD)
    arg2 = stack.pop()
    arg1 = stack.pop()
    stack.push(do_addition(arg1, arg2))

    instruction = bytecode.charAt(pc);
    pc += 1;
    guard_true(instruction == JUMP_BACKWARD)
    pc -= (int)bytecode.charAt(pc);
}

This is machine code that essentially does what the bytecode above did. Of course the code still remains some remnants of the interpreter (like the program counter manipulations, the stack handling, etc), which would have to be removed by some clever enough optimization step. If this were done, result would look a lot more natural.

Summary

If a tracing JIT is enhanced by a way to influence its loop-closing behaviour we can significantly improve its performance when the jitted program is itself an interpreter. The result is that in such a case the produced machine code will correspond to the functions that are being interpreted, not to the code of the interpreter itself.

Now, what does all this have to do with PyPy? What we are working on since a while is a sort of tracing JIT for RPython which allows to be customized with a function very similar to the add_to_position_key described above. This will make it possible to make the tracing JIT generate code that corresponds to the code that the interpreter interprets. For example, we would add a call to add_to_position_key to SPy, PyPy's Smalltalk VM. Then the tracing JIT will produce machine code for Smalltalk-level loops, with all the usual benefits of a tracing JIT (like inlining of intermediate methods, constant-folding, ...). This JIT differs from normal tracing JITs in that it also supports very powerful constant-folding and allocation-removal optimizations. Those optimizations will (hopefully) be the content of a later blog post.

The basics of this process have been working fine since quite a while. What the work currently focuses on is to improve the optimizers to remove not only the bytecode manipulation code, but also the stack handling, and a large number of other inefficiencies.

The next Leysin Winter Sprint

PyPy Leysin Winter Sprint (14-21th April 2009)

The next PyPy sprint will be in Leysin, Switzerland, for the sixth time. This sprint will take place immediately after Easter. This is a fully public sprint: newcomers and topics other than those proposed below are welcome.

  • The overall idea of the sprint is to continue working on making PyPy ready for general use. There are a few tasks left in there. In parallel, we will continue the work on the JIT, if there is general interest. And as usual, we are ready to add any other task -- please mention on the mailing list what you would like to work on; the list of task is not really fixed.
  • And as usual, the main side goal is to have fun in winter sports :-) We can take a day off for ski until Sunday, the 19th; afterwards, the installations close. (There was quite a lot of snow this winter, so there should be some left even though it's relatively late in the season.)

For more information see the announcement.

Friday, February 13, 2009

Wroclaw 2009 sprint progress report

Hello.

We have just finished probably the smallest sprint ever in PyPy history. For most of the time it was just me and Armin pairing together.

We also had a chance to work a bit with people from the University, but there were definitely not enough core developers to organize the work in a reasonable manner. At some point we ended up having two pairs containing four people each.

Jakub and Bartosz (who were our gentle hosts) worked on getting PyPy's sandbox integrated with django. It's still just an example what you can do (ie you can do much more), but it's already interesting to look at. The code can be found in user dir. This server (not yet online anywhere, sorry) is able to run untrusted python code provided by user inside a fully configurable sandbox.

We also implemented missing peepholer optimizations from CPython, finding out that some peepholer tests were failing, just because PyPy is optimizing better :-)

The main part of the sprint was work on JIT (most notable the fifth generation of the JIT), which was moved from the obscure directory in Carl's user in svn (which contains branches these days!) into a PyPy branch. It's still very much work in progress and a lot of pen and paper or handwaving was involved, but we were able to implement a lot of basics in record time.

Right now we need a lot of rest after the exhaustive sprint, but after that, stay tuned for more information about progressing JIT!

Cheers,
fijal

Wednesday, January 14, 2009

Wroclaw 2009 PyPy sprint and talk

The next PyPy sprint will be held in Wrocław, Poland 7-14th February 2009. This is fully public sprint and all newcomers are welcomed. Preceeding the sprint there will be a talk at University of Technology in Wrocław held at 22nd of January.

For detailed info about the sprint, look here.

The talk will be a general, high-level overview about PyPy project. There is a very nice poster, made by Jakub Gustak and Bartosz Skowron (in polish):

Talk details:
  • Location: Politechnika Wrocławska, budynek C-13, sala 0.31
  • Date: 22nd January 2009, 19:00
  • Language: very likely polish, although talk can be as well in english if some non-polish native would show up.
Cheers,
fijal