PyPy Status Blog

Thursday, May 5, 2011

NumPy Follow up

Hi everyone. Since yesterday's blog post we got a ton of feedback, so we want to clarify a few things, as well as share some of the progress we've made, in only the 24 hours since the post.

Reusing the original NumPy

First, a lot of people have asked why we cannot just reuse the original NumPy through cpyext, our CPython C-API compatibility layer. We believe this is not the best approach, for a few reasons:

cpyext is slow, and always will be slow. It has to emulate far too many details of the CPython object model that don't exist on PyPy (e.g., reference counting). Since people are using NumPy primarily for speed this would mean that even if we could have a working NumPy, no one would want to use it. Also, as soon as the execution crosses the cpyext boundary, it becomes invisible to the JIT, which means the JIT has to assume the worst and deoptimize stuff away.

NumPy uses many obscure documented and undocumented details of the CPython C-API. Emulating these is often difficult or impossible (e.g. we can't fix accessing a struct field, as there's no function call for us to intercept).

It's not much fun. Frankly, working on cpyext, debugging the crashes, and everything else that goes with it is not terribly fun, especially when you know that the end result will be slow. We've demonstrated we can build a much faster NumPy, in a way that's more fun, and given that the people working on this are volunteers, it's important to keep us motivated.

Finally, we are not proposing to rewrite the entirety of NumPy or, god forbid, BLAST, or any of the low level stuff that operates on C-level arrays, only the parts that interface with Python code directly.

C bindings vs. CPython C-API

There are two issues on C code, one has a very nice story, and the other not so much. First is the case of arbitrary C-code that isn't Python related, things like libsqlite, libbz2, or any random C shared library on your system. PyPy will quite happily call into these, and bindings can be developed either at the RPython level (using rffi) or in pure Python, using ctypes. Writing bindings with ctypes has the advantage that they can run on every alternative Python implementation, such as Jython and IronPython. Moreover, once we merge the jittypes2 branch ctypes calls will even be smoking fast.

On the other hand there is the CPython C-extension API. This is a very specific API which CPython exposes, and PyPy tries to emulate. It will never be fast, because there is far too much overhead in all the emulation that needs to be done.

One of the reasons people write C extensions is for speed. Often, with PyPy you can just forget about C, write everything in pure python and let the JIT to do its magic.

In case the PyPy JIT alone isn't fast enough, or you just want to use existing C code then it might make sense to split your C-extension into 2 parts, one which doesn't touch the CPython C-API and thus can be loaded with ctypes and called from PyPy, and another which does the interfacing with Python for CPython (where it will be faster).

There are also libraries written in C to interface with existing C codebases, but for whom performance is not the largest goal, for these the right solution is to try using CPyExt, and if it works that's great, but if it fails the solution will be to rewrite using ctypes, where it will work on all Python VMs, not just CPython.

And finally there are rare cases where rewriting in RPython makes more sense, NumPy is one of the few examples of these because we need to be able to give the JIT hints on how to appropriately vectorize all of the operations on an array. In general writing in RPython is not necessary for almost any libraries, NumPy is something of a special case because it is so ubiquitous that every ounce of speed is valuable, and makes the way people use it leads to code structure where the JIT benefits enormously from extra hints and the ability to manipulate memory directly, which is not possible from Python.

Progress

On a more positive note, after we published the last post, several new people came and contributed improvements to the numpy-exp branch. We would like to thank all of them:

nightless_night contributed: An implementation of __len__, fixed bounds checks on __getitem__ and __setitem__.

brentp contributed: Subtraction and division on NumPy arrays.

MostAwesomeDude contributed: Multiplication on NumPy arrays.

hodgestar contributed: Binary operations between floats and NumPy arrays.

Those last two were technically an outstanding branch we finally merged, but hopefully you get the picture. In addition there was some exciting work done by regular PyPy contributors. I hope it's clear that there's a place to jump in for people with any level of PyPy familiarity. If you're interested in contributing please stop by #pypy on irc.freenode.net, the pypy-dev mailing list, or send us pull requests on bitbucket.

Alex

Hi everyone. Since yesterday's blog post we got a ton of feedback, so we want to clarify a few things, as well as share some of the progress we've made, in only the 24 hours since the post.

Reusing the original NumPy

First, a lot of people have asked why we cannot just reuse the original NumPy through cpyext, our CPython C-API compatibility layer. We believe this is not the best approach, for a few reasons:

cpyext is slow, and always will be slow. It has to emulate far too many details of the CPython object model that don't exist on PyPy (e.g., reference counting). Since people are using NumPy primarily for speed this would mean that even if we could have a working NumPy, no one would want to use it. Also, as soon as the execution crosses the cpyext boundary, it becomes invisible to the JIT, which means the JIT has to assume the worst and deoptimize stuff away.

NumPy uses many obscure documented and undocumented details of the CPython C-API. Emulating these is often difficult or impossible (e.g. we can't fix accessing a struct field, as there's no function call for us to intercept).

It's not much fun. Frankly, working on cpyext, debugging the crashes, and everything else that goes with it is not terribly fun, especially when you know that the end result will be slow. We've demonstrated we can build a much faster NumPy, in a way that's more fun, and given that the people working on this are volunteers, it's important to keep us motivated.

C bindings vs. CPython C-API

One of the reasons people write C extensions is for speed. Often, with PyPy you can just forget about C, write everything in pure python and let the JIT to do its magic.

Progress

On a more positive note, after we published the last post, several new people came and contributed improvements to the numpy-exp branch. We would like to thank all of them:

nightless_night contributed: An implementation of __len__, fixed bounds checks on __getitem__ and __setitem__.

brentp contributed: Subtraction and division on NumPy arrays.

MostAwesomeDude contributed: Multiplication on NumPy arrays.

hodgestar contributed: Binary operations between floats and NumPy arrays.

Alex

Posted by Alex at 22:56 19 Comments

Wednesday, May 4, 2011

Numpy in PyPy - status and roadmap

Hello.

NumPy integration is one of the single most requested features for PyPy. This post tries to describe where we are, what we plan (or what we don't plan), and how you can help.

Short version for the impatient: we are doing experiments, which show that PyPy+numpy can be faster and better than CPython+numpy. We have a plan on how to move forward, but at the moment there is lack of dedicated people or money to tackle it.

The slightly longer version

Integrating numpy in PyPy has been my pet project on an on-and-off (mostly off) basis over the past two years. There were some experiments, then a long pause, and then some more experiments which are documented below.

The general idea is not to use the existing CPython module, but to reimplement numpy in RPython (i.e. the language PyPy is implemented in), thus letting our JIT achieve extra speedups. The really cool thing about this part is that numpy will automatically benefit of any general JIT improvements, without any need of extra tweaking.

At the moment, there is branch called numpy-exp which contains a translatable version of a very minimal version of numpy in the module called micronumpy. Example benchmarks show the following:

	add	iterate
CPython 2.6.5 with numpy 1.3.0	0.260s (1x)	4.2 (1x)
PyPy numpy-exp @ 3a9d77b789e1	0.120s (2.2x)	0.087 (48x)

The add benchmark spends most of the time inside the + operator on arrays (doing a + a + a + a + a), , which in CPython is implemented in C. As you can see from the table above, the PyPy version is already ~2 times faster. (Although numexpr is still faster than PyPy, but we're working on it).

The exact way array addition is implemented is worth another blog post, but in short it lazily evaluates the expression and computes it at the end, avoiding intermediate results. This approach scales much better than numexpr and can lead to speeding up all the operations that you can perform on matrices.

The next obvious step to get even more speedups would be to extend the JIT to use SSE operations on x86 CPUs, which should speed it up by about additional 2x, as well as using multiple threads to do operations.

iterate is also interesting, but for entirely different reasons. On CPython it spends most of the time inside a Python loop; the PyPy version is ~48 times faster, because the JIT can optimize across the python/numpy boundary, showing the potential of this approach, users are not grossly penalized for writing their loops in Python.

The drawback of this approach is that we need to reimplement numpy in RPython, which takes time. A very rough estimate is that it would be possible to implement an useful subset of it (for some definition of useful) in a period of time comprised between one and three man-months.

It also seems that the result will be faster for most cases and the same speed as original numpy for other cases. The only problem is finding the dedicated persons willing to spend quite some time on this and however, I am willing to both mentor such a person and encourage him or her.

The good starting point for helping would be to look at what's already implemented in micronumpy modules and try extending it. Adding a - operator or adding integers would be an interesting start. Drop by on #pypy on irc.freenode.net or get in contact with developers via some other channel (such as the pypy-dev mailing list) if you want to help.

Another option would be to sponsor NumPy development. In case you're interested, please get in touch with us or leave your email in comments.

Cheers,
fijal

Hello.

NumPy integration is one of the single most requested features for PyPy. This post tries to describe where we are, what we plan (or what we don't plan), and how you can help.

The slightly longer version

At the moment, there is branch called numpy-exp which contains a translatable version of a very minimal version of numpy in the module called micronumpy. Example benchmarks show the following:

	add	iterate
CPython 2.6.5 with numpy 1.3.0	0.260s (1x)	4.2 (1x)
PyPy numpy-exp @ 3a9d77b789e1	0.120s (2.2x)	0.087 (48x)

Another option would be to sponsor NumPy development. In case you're interested, please get in touch with us or leave your email in comments.

Cheers,
fijal

Posted by Maciej Fijalkowski at 18:04 41 Comments

Saturday, April 30, 2011

PyPy 1.5 Released: Catching Up

We're pleased to announce the 1.5 release of PyPy. This release updates PyPy with the features of CPython 2.7.1, including the standard library. Thus all the features of CPython 2.6 and CPython 2.7 are now supported. It also contains additional performance improvements. You can download it here:

http://pypy.org/download.html

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7.1. It's fast (pypy 1.5 and cpython 2.6.2 performance comparison) due to its integrated tracing JIT compiler.

This release includes the features of CPython 2.6 and 2.7. It also includes a large number of small improvements to the tracing JIT compiler. It supports Intel machines running Linux 32/64 or Mac OS X. Windows is beta (it roughly works but a lot of small issues have not been fixed so far). Windows 64 is not yet supported.

Numerous speed achievements are described on our blog. Normalized speed charts comparing pypy 1.5 and pypy 1.4 as well as pypy 1.5 and cpython 2.6.2 are available on our benchmark website. The speed improvement over 1.4 seems to be around 25% on average.

More highlights

The largest change in PyPy's tracing JIT is adding support for loop invariant code motion, which was mostly done by Håkan Ardö. This feature improves the performance of tight loops doing numerical calculations.
The CPython extension module API has been improved and now supports many more extensions. For information on which one are supported, please refer to our compatibility wiki.
These changes make it possible to support Tkinter and IDLE.
The cProfile profiler is now working with the JIT. However, it skews the performance in unstudied ways. Therefore it is not yet usable to analyze subtle performance problems (the same is true for CPython of course).
There is an external fork which includes an RPython version of the postgresql. However, there are no prebuilt binaries for this.
Our developer documentation was moved to Sphinx and cleaned up.
and many small things :-)

Cheers,

Carl Friedrich Bolz, Laura Creighton, Antonio Cuni, Maciej Fijalkowski, Amaury Forgeot d'Arc, Alex Gaynor, Armin Rigo and the PyPy team

http://pypy.org/download.html

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7.1. It's fast (pypy 1.5 and cpython 2.6.2 performance comparison) due to its integrated tracing JIT compiler.

More highlights

The largest change in PyPy's tracing JIT is adding support for loop invariant code motion, which was mostly done by Håkan Ardö. This feature improves the performance of tight loops doing numerical calculations.
The CPython extension module API has been improved and now supports many more extensions. For information on which one are supported, please refer to our compatibility wiki.
These changes make it possible to support Tkinter and IDLE.
The cProfile profiler is now working with the JIT. However, it skews the performance in unstudied ways. Therefore it is not yet usable to analyze subtle performance problems (the same is true for CPython of course).
There is an external fork which includes an RPython version of the postgresql. However, there are no prebuilt binaries for this.
Our developer documentation was moved to Sphinx and cleaned up.
and many small things :-)

Cheers,

Carl Friedrich Bolz, Laura Creighton, Antonio Cuni, Maciej Fijalkowski, Amaury Forgeot d'Arc, Alex Gaynor, Armin Rigo and the PyPy team

Posted by Carl Friedrich Bolz-Tereick at 16:59 20 Comments

Wednesday, April 20, 2011

Using Tkinter and IDLE with PyPy

We are pleased to announce that Tkinter, the GUI library based on TCL/TK, now works with PyPy.
Tkinter is composed of two parts:

_tkinter, a module written in C which interfaces with the TCL world

Tkinter, a pure Python package which wraps _tkinter to expose the pythonic API we are used to

The PyPy version of _tkinter reuses the C code of as found in CPython and compile it through the PyPy C-API compatibility layer, cpyext. To make it work with PyPy, we had to modify it slightly, in order to remove the dependency on some API functions which are not supported by PyPy. In particular, we removed the dependency on the PyOS_InputHook variable, which allows a nice integration of Tkinter and the Python interactive prompt: the result is that, unlike CPython, in PyPy Tk windows created at the interactive prompt are not shown until we manually call the mainloop method. Apart from this inconvenience, all the rest works fine.
At the moment, _tkinter is not distributed with PyPy because our build system does not support automatic compilation of C extension. Instead, it is necessary to install it manually, either directly from source or by easy_installing/pip installing tkinter-pypy from PyPI.
For everything to work correctly, you need a recent build of PyPy: the following is a step-by-step guide to install _tkinter in a PyPy nightly build for Linux 64 bit; for other architectures, look at the nightly build page:

$ wget http://buildbot.pypy.org/nightly/trunk/pypy-c-jit-43485-1615dfd7d8f1-linux64.tar.bz2

$ tar xfv pypy-c-jit-43485-1615dfd7d8f1-linux64.tar.bz2

$ cd pypy-c-jit-43485-1615dfd7d8f1-linux64/

$ wget http://peak.telecommunity.com/dist/ez_setup.py

$ ./bin/pypy ez_setup.py    # install setuptools

$ ./bin/easy_install tkinter-pypy

Once you complete the steps above, you can start using Tkinter from your python programs. In particular, you can use IDLE, the IDE which is part of the Python standard library. To start IDLE, type:

$ ./bin/pypy -m idlelib.idle

Have fun :-)

We are pleased to announce that Tkinter, the GUI library based on TCL/TK, now works with PyPy.
Tkinter is composed of two parts:

_tkinter, a module written in C which interfaces with the TCL world

Tkinter, a pure Python package which wraps _tkinter to expose the pythonic API we are used to

$ wget http://buildbot.pypy.org/nightly/trunk/pypy-c-jit-43485-1615dfd7d8f1-linux64.tar.bz2

$ tar xfv pypy-c-jit-43485-1615dfd7d8f1-linux64.tar.bz2

$ cd pypy-c-jit-43485-1615dfd7d8f1-linux64/

$ wget http://peak.telecommunity.com/dist/ez_setup.py

$ ./bin/pypy ez_setup.py    # install setuptools

$ ./bin/easy_install tkinter-pypy

Once you complete the steps above, you can start using Tkinter from your python programs. In particular, you can use IDLE, the IDE which is part of the Python standard library. To start IDLE, type:

$ ./bin/pypy -m idlelib.idle

Have fun :-)

Posted by Antonio Cuni at 12:22 13 Comments

Wednesday, April 6, 2011

Tutorial Part 2: Adding a JIT

This is the second part of a tutorial written by Andrew Brown. The first part described how to write an interpreter with PyPy.

Adding JIT

Translating RPython to C is pretty cool, but one of the best features of PyPy is its ability to generate just-in-time compilers for your interpreter. That's right, from just a couple hints on how your interpreter is structured, PyPy will generate and include a JIT compiler that will, at runtime, translate the interpreted code of our BF language to machine code!

So what do we need to tell PyPy to make this happen? First it needs to know where the start of your bytecode evaluation loop is. This lets it keep track of instructions being executed in the target language (BF).

We also need to let it know what defines a particular execution frame. Since our language doesn't really have stack frames, this boils down to what's constant for the execution of a particular instruction, and what's not. These are called "green" and "red" variables, respectively.

Refer back to example2.py for the following.

In our main loop, there are four variables used: pc, program, bracket_map, and tape. Of those, pc, program, and bracket_map are all green variables. They define the execution of a particular instruction. If the JIT routines see the same combination of green variables as before, it knows it's skipped back and must be executing a loop. The variable "tape" is our red variable, it's what's being manipulated by the execution.

So let's tell PyPy this info. Start by importing the JitDriver class and making an instance:

from pypy.rlib.jit import JitDriver
jitdriver = JitDriver(greens=['pc', 'program', 'bracket_map'],
        reds=['tape'])

And we add this line to the very top of the while loop in the mainloop function:

jitdriver.jit_merge_point(pc=pc, tape=tape, program=program,
        bracket_map=bracket_map)

We also need to define a JitPolicy. We're not doing anything fancy, so this is all we need somewhere in the file:

def jitpolicy(driver):
    from pypy.jit.codewriter.policy import JitPolicy
    return JitPolicy()

See this example at example3.py

Now try translating again, but with the flag --opt=jit:

$ python ./pypy/pypy/translator/goal/translate.py --opt=jit example3.py

It will take significantly longer to translate with JIT enabled, almost 8 minutes on my machine, and the resulting binary will be much larger. When it's done, try having it run the mandelbrot program again. A world of difference, from 12 seconds compared to 45 seconds before!

Interestingly enough, you can see when the JIT compiler switches from interpreted to machine code with the mandelbrot example. The first few lines of output come out pretty fast, and then the program gets a boost of speed and gets even faster.

A bit about Tracing JIT Compilers

It's worth it at this point to read up on how tracing JIT compilers work. Here's a brief explanation: The interpreter is usually running your interpreter code as written. When it detects a loop of code in the target language (BF) is executed often, that loop is considered "hot" and marked to be traced. The next time that loop is entered, the interpreter gets put in tracing mode where every executed instruction is logged.

When the loop is finished, tracing stops. The trace of the loop is sent to an optimizer, and then to an assembler which outputs machine code. That machine code is then used for subsequent loop iterations.

This machine code is often optimized for the most common case, and depends on several assumptions about the code. Therefore, the machine code will contain guards, to validate those assumptions. If a guard check fails, the runtime falls back to regular interpreted mode.

A good place to start for more information is http://en.wikipedia.org/wiki/Just-in-time_compilation

Debugging and Trace Logs

Can we do any better? How can we see what the JIT is doing? Let's do two things.

First, let's add a get_printable_location function, which is used during debug trace logging:

def get_location(pc, program, bracket_map):
    return "%s_%s_%s" % (
            program[:pc], program[pc], program[pc+1:]
            )
jitdriver = JitDriver(greens=['pc', 'program', 'bracket_map'], reds=['tape'],
        get_printable_location=get_location)

This function is passed in the green variables, and should return a string. Here, we're printing out the BF code, surrounding the currently executing instruction with underscores so we can see where it is.

Download this as example4.py and translate it the same as example3.py.

Now let's run a test program (test.b, which just prints the letter "A" 15 or so times in a loop) with trace logging:

$ PYPYLOG=jit-log-opt:logfile ./example4-c test.b

Now take a look at the file "logfile". This file is quite hard to read, so here's my best shot at explaining it.

The file contains a log of every trace that was performed, and is essentially a glimpse at what instructions it's compiling to machine code for you. It's useful to see if there are unnecessary instructions or room for optimization.

Each trace starts with a line that looks like this:

[3c091099e7a4a7] {jit-log-opt-loop

and ends with a line like this:

[3c091099eae17d jit-log-opt-loop}

The next line tells you which loop number it is, and how many ops are in it. In my case, the first trace looks like this:

  [3c167c92b9118f] {jit-log-opt-loop
  # Loop 0 : loop with 26 ops
  [p0, p1, i2, i3]
  debug_merge_point('+<[>[_>_+<-]>.[<+>-]<<-]++++++++++.', 0)
  debug_merge_point('+<[>[>_+_<-]>.[<+>-]<<-]++++++++++.', 0)
  i4 = getarrayitem_gc(p1, i2, descr=<SignedArrayDescr>)
  i6 = int_add(i4, 1)
  setarrayitem_gc(p1, i2, i6, descr=<SignedArrayDescr>)
  debug_merge_point('+<[>[>+_<_-]>.[<+>-]<<-]++++++++++.', 0)
  debug_merge_point('+<[>[>+<_-_]>.[<+>-]<<-]++++++++++.', 0)
  i7 = getarrayitem_gc(p1, i3, descr=<SignedArrayDescr>)
  i9 = int_sub(i7, 1)
  setarrayitem_gc(p1, i3, i9, descr=<SignedArrayDescr>)
  debug_merge_point('+<[>[>+<-_]_>.[<+>-]<<-]++++++++++.', 0)
  i10 = int_is_true(i9)
  guard_true(i10, descr=<Guard2>) [p0]
  i14 = call(ConstClass(ll_dict_lookup__dicttablePtr_Signed_Signed), ConstPtr(ptr12), 90, 90, descr=<SignedCallDescr>)
  guard_no_exception(, descr=<Guard3>) [i14, p0]
  i16 = int_and(i14, -9223372036854775808)
  i17 = int_is_true(i16)
  guard_false(i17, descr=<Guard4>) [i14, p0]
  i19 = call(ConstClass(ll_get_value__dicttablePtr_Signed), ConstPtr(ptr12), i14, descr=<SignedCallDescr>)
  guard_no_exception(, descr=<Guard5>) [i19, p0]
  i21 = int_add(i19, 1)
  i23 = int_lt(i21, 114)
  guard_true(i23, descr=<Guard6>) [i21, p0]
  guard_value(i21, 86, descr=<Guard7>) [i21, p0]
  debug_merge_point('+<[>[_>_+<-]>.[<+>-]<<-]++++++++++.', 0)
  jump(p0, p1, i2, i3, descr=<Loop0>)
  [3c167c92bc6a15] jit-log-opt-loop}

I've trimmed the debug_merge_point lines a bit, they were really long.

So let's see what this does. This trace takes 4 parameters: 2 object pointers (p0 and p1) and 2 integers (i2 and i3). Looking at the debug lines, it seems to be tracing one iteration of this loop: "[>+<-]"

It starts executing the first operation on line 4, a ">", but immediately starts executing the next operation. The ">" had no instructions, and looks like it was optimized out completely. This loop must always act on the same part of the tape, the tape pointer is constant for this trace. An explicit advance operation is unnecessary.

Lines 5 to 8 are the instructions for the "+" operation. First it gets the array item from the array in pointer p1 at index i2 (line 6), adds 1 to it and stores it in i6 (line 7), and stores it back in the array (line 8).

Line 9 starts the "<" instruction, but it is another no-op. It seems that i2 and i3 passed into this routine are the two tape pointers used in this loop already calculated. Also deduced is that p1 is the tape array. It's not clear what p0 is.

Lines 10 through 13 perform the "-" operation: get the array value (line 11), subtract (line 12) and set the array value (line 13).

Next, on line 14, we come to the "]" operation. Lines 15 and 16 check whether i9 is true (non-zero). Looking up, i9 is the array value that we just decremented and stored, now being checked as the loop condition, as expected (remember the definition of "]"). Line 16 is a guard, if the condition is not met, execution jumps somewhere else, in this case to the routine called <Guard2> and is passed one parameter: p0.

Assuming we pass the guard, lines 17 through 23 are doing the dictionary lookup to bracket_map to find where the program counter should jump to. I'm not too familiar with what the instructions are actually doing, but it looks like there are two external calls and 3 guards. This seems quite expensive, especially since we know bracket_map will never change (PyPy doesn't know that). We'll see below how to optimize this.

Line 24 increments the newly acquired instruction pointer. Lines 25 and 26 make sure it's less than the program's length.

Additionally, line 27 guards that i21, the incremented instruction pointer, is exactly 86. This is because it's about to jump to the beginning (line 29) and the instruction pointer being 86 is a precondition to this block.

Finally, the loop closes up at line 28 so the JIT can jump to loop body <Loop0> to handle that case (line 29), which is the beginning of the loop again. It passes in parameters (p0, p1, i2, i3).

Optimizing

As mentioned, every loop iteration does a dictionary lookup to find the corresponding matching bracket for the final jump. This is terribly inefficient, the jump target is not going to change from one loop to the next. This information is constant and should be compiled in as such.

The problem is that the lookups are coming from a dictionary, and PyPy is treating it as opaque. It doesn't know the dictionary isn't being modified or isn't going to return something different on each query.

What we need to do is provide another hint to the translation to say that the dictionary query is a pure function, that is, its output depends only on its inputs and the same inputs should always return the same output.

To do this, we use a provided function decorator pypy.rlib.jit.purefunction, and wrap the dictionary call in a decorated function:

@purefunction
def get_matching_bracket(bracket_map, pc):
    return bracket_map[pc]

This version can be found at example5.py

Translate again with the JIT option and observe the speedup. Mandelbrot now only takes 6 seconds! (from 12 seconds before this optimization)

Let's take a look at the trace from the same function:

[3c29fad7b792b0] {jit-log-opt-loop
# Loop 0 : loop with 15 ops
[p0, p1, i2, i3]
debug_merge_point('+<[>[_>_+<-]>.[<+>-]<<-]++++++++++.', 0)
debug_merge_point('+<[>[>_+_<-]>.[<+>-]<<-]++++++++++.', 0)
i4 = getarrayitem_gc(p1, i2, descr=<SignedArrayDescr>)
i6 = int_add(i4, 1)
setarrayitem_gc(p1, i2, i6, descr=<SignedArrayDescr>)
debug_merge_point('+<[>[>+_<_-]>.[<+>-]<<-]++++++++++.', 0)
debug_merge_point('+<[>[>+<_-_]>.[<+>-]<<-]++++++++++.', 0)
i7 = getarrayitem_gc(p1, i3, descr=<SignedArrayDescr>)
i9 = int_sub(i7, 1)
setarrayitem_gc(p1, i3, i9, descr=<SignedArrayDescr>)
debug_merge_point('+<[>[>+<-_]_>.[<+>-]<<-]++++++++++.', 0)
i10 = int_is_true(i9)
guard_true(i10, descr=<Guard2>) [p0]
debug_merge_point('+<[>[_>_+<-]>.[<+>-]<<-]++++++++++.', 0)
jump(p0, p1, i2, i3, descr=<Loop0>)
[3c29fad7ba32ec] jit-log-opt-loop}

Much better! Each loop iteration is an add, a subtract, two array loads, two array stores, and a guard on the exit condition. That's it! This code doesn't require any program counter manipulation.

I'm no expert on optimizations, this tip was suggested by Armin Rigo on the pypy-dev list. Carl Friedrich has a series of posts on how to optimize your interpreter that are also very useful: http://bit.ly/bundles/cfbolz/1

Final Words

I hope this has shown some of you what PyPy is all about other than a faster implementation of Python.

For those that would like to know more about how the process works, there are several academic papers explaining the process in detail that I recommend. In particular: Tracing the Meta-Level: PyPy's Tracing JIT Compiler.

See http://readthedocs.org/docs/pypy/en/latest/extradoc.html

This is the second part of a tutorial written by Andrew Brown. The first part described how to write an interpreter with PyPy.

Adding JIT

Refer back to example2.py for the following.

So let's tell PyPy this info. Start by importing the JitDriver class and making an instance:

from pypy.rlib.jit import JitDriver
jitdriver = JitDriver(greens=['pc', 'program', 'bracket_map'],
        reds=['tape'])

And we add this line to the very top of the while loop in the mainloop function:

jitdriver.jit_merge_point(pc=pc, tape=tape, program=program,
        bracket_map=bracket_map)

We also need to define a JitPolicy. We're not doing anything fancy, so this is all we need somewhere in the file:

def jitpolicy(driver):
    from pypy.jit.codewriter.policy import JitPolicy
    return JitPolicy()

See this example at example3.py

Now try translating again, but with the flag --opt=jit:

$ python ./pypy/pypy/translator/goal/translate.py --opt=jit example3.py

A bit about Tracing JIT Compilers

A good place to start for more information is http://en.wikipedia.org/wiki/Just-in-time_compilation

Debugging and Trace Logs

Can we do any better? How can we see what the JIT is doing? Let's do two things.

First, let's add a get_printable_location function, which is used during debug trace logging:

def get_location(pc, program, bracket_map):
    return "%s_%s_%s" % (
            program[:pc], program[pc], program[pc+1:]
            )
jitdriver = JitDriver(greens=['pc', 'program', 'bracket_map'], reds=['tape'],
        get_printable_location=get_location)

Download this as example4.py and translate it the same as example3.py.

Now let's run a test program (test.b, which just prints the letter "A" 15 or so times in a loop) with trace logging:

$ PYPYLOG=jit-log-opt:logfile ./example4-c test.b

Now take a look at the file "logfile". This file is quite hard to read, so here's my best shot at explaining it.

Each trace starts with a line that looks like this:

[3c091099e7a4a7] {jit-log-opt-loop

and ends with a line like this:

[3c091099eae17d jit-log-opt-loop}

The next line tells you which loop number it is, and how many ops are in it. In my case, the first trace looks like this:

  [3c167c92b9118f] {jit-log-opt-loop
  # Loop 0 : loop with 26 ops
  [p0, p1, i2, i3]
  debug_merge_point('+<[>[_>_+<-]>.[<+>-]<<-]++++++++++.', 0)
  debug_merge_point('+<[>[>_+_<-]>.[<+>-]<<-]++++++++++.', 0)
  i4 = getarrayitem_gc(p1, i2, descr=<SignedArrayDescr>)
  i6 = int_add(i4, 1)
  setarrayitem_gc(p1, i2, i6, descr=<SignedArrayDescr>)
  debug_merge_point('+<[>[>+_<_-]>.[<+>-]<<-]++++++++++.', 0)
  debug_merge_point('+<[>[>+<_-_]>.[<+>-]<<-]++++++++++.', 0)
  i7 = getarrayitem_gc(p1, i3, descr=<SignedArrayDescr>)
  i9 = int_sub(i7, 1)
  setarrayitem_gc(p1, i3, i9, descr=<SignedArrayDescr>)
  debug_merge_point('+<[>[>+<-_]_>.[<+>-]<<-]++++++++++.', 0)
  i10 = int_is_true(i9)
  guard_true(i10, descr=<Guard2>) [p0]
  i14 = call(ConstClass(ll_dict_lookup__dicttablePtr_Signed_Signed), ConstPtr(ptr12), 90, 90, descr=<SignedCallDescr>)
  guard_no_exception(, descr=<Guard3>) [i14, p0]
  i16 = int_and(i14, -9223372036854775808)
  i17 = int_is_true(i16)
  guard_false(i17, descr=<Guard4>) [i14, p0]
  i19 = call(ConstClass(ll_get_value__dicttablePtr_Signed), ConstPtr(ptr12), i14, descr=<SignedCallDescr>)
  guard_no_exception(, descr=<Guard5>) [i19, p0]
  i21 = int_add(i19, 1)
  i23 = int_lt(i21, 114)
  guard_true(i23, descr=<Guard6>) [i21, p0]
  guard_value(i21, 86, descr=<Guard7>) [i21, p0]
  debug_merge_point('+<[>[_>_+<-]>.[<+>-]<<-]++++++++++.', 0)
  jump(p0, p1, i2, i3, descr=<Loop0>)
  [3c167c92bc6a15] jit-log-opt-loop}

I've trimmed the debug_merge_point lines a bit, they were really long.

Lines 10 through 13 perform the "-" operation: get the array value (line 11), subtract (line 12) and set the array value (line 13).

Line 24 increments the newly acquired instruction pointer. Lines 25 and 26 make sure it's less than the program's length.

Finally, the loop closes up at line 28 so the JIT can jump to loop body <Loop0> to handle that case (line 29), which is the beginning of the loop again. It passes in parameters (p0, p1, i2, i3).

Optimizing

To do this, we use a provided function decorator pypy.rlib.jit.purefunction, and wrap the dictionary call in a decorated function:

@purefunction
def get_matching_bracket(bracket_map, pc):
    return bracket_map[pc]

This version can be found at example5.py

Translate again with the JIT option and observe the speedup. Mandelbrot now only takes 6 seconds! (from 12 seconds before this optimization)

Let's take a look at the trace from the same function:

[3c29fad7b792b0] {jit-log-opt-loop
# Loop 0 : loop with 15 ops
[p0, p1, i2, i3]
debug_merge_point('+<[>[_>_+<-]>.[<+>-]<<-]++++++++++.', 0)
debug_merge_point('+<[>[>_+_<-]>.[<+>-]<<-]++++++++++.', 0)
i4 = getarrayitem_gc(p1, i2, descr=<SignedArrayDescr>)
i6 = int_add(i4, 1)
setarrayitem_gc(p1, i2, i6, descr=<SignedArrayDescr>)
debug_merge_point('+<[>[>+_<_-]>.[<+>-]<<-]++++++++++.', 0)
debug_merge_point('+<[>[>+<_-_]>.[<+>-]<<-]++++++++++.', 0)
i7 = getarrayitem_gc(p1, i3, descr=<SignedArrayDescr>)
i9 = int_sub(i7, 1)
setarrayitem_gc(p1, i3, i9, descr=<SignedArrayDescr>)
debug_merge_point('+<[>[>+<-_]_>.[<+>-]<<-]++++++++++.', 0)
i10 = int_is_true(i9)
guard_true(i10, descr=<Guard2>) [p0]
debug_merge_point('+<[>[_>_+<-]>.[<+>-]<<-]++++++++++.', 0)
jump(p0, p1, i2, i3, descr=<Loop0>)
[3c29fad7ba32ec] jit-log-opt-loop}

Much better! Each loop iteration is an add, a subtract, two array loads, two array stores, and a guard on the exit condition. That's it! This code doesn't require any program counter manipulation.

Final Words

I hope this has shown some of you what PyPy is all about other than a faster implementation of Python.

See http://readthedocs.org/docs/pypy/en/latest/extradoc.html

Posted by Carl Friedrich Bolz-Tereick at 14:51 10 Comments

Tuesday, April 5, 2011

Tutorial: Writing an Interpreter with PyPy, Part 1

This is a guest blog post written by Andrew Brown, with help from the PyPy developers on the pypy-dev mailing list.

This tutorial's master copy and supporting files live at https://bitbucket.org/brownan/pypy-tutorial/

When I first learned about the PyPy project, it took me a while to figure out exactly what it was about. For those that don't already know, it's two things:

A set of tools for implementing interpreters for interpreted languages
An implementation of Python using this toolchain

The second part is probably what most people think PyPy is, but this tutorial is not about their Python interpreter. It is about writing your own interpreter for your own language.

This is the project I undertook to help myself better understand how PyPy works and what it's all about.

This tutorial assumes you know very little about PyPy, how it works, and even what it's all about. I'm starting from the very beginning here.

What PyPy Does

Here's a brief overview of what PyPy can do. Let's say you want to write an interpreted language. This involves writing some kind of source code parser, a bytecode interpretation loop, and lots of standard library code.

That's quite a bit of work for moderately complicated languages, and there's a lot of low level work involved. Writing the parser and compiler code usually isn't fun, that's why there are tools out there to generate parsers and compilers for you.

Even then, you still must worry about memory management in your interpreter, and you're going to be re-implementing a lot if you want data types like arbitrary precision integers, nice general hash tables, and such. It's enough to put someone off from implementing their idea for a language.

Wouldn't it be nice if you could write your language in an existing high level language like, for example, Python? That sure would be ideal, you'd get all the advantages of a high level language like automatic memory management and rich data types at your disposal. Oh, but an interpreted language interpreting another language would be slow, right? That's twice as much interpreting going on.

As you may have guessed, PyPy solves this problem. PyPy is a sophisticated toolchain for analyzing and translating your interpreter code to C code (or JVM or CLI). This process is called "translation", and it knows how to translate quite a lot of Python's syntax and standard libraries, but not everything. All you have to do is write your interpreter in RPython, a subset of the Python language carefully defined to allow this kind of analysis and translation, and PyPy will produce for you a very efficient interpreter.

Because efficient interpreters should not be hard to write.

The Language

The language I've chosen to implement is dead simple. The language runtime consists of a tape of integers, all initialized to zero, and a single pointer to one of the tape's cells. The language has 8 commands, described here:

>: Moves the tape pointer one cell to the right

<: Moves the tape pointer one cell to the left
+: Increments the value of the cell underneath the pointer
-: Decrements the value of the cell underneath the pointer

[: If the cell under the current pointer is 0, skip to the instruction after the matching ]

]: Skip back to the matching [ (evaluating its condition)

.: Print out a single byte to stdout from the cell under the pointer

,: Read in a single byte from stdin to the cell under the pointer

Any unrecognized bytes are ignored.

Some of you may recognize this language. I will be referring to it as BF.

One thing to notice is that the language is its own bytecode; there is no translation from source code to bytecode. This means that the language can be interpreted directly: the main eval loop of our interpreter will operate right on the source code. This simplifies the implementation quite a bit.

First Steps

Let's start out by writing a BF interpreter in plain old Python. The first step is sketching out an eval loop:

def mainloop(program):
    tape = Tape()
    pc = 0
    while pc < len(program):
        code = program[pc]

        if code == ">":
            tape.advance()
        elif code == "<":
            tape.devance()
        elif code == "+":
            tape.inc()
        elif code == "-":
            tape.dec()
        elif code == ".":
            sys.stdout.write(chr(tape.get()))
        elif code == ",":
            tape.set(ord(sys.stdin.read(1)))
        elif code == "[" and value() == 0:
            # Skip forward to the matching ]
        elif code == "]" and value() != 0:
            # Skip back to the matching [

        pc += 1

As you can see, a program counter (pc) holds the current instruction index. The first statement in the loop gets the instruction to execute, and then a compound if statement decides how to execute that instruction.

The implementation of [ and ] are left out here, but they should change the program counter to the value of the matching bracket. (The pc then gets incremented, so the condition is evaluated once when entering a loop, and once at the end of each iteration)

Here's the implementation of the Tape class, which holds the tape's values as well as the tape pointer:

class Tape(object):
    def __init__(self):
        self.thetape = [0]
        self.position = 0

    def get(self):
        return self.thetape[self.position]
    def set(self, val):
        self.thetape[self.position] = val
    def inc(self):
        self.thetape[self.position] += 1
    def dec(self):
        self.thetape[self.position] -= 1
    def advance(self):
        self.position += 1
        if len(self.thetape) <= self.position:
            self.thetape.append(0)
    def devance(self):
        self.position -= 1

As you can see, the tape expands as needed to the right, indefinitely. We should really add some error checking to make sure the pointer doesn't go negative, but I'm not worrying about that now.

Except for the omission of the "[" and "]" implementation, this code will work fine. However, if the program has a lot of comments, it will have to skip over them one byte at a time at runtime. So let's parse those out once and for all.

At the same time, we'll build a dictionary mapping between brackets, so that finding a matching bracket is just a single dictionary lookup. Here's how:

def parse(program):
    parsed = []
    bracket_map = {}
    leftstack = []

    pc = 0
    for char in program:
        if char in ('[', ']', '<', '>', '+', '-', ',', '.'):
            parsed.append(char)

            if char == '[':
                leftstack.append(pc)
            elif char == ']':
                left = leftstack.pop()
                right = pc
                bracket_map[left] = right
                bracket_map[right] = left
            pc += 1

    return "".join(parsed), bracket_map

This returns a string with all invalid instructions removed, and a dictionary mapping bracket indexes to their matching bracket index.

All we need is some glue code and we have a working BF interpreter:

def run(input):
    program, map = parse(input.read())
    mainloop(program, map)

if __name__ == "__main__":
    import sys
    run(open(sys.argv[1], 'r'))

If you're following along at home, you'll also need to change the signature of mainloop() and implement the bracket branches of the if statement. Here's the complete example: example1.py

At this point you can try it out to see that it works by running the interpreter under python, but be warned, it will be very slow on the more complex examples:

$ python example1.py 99bottles.b

You can find mandel.b and several other example programs (not written by me) in my repository.

PyPy Translation

But this is not about writing a BF interpreter, this is about PyPy. So what does it take to get PyPy to translate this into a super-fast executable?

As a side note, there are some simple examples in the pypy/translator/goal directory of the PyPy source tree that are helpful here. My starting point for learning this was the example "targetnopstandalone.py", a simple hello world for PyPy.

For our example, the module must define a name called "target" which returns the entry point. The translation process imports your module and looks for that name, calls it, and the function object returned is where it starts the translation.

def run(fp):
    program_contents = ""
    while True:
        read = os.read(fp, 4096)
        if len(read) == 0:
            break
        program_contents += read
    os.close(fp)
    program, bm = parse(program_contents)
    mainloop(program, bm)

def entry_point(argv):
    try:
        filename = argv[1]
    except IndexError:
        print "You must supply a filename"
        return 1

    run(os.open(filename, os.O_RDONLY, 0777))
    return 0

def target(*args):
    return entry_point, None

if __name__ == "__main__":
    entry_point(sys.argv)

The entry_point function is passed the command line arguments when you run the resulting executable.

A few other things have changed here too. See the next section...

About RPython

Let's talk a bit about RPython at this point. PyPy can't translate arbitrary Python code because Python is a bit too dynamic. There are restrictions on what standard library functions and what syntax constructs one can use. I won't be going over all the restrictions, but for more information see http://readthedocs.org/docs/pypy/en/latest/coding-guide.html#restricted-python

In the example above, you'll see a few things have changed. I'm now using low level file descriptors with os.open and os.read instead of file objects. The implementation of "." and "," are similarly tweaked (not shown above). Those are the only changes to make to this code, the rest is simple enough for PyPy to digest.

That wasn't so hard, was it? I still get to use dictionaries, expandable lists, and even classes and objects! And if low level file descriptors are too low for you, there are some helpful abstractions in the rlib.streamio module included with PyPy's "RPython standard library."

For the example thus far, see example2.py

Translating

If you haven't already, check yourself out the latest version of PyPy from their bitbucket.org repository:

$ hg clone https://bitbucket.org/pypy/pypy

(A recent revision is necessary because of a bugfix that makes my example possible)

The script to run is in "pypy/translator/goal/translate.py". Run this script, passing in our example module as an argument.

[A note added much later: this script has been moved to "rpython/bin/rpython".]

$ python ./pypy/pypy/translator/goal/translate.py example2.py

(You can use PyPy's python interpreter for extra speed, but it's not necessary)

PyPy will churn for a bit, drawing some nice looking fractals to your console while it works. It takes around 20 seconds on my machine.

The result from this is an executable binary that interprets BF programs. Included in my repository are some example BF programs, including a mandelbrot fractal generator, which takes about 45 seconds to run on my computer. Try it out:

$ ./example2-c mandel.b

Compare this to running the interpreter un-translated on top of python:

$ python example2.py mandel.b

Takes forever, doesn't it?

So there you have it. We've successfully written our own interpreter in RPython and translated it with the PyPy toolchain.

(more in the next blog post...)

This is a guest blog post written by Andrew Brown, with help from the PyPy developers on the pypy-dev mailing list.

This tutorial's master copy and supporting files live at https://bitbucket.org/brownan/pypy-tutorial/

When I first learned about the PyPy project, it took me a while to figure out exactly what it was about. For those that don't already know, it's two things:

A set of tools for implementing interpreters for interpreted languages
An implementation of Python using this toolchain

The second part is probably what most people think PyPy is, but this tutorial is not about their Python interpreter. It is about writing your own interpreter for your own language.

This is the project I undertook to help myself better understand how PyPy works and what it's all about.

This tutorial assumes you know very little about PyPy, how it works, and even what it's all about. I'm starting from the very beginning here.

What PyPy Does

Because efficient interpreters should not be hard to write.

The Language

>: Moves the tape pointer one cell to the right

<: Moves the tape pointer one cell to the left
+: Increments the value of the cell underneath the pointer
-: Decrements the value of the cell underneath the pointer

[: If the cell under the current pointer is 0, skip to the instruction after the matching ]

]: Skip back to the matching [ (evaluating its condition)

.: Print out a single byte to stdout from the cell under the pointer

,: Read in a single byte from stdin to the cell under the pointer

Any unrecognized bytes are ignored.

Some of you may recognize this language. I will be referring to it as BF.

First Steps

Let's start out by writing a BF interpreter in plain old Python. The first step is sketching out an eval loop:

def mainloop(program):
    tape = Tape()
    pc = 0
    while pc < len(program):
        code = program[pc]

        if code == ">":
            tape.advance()
        elif code == "<":
            tape.devance()
        elif code == "+":
            tape.inc()
        elif code == "-":
            tape.dec()
        elif code == ".":
            sys.stdout.write(chr(tape.get()))
        elif code == ",":
            tape.set(ord(sys.stdin.read(1)))
        elif code == "[" and value() == 0:
            # Skip forward to the matching ]
        elif code == "]" and value() != 0:
            # Skip back to the matching [

        pc += 1

Here's the implementation of the Tape class, which holds the tape's values as well as the tape pointer:

class Tape(object):
    def __init__(self):
        self.thetape = [0]
        self.position = 0

    def get(self):
        return self.thetape[self.position]
    def set(self, val):
        self.thetape[self.position] = val
    def inc(self):
        self.thetape[self.position] += 1
    def dec(self):
        self.thetape[self.position] -= 1
    def advance(self):
        self.position += 1
        if len(self.thetape) <= self.position:
            self.thetape.append(0)
    def devance(self):
        self.position -= 1

As you can see, the tape expands as needed to the right, indefinitely. We should really add some error checking to make sure the pointer doesn't go negative, but I'm not worrying about that now.

At the same time, we'll build a dictionary mapping between brackets, so that finding a matching bracket is just a single dictionary lookup. Here's how:

def parse(program):
    parsed = []
    bracket_map = {}
    leftstack = []

    pc = 0
    for char in program:
        if char in ('[', ']', '<', '>', '+', '-', ',', '.'):
            parsed.append(char)

            if char == '[':
                leftstack.append(pc)
            elif char == ']':
                left = leftstack.pop()
                right = pc
                bracket_map[left] = right
                bracket_map[right] = left
            pc += 1

    return "".join(parsed), bracket_map

This returns a string with all invalid instructions removed, and a dictionary mapping bracket indexes to their matching bracket index.

All we need is some glue code and we have a working BF interpreter:

def run(input):
    program, map = parse(input.read())
    mainloop(program, map)

if __name__ == "__main__":
    import sys
    run(open(sys.argv[1], 'r'))

If you're following along at home, you'll also need to change the signature of mainloop() and implement the bracket branches of the if statement. Here's the complete example: example1.py

At this point you can try it out to see that it works by running the interpreter under python, but be warned, it will be very slow on the more complex examples:

$ python example1.py 99bottles.b

You can find mandel.b and several other example programs (not written by me) in my repository.

PyPy Translation

But this is not about writing a BF interpreter, this is about PyPy. So what does it take to get PyPy to translate this into a super-fast executable?

def run(fp):
    program_contents = ""
    while True:
        read = os.read(fp, 4096)
        if len(read) == 0:
            break
        program_contents += read
    os.close(fp)
    program, bm = parse(program_contents)
    mainloop(program, bm)

def entry_point(argv):
    try:
        filename = argv[1]
    except IndexError:
        print "You must supply a filename"
        return 1

    run(os.open(filename, os.O_RDONLY, 0777))
    return 0

def target(*args):
    return entry_point, None

if __name__ == "__main__":
    entry_point(sys.argv)

The entry_point function is passed the command line arguments when you run the resulting executable.

A few other things have changed here too. See the next section...

About RPython

For the example thus far, see example2.py

Translating

If you haven't already, check yourself out the latest version of PyPy from their bitbucket.org repository:

$ hg clone https://bitbucket.org/pypy/pypy

(A recent revision is necessary because of a bugfix that makes my example possible)

The script to run is in "pypy/translator/goal/translate.py". Run this script, passing in our example module as an argument.

[A note added much later: this script has been moved to "rpython/bin/rpython".]

$ python ./pypy/pypy/translator/goal/translate.py example2.py

(You can use PyPy's python interpreter for extra speed, but it's not necessary)

PyPy will churn for a bit, drawing some nice looking fractals to your console while it works. It takes around 20 seconds on my machine.

$ ./example2-c mandel.b

Compare this to running the interpreter un-translated on top of python:

$ python example2.py mandel.b

Takes forever, doesn't it?

So there you have it. We've successfully written our own interpreter in RPython and translated it with the PyPy toolchain.

(more in the next blog post...)

Posted by Carl Friedrich Bolz-Tereick at 14:44 16 Comments

Monday, April 4, 2011

PyPy Göteborg Post-Easter Sprint April 25 - May 1 2011

The next PyPy sprint will be in Gothenburg, Sweden. It is a public sprint, very suitable for newcomers. We'll focus on making the 1.5 release (if it hasn't already happened) and whatever interests the Sprint attendees.

Topics and goals

The main goal is to polish and release PyPy 1.5, supporting Python 2.7 as well as the last few months' improvements in the JIT (provided that it hasn't already happened). Other topics:

Going over our documentation, and classifying our docs in terms of mouldiness. Deciding what needs writing, and maybe writing it.
Helping people get their code running with PyPy
maybe work on EuroPython Training, and talks
Summer of Code preparation
speed.pypy.org
any other programming task is welcome too -- e.g. tweaking the Python or JavaScript interpreter, Stackless support, and so on.

Location

The sprint will be held in the apartment of Laura Creighton and Jacob Hallén which is at Götabergsgatan 22 in Gothenburg, Sweden. Here is a map. This is in central Gothenburg. It is between the tram stops of Vasaplatsen and Valand, (a distance of 4 blocks) where many lines call -- the 2, 3, 4, 5, 7, 10 and 13.

Probably cheapest and not too far away is to book accomodation at SGS Veckobostader. The Elite Park Avenyn Hotel is a luxury hotel just a few blocks away. There are scores of hotels a short walk away from the sprint location, suitable for every budget, desire for luxury, and desire for the unusual. You could, for instance, stay on a boat. Options are too numerous to go into here. Just ask in the mailing list or on the blog.

Hours will be from 10:00 until people have had enough. It's a good idea to arrive a day before the sprint starts and leave a day later. In the middle of the sprint there usually is a break day and it's usually ok to take half-days off if you feel like it.

Good to Know

Sweden is not part of the Euro zone. One SEK (krona in singular, kronor in plural) is roughly 1/10th of a Euro (9.36 SEK to 1 Euro).

The venue is central in Gothenburg. There is a large selection of places to get food nearby, from edible-and-cheap to outstanding. We often cook meals together, so let us know if you have any food allergies, dislikes, or special requirements.

Sweden uses the same kind of plugs as Germany. 230V AC.

The Sprint will be held the week following Easter. This means, as always, that Gothcon will be taking place the weekend before (Easter weekend). Gothcon, now in its 35 year, is the largest European game players conference. Some of you may be interested in arriving early for the board games. The conference site is only in Swedish, alas. You don't need to register in advance unless you are planning to host a tournament, (and it's too late for that anyway).

Getting Here

If are coming train, you will arrive at the Central Station. It is about 12 blocks to the site from there, or you can take a tram.

There are two airports which are local to Göteborg, Landvetter (the main one) and Gothenburg City Airport (where some budget airlines fly). If you arrive at Landvetter the airport bus stops right downtown at Elite Park Avenyn Hotel which is the second stop, 4 blocks from the Sprint site, as well as the end of the line, which is the Central Station. If you arrive at Gothenburg City Airport take the bus to the end of the line. You will be at the Central Station.

You can also arrive by ferry, from either Kiel in Germany or Frederikshavn in Denmark.

Who's Coming?

If you'd like to come, please let us know when you will be arriving and leaving, as well as letting us know your interests We'll keep a list of people which we'll update (which you can do so yourself if you have bitbucket pypy commit rights).

Topics and goals

The main goal is to polish and release PyPy 1.5, supporting Python 2.7 as well as the last few months' improvements in the JIT (provided that it hasn't already happened). Other topics:

Going over our documentation, and classifying our docs in terms of mouldiness. Deciding what needs writing, and maybe writing it.
Helping people get their code running with PyPy
maybe work on EuroPython Training, and talks
Summer of Code preparation
speed.pypy.org
any other programming task is welcome too -- e.g. tweaking the Python or JavaScript interpreter, Stackless support, and so on.

Location

Good to Know

Sweden is not part of the Euro zone. One SEK (krona in singular, kronor in plural) is roughly 1/10th of a Euro (9.36 SEK to 1 Euro).

Sweden uses the same kind of plugs as Germany. 230V AC.

Getting Here

If are coming train, you will arrive at the Central Station. It is about 12 blocks to the site from there, or you can take a tram.

You can also arrive by ferry, from either Kiel in Germany or Frederikshavn in Denmark.

Who's Coming?

Posted by Carl Friedrich Bolz-Tereick at 14:16 4 Comments

Saturday, March 26, 2011

Controlling the Tracing of an Interpreter With Hints, Part 4: Benchmarks

This is part 4 and the final part of the series on how to speed up an interpreter written with PyPy by adding JIT hints to the interpreter. Part 1 described how to control the extent of tracing. Part 2 described how to influence the optimizer with promotion and pure functions. Part 3 described a simple object model and how it can be optimized by doing small rewrites. In this (short) post I present some benchmarks.

Benchmarks

For the benchmarks I ran a subset of the benchmarks on http://speed.pypy.org with CPython and four different executables of PyPy's Python interpreter (all with a JIT). The executables contain all combinations of enabling maps (which make instance attributes fast) and type versions (which makes method lookup fast).

pypy-slow: contains neither maps nor type versions.
pypy-map: contains maps but not type versions.
pypy-version: contains type versions but not maps.
pypy-full: contains both maps and type versions

The results are as follows:

The graph shows the speedup over CPython's numbers. The results are quite interesting. Maps by themselves do not speed up much over the bare JIT, whereas typed versions alone improve on the JIT baseline in many cases. However, maps are not useless. In combination with type versions they add a nice improvement over just type versions in a number of benchmarks (most notably raytrace-simple and richards but also in crypto-pyaes, django and go).

It's clear that type versions can be arbitrarily effective. A method lookup on a class can be arbitrarily slow, if the inheritance hierarchy becomes deeper and deeper. The full lookup is replaced by one promotion if type versions are enabled.

Maps on the other hand always replace one dict lookup with one promotion. Since dict lookups are already very fast, this by itself does not lead to a gigantic improvement. Only in combination with type versions do they show their full potential.

Posted by Carl Friedrich Bolz-Tereick at 18:44 9 Comments

Tuesday, March 22, 2011

A thank you to the PSF

This year's PyCon was an incredible time; several members of the PyPy team were there, and we'll be blogging more about our experiences in the coming days. However, we quickly wanted to extend a thank you to the Python Software Foundation (PSF).

As you may have heard, on Friday morning at PyCon Jesse Noller handed the PyPy team a check for $10,000, on behalf of the PSF. This was in recognition of our success over the past few years in bringing PyPy from a research project to a fast, compliant, production-ready Python implementation, and to allow us to continue our work on making it faster and more up-to-date with upstream version changes.

Beyond the large check, we're grateful for the endorsement this represents, not only of our work on PyPy, but also of all alternatve Python VMs. The PSF has shifted its focus from representing just CPython to representing the Python Language, reguardless of its implementation, something we are very appreciative of.

From left to right, PyPy people present at PyCon 2011: Maciej Fijałkowski, Armin Rigo, Alex Gaynor, Laura Creighton and Jacob Hallén

Thank you, PSF.

Posted by Maciej Fijalkowski at 00:50 5 Comments

Monday, March 21, 2011

Controlling the Tracing of an Interpreter With Hints, Part 3: Putting it All Together

This is part 3 of the series on how to speed up an interpreter written with PyPy by adding JIT hints to the interpreter. Part 1 described how to control the extent of tracing. Part 2 described how to influence the optimizer with promotion and pure functions. In this post I describe a worked-out example of a small object model for a dynamic language and how to make it efficient using the hints described in the previous posts.

A Simple Object Model

To implement a dynamic language efficiently, the operations on its objects need to be fast. Most dynamic languages have object models that are made by using dictionaries everywhere. Let's look at an example of how the JIT can be made to optimize such operations.

For the purpose of this blog post we will use a very simple and bare-bones object model that just supports very simple classes and instances, without any inheritance or any fancy features. The model has classes, which contain methods. Instances have a class. Instances have their own attributes. When looking up an attribute on an instance, the instances attributes are searched. If the attribute is not found there, the class' attributes are searched.

To implement this object model, we could use the following RPython code as part of the interpreter source code:

class Class(object):
    def __init__(self, name):
        self.name = name
        self.methods = {}

    def instantiate(self):
        return Instance(self)

    def find_method(self, name):
        result = self.methods.get(name)
        if result is not None:
            return result
        raise AttributeError(name)

    def change_method(self, name, value):
        self.methods[name] = value


class Instance(object):
    def __init__(self, cls):
        self.cls = cls
        self.attributes = {}

    def getfield(self, name):
        result = self.attributes.get(name)
        if result is not None:
            return result
        raise AttributeError(name)

    def write_attribute(self, name, value):
        self.attributes[name] = value

    def getattr(self, name):
        try:
            return self.getfield(name)
        except AttributeError:
            return self.cls.find_method(name)

In this straightforward implementation the methods and attributes are just stored in dictionaries on the classes/instances. While this object model is very simple it already contains all the hard parts of Python's object model. Both instances and classes can have arbitrary fields, and they are changeable at any time. Moreover, instances can change their class after they have been created.

When using this object model in an interpreter, a huge amount of time will be spent doing lookups in these dictionaries. To make the language efficient using a tracing JIT, we need to find a way to get rid of these dictionary lookups somehow.

Let's assume we trace through code that sums three attributes, such as:

inst.getattr("a") + inst.getattr("b") + inst.getattr("c")

The trace could look like this:

# inst.getattr("a")
attributes1 = inst.attributes
result1 = dict.get(attributes1, "a")
guard(result1 is not None)

# inst.getattr("b")
attributes2 = inst.attributes
v1 = dict.get(attributes2, "b")
guard(v1 is None)
cls1 = inst.cls
methods1 = cls.methods
result2 = dict.get(methods1, "b")
guard(result2 is not None)
v2 = result1 + result2

# inst.getattr("c")
attributes3 = inst.attributes
v3 = dict.get(attributes3, "c")
guard(v3 is None)
cls1 = inst.cls
methods2 = cls.methods
result3 = dict.get(methods2, "c")
guard(result3 is not None)

v4 = v2 + result3
return(v4)

In this example, the attribute a is found on the instance, but the attributes b and c are found on the class. The trace indeed contains five calls to dict.get, which is slow.

Making Instance Attributes Faster Using Maps

The first step in making getattr faster in our object model is to optimize away the dictionary lookups on the instances. The hints we have looked at in the two earlier blog posts don't seem to help with the current object model. There is no pure function to be seen, and the instance is not a candidate for promotion, because there tend to be many instances.

This is a common problem when trying to apply hints. Often, the interpreter needs a small rewrite to expose the pure functions and nearly-constant objects that are implicitly there. In the case of instance fields this rewrite is not entirely obvious. The basic idea is as follows. In theory instances can have arbitrary fields. In practice however many instances share their layout (i.e. their set of keys) with many other instances.

Therefore it makes sense to factor the layout information out of the instance implementation into a shared object. This shared layout object is called a map. Maps are an old idea that comes originally from the SELF language. They are also used by many JavaScript implementations such as V8. I've written about maps before, so I won't explain them fully again.

The rewritten Instance class using maps looks like this:

class Map(object):
    def __init__(self):
        self.attribute_indexes = {}
        self.other_maps = {}

    @purefunction
    def getindex(self, name):
        return self.attribute_indexes.get(name, -1)

    @purefunction
    def new_map_with_additional_attribute(self, name):
        if name not in self.other_maps:
            newmap = Map()
            newmap.attribute_indexes.update(self.attribute_indexes)
            newmap.attribute_indexes[name] = len(self.attribute_indexes)
            self.other_maps[name] = newmap
        return self.other_maps[name]


EMPTY_MAP = Map()

class Instance(object):
    def __init__(self, cls):
        self.cls = cls
        self.map = EMPTY_MAP
        self.storage = []

    def getfield(self, name):
        map = hint(self.map, promote=True)
        index = map.getindex(name)
        if index != -1:
            return self.storage[index]
        raise AttributeError(name)

    def write_attribute(self, name, value):
        map = hint(self.map, promote=True)
        index = map.getindex(name)
        if index != -1:
            self.storage[index] = value
            return
        self.map = map.new_map_with_additional_attribute(name)
        self.storage.append(value)

    def getattr(self, name):
        try:
            return self.getfield(name)
        except AttributeError:
            return self.cls.find_method(name)

Instances no longer use dictionaries to store their fields. Instead, they have a reference to a map, which maps field names to indexes into a storage list. The storage list contains the actual field values. The maps are shared between objects with the same layout. Therefore they have to be immutable, which means that their getindex method is a pure function. When a new attribute is added to an instance, a new map needs to be chosen, which is done with the new_map_with_additional_attribute method on the previous map. Now that we have introduced maps, it is safe to promote the map everywhere, because we assume that the number of different instance layouts is small.

With this changed instance implementation, the trace we had above changes to the following, where 0xb74af4a8 is the memory address of the Map instance that has been promoted:

# inst.getattr("a")
map1 = inst.map
guard(map1 == 0xb74af4a8)
index1 = Map.getindex(map1, "a")
guard(index1 != -1)
storage1 = inst.storage
result1 = storage1[index1]

# inst.getattr("b")
map2 = inst.map
guard(map2 == 0xb74af4a8)
index2 = Map.getindex(map2, "b")
guard(index2 == -1)
cls1 = inst.cls
methods1 = cls.methods
result2 = dict.get(methods1, "b")
guard(result2 is not None)
v2 = result1 + result2

# inst.getattr("c")
map3 = inst.map
guard(map3 == 0xb74af4a8)
index3 = Map.getindex(map3, "c")
guard(index3 == -1)
cls1 = inst.cls
methods2 = cls.methods
result3 = dict.get(methods2, "c")
guard(result3 is not None)

v4 = v2 + result3
return(v4)

The calls to Map.getindex can be optimized away, because they are calls to a pure function and they have constant arguments. That means that index1/2/3 are constant and the guards on them can be removed. All but the first guard on the map will be optimized away too, because the map cannot have changed in between. The optimized trace looks like this:

# inst.getattr("a")
map1 = inst.map
guard(map1 == 0xb74af4a8)
storage1 = inst.storage
result1 = storage1[0]

# inst.getattr("b")
cls1 = inst.cls
methods1 = cls1.methods
result2 = dict.get(methods1, "b")
guard(result2 is not None)
v2 = result1 + result2

# inst.getattr("c")
cls2 = inst.cls
methods2 = cls2.methods
result3 = dict.get(methods2, "c")
guard(result3 is not None)

v4 = v2 + result3
return(v4)

The index 0 that is used to read out of the storage array is the result of the constant-folded getindex call. This trace is already much better than the original one. Now we are down from five dictionary lookups to just two.

Versioning of Classes

Instances were optimized making the assumption that the total number of Instance layouts is small compared to the number of instances. For classes we will make an even stronger assumption. We simply assume that it is rare for classes to change at all. This is not totally reasonable (sometimes classes contain counters or similar things) but for this simple example it is good enough.

What we would really like is if the Class.find_method method were pure. But it cannot be, because it is always possible to change the class itself. Every time the class changes, find_method can potentially return a new value.

Therefore, we give every class a version number, which is increased every time a class gets changed (i.e., the content of the methods dictionary changes). This means that the result of methods.get() for a given (name, version) pair will always be the same, i.e. it is a pure operation. To help the JIT to detect this case, we factor it out in a helper method which is explicitly marked as @purefunction. The refactored Class looks like this:

class VersionTag(object):
    pass

class Class(object):
    def __init__(self, name):
        self.name = name
        self.methods = {}
        self.version = VersionTag()

    def find_method(self, name):
        self = hint(self, promote=True)
        version = hint(self.version, promote=True)
        result = self._find_method(name, version)
        if result is not None:
            return result
        raise AttributeError(name)

    @purefunction
    def _find_method(self, name, version):
        return self.methods.get(name)

    def change_method(self, name, value):
        self.methods[name] = value
        self.version = VersionTag()

What is interesting here is that _find_method takes the version argument but it does not use it at all. Its only purpose is to make the call pure (because when the version number changes, the result of the call might be different than the previous one).

The trace with this new class implementation looks like this:

# inst.getattr("a")
map1 = inst.map
guard(map1 == 0xb74af4a8)
index1 = Map.getindex(map1, "a")
guard(index1 != -1)
storage1 = inst.storage
result1 = storage1[index1]

# inst.getattr("b")
map2 = inst.map
guard(map2 == 0xb74af4a8)
index2 = Map.getindex(map2, "b")
guard(index2 == -1)
cls1 = inst.cls
guard(cls1 == 0xb7aaaaf8)
version1 = cls1.version
guard(version1 == 0xb7bbbb18)
result2 = Class._find_method(cls, "b", version1)
guard(result2 is not None)
v2 = result1 + result2

# inst.getattr("c")
map3 = inst.map
guard(map3 == 0xb74af4a8)
index3 = Map.getindex(map3, "c")
guard(index3 == -1)
cls2 = inst.cls
guard(cls2 == 0xb7aaaaf8)
version2 = cls2.version
guard(version2 == 0xb7bbbb18)
result3 = Class._find_method(cls, "c", version2)
guard(result3 is not None)

v4 = v2 + result3
return(v4)

The calls to Class._find_method can now be optimized away, also the promotion of the class and the version, except for the first one. The final optimized trace looks like this:

# inst.getattr("a")
map1 = inst.map
guard(map1 == 0xb74af4a8)
storage1 = inst.storage
result1 = storage1[0]

# inst.getattr("b")
cls1 = inst.cls
guard(cls1 == 0xb7aaaaf8)
version1 = cls1.version
guard(version1 == 0xb7bbbb18)
v2 = result1 + 41

# inst.getattr("c")
v4 = v2 + 17
return(v4)

The constants 41 and 17 are the results of the folding of the _find_method` calls. This final trace is now very good. It no longer performs any dictionary lookups. Instead it contains several guards. The first guard checks that the map is still the same. This guard will fail if the same code is executed with an instance that has another layout. The second guard checks that the class of inst is still the same. It will fail if trace is executed with an instance of another class. The third guard checks that the class did not change since the trace was produced. It will fail if somebody calls the change_method method on the class.

Real-World Considerations

The techniques used above for the simple object model are used for the object model of PyPy's Python interpreter too. Since Python's object model is considerably more complex, some additional work needs to be done.

The first problem that needs to be solved is that Python supports (multiple) inheritance. Therefore looking up a method in a class needs to consider the whole method resolution order. This makes the versioning of classes more complex. If a class is changed its version changes. At the same time, the versions of all the classes inheriting from it need to be changed as well, recursively. This makes class changes expensive, but they should be rare. On the other hand, a method lookup in a complex class hierarchy is as optimized in the trace as in our object model here.

A downside of the versioning of classes that we haven't yet fixed in PyPy, is that some classes do change a lot. An example would be a class that keeps a counter of how many instances have been created so far. This is very slow right now, but we have ideas about how to fix it in the future.

Another optimization is that in practice the shape of an instance is correlated with its class. In our code above, we allow both to vary independently. In PyPy's Python interpreter we act somewhat more cleverly. The class of an instance is not stored on the instance itself, but on the map. This means that we get one fewer promotion (and thus one fewer guard) in the trace, because the class doesn't need to be promoted after the map has been.

More General Patterns

The techniques we used above to make instance and class lookups faster are applicable in more general cases than the one we developed them for. A more abstract view of maps is that of splitting a data-structure into a part that changes slowly, and a part that changes quickly. In the concrete example of maps we split the original dictionary into the map (the slow-changing part) and the storage array (the quick-changing part). All the computation on the slow-changing part can be constant-folded during tracing so that only the manipulation of the quick-changing part remains.

Similarly, versions can be used to constant-fold arbitrary functions of large data structures. The version needs to be updated carefully every time the result of this function can change. Therefore this is useful only if the data structure is expected to change slowly.

Conclusion

In this post I showed how to use purefunction and promote to make a small but still relevant dynamic object model no longer use any dictionary lookups after tracing. Instead a number of guards are inserted into the trace to check whether the assumptions about the objects are still true. This makes operations on objects seriously faster. I plan to write another small post that shows the speed benefits for PyPy's Python interpreter for exactly these operations.