Monday, December 21, 2009

Accelerating PyPy development by funding

PyPy has recently made some great speed and memory progress towards providing the most efficient Python interpreter out there. We also just announced our plans for the pypy-1.2 release. Much of this is driven by personal commitment, by individuals and companies investing time and money. Now we'd appreciate some feedback and help regarding getting money into the PyPy project to help its core members (between 5 and 15 people depending how you count) to sustain themselves. We see several options:

  • use a foundation structure and ask for tax-exempt donations to the project, its developers and infrastructure. We just got a letter from the Software Freedom Conservancy that they view our application favourably so this option becomes practical hopefully soon.
  • offer to implement certain features like a 64bit JIT-backend, Numpy for PyPy or a streamlined installation in exchange for money, contributed in small portions/donations. Do you imagine you or your company would sponsor PyPy on a small scale for efforts like this? Any other bits you'd like to see?
  • offer to implement larger scale tasks by contracting PyPy related companies, namely Open End and merlinux who have successfully done such contracts in the past. Please don't hesitate to contact holger@merlinux.eu and bea@openend.se if you want to start a conversation on this.
  • apply for public/state funding - in fact we are likely to get some funding through Eurostars, more on that separately. Such funding is usually only a 50-60% percentage of actual employment and project costs, and is tied to research questions rather than to make PyPy a production-useable interpreter, though.

Anything else we should look out for?

cheers & thanks for any feedback, Maciej and Holger

Thursday, December 17, 2009

Planning a next release of PyPy

The PyPy core team is planning to make a new release before the next Pycon US.

The main target of the 1.2 release is packaging the good results we have achieved applying our current JIT compiler generator to our Python interpreter. Some of that progress has been chronicled in recent posts on the status blog. By releasing them in a relatively stable prototype we want to encourage people to try them with their own code and to gather feedback in this way. By construction the JIT compiler should support all Python features, what may vary are the speedups achieved (in some cases the JIT may produce worse results than the PyPy interpreter which we would like to know) and the extra memory required by it.

For the 1.2 release we will focus on the JIT stability first, less on improving non-strictly JIT areas. The JIT should be good at many things as shown by previous blog postings. We want the JIT compiler in the release to work well on Intel 32 bits on Linux, with Mac OS X and Windows being secondary targets. Which compilation targets work will depend a bit on contributions.

In order to finalize the release we intend to have a concentrated effort ("virtual sprint") from the 22nd to the 29th of January. Coordination will happen as usual through the #pypy irc channel on freenode. Samuele Pedroni will take the role of release manager as he already did in the past.

Wednesday, December 2, 2009

Leysin Winter Sprint: reported

Update: the sprint has been reported to some later date.

The next PyPy sprint will probably still be in Leysin, Switzerland, for the seventh time.
      

Monday, November 30, 2009

Using CPython extension modules with PyPy, or: PyQt on PyPy

If you have ever wanted to use CPython extension modules on PyPy, we want to announce that there is a solution that should be compatible to quite a bit of the available modules. It is neither new nor written by us, but works nevertheless great with PyPy.

The trick is to use RPyC, a transparent, symmetric remote procedure call library written in Python. The idea is to start a CPython process that hosts the PyQt libraries and connect to it via TCP to send RPC commands to it.

I tried to run PyQt applications using it on PyPy and could get quite a bit of the functionality of these working. Remaining problems include regular segfaults of CPython because of PyQt-induced memory corruption and bugs because classes like StandardButtons behave incorrectly when it comes to arithmetical operations.

Changes to RPyC needed to be done to support remote unbound __init__ methods, shallow call by value for list and dict types (PyQt4 methods want real lists and dicts as parameters), and callbacks to methods (all remote method objects are wrapped into small lambda functions to ease the call for PyQt4).

If you want to try RPyC to run the PyQt application of your choice, you just need to follow these steps. Please report your experience here in the blog comments or on our mailing list.

  1. Download RPyC from the RPyC download page.
  2. Download this patch and apply it to RPyC by running patch -p1 < rpyc-3.0.7-pyqt4-compat.patch in the RPyC directory.
  3. Install RPyc by running python setup.py install as root.
  4. Run the file rpyc/servers/classic_server.py using CPython.
  5. Execute your PyQt application on PyPy.

PyPy will automatically connect to CPython and use its PyQt libraries.

Note that this scheme works with nearly every extension library. Look at pypy/lib/sip.py on how to add new libraries (you need to create such a file for every proxied extension module).

Have fun with PyQt

Alexander Schremmer

Wednesday, November 18, 2009

Some benchmarking

Hello.

Recently, thanks to the surprisingly helpful Unhelpful, also known as Andrew Mahone, we have a decent, if slightly arbitrary, set of performances graphs. It contains a couple of benchmarks already seen on this blog as well as some taken from The Great Computer Language Benchmarks Game. These benchmarks don't even try to represent "real applications" as they're mostly small algorithmic benchmarks. Interpreters used:

  1. PyPy trunk, revision 69331 with --translation-backendopt-storesink, which is now on by default
  2. Unladen swallow trunk, r900
  3. CPython 2.6.2 release

Here are the graphs; the benchmarks and the runner script are available

And zoomed in for all benchmarks except binary-trees and fannkuch.

As we can see, PyPy is generally somewhere between the same speed as CPython to 50x faster (f1int). The places where we're the same speed as CPython are places where we know we have problems - for example generators are not sped up by the JIT and they require some work (although not as much by far as generators & Psyco :-). The glaring inefficiency is in the regex-dna benchmark. This one clearly demonstrates that our regular expression engine is really, really, bad and urgently requires attention.

The cool thing here is, that although these benchmarks might not represent typical python applications, they're not uninteresting. They show that algorithmic code does not need to be far slower in Python than in C, so using PyPy one need not worry about algorithmic code being dramatically slow. As many readers would agree, that kills yet another usage of C in our lives :-)

Cheers,
fijal

Friday, November 13, 2009

Düsseldorf Sprint Report

While the Düsseldorf is dwindling off, we put our minds to the task of retelling our accomplishments. The sprint was mostly about improving the JIT and we managed to stick to that task (as much as we managed to stick to anything). The sprint was mostly filled with doing many small things.

Inlining

Carl Friedrich and Samuele started the sprint trying to tame the JIT's inlining. Until now, the JIT would try to inline everything in a loop (except other loops) which is what most tracing JITs actually do. This works great if the resulting trace is of reasonable length, but if not it would result in excessive memory consumption and code cache problems in the CPU. So far we just had a limit on the trace size, and we would abort tracing when the limit was reached. This would happen again and again for the same loop, which is not useful at all. The new approach introduced is to be more clever when tracing is aborted by marking the function with the largest contribution to the trace size as non-inlinable. The next time this loop is traced, it usually then gives a reasonably sized trace.

This gives a problem because now some functions that don't contain loops are not inlined, which means they never get assembler code for them generated. To remedy this problem we also make it possible to trace functions from their start (as opposed to just tracing loops). We do that only for functions that can not be inlinined (either because they contain loops or they were marked as non-inlinable as described above).

The result of this is that the Python version telco decimal benchmark runs to completion without having to arbitrarily increase the trace length limit. It's also about 40% faster than running it on CPython. This is one of the first non-tiny programs that we speed up.

Reducing GC Pressure

Armin and Anto used some GC instrumentation to find places in pypy-c-jit that allocate a lot of memory. This is an endlessly surprising exercise, as usually we don't care too much about allocations of short-lived objects when writing RPython, as our GCs usually deal well with those. They found a few places where they could remove allocations, most importantly by making one of the classes that make up traces smaller.

Optimizing Chains of Guards

Carl Friedrich and Samuele started a simple optimization on the trace level that removes superfluous guards. A common pattern in a trace is to have stronger and stronger guards about the same object. As an example, often there is first a guard that an object is not None, later followed by a guard that it is exactly of a given class and then even later that it is a precise instance of that class. This is inefficient, as we can just check the most precise thing in the place of the first guard, saving us guards (which take memory, as they need resume data). Maciek, Armin and Anto later improved on that by introducing a new guard that checks for non-nullity and a specific class in one guard, which allows us to collapse more chains.

Improving JIT and Exceptions

Armin and Maciek went on a multi-day quest to make the JIT and Python-level exceptions like each other more. So far, raising and catching exceptions would make the JIT generate code that has a certain amusement value, but is not really fast in any way. To improve the situation, they had to dig into the exception support in the Python interpreter, where they found various inefficiencies. They also had to rewrite the exceptions module to be in RPython (as opposed to just pure Python + an old hack). Another problems is that tracebacks give you access to interpreter frames. This forces the JIT to deoptimize things, as the JIT keeps some of the frame's content in CPU registers or on the CPU stack, which reflective access to frames prevents. Currently we try to improve the simple cases where the traceback is never actually accessed. This work is not completely finished, but some cases are already significantly faster.

Moving PyPy to use py.test 1.1

Holger worked on porting PyPy to use the newly released py.test 1.1. PyPy still uses some very old support code in its testing infrastructure, which makes this task a bit annoying. He also gave the other PyPy developers a demo of some of the newer py.test features and we discussed which of them we want to start using to improve our tests to make them shorter and clearer. One of the things we want to do eventually is to have less skipped tests than now.

Using a Simple Effect Analysis for the JIT

One of the optimization the JIT does is caching fields that are read out of structures on the heap. This cache needs to be invalidated at some points, for example when such a field is written to (as we don't track aliasing much). Another case is a call in the assembler, as the target function could arbitrarily change the heap. This of course is imprecise, since most functions don't actually change the whole heap, and we have an analysis that finds out which sorts of types of structs and arrays a function can mutate. During the sprint Carl Friedrich and Samuele integrated this analysis with the JIT, to help it invalidate caches less aggressively. Later Anto and Carl Friedrich also ported this support to the CLI version of the JIT.

Miscellaneous

Samuele (with some assistance of Carl Friedrich) set up a buildbot slave on a Mac Mini at the University. This should let us stabilize on the Max OS X. So far we still have a number of failing tests, but now we are in a situation to sanely approach fixing them.

Anto improved the CLI backend to support the infrastructure for producing the profiling graphs Armin introduced.

The guinea-pigs that were put into Carl Friedrich's care have been fed (which was the most important sprint task anyway).

Samuele & Carl Friedrich

Friday, November 6, 2009

Düsseldorf Sprint Started

The Düsseldorf sprint starts today. Only Samuele and me are there so far, but that should change over the course of the day. We will mostly work on the JIT during this sprint, trying to make it a lot more practical. For that we need to decrease its memory requirements some more and to make it use less aggressive inlining. We will post more as the sprint progresses.

Tuesday, November 3, 2009

PyPy on RuPy 2009

Hello.

It's maybe a bit late to announce, but there will be PyPy talk at Rupy conference this weekend in Poznan. Precisely, I'll be talking mostly about PyPy's JIT and how to use it. Unfortunately the talk is on Saturday, at 8:30 in the morning.

EDIT: Talk is online, together with examples

Cheers,
fijal

Sunday, November 1, 2009

Logging and nice graphs

Hi all,

This week I worked on improving the system we use for logging. Well, it was not really a "system" but rather a pile of hacks to measure in custom ways timings and counts and display them. So now, we have a system :-)

The system in question was integrated in the code for the GC and the JIT, which are two independent components as far as the source is concerned. However, we can now display a unified view. Here is for example pypy-c-jit running pystone for (only) 5000 iterations:

The top long bar represents time. The bottom shows two summaries of the total time taken by the various components, and also plays the role of a legend to understand the colors at the top. Shades of red are the GC, shades of green are the JIT.

Here is another picture, this time on pypy-c-jit running 10 iterations of richards:

We have to look more closely at various examples, but a few things immediately show up. One thing is that the GC is put under large pressure by the jit-tracing, jit-optimize and (to a lesser extent) the jit-backend components. So large in fact that the GC takes at least 60-70% of the time there. We will have to do something about it at some point. The other thing is that on richards (and it's likely generally the case), the jit-blackhole component takes a lot of time. "Blackholing" is the operation of recovering from a guard failure in the generated assembler, and falling back to the interpreter. So this is also something we will need to improve.

That's it! The images were generated with the following commands:

PYPYLOG=/tmp/log pypy-c-jit richards.py
python pypy/tool/logparser.py draw-time /tmp/log --mainwidth=8000 --output=filename.png
EDIT: nowadays the command-line has changed to:
python rpython/tool/logparser.py draw-time /tmp/log --mainwidth=8000 filename.png

Friday, October 16, 2009

GC improvements

In the last week, I (Armin) have been taking some time off the JIT work to improve our GCs. More precisely, our GCs now take one or two words less for every object. This further reduce the memory usage of PyPy, as we will show at the end.

Background information: RPython object model

We first need to understand the RPython object model as implemented by our GCs and our C backend. (Note that the object model of the Python interpreter is built on top of that, but is more complicated -- e.g. Python-level objects are much more flexible than RPython objects.)

Consider these two RPython classes:

class A:
    def __init__(self, x):
        self.x = x
    def f(self):
        return self.x * 42

class B(A):
    def __init__(self, x, y):
        self.x = x
        self.y = y
    def f(self):
        return self.x + self.y

The instances of A and B look like this in memory (all cells are one word):

GC header vtable ptr of A hash x

GC header vtable ptr of B hash x y

The first word, the GC header, describes the layout. It encodes on half a word the shape of the object, including where it contains further pointers, so that the GC can trace it. The other half contains GC flags (e.g. the mark bit of a mark-and-sweep GC).

The second word is used for method dispatch. It is similar to a C++ vtable pointer. It points to static data that is mostly a table of methods (as function pointers), containing e.g. the method f of the example.

The hash field is not necessarily there; it is only present in classes whose hash is ever taken in the RPython program (which includes being keys in a dictionary). It is an "identity hash": it works like object.__hash__() in Python, but it cannot just be the address of the object in case of a GC that moves objects around.

Finally, the x and y fields are, obviously, used to store the value of the fields. Note that instances of B can be used in places that expect a pointer to an instance of A.

Unifying the vtable ptr with the GC header

The first idea of saving a word in every object is the observation that both the vtable ptr and the GC header store information about the class of the object. Therefore it is natural to try to only have one of them. The problem is that we still need bits for the GC flags, so the field that we have to remove is the vtable pointer.

This means that method dispatch needs to be more clever: it cannot directly read the vtable ptr, but needs to compute it from the half-word of the GC header. Fortunately, this can be done with no extra instruction on the assembler level. Here is how things will look like in the end, assuming a 32-bit x86 machine (but note that as usual we just generate portable C).

The trick for achieving efficiency is that we store all vtables together in memory, and make sure that they don't take more than 256 KB in total (16 bits, plus 2 bits of alignment). Here is how the assembler code (produced by the normal C compiler, e.g. gcc) for calling a method looks like. Before the change:

MOV EDX, [EAX + 4]               # load the vtable ptr from object EAX
MOV EDX, [EDX + method_offset]   # load the function pointer from the vtable
CALL EDX

Instead, we now have:

MOVZX EDX, [EAX]     # load the 16-bit part of the GC header from EAX
MOV EDX, [vtable_start + 4*EDX + method_offset]
CALL EDX

Note that the complex addressing scheme done by the second MOV is still just one instruction: the vtable_start and method_offset are constants, so they are combined. And as the vtables are anyway aligned at a word boundary, we can use 4*EDX to address them, giving us 256 KB instead of just 64 KB of vtables.

Optimizing the hash field

In PyPy's Python interpreter, all application-level objects are represented as an instance of some subclass of W_Root. Since all of these objects could potentially be stored in a dictionary by the application Python program, all these objects need a hash field. Of course, in practice, only a fraction of all objects in a Python program end up having their hash ever taken. Thus this field of W_Root is wasted memory most of the time.

(Up to now, we had a hack in place to save the hash field on a few classes like W_IntegerObject, but that meant that the Python expression ``object.__hash__(42)'' would raise a TypeError in PyPy.)

The solution we implemented now (done by some Java GCs, among others) is to add a hash field to an object when the (identity) hash of that object is actually taken. This means that we had to enhance our GCs to support this. When objects are allocated, we don't reserve any space for the hash:

object at 0x74B028
...00... x y

When the hash of an object is taken, we use its current memory address, and set a flag in the GC header saying that this particular object needs a hash:

object at 0x74B028
...01... x y

If the GC needs to move the object to another memory location, it will make the new version of the object bigger, i.e. it will also allocate space for the hash field:

object at 0x825F60
...11... x y 0x74B028

This hash field is immediately initialized with the old memory address, which is the hash value that we gave so far for the object. To not disturb the layout of the object, we always put the extra hash field at the end. Of course, once set, the hash value does not change even if the object needs to move again.

Results

Running the following program on PyPy's Python interpreter with n=4000000:

def make_linked_list(n):
    a = None
    i = 0
    while i < n:
        b = X()
        b.next = a
        a = b
        i += 1

the two optimizations together save 32 MB of RAM (i.e. 8 bytes per object). The version of PyPy we measured this with was built as follows:

./translate.py --gcremovetypeptr targetpypystandalone --objspace-std-withsharingdict

The total amount of RAM used on a 32-bit Linux is 247 MB, completing in 10.3 seconds. On CPython, it consumes 684 MB and takes 89 seconds to complete... This nicely shows that our GCs are much faster at allocating objects, and that our objects can be much smaller than CPython's.

Armin Rigo & Carl Friedrich Bolz

Thursday, October 15, 2009

First pypy-cli-jit benchmarks

As the readers of this blog already know, I've been working on porting the JIT to CLI/.NET for the last months. Now that it's finally possible to get a working pypy-cli-jit, it's time to do some benchmarks.

Warning: as usual, all of this has to be considered to be a alpha version: don't be surprised if you get a crash when trying to run pypy-cli-jit. Of course, things are improving very quickly so it should become more and more stable as days pass.

For this time, I decided to run four benchmarks. Note that for all of them we run the main function once in advance, to let the JIT recoginizing the hot loops and emitting the corresponding code. Thus, the results reported do not include the time spent by the JIT compiler itself, but give a good measure of how good is the code generated by the JIT. At this point in time, I know that the CLI JIT backend spends way too much time compiling stuff, but this issue will be fixed soon.

  • f1.py: this is the classic PyPy JIT benchmark. It is just a function that does some computational intensive work with integers.
  • floatdemo.py: this is the same benchmark involving floating point numbers that have already been described in a previous blog post.
  • oodemo.py: this is just a microbenchmark doing object oriented stuff such as method calls and attribute access.
  • richards2.py: a modified version of the classic richards.py, with a warmup call before starting the real benchmark.

The benchmarks were run on a Windows machine with an Intel Pentium Dual Core E5200 2.5GHz and 2GB RAM, both with .NET (CLR 2.0) and Mono 2.4.2.3.

Because of a known mono bug, if you use a version older than 2.1 you need to pass the option -O=-branch to mono when running pypy-cli-jit, else it will just loop forever.

For comparison, we also run the same benchmarks with IronPython 2.0.1 and IronPython 2.6rc1. Note that IronPython 2.6rc1 does not work with mono.

So, here are the results (expressed in seconds) with Microsoft CLR:

Benchmark pypy-cli-jit ipy 2.0.1 ipy 2.6 ipy2.01/ pypy ipy2.6/ pypy
f1 0.028 0.145 0.136 5.18x 4.85x
floatdemo 0.671 0.765 0.812 1.14x 1.21x
oodemo 1.25 4.278 3.816 3.42x 3.05x
richards2 1228 442 670 0.36x 0.54x

And with Mono:

Benchmark pypy-cli-jit ipy 2.0.1 ipy2.01/ pypy
f1 0.042 0.695 16.54x
floatdemo 0.781 1.218 1.55x
oodemo 1.703 9.501 5.31x
richards2 720 862 1.20x

These results are very interesting: under the CLR, we are between 5x faster and 3x slower than IronPython 2.0.1, and between 4.8x faster and 1.8x slower than IronPython 2.6. On the other hand, on mono we are consistently faster than IronPython, up to 16x. Also, it is also interesting to note that pypy-cli runs faster on CLR than mono for all benchmarks except richards2.

I've not investigated yet, but I think that the culprit is the terrible behaviour of tail calls on CLR: as I already wrote in another blog post, tail calls are ~10x slower than normal calls on CLR, while being only ~2x slower than normal calls on mono. richads2 is probably the benchmark that makes most use of tail calls, thus explaining why we have a much better result on mono than CLR.

The next step is probably to find an alternative implementation that does not use tail calls: this probably will also improve the time spent by the JIT compiler itself, which is not reported in the numbers above but that so far it is surely too high to be acceptable. Stay tuned.

Tuesday, October 6, 2009

PyPy's JIT now supports floats

Hello.

We've just merged branch which adds float support to x86 backend. This means that floating point operations are now super fast in PyPy's JIT. Let's have a look at example, provided by Alex Gaynor and stolen from Factor blog.

The original version of the benchmark, was definitely tuned for the performance needs of CPython.

For running this on PyPy, I changed to a bit simpler version of the program, and I'll explain a few changes that I did, which the reflect current limitations of PyPy's JIT. They're not very deep and they might be already gone while you're reading it:

  • Usage of __slots__. This is a bit ridiculous, but we spend quite a bit of time to speed up normal instances of new-style classes which are very fast, yet ones with __slots__ are slower. To be fixed soon.
  • Usage of reduce. This one is even more obscure, but reduce is not perceived as a thing producing loops in a program. Moving to a pure-Python version of reduce fixes the problem.
  • Using x ** 2 vs x * x. In PyPy, reading a local variable is a no-op when JITted (the same as reading local variable in C). However multiplication is simpler operation that power operation.

I also included the original Java benchmark. Please note that original java version is similar to my modified one (not the one specifically tuned for CPython)

The performance figures below (for n = 1 000 000), average of 10 runs:
  • CPython 2.6: 7.56s
  • CPython & psyco 2.6: 4.44s
  • PyPy: 1.63s
  • Java (JVM 1.6, client mode): 0.77s

and while JVM is much faster, it's very good that we can even compare :-)

Cheers
fijal

Sunday, September 27, 2009

First results of the JIT

Hi all,

Just a quick note to tell you that we are progressing on the JIT front. Here are the running times of the richards benchmark on my laptop:

  • 8.18 seconds with CPython 2.5.2;
  • 2.61 seconds with pypy-c-jit (3x faster than CPython);
  • 1.04 seconds if you ignore the time spent making assembler (8x faster than CPython);
  • 1.59 seconds on Psyco, for reference (5x faster that CPython).

Yes, as this table shows, we are spending 1.57 seconds in the JIT support code. That's too much -- even ridiculously so -- for anything but a long-running process. We are working on that :-)

If you want to build your own pypy-c-jit (for x86-32 only for now):

  • you need a Subversion checkout of trunk;
  • run pypy/translator/goal/translate.py with the -Ojit option;
  • as usual, wait a long time (and be sure you have more than 1GB of RAM).

For now pypy-c-jit spews a lot of debugging output and there are a few known examples where it crashes. As we like to repeat, however, it's a complete JIT: apart from the crashes (the bugs are probably in the JIT support code), it supports the whole Python language from the start -- in the sense of doing correct things. Future work include Python-specific improvements by e.g. tweaking the data structures used to store Python objects so that they are more JIT-friendly.

EDIT: Oh yes, fijal reminds me that CPython 2.6 is 30% faster than CPython 2.5 on this benchmark (which is mostly my "fault", as I extracted a small part of PyPy and submitted it as a patch to CPython that works particularly well for examples like richards). It does not fundamentally change the fact that we are way faster though.

Thursday, September 24, 2009

PyPy sprint in Düsseldorf, 6 Nov - 13 Nov

The next PyPy sprint will be held in the Computer Science department of Heinrich-Heine Universität Düsseldorf from the 6th to the 13th of November 2009. This is a fully public sprint, everyone is welcome to join us.

Topics and goals

At the sprint we intend to work on the JIT generator in PyPy and on applying it to PyPy Python interpreter.

The precise work that will be done is not fixed, as we don't know in which state the JIT will be in November. However, possible areas of work might include:

  • tweaking the interpreter/objspace to be more JIT-friendly, e.g. instance implementation code, call code
  • if there is interest starting non x86-32 JIT backends
  • trying out existing software to find features where the optimizations of the JIT could be improved
  • improving our benchmarking infrastructure

We will give special priority to topics that "non-core" people find interesting (as long as they are somehow JIT-related).

For an introduction of how our JIT-generation process works, please refer to our blog:

http://morepypy.blogspot.com/2009/03/jit-bit-of-look-inside.html

There is also a more dense academic paper about the subject:

http://codespeak.net/svn/pypy/extradoc/talk/icooolps2009/bolz-tracing-jit-final.pdf

Location

The sprint will take place in a seminar room of the computer science department. It is in the building 25.12 of the university campus. For travel instructions see

http://stups.cs.uni-duesseldorf.de/anreise/esbahn.php

Registration

If you'd like to come, please subscribe to the pypy-sprint mailing list and drop a note about your interests and post any questions. More organisational information will be send to that list. We'll keep a list of people which we'll update (which you can do so yourself if you have codespeak commit rights).

Tuesday, August 25, 2009

PyPy gets a new compiler

Today, I merged the parser-compiler branch, which I have been working on over the summer. It contained a total rewrite of both PyPy's Python parser and AST compiler. PyPy's old parser was (in)famous internally for being complicated and slow (with many algorithmic complexities greater than O(n)). The new parser is a simple as I could make it LL(1) parser like CPython (though it doesn't share the hacks of CPython's parser).

The new compiler is based on the Abstract Syntax Trees (AST) that CPython 2.5 introduced instead of PyPy's old AST based on the compiler package's. This means that Python code running on PyPy will be able to use the same _ast interface as CPython. PyPy's _ast implementation supports AST features that CPython 2.6 added, including compiling modified AST to bytecode and executing it. In this rewrite, some more obscure compiler features were added, too. For example, jumps in bytecode can now be greater than 65535 bytes! (That's like an if statement with 7000 lines of code in the body.)

While the PyPy translation toolchain still has many obscure details and hacks, this merge completes the process of making the actual Python interpreter very clean. Hopefully, this will make adding new features much easier and make PyPy less frustrating to maintain as well as providing application level code with an improved AST interface!

Gothenburg JIT sprint report

Finally, we managed to squeeze in some time to write a report about what has been going on the mysterious JIT sprint in Gothenburg, Sweden. The main goals of the sprint were to lay down the groundwork for getting more JIT work going in the next months and get more of PyPy developers up to speed with the current state of the JIT. One of the elements was to get better stability of the JIT, moving it slowly from being a prototype to actually work nicely on larger programs.

The secret goal of the sprint was to seek more speed, which Anto and Carl Friedrich did even during the break day:

We spent the first two days improving test coverage of the x86 backend and the optimizer. Now we have 100% coverage with unittests (modulo figleaf bugs), which does not mean anything, but it's better than before.

Then we spent quite some time improving the optimizer passes, so now we generate far less code than before the sprint, because a lot of it is optimized away. On the interpreter side, we marked more objects (like code objects) as immutable, so that reading fields from them can be constant-folded.

Another important optimization that we did is to remove consecutive reading of the same fields from the same structure, if no code in between can change it.

Our JIT is a hybrid environment, where only hot loops of code are jitted and the rest stays being interpreted. We found out that the performance of the non-jitted part was suboptimal, because all accesses to python frames went through an extra layer of indirection. We removed this layer of indirection, in the case where the jit and the interpreter cannot access the same frame (which is the common case).

We also spent some time improving the performance of our x86 backend, by making it use more registers and by doing more advanced variable renaming at the end of loops. It seems that using more registerd is not as much of a win as we hoped, because modern day processors are much smarter than we thought.

The most mind bending part was finding why we loose performance by making the JIT see more of the interpreter. It took us two very frustrating days and 36 gray hairs to find out that from the JIT we call a different malloc function in the Boehm GC, which is by far slower than the version that we use from the interpreter. This meant that the more we jitted, the slower our code got, purely because of the mallocs.

Now that this is fixed, the world makes much more sense again.

A lot of the sprint's work is not directly measurable in the performance figures, but we did a lot of work that is necessary for performance to improve in the next weeks. After we have done a bit more work, we should be able to provide some performance figures for programs that are more realistic than just loops that count to ten millions (which are very fast already :).

Now we're going to enjoy a couple of days off to recover from the sprint.

Bästa hälsningar,
Carl Friedrich, fijal

Friday, July 17, 2009

PyPy numeric experiments

Because PyPy will be presenting at the upcoming euroscipy conference, I have been playing recently with the idea of NumPy and PyPy integration. My idea is to integrate PyPy's JIT with NumPy or at least a very basic subset of it. Time constraints make it impossible to hand write a JIT compiler that understands NumPy. But given PyPy's architecture we actually have a JIT generator, so we don't need to write one :-)

Our JIT has shown that it can speed up small arithmetic examples significantly. What happens with something like NumPy?

I wrote a very minimal subset of NumPy in RPython, called micronumpy (only single-dimension int arrays that can only get and set items), and a benchmark against it. The point of this benchmark is to compare the performance of a builtin function (numpy.minimum) against the equivalent hand-written function, written in pure Python and compiled by our JIT.

The goal is to prove that it is possible to write algorithms in Python instead of C without loss of efficiency. Sure, we can write some functions (like minimum in the following example), but there is a whole universe of other ufuncs which would be cool to have in Python instead, assuming this could be done without a huge loss in efficiency.

Here are the results. This is comparing PyPy svn revision 66303 in the pyjitpl5 branch against python 2.6 with NumPy 1.2.1. The builtin numpy.minimum in PyPy is just a naive implementation in RPython, which is comparable to the speed of a naive implementation written in C (and thus a bit slower than the optimized version in NumPy):

NumPy (builtin function)0.12s
PyPy's micronumpy (builtin function)0.28s
CPython (pure Python)11s
PyPy with JIT (pure Python)0.91s

As we can see, PyPy's JIT is slower than the optmized NumPy's C version, but still much faster than CPython (12x).

Why is it slower? When you actually look at assembler, it's pretty obvious that it's atrocious. There's a lot of speedup to be gained out of just doing simple optimizations on resulting assembler. There are also pretty obvious limitations, like x86 backend not being able to emit opcodes for floats or x86_64 not being there. Those limitations are not fundamental in any sense and can be relatively straightforward to overcome. Therefore it seems we can get C-level speeds for pure Python implementations of numeric algorithms using NumPy arrays in PyPy. I think it's an interesting perspective that Python has the potential of becoming less of a glue language and more of a real implementation language in the scientific field.

Cheers,
fijal

Thursday, July 16, 2009

ECOOP 2009

Last week (from 6th to 10th of July) Anto, Armin and me (Carl Friedrich) were in the magnificent city of Genova, Italy at the ECOOP conference. In this blog post I want to give a (necessarily personal) account of what we did there.

Workshop days: ICOOOLPS

The first two days of the conference were the workshop days. On Monday we attended the ICOOOLPS workshop, (see the programme of the workshop). We had gotten two papers accepted at the workshop (one about layering PyPy's JIT on top of the CLR and one about the basic idea of PyPy's tracing JIT) and thus gave two presentations at the workshop, one was given by Anto, the other by me. Both went reasonably well, we got some positive feedback.

Nearly all the other talks were rather interesting as well. I particularly liked the one by Hans Schippers, who presented a machine model built on delegation called delMDSOC. The model is meant implement most features that a language would need that makes it possible to separate cross-cutting concerns. In the talk at ICOOOLPS he presented an extension to the model that adds concurrency support, using a combination of actors and coroutines. He then showed that the concurrency mechanisms of Java, Salsa (and extension of Java adding actors) and Io can be mapped to this model.

Furthermore there were two interesting invited talks, one by Andreas Gal (Mozilla), and one by Cliff Click (Azul Systems). Andreas explained how TraceMonkey works. This was very useful for me, because his talk was just before mine and I could thus kill most of my introduction about tracing JIT compilers and have more time for the really interesting stuff :-). Cliff talked about implementing other languages on top of the JVM and some of the pitfalls in getting them perform well.

All in all, ICOOOLPS was a very enjoyable workshop, also with many interesting discussions.

On Tuesday there were more workshops, but also the PyPy tutorial, so I only went to a few talks of the COP workshop and spent the rest of the morning preparing the tutorial (see next section).

Tutorial

On Tuesday afternoon we gave a PyPy Tutorial, as part of the ECOOP summer school. The first lesson we learned was that (as opposed to a community conference) people don't necessarily want to actually take their laptop out and try stuff. We gave a slow walk-through about the full life-cycle of development of a dynamic language interpreter using PyPy's tool-chain: Starting from writing your interpreter in RPython, testing it on top of CPython to translating it to C, .NET or Java to actually adding hints to get a JIT inserted.

There were about seven people attending the tutorial, a couple of which were very interested and were asking questions and discussing. Some of the discussions were even very technical, e.g. one about the details of our type-inference algorithm for RPython and why we cannot do a bottom-up analysis but have to use forward-propagation instead.

Jan Vitek of Purdue University told of some of the problems of the OVM project, which is (among other things) a Java implementation in Java (OVM also wants to support implementing VMs for other languages with it, if I understood correctly). He said that the project has essentially gotten too large and complicated, which means that it is very hard for new people to get into the project. While PyPy doesn't have some of the problems of a full Java implementation (e.g. right now our concurrency support is minimal) I definitely think that some of these risks apply to PyPy as well and we should find ways to improve the situation in this regard. Channeling Samuele: Somewhere inside the large lumbering blob of PyPy there is an elegant core trying to get out.

Main Conference

From Wednesday till Friday the main conference was happening. Many of the talks were not all that interesting for me, being quite Java centric. One talk that I liked a lot was "Making Sense of Large Heaps", which was presented by Nick Mitchell (IBM). He presented a tool called "Yeti" that can be used to analyze large heaps of Java programs. The tool uses some clever algorithms and heuristics to summarize the heap usage of data structures in intelligent ways to make it easier to find possible memory-wasters in a program. Nick also gave Anto and me a demo of the tool, where we tried to apply it to pypy-jvm (we found out that a fifth of the static data in there belongs to the parser/compiler :-( ).

On each of the days of the conference there was a keynote. I missed the one by Simon Peyton-Jones on Wednesday about type classes in Haskell. On Thursday, David Ungar was awarded the Dahl-Nygaard-Prize for his work on the Self programming language. Subsequently he gave a really inspiring keynote with the title "Self and Self: Whys and Wherefores" where he recollected Self's history, both on a technical as well as on a social level. Parts of the talk were snippets from the movies Self: The Movie and Alternate Reality Kit, both of which I highly recommend.

The keynote on Friday was by Cliff Click with the title "Java on 1000 Cores: Tales of Hardware/Software Co-design". He described the custom CPU architecture that Azul Systems has developed to run Java server applications on hundreds of cores. The talk mostly talked about the hardware, which I found very interesting (but some people didn't care for too much). Azul's CPU is essentially 54 in-order RISC cores in a single processor. The cores have a lot of extensions that make it easier to run Java on them, e.g. hardware read- and write-barriers, hardware-transactional-memory and hardware escape-detection (!).

In addition to the talks, there is of course always the hallway track (or coffee track) which is the track where you stand in the hallway and discuss with people. As usual, this was the most interesting part of the conference. One of those talks was Anto and me giving a PyPy demo to David Ungar. We had a very interesting discussion about VM implementation in general and the sort of debugging tools you need to write in particular. He liked PyPy a lot, which makes me very happy. He also liked the fact that I have actually read most Self papers :-).

Tuesday, June 23, 2009

EuroPython

EuroPython is coming. We have two 30-minutes talks that we will present. In addition, the sprint takes place the 29th of June (there will be no-one from the team on the 28th of June), as well as on the 3rd and 4th of July.

JIT progress

In the last days I finally understood how to do virtualizables. Now the frame overhead is gone. This was done with the help of discussion with Samuele, porting ideas from PyPy's first JIT attempt.

This is of course work in progress, but it works in PyPy (modulo a few XXXs, but no bugs so far). The performance of the resulting code is quite good: even with Boehm (the GC that is easy to compile to but gives a slowish pypy-c), a long-running loop typically runs 50% faster than CPython. That's "baseline" speed, moreover: we will get better speed-ups by applying optimizations on the generated code. Doing so is in progress, but it suddenly became easier because that optimization phase no longer has to consider virtualizables -- they are now handled earlier.

Update:Virtualizables is basically a way to avoid frame overhead. The frame object is allocated and has a pointer, but the JIT is free to unpack it's fields (for example python level locals) and store them somewhere else (stack or registers). Each external (out of jit) access to frame managed by jit, needs to go via special accessors that can ask jit where those variables are.

Tuesday, June 16, 2009

News from the jit front

As usual, progress is going slower then predicted, but nevertheless, we're working hard to make some progress.

We recently managed to make our nice GCs cooperate with our JIT. This is one point from our detailed plan. As of now, we have a JIT with GCs and no optimizations. It already speeds up some things, while slowing down others. The main reason for this is that the JIT generates assembler which is kind of ok, but it does not do the same level of optimizations gcc would do.

So the current status of the JIT is that it can produce assembler out of executed python code (or any interpreter written in RPython actually), but the results are not high quality enough since we're missing optimizations.

The current plan, as of now, looks as follows:

  • Improve the handling of GCs in JIT with inlining of malloc-fast paths, that should speed up things by a constant, not too big factor.
  • Write a simplified python interpreter, which will be a base for experiments and to make sure that our JIT does correct things with regard to optimizations. That would work as mid-level integration test.
  • Think about ways to inline loop-less python functions into their parent's loop.
  • Get rid of frame overhead (by virtualizables)
  • Measure, write benchmarks, publish
  • Profit

Cheers,
fijal

Saturday, May 16, 2009

ICOOOLPS Submissions

Both of the papers that people from the PyPy team submitted to ICOOOLPS have been accepted. They are:

  • "Faster than C#: efficient implementation of dynamic languages on .NET" (pdf1) by Armin, Anto and Davide Ancona, who is Anto's Ph.D. advisor
  • "Tracing the Meta-Level: PyPy’s Tracing JIT Compiler" (pdf2) by Carl Friedrich, Armin, Anto and Maciek

(the pdfs are obviously the submitted versions, not the final ones).

This year ICOOOLPS (Implementation, Compilation, Optimization of Object-Oriented Languages, Programs and Systems) is being held on July the 6th at ECOOP 2009 in Genova, Italy. Other than these two papers, Anto and Carl Friedrich will also present a PyPy tutorial, on July the 7th.

Thursday, April 30, 2009

4 weeks of GDB

Hello.

So, according to our jit plan we're mostly done with point 1, that is to provide a JIT that compiles python code to assembler in the most horrible manner possible but doesn't break. That meant mostly 4 weeks of glaring at GDB and megabytess of assembler generated by C code generated from python code. The figure of 4 weeks proves that our approach is by far superior to the one of psyco, since Armin says it's "only 4 weeks" :-)

Right now, pypy compiled with JIT can run the whole CPython test suite without crashing, which means we're done with obvious bugs and the only ones waiting for us are really horrible. (Or they really don't exist. At least they should never be about obscure Python corner cases: they can only be in the 10'000 lines of relatively clear code that is our JIT generator.)

But... the fun thing is that we can actually concentrate on optimizations! So the next step is to provide a JIT that is correct *and* actually speeds up python. Stay tuned for more :-)

Cheers,
fijal, armin & benjamin

UPDATE: for those of you blessed with no knowledge of C, gdb stands for GNU debugger, a classic debugger for C. (It's also much more powerful than python debugger, pdb, which is kind of surprising).

Tuesday, April 28, 2009

1.1 final released

We just released PyPy 1.1 final. Not much changed since the beta, apart from some more fixed bugs. Have fun with it!

Tuesday, April 21, 2009

Roadmap for JIT

Hello.

First a disclaimer. This post is more about plans for future than current status. We usually try to write about things that we have done, because it's much much easier to promise things than to actually make it happen, but I think it's important enough to have some sort of roadmap.

In recent months we came to the point where the 5th generation of JIT prototype was working as nice or even a bit nicer than 1st one back in 2007. Someone might ask "so why did you spend all this time without going forward?". And indeed, we spend a lot of time moving sideways, but as posted, we also spent a lot of time doing some other things, which are important as well. The main advantage of current JIT incarnation is much much simpler than the first one. Even I can comprehend it, which is much of an improvement :-)

So, the prototype is working and gives very nice speedups in range of 20-30x over CPython. We're pretty confident this prototype will work and will produce fast python interpreter eventually. So we decided that now we'll work towards changing prototype into something stable and solid. This might sound easy, but in fact it's not. Having stable assembler backend and optimizations that keep semantics is not as easy as it might sound.

The current roadmap, as I see it, looks like as following:

  • Provide a JIT that does not speedup things, but produce assembler without optimizations turned on, that is correct and able to run CPython's library tests on a nightly basis.
  • Introduce simple optimizations, that should make above JIT a bit faster than CPython. With optimizations disabled JIT is producing incredibly dumb assembler, which is slower than correspoding C code, even with removal of interpretation overhead (which is not very surprising).
  • Backport optimizations from JIT prototype, one by one, keeping an eye on how they perform and making sure they don't break anything.
  • Create new optimizations, like speeding up attribute access.
  • Profit.

This way, we can hopefully provide a working JIT, which gives fast python interpreter, which is a bit harder than just a nice prototype.

Tell us what you think about this plan.

Cheers,
fijal & others.

Leysin Sprint Report

The Leysin sprint is nearing its end, as usual here is an attempt at a summary

of what we did.

Beautiful Leysin Landscape

Release Work

Large parts of the sprint were dedicated to fixing bugs. Since the easy bugs seem to have been fixed long ago, those were mostly very annoying and hard bugs. This work was supported by our buildbots, which we tried to get free of test-failures. This was worked on by nearly all participants of the sprint (Samuele, Armin, Anto, Niko, Anders, Christian, Carl Friedrich). One particularly annoying bug was the differences in the tracing events that PyPy produces (fixed by Anders, Samuele and Christian). Some details about larger tasks are in the sections below.

The work culminated in the beta released on Sunday.

Stackless

A large number of problems came from our stackless features, which do some advanced things and thus seem to contain advanced bugs. Samuele and Carl Friedrich spent some time fixing tasklet pickling and unpickling. This was achieved by supporting the (un)pickling of builtin code objects. In addition they fixed some bugs in the finalization of tasklets. This needs some care because the __del__ of a tasklet cannot run at arbitrary points in time, but only at safe points. This problem was a bit subtle to get right, and popped up nearly every morning of the sprint in form of a test failure.

Armin and Niko added a way to restrict the stack depth of the RPython-level stack. This can useful when using stackless, because if this is not there it is possible that you fill your whole heap with stack frames in the case of an infinite recursion. Then they went on to make stackless not segfault when threads are used at the same time, or if a callback from C library code is in progress. Instead you get a RuntimeError now, which is not good but better than a segfault.

Anto and Armin working on the JIT

Killing Features

During the sprint we discussed the fate of the LLVM and the JS backends. Both have not really been maintained for some time, and even partially untested (their tests were skipped). Also their usefulness appears to be limited. The JS backend is cool in principle, but has some serious limitations due to the fact that JavaScript is really a dynamic language, while RPython is rather static. This made it hard to use some features of JS from RPython, e.g. RPython does not support closures of any kind.

The LLVM backend had its own set of problems. For a long time it produced the fastest form of PyPy's Python interpreter, by first using the LLVM backend, applying the LLVM optimizations to the result, then using LLVM's C backend to produce C code, then apply GCC to the result :-). However, it is not clear that it is still useful to directly produce LLVM bitcode, since LLVM has rather good C frontends nowadays, with llvm-gcc and clang. It is likely that we will use LLVM in the future in our JIT (but that's another story, based on different code).

Therefore we decided to remove these two backends from SVN, which Samuele and Carl Friedrich did. They are not dead, only resting until somebody who is interested in maintaining them steps up.

Windows

One goal of the release is good Windows-support. Anders and Samuele set up a new windows buildbot which revealed a number of failures. Those were attacked by Anders, Samuele and Christian as well as by Amaury (who was not at the sprint, but thankfully did a lot of Windows work in the last months).

OS X

Christian with some help by Samuele tried to get translation working again under Mac OS X. This was a large mess, because of different behaviours of some POSIX functionality in Leopard. It is still possible to get the old behaviour back, but whether that was enabled or not depended on a number of factors such as which Python is used. Eventually they managed to successfully navigate that maze and produce something that almost works (there is still a problem remaining about OpenSSL).

Samuele and Carl Friedrich pretending to work on something

Documentation

The Friday of the sprint was declared to be a documentation day, where (nearly) no coding was allowed. This resulted in a newly structured and improved getting started document (done by Carl Friedrich, Samuele and some help of Niko) and a new document describing differences to CPython (Armin, Carl Friedrich) as well as various improvements to existing documents (everybody else). Armin undertook the Sisyphean task of listing all talks, paper and related stuff of the PyPy project.

Various Stuff

Java Backend Work

Niko and Anto worked on the JVM backend for a while. First they had to fix translation of the Python interpreter to Java. Then they tried to improve the performance of the Python interpreter when translated to Java. Mostly they did a lot of profiling to find performance bottlenecks. They managed to improve performance by 40% by overriding fillInStackTrace of the generated exception classes. Apart from that they found no simple-to-fix performance problems.

JIT Work

Armin gave a presentation about the current state of the JIT to the sprinters as well as Adrian Kuhn, Toon Verwaest and Camillo Bruni of the University of Bern who came to visit for one day. There was a bit of work on the JIT going on too; Armin and Anto tried to get closer to having a working JIT on top of the CLI.

Sunday, April 19, 2009

Beta for 1.1.0 released

Today we are releasing a beta of the upcoming PyPy 1.1 release. There are some Windows and OS X issues left that we would like to address between now and the final release but apart from this things should be working. We would appreciate feedback.

The PyPy development team.

PyPy 1.1: Compatibility & Consolidation

Welcome to the PyPy 1.1 release - the first release after the end of EU funding. This release focuses on making PyPy's Python interpreter more compatible with CPython (currently CPython 2.5) and on making the interpreter more stable and bug-free.

PyPy's Getting Started lives at:

http://codespeak.net/pypy/dist/pypy/doc/getting-started.html

Highlights of This Release

Other Changes

What is PyPy?

Technically, PyPy is both a Python interpreter implementation and an advanced compiler, or more precisely a framework for implementing dynamic languages and generating virtual machines for them.

The framework allows for alternative frontends and for alternative backends, currently C, Java and .NET. For our main target "C", we can "mix in" different garbage collectors and threading models, including micro-threads aka "Stackless". The inherent complexity that arises from this ambitious approach is mostly kept away from the Python interpreter implementation, our main frontend.

Socially, PyPy is a collaborative effort of many individuals working together in a distributed and sprint-driven way since 2003. PyPy would not have gotten as far as it has without the coding, feedback and general support from numerous people.

Have fun,

the PyPy release team, [in alphabetical order]

Amaury Forgeot d'Arc, Anders Hammerquist, Antonio Cuni, Armin Rigo, Carl Friedrich Bolz, Christian Tismer, Holger Krekel, Maciek Fijalkowski, Samuele Pedroni

and many others: http://codespeak.net/pypy/dist/pypy/doc/contributor.html

Wednesday, April 15, 2009

Leysin Sprint Started

The Leysin Sprint started today. The weather is great and the view is wonderful, as usual. Technically we are working on the remaining test failures of the nightly test runs and are generally trying to fix various long-postponed bugs. I will try to give more detailed reports as the sprint progresses.

Tuesday, April 7, 2009

Pycon videos are online

Hi.

We didn't yet write full pycon summary, but both of our talks are now online: PyPy status talk and python in a sandbox.

Update:

Slides are also available: PyPy status talk and Python in a sandbox.

Enjoy!
fijal & holger

Thursday, March 26, 2009

VM summit: nice to see friendly competition

So Google has launched the unladen swallow project with this first goal:

    Produce a version of Python at least 5x faster than CPython.

We discussed some details with Collin Winter, Jeffrey Yasskin and Thomas Wouters during the VM summit yesterday. We were a bit confused about usage of the term JIT, because as far as we understood, it's going to be upfront compilation into LLVM. In the past we have looked into LLVM – at one point PyPy extensively use it but it wasn't clear how we could make good use to it. They also consider changing to something else than LLVM. It's gonna be interesting to see how this works out.

It's good to see friendly competition, and we think that should take up the challenge and see if we can produce faster pickling, run 2to3 and Django faster than what they can come up with. We also talked to IronPython and Jython developers and all agreed that some common benchmarks would be good. And maybe do weekly press releases about small speed increases? :)

The idea of the VM summit here in Chicago was to bring together implementors of various virtual machine languages. There were members of the communities of IronPython, CPython, GemStone's MagLev, Rubinius, Mozilla's TraceMonkey, Parrot, Sun's Da Vinci Machine, Microsoft's DLR, Jython and JRuby. Everybody got to talk 5-10 minutes on their current status and challenges. It is clear that you cannot begin to cover the complexities and architectures of the involved projects. But that wasn't too much of a problem because the rest of the day everybody freely and dynamically grouped on their issues of choice. We established some more personal contacts, was great to chat with people like Andreas Gal from the University of California, Irvine, who have a very similar idea about the JIT that we have. Actually, we could probably haved mixed our two presentations and nobody would have actually noticed :-).

At the end of the presentation part, John Rose presented his slides. John is a Hotspot developer, and while not precisely a dynamic language implementor, he has a lot of experience in virtual machine implementation. It's very good to see the JVM being extended towards supporting dynamic-language specific things, in order to be something more than just a good platform for Java. We'll probably have some extra meetup with him the next days.

cheers,
holger and fijal

Thursday, March 12, 2009

PyPy talk at OpenBossa 09

Yesterday i gave my PyPy status/mobile perspectives at OpenBossa, Nokia's developer conference for embedded platforms in Brazil. Found it a bit of a tough task to do that in 50 minutes. I had some 50, later more developers attending the talk and was happy with the questions and the feedback. Guess it's a good sign if the number of people grows during a talk :) It was the first time i tried to work more with pictures and actually used some devianart photos from Marikaz to mark section transitions. I summarize/highlight some key points here in the post.

After intro and 2.5 compatibility status, i talked about our measurements of PyPy's Python on Nokia's N810 internet tablet. The best bit is that for almost all Python data structures PyPy has smaller memory representations than CPython. Particularly good are class instances which often score at 50% of CPython's sizes. Startup time is also often better and can be improved. On the bad side, PyPy's quite large base interpreter size and its bytecode execution is often worse. In the talk i also outline ideas for "perfect PYC files" for minimizing module import times and maximizing sharing across interpreter processes. I also briefly discussed the PyPy situation with extension modules and regarding C++ libs. Most of these ideas arose from sprint discussions last year. In the morning i also had some good talk with Stefan Seefeld about Boost Python and the new QT4 bindings. Maybe to use Boost Python is also a good opportunity - but PyPy does not currently have a C-level or C++ level API.

In subsequent lunch discussions people agreed that PyPy has three main interesting areas currently:

  • the Python Just-In-Time Compiler
  • a virtualized, sandboxed Python interpreter
  • an efficient Python interpreter for small devices

I think our upcoming 1.1 release will be a good point in time for many people to look some more into PyPy. I hope we are crossing the chasm soon. It's been a while since the project started :) Getting some more sponsoring to sustain and increase our current efforts probably wouldn't hurt.

Now i am off to spend my last day in Recife / Brazil, fly back to Germany in the evening and then spend time on preparing for Pycon 2009. And I guess i am going to enjoy some naturally cold air - at least my two jogging sessions at Brazillian beaches, at a sustained 30 degrees celsius, were tough. I guess i shouldn't complain, though :)

Was great meeting all the brazillian guys and the few women - just had breakfeast with Kate Alhola, kernel hacker and working on the new "Freemantle" graphical platform. Many thanks go to Marcio Marcedo and the Python team at INDT who invited me here. Hope to come again next year and eventually talk more about the Zone VM :)

If you are interested in some more not so pypy-specific bits about the conference and what i experienced, you might head over to my tetamap blog.

holger

Tuesday, March 10, 2009

Good news everyone!

A quick update from the JIT front. As of yesterday, we're now able to translate a highly-experimental Python interpreter that contains JIT. It mostly crashes immediately, mostly due to some unsupported operations in the assembler backend, but for a carefully crafted program, we're able to get massive speedups. For something as complex as:

  i = 0
  while i < 10000000:
   i = i + 1

our JIT is about 20x faster than CPython. That's still about 3x slower than Psyco, but looking at assembler code it's obvious that we can speed it up a lot. These are very good news, since we don't encode python semantics at all in the JIT. The JIT is automatically generated from the Python interpreter source code. This means we should be able to expand it to handle more complex python programs relatively quickly (interested assembler experts needed!).

This is actually the fifth incarnation of JIT that happened over the last two years. It's by far simpler and more promising than any of the previous approaches. Expect more details soon!

Cheers,
fijal

Thursday, March 5, 2009

JIT - a bit of look inside

The previous post about our JIT explained a bit from the 1000 km perspective how the tracing JIT would approach a language like Python.

I would like to step a bit inside and give a zoom to some of its features that are already working. While probably not the most innovative, I think it's very nice to look at the way we work with the JIT and what tools we use.

The main cool thing is that you can work on and try the JIT (including trying it on the Python interpreter!) without even generating a single bit of assembler. How? Let's start with something very simple. Let's take a simple interpreter for language X.

Language X has 3 opcodes: CO_INCREASE, CO_DECREASE and CO_JUMP_BACK_3. CO_INCREASE increase the accumulator by one, CO_DECREASE decrease it by one, CO_JUMP_BACK_3 jump 3 opcodes back, if the accumulator is smaller than 100 (this is only to maintain some halting conditions possible). The interpreter for language X looks like this::

    jitdriver = JitDriver(greens = ['i'], reds = ['res', 'a'])
    code = [CO_INCREASE, CO_INCREASE, CO_INCREASE,
            CO_JUMP_BACK_3, CO_INCREASE, CO_DECREASE]
            
    def add(res, a):
        return res + a

    def sub(res, a):
        return res - a

    def main_interpreter_loop(a):
        i = 0
        res = 0
        c = len(code)
        while i < c:
            jitdriver.jit_merge_point(res=res, i=i, a=a)
            elem = code[i]
            if elem == CO_INCREASE:
                res = add(res, a)
            elif elem == CO_DECREASE:
                res = sub(res, a)
            else:
                if res > 100:
                    pass
                else:
                    i = i - 3
                    jitdriver.can_enter_jit(res=res, i=i, a=a)
                    continue
            i = i + 1
        return res

All very simple code, expect the jitdriver hints, which instruct JIT how to behave (they are the equivalent of the ``add_to_position_key`` of last the blog post).

Let's look how this code is processed. This will also give a glance at how we work in this code. This particular piece can be found on a branch in pypy/jit/metainterp/test/test_loop.py and can be run with ./test_all.py jit/metainterp/test/test_loop.py -k test_example -s --view from pypy directory. The -s option lets you see the debugging output, while --view will show you some graphs. So, let's look at graphs in order:

And the same picture with a bit of zoom for the first block:

This is the call graph of an interpreter loop, nothing magic so far. This is an intermediate representation of translation toolchain input. If you look around you can follow how the opcodes are dispatched (with a chain of ifs) and helpers called. Next graph is very boring, because it's a bit lower level representation of the same thing (you exit with q or escape btw :).

When we exit the graph viewer, we can see the trace generated by interpreting this graph with a given bytecode (variable code in paste above). It's something like:


        [compiler] ENTER
        [runner:cpu]    call__4 [(''), * GCREF hidden, 0] -> 0
        [runner:cpu]    int_eq [0, 0] -> True
        [runner:cpu]    int_add [9, 1] -> 10
        [runner:cpu]    int_add [0, 1] -> 1
        [runner:cpu]    int_lt [1, 6] -> True
        [runner:cpu]    call__4 [(''), * GCREF hidden, 1] -> 0
        [runner:cpu]    int_eq [0, 0] -> True
        [runner:cpu]    int_add [10, 1] -> 11
        [runner:cpu]    int_add [1, 1] -> 2
        [runner:cpu]    int_lt [2, 6] -> True
        [runner:cpu]    call__4 [(''), * GCREF hidden, 2] -> 0
        [runner:cpu]    int_eq [0, 0] -> True
        [runner:cpu]    int_add [11, 1] -> 12
        [runner:cpu]    int_add [2, 1] -> 3
        [runner:cpu]    int_lt [3, 6] -> True
        [runner:cpu]    call__4 [(''), * GCREF hidden, 3] -> 1
        [runner:cpu]    int_eq [1, 0] -> False
        [runner:cpu]    int_eq [1, 2] -> False
        [runner:cpu]    int_gt [12, 100] -> False
        [runner:cpu]    int_sub [3, 3] -> 0
        [compiler] LEAVE

It's entering JIT, doing some primitive operations for bytecode dispatching and repeating the loop. Note that at the end of the interpreted loop (not to be confused with the interpreter loop), we see int_sub [3, 3] which resets the bytecode position to the beginning. At this time JIT (instructed by can_enter_jit hint) notices that all green variables are the same (here only i), hence we can compile the efficient loop from this point.

The loop contains 3 additions and a check (for i < 100), exactly the same as our interpreted program would do, but completely without interpretation overhead!

As you might have noticed, there is no assembler involved so far. All of this instruction execution is done directly, in pure python. In fact, the code for executing instructions is located in jit/backend/llgraph which directly interprets instructions. This is by far simpler (and easier to debug) than x86 assembler.

And this is basically it: the very simple interpreter and a jit for it. Of course we actually can generate assembler for that. Also the missing piece is optimizing the generated graphs. While for this example, by removing the interpretetation overhead, we're done, with more complex examples it's important to further optimize traces. Hopefully this and how we actually generate assembler will be topics for next blog posts.

Cheers,
fijal