PyPy Status Blog: 2008

Friday, December 19, 2008

Pycon 2009

Hello.

Both of our PyPy talks has been accepted for Pycon US 2009. Although both are somehow related to PyPy, they're vastly different in topics, attitude and target audience.

The first one is a classic PyPy status talk - we'll mostly talk about our achievements from the last year (readers of this blog are aware of most, but not all :) as well as some general introduction and plans for the future.

The second one is about PyPy's sandboxing features. This is in my opinion a very underestimated feature, also by us, because it's not really well advertised or documented. The main purpose of the talk is to present to the general public how this works and how to use it. Hopefully we will get to work and publish about this a bit more ahead of Pycon already. Unlike Zope's Restricted Python, it provides you with the full python language, inside a fully virtualized sandbox, controlled from an external process by a custom security policy. Stay tuned for more :-)

See you at Pycon 2009!

Cheers,
fijal and holger

Posted by Maciej Fijalkowski at 15:30 3 Comments

Sunday, December 7, 2008

Porting the JIT to CLI (part 3)

In my two previous posts, we talked about the PyPy JIT generator, seeing that it can produce huge speedups and how its backend-independent frontend works.

In this post, we will look closer at the internals of the CLI JIT backend; in particular, we will see how we work around some serious limitations of the platform, and why these workarounds didn't have any serious impact on the performances of our toy virtual machine.

Graphs, blocks, links

One of the core aspect of PyPy translator is the concept of flow graph: a flow graph is a data structure that represents the code we are operating on. It is composed by a set of basic blocks, each block containing a sequence of operations; blocks are connected together by links, and each link can carry a variable number of arguments whose value is passed to the target block. In case a block contains more than one outgoing links, the one to follow is selected by looking at the value of a designated variable (the exitswitch), thus making possible to implement conditional jumps. To have a more complete description of the flow graphs model, check the documentation.

As we saw in the previous post, the generated JIT compiler makes heavy use of flexswitches to generate efficient code, continuously intermixing JIT-compile time and runtime.

In terms of graphs, we can think of a flexswitch as a special block whose links change over time. In particular, adding a new case to the flexswitch is equivalent to create a link whose target is a new block where the just generated code starts. Thus, the graphs grows over the time, as showed by the following images:

In the images above, the block containing the flexswitch is colored in cyan. In the first picture, there is only one block connected to the flexswitch: this block contains the code to restart the JIT compilation. The second picture shows the graph after the first case has been added: you can clearly see that a new block has been created and attached to the flexswitch. Finally, the third picture shows the graph after a while, with a lot of new blocks attached.

Translate graphs to CLI

Conceptually, the goal of the CLI JIT backend is to express these graphs in terms of CLI bytecode.

Translating the single block is easy, as it is just a list of sequential operation, and it's straightforward to map each operation to the equivalent CLI opcode or to a call to a helper method. Moreover, we need a way to express links between the various basic blocks: if the links are known in advance, render them is as easy as emitting a (potentially conditional) jump to the target block. Thus, we won't discuss this part in detail, as it is quite straightforward.

The hard part is how to implement flexswitches: at the time when we are emitting the code, some of the blocks of this growable graph don't even exist: how can we make a jump to a non existent block of code? For backends that emit assembly code, it is rather easy: when they need to add a new case to the flexswitch, they can just patch the existing code to insert a jump to a newly allocated area of the memory, where the new code is being generated in.

For CLI this approach is not feasible, as the VM will never allow us to modify existing code. Thus, we need to think of a different approach.

Graphs and methods

In .NET, the basic unit of compilation is the method: the only way to execute some bytecode is to wrap it into a method. Moreover, it is not possible to execute a method until it has been completed, and after this point it is no longer possible to add new code.

Because of all these constraints we cannot simply map each graph to its own method, since we saw that our graphs can grow after they have already been executed few times.

Hence, we need to distinguish between the two concepts:

a graph is the logical unit of code as seen by the JIT compiler: concretely, the CLI JIT backend renders it as one or more methods;

a method is a collection of basic blocks; each method has the so called parent graph, i.e. the graph its blocks logically belongs to.

The first method of a graph is called main method (which has nothing to do with the Main static methods found in .exe files); other methods are called children methods.

When we want to add a new case to the flexswitch, we create a method containing all the new code; then we wrap the method inside a delegate (the .NET equivalent of a function pointer) and pass it to the flexswitch, so that it can later invoke it.

The hard bit: non-local links

Using this approach, after a while the blocks of our original graph are scattered over a lot of different methods; however, there are no constraints about how these blocks can be linked together, so it happens to have links between blocks which are not in the same method. In the following, we will refer to them as non-local links.

If the non-local block we want to jump to happens to be at the beginning of its containing method, it is enough to invoke the method; but, what if we want to jump somewhere in the middle? What we really want is to produce a method which has multiple entry-points; again, doing it in assembly would be trivial, but the virtual machine does not provide any support for it, so we need a work around.

Each method in a graph is assigned an unique 16 bit method id; each block in a method is assigned a progressive 16 bit block number. From this two numbers, we can compute the block id as an unsigned integer, by storing the method id in the first 16 bits and the block number in the second 16 bits. By construction, the block id is guaranteed to be unique in the graph.

The following picture shows a graph composed of three methods; the id of each method is shown in red, while the block ids are shown in red (for the method id part) and black (for the block number part). The graph contains three non-local links; in particular, note the link between blocks 0x00020001 and 0x00010001 which connects two block that resides in different methods.

Every method contains a special dispatch block, (not shown in the picture above) whose goal is to jump to the specified block number inside the method itself. The first argument of a child method is always a block id; when the method starts, it immediately jumps to the dispatch block, and thus to the desired block.

For example, suppose to have a method which contains 3 blocks numbered 0, 1, 2; here is how its dispatch blocks looks like; for simplicity it is shown as C# code, but it is actually generated as IL bytecode:

// dispatch block
int methodid = (blockid & 0xFFFF0000) >> 16); // take the first 16 bits
int blocknum = blockid && 0x0000FFFF;         // take the second 16 bits

if (methodid != MY_METHOD_ID) {
// jump_to_unknown block
...
}

switch(blocknum) {
case 0:
goto block0;
case 1:
goto block1;
case 2:
goto block2;
default:
throw new Exception("Invalid block id");
}

Whenever we want to jump to a non-local block, it is enough to store the block id in the appropriate variable and jump to the dispatch block. If the block resides in a different method, the jump_to_unknown block is entered; this special block is implemented differently by the main method and the child methods, as we will see soon.

Each time a new method is added to the graph, we build a delegate for it, and store it in a special array called method_map; since we assign the method id sequentially starting from 0, we are sure that to fetch the method whose id is n we can simply load the n-th element of the array.

The jump_to_unknown block of the main method uses this array to select the right method, and calls it (FlexSwitchCase is the type of delegates for all children methods):

// jump_to_unknown block of the main method
FlexSwitchCase meth = method_map[methodid];
blockid = meth(blockid, ...); // execute the method
goto dispatch_block;

Each child method returns a block id specifying the next block to jump to; after its execution, we assign the return value to the blockid variable, and jump again to the dispatch block, which will jump again to the appropriate block.

Keeping this in mind, it is straightforward to implement the jump_to_unknown block of children methods: it is enough to return the target block id to the caller, and let its dispatch loop do the right thing. If the caller is also a child method, it will return it again, until we reach the dispatch loop of the main method, which will finally do the jump. In theory, we could implement things differently and jumping directly from a child method to another one, but in that case the call stack could grows indefinitely in case of a tight loop between two blocks residing in different methods.

To implement the dispatch block we can exploit the switch opcode of the CLI; if the .NET JIT is smart enough, it can render it using an indirect jump; overall, jumping to a non-local block consists of an indirect function call (by invoking the delegate) plus an indirect jump (by executing the switch opcode); even if this is more costly than a simple direct jump, we will see in the next section that this not the main source of overhead when following a non-local link.

Obviously, the slow dispatching logic is needed only when we want to jump to a non-local block; if the target block happens to reside in the same method as the current one, we can directly jump to it, completely removing the overhead.

Moreover, the dispatch blocks are emitted only if needed, i.e. if the parent graph contains at least one flexswitch; graphs without flexswitches are rendered in the obvious way, by making one method per graph.

The slow bit: passing arguments

Jumping to the correct block is not enough to follow a link: as we said before, each link carries a set of arguments to be passed from the source to the target block. As usual, passing arguments across local links is easy, as we can just use local variables to hold their values; on the other hand, non-local links make things more complex.

The only way to jump to a block is to invoke its containing method, so the first solution that comes to mind is to specify its input arguments as parameter of the method; however, each block has potentially a different number (and different types) of input arguments than every other block, so we need to think of something else.

An alternative solution could be to compute the union of the sets of input arguments of all the blocks in the method, and use this set as a signature for the method; this way, there would be enough space to specify the input arguments for every block we might want to jump to, each block ignoring the exceeding unused parameters.

Unfortunately, all the children methods must have the very same signature, as they are all called from the same calling site in the dispatch block of the main method. Since the union of the set of input arguments (and hence the computed signature) varies from method to method, this solution cannot work.

We might think to determine the signature by computing the union of input arguments of all blocks in the graph; this way, all the children methods would have the same signature. But as we said above, the graph grows new blocks at runtime, so we cannot determine in advance which set of input arguments we will need.

To solve the problem we need a way to pass a variable number of arguments without knowing in advance neither their number nor their types. Thus, we use an instance of this class:

public class InputArgs {
public int[] ints;
public float[] floats;
public object[] objs;
...
}

Since the fields are arrays, they can grow as needed to contain any number of arguments; arguments whose type is primitive are stored in the ints or floats array, depending on their type; arguments whose type is a reference type are stored in the objs array: it's up to each block to cast each argument back to the needed type.

This solution impose a huge overhead on both writing and reading arguments:

when writing, we need to make sure that the arrays are big enough to contains all the arguments we need; if not, we need to allocate a bigger array. Moreover, for each argument we store into the array the virtual machine performs a bound-check, even if we know the index will never be out of bounds (because we checked the size of the array in advance);

when reading, the same bound-check is performed for each argument read; moreover, for each value read from the objs array we need to insert a downcast.

To mitigate the performance drop, we avoid to allocate a new InputArgs object each time we do a non-local jump; instead, we preallocate one at the beginning of the main method, and reuse it all the time.

Our benchmarks show that passing arguments in arrays is about 10 times slower than passing them as real parameter of a method. Unfortunately, we couldn't come up with anything better.

Implement flexswitches

Now, we can exploit all this machinery to implement flexswitches, as this is our ultimate goal. As described above, the point is to be able to add new cases at runtime, each case represented as a delegate. Here is an excerpt of the C# class that implements a flexswitch that switches over an integer value:

public class IntLowLevelFlexSwitch:
{
public uint default_blockid = 0xFFFFFFFF;
public int numcases = 0;
public int[] values = new int[4];
public FlexSwitchCase[] cases = new FlexSwitchCase[4];

public void add_case(int value, FlexSwitchCase c)
{
...
}

public uint execute(int value, InputArgs args)
{
for(int i=0; i<numcases; i++)
if (values[i] == value) {
 return cases[i](0, args);
}
return default_blockid;
}
}

For each case, we store both the triggering value and the corresponding delegate; the add_case method takes care to append value and c to the values and cases arrays, respectively (and resize them if necessary). The interesting bit is the execute method: it takes a value and a set of input arguments to be passed across the link and jumps to the right block by performing a linear search in the values array.

As shown by previous sections, the first argument of a FlexSwitchCase is the block id to jump to; since when we go through a flexswitch we always want to jump to the first block of the method, we pass the special value 0 as a block id, which precisely means jump to the first block. This little optimization let us not to have to explicitly store the block id for the first block of all the cases.

The value returned by execute is the next block id to jump to; if the value is not found in the values array, we return the default_blockid, whose value has been set before by the JIT compiler; default_blockid usually points to a block containing code to restart the JIT compiler again; when the JIT compiler restarts, it emits more code for the missing case, then calls add_case on the flexswitch; from now on, the new blocks are wired into the existing graph, and we finally managed to implement growable graphs.

Performances

As we saw, implementing growable graphs for CLI is a pain, as the virtual machine offers very little support, so we need an incredible amount of workarounds. Moreover, the code generated is much worse than what an assembly backend could produce, and the cost of following a non-local link is very high compared to local links.

However, our first blog post showed that we still get very good performances; how is it possible?

As usual in computer science, most of the time of a running program in spent in a tiny fraction of the code; our benchmark is no exception, and the vast majority of the time is spent in the inner loop that multiplies numbers; the graph is built in such a way that all the blocks that are part of the inner loop reside in the same method, so that all links inside are local (and fast).

Flexswitches and non-local links play a key role to select the right specialized implementation of the inner loop, but once it is selected they are not executed anymore until we have finished the computation.

It is still unclear how things will look like when we will compile the full Python language instead of a toy one; depending on the code, it could be possible to have non-local links inside the inner loop, thus making performance much worse.

Alternative implementations

Before implementing the solution described here, we carefully studied a lot of possible alternatives, but all of them either didn't work because of a limitation of the virtual machine or they could work but with terrible performances.

In particular, in theory it is possible to implement non-local links using tail calls, by putting each block in its own method and doing a tail call instead of a jump; this would also solve the problem of how to pass arguments, as each method could have its own signature matching the input args of the block. I would like to explain this solution in a more detailed way as I think it's really elegant and nice, but since this post is already too long, I'll stop here :-).

In theory, if the .NET JIT were smart enough it could inline and optimize away the tail calls (or at least many of those) and give us very efficient code. However, one benchmark I wrote shows that tail calls are up to 10 times slower (!!!) than normal calls, thus making impractical to use them for our purposes.

Conclusion

Despite the complexity of the implementation, our result are extremely good; the speedup we got is impressive, and it proves that PyPy's approach to JIT compiler can work well also on top of object oriented virtual machines like .NET or the JVM.

Generating bytecode for those machine at runtime is not a new idea; Jython, IronPython, JRuby and other languages have been doing this for years. However, Jython and IronPython do only a simple "static" translation, which doesn't take advantage of the informations gathered at runtime to generate better, faster and specialized code. Recently, JRuby grew a new strategy to JIT-compile only hotspots, taking advantage of some informations gathered while interpreting the code; this is still a "one-shot" compilation, where the compiled code does not change over time.

To my knowledge, PyPy brings the first example of a language which implements a truly JIT compiler on top of the underlying JIT compiler of the virtual machine, emitting bytecode that changes and adapts over the time. If someone knows other languages doing that, I would really like to know more.

Being so innovative, the problem of this approach is that the current virtual machines are not designed to support it in a native way, and this forces us to put a lot of workarounds that slow down the generated code. The hope is that in the future the virtual machines will grow features that help us to generate such kind of code. The experimental Da Vinci VM seems to go in the right direction, so it is possible that in the future I will try to write a JIT backend for it.

At the moment, the CLI JIT backend is almost complete, and all the hardest problems seems to be solved; the next step is to fix all the remaining bugs and implement some minor feature that it's still missing, then try to apply it to the full Python language and see what is the outcome.

In my two previous posts, we talked about the PyPy JIT generator, seeing that it can produce huge speedups and how its backend-independent frontend works.

Graphs, blocks, links

As we saw in the previous post, the generated JIT compiler makes heavy use of flexswitches to generate efficient code, continuously intermixing JIT-compile time and runtime.

Translate graphs to CLI

Conceptually, the goal of the CLI JIT backend is to express these graphs in terms of CLI bytecode.

For CLI this approach is not feasible, as the VM will never allow us to modify existing code. Thus, we need to think of a different approach.

Graphs and methods

Because of all these constraints we cannot simply map each graph to its own method, since we saw that our graphs can grow after they have already been executed few times.

Hence, we need to distinguish between the two concepts:

a graph is the logical unit of code as seen by the JIT compiler: concretely, the CLI JIT backend renders it as one or more methods;

a method is a collection of basic blocks; each method has the so called parent graph, i.e. the graph its blocks logically belongs to.

The first method of a graph is called main method (which has nothing to do with the Main static methods found in .exe files); other methods are called children methods.

The hard bit: non-local links

// dispatch block
int methodid = (blockid & 0xFFFF0000) >> 16); // take the first 16 bits
int blocknum = blockid && 0x0000FFFF;         // take the second 16 bits

if (methodid != MY_METHOD_ID) {
// jump_to_unknown block
...
}

switch(blocknum) {
case 0:
goto block0;
case 1:
goto block1;
case 2:
goto block2;
default:
throw new Exception("Invalid block id");
}

The jump_to_unknown block of the main method uses this array to select the right method, and calls it (FlexSwitchCase is the type of delegates for all children methods):

// jump_to_unknown block of the main method
FlexSwitchCase meth = method_map[methodid];
blockid = meth(blockid, ...); // execute the method
goto dispatch_block;

The slow bit: passing arguments

To solve the problem we need a way to pass a variable number of arguments without knowing in advance neither their number nor their types. Thus, we use an instance of this class:

public class InputArgs {
public int[] ints;
public float[] floats;
public object[] objs;
...
}

This solution impose a huge overhead on both writing and reading arguments:

when writing, we need to make sure that the arrays are big enough to contains all the arguments we need; if not, we need to allocate a bigger array. Moreover, for each argument we store into the array the virtual machine performs a bound-check, even if we know the index will never be out of bounds (because we checked the size of the array in advance);

when reading, the same bound-check is performed for each argument read; moreover, for each value read from the objs array we need to insert a downcast.

Our benchmarks show that passing arguments in arrays is about 10 times slower than passing them as real parameter of a method. Unfortunately, we couldn't come up with anything better.

Implement flexswitches

public class IntLowLevelFlexSwitch:
{
public uint default_blockid = 0xFFFFFFFF;
public int numcases = 0;
public int[] values = new int[4];
public FlexSwitchCase[] cases = new FlexSwitchCase[4];

public void add_case(int value, FlexSwitchCase c)
{
...
}

public uint execute(int value, InputArgs args)
{
for(int i=0; i<numcases; i++)
if (values[i] == value) {
 return cases[i](0, args);
}
return default_blockid;
}
}

Performances

However, our first blog post showed that we still get very good performances; how is it possible?

Alternative implementations

Conclusion

Posted by Antonio Cuni at 14:39 11 Comments

Friday, November 7, 2008

Porting the JIT to CLI (part 2)

In my previous post, we saw that PyPy JIT generator can produce huge speedups when applied to the tlc toy language.

In this post we will dive a bit into the internals of PyPy JIT, to see how it manages to do so. Note that this is a very high level overview of how the JIT works, and applies to all backends. Then, in the third post of this series, we will look closer at the CLI JIT backend, seeing how it works around some .NET limitations and how the generated code looks like.

PyPy JIT for dummies

As you surely know, the key idea of PyPy is that we are too lazy to write a JIT of our own: so, instead of passing nights writing a JIT, we pass years coding a JIT generator that writes the JIT for us :-).

I'm not going to explain how the JIT generator does its job, (perhaps this will be the subject of another blog post), but how the generated JIT works.

There are values that, if known at compile-time (i.e., when the JIT compiler runs), let the JIT to produce very efficient code. In a dynamic language, types are the primary example: for instance, suppose you are a compiler and you have to compile to following Python function:

def mysum(a):
  return a + 1

At compile time, you don't have any knowledge about the type of the parameter: it could be integer, float, an user defined object, etc. In this situation, the only safe choice is to emit code which does the usual, slow, full lookup to know how to perform the operations.

On the other hand, suppose that you knew in advance that the parameter is an integer: this time, you could emit code that exploits this extra knowledge, by performing directly a fast integer addition.

The idea behind PyPy JIT is that if you don't have enough knowledge to generate efficient code, you stop compiling and wait until you know exactly what you need. Concretely, you emit code that runs until the point where you stopped the compilation, then it triggers a special procedure that restarts the compiler. This time the JIT compiler knows everything you need, because you can inspect the state of the running program.

Let's see an example: the first time the JIT compiles mysum, it produces something like this pseudo-code:

PyObject mysum_compiled(PyObject a)
{
  Type a_type = a.GetType();
  switch(a_type) {
      default: continue_compilation(a_type, <position>);
  }
}

If you call mysum(41), the execution goes in the default branch of the switch, thus calling continue_compilation: its job is to restart the JIT compiler, which now can emit fast code because it knows the exact type of a; then, it modifies the original mysum_compiled function, in order to make it executing the newly generated code the next time it encounters an integer at that point:

PyObject mysum_compiled(PyObject a)
{
  Type a_type = a.GetType();
  switch(a_type) {
      PyInteger: return new PyInteger(a.value+1); // fast path!
      default: continue_compilation(a_type, <position>);
  }
}

From now on, every time we call mysum with an integer argument, the JIT compiler is not called anymore and the fast path is directly executed; if we happen to call mysum with a float arguments, the switch goes again in the default branch, and the JIT compiler is started once more to produce fast code also for this case. What happens in practice is that compile-time and runtime are continuously intermixed, until the switches are stable enough and the compiler is not needed anymore.

In PyPy jargon, this kind of "growable switch" is called flexswitch, and it's one of the most important concept of our JIT generator.

Promotion

How can the JIT generator know which values are useful to know to generate efficient code and which aren't? Unfortunately it can't, or at least our JIT generator is not smart enough at the moment.

To get the best from it, the developers of the VM need to instruct it by annotating the variables on which we want the JIT to stop until it knows the actual values; this is done by using particular hints, called promote and promote_class; variables annotated with such hints are said to be promoted. If something is promoted, a flexswitch is used to gain information about it, as seen in the last section.

For an example, let's look at an excerpt from main dispatch loop of the tlc virtual machine:

elif opcode == ADD:
  a, b = stack.pop(), stack.pop()
  hint(a, promote_class=True)
  hint(b, promote_class=True)
  stack.append(b.add(a))

This the implementation of the ADD opcode: first, it pops two values from the stack; then, it computes the result; finally, it push the result to the stack again. In between, both the classes of a and b have been promoted: this means that when the JIT emits the code for b.add(a), it knows exactly what is happening: if it sees that both are instances of the IntObj class, it inlines the method call and emits a fast integer addition instead.

Virtuals

The other important concept of the JIT is the presence of virtual structures, virtual lists, and virtual dictionaries. Again, I'm not going to explain in depth how they work, but only why they are so important for generating highly efficient code.

The essence of virtuals is that you don't allocate objects until you really need to do it, e.g. because they are being passed as an argument to some external function. Instead, we store all the informations we need as local variables; e.g., in the case of a virtual structure, we create as many local variables as the number of its fields: if the structure escapes the local scope, we force it to a real object, by allocating memory on the heap and initializing it after the current value of the local variables.

This technique allows the JIT to avoid the allocation of many temporary objects that hold intermediate results; consider for example the following Python loop:

result = 0
for i in range(N):
  result += i
return result

Without the JIT, at each iteration, a new int object is created and bound to the result variable, while the previous one is discarded and not needed anymore. By combining virtuals and promotion, the JIT can emit code that does the whole computation locally, and allocates a real object only at the end, when it escapes from the local scope because it is returned from the function.

Putting it all together

This is, essentially, how PyPy's generated JITs work. To summarize, our JITs emit multiple versions of each chunk of code: each version is specialized and optimized for one particular case.

The cost of selecting the right specialization to use (through flexswitches) is almost always negligible compared to how much time you save by running the fast version instead of the more-general-but-slow one. Moreover, each specialized version knows the exact shape of the objects it's dealing with, so they can be virtualized to make the generated code even more efficient.

At the end, the actual code generation is done by one of the JIT backends: the backends exploit all the knowledge gathered by the previous steps to produce highly efficient code, but this will be the subject of the next blog post.

In my previous post, we saw that PyPy JIT generator can produce huge speedups when applied to the tlc toy language.

PyPy JIT for dummies

I'm not going to explain how the JIT generator does its job, (perhaps this will be the subject of another blog post), but how the generated JIT works.

def mysum(a):
  return a + 1

Let's see an example: the first time the JIT compiles mysum, it produces something like this pseudo-code:

PyObject mysum_compiled(PyObject a)
{
  Type a_type = a.GetType();
  switch(a_type) {
      default: continue_compilation(a_type, <position>);
  }
}

PyObject mysum_compiled(PyObject a)
{
  Type a_type = a.GetType();
  switch(a_type) {
      PyInteger: return new PyInteger(a.value+1); // fast path!
      default: continue_compilation(a_type, <position>);
  }
}

In PyPy jargon, this kind of "growable switch" is called flexswitch, and it's one of the most important concept of our JIT generator.

Promotion

How can the JIT generator know which values are useful to know to generate efficient code and which aren't? Unfortunately it can't, or at least our JIT generator is not smart enough at the moment.

For an example, let's look at an excerpt from main dispatch loop of the tlc virtual machine:

elif opcode == ADD:
  a, b = stack.pop(), stack.pop()
  hint(a, promote_class=True)
  hint(b, promote_class=True)
  stack.append(b.add(a))

Virtuals

This technique allows the JIT to avoid the allocation of many temporary objects that hold intermediate results; consider for example the following Python loop:

result = 0
for i in range(N):
  result += i
return result

Putting it all together

This is, essentially, how PyPy's generated JITs work. To summarize, our JITs emit multiple versions of each chunk of code: each version is specialized and optimized for one particular case.

Posted by Antonio Cuni at 12:03 6 Comments

Tuesday, November 4, 2008

Porting the JIT to CLI (part 1)

As the readers of this blog already know, I have been working on the CLI JIT backend for some months: last Friday, it reached an important milestone, as it is now able to produce huge speedups for a little dynamic language. To know how huge the speedup is, read on :-).

The goal of PyPy JIT generator is to take an interpreter and, with the help of few annotations, automatically generate a JIT compiler for it. In this post, we will talk about the tlc virtual machine: while tlc it is just a toy language, it contains some features that make it an interesting target for our JIT generator.

The `tlc` virtual machine

tlc is executed by a stack based, dynamically typed virtual machine (for those who knows a bit about the Python VM: does it sound familiar? :-)).

There are three types of objects: integers, nil, and cons cells (i.e. lisp-like pairs of objects).

As the VM is very simple, it provides only few opcodes:

opcodes to manipulate the stack, like PUSH, POP, etc.

integer operations, like ADD, MUL, all the comparisons, etc.: these operations can only be applied to integers;

list operations, like CONS, CAR, CDR: these operations can only be applied to lists;

other operations, including jumps and conditional jumps.

The VM is interesting for our purposes because it has a lot of similarities with Python (though on a smaller scale, of course):

it has to do type-checks at runtime before doing most of the operations;

every time you do an arithmetic operation, it has to unbox the operand, do the computation, and the box the result again.

This means that even if you have a program which only uses integers, you are paying a lot of overhead.

To know more about this toy VM, look at its source code: the interesting bits are the classes used to represent objects, and the interp_eval function, which contains the main loop of the virtual machine. As you can see, the implementation is quite straightforward; all the hint calls you see are the special annotations needed by the JIT generator to produce better code.

Let's JIT it!

So, the whole point is to generate a JIT compiler from it, isn't it?

First, checkout a fresh copy of the oo-jit branch:

$ svn co http://codespeak.net/svn/pypy/branch/oo-jit

Then, go to the oo-jit/pypy/jit/tl directory, and compile the tlc VM with the CLI backend and JIT enabled:

$ cd oo-jit/pypy/jit/tl/
$ ../../translator/goal/translate.py -b cli --jit --batch targettlc
...
lot of texts
...

If everything went OK, you now have a targettlc-cli executable, which accepts two arguments: the name of the file containing the tlc program we want to run, and an integer to be passed to it.

Luckily, in the same directory we have a factorial.tlc file that contains the bytecode for a function that -- guess? -- computes the factorial of a given integer; let's try it:

$ ./targettlc-cli factorial.tlc 5
Non jitted:    120 (0.009371 seconds)
Warmup jitted: 120 (0.208954 seconds)
Warmed jitted: 120 (0.000323999999999991 seconds)

Cool, it seems that the result was computed correcly :-). As you can see from the output, we ran the program three times:

by plain interpretation, without any jitting;

with the jit enabled: this run includes the time spent by doing the compilation itself, plus the time spent by running the produced code;

again with the jit enabled, but this time the compilation has already been done, so we are actually measuring how good is the code we produced.

So, it's time to run a benchmark: let's try to compute the factorial of a very big number; the result will be 0, because obviously after a while we overflow, but after all we are interested in the time spent, not in the result:

$ ./targettlc-cli factorial.tlc 5000000
Non jitted:    0 (19.93247 seconds)
Warmup jitted: 0 (0.293229999999998 seconds)
Warmed jitted: 0 (0.0494239999999984 seconds)

$ python -c 'print 19.93247/0.0494239999999984'
403.295362577

And no, I didn't make any mistake in copying&pasting: the jitted version is really 400 times faster that the non jitted one!

Warning: my laptop seems to be not very well suited for benchmarks, as the results vary a lot from run to run; I've run the benchmarks a lot of times, and I got speedup factors up to 500 times, so your results may be different.

More benchmarks

It's also interesting to compare the result with a manual written C# version of the factorial, to see how good is code we produced; to get reasonable results, we need to compute a larger factorial, to let to code to run a bit more:

$ ./targettlc-cli --onlyjit factorial.tlc 100000000
Warmup jitted: 0 (0.980856 seconds)
Warmed jitted: 0 (0.769716 seconds)

$ mono factorial.exe 100000000
C#:            0 (0.153777 seconds)

$ python -c 'print 0.769716/0.153777'
5.00540392907

We know that the generated code is far from being optimal, but probably the factor of five is at least partially due to the fact that Mono's own JIT is optimized for C#-like code, and our code has a completely different shape.

All the benchmarks above were run under Linux, with Mono 1.9.1. Here are the results for the same benchmarks, but run with Microsoft CLR (on a different machine, so the absolute values are not comparable):

$ ./targettlc-cli factorial.tlc 5000000
Non jitted:    0 (15,640625 seconds)
Warmup jitted: 0 (0,4375 seconds)
Warmed jitted: 0 (0,03125 seconds)

$ python -c 'print 15.640625/0.03125'
500.5

$ ./targettlc-cli --onlyjit factorial.tlc 100000000
Warmup jitted: 0 (0,90625 seconds)
Warmed jitted: 0 (0,515625 seconds)

$ ./factorial.exe 100000000
C#:            0 (0,34375 seconds)

$ python -c 'print 0.515625/0.34375'
1.5

The results are even better than before; this is probably thanks to CLR's JIT, that does a better job than Mono when faced to something which is different than the usual C#-like code.

Conclusions (for now)

This is a very important result, because it proves that PyPy's approach to JIT compilers can be applied effectively also to OO virtual machines; the result is even better than what I expected, because when generating code for .NET we have much less freedom than when generating assembly code, and I had to play some tricks to work around some .NET limitations.

Moreover, it worked at the first try :-). I tried to compile the tlc virtual machine as soon as all the related JIT tests were passing, and surprisingly everything worked just fine, even if it was the very first time I was trying to apply some features of the JIT to something bigger than a test: I think this is yet another prove that Test Driven Development just works!

Even if this is a major milestone, the CLI JIT backend is not yet completed: as a consequence it can't still be used for the full PyPy, but all the hardest problems should have been solved now.

Since a lot of readers asked for more technical details, especially about the JIT, I will try to soon write a second blog post explaining how the CLI backend works internally, with a brief look to the generated code to see how it looks like.

The `tlc` virtual machine

tlc is executed by a stack based, dynamically typed virtual machine (for those who knows a bit about the Python VM: does it sound familiar? :-)).

There are three types of objects: integers, nil, and cons cells (i.e. lisp-like pairs of objects).

As the VM is very simple, it provides only few opcodes:

opcodes to manipulate the stack, like PUSH, POP, etc.

integer operations, like ADD, MUL, all the comparisons, etc.: these operations can only be applied to integers;

list operations, like CONS, CAR, CDR: these operations can only be applied to lists;

other operations, including jumps and conditional jumps.

The VM is interesting for our purposes because it has a lot of similarities with Python (though on a smaller scale, of course):

it has to do type-checks at runtime before doing most of the operations;

every time you do an arithmetic operation, it has to unbox the operand, do the computation, and the box the result again.

This means that even if you have a program which only uses integers, you are paying a lot of overhead.

Let's JIT it!

So, the whole point is to generate a JIT compiler from it, isn't it?

First, checkout a fresh copy of the oo-jit branch:

$ svn co http://codespeak.net/svn/pypy/branch/oo-jit

Then, go to the oo-jit/pypy/jit/tl directory, and compile the tlc VM with the CLI backend and JIT enabled:

$ cd oo-jit/pypy/jit/tl/
$ ../../translator/goal/translate.py -b cli --jit --batch targettlc
...
lot of texts
...

If everything went OK, you now have a targettlc-cli executable, which accepts two arguments: the name of the file containing the tlc program we want to run, and an integer to be passed to it.

Luckily, in the same directory we have a factorial.tlc file that contains the bytecode for a function that -- guess? -- computes the factorial of a given integer; let's try it:

$ ./targettlc-cli factorial.tlc 5
Non jitted:    120 (0.009371 seconds)
Warmup jitted: 120 (0.208954 seconds)
Warmed jitted: 120 (0.000323999999999991 seconds)

Cool, it seems that the result was computed correcly :-). As you can see from the output, we ran the program three times:

by plain interpretation, without any jitting;

with the jit enabled: this run includes the time spent by doing the compilation itself, plus the time spent by running the produced code;

again with the jit enabled, but this time the compilation has already been done, so we are actually measuring how good is the code we produced.

$ ./targettlc-cli factorial.tlc 5000000
Non jitted:    0 (19.93247 seconds)
Warmup jitted: 0 (0.293229999999998 seconds)
Warmed jitted: 0 (0.0494239999999984 seconds)

$ python -c 'print 19.93247/0.0494239999999984'
403.295362577

And no, I didn't make any mistake in copying&pasting: the jitted version is really 400 times faster that the non jitted one!

More benchmarks

$ ./targettlc-cli --onlyjit factorial.tlc 100000000
Warmup jitted: 0 (0.980856 seconds)
Warmed jitted: 0 (0.769716 seconds)

$ mono factorial.exe 100000000
C#:            0 (0.153777 seconds)

$ python -c 'print 0.769716/0.153777'
5.00540392907

$ ./targettlc-cli factorial.tlc 5000000
Non jitted:    0 (15,640625 seconds)
Warmup jitted: 0 (0,4375 seconds)
Warmed jitted: 0 (0,03125 seconds)

$ python -c 'print 15.640625/0.03125'
500.5

$ ./targettlc-cli --onlyjit factorial.tlc 100000000
Warmup jitted: 0 (0,90625 seconds)
Warmed jitted: 0 (0,515625 seconds)

$ ./factorial.exe 100000000
C#:            0 (0,34375 seconds)

$ python -c 'print 0.515625/0.34375'
1.5

The results are even better than before; this is probably thanks to CLR's JIT, that does a better job than Mono when faced to something which is different than the usual C#-like code.

Conclusions (for now)

Even if this is a major milestone, the CLI JIT backend is not yet completed: as a consequence it can't still be used for the full PyPy, but all the hardest problems should have been solved now.

Posted by Antonio Cuni at 01:19 7 Comments

Sunday, November 2, 2008

One year PyPy Blog

Last Friday the PyPy Status Blog had its first anniversary. Yay! After not really buying into any of this new-fangled "blog" stuff for a long time we just bit the bullet and got started. Totally surprisingly it even worked. We posted 76 post in the last year, more than one per week. By now we have more than 800 subscribers (according to feedburner), which is quite cool for a rather niche blog.

To make our blog even more interesting, I would like to ask for some feedback via the comments:

Which posts did you like in particular?

What sort of posts would you be interested in getting more of?

Any other improvements we could make?

Posted by Carl Friedrich Bolz-Tereick at 18:03 18 Comments

Tuesday, October 14, 2008

Sprint Discussions: JIT Generator Planning

Background

Finally, the JIT post :-). First some background: Despite our plans at the end of the EU period, PyPy's Python interpreter didn't get a good and widely applicable JIT in the last year. The reason for that was that we discovered that although the basic idea to generate JIT compilers is good, the concrete prototype made during the EU period is basically flawed. It could have been pushed a bit farther, but would have run into deep troubles eventually. One of the problems would have been performance instability: change a seemingly unrelated bit in your source program, and the performance changes in unexpected ways, which is clearly not desirable. Another problem with that old approach is that too much assembler code is generated, leading to memory problems, and also that the generated assembler is bad in various ways, e.g. it is hard in that approach to do proper register allocation.

Therefore we decided that it would be worthless to pursue this direction much further. Instead we tried to research approaches to fixing the inherent problems. This research was largely done in Prolog and I eventually wrote my Master's thesis about it. From the Prolog work we got some good insights into what needs to be done and what sorts of techniques are needed. Also, it inspired Armin to do some more exploration on a small Python prototype which used the lessons learned from Prolog and also some additional ideas from tracing JITs. So far, however, the prototype is neither in RPython, nor much integrated with PyPy.

This research is not the only thing happening in the JIT-area. During the last year, Antonio Cuni was working on bringing the JIT to pypy-cli. This consisted mostly of writing a .NET backend for the old JIT-generator. Some further work is being done since August by John Witulski, who is writing an AMD64 backend for the JIT-generator for his Bachelor's thesis.

Where to go from there

During the sprint we discussed in which directions we should continue now. We plan to work quite a bit on the JIT in the coming months. Both Armin and Anto are in Düsseldorf for four months, and them and me plan to mostly work on the JIT (as well as giving a lecture on "Dynamic Programming Languages", trying to ensnare some more students).

The first step will be to experiment a bit more with Armin's prototype. So far it looks rather promising, but there are some unsolved issues that we need to look into first. The first issue is to think a bit about how to efficiently do profiling to compile only important code paths. The other large issue are so-called "virtualizables". Roughly speaking, they are the frame objects of the interpreter from which the JIT is generated. They need special treatment, because on the one hand it is important that they get optimized away to make the code fast, since the frames are accessed all the time for the local variables; on the other hand they should still be usable for introspection if code is around that is trying to look into them.

When this is done, the prototype needs to be ported to RPython, which is a non-trivial task, since it is rather dynamic so far (it is rather important that the unresolved issues are done before the porting, because once the prototype is in RPython, experimentation will be harder). The porting has the potential to be tedious, but in a sense it is "just work", as opposed to unclear research.

At this point it will become important to think about the backend interface. The interface that the old frontend used to produce assembler code won't be usable for the new approach, so things need to be rearranged slightly. Afterwards the backends will have more information and be invoked at a slightly higher level, which should allow them to produce better code.

When all this is done, the JIT generator will be in a rather good state and it should become possible (modulo a lot of details, of course), to use it on the Python interpreter.

Conclusion

I am intentionally not attaching any time estimates to this blog post. So far our time estimates have not been very accurate when it comes to the JIT, which only lead to disappointment when the JIT failed to materialize. We hope that we will progress in interesting ways in the next four months, but who knows. Note that we are really quite disappointed ourselves that it took so much longer than we planned and hoped. The reason for this is mostly that this work really is research and sometimes it is just hard to predict what sort of problems turn up. Partial evaluation (the basis for our JIT generator) is a 30 years old technique that was always just promising and never really successful, so the fact that we think we can solve its problems in a few years is very much hubris anyway :-). On the positive side, we think that we now know these problems much better than ever before and that we have a plan that has a chance to succeed.

Also we are still convinced that our approach has huge potential, despite the difficulties. If we manage to pull it off, it should be significantly simpler to support new language features in the JIT and also to get speedups on some rather interesting bits of the language. Some ideas we are having include generating a JIT for the regex engine or speed up ctypes-bindings to be nearly as fast as an extension module (or faster?). Also the JIT will be such that by construction the JIT-generated code behaves identical to the original code, which isn't always true for Psyco, for example.

Background

Where to go from there

When all this is done, the JIT generator will be in a rather good state and it should become possible (modulo a lot of details, of course), to use it on the Python interpreter.

Conclusion

Posted by Carl Friedrich Bolz-Tereick at 14:28 6 Comments

Sprint Discussions: C++ Library Bindings

At the beginning of this year, PyPy grew ctypes support, thanks to generous support by Google. This made it possible to interface with C libraries from our Python interpreter, something that was possible but rather tedious before. What we are lacking so far is a way to interface to large C++ libraries (like GUI libraries). During the sprint we had a brainstorming session about possible approaches for fixing this shortcoming.

For CPython there are a number of approaches in common use:

SIP, mainly used for PyQT

SWIG

Boost.Python

Those all have the property that they produce some code that is then compiled with a compiler to produce a CPython extension. The produced code also uses functions from CPython's C-API. This model is not simple to use for PyPy in its current state. Since PyPy generates C code automatically, a fixed C-level API does not exist (it is not unlikely that at one point in the future we might have to provide one, but not yet). At the moment, PyPy very much has a "Don't call us, we call you"-approach.

A very different approach is followed by the Reflex package, which is developed at CERN (which has an incredible amount of C++ libraries). It is not mainly intended for writing Python bindings for C++ libraries but instead provides reflection capabilities for C++. The idea is that for every C++ shared library, an additional shared library is produced, which allows together with Reflex to introspect properties of C++ classes, methods, etc. at runtime. These facilities are then used for writing a small generic CPython extension module, that allows CPython to use any C++ library for which this reflection information was generated.

This approach is a bit similar to the ctypes module, apart from the fact that ctypes does not use any reflection information, but the user has to specify the data structures that occur in the C code herself. This makes it sometimes rather burdensome to write cross-platform library bindings.

For PyPy the approach seems rather fitting: We would need to implement only the generic extension module and could then use any number of C++ libraries. Of course some more evaluation is needed (e.g. to find out whether there are any restrictions for the C++ code that the library can use and how bothersome it is to get this reflection information for a large library) but so far it seems promising.

For CPython there are a number of approaches in common use:

SIP, mainly used for PyQT

SWIG

Boost.Python

Posted by Carl Friedrich Bolz-Tereick at 14:03 11 Comments

Sunday, October 12, 2008

Sprint Discussions: Release Planning

One of the discussions that happened during the sprint was about how to approach the next PyPy release. There hasn't been a release since the end of the EU period, which is not an optimal situation. Therefore we plan to make a 1.1 release at the beginning of next year, ideally before Pycon US. We'd also like to move towards time-based releases. This will be greatly helped by the new buildbot infrastructure, which allows us to decide when the state of the codebase is stable enough to release.

Another goal of the release is to involve more people from the wider PyPy community by having bugdays and generally asking for more support. This will be particularly useful for bugs on platforms that no one of the core developers group is using.

Feature-wise the release will mostly contain CPython 2.5 language support, including some new extension modules (like ctypes, expat, sqlite). In addition we plan to make it easier to actually install and use the PyPy Python interpreter, which means some sort of proper installation procedure and supporting distutils on top of PyPy. Another part of the release will be support for fully sand-boxing an interpreter.

Additionally there were also a large number of improvements on several levels since the last release, like optimizations, faster oldstyle-classes, better GCs, correct finalization behaviour, lots and lots of bugfixes, better threading support (still with the GIL), some work on improving memory behaviour, ...

In contrast to our last release, we will focus mainly on PyPy's Python Intepreter and more particularly its C-version. There are also various experimental interpreters that PyPy contains, like for Prolog, Smalltalk, JavaScript and Scheme. We also don't intend to put the LLVM and Javascipt backends in the release, since they are essentially unmaintained and at least partially broken. If anybody is particularly interested in one of these components, please feel free to step up and take responsibility for them. Another thing that the release won't contain is a JIT. I plan to make another blog-post about this soon, stay tuned.

Posted by Carl Friedrich Bolz-Tereick at 18:10 1 Comments

Saturday, October 11, 2008

Düsseldorf Sprint Report Days 1-3

The Düsseldorf sprint is currently in full progress and this post will try to summarize what progress has been made in the last days. We are (again) sprinting at the STUPS group of the Düsseldorf University. You can find the sprint announcement and the daily planning file.

Holger and Samuele put quite some effort over several days into setting up and improving PyPy's testing infrastructure. PyPy has a variety of tests. On the one hand, there are of course our own tests. But then we also have the CPython tests that should be run on top of pypy-c. Up to now we used a custom-made pile of hacks, held together by lots of duct-tape. It consisted of a variety of different machines running different things with different reporting solutions. Some of the old test-results can still be found on wyvern. Now we are moving to a buildbot based solution together with a custom reporter to have a view similar to the old one. Some details are not quite finished yet, but most of the things are already working rather well (currently all the results displayed are from the 2.5-merge branch).

Another large (and ongoing) topic of work is the 2.5 branch. It contains the work done by our Summer-of-Code student, Bruno Gola, of adding CPython 2.5 features to PyPy's Python interpreter. While Bruno implemented most language features and imported the 2.5 stdlib into PyPy, a lot of details were still missing. In the last days nearly everybody worked on fixing small issues and failing stdlib tests. While doing that we tried to categorize some CPython tests as implementation dependant so that we can skip them when running on PyPy.

Memory Improvements

One goal of the sprint is to measure and to reduce the memory behaviour of our Python interpreter. The idea is to make pypy-c a realistic option for use on embedded devices. By memory behaviour we mean both the dynamic memory usage (how much bytes does a dict or an instance take) as well as the size of the executable and details of the GC strategy.

Alexander, Carl Friedrich and Antonio did some work on analyzing the static data that a pypy-c executable contains. Our executables have the tendency to be rather large, both due to a lot of code and due to a large amount of static data. The analysis didn't give any really surprising results, the problem is mostly that we have a lot of static data originating from a bit everywhere in our program. Two big offenders are the unicodedata-module with about 750 KB of static data and the multimethod-tables with about 150 KB of data.

Armin, Iko, Anto and Maciek worked on a new approach to malloc-removal. This is (for PyPy) a crucial optimization of the translation toolchain that performs escape analysis to find out which objects don't outlive the frame they were allocated in. Since RPython is garbage-collected we usually have a lot of allocations, so it is important to statically get rid of many of them. To successfully do that, some inlining is needed to give the analysis more context. This leads to the fact that we have rather aggressive inlining-settings to allow as much malloc-removal as possible. The new approach tries to inline functions only if this actually leads to the successful removal of a malloc operation. The code is not finished quite yet, so it remains to be seen how successful it will be.

Before the sprint Maciek had started to work on a mark-compact GC for PyPy. The idea is that it is better for memory-constrained-environments because it does not double the memory-requirements during collections. During the sprint Armin and Maciek worked on cleaning up the code a bit and then merging the branch. An interesting property of the mark-compact GC is that after a collection all the memory that is not currently used by the program is returned to the operating system. Right now the GC is not as fast as our more mature ones, but it probably will be the basis for future tweaking.

A small thing that was done by Alexander and Carl Friedrich to make objects smaller is to enable shared instance dictionaries also for instances of old-style classes. Before it worked only for instances of new-style classes. Shared instance dictionaries are a way to reduce the memory-usage of instances. In the optimal case, it gives the same memory-savings that __slots__ are giving, but without any behavioural changes. Conceptually it is very similar e.g. to the notion of "map" in the Self project, or the hidden classes that Google Chrome's V8 is using (click on the link, there are nice graphics). The difference is that for them it is mostly a way to get faster attribute access, and PyPy is so far only using it form memory savings (but that might change in the future).

In parallel to all the other work, John Witulski worked tirelessly on advancing the AMD64-JIT-backend. John has the implementation of this backend as the topic of his Bachelor's thesis. He is progressing quite well (especially also considering that this is his first sizeable Python project ever), just sometimes being impaired by such annoyances as errors in the official Intel documentation. By now the backend is supporting many integer operations and control flow.

Memory Improvements

Posted by Carl Friedrich Bolz-Tereick at 10:11 2 Comments

Friday, October 10, 2008

Prolog-JIT Master's-Thesis Finished

As we already blogged, in the last half-year or so, Michael Leuschel, Armin and me did a lot of JIT generator work on a Prolog prototype. The idea was to experiment more quickly with some techniques than what would have been possible with RPython. These experiments were quite successful in themselves. With very little code we managed to get a JIT that is not doing too badly when compared to existing projects for Prolog.

This Prolog work was also the subject of my Master's thesis. I finished the thesis about two weeks ago (and since then have been mostly sleeping and then sprinting). The thesis should be self-contained when it comes to explaining the JIT concepts but needs knowledge of Prolog to be understandable.

Posted by Carl Friedrich Bolz-Tereick at 19:05 0 Comments

Saturday, September 20, 2008

PyPy/Python at the Maemo summit

Maciej and me visited the Maemo Summit in Berlin - a community meetup around Nokia's Linux based mobile platform. We spontaneously did a lightning talk about a first running pypy-c on Maemo and got nice feedback.

We also had a nice lunch with guys from the INDT in Brazil, including Marcio Marcedo and Marcelo Eduardo. It turns out that Python is used a lot on Maemo, for example the nice Canola UI is done with it. Will be interesting to see how this shapes up in relation to the iPhone and Android.

A lot of Nokia engineers were around and they announced that from October on they are going for weekly new releases of their SDK for the new Fremantle (Maemo-5) debian-based platform until the SDK becomes final - if we got this right.

Funnily enough, we met Marius Gedminas from the Programmers of Vilnius - he gave a lightning talk on his impressions as a community member. We think python programmers really should go much more to non-Python centric conferences.

The whole event took place at the C-Base - was a bit crammed in some of the sessions with something like 200 people attending.
cheers, Maciej and Holger

Posted by holger krekel at 14:58 0 Comments

Wednesday, September 17, 2008

Pycon UK, Javascript and the GIL

Just got back from Pycon UK 2008 - here are some impressions.

Both the keynote speakers Mark Shuttleworth (Canonical) and Ted Leung (Sun Microsystems) expressed their concerns about Javascript becoming so fast and prominent that it could displace Python in the future. They also highlighted the fact that Multi-core systems get cheaper and more popular also on desktop computers or notebooks. They challenged the community to advance Python implementations to exploit it. Question was up what PyPy can do here. As it stands, PyPy still uses the good old Global Interpreter Lock (GIL) but our approaches should indeed lend itself well to do experimentation with free threading.

During the 2-day conference we met many interesting people, most notably the guys from Resolver, among them William Reade who is working on IronClad -- which implements a fake python25.dll on top of IronPython. He presented some good results for Numpy in his lightning talk. This approach is surely something to follow closely and potentially use for PyPy.

We also had lunch and a couple of chats with Jacob Kaplan-Moss from Django fame - he is apparently up to try use PyPy's sandboxing features for one of his projects, cool!

Conference itself was well organized for the 230 attending people - although the venue might be a bit small for next year's EuroPython. Ah, and we gave three well attended talks, find the slides here:

cheers,
Holger, Maciej, Anto (associated through merlinux, btw)

Posted by Maciej Fijalkowski at 19:30 6 Comments

Friday, September 5, 2008

Düsseldorf PyPy sprint 5-13th October, 2008

The PyPy team is happy to announce the next sprint, which will take place in the Computer Science Department of the University of Düsseldorf, Germany. Sprinting will start on the 6th of October and go on till the 12th. Please arrive on the day before if you want to come.

Topics of the sprint will be aiming at a 1.1 release and to work on integrating PyPy better with small devices. Other topics are also welcome!

We will try to find a hotel with group rates, so if you are interested, please sign up soon! See the announcement for more details.

Posted by Carl Friedrich Bolz-Tereick at 14:25 0 Comments

Friday, August 22, 2008

pylib/py.test 0.9.2 released

PyPy and its 14638 automated tests use the py.test tool which is also used by many other projects. PyPy developers have actually driven and contributed a lot to its development. I just released version 0.9.2 of the py lib mainly fixing Windows issues and providing better packaging and integration with setuptools. It's usable completely independently from PyPy - "easy_install py" gives you the py.test command line. Of course you can run py.test on top of a translated PyPy version as well. Here is a quick summary of what the py lib provides besides py.test:

py.execnet: ad-hoc code distribution to SSH, Socket and local sub processes
py.magic.greenlet: micro-threads on standard CPython ("stackless-light") and PyPy
py.path: path abstractions over local and subversion files
py.code: dynamic code compile and traceback printing support
tested against Linux, Win32, OSX, works on python 2.3-2.6

Good general entry points for installation and documentation:

have fun, holger krekel

Posted by holger krekel at 13:32 2 Comments

Tuesday, August 19, 2008

New translation option: --opt

Hi all,

A few command-line options for translate.py have changed. Most interesting is that optimization levels are selected with the option --opt, or -O for short. This replaces --allopts, which was also called --faassen in reference to a person who is actually not involved in PyPy (so that was a bit of a strange joke). Also, --allworkingmodules is the default nowadays, and can be cancelled with --no-allworkingmodules. Threads are also included in --allworkingmodules now.

Examples:

translate.py (reasonable default, corresponds to --opt=2)
translate.py --opt=3 (best, maybe 10-20% faster)
translate.py --opt=1 (translation is faster and less RAM-hungry)

For more information, see:

Posted by Armin Rigo at 11:44 2 Comments

PyPy runs unmodified django 1.0 beta

This is just a quick update post to previous post - django folks commited all outstanding tickets and we are able to run unmodified django on top of pypy-c. Instructions how to do it are well explained on django wiki entry

enjoy,
fijal

Monday, July 14, 2008

Europython 2008 PyPy talks and sprint sum up

The EuroPython 2008 conference and sprints have finished - it certainly was a very eventful and successful conference for PyPy. And many very interesting non-PyPy talks as well. PyPy presentations are available online: PyPy status talk PyPy for the rest of us, PyPy behind the scenes. Armin and Maciej also did a well-attended talk about PyPy's garbage collection, but that was quite interactive, no slides.

The talks were all well visited and we got good questions. However, we still need to work on sorting out the "PyPy technology cloud" and how to present it to different audiences. Anyway, we are happy to hear feedback or questions about the talks!

After the conference there was a three-day PyPy sprint. Despite the fact that most PyPy core developers were zombies, we made good progress. Particularly our newcomers did very well. Here are some results:

itertools rewritten in RPython for performance by Jakub Gustak and Andrew Durdin
a new ctypes based dbm and hashlib module, both by Gasper Zejn with support from Henrik Vendelbo, they also got ctypes to nicely work on OSX. (sorry for lack of proper letters in names :)
implement builtin function call profiling by Stephan Diehl, Antonio and Armin.
running Pinax on top of pypy-c, by Henrik, Holger, Gasper.
Jim Baker started a _rawffi.py for Jython using JNA aiming to provide support to run PyPy's ctypes on top of Jython. When Jython gets this to run, PyPy's JVM backend should be able to use it. Talk about Code Reuse :)
oldstyle classes are now the default, this makes PyPy mimick very closely cpython's 2.5 object model.
Andrew started a port of the Malbolge interpreter written in Python to RPython (obviously the only missing link for PyPy to take over the world).
various cleanups (a new option "--lonepycfiles" helps with saner imports, remove int-float comparison shortcuts, ...)

At the end of the sprint we also discussed initial plans for a 1.1 release which we'd like to make happen this year. So we are generally looking forward to a busy rest of 2008 and luckily this starts by many of us taking a good vacation first :)

Cheers,
fijal & holger

itertools rewritten in RPython for performance by Jakub Gustak and Andrew Durdin
a new ctypes based dbm and hashlib module, both by Gasper Zejn with support from Henrik Vendelbo, they also got ctypes to nicely work on OSX. (sorry for lack of proper letters in names :)
implement builtin function call profiling by Stephan Diehl, Antonio and Armin.
running Pinax on top of pypy-c, by Henrik, Holger, Gasper.
Jim Baker started a _rawffi.py for Jython using JNA aiming to provide support to run PyPy's ctypes on top of Jython. When Jython gets this to run, PyPy's JVM backend should be able to use it. Talk about Code Reuse :)
oldstyle classes are now the default, this makes PyPy mimick very closely cpython's 2.5 object model.
Andrew started a port of the Malbolge interpreter written in Python to RPython (obviously the only missing link for PyPy to take over the world).
various cleanups (a new option "--lonepycfiles" helps with saner imports, remove int-float comparison shortcuts, ...)

Posted by Maciej Fijalkowski at 10:40 2 Comments

Sunday, July 13, 2008

Finding Bugs in PyPy with a Fuzzer

Last week I played a bit with Fusil, which is a fuzzing framework. The idea is to feed the interpreter code that calls the functions of a module with random values of various types as arguments in the hope that one hits an unchecked case. This is done until a problem is hit , the most common problem being a segfault. Victor Stinner, the author of Fusil, is a regular in the PyPy IRC channel and thankfully helped me getting started with Fusil. I used his project description for CPython as a starting point and tweaked it a bit. Reason is that PyPy is harder to segfault and so I tweaked Fusil to also count uncaught RPython-level exceptions as such a problem. (RPython has full exception support, and if an RPython-exception escapes to the top level, the Python interpreter aborts. One should not be able to exploit this but but for a user it is bad enough, because such exceptions cannot be caught from Python code.)

Using Fusil I found a number of cases where such exceptions happened (in some pickle support-code, in the expat parser, in the os and in the termios module) and also one or two segfaults (in the parser module, of all places). I fixed all these problems so that by now the fuzzer just runs for a very long time and only finds things that take too long (so they count as a way to do a DoS attack) like pow(12355123123L, 12351512123121L) or round(1, 1000000000) (the latter should probably be fixed). This probably just means that the fuzzer is not good enough, because there are certainly segfaults left in PyPy. However, the fact that it is rather hard to find them validates our approach of using a high-level memory-managed language for our interpreter. Victor tells me that it is rather easy to find segfaults in CPython this way, he already found quite some problems.

Posted by Carl Friedrich Bolz-Tereick at 17:53 4 Comments

Saturday, July 12, 2008

PyPy's Python runs Pinax / Django

During the EP2008 sprint we got Pinax running on top of PyPy. At our play1 server we have it running on top of pypy-c. Not that you'll notice many differences to the original site but that's the point, isn't it? ... Well, in fact i am too lazy to customize our play1 version now - i rather spent a nice evening with the other sprint guys :) Pinax integrates numerous reusable Django apps to take care of the things that many sites have in common. Many thanks particularly to Henrik Vendelbo who sorted out various Pinax and PyPy issues, and wrote up a nice DjangoAndPyPy wiki page describing the installation process. greetings from Vilnius (Lithunia), Holger

Posted by holger krekel at 19:10 0 Comments

Thursday, July 10, 2008

EP2008: PyPy meets Jython

One of the great events at EuroPython 2008 were our chats and meetings with the Jython and Sun people. The Jython people recently are pushing into releasing Python version 2.5 and they currently pursue many interesting sub projects. Coincidentally, PyPy also has tons of interesting areas and results :) So we eventually got into brainstorming a number of possible technical collab ideas. Further below is a first list as i wrote it down from our 10 people PyPy / Jython 30 minute close up meeting yesterday. It felt great to be able to talk to the Jython people this way - kudos to Sun for their clear commitments and open ways to go about things! I sense a genuine interest on fair collaboration with non-java developer communities. Seems like they are serious about not focusing on "Java this", "Java that" anymore but rather focus on the JVM platform. Good! And about language independent interest in ambitious technology. Even Better! I am tensed to see how things go from here. So here the list of technical collab ideas:

ctypes - try to create _rawffi module in Java for Jython, which will enable Jython to reuse our existing ctypes implementation (and have PyPy use the Jython-rawffi for its own for PyPy.JVM)
generally see to share work / (continue) collaborate regarding extension modules
Jython/PyPy (and eventually IronPython): document known differences to CPython, maybe in a PEP
Python Interpreter for Jython (in order to run CPython's .pyc files): re-use pypy's bytecode evaluator, implement a "Jython object space".
re-use rpython-extension modules for jython (e.g. SRE), by compiling them to Java and reusing as a native library.
collaborate on testing framework / benchmarking, have a common site to show test results
make py.test compatible with jython
come up with a set of "pure Python language" tests, which would gather and refactor tests from CPython, PyPy and Jython.
look into using java types / jython approaches for implementing free threading.
share knowledge regarding JIT / psyco

If you have any more ideas, comments or would like to join efforts, let us know! Cheers and thanks to Ted Leung, Frank Wierzbiki, Jim Baker and Tobias Ivarsson from Sun and Jython fame respectively, Holger

ctypes - try to create _rawffi module in Java for Jython, which will enable Jython to reuse our existing ctypes implementation (and have PyPy use the Jython-rawffi for its own for PyPy.JVM)
generally see to share work / (continue) collaborate regarding extension modules
Jython/PyPy (and eventually IronPython): document known differences to CPython, maybe in a PEP
Python Interpreter for Jython (in order to run CPython's .pyc files): re-use pypy's bytecode evaluator, implement a "Jython object space".
re-use rpython-extension modules for jython (e.g. SRE), by compiling them to Java and reusing as a native library.
collaborate on testing framework / benchmarking, have a common site to show test results
make py.test compatible with jython
come up with a set of "pure Python language" tests, which would gather and refactor tests from CPython, PyPy and Jython.
look into using java types / jython approaches for implementing free threading.
share knowledge regarding JIT / psyco

Posted by holger krekel at 09:29 3 Comments

Monday, July 7, 2008

PyPy at the EuroPython 2008

Greetings from Vilnius, Lithuania. There were already two pypy talks, one performed by Jacob Hallen PyPy for the rest of us and second by Maciej Fijalkowski PyPy status talk. The thing that we forgotten to tell is that PyPy sandboxing feature can also easily limit CPU and RAM usage as well as any other possible resource (like network transfer). For anyone who would like to join, there is a PyPy sprint after the conference.

Cheers,
arigo & fijal

Posted by Maciej Fijalkowski at 16:26 5 Comments

Monday, June 30, 2008

JIT in Prolog

Hi all,

Some news from the JIT front. Progress on the JIT has been low-profile in the past few months. No big results to announce yet, but we have played with some new ideas, and they are now documented as a draft research paper: Towards Just-In-Time Compilation and Specialisation of Prolog.

Prolog? Yes. To understand this slightly unusual choice of programming language, here is first some background about our JIT.

PyPy contains not a JIT but a JIT generator, which means that we only write an interpreter for a language (say, the complete Python language), and we get a JIT "for free". More precisely, it's not for free: we had to write the JIT generator, of course, as well as some amount of subtle generic support code. The JIT generator preprocesses the (complete Python) interpreter that we wrote and links the result with the generic support code; the result is a (complete Python) JIT.

The way that this works so far gives us a generated JIT that is very similar to Psyco in the way it works. But Psyco has issues (and so the current PyPy JITs have the same issues): it can sometimes produce too much machine code, e.g. by failing to notice that two versions of the machine code are close enough that they should really be one; and it can also sometimes fail in the opposite way, by making a single sub-efficient version of the machine code instead of several efficient specialized versions.

A few months ago we have chosen to experiment with improving this instead of finishing and polishing what we had so far. The choice was mostly because we were (and still are) busy finishing and polishing everything else in PyPy, so it was more fun to keep at least the JIT on the experimental side. Besides, PyPy is now getting to a rather good and complete state, and it is quite usable without the JIT already.

Anyway, enough excuses. Why is this about Prolog?

In PyPy, both the (complete Python) interpreter and the JIT support code are in RPython. Now RPython is not an extremely complicated language, but still, it is far from the top on a minimalism scale. In general, this is a good in practice (or at least I think so): it gives a reasonable balance because it is convenient to write interpreters in RPython, while not being so bloated that it makes our translation toolchain horribly complicated (e.g. writing garbage collectors for RPython - or even JIT generators - is reasonable). Still, it is not the best choice for early research-level experimentation.

So what we did instead recently is hand-write, in Prolog, a JIT that looks similar to what we would like to achieve for RPython with our JIT generator. This gave much quicker turnaround times than we were used to when we played around directly with RPython. We wrote tiny example interpreters in Prolog (of course not a complete Python interpreter). Self-inspection is trivial in Prolog, and generating Prolog code at runtime is very easy too. Moreover, many other issues are also easier in Prolog: for example, all data structures are immutable "terms". Other languages than Prolog would have worked, too, but it happens to be one that we (Carl Friderich, Michael Leuschel and myself) are familiar with -- not to mention that it's basically a nice small dynamic language.

Of course, all this is closely related to what we want to do in PyPy. The fundamental issues are the same. Indeed, in PyPy, the major goals of the JIT are to remove, first, the overhead of allocating objects all the time (e.g. integers), and second, the overhead of dynamic dispatch (e.g. finding out that it's integers we are adding). The equivalent goals in Prolog are, first, to avoid creating short-lived terms, and second, to remove the overhead of dispatch (typically, the dispatching to multiple clauses). If you are familiar with Prolog you can find more details about this in the paper. So far we already played with many possible solutions in the Prolog JIT, and the paper describes the most mature one; we have more experimentation in mind. The main point here is that these are mostly language-independent techniques (anything that works both in Prolog and in RPython has to be language-independent, right? :-)

In summary, besides the nice goal of speeding up Prolog, we are trying to focus our Prolog JIT on the issues and goals that have equivalents in the PyPy JIT generator. So in the end we are pretty convinced that it will give us something that we can backport to PyPy -- good ideas about what works and what doesn't, as well as some concrete algorithms.

Hi all,

Prolog? Yes. To understand this slightly unusual choice of programming language, here is first some background about our JIT.

Anyway, enough excuses. Why is this about Prolog?

Posted by Armin Rigo at 20:17 13 Comments

Friday, June 27, 2008

PyPy code swarm

Following the great success of code_swarm, I recently produced a video that shows the commit history of the PyPy project.

The video shows the commits under the dist/ and branch/ directories, which is where most of the development happens.

In the first part of the video, you can see clearly our sprint based approach: the video starts in February 2003, when the first PyPy sprint took place in Hildesheim: after a lot of initial activity, few commits happened in the next two months, until the second PyPy sprint, which took place in Gothenburg in late May 2003; around the minute 0:15, you can see the high commit rate due to the sprint.

The next two years follow more or less the same pattern: very high activity during sprints, followed by long pauses between them; the most interesting breaking point is located around the minute 01:55; it's January 2005, and when the EU project starts, the number of commits just explodes, as well as the number of people involved.

I also particularly appreciated minute 03:08 aka March 22, 2006: it's the date of my first commit to dist/, and my nickname magically appears; but of course I'm biased :-).

The soundtrack is NIN - Ghosts IV - 34: thanks to xoraxax for having added the music and uploaded the video.

PyPy Codeswarm from solse@trashymail.com on Vimeo.

Posted by Antonio Cuni at 12:18 5 Comments

Thursday, June 26, 2008

Funding of some recent progress by Google's Open Source Programs

As readers of this blog already know, PyPy development has recently focused on getting the code base to a more usable state. One of the most important parts of this work was creating an implementation of the ctypes module for PyPy, which provides a realistic way to interface with external libraries. The module is now fairly complete (if somewhat slow), and has generated a great deal of community interest. One of the main reasons this work progressed so well was that we received funding from Google's Open Source Programs Office. This is really fantastic for us, and we cannot thank Google and Guido enough for helping PyPy progress more rapidly than we could have with volunteer-only time!

This funding opportunity arose from the PyPy US road trip at the end of last year, which included a visit to Google. You can check out the video of the talk we gave during our visit. We wrapped up our day with discussions about the possibility of Google funding some PyPy work and soon after a we were at work on the proposal for improvements we'd submitted.

One nice side-effect of the funding is indeed that we can use some of the money for funding travels of contributors to our sprint meetings. The next scheduled Google funding proposal also aims at making our Python interpreter more usable and compliant with CPython. This will be done by trying to fully run Django on top of PyPy. With more efforts like this one we're hoping that PyPy can start to be used as a CPython replacement before the end of 2008.

Many thanks to the teams at merlinux and Open End for making this development possible, including Carl Friedrich Bolz, Antonio Cuni, Holger Krekel, Maciek Fijalkowski at merlinux, Samuele Pedroni and yours truly at Open End.

We always love to hear feedback from the community, and you can get the latest word on our development and let us know your thoughts here in the comments.

Bea Düring, Open End AB

PS: Thanks Carl Friedrich Bolz for drafting this post.

Posted by Samuele Pedroni at 12:41 9 Comments

Sunday, June 22, 2008

Pdb++ and rlcompleter_ng

When hacking on PyPy, I spend a lot of time inside pdb; thus, I tried to create a more comfortable environment where I can pass my nights :-).

As a result, I wrote two modules:

pdb.py, which extends the default behaviour of pdb, by adding some commands and some fancy features such as syntax highlight and powerful tab completion; pdb.py is meant to be placed somewhere in your PYTHONPATH, in order to override the default version of pdb.py shipped with the stdlib;

rlcompleter_ng.py, whose most important feature is the ability to show coloured completions depending on the type of the objects.

To find more informations about those modules and how to install them, have a look at their docstrings.

It's important to underline that these modules are not PyPy specific, and they work perfectly also on top of CPython.

Posted by Antonio Cuni at 15:48 9 Comments

Friday, June 20, 2008

Running Nevow on top of PyPy

Another episode of the "Running Real Application of top of PyPy" series:

Today's topic: Divmod's Nevow. Nevow (pronounced as the French "nouveau", or "noo-voh") is a web application construction kit written in Python. Which means it's just another web framework, but this time built on top of Twisted. While, due to some small problems we're not yet able to pass full Twisted test suite on top of pypy-c, Nevow seems to be simple enough to work perfectly (959 out of 960 unit tests passing, with the last one recognized as pointless and about to be deleted). Also, thanks to exarkun, Nevow now no longer relies on ugly details like refcounting.

As usual, translate pypy using:


translate.py --gc=hybrid --thread targetpypystandalone --faassen --allworkingmodules --oldstyle

Of course, obligatory to the series, screenshot:

This is Nevow's own test suite.

Cheers,
fijal

Posted by Maciej Fijalkowski at 01:44 1 Comments

Friday, December 19, 2008

Sunday, December 7, 2008

Graphs, blocks, links

Translate graphs to CLI

Graphs and methods

The hard bit: non-local links

The slow bit: passing arguments

Implement flexswitches

Performances

Alternative implementations

Conclusion

Graphs, blocks, links

Translate graphs to CLI

Graphs and methods

The hard bit: non-local links

The slow bit: passing arguments

Implement flexswitches

Performances

Alternative implementations

Conclusion

Friday, November 7, 2008

PyPy JIT for dummies

Promotion

Virtuals

Putting it all together

PyPy JIT for dummies

Promotion

Virtuals

Putting it all together

Tuesday, November 4, 2008

The tlc virtual machine

Let's JIT it!

More benchmarks

Conclusions (for now)

The tlc virtual machine

Let's JIT it!

More benchmarks

Conclusions (for now)

Sunday, November 2, 2008

Tuesday, October 14, 2008

Background

Where to go from there

Conclusion

Background

Where to go from there

Conclusion

Sunday, October 12, 2008

Saturday, October 11, 2008

Memory Improvements

Memory Improvements

Friday, October 10, 2008

Saturday, September 20, 2008

Wednesday, September 17, 2008

Friday, September 5, 2008

Friday, August 22, 2008

Tuesday, August 19, 2008

Monday, July 14, 2008

Sunday, July 13, 2008

Saturday, July 12, 2008

Thursday, July 10, 2008

Monday, July 7, 2008

Monday, June 30, 2008

Friday, June 27, 2008

Thursday, June 26, 2008

Sunday, June 22, 2008

Friday, June 20, 2008

The `tlc` virtual machine

The `tlc` virtual machine