Tuesday, January 31, 2012

A Simple Tracer for the Flow Graph Language

Part 2 of Comparing Partial Evaluation to Tracing

This is the second blog post in a series about comparing partial evaluation and tracing. In the first post of the series I introduced a small flow-graph language together with an interpreter for it. Then I showed a partial evaluator for the language. In this post I will show how a tracer for the same language works and how it relates to both execution and to partial evaluation. The code from this post can be found here: http://paste.pocoo.org/show/543542/

Tracing Execution

The idea of a tracer (for the described language and also in general) is to do completely normal interpretation but at the same time keep a log of all the normal operations (i.e. non-control-flow operations) that were performed. This continues until the tracer executes the code block where it started at, in which case the trace corresponds to a closed loop. Then tracing stops and the last operation is replaced by a jump to the start. After tracing has ended, the trace can be executed, optionally optimizing it before that.

To write a tracer, we start from the rules of the interpreter, rename the predicate to trace and add some extra arguments. Thus, the following rules in the interpreter:

interp(op1(ResultVar, Op, Arg, Rest), Env) :-
    resolve(Arg, Env, RArg),
    do_op(Op, RArg, Res),
    write_env(Env, ResultVar, Res, NEnv),
    interp(Rest, NEnv).

interp(op2(ResultVar, Op, Arg1, Arg2, Rest), Env) :-
    resolve(Arg1, Env, RArg1),
    resolve(Arg2, Env, RArg2),
    do_op(Op, RArg1, RArg2, Res),
    write_env(Env, ResultVar, Res, NEnv),
    interp(Rest, NEnv).

become the following rules in the tracer:

trace(op1(ResultVar, Op, Arg, Rest), Env, op1(ResultVar, Op, Arg, T), TraceAnchor) :-
    resolve(Arg, Env, RArg),
    do_op(Op, RArg, Res),
    write_env(Env, ResultVar, Res, NEnv),
    trace(Rest, NEnv, T, TraceAnchor).

trace(op2(ResultVar, Op, Arg1, Arg2, Rest), Env, op2(ResultVar, Op, Arg1, Arg2, T), TraceAnchor) :-
    resolve(Arg1, Env, RArg1),
    resolve(Arg2, Env, RArg2),
    do_op(Op, RArg1, RArg2, Res),
    write_env(Env, ResultVar, Res, NEnv),
    trace(Rest, NEnv, T, TraceAnchor).

Note how the bodies of the trace rules correspond exactly to the bodies of the interp rules, the only difference is the recursive call to trace. The meaning of the arguments of trace is as follows: The first and second argument are the operation currently executed and the environment, like in the interpreter. The argument after that is an output argument that collects the currently traced operation, in the example above it is exactly like the operation that was executed. TraceAnchor is additional information about the trace that is being built right now, most of the time it is just handed on to the recursive call of trace. We will see later what it contains.

The rule for print_and_stop is very simple, as execution (and therefore also tracing) simply stops there:

trace(print_and_stop(V), Env, print_and_stop(V), _) :-
    resolve(V, Env, Val),
    print(Val), nl.

Left are the rules for the control operations jump and if. A trace linearizes one execution path, it contains no jumps. However, when a jump to the starting label is reached, tracing should stop. Therefore, the implementation of jump contains two cases:

trace(jump(L), Env, T, TraceAnchor) :-
    (TraceAnchor = traceanchor(L, FullTrace) ->
        T = loop,
        write(trace), nl, write(FullTrace), nl,
        do_optimize(FullTrace, OptTrace),
        write(opttrace), nl, write(OptTrace), nl,
        runtrace(OptTrace, Env, OptTrace)
    ;
        block(L, Block),
        trace(Block, Env, T, TraceAnchor)
    ).

Let's disect this code in small increments. First, we see what TraceAnchor is. It is a term of the form traceanchor(StartLabel, FullTrace). StartLabel is a label in the program where tracing started (and where it should end as well, when the loop is closed). The argument FullTrace is an accumulator which contains the full trace that is being built right now.

The condition at the start of the rule checks whether the taget-label L is the same as the one stored in the trace anchor. If that is the case, we can stop tracing. The rest of the trace T is assigned the operation loop, which jumps back to the beginning of the trace. Afterwards we print and optimize the trace, then run it, using the FullTrace part of the traceanchor.

If the label we jump to is not the StartLabel we simply continue tracing without recording any operation. This part of the rule is again extremely similar to the interpretation of jump.

For now, we will not use any interesting optimizations, just return the unoptimized trace unchanged:

do_optimize(FullTrace, FullTrace).

The missing operation now is if. An if statement needs special treatment, because it is a way where control flow can diverge from the trace. The trace is linear, therefore it can only record one of the two possible paths. When executing the trace it is possible for the other path to be taken. Therefore we need to make sure that the same conditions that were true or false during tracing are still true or false during the execution of the trace. This is done with a guard operation, which checks for this condition. The following rule implements it:

trace(if(V, L1, L2), Env, T, TraceAnchor) :-
    lookup(V, Env, Val),
    (Val == 0 ->
        L = L2, T = guard_false(V, [], L1, NT)
    ;
        L = L1, T = guard_true(V, [], L2, NT)
    ),
    trace(jump(L), Env, NT, TraceAnchor).

It is very similar to the interp rule of if. The rule inserts a guard_true into the case, if the condition is true, and a guard_false if the condition is false. The arguments of the guard are: The variable that is being guarded, an empty list (the reason for that will be explained in a later post), the label where execution needs to continue when the guard fails and the rest of the trace.

Let's also add a small helper predicate that can be used to conveniently start tracing:

do_trace(L, Env) :-
    block(L, StartBlock),
    trace(StartBlock, Env, ProducedTrace, traceanchor(L, ProducedTrace)).

The predicate takes a label and an environment and executes the label with the given environment by first producing a trace, then executing the trace and eventually jumping back to interpretation, if a guard fails. It does this by reading the code at label L with the block statement, and then calling trace with an unbound variable ProducedTrace to hold the trace and a trace anchor that contains the label where tracing started and the produced trace variable.

With that predicate and the trace so far we can already trace the power implementation from the last blog post, just not execute the trace (which we will do in the next section):

?- do_trace(power_rec, [res/1, x/10, y/20]).
trace
op2(res,mul,var(res),var(x),op2(y,sub,var(y),const(1),guard_true(y,[],power_done,loop)))
opttrace
op2(res,mul,var(res),var(x),op2(y,sub,var(y),const(1),guard_true(y,[],power_done,loop)))
...

The computed trace is:

op2(res,mul,var(res),var(x),
op2(y,sub,var(y),const(1),
guard_true(y,[],power_done,
loop)))

which is exactly the content of the loop from power_rec. Note how the if is turned into a guard_true which jumps to power_done if the guard fails.

A real tracing system would need a way for the tracer to get started, e.g. by doing profiling in an interpreter and starting the tracer for labels that are jumped to often. Also, traces for the same label are usually cached in some way. These details are left out in this simple model.

Executing Traces

In a real tracing system, the traces would be turned into machine code and executed by the CPU. In our small model, we will simply write another interpreter for them. This interpreter is very simple and looks again very similar to interp.

runtrace(op1(ResultVar, Op, Arg, Rest), Env, TraceFromStart) :-
    resolve(Arg, Env, RArg),
    do_op(Op, RArg, Res),
    write_env(Env, ResultVar, Res, NEnv),
    runtrace(Rest, NEnv, TraceFromStart).

runtrace(op2(ResultVar, Op, Arg1, Arg2, Rest), Env, TraceFromStart) :-
    resolve(Arg1, Env, RArg1),
    resolve(Arg2, Env, RArg2),
    do_op(Op, RArg1, RArg2, Res),
    write_env(Env, ResultVar, Res, NEnv),
    runtrace(Rest, NEnv, TraceFromStart).

These rules are completely equivalent to the interp rules for op1 and op2. runtrace needs an extra argument, TraceFromStart, which is always just handed over to the recursive call of runtrace.

When the end of the trace is reached and the loop statement is encountered, we simply start from the beginning:

runtrace(loop, Env, TraceFromStart) :-
    runtrace(TraceFromStart, Env, TraceFromStart).

The remaining question is what to do when encountering guards. In that case the guard condition needs to be checked. If the guard succeeds, executing the trace can continue. Otherwise the trace is aborted and the interpreter resumes execution:

runtrace(guard_true(V, ResumeVars, L, Rest), Env, TraceFromStart) :-
    lookup(V, Env, Val),
    (Val == 0 ->
        resume_interp(Env, ResumeVars, L)
    ;
        runtrace(Rest, Env, TraceFromStart)
    ).

runtrace(guard_false(V, ResumeVars, L, Rest), Env, TraceFromStart) :-
    lookup(V, Env, Val),
    (Val == 0 ->
        runtrace(Rest, Env, TraceFromStart)
    ;
        resume_interp(Env, ResumeVars, L)
    ).


resume_interp(Env, [], L) :-
    block(L, Block),
    interp(Block, Env).

Note how the execution is handed over to the interpreter at the label that was encoded as the third argument in the guard operation. What the ResumeVars are for we will see in a later post. For now we assume that it is always an empty list.

With this interpreter for traces we can now trace and then execute the example:

:- do_trace(power_rec, [res/1, x/10, y/20]).
trace
op2(res,mul,var(res),var(x),op2(y,sub,var(y),const(1),guard_true(y,[],power_done,loop)))
opttrace
op2(res,mul,var(res),var(x),op2(y,sub,var(y),const(1),guard_true(y,[],power_done,loop)))
100000000000000000000

Of course this is example is not very exciting, because the trace looks more or less exactly like the original code as well. There will be more exciting examples in a later post.

Extension: Promotion

As it is, the tracer does not actually add much to the interpreter. It linearizes control flow, but nothing deeply advanced happens. In this section I will add a crucial but simple to implement extension to the control flow language that allows the tracer to do more interesting things. This extension is called promotion.

Promotion is basically a hint that the programmer can add to her control flow graph program. A promotion is an operation promote(V, L) that takes a variable V and a label L. When the interpreter runs this statement, it simply jumps to the label L and ignores the variable:

interp(promote(_, L), Env) :-
    interp(jump(L), Env).

However, the tracer does something much more interesting. For the tracer, the promote statement is a hint that it would be very useful to know the value of V and that the rest of the trace should keep that value as a constant. Therefore, when the tracer encounters a promotion, it inserts a special kind of guard called guard_value

trace(promote(V, L), Env, guard_value(V, Val, [], L, T), TraceAnchor) :-
    lookup(V, Env, Val),
    trace(jump(L), Env, T, TraceAnchor).

The guard_value is an interesting operation, because it freezes the current value FVal of variable V into the trace. When the trace is executed, the guard checks that the current value of the variable and the frozen value are the same. If yes, execution continues, if not, the trace is aborted:

runtrace(guard_value(V, FVal, ResumeVars, L, Rest), Env, TraceFromStart) :-
    lookup(V, Env, Val),
    (Val == FVal ->
        runtrace(Rest, Env, TraceFromStart)
    ;
        resume_interp(Env, ResumeVars, L)
    ).

What can this operation be used for? It's a way to communicate to the tracer that variable V is not changing very often and that it is therefore useful to freeze the current value into the trace. This can be done even without knowing the value of V in advance.

Let's look at a (slightly contrived) example:

l:
    c = i >= 0
    if c goto b else goto l_done

l_done:
    print_and_stop(var(i))

b:
    promote(x, b2)

b2:
    x2 = x * 2
    x3 = x2 + 1
    i = i - x3
    goto l

Encoded in Prolog syntax:

block(l, op2(c, ge, var(i), const(0),
         if(c, b, l_done))).
block(l_done, print_and_stop(var(i))).

block(b, promote(x, b2)).
block(b2, op2(x2, mul, var(x), const(2),
          op2(x3, add, var(x2), const(1),
          op2(i, sub, var(i), var(x3),
          jump(l))))).

This is a simple loop that counts down in steps of x * 2 + 1, whatever x might be, until i >= 0 is no longer true. Assuming that x doesn't change often, it is worth to promote it to be able to constant-fold x * 2 + 1 to not have to redo it every iteration. This is done with the promotion of x (of course optimizing this loop with loop invariant code motion would work as well, because x doesn't actually change during the loop).

To trace this, we can run the following query:

?- do_trace(b, [i/100, x/5]).
trace
guard_value(x,5,[],b2,op2(x2,mul,var(x),const(2),op2(x3,add,var(x2),const(1),op2(i,sub,var(i),var(x3),op2(c,ge,var(i),const(0),guard_true(c,[],l_done,loop))))))
opttrace
guard_value(x,5,[],b2,op2(x2,mul,var(x),const(2),op2(x3,add,var(x2),const(1),op2(i,sub,var(i),var(x3),op2(c,ge,var(i),const(0),guard_true(c,[],l_done,loop))))))
-10

Writing the trace in a more readable way:

guard_value(x,3,[],b2,
op2(x2,mul,var(x),const(2),
op2(x3,add,var(x2),const(1),
op2(i,sub,var(i),var(x3),
op2(c,ge,var(i),const(0),
guard_true(c,[],l_done,
loop))))))

After the guard_value the operations performed on x could be constant-folded away, because the guard ensures that x is 5 before execution continues. To actually do the constant-folding we would need some optimization component that optimizes traces. This will be done in the next blog post.

In this section I mostly talked about how promotion is realized in the tracer, not what and how to use to use it for. Promotion is one of the most important ingredients that's responsible for the success of PyPy's tracing approach. How this works is discussed in detail in the paper "Runtime feedback in a meta-tracing JIT for efficient dynamic languages".

Conclusion

In this blog post we have seen a very minimalistic tracer and an interpreter for the produced traces. The tracer is very much like the original interpreter, it just also keeps track of which operations were executed, in addition to executing the program. Tracing stops when a loop is closed, then the trace can be optimized and run. Running a trace continues until a failing guard is hit. At that point, execution goes back to the normal interpreter (and stays there, in this very simple implementation).

I also presented an extension of tracing that makes it possible to add a hint called promote to the original program that tells the tracer to feed back a runtime value into the trace and freeze it there. This extension would be impossible to do in the partial evaluator from the last post, because partial evaluation is done strictly before run time, so if a variable isn't already known, its likely runtime value cannot be found out.

In the next post I will show how to optimize traces before executing them and how the optimizer for traces is related to partial evaluation.

Saturday, January 28, 2012

NumPyPy status update

Hello.

This is just a quick status update on the NumPy in PyPy project that very recently became my day job. I should give my thanks once again to Getco, Nate Lawson and other contributors who donated above $40000 towards the goal.

Recently we (Alex Gaynor, Matti Picus and me) implemented a few interesting things that a lot of people use:

  • more ufuncs
  • most ufuncs now accept the axis parameter (except all and any)
  • fixed string representation of arrays, now it's identical to numpy (uses pretty much the same code)
  • ndarray.flat should be working correctly
  • ndarray.flatten, ndarray.ravel, ndarray.take
  • indexing arrays by boolean arrays of the same size
  • and various bugfixes.

We would also like to introduce the nightly report of numpy status. This is an automated tool that does package introspection. While it gives some sort of idea how much of numpy is implemented, it's not by far the authority. Your tests should be the authority. It won't report whether functions support all kinds of parameters (for example masked arrays and out parameter are completely unsupported) or that functions work at all. We also reserve the right to incorporate jokes in that website, so don't treat it that seriously overall :-)

Thanks, and stay tuned. We hope to post here regular updates on the progress.

Cheers,
fijal & the PyPy team

Friday, January 27, 2012

Py3k and Numpy First Stage: Thanks to all who Gave

Last year was quite successful for PyPy fundraising through the Software Freedom Conservancy, and Conservancy and PyPy are very excited to announce that enough was raised to begin the first stages on the Py3k and Numpy grant proposals.

As of the end of 2011, 135 different individuals gave to the Py3k campaign, and 114 to the Numpy campaign. We thank each of you who donated to help make this work possible. Meanwhile, if you haven't given to support these projects, we do hope you'll give generously now to help fund their second stages later this year!

We're also particularly excited that a few donors gave particularly large donations to support this work; those big donations really filled in the gap to help us get started!

Specifically, we're pleased to announce that Google donated $35000 towards implementing Python 3 in PyPy. Google's general support of the Python community is well known, and their specific support of our grant proposal is much appreciated.

Meanwhile, Numpy was supported in part by contributions from Nate Lawson, Cantab Capital Partners, and Getco, as well as more than a hundred other contributors.

With these donations combined with many others, we're now starting work on both projects. This week, the Conservancy signed contracts with Antonio Cuni and Benjamin Peterson to work towards the Stage 1.1 goals in Py3k proposal (and is negotiating for another contractor as well), and with Maciej Fijałkowski to work towards the Stage 1 goals in the Numpy proposal.

In 2012, PyPy will continue regular sprint meetings, at which Py3K and Numpy efforts will certainly have a place. We have some limited funds to fund travels of contributors to those meetings.

We're very thankful for all who donated so far to support these efforts, and we hope that now that work has begun, even more donors will come forward to help us finish the job. In the meantime, watch for the commits showing up from these developers and other contributors in the PyPy repositories!

Cheers, The PyPy Team

Thursday, January 26, 2012

Comparing Partial Evaluation and Tracing, Part 1

As part of writing my PhD I am currently thinking about the relationship between PyPy's meta-tracing approach with various previous ideas to automatically get a (JIT-)compiler from only an interpreter of a language. One of the most-researched ideas along these lines is that of partial evaluation. Partial evaluation has basically the same goals as PyPy when it comes to compilers: Write an interpreter, and get a compiler for free. The methods for reaching that goal are a bit different. In this series of blog posts, I am trying to explore the similarities and differences of partial evaluation and PyPy's meta-tracing.

A Flowgraph Language

To be able to clearly understand what "partial evaluation" is and what "meta-tracing" is I will show an "executable model" of both. To that end, I am defining a small imperative language and will then show what a partial evaluator and a tracer for that language look like. All this code will be implemented in Prolog. (Any pattern-matching functional language would do, but I happen to know Prolog best. Backtracking is not used, so you can read things simply as functional programs.) In this post I will start with the definition of the language, and a partial evaluator for it. The code written in this blog post can be found fully here: http://paste.pocoo.org/show/541004/

The language is conceptionally similar to PyPy's flow graphs, but a bit more restricted. It does not have function calls, only labelled basic blocks that consist of a series of linearly executed operations, followed by a conditional or an unconditional jump. Every operation is assigning a value to a variable, which is computed by applying some operation to some arguments.

A simple program to raise x to the yth power in that language looks like this:

power:
    res = 1
    if y goto power_rec else goto power_done

power_rec:
    res = res * x
    y = y - 1
    if y goto power_rec else goto power_done

power_done:
    print_and_stop(res)

To represent the same program as Prolog data structures, we use the following Prolog code:

block(power, op1(res, same, const(1),
             if(y, power_rec, power_done))).
block(power_rec, op2(res, mul, var(res), var(x),
                 op2(y, sub, var(y), const(1),
                 if(y, power_rec, power_done)))).
block(power_done, print_and_stop(var(res))).

Every rule of block declares one block by first giving the label of the block, followed by the code. Code is a series of op1 or op2 statements terminated by a jump, an if or a print_and_stop. op1 statements are operations with one argument of the form op1(res_variable, operation_name, argument, next_statement). Arguments can be either variables in the form var(name) or constants in the form const(value).

To run programs in this flowgraph language, we first need some helper functionality. The first few helper functions are concerned with the handling of environments, the data structures the interpreter uses to map variable names occuring in the program to the variables' current values. In Python dictionaries would be used for this purpose, but in Prolog we have to emulate these by lists of key/value pairs (not very efficient, but good enough):

lookup(X, [], _) :- throw(key_not_found(X)).
lookup(Key, [Key/Value | _], Value) :- !.
lookup(Key, [_ | Rest], Value) :- lookup(Key, Rest, Value).

write_env([], X, V, [X/V]).
write_env([Key/_ | Rest], Key, Value, [Key/Value | Rest]) :- !.
write_env([Pair | Rest], Key, Value, [Pair | NewRest]) :- write_env(Rest, Key, Value, NewRest).

remove_env([], _, []).
remove_env([Key/_ | Rest], Key, Rest) :- !.
remove_env([Pair | Rest], Key, [Pair | NewRest]) :- remove_env(Rest, Key, NewRest).

resolve(const(X), _, X).
resolve(var(X), Env, Y) :- lookup(X, Env, Y).

The implementation of these functions is not too important. The lookup function finds a key in an environment list, the write_env function adds a new key/value pair to an environment, remove_env removes a key. The resolve function is used to take either a constant or a variable and return a value. If it's a constant, the value of that constant is returned, if it's a variable it is looked up in the environment. Note how the last argument of lookup and resolve is actually a return value, which is the typical approach in Prolog.

So far we have not specified what the primitive operations that can occur in the program actually mean. For that we define a do_op function which executes primitive operations:

do_op(same, X, X).
do_op(mul, X, Y, Z) :- Z is X * Y.
do_op(add, X, Y, Z) :- Z is X + Y.
do_op(sub, X, Y, Z) :- Z is X - Y.
do_op(eq, X, Y, Z) :- X == Y -> Z = 1; Z = 0.
do_op(ge, X, Y, Z) :- X >= Y -> Z = 1; Z = 0.
do_op(readlist, L, I, X) :- nth0(I, L, X).
do_op(Op, _, _, _) :- throw(missing_op(Op)).

Again the last argument is an output variable.

Now we can start executing simple operations. For that an interp predicate is defined. It takes as its first argument the current environment and as the second argument the operation to execute. E.g. to execute primitive operations with one or two arguments:

interp(op1(ResultVar, Op, Arg, Rest), Env) :-
    resolve(Arg, Env, RArg),
    do_op(Op, RArg, Res),
    write_env(Env, ResultVar, Res, NEnv),
    interp(Rest, NEnv).

interp(op2(ResultVar, Op, Arg1, Arg2, Rest), Env) :-
    resolve(Arg1, Env, RArg1),
    resolve(Arg2, Env, RArg2),
    do_op(Op, RArg1, RArg2, Res),
    write_env(Env, ResultVar, Res, NEnv),
    interp(Rest, NEnv).

First the arguments are resolved into values. Afterwards the operation is executed, and the result is written back into the environment. Then interp is called on the rest of the program. Similarly easy are the unconditional jump and print_and_stop:

interp(jump(L), Env) :-
    block(L, Block),
    interp(Block, Env).


interp(print_and_stop(Arg), Env) :-
    resolve(Arg, Env, Val),
    print(Val), nl.

In the unconditional jump we simply get the target block and continue executing that. To execute print_and_stop we resolve the argument, print the value and then are done.

The conditional jump is only slightly more difficult:

interp(if(V, L1, L2), Env) :-
    lookup(V, Env, Val),
    (Val == 0 ->
        block(L2, Block)
    ;
        block(L1, Block)
    ),
    interp(Block, Env).

First the variable is looked up in the environment. If the variable is zero, execution continues at the second block, otherwise it continues at the first block.

Given this interpreter, we can execute the above example program like this, on a Prolog console:

$ swipl -s cfglang.pl
?- block(power, Block), interp(Block, [x/10, y/10]).
10000000000

Partial Evaluation of the Flowgraph Language

Let's look at what a partial evaluator for this simple flowgraph language would look like. Partial evaluation (PE), also called specialization, is a program manipuation technique. PE takes an input program and transforms it into a (hopefully) simpler and faster output program. It does this by assuming that some variables in the input program are constants. All operations that act only on such constants can be folded away. All other operations need to remain in the output program (called residual program). Thus the partial evaluator proceeds much like an interpreter, just that it cannot actually execute some operations. Also, its output is not just a value, but also list of remaining operations that could not be optimized away.

The partial evaluator cannot use normal environments, because unlike the interpreter not all variables' values are known to it. It will therefore work on partial environments, which store just the know variables. For these partial environments, some new helper functions are needed:

plookup(Key, [], var(Key)).
plookup(Key, [Key/Value | _], const(Value)) :- !.
plookup(Key, [_ | Rest], Value) :- plookup(Key, Rest, Value).

presolve(const(X), _, const(X)).
presolve(var(V), PEnv, X) :- plookup(V, PEnv, X).

The function plookup takes a variable and a partial environment and returns either const(Value) if the variable is found in the partial environment or var(Key) if it is not. Equivalently, presolve is like resolve, except that it uses plookup instead of lookup.

With these helpers we can start writing a partial evaluator. The following two rules are where the main optimization in the form of constant folding happens. The idea is that when the partial evaluator sees an operation that involves only constant arguments, it can constant-fold the operation, otherwise it can't:

pe(op1(ResultVar, Op, Arg, Rest), PEnv, NewOp) :-
    presolve(Arg, PEnv, RArg),
    (RArg = const(C) ->
        do_op(Op, C, Res),
        write_env(PEnv, ResultVar, Res, NEnv),
        RestResidual = NewOp
    ;
        remove_env(PEnv, ResultVar, NEnv),
        NewOp = op1(ResultVar, Op, RArg, RestResidual)
    ),
    pe(Rest, NEnv, RestResidual).

pe(op2(ResultVar, Op, Arg1, Arg2, Rest), PEnv, NewOp) :-
    presolve(Arg1, PEnv, RArg1),
    presolve(Arg2, PEnv, RArg2),
    (RArg1 = const(C1), RArg2 = const(C2) ->
        do_op(Op, C1, C2, Res),
        write_env(PEnv, ResultVar, Res, NEnv),
        RestResidual = NewOp

    ;
        remove_env(PEnv, ResultVar, NEnv),
        NewOp = op2(ResultVar, Op, RArg1, RArg2, RestResidual)
    ),
    pe(Rest, NEnv, RestResidual).

The pe predicate takes a partial environment, the current operations and potentially returns a new operation. To partially evaluate a simple operation, its arguments are looked up in the partial environment. If all the arguments are constants, the operation can be executed, and no new operation is produced. Otherwise, we need to produce a new residual operation which is exactly like the one currently looked at. Also, the result variable needs to be removed from the partial environment, because it was just overwritten by an unknown value.

The potentially generated residual operation is stored into the output argument NewOp. The output argument of the recursive call is the last argument of the newly created residual operation, which will then be filled by the recursive call. This is a typical approach in Prolog, but may look strange if you are not familiar with it.

Note how the first case of these two rules is just like interpretation. The second case doesn't really do anything, it just produces a residual operation. This relationship between normal evaluation and partial evaluation is very typical.

The unconditional jump and print_and_stop are not much more complex:

pe(jump(L), PEnv, jump(LR)) :-
    do_pe(L, PEnv, LR).

pe(print_and_stop(Arg), Env, print_and_stop(RArg)) :-
    presolve(Arg, Env, RArg).

To partially evaluate an unconditional jump we again produce a jump. The target label of that residual jump is computed by asking the partial evaluator to produce residual code for the label L with the given partial environment. print_and_stop is simply turned into a print_and_stop. We will see the code for do_pe soon.

Conditional jumps are more interesting:

pe(if(V, L1, L2), PEnv, NewOp) :-
    plookup(V, PEnv, Val),
    (Val = const(C) ->
        (C = 0 ->
            L = L2
        ;
            L = L1
        ),
        do_pe(L, PEnv, LR),
        NewOp = jump(LR)
    ;
        do_pe(L1, PEnv, L1R),
        do_pe(L2, PEnv, L2R),
        NewOp = if(V, L1R, L2R)
    ).

First we look up the value of the condition variable. If it is a constant, we can produce better code, because we know statically that only one path is reachable. Thus we produce code for that path, and then emit an unconditional jump there. If the condition variable is not known at partial evaluation time, we need to partially evaluate both paths and produce a conditional jump in the residual code.

This rule is the one that causes the partial evaluator to potentially do much more work than the interpreter, because after an if sometimes both paths need to be explored. In the worst case this process never stops, so a real partial evaluator would need to ensure somehow that it terminates. There are many algorithms for doing that, but I will ignore this problem here.

Now we need to understand what the do_pe predicate is doing. Its most important task is to make sure that we don't do the same work twice by memoizing code that was already partially evaluated in the past. For that it keeps a mapping of Label, Partial Environment to Label of the residual code:

do_pe(L, PEnv, LR) :-
    (code_cache(L, PEnv, LR) ->
        true
    ;
        gensym(L, LR),
        assert(code_cache(L, PEnv, LR)),
        block(L, Code),
        pe(Code, PEnv, Residual),
        assert(block(LR, Residual))
    ).

If the code cache indicates that label L was already partially evaluated with partial environment PEnv, then the previous residual code label LPrevious is returned. Otherwise, a new label is generated with gensym, the code cache is informed of that new label with assert, then the block is partially evaluated and the residual code is added to the database.

For those who know partial evaluation terminology: This partial evaluator is a polyvariant online partial evaluator. "Polyvariant" means that for every label, several specialized version of the block can be generated. "Online" means that no preprocessing is done before the partial evaluator runs.

Partial Evaluation Example

With this code we can look at the classical example of partial evaluation (it's probably the "Hello World" of partial evaluation). We can ask the partial evaluator to compute a power function, where the exponent y is a fixed number, e.g. 5, and the base x is unknown:

?- do_pe(power, [y/5], LR).
LR = power1.

To find out which code was produced, we can use listing:

?- listing(code_cache)
code_cache(power, [y/5], power1).
code_cache(power_rec, [y/5, res/1], power_rec1).
code_cache(power_rec, [y/4], power_rec2).
code_cache(power_rec, [y/3], power_rec3).
code_cache(power_rec, [y/2], power_rec4).
code_cache(power_rec, [y/1], power_rec5).
code_cache(power_done, [y/0], power_done1).

?- listing(block)
.... the block definition of the user program ....
block(power_done1, print_and_stop(var(res))).
block(power_rec5, op2(res, mul, var(res), var(x), jump(power_done1))).
block(power_rec4, op2(res, mul, var(res), var(x), jump(power_rec5))).
block(power_rec3, op2(res, mul, var(res), var(x), jump(power_rec4))).
block(power_rec2, op2(res, mul, var(res), var(x), jump(power_rec3))).
block(power_rec1, op2(res, mul, const(1), var(x), jump(power_rec2))).
block(power1, jump(power_rec1)).

The code_cache tells which residual labels correspond to which original labels under which partial environments. Thus, power1 contains the code of power under the assumption that y is 5. Looking at the block listing, the label power1 corresponds to code that simply multiplies res by x five times without using the variable x at all. The loop that was present in the original program has been fully unrolled, the loop variable y has disappeared. Hopefully this is faster than the original program.

Conclusion

In this blog post we saw an interpreter for a simple flow graph language in Prolog, together with a partial evaluator for it. The partial evaluator essentially duplicates every rule of the interpreter. If all the arguments of the current operation are known, it acts like the interpreter, otherwise it simply copies the operation into the residual code.

Partial evaluation can be used for a variety of applications, but the most commonly cited one is that of applying it to an interpreter. To do that, the program that the interpreter runs is assumed to be constant by the partial evaluator. Thus a specialized version of the interpreter is produced that does not use the input program at all. That residual code can be seen as a compiled version of the input program.

In the next blog post in this series we will look at writing a simple tracer for the same flowgraph language.

Sunday, January 15, 2012

PyPy internship at NCAR

Hello, everyone

I would like to inform you that there is a very interesting opportunity for doing an internship at NCAR in the lovely town of Boulder, situated on the foothils of Rocky Mountains. Before you read on, make sure you:

  • are a student of a US University, who is legally eligible to work in the US
  • are at least finishing second year this year
  • apply before February 3rd.

The internship itself will focus on using PyPy (in some way) to provide a high performance numeric kernel for an atmospheric model, and measuring how fast we can go. This is very much in line with what the current effort on NumPy in PyPy is about. The internship will be mentored by Davide del Vento and I hope to have some influence over where it goes myself :-)

A few interesting links:

Feel free to contact Davide for details about the proposal and pypy-dev or me directly for details about PyPy.

Cheers, fijal

Saturday, January 14, 2012

Transactional Memory (II)

Here is an update about the previous blog post about the Global Interpreter Lock (GIL). In 5 months, the point of view changed quite a bit.

Let me remind you that the GIL is the technique used in both CPython and PyPy to safely run multi-threaded programs: it is a global lock that prevents multiple threads from actually running at the same time. The reason to do that is that it would have disastrous effects in the interpreter if several threads access the same object concurrently --- to the point that in CPython even just manipulating the object's reference counter needs to be protected by the lock.

So far, the ultimate goal to enable true multi-CPU usage has been to remove the infamous GIL from the interpreter, so that multiple threads could actually run in parallel. It's a lot of work, but this has been done in Jython. The reason that it has not been done in CPython so far is that it's even more work: we would need to care not only about carefully adding fine-grained locks everywhere, but also about reference counting; and there are a lot more C extension modules that would need care, too. And we don't have locking primitives as performant as Java's, which have been hand-tuned since ages (e.g. to use help from the JIT compiler).

But we think we have a plan to implement a different model for using multiple cores. Believe it or not, this is better than just removing the GIL from PyPy. You might get to use all your cores without ever writing threads.

You would instead just use some event dispatcher, say from Twisted, from Stackless, or from your favorite GUI; or just write your own. From there, you (or someone else) would add some minimal extra code to the event dispatcher's source code, to exploit the new transactional features offered by PyPy. Then you would run your program on a special version of PyPy, and voilà: you get some form of automatic parallelization. Sounds magic, but the basic idea is simple: start handling multiple events in parallel, giving each one its own transaction. More about it later.

Threads or Events?

First, why would this be better than "just" removing the GIL? Because using threads can be a mess in any complex program. Some authors (e.g. Lee) have argued that the reason is that threads are fundamentally non-deterministic. This makes it very hard to reason about them. Basically the programmer needs to "trim" down the non-determinism (e.g. by adding locks, semaphores, etc.), and it's hard to be sure when he's got a sufficiently deterministic result, if only because he can't write exhaustive tests for it.

By contrast, consider a Twisted program. It's not a multi-threaded program, which means that it handles the "events" one after the other. The exact ordering of the events is not really deterministic, because they often correspond to external events; but that's the only source of non-determinism. The actual handling of each event occurs in a nicely deterministic way, and most importantly, not in parallel with the handling of other events. The same is true about other libraries like GUI toolkits, gevent, or Stackless.

(Of course the Twisted and the Stackless models, to cite only these two, are quite different from each other; but they have in common the fact that they are not multi-threaded, and based instead on "events" --- which in the Stackless case means running a tasklet from one switch() point to the next one.)

These two models --- threads or events --- are the two main models we have right now. The latter is more used in Python, because it is much simpler to use than the former, and the former doesn't give any benefit because of the GIL. A third model, which is the only one that gives multi-core benefits, is to use multiple processes, and do inter-process communication.

The problem

Consider the case of a big program that has arbitrary complicated dependencies. Even assuming a GIL-less Python, this is likely enough to prevent the programmer from even starting a multi-threaded rewrite, because it would require a huge mess of locks. He could also consider using multiple processes instead, but the result is annoying as well: the complicated dependencies translate into a huge mess of inter-process synchronization.

The problem can also be down-sized to very small programs, like the kind of hacks that you do and forget about. In this case, the dependencies might be simpler, but you still have to learn and use subtle locking patterns or a complex inter-process library, which is overkill for the purpose.

(This is similar to how explicit memory management is not very hard for small programs --- but still, nowadays a lot of people agree that automatic memory management is easier for programs of all sizes. I think the same will eventually be true for using multiple CPUs, but the correct solution will take time to mature, like garbage collectors did. This post is a step in hopefully the right direction :-))

Events in Transactions

Let me introduce the notion of independent events: two events are independent if they don't touch the same set of objects. In a multi-threaded world, it means that they can be executed in parallel without needing any lock to ensure correctness.

Events might also be mostly independent, i.e. they rarely access the same object concurrently. Of course, in a multi-threaded world we would still need locks to ensure correctness, but the point is that the locks are rarely causing pauses: lock contention is low.

Consider again the Twisted example I gave above. There are often several events pending in the dispatch queue (assuming the program is using 100% of our single usable CPU, otherwise the whole discussion is moot). The case I am interested in is the case in which these events are generally mostly independent, i.e. we expect few conflicts between them. However they don't have to be proved independent. In fact it is fine if they have arbitrary complicated dependencies as described above. The point is the expected common case. Imagine that you have a GIL-less Python and that you can, by a wave of your hand, have all the careful locking mess magically done. Then what I mean here is the case in which such a theoretical program would run mostly in parallel on multiple core, without waiting too often on the locks.

In this case, the solution I'm proposing is that with minimal tweaks in the event dispatch loop, we can handle multiple events on multiple threads, each in its own transaction. A transaction is basically a tentative execution of the corresponding piece of code: if we detect conflicts with other concurrently executing transactions, we abort the whole transaction and restart it from scratch.

By now, the fact that it can basically work should be clear: multiple transactions will only get into conflict when modifying the same data structures, which is the case where the magical wand above would have put locks. If the magical program could progress without too many locks, then the transactional program can progress without too many conflicts. In a way, you get even more than what the magical program can give you: each event is dispatched in its own transaction, which means that from each event's point of view, we have the illusion that nobody else is running concurrently. This is exactly what all existing Twisted-/Stackless-/etc.-based programs are assuming.

Note that this solution, without transactions, already exists in some other languages: for example, Erlang is all about independent events. This is the simple case where we can just run them on multiple cores, knowing by construction of the language that you can't get conflicts. Of course, it doesn't work for Python or for a lot of other languages. From that point of view, what I'm suggesting is merely that transactional memory could be a good model to cope with the risks of conflicts that come from not having a special-made language.

Not a perfect solution

Of course, transactional memory (TM) is not a perfect solution either. Right now, the biggest issue is the performance hit that comes from the software implementation (STM). In time, hardware support (HTM) is likely to show up and help mitigate the problem; but I won't deny the fact that in some cases, because it's simple enough and/or because you really need the top performance, TM is not the best solution.

Also, the explanations above are silent on what is a hard point for TM, namely system calls. The basic general solution is to suspend other transactions as soon as a transaction does its first system call, so that we are sure that the transaction will succeed. Of course this solution is far from optimal. Interestingly, it's possible to do better on a case-by-case basis: for example, by adding in-process buffers, we can improve the situation for sockets, by having recv() store in a buffer what is received so that it can be re-recv()-ed later if the transaction is aborted; similarly, send() or writes to log files can be delayed until we are sure that the transaction will commit.

From my point of view, the most important point is that the TM solution comes from the correct side of the "determinism" scale. With threads, you have to prune down non-determinism. With TM, you start from a mostly deterministic point, and if needed, you add non-determinism. The reason you would want to do so is to make the transactions shorter: shorter transactions have less risks of conflicts, and when there are conflicts, less things to redo. So making transactions shorter increases the parallelism that your program can achieve, while at the same time requiring more care.

In terms of an event-driven model, the equivalent would be to divide the response of a big processing event into several events that are handled one after the other: for example, the first event sets things up and fires the second event, which does the actual computation; and afterwards a third event writes the results back. As a result, the second event's transaction has little risks of getting aborted. On the other hand, the writing back needs to be aware of the fact that it's not in the same transaction as the original setting up, which means that other unrelated transactions may have run in-between.

One step towards the future?

These, and others, are the problems of the TM approach. They are "new" problems, too, in the sense that the existing ways of programming don't have these problems.

Still, as you have guessed, I think that it is overall a win, and possibly a big win --- a win that might be on the same scale for the age of multiple CPUs as automatic garbage collection was 20 years ago for the age of RAM size explosion.

Stay tuned for more!

--- Armin (and reviews by Antonio and Fijal)


UPDATE: please look at the tiny transaction module I wrote as an example. The idea is to have the same interface as this module, but implemented differently. By making use of transactional memory internally, it should be possible to safely run on multiple CPUs while keeping the very same programmer interface.