PyPy Status Blog

Part 4 of Comparing Partial Evaluation to Tracing
This is the fourth and final blog post in a series about comparing partial evaluation and
tracing. We've come a long way: In the first post of the series I showed an interpreter for a small flow-graph
language together with a partial evaluator it. In the second post I showed how a tracer for
the same language works and how it relates to both execution and to partial
evaluation. The third post described an optimizer for traces.
In this final post we can compare and contrast the two different approaches of
tracing and partial evaluation by means of an example. The programs in the flow
chart language seen so far have been rather small, so I want to give an example
of a larger program: an interpreter for an extremely simple bytecode
instruction set. I will look at how the partial evaluator deals with that
interpreter, and
what the tracer does with it. The code for
that, as well as all the code of the series can be found here: http://paste.pocoo.org/show/550282/ (some small
additions have been made, such as a nicer way to print traces).
A Bytecode Interpreter
Writing programs in the flow graph language is painful, but I still want to give
an example that is a bit more interesting than the tiny ones that we've seen so
far. The example is an interpreter for the bytecode of a very trivial
register-based language. The language has four registers, one of which is an
accumulator on which all the actual operations are performed.
The opcodes of the language are:

jump_if_a, jumps to a target address when the accumulator is non-zero
mov_a_r0, mov_a_r1, mov_a_r2 move the value of the accumulator to
the respective register
mov_r0_a, mov_r1_a, mov_r2_a move the value of a register to
the accumulator
add_r0_to_a, add_r1_to_a, add_r2_to_a add the value of the
register to the accumulator
decr_a decrement the accumulator
return_a stop the program and print the accumulator

The interpreter has a main loop that reads the opcode at the current program
counter, does a (lengthy) dispatch to the right bytecode via a series of if
statements and then executes the right opcode. Afterwards the next opcode is
treated equivalently.
Here is a part of the source code in the flow graph language. As pseudocode:
bytecode_loop:
    opcode = bytecode[pc]
    pc = pc + 1
    c = opcode == 'jump_if_a'
    if c goto op_jump_if_a else goto not_jump_if_a

# select the right bytecode via a long series of if statements
not_jump_if_a:
    c = opcode == 'mov_a_r0'
    if y goto op_mov_a_r0 else goto not_mov_a_r0
not_mov_a_r0:
    c = opcode == 'mov_a_r0'
    if y goto op_mov_a_r1 else goto not_mov_a_r1
...

# bytecode implementations
op_mov_a_r0:
    r0 = a
    goto bytecode_loop

op_jump_if_a:
    c = a == 0
    target = bytecode[pc]
    pc += 1
    if c goto bytecode_loop else goto op_jump_if_a_jump

op_jump_if_a_jump:
    pc = target
    goto bytecode_loop
...

And actually working, as Prolog facts (the full implementation can be found at
the link above):
% bytecode dispatch loop
block(bytecode_loop,
      op2(opcode, readlist, var(bytecode), var(pc),
      op2(pc, add, var(pc), const(1),
      op2(c, eq, var(opcode), const(jump_if_a),
      if(c, op_jump_if_a, not_jump_if_a))))).

% select the right bytecode via a long series of if statements
block(not_jump_if_a,
      op2(c, eq, var(opcode), const(mov_a_r0),
      if(c, op_mov_a_r0, not_mov_a_r0))).
block(not_mov_a_r0,
      op2(c, eq, var(opcode), const(mov_a_r1),
      if(c, op_mov_a_r1, not_mov_a_r1))).
...

% bytecode implementations
block(op_jump_if_a,
      op2(c, eq, var(a), const(0),
      op2(target, readlist, var(bytecode), var(pc),
      op2(pc, add, var(pc), const(1),
      if(c, bytecode_loop, op_jump_if_a_jump))))).
block(op_jump_if_a_jump,
      op1(pc, same, var(target),
      promote(bytecode, bytecode_loop))).
block(op_mov_a_r0,
      op1(r0, same, var(a), jump(bytecode_loop))).
...

The bytecode_loop block is the main dispatch loop. It reads an opcode out of the
bytecode list at the program counter position, then has a long series of if
statements that compares the current opcode to the various existing opcodes.
The full code of the interpreter can be found under the link above.
The bytecodes of the interpreter don't really permit hugely complex
programs, but it can be used to write a program that computes the square of a
number with the following program:
mov_a_r0     # r0 = a
mov_a_r1     # r1 = a
# 2:
mov_r0_a     # r0--
decr_a
mov_a_r0
mov_r2_a     # r2 += a
add_r1_to_a
mov_a_r2
mov_r0_a     # if r0!=0: goto 2
jump_if_a 2
mov_r2_a     # return r2
return_a

Partially Evaluating the Bytecode Interpreter
The partial evaluator from the first blog post can be easily used to partially
evaluate the bytecode interpreter. The static input is the bytecode for
computing the square and the initial program counter value, as given above. The
dynamic input are the content of the accumulator (the number to be squared).
This can be done as follows:
?- bytecode_square(B),
Env = [bytecode/B, pc/0],
do_pe(bytecode_loop, Env, Label),
REnv = [a/16, r0/0, r1/0, r2/0],
interp(jump(Label), REnv), listing(block).
256
:- dynamic block/2.

<lots of generated code>

The code that is generated by the partial evaluation process is somewhat hard to
read. It contains a lot of passages like this:
...
block(op_return_a1, print_and_stop(var(a))).
block(not_decr_a1, jump(op_return_a1)).
block(not_add_r2_to_a2, jump(not_decr_a1)).
block(not_add_r1_to_a2, jump(not_add_r2_to_a2)).
block(not_add_r0_to_a3, jump(not_add_r1_to_a2)).
block(not_mov_r2_a3, jump(not_add_r0_to_a3)).
block(not_mov_r1_a5, jump(not_mov_r2_a3)).
block(not_mov_r0_a5, jump(not_mov_r1_a5)).
block(not_mov_a_r27, jump(not_mov_r0_a5)).
block(not_mov_a_r18, jump(not_mov_a_r27)).
block(not_mov_a_r09, jump(not_mov_a_r18)).
block(not_jump_if_a11, jump(not_mov_a_r09)).
block(bytecode_loop12, jump(not_jump_if_a11)).
block(op_mov_r2_a2, op1(a, same, var(r2), jump(bytecode_loop12))).
...

I.e. lots of blocks that do nothing but jump to another block, interspersed with
some blocks that contain an actual operation. I cleaned the output up manually
and got something like the following (this sort of cleanup is something a good
partial evaluation system would do itself, after partial evaluation has
occurred):
block(bytecode_loop1,
    op1(r0, same, var(a),
    op1(r1, same, var(a),
    op1(a, same, var(r0),
    op2(a, sub, var(a), const(1),
    op1(r0, same, var(a),
    op1(a, same, var(r2),
    op2(a, add, var(a), var(r1),
    op1(r2, same, var(a),
    op1(a, same, var(r0),
    op2(c, eq, var(a), const(0),
    if(c, bytecode_loop11, op_jump_if_a_jump1)))))))))))).

block(bytecode_loop11,
    op1(a, same, var(r2),
    print_and_stop(var(a))).

block(op_jump_if_a_jump1,
    op1(a, same, var(r0),
    op2(a, sub, var(a), const(1),
    op1(r0, same, var(a),
    op1(a, same, var(r2),
    op2(a, add, var(a), var(r1),
    op1(r2, same, var(a),
    op1(a, same, var(r0),
    op2(c, eq, var(a), const(0),
    if(c, bytecode_loop11, op_jump_if_a_jump1)))))))))).

What do we see here? The partial evaluator has generated a block bytecode_loop1,
which corresponds to the initialization opcodes mov_a_r0 and mov_a_r1 together
with one iteration of the loop. Then it either jumps to a copy of the main loop
(label op_jump_if_a_jump1) or to block bytecode_loop11, which prints the result
and then stops. The residual code does exactly what the bytecode did: It
squares the accumulator then prints that. All the uses of the bytecode and
pc variable are gone.
Why did the partial evaluator produce two copies of the main loop that
look the same? The reason for that is that in the second copy, the additional
static information target = 2 is known, where target is a variable in
the interpreter source that stores the jump target, for very brief periods of
time. This additional static information does not have any effect on the
residual code, so the same code is uselessly generated twice. This is an
example of overspecialization.
Tracing the Interpreter
In this section we will look at what happens if we try to trace the interpreter.
The naive way of doing that yields traces that are not very useful, because they
abort after one iteration. We will look at a way of avoiding this problem. The
problems described in this section are at the core of the paper Tracing the
meta-level: PyPy's tracing JIT compiler (that paper uses a slightly more
advanced version of the bytecode interpreter as an example).
To trace the interpreter, it is useful to change the bytecode_loop block from above
to always promote the bytecode and the pc variables, because without
knowing them the trace produced is not really interesting. This is similar to
making these variables static in the partial evaluation example above:
block(bytecode_loop,
      promote(bytecode, bytecode_loop_promote_bytecode)).
block(bytecode_loop_promote_bytecode,
      promote(pc, bytecode_loop_promote_pc)).
block(bytecode_loop_promote_pc,
      op2(opcode, readlist, var(bytecode), var(pc),
      op2(pc, add, var(pc), const(1),
      op2(c, eq, var(opcode), const(0),
      if(c, op_jump_if_a, not_jump_if_a))))).
...

The rest of the interpreter stays unchanged.
To trace the interpreter we would start naively at the bytecode_loop label, because
that's the label in the interpreter that is jumped to most often (which a
profiler could establish easily). The following command can be used for that
(this output prints traces in a slightly more readable way than in previous blog
posts):
?- bytecode_square(B),
        A = 16, Env = [bytecode/B, pc/2, a/A, r0/A, r1/A, r2/0],
        do_trace(bytecode_loop, Env).
trace
  guard_value(bytecode,[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a],[],bytecode_loop_promote_bytecode)
  guard_value(pc,2,[],bytecode_loop_promote_pc)
  op2(opcode,readlist,var(bytecode),var(pc))
  op2(pc,add,var(pc),const(1))
  op2(c,eq,var(opcode),const(jump_if_a))
  guard_false(c,[],op_jump_if_a)
  op2(c,eq,var(opcode),const(mov_a_r0))
  guard_false(c,[],op_mov_a_r0)
  op2(c,eq,var(opcode),const(mov_a_r1))
  guard_false(c,[],op_mov_a_r1)
  op2(c,eq,var(opcode),const(mov_a_r2))
  guard_false(c,[],op_mov_a_r2)
  op2(c,eq,var(opcode),const(mov_r0_a))
  guard_true(c,[],not_mov_r0_a)
  op1(a,same,var(r0))
  loop

opttrace
  guard_value(bytecode,[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a],[],bytecode_loop_promote_bytecode)
  guard_value(pc,2,[bytecode/[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a]],bytecode_loop_promote_pc)
  op1(a,same,var(r0))
  op1(bytecode,same,const([mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a]))
  op1(pc,same,const(3))
  op1(opcode,same,const(mov_r0_a))
  op1(c,same,const(1))
  loop

256
B = [mov_a_r0, mov_a_r1, mov_r0_a, decr_a, mov_a_r0, mov_r2_a, add_r1_to_a, mov_a_r2, mov_r0_a|...],
A = 16,
Env = [bytecode/[mov_a_r0, mov_a_r1, mov_r0_a, decr_a, mov_a_r0, mov_r2_a, add_r1_to_a|...], pc/2, a/16, r0/16, r1/16, r2/0]

These traces are very short. They start with promoting the bytecode and the
pc, followed by the execution of the opcode mov_r0_a, which is the
one at position 2 in the given bytecode. Then they increment the pc and
loop back to the beginning. Looking at the optimized trace, it is clear that the
trace is essentially useless. It will run only for one iteration, because in the
second iteration the pc is 3, thus the guard_value at the beginning
will fail.
This problem can be solved by tracing more than just one iteration of the
bytecode dispatch loop, which is called meta-tracing. To get this behaviour, in
this simple example it is enough to start (and thus end) tracing at a different
label, op_jump_if_a_jump. This label is hit when the interpreter executes a
jump_if_a bytecode and the jump is taken. In a loop on the level of the
executed bytecode program there is one such jump. Thus tracing from this label,
a full loop in the bytecode program is traced, containing potentially many
iterations of the bytecode dispatch loop in the control flow graph language.
Doing that yields the following:
?- bytecode_square(B),
        A = 16, Env = [bytecode/B, pc/11, a/A, r0/A, r1/A, r2/0, target/2],
        do_trace(op_jump_if_a_jump, Env).
trace
  op1(pc,same,var(target))
  guard_value(bytecode,[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a],[],bytecode_loop)
  guard_value(bytecode,[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a],[],bytecode_loop_promote_bytecode)
  guard_value(pc,2,[],bytecode_loop_promote_pc)
  op2(opcode,readlist,var(bytecode),var(pc))
  op2(pc,add,var(pc),const(1))
  op2(c,eq,var(opcode),const(jump_if_a))
  guard_false(c,[],op_jump_if_a)
  op2(c,eq,var(opcode),const(mov_a_r0))
  guard_false(c,[],op_mov_a_r0)
  op2(c,eq,var(opcode),const(mov_a_r1))
  guard_false(c,[],op_mov_a_r1)
  op2(c,eq,var(opcode),const(mov_a_r2))
  guard_false(c,[],op_mov_a_r2)
  op2(c,eq,var(opcode),const(mov_r0_a))
  guard_true(c,[],not_mov_r0_a)
  op1(a,same,var(r0))
  guard_value(bytecode,[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a],[],bytecode_loop_promote_bytecode)
  guard_value(pc,3,[],bytecode_loop_promote_pc)
  op2(opcode,readlist,var(bytecode),var(pc))
  ...
  lots of operations ommitted
  ...
  guard_value(bytecode,[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a],[],bytecode_loop_promote_bytecode)
  guard_value(pc,9,[],bytecode_loop_promote_pc)
  op2(opcode,readlist,var(bytecode),var(pc))
  op2(pc,add,var(pc),const(1))
  op2(c,eq,var(opcode),const(jump_if_a))
  guard_true(c,[],not_jump_if_a)
  op2(c,eq,var(a),const(0))
  op2(target,readlist,var(bytecode),var(pc))
  op2(pc,add,var(pc),const(1))
  guard_false(c,[],bytecode_loop)
  loop

opttrace
  op1(pc,same,var(target))
  guard_value(bytecode,[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a],[],bytecode_loop)
  guard_value(pc,2,[bytecode/[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a]],bytecode_loop_promote_pc)
  op1(a,same,var(r0))
  op2(a,sub,var(a),const(1))
  op1(r0,same,var(a))
  op1(a,same,var(r2))
  op2(a,add,var(a),var(r1))
  op1(r2,same,var(a))
  op1(a,same,var(r0))
  op2(c,eq,var(a),const(0))
  guard_false(c,[bytecode/[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a],pc/11,opcode/jump_if_a,target/2],bytecode_loop)
  op1(bytecode,same,const([mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a]))
  op1(pc,same,const(11))
  op1(opcode,same,const(jump_if_a))
  op1(target,same,const(2))
  op1(c,same,const(0))
  loop

256
B = [mov_a_r0, mov_a_r1, mov_r0_a, decr_a, mov_a_r0, mov_r2_a, add_r1_to_a, mov_a_r2, mov_r0_a|...],
A = 16,
Env = [bytecode/[mov_a_r0, mov_a_r1, mov_r0_a, decr_a, mov_a_r0, mov_r2_a, add_r1_to_a|...], pc/11, a/16, r0/16, r1/16, r2/0, target/2] .

That looks better. The trace corresponds to the interpreter running all the
bytecodes in the loop of the squaring function in the example bytecode above.
The optimized code starts with
two guards (checking that the bytecode is still the one for the squaring
function, checking that the pc is 2) and then only does the operations
that actually do the computation. No bytecode dispatching is performed, thus the
interpretation overhead is fully removed, apart from the two guard_value
operations at the beginning.
Many of the assignments in the trace are superfluous, e.g. all the copying back
and forth between registers r1, r1, r2 and accumulator a. This
could be easily solved by an even more intelligent optimization utilizing SSA
form.
Conclusion About the Interpreter
Both partial evaluation and meta-tracing can be used to transform the example
bytecode computing a square into a form that shows the essential computation
that is going on, without the interpretation overhead. The naive partial evaluator
produces lots of extra blocks that just jump around, which could be solved with
a post-processing step. The tracer by itself produces uselessly short traces,
but with a simple trick of starting the trace at a different point the results
become a lot better.
In a real meta-tracing system, the meta-tracer would need a way for the author
of the interpreter
to mark which bytecode corresponds to a backward jump. It would also need better
integration with the interpreter to start tracing automatically, as well as
cache the traces. Additionally, it would have to deal better with guards that fail a
lot, attaching new traces to the failing guards. However, all that is "just"
engineering on top of the ideas presented in this series of blog posts.
High-Level Conclusion
Some concluding high-level thoughts about the similarities of tracing and
partial evaluation: Tracing and partial evaluation try to tackle a similar
problem, that of automatically reducing the interpreter overhead, their
approaches are slightly different though.
Tracing is very close to normal evaluation, only keeping some extra information
in the process. But then, the optimizer that is used in a tracer
is again very similar in structure to a partial evaluator. The task of the
optimizer is much simpler though, because it does not need to deal with control
flow at all, just a linear list of operations.
So in a sense tracing is taking those parts of partial evaluation that work (the
"just evaluate those things that you can, and leave the others") and replacing
the parts that don't (controlling unfolding) by a much more pragmatic mechanism.
That mechanism observes actual execution runs of the program to choose control
flow paths that are typical. At the same time, the tracer's focus is on loops,
because they are where most programs spend significant amounts of time.
Another point of view of tracing is that it is a form of partial evaluation that
replaces the control components of a partial evaluator with an oracle (the
actual execution runs) that provide the information which paths to look at.
Already in the quite trivial interpreter here the effects of this are visible.
The simple partial evaluator over-specializes the loop and produces two
identical versions of it, that aren't different. The tracer doesn't, and it
also generates only code for the loop itself, not for the initialization
opcodes.
That's it for this series. To those that made it, thanks for following along.
Also thanks to Samuele and Sven, who consistently gave me good feedback on the
posts before I put them here.

PyPy Status Blog

Monday, February 13, 2012

A Larger Example for the Flow Graph Language

Part 4 of Comparing Partial Evaluation to Tracing

A Bytecode Interpreter

Partially Evaluating the Bytecode Interpreter

Tracing the Interpreter

Conclusion About the Interpreter

High-Level Conclusion

No comments:

Post a Comment