Part 4 of Comparing Partial Evaluation to Tracing

This is the fourth and final blog post in a series about comparing partial evaluation and tracing. We've come a long way: In the first post of the series I showed an interpreter for a small flow-graph language together with a partial evaluator it. In the second post I showed how a tracer for the same language works and how it relates to both execution and to partial evaluation. The third post described an optimizer for traces.

In this final post we can compare and contrast the two different approaches of tracing and partial evaluation by means of an example. The programs in the flow chart language seen so far have been rather small, so I want to give an example of a larger program: an interpreter for an extremely simple bytecode instruction set. I will look at how the partial evaluator deals with that interpreter, and what the tracer does with it. The code for that, as well as all the code of the series can be found here: http://paste.pocoo.org/show/550282/ (some small additions have been made, such as a nicer way to print traces).

A Bytecode Interpreter

Writing programs in the flow graph language is painful, but I still want to give an example that is a bit more interesting than the tiny ones that we've seen so far. The example is an interpreter for the bytecode of a very trivial register-based language. The language has four registers, one of which is an accumulator on which all the actual operations are performed.

The opcodes of the language are:

• jump_if_a, jumps to a target address when the accumulator is non-zero
• mov_a_r0, mov_a_r1, mov_a_r2 move the value of the accumulator to the respective register
• mov_r0_a, mov_r1_a, mov_r2_a move the value of a register to the accumulator
• decr_a decrement the accumulator
• return_a stop the program and print the accumulator

The interpreter has a main loop that reads the opcode at the current program counter, does a (lengthy) dispatch to the right bytecode via a series of if statements and then executes the right opcode. Afterwards the next opcode is treated equivalently.

Here is a part of the source code in the flow graph language. As pseudocode:

```bytecode_loop:
opcode = bytecode[pc]
pc = pc + 1
c = opcode == 'jump_if_a'
if c goto op_jump_if_a else goto not_jump_if_a

# select the right bytecode via a long series of if statements
not_jump_if_a:
c = opcode == 'mov_a_r0'
if y goto op_mov_a_r0 else goto not_mov_a_r0
not_mov_a_r0:
c = opcode == 'mov_a_r0'
if y goto op_mov_a_r1 else goto not_mov_a_r1
...

# bytecode implementations
op_mov_a_r0:
r0 = a
goto bytecode_loop

op_jump_if_a:
c = a == 0
target = bytecode[pc]
pc += 1
if c goto bytecode_loop else goto op_jump_if_a_jump

op_jump_if_a_jump:
pc = target
goto bytecode_loop
...
```

And actually working, as Prolog facts (the full implementation can be found at the link above):

```% bytecode dispatch loop
block(bytecode_loop,
op2(c, eq, var(opcode), const(jump_if_a),
if(c, op_jump_if_a, not_jump_if_a))))).

% select the right bytecode via a long series of if statements
block(not_jump_if_a,
op2(c, eq, var(opcode), const(mov_a_r0),
if(c, op_mov_a_r0, not_mov_a_r0))).
block(not_mov_a_r0,
op2(c, eq, var(opcode), const(mov_a_r1),
if(c, op_mov_a_r1, not_mov_a_r1))).
...

% bytecode implementations
block(op_jump_if_a,
op2(c, eq, var(a), const(0),
if(c, bytecode_loop, op_jump_if_a_jump))))).
block(op_jump_if_a_jump,
op1(pc, same, var(target),
promote(bytecode, bytecode_loop))).
block(op_mov_a_r0,
op1(r0, same, var(a), jump(bytecode_loop))).
...
```

The bytecode_loop block is the main dispatch loop. It reads an opcode out of the bytecode list at the program counter position, then has a long series of if statements that compares the current opcode to the various existing opcodes. The full code of the interpreter can be found under the link above.

The bytecodes of the interpreter don't really permit hugely complex programs, but it can be used to write a program that computes the square of a number with the following program:

```mov_a_r0     # r0 = a
mov_a_r1     # r1 = a
# 2:
mov_r0_a     # r0--
decr_a
mov_a_r0
mov_r2_a     # r2 += a
mov_a_r2
mov_r0_a     # if r0!=0: goto 2
jump_if_a 2
mov_r2_a     # return r2
return_a
```

Partially Evaluating the Bytecode Interpreter

The partial evaluator from the first blog post can be easily used to partially evaluate the bytecode interpreter. The static input is the bytecode for computing the square and the initial program counter value, as given above. The dynamic input are the content of the accumulator (the number to be squared). This can be done as follows:

```?- bytecode_square(B),
Env = [bytecode/B, pc/0],
do_pe(bytecode_loop, Env, Label),
REnv = [a/16, r0/0, r1/0, r2/0],
interp(jump(Label), REnv), listing(block).
256
:- dynamic block/2.

<lots of generated code>
```

The code that is generated by the partial evaluation process is somewhat hard to read. It contains a lot of passages like this:

```...
block(op_return_a1, print_and_stop(var(a))).
block(not_decr_a1, jump(op_return_a1)).
block(not_mov_r1_a5, jump(not_mov_r2_a3)).
block(not_mov_r0_a5, jump(not_mov_r1_a5)).
block(not_mov_a_r27, jump(not_mov_r0_a5)).
block(not_mov_a_r18, jump(not_mov_a_r27)).
block(not_mov_a_r09, jump(not_mov_a_r18)).
block(not_jump_if_a11, jump(not_mov_a_r09)).
block(bytecode_loop12, jump(not_jump_if_a11)).
block(op_mov_r2_a2, op1(a, same, var(r2), jump(bytecode_loop12))).
...
```

I.e. lots of blocks that do nothing but jump to another block, interspersed with some blocks that contain an actual operation. I cleaned the output up manually and got something like the following (this sort of cleanup is something a good partial evaluation system would do itself, after partial evaluation has occurred):

```block(bytecode_loop1,
op1(r0, same, var(a),
op1(r1, same, var(a),
op1(a, same, var(r0),
op2(a, sub, var(a), const(1),
op1(r0, same, var(a),
op1(a, same, var(r2),
op1(r2, same, var(a),
op1(a, same, var(r0),
op2(c, eq, var(a), const(0),
if(c, bytecode_loop11, op_jump_if_a_jump1)))))))))))).

block(bytecode_loop11,
op1(a, same, var(r2),
print_and_stop(var(a))).

block(op_jump_if_a_jump1,
op1(a, same, var(r0),
op2(a, sub, var(a), const(1),
op1(r0, same, var(a),
op1(a, same, var(r2),
op1(r2, same, var(a),
op1(a, same, var(r0),
op2(c, eq, var(a), const(0),
if(c, bytecode_loop11, op_jump_if_a_jump1)))))))))).
```

What do we see here? The partial evaluator has generated a block bytecode_loop1, which corresponds to the initialization opcodes mov_a_r0 and mov_a_r1 together with one iteration of the loop. Then it either jumps to a copy of the main loop (label op_jump_if_a_jump1) or to block bytecode_loop11, which prints the result and then stops. The residual code does exactly what the bytecode did: It squares the accumulator then prints that. All the uses of the bytecode and pc variable are gone.

Why did the partial evaluator produce two copies of the main loop that look the same? The reason for that is that in the second copy, the additional static information target = 2 is known, where target is a variable in the interpreter source that stores the jump target, for very brief periods of time. This additional static information does not have any effect on the residual code, so the same code is uselessly generated twice. This is an example of overspecialization.

Tracing the Interpreter

In this section we will look at what happens if we try to trace the interpreter. The naive way of doing that yields traces that are not very useful, because they abort after one iteration. We will look at a way of avoiding this problem. The problems described in this section are at the core of the paper Tracing the meta-level: PyPy's tracing JIT compiler (that paper uses a slightly more advanced version of the bytecode interpreter as an example).

To trace the interpreter, it is useful to change the bytecode_loop block from above to always promote the bytecode and the pc variables, because without knowing them the trace produced is not really interesting. This is similar to making these variables static in the partial evaluation example above:

```block(bytecode_loop,
promote(bytecode, bytecode_loop_promote_bytecode)).
block(bytecode_loop_promote_bytecode,
promote(pc, bytecode_loop_promote_pc)).
block(bytecode_loop_promote_pc,
op2(c, eq, var(opcode), const(0),
if(c, op_jump_if_a, not_jump_if_a))))).
...
```

The rest of the interpreter stays unchanged.

To trace the interpreter we would start naively at the bytecode_loop label, because that's the label in the interpreter that is jumped to most often (which a profiler could establish easily). The following command can be used for that (this output prints traces in a slightly more readable way than in previous blog posts):

```?- bytecode_square(B),
A = 16, Env = [bytecode/B, pc/2, a/A, r0/A, r1/A, r2/0],
do_trace(bytecode_loop, Env).
trace
guard_value(pc,2,[],bytecode_loop_promote_pc)
op2(c,eq,var(opcode),const(jump_if_a))
guard_false(c,[],op_jump_if_a)
op2(c,eq,var(opcode),const(mov_a_r0))
guard_false(c,[],op_mov_a_r0)
op2(c,eq,var(opcode),const(mov_a_r1))
guard_false(c,[],op_mov_a_r1)
op2(c,eq,var(opcode),const(mov_a_r2))
guard_false(c,[],op_mov_a_r2)
op2(c,eq,var(opcode),const(mov_r0_a))
guard_true(c,[],not_mov_r0_a)
op1(a,same,var(r0))
loop

opttrace
op1(a,same,var(r0))
op1(pc,same,const(3))
op1(opcode,same,const(mov_r0_a))
op1(c,same,const(1))
loop

256
B = [mov_a_r0, mov_a_r1, mov_r0_a, decr_a, mov_a_r0, mov_r2_a, add_r1_to_a, mov_a_r2, mov_r0_a|...],
A = 16,
Env = [bytecode/[mov_a_r0, mov_a_r1, mov_r0_a, decr_a, mov_a_r0, mov_r2_a, add_r1_to_a|...], pc/2, a/16, r0/16, r1/16, r2/0]
```

These traces are very short. They start with promoting the bytecode and the pc, followed by the execution of the opcode mov_r0_a, which is the one at position 2 in the given bytecode. Then they increment the pc and loop back to the beginning. Looking at the optimized trace, it is clear that the trace is essentially useless. It will run only for one iteration, because in the second iteration the pc is 3, thus the guard_value at the beginning will fail.

This problem can be solved by tracing more than just one iteration of the bytecode dispatch loop, which is called meta-tracing. To get this behaviour, in this simple example it is enough to start (and thus end) tracing at a different label, op_jump_if_a_jump. This label is hit when the interpreter executes a jump_if_a bytecode and the jump is taken. In a loop on the level of the executed bytecode program there is one such jump. Thus tracing from this label, a full loop in the bytecode program is traced, containing potentially many iterations of the bytecode dispatch loop in the control flow graph language.

Doing that yields the following:

```?- bytecode_square(B),
A = 16, Env = [bytecode/B, pc/11, a/A, r0/A, r1/A, r2/0, target/2],
do_trace(op_jump_if_a_jump, Env).
trace
op1(pc,same,var(target))
guard_value(pc,2,[],bytecode_loop_promote_pc)
op2(c,eq,var(opcode),const(jump_if_a))
guard_false(c,[],op_jump_if_a)
op2(c,eq,var(opcode),const(mov_a_r0))
guard_false(c,[],op_mov_a_r0)
op2(c,eq,var(opcode),const(mov_a_r1))
guard_false(c,[],op_mov_a_r1)
op2(c,eq,var(opcode),const(mov_a_r2))
guard_false(c,[],op_mov_a_r2)
op2(c,eq,var(opcode),const(mov_r0_a))
guard_true(c,[],not_mov_r0_a)
op1(a,same,var(r0))
guard_value(pc,3,[],bytecode_loop_promote_pc)
...
lots of operations ommitted
...
guard_value(pc,9,[],bytecode_loop_promote_pc)
op2(c,eq,var(opcode),const(jump_if_a))
guard_true(c,[],not_jump_if_a)
op2(c,eq,var(a),const(0))
guard_false(c,[],bytecode_loop)
loop

opttrace
op1(pc,same,var(target))
op1(a,same,var(r0))
op2(a,sub,var(a),const(1))
op1(r0,same,var(a))
op1(a,same,var(r2))
op1(r2,same,var(a))
op1(a,same,var(r0))
op2(c,eq,var(a),const(0))
op1(pc,same,const(11))
op1(opcode,same,const(jump_if_a))
op1(target,same,const(2))
op1(c,same,const(0))
loop

256
B = [mov_a_r0, mov_a_r1, mov_r0_a, decr_a, mov_a_r0, mov_r2_a, add_r1_to_a, mov_a_r2, mov_r0_a|...],
A = 16,
Env = [bytecode/[mov_a_r0, mov_a_r1, mov_r0_a, decr_a, mov_a_r0, mov_r2_a, add_r1_to_a|...], pc/11, a/16, r0/16, r1/16, r2/0, target/2] .
```

That looks better. The trace corresponds to the interpreter running all the bytecodes in the loop of the squaring function in the example bytecode above. The optimized code starts with two guards (checking that the bytecode is still the one for the squaring function, checking that the pc is 2) and then only does the operations that actually do the computation. No bytecode dispatching is performed, thus the interpretation overhead is fully removed, apart from the two guard_value operations at the beginning.

Many of the assignments in the trace are superfluous, e.g. all the copying back and forth between registers r1, r1, r2 and accumulator a. This could be easily solved by an even more intelligent optimization utilizing SSA form.

Both partial evaluation and meta-tracing can be used to transform the example bytecode computing a square into a form that shows the essential computation that is going on, without the interpretation overhead. The naive partial evaluator produces lots of extra blocks that just jump around, which could be solved with a post-processing step. The tracer by itself produces uselessly short traces, but with a simple trick of starting the trace at a different point the results become a lot better.

In a real meta-tracing system, the meta-tracer would need a way for the author of the interpreter to mark which bytecode corresponds to a backward jump. It would also need better integration with the interpreter to start tracing automatically, as well as cache the traces. Additionally, it would have to deal better with guards that fail a lot, attaching new traces to the failing guards. However, all that is "just" engineering on top of the ideas presented in this series of blog posts.

High-Level Conclusion

Some concluding high-level thoughts about the similarities of tracing and partial evaluation: Tracing and partial evaluation try to tackle a similar problem, that of automatically reducing the interpreter overhead, their approaches are slightly different though.

Tracing is very close to normal evaluation, only keeping some extra information in the process. But then, the optimizer that is used in a tracer is again very similar in structure to a partial evaluator. The task of the optimizer is much simpler though, because it does not need to deal with control flow at all, just a linear list of operations.

So in a sense tracing is taking those parts of partial evaluation that work (the "just evaluate those things that you can, and leave the others") and replacing the parts that don't (controlling unfolding) by a much more pragmatic mechanism. That mechanism observes actual execution runs of the program to choose control flow paths that are typical. At the same time, the tracer's focus is on loops, because they are where most programs spend significant amounts of time.

Another point of view of tracing is that it is a form of partial evaluation that replaces the control components of a partial evaluator with an oracle (the actual execution runs) that provide the information which paths to look at.

Already in the quite trivial interpreter here the effects of this are visible. The simple partial evaluator over-specializes the loop and produces two identical versions of it, that aren't different. The tracer doesn't, and it also generates only code for the loop itself, not for the initialization opcodes.

That's it for this series. To those that made it, thanks for following along. Also thanks to Samuele and Sven, who consistently gave me good feedback on the posts before I put them here.