Tuesday, February 7, 2012

Optimizing Traces of the Flow Graph Language

Part 3 of Comparing Partial Evaluation to Tracing

This is the third blog post in a series about comparing partial evaluation and tracing. In the first post of the series I introduced a small flow-graph language together with an interpreter for it. Then I showed a partial evaluator for the language. In the second post of the series I showed how a tracer for the same language works and how it relates to both execution and to partial evaluation. Then I added support for promotion to that tracer.

In this post I will show how to optimize the traces that are produced by the tracer and compare the structure of the optimizer to that of partial evaluation.

The code from this post can be found here: http://paste.pocoo.org/show/547304/

Optimizing Traces

In the last post we saw how to produce a linear trace with guards by interpreting a control flow graph program in a special mode. A trace always end with a loop statement, which jumps to the beginning. The tracer is just logging the operations that are done while interpreting, so the trace can contain superfluous operations. On the other hand, the trace also contains some of the runtime values through promotions and some decisions made on them which can be exploited by optimization. An example for this is the trace produced by the promotion example from the last post:

op2(c,ge,var(i),const(0),
guard_true(c,[],l_done,
guard_value(x,5,[],b2,
op2(x2,mul,var(x),const(2),
op2(x3,add,var(x2),const(1),
op2(i,sub,var(i),var(x3),
loop))))))

After the guard_value(x, 5, ...) operation, x is know to be 5: If it isn't 5, execution falls back to the interpreter. Therefore, operations on x after the guard can be constant-folded. To do that sort of constant-folding, an extra optimization step is needed. That optimization step walks along the trace, remembers which variables are constants and what their values are using a partial environment. The opimizer removes operations that have only constant arguments and leaves the others in the trace. This process is actually remarkably similar to partial evaluation: Some variables are known to be constants, operations on only constant arguments are optimized away, the rest remains.

The code for optimizing operations looks as follows:

optimize(op1(ResultVar, Op, Arg, Rest), PEnv, NewOp) :-
    presolve(Arg, PEnv, RArg),
    (RArg = const(C) ->
        do_op(Op, C, Res),
        write_env(PEnv, ResultVar, Res, NEnv),
        NewOp = RestResidual
    ;
        remove_env(PEnv, ResultVar, NEnv),
        NewOp = op1(ResultVar, Op, RArg, RestResidual)
    ),
    optimize(Rest, NEnv, RestResidual).

optimize(op2(ResultVar, Op, Arg1, Arg2, Rest), PEnv, NewOp) :-
    presolve(Arg1, PEnv, RArg1),
    presolve(Arg2, PEnv, RArg2),
    (RArg1 = const(C1), RArg2 = const(C2) ->
        do_op(Op, C1, C2, Res),
        write_env(PEnv, ResultVar, Res, NEnv),
        NewOp = RestResidual
    ;
        remove_env(PEnv, ResultVar, NEnv),
        NewOp = op2(ResultVar, Op, RArg1, RArg2, RestResidual)
    ),
    optimize(Rest, NEnv, RestResidual).

Just like partial evaluation! It even reuses the helper functions presolve from the partial evaluator and a partial environment PEnv. When the arguments of the operation are known constants in the partial environment, the operation can be executed at optimization time and removed from the trace. Otherwise, the operation has to stay in the output trace. The result variable (as in the partial evaluator) needs to be removed from the partial environment, because it was just overwritten by an unknown result.

Now we need to deal with guards in the trace.

optimize(guard_true(V, [], L, Rest), PEnv, NewOp) :-
    plookup(V, PEnv, Val),
    (Val = const(C) ->
        NewOp = RestResidual
    ;
        NewOp = guard_true(V, PEnv, L, RestResidual)
    ),
    optimize(Rest, PEnv, RestResidual).

optimize(guard_false(V, [], L, Rest), PEnv, NewOp) :-
    plookup(V, PEnv, Val),
    (Val = const(C) ->
        NewOp = RestResidual,
        NEnv = PEnv
    ;
        write_env(PEnv, V, 0, NEnv),
        NewOp = guard_false(V, PEnv, L, RestResidual)
    ),
    optimize(Rest, NEnv, RestResidual).

When the variable that is being guarded is actually known to be a constant, we can remove the guard. Note that it is not possible that the guard of that constant fails: The tracer recorded the operation while running with real values, therefore the guards have to succeed for values the optimizer discovers to be constant.

guard_false is slightly different from guard_true: after the former we know that the argument is actually 0. After guard_true we only know that it is not equal to zero, but not which precise value it has.

Another point to note in the optimization of guards is that the second argument of the guard operation, which was so far always just an empty list, is now replaced by the partial environment PEnv. I will discuss further down why this is needed.

Optimizing guard_value is very similar, except that it really gives precise information about the variable involved:

optimize(guard_value(V, C, [], L, Rest), PEnv, NewOp) :-
    plookup(V, PEnv, Val),
    (Val = const(C1) ->
        NewOp = RestResidual,
        NEnv = PEnv
    ;
        write_env(PEnv, V, C, NEnv),
        NewOp = guard_value(V, C, PEnv, L, RestResidual)
    ),
    optimize(Rest, NEnv, RestResidual).

This operation is the main way how the optimizer gains constant variables that it then exploits to do constant-folding on later operations. This is a chief difference from partial evaluation: There the optimizer knows the value of some variables from the start. When optimizing traces, at the beginning the value of no variable is known. Knowledge about some variables is only later gained through guards.

Now we are missing what happens with the loop statement. In principle, it is turned into a loop statement again. However, at the loop statement a few additional operations need to be emitted. The reason is that we optimized away operations and thus assignments when the result value of the variable was a constant. That means the involved variable still potentially has some older value. The next iteration of the loop would continue with this older value, which is obviously wrong. Therefore we need to emit some assignments before the loop statement, one per entry in the partial environment:

optimize(loop, PEnv, T) :-
    generate_assignments(PEnv, T).

generate_assignments([], loop).
generate_assignments([Var/Val | Tail], op1(Var, same, const(Val), T)) :-
    generate_assignments(Tail, T).

As an example of how generate_assignments assignments works, let's look at the following example. When the partial environment is, [x/5, y/10] the following assignments are generated:

?- generate_assignments([x/5, y/10], Out).
Out = op1(x, same, const(5), op1(y, same, const(10), loop)).

That's all the code of the optimizer. While the basic structure is quite similar to partial evaluation, it's a lot less complex as well. What made the partial evaluator hard was that it needs to deal with control flow statements and with making sure that code is reused if the same block is partially evaluated with the same constants. Here, all these complexities go away. The tracer has already removed all control flow and replaced it with guards and one loop operation at the end. Thus, the optimizer can simply do one pass over the operations, removing some (with some extra care around the loop statement).

With this machinery in place, we can optimize the trace from the promotion example of the last post:

?- optimize(
    guard_value(x,3,[],b2,
    op2(x2,mul,var(x),const(2),
    op2(x3,add,var(x2),const(1),
    op2(i,sub,var(i),var(x3),
    op2(c,ge,var(i),const(0),
    guard_true(c,[],l_done, loop)))))),
    [],
    LoopOut).
LoopOut = guard_value(x, 3, [], b2, op2(i, sub, var(i), const(7), op2(c, ge, var(i), const(0), guard_true(c, [x/3, x2/6, x3/7], l_done, op1(x, same, const(3), op1(x2, same, const(6), op1(x3, same, const(7), loop)))))))

More readably, the optimized version is:

guard_value(x, 3, [], b2,
op2(i, sub, var(i), const(7),
op2(c, ge, var(i), const(0),
guard_true(c, [x/3, x2/6, x3/7], l_done,
op1(x, same, const(3),
op1(x2, same, const(6),
op1(x3, same, const(7),
loop)))))))

As intended, the operations on x after the guard_value have all been removed. However, some additional assignments (to x, x2, x3) at the end have been generated as well. The assignments look superfluous, but the optimizer does not have enough information to easily recognize this. That can be fixed, but only at the cost of additional complexity. (A real system would transform the trace into static single assignment form to answer such questions.)

Resuming to the Interpreter

Why does the code above need to add the partial environment to the guards that cannot be optimized away? The reason is related to why we needed to generate assignments before the loop statement. The problem is that the optimizer removes assignments to variables when it knows the values of these variables. That means that when switching back from running the optimized trace to the interpreter, a number of variables are not updated in the environment, making the execution in the interpreter incorrect.

In the example above, this applies to the variables x2 and x3. When the second guard fails, they have not been assigned in the optimized case. Therefore, the guard lists them and their (always constant) values.

When switching back these assignments need to be made. Thus we need to adapt the resume_interp function from the last blog post as follows:

write_resumevars([], Env, Env).
write_resumevars([Key / Value | Rest], Env, NEnv) :-
    write_env(Env, Key, Value, Env1),
    write_resumevars(Rest, Env1, NEnv).

resume_interp(Env, ResumeVars, L) :-
    write_resumevars(ResumeVars, Env, NEnv),
    block(L, Block),
    interp(Block, NEnv).

On resuming, the ResumeVars (a former partial environment) are simply added back to the normal environment before going back to the interpreter.

The data attached to guards about what needs to be done to resume to the interpreter when the guard fails is often a very complex part of a tracing system. The data can become big, yet most guards never fail. Therefore, most real systems try hard to compress the attached data or try to share it between subsequent guards.

Summary

In this post we have shown how to optimize traces by applying a variant of the partial evaluation principle: Perform all the operations that have only constant arguments, leave the others alone. However, optimizing traces is much simpler, because no control flow is involved. All the questions about control flow have already been solved by the tracing component.

In the next and final post of the series I will show a larger example of how tracing and partial evaluation can be used to optimize a small bytecode interpreter.

No comments:

Post a Comment

See also PyPy's IRC channel: #pypy at freenode.net, or the pypy-dev mailing list.
If the blog post is old, it is pointless to ask questions here about it---you're unlikely to get an answer.