Hello everyone!
This is the second part of the series of improvements in warmup time and memory consumption in the PyPy JIT. This post covers recent work on sharing guard resume data that was recently merged to trunk. It will be a part of the next official PyPy release. To understand what it does, let's start with a loop for a simple example:
class A(object): def __init__(self, x, y): self.x = x self.y = y def call_method(self, z): return self.x + self.y + z def f(): s = 0 for i in range(100000): a = A(i, 1 + i) s += a.call_method(i)
At the entrance of the loop, we have the following set of operations:
guard(i5 == 4)guard(p3 is null)p27 = p2.co_cellvars p28 = p2.co_freevarsguard_class(p17, 4316866008, descr=<Guard0x104295e08>)p30 = p17.w_seqguard_nonnull(p30, descr=<Guard0x104295db0>)i31 = p17.index p32 = p30.strategyguard_class(p32, 4317041344, descr=<Guard0x104295d58>)p34 = p30.lstorage i35 = p34..item0
The above operations gets executed at the entrance, so each time we call f(). They ensure all the optimizations done below stay valid. Now, as long as nothing out of the ordinary happens, they only ensure that the world around us never changed. However, if e.g. someone puts new methods on class A, any of the above guards might fail. Despite the fact that it's a very unlikely case, PyPy needs to track how to recover from such a situation. Each of those points needs to keep the full state of the optimizations performed, so we can safely deoptimize them and reenter the interpreter. This is vastly wasteful since most of those guards never fail, hence some sharing between guards has been performed.
We went a step further - when two guards are next to each other or the operations in between them don't have side effects, we can safely redo the operations or to simply put, resume in the previous guard. That means every now and again we execute a few operations extra, but not storing extra info saves quite a bit of time and memory. This is similar to the approach that LuaJIT takes, which is called sparse snapshots.
I've done some measurements on annotating & rtyping translation of pypy, which is a pretty memory hungry program that compiles a fair bit. I measured, respectively:
- total time the translation step took (annotating or rtyping)
- time it took for tracing (that excludes backend time for the total JIT time) at the end of rtyping.
- memory the GC feels responsible for after the step. The real amount of memory consumed will always be larger and the coefficient of savings is in 1.5-2x mark
Here is the table:
branch | time annotation | time rtyping | memory annotation | memory rtyping | tracing time |
---|---|---|---|---|---|
default | 317s | 454s | 707M | 1349M | 60s |
sharing | 302s | 430s | 595M | 1070M | 51s |
win | 4.8% | 5.5% | 19% | 26% | 17% |
Obviously pypy translation is an extreme example - the vast majority of the code out there does not have that many lines of code to be jitted. However, it's at the very least a good win for us :-)
We will continue to improve the warmup performance and keep you posted!
Cheers,
fijal
2 comments:
"when two guards are next to each other or the operations in between them don't have side effects, we can safely redo the operations or to simply put, resume in the previous guard"
Wait... "side effects", "redo"... Does this have synergies with STM?
Side effect operation is one that does not have any side effects. This means that you can execute the operation again (e.g. reading a field or adding two numbers) and will affect nothing but it's result. As for redo - well, it has nothing to do with STM, but doing pure operations again can be sometimes useful (in short - if you have i = a + b, you don't remember the i, just a, b and that i = a + b)
Post a Comment