This is part 3 of the series on how to speed up an interpreter written with PyPy by adding JIT hints to the interpreter. Part 1 described how to control the extent of tracing. Part 2 described how to influence the optimizer with promotion and pure functions. In this post I describe a worked-out example of a small object model for a dynamic language and how to make it efficient using the hints described in the previous posts.
A Simple Object Model
To implement a dynamic language efficiently, the operations on its objects need to be fast. Most dynamic languages have object models that are made by using dictionaries everywhere. Let's look at an example of how the JIT can be made to optimize such operations.
For the purpose of this blog post we will use a very simple and bare-bones object model that just supports very simple classes and instances, without any inheritance or any fancy features. The model has classes, which contain methods. Instances have a class. Instances have their own attributes. When looking up an attribute on an instance, the instances attributes are searched. If the attribute is not found there, the class' attributes are searched.
To implement this object model, we could use the following RPython code as part of the interpreter source code:
class Class(object):
def __init__(self, name):
self.name = name
self.methods = {}
def instantiate(self):
return Instance(self)
def find_method(self, name):
result = self.methods.get(name)
if result is not None:
return result
raise AttributeError(name)
def change_method(self, name, value):
self.methods[name] = value
class Instance(object):
def __init__(self, cls):
self.cls = cls
self.attributes = {}
def getfield(self, name):
result = self.attributes.get(name)
if result is not None:
return result
raise AttributeError(name)
def write_attribute(self, name, value):
self.attributes[name] = value
def getattr(self, name):
try:
return self.getfield(name)
except AttributeError:
return self.cls.find_method(name)
In this straightforward implementation the methods and attributes are just stored in dictionaries on the classes/instances. While this object model is very simple it already contains all the hard parts of Python's object model. Both instances and classes can have arbitrary fields, and they are changeable at any time. Moreover, instances can change their class after they have been created.
When using this object model in an interpreter, a huge amount of time will be spent doing lookups in these dictionaries. To make the language efficient using a tracing JIT, we need to find a way to get rid of these dictionary lookups somehow.
Let's assume we trace through code that sums three attributes, such as:
inst.getattr("a") + inst.getattr("b") + inst.getattr("c")
The trace could look like this:
# inst.getattr("a")
attributes1 = inst.attributes
result1 = dict.get(attributes1, "a")
guard(result1 is not None)
# inst.getattr("b")
attributes2 = inst.attributes
v1 = dict.get(attributes2, "b")
guard(v1 is None)
cls1 = inst.cls
methods1 = cls.methods
result2 = dict.get(methods1, "b")
guard(result2 is not None)
v2 = result1 + result2
# inst.getattr("c")
attributes3 = inst.attributes
v3 = dict.get(attributes3, "c")
guard(v3 is None)
cls1 = inst.cls
methods2 = cls.methods
result3 = dict.get(methods2, "c")
guard(result3 is not None)
v4 = v2 + result3
return(v4)
In this example, the attribute a is found on the instance, but the attributes b and c are found on the class. The trace indeed contains five calls to dict.get, which is slow.
Making Instance Attributes Faster Using Maps
The first step in making getattr faster in our object model is to optimize away the dictionary lookups on the instances. The hints we have looked at in the two earlier blog posts don't seem to help with the current object model. There is no pure function to be seen, and the instance is not a candidate for promotion, because there tend to be many instances.
This is a common problem when trying to apply hints. Often, the interpreter needs a small rewrite to expose the pure functions and nearly-constant objects that are implicitly there. In the case of instance fields this rewrite is not entirely obvious. The basic idea is as follows. In theory instances can have arbitrary fields. In practice however many instances share their layout (i.e. their set of keys) with many other instances.
Therefore it makes sense to factor the layout information out of the instance implementation into a shared object. This shared layout object is called a map. Maps are an old idea that comes originally from the SELF language. They are also used by many JavaScript implementations such as V8. I've written about maps before, so I won't explain them fully again.
The rewritten Instance class using maps looks like this:
class Map(object):
def __init__(self):
self.attribute_indexes = {}
self.other_maps = {}
@purefunction
def getindex(self, name):
return self.attribute_indexes.get(name, -1)
@purefunction
def new_map_with_additional_attribute(self, name):
if name not in self.other_maps:
newmap = Map()
newmap.attribute_indexes.update(self.attribute_indexes)
newmap.attribute_indexes[name] = len(self.attribute_indexes)
self.other_maps[name] = newmap
return self.other_maps[name]
EMPTY_MAP = Map()
class Instance(object):
def __init__(self, cls):
self.cls = cls
self.map = EMPTY_MAP
self.storage = []
def getfield(self, name):
map = hint(self.map, promote=True)
index = map.getindex(name)
if index != -1:
return self.storage[index]
raise AttributeError(name)
def write_attribute(self, name, value):
map = hint(self.map, promote=True)
index = map.getindex(name)
if index != -1:
self.storage[index] = value
return
self.map = map.new_map_with_additional_attribute(name)
self.storage.append(value)
def getattr(self, name):
try:
return self.getfield(name)
except AttributeError:
return self.cls.find_method(name)
Instances no longer use dictionaries to store their fields. Instead, they have a reference to a map, which maps field names to indexes into a storage list. The storage list contains the actual field values. The maps are shared between objects with the same layout. Therefore they have to be immutable, which means that their getindex method is a pure function. When a new attribute is added to an instance, a new map needs to be chosen, which is done with the new_map_with_additional_attribute method on the previous map. Now that we have introduced maps, it is safe to promote the map everywhere, because we assume that the number of different instance layouts is small.
With this changed instance implementation, the trace we had above changes to the following, where 0xb74af4a8 is the memory address of the Map instance that has been promoted:
# inst.getattr("a")
map1 = inst.map
guard(map1 == 0xb74af4a8)
index1 = Map.getindex(map1, "a")
guard(index1 != -1)
storage1 = inst.storage
result1 = storage1[index1]
# inst.getattr("b")
map2 = inst.map
guard(map2 == 0xb74af4a8)
index2 = Map.getindex(map2, "b")
guard(index2 == -1)
cls1 = inst.cls
methods1 = cls.methods
result2 = dict.get(methods1, "b")
guard(result2 is not None)
v2 = result1 + result2
# inst.getattr("c")
map3 = inst.map
guard(map3 == 0xb74af4a8)
index3 = Map.getindex(map3, "c")
guard(index3 == -1)
cls1 = inst.cls
methods2 = cls.methods
result3 = dict.get(methods2, "c")
guard(result3 is not None)
v4 = v2 + result3
return(v4)
The calls to Map.getindex can be optimized away, because they are calls to a pure function and they have constant arguments. That means that index1/2/3 are constant and the guards on them can be removed. All but the first guard on the map will be optimized away too, because the map cannot have changed in between. The optimized trace looks like this:
# inst.getattr("a")
map1 = inst.map
guard(map1 == 0xb74af4a8)
storage1 = inst.storage
result1 = storage1[0]
# inst.getattr("b")
cls1 = inst.cls
methods1 = cls1.methods
result2 = dict.get(methods1, "b")
guard(result2 is not None)
v2 = result1 + result2
# inst.getattr("c")
cls2 = inst.cls
methods2 = cls2.methods
result3 = dict.get(methods2, "c")
guard(result3 is not None)
v4 = v2 + result3
return(v4)
The index 0 that is used to read out of the storage array is the result of the constant-folded getindex call. This trace is already much better than the original one. Now we are down from five dictionary lookups to just two.
Versioning of Classes
Instances were optimized making the assumption that the total number of Instance layouts is small compared to the number of instances. For classes we will make an even stronger assumption. We simply assume that it is rare for classes to change at all. This is not totally reasonable (sometimes classes contain counters or similar things) but for this simple example it is good enough.
What we would really like is if the Class.find_method method were pure. But it cannot be, because it is always possible to change the class itself. Every time the class changes, find_method can potentially return a new value.
Therefore, we give every class a version number, which is increased every time a class gets changed (i.e., the content of the methods dictionary changes). This means that the result of methods.get() for a given (name, version) pair will always be the same, i.e. it is a pure operation. To help the JIT to detect this case, we factor it out in a helper method which is explicitly marked as @purefunction. The refactored Class looks like this:
class VersionTag(object):
pass
class Class(object):
def __init__(self, name):
self.name = name
self.methods = {}
self.version = VersionTag()
def find_method(self, name):
self = hint(self, promote=True)
version = hint(self.version, promote=True)
result = self._find_method(name, version)
if result is not None:
return result
raise AttributeError(name)
@purefunction
def _find_method(self, name, version):
return self.methods.get(name)
def change_method(self, name, value):
self.methods[name] = value
self.version = VersionTag()
What is interesting here is that _find_method takes the version argument but it does not use it at all. Its only purpose is to make the call pure (because when the version number changes, the result of the call might be different than the previous one).
The trace with this new class implementation looks like this:
# inst.getattr("a")
map1 = inst.map
guard(map1 == 0xb74af4a8)
index1 = Map.getindex(map1, "a")
guard(index1 != -1)
storage1 = inst.storage
result1 = storage1[index1]
# inst.getattr("b")
map2 = inst.map
guard(map2 == 0xb74af4a8)
index2 = Map.getindex(map2, "b")
guard(index2 == -1)
cls1 = inst.cls
guard(cls1 == 0xb7aaaaf8)
version1 = cls1.version
guard(version1 == 0xb7bbbb18)
result2 = Class._find_method(cls, "b", version1)
guard(result2 is not None)
v2 = result1 + result2
# inst.getattr("c")
map3 = inst.map
guard(map3 == 0xb74af4a8)
index3 = Map.getindex(map3, "c")
guard(index3 == -1)
cls2 = inst.cls
guard(cls2 == 0xb7aaaaf8)
version2 = cls2.version
guard(version2 == 0xb7bbbb18)
result3 = Class._find_method(cls, "c", version2)
guard(result3 is not None)
v4 = v2 + result3
return(v4)
The calls to Class._find_method can now be optimized away, also the promotion of the class and the version, except for the first one. The final optimized trace looks like this:
# inst.getattr("a")
map1 = inst.map
guard(map1 == 0xb74af4a8)
storage1 = inst.storage
result1 = storage1[0]
# inst.getattr("b")
cls1 = inst.cls
guard(cls1 == 0xb7aaaaf8)
version1 = cls1.version
guard(version1 == 0xb7bbbb18)
v2 = result1 + 41
# inst.getattr("c")
v4 = v2 + 17
return(v4)
The constants 41 and 17 are the results of the folding of the _find_method` calls. This final trace is now very good. It no longer performs any dictionary lookups. Instead it contains several guards. The first guard checks that the map is still the same. This guard will fail if the same code is executed with an instance that has another layout. The second guard checks that the class of inst is still the same. It will fail if trace is executed with an instance of another class. The third guard checks that the class did not change since the trace was produced. It will fail if somebody calls the change_method method on the class.
Real-World Considerations
The techniques used above for the simple object model are used for the object model of PyPy's Python interpreter too. Since Python's object model is considerably more complex, some additional work needs to be done.
The first problem that needs to be solved is that Python supports (multiple) inheritance. Therefore looking up a method in a class needs to consider the whole method resolution order. This makes the versioning of classes more complex. If a class is changed its version changes. At the same time, the versions of all the classes inheriting from it need to be changed as well, recursively. This makes class changes expensive, but they should be rare. On the other hand, a method lookup in a complex class hierarchy is as optimized in the trace as in our object model here.
A downside of the versioning of classes that we haven't yet fixed in PyPy, is that some classes do change a lot. An example would be a class that keeps a counter of how many instances have been created so far. This is very slow right now, but we have ideas about how to fix it in the future.
Another optimization is that in practice the shape of an instance is correlated with its class. In our code above, we allow both to vary independently. In PyPy's Python interpreter we act somewhat more cleverly. The class of an instance is not stored on the instance itself, but on the map. This means that we get one fewer promotion (and thus one fewer guard) in the trace, because the class doesn't need to be promoted after the map has been.
More General Patterns
The techniques we used above to make instance and class lookups faster are applicable in more general cases than the one we developed them for. A more abstract view of maps is that of splitting a data-structure into a part that changes slowly, and a part that changes quickly. In the concrete example of maps we split the original dictionary into the map (the slow-changing part) and the storage array (the quick-changing part). All the computation on the slow-changing part can be constant-folded during tracing so that only the manipulation of the quick-changing part remains.
Similarly, versions can be used to constant-fold arbitrary functions of large data structures. The version needs to be updated carefully every time the result of this function can change. Therefore this is useful only if the data structure is expected to change slowly.
Conclusion
In this post I showed how to use purefunction and promote to make a small but still relevant dynamic object model no longer use any dictionary lookups after tracing. Instead a number of guards are inserted into the trace to check whether the assumptions about the objects are still true. This makes operations on objects seriously faster. I plan to write another small post that shows the speed benefits for PyPy's Python interpreter for exactly these operations.
Very clever indeed.
ReplyDeleteI think and additional speedup can be achieved
by using a technique from smalltalk intrepters: Method lookup cache.
The cache is organized so that function
cache(class, method) returns a pointer to the method.
The early Smalltalk implementors reported pretty spectacular speedups when this cache was implemented.
SO MUCH AWESOME.
ReplyDelete@vadiml: the jit+version tags already acts as method lookup cache for jited code
ReplyDeleteit basically inlines lookup(class, method)
@RonnyPfannschmidt: thinking more about it
ReplyDeleteyes, you're right of course
I'm wondering about VersionTag(). The guard you've shown looks at its memory address. Doesn't PyPy use compacting garbage collectors? I seem to recall that from earlier posts about the cost of id().
ReplyDeleteHmm. And now I think I know why twisted isn't any faster in pypy. I remember looking at the source a few years ago and being horrified to see that they were changing class methods during runtime. I guessed to avoid one layer of dispatch in state machines. Anyway, it's an "optimisation" that will hurt pypy.
ReplyDelete@Marius: You are right. The trace is a bit simplified, in practice there is an indirection so that if the GC moves the object, the trace still works.
ReplyDelete@Anonymous: can you find that place in twisted? would be very interesting to see. Also it probably means we should implement these ideas about making changing classes not quite so inefficient.