Monday, February 18, 2008

Python Finalizers Semantics, Part 1

Python's garbage collection semantics is very much historically grown and implementation-driven. Samuele Pedroni therefore likes to call it the "'there is no such thing as too much chocolate'-approach to GC semantics" :-). In this two-part post series I am going to talk about the semantics of finalization (__del__ methods) in CPython and PyPy.

The current behaviour is mostly all a consequence of the fact that CPython uses reference counting for garbage collection. The first consequence is that if several objects die at the same time, their finalizers are called in a so-called topological order, which is a feature that some GCs have that CPython offers by chance. This ensures, that in a __del__ method, all the attributes of the object didn't get their __del__ called yet. A simple example:

class B(object):
    def __init__(self, logfile):
        self.logfile = logfile
    def __del__(self):
        self.logfile.write("done doing stuff")
b = B(file("logfile.txt", "w"))

If the instance of B dies now, both it and the logfile are dead. They will get their __del__``s called and it's important that the file's ``__del__ gets called second, because otherwise the __del__ of B would try to write to a closed file.

The correct ordering happens completely automatically if you use reference counting: Setting b to None will decref the old value of b. This reduces the reference count of this instance to 0, so the finalizer will be called. After the __del__ has finished, this object will be freed and all the objects it points to decrefed as well, which decreases the reference count of the file to 0 and call its `` __del__`` as well, which closes the file.

The behaviour of PyPy's semispace and generational GCs wasn't very nice so far: it just called the finalizers in an essentially random order. Last week Armin came up with a somewhat complicated algorithm that solves this by emulating CPython's finalization order, which we subsequently implemented. So PyPy does what you expect now! The Boehm GC does a topological ordering by default, so it wasn't a problem there.

A small twist on the above is when there is a cycle of objects involving finalizers: In this case a topological ordering is not possible, so that CPython refuses to guess the finalization order and puts such cycles into gc.garbage. This would be very hard for PyPy to do, since our GC implementation is essentially independent from the Python interpreter. The same GCs work for our other interpreters after all too. Therefore we decided to break such a cycle at an arbitrary place, which doesn't sound too insane. The insane thing is for a Python program to create a cycle of objects with finalizers and depend on the order in which the finalizers are called. Don't do that :-) (After all, CPython wouldn't even call the finalizers in this case.)


SamB said...

The link to the "somewhat complicated algorithm" is a bit broken, but you can still get to it at the web archive.

Armin Rigo said...

Thanks, link updated.