Saturday, October 11, 2008

Düsseldorf Sprint Report Days 1-3

The Düsseldorf sprint is currently in full progress and this post will try to summarize what progress has been made in the last days. We are (again) sprinting at the STUPS group of the Düsseldorf University. You can find the sprint announcement and the daily planning file.

Holger and Samuele put quite some effort over several days into setting up and improving PyPy's testing infrastructure. PyPy has a variety of tests. On the one hand, there are of course our own tests. But then we also have the CPython tests that should be run on top of pypy-c. Up to now we used a custom-made pile of hacks, held together by lots of duct-tape. It consisted of a variety of different machines running different things with different reporting solutions. Some of the old test-results can still be found on wyvern. Now we are moving to a buildbot based solution together with a custom reporter to have a view similar to the old one. Some details are not quite finished yet, but most of the things are already working rather well (currently all the results displayed are from the 2.5-merge branch).

Another large (and ongoing) topic of work is the 2.5 branch. It contains the work done by our Summer-of-Code student, Bruno Gola, of adding CPython 2.5 features to PyPy's Python interpreter. While Bruno implemented most language features and imported the 2.5 stdlib into PyPy, a lot of details were still missing. In the last days nearly everybody worked on fixing small issues and failing stdlib tests. While doing that we tried to categorize some CPython tests as implementation dependant so that we can skip them when running on PyPy.

Memory Improvements

One goal of the sprint is to measure and to reduce the memory behaviour of our Python interpreter. The idea is to make pypy-c a realistic option for use on embedded devices. By memory behaviour we mean both the dynamic memory usage (how much bytes does a dict or an instance take) as well as the size of the executable and details of the GC strategy.

Alexander, Carl Friedrich and Antonio did some work on analyzing the static data that a pypy-c executable contains. Our executables have the tendency to be rather large, both due to a lot of code and due to a large amount of static data. The analysis didn't give any really surprising results, the problem is mostly that we have a lot of static data originating from a bit everywhere in our program. Two big offenders are the unicodedata-module with about 750 KB of static data and the multimethod-tables with about 150 KB of data.

Armin, Iko, Anto and Maciek worked on a new approach to malloc-removal. This is (for PyPy) a crucial optimization of the translation toolchain that performs escape analysis to find out which objects don't outlive the frame they were allocated in. Since RPython is garbage-collected we usually have a lot of allocations, so it is important to statically get rid of many of them. To successfully do that, some inlining is needed to give the analysis more context. This leads to the fact that we have rather aggressive inlining-settings to allow as much malloc-removal as possible. The new approach tries to inline functions only if this actually leads to the successful removal of a malloc operation. The code is not finished quite yet, so it remains to be seen how successful it will be.

Before the sprint Maciek had started to work on a mark-compact GC for PyPy. The idea is that it is better for memory-constrained-environments because it does not double the memory-requirements during collections. During the sprint Armin and Maciek worked on cleaning up the code a bit and then merging the branch. An interesting property of the mark-compact GC is that after a collection all the memory that is not currently used by the program is returned to the operating system. Right now the GC is not as fast as our more mature ones, but it probably will be the basis for future tweaking.

A small thing that was done by Alexander and Carl Friedrich to make objects smaller is to enable shared instance dictionaries also for instances of old-style classes. Before it worked only for instances of new-style classes. Shared instance dictionaries are a way to reduce the memory-usage of instances. In the optimal case, it gives the same memory-savings that __slots__ are giving, but without any behavioural changes. Conceptually it is very similar e.g. to the notion of "map" in the Self project, or the hidden classes that Google Chrome's V8 is using (click on the link, there are nice graphics). The difference is that for them it is mostly a way to get faster attribute access, and PyPy is so far only using it form memory savings (but that might change in the future).

In parallel to all the other work, John Witulski worked tirelessly on advancing the AMD64-JIT-backend. John has the implementation of this backend as the topic of his Bachelor's thesis. He is progressing quite well (especially also considering that this is his first sizeable Python project ever), just sometimes being impaired by such annoyances as errors in the official Intel documentation. By now the backend is supporting many integer operations and control flow.

2 comments:

illume said...

Hello,

sounds like some fun sprinting :)

Have you considered mmap for some of those big memory users?

Especially for unicode stuff, which mostly won't be used for many applications, it should be a good win -- both for load time, and memory use.

Double plus extra combo win!!! for if you use multiple processes.

cu,

Paolo Giarrusso said...

I'm not sure of what you mean.
But modern operating systems (at least Linux) use on-demand loading of executables and libraries, so they never copy anything from an executable file to memory unless it is used.

In fact, process startup is implemented internally by mmap()ing the executable and libraries into the address space of the new process.

If you use multiple processes, it still works well - also data pages are shared, until some process writes to them.

For MMU-less devices, the above does not apply (and ucLinux allows Linux to run on them).
But in that case, I guess that no demand loading is available, and that mmap() copies the mapped data in memory - you need to explicitly swap in and out code segments (i.e. to use good old overlays), and no modern programming environment has direct support for them any more I guess.

You can still emulate overlays with advanced usage of linker scripts however - you put some code in a section, create variables containing the begin and end offset of that section in the linker script, and copy data in memory from that section; but I think that making relocations to that code work flawlessly is impossible, you need to always refer to the buffer containing loaded data.