Thursday, January 22, 2015

Faster, more memory efficient and more ordered dictionaries on PyPy

Hello everyone!

As of today, we merged the latest branch that brings better dictionaries to PyPy by default. The work is based on an idea by Raymond Hettinger on python-dev, with prior work done notably in Java.  It was done by Maciej Fijałkowski and Armin Rigo, with Laurence Tratt recently prodding us to finish it.  (Earlier work going in a similar direction include Alex Gaynor's work on ordered dicts in Topaz, which was also used in the Hippy VM.  Each of these pieces of work is itself based on the original dict implementation in RPython, whose origins fade in the Subversion prehistory of PyPy.)  Coincidentally, a very similar idea has been implemented in Zend PHP very recently. Zend implementation description.

This post covers the basics of design and implementation as well as some basic benchmarks.

Dictionaries are now ordered!

One surprising part is that the new design, besides being more memory efficient, is ordered by design: it preserves the insertion order.  This is not forbidden by the Python language, which allows any order.  It makes the collections.OrderedDict subclass much faster than before: it is now a thin subclass of dict.  Obviously, we recommend that any portable Python program continues to use OrderedDict when ordering is important.  Note that a non-portable program might rely on more: for example, a **keywords argument now receives the keywords in the same order as the one in which they were given in the call.  (Whether such a thing might be called a language design change or not is a bit borderline.)  The point is that Python programs that work on CPython or previous versions of PyPy should continue to work on PyPy.

There is one exception, though.  The iterators of the OrderedDict subclass are now working just like the ones of the dict builtin: they will raise RuntimeError when iterating if the dictionary was modified.  In the CPython design, the class OrderedDict explicitly doesn't worry about that, and instead you get some result that might range from correct to incorrect to crashes (i.e. random Python exceptions).

Original PyPy dictionary design

Originally, PyPy dictionaries, as well as CPython dictionaries are implemented as follows (simplified view):

struct dict {
   long num_items;
   dict_entry* items;   /* pointer to array */
}

struct dict_entry {
   long hash;
   PyObject* key;
   PyObject* value;
}

Where items is a sparse array, with 1/3 to 1/2 of the items being NULL. The average space occupied by a dictionary is 3 * WORD * 12/7 plus some small constant (the smallest dict has 8 entries, which is 8 * 3 * WORD + 2 * WORD = 26 WORDs).

New PyPy dictionary design

The new PyPy dictionary is split in two arrays:

struct dict {
    long num_items;
    variable_int *sparse_array;
    dict_entry* compact_array;
}

struct dict_entry {
    long hash;
    PyObject *key;
    PyObject *value;
}

Here, compact_array stores all the items in order of insertion, while sparse_array is a 1/2 to 2/3 full array of integers. The integers themselves are of the smallest size necessary for indexing the compact_array. So if compact_array has less than 256 items, then sparse_array will be made of bytes; if less than 2^16, it'll be two-byte integers; and so on.

This design saves quite a bit of memory. For example, on 64bit systems we can, but almost never, use indexing of more than 4 billion elements; and for small dicts, the extra sparse_array takes very little space.  For example a 100 element dict, would be on average for the original design on 64bit: 100 * 12/7 * WORD * 3 =~ 4100 bytes, while on new design it's 100 * 12/7 + 3 * WORD * 100 =~ 2600 bytes, quite a significant saving.

GC friendliness

The obvious benefit of having more compact dictionaries is an increased cache friendliness. In modern CPUs cache misses are much more costly than doing additional simple work, like having an additional level of (in-cache) indirection. Additionally, there is a GC benefit coming from it. When doing a minor collection, the GC has to visit all the GC fields in old objects that can point to young objects. In the case of large arrays, this can prove problematic since the array grows and with each minor collection we need to visit more and more GC pointers. In order to avoid it, large arrays in PyPy employ a technique called "card marking" where the GC only visits "cards" or subsets of arrays that were modified between collections. The problem with dictionaries was that by design modifications in a dictionary occur randomly, hence a lot of cards used to get invalidated. In the new design, however, new items are typically appended to the compact_array, hence invalidate much fewer cards --- which improves GC performance.  (The new sparse_array is an array of integers, so it does not suffer from the same problems.)

Deletion

Deleting entries from dictionaries is not very common, but important in a few use cases.  To preserve order, when we delete an entry, we mark the entry as removed but don't otherwise shuffle the remaining entries.  If we repeat this operation often enough, there will be a lot of removed entries in the (originally compact) array.  At this point, we need to do a "packing" operation, which moves all live entries to the start of the array (and then reindexes the sparse array, as the positions changed).  This works well, but there are use cases where previously no reindexing was ever needed, so it makes these cases a bit slower (for example when repeatedly adding and removing keys in equal number).

Benchmarks

The PyPy speed benchmarks show mostly small effect, see changes. The microbenchmarks that we did show large improvements on large and very large dictionaries (particularly, building dictionaries of at least a couple 100s of items is now twice faster) and break-even on small ones (between 20% slower and 20% faster depending very much on the usage patterns and sizes of dictionaries). The new dictionaries enable various optimization possibilities which we're going to explore in the near future.

Cheers,
fijal, arigo and the PyPy team


18 comments:

Unknown said...

This is outstanding work, PyPy team. Keep on keeping on!

Wilfred Hughes said...

Fantastic!

http://pypy.org/performance.html states that large dicts are a weakness of pypy -- is still the case overall, or is this work sufficient to favour pypy over cpython for large dict work in general?

John M. Camara said...

Wilfred - With the ordered dict changes that bullet item is no longer true.

EM Lazzarin said...

Awesome work and thanks. Pypy would be ahead of the game if PEP 468 were accepted.

JSZ said...

How is deleting an element implemented? It sounds like it would take O(n) work to remove an element from the middle of the compact array.

Armin Rigo said...

JSZ: the array gets holes. If a lot of items are deleted it can no longer be called "compact", but if it becomes too sparse it is recompacted and rehashed.

Anonymous said...

There are lots of things to like about this approach!

Did you find any problems with cache misses? With linear probing, the keys are accessed sequentially (cache friendly), but with this method the keys are accessed in random order.

Carl Friedrich Bolz-Tereick said...

@Anonymous: The old approach didn't use linear probing either, so in that regard nothing changed.

Anonymous said...

@carl - ah I see, thats interesting.

Well then, what about storing the hashes with the indices?
* Another chunk of memory saved. Only the lowest N bits need be stored that way instead of the full 64 bits. (Big assumption that rehashing on bit size change is ok)

* The nice thing is that the dense part (cache miss!) need only be accessed if the hash matches.

I think if I was doing this, I'd skip 8 bit indices and have 16 bit minimum so rehashing would be very rare.

Carl Friedrich Bolz-Tereick said...

two problems with that:

- since the hash functions can be written in python, recomputing a hash from a key is potentially expensive

- why would you want to throw away bits from the hash? comparing the full hashes as a first check to see whether equality has a chance to succeed is very useful. the equality function can again be written in python, so is potentially very slow.

Armin Rigo said...

@Anonymous: about starting at 16-bit instead of 8-bit: it doesn't give any benefit, because rehashing is needed anyway to grow the sparse table. As long as its size is at most 256, then there is no point in storing 16-bit numbers instead of 8-bit numbers. In theory we could store N-bit numbers for the optimal value of N (= 4, 6, 8, 10...) and pay only the cost of additional complexity for individual reads and writes, not for rehashing.

Anonymous said...

Ah indeed. I am thinking of implementing this in C++ which has coloured my thoughts somewhat. In my case, key equality checks are for the most part cheap. Thus the size/compute tradeoffs may be a bit different.

Thanks for your thoughts.

Dustin Boswell said...

Just curious, was there no slowdown from adding this extra level of indirection? For the case of accessing a random key from a cold dictionary, won't the lookup incur 2 cache misses now (one on each array), compared to just 1 for the original design?

Armin Rigo said...

@Durtin: there are certainly slow-downs in some cases. If the dictionary is cold, then indeed there is one extra cache miss. It seems to be quickly compensated, though, by the fact that if then you do a few more accesses to the same dict, you are likely to get less cache misses, simply because of the more compact layout. Also, the index array is often single bytes, so it can be fully in the cache very quickly.

Alhabshi3k said...

Thank you for improving pypy performance and features. Your project and method is promising in improvement weakness aspect of dynamic languages. At the same time, pypy should provide an simplicity of Python rather than diversity , where diversity is the reality but simplicity is the case.

Making dictionaries ordered by default is part of simplicity; in this effort I wish integrating the features of "defaultdict" as method and properties of the the default basic dictionary.

similar case , integrating "deque" features (as well ,method and properties) as part of pypy list datatype.

Usually I wonder why python team didn't integrate the features of these "collections" ( as they say "High-performance container datatypes" ) within original python basic datatype, as we all know , everything in Python is an Object. and I don't think it is a pythonic way to do things in diversity.

Anyhow , keep on your development and team spirit.

Armin Rigo said...

@Alhabshi3k: indeed, you're right in that "defaultdict" could be replaced with an alternate constructor of the regular dicts. I'm not sure why it is not so. For deques, it is maybe a question of performance, but particularly of underlying C-level memory layout: CPython can't easily add appendleft() and popleft() to regular lists while still keeping the same C API, notably PyList_GET_ITEM() and PySequence_Fast_ITEMS() --- though that is debatable.

We could support that in PyPy, but that is arguably more of a language change than just making dicts ordered with no new user-visible API.

Unknown said...

You say for 100 elements, the new design's compact array uses 3 * WORD * 100 memory, right? So no extra capacity whatsoever? Then what do you do when I insert another element? Allocate a new array with 3 * WORD * 101 memory and copy all data there (and write the new element at the end)? That would be highly inefficient. So I don't believe you're honest about the memory usage.

Armin Rigo said...

The actual items are stored in a list which, like a list object, is slightly overallocated. Maybe the text in the blog post missed that and it should add a "k": the average is "100 * 12/7 + 3 * WORD * 100 * k" for an average value of k around 17/16. That's around 2700 instead of 2600.