Tuesday, October 11, 2011

More Compact Lists with List Strategies

Since we come closer to merging the list-strategy branch I want to try to explain this memory optimization today.

Datatypes in PyPy are stored as W_<type>Objects (e.g. W_StringObject to represent strings, W_IntObject to represent ints). This is necessary due to the dynamic nature of Python. So the actual value (e.g. string, integer) is stored inside that box, resulting in an indirection. When having a large amount of such boxed objects, for example in a list, the wasted memory can become quite large.

If you have a closer look at such lists, you will see that in many of them only one type of data is stored and only few (and smaller) lists store mixed types. Another thing to observe is that those lists often won't change the types of the objects they contain at runtime very often. For instance a list of a million integers is very unlikely to suddenly get a string appended to it.

List Strategies

The goal of this work is to write an optimization that exploits this behaviour. Instead of wrapping all items in a list, we implement lists in a way that they are optimized for storing certain (primitive) datatypes. These implementations store the content of the list in unwrapped form, getting rid of the extra indirection and wrapper objects.

One approach would be to add a level of indirection, making each W_ListObject instance point to another object that stores the actual content. For this other object, several implementations would exist, for every datatype we want to store without wrapping it (as well as a general one that deals with arbitrary content). The data layout would look something like this:

This approach has the problem that we need two indirections to get to the data and that the implementation instances need memory themselves.

What we would like to do is to make the W_ListObject point to an RPython list directly, that contains either wrapped or unwrapped data. This plan has the problem that storing different unwrapped data is not directly possible in RPython.

To solve the problem, we use the rerased RPython library module. It allows us to erase the type of an object, in this case lists, and returns something similar to void-star in C, or Object in Java. This object is then stored on the W_ListObject in the field storage. If we want to work with the list, for example to append or delete items, we need to unerase the storage again.

Example for rerase:

storage = erase([1 ,2 ,3 ,4])
# storage is an opaque object that you can do nothing with
....
l = unerase(storage)
l.clear()

Now that we know how to make the W_ListObject point directly to wrapped or unwrapped data, we need to find out how to actually do any operations on this data. This can be accomplished by adding another field to our W_ListObject. This field points to a ListStrategy object. The actual implementation of W_ListObject is now deferred to those ListStrategy classes. For instance, a W_ListObject which holds only integers will use the IntegerListStrategy.

When the type of content is being changed, we need to change the used strategy as well as the storage in compatible ways. For example when we add a string to the list of integers we need to switch to the ObjectListStrategy and change the storage to be a list of wrapped objects. Thus the currently used strategy always knows what to do with what is currently in the storage.

As you can see, we now save one level of indirections by storing some of the data unwrapped. Of course each operation on a list needs to go via the strategy, but since we save one indirection for each element stored in that list and the Strategy classes are singletons, the benefits outweigh the costs.

Currently there are only strategies for integers and strings since many lists seem to have these datatypes. Other strategies i.e for floats and unicode strings are planned. We also implemented two special strategies for empty lists and range-lists. The EmptyListStrategy's storage is None. If objects are added to the list we just switch to the appropriate strategy (determined by the item's type). RangeListsStrategies do not store any items at all. Instead they only store values describing the range of the list, i.e. start, step and length. On any operations that changes the data of the list we switch to the IntegerStrategy.

A nice side-effect of storing unwrapped datatypes is that we can implement optimized methods for certain cases. For instance, since comparison of unwrapped integers is now much faster than comparison between arbitrary objects, we can rewrite the sorting methods for lists containing integers.

Microbenchmarks

Finally here is an early overview of the memory consumption of different Python implementations: CPython, PyPy and PyPy-list which uses list-strategies. To demonstrate how powerful list-strategies can be in the best case, we wrote benchmarks that create a list of integers, a list of strings and a range-list each with one million elements each and then reads out the heap size of the process as reported by the OS.

The results are as follows:

The savings on integers and strings in this ideal case are quite big.

The benchmark for range-lists is a little unfair, since in CPython one could accomplish the same memory behaviour using xrange. However, in PyPy users won't notice that internally the list does not store all items, making it still possible to use all list methods, such as append or delete.

Conclusion

We hope that list strategies bring memory savings for applications that use homogeneous lists of primitive types. Furthermore, operations on such lists tend to be somewhat faster as well. This also integrates well with the JIT. The list strategies optimizations will be merged to the PyPy's default branch at some point in the next months. An equivalent optimization for dictionaries has already been merged (and is part of PyPy 1.6), one for sets is coming in the future.

Lukas Diekmann and Carl Friedrich Bolz

16 comments:

  1. Nice.

    But isn't there a small change in semantics to do that? If a push a python int object onto a list and then pop it back off I'll have the exact same object. But if you unwrap the object and store it as a plain int and then repop it I don't have the exact same object. I've a got a new object.

    ReplyDelete
  2. It seems to be very nice.

    By the way, are object attributes optimized the same way? Objects of the same class can be expected to frequently store data of the same type in the same attribute.
    I've found a nearly-year-old post on maps ( http://morepypy.blogspot.com/2010/11/efficiently-implementing-python-objects.html ), but it does not mention attribute value types... has this idea been considered?

    ReplyDelete
  3. I can see float support presenting some interesting challenges being emblematic of a wider issue. It would be very easy for someone to have a list of "floats" but if they populated it with any literals, most likely they'll be integer literals, missing any of the float optimization.

    For most apps this won't be a problem but if someone is trying to optimize their application they might see this as a performance heisenbug. For example they write a hard coded list and it is slow, read it from a file and it is fast.

    One approach is for there to be a document on some website (that gets out of date) that lists PyPy micro-optimizations. Someone would then need continually audit their code against that list. This doesn't seem practical.

    I've seen posted some low level visualization tools. I'd be curious how practical it would be to have a higher level profiler tool integrate with the JIT to detect patterns like the list of mixed float/int situation to flag these micro-optimizations in a more automated fashion.

    ReplyDelete
  4. Winston: Indeed, very clever of you to notice :) However, we noticed as well, going forward integers (and other primitives) identity will be a function of their value, not the identity of their box. This means that for all ints `i is x` if and only if `i == x`. This also means that `id()` is now a function of value for primitives. Don't rely on that though! Just like we don't want people relying on `i is x` if `i == x and -100 < i < 200`, we don't want people relying on this either.

    Anonymous:

    Yes, this is definitely a consideration, I keep meaning to make time to work on this.

    ReplyDelete
  5. Well interesting, SpiderMonkey is considering to implement something like this, because NaN-boxing usually wastes a lot of memory.

    ReplyDelete
  6. @Ed I think float list can accomodate a limited set of integer values (those that can be represented correctly when interpreted as float) without any issue. You would then however need to tag which one is integer and which one is float, having to keep a bitmap. That's certainly possible, but a bit of a mess.

    ReplyDelete
  7. fijal: I think better than obscure hacks like a bitmap allowing integers as floats, perhaps it would be better just to eventually have logging of when you get fallbacks like that. For eventual integration with the jitviewer of course :)

    ReplyDelete
  8. A general runtime warning system that could say things like: "list of floats decaying to list of objects because of adding int", "two ints being compared via is", etc. might be useful. That could handle any number of situations with surprising semantics or performance.

    ReplyDelete
  9. This is very interesting. I have been thinking along somewhat similar lines for a while (but for performance reasons, rather than memory size), and so have already reviewed how I use lists in my own code. In my own programs, having non-uniform data types in a list is extremely rare. However, some lists are lists of lists (or tuples or dictionaries). The most common large lists however tend to be lists of strings.

    1) If I correctly understand your explanation of what you are doing, your "list strategies" are effectively marking uniform lists as being either of one of a few known basic types (e.g. IntegerListStrategy), or just a traditional list of objects. Is that correct?

    2) Do you think there are any meaningful performance optimsations which could be gained when the list type is known in advance?

    3) What about built-in functions such as all(), any(), len(), min(), max(), etc? Would they be able to make use of this to improve their performance?

    4) Would the underlying array data format be exposed for people who want to write extensions making direct use of it (e.g. for things like SIMD libraries)?

    5) Could this allow a list to be in shared memory and directly accessed by another program?

    6) Would the new list format be compatible with a "memoryview" (as str and bytearray are)?

    7) My own earlier thoughts had involved marking a list as being of a uniform or non-uniform data when the list is created or altered, and using optimised code for the expected type for uniform lists. One sticky point however was threading, as a different type could be appended in another thread, which means that the consuming function would have to somehow be aware of this. Would your concept have a problem with threading if appending a string to an integer list suddenly required changing the underlying list strategy while another thread was accessing the same list?

    8) Python 3.x seems to favour iterators over creating lists (e.g. map, filter, range are replaced by what used to be imap, ifilter, and xrange), and generators were introduced to complement list comprehensions in order to save memory. Does this have any implications for what you are doing?

    9) Could your list concept be applied by the CPython developers to CPython? This might help ensure that any subtle semantic issues which arise as a result apply equally to CPython, rather than having people call them "Pypy bugs".

    10) What about changing the Python language semantics to allow a user to specify that a list must be of a specific uniform type, and raising a type error if an element(s) of an unexpected type is added to the list? This is actually a language feature that I would like to have in order to catch errors without having to write code to examine each individual data element (as that can be slow and error prone in itself).

    11) Finally, why is there such a large start-up memory use in your micro-benchmarks when comparing Pypy-list to CPython? Is this just general overhead from Pypy itself, or is that due to something related to converting the list format to a particular "list strategy"?

    ReplyDelete
  10. Anonymous: Wow a lot of questions, I'll try to answer them :)

    1) Yes.

    2) Probably not, you get the most performance gains when you have a large list, and if it's large the very-very-very-small initial transition is amortized over many elements.

    3) Many of those are pure-python and will thus automatically gain these benefits, max() and min() unfortunately are not.

    4) Probably not, we don't expose this data in any other place nor do we have any APIs for it.

    5) I suppose in theory, again we have no API for it.

    6) No, it wouldn't be, since that's not a part of the list API. We don't define the language, we just implement it (faster).

    7) No, there's no problem with this, you simply need to lock (or whatever the equivilant in STM) is the list and do the modifications.

    8) No, I don't think it does.

    9) Yes, it could be applied to CPython with slightly more difficulty, and it would see the memory gains. However, it would see performance losses (as you do with teh array module on CPython) because it would need to box/unbox at every iteraction, whereas teh JIT is able to remove that.

    10) Propose it to python-ideas, we don't define the language.

    11) I can't understand the charts, so I can't answer this one.

    ReplyDelete
  11. Alex: I'm the anonymous with all the questions. Thank you for your detailed answers. I completely understand that there are side issues that you don't want to deal with at this time.

    As for the possible performance effects of the proposed new list data format if applied to CPython, doing the operation: "y = reduce(operator.add, x, 0)" where x is either a list or array of 1,000,000 integers does not seem to produce a measurable difference in speed for me (Python 2.6 on 64 bit Ubuntu). Any differences seem to go either way when the test is repeated, so they seem equivalent within the margin of error. An equivalent for loop yields the same result (except for being slower, of course).

    When extracting or replacing slices for lists and arrays (e.g. "y = x[i:i + 50]" and "x[i:i + 50] = y") within a for loop, the array version seems to be significantly *faster* than the list version for large slices (e.g. 50), and approximately the same for small slices (e.g. 5).

    Theoretically, yes the implementation with array should always be slower, but I can't seem to get that result when I attempt to measure it. Perhaps I'm doing something wrong, but it appears from the (admittedly minimal) testing that I have done that significant speed penalties for CPython cannot simply be assumed.

    I realize that ultimately this isn't a matter for the Pypy developers to concern themselves with, but should the question ever arise I don't think it can be dismissed out of hand.

    ReplyDelete
  12. Some additional thoughts to @Anonymous questions:

    3) What about built-in functions such as all(), any(), len(), min(), max(), etc? Would they be able to make use of this to improve their performance?

    len does not depend on the content of the list, so it does not win. all, any, min and max could be improved, yes.

    7) My own earlier thoughts had involved marking a list as being of a uniform or non-uniform data when the list is created or altered, and using optimised code for the expected type for uniform lists. One sticky point however was threading, as a different type could be appended in another thread, which means that the consuming function would have to somehow be aware of this. Would your concept have a problem with threading if appending a string to an integer list suddenly required changing the underlying list strategy while another thread was accessing the same list?

    The JIT does indeed produce special optimized code for the type of list it is currently observing, making operations faster. The fact that another thread could change the type of the list is not a problem, because we have a GIL and thus the JIT knows at which points another thread can run.

    10) What about changing the Python language semantics to allow a user to specify that a list must be of a specific uniform type, and raising a type error if an element(s) of an unexpected type is added to the list? This is actually a language feature that I would like to have in order to catch errors without having to write code to examine each individual data element (as that can be slow and error prone in itself).

    this already exists. it's called the array module.

    11) Finally, why is there such a large start-up memory use in your micro-benchmarks when comparing Pypy-list to CPython? Is this just general overhead from Pypy itself, or is that due to something related to converting the list format to a particular "list strategy"?

    The higher startup memory is also there in the PyPy without list strategies, so those have nothing to do with it.

    ReplyDelete
  13. I too had trouble understanding the chart. The vertical axis doesn't have negative numbers to represent a delta, just ignore the signs.

    The blue area is an algebraically positive area, representing the startup memory use. The yellow area represents the memory use delta after doing the 1e6 items list operations.

    ReplyDelete
  14. Re list of floats-and-ints: a fully compatible way is to use the NaN-tagging idea from SpiderMonkey, i.e. have a special encoding of NaN that is normally not used, and that still leaves 32 bits of extra information. We would then represent ints in the list as such a NaN-encoded float. (At least it works as long as the integer is not too large, on 64-bit platforms.)

    ReplyDelete
  15. Neat!

    Nice work people. I'm amazed it's so simple do to afterall, just switch type based on what the first element is. It must be a big boon for garbage collection, too?

    ReplyDelete
  16. The benchmark measures virtual memory (don't know on which architecture); measuring RSS would be more representative of the actual amount of RAM spent storing the data. Presumably it would also be more favourable to PyPy, since moving garbage collection doubles the amount of virtual memory.

    ReplyDelete

See also PyPy's IRC channel: #pypy at freenode.net, or the pypy-dev mailing list.
If the blog post is old, it is pointless to ask questions here about it---you're unlikely to get an answer.