Thursday, October 27, 2011

Speeding up JSON encoding in PyPy

Hi

Recently I spent a bit of effort into speeding up JSON in PyPy. I started with writing a benchmark, which is admittedly not a very good one, but it's better than nothing (suggestions on how to improve it are welcome!).

For this particular benchmark, the numbers are as follow. Note that CPython by default uses the optimized C extension, while PyPy uses the pure Python one. PyPy trunk contains another pure Python version which has been optimized specifically for the PyPy JIT. Detailed optimizations are described later in this post.

The number reported is the time taken for the third run, when things are warmed up. Full session here.

CPython 2.6 22s
CPython 2.7 3.7s
CPython 2.7 no C extension 44s
PyPy 1.5 34s
PyPy 1.6 22s
PyPy trunk 3.3s

Lessons learned:

Expectations are high

A lot of performance critical stuff in Python world is already written in a hand optimized C. Writing C (especially when you interface with CPython C API) is ugly and takes significant effort. This approach does not scale well when there is a lot of code to be written or when there is a very tight coupling between the part to be rewritten and the rest of the code. Still, people would expect PyPy to be better at "tasks" and not precisely at running equivalent code, hence a comparison between the C extension and the pure python version is sound. Fortunately it's possible to outperform the C extension, but requires a bit of effort on the programmer side as well.

Often interface between the C and Python part is ugly

This is very clear if you look at json module as implemented in CPython's standard library. Not everything is in C (it would probably be just too much effort) and the interface to what is in C is guided via profiling not by what kind of interface makes sense. This especially is evident comparing CPython 2.6 to 2.7. Just adapting the code to an interface with C made the Python version slower. Removing this clutter improves the readability a lot and improves PyPy's version a bit, although I don't have hard numbers.

JitViewer is crucial

In case you're fighting with PyPy's performance, jitviewer is worth a shot. While it's not completely trivial to understand what's going on, it'll definitely show you what kind of loops got compiled and how.

No nice and fast way to build strings in Python

PyPy has a custom thing called __pypy__.builders.StringBuilder. It has a few a features that make it much easier to optimize than other ways like str.join() or cStringIO.

  • You can specify the start size, which helps a lot if you can even provide a rough estimate on the size of the string (less copying)
  • Only append and build are allowed. While the string is being built you can't seek or do anything else. After it's built you can never append any more.
  • Unicode version available as well as __pypy__.builders.UnicodeBuilder.

Method calls are ok, immutable globals are ok

PyPy's JIT seems to be good enough for at least the simple cases. Calling methods for common infrastructure or loading globals (instead of rebinding as locals) is fast enough and improves code readability.

String copying is expensive

Edit: see the comment at the end

If you use re.sub, the current implementation will always create a copy of the string even if there was no match to replace. If you know your regexp is simple, first try to check if there is anything to replace. This is a pretty hard optimization to do automatically -- simply matching the regular expression can be too costly for it to make sense. In our particular example however, the regexp is really simple, checking ranges of characters. It also seems that this is by far the fastest way to escape characters as of now.

Generators are slower than they should be

I changed the entire thing to simply call builder.append instead of yielding to the main loop where it would be gathered. This is kind of a PyPy bug that using generators extensively is slower, but a bit hard to fix. Especially in cases where there is relatively little data being passed around (few bytes), it makes sense to gather it first. If I were to implement an efficient version of iterencode, I would probably handle chunks of predetermined size, about 1000 bytes instead of yielding data every few bytes.

I must admit I worked around PyPy's performance bug

For obscure (although eventually fixable) reasons, this:

for c in s: # s is string
  del c

is faster than:

for c in s:
  pass

This is a PyPy performance bug and should be fixed, but on a different branch ;-)

PyPy's JIT is good

I was pretty surprised, but the JIT actually did make stuff work nicely. The changes that were done were relatively minor and straightforward, once the module was cleaned to the normal "pythonic" state. It is worth noting that it's possible to write code in Python and make it run really fast, but you have to be a bit careful. Again, jitviewer is your friend when determining why things are slow. I hope we can write more tools in the future that would more automatically guide people through potential performance pitfals.

Cheers, fijal

Edit: I was wrong about re.sub. It just seems to be that the JIT is figuring match better than sub, will be fixed soon

Monday, October 17, 2011

PyPy Göteborg Post-Hallowe'en Sprint Nov 2nd - Nov 9th

The next PyPy sprint will be in Gothenburg, Sweden. It is a public sprint, suitable for newcomers. We'll focus on making a public kickoff for both the numpy/pypy integration project and the Py3k support project, as well as whatever interests the Sprint attendees. Since both of these projects are very new, there will be plenty of work suitable for newcomers to PyPy.

Other topics might include:

  • Helping people get their code running with PyPy
  • work on a FSCons talk?
  • state of the STM Vinnova project (We most likely, but not for certain will know whether or not we are approved by this date.)

Other Useful dates

GothPyCon - Saturday Oct 29.

FSCONS Friday Nov 11 - Sunday Nov 12.

Location

The sprint will be held in the apartment of Laura Creighton and Jacob Hallén which is at Götabergsgatan 22 in Gothenburg, Sweden. Here is a map. This is in central Gothenburg. It is between the tram stops of Vasaplatsen and Valand, (a distance of 4 blocks) where many lines call -- the 2, 3, 4, 5, 7, 10 and 13.

Probably cheapest and not too far away is to book accomodation at SGS Veckobostader. The Elite Park Avenyn Hotel is a luxury hotel just a few blocks away. There are scores of hotels a short walk away from the sprint location, suitable for every budget, desire for luxury, and desire for the unusual. You could, for instance, stay on a boat. Options are too numerous to go into here. Just ask in the mailing list or on the blog.

Hours will be from 10:00 until people have had enough. It's a good idea to arrive a day before the sprint starts and leave a day later. In the middle of the sprint there usually is a break day and it's usually ok to take half-days off if you feel like it. Of course, many of you may be interested in sticking around for FSCons, held the weekend after the sprint.

Good to Know

Sweden is not part of the Euro zone. One SEK (krona in singular, kronor in plural) is roughly 1/10th of a Euro (9.36 SEK to 1 Euro).

The venue is central in Gothenburg. There is a large selection of places to get food nearby, from edible-and-cheap to outstanding. We often cook meals together, so let us know if you have any food allergies, dislikes, or special requirements.

Sweden uses the same kind of plugs as Germany. 230V AC.

Getting Here

If are coming train, you will arrive at the Central Station. It is about 12 blocks to the site from there, or you can take a tram.

There are two airports which are local to Göteborg, Landvetter (the main one) and Gothenburg City Airport (where some budget airlines fly). If you arrive at Landvetter the airport bus stops right downtown at Elite Park Avenyn Hotel which is the second stop, 4 blocks from the Sprint site, as well as the end of the line, which is the Central Station. If you arrive at Gothenburg City Airport take the bus to the end of the line. You will be at the Central Station.

You can also arrive by ferry, from either Kiel in Germany or Frederikshavn in Denmark.

Who's Coming?

If you'd like to come, please let us know when you will be arriving and leaving, as well as letting us know your interests We'll keep a list of people which we'll update (which you can do so yourself if you have bitbucket pypy commit rights).

Wednesday, October 12, 2011

Numpy funding and status update

Hi everyone,

It's been a little while since we wrote about NumPy on PyPy, so we wanted to give everyone an update on what we've been up to, and what's up next for us.

We would also like to note that we're launching a funding campaign for NumPy support in PyPy. Details can be found on the donation page.

Some of the things that have happened since last we wrote are:

  • We added dtype support, meaning you can now create arrays of a bunch of different types, including bools, ints of a various sizes, and floats.
  • More array methods and ufuncs, including things like comparison methods (==, >, etc.)
  • Support for more and more argument types, for example you can index by a tuple now (only works with tuples of length one, since we only have single-dimension arrays thus far).

Some of the things we're working on at the moment:

  • More dtypes, including complex values and user-defined dtypes.
  • Subscripting arrays by other array as indices, and by bool arrays as masks.
  • Starting to reuse Python code from the original numpy.

Some of the things on the near horizon are:

  • Better support for scalar data, for example did you know that numpy.array([True], dtype=bool)[0] doesn't return a bool object? Instead it returns a numpy.bool_.
  • Multi-dimensional array support.

If you're interested in helping out, we always love more contributors, Alex, Maciej, Justin, and the whole PyPy team

Tuesday, October 11, 2011

More Compact Lists with List Strategies

Since we come closer to merging the list-strategy branch I want to try to explain this memory optimization today.

Datatypes in PyPy are stored as W_<type>Objects (e.g. W_StringObject to represent strings, W_IntObject to represent ints). This is necessary due to the dynamic nature of Python. So the actual value (e.g. string, integer) is stored inside that box, resulting in an indirection. When having a large amount of such boxed objects, for example in a list, the wasted memory can become quite large.

If you have a closer look at such lists, you will see that in many of them only one type of data is stored and only few (and smaller) lists store mixed types. Another thing to observe is that those lists often won't change the types of the objects they contain at runtime very often. For instance a list of a million integers is very unlikely to suddenly get a string appended to it.

List Strategies

The goal of this work is to write an optimization that exploits this behaviour. Instead of wrapping all items in a list, we implement lists in a way that they are optimized for storing certain (primitive) datatypes. These implementations store the content of the list in unwrapped form, getting rid of the extra indirection and wrapper objects.

One approach would be to add a level of indirection, making each W_ListObject instance point to another object that stores the actual content. For this other object, several implementations would exist, for every datatype we want to store without wrapping it (as well as a general one that deals with arbitrary content). The data layout would look something like this:

This approach has the problem that we need two indirections to get to the data and that the implementation instances need memory themselves.

What we would like to do is to make the W_ListObject point to an RPython list directly, that contains either wrapped or unwrapped data. This plan has the problem that storing different unwrapped data is not directly possible in RPython.

To solve the problem, we use the rerased RPython library module. It allows us to erase the type of an object, in this case lists, and returns something similar to void-star in C, or Object in Java. This object is then stored on the W_ListObject in the field storage. If we want to work with the list, for example to append or delete items, we need to unerase the storage again.

Example for rerase:

storage = erase([1 ,2 ,3 ,4])
# storage is an opaque object that you can do nothing with
....
l = unerase(storage)
l.clear()

Now that we know how to make the W_ListObject point directly to wrapped or unwrapped data, we need to find out how to actually do any operations on this data. This can be accomplished by adding another field to our W_ListObject. This field points to a ListStrategy object. The actual implementation of W_ListObject is now deferred to those ListStrategy classes. For instance, a W_ListObject which holds only integers will use the IntegerListStrategy.

When the type of content is being changed, we need to change the used strategy as well as the storage in compatible ways. For example when we add a string to the list of integers we need to switch to the ObjectListStrategy and change the storage to be a list of wrapped objects. Thus the currently used strategy always knows what to do with what is currently in the storage.

As you can see, we now save one level of indirections by storing some of the data unwrapped. Of course each operation on a list needs to go via the strategy, but since we save one indirection for each element stored in that list and the Strategy classes are singletons, the benefits outweigh the costs.

Currently there are only strategies for integers and strings since many lists seem to have these datatypes. Other strategies i.e for floats and unicode strings are planned. We also implemented two special strategies for empty lists and range-lists. The EmptyListStrategy's storage is None. If objects are added to the list we just switch to the appropriate strategy (determined by the item's type). RangeListsStrategies do not store any items at all. Instead they only store values describing the range of the list, i.e. start, step and length. On any operations that changes the data of the list we switch to the IntegerStrategy.

A nice side-effect of storing unwrapped datatypes is that we can implement optimized methods for certain cases. For instance, since comparison of unwrapped integers is now much faster than comparison between arbitrary objects, we can rewrite the sorting methods for lists containing integers.

Microbenchmarks

Finally here is an early overview of the memory consumption of different Python implementations: CPython, PyPy and PyPy-list which uses list-strategies. To demonstrate how powerful list-strategies can be in the best case, we wrote benchmarks that create a list of integers, a list of strings and a range-list each with one million elements each and then reads out the heap size of the process as reported by the OS.

The results are as follows:

The savings on integers and strings in this ideal case are quite big.

The benchmark for range-lists is a little unfair, since in CPython one could accomplish the same memory behaviour using xrange. However, in PyPy users won't notice that internally the list does not store all items, making it still possible to use all list methods, such as append or delete.

Conclusion

We hope that list strategies bring memory savings for applications that use homogeneous lists of primitive types. Furthermore, operations on such lists tend to be somewhat faster as well. This also integrates well with the JIT. The list strategies optimizations will be merged to the PyPy's default branch at some point in the next months. An equivalent optimization for dictionaries has already been merged (and is part of PyPy 1.6), one for sets is coming in the future.

Lukas Diekmann and Carl Friedrich Bolz