Thursday, April 18, 2019

PyPy 7.1.1 Bug Fix Release

The PyPy team is proud to release a bug-fix release version 7.1.1 of PyPy, which includes two different interpreters:
  • PyPy2.7, which is an interpreter supporting the syntax and the features of Python 2.
  • PyPy3.6-beta: the second official release of PyPy to support 3.6 features.
The interpreters are based on much the same codebase, thus the double release.

This bugfix fixes bugs related to large lists, dictionaries, and sets, some corner cases with unicode, and PEP 3118 memory views of ctype structures. It also fixes a few issues related to the ARM 32-bit backend. For the complete list see the changelog.

You can download the v7.1.1 releases here:

As always, this release is 100% compatible with the previous one and fixed several issues and bugs raised by the growing community of PyPy users. We strongly recommend updating.

The PyPy3.6 release is rapidly maturing, but is still considered beta-quality.

The PyPy team

Thursday, April 4, 2019

An RPython JIT for LPegs

The following is a guest post by Stefan Troost, he describes the work he did in his bachelor thesis:

In this project we have used the RPython infrastructure to generate an RPython JIT for a less-typical use-case: string pattern matching. The work in this project is based on Parsing Expression Grammars and LPeg, an implementation of PEGs designed to be used in Lua. In this post I will showcase some of the work that went into this project, explain PEGs in general and LPeg in particular, and show some benchmarking results.

Parsing Expression Grammars

Parsing Expression Grammas (PEGs) are a type of formal grammar similar to context-free grammars, with the main difference being that they are unambiguous. This is achieved by redefining the ambiguous choice operator of CFGs (usually noted as |) as an ordered choice operator. In practice this means that if a rule in a PEG presents a choice, a PEG parser should prioritize the leftmost choice. Practical uses include parsing and pattern-searching. In comparison to regular expressions PEGs stand out as being able to be parsed in linear time, being strictly more powerful than REs, as well as being arguably more readable.

LPeg

LPeg is an implementation of PEGs written in C to be used in the Lua programming language. A crucial detail of this implementation is that it parses high level function calls, translating them to bytecode, and interpreting that bytecode. Therefore, we are able to improve that implementation by replacing LPegs C-interpreter with an RPython JIT. I use a modified version of LPeg to parse PEGs and pass the generated Intermediate Representation, the LPeg bytecode, to my VM.

The LPeg Library

The LPeg Interpreter executes bytecodes created by parsing a string of commands using the LPeg library. Our JIT supports a subset of the LPeg library, with some of the more advanced or obscure features being left out. Note that this subset is still powerful enough to do things like parse JSON.

Operator Description
lpeg.P(string) Matches string literally
lpeg.P(n) Matches exactly n characters
lpeg.P(-n) Matches at most n characters
lpeg.S(string) Matches any character in string (Set)
lpeg.R(“xy”) Matches any character between x and y (Range)
pattern^n Matches at least n repetitions of pattern
pattern^-n Matches at most n repetitions of pattern
pattern1 * pattern2 Matches pattern1 followed by pattern2
pattern1 + pattern2 Matches pattern1 or pattern2 (ordered choice)
pattern1 - pattern2 Matches pattern1 if pattern2 does not match
-pattern Equivalent to ("" - pattern)

As a simple example, the pattern lpeg.P"ab"+lpeg.P"cd" would match either the string ab or the string cd.

To extract semantic information from a pattern, captures are needed. These are the following operations supported for capture creation.

Operation What it produces
lpeg.C(pattern) the match for patten plus all captures made by pattern
lpeg.Cp() the current position (matches the empty string)

(tables taken from the LPeg documentation)

These patterns are translated into bytecode by LPeg, at which point we are able to pass them into our own VM.

The VM

The state of the VM at any point is defined by the following variables:

  • PC: program counter indicating the current instruction
  • fail: an indicator that some match failed and the VM must backtrack
  • index: counter indicating the current character of the input string
  • stackentries: stack of return addresses and choice points
  • captures: stack of capture objects

The execution of bytecode manipulates the values of these variables in order to produce some output. How that works and what that output looks like will be explained now.

The Bytecode

For simplicity’s sake I will not go over every individual bytecode, but instead choose some that exemplify the core concepts of the bytecode set.

generic character matching bytecodes

  • any: Checks if there’s any characters left in the inputstring. If it succeeds it advances the index and PC by 1, if not the bytecode fails.

  • char c: Checks if there is another bytecode in the input and if that character is equal to c. Otherwise the bytecode fails.

  • set c1-c2: Checks if there is another bytecode in the input and if that character is between (including) c1 and c2. Otherwise the bytecode fails.

These bytecodes are the easiest to understand with very little impact on the VM. What it means for a bytecode to fail will be explained when we get to control flow bytecodes.

To get back to the example, the first half of the pattern lpeg.P"ab" could be compiled to the following bytecodes:

char a
char b

control flow bytecodes

  • jmp n: Sets PC to n, effectively jumping to the n’th bytecode. Has no defined failure case.

  • testchar c n: This is a lookahead bytecode. If the current character is equal to c it advances the PC but not the index. Otherwise it jumps to n.

  • call n: Puts a return address (the current PC + 1) on the stackentries stack and sets the PC to n. Has no defined failure case.

  • ret: Opposite of call. Removes the top value of the stackentries stack (if the string of bytecodes is valid this will always be a return address) and sets the PC to the removed value. Has no defined failure case.

  • choice n: Puts a choice point on the stackentries stack. Has no defined failure case.

  • commit n: Removes the top value of the stackentries stack (if the string of bytecodes is valid this will always be a choice point) and jumps to n. Has no defined failure case.

Using testchar we can implement the full pattern lpeg.P"ab"+lpeg.P"cd" with bytecode as follows:

testchar a -> L1
any
char b
end
any
L1: char c
char d
end

The any bytecode is needed because testchar does not consume a character from the input.

Failure Handling, Backtracking and Choice Points

A choice point consist of the VM’s current index and capturestack as well as a PC. This is not the VM’s PC at the time of creating the choicepoint, but rather the PC where we should continue trying to find matches when a failure occurs later.

Now that we have talked about choice points, we can talk about how the VM behaves in the fail state. If the VM is in the fail state, it removed entries from the stackentries stack until it finds a choice point. Then it backtracks by restoring the VM to the state defined by the choice point. If no choice point is found this way, no match was found in the string and the VM halts.

Using choice points we could implement the example lpeg.P"ab" + lpeg.P"cd" in bytecodes in a different way (LPEG uses the simpler way shown above, but for more complex patterns it can’t use the lookahead solution using testchar):

choice L1
char a
char b
commit
end
L1: char c
char d
end

Captures

Some patterns require the VM to produce more output than just “the pattern matched” or “the pattern did not match”. Imagine searching a document for an IPv4 address and all your program responded was “I found one”. In order to recieve additional information about our inputstring, captures are used.

The capture object

In my VM, two types of capture objects are supported, one of them being the position capture. It consists of a single index referencing the point in the inputstring where the object was created.

The other type of capture object is called simplecapture. It consists of an index and a size value, which are used to reference a substring of the inputstring. In addition, simplecaptures have a variable status indicating they are either open or full. If a simplecapture object is open, that means that its size is not yet determined, since the pattern we are capturing is of variable length.

Capture objects are created using the following bytecodes:

  • Fullcapture Position: Pushes a positioncapture object with the current index value to the capture stack.

  • Fullcapture Simple n: Pushes a simplecapture object with current index value and size=n to the capture stack.

  • Opencapture Simple: Pushes an open simplecapture object with current index value and undetermined size to the capture stack.

  • closecapture: Sets the top element of the capturestack to full and sets its size value using the difference between the current index and the index of the capture object.

The RPython Implementation

These, and many more bytecodes were implemented in an RPython-interpreter. By adding jit hints, we were able to generate an efficient JIT. We will now take a closer look at some implementations of bytecodes.

...
        elif instruction.name == "any":
            if index >= len(inputstring):
                fail = True
            else:
                pc += 1
                index += 1

...

The code for the any-bytecode is relatively straight-forward. It either advances the pc and index or sets the VM into the fail state, depending on whether the end of the inputstring has been reached or not.

...
        if instruction.name == "char":
            if index >= len(inputstring):
                fail = True
            elif instruction.character == inputstring[index]:
                pc += 1
                index += 1
            else:
                fail = True
...

The char-bytecode also looks as one would expect. If the VM’s string index is out of range or the character comparison fails, the VM is put into the fail state, otherwise the pc and index are advanced by 1. As you can see, the character we’re comparing the current inputstring to is stored in the instruction object (note that this code-example has been simplified for clarity, since the actual implementation includes a jit-optimization that allows the VM to execute multiple successive char-bytecodes at once).

...
        elif instruction.name == "jmp":
            pc = instruction.goto
...

The jmp-bytecode comes with a goto value which is a pc that we want execution to continue at.

...
        elif instruction.name == "choice":
            pc += 1
            choice_points = choice_points.push_choice_point(
                instruction.goto, index, captures)
...

As we can see here, the choice-bytecode puts a choice point onto the stack that may be backtracked to if the VM is in the fail-state. This choice point consists of a pc to jump to which is determined by the bytecode. But it also includes the current index and captures values at the time the choice point was created. An ongoing topic of jit optimization is which data structure is best suited to store choice points and return addresses. Besides naive implementations of stacks and single-linked lists, more case-specific structures are also being tested for performance.

Benchmarking Result

In order to find out how much it helps to JIT LPeg patterns we ran a small number of benchmarks. We used an otherwise idle Intel Core i5-2430M CPU with 3072 KiB of cache and 8 GiB of RAM, running with 2.40GHz. The machine was running Ubuntu 14.04 LTS, Lua 5.2.3 and we used GNU grep 2.16 as a point of comparison for one of the benchmarks. The benchmarks were run 100 times in a new process each. We measured the full runtime of the called process, including starting the process.

Now we will take a look at some plots generated by measuring the runtime of different iterations of my JIT compared to lua and using bootstrapping to generate a sampling distribution of mean values. The plots contain a few different variants of pypeg, only the one called "fullops" is important for this blog post, however.

This is the plot for a search pattern that searches a text file for valid URLs. As we can see, if the input file is as small as 100 kb, the benefits of JIT optimizations do not outweigh the time required to generate the machine code. As a result, all of our attempts perform significantly slower than LPeg.

This is the plot for the same search pattern on a larger input file. As we can see, for input files as small as 500 kb our VM already outperforms LPeg’s. An ongoing goal of continued development is to get this lower boundary as small as possible.

The benefits of a JIT compared to an Interpreter become more and more relevant for larger input files. Searching a file as large as 5 MB makes this fairly obvious and is exactly the behavior we expect.

This time we are looking at a different more complicated pattern, one that parses JSON used on a 50 kb input file. As expected, LPeg outperforms us, however, something unexpected happens as we increase the filesize.

Since LPeg has a defined maximum depth of 400 for the choicepoints and returnaddresses Stack, LPeg by default refuses to parse files as small as 100kb. This raises the question if LPeg was intended to be used for parsing. Until a way to increase LPeg’s maximum stack depth is found, no comparisons to LPeg can be performed at this scale. This has been a low priority in the past but may be addressed in the future.

To conclude, we see that at sufficiently high filesizes, our JIT outperforms the native LPeg-interpreter. This lower boundary is currently as low as 100kb in filesize.

Conclusion

Writing a JIT for PEG’s has proven itself to be a challenge worth pursuing, as the expected benefits of a JIT compared to an Interpreter have been achieved. Future goals include getting LPeg to be able to use parsing patterns on larger files, further increasing the performance of our JIT and comparing it to other well-known programs serving a similar purpose, like grep.

The prototype implementation that I described in this post can be found on Github (it's a bit of a hack in some places, though).

Sunday, March 24, 2019

PyPy v7.1 released; now uses utf-8 internally for unicode strings

The PyPy team is proud to release version 7.1.0 of PyPy, which includes two different interpreters:
  • PyPy2.7, which is an interpreter supporting the syntax and the features of Python 2.7
  • PyPy3.6-beta: this is the second official release of PyPy to support 3.6 features, although it is still considered beta quality.
The interpreters are based on much the same codebase, thus the double release.

This release, coming fast on the heels of 7.0 in February, finally merges the internal refactoring of unicode representation as UTF-8. Removing the conversions from strings to unicode internally lead to a nice speed bump. We merged the utf-8 changes to the py3.5 branch (Python3.5.3) but will concentrate on 3.6 going forward.

We also improved the ability to use the buffer protocol with ctype structures and arrays.

The CFFI backend has been updated to version 1.12.2. We recommend using CFFI rather than c-extensions to interact with C, and cppyy for interacting with C++ code.
 You can download the v7.1 releases here:
We would like to thank our donors for the continued support of the PyPy project. If PyPy is not quite good enough for your needs, we are available for direct consulting work.

We would also like to thank our contributors and encourage new people to join the project. PyPy has many layers and we need help with all of them: PyPy and RPython documentation improvements, tweaking popular modules to run on pypy, or general help with making RPython’s JIT even better.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7, 3.6. It’s fast (PyPy and CPython 2.7.x performance comparison) due to its integrated tracing JIT compiler.

We also welcome developers of other dynamic languages to see what RPython can do for them.
This PyPy release supports:
 
  • x86 machines on most common operating systems (Linux 32/64 bits, Mac OS X 64 bits, Windows 32 bits, OpenBSD, FreeBSD)
  • big- and little-endian variants of PPC64 running Linux
  •  ARM32 although we do not supply downloadable binaries at this time
  • s390x running Linux

What else is new?

PyPy 7.0 was released in February, 2019. There are many incremental improvements to RPython and PyPy, for more information see the changelog.

Please update, and continue to help us make PyPy better.


Cheers, The PyPy team

Monday, February 11, 2019

PyPy v7.0.0: triple release of 2.7, 3.5 and 3.6-alpha


The PyPy team is proud to release the version 7.0.0 of PyPy, which includes three different interpreters:
  • PyPy2.7, which is an interpreter supporting the syntax and the features of Python 2.7
  • PyPy3.5, which supports Python 3.5
  • PyPy3.6-alpha: this is the first official release of PyPy to support 3.6 features, although it is still considered alpha quality.
All the interpreters are based on much the same codebase, thus the triple release.
Until we can work with downstream providers to distribute builds with PyPy, we have made packages for some common packages available as wheels.
The GC hooks , which can be used to gain more insights into its performance, has been improved and it is now possible to manually manage the GC by using a combination of gc.disable and gc.collect_step. See the GC blog post.
We updated the cffi module included in PyPy to version 1.12, and the cppyy backend to 1.4. Please use these to wrap your C and C++ code, respectively, for a JIT friendly experience.
As always, this release is 100% compatible with the previous one and fixed several issues and bugs raised by the growing community of PyPy users. We strongly recommend updating.
The PyPy3.6 release and the Windows PyPy3.5 release are still not production quality so your mileage may vary. There are open issues with incomplete compatibility and c-extension support.
The utf8 branch that changes internal representation of unicode to utf8 did not make it into the release, so there is still more goodness coming. You can download the v7.0 releases here:
http://pypy.org/download.html
We would like to thank our donors for the continued support of the PyPy project. If PyPy is not quite good enough for your needs, we are available for direct consulting work.
We would also like to thank our contributors and encourage new people to join the project. PyPy has many layers and we need help with all of them: PyPy and RPython documentation improvements, tweaking popular modules to run on pypy, or general help with making RPython's JIT even better.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7, 3.5 and 3.6. It's fast (PyPy and CPython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
We also welcome developers of other dynamic languages to see what RPython can do for them.
The PyPy release supports:
  • x86 machines on most common operating systems (Linux 32/64 bits, Mac OS X 64 bits, Windows 32 bits, OpenBSD, FreeBSD)
  • big- and little-endian variants of PPC64 running Linux,
  • s390x running Linux
Unfortunately at the moment of writing our ARM buildbots are out of service, so for now we are not releasing any binary for the ARM architecture.

What else is new?

PyPy 6.0 was released in April, 2018. There are many incremental improvements to RPython and PyPy, the complete listing is here.

Please update, and continue to help us make PyPy better.


Cheers, The PyPy team

Saturday, February 9, 2019

Düsseldorf Sprint Report 2019

Hello everyone!

We are happy to report a successful and well attended sprint that is wrapping up in Düsseldorf, Germany. In the last week we had eighteen people sprinting at the Heinrich-Heine-Universität Düsseldorf on various topics.

Totally serious work going on here constantly.

A big chunk of the sprint was dedicated to various discussions, since we did not manage to gather the core developers in one room in quite a while. Discussion topics included:

  • Funding and general sustainability of open source.
  • Catching up with CPython 3.7/3.8 – we are planning to release 3.6 some time in the next few months and we will continue working on 3.7/3.8.
  • What to do with VMprof
  • How can we support Cython inside PyPy in a way that will be understood by the JIT, hence fast.
  • The future of supporting the numeric stack on pypy – we have made significant progress in the past few years and most of the numeric stack works out of the box, but deployment and performance remain problems. Improving on those problems remains a very important focus for PyPy as a project.
  • Using the presence of a CPython developer (Łukasz Langa) and a Graal Python developer (Tim Felgentreff) we discussed ways to collaborate in order to improve Python ecosystem across implementations.
  • Pierre-Yves David and Georges Racinet from octobus gave us an exciting demo on Heptapod, which adds mercurial support to gitlab.
  • Maciej and Armin gave demos of their current (non-PyPy-related) project VRSketch.

Visiting the Landschaftspark Duisburg Nord on the break day

Some highlights of the coding tasks worked on:

  • Aarch64 (ARM64) JIT backend work has been started, we are able to run the first test! Tobias Oberstein from Crossbar GmbH and Rodolph Perfetta from ARM joined the sprint to help kickstart the project.
  • The long running math-improvements branch that was started by Stian Andreassen got merged after bugfixes done by Alexander Schremmer. It should improve operations on large integers.
  • The arcane art of necromancy was used to revive long dormant regalloc branch started and nearly finished by Carl Friedrich Bolz-Tereick. The branch got merged and gives some modest speedups across the board.
  • Andrew Lawrence worked on MSI installer for PyPy on windows.
  • Łukasz worked on improving failing tests on the PyPy 3.6 branch. He knows very obscure details of CPython (e.g. how pickling works), hence we managed to progress very quickly.
  • Matti Picus set up a new benchmarking server for PyPy 3 branches.
  • The Utf8 branch, which changes the internal representation of unicode might be finally merged at some point very soon. We discussed and improved upon the last few blockers. It gives significant speedups in a lot of cases handling strings.
  • Zlib was missing couple methods, which were added by Ronan Lamy and Julian Berman.
  • Manuel Jacob fixed RevDB failures.
  • Antonio Cuni and Matti Picus worked on 7.0 release which should happen in a few days.

Now we are all quite exhausted, and are looking forward to catching up on sleep.

Best regards, Maciej Fijałkowski, Carl Friedrich Bolz-Tereick and the whole PyPy team.

Thursday, January 3, 2019

PyPy for low-latency systems

PyPy for low-latency systems

Recently I have merged the gc-disable branch, introducing a couple of features which are useful when you need to respond to certain events with the lowest possible latency. This work has been kindly sponsored by Gambit Research (which, by the way, is a very cool and geeky place where to work, in case you are interested). Note also that this is a very specialized use case, so these features might not be useful for the average PyPy user, unless you have the same problems as described here.

The PyPy VM manages memory using a generational, moving Garbage Collector. Periodically, the GC scans the whole heap to find unreachable objects and frees the corresponding memory. Although at a first look this strategy might sound expensive, in practice the total cost of memory management is far less than e.g. on CPython, which is based on reference counting. While maybe counter-intuitive, the main advantage of a non-refcount strategy is that allocation is very fast (especially compared to malloc-based allocators), and deallocation of objects which die young is basically for free. More information about the PyPy GC is available here.

As we said, the total cost of memory managment is less on PyPy than on CPython, and it's one of the reasons why PyPy is so fast. However, one big disadvantage is that while on CPython the cost of memory management is spread all over the execution of the program, on PyPy it is concentrated into GC runs, causing observable pauses which interrupt the execution of the user program.
To avoid excessively long pauses, the PyPy GC has been using an incremental strategy since 2013. The GC runs as a series of "steps", letting the user program to progress between each step.

The following chart shows the behavior of a real-world, long-running process:


The orange line shows the total memory used by the program, which increases linearly while the program progresses. Every ~5 minutes, the GC kicks in and the memory usage drops from ~5.2GB to ~2.8GB (this ratio is controlled by the PYPY_GC_MAJOR_COLLECT env variable).
The purple line shows aggregated data about the GC timing: the whole collection takes ~1400 individual steps over the course of ~1 minute: each point represent the maximum time a single step took during the past 10 seconds. Most steps take ~10-20 ms, although we see a horrible peak of ~100 ms towards the end. We have not investigated yet what it is caused by, but we suspect it is related to the deallocation of raw objects.

These multi-millesecond pauses are a problem for systems where it is important to respond to certain events with a latency which is both low and consistent. If the GC kicks in at the wrong time, it might causes unacceptable pauses during the collection cycle.

Let's look again at our real-world example. This is a system which continuously monitors an external stream; when a certain event occurs, we want to take an action. The following chart shows the maximum time it takes to complete one of such actions, aggregated every minute:


You can clearly see that the baseline response time is around ~20-30 ms. However, we can also see periodic spikes around ~50-100 ms, with peaks up to ~350-450 ms! After a bit of investigation, we concluded that most (although not all) of the spikes were caused by the GC kicking in at the wrong time.

The work I did in the gc-disable branch aims to fix this problem by introducing two new features to the gc module:
  • gc.disable(), which previously only inhibited the execution of finalizers without actually touching the GC, now disables the GC major collections. After a call to it, you will see the memory usage grow indefinitely.
  • gc.collect_step() is a new function which you can use to manually execute a single incremental GC collection step.
It is worth to specify that gc.disable() disables only the major collections, while minor collections still runs. Moreover, thanks to the JIT's virtuals, many objects with a short and predictable lifetime are not allocated at all. The end result is that most objects with short lifetime are still collected as usual, so the impact of gc.disable() on memory growth is not as bad as it could sound.

Combining these two functions, it is possible to take control of the GC to make sure it runs only when it is acceptable to do so. For an example of usage, you can look at the implementation of a custom GC inside pypytools. The peculiarity is that it also defines a "with nogc():" context manager which you can use to mark performance-critical sections where the GC is not allowed to run.

The following chart compares the behavior of the default PyPy GC and the new custom GC, after a careful placing of nogc() sections:


The yellow line is the same as before, while the purple line shows the new system: almost all spikes have gone, and the baseline performance is about 10% better. There is still one spike towards the end, but after some investigation we concluded that it was not caused by the GC.

Note that this does not mean that the whole program became magically faster: we simply moved the GC pauses in some other place which is not shown in the graph: in this specific use case this technique was useful because it allowed us to shift the GC work in places where pauses are more acceptable.

All in all, a pretty big success, I think. These functionalities are already available in the nightly builds of PyPy, and will be included in the next release: take this as a New Year present :)

Antonio Cuni and the PyPy team