Monday, October 14, 2019

PyPy v7.2 released

The PyPy team is proud to release the version 7.2.0 of PyPy, which includes two different interpreters:
  • PyPy2.7, which is an interpreter supporting the syntax and the features of Python 2.7 including the stdlib for CPython 2.7.13
  • PyPy3.6: which is an interpreter supporting the syntax and the features of Python 3.6, including the stdlib for CPython 3.6.9.
The interpreters are based on much the same codebase, thus the double release.

As always, this release is 100% compatible with the previous one and fixed several issues and bugs raised by the growing community of PyPy users. We strongly recommend updating. Many of the fixes are the direct result of end-user bug reports, so please continue reporting issues as they crop up.

You can download the v7.2 releases here:
With the support of Arm Holdings Ltd. and, this release supports the 64-bit aarch64 ARM architecture. More about the work and the performance data around this welcome development can be found in the blog post.

This release removes the “beta” tag from PyPy3.6. While there may still be some small corner-case incompatibilities (around the exact error messages in exceptions and the handling of faulty codec errorhandlers) we are happy with the quality of the 3.6 series and are looking forward to working on a Python 3.7 interpreter.

We updated our benchmark runner at to a more modern machine and updated the baseline python to CPython 2.7.11. Thanks to Baroque Software for maintaining the benchmark runner.

The CFFI-based _ssl module was backported to PyPy2.7 and updated to use cryptography version 2.7. Additionally, the _hashlib, and crypt (or _crypt on Python3) modules were converted to CFFI. This has two consequences: end users and packagers can more easily update these libraries for their platform by executing (cd lib_pypy; ../bin/pypy _* More significantly, since PyPy itself links to fewer system shared objects (DLLs), on platforms with a single runtime namespace like linux, different CFFI and c-extension modules can load different versions of the same shared object into PyPy without collision (issue 2617).

Until downstream providers begin to distribute c-extension builds with PyPy, we have made packages for some common packages available as wheels.

The CFFI backend has been updated to version 1.13.0. We recommend using CFFI rather than c-extensions to interact with C, and cppyy for interacting with C++ code.

Thanks to Anvil, we revived the PyPy Sandbox, (soon to be released) which allows total control over a Python interpreter’s interactions with the external world.

We implemented a new JSON decoder that is much faster, uses less memory, and uses a JIT-friendly specialized dictionary. More about that in the recent blog post

We would like to thank our donors for the continued support of the PyPy project. If PyPy is not quite good enough for your needs, we are available for direct consulting work.
We would also like to thank our contributors and encourage new people to join the project. PyPy has many layers and we need help with all of them: PyPy and RPython documentation improvements, tweaking popular modules to run on PyPy, or general help with making RPython’s JIT even better. Since the previous release, we have accepted contributions from 27 new contributors, so thanks for pitching in.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7, 3.6. It’s fast (PyPy and CPython 2.7.x performance comparison) due to its integrated tracing JIT compiler.

We also welcome developers of other dynamic languages to see what RPython can do for them.

This PyPy release supports:
  • x86 machines on most common operating systems (Linux 32/64 bit, Mac OS X 64-bit, Windows 32-bit, OpenBSD, FreeBSD)
  • big- and little-endian variants of PPC64 running Linux
  • s390x running Linux
  • 64-bit ARM machines running Linux
Unfortunately at the moment of writing our ARM buildbots are out of service, so for now we are not releasing any binary for the ARM architecture (32-bit), although PyPy does support ARM 32-bit processors.

What else is new?

PyPy 7.1 was released in March, 2019. There are many incremental improvements to RPython and PyPy, For more information about the 7.2.0 release, see the full changelog.

Please update, and continue to help us make PyPy better.

The PyPy team

Tuesday, October 8, 2019

PyPy's new JSON parser


In the last year or two I have worked on and off on making PyPy's JSON faster, particularly when parsing large JSON files. In this post I am going to document those techniques and measure their performance impact. Note that I am quite a lot more constrained in what optimizations I can apply here, compared to some of the much more advanced approaches like Mison, Sparser or SimdJSON because I don't want to change the json.loads API that Python programs expect, and because I don't want to only support CPUs with wide SIMD extensions. With a more expressive API, more optimizations would be possible.
There are a number of problems of working with huge JSON files: deserialization takes a long time on the one hand, and the resulting data structures often take a lot of memory (usually they can be many times bigger than the size of the file they originated from). Of course these problems are related, because allocating and initializing a big data structure takes longer than a smaller data structure. Therefore I always tried to attack both of these problems at the same time.
One common theme of the techniques I am describing is that of optimizing the parser for how JSON files are typically used, not how they could theoretically be used. This is a similar approach to the way dynamic languages are optimized more generally: most JITs will optimize for typical patterns of usage, at the cost of less common usage patterns, which might even become slower as a result of the optimizations.


The first technique I investigated is to use maps in the JSON parser. Maps, also called hidden classes or shapes, are a fairly common way to (generally, not just in the context of JSON parsing) optimize instances of classes in dynamic language VMs. Maps exploit the fact that while it is in theory possible to add arbitrary fields to an instance, in practice most instances of a class are going to have the same set of fields (or one of a small number of different sets). Since JSON dictionaries or objects often come from serialized instances of some kind, this property often holds in JSON files as well: dictionaries often have the same fields in the same order, within a JSON file.
This property can be exploited in two ways: on the one hand, it can be used to again store the deserialized dictionaries in a more memory efficient way by not using a hashmap in most cases, but instead splitting the dictionary into a shared description of the set of keys (the map) and an array of storage with the values. This makes the deserialized dictionaries smaller if the same set of keys is repeated a lot. This is completely transparent to the Python programmer, the dictionary will look completely normal to the Python program but its internal representation is different.
One downside of using maps is that sometimes files will contain many dictionaries that have unique key sets. Since maps themselves are quite large data structures and since dictionaries that use maps contain an extra level of indirection we want to fall back to using normal hashmaps to represent the dictionaries where that is the case. To prevent this we perform some statistics at runtime, how often every map (i.e. set of keys) is used in the file. For uncommonly used maps, the map is discarded and the dictionaries that used the map converted into using a regular hashmap.

Using Maps to Speed up Parsing

Another benefit of using maps to store deserialized dictionaries is that we can use them to speed up the parsing process itself. To see how this works, we need to understand maps a bit better. All the maps produced as a side-effect of parsing JSON form a tree. The tree root is a map that describes the object without any attributes. From every tree node we have a number of edges going to other nodes, each edge for a specific new attribute added:

This map tree is the result of parsing a file that has dictionaries with the keys a, b, c many times, the keys a, b, f less often, and also some objects with the keys x, y.
When parsing a dictionary we traverse this tree from the root, according to the keys that we see in the input file. While doing this, we potentially add new nodes, if we get key combinations that we have never seen before. The set of keys of a dictionary parsed so far are represented by the current tree node, while we can store the values into an array. We can use the tree of nodes to speed up parsing. A lot of the nodes only have one child, because after reading the first few keys of an object, the remaining ones are often uniquely determined in a given file. If we have only one child map node, we can speculatively parse the next key by doing a memcmp between the key that the map tree says is likely to come next and the characters that follow the ',' that started the next entry in the dictionary. If the memcmp returns true this means that the speculation paid off, and we can transition to the new map that the edge points to, and parse the corresponding value. If not, we fall back to general code that parses the string, handles escaping rules etc. This trick was explained to me by some V8 engineers, the same trick is supposedly used as part of the V8 JSON parser.
This scheme doesn't immediately work for map tree nodes that have more than one child. However, since we keep statistics anyway about how often each map is used as the map of a parsed dictionary, we can speculate that the most common map transition is taken more often than the others in the future, and use that as the speculated next node.
So for the example transition tree shown in the figure above the key speculation would succeed for objects with keys a, b, c. For objects with keys a, b, f the speculation would succeed for the first two keys, but not for the third key f. For objects with the keys x, y the speculation would fail for the first key x but succeed for the second key y.
For real-world datasets these transition trees can become a lot more complicated, for example here is a visualization of a part of the transition tree generated for parsing a New York Times dataset:

Caching Strings

A rather obvious observation we can use to improve performance of the parser is the fact that string values repeat a lot in most JSON files. For strings that are used as dictionary keys this is pretty obvious. However it happens also for strings that are used as values in dictionaries (or are stored in lists). We can use this fact to intern/memoize strings and save memory. This is an approach that many JSON parsers use, including CPython's. To do this, I keep a dictionary of strings that we have seen so far during parsing and look up new strings that are deserialized. If we have seen the string before, we can re-use the deserialized previous string. Right now I only consider utf-8 strings for caching that do not contain any escapes (whether stuff like \", \n or escaped unicode chars).
This simple approach works extremely well for dictionary keys, but needs a number of improvements to be a win in general. The first observation is that computing the hash to look up the string in the dictionary of strings we've seen so far is basically free. We can compute the hash while scanning the input for the end of the string we are currently deserializing. Computing the hash while scanning doesn't increase the time spent scanning much. This is not a new idea, I am sure many other parsers do the same thing (but CPython doesn't seem to).
Another improvement follows from the observation that inserting every single deserialized non-key string into a hashmap is too expensive. Instead, we insert strings into the cache more conservatively, by keeping a small ring buffer of hashes of recently deserialized strings. The hash is looked for in the ring buffer, and only if the hash is present we insert the string into the memoization hashmap. This has the effect of only inserting strings into the memoization hashmap that re-occur a second time not too far into the file. This seems to give a good trade-off between still re-using a lot of strings but keeping the time spent updating and the size of the memoization hashmap low.
Another twist is that in a lot of situations caching strings is not useful at all, because it will almost never succeed. Examples of this are UUIDs (which are unique), or the content of a tweet in a JSON file with many tweets (which is usually unique). However, in the same file it might be useful to cache e.g. the user name of the Twitter user, because many tweets from the same person could be in such a file. Therefore the usefulness of the string cache depends on which fields of objects we are deserializing the value off. Therefore we keep statistics per map field and disable string memoization per individual field if the cache hit rate falls below a certain threshold. This gives the best of both worlds: in the cases where string values repeat a lot in certain fields we use the cache to save time and memory. But for those fields that mostly contain unique strings we don't waste time looking up and adding strings in the memoization table. Strings outside of dictionaries are quite rare anyway, so we just always try to use the cache for them.
The following pseudocode sketches the code to deserialize a string in the input at a given position. The function also takes a map, which is the point in the map tree that we are currently deserializing a field off (if we are deserializing a string in another context, some kind of dummy map can be used there).

def deserialize_string(pos, input, map):
    # input is the input string, pos is the position of the starting " of
    # the string

    # find end of string, check whether it contains escape codes,
    # compute hash, all at the same time
    end, escapes, hash = find_end_of_string(pos + 1, input)
    if end == -1:
        raise ParseError
    if escapes:
        # need to be much more careful with escaping
        return deserialize_string_escapes(pos, input)
    # should we cache at all?
    if map.cache_disabled():
        return input[pos + 1:end]

    # if string is in cache, return it
    if hash in cache:
        map.cache_hit += 1
        return cache[hash]

    result = input[pos + 1:end]
    map.cache_miss += 1

    # if hash is in the ring buffer of recently seen hashes,
    # add the string to the cache
    if hash in ring_buffer:
        cache[hash] = result
    return result


To find out how much the various techniques help, I implemented a number of JSON parsers in PyPy with different combinations of the techniques enabled. I compared the numbers with the JSON parser of CPython 3.7.3 (simplejson), with ujson, with the JSON parser of Node 12.11.1 (V8) and with RapidJSON (in DOM mode).
I collected a number of medium-to-large JSON files to try the JSON parsers on:
  • Censys: A subset of the Censys port and protocol scan data for websites in the Alexa top million domains
  • Gharchive: Github activity from January 15-23, 2015 from Github Archive
  • Reddit: Reddit comments from May 2009
  • Rosie: The nested matches produced using the Rosie pattern language all.things pattern on a log file
  • Nytimes: Metadata of a collection of New York Times articles
  • Tpch: The TPC-H database benchmark's deals table as a JSON file
  • Twitter: A JSON export of the @pypyproject Twitter account data
  • Wikidata: A file storing a subset of the Wikidata fact dump from Nov 11, 2014
  • Yelp: A file of yelp businesses
Here are the file sizes of the benchmarks:
Benchmark File Size [MiB]
Censys 898.45
Gharchive 276.34
NYTimes 12.98
Reddit 931.65
Rosie 388.88
TPCH 173.86
Wikidata 119.75
Yelp 167.61
I measured the times of each benchmark with a number of variations of the improved PyPy algorithms:
  • PyPyBaseline: The PyPy JSON parser as it was before my work with JSON parsing started (PyPy version 5.8)
  • PyPyKeyStringCaching: Memoizing the key strings of dictionaries, but not the other strings in a json file, and not using maps to represent dictionaries (this is the JSON parser that PyPy has been shipping since version 5.9, in the benchmarks I used 7.1).
  • PyPyMapNoCache: Like PyPyKeyStringCaching, but using maps to represent dictionaries. This includes speculatively parsing the next key using memcmp, but does not use string caching of non-key strings.
  • PyPyFull: Like PyPyMapNoCache but uses a string cache for all strings, not just keys. This is equivalent to what will be released soon as part of PyPy 7.2
In addition to wall clock time of parsing, I also measured the increase in memory use of each implementation after the input string has been deserialized, i.e. the size of the in-memory representation of every JSON file.

Contributions of Individual Optimizations

Let's first look at the contributions of the individual optimizations to the overall performance and memory usage.

All the benchmarks were run 30 times in new processes, all the numbers are normalized to PyPyFull.
The biggest individual improvement to both parsing time and memory used comes from caching just the keys in parsed dictionaries. This is the optimization in PyPy's JSON parser that has been implemented for a while already. To understand why this optimization is so useful, let's look at some numbers about each benchmark, namely the number of total keys across all dictionaries in each file, as well as the number of unique keys. As we can see, for all benchmarks the number of unique keys is significantly smaller than the number of keys in total.
Benchmark Number of keys Number of unique keys
Censys 14 404 234 163
Gharchive 6 637 881 169
NYTimes 417 337 60
Reddit 25 226 397 21
Rosie 28 500 101 5
TPCH 6 700 000 45
Wikidata 6 235 088 1 602
Yelp 5 133 914 61
The next big jump in deserialization time and memory comes from introducing maps to represent deserialized dictionaries. With PyPyMapNoCache deserialization time goes down because it's much cheaper to walk the tree of maps and store all deserialized objects into an array of values than to build hashmaps with the same keys again and again. Memory use goes down for the same reason: it takes a lot less memory to store the shared structure of each set of keys in the map, as opposed to repeating it again and again in every hashmap.
We can look at some numbers about every benchmark again. The table shows how many map-based dictionaries are deserialized for every benchmark, and how many hashmap-backed dictionaries. We see that the number of hashmap-backed dictionaries is often zero, or at most a small percentage of all dictionaries in each benchmark. Yelp has the biggest number of hashmap-backed dictionaries. The reason for this is that the input file contains hashmaps that store combinations of various features of Yelp businesses, and a lot of these combinations are totally unique to a business. Therefore the heuristics determine that it's better to store these using hashmaps.
Benchmark Map Dicts Regular Dicts % Regular Dicts
Censys 4 049 235 1 042 0.03
Gharchive 955 301 0 0.00
NYTimes 80 393 0 0.00
Reddit 1 201 257 0 0.00
Rosie 6 248 966 0 0.00
TPCH 1 000 000 0 0.00
Wikidata 1 923 460 46 905 2.38
Yelp 443 140 52 051 10.51
We can also look at numbers about how often the memcmp-based speculative parsing of the next key of a given map succeeds. Looking at statistics about each benchmark, we can see that the speculation of what key we expect next pays off in a significant percentage of cases, between 63% for Wikidata where the dictionary structures are quite irregular, and 99% for Reddit, where all the dictionaries have the same set of keys.
Benchmark Number of Keys Map Transitions % Successful Speculation
Censys 14 404 234 14 403 243 65.79
Gharchive 6 637 881 6 637 881 86.71
NYTimes 417 337 417 337 79.85
Reddit 25 226 397 25 226 397 100.00
Rosie 28 500 101 28 500 101 90.37
TPCH 6 700 000 6 700 000 86.57
Wikidata 6 235 088 5 267 744 63.68
Yelp 5 133 914 4 593 980 90.43
geomean 82.04
General string caching is the most unclear optimization. On the one hand its impact on memory usage is quite substantial, leading to a 20% reduction for Gharchive and Reddit, up to a 2× improvement for Yelp. On the other hand, the effect on performance is less clear, since it even leads to a slowdown in Gharchive and Reddit, and generally only a small improvement. Choosing the right heuristic for when to disable the cache also has somewhat unclear effects and is definitely a topic worthy of further investigation.

Comparison against other JSON Decoders

To get a more general feeling of the performance and memory usage of the improved PyPy parser, we compare it against CPython's built-in json parser, ujson for CPython, Node's (V8) JSON parser and RapidJSON. For better context for the memory usage I also show the file size of the input files.
These benchmarks are not really an apples-to-apple comparison. All of the implementations use different in-memory representations of strings in the deserialized data-structure (Node uses two bytes per character in a string, in CPython it depends but 4 bytes on my machine), PyPyBaseline uses four bytes, PyPy and RapidJSON use utf-8). But it's still interesting to get some ballpark numbers. The results are as follows:

As we can see, PyPyFull handily beats CPython and ujson, with a geometric mean of the improvement of about 2.5×. The memory improvement can be even more extreme, with an improvement of over 4× against CPython/ujson in some cases (CPython gives better memory sizes, because its parser caches the keys of dictionaries as well). Node is often more than 50% slower, whereas RapidJSON beats us easily, by a factor of 2× on average.


While the speedup I managed to achieve over the course of this project is nice and I am certainly happy to beat both CPython and Node, I am ultimately still annoyed that RapidJSON manages to maintain such a clear lead over PyPyFull, and would like to get closer to it. One problem that PyPy suffers compared to RapidJSON is the overhead of garbage collection. Deserializing large JSON files is pretty much the worst case for the generational GC that PyPy uses, since none of the deserialized objects die young (and the GC expects that most objects do). That means that a lot of the deserialization time of PyPy is wasted allocating the resulting objects in the nursery, and then copying them into the old generation. Somehow, this should be done in better ways, but all my attempts to not have to do the copy did not seem to help much. So maybe more improvements are possible, if I can come up with more ideas.
On the memory side of things, Node/V8 is beating PyPy clearly which might indicate more general problems in how we represent Python objects in memory. On the other hand, I think it's cool that we are competitive with RapidJSON in terms of memory and often within 2× of the file size.
An effect that I didn't consider at all in this blog post is the fact that accessing the deserialized objects with constants strings is also faster than with regular dictionaries, due to them being represented with maps. More benchmarking work to do in the future!
If you have your own programs that run on PyPy and use the json parser a lot, please measure them on the new code and let me know whether you see any difference!

Wednesday, August 7, 2019

A second life for the Sandbox

Hi all,

Anvil is a UK-based company sponsoring one month of work to revive PyPy's "sandbox" mode and upgrade it to PyPy3. Thanks to them, sandboxing will be given a second life!

The sandboxed PyPy is a special version of PyPy that runs fully isolated. It gives a safe way to execute arbitrary Python programs (whole programs, not small bits of code inside your larger Python program). Such scripts can be fully untrusted, and they can try to do anything—there are no syntax-based restrictions, for example—but whatever they do, any communication with the external world is not actually done but delegated to the parent process. This is similar but much more flexible than Linux's Seccomp approach, and it is more lightweight than setting up a full virtual machine. It also works without operating system support.

However, during the course of the years the sandbox mode of PyPy has been mostly unmaintained and unsupported by the core developers, mostly because of a lack of interest by users and because it took too much effort to maintain it.

Now we have found that we have an actual user, Anvil. As far as I can tell they are still using a very old version of PyPy, the last one that supported sandboxing. This is where this contract comes from: the goal is to modernize sandboxing and port it to PyPy3.

Part of my motivation for accepting this work is that I may have found a way to tweak the protocol on the pipe between the sandboxed PyPy and the parent controller process. This should make the sandboxed PyPy more resilient against future developments and easier to maintain; at most, in the future some tweaks will be needed in the controller process but hopefully not deep inside the guts of the sandboxed PyPy. Among the advantages, such a more robust solution should mean that we can actually get a working sandboxed PyPy—or sandboxed PyPy3 or sandboxed version of any other interpreter written in RPython—with just an extra argument when calling rpython to translate this interpreter. If everything works as planned, sandboxing may be given a second life.

Armin Rigo

Thursday, July 25, 2019

PyPy JIT for Aarch64

Hello everyone.

We are pleased to announce the availability of the new PyPy for AArch64. This port brings PyPy's high-performance just-in-time compiler to the AArch64 platform, also known as 64-bit ARM. With the addition of AArch64, PyPy now supports a total of 6 architectures: x86 (32 & 64bit), ARM (32 & 64bit), PPC64, and s390x. The AArch64 work was funded by ARM Holdings Ltd. and

PyPy has a good record of boosting the performance of Python programs on the existing platforms. To show how well the new PyPy port performs, we compare the performance of PyPy against CPython on a set of benchmarks. As a point of comparison, we include the results of PyPy on x86_64.

Note, however, that the results presented here were measured on a Graviton A1 machine from AWS, which comes with a very serious word of warning: Graviton A1's are virtual machines, and, as such, they are not suitable for benchmarking. If someone has access to a beefy enough (16G) ARM64 server and is willing to give us access to it, we are happy to redo the benchmarks on a real machine. One major concern is that while a virtual CPU is 1-to-1 with a real CPU, it is not clear to us how CPU caches are shared across virtual CPUs. Also, note that by no means is this benchmark suite representative enough to average the results. Read the numbers individually per benchmark.

The following graph shows the speedups on AArch64 of PyPy (hg id 2417f925ce94) compared to CPython (2.7.15), as well as the speedups on a x86_64 Linux laptop comparing the most recent release, PyPy 7.1.1, to CPython 2.7.16.

In the majority of benchmarks, the speedups achieved on AArch64 match those achieved on the x86_64 laptop. Over CPython, PyPy on AArch64 achieves speedups between 0.6x to 44.9x. These speedups are comparable to x86_64, where the numbers are between 0.6x and 58.9x.

The next graph compares between the speedups achieved on AArch64 to the speedups achieved on x86_64, i.e., how great the speedup is on AArch64 vs. the same benchmark on x86_64. This comparison should give a rough idea about the quality of the generated code for the new platform.

Note that we see a large variance: There are generally three groups of benchmarks - those that run at more or less the same speed, those that run at 2x the speed, and those that run at 0.5x the speed of x86_64.

The variance and disparity are likely related to a variety of issues, mostly due to differences in architecture. What is however interesting is that, compared to measurements performed on older ARM boards, the branch predictor on the Graviton A1 machine appears to have improved. As a result, the speedups achieved by PyPy over CPython are smaller than on older ARM boards: sufficiently branchy code, like CPython itself, simply runs a lot faster. Hence, the advantage of the non-branchy code generated by PyPy's just-in-time compiler is smaller.

One takeaway here is that many possible improvements for PyPy have yet to be implemented. This is true for both of the above platforms, but probably more so for AArch64, which comes with a large number of CPU registers. The PyPy backend was written with x86 (the 32-bit variant) in mind, which has a really low number of registers. We think that we can improve in the area of emitting more modern machine code, which may have a higher impact on AArch64 than on x86_64. There is also a number of missing features in the AArch64 backend. These features are currently implemented as expensive function calls instead of inlined native instructions, something we intend to improve.


Maciej Fijalkowski, Armin Rigo and the PyPy team

Thursday, April 18, 2019

PyPy 7.1.1 Bug Fix Release

The PyPy team is proud to release a bug-fix release version 7.1.1 of PyPy, which includes two different interpreters:
  • PyPy2.7, which is an interpreter supporting the syntax and the features of Python 2.
  • PyPy3.6-beta: the second official release of PyPy to support 3.6 features.
The interpreters are based on much the same codebase, thus the double release.

This bugfix fixes bugs related to large lists, dictionaries, and sets, some corner cases with unicode, and PEP 3118 memory views of ctype structures. It also fixes a few issues related to the ARM 32-bit backend. For the complete list see the changelog.

You can download the v7.1.1 releases here:

As always, this release is 100% compatible with the previous one and fixed several issues and bugs raised by the growing community of PyPy users. We strongly recommend updating.

The PyPy3.6 release is rapidly maturing, but is still considered beta-quality.

The PyPy team

Thursday, April 4, 2019

An RPython JIT for LPegs

The following is a guest post by Stefan Troost, he describes the work he did in his bachelor thesis:

In this project we have used the RPython infrastructure to generate an RPython JIT for a less-typical use-case: string pattern matching. The work in this project is based on Parsing Expression Grammars and LPeg, an implementation of PEGs designed to be used in Lua. In this post I will showcase some of the work that went into this project, explain PEGs in general and LPeg in particular, and show some benchmarking results.

Parsing Expression Grammars

Parsing Expression Grammas (PEGs) are a type of formal grammar similar to context-free grammars, with the main difference being that they are unambiguous. This is achieved by redefining the ambiguous choice operator of CFGs (usually noted as |) as an ordered choice operator. In practice this means that if a rule in a PEG presents a choice, a PEG parser should prioritize the leftmost choice. Practical uses include parsing and pattern-searching. In comparison to regular expressions PEGs stand out as being able to be parsed in linear time, being strictly more powerful than REs, as well as being arguably more readable.


LPeg is an implementation of PEGs written in C to be used in the Lua programming language. A crucial detail of this implementation is that it parses high level function calls, translating them to bytecode, and interpreting that bytecode. Therefore, we are able to improve that implementation by replacing LPegs C-interpreter with an RPython JIT. I use a modified version of LPeg to parse PEGs and pass the generated Intermediate Representation, the LPeg bytecode, to my VM.

The LPeg Library

The LPeg Interpreter executes bytecodes created by parsing a string of commands using the LPeg library. Our JIT supports a subset of the LPeg library, with some of the more advanced or obscure features being left out. Note that this subset is still powerful enough to do things like parse JSON.

Operator Description
lpeg.P(string) Matches string literally
lpeg.P(n) Matches exactly n characters
lpeg.P(-n) Matches at most n characters
lpeg.S(string) Matches any character in string (Set)
lpeg.R(“xy”) Matches any character between x and y (Range)
pattern^n Matches at least n repetitions of pattern
pattern^-n Matches at most n repetitions of pattern
pattern1 * pattern2 Matches pattern1 followed by pattern2
pattern1 + pattern2 Matches pattern1 or pattern2 (ordered choice)
pattern1 - pattern2 Matches pattern1 if pattern2 does not match
-pattern Equivalent to ("" - pattern)

As a simple example, the pattern lpeg.P"ab"+lpeg.P"cd" would match either the string ab or the string cd.

To extract semantic information from a pattern, captures are needed. These are the following operations supported for capture creation.

Operation What it produces
lpeg.C(pattern) the match for patten plus all captures made by pattern
lpeg.Cp() the current position (matches the empty string)

(tables taken from the LPeg documentation)

These patterns are translated into bytecode by LPeg, at which point we are able to pass them into our own VM.

The VM

The state of the VM at any point is defined by the following variables:

  • PC: program counter indicating the current instruction
  • fail: an indicator that some match failed and the VM must backtrack
  • index: counter indicating the current character of the input string
  • stackentries: stack of return addresses and choice points
  • captures: stack of capture objects

The execution of bytecode manipulates the values of these variables in order to produce some output. How that works and what that output looks like will be explained now.

The Bytecode

For simplicity’s sake I will not go over every individual bytecode, but instead choose some that exemplify the core concepts of the bytecode set.

generic character matching bytecodes

  • any: Checks if there’s any characters left in the inputstring. If it succeeds it advances the index and PC by 1, if not the bytecode fails.

  • char c: Checks if there is another bytecode in the input and if that character is equal to c. Otherwise the bytecode fails.

  • set c1-c2: Checks if there is another bytecode in the input and if that character is between (including) c1 and c2. Otherwise the bytecode fails.

These bytecodes are the easiest to understand with very little impact on the VM. What it means for a bytecode to fail will be explained when we get to control flow bytecodes.

To get back to the example, the first half of the pattern lpeg.P"ab" could be compiled to the following bytecodes:

char a
char b

control flow bytecodes

  • jmp n: Sets PC to n, effectively jumping to the n’th bytecode. Has no defined failure case.

  • testchar c n: This is a lookahead bytecode. If the current character is equal to c it advances the PC but not the index. Otherwise it jumps to n.

  • call n: Puts a return address (the current PC + 1) on the stackentries stack and sets the PC to n. Has no defined failure case.

  • ret: Opposite of call. Removes the top value of the stackentries stack (if the string of bytecodes is valid this will always be a return address) and sets the PC to the removed value. Has no defined failure case.

  • choice n: Puts a choice point on the stackentries stack. Has no defined failure case.

  • commit n: Removes the top value of the stackentries stack (if the string of bytecodes is valid this will always be a choice point) and jumps to n. Has no defined failure case.

Using testchar we can implement the full pattern lpeg.P"ab"+lpeg.P"cd" with bytecode as follows:

testchar a -> L1
char b
L1: char c
char d

The any bytecode is needed because testchar does not consume a character from the input.

Failure Handling, Backtracking and Choice Points

A choice point consist of the VM’s current index and capturestack as well as a PC. This is not the VM’s PC at the time of creating the choicepoint, but rather the PC where we should continue trying to find matches when a failure occurs later.

Now that we have talked about choice points, we can talk about how the VM behaves in the fail state. If the VM is in the fail state, it removed entries from the stackentries stack until it finds a choice point. Then it backtracks by restoring the VM to the state defined by the choice point. If no choice point is found this way, no match was found in the string and the VM halts.

Using choice points we could implement the example lpeg.P"ab" + lpeg.P"cd" in bytecodes in a different way (LPEG uses the simpler way shown above, but for more complex patterns it can’t use the lookahead solution using testchar):

choice L1
char a
char b
L1: char c
char d


Some patterns require the VM to produce more output than just “the pattern matched” or “the pattern did not match”. Imagine searching a document for an IPv4 address and all your program responded was “I found one”. In order to recieve additional information about our inputstring, captures are used.

The capture object

In my VM, two types of capture objects are supported, one of them being the position capture. It consists of a single index referencing the point in the inputstring where the object was created.

The other type of capture object is called simplecapture. It consists of an index and a size value, which are used to reference a substring of the inputstring. In addition, simplecaptures have a variable status indicating they are either open or full. If a simplecapture object is open, that means that its size is not yet determined, since the pattern we are capturing is of variable length.

Capture objects are created using the following bytecodes:

  • Fullcapture Position: Pushes a positioncapture object with the current index value to the capture stack.

  • Fullcapture Simple n: Pushes a simplecapture object with current index value and size=n to the capture stack.

  • Opencapture Simple: Pushes an open simplecapture object with current index value and undetermined size to the capture stack.

  • closecapture: Sets the top element of the capturestack to full and sets its size value using the difference between the current index and the index of the capture object.

The RPython Implementation

These, and many more bytecodes were implemented in an RPython-interpreter. By adding jit hints, we were able to generate an efficient JIT. We will now take a closer look at some implementations of bytecodes.

        elif == "any":
            if index >= len(inputstring):
                fail = True
                pc += 1
                index += 1


The code for the any-bytecode is relatively straight-forward. It either advances the pc and index or sets the VM into the fail state, depending on whether the end of the inputstring has been reached or not.

        if == "char":
            if index >= len(inputstring):
                fail = True
            elif instruction.character == inputstring[index]:
                pc += 1
                index += 1
                fail = True

The char-bytecode also looks as one would expect. If the VM’s string index is out of range or the character comparison fails, the VM is put into the fail state, otherwise the pc and index are advanced by 1. As you can see, the character we’re comparing the current inputstring to is stored in the instruction object (note that this code-example has been simplified for clarity, since the actual implementation includes a jit-optimization that allows the VM to execute multiple successive char-bytecodes at once).

        elif == "jmp":
            pc = instruction.goto

The jmp-bytecode comes with a goto value which is a pc that we want execution to continue at.

        elif == "choice":
            pc += 1
            choice_points = choice_points.push_choice_point(
                instruction.goto, index, captures)

As we can see here, the choice-bytecode puts a choice point onto the stack that may be backtracked to if the VM is in the fail-state. This choice point consists of a pc to jump to which is determined by the bytecode. But it also includes the current index and captures values at the time the choice point was created. An ongoing topic of jit optimization is which data structure is best suited to store choice points and return addresses. Besides naive implementations of stacks and single-linked lists, more case-specific structures are also being tested for performance.

Benchmarking Result

In order to find out how much it helps to JIT LPeg patterns we ran a small number of benchmarks. We used an otherwise idle Intel Core i5-2430M CPU with 3072 KiB of cache and 8 GiB of RAM, running with 2.40GHz. The machine was running Ubuntu 14.04 LTS, Lua 5.2.3 and we used GNU grep 2.16 as a point of comparison for one of the benchmarks. The benchmarks were run 100 times in a new process each. We measured the full runtime of the called process, including starting the process.

Now we will take a look at some plots generated by measuring the runtime of different iterations of my JIT compared to lua and using bootstrapping to generate a sampling distribution of mean values. The plots contain a few different variants of pypeg, only the one called "fullops" is important for this blog post, however.

This is the plot for a search pattern that searches a text file for valid URLs. As we can see, if the input file is as small as 100 kb, the benefits of JIT optimizations do not outweigh the time required to generate the machine code. As a result, all of our attempts perform significantly slower than LPeg.

This is the plot for the same search pattern on a larger input file. As we can see, for input files as small as 500 kb our VM already outperforms LPeg’s. An ongoing goal of continued development is to get this lower boundary as small as possible.

The benefits of a JIT compared to an Interpreter become more and more relevant for larger input files. Searching a file as large as 5 MB makes this fairly obvious and is exactly the behavior we expect.

This time we are looking at a different more complicated pattern, one that parses JSON used on a 50 kb input file. As expected, LPeg outperforms us, however, something unexpected happens as we increase the filesize.

Since LPeg has a defined maximum depth of 400 for the choicepoints and returnaddresses Stack, LPeg by default refuses to parse files as small as 100kb. This raises the question if LPeg was intended to be used for parsing. Until a way to increase LPeg’s maximum stack depth is found, no comparisons to LPeg can be performed at this scale. This has been a low priority in the past but may be addressed in the future.

To conclude, we see that at sufficiently high filesizes, our JIT outperforms the native LPeg-interpreter. This lower boundary is currently as low as 100kb in filesize.


Writing a JIT for PEG’s has proven itself to be a challenge worth pursuing, as the expected benefits of a JIT compared to an Interpreter have been achieved. Future goals include getting LPeg to be able to use parsing patterns on larger files, further increasing the performance of our JIT and comparing it to other well-known programs serving a similar purpose, like grep.

The prototype implementation that I described in this post can be found on Github (it's a bit of a hack in some places, though).