Tuesday, June 8, 2010

A JIT for Regular Expression Matching

This is part 2 of a series, see Part 1 for an introduction. In this post I want to describe how the JIT generator of the PyPy project can be used to turn the elegant but not particularly fast regular expression matcher from the first part into a rather fast implementation. In addition, I will show some speed measurements against various regular expression implementations.

Again, note the disclaimer: This technology could not easily be used to implement Python's re-module.

Example Expression and First Numbers

The regular expression I will use as an example in the rest of this paper is the expression (a|b)*a(a|b){20}a(a|b)*. It matches all strings that have two a with exactly 20 characters between them. This regular expression has the property that the corresponding DFA needs 2**(n+1) different states. As an input string, we use a random string (of varying lengths) that does not match the regular expression. I will give all results as number of chars matched per second. While this is not a particularly typical regular expression, it should still be possible to get some ballpark numbers for the speeds of various implementations – as we will see, the differences between implementations are huge anyway.

All the benchmarks were performed on my laptop, which has an Intel Core2 Duo P8400 processor with 2.26 GHz and 3072 KB of cache on a machine with 3GB RAM running Ubuntu Linux 10.04.

To get a feeling for the orders of magnitude involved, the CPython re module (which is implemented in C and quite optimized) can match 2'500'000 chars/s. Google's new re2 implementation still matches 550'000 chars/s. Google's implementation is slower, but their algorithm gives complexity and space guarantees similar to our implementation in the last blog post.

On the other end of the performance scale is the pure-Python code from the last blog post running on CPython. It can match only 12'200 chars/s and is thus 200 times slower than the re module.

Translating the Matcher

The code described in the last blog post is not only normal Python code, but also perfectly valid RPython code. Nothing particularly dynamic is going on in the code, thus it can be translated with PyPy's translation toolchain to C code. The resulting binary is considerably faster and can match 720'000 chars/s, 60 times faster than the untranslated version.

Another approach is to write equivalent versions of the algorithms in lower level languages. This has been done for C++ by Sebastian Fischer and for Java by Baltasar Trancón y Widemann. The algorithm is object-oriented enough to be mapped very closely to the respective languages. The C++ version is a little bit faster than the RPython version translated to C, at 750'000 chars/s. That's not very surprising, given their similarity. The Java version is more than twice as fast, with 1'920'000 chars/s. Apparently the Java JIT compiler is a lot better at optimizing the method calls in the algorithm or does some other optimizations. One reason for this could be that the Java JIT can assume that the classes it sees are all there are (and it will invalidate the generated machine code if more classes are loaded), whereas the C++ compiler needs to generate code that works even in the presence of more regular expression classes.

Generating a JIT

To get even more performance out of the RPython code, it is possible to generate a JIT for it with the help of the PyPy translation toolchain. To do this, the matching code needs to be extended somewhat by some hints that tell PyPy's JIT generator how this is to be done. The JIT generator can automatically produce a JIT compiler from an RPython interpreter of the source language. In our case, we view the regular expression matcher as an interpreter for regular expressions. Then the match function corresponds to the dispatch loop of a traditional interpreter.

Our regular expression matcher is a very peculiar interpreter. The matcher works by running exactly one loop (the one in match) as many times as the input string is long, irrespective of the "program", i.e. the particular regular expressions. In addition, within the loop there are no conditions (e.g. if statements) at all, it is just linear code. This makes it almost perfectly suited to the JIT generator, which produces a tracing JIT. A tracing JIT compiles the hot loops of a program (i.e. regular expression) and has to do extra work if there are conditions in the loop. In our case, there is exactly one loop per regular expression, without any condition.

JIT Hints

The hints that are needed for the match function of the last blog post can be seen here (the function is slightly rewritten, e.g. the JIT does only properly support a while loop as the main dispatch loop):

jitdriver = jit.JitDriver(reds=["i", "result", "s"], greens=["re"])

def match(re, s):
    if not s:
        return re.empty
    # shift a mark in from the left
    result = re.shift(s[0], 1)
    i = 1
    while i < len(s):
        jitdriver.can_enter_jit(i=i, result=result, s=s, re=re)
        jitdriver.jit_merge_point(i=i, result=result, s=s, re=re)
        # shift the internal marks around
        result = re.shift(s[i], 0)
        i += 1
    re.reset()
    return result

The jitdriver is an instance describing the data of the interpreter we are dealing with. The arguments to the constructor need to list all local variables of the dispatch loop. The local variables are classified into two classes, red ones and green ones. The green ones hold the objects that make up the program that the interpreter currently runs and which position in the program is currently being executed. In a typical bytecode interpreter, the bytecode object and the program counter would be green. In our case, the regular expression is the program, so it is green. The rest of the variables are red.

The green variables are treated specially by the JIT generator. At runtime, for a given value of the green variables, one piece of machine code will be generated. This piece of machine code can therefore assume that the value of the green variable is constant.

There are two additional hints, which are method calls on the jitdriver instance. The jit_merge_point method marks the beginning of the main interpreter loop. The can_enter_jit function marks the point where a loop in the user program can be closed, which in our case is trivial, it's just at the end of the interpreter loop (for technical reasons it is put at the beginning, because nothing must happen between the can_enter_jit and jit_merge_point invocations).

Those are the hints that the JIT generator needs to function at all. We added some additional hints, that give the JIT generator more information to work with. Those hints are immutability information, which means that certain instance fields can not be changed after the object has been constructed. Apart from the marked field, none of the fields of any of the Regex subclasses can change. For example for the Char class this is expressed in the following way:

class Char(Regex):
    _immutable_fields_ = ["c"]
    def __init__(self, c):
        ...

These hints allow the generated JIT to constant-fold reads out of the immutable fields in some situations.

Adaptions to the Original Code

In the introduction above I wrote that the code within the loop in match uses no conditions. It is indeed true that none of the _shift methods have an if statement or similar. However, there are some hidden conditions due to the fact that the and and or boolean operators are used, which are short-circuiting. Therefore the JIT-version of the code needs to be adapted to use the non-short-circuiting operators & and |.

JIT Example

To get an impression of how the generated machine code looks like, consider the regular expression (a|b)*. As regular expression objects this would be Repetition(Alternative(Char('a'), Char('b'))). The machine code in its intermediate, machine-independent form looks as follows (I have slightly cleaned it up and added comments for clarity):

# arguments of the loop
# i0 is i in the match function
# result0 is result in the match function
# s0 is s in the match function
[i0, result0, s0] # those are the arguments to the machine code
char = s0[i0] # read the character
# read the current mark:
i5 = ConstPtr(ptr_repetition).marked
i7 = char == 'a' # is the character equal to 'a'
i8 = i5 & i7
i10 = char == 'b' # is the character equal to 'b'
i11 = i5 & i10
# write new mark
ConstPtr(ptr_chara).marked = i8
i13 = i8 | i11
# write new mark
ConstPtr(ptr_charb).marked = i11
# write new mark
ConstPtr(ptr_alternative).marked = i13
# increment the index
i17 = i0 + 1
i18 = len(s0)
# write new mark
ConstPtr(ptr_repetition).marked = i13
# check that index is smaller than the length of the string
i19 = i17 < i18
if not i19:
    go back to normally running match
jump(i17, i13, s0) # start from the top again

The various ConstPtr(ptr_*) denote constant addresses of parts of the regular expression tree:

  • ptr_repetition is the Repetition
  • ptr_chara is Char('a')
  • ptr_charb is Char('b')
  • ptr_alternative is the Alternative

Essentially the machine code reads the next char out of the string, the current mark out of the Repetition and then performs some boolean operations on those, writing back the new marks. Note in particular how the generated machine code does not need to do any method calls to shift and _shift and that most field reads out of the regular expression classes have been optimized away, because the fields are immutable. Therefore the machine code does not need to deconstruct the tree of regular expression objects at all, it just knows where in memory the various parts of it are, and encodes that directly into the code.

Performance Results With JIT

With the regular expression matcher translated to C and with a generated JIT, the regular expression performance increases significantly. Our running example can match 16'500'000 chars/s, which is more than six times faster than the re module. This is not an entirely fair comparison, because the re module can give more information than just "matches" or "doesn't match", but it's still interesting to see. A more relevant comparison is that between the program with and without a JIT: Generating a JIT speeds the matcher up by more than 20 times.

Conclusion

So, what have we actually won? We translated the relatively simple and very slow regular expression matching algorithm from the last post to C and were thus able to speed it up significantly. The real win is gained by also generating a JIT for the matcher, which can be regarded as a simple interpreter. The resulting matcher is rather fast.

The lesson from these posts is not that you can or should write a practical and general regular expression module in this way – indeed, enhancing the algorithm to support more features of the re module would be a lot of work and it is also unclear what the speed results for more realistic regular expressions would be. However, it makes for a great case study of the JIT generator. It was relatively straightforward to generate a JIT for the regex matcher, and the speed results were great (Admittedly I know rather a lot about PyPy's JIT though). This approach is generalizable to many programs that are sufficiently "interpreter-like" (whatever that exactly means).

All the results that appeared at various points in this blog post can be seen here:

Implementation chars/s speedup over pure Python
Pure Python code 12'200 1
Python re module 2'500'000 205
Google's re2 implementation 550'000 45
RPython implementation translated to C 720'000 59
C++ implementation 750'000 61
Java implementation 1'920'000 157
RPython implementation with JIT 16'500'000 1352

Sources

All the source code can be found in my Subversion user directory on Codespeak.

Edit:

Armin is right (see first comment). I fixed the problem.

12 comments:

Armin Rigo said...

Warning: the first example is wrong: there should be no code executed between can_enter_jit() and jit_merge_point(). In this case, there is the exit condition of the loop. It needs to be rewritten as a "while True:" loop with a "break" just before can_enter_jit().

Carl Friedrich Bolz-Tereick said...

@Armin: Damn, you're right. I fixed the blog post.

Nelson Elhage said...

What happens if you don't replace and and or?

Without those changes, the modifications for JIT really are prety small --
mostly just some annotations in the main loop and at toplevel for each
class. With those changes, though, you need to potentially check the entire
codebase of your interpreter.

Pretty fun performance results, though.

Carl Friedrich Bolz-Tereick said...

@Nelson: If you don't change the "and" and "or" you get a lot of assembler code generated, and it's not particularly fast.

Note that this "and" and "or" business is quite specific to this particular example. Usually you can work more incrementally by generating a JIT, then looking at the produced assembler and then doing some small changes in the interpreter to improve parts of it. Each such change is usually localized to one part of the interpreter improves the performance of some language feature.

This example is not really large enough to show this way of working, though :-). Maybe at some point I should write a walk-through for some interpreter.

Kumo said...

Would it be possible to create a pypy or cpython extension module this way?

Jared Forsyth said...

Could you post your 'test runner' code? I'm running some tests (with your) code and getting drastically different numbers...

Carl Friedrich Bolz-Tereick said...

@jabapyth: there is no test runner code. I am simply running something like

genrandom 20 1000000 | time regex-c

What performance results are you getting? Are you sure that you translated jitregex.py with -Ojit? Otherwise the JIT is not put into the executable.

Maxim Yegorushkin said...

boost::regex is not mentioned. It's got both recursive and non-recursive implementations. And it is the base of the standard C++ TR1 regex. Would be interesting to stack it up against other results because it is

Maciej Fijalkowski said...

We can't possibly include all regex engines (even if we would like to). However, sources are out there and you can always rerun those benchmarks and see how it compares :-)

Cheers,
fijal

Nikhil said...
This comment has been removed by the author.
Nikhil said...

I'm not able to access the code on codespeak.net. Has the code been moved to some other place?

Maciej Fijalkowski said...

The code has been merged to PyPy since I think. Look up cfbolz repos on bitbucket though