PyPy Status Blog: An Efficient and Elegant Regular Expression Matcher in Python

Friday, May 21, 2010

An Efficient and Elegant Regular Expression Matcher in Python

Two weeks ago, I was at the Workshop Programmiersprachen und Rechenkonzepte, a yearly meeting of German programming language researchers. At the workshop, Frank Huch and Sebastian Fischer gave a really excellent talk about an elegant regular expression matcher written in Haskell. One design goal of the matcher was to run in time linear to the length of the input string (i.e. without backtracking) and linear in the size of the regular expression. The memory use should also only be linear in the regular expression.

During the workshop, some of the Haskell people and me then implemented the algorithm in (R)Python. Involved were Frank, Sebastian, Baltasar Trancón y Widemann, Bernd Braßel and Fabian Reck.

In this blog post I want to describe this implementation and show the code of it, because it is quite simple. In a later post I will show what optimizations PyPy can perform on this matcher and also do some benchmarks.

A Note on terminology: In the rest of the post "regular expression" is meant in the Computer Science sense, not in the POSIX sense. Most importantly, that means that back-references are not allowed.

Another note: This algorithm could not be used to implement PyPy's re module! So it won't help to speed up this currently rather slow implementation.

Implementing Regular Expression Matchers

There are two typical approaches to implement regular expression. A naive one is to use a back-tracking implementation, which can lead to exponential matching times given a sufficiently evil regular expression.

The other, more complex one, is to transform the regular expression into a non-deterministic finite automaton (NFA) and then transform the NFA into a deterministic finite automaton (DFA). A DFA can be used to efficiently match a string, the problem of this approach is that turning an NFA into a DFA can lead to exponentially large automatons.

Given this problem of potential memory explosion, a more sophisticated approach to matching is to not construct the DFA fully, but instead use the NFA for matching. This requires some care, because it is necessary to keep track of which set of states the automaton is in (it is not just one state, because the automaton is non-deterministic).

The algorithm described here is essentially equivalent to this approach, however it does not need an intermediate NFA and represents a state of a corresponding DFA as marked regular expression (represented as a tree of nodes). For many details about an alternative approach to implement regular expressions efficiently, see Russ Cox excellent article collection.

The Algorithm

In the algorithm the regular expression is represented as a tree of nodes. The leaves of the nodes can match exactly one character (or the epsilon node, which matches the empty string). The inner nodes of the tree combine other nodes in various ways, like alternative, sequence or repetition. Every node in the tree can potentially have a mark. The meaning of the mark is that a node is marked, if that sub-expression matches the string seen so far.

The basic approach of the algorithm is that for every character of the input string the regular expression tree is walked and a number of the nodes in the regular expression are marked. At the end of the string, if the top-level node is marked, the string matches, otherwise it does not. At the beginning of the string, one mark gets shifted into the regular expression from the top, and then the marks that are in the regex already are shifted around for every additional character.

Let's start looking at some code, and an example to make this clearer. The base class of all regular expression nodes is this:

class Regex(object):
    def __init__(self, empty):
        # empty denotes whether the regular expression
        # can match the empty string
        self.empty = empty
        # mark that is shifted through the regex
        self.marked = False

    def reset(self):
        """ reset all marks in the regular expression """
        self.marked = False

    def shift(self, c, mark):
        """ shift the mark from left to right, matching character c."""
        # _shift is implemented in the concrete classes
        marked = self._shift(c, mark)
        self.marked = marked
        return marked

The match function which checks whether a string matches a regex is:

def match(re, s):
    if not s:
        return re.empty
    # shift a mark in from the left
    result = re.shift(s[0], True)
    for c in s[1:]:
        # shift the internal marks around
        result = re.shift(c, False)
    re.reset()
    return result

The most important subclass of Regex is Char, which matches one concrete character:

class Char(Regex):
    def __init__(self, c):
        Regex.__init__(self, False)
        self.c = c

    def _shift(self, c, mark):
        return mark and c == self.c

Shifting the mark through Char is easy: a Char instance retains a mark that is shifted in when the current character is the same as that in the instance.

Another easy case is that of the empty regular expression Epsilon:

class Epsilon(Regex):
    def __init__(self):
        Regex.__init__(self, empty=True)

    def _shift(self, c, mark):
        return False

Epsilons never get a mark, but they can match the empty string.

Alternative

Now the more interesting cases remain. First we define an abstract base class Binary for the case of composite regular expressions with two children, and then the first subclass Alternative which matches if either of two regular expressions matches the string (usual regular expressions syntax a|b).

class Binary(Regex):
    def __init__(self, left, right, empty):
        Regex.__init__(self, empty)
        self.left = left
        self.right = right

    def reset(self):
        self.left.reset()
        self.right.reset()
        Regex.reset(self)

class Alternative(Binary):
    def __init__(self, left, right):
        empty = left.empty or right.empty
        Binary.__init__(self, left, right, empty)

    def _shift(self, c, mark):
        marked_left  = self.left.shift(c, mark)
        marked_right = self.right.shift(c, mark)
        return marked_left or marked_right

An Alternative can match the empty string, if either of its children can. Similarly, shifting a mark into an Alternative shifts it into both its children. If either of the children are marked afterwards, the Alternative is marked too.

As an example, consider the regular expression a|b|c, which would be represented by the objects Alternative(Alternative(Char('a'), Char('b')), Char('c')). Matching the string "a" would lead to the following marks in the regular expression objects (green nodes are marked, white ones are unmarked):

At the start of the process, no node is marked. Then the first char is matched, which adds a mark to the Char('a') node, and the mark will propagate up the two Alternative nodes.

Repetition

The two remaining classes are slightly trickier. Repetition is used to match a regular expression any number of times (usual regular expressions syntax a*):

class Repetition(Regex):
    def __init__(self, re):
        Regex.__init__(self, True)
        self.re = re

    def _shift(self, c, mark):
        return self.re.shift(c, mark or self.marked)

    def reset(self):
        self.re.reset()
        Regex.reset(self)

A Repetition can always match the empty string. The mark is shifted into the child, but if the Repetition is already marked, this will be shifted into the child as well, because the Repetition could match a second time.

As an example, consider the regular expression (a|b|c)* matching the string abcbac:

For every character, one of the alternatives matches, thus the repetition matches as well.

Sequence

The only missing class is that for sequences of expressions, Sequence (usual regular expressions syntax ab):

class Sequence(Binary):
    def __init__(self, left, right):
        empty = left.empty and right.empty
        Binary.__init__(self, left, right, empty)

    def _shift(self, c, mark):
        old_marked_left = self.left.marked
        marked_left = self.left.shift(c, mark)
        marked_right = self.right.shift(
            c, old_marked_left or (mark and self.left.empty))
        return (marked_left and self.right.empty) or marked_right

A Sequence can be empty only if both its children are empty. The mark handling is a bit delicate. If a mark is shifted in, it will be shifted to the left child regular expression. If that left child is already marked before the shift, that mark is shifted to the right child. If the left child can match the empty string, the right child gets the mark shifted in as well.

The whole sequence matches (i.e. is marked), if the left child is marked after the shift and if the right child can match the empty string, or if the right child is marked.

Consider the regular expression abc matching the string abcd. For the first three characters, the marks wander from left to right, when the d is reached, the matching fails.

More Complex Example

As a more complex example, consider the expression ((abc)*|(abcd))(d|e) matching the string abcabcabcd.

Note how the two branches of the first alternative match the first abc in parallel, until it becomes clear that only the left alternative (abc)* can work.

Complexity

The match function above loops over the entire string without going back and forth. Each iteration goes over the whole tree every time. Thus the complexity of the algorithm is O(m*n) where m is the size of the regular expression and n is the length of the string.

Summary & Outlook

So, what have we achieved now? The code shown here can match regular expressions with the desired complexity. It is also not much code. By itself, the Python code shown above is not terribly efficient. In the next post I will show how the JIT generator can be used to make the simple matcher shown above really fast.