Tuesday, January 8, 2008

Visualizing a Python tokenizer

Armin and me have been working on PyPy's parser and bytecode compiler for the Python language in the last days. Armin implemented several bytecode optimizations that CPython has since a while whereas I tried to refactor our tokenizer and parser (because our existing parser is rather slow and also not very nice code). Armin is mostly done whereas the new parser is not very far yet.

What is done, however, is the Python tokenizer. It is implemented in the usual way, by using a set of regular expressions to generate a deterministic finite automaton (DFA). This automaton is then turned into a big function which does the actual tokenization. Of course the picture is not quite as simple for Python, because it is not possible to tokenize Python using only regular expressions. To generate the proper "indent" and "dedent" tokens it would be necessary to keep state (the previous indentation levels) which a DFA cannot do. This is solved by postprocessing the tokens that the tokenizer produces to turn whitespace tokens into the proper indent and dedent tokens.

For debugging purposes I implemented a visualization tool for DFAs using PyPy's pygame-based graph viewer. The graph viewer is able to visualize interactively any graph given in the graph-description language of Graphviz. Looking at the tokenizing DFA for Python is rather instructive, both for understanding how tokenizing works and (maybe) for understanding the Python language. To try it, download the dot file of the DFA and run from a pypy checkout:

$ python pypy/bin/dotviewer.py tokenizer.dot

The following is a screenshot of the graphviewer:

For people who don't want do checkout PyPy I generated a (rather big) png for the DFA.

Next thing I would like to do (apart from actually finishing the parser, of course :-) ) is visualize the Python grammar itself using syntax diagrams or something similar. So far I couldn't really find a program to do that, though.

11 comments:

  1. Hi Robin,

    Yes, this is sort of cool already, but it doesn't really give you as much information as the grammar itself. It's more of an overview-like thing.

    Cheers,

    Carl Friedrich

    ReplyDelete
  2. Check out ANTLRWorks or ANTLRStudio for superb visualization of grammars.

    ReplyDelete
  3. Yeah, ANTLRWorks seems pretty nice in that respect. Thanks for the hint.

    ReplyDelete
  4. Hello!

    I was considering to use

    http://www.antlr.org/wiki/display/ANTLR3/Antlr3PythonTarget

    for a toy language project for uni (Computer Engineering, IT) because I guess ANTLR(works) and its mailing list could help me a lot in understanding the very basics of 'grammar design'...

    Or at least I hope so! :-)

    However, I've also been lurking the PyPy ml for quite a while now and was considering the possibility to implement the whole (*toy*) interpreter in RPython, so to understand a bit more of PyPy's design by actually coding something simple which makes some use of it. :-)

    So, would you consider trying to port the current ANTLR's Python runtime to RPython a good way for me to start doing something with PyPy?

    Would you consider the thing interesting? I know this possibility had been discussed on IRC some times ago and it wasn't thought to be that useful at last, but maybe you discussed the thing some more since then and changed idea, I don't know...

    How would you rate the difficulty of such a task for a PyPy and ANTLR newbie, also? :-)

    I wouldn't try doing that right now, anyway, but maybe in March I should manage to get some spare time for it. In the meantime, I'd try to read as many docs and sources as possible...

    Tell me your opinion, please! :-)

    Cheers,
    Matteo

    PS: I'm writing here because you were discussing PyPy's Python lexer and somebody wrote about ANTLRworks, but if you think I'd better send this message to the list please just tell me and I'll do so! :-)

    ReplyDelete
  5. Hi Matteo,

    I would post to the pypy-dev mailing list instead, it is better to discuss such longish things there. Thanks for you interest!

    Cheers,

    Carl Friedrich

    ReplyDelete
  6. Does it use Graphviz's dot utility for rendering?

    ReplyDelete
  7. That's bad it cann't be used as a standalone product without the need to install Graphvis.

    ReplyDelete

See also PyPy's IRC channel: #pypy at freenode.net, or the pypy-dev mailing list.
If the blog post is old, it is pointless to ask questions here about it---you're unlikely to get an answer.