Tuesday, June 10, 2008

List comprehension implementation details

List comprehensions are a nice feature in Python. They are, however, just syntactic sugar for for loops. E.g. the following list comprehension:

def f(l):
    return [i ** 2 for i in l if i % 3 == 0]

is sugar for the following for loop:

def f(l):
    result = []
    for i in l:
        if i % 3 == 0:
            result.append(i ** 2)
    return result

The interesting bit about this is that list comprehensions are actually implemented in almost exactly this way. If one disassembles the two functions above one gets sort of similar bytecode for both (apart from some details, like the fact that the append in the list comprehension is done with a special LIST_APPEND bytecode).

Now, when doing this sort of expansion there are some classical problems: what name should the intermediate list get that is being built? (I said classical because this is indeed one of the problems of many macro systems). What CPython does is give the list the name _[1] (and _[2]... with nested list comprehensions). You can observe this behaviour with the following code:

$ python
Python 2.5.2 (r252:60911, Apr 21 2008, 11:12:42)
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> [dir() for i in [0]][0]
['_[1]', '__builtins__', '__doc__', '__name__', 'i']
>>> [[dir() for i in [0]][0] for j in [0]][0]
['_[1]', '_[2]', '__builtins__', '__doc__', '__name__', 'i', 'j']

That is a sort of nice decision, since you can not reach that name by any "normal" means. Of course you can confuse yourself in funny ways if you want:

>>> [locals()['_[1]'].extend([i, i + 1]) for i in range(10)]
[0, 1, None, 1, 2, None, 2, 3, None, 3, 4, None, 4, 5, None, 5, 6, None, 6, 7, None, 7, 8, None, 8, 9, None, 9, 10, None]

Now to the real reason why I am writing this blog post. PyPy's Python interpreter implements list comprehensions in more or less exactly the same way, with on tiny difference: the name of the variable:

$ pypy-c-53594-generation-allworking
Python 2.4.1 (pypy 1.0.0 build 53594) on linux2
Type "help", "copyright", "credits" or "license" for more information.
``the globe is our pony, the cosmos our real horse''
>>>> [dir() for i in [0]][0]
['$list0', '__builtins__', '__doc__', '__name__', 'i']

Now, that shouldn't really matter for anybody, should it? Turns out it does. The following way too clever code is apparently used a lot:

__all__ = [__name for __name in locals().keys() if not __name.startswith('_') '
               or __name == '_']

In PyPy this will give you a "$list0" in __all__, which will prevent the import of that module :-(. I guess I need to change the name to match CPython's.

Lesson learned: no detail is obscure enough to not have some code depending on it. Mostly problems on this level of obscurity are the things we are fixing in PyPy at the moment.

11 comments:

  1. In fairness, the clever code does not depend on the name looking as it actually does in CPython; the clever code merely expects that variables auto-created by Python internals will begin with an underscore. Which is far more reasonable than actually expecting the specific name "_[1]" (and, wow, you're right, that does look weird; you've shown me something I've never seen before about Python!) to turn up in the variable list.

    ReplyDelete
  2. Actually, that piece of code is looking to export only public identifiers, right? It's trying to exclude things prefixed with an underscore that are in the file scope.

    ReplyDelete
  3. I would have said "Lesson learned: when MIT hackers in the 1960's come up with some funny thing called GENSYM, it's not just because they're weird; it really does serve a purpose". But then I'm an asshole Lisp hacker. :-)

    ReplyDelete
  4. anonymous: Using gensym for getting the symbol wouldn't have helped in this case at all. The gensymmed symbol would still have showed up in the locals() dictionary. So depending on whether the gensym implementation returns symbols that start with an underscore or not the same bug would have occured.

    ReplyDelete
  5. Other languages have the capability/design/philosophy to make such implementation details totally unobservable.

    Haskell has list comprehensions which expand into normal code. These cannot expose implementation details or temporary names.

    ReplyDelete
  6. turingtest: I agree that that would be preferable, but it's sort of hard with the current interpreter design. Also, it's a pragmatic implementation in that the interpreter didn't have to change at all to add the list comps.

    ReplyDelete
  7. The code's not overly clever, it's ridiculous, because it exactly duplicates the effects of not having __all__ at all. From foo import * already won't import names prefaced with an underscore. Also from the google code search it looks like it's mostly used in Paste, most of the other hits are false positives.

    The "from foo import *" case (without __all__ defined) is a good enough reason to match the cpython naming, though, the useless code in Paste not withstanding.

    ReplyDelete
  8. carl: something like GENSYM would still help, since the symbol generated is not accessible from any package.

    That's difference between gensym and mktemp. However, I don't believe that python has the concept of uninterned symbols (someone who knows more about python could correct me).

    ReplyDelete
  9. arkanes: no, the "from foo import *" case isn't really changed by the different choice of symbols because the new variable is really only visible within the list comprehension and deleted afterwards. It doesn't leak (as opposed to the iteration variable).

    ReplyDelete
  10. arkanes: This is not the same as not having __all__ defined. __all__ would skip the function _() which is used to mark and translate strings with gettext. In other words, it is emulating the default no __all__ behavior and adding in _()

    Carl: doesn't the "$list0" get imported without the all? If not what keeps it from causing a problem normally? Could you not just delete the $list0 variable after assigning it to the LHS?

    ReplyDelete
  11. chris: yes, deleting this variable is exactly what PyPy does (and CPython as well). That's what I was trying to say in my last post.

    The bug with the __all__ only occurs because locals is called within the list comprehension. After the list comprehension is done there is no problem.

    ReplyDelete

See also PyPy's IRC channel: #pypy at freenode.net, or the pypy-dev mailing list.
If the blog post is old, it is pointless to ask questions here about it---you're unlikely to get an answer.