Thursday, June 28, 2012

EuroPython sprint

Hi all,

EuroPython is next week. We will actually be giving a presentation on Monday, in one of the plenary talks: PyPy: current status and GIL-less future. This is the first international PyPy keynote we give, as far as I know, but not the first keynote about PyPy [David Beazley's video] :-)

The other talks are PyPy JIT under the hood and to some extent Performance analysis tools for JITted VMs. This year we are also trying out a help desk. Finally, we will have the usual sprint after EuroPython on Saturday and Sunday.

See you soon!

Armin.

Monday, June 25, 2012

Architecture of Cppyy

The cppyy module makes it possible to call into C++ from PyPy through the Reflex package. Work started about two years ago, with a follow-up sprint a year later. The module has now reached an acceptable level of maturity and initial documentation with setup instructions, as well as a list of the currently supported language features, are now available here. There is a sizable (non-PyPy) set of unit and application tests that is still being worked through, not all of them of general applicability, so development continues its current somewhat random walk towards full language coverage. However, if you find that cppyy by and large works for you except for certain specific features, feel free to ask for them to be given higher priority.

Cppyy handles bindings differently than what is typically found in other tools with a similar objective, so this update walks through some of these differences, and explains why choices were made as they are.

The most visible difference, is from the viewpoint of the Python programmer interacting with the module. The two canonical ways of making Python part of a larger environment, are to either embed or extend it. The latter is done with so-called extension modules, which are explicitly constructed to be very similar in their presentation to the Python programmer as normal Python modules. In cppyy, however, the external C++ world is presented from a single entrance point, the global C++ namespace (in the form of the variable cppyy.gbl). Thus, instead of importing a package that contains your C++ classes, usage looks like this (assuming class MyClass in the global namespace):

>>>> import cppyy
>>>> m = cppyy.gbl.MyClass()
>>>> # etc.

This is more natural than it appears at first: C++ classes and functions are, once compiled, represented by unique linker symbols, so it makes sense to give them their own unique place on the Python side as well. This organization allows pythonizations of C++ classes to propagate from one code to another, ensures that all normal Python introspection (such as issubclass and isinstance) works as expected in all cases, and that it is possible to represent C++ constructs such as typedefs simply by Python references. Achieving this unified presentation would clearly require a lot of internal administration to track all C++ entities if they each lived in their own, pre-built extension modules. So instead, cppyy generates the C++ bindings at run-time, which brings us to the next difference.

Then again, that is not really a difference: when writing or generating a Python extension module, the result is some C code that consists of calls into Python, which then gets compiled. However, it is not the bindings themselves that are compiled; it is the code that creates the bindings that gets compiled. In other words, any generated or hand-written extension module does exactly what cppyy does, except that they are much more specific in that the bound code is hard-wired with e.g. fixed strings and external function calls. The upshot is that in Python, where all objects are first-class and run-time constructs, there is no difference whatsoever between bindings generated at run-time, and bindings generated at ... well, run-time really. There is a difference in organization, though, which goes back to the first point of structuring the C++ class proxies in Python: given that a class will settle in a unique place once bound, instead of inside a module that has no meaning in the C++ world, it follows that it can also be uniquely located in the first place. In other words, cppyy can, and does, make use of a class loader to auto-load classes on-demand.

If at this point, this all reminds you of a bit ctypes, just with some extra bells and whistles, you would be quite right. In fact, internally cppyy makes heavy use of the RPython modules that form the guts of ctypes. The difficult part of ctypes, however, is the requirement to annotate functions and structures. That is not very pleasant in C, but in C++ there is a whole other level of complexity in that the C++ standard specifies many low-level details, that are required for dispatching calls and understanding object layout, as "implementation defined." Of course, in the case of Open Source compilers, getting at those details is doable, but having to reverse engineer closed-source compilers gets old rather quickly in more ways than one. More generally, these implementation defined details prevent a clean interface, i.e. without a further dependency on the compiler, into C++ like the one that the CFFI module provides for C. Still, once internal pointers have been followed, offsets have been calculated, this objects have been provided, etc., etc., the final dispatch into binary C++ is no different than that into C, and cppyy will therefore be able to make use of CFFI internally, like it does with ctypes today. This is especially relevant in the CLang/LLVM world, where stub functions are done away with. To get the required low-level details then, cppyy relies on a back-end, rather than getting it from the programmer, and this is where Reflex (together with the relevant C++ compiler) comes in, largely automating this tedious process.

There is nothing special about Reflex per se, other than that it is relatively lightweight, available, and has proven to be able to handle huge code bases. It was a known quantity when work on cppyy started, and given the number of moving parts in learning PyPy, that was a welcome relief. Reflex is based on gccxml, and can therefore handle pretty much any C or C++ code that you care to throw at it. It is also technically speaking obsolete as it will not support C++11, since gccxml won't, but its expected replacement, based on CLang/LLVM, is not quite there yet (we are looking at Q3 of this year). In cppyy, access to Reflex, or any back-end for that matter, is through a thin C API (see the schematic below): cppyy asks high level questions to the back-end, and receives low-level results, some of which are in the form of opaque handles. This ensures that cppyy is not tied to any specific back-end. In fact, currently it already supports another, CINT, but that back-end is of little interest outside of High Energy Physics (HEP). The Python side is always the same, however, so any Python code based on cppyy does not have to change if the back-end changes. To use the system, a back-end specific tool (genreflex for Reflex) is first run on a set of header files with a selection file for choosing the required classes. This produces a C++ file that must be compiled into a shared library, and a corresponding map file for the class loader. These shared libraries, with their map files alongside, can be put anywhere as long as they can be located through the standard paths for the dynamic loader. With that in place, the setup is ready, and the C++ classes are available to be used from cppyy.

So far, nothing that has been described is specific to PyPy. In fact, most of the technologies described have been used for a long time on CPython already, so why the need for a new, PyPy-specific, module? To get to that, it is important to first understand how a call is mediated between Python and C++. In Python, there is the concept of a PyObject, which has a reference count, a pointer to a type object, and some payload. There are APIs to extract the low-level information from the payload for use in the C++ call, and to repackage any results from the call. This marshalling is where the bulk of the time is spent when dispatching. To be absolutely precise, most C++ extension module generators produce slow dispatches because they don't handle overloads efficiently, but even in there, they still spend most of their time in the marshalling code, albeit in calls that fail before trying the next overload. In PyPy, speed is gained by having the JIT unbox objects into the payload only, allowing it to become part of compiled traces. If the same marshalling APIs were used, the JIT is forced to rebox the payload, hand it over through the API, only to have it unboxed again by the binding. Doing so is dreadfully inefficient. The objective of cppyy, then, is to keep all code transparent to the JIT until the absolute last possible moment, i.e. the call into C++ itself, therefore allowing it to (more or less) directly pass the payload it already has, with an absolute minimal amount of extra work. In the extreme case when the binding is not to a call, but to a data member of an object (or to a global variable), the memory address is delivered to the JIT and this results in direct access with no overhead. Note the interplay: cppyy in PyPy does not work like a binding in the CPython sense that is a back-and-forth between the interpreter and the extension. Instead, it does its work by being transparent to the JIT, allowing the JIT to dissolve the binding. And with that, we have made a full circle: if to work well with the JIT, and in so doing achieve the best performance, you can not have marshalling or do any other API-based driving, then the concept of compiled extension modules is out, and the better solution is in run-time generated bindings.

That leaves one final point. What if you do want to present an extension module-like interface to programmers that use your code? But of course, this is Python: everything consists of first-class objects, whose behavior can be changed on the fly. In CPython, you might hesitate to make such changes, as every overlay or indirection results in quite a bit of overhead. With PyPy, however, these layers are all optimized out of existences, making that a non-issue.

This posting laid out the reasoning behind the organization of cppyy. A follow-up is planned, to explain how C++ objects are handled and represented internally.

Wim Lavrijsen

Monday, June 18, 2012

Release 0.1 of CFFI

Hi.

We're pleased to announce the first public release, 0.1 of CFFI, a way to call C from Python.
(This release does not support PyPy yet --- but we announce it here as it is planned for the
next release :-)

The package is available on bitbucket as well as documented. You can also install it
straight from the python package index (pip).

The aim of this project is to provide a convenient and reliable way of calling C code from Python.
The interface is based on LuaJIT's FFI and follows a few principles:

  • The goal is to call C code from Python. You should be able to do so
    without learning a 3rd language: every alternative requires you to learn
    their own language (Cython, SWIG) or API (ctypes). So we tried to
    assume that you know Python and C and minimize the extra bits of API that
    you need to learn.
  • Keep all the Python-related logic in Python so that you don't need to
    write much C code (unlike CPython native C extensions).
  • Work either at the level of the ABI (Application Binary Interface)
    or the API (Application Programming Interface). Usually, C
    libraries have a specified C API but often not an ABI (e.g. they may
    document a "struct" as having at least these fields, but maybe more).
    (ctypes works at the ABI level, whereas Cython or native C extensions
    work at the API level.)
  • We try to be complete. For now some C99 constructs are not supported,
    but all C89 should be, including macros (and including macro "abuses",
    which you can manually wrap in saner-looking C functions).
  • We attempt to support both PyPy and CPython (although PyPy support is not
    complete yet) with a reasonable path for other Python implementations like
    IronPython and Jython.
  • Note that this project is not about embedding executable C code in
    Python, unlike Weave. This is about calling existing C libraries
    from Python.

Status of the project

Consider this as a beta release. Creating CPython extensions is fully supported and the API should
be relatively stable; however, minor adjustements of the API are possible.

PyPy support is not yet done and this is a goal for the next release. There are vague plans to make this the
preferred way to call C from Python that can reliably work between PyPy and CPython.

Right now CFFI's verify() requires a C compiler and header files to be available at run-time.
This limitation will be lifted in the near future and it'll contain a way to cache the resulting binary.

Cheers,

Armin Rigo and Maciej FijaƂkowski

Sunday, June 10, 2012

STM with threads

Hi all,

A quick update. The first version of pypy-stm based on regular
threads
is ready. Still having no JIT and a 4-or-5-times performance
hit, it is not particularly fast, but I am happy that it turns out not
to be much slower than the previous thread-less attempts. It is at
least fast enough to run faster (in real time) than an equivalent no-STM
PyPy, if fed with an eight-threaded program on an eight-core machine
(provided, of course, you don't mind it eating all 8 cores' CPU power
instead of just one :-).

You can download and play around with this binary for Linux 64. It
was made from the stm-thread branch of the PyPy repository (translate.py --stm -O2 targetpypystandalone.py). (Be sure
to put it where it can find its stdlib, e.g. by putting it inside the
directory from the official 1.9 release.)

This binary supports the thread module and runs without the GIL.
So, despite the factor-of-4 slow-down issue, it should be the fourth
complete Python interpreter in which we can reasonably claim to have
resolved the problem of the GIL. (The first one was Greg Stein's Python
1.4, re-explored here; the second one is Jython; the third one is
IronPython.) Unlike the previous three, it is also the first one to
offer full GIL semantics to the programmer, and additionally
thread.atomic (see below). I should also add that we're likely to
see in the next year a 5th such interpreter, too, based on Hardware
Transactional Memory (same approach as with STM, but using e.g.
Intel's HTM).

The binary I linked to above supports all built-in modules from PyPy,
apart from signal, still being worked on (which can be a bit
annoying because standard library modules like subprocess depend on
it). The sys.get/setcheckinterval() functions can be used to tweak
the frequency of the automatic commits. Additionally, it offers
thread.atomic, described in the previous blog post as a way to
create longer atomic sections (with the observable effect of preventing
the "GIL" to be released during that time). A complete
transaction.py module based on it is available from the sources.

The main missing features are:

  • the signal module;
  • the Garbage Collector, which does not do major collections so far, only
    minor ones;
  • and finally, the JIT, which needs some amount of integration to generate
    the correctly-tweaked assembler.

Have fun!

Armin.

Friday, June 8, 2012

PyPy 1.9 - Yard Wolf

We're pleased to announce the 1.9 release of PyPy. This release brings mostly
bugfixes, performance improvements, other small improvements and overall
progress on the numpypy effort.
It also brings an improved situation on Windows and OS X.

You can download the PyPy 1.9 release here:

http://pypy.org/download.html

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for
CPython 2.7. It's fast (pypy 1.9 and cpython 2.7.2 performance comparison)
due to its integrated tracing JIT compiler.

This release supports x86 machines running Linux 32/64, Mac OS X 64 or
Windows 32. Windows 64 work is still stalling, we would welcome a volunteer
to handle that.

Thanks to our donors

But first of all, we would like to say thank you to all people who
donated some money to one of our four calls:

Thank you all for proving that it is indeed possible for a small team of
programmers to get funded like that, at least for some
time. We want to include this thank you in the present release
announcement even though most of the work is not finished yet. More
precisely, neither Py3k nor STM are ready to make it in an official release
yet: people interested in them need to grab and (attempt to) translate
PyPy from the corresponding branches (respectively py3k and
stm-thread).

Highlights

  • This release still implements Python 2.7.2.
  • Many bugs were corrected for Windows 32 bit. This includes new
    functionality to test the validity of file descriptors; and
    correct handling of the calling convensions for ctypes. (Still not
    much progress on Win64.) A lot of work on this has been done by Matti Picus
    and Amaury Forgeot d'Arc.
  • Improvements in cpyext, our emulator for CPython C extension modules.
    For example PyOpenSSL should now work. We thank various people for help.
  • Sets now have strategies just like dictionaries. This means for example
    that a set containing only ints will be more compact (and faster).
  • A lot of progress on various aspects of numpypy. See the numpy-status
    page for the automatic report.
  • It is now possible to create and manipulate C-like structures using the
    PyPy-only _ffi module. The advantage over using e.g. ctypes is that
    _ffi is very JIT-friendly, and getting/setting of fields is translated
    to few assembler instructions by the JIT. However, this is mostly intended
    as a low-level backend to be used by more user-friendly FFI packages, and
    the API might change in the future. Use it at your own risk.
  • The non-x86 backends for the JIT are progressing but are still not
    merged (ARMv7 and PPC64).
  • JIT hooks for inspecting the created assembler code have been improved.
    See JIT hooks documentation for details.
  • select.kqueue has been added (BSD).
  • Handling of keyword arguments has been drastically improved in the best-case
    scenario: proxy functions which simply forwards *args and **kwargs
    to another function now performs much better with the JIT.
  • List comprehension has been improved.

JitViewer

There will be a corresponding 1.9 release of JitViewer which is guaranteed to work
with PyPy 1.9. See the JitViewer docs for details.

Cheers,
The PyPy Team

Tuesday, June 5, 2012

Py3k status update #4

This is the fourth status update about our work on the py3k branch, which we
can work on thanks to all of the people who donated to the py3k proposal.

For various reasons, less work than usual has been done since the last status
update. However, some interesting things happened anyway.

As readers know, so far we spent most of the effort in fixing all PyPy's own
tests which started to fail for various py2/py3 differences. Most of them
failed for shallow reasons, e.g. syntactic changes or the int/long
unifications. Others failed for subtle differences and needed a bit more care,
for example the fact that unbound methods are gone in Py3k.

The good news is that finally we are seeing the light at the end of the
tunnel. Most of them have been fixed. For sine other tests, we introduced the
concept of "py3k-skipping": some optimizations and modules are indeed failing,
but right now we are concentrating on completing the core language and so we
are not interested in those. When the core language will be done, we will be
able to easily find and work on the py3k-skipped tests. In particular, for
now we disabled the Int and String dict strategies, which are broken
because of the usual int/long unification and str vs bytes. As for modules,
for now _continuation (needed for stackless) and _multiprocessing do
not work yet.

Another non-trivial feature we implemented is the proper cleaning of exception
variables when we exit except blocks. This is a feature which touches
lots of levels of PyPy, starting from astcompiler, down to the bytecode
interpreter. It tooks two days of headache, but at the end we made it :-).

Additionally, Amaury did a lot of improvements to cpyext, which had been
broken since forever on this branch.

As for the next plans, now that things are starting to work and PyPy's own
tests mostly pass, we can finally start to run the compiled PyPy against
CPython's test suite. It is very likely that we will have tons of failures at
the beginning, but once we start to fix them one by one, a Py3k-compatible
PyPy will be closer and closer.

Monday, May 7, 2012

STM update: back to threads?

Hi again,

Here is another update on the status of Software Transactional Memory on PyPy.

Those of you who have been closely following this blog since last year know that, from the very first post about STM, I explored various design ideas about the API that we should get when programming in Python.

I went a full circle, and now I am back to where I started (with, important difference, a very roughly working implementation of pypy-stm).

What I realized is that the "thread" module is not that bad after all --- I mean, yes, it is a horribly low-level interface, but it is general enough to build various interesting things on top of it. What the "stm-thread" branch of PyPy contains is, basically, the regular "thread" module in which the GIL was replaced with STM. It gives multicore capabilities to any program based on multiple threads. (This is so far exactly the idea same than the one being investigated for Hardware Transactional Memory. It is roughly also what you would get if you managed to convince GCC 4.7 to compile CPython using STM.)

Now while this might already be quite interesting to some people, here is how it relates to all I said previously: namely, threads are bad, and some new "transaction" module would be a better idea.

There is one new core functionality in the "stm-thread" branch: it is "thread.atomic", a context manager that can be used in a "with" statement (exact name subject to change). In terms of the GIL, it prevents the GIL from being released in the "with" block. In terms of STM, it prevents a "transaction break", which means that the whole "with" statement runs in one single transaction. (From the Python programmer's point of view, the net effect is the same.)

So far, no ground-breaking news. But what I missed previously is that this is enough to give multicore capabilities even to a program that is not using threads so far. It is possible to rewrite an equivalent of the old transaction module in a few pages of pure Python, using "thread.atomic". Something along the following lines: start N threads that each reads from a Queue.Queue() the next job to do, and does it in a "with thread.atomic" block. The STM version of PyPy is then able to run these atomic blocks concurrently. The key point is that the slightly delicate handling of threads should be nicely hidden inside the new "transaction" module, and from outside the observed behavior would be exactly as if the transactions that we schedule are run serially.

The point I kept missing was that, yes, this sounds like nonsense, because it seems that we create N threads just to serialize their work again in "thread.atomic" sections. In fact this would be nonsense in any model that would "just" remove the GIL to let multiple threads run concurrently without crashing. Indeed, you have multiple threads, but their atomic blocks would be again a sort of GIL: only one of them would run at a time. And this is indeed the simple model of execution that you get even with STM --- but not the model of performance. The performance with STM scales with the number of cores, as long as there is enough non-conflicting work to do.

So in summary the complete circle back to the starting point is that threads might be a good low-level model. It mends itself naturally to, say, a kind of program in which the main thread polls file descriptors using select() or the Linux epoll(), and the work received is split along N other threads --- which is the kind of program you would naturally write in other languages that don't have a GIL, say Java. The other threads can then use "thread.atomic" blocks to protect sections of their work. The traditional Transactional Memory point of view is that you use such blocks to guard the short sections of code that communicate with other threads or modify global state, but nothing prevents you from using much larger sections: you should be able to scale them up to the size of a native "unit of work", so that every unit is naturally atomic. And then it's only a matter of design: you can tweak an existing module that does the thread pooling to add one "with thread.atomic"; or do it yourself from scratch; or (if the design is compatible enough) just plug in the proposed pure-Python "transaction" module. Or if you feel like it you can even use threads directly (but keep in mind that using threads too explicitly is not a composable abstraction, whereas higher-level designs typically are).

At the end of the day, you can write or reuse programs whose global structure you are already familiar with, for example with a thread pool (that can be hidden in a library if you prefer), or any other structure with or without explicit threads. But you can do so without all the mess that comes with threads like locks and deadlocks. From that angle it is really similar to Garbage Collection: e.g. the Boehm GC (now used by GCC itself) lets you write C code like you are used to, but forgeting all you had to learn about careful explicit memory management.

Friday, April 27, 2012

STM update (and thanks everybody)

A short update on the Software Transactional Memory (STM) side. Let me remind you that the work is to add STM internally into PyPy, with the goal of letting the user's programs run on multiple cores after a minor adaptation. (The goal is not to expose STM to the user's program.) I will soon write some official documentation that explains in more details exactly what you get. For now you can read the previous blog posts, and you can also find technical details in the call for donation itself; or directly look at how I adapted the examples linked to later in this post.

I have now reached the point where the basics seem to work. There is no integration with the JIT so far; moreover the integration with the Garbage Collection subsystem is not finished right now, but at least it is "not crashing in my simple tests and not leaking memory too quickly". (It means that it is never calling __del__ so far, although it releases memory; and when entering transactional mode or when going to the next transaction, all live objects become immortal. This should still let most not-too-long-running programs work.)

If you want to play with it, you can download this binary (you need to put it in a place with the paths lib-python and lib_pypy, for example inside the main directory from a regular nightly tarball or from a full checkout). This version was compiled for Linux x86 32-bit from the stm-gc branch on the 25th of April. It runs e.g. the modified version of richards. This branch could also be translated for Linux x86-64, but not for other OSes nor other CPUs for now.

The resulting pypy-stm exposes the same interface as the pure Python transaction module, which is an emulator (running on CPython or any version of PyPy) which can be used to play around and prepare your programs. See the comments in there. A difference is that the real pypy-stm doesn't support epoll right now, so it cannot be used yet to play with a branch of Twisted that was already adapted (thanks Jean-Paul Calderone); but that's coming soon. For now you can use it to get multi-core usage on purely computational programs.

I did for example adapt PyPy's own translate.py: see the tweak in rpython/rtyper.py. Lines 273-281 are all that I needed to add, and they are mostly a "simplification and parallelization" of the lines above. There are a few more places in the whole translate.py that could be similarly modified, but overall it is just that: a few places. I did not measure performance, but I checked that it is capable of using multiple cores in the RTyping step of translation, with --- as expected --- some still-reasonable number of conflicts, particularly at the beginning when shared data structures are still being built.

On a few smaller, more regular examples like richards, I did measure the performance. It is not great, even taking into account that it has no JIT so far. Running pypy-stm with one thread is roughly 5 times slower than running a regular PyPy with no JIT (it used to be better in previous versions, but they didn't have any GC; nevertheless, I need to investigate). However, it does seem to scale. At least, it scales roughly as expected on my 2-real-cores, 4-hyperthreaded-cores laptop (i.e. for N between 1 and 4, the N-threaded pypy-stm performs similarly to N independent pypy-stm's running one thread each).

And finally...

...a big thank you to everyone who contributed some money to support this! As you see on the PyPy site, we got more than 6700$ so far in only 5 or 6 weeks. Thanks to that, my contract started last Monday, and I am now paid a small salary via the Software Freedom Conservancy (thanks Bradley M. Kuhn for organizational support from the SFC). Again, thank you everybody!

UPDATE: The performance regression was due to disabling an optimization, the method cache, which caused non-deterministic results --- the performance could vary from simple to double. Today, as a workaround, I made the method cache transaction-local for now; it is only effective for transactions that run for long enough (maybe 0.1ms or 1ms), but at least it is there in this situation. In the version of richards presented above, the transactions are too short to make a difference (around 0.015ms).

Tuesday, April 17, 2012

NumPy on PyPy progress report

Hello.

A lot of things happened in March, like pycon. I was also busy doing other things (pictured), so apologies for the late numpy status update.

However, a lot of things have happened and numpy continues to be one of the main points of entry for hacking on PyPy. Apologies to all the people whose patches I don't review in timely manner, but seriously, you do a lot of work.

This list of changes is definitely not exhaustive, and I might be forgetting important contributions. In a loose order:

  • Matti Picus made out parameter work for a lot of (but not all) functions.

  • We merged record dtypes support. The only missing dtypes left are complex (important), datetime (less important) and object (which will probably never be implemented because it makes very little sense and is a mess with moving GCs).

  • Taavi Burns and others implemented lots of details, including lots of ufuncs. On the completely unscientific measure of "implemented functions" on numpypy status page, we're close to 50% of numpy working. In reality it might be more or less, but after complex dtypes we're getting very close to running real programs.

  • Bool indexing of arrays of the same size should work, leaving only arrays-of-ints indexing as the last missing element of fancy indexing.

  • I did some very early experiments on SSE. This work is seriously preliminary - in fact the only implemented operation is addition of float single-dimension numpy arrays. However, results are encouraging, given that our assembler generator is far from ideal:

     

    Numpy

    PyPy SSE

    PyPy

    GCC non-looped

    GCC looped

    a+b

    0.6s

    0.3s

    0.4s

    0.3s

    0.25s

    a+b+c

    1.9s

    0.35s

    0.5s

    0.7s

    0.32s

    a+b+c+d+e

    3.2s

    0.36s

    0.8s

    1.7s

    0.51s

    The benchmark repo is available. GCC was run with -O3, no further options specified. PyPy was run with default options, the SSE branch is under backend-vector-ops, but it's not working completely yet.

    One might argue that C and Python is not the same code - indeed it is not. It just shows some possible approach to writing numeric code.

Next step would be to just continue implementing missing features such as

  • specialised arrays i.e. masked arrays and matrixes
  • core modules such as fft, linalg, random.
  • numpy's testing framework

The future is hard to predict, but we're not far off!

Cheers,
fijal

UPDATE:Indeed, string and unicode dtypes are not supported yet. They're as important as complex dtype

Friday, April 13, 2012

PyCon 2012 wrap up

So, PyCon happened. This was the biggest PyCon ever and probably the biggest gathering of Python hackers ever.

From the PyPy perspective, a lot at PyCon was about PyPy. Listing things:

  • David Beazley presented an excellent keynote describing his experience diving head-first into PyPy and at least partly failing. He, however, did not fail to explain bits and pieces about PyPy's architecture. Video is available.
  • We gave tons of talks, including the tutorial, why pypy by example and pypy's JIT architecture
  • We had a giant influx of new commiters, easily doubling the amount of pull requests ever created for PyPy. The main topics for newcomers were numpy and py3k, disproving what David said about PyPy being too hard to dive into ;)
  • Guido argued in his keynote that Python is not too slow. In the meantime, we're trying to prove him correct :-)

We would like to thank everyone who talked to us, shared ideas and especially those who participated in sprints - we're always happy to welcome newcomers!

I'm sure there are tons of things I forgot, but thank you all!

Cheers, fijal

Friday, April 6, 2012

Py3k status update #3

This is the third status update about my work on the py3k branch, which I can work on thanks to all of the people who donated to the py3k proposal.

A lot of work has been done during the last month: as usual, the list of changes is too big to be reported in a detalied way, so this is just a summary of what happened.

One of the most active areas was killing old and deprecated features. In particular, we killed support for the __cmp__ special method and its counsins, the cmp builtin function and keyword argument for list.sort() and sorted(). Killing is easy, but then you have to fix all the places which breaks because of this, including all the types which relied on __cmp__ to be comparable,, fixing all the tests which tried to order objects which are no longer ordeable now, or implementing new behavior like forbidding calling hash() on objects which implement __eq__ but not __hash__.

Among the other features, we killed lots of now-gone functions in the operator module, the builtins apply(), reduce() and buffer, and the os.* functions to deal with temporary files, which has been deprecated in favour of the new tempfile module.

The other topic which can't miss in a py3k status update is, as usual, string-vs-unicode. At this round, we fixed bugs in string formatting (in particular to teach format() to always use unicode strings) and various corner cases about when calling the (possibly overridden) __str__ method on subclasses of str. Believe me, you don't want to know the precise rules :-).

Other features which we worked on and fixed tests include, but are not limited to, marshal, hashlib, zipimport, _socket and itertools, plus the habitual endless lists of tests which fail for shallow reasons such as the syntactic differences, int vs long, range() vs list(range()) etc. As a result, the number of failing tests dropped from 650 to 235: we are beginning to see the light at the end of the tunnel :-)

Benjamin finished implementing Python 3 syntax. Most of it was small cleanups and tweaks to be compatible with CPython such as making True and False keywords and preventing . . . (note spaces between dots) from being parsed as Ellipsis. Larger syntax additions included keyword only arguments and function annotations.

Finally, we did some RPython fixes, so that it is possible again to translate PyPy in the py3k branch. However, the resuling binary is a strange beast which mixes python 2 and python 3 semantics, so it is unusable for anything but showing friends how cool it is.

I would like to underline that I was not alone in doing all this work. In particular, a lot of people joined the PyPy sprint at Pycon and worked on the branch, as you can clearly see in this activity graph. I would like to thank all who helped!

cheers,
Antonio and Benjamin

Thursday, April 5, 2012

PyPy sprint in Leipzig, Germany (June 22-27)

The next PyPy sprint will be held --- for the first time in a while --- in a place where we haven't been so far: Leipzig, Germany, at the Python Academy's Teaching Center. It will take place from the 22nd to the 27th of June 2012, before EuroPython. Thanks to Mike MĂŒller for organizing it!

This is a fully public sprint, everyone is welcome to join us. All days are full sprint days, so it is recommended to arrive the 21st and leave the 28th.

Topics and goals

Open. Here are some goals:

  • numpy: progress towards completing the numpypy module; try to use it in real code
  • stm: progress on Transactional Memory; try out the transaction module on real code.
  • jit optimizations: there are a number of optimizations we can still try out or refactor.
  • work on various, more efficient data structures for Python language. A good example would be lazy string slicing/concatenation or more efficient objects.
  • any other PyPy-related topic is fine too.

Grants

For students, we have the possibility to support some costs via PyPy funds. Additionally, we can support you applying for grants from the PSF and other sources.

Registration

If you'd like to come, please sign up either by announcing yourself on pypy-dev, or by directly adding yourself to the list of people. (We need to have a head count for the organization.) If you are new to the project please drop a note about your interests and post any questions.

More...

For more information, please see the sprint announcement.

Thursday, March 8, 2012

Call for donations for Software Transactional Memory

Hi all,

The Software Transactional Memory call for donations is up. From the proposal:

Previous attempts on Hardware Transactional Memory focused on parallelizing existing programs written using the thread or threading modules. However, as argued here, this may not be the most practical way to achieve real multithreading; it seems that better alternatives would offer good scalability too. Notably, Transactional Memory could benefit any event-based system that is written to dispatch events serially (Twisted-based, most GUI toolkit, Stackless, gevent, and so on). The events would internally be processed in parallel, while maintaining the illusion of serial execution, with all the corresponding benefits of safety. This should be possible with minimal changes to the event dispatchers. This approach has been described by the Automatic Mutual Exclusion work at Microsoft Research, but not been implemented anywhere (to the best of our knowledge).

Note that, yes, this gives you both sides of the coin: you keep using your non-thread-based program (without worrying about locks and their drawbacks like deadlocks, races, and friends), and your programs benefit from all your cores.

In more details, a low-level built-in module will provide the basics to start transactions in parallel; but this module will be only used internally in a tweaked version of, say, a Twisted reactor. Using this reactor will be enough for your existing Twisted-based programs to actually run on multiple cores. You, as a developer of the Twisted-based program, have only to care about improving the parallelizability of your program (e.g. by splitting time-consuming transactions into several parts; the exact rules will be published in detail once they are known).

The point is that your program is always correct, and can be tweaked to improve performance. This is the opposite from what explicit threads and locks give you, which is a performant program which you need to tweak to remove bugs. Arguably, this approach is the reason for why you use Python in the first place :-)

Armin

Thursday, March 1, 2012

Py3k status update #2

This is the second status update about my work on the py3k branch, which I can work on thanks to all of the people who donated to the py3k proposal.

Since my previous status update, things have improved a lot: first of all, I fixed the syntax of many more tests, which were failing on the branch because they used constructs which are no longer valid in Python 3, such as u'' strings, the print statement or the old except Exception, e syntax. I have to say that this work is tedious and not very rewarding, but it has to be done anyway, so that the real failures can stand up.

Then, I spent most of the rest of the time by killing features which are present in Python 2 and are gone in Python 3.

Some of them were easy and mechnical: for example, I removed all the function attributes such as func_code and func_closure, which has been renamed to __code__ and __closure__, and then I had to find and fix all the places which still expected the old ones.

Some were trickier: I removed support for the cmp function and the __cmp__ special method, but this also meant that I had to fix a few types which relied on it to be comparable (for example, did you know that the cells contained in __closure__ are comparable?). At the same time, I also removed the old behavior which in Python 2 allows us to compare arbitrary objects with <, > & co.: in Python 3 the only comparisons allowed between incompatible types are == and !=.

Speaking of old special methods, __hex__ and __oct__ are gone as well (and I didn't even know about their existence before removing them :-))

But the most important breakthrough was the removal of the _file module, containing the implementation of the file type in Python 2, which is now gone since in Python 3 files are handled by the _io module. Killing the module was not straightforward, because some of the importing logic was tightly tied to the internal implementation of files, so it needed some refactoring. Finally, I had to fix the marshal module to correctly detect text files vs. byte files.

Among these things, I fixed tons of smaller issues here and there. As a result, there are many fewer failing tests than a few weeks ago. Obviously the number itself does not mean much, because sometimes fixing a single test takes hours, and some other times by changing one line one fixes tens of tests. But at the end, seeing it dropping from 999 to 650 always is nice and rewarding :-).

The road for having a pypy3k is still long, but everything is going fine so far. Stay tuned for more updates!

cheers, Antonio

Thursday, February 16, 2012

Py3k status update

Thank to all the people who donated to the py3k proposal, we managed to collect enough money to start to work on the first step. This is a quick summary of what I did since I began working on this.
First of all, many thanks to Amaury Forgeot d'Arc, who started the py3k branch months ago, and already implemented lots of features including e.g. switching to "unicode everywhere" and the int/long unification, making my job considerably easier :-)
I started to work on the branch at the last Leysin sprint together with Romain Guillebert, where we worked on various syntactical changes such as extended tuple unpacking and keyword-only arguments. Working on such features is a good way to learn about a lot of the layers which the PyPy Python interpreter is composed of, because often you have to touch the tokenizer, the parser, the ast builder, the compiler and finally the interpreter.
Then I worked on improving our test machinery in various way, e.g. by optimizing the initialization phase of the object space created by tests, which considerably speeds up small test runs, and adding the possibility to automatically run our tests against CPython 3, to ensure that what we are not trying to fix a test which is meant to fail :-). I also setup our buildbot to run the py3k tests nightly, so that we can have an up to date overview of what is left to do.
Finally I started to look at all the tests in the interpreter/ directory, trying to unmangle the mess of failing tests. Lots of tests were failing because of simple syntax errors (e.g., by using the no longer valid except Exception, e syntax or the old print statement), others for slightly more complex reasons like unicode vs bytes or the now gone int/long distinction. Others were failing simply because they relied on new features, such as the new lexical exception handlers.
To give some numbers, at some point in january we had 1621 failing tests in the branch, while today we are under 1000 (to be exact: 999, and this is why I've waited until today to post the status update :-)).
Before ending this blog post, I would like to thank once again all the people who donated to PyPy, who let me to do this wonderful job. That's all for now, I'll post more updates soon.
cheers, Antonio

Monday, February 13, 2012

A Larger Example for the Flow Graph Language

Part 4 of Comparing Partial Evaluation to Tracing

This is the fourth and final blog post in a series about comparing partial evaluation and tracing. We've come a long way: In the first post of the series I showed an interpreter for a small flow-graph language together with a partial evaluator it. In the second post I showed how a tracer for the same language works and how it relates to both execution and to partial evaluation. The third post described an optimizer for traces.

In this final post we can compare and contrast the two different approaches of tracing and partial evaluation by means of an example. The programs in the flow chart language seen so far have been rather small, so I want to give an example of a larger program: an interpreter for an extremely simple bytecode instruction set. I will look at how the partial evaluator deals with that interpreter, and what the tracer does with it. The code for that, as well as all the code of the series can be found here: http://paste.pocoo.org/show/550282/ (some small additions have been made, such as a nicer way to print traces).

A Bytecode Interpreter

Writing programs in the flow graph language is painful, but I still want to give an example that is a bit more interesting than the tiny ones that we've seen so far. The example is an interpreter for the bytecode of a very trivial register-based language. The language has four registers, one of which is an accumulator on which all the actual operations are performed.

The opcodes of the language are:

  • jump_if_a, jumps to a target address when the accumulator is non-zero
  • mov_a_r0, mov_a_r1, mov_a_r2 move the value of the accumulator to the respective register
  • mov_r0_a, mov_r1_a, mov_r2_a move the value of a register to the accumulator
  • add_r0_to_a, add_r1_to_a, add_r2_to_a add the value of the register to the accumulator
  • decr_a decrement the accumulator
  • return_a stop the program and print the accumulator

The interpreter has a main loop that reads the opcode at the current program counter, does a (lengthy) dispatch to the right bytecode via a series of if statements and then executes the right opcode. Afterwards the next opcode is treated equivalently.

Here is a part of the source code in the flow graph language. As pseudocode:

bytecode_loop:
    opcode = bytecode[pc]
    pc = pc + 1
    c = opcode == 'jump_if_a'
    if c goto op_jump_if_a else goto not_jump_if_a

# select the right bytecode via a long series of if statements
not_jump_if_a:
    c = opcode == 'mov_a_r0'
    if y goto op_mov_a_r0 else goto not_mov_a_r0
not_mov_a_r0:
    c = opcode == 'mov_a_r0'
    if y goto op_mov_a_r1 else goto not_mov_a_r1
...

# bytecode implementations
op_mov_a_r0:
    r0 = a
    goto bytecode_loop

op_jump_if_a:
    c = a == 0
    target = bytecode[pc]
    pc += 1
    if c goto bytecode_loop else goto op_jump_if_a_jump

op_jump_if_a_jump:
    pc = target
    goto bytecode_loop
...

And actually working, as Prolog facts (the full implementation can be found at the link above):

% bytecode dispatch loop
block(bytecode_loop,
      op2(opcode, readlist, var(bytecode), var(pc),
      op2(pc, add, var(pc), const(1),
      op2(c, eq, var(opcode), const(jump_if_a),
      if(c, op_jump_if_a, not_jump_if_a))))).

% select the right bytecode via a long series of if statements
block(not_jump_if_a,
      op2(c, eq, var(opcode), const(mov_a_r0),
      if(c, op_mov_a_r0, not_mov_a_r0))).
block(not_mov_a_r0,
      op2(c, eq, var(opcode), const(mov_a_r1),
      if(c, op_mov_a_r1, not_mov_a_r1))).
...

% bytecode implementations
block(op_jump_if_a,
      op2(c, eq, var(a), const(0),
      op2(target, readlist, var(bytecode), var(pc),
      op2(pc, add, var(pc), const(1),
      if(c, bytecode_loop, op_jump_if_a_jump))))).
block(op_jump_if_a_jump,
      op1(pc, same, var(target),
      promote(bytecode, bytecode_loop))).
block(op_mov_a_r0,
      op1(r0, same, var(a), jump(bytecode_loop))).
...

The bytecode_loop block is the main dispatch loop. It reads an opcode out of the bytecode list at the program counter position, then has a long series of if statements that compares the current opcode to the various existing opcodes. The full code of the interpreter can be found under the link above.

The bytecodes of the interpreter don't really permit hugely complex programs, but it can be used to write a program that computes the square of a number with the following program:

mov_a_r0     # r0 = a
mov_a_r1     # r1 = a
# 2:
mov_r0_a     # r0--
decr_a
mov_a_r0
mov_r2_a     # r2 += a
add_r1_to_a
mov_a_r2
mov_r0_a     # if r0!=0: goto 2
jump_if_a 2
mov_r2_a     # return r2
return_a

Partially Evaluating the Bytecode Interpreter

The partial evaluator from the first blog post can be easily used to partially evaluate the bytecode interpreter. The static input is the bytecode for computing the square and the initial program counter value, as given above. The dynamic input are the content of the accumulator (the number to be squared). This can be done as follows:

?- bytecode_square(B),
Env = [bytecode/B, pc/0],
do_pe(bytecode_loop, Env, Label),
REnv = [a/16, r0/0, r1/0, r2/0],
interp(jump(Label), REnv), listing(block).
256
:- dynamic block/2.

<lots of generated code>

The code that is generated by the partial evaluation process is somewhat hard to read. It contains a lot of passages like this:

...
block(op_return_a1, print_and_stop(var(a))).
block(not_decr_a1, jump(op_return_a1)).
block(not_add_r2_to_a2, jump(not_decr_a1)).
block(not_add_r1_to_a2, jump(not_add_r2_to_a2)).
block(not_add_r0_to_a3, jump(not_add_r1_to_a2)).
block(not_mov_r2_a3, jump(not_add_r0_to_a3)).
block(not_mov_r1_a5, jump(not_mov_r2_a3)).
block(not_mov_r0_a5, jump(not_mov_r1_a5)).
block(not_mov_a_r27, jump(not_mov_r0_a5)).
block(not_mov_a_r18, jump(not_mov_a_r27)).
block(not_mov_a_r09, jump(not_mov_a_r18)).
block(not_jump_if_a11, jump(not_mov_a_r09)).
block(bytecode_loop12, jump(not_jump_if_a11)).
block(op_mov_r2_a2, op1(a, same, var(r2), jump(bytecode_loop12))).
...

I.e. lots of blocks that do nothing but jump to another block, interspersed with some blocks that contain an actual operation. I cleaned the output up manually and got something like the following (this sort of cleanup is something a good partial evaluation system would do itself, after partial evaluation has occurred):

block(bytecode_loop1,
    op1(r0, same, var(a),
    op1(r1, same, var(a),
    op1(a, same, var(r0),
    op2(a, sub, var(a), const(1),
    op1(r0, same, var(a),
    op1(a, same, var(r2),
    op2(a, add, var(a), var(r1),
    op1(r2, same, var(a),
    op1(a, same, var(r0),
    op2(c, eq, var(a), const(0),
    if(c, bytecode_loop11, op_jump_if_a_jump1)))))))))))).

block(bytecode_loop11,
    op1(a, same, var(r2),
    print_and_stop(var(a))).

block(op_jump_if_a_jump1,
    op1(a, same, var(r0),
    op2(a, sub, var(a), const(1),
    op1(r0, same, var(a),
    op1(a, same, var(r2),
    op2(a, add, var(a), var(r1),
    op1(r2, same, var(a),
    op1(a, same, var(r0),
    op2(c, eq, var(a), const(0),
    if(c, bytecode_loop11, op_jump_if_a_jump1)))))))))).

What do we see here? The partial evaluator has generated a block bytecode_loop1, which corresponds to the initialization opcodes mov_a_r0 and mov_a_r1 together with one iteration of the loop. Then it either jumps to a copy of the main loop (label op_jump_if_a_jump1) or to block bytecode_loop11, which prints the result and then stops. The residual code does exactly what the bytecode did: It squares the accumulator then prints that. All the uses of the bytecode and pc variable are gone.

Why did the partial evaluator produce two copies of the main loop that look the same? The reason for that is that in the second copy, the additional static information target = 2 is known, where target is a variable in the interpreter source that stores the jump target, for very brief periods of time. This additional static information does not have any effect on the residual code, so the same code is uselessly generated twice. This is an example of overspecialization.

Tracing the Interpreter

In this section we will look at what happens if we try to trace the interpreter. The naive way of doing that yields traces that are not very useful, because they abort after one iteration. We will look at a way of avoiding this problem. The problems described in this section are at the core of the paper Tracing the meta-level: PyPy's tracing JIT compiler (that paper uses a slightly more advanced version of the bytecode interpreter as an example).

To trace the interpreter, it is useful to change the bytecode_loop block from above to always promote the bytecode and the pc variables, because without knowing them the trace produced is not really interesting. This is similar to making these variables static in the partial evaluation example above:

block(bytecode_loop,
      promote(bytecode, bytecode_loop_promote_bytecode)).
block(bytecode_loop_promote_bytecode,
      promote(pc, bytecode_loop_promote_pc)).
block(bytecode_loop_promote_pc,
      op2(opcode, readlist, var(bytecode), var(pc),
      op2(pc, add, var(pc), const(1),
      op2(c, eq, var(opcode), const(0),
      if(c, op_jump_if_a, not_jump_if_a))))).
...

The rest of the interpreter stays unchanged.

To trace the interpreter we would start naively at the bytecode_loop label, because that's the label in the interpreter that is jumped to most often (which a profiler could establish easily). The following command can be used for that (this output prints traces in a slightly more readable way than in previous blog posts):

?- bytecode_square(B),
        A = 16, Env = [bytecode/B, pc/2, a/A, r0/A, r1/A, r2/0],
        do_trace(bytecode_loop, Env).
trace
  guard_value(bytecode,[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a],[],bytecode_loop_promote_bytecode)
  guard_value(pc,2,[],bytecode_loop_promote_pc)
  op2(opcode,readlist,var(bytecode),var(pc))
  op2(pc,add,var(pc),const(1))
  op2(c,eq,var(opcode),const(jump_if_a))
  guard_false(c,[],op_jump_if_a)
  op2(c,eq,var(opcode),const(mov_a_r0))
  guard_false(c,[],op_mov_a_r0)
  op2(c,eq,var(opcode),const(mov_a_r1))
  guard_false(c,[],op_mov_a_r1)
  op2(c,eq,var(opcode),const(mov_a_r2))
  guard_false(c,[],op_mov_a_r2)
  op2(c,eq,var(opcode),const(mov_r0_a))
  guard_true(c,[],not_mov_r0_a)
  op1(a,same,var(r0))
  loop

opttrace
  guard_value(bytecode,[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a],[],bytecode_loop_promote_bytecode)
  guard_value(pc,2,[bytecode/[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a]],bytecode_loop_promote_pc)
  op1(a,same,var(r0))
  op1(bytecode,same,const([mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a]))
  op1(pc,same,const(3))
  op1(opcode,same,const(mov_r0_a))
  op1(c,same,const(1))
  loop

256
B = [mov_a_r0, mov_a_r1, mov_r0_a, decr_a, mov_a_r0, mov_r2_a, add_r1_to_a, mov_a_r2, mov_r0_a|...],
A = 16,
Env = [bytecode/[mov_a_r0, mov_a_r1, mov_r0_a, decr_a, mov_a_r0, mov_r2_a, add_r1_to_a|...], pc/2, a/16, r0/16, r1/16, r2/0]

These traces are very short. They start with promoting the bytecode and the pc, followed by the execution of the opcode mov_r0_a, which is the one at position 2 in the given bytecode. Then they increment the pc and loop back to the beginning. Looking at the optimized trace, it is clear that the trace is essentially useless. It will run only for one iteration, because in the second iteration the pc is 3, thus the guard_value at the beginning will fail.

This problem can be solved by tracing more than just one iteration of the bytecode dispatch loop, which is called meta-tracing. To get this behaviour, in this simple example it is enough to start (and thus end) tracing at a different label, op_jump_if_a_jump. This label is hit when the interpreter executes a jump_if_a bytecode and the jump is taken. In a loop on the level of the executed bytecode program there is one such jump. Thus tracing from this label, a full loop in the bytecode program is traced, containing potentially many iterations of the bytecode dispatch loop in the control flow graph language.

Doing that yields the following:

?- bytecode_square(B),
        A = 16, Env = [bytecode/B, pc/11, a/A, r0/A, r1/A, r2/0, target/2],
        do_trace(op_jump_if_a_jump, Env).
trace
  op1(pc,same,var(target))
  guard_value(bytecode,[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a],[],bytecode_loop)
  guard_value(bytecode,[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a],[],bytecode_loop_promote_bytecode)
  guard_value(pc,2,[],bytecode_loop_promote_pc)
  op2(opcode,readlist,var(bytecode),var(pc))
  op2(pc,add,var(pc),const(1))
  op2(c,eq,var(opcode),const(jump_if_a))
  guard_false(c,[],op_jump_if_a)
  op2(c,eq,var(opcode),const(mov_a_r0))
  guard_false(c,[],op_mov_a_r0)
  op2(c,eq,var(opcode),const(mov_a_r1))
  guard_false(c,[],op_mov_a_r1)
  op2(c,eq,var(opcode),const(mov_a_r2))
  guard_false(c,[],op_mov_a_r2)
  op2(c,eq,var(opcode),const(mov_r0_a))
  guard_true(c,[],not_mov_r0_a)
  op1(a,same,var(r0))
  guard_value(bytecode,[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a],[],bytecode_loop_promote_bytecode)
  guard_value(pc,3,[],bytecode_loop_promote_pc)
  op2(opcode,readlist,var(bytecode),var(pc))
  ...
  lots of operations ommitted
  ...
  guard_value(bytecode,[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a],[],bytecode_loop_promote_bytecode)
  guard_value(pc,9,[],bytecode_loop_promote_pc)
  op2(opcode,readlist,var(bytecode),var(pc))
  op2(pc,add,var(pc),const(1))
  op2(c,eq,var(opcode),const(jump_if_a))
  guard_true(c,[],not_jump_if_a)
  op2(c,eq,var(a),const(0))
  op2(target,readlist,var(bytecode),var(pc))
  op2(pc,add,var(pc),const(1))
  guard_false(c,[],bytecode_loop)
  loop

opttrace
  op1(pc,same,var(target))
  guard_value(bytecode,[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a],[],bytecode_loop)
  guard_value(pc,2,[bytecode/[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a]],bytecode_loop_promote_pc)
  op1(a,same,var(r0))
  op2(a,sub,var(a),const(1))
  op1(r0,same,var(a))
  op1(a,same,var(r2))
  op2(a,add,var(a),var(r1))
  op1(r2,same,var(a))
  op1(a,same,var(r0))
  op2(c,eq,var(a),const(0))
  guard_false(c,[bytecode/[mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a],pc/11,opcode/jump_if_a,target/2],bytecode_loop)
  op1(bytecode,same,const([mov_a_r0,mov_a_r1,mov_r0_a,decr_a,mov_a_r0,mov_r2_a,add_r1_to_a,mov_a_r2,mov_r0_a,jump_if_a,2,mov_r2_a,return_a]))
  op1(pc,same,const(11))
  op1(opcode,same,const(jump_if_a))
  op1(target,same,const(2))
  op1(c,same,const(0))
  loop

256
B = [mov_a_r0, mov_a_r1, mov_r0_a, decr_a, mov_a_r0, mov_r2_a, add_r1_to_a, mov_a_r2, mov_r0_a|...],
A = 16,
Env = [bytecode/[mov_a_r0, mov_a_r1, mov_r0_a, decr_a, mov_a_r0, mov_r2_a, add_r1_to_a|...], pc/11, a/16, r0/16, r1/16, r2/0, target/2] .

That looks better. The trace corresponds to the interpreter running all the bytecodes in the loop of the squaring function in the example bytecode above. The optimized code starts with two guards (checking that the bytecode is still the one for the squaring function, checking that the pc is 2) and then only does the operations that actually do the computation. No bytecode dispatching is performed, thus the interpretation overhead is fully removed, apart from the two guard_value operations at the beginning.

Many of the assignments in the trace are superfluous, e.g. all the copying back and forth between registers r1, r1, r2 and accumulator a. This could be easily solved by an even more intelligent optimization utilizing SSA form.

Conclusion About the Interpreter

Both partial evaluation and meta-tracing can be used to transform the example bytecode computing a square into a form that shows the essential computation that is going on, without the interpretation overhead. The naive partial evaluator produces lots of extra blocks that just jump around, which could be solved with a post-processing step. The tracer by itself produces uselessly short traces, but with a simple trick of starting the trace at a different point the results become a lot better.

In a real meta-tracing system, the meta-tracer would need a way for the author of the interpreter to mark which bytecode corresponds to a backward jump. It would also need better integration with the interpreter to start tracing automatically, as well as cache the traces. Additionally, it would have to deal better with guards that fail a lot, attaching new traces to the failing guards. However, all that is "just" engineering on top of the ideas presented in this series of blog posts.

High-Level Conclusion

Some concluding high-level thoughts about the similarities of tracing and partial evaluation: Tracing and partial evaluation try to tackle a similar problem, that of automatically reducing the interpreter overhead, their approaches are slightly different though.

Tracing is very close to normal evaluation, only keeping some extra information in the process. But then, the optimizer that is used in a tracer is again very similar in structure to a partial evaluator. The task of the optimizer is much simpler though, because it does not need to deal with control flow at all, just a linear list of operations.

So in a sense tracing is taking those parts of partial evaluation that work (the "just evaluate those things that you can, and leave the others") and replacing the parts that don't (controlling unfolding) by a much more pragmatic mechanism. That mechanism observes actual execution runs of the program to choose control flow paths that are typical. At the same time, the tracer's focus is on loops, because they are where most programs spend significant amounts of time.

Another point of view of tracing is that it is a form of partial evaluation that replaces the control components of a partial evaluator with an oracle (the actual execution runs) that provide the information which paths to look at.

Already in the quite trivial interpreter here the effects of this are visible. The simple partial evaluator over-specializes the loop and produces two identical versions of it, that aren't different. The tracer doesn't, and it also generates only code for the loop itself, not for the initialization opcodes.

That's it for this series. To those that made it, thanks for following along. Also thanks to Samuele and Sven, who consistently gave me good feedback on the posts before I put them here.