Tuesday, October 14, 2008

Sprint Discussions: C++ Library Bindings

At the beginning of this year, PyPy grew ctypes support, thanks to generous support by Google. This made it possible to interface with C libraries from our Python interpreter, something that was possible but rather tedious before. What we are lacking so far is a way to interface to large C++ libraries (like GUI libraries). During the sprint we had a brainstorming session about possible approaches for fixing this shortcoming.

For CPython there are a number of approaches in common use:

Those all have the property that they produce some code that is then compiled with a compiler to produce a CPython extension. The produced code also uses functions from CPython's C-API. This model is not simple to use for PyPy in its current state. Since PyPy generates C code automatically, a fixed C-level API does not exist (it is not unlikely that at one point in the future we might have to provide one, but not yet). At the moment, PyPy very much has a "Don't call us, we call you"-approach.

A very different approach is followed by the Reflex package, which is developed at CERN (which has an incredible amount of C++ libraries). It is not mainly intended for writing Python bindings for C++ libraries but instead provides reflection capabilities for C++. The idea is that for every C++ shared library, an additional shared library is produced, which allows together with Reflex to introspect properties of C++ classes, methods, etc. at runtime. These facilities are then used for writing a small generic CPython extension module, that allows CPython to use any C++ library for which this reflection information was generated.

This approach is a bit similar to the ctypes module, apart from the fact that ctypes does not use any reflection information, but the user has to specify the data structures that occur in the C code herself. This makes it sometimes rather burdensome to write cross-platform library bindings.

For PyPy the approach seems rather fitting: We would need to implement only the generic extension module and could then use any number of C++ libraries. Of course some more evaluation is needed (e.g. to find out whether there are any restrictions for the C++ code that the library can use and how bothersome it is to get this reflection information for a large library) but so far it seems promising.

11 comments:

Geoffrey Irving said...

I've done a fair amount of complicated Boost.Python wrapping, and also implemented a small replacement for it with most of the complexity removed. There are two main reasons why Boost.Python is so complicated:

1. It supports arbitrarily complex memory and sharing semantics on the C++ classes (and is runtime polymorphic on how the memory of wrapped objects is managed).

2. It supports arbitrary overloading of C++ functions.

If you remove those two generality requirements (by requiring that wrapped C++ objects are also PyObjects and banning overloading), it's possible to write very lightweight C++ bindings. Therefore, I think it's critical to factor the C/C++ API design so that as much of it as possible is writable in application level python on top of a small core that does the final C++ dispatch.

For example, if you wrap a C++ vector class with a bunch of overloads of operator+ in Boost.Python, each call to __add__ has to do a runtime search through all the overloads asking whether each one matches the arguments passed. Each such check does a runtime search through a table of converters. It would a terrible shame if that overhead isn't stripped by the JIT, which means it has to be in python.

I think a good test library for thinking about these issues is numpy, since it has some memory management complexity as well as internal overloading.

I could go on, but it'd probably be better to do that via email. :)

Geoffrey

illume said...

Please add a C API :)

Once that is done, it's lots easier to interface with the outside world.

For a lot of C++ apis I find it easy enough to write a C api on top of it.

In fact many C++ apis provide a C API. Since that makes it easier to work with different C++ compilers. As you probably know different C++ compilers mangle things differently.

It is possible to look at C++ code at runtime. You just need to be able to interpret the C++ symbols. I know someone did a prototype of this for vc6 on windows. He parsed the symbols, and then created the functions at run time with ctypes. However the approach is not portible between platforms, compilers, or even different versions of compilers. Of course this didn't allow you to use many of the C++ features, but only some.

If you look at how swig works, you will see it kind of generates a C API for many C++ things.


For libraries, it is custom to provide a C API. It just makes things easier.

bins said...

you might want to look at PyRoot [1,2,3] which is using the Reflex library to automatically wrap (and pythonize) the C++ libraries/types for which a Reflex dictionary has been (beforehand) generated.

theoretically any piece of C++ can be wrapped as Reflex is using gccxml[4] to extract informations from a library and to generat the dictionary library.

Using it in one of CERN's LHC experiment which makes heavy (ab)use of templates (Boost) I can say that we almost had basically no problem.
Usually the only problems we got were either at the gccxml level (resolution of typedef, default template arguments,...) or at the gccxml-to-reflex level (mainly naming conventions problems interfering with the autoloading of types at runtime)

Being a client of gccxml is a rather annoying as the development is... opaque.

I know the Reflex guys were investigating at some point to migrate to an LLVM version (with GCC as a frontend) to replace gccxml.

[1] http://wlav.web.cern.ch/wlav/pyroot/
[2] http://root.cern.ch/viewcvs/trunk/bindings/pyroot/
[3] http://www.scipy.org/PyRoot
[4] http://www.gccxml.org/

Richard Boulton said...

There's been some (small) discussion in the SWIG project of making an alternative output method which creates a simple C API for a C++ project, and wraps that with ctypes (generating the python side of the ctypes bindings, too). So far, this is purely theoretical, but all the pieces needed to do it are present in the SWIG source code. If reflex doesn't work out, this might be a reasonable alternative approach.

Maciej Fijalkowski said...

Wow. A lot of very informative posts. We'll definitely look to evaluate more what you all posted. Also, in case you want to discuss more, mailing list is usually better place for discussions. Feel free to send new ideas or more detailed info there.

Cheers,
fijal

Anonymous said...

fyi check out Elsa ( http://www.cubewano.org/oink ). It is much better than Reflex.

Carl Friedrich Bolz said...

illume: Adding a C-API is rather hard, and probably not on our todo list, unless somebody pays for it :-).

anonymous: From a quick glance I am not sure Elsa would really help. Yes, you can get use it to parse c++ headers and get information about it. But as far as I see it, you cannot use it to create shared libraries that can be used to dynamically construct classes and dynamically call methods on them. Besides, the idea is to have a solution that works on both CPython and PyPy. Reflex already has a way to bind C++ libraries to CPython, so we only need to do the PyPy part.

Anyway, if anybody is interested in more detailed discussions, we should all move to pypy-dev.

Roman Yakovenko said...

I also have done a fair amount of Boost.Python wrapping. I even created a code generator for it - Py++( www.language-binding.net).

The problem you want to solve is very complex. Exposing C++ code, as is, is not enough. You will have to create "bridge" between different concepts.

For example C++ and Python iterators. In C++, in order to get the iterator value, you have to dereference it. In Python, you just have it( value ).

This is just a single example, I have others.

If you continue with the project - I would like to be involved.

bins said...

Roman,

I haven't looked at the code of Boost.Python since a long time, but the way "we" do the pythonization of the iteration over STL sequences is rather simple.

when one writes:
foos = std.vector('FooKlass')()
# fill foos
# iterate
for foo in foos:
print foo.value()

what the PyROOT/Reflex layer is doing is looking at the dictionary for the std::vector(FooKlass), discovering that there is a pair of functions 'begin' and 'end' and it figures out one can create a python iterator from that pair.

anyways, as Maciej pointed it out, we could try to move this discussion here[1] or there[2]...

cheers,
sebastien.

[1] http://codespeak.net/pipermail/pypy-dev/2008q4/004847.html

[2] http://codespeak.net/pipermail/pypy-dev/2008q4/004843.html

Anonymous said...

Just to let you know, there is an upcoming paper in PythonPapers.org review on the topic of C++ wrapping in Python. Just watch out!

costantino said...

maybe it's worth also evaluate the library xrtti, a Reflex comparable library but without CERN and ROOT dependencies.