Saturday, January 29, 2011

A JIT Backend for ARM Processors

In the past few months, I have been developing as a part of my master thesis the ARM backend for the the PyPy JIT, in the arm-backend branch. Currently, it is still work in progress: all integer and object operations are working and the support for floating point is also under development.
ARM processors are very widely used, beeing deployed in servers, some netbooks and mainly mobile devices such as phones and tablets. One of our goals is to be able to run PyPy on phones, specially on Android. Currently is not yet possible to translate and compile PyPy for Android automatically, but there has been some work on using Android's NDK to compile PyPy's generated C code.
The JIT Backend targets the application profile of the ARMv7 instruction set architecture which is found for example in the Cortex-A8 processors used in many Android powered devices and in Apple's A4 processors built into the latest iOS devices. To develop and test the backend we are using a BeagleBoard-xM which has a 1 GHz ARM Cortex-A8 and 512 MB of RAM running the ARM port of Ubuntu 10.10.
Currently on Linux it is possible to translate and cross-compile PyPy's Python interpreter as well as other interpreters with the ARM JIT backend enabled using Scratchbox 2 to provide a build environment and the GNU ARM cross compilation toolchain. So far the backend only supports the Boehm garbage collector which does not produce the best results combined with the JIT, but we plan to add support for the other GCs in the future, doing so should increase the performance of PyPy on ARM.
While still debugging the last issues with the backend we already can run some simple benchmarks on Pyrolog, a prolog interpreter written in RPython. Even using Boehm as the GC the results look very promising. In the benchmarks we compare Pyrolog to SWI-Prolog, a prolog interpreter written in C, which is available from the package repositories for Ubuntu's ARM port.
The benchmarks can be found in the pyrolog-bench repository.
BenchmarkSWI-Prolog in ms.Pyrolog in ms.Speedup
iterate60.06.010.0
iterate_assert130.06.021.67
iterate_call3310.05.0662.0
iterate_cut60.0359.00.16713
iterate_exception4950.0346.014.306
iterate_failure400.0127.03.1496
iterate_findall740.0No res.
iterate_if140.06.023.333
The iterate_call benchmark, which constructs a predicate and calls it at runtime, with a speedup of 662 times over SWI-Prolog is an example where the JIT can show its strength. The Pyrolog interpreter and the JIT treat dynamically defined predicates as static ones and can generate optimezed code in both cases. Whereas SWI only compiles statically defined rules and has to fall back to interpretation on dynamic ones.
For simple benchmarks running on PyPy's Python intepreter we see some speedups over CPython, but we still need to debug the backend bit more before we can show numbers on more complex benchmarks. So, stay tuned.

10 comments:

mwhudson said...

Awesome stuff. I have a panda board and another xm that's usually not doing much if you want to borrow some cycles :-)

When you support floats will you be aiming for hard float? It's the way of the future, I hear...

dbrodie said...

I am curious if you had any use for ThumbEE (or Jazelle RCT) to speed up?

David Schneider said...

@mwhudson: thanks it would be great to be able to test on more hardware.

For the float support we still need to investigate a bit, but if possible I would like to target hard floats.

@dbrodie: currently we are targeting the arm state, so not at the moment.

Martijn Faassen said...

One would imagine conserving memory would be an important factor on mobile devices. Even though mobile devices have a growing amount of memory available, it will still be less than desktops for the forseeable future. Memory pressure can create real slowdowns.

A JIT normally takes more memory, but on the other hand PyPy offers features to reduce usage of memory. Could you share some of your thinking on this?

Armin Rigo said...

Martijn: you are describing the situation as well as we (at least I) know it so far: while PyPy has in many cases a lower non-JIT memory usage, the JIT adds some overhead. But it seems to be within ~200MB on "pypy translate.py", which is kind of the extreme example in hugeness. So already on today's high-end boards with 1GB of RAM, it should easily fit. Moreover it can be tweaked, e.g. it's probably better on these systems to increase the threshold at which JITting starts (which also reduces the number of JITted code paths). So I think that the possibility is real.

Dan said...

Showing speedups over repetitive instructions (which caching & JIT are really good at) is irrelevant.

What happens when people use real benchmarks, like constraint-based solvers and non-iterative stuff (maybe take a look at the other benchmarks) ...

Prolog is a declative language, not a sysadmin scripting language.

Also, the SWI implementation adds so many functionalities, it's like making a «Extract chars from an RDBMS vs Text files» benchmark.

Carl Friedrich Bolz said...

@Dan

Why are you so defensive? This benchmark is clearly not about how fast Pyrolog is, but how the ARM JIT backend performs, using trivial Prolog microbenchmarks, with SWI to give a number to compare against.

Pyrolog is a minimal Prolog implementation that is (at least so far) mostly an experiment to see how well PyPy's JIT technology can do on an non-imperative language. This paper contains more interesting benchmarks:

http://portal.acm.org/citation.cfm?id=1836102

jamu said...

Hi,
Is there a way to cross compile on a host machine (but not with scratch box) where I have tool chain and file system for the target?

Any instructions for building with arm back-end?

Cheers

David Schneider said...

@jamu: scratchbox 2 is currently the only option to cross-translate pypy for ARM. You can find some documentation about the cross translation at https://bitbucket.org/pypy/pypy/src/arm-backend-2/pypy/doc/arm.rst

vak said...

Sounds very cool, are there any updates?