Friday, February 4, 2011

PyPy faster than C on a carefully crafted example

Good day everyone.

Recent round of optimizations, especially loop invariant code motion has been very good for small to medium examples. There is work ongoing to make them scale to larger ones, however there are few examples worth showing how well they perform. This one following example, besides getting benefits from loop invariants, also shows a difference between static and dynamic compilation. In fact, after applying all the optimizations C does, only a JIT can use the extra bit of runtime information to run even faster.

The example is as follows. First Python. I create two files, x.py:

def add(a, b):
  return a + b

And y.py:

from x import add

def main():
    i = 0
    a = 0.0
    while i < 1000000000:
        a += 1.0
        add(a, a)
        i += 1

main()

For C, x.c:

double add(double a, double b)
{
  return a + b;
}

and y.c:

double add(double a, double b);

int main()
{
  int i = 0;
  double a = 0;
  while (i < 1000000000) {
    a += 1.0;
    add(a, a);
    i++;
  }
}

Results?

  • 1.97s - PyPy
  • 3.07s - C
Compilation options:
  • PyPy trunk (386ed41eae0c), running pypy-c y.py
  • C - gcc -O3 (GCC 4.4.5 shipped with Ubuntu Maverick)

Hence, PyPy 50% faster than C on this carefully crafted example. The reason is obvious - static compiler can't inline across file boundaries. In C, you can somehow circumvent that, however, it wouldn't anyway work with shared libraries. In Python however, even when the whole import system is completely dynamic, the JIT can dynamically find out what can be inlined. That example would work equally well for Java and other decent JITs, it's however good to see we work in the same space :-)

Cheers,
fijal

EDIT: Updated GCC version

15 Comments:

Anonymous said...

> The reason is obvious - static compiler can't inline across file boundaries.

That's what link-time optimizations are for, which where added to GCC in 2009; however, your point concerning shared libaries is valid...

Zeev said...

I added a printf("%f\n",a) to the end of the file so the compiler wouldn't optimize the whole thing away. On my Cure 2 Duo 2.33Ghz, I got for gcc -O3:

1000000000.000000

real 0m4.396s
user 0m4.386s
sys 0m0.007s

and for gcc -O3 -flto -fwhole-program:


1000000000.000000

real 0m1.312s
user 0m1.308s
sys 0m0.003s

Anonymous said...

Great work!

Now you just have to identify and remove dead code in your jit. Then you could remove the call to 'add' altogether.

Armin Rigo said...

In this strange example, in our JIT, the call to 'add' is indeed removed because of inlining, and then the addition that occurs in there is removed because of dead code elimination.

Maciej Fijalkowski said...

@Zeev yes, but C equivalent of Python import is indeed shared libraries, where -fwhole-program no longer works.

Maciej Fijalkowski said...

@Armin note that even when the result is accumulated (addition is not removed, although the call is still inlined), PyPy is still faster. Not as much though: 2.5s vs 3.0s

Anonymous said...

For completeness's sake, what's the output of `gcc --version` in your example?

klauss said...

Not to mention specialization: python's (and pypy's) add() can add pretty much anything - strings if you will.

The JIT will inline a specialized version particular to the call site, whereas C can only apply generalized optimizations.

Greg said...

Everyone knows Python runs faster than C...

By about 6 weeks.

Anonymous said...

There's another simple case where pypy could (in principle) do very much better than standard C: turn pow(x, i) into sqrt(x*x*x) if i == 3/2, and other reductions. In practice if you don't know what i is at compiletime you often bundle the simplifications into a function (at the cost of some ifs) but a JIT could do a very nice job on this automagically whenever i is fixed, which it usually is.

Anonymous said...

You wrote: "PyPy 50% faster than C on this carefully crafted example".

The truth is: PyPy is 35% faster than the C code (using C as the baseline), because it completes in 65% of the time required by the C version.

The C code takes 50% more time to execute (is slower by 50%, 1.5x slower) than the PyPy code (using PyPy as the baseline).

haypo said...

Test with gcc (Debian 20110126-0ubuntu1) 4.6.0 20110126 (experimental) [trunk revision 169285]: "/usr/lib/gcc-snapshot/bin/gcc [OPTIONS] x.c y.c -o x && time ./x". OPTIONS=-O0: 10.1s; OPTIONS=-O3: 9.1s; OPTIONS=-O3 -flto: 0.002s. Woops, 0.002 second? I checked: the result is correct :-) LTO rocks!

Maciej Fijalkowski said...

@haypo print the result so the loop don't get removed as dead code. Besides, the problem is really the fact that's -flto is unfair since python imports more resemble shared libraries than statically-compiled files.

Anonymous said...

In general, if you want to compare the performance of languages, you're actually supposed to try to write the *fastest* implementation in each language. Not just some arbitrary one.

In this example, the program has no output, so both implementations are crap and could be made a lot faster.

Come up with a program that has testable output, and see if someone can't comment with a C program that's faster than your python.

Anonymous said...

RIDICULOUS!