Good day everyone.
Recent round of optimizations, especially loop invariant code motion has been very good for small to medium examples. There is work ongoing to make them scale to larger ones, however there are few examples worth showing how well they perform. This one following example, besides getting benefits from loop invariants, also shows a difference between static and dynamic compilation. In fact, after applying all the optimizations C does, only a JIT can use the extra bit of runtime information to run even faster.
The example is as follows. First Python. I create two files, x.py:
def add(a, b): return a + b
And y.py:
from x import add
def main():
i = 0
a = 0.0
while i < 1000000000:
a += 1.0
add(a, a)
i += 1
main()
For C, x.c:
double add(double a, double b)
{
return a + b;
}
and y.c:
double add(double a, double b);
int main()
{
int i = 0;
double a = 0;
while (i < 1000000000) {
a += 1.0;
add(a, a);
i++;
}
}
Results?
- 1.97s - PyPy
- 3.07s - C
- PyPy trunk (386ed41eae0c), running pypy-c y.py
- C - gcc -O3 (GCC 4.4.5 shipped with Ubuntu Maverick)
Hence, PyPy 50% faster than C on this carefully crafted example. The reason is obvious - static compiler can't inline across file boundaries. In C, you can somehow circumvent that, however, it wouldn't anyway work with shared libraries. In Python however, even when the whole import system is completely dynamic, the JIT can dynamically find out what can be inlined. That example would work equally well for Java and other decent JITs, it's however good to see we work in the same space :-)
Cheers,
fijal
EDIT: Updated GCC version
15 Comments:
> The reason is obvious - static compiler can't inline across file boundaries.
That's what link-time optimizations are for, which where added to GCC in 2009; however, your point concerning shared libaries is valid...
I added a printf("%f\n",a) to the end of the file so the compiler wouldn't optimize the whole thing away. On my Cure 2 Duo 2.33Ghz, I got for gcc -O3:
1000000000.000000
real 0m4.396s
user 0m4.386s
sys 0m0.007s
and for gcc -O3 -flto -fwhole-program:
1000000000.000000
real 0m1.312s
user 0m1.308s
sys 0m0.003s
Great work!
Now you just have to identify and remove dead code in your jit. Then you could remove the call to 'add' altogether.
In this strange example, in our JIT, the call to 'add' is indeed removed because of inlining, and then the addition that occurs in there is removed because of dead code elimination.
@Zeev yes, but C equivalent of Python import is indeed shared libraries, where -fwhole-program no longer works.
@Armin note that even when the result is accumulated (addition is not removed, although the call is still inlined), PyPy is still faster. Not as much though: 2.5s vs 3.0s
For completeness's sake, what's the output of `gcc --version` in your example?
Not to mention specialization: python's (and pypy's) add() can add pretty much anything - strings if you will.
The JIT will inline a specialized version particular to the call site, whereas C can only apply generalized optimizations.
Everyone knows Python runs faster than C...
By about 6 weeks.
There's another simple case where pypy could (in principle) do very much better than standard C: turn pow(x, i) into sqrt(x*x*x) if i == 3/2, and other reductions. In practice if you don't know what i is at compiletime you often bundle the simplifications into a function (at the cost of some ifs) but a JIT could do a very nice job on this automagically whenever i is fixed, which it usually is.
You wrote: "PyPy 50% faster than C on this carefully crafted example".
The truth is: PyPy is 35% faster than the C code (using C as the baseline), because it completes in 65% of the time required by the C version.
The C code takes 50% more time to execute (is slower by 50%, 1.5x slower) than the PyPy code (using PyPy as the baseline).
Test with gcc (Debian 20110126-0ubuntu1) 4.6.0 20110126 (experimental) [trunk revision 169285]: "/usr/lib/gcc-snapshot/bin/gcc [OPTIONS] x.c y.c -o x && time ./x". OPTIONS=-O0: 10.1s; OPTIONS=-O3: 9.1s; OPTIONS=-O3 -flto: 0.002s. Woops, 0.002 second? I checked: the result is correct :-) LTO rocks!
@haypo print the result so the loop don't get removed as dead code. Besides, the problem is really the fact that's -flto is unfair since python imports more resemble shared libraries than statically-compiled files.
In general, if you want to compare the performance of languages, you're actually supposed to try to write the *fastest* implementation in each language. Not just some arbitrary one.
In this example, the program has no output, so both implementations are crap and could be made a lot faster.
Come up with a program that has testable output, and see if someone can't comment with a C program that's faster than your python.
RIDICULOUS!
Post a Comment