Good day everyone.
Recent round of optimizations, especially loop invariant code motion has been very good for small to medium examples. There is work ongoing to make them scale to larger ones, however there are few examples worth showing how well they perform. This one following example, besides getting benefits from loop invariants, also shows a difference between static and dynamic compilation. In fact, after applying all the optimizations C does, only a JIT can use the extra bit of runtime information to run even faster.
The example is as follows. First Python. I create two files, x.py:
def add(a, b): return a + b
And y.py:
from x import add def main(): i = 0 a = 0.0 while i < 1000000000: a += 1.0 add(a, a) i += 1 main()
For C, x.c:
double add(double a, double b) { return a + b; }
and y.c:
double add(double a, double b); int main() { int i = 0; double a = 0; while (i < 1000000000) { a += 1.0; add(a, a); i++; } }
Results?
- 1.97s - PyPy
- 3.07s - C
- PyPy trunk (386ed41eae0c), running pypy-c y.py
- C - gcc -O3 (GCC 4.4.5 shipped with Ubuntu Maverick)
Hence, PyPy 50% faster than C on this carefully crafted example. The reason is obvious - static compiler can't inline across file boundaries. In C, you can somehow circumvent that, however, it wouldn't anyway work with shared libraries. In Python however, even when the whole import system is completely dynamic, the JIT can dynamically find out what can be inlined. That example would work equally well for Java and other decent JITs, it's however good to see we work in the same space :-)
Cheers,
fijal
EDIT: Updated GCC version
> The reason is obvious - static compiler can't inline across file boundaries.
ReplyDeleteThat's what link-time optimizations are for, which where added to GCC in 2009; however, your point concerning shared libaries is valid...
I added a printf("%f\n",a) to the end of the file so the compiler wouldn't optimize the whole thing away. On my Cure 2 Duo 2.33Ghz, I got for gcc -O3:
ReplyDelete1000000000.000000
real 0m4.396s
user 0m4.386s
sys 0m0.007s
and for gcc -O3 -flto -fwhole-program:
1000000000.000000
real 0m1.312s
user 0m1.308s
sys 0m0.003s
Great work!
ReplyDeleteNow you just have to identify and remove dead code in your jit. Then you could remove the call to 'add' altogether.
In this strange example, in our JIT, the call to 'add' is indeed removed because of inlining, and then the addition that occurs in there is removed because of dead code elimination.
ReplyDelete@Zeev yes, but C equivalent of Python import is indeed shared libraries, where -fwhole-program no longer works.
ReplyDelete@Armin note that even when the result is accumulated (addition is not removed, although the call is still inlined), PyPy is still faster. Not as much though: 2.5s vs 3.0s
ReplyDeleteFor completeness's sake, what's the output of `gcc --version` in your example?
ReplyDeleteNot to mention specialization: python's (and pypy's) add() can add pretty much anything - strings if you will.
ReplyDeleteThe JIT will inline a specialized version particular to the call site, whereas C can only apply generalized optimizations.
Everyone knows Python runs faster than C...
ReplyDeleteBy about 6 weeks.
There's another simple case where pypy could (in principle) do very much better than standard C: turn pow(x, i) into sqrt(x*x*x) if i == 3/2, and other reductions. In practice if you don't know what i is at compiletime you often bundle the simplifications into a function (at the cost of some ifs) but a JIT could do a very nice job on this automagically whenever i is fixed, which it usually is.
ReplyDeleteYou wrote: "PyPy 50% faster than C on this carefully crafted example".
ReplyDeleteThe truth is: PyPy is 35% faster than the C code (using C as the baseline), because it completes in 65% of the time required by the C version.
The C code takes 50% more time to execute (is slower by 50%, 1.5x slower) than the PyPy code (using PyPy as the baseline).
Test with gcc (Debian 20110126-0ubuntu1) 4.6.0 20110126 (experimental) [trunk revision 169285]: "/usr/lib/gcc-snapshot/bin/gcc [OPTIONS] x.c y.c -o x && time ./x". OPTIONS=-O0: 10.1s; OPTIONS=-O3: 9.1s; OPTIONS=-O3 -flto: 0.002s. Woops, 0.002 second? I checked: the result is correct :-) LTO rocks!
ReplyDelete@haypo print the result so the loop don't get removed as dead code. Besides, the problem is really the fact that's -flto is unfair since python imports more resemble shared libraries than statically-compiled files.
ReplyDeleteIn general, if you want to compare the performance of languages, you're actually supposed to try to write the *fastest* implementation in each language. Not just some arbitrary one.
ReplyDeleteIn this example, the program has no output, so both implementations are crap and could be made a lot faster.
Come up with a program that has testable output, and see if someone can't comment with a C program that's faster than your python.
RIDICULOUS!
ReplyDeletePypy isn't faster than C, even on this example for multiple reasons:
ReplyDeleteFirst it's conceptual: C is almost as optimized as assembly (it's often referred to as a super assembler) so even if Pypy ends-up generating some assembly code, it has first to evaluate the runtime environment to figure out the type of variables and emit assembly code, and all this process is not free... so Pypy can only asymptotically reach the same level as C and assembly.
Second, the test is flawed: I did a slight modification that shouldn't change the results: I've inlined the add() in both python and C. Oh! surprise: Pypy keeps the same time whereas C is 4x faster than before (without inlining).
So to make it fair, we need to use the best capabilities of both languages:
- python: I'm sure the author provided the best python implementation, and the fact that inlining add() doesn't change results kinda proves this)
- C: when you inline the function you get:
[code]
static inline double add_double(double a, double b) {
return a + b;
}
int main()
{
unsigned int i;
double a = 0.0;
for (i = 0; i < N; i++) {
a += 1.0;
add_double(a, a);
}
printf("%f\n", a);
}
[/code]
Results:
C inlined: 1.10s
C: 3.98s
Pypy inlined: 3.30s
Pypy: 3.28s
Conclusion:
- When using the right C code, on the same example C is 3 times faster than Pypy.
- As demonstrated, the statement that Pypy is faster than C is simply biased by a not optimizsed C code.
@Eric This post is not trying to argue that Python is "better" or even faster than C. It is just pointing out that certain classes of optimizations (i.e. whole program optimizations) come naturally to the PyPy JIT.
ReplyDeleteThis is, of course, only one small facet of why a program runs fast. The author admits that it is a contrived example to illustrate the point.
Taking the point to an extreme, one could see a PyPy program run faster than a C program if the C program made many calls to simple shared libraries. For example, if one dynamically links a C stdlib into their program, and uses it heavily, the equivalent python code may conceivably run faster.
Please read the title of this article again: "PyPy faster than C on a carefully crafted example"
ReplyDeleteBased on a specific example or not it doesn't matter, I'm simply not comfortable with reading strong statement like this that are obvioulsy false to any serious computer scientist and misleading to beginners. It's false because it's the conclusion of a test which is biased.
The root of benchmarking is to get rid of any bias
In this case the obvious bias is that Pypy is optimized and C isn't (as demonstrated above with inline functions).
You can't transpose only what you want in real life and not the other: your argument that in real life the C could use external library hence be slower is valid, but then you have to compare with real life Python scripts which can't be as much optimized by Pypy as this crafted example. So in real life you get a C code that may be slowed down a bit by dynamic linking, and python scripts that are much slower because Pypy isn't ready to match C speed for everything (yet).
If you want to use a crafted Python example, you have to compare it to a crafted C example, so that you can compare apples with apples.
All that is methodology, that said JIT is quite powerful and it's impressive in itself to beat CPython by a large margin.
Eric: Your comments about "real life" are irrelevant - the post is about a specific, contrived example. I don't think anyone would argue that a high-level, garbage-collected language like python could ever beat out C in general - it's simply a demonstration that, in a very specific instance, equivalent code in python and C can run faster in python because of the JIT making optimizations that can't occur at compile time.
ReplyDeleteYou're assuming that python is faster even on this crafted example, but keep in mind that this comparison is biased because the C version isn't optimal.
ReplyDeleteyou're assuming that python is faster even on this crafted example, but keep in mind that this comparison is biased because the C version isn't optimal.
ReplyDeletestop feeding this troll
ReplyDeletepoint taken, but do update the article to take into account my remark: both the title and the conclusion of the "demonstration" are false, even on a contrived example as you barely can't find any C code that would be slower than the code generated by your JIT for the simple reason that C is really too close to assembly and that JIT adds an overhead.
ReplyDeleteHey Eric.
ReplyDeleteYour argument is incredibly flawed. You can compile faster version of assembler (or is C the fastest assembler ever?) if you try hard enough. Why not?
Please don't digress, what I say is simple:
ReplyDeleteThe article states that Pypy generates code faster than C on a crafted example.
I demonstrate there is a more optimized C code that the author's one, hence that the whole article is wrong... end of the story.
No, it's a reasonable piece of C. You don't inline your printf code, do you? dynamic linking is a thing that people use.
ReplyDeleteYou're right, people very often use dynamic linking. However the following is not a reasonable piece of Python code:
ReplyDeletedef add(a, b): return a + b
People rarely use that and more importantly they don't write a loop that calls it 1 billion times.
The point is that the reasoning spans two levels (hence is flawed/biased):
- in Python the author took a crafted piece of Python that is not meaningful in real life because it has the property to do what he wants at the Pypy level
- in C the author uses a very common mechanism that isn't fully optimized (not as much as Python/Ppy is optimized).
I know you will not agree since you're all proud that "Pypy is faster than C" (lol it's nonsense even on a "crafted example") but you have to compare apples with apples.
@Eric what you don't understand is the point of the article. The actual point is to demonstrate a nice property of PyPy JIT, which is able to generate fast code when it can. Comparing to C in this manner proves that PyPy's generated machine code is relevant with regard to speed.
ReplyDeleteOf course this example is fragile because it relies on suboptimal C code, but this serves only to prove the point about PyPy.
@Eric... Non sense.. Are you a ambassador for C ?
ReplyDeleteDo argue if you disagree, don't troll.
ReplyDeleteI think everything have been said already anyway.