Comments on PyPy Status Blog: PyPy is faster than C, again: string formatting

@Anonymous, "the C one doesn't print anyt...

2012-11-12T16:36:08.201+01:00

@Anonymous, "the C one doesn't print anything, either; sprintf just returns a string. printf is the one that prints." - Andrew Pendleton, this page, August 2, 2011 9:25 PM

So a loop that doesn't print in Python is comp...

2012-11-01T18:25:58.797+01:00

So a loop that doesn't print in Python is compared to a loop in C that does and that was compiled on one of the slowest C compilers out there.

YearOfTheLinuxDesktopIsAtHand(TM)

@⚛: I indeed do not have benchmarks for these clai...

2011-08-14T21:17:24.954+02:00

@⚛: I indeed do not have benchmarks for these claims, but GCC 4.6 indeed added some newer optimization techniques to its assortment. Maybe these may not have had a significant influence in said case but they might have somewhere else. I'm merely saying: you can't really compare the latest hot inventions with something that is surpassed (e.g. compare Java 7 to a program output by Visual C++ back form the VS 2003 IDE).

All by all, I'm not saying that Python sucks and don't want to sound like a fanboy (on the contrary, Linux uses a great deal of Python and if this could mean a major speedup, then why the hell not ;).

I guess I was pissed off because the written article sounds very much fanboyish and pro-Python (just look at the title alone).

@Anonymous: I agree with your other paragraphs, bu...

2011-08-14T19:34:10.673+02:00

@Anonymous: I agree with your other paragraphs, but not with the one where you wrote that "... OLDER version (4.5.x) of GCC whilst a newer version (4.6.x) is available with major improvements to the optimizer in general".

I am not sure, what "major improvements" in GCC 4.6 do you mean? Do you have benchmark numbers to back up your claim?

As far as well-written C code is concerned, in my opinion, there haven't been any "major improvements" in GCC for more than 5+ years. There have been improvements of a few percent in a limited number of cases - but nothing major.

Even LTO (link-time optimization (and lets hope it will be safe/stable to use when GCC 4.7 is released)) isn't a major boost. I haven't seen LTO being able to optimize calls to functions living in dynamic libraries (the bsearch(3) function would be a nice candidate). And I also haven't seen GCC's LTO being able to optimize calls to/within the Qt GUI library when painting pixels or lines onto the screen.

The main point of the PyPy article was that run-time optimizations in PyPy have a chance of surpassing GCC in certain cases.

Personally, I probably wouldn't willingly choose to work on a project like PyPy - since, err, I believe that hard-core JIT optimizations on a dynamically typed language like Python are generally a bad idea - but I am (in a positive way) eager to see what the PyPy team will be able to do in this field in the years to come.

Only a quick note OffTopic: in the python FAQ, one...

2011-08-12T18:07:12.055+02:00

Only a quick note OffTopic: in the python FAQ, one could update adding PyPy besides Psyco in the performance tips:

http://docs.python.org/faq/programming.html#my-program-is-too-slow-how-do-i-speed-it-up

@tt: Code is doing something else so it's not ...

2011-08-06T17:35:08.344+02:00

@tt: Code is doing something else so it's not the same.

@Antiplutocrat: Honestly, I expected a bit more ob...

2011-08-06T12:05:24.706+02:00

@Antiplutocrat:
Honestly, I expected a bit more objectivity from posters here. I am really disappointed that you compare me to "haters" (whoever that may be).

Your point about unroll-if-alt is absolutely valid and I myself have explicitly stated that I did not use that feature. At no point I have refuted that the original blog post was wrong - it is still very well possible that PyPy 1.6 is faster then C in this usage scenario. The main goal of my post was to make clear that the original benchmarks were flawed, as they grant the compiler too much space for unpredictable optimizations. I believe that my benchmark code produces more realistic results and I suggest that the authors of this blog entry re-run the benchmark using my code (or something similar, which controls for unpredictable optimizations).

The compare is good because both use standard lang...

2011-08-06T01:59:18.842+02:00

The compare is good because both use standard langauge fetatures to do the same thing, using a third part lib is not the same, then I have to code the same implant in RPython and people would still complain do to RPython often being faster then C regardless.

Python could have detected that the loop is not doing anything, but give that one value had a __str__ call it could've broken some code. Anyway, C compiler could also see that you didn't do anything with the value and optimalize it the same way.

@tt except one of the main points of the post was ...

2011-08-05T22:08:43.914+02:00

@tt except one of the main points of the post was that they had implemented a *new* feature (unroll-if-alt, I believe) that sped things up a bunch.

I'm not sure how much any comparison that *doesn't* use this new feature is worth ...

So many haters ! ;)

Well, I never said anything about writing super ef...

2011-08-05T21:03:15.698+02:00

Well, I never said anything about writing super efficient C code. Anyway, I don't see how you want to implement string formatting more efficiently - if we talk about general usage scenario. You can't really reuse the old string buffer, you basically have to allocate new one each time the string grows. Or pre-allocate a larger string buffer and do some substring copies (which will result in a much more complicated code). Anyway, the malloc() on OS X is very fast.

My point is: even this C code, which you call inefficient is around 6 times faster then pypy 1.5

@tt: This is a very inefficient C/C++ implementati...

2011-08-05T19:32:47.731+02:00

@tt: This is a very inefficient C/C++ implementation of the idea "make a loop grow a string by continuous appending and write the string to the file in the end". In addition, it appears to be an uncommon piece of C/C++ code.

I have now put a small, slightly more realistic be...

2011-08-05T17:15:39.746+02:00

I have now put a small, slightly more realistic benchmark. I used following code.

Python

def main():
x = ""
for i in xrange(50000):
x = "%s %d" % (x, i)
return x

x = main()

f = open("log.txt", "w")
f.write(x)
f.close()

C
#include
#include
#include

int main() {
int i;
char *x = malloc(0);
FILE *file;

*x = 0x00;

for (i = 0; i < 50000; i++) {
char *nx = malloc(strlen(x) + 16); // +16 bytes to be on the safe side

sprintf(nx, "%s %d", x, i);
free(x);
x = nx;
}

file = fopen("log1.txt","w");
fprintf(file, "%s", x);
fclose(file);
}

JavaScript (NodeJS)

var fs = require('fs');

String.prototype.format = function() {
var formatted = this;
for (var i = 0; i < arguments.length; i++) {
var regexp = new RegExp('\\{'+i+'\\}', 'gi');
formatted = formatted.replace(regexp, arguments[i]);
}
return formatted;
};

function main() {
var x = "";
for (var i = 0; i < 50000; i++)
x = "{0} {1}".format(x, i);
return(x)
}

x = main();
fs.writeFile('log.txt', x)

Note for JS example: I did not want to use the stuff like i + " " + i because it bypasses the format function call. Obviously, using the + operator the nodejs example would be much faster (but pypy probably as well).

Also, I used PyPy 1.5 as I did not find any precompiled PyPy 1.6 for OS X.

Results:

PyPy: real 0m13.307s
NodeJS: real 0m44.350s
C: real 0m1.812s

try in nodejs: var t = (new Date()).getTime(); f...

2011-08-05T12:28:47.516+02:00

try in nodejs:

var t = (new Date()).getTime();

function main() {
var x;
for (var i = 0; i < 10000000; i++)
x = i + " " + i;
return x;
}
x = main();

t = (new Date()).getTime() - t;
console.log(x + ", " + t);

This is a horribly flawed benchmark which illustra...

2011-08-05T11:05:28.835+02:00

This is a horribly flawed benchmark which illustrates absolutely nothing. First of all, an optimizing JIT should be (easily) able to detect that your inner loop has no side effects and optimize it away. Secondly, with code like that you should expect all kinds of weirds transformations by the compiler, hence - you can't be really sure what you are comparing here. As many here have pointed out, you should compare the output assembly.

Anyway, if you really want to do a benchmark like that, do it the right way. Make the loop grow a string by continuous appending and write the string to the file in the end (time the loop only). This way you will get accurate results which really compare the performance of two compilers performing the same task.

@Connelly yes, for some definition of working (bei...

2011-08-05T08:06:42.542+02:00

@Connelly yes, for some definition of working (being thought about). that's one reason why twisted_tcp is slower than other twisted benchmarks. We however welcome simple benchmarks as bugs on the issue tracker

Is string/IO performance in general being worked o...

2011-08-04T21:50:38.070+02:00

Is string/IO performance in general being worked on in Pypy? Last I looked Pypy showed it was faster than CPython in many cases on its benchmarks page, but for many string/IO intensive tasks I tried Pypy v1.5 on, it was slower.

@Anonymous: this branch, unroll-if-alt, will not b...

2011-08-04T09:47:57.943+02:00

@Anonymous: this branch, unroll-if-alt, will not be included in the release 1.6, which we're doing right now (it should be out any day now). It will only be included in the next release, which we hope to do soonish. It will also be in the nightly builds as soon as it is merged.

@Dave Kirby: There are two C programs there. One ...

2011-08-04T02:11:44.024+02:00

@Dave Kirby:

There are two C programs there. One on the stack, one with a malloc / free in the loop.

Which one is used for the faster claim?

You wrote: "We think this demonstrates the in...

2011-08-03T20:51:14.813+02:00

You wrote: "We think this demonstrates the incredible potential of dynamic compilation, ..."

I disagree. You tested a microbenchmark. Claims about compiler or language X winning over Y should be made after observing patterns in real programs. That is: execute or analyse real C programs which are making use of 'sprintf', record their use of 'sprintf', create a statistics out of the recorded data and then finally use the statistical distributions to create Python programs with a similar distribution of calls to '%'.

Trivial microbenchmarks can be deceiving.

PyPy does nothing 1.9 times faster than C.

2011-08-03T19:25:06.413+02:00

PyPy does nothing 1.9 times faster than C.

I'm following the progress on pypy since many ...

2011-08-03T16:45:24.015+02:00

I'm following the progress on pypy since many years and the potential is and has always been here. And boy, pypy has come a looong way.

You are my favorite open-source project and I am excited to see what will happen next. Go pypy-team, go!

@Anonymous: The C code shown does not do any mall...

2011-08-03T13:50:05.351+02:00

@Anonymous:

The C code shown does not do any malloc/free. The sprintf function formats the string into the char array x, which is allocated on the stack. It is highly unlikely that the sprintf function itself mallocs any memory.

The point here is not that the Python implementati...

2011-08-03T09:44:35.889+02:00

The point here is not that the Python implementation of formatting is better than the C standard library, but that dynamic optimisation can make a big difference. The first time the formatting operator is called its format string is parsed and assembly code for assembling the output generated. The next 999999 times that assembly code is used without doing the parsing step. Even if sprintf were defined locally, a static compiler can’t optimise away the parsing step, so that work is done redundantly every time around the loop.

In a language like Haskell something similar happens. A string formatting function in the style of sprintf would take a format string as a parameter and return a new function that formats its arguments according to that string. The new function corresponds to the specialized assembly code generated by PyPy’s JIT. I think if you wanted to give the static compiler the opportunity to do optimizations that PyPy does at runtime you would need to use a custom type rather than a string as the formatting spec. (NB my knowledge of functional-language implementation is 20 years out of date so take the above with a pinch of salt.)

What performance impact does the malloc/free produ...

2011-08-03T09:38:38.659+02:00

What performance impact does the malloc/free produce in the C code? AFAIK Python allocates memory in larger chunks from the operating system. Probably Python does not have to call malloc after initialization after it allocated the first chunk.

AFAIK each malloc/free crosses the boundaries between user-mode/kernel-mode.

So, IMHO you should compare the numbers of a C program which
does not allocate dynamic memory more than once and uses an internal memory management system.

These numbers would be interesting.

Have fun

"one time faster" is bad English.

2011-08-03T08:13:40.013+02:00

"one time faster" is bad English.