Tuesday, August 2, 2011

PyPy is faster than C, again: string formatting

String formatting is probably something you do just about every day in Python, and never think about. It's so easy, just "%d %d" % (i, i) and you're done. No thinking about how to size your result buffer, whether your output has an appropriate NULL byte at the end, or any other details. A C equivalent might be:

char x[44];
sprintf(x, "%d %d", i, i);

Note that we had to stop for a second and consider how big numbers might get and overestimate the size (44 = length of the biggest number on 64bit (20) + 1 for the sign * 2 + 1 (for the space) + 1 (NUL byte)), it took the authors of this post, fijal and alex, 3 tries to get the math right on this :-)

This is fine, except you can't even return x from this function, a more fair comparison might be:

char *x = malloc(44 * sizeof(char));
sprintf(x, "%d %d", i, i);

x is slightly overallocated in some situations, but that's fine.

But we're not here to just discuss the implementation of string formatting, we're here to discuss how blazing fast PyPy is at it, with the new unroll-if-alt branch. Given the Python code:

def main():
    for i in xrange(10000000):
        "%d %d" % (i, i)

main()

and the C code:

#include <stdio.h>
#include <stdlib.h>


int main() {
    int i = 0;
    char x[44];
    for (i = 0; i < 10000000; i++) {
        sprintf(x, "%d %d", i, i);
    }
}

Run under PyPy, at the head of the unroll-if-alt branch, and compiled with GCC 4.5.2 at -O4 (other optimization levels were tested, this produced the best performance). It took 0.85 seconds to execute under PyPy, and 1.63 seconds with the compiled binary. We think this demonstrates the incredible potential of dynamic compilation, GCC is unable to inline or unroll the sprintf call, because it sits inside of libc.

Benchmarking the C code:

#include <stdio.h>
#include <stdlib.h>


int main() {
    int i = 0;
    for (i = 0; i < 10000000; i++) {
        char *x = malloc(44 * sizeof(char));
        sprintf(x, "%d %d", i, i);
        free(x);
    }
}

Which as discussed above, is more comperable to the Python, gives a result of 1.96 seconds.

Summary of performance:

Platform GCC (stack) GCC (malloc) CPython PyPy (unroll-if-alt)
Time 1.63s 1.96s 10.2s 0.85s
relative to C 1x 0.83x 0.16x 1.9x

Overall PyPy is almost 2x faster. This is clearly win for dynamic compilation over static - the sprintf function lives in libc and so cannot be specializing over the constant string, which has to be parsed every time it's executed. In the case of PyPy, we specialize the assembler if we detect the left hand string of the modulo operator to be constant.

Cheers,
alex & fijal

52 comments:

salmon said...

What about '{0}'.format('pypy') ?
Is this also faster?

JoeHillen said...

Where can we see this "unroll-if-alt" branch?

Greg Haines said...

Are you sure the compiler isn't optimizing away the actual execution since you're not doing anything with the result?

Thomas Schilling said...

How are those two loops equivalent? You're not printing anything in the Python loop. I/O buffering etc. can eat quite a bit of runtime. It would also be nice to see what the particular improvements in this "unroll-if-alt" branch are.

Anonymous said...

How about doing something like that:
....
char p[5] = "%d %d"
//and then
sprintf(x, p, i,i);
....

?

Andrew Pendleton said...

@Thomas the C one doesn't print anything, either; sprintf just returns a string. printf is the one that prints.

hobbs said...

@Thomas the C one doesn't print anything either, so it sounds pretty equivalent to me.

Johan Tibell said...

This doesn't really have anything to do with dynamic compilation, but cross module optimization. There are static compilers, such as the Glasgow Haskell Compiler, that do this. If the compilation strategy depended on runtime data (e.g. measure hot spots), it would be dynamic compilation.

Sean said...

*yawn* If you want to see ridiculously fast string formatting, look at the Boost's Spirit library (specifically Karma). Small test case, but point well illustrated: http://www.boost.org/doc/libs/1_47_0/libs/spirit/doc/html/spirit/karma/performance_measurements/numeric_performance/int_performance.html Or look at Spirit's input parser for even integers: http://alexott.blogspot.com/2010/01/boostspirit2-vs-atoi.html

Antonio Cuni said...

@JoeHillen: the unroll-if-alt branch is inside the main pypy repo on bitbucket (together with all the other branches).

@Greg: yes, we checked the generated code, it's not optimized away.

@anonymous: why it should be any faster? String literals in C are constants, it's not that you need to create a new one at each iteration

@Johan: note that the PyPy approach can generate code optimized for a formatting string loaded from a disk, or computed at runtime. No static compiler could do that.

Anonymous said...

What machine are you on that an int is 64 bits? Hardly anybody uses ILP64 or SILP64 data models ( http://en.wikipedia.org/wiki/64-bit#Specific_C-language_data_models ). Maybe a fourth try is in order? :P

Johan Tibell said...

Antonio, that is indeed neat.

Reid Kleckner said...

So when are you going to teach PyPy that the result of an unused string formatting can be deleted, and then delete the loop? ;)

I'm not sure how you'd get there from a tracing JIT, though. WIth Python, you still have to call all the formatting and stringification methods because they might have side effects. You only get to know that the entire operation is a no-op after you've inlined everything, but by then it will be at a low enough representation that it's hard to tell.

Anonymous said...

sizeof(char)==1. By definition. Argh.

PS: negative karma for lying headline

Anonymous said...

Check that you're not spending all your time in malloc/free(). Also use the return value from a failed snprintf(), plus 1, to size your output buffer.

Reid Kleckner said...

@Anonymous 2: Even if all the time were spent in malloc/free, PyPy has to dynamically allocate the string data structure, as well as provide a buffer to fill with the characters from the integers, since it has no way of knowing how much space will be needed (could be a custom integer class).

However, you're right that malloc and free are slow and a good gc system would have a faster allocator.

vsergeev said...

a quick tip to minimize the math in determining your sprintf buffer size for your experiment:
#include < stdint.h >
len = snprintf(NULL, 0, "%d %d", INT32_MIN, INT32_MIN);
will give you the string length required (not including null terminating byte) to fit the formatted string.

Similarly, %lld and INT64_MIN will do the trick (on the right platform) for 64-bit signed integers.

(not that I advocate fixed sized buffers for formatted strings based on min/max digit lengths for any real application)

Anonymous said...

You wrote:
and compiled with GCC 4.5.2 at -O4

Please read the manual of GCC. There you will see that every optimization level above 3 is handled as it would be 3. '-O4' is nothing else than '-O3'.

It is also known that optimizing with -O3 may lead to several problems at runtime (e.g. memory delays for short programs or memory allocation failure in larger programs).
That's why the recommended optimization level is '2' (or 's' for embedded systems) and not '3'.

Did you test with a realtime kernel?
How about the scheduler?

Maybe you should double check your test environment.

Anonymous said...

For all you complaining about test eviorment. Pypy would still have to do that internaly. If they should be truely comparable, then you need to also include snprintf inside the loop, making C even slower. Also, I doubt you will get 200% performance boost from scheduler change.

unroll-if-alt will be included in 1.6 right? Also when will 1.6 be released?

Thomas Schilling said...

@Andrew, @hobbs: Oh, sorry I overlooked the "s" in "sprintf". It would still be nice compare the generated machine code to explain the differences.

Whenever, someone claims language L1 implementation A is faster than language L2 implementation B there are obvious questions about (1) fairness of comparison, (2) what is being measured. In this case PyPy is specializing on the format string interpreter (does that require library annotations?) which a C compiler could do in principle here (but probably doesn't.) So, I'm always a bit suspicious when I see these kinds of comparisons.

@Johan: GHC's cross-module optimization often comes at the expense of binary compatibility. A JIT has a big advantage here.

illume said...

The python faster than C day has come! Congrats.

ps. Did you try it with (Link Time Optimization)LTO? that is with gcc the option: -flto ? Also, are you using PGO with gcc?

nekto0n said...

@salmon According to this commit new style formatting is supported too.

Someone correct me if I'm wrong.

Anonymous said...

I think that computation is not correct yet. IIRC, you only get 20 digits in an unsigned 64-bit quantity.

Worse, (again IIRC) sprintf is locale dependent. It may insert thousands separators.

Anonymous said...

This is not a good performance test because all printf function have high constant complexity, without looking at format string, check it

Strohan said...

wouldn't it be better to run your test with a more modern c++ library like cstring?

Anonymous said...

If 1.9x is "almost 2x faster", then what is "1x faster"?

Poposhka said...

post the Assembly code, map files and call graph or it didnt happen!!!!!!!!

Reinis I. said...

"one time faster" is bad English.

Anonymous said...

What performance impact does the malloc/free produce in the C code? AFAIK Python allocates memory in larger chunks from the operating system. Probably Python does not have to call malloc after initialization after it allocated the first chunk.

AFAIK each malloc/free crosses the boundaries between user-mode/kernel-mode.

So, IMHO you should compare the numbers of a C program which
does not allocate dynamic memory more than once and uses an internal memory management system.

These numbers would be interesting.

Have fun

Damian Cugley said...

The point here is not that the Python implementation of formatting is better than the C standard library, but that dynamic optimisation can make a big difference. The first time the formatting operator is called its format string is parsed and assembly code for assembling the output generated. The next 999999 times that assembly code is used without doing the parsing step. Even if sprintf were defined locally, a static compiler can’t optimise away the parsing step, so that work is done redundantly every time around the loop.

In a language like Haskell something similar happens. A string formatting function in the style of sprintf would take a format string as a parameter and return a new function that formats its arguments according to that string. The new function corresponds to the specialized assembly code generated by PyPy’s JIT. I think if you wanted to give the static compiler the opportunity to do optimizations that PyPy does at runtime you would need to use a custom type rather than a string as the formatting spec. (NB my knowledge of functional-language implementation is 20 years out of date so take the above with a pinch of salt.)

Dave Kirby said...

@Anonymous:

The C code shown does not do any malloc/free. The sprintf function formats the string into the char array x, which is allocated on the stack. It is highly unlikely that the sprintf function itself mallocs any memory.

Paul Jaros said...

I'm following the progress on pypy since many years and the potential is and has always been here. And boy, pypy has come a looong way.

You are my favorite open-source project and I am excited to see what will happen next. Go pypy-team, go!

Stepan Koltsov said...

PyPy does nothing 1.9 times faster than C.

said...

You wrote: "We think this demonstrates the incredible potential of dynamic compilation, ..."

I disagree. You tested a microbenchmark. Claims about compiler or language X winning over Y should be made after observing patterns in real programs. That is: execute or analyse real C programs which are making use of 'sprintf', record their use of 'sprintf', create a statistics out of the recorded data and then finally use the statistical distributions to create Python programs with a similar distribution of calls to '%'.

Trivial microbenchmarks can be deceiving.

sqlrob said...

@Dave Kirby:

There are two C programs there. One on the stack, one with a malloc / free in the loop.

Which one is used for the faster claim?

Armin Rigo said...

@Anonymous: this branch, unroll-if-alt, will not be included in the release 1.6, which we're doing right now (it should be out any day now). It will only be included in the next release, which we hope to do soonish. It will also be in the nightly builds as soon as it is merged.

Connelly Barnes said...

Is string/IO performance in general being worked on in Pypy? Last I looked Pypy showed it was faster than CPython in many cases on its benchmarks page, but for many string/IO intensive tasks I tried Pypy v1.5 on, it was slower.

Maciej Fijalkowski said...

@Connelly yes, for some definition of working (being thought about). that's one reason why twisted_tcp is slower than other twisted benchmarks. We however welcome simple benchmarks as bugs on the issue tracker

tt said...

This is a horribly flawed benchmark which illustrates absolutely nothing. First of all, an optimizing JIT should be (easily) able to detect that your inner loop has no side effects and optimize it away. Secondly, with code like that you should expect all kinds of weirds transformations by the compiler, hence - you can't be really sure what you are comparing here. As many here have pointed out, you should compare the output assembly.

Anyway, if you really want to do a benchmark like that, do it the right way. Make the loop grow a string by continuous appending and write the string to the file in the end (time the loop only). This way you will get accurate results which really compare the performance of two compilers performing the same task.

Anonymous said...

try in nodejs:

var t = (new Date()).getTime();

function main() {
var x;
for (var i = 0; i < 10000000; i++)
x = i + " " + i;
return x;
}
x = main();

t = (new Date()).getTime() - t;
console.log(x + ", " + t);

tt said...

I have now put a small, slightly more realistic benchmark. I used following code.

Python

def main():
x = ""
for i in xrange(50000):
x = "%s %d" % (x, i)
return x

x = main()

f = open("log.txt", "w")
f.write(x)
f.close()

C
#include
#include
#include


int main() {
int i;
char *x = malloc(0);
FILE *file;

*x = 0x00;

for (i = 0; i < 50000; i++) {
char *nx = malloc(strlen(x) + 16); // +16 bytes to be on the safe side

sprintf(nx, "%s %d", x, i);
free(x);
x = nx;
}

file = fopen("log1.txt","w");
fprintf(file, "%s", x);
fclose(file);
}

JavaScript (NodeJS)

var fs = require('fs');

String.prototype.format = function() {
var formatted = this;
for (var i = 0; i < arguments.length; i++) {
var regexp = new RegExp('\\{'+i+'\\}', 'gi');
formatted = formatted.replace(regexp, arguments[i]);
}
return formatted;
};


function main() {
var x = "";
for (var i = 0; i < 50000; i++)
x = "{0} {1}".format(x, i);
return(x)
}

x = main();
fs.writeFile('log.txt', x)


Note for JS example: I did not want to use the stuff like i + " " + i because it bypasses the format function call. Obviously, using the + operator the nodejs example would be much faster (but pypy probably as well).

Also, I used PyPy 1.5 as I did not find any precompiled PyPy 1.6 for OS X.

Results:

PyPy: real 0m13.307s
NodeJS: real 0m44.350s
C: real 0m1.812s

said...

@tt: This is a very inefficient C/C++ implementation of the idea "make a loop grow a string by continuous appending and write the string to the file in the end". In addition, it appears to be an uncommon piece of C/C++ code.

tt said...

Well, I never said anything about writing super efficient C code. Anyway, I don't see how you want to implement string formatting more efficiently - if we talk about general usage scenario. You can't really reuse the old string buffer, you basically have to allocate new one each time the string grows. Or pre-allocate a larger string buffer and do some substring copies (which will result in a much more complicated code). Anyway, the malloc() on OS X is very fast.

My point is: even this C code, which you call inefficient is around 6 times faster then pypy 1.5

Antiplutocrat said...

@tt except one of the main points of the post was that they had implemented a *new* feature (unroll-if-alt, I believe) that sped things up a bunch.

I'm not sure how much any comparison that *doesn't* use this new feature is worth ...

So many haters ! ;)

Anonymous said...

The compare is good because both use standard langauge fetatures to do the same thing, using a third part lib is not the same, then I have to code the same implant in RPython and people would still complain do to RPython often being faster then C regardless.

Python could have detected that the loop is not doing anything, but give that one value had a __str__ call it could've broken some code. Anyway, C compiler could also see that you didn't do anything with the value and optimalize it the same way.

tt said...

@Antiplutocrat:
Honestly, I expected a bit more objectivity from posters here. I am really disappointed that you compare me to "haters" (whoever that may be).

Your point about unroll-if-alt is absolutely valid and I myself have explicitly stated that I did not use that feature. At no point I have refuted that the original blog post was wrong - it is still very well possible that PyPy 1.6 is faster then C in this usage scenario. The main goal of my post was to make clear that the original benchmarks were flawed, as they grant the compiler too much space for unpredictable optimizations. I believe that my benchmark code produces more realistic results and I suggest that the authors of this blog entry re-run the benchmark using my code (or something similar, which controls for unpredictable optimizations).

Anonymous said...

@tt: Code is doing something else so it's not the same.

Anonymous said...

Only a quick note OffTopic: in the python FAQ, one could update adding PyPy besides Psyco in the performance tips:

http://docs.python.org/faq/programming.html#my-program-is-too-slow-how-do-i-speed-it-up

said...

@Anonymous: I agree with your other paragraphs, but not with the one where you wrote that "... OLDER version (4.5.x) of GCC whilst a newer version (4.6.x) is available with major improvements to the optimizer in general".

I am not sure, what "major improvements" in GCC 4.6 do you mean? Do you have benchmark numbers to back up your claim?

As far as well-written C code is concerned, in my opinion, there haven't been any "major improvements" in GCC for more than 5+ years. There have been improvements of a few percent in a limited number of cases - but nothing major.

Even LTO (link-time optimization (and lets hope it will be safe/stable to use when GCC 4.7 is released)) isn't a major boost. I haven't seen LTO being able to optimize calls to functions living in dynamic libraries (the bsearch(3) function would be a nice candidate). And I also haven't seen GCC's LTO being able to optimize calls to/within the Qt GUI library when painting pixels or lines onto the screen.

The main point of the PyPy article was that run-time optimizations in PyPy have a chance of surpassing GCC in certain cases.

Personally, I probably wouldn't willingly choose to work on a project like PyPy - since, err, I believe that hard-core JIT optimizations on a dynamically typed language like Python are generally a bad idea - but I am (in a positive way) eager to see what the PyPy team will be able to do in this field in the years to come.

Anonymous said...

@⚛: I indeed do not have benchmarks for these claims, but GCC 4.6 indeed added some newer optimization techniques to its assortment. Maybe these may not have had a significant influence in said case but they might have somewhere else. I'm merely saying: you can't really compare the latest hot inventions with something that is surpassed (e.g. compare Java 7 to a program output by Visual C++ back form the VS 2003 IDE).

All by all, I'm not saying that Python sucks and don't want to sound like a fanboy (on the contrary, Linux uses a great deal of Python and if this could mean a major speedup, then why the hell not ;).

I guess I was pissed off because the written article sounds very much fanboyish and pro-Python (just look at the title alone).

Anonymous said...

So a loop that doesn't print in Python is compared to a loop in C that does and that was compiled on one of the slowest C compilers out there.

YearOfTheLinuxDesktopIsAtHand(TM)

C. Timmerman said...

@Anonymous, "the C one doesn't print anything, either; sprintf just returns a string. printf is the one that prints." - Andrew Pendleton, this page, August 2, 2011 9:25 PM