Thursday, October 29, 2015

PyPy 4.0.0 Released - A Jit with SIMD Vectorization and More

PyPy 4.0.0

We’re pleased and proud to unleash PyPy 4.0.0, a major update of the PyPy python 2.7.10 compatible interpreter with a Just In Time compiler. We have improved warmup time and memory overhead used for tracing, added vectorization for numpy and general loops where possible on x86 hardware (disabled by default), refactored rough edges in rpython, and increased functionality of numpy.
You can download the PyPy 4.0.0 release here:
We would like to thank our donors for the continued support of the PyPy project.
We would also like to thank our contributors (7 new ones since PyPy 2.6.0) and encourage new people to join the project. PyPy has many layers and we need help with all of them: PyPy and RPython documentation improvements, tweaking popular modules to run on PyPy, or general help with making RPython’s JIT even better.

New Version Numbering

Since the past release, PyPy 2.6.1, we decided to update the PyPy 2.x.x versioning directly to PyPy 4.x.x, to avoid confusion with CPython 2.7 and 3.5. Note that this version of PyPy uses the stdlib and implements the syntax of CPython 2.7.10.


Richard Plangger began work in March and continued over a Google Summer of Code to add a vectorization step to the trace optimizer. The step recognizes common constructs and emits SIMD code where possible, much as any modern compiler does. This vectorization happens while tracing running code, so it is actually easier at run-time to determine the availability of possible vectorization than it is for ahead-of-time compilers.
Availability of SIMD hardware is detected at run time, without needing to precompile various code paths into the executable.
The first version of the vectorization has been merged in this release, since it is so new it is off by default. To enable the vectorization in built-in JIT drivers (like numpy ufuncs), add –jit vec=1, to enable all implemented vectorization add –jit vec_all=1
Benchmarks and a summary of this work appear here

Internal Refactoring: Warmup Time Improvement and Reduced Memory Usage

Maciej Fijalkowski and Armin Rigo refactored internals of Rpython that now allow PyPy to more efficiently use guards in jitted code. They also rewrote unrolling, leading to a warmup time improvement of 20% or so. The reduction in guards also means a reduction in the use of memory, also a savings of around 20%.


Our implementation of numpy continues to improve. ndarray and the numeric dtypes are very close to feature-complete; record, string and unicode dtypes are mostly supported. We have reimplemented numpy linalg, random and fft as cffi-1.0 modules that call out to the same underlying libraries that upstream numpy uses. Please try it out, especially using the new vectorization (via –jit vec=1 on the command line) and let us know what is missing for your code.


While not applicable only to PyPy, cffi is arguably our most significant contribution to the python ecosystem. Armin Rigo continued improving it, and PyPy reaps the benefits of cffi-1.3: improved manangement of object lifetimes, __stdcall on Win32, ffi.memmove(), and percolate const, restrict keywords from cdef to C code.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It’s fast (pypy and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
We also welcome developers of other dynamic languages to see what RPython can do for them.
This release supports x86 machines on most common operating systems (Linux 32/64, Mac OS X 64, Windows 32, OpenBSD, freebsd), as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux.
We also introduce support for the 64 bit PowerPC hardware, specifically Linux running the big- and little-endian variants of ppc64.

Other Highlights (since 2.6.1 release two months ago)

  • Bug Fixes
    • Applied OPENBSD downstream fixes
    • Fix a crash on non-linux when running more than 20 threads
    • In cffi, ffi.new_handle() is more cpython compliant
    • Accept unicode in functions inside the _curses cffi backend exactly like cpython
    • Fix a segfault in itertools.islice()
    • Use gcrootfinder=shadowstack by default, asmgcc on linux only
    • Fix ndarray.copy() for upstream compatability when copying non-contiguous arrays
    • Fix assumption that lltype.UniChar is unsigned
    • Fix a subtle bug with stacklets on shadowstack
    • Improve support for the cpython capi in cpyext (our capi compatibility layer). Fixing these issues inspired some thought about cpyext in general, stay tuned for more improvements
    • When loading dynamic libraries, in case of a certain loading error, retry loading the library assuming it is actually a linker script, like on Arch and Gentoo
    • Issues reported with our previous release were resolved after reports from users on our issue tracker at or on IRC at #pypy
  • New features:
    • Add an optimization pass to vectorize loops using x86 SIMD intrinsics.
    • Support __stdcall on Windows in CFFI
    • Improve debug logging when using PYPYLOG=???
    • Deal with platforms with no RAND_egd() in OpenSSL
  • Numpy:
    • Add support for ndarray.ctypes
    • Fast path for mixing numpy scalars and floats
    • Add support for creating Fortran-ordered ndarrays
    • Fix casting failures in linalg (by extending ufunc casting)
    • Recognize and disallow (for now) pickling of ndarrays with objects embedded in them
  • Performance improvements and refactorings:
    • Reuse hashed keys across dictionaries and sets
    • Refactor JIT interals to improve warmup time by 20% or so at the cost of a minor regression in JIT speed
    • Recognize patterns of common sequences in the JIT backends and optimize them
    • Make the garbage collecter more incremental over external_malloc() calls
    • Share guard resume data where possible which reduces memory usage
    • Fast path for zip(list, list)
    • Reduce the number of checks in the JIT for lst[a:]
    • Move the non-optimizable part of callbacks outside the JIT
    • Factor in field immutability when invalidating heap information
    • Unroll itertools.izip_longest() with two sequences
    • Minor optimizations after analyzing output from vmprof and trace logs
    • Remove many class attributes in rpython classes
    • Handle getfield_gc_pure* and getfield_gc_* uniformly in
    • Improve simple trace function performance by lazily calling fast2locals and locals2fast only if truly necessary
Please try it out and let us know what you think. We welcome feedback, we know you are using PyPy, please tell us about it!
The PyPy Team


Gerrit Slonzer said...

With the SIMD run-time detection implemented, has the --jit-backend option become redundant?

stryker said...

Will a similar release be coming for Python 3.5?

John M. Camara said...

@Gerrit, they are 2 different things. One is the option to say you are interested in the SIMD support and the other is a check if SIMD support is available in the HW if you are interested in using it. I'm sure once SIMD support has been used for some time it will eventually be enabled by default but since it is new and potential could have some unknown issues at this time you have to explicitly enable it at this time.

Niklas B said...

Awesome, can't wait to try it

Daniela Czajk said...

Well done, thx!

Travis Griggs said...

I keep watching the progress of PyPy with excitement. Cool things happening here. But I continue to be disappointed that it doesn't tip towards Python3. It's dead to me until that becomes the majority effort. :(

Carl Friedrich Bolz said...

The PyPy project contains a large plurality of interests. A lot of the people working on it are volunteers. So PyPy3 will happen if people within the project become interested in that part, or if new people with that interest join the project. At the moment, this seems not happening, which we can all be sad about. However, blaming anybody with differing interest for that situation feels a bit annoying to me.

Travis Griggs said...

Well said, I apologize for any whining tone. It was not my intent to blame or complain. It really was just meant as a lamentation. Thanks for all you do.

PeteVine said...

What happened to my comment? Surely the benchmark I was proposing is not censorable...

Maciej Fijalkowski said...

@PeteVine you posted a random executable from dropbox claiming to have pypy with x87 backend. PyPy does not have an x87 backend and this raises suspitions this was just some malware. Now if you want someone to compare one thing against some other thing, please link to sources and not random binaries so the person comparing can look themselves. Additionally you did not post a benchmark, just a link to the binary

PeteVine said...

Well, I was suggesting benchmarking the 32-bit backends to see how much difference SIMD makes - x87 means the standard fpu whereas the default uses SSE2. I know it's processor archaeology so you may have forgotten pypy even had it ;)

The ready-to-use pypy distro (built by me) was meant for anyone in possesion of a real set of benchmarks (not synthetic vector stuff) to be able to try it quickly.

And btw, you could have simply edited the dropbox link out. I'd already tested py3k using this backend and mentioned it in one of the issues on bitbucket so it's far from random.

@ all the people asking about pypy3 - you have the python 3.2 compatible pypy (py3k) at your disposal even now.

Armin Rigo said...

@PeteVine: to clarify, PyPy has no JIT backend emitting the old-style x87 fpu instructions. What you are posting is very likely a PyPy whose JIT doesn't support floats at all. It emits calls to already-compiled functions, like the one doing addition of float objects, instead of writing a SSE2 float addition on unboxed objects.

Instead, use the official PyPy and run it with vectorization turned on and off (as documented) on the same modern machine. This allows an apple-to-apple comparison.

PeteVine said...

Thanks for clarifying, I must have confused myself after seeing it was i486 compatible.

Are you saying the only difference between the backends I wanted to benchmark would boil down to jit-emitting performance and not actual pypy performance? (I must admit I tried this a while ago with fibonacci and there was no difference at all).

In other words, even before vectorization functionality was added, shouldn't it be possible to detect that the non-SSE2 backend is running on newer hardware and use the available SIMD? (e.g. for max. compatibility)

Armin Rigo said...

@PeteVine Sorry, I don't understand your questions. Why do you bring the JIT-emitting performance to the table? And why fibonacci (it's not a benchmark with floats at all)? And I don't get the last question either ("SIMD" = "vectorization").

To some people, merely dropping the word "SIMD" into a performance discussion makes them go "ooh nice" even if they don't have a clue what it is. I hope you're more knowledgeable than that and that I'm merely missing your point :-)

PeteVine said...

The last part should have been pretty clear as it was referring to the newly added –jit vec=1 so it's not me who's dropping SIMD here (shorthand for different instructions sets) as can be seen in the title of this blog post.

All this time I was merely interested in comparing the two 32-bit backends, that's all. One's using the i486/x87 instruction set regardless of any jit codes, the other is able take advantage of anything up to SSE2. The quick fibonacci test was all I did so you could have pointed me to a real set of benchmarks instead of throwing these little jabs :)

Carl Friedrich Bolz said...

@PeteVine: ok, there is a misunderstanding somewhere, I think. Let me try to clarify: PyPy's JIT has always used non-SIMD SSE2 instructions to implement floating point operations. We have a slow mode where only x87 instructions are used, but usually don't fall back to that, and it does not make sense to compare against that mode.

What the new release experimentally added is support for SIMD SSE instructions using autoparallelization when --jit vec=1 is given. This only works if your program uses numpy arrays or other simple list processing code. For details on that (and for benchmarks) it's probably best to read Richard Plangger's blog.

Does that make sense?

PeteVine said...

Great, love that explanation! :)

But please, I'd really like to see how much of a handicap the much-maligned non-SSE2 backend incurs. Could you recommend a set of python (not purely computational) benchmarks so I can put this peevee of mine to rest/test?

Anyways, @Armin Rigo is a great educator himself judging from his patient replies in the bugtracker! So yeah, kudos to you guys!

Carl Friedrich Bolz said...

If you want to try a proper performance evaluation, the official benchmark set is probably the right one:

However, none of these benchmarks are exercising the new autovectorization. If you're particularly interested in that part, use the benchmarks from Richard's blog.

NortonCommander4ever said...

Is there a readme on how to use these benchmarks somewhere? (preferably written with windows users in mind, if you know what I mean:))