Monday, April 18, 2016

PyPy Enterprise Edition

With the latest additions, PyPy's JIT now supports the Z architecture on Linux. The newest architecture revision (also known as s390x, or colloquially referred to as "big iron") is the 64-bit extension for IBM mainframes. Currently only Linux 64 bit is supported (not z/OS nor TPF).
This is the fourth assembler backend supported by PyPy in addition to x86 (32 and 64), ARM (32-bit only) and PPC64 (both little- and big-endian). It might seem that we kind of get a hang of new architectures. Thanks to IBM for funding this work!

History

When I went to university one lecture covered the prediction of Thomas Watson in 1943. His famous quote "I think there is a world market for maybe five computers ...", turned out not to be true.

However, even 70 years later, mainframes are used more often than you think. They back critical tasks requiring a high level of stability/security and offer high hardware and computational utilization rates by virtualization.

With the new PyPy JIT backend we are happy to present a fast Python virtual machine for mainframes and contribute more free software running on s390x.

Meta tracing

Even though the JIT backend has been tested on PyPy, it is not restricted to the Python programming language. Do you have a great idea for a DSL, or another language that should run on mainframes? Go ahead and just implement your interpreter using RPython.

How do I get a copy?

PyPy can be built using the usual instructions found here. As soon as the next PyPy version has been released we will provide binaries. Until then you can just grab a nightly here.We are currently busy to get the next version of PyPy ready, so an official release will be rolled out soon.

Comparing s390x to x86

The goal of this comparison is not to scientifically evaluate the benefits/disadvantages on s390x, but rather to see that PyPy's architecture delivers the same benefits as it does on other platforms. Similar to the comparison done for PPC I ran the benchmarks using the same setup. The first column is the speedup of the PyPy JIT VM compared to the speedup of a pure PyPy interpreter 1). Note that the s390x's OS was virtualized.

Label               x86     s390x      s390x (run 2)

ai                 13.7      12.4       11.9
bm_chameleon        8.5       6.3        6.8
bm_dulwich_log      5.1       5.0        5.1
bm_krakatau         5.5       2.0        2.0
bm_mako             8.4       5.8        5.9
bm_mdp              2.0       3.8        3.8
chaos              56.9      52.6       53.4
crypto_pyaes       62.5      64.2       64.2
deltablue           3.3       3.9        3.6
django             28.8      22.6       21.7
eparse              2.3       2.5        2.6
fannkuch            9.1       9.9       10.1
float              13.8      12.8       13.8
genshi_text        16.4      10.5       10.9
genshi_xml          8.2       7.9        8.2
go                  6.7       6.2       11.2
hexiom2            24.3      23.8       23.5
html5lib            5.4       5.8        5.7
json_bench         28.8      27.8       28.1
meteor-contest      5.1       4.2        4.4
nbody_modified     20.6      19.3       19.4
pidigits            1.0      -1.1       -1.0
pyflate-fast        9.0       8.7        8.5
pypy_interp         3.3     4.2        4.4
raytrace-simple    69.0     100.9       93.4
richards           94.1      96.6       84.3
rietveld            3.2       2.5        2.7
slowspitfire        2.8       3.3        4.2
spambayes           5.0       4.8        4.8
spectral-norm      41.9      39.8       42.6
spitfire            3.8       3.9        4.3
spitfire_cstringio 7.6       7.9        8.2
sympy_expand        2.9       1.8        1.8
sympy_integrate     4.3       3.9        4.0
sympy_str           1.5       1.3        1.3
sympy_sum           6.2       5.8        5.9
telco              61.2      48.5       54.8
twisted_iteration 55.5      41.9       43.8
twisted_names       8.2       9.3        9.7
twisted_pb         12.1      10.4       10.2
twisted_tcp         4.9       4.8        5.2

Geometric mean:    9.31      9.10       9.43

As you can see the benefits are comparable on both platforms.
Of course this is scientifically not good enough, but it shows a tendency. s390x can achieve the same results as you can get on x86.

Are you running your business application on a mainframe? We would love to get some feedback. Join us in IRC tell us if PyPy made your application faster!

plan_rich & the PyPy Team

1) PyPy revision for the benchmarks: 4b386bcfee54

Posted by Unknown at 11:13 0 Comments

Thursday, April 7, 2016

Warmup improvements: more efficient trace representation

Hello everyone.

I'm pleased to inform that we've finished another round of improvements to the warmup performance of PyPy. Before I go into details, I'll recap the achievements that we've done since we've started working on the warmup performance. I picked a random PyPy from November 2014 (which is definitely before we started the warmup work) and compared it with a recent one, after 5.0. The exact revisions are respectively ffce4c795283 and cfbb442ae368. First let's compare pure warmup benchmarks that can be found in our benchmarking suite. Out of those, pypy-graph-alloc-removal numbers should be taken with a grain of salt, since other work could have influenced the results. The rest of the benchmarks mentioned is bottlenecked purely by warmup times.

You can see how much your program spends in warmup running PYPYLOG=jit-summary:- pypy your-program.py under "tracing" and "backend" fields (in the first three lines). An example looks like that:

[e00c145a41] {jit-summary
Tracing:        71      0.053645 <- time spent tracing & optimizing
Backend:        71      0.028659 <- time spent compiling to assembler
TOTAL:                  0.252217 <- total run time of the program

The results of the benchmarks

benchmark	time - old	time - new	speedup	JIT time - old	JIT time - new
function_call	1.86	1.42	1.3x	1.12s	0.57s
function_call2	5.17s	2.73s	1.9x	4.2s	1.6s
bridges	2.77s	2.07s	1.3x	1.5s	0.8s
pypy-graph-alloc-removal	2.06s	1.65s	1.25x	1.25s	0.79s

As we can see, the overall warmup benchmarks got up to 90% faster with JIT time dropping by up to 2.5x. We have more optimizations in the pipeline, with an idea how to transfer some of the JIT gains into more of a total program runtime by jitting earlier and more eagerly.

Details of the last round of optimizations

Now the nitty gritty details - what did we actually do? I covered a lot of warmup improvements in the past blog posts so I'm going to focus on the last change, the jit-leaner-frontend branch. This last change is simple, instead of using pointers to store the "operations" objects created during tracing, we use a compact list of 16-bit integers (with 16bit pointers in between). On 64bit machine the memory wins are tremendous - the new representation is 4x more efficient to use 16bit pointers than full 64bit pointers. Additionally, the smaller representation has much better cache behavior and much less pointer chasing in memory. It also has a better defined lifespan, so we don't need to bother tracking them by the GC, which also saves quite a bit of time.

The change sounds simple, but the details in the underlaying data mean that everything in the JIT had to be changed which took quite a bit of effort :-)

Going into the future on the JIT front, we have an exciting set of optimizations, ranging from faster loops through faster warmup to using better code generation techniques and broadening the kind of program that PyPy speeds up. Stay tuned for the updates.

We would like to thank our commercial partners for making all of this possible. The work has been performed by baroquesoftware and would not be possible without support from people using PyPy in production. If your company uses PyPy and want it to do more or does not use PyPy but has performance problems with the Python installation, feel free to get in touch with me, trust me using PyPy ends up being a lot cheaper than rewriting everything in go :-)

Best regards,
Maciej Fijalkowski

Hello everyone.

You can see how much your program spends in warmup running PYPYLOG=jit-summary:- pypy your-program.py under "tracing" and "backend" fields (in the first three lines). An example looks like that:

[e00c145a41] {jit-summary
Tracing:        71      0.053645 <- time spent tracing & optimizing
Backend:        71      0.028659 <- time spent compiling to assembler
TOTAL:                  0.252217 <- total run time of the program

The results of the benchmarks

benchmark	time - old	time - new	speedup	JIT time - old	JIT time - new
function_call	1.86	1.42	1.3x	1.12s	0.57s
function_call2	5.17s	2.73s	1.9x	4.2s	1.6s
bridges	2.77s	2.07s	1.3x	1.5s	0.8s
pypy-graph-alloc-removal	2.06s	1.65s	1.25x	1.25s	0.79s

Details of the last round of optimizations

The change sounds simple, but the details in the underlaying data mean that everything in the JIT had to be changed which took quite a bit of effort :-)

Best regards,
Maciej Fijalkowski

Posted by Maciej Fijalkowski at 10:56 2 Comments

Saturday, March 19, 2016

PyPy 5.0.1 bugfix released

PyPy 5.0.1

We have released a bugfix for PyPy 5.0, after reports that the newly released lxml 3.6.0, which now supports PyPy 5.0 +, can crash on large files. Thanks to those who reported the crash. Please update, downloads are available at

pypy.org/download.html

The changes between PyPy 5.0 and 5.0.1 are only two bug fixes: one in cpyext, which fixes notably (but not only) lxml; and another for a corner case of the JIT.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It’s fast (PyPy and CPython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
We also welcome developers of other dynamic languages to see what RPython can do for them.
This release supports x86 machines on most common operating systems (Linux 32/64, Mac OS X 64, Windows 32, OpenBSD, FreeBSD), newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux, and the big- and little-endian variants of PPC64 running Linux.

Please update, and continue to help us make PyPy better.

Cheers
The PyPy Team

Posted by mattip at 17:44 1 Comments

Thursday, March 10, 2016

PyPy 5.0 released

PyPy 5.0

We have released PyPy 5.0, about three months after PyPy 4.0.1. We encourage all users of PyPy to update to this version.

You can download the PyPy 5.0 release here:

We would like to thank our donors for the continued support of the PyPy project.
We would also like to thank our contributors and encourage new people to join the project. PyPy has many layers and we need help with all of them: PyPy and RPython documentation improvements, tweaking popular modules to run on pypy, or general help with making RPython’s JIT even better.

Faster and Leaner

We continue to improve the warmup time and memory usage of JIT-related metadata. The exact effects depend vastly on the program you’re running and can range from insignificant to warmup being up to 30% faster and memory dropping by about 30%.

C-API Upgrade

We also merged a major upgrade to our C-API layer (cpyext), simplifying the interaction between c-level objects and PyPy interpreter level objects. As a result, lxml (prerelease) with its cython compiled component passes all tests on PyPy. The new cpyext is also much faster. This major refactoring will soon be followed by an expansion of our C-API compatibility.

Profiling with vmprof supported on more platforms

vmprof has been a go-to profiler for PyPy on linux for a few releases and we’re happy to announce that thanks to the cooperation with jetbrains, vmprof now works on Linux, OS X and Windows on both PyPy and CPython.

CFFI

While not applicable only to PyPy, cffi is arguably our most significant contribution to the python ecosystem. PyPy 5.0 ships with cffi-1.5.2 which now allows embedding PyPy (or CPython) in a C program.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It’s fast (pypy and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
We also welcome developers of other dynamic languages to see what RPython can do for them.
This release supports x86 machines on most common operating systems (Linux 32/64, Mac OS X 64, Windows 32, OpenBSD, freebsd), newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux, and 64 bit PowerPC hardware, specifically Linux running the big- and little-endian variants of ppc64.

Other Highlights (since 4.0.1 released in November 2015)

New features:
- Support embedding PyPy in a C-program via cffi and static callbacks in cffi.
  This deprecates the old method of embedding PyPy
- Refactor vmprof to work cross-operating-system, deprecate using buggy
  libunwind on Linux platforms. Vmprof even works on Windows now.
- Support more of the C-API type slots, like tp_getattro, and fix C-API
  macros, functions, and structs such as _PyLong_FromByteArray(),
  PyString_GET_SIZE, f_locals in PyFrameObject, Py_NAN, co_filename in
  PyCodeObject
- Use a more stable approach for allocating PyObjects in cpyext. (see
  blog post). Once the PyObject corresponding to a PyPy object is created,
  it stays around at the same location until the death of the PyPy object.
  Done with a little bit of custom GC support. It allows us to kill the
  notion of “borrowing” inside cpyext, reduces 4 dictionaries down to 1, and
  significantly simplifies the whole approach (which is why it is a new
  feature while technically a refactoring) and allows PyPy to support the
  populart lxml module (as of the next release) with no PyPy specific
  patches needed
- Make the default filesystem encoding ASCII, like CPython
- Use hypothesis in test creation, which is great for randomizing tests
Bug Fixes
- Backport always using os.urandom for uuid4 from cpython and fix the JIT as well
  (issue #2202)
- More completely support datetime, optimize timedelta creation
- Fix for issue #2185 which caused an inconsistent list of operations to be
  generated by the unroller, appeared in a complicated DJango app
- Fix an elusive issue with stacklets on shadowstack which showed up when
  forgetting stacklets without resuming them
- Fix entrypoint() which now acquires the GIL
- Fix direct_ffi_call() so failure does not bail out before setting CALL_MAY_FORCE
- Fix (de)pickling long values by simplifying the implementation
- Fix RPython rthread so that objects stored as threadlocal do not force minor
  GC collection and are kept alive automatically. This improves perfomance of
  short-running Python callbacks and prevents resetting such object between
  calls
- Support floats as parameters to itertools.isslice()
- Check for the existence of CODESET, ignoring it should have prevented PyPy
  from working on FreeBSD
- Fix for corner case (likely shown by Krakatau) for consecutive guards with
  interdependencies
- Fix applevel bare class method comparisons which should fix pretty printing
  in IPython
- Issues reported with our previous release were resolved after reports from users on our issue tracker at https://bitbucket.org/pypy/pypy/issues or on IRC at #pypy
Numpy:
- Updates to numpy 1.10.2 (incompatibilities and not-implemented features
  still exist)
- Support dtype=((‘O’, spec)) union while disallowing record arrays with
  mixed object, non-object values
- Remove all traces of micronumpy from cpyext if –withoutmod-micronumpy option used
- Support indexing filtering with a boolean ndarray
- Support partition() as an app-level function, together with a cffi wrapper
  in pypy/numpy, this now provides partial support for partition()
Performance improvements:
- Optimize global lookups
- Improve the memory signature of numbering instances in the JIT. This should
  massively decrease the amount of memory consumed by the JIT, which is
  significant for most programs. Also compress the numberings using variable-
  size encoding
- Optimize string concatenation
- Use INT_LSHIFT instead of INT_MUL when possible
- Improve struct.unpack by casting directly from the underlying buffer.
  Unpacking floats and doubles is about 15 times faster, and integer types
  about 50% faster (on 64 bit integers). This was then subsequently
  improved further in optimizeopt.py.
- Optimize two-tuple lookups in mapdict, which improves warmup of instance
  variable access somewhat
- Reduce all guards from int_floordiv_ovf if one of the arguments is constant
- Identify permutations of attributes at instance creation, reducing the
  number of bridges created
- Greatly improve re.sub() performance
Internal refactorings:
- Refactor and improve exception analysis in the annotator
- Remove unnecessary special handling of space.wrap().
- Support list-resizing setslice operations in RPython
- Tweak the trace-too-long heuristic for multiple jit drivers
- Refactor bookkeeping (such a cool word - three double letters) in the
  annotater
- Refactor wrappers for OS functions from rtyper to rlib and simplify them
- Simplify backend loading instructions to only use four variants
- Simplify GIL handling in non-jitted code
- Refactor naming in optimizeopt
- Change GraphAnalyzer to use a more precise way to recognize external
  functions and fix null pointer handling, generally clean up external
  function handling
- Remove pure variants of getfield_gc_* operations from the JIT by
  determining purity while tracing
- Refactor databasing
- Simplify bootstrapping in cpyext
- Refactor rtyper debug code into python.rtyper.debug
- Seperate structmember.h from Python.h Also enhance creating api functions
  to specify which header file they appear in (previously only pypy_decl.h)
- Fix tokenizer to enforce universal newlines, needed for Python 3 support

Please try it out and let us know what you think. We welcome feedback, we know you are using PyPy, please tell us about it!
Cheers
The PyPy Team

PyPy 5.0

We have released PyPy 5.0, about three months after PyPy 4.0.1. We encourage all users of PyPy to update to this version.

You can download the PyPy 5.0 release here:

Faster and Leaner

C-API Upgrade

Profiling with vmprof supported on more platforms

CFFI

What is PyPy?

Other Highlights (since 4.0.1 released in November 2015)

New features:
- Support embedding PyPy in a C-program via cffi and static callbacks in cffi.
  This deprecates the old method of embedding PyPy
- Refactor vmprof to work cross-operating-system, deprecate using buggy
  libunwind on Linux platforms. Vmprof even works on Windows now.
- Support more of the C-API type slots, like tp_getattro, and fix C-API
  macros, functions, and structs such as _PyLong_FromByteArray(),
  PyString_GET_SIZE, f_locals in PyFrameObject, Py_NAN, co_filename in
  PyCodeObject
- Use a more stable approach for allocating PyObjects in cpyext. (see
  blog post). Once the PyObject corresponding to a PyPy object is created,
  it stays around at the same location until the death of the PyPy object.
  Done with a little bit of custom GC support. It allows us to kill the
  notion of “borrowing” inside cpyext, reduces 4 dictionaries down to 1, and
  significantly simplifies the whole approach (which is why it is a new
  feature while technically a refactoring) and allows PyPy to support the
  populart lxml module (as of the next release) with no PyPy specific
  patches needed
- Make the default filesystem encoding ASCII, like CPython
- Use hypothesis in test creation, which is great for randomizing tests
Bug Fixes
- Backport always using os.urandom for uuid4 from cpython and fix the JIT as well
  (issue #2202)
- More completely support datetime, optimize timedelta creation
- Fix for issue #2185 which caused an inconsistent list of operations to be
  generated by the unroller, appeared in a complicated DJango app
- Fix an elusive issue with stacklets on shadowstack which showed up when
  forgetting stacklets without resuming them
- Fix entrypoint() which now acquires the GIL
- Fix direct_ffi_call() so failure does not bail out before setting CALL_MAY_FORCE
- Fix (de)pickling long values by simplifying the implementation
- Fix RPython rthread so that objects stored as threadlocal do not force minor
  GC collection and are kept alive automatically. This improves perfomance of
  short-running Python callbacks and prevents resetting such object between
  calls
- Support floats as parameters to itertools.isslice()
- Check for the existence of CODESET, ignoring it should have prevented PyPy
  from working on FreeBSD
- Fix for corner case (likely shown by Krakatau) for consecutive guards with
  interdependencies
- Fix applevel bare class method comparisons which should fix pretty printing
  in IPython
- Issues reported with our previous release were resolved after reports from users on our issue tracker at https://bitbucket.org/pypy/pypy/issues or on IRC at #pypy
Numpy:
- Updates to numpy 1.10.2 (incompatibilities and not-implemented features
  still exist)
- Support dtype=((‘O’, spec)) union while disallowing record arrays with
  mixed object, non-object values
- Remove all traces of micronumpy from cpyext if –withoutmod-micronumpy option used
- Support indexing filtering with a boolean ndarray
- Support partition() as an app-level function, together with a cffi wrapper
  in pypy/numpy, this now provides partial support for partition()
Performance improvements:
- Optimize global lookups
- Improve the memory signature of numbering instances in the JIT. This should
  massively decrease the amount of memory consumed by the JIT, which is
  significant for most programs. Also compress the numberings using variable-
  size encoding
- Optimize string concatenation
- Use INT_LSHIFT instead of INT_MUL when possible
- Improve struct.unpack by casting directly from the underlying buffer.
  Unpacking floats and doubles is about 15 times faster, and integer types
  about 50% faster (on 64 bit integers). This was then subsequently
  improved further in optimizeopt.py.
- Optimize two-tuple lookups in mapdict, which improves warmup of instance
  variable access somewhat
- Reduce all guards from int_floordiv_ovf if one of the arguments is constant
- Identify permutations of attributes at instance creation, reducing the
  number of bridges created
- Greatly improve re.sub() performance
Internal refactorings:
- Refactor and improve exception analysis in the annotator
- Remove unnecessary special handling of space.wrap().
- Support list-resizing setslice operations in RPython
- Tweak the trace-too-long heuristic for multiple jit drivers
- Refactor bookkeeping (such a cool word - three double letters) in the
  annotater
- Refactor wrappers for OS functions from rtyper to rlib and simplify them
- Simplify backend loading instructions to only use four variants
- Simplify GIL handling in non-jitted code
- Refactor naming in optimizeopt
- Change GraphAnalyzer to use a more precise way to recognize external
  functions and fix null pointer handling, generally clean up external
  function handling
- Remove pure variants of getfield_gc_* operations from the JIT by
  determining purity while tracing
- Refactor databasing
- Simplify bootstrapping in cpyext
- Refactor rtyper debug code into python.rtyper.debug
- Seperate structmember.h from Python.h Also enhance creating api functions
  to specify which header file they appear in (previously only pypy_decl.h)
- Fix tokenizer to enforce universal newlines, needed for Python 3 support

Please try it out and let us know what you think. We welcome feedback, we know you are using PyPy, please tell us about it!
Cheers
The PyPy Team

Posted by mattip at 18:03 12 Comments

Thursday, February 25, 2016

C-API Support update

As you know, PyPy can emulate the CPython C API to some extent. In this post I will describe an important optimization that we merged to improve the performance and stability of the C-API emulation layer.

The C-API is implemented by passing around PyObject * pointers in the C code. The problem with providing the same interface with PyPy is that objects don't natively have the same PyObject * structure at all; and additionally their memory address can change. PyPy handles the difference by maintaining two sets of objects. More precisely, starting from a PyPy object, it can allocate on demand a PyObject structure and fill it with information that points back to the original PyPy objects; and conversely, starting from a C-level object, it can allocate a PyPy-level object and fill it with information in the opposite direction.

I have merged a rewrite of the interaction between C-API C-level objects and PyPy's interpreter level objects. This is mostly a simplification based on a small hack in our garbage collector. This hack makes the garbage collector aware of the reference-counted PyObject structures. When it considers a pair consisting of a PyPy object and a PyObject, it will always free either none or both of them at the same time. They both stay alive if either there is a regular GC reference to the PyPy object, or the reference counter in the PyObject is bigger than zero.

This gives a more stable result. Previously, a PyPy object might grow a corresponding PyObject, loose it (when its reference counter goes to zero), and later have another corresponding PyObject re-created at a different address. Now, once a link is created, it remains alive until both objects die.

The rewrite significantly simplifies our previous code (which used to be based on at least 4 different dictionaries), and should make using the C-API somewhat faster (though it is still slower than using pure python or cffi).

A side effect of this work is that now PyPy actually supports the upstream lxml package---which is is one of the most popular packages on PyPI. (Specifically, you need version 3.5.0 with this pull request to remove old PyPy-specific hacks that were not really working. See details.) At this point, we no longer recommend using the cffi-lxml alternative: although it may still be faster, it might be incomplete and old.

We are actively working on extending our C-API support, and hope to soon merge a branch to support more of the C-API functions (some numpy news coming!). Please try it out and let us know how it works for you.

Armin Rigo and the PyPy team

Posted by Armin Rigo at 16:54 4 Comments

Wednesday, January 6, 2016

Using CFFI for embedding

Introduction

CFFI has been a great success so far to call C libraries in your Python programs, in a way that is both simple and that works across CPython 2.x and 3.x and PyPy.

This post assumes that you know what CFFI is and how to use it in API mode (ffi.cdef(), ffi.set_source(), ffi.compile()). A quick overview can be found in this paragraph.

The major news of CFFI 1.4, released last december, was that you can now declare C functions with extern "Python" in the cdef(). These magic keywords make the function callable from C (where it is defined automatically), but calling it will call some Python code (which you attach with the @ffi.def_extern() decorator). This is useful because it gives a more straightforward, faster and libffi-independent way to write callbacks. For more details, see the documentation.

You are, in effect, declaring a static family of C functions which call Python code. The idea is to take pointers to them, and pass them around to other C functions, as callbacks. However, the idea of a set of C functions which call Python code opens another path: embedding Python code inside non-Python programs.

Embedding

Embedding is traditionally done using the CPython C API: from C code, you call Py_Initialize() and then some other functions like PyRun_SimpleString(). In the simple cases it is, indeed, simple enough; but it can become a complicated story if you throw in supporting application-dependent object types; and a messy story if you add correctly running on multiple threads, for example.

Moreover, this approach is specific to CPython (2.x or 3.x). It does not work at all on PyPy, which has its own very different, minimal embedding API.

The new-and-coming thing about CFFI 1.5, meant as replacement of the above solutions, is direct embedding support---with no fixed API at all. The idea is to write some Python script with a cdef() which declares a number of extern "Python" functions. When running the script, it creates the C source code and compiles it to a dynamically-linked library (.so on Linux). This is the same as in the regular API-mode usage. What is new is that these extern "Python" can now also be exported from the .so, in the C sense. You also give a bit of initialization-time Python code directly in the script, which will be compiled into the .so too.

This library can now be used directly from any C program (and it is still importable in Python). It exposes the C API of your choice, which you specified with the extern "Python" declarations. You can use it to make whatever custom API makes sense in your particular case. You can even directly make a "plug-in" for any program that supports them, just by exporting the API expected for such plugins.

Trying it out on CPython

This is still being finalized, but please try it out. You can see embedding.py directly online for a quick glance. Or see below the instructions on Linux with CPython 2.7 (CPython 3.x and non-Linux platforms are still a work in progress right now, but this should be quickly fixed):

get the branch static-callback-embedding of CFFI:

hg clone https://bitbucket.org/cffi/cffi
hg up static-callback-embedding

make the _cffi_backend.so:
```
python setup_base.py build_ext -f -i
```

run embedding.py in the demo directory:

cd demo
PYTHONPATH=.. python embedding.py

this produces _embedding_cffi.c. Run gcc to build it. On Linux:

gcc -shared -fPIC _embedding_cffi.c -o _embedding_cffi.so  \
    -lpython2.7 -I/usr/include/python2.7

try out the demo C program in embedding_test.c:

gcc embedding_test.c _embedding_cffi.so
PYTHONPATH=.. LD_LIBRARY_PATH=. ./a.out

Note that if you get ImportError: cffi extension module '_embedding_cffi' has unknown version 0x2701, it means that the _cffi_backend module loaded is a pre-installed one instead of the more recent one in "..". Be sure to use PYTHONPATH=.. for now. (Some installations manage to be confused enough to load the system-wide cffi even if another version is in the PYTHONPATH. I think a virtualenv can be used to work around this issue.)

Try it out on PyPy

Very similar steps can be followed on PyPy, but it requires the cffi-static-callback-embedding branch of PyPy, which you must first translate from sources. The difference is then that you need to adapt the first gcc command line: replace -lpython2.7 with -lpypy-c and to fix the -I path (and possibly add a -L path).

More details

How it works, more precisely, is by automatically initializing CPython/PyPy the first time any of the extern "Python" functions is called from the C program. This is done using locks in case of multi-threading, so several threads can concurrently do this "first call". This should work even if two different threads call the first time a function from two different embedded CFFI extensions that happen to be linked with the same program. Explicit initialization is never needed.

The custom initialization-time Python code you put in ffi.embedding_init_code() is executed at that time. If this code starts to be big, you can move it to independent modules or packages. Then the initialization-time Python code only needs to import them. In that case, you have to carefully set up sys.path if the modules are not installed in the usual Python way.

If the Python code is big and full of dependencies, a better alternative would be to use virtualenv. How to do that is not fully fleshed out so far. You can certainly run the whole program with the environment variables set up by the virtualenv's activate script first. There are probably other solutions that involve using gcc's -Wl,-rpath=\$ORIGIN/ or -Wl,-rpath=/fixed/path/ options to load a specific libpython or libypypy-c library. If you try it out and it doesn't work the way you would like, please complain :-)

Another point: right now this does not support CPython's notion of multiple subinterpreters. The logic creates a single global Python interpreter, and runs everything in that context. Maybe a future version would have an explicit API to do that — or maybe it should be the job of a 3rd-party extension module to provide a Python interface over the notion of subinterpreters...

More generally, any feedback is appreciated.

Have fun,

Armin

Introduction

CFFI has been a great success so far to call C libraries in your Python programs, in a way that is both simple and that works across CPython 2.x and 3.x and PyPy.

This post assumes that you know what CFFI is and how to use it in API mode (ffi.cdef(), ffi.set_source(), ffi.compile()). A quick overview can be found in this paragraph.

Embedding

Moreover, this approach is specific to CPython (2.x or 3.x). It does not work at all on PyPy, which has its own very different, minimal embedding API.

Trying it out on CPython

get the branch static-callback-embedding of CFFI:

hg clone https://bitbucket.org/cffi/cffi
hg up static-callback-embedding

make the _cffi_backend.so:
```
python setup_base.py build_ext -f -i
```

run embedding.py in the demo directory:

cd demo
PYTHONPATH=.. python embedding.py

this produces _embedding_cffi.c. Run gcc to build it. On Linux:

gcc -shared -fPIC _embedding_cffi.c -o _embedding_cffi.so  \
    -lpython2.7 -I/usr/include/python2.7

try out the demo C program in embedding_test.c:

gcc embedding_test.c _embedding_cffi.so
PYTHONPATH=.. LD_LIBRARY_PATH=. ./a.out

Try it out on PyPy

More details

More generally, any feedback is appreciated.

Have fun,

Armin

Posted by Armin Rigo at 12:17 7 Comments

Monday, January 4, 2016

Leysin Winter Sprint (20-27th February 2016)

The next PyPy sprint will be in Leysin, Switzerland, for the eleventh time. This is a fully public sprint: newcomers and topics other than those proposed below are welcome.

Goals and topics of the sprint

The details depend on who is here and ready to work. The list of topics is mostly the same as last year (did PyPy became a mature project with only long-term goals?):

cpyext (CPython C API emulation layer): various speed and completeness topics
cleaning up the optimization step in the JIT, change the register allocation done by the JIT's backend, or more improvements to the warm-up time
finish vmprof - a statistical profiler for CPython and PyPy
Py3k (Python 3.x support), NumPyPy (the numpy module)
STM (Software Transaction Memory), notably: try to come up with benchmarks, and measure them carefully in order to test and improve the conflict reporting tools, and more generally to figure out how practical it is in large projects to avoid conflicts
And as usual, the main side goal is to have fun in winter sports :-) We can take a day off for ski.

Exact times

I have booked the week from Saturday 20 to Saturday 27. It is fine to leave either the 27 or the 28, or even stay a few more days on either side. The plan is to work full days between the 21 and the 27. You are of course allowed to show up for a part of that time only, too.

Location & Accomodation

Leysin, Switzerland, "same place as before". Let me refresh your memory: both the sprint venue and the lodging will be in a pair of chalets built specifically for bed & breakfast: http://www.ermina.ch/. The place has a good ADSL Internet connection with wireless installed. You can also arrange your own lodging elsewhere (as long as you are in Leysin, you cannot be more than a 15 minutes walk away from the sprint venue).

Please confirm that you are coming so that we can adjust the reservations as appropriate.

The options of rooms are a bit more limited than on previous years because the place for bed-and-breakfast is shrinking: what is guaranteed is only one double-bed room and a bigger room with 5-6 individual beds (the latter at 50-60 CHF per night, breakfast included). If there are more people that would prefer a single room, please contact me and we'll see what choices you have. There are a choice of hotels, many of them reasonably priced for Switzerland.

Please register by Mercurial:

or on the pypy-dev mailing list if you do not yet have check-in rights:

You need a Swiss-to-(insert country here) power adapter. There will be some Swiss-to-EU adapters around, and at least one EU-format power strip.

The next PyPy sprint will be in Leysin, Switzerland, for the eleventh time. This is a fully public sprint: newcomers and topics other than those proposed below are welcome.

Goals and topics of the sprint

The details depend on who is here and ready to work. The list of topics is mostly the same as last year (did PyPy became a mature project with only long-term goals?):

cpyext (CPython C API emulation layer): various speed and completeness topics
cleaning up the optimization step in the JIT, change the register allocation done by the JIT's backend, or more improvements to the warm-up time
finish vmprof - a statistical profiler for CPython and PyPy
Py3k (Python 3.x support), NumPyPy (the numpy module)
STM (Software Transaction Memory), notably: try to come up with benchmarks, and measure them carefully in order to test and improve the conflict reporting tools, and more generally to figure out how practical it is in large projects to avoid conflicts
And as usual, the main side goal is to have fun in winter sports :-) We can take a day off for ski.

Exact times

Location & Accomodation

Please confirm that you are coming so that we can adjust the reservations as appropriate.

Please register by Mercurial:

or on the pypy-dev mailing list if you do not yet have check-in rights:

You need a Swiss-to-(insert country here) power adapter. There will be some Swiss-to-EU adapters around, and at least one EU-format power strip.

Posted by Armin Rigo at 13:24 0 Comments

Thursday, November 19, 2015

PyPy 4.0.1 released please update

PyPy 4.0.1

We have released PyPy 4.0.1, three weeks after PyPy 4.0.0. We have fixed a few critical bugs in the JIT compiled code, reported by users. We therefore encourage all users of PyPy to update to this version. There are a few minor enhancements in this version as well.

You can download the PyPy 4.0.1 release here:

CFFI update

While not applicable only to PyPy, cffi is arguably our most significant contribution to the python ecosystem. PyPy 4.0.1 ships with cffi-1.3.1 with the improvements it brings.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It’s fast (pypy and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
We also welcome developers of other dynamic languages to see what RPython can do for them.
This release supports x86 machines on most common operating systems (Linux 32/64, Mac OS X 64, Windows 32, OpenBSD, freebsd), newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux, and the big- and little-endian variants of ppc64 running Linux.

Other Highlights (since 4.0.0 released three weeks ago)

Bug Fixes
- Fix a bug when unrolling double loops in JITted code
- Fix multiple memory leaks in the ssl module, one of which affected CPython as well (thanks to Alex Gaynor for pointing those out)
- Use pkg-config to find ssl headers on OS-X
- Issues reported with our previous release were resolved after reports from users on our issue tracker at https://bitbucket.org/pypy/pypy/issues or on IRC at #pypy
New features
- Internal cleanup of RPython class handling
- Support stackless and greenlets on PPC machines
- Improve debug logging in subprocesses: use PYPYLOG=jit:log.%d for example to have all subprocesses write the JIT log to a file called ‘log.%d’, with ‘%d’ replaced with the subprocess’ PID.
- Support PyOS_double_to_string in our cpyext capi compatibility layer
Numpy
- Improve support for __array_interface__
- Propagate most NAN mantissas through float16-float32-float64 conversions
Performance improvements and refactorings
- Improvements in slicing byte arrays
- Improvements in enumerate()
- Silence some warnings while translating

Please update, and continue to help us make PyPy better.

Cheers
The PyPy Team

PyPy 4.0.1

CFFI update

While not applicable only to PyPy, cffi is arguably our most significant contribution to the python ecosystem. PyPy 4.0.1 ships with cffi-1.3.1 with the improvements it brings.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It’s fast (pypy and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
We also welcome developers of other dynamic languages to see what RPython can do for them.
This release supports x86 machines on most common operating systems (Linux 32/64, Mac OS X 64, Windows 32, OpenBSD, freebsd), newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux, and the big- and little-endian variants of ppc64 running Linux.

Other Highlights (since 4.0.0 released three weeks ago)

Bug Fixes
- Fix a bug when unrolling double loops in JITted code
- Fix multiple memory leaks in the ssl module, one of which affected CPython as well (thanks to Alex Gaynor for pointing those out)
- Use pkg-config to find ssl headers on OS-X
- Issues reported with our previous release were resolved after reports from users on our issue tracker at https://bitbucket.org/pypy/pypy/issues or on IRC at #pypy
New features
- Internal cleanup of RPython class handling
- Support stackless and greenlets on PPC machines
- Improve debug logging in subprocesses: use PYPYLOG=jit:log.%d for example to have all subprocesses write the JIT log to a file called ‘log.%d’, with ‘%d’ replaced with the subprocess’ PID.
- Support PyOS_double_to_string in our cpyext capi compatibility layer
Numpy
- Improve support for __array_interface__
- Propagate most NAN mantissas through float16-float32-float64 conversions
Performance improvements and refactorings
- Improvements in slicing byte arrays
- Improvements in enumerate()
- Silence some warnings while translating

Please update, and continue to help us make PyPy better.

Cheers
The PyPy Team

Posted by mattip at 22:01 8 Comments

Thursday, October 29, 2015

PyPy 4.0.0 Released - A Jit with SIMD Vectorization and More

PyPy 4.0.0

We’re pleased and proud to unleash PyPy 4.0.0, a major update of the PyPy python 2.7.10 compatible interpreter with a Just In Time compiler. We have improved warmup time and memory overhead used for tracing, added vectorization for numpy and general loops where possible on x86 hardware (disabled by default), refactored rough edges in rpython, and increased functionality of numpy.
You can download the PyPy 4.0.0 release here:

We would like to thank our donors for the continued support of the PyPy project.
We would also like to thank our contributors (7 new ones since PyPy 2.6.0) and encourage new people to join the project. PyPy has many layers and we need help with all of them: PyPy and RPython documentation improvements, tweaking popular modules to run on PyPy, or general help with making RPython’s JIT even better.

New Version Numbering

Since the past release, PyPy 2.6.1, we decided to update the PyPy 2.x.x versioning directly to PyPy 4.x.x, to avoid confusion with CPython 2.7 and 3.5. Note that this version of PyPy uses the stdlib and implements the syntax of CPython 2.7.10.

Vectorization

Richard Plangger began work in March and continued over a Google Summer of Code to add a vectorization step to the trace optimizer. The step recognizes common constructs and emits SIMD code where possible, much as any modern compiler does. This vectorization happens while tracing running code, so it is actually easier at run-time to determine the availability of possible vectorization than it is for ahead-of-time compilers.
Availability of SIMD hardware is detected at run time, without needing to precompile various code paths into the executable.
The first version of the vectorization has been merged in this release, since it is so new it is off by default. To enable the vectorization in built-in JIT drivers (like numpy ufuncs), add –jit vec=1, to enable all implemented vectorization add –jit vec_all=1
Benchmarks and a summary of this work appear here

Internal Refactoring: Warmup Time Improvement and Reduced Memory Usage

Maciej Fijalkowski and Armin Rigo refactored internals of Rpython that now allow PyPy to more efficiently use guards in jitted code. They also rewrote unrolling, leading to a warmup time improvement of 20% or so. The reduction in guards also means a reduction in the use of memory, also a savings of around 20%.

Numpy

Our implementation of numpy continues to improve. ndarray and the numeric dtypes are very close to feature-complete; record, string and unicode dtypes are mostly supported. We have reimplemented numpy linalg, random and fft as cffi-1.0 modules that call out to the same underlying libraries that upstream numpy uses. Please try it out, especially using the new vectorization (via –jit vec=1 on the command line) and let us know what is missing for your code.

CFFI

While not applicable only to PyPy, cffi is arguably our most significant contribution to the python ecosystem. Armin Rigo continued improving it, and PyPy reaps the benefits of cffi-1.3: improved manangement of object lifetimes, __stdcall on Win32, ffi.memmove(), and percolate const, restrict keywords from cdef to C code.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It’s fast (pypy and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
We also welcome developers of other dynamic languages to see what RPython can do for them.
This release supports x86 machines on most common operating systems (Linux 32/64, Mac OS X 64, Windows 32, OpenBSD, freebsd), as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux.
We also introduce support for the 64 bit PowerPC hardware, specifically Linux running the big- and little-endian variants of ppc64.

Other Highlights (since 2.6.1 release two months ago)

Bug Fixes
- Applied OPENBSD downstream fixes
- Fix a crash on non-linux when running more than 20 threads
- In cffi, ffi.new_handle() is more cpython compliant
- Accept unicode in functions inside the _curses cffi backend exactly like cpython
- Fix a segfault in itertools.islice()
- Use gcrootfinder=shadowstack by default, asmgcc on linux only
- Fix ndarray.copy() for upstream compatability when copying non-contiguous arrays
- Fix assumption that lltype.UniChar is unsigned
- Fix a subtle bug with stacklets on shadowstack
- Improve support for the cpython capi in cpyext (our capi compatibility layer). Fixing these issues inspired some thought about cpyext in general, stay tuned for more improvements
- When loading dynamic libraries, in case of a certain loading error, retry loading the library assuming it is actually a linker script, like on Arch and Gentoo
- Issues reported with our previous release were resolved after reports from users on our issue tracker at https://bitbucket.org/pypy/pypy/issues or on IRC at #pypy
New features:
- Add an optimization pass to vectorize loops using x86 SIMD intrinsics.
- Support __stdcall on Windows in CFFI
- Improve debug logging when using PYPYLOG=???
- Deal with platforms with no RAND_egd() in OpenSSL
Numpy:
- Add support for ndarray.ctypes
- Fast path for mixing numpy scalars and floats
- Add support for creating Fortran-ordered ndarrays
- Fix casting failures in linalg (by extending ufunc casting)
- Recognize and disallow (for now) pickling of ndarrays with objects embedded in them
Performance improvements and refactorings:
- Reuse hashed keys across dictionaries and sets
- Refactor JIT interals to improve warmup time by 20% or so at the cost of a minor regression in JIT speed
- Recognize patterns of common sequences in the JIT backends and optimize them
- Make the garbage collecter more incremental over external_malloc() calls
- Share guard resume data where possible which reduces memory usage
- Fast path for zip(list, list)
- Reduce the number of checks in the JIT for lst[a:]
- Move the non-optimizable part of callbacks outside the JIT
- Factor in field immutability when invalidating heap information
- Unroll itertools.izip_longest() with two sequences
- Minor optimizations after analyzing output from vmprof and trace logs
- Remove many class attributes in rpython classes
- Handle getfield_gc_pure* and getfield_gc_* uniformly in heap.py
- Improve simple trace function performance by lazily calling fast2locals and locals2fast only if truly necessary

Please try it out and let us know what you think. We welcome feedback, we know you are using PyPy, please tell us about it!
Cheers
The PyPy Team

PyPy 4.0.0

New Version Numbering

Vectorization

Internal Refactoring: Warmup Time Improvement and Reduced Memory Usage

Numpy

CFFI

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It’s fast (pypy and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
We also welcome developers of other dynamic languages to see what RPython can do for them.
This release supports x86 machines on most common operating systems (Linux 32/64, Mac OS X 64, Windows 32, OpenBSD, freebsd), as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux.
We also introduce support for the 64 bit PowerPC hardware, specifically Linux running the big- and little-endian variants of ppc64.

Other Highlights (since 2.6.1 release two months ago)

Bug Fixes
- Applied OPENBSD downstream fixes
- Fix a crash on non-linux when running more than 20 threads
- In cffi, ffi.new_handle() is more cpython compliant
- Accept unicode in functions inside the _curses cffi backend exactly like cpython
- Fix a segfault in itertools.islice()
- Use gcrootfinder=shadowstack by default, asmgcc on linux only
- Fix ndarray.copy() for upstream compatability when copying non-contiguous arrays
- Fix assumption that lltype.UniChar is unsigned
- Fix a subtle bug with stacklets on shadowstack
- Improve support for the cpython capi in cpyext (our capi compatibility layer). Fixing these issues inspired some thought about cpyext in general, stay tuned for more improvements
- When loading dynamic libraries, in case of a certain loading error, retry loading the library assuming it is actually a linker script, like on Arch and Gentoo
- Issues reported with our previous release were resolved after reports from users on our issue tracker at https://bitbucket.org/pypy/pypy/issues or on IRC at #pypy
New features:
- Add an optimization pass to vectorize loops using x86 SIMD intrinsics.
- Support __stdcall on Windows in CFFI
- Improve debug logging when using PYPYLOG=???
- Deal with platforms with no RAND_egd() in OpenSSL
Numpy:
- Add support for ndarray.ctypes
- Fast path for mixing numpy scalars and floats
- Add support for creating Fortran-ordered ndarrays
- Fix casting failures in linalg (by extending ufunc casting)
- Recognize and disallow (for now) pickling of ndarrays with objects embedded in them
Performance improvements and refactorings:
- Reuse hashed keys across dictionaries and sets
- Refactor JIT interals to improve warmup time by 20% or so at the cost of a minor regression in JIT speed
- Recognize patterns of common sequences in the JIT backends and optimize them
- Make the garbage collecter more incremental over external_malloc() calls
- Share guard resume data where possible which reduces memory usage
- Fast path for zip(list, list)
- Reduce the number of checks in the JIT for lst[a:]
- Move the non-optimizable part of callbacks outside the JIT
- Factor in field immutability when invalidating heap information
- Unroll itertools.izip_longest() with two sequences
- Minor optimizations after analyzing output from vmprof and trace logs
- Remove many class attributes in rpython classes
- Handle getfield_gc_pure* and getfield_gc_* uniformly in heap.py
- Improve simple trace function performance by lazily calling fast2locals and locals2fast only if truly necessary

Please try it out and let us know what you think. We welcome feedback, we know you are using PyPy, please tell us about it!
Cheers
The PyPy Team

Posted by mattip at 13:17 19 Comments

Tuesday, October 20, 2015

Automatic SIMD vectorization support in PyPy

Hi everyone,

it took some time to catch up with the JIT refacrtorings merged in this summer. But, (drums) we are happy to announce that:

The next release of PyPy, "PyPy 4.0.0", will ship the new auto vectorizer

The goal of this project was to increase the speed of numerical applications in both the NumPyPy library and for arbitrary Python programs. In PyPy we have focused a lot on improvements in the 'typical python workload', which usually involves object and string manipulations, mostly for web development. We're hoping with this work that we'll continue improving the other very important Python use case - numerics.

What it can do!

It targets numerics only. It will not execute object manipulations faster, but it is capable of enhancing common vector and matrix operations.
Good news is that it is not specifically targeted for the NumPy library and the PyPy virtual machine. Any interpreter (written in RPython) is able make use of the vectorization. For more information about that take a look here, or consult the documentation. For the time being it is not turn on by default, so be sure to enable it by specifying --jit vec=1 before running your program.

If your language (written in RPython) contains many array/matrix operations, you can easily integrate the optimization by adding the parameter 'vec=1' to the JitDriver.

NumPyPy Improvements

Let's take a look at the core functions of the NumPyPy library (*).
The following tests tests show the speedup of the core functions commonly used in Python code interfacing with NumPy, on CPython with NumPy, on the PyPy 2.6.1 relased several weeks ago, and on PyPy 15.11 to be released soon. Timeit was used to test the time needed to run the operation in the plot title on various vector (lower case) and square matrix (upper case) sizes displayed on the X axis. The Y axis shows the speedup compared to CPython 2.7.10. This means that higher is better.

In comparison to PyPy 2.6.1, the speedup greatly improved. The hardware support really strips down the runtime of the vector and matrix operations. There is another operation we would like to highlight: the dot product.
It is a very common operation in numerics and PyPy now (given a moderate sized matrix and vector) decreases the time spent in that operation. See for yourself:

These are nice improvements in the NumPyPy library and we got to a competitive level only making use of SSE4.1.

Future work

This is not the end of the road. The GSoC project showed that it is possible to implement this optimization in PyPy. There might be other improvements we can make to carry this further:

Check alignment at runtime to increase the memory throughput of the CPU
Support the AVX vector extension which (at least) doubles the size of the vector register
Handle each and every corner case in Python traces to enable it globally
Do not rely only on loading operations to trigger the analysis, there might be cases where combination of floating point values could be done in parallel

Cheers,
The PyPy Team

(*) The benchmark code can be found here it was run using this configuration: i7-2600 CPU @ 3.40GHz (4 cores).

Hi everyone,

it took some time to catch up with the JIT refacrtorings merged in this summer. But, (drums) we are happy to announce that:

The next release of PyPy, "PyPy 4.0.0", will ship the new auto vectorizer

What it can do!

NumPyPy Improvements

These are nice improvements in the NumPyPy library and we got to a competitive level only making use of SSE4.1.

Future work

This is not the end of the road. The GSoC project showed that it is possible to implement this optimization in PyPy. There might be other improvements we can make to carry this further:

Check alignment at runtime to increase the memory throughput of the CPU
Support the AVX vector extension which (at least) doubles the size of the vector register
Handle each and every corner case in Python traces to enable it globally
Do not rely only on loading operations to trigger the analysis, there might be cases where combination of floating point values could be done in parallel

Cheers,
The PyPy Team

(*) The benchmark code can be found here it was run using this configuration: i7-2600 CPU @ 3.40GHz (4 cores).

Posted by Unknown at 15:38 5 Comments

Friday, October 16, 2015

PowerPC backend for the JIT

Hi all,

PyPy's JIT now supports the 64-bit PowerPC architecture! This is the third architecture supported, in addition to x86 (32 and 64) and ARM (32-bit only). More precisely, we support Linux running the big- and the little-endian variants of ppc64. Thanks to IBM for funding this work!

The new JIT backend has been merged into "default". You should be able to translate PPC versions as usual directly on the machines. For the foreseeable future, I will compile and distribute binary versions corresponding to the official releases (for Fedora), but of course I'd welcome it if someone else could step in and do it. Also, it is unclear yet if we will run a buildbot.

To check that the result performs well, I logged in a ppc64le machine and ran the usual benchmark suite of PyPy (minus sqlitesynth: sqlite was not installed on that machine). I ran it twice at a difference of 12 hours, as an attempt to reduce risks caused by other users suddenly using the machine. The machine was overall relatively quiet. Of course, this is scientifically not good enough; it is what I could come up with given the limited resources.

Here are the results, where the numbers are speed-up factors between the non-jit and the jit version of PyPy. The first column is x86-64, for reference. The second and third columns are the two ppc64le runs. All are Linux. A few benchmarks are not reported here because the runner doesn't execute them on non-jit (however, apart from sqlitesynth, they all worked).

    ai                        13.7342        16.1659     14.9091
    bm_chameleon               8.5944         8.5858        8.66
    bm_dulwich_log             5.1256         5.4368      5.5928
    bm_krakatau                5.5201         2.3915      2.3452
    bm_mako                    8.4802         6.8937      6.9335
    bm_mdp                     2.0315         1.7162      1.9131
    chaos                     56.9705        57.2608     56.2374
    sphinx
    crypto_pyaes               62.505         80.149     79.7801
    deltablue                  3.3403         5.1199      4.7872
    django                    28.9829         23.206       23.47
    eparse                     2.3164         2.6281       2.589
    fannkuch                   9.1242        15.1768     11.3906
    float                     13.8145        17.2582     17.2451
    genshi_text               16.4608        13.9398     13.7998
    genshi_xml                 8.2782         8.0879      9.2315
    go                         6.7458        11.8226     15.4183
    hexiom2                   24.3612        34.7991     33.4734
    html5lib                   5.4515         5.5186       5.365
    json_bench                28.8774        29.5022     28.8897
    meteor-contest             5.1518         5.6567      5.7514
    nbody_modified            20.6138        22.5466     21.3992
    pidigits                   1.0118          1.022      1.0829
    pyflate-fast               9.0684        10.0168     10.3119
    pypy_interp                3.3977         3.9307      3.8798
    raytrace-simple           69.0114       108.8875    127.1518
    richards                  94.1863       118.1257    102.1906
    rietveld                   3.2421         3.0126      3.1592
    scimark_fft
    scimark_lu
    scimark_montecarlo
    scimark_sor
    scimark_sparsematmul
    slowspitfire               2.8539         3.3924      3.5541
    spambayes                  5.0646         6.3446       6.237
    spectral-norm             41.9148        42.1831     43.2913
    spitfire                   3.8788         4.8214       4.701
    spitfire_cstringio          7.606         9.1809      9.1691
    sqlitesynth
    sympy_expand               2.9537         2.0705      1.9299
    sympy_integrate            4.3805         4.3467      4.7052
    sympy_str                  1.5431         1.6248      1.5825
    sympy_sum                  6.2519          6.096      5.6643
    telco                     61.2416        54.7187     55.1705
    trans2_annotate
    trans2_rtype
    trans2_backendopt
    trans2_database
    trans2_source
    twisted_iteration         55.5019        51.5127     63.0592
    twisted_names              8.2262         9.0062      10.306
    twisted_pb                12.1134         13.644     12.1177
    twisted_tcp                4.9778          1.934      5.4931

    GEOMETRIC MEAN               9.31           9.70       10.01

The last line reports the geometric mean of each column. We see that the goal was reached: PyPy's JIT actually improves performance by a factor of around 9.7 to 10 times on ppc64le. By comparison, it "only" improves performance by a factor 9.3 on Intel x86-64. I don't know why, but I'd guess it mostly means that a non-jitted PyPy performs slightly better on Intel than it does on PowerPC.

Why is that? Actually, if we do the same comparison with an ARM column too, we also get higher numbers there than on Intel. When we discovered that a few years ago, we guessed that on ARM running the whole interpreter in PyPy takes up a lot of resources, e.g. of instruction cache, which the JIT's assembler doesn't need any more after the process is warmed up. And caches are much bigger on Intel. However, PowerPC is much closer to Intel, so this argument doesn't work for PowerPC. But there are other more subtle variants of it. Notably, Intel is doing crazy things about branch prediction, which likely helps a big interpreter---both the non-JITted PyPy and CPython, and both for the interpreter's main loop itself and for the numerous indirect branches that depend on the types of the objects. Maybe the PowerPC is as good as Intel, and so this argument doesn't work either. Another one would be: on PowerPC I did notice that gcc itself is not perfect at optimization. During development of this backend, I often looked at assembler produced by gcc, and there are a number of small inefficiencies there. All these are factors that slow down the non-JITted version of PyPy, but don't influence the speed of the assembler produced just-in-time.

Anyway, this is just guessing. The fact remains that PyPy can now be used on PowerPC machines. Have fun!

A bientôt,

Armin.

Hi all,

    ai                        13.7342        16.1659     14.9091
    bm_chameleon               8.5944         8.5858        8.66
    bm_dulwich_log             5.1256         5.4368      5.5928
    bm_krakatau                5.5201         2.3915      2.3452
    bm_mako                    8.4802         6.8937      6.9335
    bm_mdp                     2.0315         1.7162      1.9131
    chaos                     56.9705        57.2608     56.2374
    sphinx
    crypto_pyaes               62.505         80.149     79.7801
    deltablue                  3.3403         5.1199      4.7872
    django                    28.9829         23.206       23.47
    eparse                     2.3164         2.6281       2.589
    fannkuch                   9.1242        15.1768     11.3906
    float                     13.8145        17.2582     17.2451
    genshi_text               16.4608        13.9398     13.7998
    genshi_xml                 8.2782         8.0879      9.2315
    go                         6.7458        11.8226     15.4183
    hexiom2                   24.3612        34.7991     33.4734
    html5lib                   5.4515         5.5186       5.365
    json_bench                28.8774        29.5022     28.8897
    meteor-contest             5.1518         5.6567      5.7514
    nbody_modified            20.6138        22.5466     21.3992
    pidigits                   1.0118          1.022      1.0829
    pyflate-fast               9.0684        10.0168     10.3119
    pypy_interp                3.3977         3.9307      3.8798
    raytrace-simple           69.0114       108.8875    127.1518
    richards                  94.1863       118.1257    102.1906
    rietveld                   3.2421         3.0126      3.1592
    scimark_fft
    scimark_lu
    scimark_montecarlo
    scimark_sor
    scimark_sparsematmul
    slowspitfire               2.8539         3.3924      3.5541
    spambayes                  5.0646         6.3446       6.237
    spectral-norm             41.9148        42.1831     43.2913
    spitfire                   3.8788         4.8214       4.701
    spitfire_cstringio          7.606         9.1809      9.1691
    sqlitesynth
    sympy_expand               2.9537         2.0705      1.9299
    sympy_integrate            4.3805         4.3467      4.7052
    sympy_str                  1.5431         1.6248      1.5825
    sympy_sum                  6.2519          6.096      5.6643
    telco                     61.2416        54.7187     55.1705
    trans2_annotate
    trans2_rtype
    trans2_backendopt
    trans2_database
    trans2_source
    twisted_iteration         55.5019        51.5127     63.0592
    twisted_names              8.2262         9.0062      10.306
    twisted_pb                12.1134         13.644     12.1177
    twisted_tcp                4.9778          1.934      5.4931

    GEOMETRIC MEAN               9.31           9.70       10.01

Anyway, this is just guessing. The fact remains that PyPy can now be used on PowerPC machines. Have fun!

A bientôt,

Armin.

Posted by Armin Rigo at 18:08 0 Comments

Monday, October 5, 2015

PyPy memory and warmup improvements (2) - Sharing of Guards

Hello everyone!

This is the second part of the series of improvements in warmup time and memory consumption in the PyPy JIT. This post covers recent work on sharing guard resume data that was recently merged to trunk. It will be a part of the next official PyPy release. To understand what it does, let's start with a loop for a simple example:

class A(object):
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def call_method(self, z):
        return self.x + self.y + z

def f():
    s = 0
    for i in range(100000):
        a = A(i, 1 + i)
        s += a.call_method(i)

At the entrance of the loop, we have the following set of operations:

guard(i5 == 4)
guard(p3 is null)
p27 = p2.co_cellvars
p28 = p2.co_freevars
guard_class(p17, 4316866008, descr=<Guard0x104295e08>)
p30 = p17.w_seq
guard_nonnull(p30, descr=<Guard0x104295db0>)
i31 = p17.index
p32 = p30.strategy
guard_class(p32, 4317041344, descr=<Guard0x104295d58>)
p34 = p30.lstorage
i35 = p34..item0

The above operations gets executed at the entrance, so each time we call f(). They ensure all the optimizations done below stay valid. Now, as long as nothing out of the ordinary happens, they only ensure that the world around us never changed. However, if e.g. someone puts new methods on class A, any of the above guards might fail. Despite the fact that it's a very unlikely case, PyPy needs to track how to recover from such a situation. Each of those points needs to keep the full state of the optimizations performed, so we can safely deoptimize them and reenter the interpreter. This is vastly wasteful since most of those guards never fail, hence some sharing between guards has been performed.

We went a step further - when two guards are next to each other or the operations in between them don't have side effects, we can safely redo the operations or to simply put, resume in the previous guard. That means every now and again we execute a few operations extra, but not storing extra info saves quite a bit of time and memory. This is similar to the approach that LuaJIT takes, which is called sparse snapshots.

I've done some measurements on annotating & rtyping translation of pypy, which is a pretty memory hungry program that compiles a fair bit. I measured, respectively:

total time the translation step took (annotating or rtyping)
time it took for tracing (that excludes backend time for the total JIT time) at the end of rtyping.
memory the GC feels responsible for after the step. The real amount of memory consumed will always be larger and the coefficient of savings is in 1.5-2x mark

Here is the table:

branch	time annotation	time rtyping	memory annotation	memory rtyping	tracing time
default	317s	454s	707M	1349M	60s
sharing	302s	430s	595M	1070M	51s
win	4.8%	5.5%	19%	26%	17%

Obviously pypy translation is an extreme example - the vast majority of the code out there does not have that many lines of code to be jitted. However, it's at the very least a good win for us :-)

We will continue to improve the warmup performance and keep you posted!

Cheers,
fijal

Hello everyone!

class A(object):
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def call_method(self, z):
        return self.x + self.y + z

def f():
    s = 0
    for i in range(100000):
        a = A(i, 1 + i)
        s += a.call_method(i)

At the entrance of the loop, we have the following set of operations:

guard(i5 == 4)
guard(p3 is null)
p27 = p2.co_cellvars
p28 = p2.co_freevars
guard_class(p17, 4316866008, descr=<Guard0x104295e08>)
p30 = p17.w_seq
guard_nonnull(p30, descr=<Guard0x104295db0>)
i31 = p17.index
p32 = p30.strategy
guard_class(p32, 4317041344, descr=<Guard0x104295d58>)
p34 = p30.lstorage
i35 = p34..item0

I've done some measurements on annotating & rtyping translation of pypy, which is a pretty memory hungry program that compiles a fair bit. I measured, respectively:

total time the translation step took (annotating or rtyping)
time it took for tracing (that excludes backend time for the total JIT time) at the end of rtyping.
memory the GC feels responsible for after the step. The real amount of memory consumed will always be larger and the coefficient of savings is in 1.5-2x mark

Here is the table:

branch	time annotation	time rtyping	memory annotation	memory rtyping	tracing time
default	317s	454s	707M	1349M	60s
sharing	302s	430s	595M	1070M	51s
win	4.8%	5.5%	19%	26%	17%

Obviously pypy translation is an extreme example - the vast majority of the code out there does not have that many lines of code to be jitted. However, it's at the very least a good win for us :-)

We will continue to improve the warmup performance and keep you posted!

Cheers,
fijal

Posted by Maciej Fijalkowski at 11:31 2 Comments

Wednesday, September 9, 2015

PyPy warmup improvements

Hello everyone!

I'm very pleased to announce that we've just managed to merge the optresult branch. Under this cryptic name is the biggest JIT refactoring we've done in a couple years, mostly focused on the warmup time and memory impact of PyPy.

To understand why we did that, let's look back in time - back when we got the first working JIT prototype in 2009 we were focused exclusively on achieving peak performance with some consideration towards memory usage, but without serious consideration towards warmup time. This means we accumulated quite a bit of technical debt over time that we're trying, with difficulty, to address right now. This branch mostly does not affect the peak performance - it should however help you with short-living scripts, like test runs.

We identified warmup time to be one of the major pain points for pypy users, along with memory impact and compatibility issues with CPython C extension world. While we can't address all the issues at once, we're trying to address the first two in the work contributing to this blog post. I will write a separate article on the last item separately.

To see how much of a problem warmup is for your program, you can run your program with PYPYLOG=jit-summary:- environment variable set. This should show you something like this:

(pypy-optresult)fijal@hermann:~/src/botbot-web$ PYPYLOG=jit-summary:- python orm.py 1500
[d195a2fcecc] {jit-summary
Tracing:            781     2.924965
Backend:            737     0.722710
TOTAL:                      35.912011
ops:                1860596
recorded ops:       493138
  calls:            81022
guards:             131238
opt ops:            137263
opt guards:         35166
forcings:           4196
abort: trace too long:      22
abort: compiling:   0
abort: vable escape:        22
abort: bad loop:    0
abort: force quasi-immut:   0
nvirtuals:          183672
nvholes:            25797
nvreused:           116131
Total # of loops:   193
Total # of bridges: 575
Freed # of loops:   6
Freed # of bridges: 75
[d195a48de18] jit-summary}

This means that the total (wall clock) time was 35.9s, out of which we spent 2.9s tracing 781 loops and 0.72s compiling them. The remaining couple were aborted (trace too long is normal, vable escape means someone called sys._getframe() or equivalent). You can do the following things:

compare the numbers with pypy --jit off and see at which number of iterations pypy jit kicks in
play with the thresholds: pypy --jit threshold=500,function_threshold=400,trace_eagerness=50 was much better in this example. What this does is to lower the threshold for tracing loops from default of 1039 to 400, threshold for tracing functions from the start from 1619 to 500 and threshold for tracing bridges from 200 to 50. Bridges are "alternative paths" that JIT did not take that are being additionally traced. We believe in sane defaults, so we'll try to improve upon those numbers, but generally speaking there is no one-size fits all here.
if the tracing/backend time stays high, come and complain to us with benchmarks, we'll try to look at them

Warmup, as a number, is notoriously hard to measure. It's a combination of:

pypy running interpreter before jitting
pypy needing time to JIT the traces
additional memory allocations needed during tracing to accomodate bookkeeping data
exiting and entering assembler until there is enough coverage of assembler

We're working hard on making a better assesment at this number, stay tuned :-)

Speedups

Overall we measured about 50% speed improvement in the optimizer, which reduces the overall warmup time between 10% and 30%. The very obvious warmup benchmark got a speedup from 4.5s to 3.5s, almost 30% improvement. Obviously the speedups on benchmarks would vastly depend on how much warmup time is there in those benchmarks. We observed annotation of pypy to decreasing by about 30% and the overall translation time by about 7%, so your mileage may vary.

Of course, as usual with the large refactoring of a crucial piece of PyPy, there are expected to be bugs. We are going to wait for the default branch to stabilize so you should see warmup improvements in the next release. If you're not afraid to try, nightlies will already have them.

We're hoping to continue improving upon warmup time and memory impact in the future, stay tuned for improvements.

Technical details

The branch does "one" thing - it changes the underlying model of how operations are represented during tracing and optimizations. Let's consider a simple loop like:

[i0, i1]
i2 = int_add(i0, i1)
i3 = int_add(i2, 1)
i4 = int_is_true(i3)
guard_true(i4)
jump(i3, i2)

The original representation would allocate a Box for each of i0 - i4 and then store those boxes in instances of ResOperation. The list of such operations would then go to the optimizer. Those lists are big - we usually remove 90% of them during optimizations, but they can be a couple thousand elements. Overall, allocating those big lists takes a toll on warmup time, especially due to the GC pressure. The branch removes the existance of Box completely, instead using a link to ResOperation itself. So say in the above example, i2 would refer to its producer - i2 = int_add(i0, i1) with arguments getting special treatment.

That alone reduces the GC pressure slightly, but a reduced number of instances also lets us store references on them directly instead of going through expensive dictionaries, which were used to store optimizing information about the boxes.

Cheers!
fijal & arigo

Hello everyone!

To see how much of a problem warmup is for your program, you can run your program with PYPYLOG=jit-summary:- environment variable set. This should show you something like this:

(pypy-optresult)fijal@hermann:~/src/botbot-web$ PYPYLOG=jit-summary:- python orm.py 1500
[d195a2fcecc] {jit-summary
Tracing:            781     2.924965
Backend:            737     0.722710
TOTAL:                      35.912011
ops:                1860596
recorded ops:       493138
  calls:            81022
guards:             131238
opt ops:            137263
opt guards:         35166
forcings:           4196
abort: trace too long:      22
abort: compiling:   0
abort: vable escape:        22
abort: bad loop:    0
abort: force quasi-immut:   0
nvirtuals:          183672
nvholes:            25797
nvreused:           116131
Total # of loops:   193
Total # of bridges: 575
Freed # of loops:   6
Freed # of bridges: 75
[d195a48de18] jit-summary}

compare the numbers with pypy --jit off and see at which number of iterations pypy jit kicks in
play with the thresholds: pypy --jit threshold=500,function_threshold=400,trace_eagerness=50 was much better in this example. What this does is to lower the threshold for tracing loops from default of 1039 to 400, threshold for tracing functions from the start from 1619 to 500 and threshold for tracing bridges from 200 to 50. Bridges are "alternative paths" that JIT did not take that are being additionally traced. We believe in sane defaults, so we'll try to improve upon those numbers, but generally speaking there is no one-size fits all here.
if the tracing/backend time stays high, come and complain to us with benchmarks, we'll try to look at them

Warmup, as a number, is notoriously hard to measure. It's a combination of:

pypy running interpreter before jitting
pypy needing time to JIT the traces
additional memory allocations needed during tracing to accomodate bookkeeping data
exiting and entering assembler until there is enough coverage of assembler

We're working hard on making a better assesment at this number, stay tuned :-)

Speedups

We're hoping to continue improving upon warmup time and memory impact in the future, stay tuned for improvements.

Technical details

The branch does "one" thing - it changes the underlying model of how operations are represented during tracing and optimizations. Let's consider a simple loop like:

[i0, i1]
i2 = int_add(i0, i1)
i3 = int_add(i2, 1)
i4 = int_is_true(i3)
guard_true(i4)
jump(i3, i2)

Cheers!
fijal & arigo

Posted by Maciej Fijalkowski at 16:52 0 Comments

Monday, August 31, 2015

PyPy 2.6.1 released

PyPy 2.6.1

We’re pleased to announce PyPy 2.6.1, an update to PyPy 2.6.0 released June 1. We have fixed many issues, updated stdlib to 2.7.10, cffi to version 1.3, extended support for the new vmprof statistical profiler for multiple threads, and increased functionality of numpy.
You can download the PyPy 2.6.1 release here:

We would like to thank our donors for the continued support of the PyPy project, and our volunteers and contributors.

We would also like to encourage new people to join the project. PyPy has many layers and we need help with all of them: PyPy and RPython documentation improvements, tweaking popular modules to run on pypy, or general help with making RPython’s JIT even better.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It’s fast (pypy and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.

This release supports x86 machines on most common operating systems (Linux 32/64, Mac OS X 64, Windows 32, OpenBSD, freebsd), as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux.

We also welcome developers of other dynamic languages to see what RPython can do for them.

Highlights

Bug Fixes
- Revive non-SSE2 support
- Fixes for detaching _io.Buffer*
- On Windows, close (and flush) all open sockets on exiting
- Drop support for ancient macOS v10.4 and before
- Clear up contention in the garbage collector between trace-me-later and pinning
- Issues reported with our previous release were resolved after reports from users on our issue tracker at https://bitbucket.org/pypy/pypy/issues or on IRC at #pypy.
New features:
- cffi was updated to version 1.3
- The python stdlib was updated to 2.7.10 from 2.7.9
- vmprof now supports multiple threads and OS X
- The translation process builds cffi import libraries for some stdlib packages, which should prevent confusion when package.py is not used
- better support for gdb debugging
- freebsd should be able to translate PyPy “out of the box” with no patches
Numpy:
- Better support for record dtypes, including the align keyword
- Implement casting and create output arrays accordingly (still missing some corner cases)
- Support creation of unicode ndarrays
- Better support ndarray.flags
- Support axis argument in more functions
- Refactor array indexing to support ellipses
- Allow the docstrings of built-in numpy objects to be set at run-time
- Support the buffered nditer creation keyword
Performance improvements:
- Delay recursive calls to make them non-recursive
- Skip loop unrolling if it compiles too much code
- Tweak the heapcache
- Add a list strategy for lists that store both floats and 32-bit integers. The latter are encoded as nonstandard NaNs. Benchmarks show that the speed of such lists is now very close to the speed of purely-int or purely-float lists.
- Simplify implementation of ffi.gc() to avoid most weakrefs
- Massively improve the performance of map() with more than one sequence argument

Please try it out and let us know what you think. We welcome success stories, experiments, or benchmarks, we know you are using PyPy, please tell us about it!
Cheers
The PyPy Team

PyPy 2.6.1

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It’s fast (pypy and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.

This release supports x86 machines on most common operating systems (Linux 32/64, Mac OS X 64, Windows 32, OpenBSD, freebsd), as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux.

We also welcome developers of other dynamic languages to see what RPython can do for them.

Highlights

Bug Fixes
- Revive non-SSE2 support
- Fixes for detaching _io.Buffer*
- On Windows, close (and flush) all open sockets on exiting
- Drop support for ancient macOS v10.4 and before
- Clear up contention in the garbage collector between trace-me-later and pinning
- Issues reported with our previous release were resolved after reports from users on our issue tracker at https://bitbucket.org/pypy/pypy/issues or on IRC at #pypy.
New features:
- cffi was updated to version 1.3
- The python stdlib was updated to 2.7.10 from 2.7.9
- vmprof now supports multiple threads and OS X
- The translation process builds cffi import libraries for some stdlib packages, which should prevent confusion when package.py is not used
- better support for gdb debugging
- freebsd should be able to translate PyPy “out of the box” with no patches
Numpy:
- Better support for record dtypes, including the align keyword
- Implement casting and create output arrays accordingly (still missing some corner cases)
- Support creation of unicode ndarrays
- Better support ndarray.flags
- Support axis argument in more functions
- Refactor array indexing to support ellipses
- Allow the docstrings of built-in numpy objects to be set at run-time
- Support the buffered nditer creation keyword
Performance improvements:
- Delay recursive calls to make them non-recursive
- Skip loop unrolling if it compiles too much code
- Tweak the heapcache
- Add a list strategy for lists that store both floats and 32-bit integers. The latter are encoded as nonstandard NaNs. Benchmarks show that the speed of such lists is now very close to the speed of purely-int or purely-float lists.
- Simplify implementation of ffi.gc() to avoid most weakrefs
- Massively improve the performance of map() with more than one sequence argument

Please try it out and let us know what you think. We welcome success stories, experiments, or benchmarks, we know you are using PyPy, please tell us about it!
Cheers
The PyPy Team

Posted by mattip at 17:52 3 Comments

Wednesday, June 17, 2015

PyPy and ijson - a guest blog post

This gem was posted in the ijson issue tracker after some discussion on #pypy, and Dav1dde kindly allowed us to repost it here:

"So, I was playing around with parsing huge JSON files (19GiB, testfile is ~520MiB) and wanted to try a sample code with PyPy, turns out, PyPy needed ~1:30-2:00 whereas CPython 2.7 needed ~13 seconds (the pure python implementation on both pythons was equivalent at ~8 minutes).

"Apparantly ctypes is really bad performance-wise, especially on PyPy. So I made a quick CFFI mockup: https://gist.github.com/Dav1dde/c509d472085f9374fc1d

Before:

CPython 2.7:
    python -m emfas.server size dumps/echoprint-dump-1.json
    11.89s user 0.36s system 98% cpu 12.390 total
PYPY:
    python -m emfas.server size dumps/echoprint-dump-1.json
    117.19s user 2.36s system 99% cpu 1:59.95 total

After (CFFI):

CPython 2.7:
     python jsonsize.py ../dumps/echoprint-dump-1.json
     8.63s user 0.28s system 99% cpu 8.945 total
PyPy:
     python jsonsize.py ../dumps/echoprint-dump-1.json
     4.04s user 0.34s system 99% cpu 4.392 total
"

Dav1dd goes into more detail in the issue itself, but we just want to emphasize a few significant points from this brief interchange:

His CFFI implementation is faster than the ctypes one even on CPython 2.7.
PyPy + CFFI is faster than CPython even when using C code to do the heavy parsing.

The PyPy Team

Posted by mattip at 20:45 4 Comments

Monday, June 1, 2015

PyPy 2.6.0 release

PyPy 2.6.0 - Cameo Charm

We’re pleased to announce PyPy 2.6.0, only two months after PyPy 2.5.1. We are particulary happy to update cffi to version 1.1, which makes the popular ctypes-alternative even easier to use, and to support the new vmprof statistical profiler.

You can download the PyPy 2.6.0 release here:

We would like to thank our donors for the continued support of the PyPy project, and for those who donate to our three sub-projects, as well as our volunteers and contributors.

Thanks also to Yury V. Zaytsev and David Wilson who recently started running nightly builds on Windows and MacOSX buildbots.

We’ve shown quite a bit of progress, but we’re slowly running out of funds. Please consider donating more, or even better convince your employer to donate, so we can finish those projects! The three sub-projects are:

Py3k (supporting Python 3.x): We have released a Python 3.2.5 compatible version we call PyPy3 2.4.0, and are working toward a Python 3.3 compatible version
STM (software transactional memory): We have released a first working version, and continue to try out new promising paths of achieving a fast multithreaded Python
NumPy which requires installation of our fork of upstream numpy, available on bitbucket

We would also like to encourage new people to join the project. PyPy has many layers and we need help with all of them: PyPy and RPython documentation improvements, tweaking popular modules to run on pypy, or general help with making RPython’s JIT even better. Nine new people contributed since the last release, you too could be one of them.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It’s fast (pypy and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.

This release supports x86 machines on most common operating systems (Linux 32/64, Mac OS X 64, Windows, OpenBSD, freebsd), as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux.

While we support 32 bit python on Windows, work on the native Windows 64 bit python is still stalling, we would welcome a volunteer to handle that. We also welcome developers with other operating systems or dynamic languages to see what RPython can do for them.

Highlights

Python compatibility:
- Improve support for TLS 1.1 and 1.2
- Windows downloads now package a pypyw.exe in addition to pypy.exe
- Support for the PYTHONOPTIMIZE environment variable (impacting builtin’s __debug__ property)
- Issues reported with our previous release were resolved after reports from users on our issue tracker at https://bitbucket.org/pypy/pypy/issues or on IRC at #pypy.
New features:
- Add preliminary support for a new lightweight statistical profiler vmprof, which has been designed to accomodate profiling JITted code
Numpy:
- Support for object dtype via a garbage collector hook
- Support for .can_cast and .min_scalar_type as well as beginning a refactoring of the internal casting rules
- Better support for subtypes, via the __array_interface__, __array_priority__, and __array_wrap__ methods (still a work-in-progress)
- Better support for ndarray.flags
Performance improvements:
- Slight improvement in frame sizes, improving some benchmarks
- Internal refactoring and cleanups leading to improved JIT performance
- Improved IO performance of zlib and bz2 modules
- We continue to improve the JIT’s optimizations. Our benchmark suite is now over 7 times faster than cpython

Please try it out and let us know what you think. We welcome success stories, experiments, or benchmarks, we know you are using PyPy, please tell us about it!
Cheers
The PyPy Team

PyPy 2.6.0 - Cameo Charm

You can download the PyPy 2.6.0 release here:

We would like to thank our donors for the continued support of the PyPy project, and for those who donate to our three sub-projects, as well as our volunteers and contributors.

Thanks also to Yury V. Zaytsev and David Wilson who recently started running nightly builds on Windows and MacOSX buildbots.

Py3k (supporting Python 3.x): We have released a Python 3.2.5 compatible version we call PyPy3 2.4.0, and are working toward a Python 3.3 compatible version
STM (software transactional memory): We have released a first working version, and continue to try out new promising paths of achieving a fast multithreaded Python
NumPy which requires installation of our fork of upstream numpy, available on bitbucket

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It’s fast (pypy and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.

Highlights

Python compatibility:
- Improve support for TLS 1.1 and 1.2
- Windows downloads now package a pypyw.exe in addition to pypy.exe
- Support for the PYTHONOPTIMIZE environment variable (impacting builtin’s __debug__ property)
- Issues reported with our previous release were resolved after reports from users on our issue tracker at https://bitbucket.org/pypy/pypy/issues or on IRC at #pypy.
New features:
- Add preliminary support for a new lightweight statistical profiler vmprof, which has been designed to accomodate profiling JITted code
Numpy:
- Support for object dtype via a garbage collector hook
- Support for .can_cast and .min_scalar_type as well as beginning a refactoring of the internal casting rules
- Better support for subtypes, via the __array_interface__, __array_priority__, and __array_wrap__ methods (still a work-in-progress)
- Better support for ndarray.flags
Performance improvements:
- Slight improvement in frame sizes, improving some benchmarks
- Internal refactoring and cleanups leading to improved JIT performance
- Improved IO performance of zlib and bz2 modules
- We continue to improve the JIT’s optimizations. Our benchmark suite is now over 7 times faster than cpython

Please try it out and let us know what you think. We welcome success stories, experiments, or benchmarks, we know you are using PyPy, please tell us about it!
Cheers
The PyPy Team

Posted by mattip at 17:20 4 Comments

Thursday, May 21, 2015

CFFI 1.0.1 released

CFFI 1.0.1 final has now been released for CPython! CFFI is a (CPython and PyPy) module to interact with C code from Python.

The main news from CFFI 0.9 is the new way to build extension modules: the "out-of-line" mode, where you have a separate build script. When this script is executed, it produces the extension module. This comes with associated Setuptools support that fixes the headache of distributing your own CFFI-using packages. It also massively cuts down the import times.

Although this is a major new version, it should be fully backward-compatible: existing projects should continue to work, in what is now called the "in-line mode".

The documentation has been reorganized and split into a few pages. For more information about this new "out-of-line" mode, as well as more general information about what CFFI is and how to use it, read the Goals and proceed to the Overview.

Unlike the 1.0 beta 1 version (<- click for a motivated introduction), the final version also supports an out-of-line mode for projects using ffi.dlopen(), instead of only ffi.verify().

PyPy support: PyPy needs integrated support for efficient JITting, so you cannot install a different version of CFFI on top of an existing PyPy. You need to wait for the upcoming PyPy 2.6 to use CFFI 1.0---or get a nightly build.

My thanks again to the PSF (Python Software Foundation) for their financial support!

UPDATE:

Bug with the first example "ABI out-of-line": variadic functions (like printf, ending in a "..." argument) crash. Fixed in CFFI 1.0.2.

Posted by Armin Rigo at 11:55 3 Comments

Tuesday, May 5, 2015

CFFI 1.0 beta 1

Finally! CFFI 1.0 is almost ready. CFFI gives Python developers a convenient way to call external C libraries. Here "Python" == "CPython or PyPy", but this post is mostly about the CPython side of CFFI, as the PyPy version is not ready yet.

On CPython, you can download the version "1.0.0b1" either by looking for the cffi-1.0 branch in the repository, or by saying

pip install "cffi>=1.0.dev0"

(Until 1.0 final is ready, pip install cffi will still give you version 0.9.2.)

The main news: you can now explicitly generate and compile a CPython C extension module from a "build" script. Then in the rest of your program or library, you no longer need to import cffi at all. Instead, you simply say:

from _my_custom_module import ffi, lib

Then you use ffi and lib just like you did in your verify()-based project in CFFI 0.9.2. (The lib is what used to be the result of verify().) The details of how you use them should not have changed at all, so that the rest of your program should not need any update.

Benefits

This is a big step towards standard practices for making and distributing Python packages with C extension modules:

on the one hand, you need an explicit compilation step, triggered here by running the "build" script;
on the other hand, what you gain in return is better control over when and why the C compilation occurs, and more standard ways to write distutils- or setuptools-based setup.py files (see below).

Additionally, this completely removes one of the main drawbacks of using CFFI to interface with large C APIs: the start-up time. In some cases it could be extreme on slow machines (cases of 10-20 seconds on ARM boards occur commonly). Now, the import above is instantaneous.

In fact, none of the pure Python cffi package is needed any more at runtime (it needs only an internal extension module from CFFI, which can be installed by doing "pip install cffi-runtime" [*] if you only need that). The ffi object you get by the import above is of a completely different class written entirely in C. The two implementations might get merged in the future; for now they are independent, but give two compatible APIs. The differences are that some methods like cdef() and verify() and set_source() are omitted from the C version, because it is supposed to be a complete FFI already; and other methods like new(), which take as parameter a string describing a C type, are faster now because that string is parsed using a custom small-subset-of-C parser, written in C too.

In practice

CFFI 1.0 beta 1 was tested on CPython 2.7 and 3.3/3.4, on Linux and to some extent on Windows and OS/X. Its PyPy version is not ready yet, and the only docs available so far are those below.

This is beta software, so there might be bugs and details may change. We are interested in hearing any feedback (irc.freenode.net #pypy) or bug reports.

To use the new features, create a source file that is not imported by the rest of your project, in which you place (or move) the code to build the FFI object:

# foo_build.py
import cffi
ffi = cffi.FFI()

ffi.cdef("""
    int printf(const char *format, ...);
""")

ffi.set_source("_foo", """
    #include <stdio.h>
""")   # and other arguments like libraries=[...]

if __name__ == '__main__':
    ffi.compile()

The ffi.set_source() replaces the ffi.verify() of CFFI 0.9.2. Calling it attaches the given source code to the ffi object, but this call doesn't compile or return anything by itself. It may be placed above the ffi.cdef() if you prefer. Its first argument is the name of the C extension module that will be produced.

Actual compilation (including generating the complete C sources) occurs later, in one of two places: either in ffi.compile(), shown above, or indirectly from the setup.py, shown next.

If you directly execute the file foo_build.py above, it will generate a local file _foo.c and compile it to _foo.so (or the appropriate extension, like _foo.pyd on Windows). This is the extension module that can be used in the rest of your program by saying "from _foo import ffi, lib".

Distutils

If you want to distribute your program, you write a setup.py using either distutils or setuptools. Using setuptools is generally recommended nowdays, but using distutils is possible too. We show it first:

# setup.py
from distutils.core import setup
import foo_build

setup(
    name="example",
    version="0.1",
    py_modules=["example"],
    ext_modules=[foo_build.ffi.distutils_extension()],
)

This is similar to the CFFI 0.9.2 way. It only works if cffi was installed previously, because otherwise foo_build cannot be imported. The difference is that you use ffi.distutils_extension() instead of ffi.verifier.get_extension(), because there is no longer any verifier object if you use set_source().

Setuptools

The modern way is to write setup.py files based on setuptools, which can (among lots of other things) handle dependencies. It is what you normally get with pip install, too. Here is how you'd write it:

# setup.py
from setuptools import setup

setup(
    name="example",
    version="0.1",
    py_modules=["example"],
    setup_requires=["cffi>=1.0.dev0"],
    cffi_modules=["foo_build:ffi"],
    install_requires=["cffi-runtime"],    # see [*] below
)

Note that "cffi" is mentioned on three lines here:

the first time is in setup_requires, which means that cffi will be locally downloaded and used for the setup.
the second mention is a custom cffi_modules argument. This argument is handled by cffi as soon as it is locally downloaded. It should be a list of "module:ffi" strings, where the ffi part is the name of the global variable in that module.
the third mention is in install_requires. It means that in order to install this example package, "cffi-runtime" must also be installed. This is (or will be) a PyPI entry that only contains a trimmed down version of CFFI, one that does not include the pure Python "cffi" package and its dependencies. None of it is needed at runtime.

[*] NOTE: The "cffi-runtime" PyPI entry is not ready yet. For now, use "cffi>=1.0.dev0" instead. Considering PyPy, which has got a built-in "_cffi_backend" module, the "cffi-runtime" package could never be upgraded there; but it would still be nice if we were able to upgrade the "cffi" pure Python package on PyPy. This might require some extra care in writing the interaction code. We need to sort it out now...

Thanks

Special thanks go to the PSF (Python Software Foundation) for their financial support, without which this work---er... it might likely have occurred anyway, but at an unknown future date :-)

(For reference, the amount I asked for (and got) is equal to one month of what a Google Summer of Code student gets, for work that will take a bit longer than one month. At least I personally am running mostly on such money, and so I want to thank the PSF again for their contribution to CFFI---and while I'm at it, thanks to all other contributors to PyPy---for making this job more than an unpaid hobby on the side :-)

Armin Rigo

On CPython, you can download the version "1.0.0b1" either by looking for the cffi-1.0 branch in the repository, or by saying

pip install "cffi>=1.0.dev0"

(Until 1.0 final is ready, pip install cffi will still give you version 0.9.2.)

from _my_custom_module import ffi, lib

Benefits

This is a big step towards standard practices for making and distributing Python packages with C extension modules:

on the one hand, you need an explicit compilation step, triggered here by running the "build" script;
on the other hand, what you gain in return is better control over when and why the C compilation occurs, and more standard ways to write distutils- or setuptools-based setup.py files (see below).

In practice

CFFI 1.0 beta 1 was tested on CPython 2.7 and 3.3/3.4, on Linux and to some extent on Windows and OS/X. Its PyPy version is not ready yet, and the only docs available so far are those below.

This is beta software, so there might be bugs and details may change. We are interested in hearing any feedback (irc.freenode.net #pypy) or bug reports.

To use the new features, create a source file that is not imported by the rest of your project, in which you place (or move) the code to build the FFI object:

# foo_build.py
import cffi
ffi = cffi.FFI()

ffi.cdef("""
    int printf(const char *format, ...);
""")

ffi.set_source("_foo", """
    #include <stdio.h>
""")   # and other arguments like libraries=[...]

if __name__ == '__main__':
    ffi.compile()

Actual compilation (including generating the complete C sources) occurs later, in one of two places: either in ffi.compile(), shown above, or indirectly from the setup.py, shown next.

Distutils

# setup.py
from distutils.core import setup
import foo_build

setup(
    name="example",
    version="0.1",
    py_modules=["example"],
    ext_modules=[foo_build.ffi.distutils_extension()],
)

Setuptools

# setup.py
from setuptools import setup

setup(
    name="example",
    version="0.1",
    py_modules=["example"],
    setup_requires=["cffi>=1.0.dev0"],
    cffi_modules=["foo_build:ffi"],
    install_requires=["cffi-runtime"],    # see [*] below
)

Note that "cffi" is mentioned on three lines here:

the first time is in setup_requires, which means that cffi will be locally downloaded and used for the setup.
the second mention is a custom cffi_modules argument. This argument is handled by cffi as soon as it is locally downloaded. It should be a list of "module:ffi" strings, where the ffi part is the name of the global variable in that module.
the third mention is in install_requires. It means that in order to install this example package, "cffi-runtime" must also be installed. This is (or will be) a PyPI entry that only contains a trimmed down version of CFFI, one that does not include the pure Python "cffi" package and its dependencies. None of it is needed at runtime.

Thanks

Special thanks go to the PSF (Python Software Foundation) for their financial support, without which this work---er... it might likely have occurred anyway, but at an unknown future date :-)

Armin Rigo

Posted by Armin Rigo at 19:20 3 Comments