Friday, November 28, 2014

September donations and thank you to the Python Software Foundation!

Hello everyone!

We would like to show you a short update on the PyPy funding. We gathered a total of $15,986 in the month of September and as per earlier agreement, the Python Software Foundation donated $10,000 to PyPy. We would like to thank everyone participating and the PSF in particular for supporting the PyPy project and making our work possible!

We've been working hard on the goals outlined in the funding proposals.

PyPy Python 3 support has been in beta for a while and it's already being used by many people, as seen per the number of reported bugs. We're currently supporting 3.2, planning on moving towards 3.4 in the future.
Software Transactional Memory has been a successful research project, with first real world results shown during the Warsaw sprint.
More detailed update on numpy will be published soon. A little spoiler is that we're planning on addressing matplotlib, scipy and the larger ecosystem to some extent. Stay tuned!

Again, thanks to everyone who donated and happy Thanksgiving to everyone on that side of the world!

Cheers,
fijal and the entire PyPy team

Posted by Maciej Fijalkowski at 13:49 4 Comments

Monday, November 17, 2014

Tornado without a GIL on PyPy STM

This post is by Konstantin Lopuhin, who tried PyPy STM during the Warsaw sprint.

Python has a GIL, right? Not quite - PyPy STM is a python implementation without a GIL, so it can scale CPU-bound work to several cores. PyPy STM is developed by Armin Rigo and Remi Meier, and supported by community donations. You can read more about it in the docs.

Although PyPy STM is still a work in progress, in many cases it can already run CPU-bound code faster than regular PyPy, when using multiple cores. Here we will see how to slightly modify Tornado IO loop to use transaction module. This module is described in the docs and is really simple to use - please see an example there. An event loop of Tornado, or any other asynchronous web server, looks like this (with some simplifications):

while True:
    for callback in list(self._callbacks):
        self._run_callback(callback)
    event_pairs = self._impl.poll()
    self._events.update(event_pairs)
    while self._events:
        fd, events = self._events.popitem()
        handler = self._handlers[fd]
        self._handle_event(fd, handler, events)

We get IO events, and run handlers for all of them, these handlers can also register new callbacks, which we run too. When using such a framework, it is very nice to have a guaranty that all handlers are run serially, so you do not have to put any locks. This is an ideal case for the transaction module - it gives us guaranties that things appear to be run serially, so in user code we do not need any locks. We just need to change the code above to something like:

while True:
    for callback in list(self._callbacks):
        transaction.add(                # added
            self._run_callback, callback)
    transaction.run()                   # added
    event_pairs = self._impl.poll()
    self._events.update(event_pairs)
    while self._events:
        fd, events = self._events.popitem()
        handler = self._handlers[fd]
        transaction.add(                # added
            self._handle_event, fd, handler, events)
    transaction.run()                   # added

The actual commit is here, - we had to extract a little function to run the callback.

Part 1: a simple benchmark: primes

Now we need a simple benchmark, lets start with this - just calculate a list of primes up to the given number, and return it as JSON:

def is_prime(n):
    for i in xrange(2, n):
        if n % i == 0:
            return False
    return True

class MainHandler(tornado.web.RequestHandler):
    def get(self, num):
        num = int(num)
        primes = [n for n in xrange(2, num + 1) if is_prime(n)]
        self.write({'primes': primes})

We can benchmark it with siege:

siege -c 50 -t 20s http://localhost:8888/10000

But this does not scale. The CPU load is at 101-104 %, and we handle 30 % less request per second. The reason for the slowdown is STM overhead, which needs to keep track of all writes and reads in order to detect conflicts. And the reason for using only one core is, obviously, conflicts! Fortunately, we can see what this conflicts are, if we run code like this (here 4 is the number of cores to use):

PYPYSTM=stm.log ./primes.py 4

Then we can use print_stm_log.py to analyse this log. It lists the most expensive conflicts:

14.793s lost in aborts, 0.000s paused (1258x STM_CONTENTION_INEVITABLE)
File "/home/ubuntu/tornado-stm/tornado/tornado/httpserver.py", line 455, in __init__
    self._start_time = time.time()
File "/home/ubuntu/tornado-stm/tornado/tornado/httpserver.py", line 455, in __init__
    self._start_time = time.time()
...

There are only three kinds of conflicts, they are described in stm source, Here we see that two threads call into external function to get current time, and we can not rollback any of them, so one of them must wait till the other transaction finishes. For now we can hack around this by disabling this timing - this is only needed for internal profiling in tornado.

If we do it, we get the following results (but see caveats below):

Impl.	req/s
PyPy 2.4	14.4
CPython 2.7	3.2
PyPy-STM 1	9.3
PyPy-STM 2	16.4
PyPy-STM 3	20.4
PyPy STM 4	24.2

As we can see, in this benchmark PyPy STM using just two cores can beat regular PyPy! This is not linear scaling, there are still conflicts left, and this is a very simple example but still, it works!

But its not that simple yet :)

First, these are best-case numbers after long (much longer than for regular PyPy) warmup. Second, it can sometimes crash (although removing old pyc files fixes it). Third, benchmark meta-parameters are also tuned.

Here we get relatively good results only when there are a lot of concurrent clients - as a results, a lot of requests pile up, the server is not keeping with the load, and transaction module is busy with work running this piled up requests. If we decrease the number of concurrent clients, results get slightly worse. Another thing we can tune is how heavy is each request - again, if we ask primes up to a lower number, then less time is spent doing calculations, more time is spent in tornado, and results get much worse.

Besides the time.time() conflict described above, there are a lot of others. The bulk of time is lost in these two conflicts:

14.153s lost in aborts, 0.000s paused (270x STM_CONTENTION_INEVITABLE)
File "/home/ubuntu/tornado-stm/tornado/tornado/web.py", line 1082, in compute_etag
    hasher = hashlib.sha1()
File "/home/ubuntu/tornado-stm/tornado/tornado/web.py", line 1082, in compute_etag
    hasher = hashlib.sha1()

13.484s lost in aborts, 0.000s paused (130x STM_CONTENTION_WRITE_READ)
File "/home/ubuntu/pypy/lib_pypy/transaction.py", line 164, in _run_thread
    got_exception)

The first one is presumably calling into some C function from stdlib, and we get the same conflict as for time.time() above, but is can be fixed on PyPy side, as we can be sure that computing sha1 is pure.

It is easy to hack around this one too, just removing etag support, but if we do it, performance is much worse, only slightly faster than regular PyPy, with the top conflict being:

83.066s lost in aborts, 0.000s paused (459x STM_CONTENTION_WRITE_WRITE)
File "/home/arigo/hg/pypy/stmgc-c7/lib-python/2.7/_weakrefset.py", line 70, in __contains__
File "/home/arigo/hg/pypy/stmgc-c7/lib-python/2.7/_weakrefset.py", line 70, in __contains__

Comment by Armin: It is unclear why this happens so far. We'll investigate...

The second conflict (without etag tweaks) originates in the transaction module, from this piece of code:

while True:
    self._do_it(self._grab_next_thing_to_do(tloc_pending),
                got_exception)
    counter[0] += 1

Comment by Armin: This is a conflict in the transaction module itself; ideally, it shouldn't have any, but in order to do that we might need a little bit of support from RPython or C code. So this is pending improvement.

Tornado modification used in this blog post is based on 3.2.dev2. As of now, the latest version is 4.0.2, and if we apply the same changes to this version, then we no longer get any scaling on this benchmark, and there are no conflicts that take any substantial time.

Comment by Armin: There are two possible reactions to a conflict. We can either abort one of the two threads, or (depending on the circumstances) just pause the current thread until the other one commits, after which the thread will likely be able to continue. The tool ``print_stm_log.py`` did not report conflicts that cause pauses. It has been fixed very recently. Chances are that on this test it would report long pauses and point to locations that cause them.

Part 2: a more interesting benchmark: A-star

Although we have seen that PyPy STM is not all moonlight and roses, it is interesting to see how it works on a more realistic application.

astar.py is a simple game where several players move on a map (represented as a list of lists of integers), build and destroy walls, and ask server to give them shortest paths between two points using A-star search, adopted from ActiveState recipie.

The benchmark bench_astar.py is simulating players, and tries to put the main load on A-star search, but also does some wall building and destruction. There are no locks around map modifications, as normal tornado is executing all callbacks serially, and we can keep this guaranty with atomic blocks of PyPy STM. This is also an example of a program that is not trivial to scale to multiple cores with separate processes (assuming more interesting shared state and logic).

This benchmark is very noisy due to randomness of client interactions (also it could be not linear), so just lower and upper bounds for number of requests are reported

Impl.	req/s
PyPy 2.4	5 .. 7
CPython 2.7	0.5 .. 0.9
PyPy-STM 1	2 .. 4
PyPy STM 4	2 .. 6

Clearly this is a very bad benchmark, but still we can see that scaling is worse and STM overhead is sometimes higher. The bulk of conflicts come from the transaction module (we have seen it above):

91.655s lost in aborts, 0.000s paused (249x STM_CONTENTION_WRITE_READ)
File "/home/ubuntu/pypy/lib_pypy/transaction.py", line 164, in _run_thread
    got_exception)

Although it is definitely not ready for production use, you can already try to run things, report bugs, and see what is missing in user-facing tools and libraries.

Benchmarks setup:

Amazon c3.xlarge (4 cores) running Ubuntu 14.04
pypy-c-r74011-stm-jit for the primes benchmark (but it has more bugs than more recent versions), and pypy-c-r74378-74379-stm-jit for astar benchmark (put it inside pypy source checkout at 38c9afbd253c)
http://bitbucket.org/kostialopuhin/tornado-stm-bench at 65144cda7a1f

This post is by Konstantin Lopuhin, who tried PyPy STM during the Warsaw sprint.

while True:
    for callback in list(self._callbacks):
        self._run_callback(callback)
    event_pairs = self._impl.poll()
    self._events.update(event_pairs)
    while self._events:
        fd, events = self._events.popitem()
        handler = self._handlers[fd]
        self._handle_event(fd, handler, events)

while True:
    for callback in list(self._callbacks):
        transaction.add(                # added
            self._run_callback, callback)
    transaction.run()                   # added
    event_pairs = self._impl.poll()
    self._events.update(event_pairs)
    while self._events:
        fd, events = self._events.popitem()
        handler = self._handlers[fd]
        transaction.add(                # added
            self._handle_event, fd, handler, events)
    transaction.run()                   # added

The actual commit is here, - we had to extract a little function to run the callback.

Part 1: a simple benchmark: primes

Now we need a simple benchmark, lets start with this - just calculate a list of primes up to the given number, and return it as JSON:

def is_prime(n):
    for i in xrange(2, n):
        if n % i == 0:
            return False
    return True

class MainHandler(tornado.web.RequestHandler):
    def get(self, num):
        num = int(num)
        primes = [n for n in xrange(2, num + 1) if is_prime(n)]
        self.write({'primes': primes})

We can benchmark it with siege:

siege -c 50 -t 20s http://localhost:8888/10000

PYPYSTM=stm.log ./primes.py 4

Then we can use print_stm_log.py to analyse this log. It lists the most expensive conflicts:

14.793s lost in aborts, 0.000s paused (1258x STM_CONTENTION_INEVITABLE)
File "/home/ubuntu/tornado-stm/tornado/tornado/httpserver.py", line 455, in __init__
    self._start_time = time.time()
File "/home/ubuntu/tornado-stm/tornado/tornado/httpserver.py", line 455, in __init__
    self._start_time = time.time()
...

If we do it, we get the following results (but see caveats below):

Impl.	req/s
PyPy 2.4	14.4
CPython 2.7	3.2
PyPy-STM 1	9.3
PyPy-STM 2	16.4
PyPy-STM 3	20.4
PyPy STM 4	24.2

But its not that simple yet :)

Besides the time.time() conflict described above, there are a lot of others. The bulk of time is lost in these two conflicts:

14.153s lost in aborts, 0.000s paused (270x STM_CONTENTION_INEVITABLE)
File "/home/ubuntu/tornado-stm/tornado/tornado/web.py", line 1082, in compute_etag
    hasher = hashlib.sha1()
File "/home/ubuntu/tornado-stm/tornado/tornado/web.py", line 1082, in compute_etag
    hasher = hashlib.sha1()

13.484s lost in aborts, 0.000s paused (130x STM_CONTENTION_WRITE_READ)
File "/home/ubuntu/pypy/lib_pypy/transaction.py", line 164, in _run_thread
    got_exception)

It is easy to hack around this one too, just removing etag support, but if we do it, performance is much worse, only slightly faster than regular PyPy, with the top conflict being:

83.066s lost in aborts, 0.000s paused (459x STM_CONTENTION_WRITE_WRITE)
File "/home/arigo/hg/pypy/stmgc-c7/lib-python/2.7/_weakrefset.py", line 70, in __contains__
File "/home/arigo/hg/pypy/stmgc-c7/lib-python/2.7/_weakrefset.py", line 70, in __contains__

Comment by Armin: It is unclear why this happens so far. We'll investigate...

The second conflict (without etag tweaks) originates in the transaction module, from this piece of code:

while True:
    self._do_it(self._grab_next_thing_to_do(tloc_pending),
                got_exception)
    counter[0] += 1

Part 2: a more interesting benchmark: A-star

Although we have seen that PyPy STM is not all moonlight and roses, it is interesting to see how it works on a more realistic application.

This benchmark is very noisy due to randomness of client interactions (also it could be not linear), so just lower and upper bounds for number of requests are reported

Impl.	req/s
PyPy 2.4	5 .. 7
CPython 2.7	0.5 .. 0.9
PyPy-STM 1	2 .. 4
PyPy STM 4	2 .. 6

Clearly this is a very bad benchmark, but still we can see that scaling is worse and STM overhead is sometimes higher. The bulk of conflicts come from the transaction module (we have seen it above):

91.655s lost in aborts, 0.000s paused (249x STM_CONTENTION_WRITE_READ)
File "/home/ubuntu/pypy/lib_pypy/transaction.py", line 164, in _run_thread
    got_exception)

Although it is definitely not ready for production use, you can already try to run things, report bugs, and see what is missing in user-facing tools and libraries.

Benchmarks setup:

Amazon c3.xlarge (4 cores) running Ubuntu 14.04
pypy-c-r74011-stm-jit for the primes benchmark (but it has more bugs than more recent versions), and pypy-c-r74378-74379-stm-jit for astar benchmark (put it inside pypy source checkout at 38c9afbd253c)
http://bitbucket.org/kostialopuhin/tornado-stm-bench at 65144cda7a1f

Posted by Armin Rigo at 15:05 5 Comments

Wednesday, November 5, 2014

PyPy IO improvements

Hello everyone!

We've wrapped up the Warsaw sprint, so I would like to describe some branches which have been recently merged and which improved the I/O and the GC: gc_no_cleanup_nursery and gc-incminimark-pinning.

The first branch was started by Wenzhu Man for her Google Summer of Code and finished by Maciej Fijałkowski and Armin Rigo. The PyPy GC works by allocating new objects in the young object area (the nursery), simply by incrementing a pointer. After each minor collection, the nursery has to be cleaned up. For simplicity, the GC used to do it by zeroing the whole nursery.

This approach has bad effects on the cache, since you zero a large piece of memory at once and do unnecessary work for things that don't require zeroing like large strings. We mitigated the first problem somewhat with incremental nursery zeroing, but this branch removes the zeroing completely, thus improving the string handling and recursive code (since jitframes don't requires zeroed memory either). I measured the effect on two examples: a recursive implementation of fibonacci and gcbench, to measure GC performance.

The results for fibonacci and gcbench are below (normalized to cpython 2.7). Benchmarks were run 50 times each (note that the big standard deviation comes mostly from the warmup at the beginning, true figures are smaller):

benchmark	CPython	PyPy 2.4	PyPy non-zero
fibonacci	4.8+-0.15 (1.0x)	0.59+-0.07 (8.1x)	0.45+-0.07 (10.6x)
gcbench	22+-0.36 (1.0x)	1.34+-0.28 (16.4x)	1.02+-0.15 (21.6x)

The second branch was done by Gregor Wegberg for his master thesis and finished by Maciej Fijałkowski and Armin Rigo. Because of the way it works, the PyPy GC from time to time moves the objects in memory, meaning that their address can change. Therefore, if you want to pass pointers to some external C function (for example, write(2) or read(2)), you need to ensure that the objects they are pointing to will not be moved by the GC (e.g. when running a different thread). PyPy up to 2.4 solves the problem by copying the data into or from a non-movable buffer, which is obviously inefficient. The branch introduce the concept of "pinning", which allows us to inform the GC that it is not allowed to move a certain object for a short period of time. This introduces a bit of extra complexity in the garbage collector, but improves the I/O performance quite drastically, because we no longer need the extra copy to and from the non-movable buffers.

In this benchmark, which does I/O in a loop, we either write a number of bytes from a freshly allocated string into /dev/null or read a number of bytes from /dev/full. I'm showing the results for PyPy 2.4, PyPy with non-zero-nursery and PyPy with non-zero-nursery and object pinning. Those are wall times for cases using os.read/os.write and file.read/file.write, normalized against CPython 2.7.

Benchmarks were done using PyPy 2.4 and revisions 85646d1d07fb for non-zero-nursery and 3d8fe96dc4d9 for non-zero-nursery and pinning. The benchmarks were run once, since the standard deviation was small.

The Y axis is speed, normalized to CPython, the more the better

What we can see is that os.read and os.write both improved greatly and outperforms CPython now for each combination. file operations are a little more tricky, and while those branches improved the situation a bit, the improvement is not as drastic as in os versions. It really should not be the case and it showcases how our file buffering is inferior to CPython. We plan on removing our own buffering and using FILE* in C in the near future, so we should outperform CPython on those too (since our allocations are cheaper). If you look carefully in the benchmark, the write function is copied three times. This hack is intended to avoid JIT overspecializing the assembler code, which happens because the buffering code was written way before the JIT was done. In fact, our buffering is hilariously bad, but if stars align correctly it can be JIT-compiled to something that's not half bad. Try removing the hack and seeing how the performance of the last benchmark drops :-) Again, this hack should be absolutely unnecessary once we remove our own buffering, stay tuned for more.

Cheers,
fijal

Hello everyone!

We've wrapped up the Warsaw sprint, so I would like to describe some branches which have been recently merged and which improved the I/O and the GC: gc_no_cleanup_nursery and gc-incminimark-pinning.

benchmark	CPython	PyPy 2.4	PyPy non-zero
fibonacci	4.8+-0.15 (1.0x)	0.59+-0.07 (8.1x)	0.45+-0.07 (10.6x)
gcbench	22+-0.36 (1.0x)	1.34+-0.28 (16.4x)	1.02+-0.15 (21.6x)

The Y axis is speed, normalized to CPython, the more the better

Cheers,
fijal

Posted by Maciej Fijalkowski at 16:14 3 Comments

Tuesday, October 21, 2014

PyPy3 2.4.0 released

We're pleased to announce the availability of PyPy3 2.4.0!

This release contains several bugfixes and enhancements. Among the user-facing improvements specific to PyPy3:

Better Windows compatibility, e.g. the nt module functions _getfinalpathname & _getfileinformation are now supported (the former is required for the popular pathlib library for example)
Various fsencode PEP 383 related fixes to the posix module (readlink, uname, ttyname and ctermid) and improved locale handling
Switched the default binary name on POSIX distributions from 'pypy' to 'pypy3' (which symlinks to to 'pypy3.2')
Fixed a couple different crashes related to parsing Python 3 source code

And improvements shared with the recent PyPy 2.4.0 release:

internal refactoring in string and GIL handling which led to significant speedups
improved handling of multiple objects (like sockets) in long-running programs. They are collected and released more efficiently, reducing memory use. In simpler terms - we closed what looked like a memory leak
Windows builds now link statically to zlib, expat, bzip, and openssl-1.0.1i
Many issues were resolved since the 2.3.1 release in June

You can download PyPy3 2.4.0 here http://pypy.org/download.html.

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7 and 3.2.5. It's fast (pypy 2.4 and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.

This release supports x86 machines running Linux 32/64, Mac OS X 64, Windows, and OpenBSD, as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux.
We would like to thank our donors for the continued support of the PyPy project.

The complete release notice is here.

Please try it out and let us know what you think. We especially welcome success stories, please tell us about how it has helped you!

Cheers, The PyPy Team

Posted by Philip Jenvey at 19:02 4 Comments

Tuesday, October 14, 2014

Couchbase contribution to PyPy

Hello everyone!

We always offer to put on the blog info about our sponsors who donate substantial amounts of money. So far most people decided to stay anonymous, so this is the first blog post describing our sponsor and his relationship to PyPy, hopefully not the last. We'll also publish a full blog post about the PSF-matched fundraiser soon. This is a guest post by Brent Woodruff from Couchbase.

Couchbase is a leading NoSQL document database that provides a flexible data model, high performance, scalability, and high availability. Couchbase is a commercially supported open source project. Visit us at http://www.couchbase.com and http://github.com/couchbase.

Couchbase Inc. donated $2000.00, and employees of Couchbase personally contributed a disclosed additional $230.00, towards Pypy progress during the September funding drive. These funds will see a match from the Python Software Foundation.

Pypy is primarily used by Couchbase employees to perform product analysis and troubleshooting using internally developed tools. Every customer of Couchbase benefits from the use of Pypy; both due to the rapid development provided by Python, and the speed of the resulting tools provided by the Pypy JIT interpreter.

“PyPy is great - it gave us a 4x speedup in our CPU-intensive internal application over CPython” -Dave Rigby and Daniel Owen, Couchbase Engineers

Additionally, Couchbase has a preliminary CFFI based Couchbase client available for Pypy users.

Posted by Maciej Fijalkowski at 18:40 3 Comments

Monday, September 22, 2014

PyPy 2.4.0 released, 9 days left in funding drive

We're pleased to announce the availability of PyPy 2.4.0; faster, fewer bugs, and updated to the python 2.7.8 stdlib.

This release contains several bugfixes and enhancements. Among the user-facing improvements:

internal refactoring in string and GIL handling which led to significant speedups
improved handling of multiple objects (like sockets) in long-running programs. They are collected and released more efficiently, reducing memory use. In simpler terms - we closed what looked like a memory leak
Windows builds now link statically to zlib, expat, bzip, and openssl-1.0.1i
Many issues were resolved since the 2.3.1 release in June

You can download PyPy 2.4.0 here http://pypy.org/download.html.

We would like to also point out that in September, the Python Software Foundation will match funds for any donations up to $10k, so head over to our website and help this mostly-volunteer effort out.

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7 and 3.2.5. It's fast (pypy 2.4 and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.

This release supports x86 machines running Linux 32/64, Mac OS X 64, Windows, and OpenBSD, as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux.
We would like to thank our donors for the continued support of the PyPy project.

The complete release notice is here.

Please try it out and let us know what you think. We especially welcome success stories, please tell us about how it has helped you!

Cheers, The PyPy Team

Posted by mattip at 20:12 1 Comments

Monday, September 8, 2014

PyPy 2.4-beta just in time for PSF's funding drive

We're pleased to announce the availability of PyPy 2.4-beta1; faster, fewer bugs, and updated to the python 2.7.8 stdlib.

This release contains several bugfixes and enhancements. Among the user-facing improvements:

internal refactoring in string and GIL handling which led to significant speedups
improved handling of multiple objects (like sockets) in long-running programs. They are collected and released more efficiently, reducing memory use. In simpler terms - we closed what looked like a memory leak
Windows builds now link statically to zlib, expat, bzip, and openssl-1.0.1i
Many issues were resolved since the 2.3.1 release in June

You can download the PyPy 2.4-beta1 release here http://pypy.org/download.html.

We would like to also point out that in September, the Python Software Foundation will match funds for any donations up to $10k, so head over to our website and help this mostly-volunteer effort out.

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7 and 3.2.5. It's fast (pypy 2.4 and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.

This release supports x86 machines running Linux 32/64, Mac OS X 64, Windows, and OpenBSD, as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux.
We would like to thank our donors for the continued support of the PyPy project.

The complete release notice is here.

Please try it out and let us know what you think. We especially welcome success stories, please tell us about how it has helped you!

Cheers, The PyPy Team

News Flash from the beta release cycle:

Note that the beta release mistakenly identifies itself in sys.pypy_version_info as releaselevel=='final', please do not mistake this for a final version
The beta can hit a "Illegal instruction" exception in jitted code on ARMv6 processors like the RaspberryPi. This will be fixed for the release.

Posted by mattip at 21:55 5 Comments

Monday, September 1, 2014

Python Software Foundation Matching Donations this Month

We're extremely excited to announce that for the month of September, any amount
you donate to PyPy will be match (up to $10,000) by the Python Software
Foundation.

This includes any of our ongoing fundraisers: NumPyPy, STM, Python3, or our
general fundraising.

Here are some of the things your previous donations have helped accomplish:

Getting PyPy3 completed (currently 3.2, with 3.3 work underway)
New research and production engineering on STM for PyPy
Lots of progress on NumPy for PyPy
Significant performance improvements

You can see a preview of what's coming in our next 2.4 release in the draft
release notes.

Thank you to all the individuals and companies which have donated so far.

So please, donate today: http://pypy.org/

(Please be aware that the donation progress bars are not live updating, so
don't be afraid if your donation doesn't show up immediately).

Posted by Alex at 18:49 7 Comments

Saturday, August 9, 2014

A Field Test of Software Transactional Memory Using the RSqueak Smalltalk VM

Extending the Smalltalk RSqueakVM with STM

by Conrad Calmez, Hubert Hesse, Patrick Rein and Malte Swart supervised by Tim Felgentreff and Tobias Pape

Introduction

After pypy-stm we can announce that through the RSqueakVM (which used to be called SPyVM) a second VM implementation supports software transactional memory. RSqueakVM is a Smalltalk implementation based on the RPython toolchain. We have added STM support based on the STM tools from RPython (rstm). The benchmarks indicate that linear scale up is possible, however in some situations the STM overhead limits speedup.

The work was done as a master's project at the Software Architechture Group of Professor Robert Hirschfeld at at the Hasso Plattner Institut at the University of Potsdam. We - four students - worked about one and a half days per week for four months on the topic. The RSqueakVM was originally developped during a sprint at the University of Bern. When we started the project we were new to the topic of building VMs / interpreters.

We would like to thank Armin, Remi and the #pypy IRC channel who supported us over the course of our project. We also like to thank Toni Mattis and Eric Seckler, who have provided us with an initial code base.

Introduction to RSqueakVM

As the original Smalltalk implementation, the RSqueakVM executes a given Squeak Smalltalk image, containing the Smalltalk code and a snapshot of formerly created objects and active execution contexts. These execution contexts are scheduled inside the image (greenlets) and not mapped to OS threads. Thereby the non-STM RSqueakVM runs on only one OS thread.

Changes to RSqueakVM

The core adjustments to support STM were inside the VM and transparent from the view of a Smalltalk user. Additionally we added Smalltalk code to influence the behavior of the STM. As the RSqueakVM has run in one OS thread so far, we added the capability to start OS threads. Essentially, we added an additional way to launch a new Smalltalk execution context (thread). But in contrast to the original one this one creates a new native OS thread, not a Smalltalk internal green thread.

STM (with automatic transaction boundaries) already solves the problem of concurrent access on one value as this is protected by the STM transactions (to be more precise one instruction). But there are cases were the application relies on the fact that a bigger group of changes is executed either completely or not at all (atomic). Without further information transaction borders could be in the middle of such a set of atomic statements. rstm allows to aggregate multiple statements into one higher level transaction. To let the application mark the beginning and the end of these atomic blocks (high-level transactions), we added two more STM specific extensions to Smalltalk.

Benchmarks

RSqueak was executed in a single OS thread so far. rstm enables us to execute the VM using several OS threads. Using OS threads we expected a speed-up in benchmarks which use multiple threads. We measured this speed-up by using two benchmarks: a simple parallel summation where each thread sums up a predefined interval and an implementation of Mandelbrot where each thread computes a range of predefined lines.

To assess the speed-up, we used one RSqueakVM compiled with rstm enabled, but once running the benchmarks with OS threads and once with Smalltalk green threads. The workload always remained the same and only the number of threads increased. To assess the overhead imposed by the STM transformation we also ran the green threads version on an unmodified RSqueakVM. All VMs were translated with the JIT optimization and all benchmarks were run once before the measurement to warm up the JIT. As the JIT optimization is working it is likely to be adoped by VM creators (the baseline RSqueakVM did that) so that results with this optimization are more relevant in practice than those without it. We measured the execution time by getting the system time in Squeak. The results are:

Parallel Sum Ten Million

Benchmark Parallel Sum 10,000,000

Thread Count	RSqueak green threads	RSqueak/STM green threads	RSqueak/STM OS threads	Slow down from RSqueak green threads to RSqueak/STM green threads	Speed up from RSqueak/STM green threads to RSQueak/STM OS Threads
1	168.0 ms	240.0 ms	290.9 ms	0.70	0.83
2	167.0 ms	244.0 ms	246.1 ms	0.68	0.99
4	167.8 ms	240.7 ms	366.7 ms	0.70	0.66
8	168.1 ms	241.1 ms	757.0 ms	0.70	0.32
16	168.5 ms	244.5 ms	1460.0 ms	0.69	0.17

Parallel Sum One Billion

Benchmark Parallel Sum 1,000,000,000

Thread Count	RSqueak green threads	RSqueak/STM green threads	RSqueak/STM OS threads	Slow down from RSqueak green threads to RSqueak/STM green threads	Speed up from RSqueak/STM green threads to RSQueak/STM OS Threads
1	16831.0 ms	24111.0 ms	23346.0 ms	0.70	1.03
2	17059.9 ms	24229.4 ms	16102.1 ms	0.70	1.50
4	16959.9 ms	24365.6 ms	12099.5 ms	0.70	2.01
8	16758.4 ms	24228.1 ms	14076.9 ms	0.69	1.72
16	16748.7 ms	24266.6 ms	55502.9 ms	0.69	0.44

Mandelbrot Iterative

Benchmark Mandelbrot

Thread Count	RSqueak green threads	RSqueak/STM green threads	RSqueak/STM OS threads	Slow down from RSqueak green threads to RSqueak/STM green threads	Speed up from RSqueak/STM green threads to RSqueak/STM OS Threads
1	724.0 ms	983.0 ms	1565.5 ms	0.74	0.63
2	780.5 ms	973.5 ms	5555.0 ms	0.80	0.18
4	781.0 ms	982.5 ms	20107.5 ms	0.79	0.05
8	779.5 ms	980.0 ms	113067.0 ms	0.80	0.01

Discussion of benchmark results

First of all, the ParallelSum benchmarks show that the parallelism is actually paying off, at least for sufficiently large embarrassingly parallel problems. Thus RSqueak can also benefit from rstm.

On the other hand, our Mandelbrot implementation shows the limits of our current rstm integration. We implemented two versions of the algorithm one using one low-level array and one using two nested collections. In both versions, one job only calculates a distinct range of rows and both lead to a slowdown. The summary of the state of rstm transactions shows that there are a lot of inevitable transactions (transactions which must be completed). One reason might be the interactions between the VM and its low-level extensions, so called plugins. We have to investigate this further.

Limitations

Although the current VM setup is working well enough to support our benchmarks, the VM still has limitations. First of all, as it is based on rstm, it has the current limitation of only running on 64-bit Linux.

Besides this, we also have two major limitations regarding the VM itself. First, the atomic interface exposed in Smalltalk is currently not working, when the VM is compiled using the just-in-time compiler transformation. Simple examples such as concurrent parallel sum work fine while more complex benchmarks such as chameneos fail. The reasons for this are currently beyond our understanding. Second, Smalltalk supports green threads, which are threads which are managed by the VM and are not mapped to OS threads. We currently support starting new Smalltalk threads as OS threads instead of starting them as green threads. However, existing threads in a Smalltalk image are not migrated to OS threads, but remain running as green threads.

Future work for STM in RSqueak

The work we presented showed interesting problems, we propose the following problem statements for further analysis:

Inevitable transactions in benchmarks. This looks like it could limit other applications too so it should be solved.
Collection implementation aware of STM: The current implementation of collections can cause a lot of STM collisions due to their internal memory structure. We believe it could bear potential for performance improvements, if we replace these collections in an STM enabled interpreter with implementations with less STM collisions. As already proposed by Remi Meier, bags, sets and lists are of particular interest.
Finally, we exposed STM through languages features such as the atomic method, which is provided through the VM. Originally, it was possible to model STM transactions barriers implicitly by using clever locks, now its exposed via the atomic keyword. From a language design point of view, the question arises whether this is a good solution and what features an stm-enabled interpreter must provide to the user in general? Of particular interest are for example, access to the transaction length and hints for transaction borders to and their performance impact.

Details for the technically inclined

Adjustments to the interpreter loop were minimal.
STM works on bytecode granularity that means, there is a implicit transaction border after every bytecode executed. Possible alternatives: only break transactions after certain bytecodes, break transactions on one abstraction layer above, e.g. object methods (setter, getter).
rstm calls were exposed using primtives (a way to expose native code in Smalltalk), this was mainly used for atomic.
Starting and stopping OS threads is exposed via primitives as well. Threads are started from within the interpreter.
For Smalltalk enabled STM code we currently have different image versions. However another way to add, load and replace code to the Smalltalk code base is required to make a switch between STM and non-STM code simple.

Details on the project setup

From a non-technical perspective, a problem we encountered was the huge roundtrip times (on our machines up to 600s, 900s with JIT enabled). This led to a tendency of bigger code changes ("Before we compile, let's also add this"), lost flow ("What where we doing before?") and different compiled interpreters in parallel testing ("How is this version different from the others?") As a consequence it was harder to test and correct errors. While this is not as much of a problem for other RPython VMs, RSqueakVM needs to execute the entire image, which makes running it untranslated even slower.

Summary

The benchmarks show that speed up is possible, but also that the STM overhead in some situations can eat up the speedup. The resulting STM-enabled VM still has some limitations: As rstm is currently only running on 64-bit Linux the RSqueakVM is doing so as well. Eventhough it is possible for us now to create new threads that map to OS threads within the VM, the migration of exiting Smalltalk threads keeps being problematic.

We showed that an existing VM code base can benefit of STM in terms of scaling up. Further it was relatively easy to enable STM support. This may also be valuable to VM developers considering to get STM support for their VMs.

Extending the Smalltalk RSqueakVM with STM

by Conrad Calmez, Hubert Hesse, Patrick Rein and Malte Swart supervised by Tim Felgentreff and Tobias Pape

Introduction

Introduction to RSqueakVM

Changes to RSqueakVM

Benchmarks

Parallel Sum Ten Million

Benchmark Parallel Sum 10,000,000

Thread Count	RSqueak green threads	RSqueak/STM green threads	RSqueak/STM OS threads	Slow down from RSqueak green threads to RSqueak/STM green threads	Speed up from RSqueak/STM green threads to RSQueak/STM OS Threads
1	168.0 ms	240.0 ms	290.9 ms	0.70	0.83
2	167.0 ms	244.0 ms	246.1 ms	0.68	0.99
4	167.8 ms	240.7 ms	366.7 ms	0.70	0.66
8	168.1 ms	241.1 ms	757.0 ms	0.70	0.32
16	168.5 ms	244.5 ms	1460.0 ms	0.69	0.17

Parallel Sum One Billion

Benchmark Parallel Sum 1,000,000,000

Thread Count	RSqueak green threads	RSqueak/STM green threads	RSqueak/STM OS threads	Slow down from RSqueak green threads to RSqueak/STM green threads	Speed up from RSqueak/STM green threads to RSQueak/STM OS Threads
1	16831.0 ms	24111.0 ms	23346.0 ms	0.70	1.03
2	17059.9 ms	24229.4 ms	16102.1 ms	0.70	1.50
4	16959.9 ms	24365.6 ms	12099.5 ms	0.70	2.01
8	16758.4 ms	24228.1 ms	14076.9 ms	0.69	1.72
16	16748.7 ms	24266.6 ms	55502.9 ms	0.69	0.44

Mandelbrot Iterative

Benchmark Mandelbrot

Thread Count	RSqueak green threads	RSqueak/STM green threads	RSqueak/STM OS threads	Slow down from RSqueak green threads to RSqueak/STM green threads	Speed up from RSqueak/STM green threads to RSqueak/STM OS Threads
1	724.0 ms	983.0 ms	1565.5 ms	0.74	0.63
2	780.5 ms	973.5 ms	5555.0 ms	0.80	0.18
4	781.0 ms	982.5 ms	20107.5 ms	0.79	0.05
8	779.5 ms	980.0 ms	113067.0 ms	0.80	0.01

Discussion of benchmark results

First of all, the ParallelSum benchmarks show that the parallelism is actually paying off, at least for sufficiently large embarrassingly parallel problems. Thus RSqueak can also benefit from rstm.

Limitations

Future work for STM in RSqueak

The work we presented showed interesting problems, we propose the following problem statements for further analysis:

Inevitable transactions in benchmarks. This looks like it could limit other applications too so it should be solved.
Collection implementation aware of STM: The current implementation of collections can cause a lot of STM collisions due to their internal memory structure. We believe it could bear potential for performance improvements, if we replace these collections in an STM enabled interpreter with implementations with less STM collisions. As already proposed by Remi Meier, bags, sets and lists are of particular interest.
Finally, we exposed STM through languages features such as the atomic method, which is provided through the VM. Originally, it was possible to model STM transactions barriers implicitly by using clever locks, now its exposed via the atomic keyword. From a language design point of view, the question arises whether this is a good solution and what features an stm-enabled interpreter must provide to the user in general? Of particular interest are for example, access to the transaction length and hints for transaction borders to and their performance impact.

Details for the technically inclined

Adjustments to the interpreter loop were minimal.
STM works on bytecode granularity that means, there is a implicit transaction border after every bytecode executed. Possible alternatives: only break transactions after certain bytecodes, break transactions on one abstraction layer above, e.g. object methods (setter, getter).
rstm calls were exposed using primtives (a way to expose native code in Smalltalk), this was mainly used for atomic.
Starting and stopping OS threads is exposed via primitives as well. Threads are started from within the interpreter.
For Smalltalk enabled STM code we currently have different image versions. However another way to add, load and replace code to the Smalltalk code base is required to make a switch between STM and non-STM code simple.

Details on the project setup

Summary

Posted by Carl Friedrich Bolz-Tereick at 14:15 5 Comments

Saturday, July 5, 2014

PyPy-STM: first "interesting" release

Hi all,

PyPy-STM is now reaching a point where we can say it's good enough to be a GIL-less Python. (We don't guarantee there are no more bugs, so please report them :-) The first official STM release:

pypy-stm-2.3-r2-linux64
(UPDATE: this is release r2, fixing a systematic segfault at start-up on some systems)

This corresponds roughly to PyPy 2.3 (not 2.3.1). It requires 64-bit Linux. More precisely, this release is built for Ubuntu 12.04 to 14.04; you can also rebuild it from source by getting the branch stmgc-c7. You need clang to compile, and you need a patched version of llvm.

This version's performance can reasonably be compared with a regular PyPy, where both include the JIT. Thanks for following the meandering progress of PyPy-STM over the past three years --- we're finally getting somewhere really interesting! We cannot thank enough all contributors to the previous PyPy-STM money pot that made this possible. And, although this blog post is focused on the results from that period of time, I have of course to remind you that we're running a second call for donation for future work, which I will briefly mention again later.

A recap of what we did to get there: around the start of the year we found a new model, a "redo-log"-based STM which uses a couple of hardware tricks to not require chasing pointers, giving it (in this context) exceptionally cheap read barriers. This idea was developed over the following months and (relatively) easily integrated with the JIT compiler. The most recent improvements on the Garbage Collection side are closing the gap with a regular PyPy (there is still a bit more to do there). There is some preliminary user documentation.

Today, the result of this is a PyPy-STM that is capable of running pure Python code on multiple threads in parallel, as we will show in the benchmarks that follow. A quick warning: this is only about pure Python code. We didn't try so far to optimize the case where most of the time is spent in external libraries, or even manipulating "raw" memory like array.array or numpy arrays. To some extent there is no point because the approach of CPython works well for this case, i.e. releasing the GIL around the long-running operations in C. Of course it would be nice if such cases worked as well in PyPy-STM --- which they do to some extent; but checking and optimizing that is future work.

As a starting point for our benchmarks, when running code that only uses one thread, we get a slow-down between 1.2 and 3: at worst, three times as slow; at best only 20% slower than a regular PyPy. This worst case has been brought down --it used to be 10x-- by recent work on "card marking", a useful GC technique that is also present in the regular PyPy (and about which I don't find any blog post; maybe we should write one :-) The main remaining issue is fork(), or any function that creates subprocesses: it works, but is very slow. To remind you of this fact, it prints a line to stderr when used.

Now the real main part: when you run multithreaded code, it scales very nicely with two threads, and less-than-linearly but still not badly with three or four threads. Here is an artificial example:

    total = 0
    lst1 = ["foo"]
    for i in range(100000000):
        lst1.append(i)
        total += lst1.pop()

We run this code N times, once in each of N threads (full benchmark). Run times, best of three:

Number of threads	Regular PyPy (head)	PyPy-STM
N = 1	real 0.92s user+sys 0.92s	real 1.34s user+sys 1.34s
N = 2	real 1.77s user+sys 1.74s	real 1.39s user+sys 2.47s
N = 3	real 2.57s user+sys 2.56s	real 1.58s user+sys 4.106s
N = 4	real 3.38s user+sys 3.38s	real 1.64s user+sys 5.35s

(The "real" time is the wall clock time. The "user+sys" time is the recorded CPU time, which can be larger than the wall clock time if multiple CPUs run in parallel. This was run on a 4x2 cores machine. For direct comparison, avoid loops that are so trivial that the JIT can remove all allocations from them: right now PyPy-STM does not handle this case well. It has to force a dummy allocation in such loops, which makes minor collections occur much more frequently.)

Four threads is the limit so far: only four threads can be executed in parallel. Similarly, the memory usage is limited to 2.5 GB of GC objects. These two limitations are not hard to increase, but at least increasing the memory limit requires fighting against more LLVM bugs. (Include here snark remarks about LLVM.)

Here are some measurements from more real-world benchmarks. This time, the amount of work is fixed and we parallelize it on T threads. The first benchmark is just running translate.py on a trunk PyPy. The last three benchmarks are here.

Benchmark	PyPy 2.3	(PyPy head)	PyPy-STM, T=1	T=2	T=3	T=4
`translate.py --no-allworkingmodules` (annotation step)	184s	(170s)	386s (2.10x)	n/a
multithread-richards 5000 iterations	24.2s	(16.8s)	52.5s (2.17x)	37.4s (1.55x)	25.9s (1.07x)	32.7s (1.35x)
mandelbrot divided in 16-18 bands	22.9s	(18.2s)	27.5s (1.20x)	14.4s (0.63x)	10.3s (0.45x)	8.71s (0.38x)
btree	2.26s	(2.00s)	2.01s (0.89x)	2.22s (0.98x)	2.14s (0.95x)	2.42s (1.07x)

This shows various cases that can occur:

The mandelbrot example runs with minimal overhead and very good parallelization. It's dividing the plane to compute in bands, and each of the T threads receives the same number of bands.
Richards, a classical benchmark for PyPy (tweaked to run the iterations in multiple threads), is hard to beat on regular PyPy: we suspect that the difference is due to the fact that a lot of paths through the loops don't allocate, triggering the issue already explained above. Moreover, the speed of Richards was again improved dramatically recently, in trunk.
The translation benchmark measures the time translate.py takes to run the first phase only, "annotation" (for now it consumes too much memory to run translate.py to the end). Moreover the timing starts only after the large number of subprocesses spawned at the beginning (mostly gcc). This benchmark is not parallel, but we include it for reference here. The slow-down factor of 2.1x is still too much, but we have some idea about the reasons: most likely, again the Garbage Collector, missing the regular PyPy's very fast small-object allocator for old objects. Also, translate.py is an example of application that could, with reasonable efforts, be made largely parallel in the future using atomic blocks.
Atomic blocks are also present in the btree benchmark. I'm not completely sure but it seems that, in this case, the atomic blocks create too many conflicts between the threads for actual parallization: the base time is very good, but running more threads does not help at all.

As a summary, PyPy-STM looks already useful to run CPU-bound multithreaded applications. We are certainly still going to fight slow-downs, but it seems that there are cases where 2 threads are enough to outperform a regular PyPy, by a large margin. Please try it out on your own small examples!

And, at the same time, please don't attempt to retrofit threads inside an existing large program just to benefit from PyPy-STM! Our goal is not to send everyone down the obscure route of multithreaded programming and its dark traps. We are going finally to shift our main focus on the phase 2 of our research (donations welcome): how to enable a better way of writing multi-core programs. The starting point is to fix and test atomic blocks. Then we will have to debug common causes of conflicts and fix them or work around them; and try to see how common frameworks like Twisted can be adapted.

Lots of work ahead, but lots of work behind too :-)

Armin (thanks Remi as well for the work).

Hi all,

PyPy-STM is now reaching a point where we can say it's good enough to be a GIL-less Python. (We don't guarantee there are no more bugs, so please report them :-) The first official STM release:

pypy-stm-2.3-r2-linux64
(UPDATE: this is release r2, fixing a systematic segfault at start-up on some systems)

Now the real main part: when you run multithreaded code, it scales very nicely with two threads, and less-than-linearly but still not badly with three or four threads. Here is an artificial example:

    total = 0
    lst1 = ["foo"]
    for i in range(100000000):
        lst1.append(i)
        total += lst1.pop()

We run this code N times, once in each of N threads (full benchmark). Run times, best of three:

Number of threads	Regular PyPy (head)	PyPy-STM
N = 1	real 0.92s user+sys 0.92s	real 1.34s user+sys 1.34s
N = 2	real 1.77s user+sys 1.74s	real 1.39s user+sys 2.47s
N = 3	real 2.57s user+sys 2.56s	real 1.58s user+sys 4.106s
N = 4	real 3.38s user+sys 3.38s	real 1.64s user+sys 5.35s

Benchmark	PyPy 2.3	(PyPy head)	PyPy-STM, T=1	T=2	T=3	T=4
`translate.py --no-allworkingmodules` (annotation step)	184s	(170s)	386s (2.10x)	n/a
multithread-richards 5000 iterations	24.2s	(16.8s)	52.5s (2.17x)	37.4s (1.55x)	25.9s (1.07x)	32.7s (1.35x)
mandelbrot divided in 16-18 bands	22.9s	(18.2s)	27.5s (1.20x)	14.4s (0.63x)	10.3s (0.45x)	8.71s (0.38x)
btree	2.26s	(2.00s)	2.01s (0.89x)	2.22s (0.98x)	2.14s (0.95x)	2.42s (1.07x)

This shows various cases that can occur:

The mandelbrot example runs with minimal overhead and very good parallelization. It's dividing the plane to compute in bands, and each of the T threads receives the same number of bands.
Richards, a classical benchmark for PyPy (tweaked to run the iterations in multiple threads), is hard to beat on regular PyPy: we suspect that the difference is due to the fact that a lot of paths through the loops don't allocate, triggering the issue already explained above. Moreover, the speed of Richards was again improved dramatically recently, in trunk.
The translation benchmark measures the time translate.py takes to run the first phase only, "annotation" (for now it consumes too much memory to run translate.py to the end). Moreover the timing starts only after the large number of subprocesses spawned at the beginning (mostly gcc). This benchmark is not parallel, but we include it for reference here. The slow-down factor of 2.1x is still too much, but we have some idea about the reasons: most likely, again the Garbage Collector, missing the regular PyPy's very fast small-object allocator for old objects. Also, translate.py is an example of application that could, with reasonable efforts, be made largely parallel in the future using atomic blocks.
Atomic blocks are also present in the btree benchmark. I'm not completely sure but it seems that, in this case, the atomic blocks create too many conflicts between the threads for actual parallization: the base time is very good, but running more threads does not help at all.

Lots of work ahead, but lots of work behind too :-)

Armin (thanks Remi as well for the work).

Posted by Armin Rigo at 10:37 20 Comments

Friday, June 20, 2014

PyPy3 2.3.1 - Fulcrum

We're pleased to announce the first stable release of PyPy3. PyPy3
targets Python 3 (3.2.5) compatibility.

We would like to thank all of the people who donated to the py3k proposal
for supporting the work that went into this.

You can download the PyPy3 2.3.1 release here:

http://pypy.org/download.html#pypy3-2-3-1

Highlights

The first stable release of PyPy3: support for Python 3!
The stdlib has been updated to Python 3.2.5
Additional support for the u'unicode' syntax (PEP 414) from Python 3.3
Updates from the default branch, such as incremental GC and various JIT
improvements
Resolved some notable JIT performance regressions from PyPy2:

Re-enabled the previously disabled collection (list/dict/set) strategies

Resolved performance of iteration over range objects

Resolved handling of Python 3's exception __context__ unnecessarily forcing
frame object overhead

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for
CPython 2.7.6 or 3.2.5. It's fast due to its integrated tracing JIT compiler.

This release supports x86 machines running Linux 32/64, Mac OS X 64, Windows,
and OpenBSD,
as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux.

While we support 32 bit python on Windows, work on the native Windows 64
bit python is still stalling, we would welcome a volunteer
to handle that.

How to use PyPy?

We suggest using PyPy from a virtualenv. Once you have a virtualenv
installed, you can follow instructions from pypy documentation on how
to proceed. This document also covers other installation schemes.

Cheers,
the PyPy team

Posted by Philip Jenvey at 22:31 6 Comments

Sunday, June 8, 2014

PyPy 2.3.1 - Terrestrial Arthropod Trap Revisited

We're pleased to announce PyPy 2.3.1, a feature-and-bugfix improvement over our recent 2.3 release last month.

This release contains several bugfixes and enhancements among the user-facing improvements:

The built-in struct module was renamed to _struct, solving issues with IDLE and other modules
Support for compilation with gcc-4.9
A CFFI-based version of the gdbm module is now included in our binary bundle
Many issues were resolved since the 2.3 release on May 8

You can download the PyPy 2.3.1 release here:

http://pypy.org/download.html

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It's fast (pypy 2.3.1 and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.

This release supports x86 machines running Linux 32/64, Mac OS X 64, Windows, and OpenBSD, as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux.
We would like to thank our donors for the continued support of the PyPy project.

The complete release notice is here.

Please try it out and let us know what you think. We especially welcome success stories, please tell us about how it has helped you!

Cheers, The PyPy Team

Posted by mattip at 00:14 0 Comments

Friday, May 9, 2014

PyPy 2.3 - Terrestrial Arthropod Trap

We’re pleased to announce PyPy 2.3, which targets version 2.7.6 of the Python language. This release updates the stdlib from 2.7.3, jumping directly to 2.7.6.

This release also contains several bugfixes and performance improvements, many generated by real users finding corner cases. CFFI has made it easier than ever to use existing C code with both cpython and PyPy, easing the transition for packages like cryptography, Pillow(Python Imaging Library [Fork]), a basic port of pygame-cffi, and others.

PyPy can now be embedded in a hosting application, for instance inside uWSGI

You can download the PyPy 2.3 release here:

http://pypy.org/download.html

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It's fast (pypy 2.3 and cpython 2.7.x performance comparison; note that cpython's speed has not changed since 2.7.2) due to its integrated tracing JIT compiler.

This release supports x86 machines running Linux 32/64, Mac OS X 64, Windows, and OpenBSD, as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux.

We would like to thank our donors for the continued support of the PyPy project.

The complete release notice is here

Cheers, The PyPy Team

Posted by mattip at 09:38 6 Comments

Tuesday, April 15, 2014

NumPy on PyPy - Status Update

Work on NumPy on PyPy continued in March, though at a lighter pace than the previous few months. Progress was made on both compatibility and speed fronts. Several behavioral issues reported to the bug tracker were resolved. The most significant of these was probably the correction of casting to built-in Python types. Previously, int/long conversions of numpy scalars such as inf/nan/1e100 would return bogus results. Now, they raise or return values, as appropriate.

On the speed front, enhancements to the PyPy JIT were made to support virtualizing the raw_store/raw_load memory operations used in numpy arrays. Further work remains here in virtualizing the alloc_raw_storage when possible. This will allow scalars to have storages but still be virtualized when possible in loops.

Aside from continued work on compatibility/speed of existing code, we also hope to begin implementing the C-level components of other numpy modules such as mtrand, nditer, linalg, and so on. Several approaches could be taken to get C-level code in these modules working, ranging from reimplementing in RPython to interfacing with existing code with CFFI, if possible. The appropriate approach depends on many factors and will probably vary from module to module.

To try out PyPy + NumPy, grab a nightly PyPy and install our NumPy fork. Feel free to report comments/issues to IRC, our mailing list, or bug tracker. Thanks to the contributors to the NumPy on PyPy proposal for supporting this work.

Posted by Brian Kearns at 21:08 8 Comments

Wednesday, April 9, 2014

STM results and Second Call for Donations

Hi all,

We now have a preliminary version of PyPy-STM with the JIT, from the new STM documentation page. This PyPy-STM is still not quite useful, failing to top the performance of a regular PyPy by a small margin on most benchmarks, but it's definitely getting there :-) The overheads with the JIT are still a bit too high. (I've been tracking an obscure bug since days. It turned out to be a simple buffer overflow. But if anybody has a clue about why a hardware watchpoint in gdb, set on one of the garbled memory locations, fails to trigger but the memory ends up being modified anyway... and, it turns out, by just a regular pointer write... ideas welcome.)

But I go off-topic :-) The main point of this post is to announce the 2nd Call for Donation about STM. We achieved most of the goals laid out in the first call. We even largely overachieved them in terms of raw performance, even if there are many cases that are unreasonably slow for now. So, after the successful research, we are launching a second proposal about the development part of the project:

Polish PyPy-STM to get a consistently reasonable speed, 25%-40% slower than a regular JITted PyPy when running single-threaded code. Of course it is supposed to scale nicely as long as there are no user-visible conflicts.
Focus on developing the Python-facing interface: both internal things (e.g. do dictionaries need to be more TM-friendly in general?) as well as directly visible things (e.g. some profiler-like interface to explore common conflicts in a program).
Regular multithreaded code should benefit out of the box, but the final goal is to explore and tweak some existing non-multithreaded frameworks and improve their TM-friendliness. So existing programs using Twisted or Stackless, for example, should run on multiple cores without any major change.

See the full call for more details! I'd like to thank Remi Meier for getting involved. And a big thank you to everybody who contributed money on the first call. It took more time than anticipated, but it's there in good but rough shape. Now it needs a lot of polishing :-)

Armin

Hi all,

Polish PyPy-STM to get a consistently reasonable speed, 25%-40% slower than a regular JITted PyPy when running single-threaded code. Of course it is supposed to scale nicely as long as there are no user-visible conflicts.
Focus on developing the Python-facing interface: both internal things (e.g. do dictionaries need to be more TM-friendly in general?) as well as directly visible things (e.g. some profiler-like interface to explore common conflicts in a program).
Regular multithreaded code should benefit out of the box, but the final goal is to explore and tweak some existing non-multithreaded frameworks and improve their TM-friendliness. So existing programs using Twisted or Stackless, for example, should run on multiple cores without any major change.

Armin

Posted by Armin Rigo at 10:33 2 Comments

Wednesday, March 26, 2014

pygame_cffi: pygame on PyPy

The Raspberry Pi aims to be a low-cost educational tool that anyone can use to learn about electronics and programming. Python and pygame are included in the Pi's programming toolkit. And since last year, thanks in part to sponsorship from the Raspberry Pi Foundation, PyPy also works on the Pi (read more here).

With PyPy working on the Pi, game logic written in Python stands to gain an awesome performance boost. However, the original pygame is a Python C extension. This means it performs poorly on PyPy and negates any speedup in the Python parts of the game code.

One solution to making pygame games run faster on PyPy, and eventually on the Raspberry Pi, comes in the form of pygame_cffi. pygame_cffi uses CFFI to wrap the underlying SDL library instead of a C extension. A few months ago, the Raspberry Pi Foundation sponsored a Cape Town Python User Group hackathon to build a proof-of-concept pygame using CFFI. This hackathon was a success and it produced an early working version of pygame_cffi.

So for the last 5 weeks Raspberry Pi has been funding work on pygame_cffi. The goal was a complete implementation of the core modules. We also wanted benchmarks to illuminate performance differences between pygame_cffi on PyPy and pygame on CPython. We are happy to report that those goals were met. So without further ado, here's a rundown of what works.

Current functionality

Surfaces support all the usual flags for SDL and OpenGL rendering (more about OpenGL below).
The graphics-related modules color, display, font and image, and parts of draw and transform are mostly complete.
Events! No fastevent module yet, though.
Mouse and keyboard functionality, as provided by the mouse and key modules, is complete.
Sound functionality, as provided by the mixer and music modules, is complete.
Miscellaneous modules, cursors, rect, sprite and time are also complete.

Invention screenshot:

Mutable mamba screenshot:

With the above-mentioned functionality in place we could get 10+ of the pygame examples to work, and a number of PyWeek games. At the time of writing, if a game doesn't work it is most likely due to an unimplemented transform or draw function. That will be remedied soon.

Performance

In terms of performance, pygame_cffi on PyPy is showing a lot of promise. It beats pygame on CPython by a significant margin in our events processing and collision detection benchmarks, while blit and fill benchmarks perform similarly. The pygame examples we checked also perform better.

However, there is still work to be done to identify and eliminate bottlenecks. On the Raspberry Pi performance is markedly worse compared to pygame (barring collision detection). The PyWeek games we tested also performed slightly worse. Fortunately there is room for improvement in various places.

Invention & Mutable Mamba (x86)

Standard pygame examples (Raspberry Pi)

Here's a summary of some of the benchmarks. Relative speed refers to the frame rate obtained in pygame_cffi on PyPy relative to pygame on CPython.

Benchmark	Relative speed (pypy speedup)
Events (x86)	1.41
Events (Pi)	0.58
N² collision detection on 100 sprites (x86)	4.14
N² collision detection on 100 sprites (Pi)	1.01
Blit 100 surfaces (x86)	1.06
Blit 100 surfaces (Pi)	0.60
Invention (x86)	0.95
Mutable Mamba (x86)	0.72
stars example (x86)	1.95
stars example (Pi)	0.84

OpenGL

Some not-so-great news is that PyOpenGL performs poorly on PyPy since PyOpenGL uses ctypes. This translates into a nasty reduction in frame rate for games that use OpenGL surfaces. It might be worthwhile creating a CFFI-powered version of PyOpenGL as well.

Where to now?

Work on pygame_cffi is ongoing. Here are some things that are in the pipeline:

Get pygame_cffi on PyPy to a place where it is consistently faster than pygame on CPython.
Implement the remaining modules and functions, starting with draw and transform.
Improve test coverage.
Reduce the time it takes for CFFI to parse the cdef. This makes the initial pygame import slow.

If you want to contribute you can find pygame_cffi on Github. Feel free to find us on #pypy on freenode or post issues on github.

Cheers,
Rizmari Versfeld

Current functionality

Surfaces support all the usual flags for SDL and OpenGL rendering (more about OpenGL below).
The graphics-related modules color, display, font and image, and parts of draw and transform are mostly complete.
Events! No fastevent module yet, though.
Mouse and keyboard functionality, as provided by the mouse and key modules, is complete.
Sound functionality, as provided by the mixer and music modules, is complete.
Miscellaneous modules, cursors, rect, sprite and time are also complete.

Invention screenshot:

Mutable mamba screenshot:

Performance

Invention & Mutable Mamba (x86)

Standard pygame examples (Raspberry Pi)

Here's a summary of some of the benchmarks. Relative speed refers to the frame rate obtained in pygame_cffi on PyPy relative to pygame on CPython.

Benchmark	Relative speed (pypy speedup)
Events (x86)	1.41
Events (Pi)	0.58
N² collision detection on 100 sprites (x86)	4.14
N² collision detection on 100 sprites (Pi)	1.01
Blit 100 surfaces (x86)	1.06
Blit 100 surfaces (Pi)	0.60
Invention (x86)	0.95
Mutable Mamba (x86)	0.72
stars example (x86)	1.95
stars example (Pi)	0.84

OpenGL

Where to now?

Work on pygame_cffi is ongoing. Here are some things that are in the pipeline:

Get pygame_cffi on PyPy to a place where it is consistently faster than pygame on CPython.
Implement the remaining modules and functions, starting with draw and transform.
Improve test coverage.
Reduce the time it takes for CFFI to parse the cdef. This makes the initial pygame import slow.

If you want to contribute you can find pygame_cffi on Github. Feel free to find us on #pypy on freenode or post issues on github.

Cheers,
Rizmari Versfeld

Posted by Maciej Fijalkowski at 17:28 41 Comments

Saturday, March 15, 2014

STMGC-C7 with PyPy

Hi all,

Here is one of the first full PyPy's (edit: it was r69967+, but the general list of versions is currently here) compiled with the new StmGC-c7 library. It has no JIT so far, but it runs some small single-threaded benchmarks by taking around 40% more time than a corresponding non-STM, no-JIT version of PyPy. It scales --- up to two threads only, which is the hard-coded maximum so far in the c7 code. But the scaling looks perfect in these small benchmarks without conflict: starting two threads each running a copy of the benchmark takes almost exactly the same amount of total time, simply using two cores.

Feel free to try it! It is not actually useful so far, because it is limited to two cores and CPython is something like 2.5x faster. One of the important next steps is to re-enable the JIT. Based on our current understanding of the "40%" figure, we can probably reduce it with enough efforts; but also, the JIT should be able to easily produce machine code that suffers a bit less than the interpreter from these effects. This seems to mean that we're looking at 20%-ish slow-downs for the future PyPy-STM-JIT.

Interesting times :-)

For reference, this is what you get by downloading the PyPy binary linked above: a Linux 64 binary (Ubuntu 12.04) that should behave mostly like a regular PyPy. (One main missing feature is that destructors are never called.) It uses two cores, but obviously only if the Python program you run is multithreaded. The only new built-in feature is with __pypy__.thread.atomic: this gives you a way to enforce that a block of code runs "atomically", which means without any operation from any other thread randomly interleaved.

If you want to translate it yourself, you need a trunk version of clang with three patches applied. That's the number of bugs that we couldn't find workarounds for, not the total number of bugs we found by (ab)using the address_space feature...

Stay tuned for more!

Armin & Remi

Posted by Armin Rigo at 18:00 11 Comments

Monday, March 10, 2014

PyPy on uWSGI

Hello everyone

There is an interview with Roberto De Ioris (from uWSGI fame) about embedding PyPy in uWSGI that covers recent addition of a PyPy embedding interface using cffi and the experience with using it. Read The full interview

Cheers
fijal

Friday, March 7, 2014

NumPy on PyPy - Progress in February

More progress was made on the NumPy front in the past month. On the compatibility front, we now pass ~130 more tests from NumPy's suite since the end of January. Currently, we pass 2336 tests out of 3265 tests run, with many of the failures representing portions of NumPy that we don't plan to implement in the near future (object dtypes, unicode, etc). There are still some failures that do represent issues, such as special indexing cases and failures to respect subclassed ndarrays in return values, which we do plan to resolve. There are also some unimplemented components and ufuncs remaining which we hope to implement, such as nditer and mtrand. Overall, the most common array functionality should be working.

Additionally, I began to take a look at some of the loops generated by our code. One widely used loop is dot, and we were running about 5x slower than NumPy's C version. I was able to optimize the dot loop and also the general array iterator to get us to ~1.5x NumPy C time on dot operations of various sizes. Further progress in this area could be made by using CFFI to tie into BLAS libraries, when available. Also, work remains in examining traces generated for our other loops and checking for potential optimizations.

To try out PyPy + NumPy, grab a nightly PyPy and install our NumPy fork. Feel free to report comments/issues to IRC, our mailing list, or bug tracker. Thanks to the contributors to the NumPy on PyPy proposal for supporting this work.

Cheers,
Brian

Posted by Brian Kearns at 06:05 4 Comments

Tuesday, February 18, 2014

Py3k status update #13

This is the 13th status update about our work on the py3k branch, which we
can work on thanks to all of the people who donated to the py3k proposal.

We're just finishing up a cleanup of int/long types. This work helps the py3k
branch unify these types into the Python 3 int and restore JIT compilation of
machine sized integers.

This cleanup also removes multimethods from these types. PyPy has
historically used a clever implementation of multimethod dispatch for declaring
methods of the __builtin__ types in RPython.

This multimethod scheme provides some convenient features for doing this,
however we've come to the conclusion that it may be more trouble than it's
worth. A major problem of multimethods is that they generate a large amount of
stub methods which burden the already lengthy and memory hungry RPython
translation process. Also, their implementation and behavior can be somewhat
complicated/obscure.

The alternative to multimethods involves doing the work of the type checking
and dispatching rules in a more verbose, manual way. It's a little more work in
the end but less magical.

Recently, Manuel Jacob finished a large cleanup effort of the
unicode/string/bytearray types that also removed their multimethods. This work
also benefits the py3k branch: it'll help with future PEP 393 (or PEP 393
alternative) work. This effort was partly sponsored by Google's Summer of
Code: thanks Manuel and Google!

Now there's only a couple major pieces left in the multimethod removal (the
float/complex types and special marshaling code) and a few minor pieces that
should be relatively easy.

In conclusion, there's been some good progress made on py3k and multimethod
removal this winter, albeit a bit slower than we would have liked.

cheers,
Phil

Posted by Philip Jenvey at 03:33 1 Comments

Sunday, February 9, 2014

Rewrites of the STM core model -- again

Hi all,

A quick note about the Software Transactional Memory (STM) front.

Since the previous post, we believe we progressed a lot by discovering an alternative core model for software transactions. Why do I say "believe"? It's because it means again that we have to rewrite from scratch the C library handling STM. This is currently work in progress. Once this is done, we should be able to adapt the existing pypy-stm to run on top of it without much rewriting efforts; in fact it should simplify the difficult issues we ran into for the JIT. So while this is basically yet another restart similar to last June's, the difference is that the work that we have already put in the PyPy part (as opposed to the C library) remains.

You can read about the basic ideas of this new C library here. It is still STM-only, not HTM, but because it doesn't constantly move objects around in memory, it would be easier to adapt an HTM version. There are even potential ideas about a hybrid TM, like using HTM but only to speed up the commits. It is based on a Linux-only system call, remap_file_pages() (poll: who heard about it before? :-). As previously, the work is done by Remi Meier and myself.

Currently, the C library is incomplete, but early experiments show good results in running duhton, the interpreter for a minimal language created for the purpose of testing STM. Good results means we brough down the slow-downs from 60-80% (previous version) to around 15% (current version). This number measures the slow-down from the non-STM-enabled to the STM-enabled version, on one CPU core; of course, the idea is that the STM version scales up when using more than one core.

This means that we are looking forward to a result that is much better than originally predicted. The pypy-stm has chances to run at a one-thread speed that is only "n%" slower than the regular pypy-jit, for a value of "n" that is optimistically 15 --- but more likely some number around 25 or 50. This is seriously better than the original estimate, which was "between 2x and 5x". It would mean that using pypy-stm is quite worthwhile even with just two cores.

More updates later...

Armin

Hi all,

A quick note about the Software Transactional Memory (STM) front.

More updates later...

Armin

Posted by Armin Rigo at 23:16 6 Comments

Thursday, February 6, 2014

NumPy Status Update - December/January

Work continued on the NumPy + PyPy front steadily in December and more lightly in January. The continued focus was compatibility, targeting incorrect or unimplemented features that appeared in multiple NumPy test suite failures. We now pass ~2/3 of the NumPy test suite. The biggest improvements were made in these areas:

- Bugs in conversions of arrays/scalars to/from native types
- Fix cases where we would choose incorrect dtypes when initializing or computing results
- Improve handling of subclasses of ndarray through computations
- Support some optional arguments for array methods that are used in the pure-python part of NumPy
- Support additional attributes in arrays, array.flags, and dtypes
- Fix some indexing corner cases that arise in NumPy testing
- Implemented part of numpy.fft (cffti and cfftf)

Looking forward, we plan to continue improving the correctness of the existing implemented NumPy functionality, while also beginning to look at performance. The initial focus for performance will be to look at areas where we are significantly worse than CPython+NumPy. Those interested in trying these improvements out will need a PyPy nightly, and an install of the PyPy NumPy fork. Thanks again to the NumPy on PyPy donors for funding this work.

Posted by Brian Kearns at 20:06 5 Comments