Tuesday, December 10, 2013
NumPy Status Update - November
Improvements were made in these areas:
- Many missing/broken scalar functionalities were added/fixed. The scalar API should match up more closely with arrays now.
- Some missing dtype functionality was added (newbyteorder, hasobject, descr, etc)
- Support for optional arguments (axis, order) was added to some ndarray functions
- Fixed some corner cases for string/record types
Most of these improvements went onto trunk after 2.2 was split, so if you're interested in trying them out or running into problems on 2.2, try the nightly.
Thanks again to the NumPy on PyPy donors who make this continued progress possible.
Cheers,
Brian
Monday, December 9, 2013
PyGame CFFI
One of the RaspberryPi's goals is to be a fun toolkit for school children (and adults!) to learn programming and electronics with. Python and pygame are part of this toolkit. Recently the RaspberryPi Foundation funded parts of the effort of porting of pypy to the Pi -- making Python programs on the Pi faster!
Unfortunately pygame is written as a Python C extension that wraps SDL which means performance of pygame under pypy remains mediocre. To fix this pygame needs to be rewritten using cffi to wrap SDL instead.
RaspberryPi sponsored a CTPUG (Cape Town Python User Group) hackathon to put together a proof-of-concept pygame-cffi. The day was quite successful - we got a basic version of the bub'n'bros client working on pygame-cffi (and on PyPy). The results can be found on github with contributions from the five people present at the sprint.
While far from complete, the proof of concept does show that there are no major obstacles to porting pygame to cffi and that cffi is a great way to bind your Python package to C libraries.
Amazingly, we managed to have machines running all three major platforms (OS X, Linux and Windows) at the hackathon so the code runs on all of them!
We would like to thank the Praekelt foundation for providing the venue and The Raspberry Pi foundation for providing food and drinks!
Cheers,
Simon Cross, Jeremy Thurgood, Neil Muller, David Sharpe and fijal.
Saturday, November 30, 2013
PyPy Leysin Winter Sprint (11-19st January 2014)
The next PyPy sprint will be in Leysin, Switzerland, for the ninth time. This is a fully public sprint: newcomers and topics other than those proposed below are welcome.
Goals and topics of the sprint
- Py3k: work towards supporting Python 3 in PyPy
- NumPyPy: work towards supporting the numpy module in PyPy
- STM: work towards supporting Software Transactional Memory
- And as usual, the main side goal is to have fun in winter sports :-) We can take a day off for ski.
Exact times
For a change, and as an attempt to simplify things, I specified the dates as 11-19 January 2014, where 11 and 19 are travel days. We will work full days between the 12 and the 18. You are of course allowed to show up for a part of that time only, too.
Location & Accomodation
Leysin, Switzerland, "same place as before". Let me refresh your memory: both the sprint venue and the lodging will be in a very spacious pair of chalets built specifically for bed & breakfast: http://www.ermina.ch/. The place has a good ADSL Internet connexion with wireless installed. You can of course arrange your own lodging anywhere (as long as you are in Leysin, you cannot be more than a 15 minutes walk away from the sprint venue), but I definitely recommend lodging there too -- you won't find a better view anywhere else (though you probably won't get much worse ones easily, either :-)
Please confirm that you are coming so that we can adjust the reservations as appropriate. The rate so far has been around 60 CHF a night all included in 2-person rooms, with breakfast. There are larger rooms too (less expensive per person) and maybe the possibility to get a single room if you really want to.
Please register by Mercurial:
https://bitbucket.org/pypy/extradoc/ https://bitbucket.org/pypy/extradoc/raw/extradoc/sprintinfo/leysin-winter-2014
or on the pypy-dev mailing list if you do not yet have check-in rights:
http://mail.python.org/mailman/listinfo/pypy-dev
You need a Swiss-to-(insert country here) power adapter. There will be some Swiss-to-EU adapters around -- bring a EU-format power strip if you have one.
Wednesday, November 27, 2013
PyPy 2.2.1 - Incrementalism.1
We're pleased to announce PyPy 2.2.1, which targets version 2.7.3 of the Python language. This is a bugfix release over 2.2.
You can download the PyPy 2.2.1 release here:
http://pypy.org/download.html
What is PyPy?
PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It's fast (pypy 2.2 and cpython 2.7.2 performance comparison) due to its integrated tracing JIT compiler.
This release supports x86 machines running Linux 32/64, Mac OS X 64, Windows 32, or ARM (ARMv6 or ARMv7, with VFPv3).
Work on the native Windows 64 is still stalling, we would welcome a volunteer to handle that.
Highlights
This is a bugfix release. The most important bugs fixed are:
- an issue in sockets' reference counting emulation, showing up notably when using the ssl module and calling makefile().
- Tkinter support on Windows.
- If sys.maxunicode==65535 (on Windows and maybe OS/X), the json decoder incorrectly decoded surrogate pairs.
- some FreeBSD fixes.
Note that CFFI 0.8.1 was released. Both versions 0.8 and 0.8.1 are compatible with both PyPy 2.2 and 2.2.1.
Cheers, Armin Rigo & everybody
Sunday, November 17, 2013
CFFI 0.8
Hi all,
CFFI 0.8 for CPython (2.6-3.x) has been released.
Quick download: pip install cffi --upgrade
Documentation: https://cffi.readthedocs.org/en/release-0.8/
What's new: a number of small fixes; ffi.getwinerror()
; integrated support for C99 variable-sized structures; multi-thread safety.
--- Armin
Update: CFFI 0.8.1, with fixes on Python 3 on OS/X, and some FreeBSD fixes (thanks Tobias).
Friday, November 15, 2013
NumPy status update
The biggest change is that we shifted to using an external fork of numpy rather than a minimal numpypy module. The idea is that we will be able to reuse most of the upstream pure-python numpy components, replacing the C modules with appropriate RPython micronumpy pieces at the correct places in the module namespace.
The numpy fork should work just as well as the old numpypy for functionality that existed previously, and also include much new functionality from the pure-python numpy pieces that simply hadn't been imported yet in numpypy. However, this new functionality will not have been "hand picked" to only include pieces that work, so you may run into functionality that relies on unimplemented components (which should fail with user-level exceptions).
This setup also allows us to run the entire numpy test suite, which will help in directing future compatibility development. The recent PyPy release includes these changes, so download it and let us know how it works! And if you want to live on the edge, the nightly includes even more numpy progress made in November.
To install the fork, download the latest release, and then install numpy either separately with a virtualenv: pip install git+https://bitbucket.org/pypy/numpy.git; or directly: git clone https://bitbucket.org/pypy/numpy.git; cd numpy; pypy setup.py install.
EDIT: if you install numpy as root, you may need to also import it once as root before it works: sudo pypy -c 'import numpy'
Along with this change, progress was made in fixing internal micronumpy bugs and increasing compatibility:
- Fixed a bug with strings in record dtypes
- Fixed a bug where the multiplication of an ndarray with a Python int or float resulted in loss of the array's dtype
- Fixed several segfaults encountered in the numpy test suite (suite should run now without segfaulting)
We also began working on __array_prepare__ and __array_wrap__, which are necessary pieces for a working matplotlib module.
Cheers,
Romain and Brian
Thursday, November 14, 2013
PyPy 2.2 - Incrementalism
This release also contains several bugfixes and performance improvements.
You can download the PyPy 2.2 release here:
http://pypy.org/download.htmlWe would like to thank our donors for the continued support of the PyPy project. We showed quite a bit of progress on all three projects (see below) and we're slowly running out of funds. Please consider donating more so we can finish those projects! The three projects are:
- Py3k (supporting Python 3.x): the release PyPy3 2.2 is imminent.
- STM (software transactional memory): a preview will be released very soon, as soon as we fix a few bugs
- NumPy: the work done is included in the PyPy 2.2 release. More details below.
What is PyPy?
PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It's fast (pypy 2.2 and cpython 2.7.2 performance comparison) due to its integrated tracing JIT compiler.This release supports x86 machines running Linux 32/64, Mac OS X 64, Windows 32, or ARM (ARMv6 or ARMv7, with VFPv3).
Work on the native Windows 64 is still stalling, we would welcome a volunteer to handle that.
Highlights
- Our Garbage Collector is now "incremental". It should avoid almost all pauses due to a major collection taking place. Previously, it would pause the program (rarely) to walk all live objects, which could take arbitrarily long if your process is using a whole lot of RAM. Now the same work is done in steps. This should make PyPy more responsive, e.g. in games. There are still other pauses, from the GC and the JIT, but they should be on the order of 5 milliseconds each.
- The JIT counters for hot code were never reset, which meant that a process running for long enough would eventually JIT-compile more and more rarely executed code. Not only is it useless to compile such code, but as more compiled code means more memory used, this gives the impression of a memory leak. This has been tentatively fixed by decreasing the counters from time to time.
- NumPy has been split: now PyPy only contains the core module, called _numpypy. The numpy module itself has been moved to https://bitbucket.org/pypy/numpy and numpypy disappeared. You need to install NumPy separately with a virtualenv: pip install git+https://bitbucket.org/pypy/numpy.git; or directly: git clone https://bitbucket.org/pypy/numpy.git; cd numpy; pypy setup.py install.
- non-inlined calls have less overhead
- Things that use sys.set_trace are now JITted (like coverage)
- JSON decoding is now very fast (JSON encoding was already very fast)
- various buffer copying methods experience speedups (like list-of-ints to int[] buffer from cffi)
- We finally wrote (hopefully) all the missing os.xxx() functions, including os.startfile() on Windows and a handful of rare ones on Posix.
- numpy has a rudimentary C API that cooperates with cpyext
Armin Rigo and Maciej Fijalkowski
Wednesday, November 13, 2013
Py3k status update #12
This is the 12th status update about our work on the py3k branch, which we
can work on thanks to all of the people who donated to the py3k proposal.
Here's an update on the recent progress:
- Thank you to everyone who has provided initial feedback on the PyPy3 2.1 beta
1 release. We've gotten a number of bug reports, most of which have been
fixed. - As usual, we're continually keeping up with changes from the default
branch. Oftentimes these merges come at a cost (conflicts and or
reintegration of py3k changes) but occasionally we get goodies for free, such
as the recent JIT optimizations and incremental garbage collection. - We've been focusing on re-optimizing Python 2 int sized (machine sized)
integers:
We have a couple of known, notable speed regressions in the PyPy3 beta release
vs regular PyPy. The major one being with Python 2.x int sized (or machine
sized) integers.
Python 3 drops the distinction between int and long types. CPython 3.x
accomplishes this by removing the old int type entirely and renaming the long
type to int. Initially, we've done the same for PyPy3 for the sake of
simplicity and getting everything working.
However PyPy's JIT is capable of heavily optimizing these machine sized integer
operations, so this came with a regression in performance in this area.
We're now in the process of solving this. Part of this work also involves some
house cleaning on these numeric types which also benefits the default branch.
cheers,
Phil
Saturday, October 26, 2013
Making coverage.py faster under PyPy
If you've ever tried to run your programs with coverage.py under PyPy,
you've probably experienced some incredible slowness. Take this simple
program:
def f(): return 1 def main(): i = 10000000 while i: i -= f() main()
Running time coverage.py run test.py five times, and looking at the best
run, here's how PyPy 2.1 stacks up against CPython 2.7.5:
Python | Time | Normalized to CPython |
---|---|---|
CPython 2.7.5 | 3.879s | 1.0x |
PyPy 2.1 | 53.330s | 13.7x slower |
Totally ridiculous. I got turned onto this problem because on one of my
projects CPython takes about 1.5 minutes to run our test suite on the build
bot, but PyPy takes 8-10 minutes.
So I sat down to address it. And the results:
Python | Time | Normalized to CPython |
---|---|---|
CPython 2.7.5 | 3.879s | 1.0x |
PyPy 2.1 | 53.330s | 13.7x slower |
PyPy head | 1.433s | 2.7x faster |
Not bad.
Technical details
So how'd we do it? Previously, using sys.settrace() (which coverage.py
uses under the hood) disabled the JIT. Except it didn't just disable the JIT,
it did it in a particularly insidious way — the JIT had no idea it was being
disabled!
Instead, every time PyPy discovered that one of your functions was a hotspot,
it would start tracing to observe what the program was doing, and right when it
was about to finish, coverage would run and cause the JIT to abort. Tracing
is a slow process, it makes up for it by generating fast machine code at the
end, but tracing is still incredibly slow. But we never actually got to the
"generate fast machine code" stage. Instead we'd pay all the cost of tracing,
but then we'd abort, and reap none of the benefits.
To fix this, we adjusted some of the heuristics in the JIT, to better show it
how sys.settrace(<tracefunc>) works. Previously the JIT saw it as an opaque
function which gets the frame object, and couldn't tell whether or not it
messed with the frame object. Now we let the JIT look inside the
<tracefunc> function, so it's able to see that coverage.py isn't
messing with the frame in any weird ways, it's just reading the line number and
file path out of it.
I asked several friends in the VM implementation and research field if they
were aware of any other research into making VMs stay fast when debugging tools
like coverage.py are running. No one I spoke to was aware of any (but I
didn't do a particularly exhaustive review of the literature, I just tweeted at
a few people), so I'm pleased to say that PyPy is quite possibly the first VM
to work on optimizing code in debugging mode! This is possible because of our
years spent investing in meta-tracing research.
Happy testing,
Alex
Wednesday, October 16, 2013
Update on STM
Hi all,
The sprint in London was a lot of fun and very fruitful. In the last update on STM, Armin was working on improving and specializing the automatic barrier placement. There is still a lot to do in that area, but that work is merged now. Specializing and improving barrier placement is still to be done for the JIT.
But that is not all. Right after the sprint, we were able to squeeze the last obvious bugs in the STM-JIT combination. However, the performance was nowhere near to what we want. So until now, we fixed some of the most obvious issues. Many come from RPython erring on the side of caution and e.g. making a transaction inevitable even if that is not strictly necessary, thereby limiting parallelism. Another problem came from increasing counters everytime a guard fails, which caused transactions to conflict on these counter updates. Since these counters do not have to be completely accurate, we update them non-transactionally now with a chance of small errors.
There are still many such performance issues of various complexity left to tackle: we are nowhere near done. So stay tuned or contribute :)
Performance
Now, since the JIT is all about performance, we want to at least show you some numbers that are indicative of things to come. Our set of STM benchmarks is very small unfortunately (something you can help us out with), so this is not representative of real-world performance. We tried to minimize the effect of JIT warm-up in the benchmark results.
The machine these benchmarks were executed on has 4 physical cores with Hyper-Threading (8 hardware threads).
Raytracer from stm-benchmarks: Render times in seconds for a 1024x1024 image:
Interpreter | Base time: 1 thread | 8 threads (speedup) |
---|---|---|
PyPy-2.1 | 2.47 | 2.56 (0.96x) |
CPython | 81.1 | 73.4 (1.1x) |
PyPy-STM | 50.2 | 10.8 (4.6x) |
For comparison, disabling the JIT gives 148s on PyPy-2.1 and 87s on PyPy-STM (with 8 threads).
Richards from PyPy repository on the stmgc-c4 branch: Average time per iteration in milliseconds:
Interpreter | Base time: 1 thread | 8 threads (speedup) |
---|---|---|
PyPy-2.1 | 15.6 | 15.4 (1.01x) |
CPython | 239 | 237 (1.01x) |
PyPy-STM | 371 | 116 (3.2x) |
For comparison, disabling the JIT gives 492ms on PyPy-2.1 and 538ms on PyPy-STM.
Try it!
All this can be found in the PyPy repository on the stmgc-c4 branch. Try it for yourself, but keep in mind that this is still experimental with a lot of things yet to come. Only Linux x64 is supported right now, but contributions are welcome.
You can download a prebuilt binary from here: https://bitbucket.org/pypy/pypy/downloads/pypy-oct13-stm.tar.bz2 (Linux x64 Ubuntu >= 12.04). This was made at revision bafcb0cdff48.
Summary
What the numbers tell us is that PyPy-STM is, as expected, the only of the three interpreters where multithreading gives a large improvement in speed. What they also tell us is that, obviously, the result is not good enough yet: it still takes longer on a 8-threaded PyPy-STM than on a regular single-threaded PyPy-2.1. However, as you should know by now, we are good at promising speed and delivering it... years later :-)
But it has been two years already since PyPy-STM started, and this is our first preview of the JIT integration. Expect major improvements soon: with STM, the JIT generates code that is completely suboptimal in many cases (barriers, allocation, and more). Once we improve this, the performance of the STM-JITted code should come much closer to PyPy 2.1.
Cheers
Remi & Armin
Tuesday, October 15, 2013
Incremental Garbage Collector in PyPy
Hello everyone.
We're pleased to announce that as of today, the default PyPy comes with a GC that has much smaller pauses than yesterday.
Let's start with explaining roughly what GC pauses are. In CPython each object has a reference count, which is incremented each time we create references and decremented each time we forget them. This means that objects are freed each time they become unreachable. That is only half of the story though. First note that when the last reference to a large tree of objects goes away, you have a pause: all the objects are freed. Your program is not progressing at all during this pause, and this pause's duration can be arbitrarily large. This occurs at deterministic times, though. But consider code like this:
class A(object): pass a = A() b = A() a.item = b b.item = a del a del b
This creates a reference cycle. It means that while we deleted references to a and b from the current scope, they still have a reference count of 1, because they point to each other, even though the whole group has no references from the outside. CPython employs a cyclic garbage collector which is used to find such cycles. It walks over all objects in memory, starting from some known roots, such as type objects, variables on the stack, etc. This solves the problem, but can create noticeable, nondeterministic GC pauses as the heap becomes large and convoluted.
PyPy essentially has only the cycle finder - it does not bother with reference counting, instead it walks alive objects every now and then (this is a big simplification, PyPy's GC is much more complex than this). Although this might sound like a missing feature, it is really one of the reasons why PyPy is so fast, because at the end of the day the total time spent in managing the memory is lower in PyPy than CPython. However, as a result, PyPy also has the problem of GC pauses.
To alleviate this problem, which is essential for applications like games, we started to work on incremental GC, which spreads the walking of objects and cleaning them across the execution time in smaller intervals. The work was sponsored by the Raspberry Pi foundation, started by Andrew Chambers and finished by Armin Rigo and Maciej Fijałkowski.
Benchmarks
Everyone loves benchmarks. We did not measure any significant speed difference on our quite extensive benchmark suite on speed.pypy.org. The main benchmark that we used for other comparisons was translating the topaz ruby interpreter using various versions of PyPy and CPython. The exact command was python <pypy-checkout>/bin/rpython -O2 --rtype targettopaz.py. Versions:
- topaz - dce3eef7b1910fc5600a4cd0afd6220543104823
- pypy source - defb5119e3c6
- pypy compiled with minimark (non-incremental GC) - d1a0c07b6586
- pypy compiled with incminimark (new, incremental GC) - 417a7117f8d7
- CPython - 2.7.3
The memory usage of CPython, PyPy with minimark and PyPy with incminimark is shown here. Note that this benchmark is quite bad for PyPy in general, the memory usage is higher and the amount of time taken is longer. This is due to the JIT warmup being both memory hungry and inefficient (see below). But first, the new GC is not worse than the old one.
EDIT:Red line is CPython, blue is incminimark (new), green is minimark (old)
The image was obtained by graphing the output of memusage.py.
However, the GC pauses are significantly smaller. For PyPy the way to get GC pauses is to measure time between start and stop while running stuff with PYPYLOG=gc-collect:log pypy program.py, for CPython, the magic incantation is gc.set_debug(gc.DEBUG_STATS) and parsing the output. For what is worth, the average and total for CPython, as well as the total number of events are not directly comparable since it only shows the cyclic collector, not the reference counts. The only comparable thing is the amount of long pauses and their duration. In the table below, pause duration is sorted into 8 buckets, each meaning "below that or equal to the threshold". The output is generated using the gcanalyze tool.
CPython:
150.1ms | 300.2ms | 450.3ms | 600.5ms | 750.6ms | 900.7ms | 1050.8ms | 1200.9ms |
5417 | 5 | 3 | 2 | 1 | 1 | 0 | 1 |
PyPy minimark (non-incremental GC):
216.4ms | 432.8ms | 649.2ms | 865.6ms | 1082.0ms | 1298.4ms | 1514.8ms | 1731.2ms |
27 | 14 | 6 | 4 | 6 | 5 | 3 | 3 |
PyPy incminimark (new incremental GC):
15.7ms | 31.4ms | 47.1ms | 62.8ms | 78.6ms | 94.3ms | 110.0ms | 125.7ms |
25512 | 122 | 4 | 1 | 0 | 0 | 0 | 2 |
As we can see, while there is still work to be done (the 100ms ones could be split among several steps), we did improve the situation quite drastically without any actual performance difference.
Note about the benchmark - we know it's a pretty extreme case of JIT warmup, we know we suck on it, we're working on it and we're not afraid of showing PyPy is not always the best ;-)
Nitty gritty details
Here are some nitty gritty details for people really interested in Garbage Collection. This was done as a patch to "minimark", our current GC, and called "incminimark" for now. The former is a generational stop-the-world GC. New objects are allocated "young", which means that they initially live in the "nursery", a special zone of a few MB of memory. When the nursery is full, a "minor collection" step moves the surviving objects out of the nursery. This can be done quickly (a few millisecond) because we only need to walk through the young objects that survive --- usually a small fraction of all young objects; and also by far not all objects that are alive at this point, but only the young ones. However, from time to time this minor collection is followed by a "major collection": in that step, we really need to walk all objects to classify which ones are still alive and which ones are now dead ("marking") and free the memory occupied by the dead ones ("sweeping"). You can read more details here.
This "major collection" is what gives the long GC pauses. To fix this problem we made the GC incremental: instead of running one complete major collection, we split its work into a variable number of pieces and run each piece after every minor collection for a while, until there are no more pieces. The pieces are each doing a fraction of marking, or a fraction of sweeping. It adds some few milliseconds after each of these minor collections, rather than requiring hundreds of milliseconds in one go.
The main issue is that splitting the major collections means that the main program is actually running between the pieces, and so it can change the pointers in the objects to point to other objects. This is not a problem for sweeping: dead objects will remain dead whatever the main program does. However, it is a problem for marking. Let us see why.
In terms of the incremental GC literature, objects are either "white", "gray" or "black". This is called tri-color marking. See for example this blog post about Rubinius, or this page about LuaJIT or the wikipedia description. The objects start as "white" at the beginning of marking; become "gray" when they are found to be alive; and become "black" when they have been fully traversed. Marking proceeds by scanning grey objects for pointers to white objects. The white objects found are turned grey, and the grey objects scanned are turned black. When there are no more grey objects, the marking phase is complete: all remaining white objects are truly unreachable and can be freed (by the following sweeping phase).
In this model, the important part is that a black object can never point to a white object: if the latter remains white until the end, it will be freed, which is incorrect because the black object itself can still be reached. How do we ensure that the main program, running in the middle of marking, will not try to write a pointer to white object into a black object? This requires a "write barrier", i.e. a piece of code that runs every time we set a pointer into an object or array. This piece of code checks if some (hopefully rare) condition is met, and calls a function if that is the case.
The trick we used in PyPy is to consider minor collections as part of the whole, rather than focus only on major collections. The existing minimark GC had always used a write barrier of its own to do its job, like any generational GC. This existing write barrier is used to detect when an old object (outside the nursery) is modified to point to a young object (inside the nursery), which is essential information for minor collections. Actually, although this was the goal, the actual write barrier code is simpler: it just records all old objects into which we write any pointer --- to a young or old object. As we found out over time, doing so is not actually slower, and might actually be a performance improvement: for example, if the main program does a lot of writes into the same old object, we don't need to check over and over again if the written pointer points to a young object or not. We just record the old object in some list the first time, and that's it.
The trick is that this unmodified write barrier works for incminimark too. Imagine that we are in the middle of the marking phase, running the main program. The write barrier will record all old objects that are being modified. Then at the next minor collection, all surviving young objects will be moved out of the nursery. At this point, as we're about to continue running the major collection's marking phase, we simply add to the list of pending gray objects all the objects that we just considered --- both the objects listed as "old objects that are being modified", and the objects that we just moved out of the nursery. A fraction from the former list were black object; so this mean that they are turned back from the black to the gray color. This technique implements nicely, if indirectly, what is called a "backward write barrier" in the literature. The backwardness is about the color that needs to be changed in the opposite of the usual direction "white -> gray -> black", thus making more work for the GC. (This is as opposed to "forward write barrier", where we would also detect "black -> white" writes but turn the white object gray.)
In summary, I realize that this description is less about how we turned minimark into incminimark, and more about how we differ from the standard way of making a GC incremental. What we really had to do to make incminimark was to write logic that says "if the major collection is in the middle of the marking phase, then add this object to the list of gray objects", and put it at a few places throughout minor collection. Then we simply split a major collection into increments, doing marking or sweeping of some (relatively arbitrary) number of objects before returning. That's why, after we found that the existing write barrier would do, it was not much actual work, and could be done without major changes. For example, not a single line from the JIT needed adaptation. All in all it was relatively painless work. ;-)
Cheers,
armin and fijal
Wednesday, September 25, 2013
Numpy Status Update
Thanks to the people who donated money to the numpy proposal, here is what I've been working on recently :
- Fixed conversion from a numpy complex number to a python complex number
- Implement the rint ufunc
- Make numpy.character usable as a dtype
- Fix ndarray(dtype=str).fill()
- Various fixes on boolean and fancy indexing
Cheers
Romain
Monday, September 23, 2013
PyCon South Africa & sprint
Hi all,
For those of you that happen to be from South Africa: don't miss PyCon ZA 2013, next October 3rd and 4th! Like last year, a few of us will be there. There will be the first talk about STM getting ready (a blog post about that should follow soon).
Moreover, general sprints will continue on the weekend (5th and 6th). Afterwards, Fijal will host a longer PyPy sprint (marathon?) with me until around the 21th. You are welcome to it as well! Write to the mailing list or to fijal directly (fijall at gmail.com), or simply in comments of this post.
--- Armin
Friday, August 30, 2013
Tuesday, August 27, 2013
NumPy road forward
Hello everyone.
This is the roadmap for numpy effort in PyPy as discussed on the London sprint. First, the highest on our priority list is to finish the low-level part of the numpy module. What we'll do is to finish the RPython part of numpy and provide a pip installable numpypy repository that includes the pure python part of Numpy. This would contain the original Numpy with a few minor changes.
Second, we need to work on the JIT support that will make NumPy on PyPy faster. In detail:
- reenable the lazy loop evaluation
- optimize bridges, which is depending on optimizer refactorings
- SSE support
On the compatibility front, there were some independent attempts into making the following stuff working:
- f2py
- C API (in fact, PyArray_* API is partly present in the nightly builds of PyPy)
- matplotlib (both using PyArray_* API and embedding CPython runtime in PyPy)
- scipy
In order to make all of the above happen faster, it would be helpful to raise more funds. You can donate to PyPy's NumPy project on our website. Note that PyPy is a member of SFC which is a 501(c)(3) US non-profit, so donations from US companies can be tax-deducted.
Cheers,
fijal, arigo, ronan, rguillebert, anto and others
Tuesday, August 20, 2013
Preliminary London Demo Evening Agenda
We now have a preliminary agenda for the demo evening in London next week. It takes place on Tuesday, August 27 2013, 18:30-19:30 (BST) at King's College London, Strand. The preliminary agenda is as follows:
- Laurence Tratt: Welcome from the Software Development Team
- Carl Friedrich Bolz: A Short Introduction to PyPy
- Maciej Fijałkowski: Numpy on PyPy, Present State and Outlook
- Lukas Diekmann: Collection Strategies for Fast Containers in PyPy
- Armin Rigo: Software Transactional Memory for PyPy
- Edd Barrett: Unipycation: Combining Prolog and Python
All the talks are lightning talks. Afterwards there will be plenty of time for discussion.
There's still free spots, if you want to come, please register on the Eventbrite page. Hope to see you there!
Sunday, August 18, 2013
Update on STM
Hi all,
A quick update on Software Transactional Memory. We are working on two fronts.
On the one hand, the integration of the "c4" C library with PyPy is done and works well, but is still subject to improvements. The "PyPy-STM" executable (without the JIT) seems to be stable, as far as it has been tested. It runs a simple benchmark like Richards with a 3.2x slow-down over a regular JIT-less PyPy.
The main factor of this slow-down: the numerous "barriers" in the code --- checks that are needed a bit everywhere to verify that a pointer to an object points to a recent enough version, and if not, to go to the most recent version. These barriers are inserted automatically during the translation; there is no need for us to manually put 42 million barriers in the source code of PyPy. But this automatic insertion uses a primitive algorithm right now, which usually ends up putting more barriers than the theoretical optimum. I (Armin) am trying to improve that --- and progressing: last week the slow-down was around 4.5x. This is done in the branch stmgc-static-barrier.
On the other hand, Remi is progressing on the JIT integration in the branch stmgc-c4. This has been working in simple cases since a couple of weeks by now, but the resulting "PyPy-JIT-STM" often crashes. This is because while the basics are not really hard, we keep hitting new issues that must be resolved.
The basics are that whenever the JIT is about to generate assembler corresponding to a load or a store in a GC object, it must first generate a bit of extra assembler that corresponds to the barrier that we need. This works fine by now (but could benefit from the same kind of optimizations described above, to reduce the number of barriers). The additional issues are all more subtle. I will describe the current one as an example: it is how to write constant pointers inside the assembler.
Remember that the STM library classifies objects as either "public" or "protected/private". A "protected/private" object is one which has not been seen by another thread so far. This is essential as an optimization, because we know that no other thread will access our protected or private objects in parallel, and thus we are free to modify their content in place. By contrast, public objects are frozen, and to do any change, we first need to build a different (protected) copy of the object. See this blog post for more details.
So far so good, but the JIT will sometimes (actually often) hard-code constant pointers into the assembler it produces. For example, this is the case when the Python code being JITted creates an instance of a known class; the corresponding assembler produced by the JIT will reserve the memory for the instance and then write the constant type pointer in it. This type pointer is a GC object (in the simple model, it's the Python class object; in PyPy it's actually the "map" object, which is a different story).
The problem right now is that this constant pointer may point to a protected object. This is a problem because the same piece of assembler can later be executed by a different thread. If it does, then this different thread will create instances whose type pointer is bogus: looking like a protected object, but actually protected by a different thread. Any attempt to use this type pointer to change anything on the class itself will likely crash: the threads will all think they can safely change it in-place. To fix this, we need to make sure we only write pointers to public objects in the assembler. This is a bit involved because we need to ensure that there is a public version of the object to start with.
When this is done, we will likely hit the next problem, and the next one; but at some point it should converge (hopefully!) and we'll give you our first PyPy-JIT-STM ready to try. Stay tuned :-)
A bientôt,
Armin.
Thursday, August 8, 2013
NumPyPy Status Update
As expected, nditer is a lot of work. I'm going to pause my work on it for now and focus on simpler and more important things, here is a list of what I implemented :
- Fixed a bug on 32 bit that made int32(123).dtype == dtype("int32") fail
- Fixed a bug on the pickling of array slices
- The external loop flag is implemented on the nditer class
- The c_index, f_index and multi_index flags are also implemented
- Add dtype("double") and dtype("str")
- C-style iteration is available for nditer
Romain Guillebert
Thursday, August 1, 2013
PyPy 2.1 - Considered ARMful
We're pleased to announce PyPy 2.1, which targets version 2.7.3 of the Python
language. This is the first release with official support for ARM processors in the JIT.
This release also contains several bugfixes and performance improvements.
You can download the PyPy 2.1 release here:
http://pypy.org/download.html
We would like to thank the Raspberry Pi Foundation for supporting the work
to finish PyPy's ARM support.
The first beta of PyPy3 2.1, targeting version 3 of the Python language, was
just released, more details can be found here.
What is PyPy?
PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It's fast (pypy 2.1 and cpython 2.7.2 performance comparison) due to its integrated tracing JIT compiler.
This release supports x86 machines running Linux 32/64, Mac OS X 64 or Windows 32. This release also supports ARM machines running Linux 32bit - anything with ARMv6 (like the Raspberry Pi) or ARMv7 (like the Beagleboard, Chromebook, Cubieboard, etc.) that supports VFPv3 should work. Both hard-float armhf/gnueabihf and soft-float armel/gnueabi builds are provided. The armhf builds for Raspbian are created using the Raspberry Pi custom cross-compilation toolchain based on gcc-arm-linux-gnueabihf and should work on ARMv6 and ARMv7 devices running Debian or Raspbian. The armel builds are built using the gcc-arm-linux-gnuebi toolchain provided by Ubuntu and currently target ARMv7.
Windows 64 work is still stalling, we would welcome a volunteer to handle that.
Highlights
- JIT support for ARM, architecture versions 6 and 7, hard- and soft-float ABI
- Stacklet support for ARM
- Support for os.statvfs and os.fstatvfs on unix systems
- Improved logging performance
- Faster sets for objects
- Interpreter improvements
- During packaging, compile the CFFI based TK extension
- Pickling of numpy arrays and dtypes
- Subarrays for numpy
- Bugfixes to numpy
- Bugfixes to cffi and ctypes
- Bugfixes to the x86 stacklet support
- Fixed issue 1533: fix an RPython-level OverflowError for space.float_w(w_big_long_number).
- Fixed issue 1552: GreenletExit should inherit from BaseException.
- Fixed issue 1537: numpypy __array_interface__
- Fixed issue 1238: Writing to an SSL socket in PyPy sometimes failed with a "bad write retry" message.
Cheers,
David Schneider for the PyPy team.
Wednesday, July 31, 2013
PyPy Demo Evening in London, August 27, 2013
As promised in the London sprint announcement we are organising a PyPy demo evening during the London sprint on Tuesday, August 27 2013, 18:30-19:30 (BST). The description of the event is below. If you want to come, please register on the Eventbrite page.
PyPy is a fast Python VM. Maybe you've never used PyPy and want to find out what use it might be for you? Or you and your organisation have been using it and you want to find out more about how it works under the hood? If so, this demo session is for you!
Members of the PyPy team will give a series of lightning talks on PyPy: its benefits; how it works; research currently being undertaken to make it faster; and unusual uses it can be put to. Speakers will be available afterwards for informal discussions. This is the first time an event like this has been held in the UK, and is a unique opportunity to speak to core people. Speakers confirmed thus far include: Armin Rigo, Maciej Fijałkowski, Carl Friedrich Bolz, Lukas Diekmann, Laurence Tratt, Edd Barrett.
The venue for this talk is the Software Development Team, King's College London. The main entrance is on the Strand, from where the room for the event will be clearly signposted. Travel directions can be found at http://www.kcl.ac.uk/campuslife/campuses/directions/strand.aspx
If you have any questions about the event, please contact Laurence Tratt
Tuesday, July 30, 2013
PyPy3 2.1 beta 1
We're pleased to announce the first beta of the upcoming 2.1 release of
PyPy3. This is the first release of PyPy which targets Python 3 (3.2.3)
compatibility.
We would like to thank all of the people who donated to the py3k proposal
for supporting the work that went into this and future releases.
You can download the PyPy3 2.1 beta 1 release here:
http://pypy.org/download.html#pypy3-2-1-beta-1
Highlights
- The first release of PyPy3: support for Python 3, targetting CPython 3.2.3!
- There are some known issues including performance regressions (issues
#1540 & #1541) slated to be resolved before the final release.
- There are some known issues including performance regressions (issues
What is PyPy?
PyPy is a very compliant Python interpreter, almost a drop-in replacement for
CPython 2.7.3 or 3.2.3. It's fast due to its integrated tracing JIT compiler.
This release supports x86 machines running Linux 32/64, Mac OS X 64 or Windows
32. Also this release supports ARM machines running Linux 32bit - anything with
ARMv6 (like the Raspberry Pi) or ARMv7 (like Beagleboard,
Chromebook, Cubieboard, etc.) that supports VFPv3 should work.
Windows 64 work is still stalling and we would welcome a volunteer to handle
that.
How to use PyPy?
We suggest using PyPy from a virtualenv. Once you have a virtualenv
installed, you can follow instructions from pypy documentation on how
to proceed. This document also covers other installation schemes.
Cheers,
the PyPy team
Friday, July 26, 2013
PyPy 2.1 beta 2
We're pleased to announce the second beta of the upcoming 2.1 release of PyPy.
This beta adds one new feature to the 2.1 release and contains several bugfixes listed below.
You can download the PyPy 2.1 beta 2 release here:
http://pypy.org/download.html
Highlights
- Support for os.statvfs and os.fstatvfs on unix systems.
- Fixed issue 1533: fix an RPython-level OverflowError for space.float_w(w_big_long_number).
- Fixed issue 1552: GreenletExit should inherit from BaseException.
- Fixed issue 1537: numpypy __array_interface__
- Fixed issue 1238: Writing to an SSL socket in pypy sometimes failed with a "bad write retry" message.
- distutils: copy CPython's implementation of customize_compiler, dont call
split on environment variables, honour CFLAGS, CPPFLAGS, LDSHARED and
LDFLAGS. - During packaging, compile the CFFI tk extension.
What is PyPy?
PyPy is a very compliant Python interpreter, almost a drop-in replacement for
CPython 2.7.3. It's fast due to its integrated tracing JIT compiler.
This release supports x86 machines running Linux 32/64, Mac OS X 64 or Windows
32. Also this release supports ARM machines running Linux 32bit - anything with
ARMv6 (like the Raspberry Pi) or ARMv7 (like Beagleboard,
Chromebook, Cubieboard, etc.) that supports VFPv3 should work.
Windows 64 work is still stalling, we would welcome a volunteer
to handle that.
How to use PyPy?
We suggest using PyPy from a virtualenv. Once you have a virtualenv
installed, you can follow instructions from pypy documentation on how
to proceed. This document also covers other installation schemes.
Cheers,
The PyPy Team.
PyPy San Francisco Sprint July 27th 2013
The next PyPy sprint will be in San Francisco, California. It is a public
sprint, suitable for newcomers. It will run on Saturday July 27th.
Some possible things people will be hacking on the sprint:
- running your software on PyPy
- making your software fast on PyPy
- improving PyPy's JIT
- improving Twisted on PyPy
- any exciting stuff you can think of
If there are newcomers, we'll run an introduction to hacking on PyPy.
Location
The sprint will be held at the Rackspace Office:
620 Folsom St, Ste 100
The doors will open at 10AM and run until 6PM.
Friday, July 19, 2013
PyPy London Sprint (August 26 - September 1 2013)
The next PyPy sprint will be in London, United Kingdom for the first time. This is a fully public sprint. PyPy sprints are a very good way to get into PyPy development and no prior PyPy knowledge is necessary.
Goals and topics of the sprint
For newcomers:
- bring your application/library and we'll help you port it to PyPy, benchmark and profile
- come and write your favorite missing numpy function
- help us work on developer tools like jitviewer
We'll also work on:
- refactoring the JIT optimizations
- STM and STM-related topics
- anything else attendees are interested in
Exact times
The work days should be August 26 - September 1 2013 (Monday-Sunday). The official plans are for people to arrive on the 26th, and to leave on the 2nd. There will be a break day in the middle. We'll typically start at 10:00 in the morning.
Location
The sprint will happen within a room of King's College's Strand Campus in Central London, UK. There are some travel instructions how to get there. We are being hosted by Laurence Tratt and the Software Development Team.
Demo Session
If you don't want to come to the full sprint, but still want to chat a bit, we are planning to have a demo session on Tuesday August 27. We will announce this separately on the blog. If you are interested, please leave a comment.
Registration
If you want to attend, please register by adding yourself to the "people.txt" file in Mercurial:
https://bitbucket.org/pypy/extradoc/ https://bitbucket.org/pypy/extradoc/raw/extradoc/sprintinfo/london-2013
or on the pypy-dev mailing list if you do not yet have check-in rights:
http://mail.python.org/mailman/listinfo/pypy-dev
Remember that you may need a (insert country here)-to-UK power adapter. Please note that UK is not within the Schengen zone, so non-EU and non-Switzerland citizens may require specific visa. Please check travel regulations. Also, the UK uses pound sterling (GBP).
Friday, July 12, 2013
Software Transactional Memory lisp experiments
As covered in the previous blog post, the STM subproject of PyPy has been back on the drawing board. The result of this experiment is an STM-aware garbage collector written in C. This is finished by now, thanks to Armin's and Remi's work, we have a fully functional garbage collector and a STM system that can be used from any C program with enough effort. Using it is more than a little mundane, since you have to inserts write and read barriers by hand everywhere in your code that reads or writes to garbage collector controlled memory. In the PyPy integration, this manual work is done automatically by the STM transformation in the interpreter.
However, to experiment some more, we created a minimal lisp-like/scheme-like interpreter (called Duhton), that follows closely CPython's implementation strategy. For anyone familiar with CPython's source code, it should be pretty readable. This interpreter works like a normal and very basic lisp variant, however it comes with a transaction builtin, that lets you spawn transactions using the STM system. We implemented a few demos that let you play with the transaction system. All the demos are running without conflicts, which means there are no conflicting writes to global memory and hence the demos are very amenable to parallelization. They exercise:
- arithmetics - demo/many_sqare_roots.duh
- read-only access to globals - demo/trees.duh
- read-write access to local objects - demo/trees2.duh
With the latter ones being very similar to the classic gcbench. STM-aware Duhton can be found in the stmgc repo, while the STM-less Duhton, that uses refcounting, can be found in the duhton repo under the base branch.
Below are some benchmarks. Note that this is a little comparing apples to oranges since the single-threaded duhton uses refcounting GC vs generational GC for STM version. Future pypy benchmarks will compare more apples to apples. Moreover none of the benchmarks has any conflicts. Time is the total time that the benchmark took (not the CPU time) and there was very little variation in the consecutive runs (definitely below 5%).
benchmark | 1 thread (refcount) | 1 thread (stm) | 2 threads | 4 threads |
square | 1.9s | 3.5s | 1.8s | 0.9s |
trees | 0.6s | 1.0s | 0.54s | 0.28s |
trees2 | 1.4s | 2.2s | 1.1s | 0.57s |
As you can see, the slowdown for STM vs single thread is significant (1.8x, 1.7x, 1.6x respectively), but still lower than 2x. However the speedup from running on multiple threads parallelizes the problem almost perfectly.
While a significant milestone, we hope the next blog post will cover STM-enabled pypy that's fully working with JIT work ongoing.
Cheers,
fijal on behalf of Remi Meier and Armin Rigo
Thursday, July 11, 2013
PyPy 2.1 beta
We would like to thank the Raspberry Pi Foundation for supporting the work to finish PyPy's ARM support.
You can download the PyPy 2.1 beta release here:
http://pypy.org/download.html
Highlights
- Bugfixes to the ARM JIT backend, so that ARM is now an officially
supported processor architecture - Stacklet support on ARM
- Interpreter improvements
- Various numpy improvements
- Bugfixes to cffi and ctypes
- Bugfixes to the stacklet support
- Improved logging performance
- Faster sets for objects
What is PyPy?
PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7.3. It's fast due to its integrated tracing JIT compiler. This release supports x86 machines running Linux 32/64, Mac OS X 64 or Windows 32. Also this release supports ARM machines running Linux 32bit - anything with ARMv6 (like the Raspberry Pi) or ARMv7 (like Beagleboard, Chromebook, Cubieboard, etc.) that supports VFPv3 should work. Both hard-float armhf/gnueabihf and soft-float armel/gnueabi builds are provided. armhf builds for Raspbian are created using the Raspberry Picustom cross-compilation toolchain based on gcc-arm-linux-gnueabihf and should work on ARMv6 and ARMv7 devices running Debian or Raspbian. armel builds are built using the gcc-arm-linux-gnuebi toolchain provided by Ubuntu and currently target ARMv7.
Windows 64 work is still stalling, we would welcome a volunteer to handle that.
How to use PyPy?
We suggest using PyPy from a virtualenv. Once you have a virtualenv installed, you can follow instructions from pypy documentation on how to proceed. This document also covers other installation schemes.Cheers,
the PyPy team.
Thursday, July 4, 2013
EuroPython
Hi all,
A short note: if you're at EuroPython right now and wondering if PyPy is dead because you don't see the obviously expected talk about PyPy, don't worry. PyPy is still alive and kicking. The truth is two-fold: (1) we missed the talk deadline (duh!)... but as importantly, (2) for various reasons we chose not to travel to Florence this year after our trip to PyCon US. (Antonio Cuni is at Florence but doesn't have a talk about PyPy either.)
Armin
Wednesday, June 12, 2013
Py3k status update #11
This is the 11th status update about our work on the py3k branch, which we
can work on thanks to all of the people who donated to the py3k proposal.
Here's some highlights of the progress made since the previous update:
- PyPy py3k now matches CPython 3's hash code for
int/float/complex/Decimal/Fraction - Various outstanding unicode identifier related issues were
resolved. E.g. test_importlib/pep263/ucn/unicode all now fully pass. Various
usage of identifiers (in particular type and module names) have been fixed to
handle non-ascii names -- mostly around display of reprs and exception
messages. - The unicodedata database has been upgraded to 6.0.0.
- Windows support has greatly improved, though it could still use some more
help (but so does the default branch to a certain degree). - Probably the last of the parsing related bugs/features have been taken care
of. - Of course various other smaller miscellaneous fixes
This leaves the branch w/ only about 5 outstanding failures of the stdlib test
suite:
test_float
1 failing test about containment of floats in collections.
test_memoryview
Various failures: requires some bytes/str changes among other things (Manuel
Jacob's has some progress on this on the py3k-memoryview branch)test_multiprocessing
1 or more tests deadlock on some platforms
test_sys and test_threading
2 failing tests for the New GIL's new API
Probably the biggest feature left to tackle is the New GIL.
We're now pretty close to pushing an initial release. We had planned for one
around PyCon, but having missed that we've put some more effort into the branch
to provide a more fully-fledged initial release.
Thanks to the following for their contributions: Manuel Jacob, Amaury Forgeot
d'Arc, Karl Ramm, Jason Chu and Christian Hudon.
cheers,
Phil
Wednesday, June 5, 2013
STM on the drawing board
Hi all!
This is an update about the Software Transactional Memory subproject of PyPy. I have some good news of progress. Also, Remi Meier will likely help me this summer. He did various investigations with PyPy-STM for his Master's Thesis and contributed back a lot of ideas and some code. Welcome again Remi!
I am also sorry that it seems to advance so slowly. Beyond the usual excuses --- I was busy with other things, e.g. releasing PyPy 2.0 --- I would like to reassure people: I'm again working on it, and the financial contributions are still there and reserved for STM (almost half the money is left, a big thank you again if you contributed!).
The real reason for the apparent slowness, though, is that it is really a research project. It's possible to either have hard deadlines, or to follow various tracks and keep improving the basics, but not both at the same time.
During the past month where I have worked again on STM, I worked still on the second option; and I believe it was worth every second of it. Let me try to convince you :-)
The main blocker was that the STM subsystem, written in C, and the Garbage Collection (GC) subsystem, written in RPython, were getting harder and harder to coordinate. So what I did instead is to give up using RPython in favor of using only C for both. C is a good language for some things, which includes low-level programming where we must take care of delicate multithreading issues; RPython is not a good fit in that case, and wasn't designed to be.
I started a fresh Mercurial repo which is basically a stand-alone C library. This library (in heavy development right now!) gives any C program some functions to allocate and track GC-managed objects, and gives an actual STM+GC combination on these objects. It's possible (though rather verbose) to use it directly in C programs, like in a small example interpreter. Of course the eventual purpose is to link it with PyPy during translation to C, with all the verbose calls automatically generated.
Since I started this, bringing the GC closer to the STM, I kept finding new ways that the two might interact to improve the performance, maybe radically. Here is a summary of the current ideas.
When we run multiple threads, there are two common cases: one is to access (read and write) objects that have only been seen by the current thread; the other is to read objects seen by all threads, like in Python the modules/functions/classes, but not to write to them. Of course, writing to the same object from multiple threads occurs too, and it is handled correctly (that's the whole point), but it is a relatively rare case.
So each object is classified as "public" or "protected" (or "private", when they belong to the current transaction). Newly created objects, once they are no longer private, remain protected until they are read by a different thread. Now, the point is to use very different mechanisms for public and for protected objects. Public objects are visible by all threads, but read-only in memory; to change them, a copy must be made, and the changes are written to the copy (the "redolog" approach to STM). Protected objects, on the other hand, are modified in-place, with (if necessary) a copy of them being made for the sole purpose of a possible abort of the transaction (the "undolog" approach).
This is combined with a generational GC similar to PyPy's --- but here, each thread gets its own nursery and does its own "minor collections", independently of the others.
So objects are by default protected; when another thread tries to follow a pointer to them, then it is that other thread's job to carefully "steal" the object and turn it public (possibly making a copy of it if needed, e.g. if it was still a young object living in the original nursery).
The same object can exist temporarily in multiple versions: any number of public copies; at most one active protected copy; and optionally one private copy per thread (this is the copy as currently seen by the transaction in progress on that thread). The GC cleans up the unnecessary copies.
These ideas are variants and extensions of the same basic idea of keeping multiple copies with revision numbers to track them. Moreover, "read barriers" and "write barriers" are used by the C program calling into this library in order to be sure that it is accessing the right version of the object. In the currently investigated variant I believe it should be possible to have rather cheap read barriers, which would definitely be a major speed improvement over the previous variants. Actually, as far as I know, it would be a major improvement over most of the other existing STMs: in them, the typical read barrier involves following chains of pointers, and checking some dictionary to see if this thread has a modified local copy of the object. The difference with a read barrier that can resolve most cases in a few CPU cycles should be huge.
So, this is research :-) It is progressing, and at some point I'll be satisfied with it and stop rewriting everything; and then the actual integration into PyPy should be straightforward (there is already code to detect where the read and write barriers need to be inserted, where transactions can be split, etc.). Then there is support for the JIT to be written, and so on. But more about it later.
The purpose of this post was to give you some glimpses into what I'm working on right now. As usual, no plan for release yet. But you can look forward to seeing the C library progress. I'll probably also start soon some sample interpreter in C, to test the waters (likely a revival of duhton). If you know nothing about Python but all about the C-level multithreading issues, now is a good time to get involved :-)
Thanks for reading!
Armin
Monday, June 3, 2013
NumPyPy status update
May was the first month I was paid to work on NumPyPy (thanks to all who donated!), here is what I worked on during this period :
- It is now possible to use subarrays.
- It is now possible to pickle ndarrays (including those using subarrays), dtypes and scalars, the pickling protocol is the same as numpy's.
Cheers
Romain Guillebert
Tuesday, May 21, 2013
PyPy 2.0.2 - Fermi Panini
We're pleased to announce PyPy 2.0.2. This is a stable bugfix release over 2.0 and 2.0.1. You can download it here:
http://pypy.org/download.html
It fixes a crash in the JIT when calling external C functions (with ctypes/cffi) in a multithreaded context.
What is PyPy?
PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It's fast (pypy 2.0 and cpython 2.7.3 performance comparison) due to its integrated tracing JIT compiler.
This release supports x86 machines running Linux 32/64, Mac OS X 64 or Windows 32. Support for ARM is progressing but not bug-free yet.
Highlights
This release contains only the fix described above. A crash (or wrong results) used to occur if all these conditions were true:
- your program is multithreaded;
- it runs on a single-core machine or a heavily-loaded multi-core one;
- it uses ctypes or cffi to issue external calls to C functions.
This was fixed in the branch emit-call-x86 (see the example file bug1.py).
Cheers, arigo et. al. for the PyPy team
Thursday, May 16, 2013
PyPy 2.0.1 - Bohr Smørrebrød
We're pleased to announce PyPy 2.0.1. This is a stable bugfix release over 2.0. You can download it here:
http://pypy.org/download.html
The fixes are mainly about fatal errors or crashes in our stdlib. See below for more details.
What is PyPy?
PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It's fast (pypy 2.0 and cpython 2.7.3 performance comparison) due to its integrated tracing JIT compiler.
This release supports x86 machines running Linux 32/64, Mac OS X 64 or Windows 32. Support for ARM is progressing but not bug-free yet.
Highlights
- fix an occasional crash in the JIT that ends in RPython Fatal error: NotImplementedError.
- id(x) is now always a positive number (except on int/float/long/complex). This fixes an issue in _sqlite.py (mostly for 32-bit Linux).
- fix crashes of callback-from-C-functions (with cffi) when used together with Stackless features, on asmgcc (i.e. Linux only). Now gevent should work better.
- work around an eventlet issue with socket._decref_socketios().
Cheers, arigo et. al. for the PyPy team
Saturday, May 11, 2013
Numpy Status Update
I've started to work on NumPyPy since the end of April and here is a short update :
- I implemented pickling support on ndarrays and dtypes, it will be compatible with numpy's pickling protocol when the "numpypy" module will be renamed to "numpy".
- I am now working on subarrays.
Thursday, May 9, 2013
PyPy 2.0 - Einstein Sandwich
We're pleased to announce PyPy 2.0. This is a stable release that brings a swath of bugfixes, small performance improvements and compatibility fixes. PyPy 2.0 is a big step for us and we hope in the future we'll be able to provide stable releases more often.
You can download the PyPy 2.0 release here:
http://pypy.org/download.html
The two biggest changes since PyPy 1.9 are:
- stackless is now supported including greenlets, which means eventlet and gevent should work (but read below about gevent)
- PyPy now contains release 0.6 of cffi as a builtin module, which is preferred way of calling C from Python that works well on PyPy
If you're using PyPy for anything, it would help us immensely if you fill out the following survey: http://bit.ly/pypysurvey This is for the developers eyes and we will not make any information public without your agreement.
What is PyPy?
PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It's fast (pypy 2.0 and cpython 2.7.3 performance comparison) due to its integrated tracing JIT compiler.
This release supports x86 machines running Linux 32/64, Mac OS X 64 or Windows 32. Windows 64 work is still stalling, we would welcome a volunteer to handle that. ARM support is on the way, as you can see from the recently released alpha for ARM.
Highlights
- Stackless including greenlets should work. For gevent, you need to check out pypycore and use the pypy-hacks branch of gevent.
- cffi is now a module included with PyPy. (cffi also exists for CPython; the two versions should be fully compatible.) It is the preferred way of calling C from Python that works on PyPy.
- Callbacks from C are now JITted, which means XML parsing is much faster.
- A lot of speed improvements in various language corners, most of them small, but speeding up some particular corners a lot.
- The JIT was refactored to emit machine code which manipulates a "frame" that lives on the heap rather than on the stack. This is what makes Stackless work, and it could bring another future speed-up (not done yet).
- A lot of stability issues fixed.
- Refactoring much of the numpypy array classes, which resulted in removal of lazy expression evaluation. On the other hand, we now have more complete dtype support and support more array attributes.
Cheers,
fijal, arigo and the PyPy team
Tuesday, May 7, 2013
PyPy 2.0 alpha for ARM
Hello.
We're pleased to announce an alpha release of PyPy 2.0 for ARM. This is mostly a technology preview, as we know the JIT is not yet stable enough for the full release. However please try your stuff on ARM and report back.
This is the first release that supports a range of ARM devices - anything with ARMv6 (like the Raspberry Pi) or ARMv7 (like Beagleboard, Chromebook, Cubieboard, etc.) that supports VFPv3 should work. We provide builds with support for both ARM EABI variants: hard-float and some older operating systems soft-float.
This release comes with a list of limitations, consider it alpha quality, not suitable for production:
- stackless support is missing.
- assembler produced is not always correct, but we successfully managed to run large parts of our extensive benchmark suite, so most stuff should work.
You can download the PyPy 2.0 alpha ARM release here (including a deb for raspbian):
http://pypy.org/download.html
Part of the work was sponsored by the Raspberry Pi foundation.
What is PyPy?
PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7.3. It's fast due to its integrated tracing JIT compiler.
This release supports ARM machines running Linux 32bit. Both hard-float armhf and soft-float armel builds are provided. armhf builds are created using the Raspberry Pi custom cross-compilation toolchain based on gcc-arm-linux-gnueabihf and should work on ARMv6 and ARMv7 devices running at least debian or ubuntu. armel builds are built using gcc-arm-linux-gnuebi toolchain provided by ubuntu and currently target ARMv7. If there is interest in other builds, such as gnueabi for ARMv6 or without requiring a VFP let us know in the comments or in IRC.
Benchmarks
Everybody loves benchmarks. Here is a table of our benchmark suite (for ARM we don't provide it yet on http://speed.pypy.org, unfortunately).
This is a comparison of Cortex A9 processor with 4M cache and Xeon W3580 with 8M of L3 cache. The set of benchmarks is a subset of what we run for http://speed.pypy.org that finishes in reasonable time. The ARM machine was provided by Calxeda. Columns are respectively:
- benchmark name
- PyPy speedup over CPython on ARM (Cortex A9)
- PyPy speedup over CPython on x86 (Xeon)
- speedup on Xeon vs Cortex A9, as measured on CPython
- speedup on Xeon vs Cortex A9, as measured on PyPy
- relative speedup (how much bigger the x86 speedup is over ARM speedup)
Benchmark | PyPy vs CPython (arm) | PyPy vs CPython (x86) | x86 vs arm (pypy) | x86 vs arm (cpython) | relative speedup |
ai | 3.61 | 3.16 | 7.70 | 8.82 | 0.87 |
bm_mako | 3.41 | 2.11 | 8.56 | 13.82 | 0.62 |
chaos | 21.82 | 17.80 | 6.93 | 8.50 | 0.82 |
crypto_pyaes | 22.53 | 19.48 | 6.53 | 7.56 | 0.86 |
django | 13.43 | 11.16 | 7.90 | 9.51 | 0.83 |
eparse | 1.43 | 1.17 | 6.61 | 8.12 | 0.81 |
fannkuch | 6.22 | 5.36 | 6.18 | 7.16 | 0.86 |
float | 5.22 | 6.00 | 9.68 | 8.43 | 1.15 |
go | 4.72 | 3.34 | 5.91 | 8.37 | 0.71 |
hexiom2 | 8.70 | 7.00 | 7.69 | 9.56 | 0.80 |
html5lib | 2.35 | 2.13 | 6.59 | 7.26 | 0.91 |
json_bench | 1.12 | 0.93 | 7.19 | 8.68 | 0.83 |
meteor-contest | 2.13 | 1.68 | 5.95 | 7.54 | 0.79 |
nbody_modified | 8.19 | 7.78 | 6.08 | 6.40 | 0.95 |
pidigits | 1.27 | 0.95 | 14.67 | 19.66 | 0.75 |
pyflate-fast | 3.30 | 3.57 | 10.64 | 9.84 | 1.08 |
raytrace-simple | 46.41 | 29.00 | 5.14 | 8.23 | 0.62 |
richards | 31.48 | 28.51 | 6.95 | 7.68 | 0.91 |
slowspitfire | 1.28 | 1.14 | 5.91 | 6.61 | 0.89 |
spambayes | 1.93 | 1.27 | 4.15 | 6.30 | 0.66 |
sphinx | 1.01 | 1.05 | 7.76 | 7.45 | 1.04 |
spitfire | 1.55 | 1.58 | 5.62 | 5.49 | 1.02 |
spitfire_cstringio | 9.61 | 5.74 | 5.43 | 9.09 | 0.60 |
sympy_expand | 1.42 | 0.97 | 3.86 | 5.66 | 0.68 |
sympy_integrate | 1.60 | 0.95 | 4.24 | 7.12 | 0.60 |
sympy_str | 0.72 | 0.48 | 3.68 | 5.56 | 0.66 |
sympy_sum | 1.99 | 1.19 | 3.83 | 6.38 | 0.60 |
telco | 14.28 | 9.36 | 3.94 | 6.02 | 0.66 |
twisted_iteration | 11.60 | 7.33 | 6.04 | 9.55 | 0.63 |
twisted_names | 3.68 | 2.83 | 5.01 | 6.50 | 0.77 |
twisted_pb | 4.94 | 3.02 | 5.10 | 8.34 | 0.61 |
It seems that Cortex A9, while significantly slower than Xeon, has higher slowdowns with a large interpreter (CPython) than a JIT compiler (PyPy). This comes as a surprise to me, especially that our ARM assembler is not nearly as polished as our x86 assembler. As for the causes, various people mentioned branch predictor, but I would not like to speculate without actually knowing.
How to use PyPy?
We suggest using PyPy from a virtualenv. Once you have a virtualenv installed, you can follow instructions from pypy documentation on how to proceed. This document also covers other installation schemes.
We would not recommend using in production PyPy on ARM just quite yet, however the day of a stable PyPy ARM release is not far off.
Cheers,
fijal, bivab, arigo and the whole PyPy team
Sunday, April 7, 2013
PyPy 2.0 beta 2 released
We're pleased to announce the 2.0 beta 2 release of PyPy. This is a major release of PyPy and we're getting very close to 2.0 final, however it includes quite a few new features that require further testing. Please test and report issues, so we can have a rock-solid 2.0 final. It also includes a performance regression of about 5% compared to 2.0 beta 1 that we hope to fix before 2.0 final. The ARM support is not working yet and we're working hard to make it happen before the 2.0 final. The new major features are:
- JIT now supports stackless features, that is greenlets and stacklets. This means that JIT can now optimize the code that switches the context. It enables running eventlet and gevent on PyPy (although gevent requires some special support that's not quite finished, read below).
- This is the first PyPy release that includes cffi as a core library. Version 0.6 comes included in the PyPy library. cffi has seen a lot of adoption among library authors and we believe it's the best way to wrap C libaries. You can see examples of cffi usage in _curses.py and _sqlite3.py in the PyPy source code.
You can download the PyPy 2.0 beta 2 release here:
http://pypy.org/download.html
What is PyPy?
PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7.3. It's fast (pypy 2.0 beta 2 and cpython 2.7.3 performance comparison) due to its integrated tracing JIT compiler.
This release supports x86 machines running Linux 32/64, Mac OS X 64 or Windows 32. It also supports ARM machines running Linux, however this is disabled for the beta 2 release. Windows 64 work is still stalling, we would welcome a volunteer to handle that.
How to use PyPy?
We suggest using PyPy from a virtualenv. Once you have a virtualenv installed, you can follow instructions from pypy documentation on how to proceed. This document also covers other installation schemes.
Highlights
- cffi is officially supported by PyPy. It comes included in the standard library, just use import cffi
- stackless support - eventlet just works and gevent requires pypycore and pypy-hacks branch of gevent (which mostly disables cython-based modules)
- callbacks from C are now much faster. pyexpat is about 3x faster, cffi callbacks around the same
- __length_hint__ is implemented (PEP 424)
- a lot of numpy improvements
Improvements since 1.9
- JIT hooks are now a powerful tool to introspect the JITting process that PyPy performs
- various performance improvements compared to 1.9 and 2.0 beta 1
- operations on long objects are now as fast as in CPython (from roughly 2x slower)
- we now have special strategies for dict/set/list which contain unicode strings, which means that now such collections will be both faster and more compact.