Actually, it turns out that the PyPy JIT compiler produces code which is fast enough to do realtime video processing using two simple algorithms implemented by Håkan Ardö.
sobel.py implements a classical way of locating edges in images, the Sobel operator. It is an approximation of the magnitude of the image gradient. The processing time is spend on two convolutions between the image and 3x3-kernels.
magnify.py implements a pixel coordinate transformation that rearranges the pixels in the image to form a magnifying effect in the center. It consists of a single loop over the pixels in the output image copying pixels from the input image.
You can try by yourself by downloading the appropriate demo:
To run the demo, you need to have mplayer installed on your system. The demo has been tested only on linux, it might (or not) work also on other systems:
- pypy-image-demo.tar.bz2: this archive contains only the source code, use this is you have PyPy already installed
- pypy-image-demo-full.tar.bz2: this archive contains both the source code and prebuilt PyPy binaries for linux 32 and 64 bits
$ pypy pypy-image-demo/sobel.py $ pypy pypy-image-demo/magnify.pyBy default, the two demos uses an example AVI file. To have more fun, you can use your webcam by passing the appropriate mplayer parameters to the scripts, e.g:
$ pypy demo/sobel.py tv://By default magnify.py uses nearest-neighbor interpolation. By adding the option -b, bilinear interpolation will be used instead, which gives smoother result:
$ pypy demo/magnify.py -bThere is only a single implementation of the algorithm in magnify.py. The two different interpolation methods are implemented by subclassing the class used to represent images and embed the interpolation within the pixel access method. PyPy is able to achieve good performance with this kind of abstractions because it can inline the pixel access method and specialize the implementation of the algorithm. In C++ that kind of pixel access method would be virtual and you'll need to use templates to get the same effect without incurring in runtime overhead.
For magnify.py:
- PyPy: ~47.23 fps
- CPython: ~0.08 fps
This means that on sobel.py, PyPy is 590 times faster. On magnify.py the difference is much less evident and the speedup is "only" 15x.
- PyPy: ~26.92 fps
- CPython: ~1.78 fps
It must be noted that this is an extreme example of what PyPy can do. In particular, you cannot expect (yet :-)) PyPy to be fast enough to run an arbitrary video processing algorithm in real time, but the demo still proves that PyPy has the potential to get there.
Pypy is awesome!
ReplyDeleteI have a n00b problem: On Mac OS X 10.5.8, the precompiled pypy binary crashes with this message:
ReplyDeletedyld: Library not loaded: /usr/lib/libssl.0.9.8.dylib
What's up with this? Thanks, and sorry for being offtopic.
I saw this demo recently when Dan Roberts presented at Baypiggies. We broke into spontaneous applause when the pypy runtime ran at a watchable speed after cpython ran at less than 1 frame/second. Very impressive!
ReplyDeleteAnonymous, can you read?
ReplyDelete"prebuilt PyPy binaries for linux 32 and 64 bits"
"The demo has been tested only on linux, it might (or not) work also on other systems"
Mac OS X is not Linux.
Perhaps add a comment to sobel.py explaining what "pypyjit.set_param(trace_limit=200000)" does?
ReplyDeleteThe only chamge I'd like to see in this project is its name... Trying to gather news from twitter for example, makes me search amongst thousands of comments in japanese (pypy means "boobies" in japanese), other incomprehensible comments in malay and hundreds of music fans of Look-Ka PYPY (WTF??)
ReplyDeleteOther Anonymous: Yes, I can read. I should have given a bit more context, but I was offtopic anyway. My goal was not running the demo, my goal was running pypy. I used the OS X binary from pypy.org. For those who are really good at reading, this was probably clear from the fact that my binary only crashed at library loading time.
ReplyDelete@Anonymous: most probably, the prebuilt PyPy for Mac Os X was built on a system different (older?) than yours.
ReplyDeleteFor a quick workaround, you can try to do "ln -s /usr/lib/libssl-XXX.dylib /usr/lib/libssl.0.9.8.dylib". This should at least make it working, but of course it might break in case you actually use libssl.
The proper fix is to recompile PyPy by yourself.
@schmichael
ReplyDeleteto avoid the potential problem of infinite tracing, the JIT bails out if it traces "too much", depending on the trace_limit.
In this case, the default trace_limit is not enough to fully optimize the whole algorithm, hence we need to help the JIT by telling it to trace a bit more than usual.
I agree that having to mess up with the internal parameters of the JIT is suboptimal. I plan to address this issue in the next weeks.
How does it perform against python-opencv?
ReplyDeleteAntonio: Thanks for the quick reply. Unfortunately pypy can't be misled with the symlink hack: "Reason: Incompatible library version: pypy requires version 0.9.8 or later, but libssl.0.9.8.dylib provides version 0.9.7"
ReplyDeleteIt seem like the prebuilt was created on a 10.6, and it does not work on vanilla 10.5 systems. Not a big deal, but is good to know.
Thanks for posting this. pypy is great. I'm trying to figure out how to write modules in RPython. I was sad that I missed the Baypiggies presentation.
ReplyDeleteHello,
ReplyDeleteit's lovely that pypy can do this. This result is amazing, wonderful, and is very kittens. pypy is fast at running python code (*happy dance*).
But.
It also makes kittens cry when you compare to CPython in such a way.
The reality is that CPython users would do this using a library like numpy, opencv, pygame, scipy, pyopengl, freej (the list of real time video processing python libraries is very large, so I won't list them all here).
Of course python can do this task well, and has for more than 10 years.
This code does not take advantage of vectorization through efficient SIMD, multiple cores or graphics hardware, and isn't careful with reusing memory - so is not within an order of magnitude of the speed of CPython code with libraries doing real time video processing.
Anyone within the field would ask about using these features.
Another question they would ask is about pauses. How does the JIT affect pauses in animation? What are the rules for when the JIT warms up, and how can you tell when the code will start running fast? How does the GC affect pauses? If there is a way to turn off the GC, or reuse memory in some way such that the GC won't cause the program to fail(Remember that in realtime a pause is a program fail). Does the GC pool memory of similar size objects automatically? Does the GC work well with 256MB-1GB-16GB sized objects? In a 16GB system, can you use 15GB of objects, and then delete those objects to then use another 15GB of different objects? Or will the program swap, or fragment memory causing pauses?
Please don't make kittens cry. Be realistic with CPython comparisons.
At the moment the python implementation is not as elegant as a vector style implementation. A numpy/matlab/CUDA/OpenCL approach looks really nice for this type of code. One speed up might be to reuse memory, or act in place where possible. For example, not copying the image... unless the GC magically takes care of that for you.
@illume:More or less everyone knows that you can speed up your code by writing or using an extension library. Unfortunately this introduces a dependency on the library (for instance libssl mentioned in the comment thread) and it usually increases the complexity of your code.
ReplyDeleteUsing PyPy you can solve computationally intensive problems in plain Python. Writing in Python saves development time. This is what the comparison is all about.
hi @jacob: below is code which runs either multi core, vectorised SIMD, and on a GPU if you like. You'll notice that it is way shorter and more elegant than the 'pure python' code.
ReplyDeletedef sobelEdgeDetect(im=DImage, p=Position):
....wX = outerproduct([1,2,1],[-1,0,1])
....wY = transpose(wX)
....Gx = convolve(wX,im,p)
....Gy = convolve(wY,im,p)
....return sqrt(Gx**2 + Gy**2)
If pypy is 5x slower than C, and SIMD is 5x faster than C... and using multiple cores is 8x faster than a single core you can see this python code is (5 * 5 * 8) 200x faster than the pypy code. This is just comparing CPU based code. Obviously GPU code for real time image processing is very fast compared to CPU based code.
Things like numpy, pyopengl etc come packaged with various OSes - but chosing those dependencies compared to depending on pypy is a separate issue I guess (but many cpython packaged libraries are packaged for more platforms than pypy).
Of course using tested, and debugged existing code written in python will save you development time: for example using sobel written with the scipy library:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.filters.sobel.html
The fact is CPython is fast enough, more elegant, and will save you time for realtime image processing - unless you ignore the reality that people use CPython libraries for these tasks.
Finally the given code does not prove that the frames are all processed in realtime. They give an average time over all of the frames. Realtime video requires that you meet your target speed for every frame. It would need to be extended to measure each frame to make sure that each frame is within the required time budget.
@illume: I think you completely missed the point of the blog post. This is not about "you should use pypy to do video processing", it's about "pypy runs pure python code very fast".
ReplyDelete@Antonio Cuni, I'm saying the post reads like cpython can not do "realtime image processing in python" and that pypy can.
ReplyDelete@illume:
ReplyDeleteThis example shows pure python code and compares its execution time in cpython and pypy. Nothing else. Writing graphics code in pure python that runs not dreadfully slow was to my knowledge never before shown.
If enough people understand the potential of this technique and put their time into it, we will hopefully come closer to your (5 * 5 * 8) acceleration in pypy, too.
I will for sure work on this.
SIMD instructions and multi core support is something PyPy has potential to support, given time and funding.
ReplyDeleteThe typical optimization path here would be implementing the necessary numpy array operations for the algorithms described. I wonder how a proper numpy implementation would compare.
ReplyDeleteI think you are still missing the point of the post. It was not "use pure Python to write your video processing algos". That's of course nonsense, given the amount and quality of existing C extension modules to do that.
ReplyDeleteThe point is that when you want to experiment with writing a new algorithm of any kind, it is now possible to do it in pure Python instead of, say, C code. If later your project needs to move past the experimentation phase, you will have to decide if you want to keep that Python code, rewrite it in C, or (if applicable) use SIMD instructions from Python or from C, or whatever.
The real point of this demo is to show that PyPy makes Python fast enough as an early experimentation platform for almost any kind of algorithm. If you can write in Python instead of in C, you'll save 50% of your time (random estimate); and then for the 5% of projects that go past the experimentation phase and where Python is not enough (other random estimate), spend more time learning other techniques and using them. The result is still in your favor, and it's only going to be more so as PyPy continues to improve.
I was hoping to experiment with this amazing demo on my Windows-based computers. Any advice for how I would start making the required changes?
ReplyDeleteJacob
Unfortunately the server died :( I'm not sure where exactly are packaged demos, but they can be run from:
ReplyDeletehttps://bitbucket.org/pypy/extradoc/src/extradoc/talk/iwtc11/benchmarks/image
ReplyDeleteThe python code for this seems to be now here:
https://bitbucket.org/pypy/extradoc/src/talk/dls2012/demo
The scripts can be found here:
ReplyDeletehttps://bitbucket.org/pypy/extradoc/src/153804ce4fc3/talk/dls2012/demo