This gem was posted in the ijson issue tracker after some discussion on #pypy, and Dav1dde kindly allowed us to repost it here:
"So, I was playing around with parsing huge JSON files (19GiB, testfile is ~520MiB) and wanted to try a sample code with PyPy, turns out, PyPy needed ~1:30-2:00 whereas CPython 2.7 needed ~13 seconds (the pure python implementation on both pythons was equivalent at ~8 minutes).
"Apparantly ctypes is really bad performance-wise, especially on PyPy. So I made a quick CFFI mockup: https://gist.github.com/Dav1dde/c509d472085f9374fc1d
Before:
CPython 2.7:
python -m emfas.server size dumps/echoprint-dump-1.json
11.89s user 0.36s system 98% cpu 12.390 total
PYPY:
python -m emfas.server size dumps/echoprint-dump-1.json
117.19s user 2.36s system 99% cpu 1:59.95 total
After (CFFI):
CPython 2.7:
python jsonsize.py ../dumps/echoprint-dump-1.json
8.63s user 0.28s system 99% cpu 8.945 total
PyPy:
python jsonsize.py ../dumps/echoprint-dump-1.json
4.04s user 0.34s system 99% cpu 4.392 total
"
Dav1dd goes into more detail in the issue itself, but we just want to emphasize a few significant points from this brief interchange:
"So, I was playing around with parsing huge JSON files (19GiB, testfile is ~520MiB) and wanted to try a sample code with PyPy, turns out, PyPy needed ~1:30-2:00 whereas CPython 2.7 needed ~13 seconds (the pure python implementation on both pythons was equivalent at ~8 minutes).
"Apparantly ctypes is really bad performance-wise, especially on PyPy. So I made a quick CFFI mockup: https://gist.github.com/Dav1dde/c509d472085f9374fc1d
Before:
CPython 2.7:
python -m emfas.server size dumps/echoprint-dump-1.json
11.89s user 0.36s system 98% cpu 12.390 total
PYPY:
python -m emfas.server size dumps/echoprint-dump-1.json
117.19s user 2.36s system 99% cpu 1:59.95 total
After (CFFI):
CPython 2.7:
python jsonsize.py ../dumps/echoprint-dump-1.json
8.63s user 0.28s system 99% cpu 8.945 total
PyPy:
python jsonsize.py ../dumps/echoprint-dump-1.json
4.04s user 0.34s system 99% cpu 4.392 total
"
Dav1dd goes into more detail in the issue itself, but we just want to emphasize a few significant points from this brief interchange:
- His CFFI implementation is faster than the ctypes one even on CPython 2.7.
- PyPy + CFFI is faster than CPython even when using C code to do the heavy parsing.