PyPy Status Blog

is

Hello everyone.
We are pleased to announce the availability of the new PyPy for AArch64. This
port brings PyPy's high-performance just-in-time compiler to the AArch64
platform, also known as 64-bit ARM. With the addition of AArch64, PyPy now
supports a total of 6 architectures: x86 (32 & 64bit), ARM (32 & 64bit), PPC64,
and s390x. The AArch64 work was funded by ARM Holdings Ltd. and Crossbar.io.
PyPy has a good record of boosting the performance of Python programs on the
existing platforms. To show how well the new PyPy port performs, we compare the
performance of PyPy against CPython on a set of benchmarks. As a point of
comparison, we include the results of PyPy on x86_64.
Note, however, that the results presented here were measured on a Graviton A1
machine from AWS, which comes with a very serious word of warning: Graviton A1's
are virtual machines, and, as such, they are not suitable for benchmarking. If
someone has access to a beefy enough (16G) ARM64 server and is willing to give
us access to it, we are happy to redo the benchmarks on a real machine. One
major concern is that while a virtual CPU is 1-to-1 with a real CPU, it is not
clear to us how CPU caches are shared across virtual CPUs. Also, note that by no
means is this benchmark suite representative enough to average the results. Read
the numbers individually per benchmark.
The following graph shows the speedups on AArch64 of PyPy (hg id 2417f925ce94) compared to
CPython (2.7.15), as well as the speedups on a x86_64 Linux laptop
comparing the most recent release, PyPy 7.1.1, to CPython 2.7.16.

In the majority of benchmarks, the speedups achieved on AArch64 match those
achieved on the x86_64 laptop. Over CPython, PyPy on AArch64 achieves speedups
between 0.6x to 44.9x. These speedups are comparable to x86_64, where the
numbers are between 0.6x and 58.9x.
The next graph compares between the speedups achieved on AArch64 to the speedups
achieved on x86_64, i.e., how great the speedup is on AArch64 vs. the same
benchmark on x86_64. This comparison should give a rough idea about the
quality of the generated code for the new platform.

Note that we see a large variance: There are generally three groups of
benchmarks - those that run at more or less the same speed, those that
run at 2x the speed, and those that run at 0.5x the speed of x86_64.
The variance and disparity are likely related to a variety of issues, mostly due
to differences in architecture. What is however interesting is that, compared
to measurements performed on older ARM boards, the branch predictor on the
Graviton A1 machine appears to have improved. As a result, the speedups achieved
by PyPy over CPython are smaller than on older ARM boards: sufficiently branchy
code, like CPython itself, simply runs a lot faster. Hence, the advantage
of the non-branchy code generated by PyPy's just-in-time compiler is smaller.
One takeaway here is that many possible improvements for PyPy have yet to be
implemented. This is true for both of the above platforms, but probably more so
for AArch64, which comes with a large number of CPU registers. The PyPy backend
was written with x86 (the 32-bit variant) in mind, which has a really low number
of registers. We think that we can improve in the area of emitting more modern
machine code, which may have a higher impact on AArch64 than on x86_64. There is
also a number of missing features in the AArch64 backend. These features are
currently implemented as expensive function calls instead of inlined native
instructions, something we intend to improve.
Best,
Maciej Fijalkowski, Armin Rigo and the PyPy team

PyPy Status Blog

Thursday, July 25, 2019

PyPy JIT for Aarch64

5 comments: