Saturday, August 9, 2014

A Field Test of Software Transactional Memory Using the RSqueak Smalltalk VM

Extending the Smalltalk RSqueakVM with STM

by Conrad Calmez, Hubert Hesse, Patrick Rein and Malte Swart supervised by Tim Felgentreff and Tobias Pape

Introduction

After pypy-stm we can announce that through the RSqueakVM (which used to be called SPyVM) a second VM implementation supports software transactional memory. RSqueakVM is a Smalltalk implementation based on the RPython toolchain. We have added STM support based on the STM tools from RPython (rstm). The benchmarks indicate that linear scale up is possible, however in some situations the STM overhead limits speedup.

The work was done as a master's project at the Software Architechture Group of Professor Robert Hirschfeld at at the Hasso Plattner Institut at the University of Potsdam. We - four students - worked about one and a half days per week for four months on the topic. The RSqueakVM was originally developped during a sprint at the University of Bern. When we started the project we were new to the topic of building VMs / interpreters.

We would like to thank Armin, Remi and the #pypy IRC channel who supported us over the course of our project. We also like to thank Toni Mattis and Eric Seckler, who have provided us with an initial code base.

Introduction to RSqueakVM

As the original Smalltalk implementation, the RSqueakVM executes a given Squeak Smalltalk image, containing the Smalltalk code and a snapshot of formerly created objects and active execution contexts. These execution contexts are scheduled inside the image (greenlets) and not mapped to OS threads. Thereby the non-STM RSqueakVM runs on only one OS thread.

Changes to RSqueakVM

The core adjustments to support STM were inside the VM and transparent from the view of a Smalltalk user. Additionally we added Smalltalk code to influence the behavior of the STM. As the RSqueakVM has run in one OS thread so far, we added the capability to start OS threads. Essentially, we added an additional way to launch a new Smalltalk execution context (thread). But in contrast to the original one this one creates a new native OS thread, not a Smalltalk internal green thread.

STM (with automatic transaction boundaries) already solves the problem of concurrent access on one value as this is protected by the STM transactions (to be more precise one instruction). But there are cases were the application relies on the fact that a bigger group of changes is executed either completely or not at all (atomic). Without further information transaction borders could be in the middle of such a set of atomic statements. rstm allows to aggregate multiple statements into one higher level transaction. To let the application mark the beginning and the end of these atomic blocks (high-level transactions), we added two more STM specific extensions to Smalltalk.

Benchmarks

RSqueak was executed in a single OS thread so far. rstm enables us to execute the VM using several OS threads. Using OS threads we expected a speed-up in benchmarks which use multiple threads. We measured this speed-up by using two benchmarks: a simple parallel summation where each thread sums up a predefined interval and an implementation of Mandelbrot where each thread computes a range of predefined lines.

To assess the speed-up, we used one RSqueakVM compiled with rstm enabled, but once running the benchmarks with OS threads and once with Smalltalk green threads. The workload always remained the same and only the number of threads increased. To assess the overhead imposed by the STM transformation we also ran the green threads version on an unmodified RSqueakVM. All VMs were translated with the JIT optimization and all benchmarks were run once before the measurement to warm up the JIT. As the JIT optimization is working it is likely to be adoped by VM creators (the baseline RSqueakVM did that) so that results with this optimization are more relevant in practice than those without it. We measured the execution time by getting the system time in Squeak. The results are:

Parallel Sum Ten Million

Benchmark Parallel Sum 10,000,000
Thread Count RSqueak green threads RSqueak/STM green threads RSqueak/STM OS threads Slow down from RSqueak green threads to RSqueak/STM green threads Speed up from RSqueak/STM green threads to RSQueak/STM OS Threads
1 168.0 ms 240.0 ms 290.9 ms 0.70 0.83
2 167.0 ms 244.0 ms 246.1 ms 0.68 0.99
4 167.8 ms 240.7 ms 366.7 ms 0.70 0.66
8 168.1 ms 241.1 ms 757.0 ms 0.70 0.32
16 168.5 ms 244.5 ms 1460.0 ms 0.69 0.17

Parallel Sum One Billion

Benchmark Parallel Sum 1,000,000,000

Thread CountRSqueak green threadsRSqueak/STM green threadsRSqueak/STM OS threadsSlow down from RSqueak green threads to RSqueak/STM green threadsSpeed up from RSqueak/STM green threads to RSQueak/STM OS Threads
1 16831.0 ms 24111.0 ms 23346.0 ms 0.70 1.03
2 17059.9 ms 24229.4 ms 16102.1 ms 0.70 1.50
4 16959.9 ms 24365.6 ms 12099.5 ms 0.70 2.01
8 16758.4 ms 24228.1 ms 14076.9 ms 0.69 1.72
16 16748.7 ms 24266.6 ms 55502.9 ms 0.69 0.44

Mandelbrot Iterative

Benchmark Mandelbrot
Thread Count RSqueak green threads RSqueak/STM green threads RSqueak/STM OS threads Slow down from RSqueak green threads to RSqueak/STM green threads Speed up from RSqueak/STM green threads to RSqueak/STM OS Threads
1 724.0 ms 983.0 ms 1565.5 ms 0.74 0.63
2 780.5 ms 973.5 ms 5555.0 ms 0.80 0.18
4 781.0 ms 982.5 ms 20107.5 ms 0.79 0.05
8 779.5 ms 980.0 ms 113067.0 ms 0.80 0.01

Discussion of benchmark results

First of all, the ParallelSum benchmarks show that the parallelism is actually paying off, at least for sufficiently large embarrassingly parallel problems. Thus RSqueak can also benefit from rstm.

On the other hand, our Mandelbrot implementation shows the limits of our current rstm integration. We implemented two versions of the algorithm one using one low-level array and one using two nested collections. In both versions, one job only calculates a distinct range of rows and both lead to a slowdown. The summary of the state of rstm transactions shows that there are a lot of inevitable transactions (transactions which must be completed). One reason might be the interactions between the VM and its low-level extensions, so called plugins. We have to investigate this further.

Limitations

Although the current VM setup is working well enough to support our benchmarks, the VM still has limitations. First of all, as it is based on rstm, it has the current limitation of only running on 64-bit Linux.

Besides this, we also have two major limitations regarding the VM itself. First, the atomic interface exposed in Smalltalk is currently not working, when the VM is compiled using the just-in-time compiler transformation. Simple examples such as concurrent parallel sum work fine while more complex benchmarks such as chameneos fail. The reasons for this are currently beyond our understanding. Second, Smalltalk supports green threads, which are threads which are managed by the VM and are not mapped to OS threads. We currently support starting new Smalltalk threads as OS threads instead of starting them as green threads. However, existing threads in a Smalltalk image are not migrated to OS threads, but remain running as green threads.

Future work for STM in RSqueak

The work we presented showed interesting problems, we propose the following problem statements for further analysis:
  • Inevitable transactions in benchmarks. This looks like it could limit other applications too so it should be solved.
  • Collection implementation aware of STM: The current implementation of collections can cause a lot of STM collisions due to their internal memory structure. We believe it could bear potential for performance improvements, if we replace these collections in an STM enabled interpreter with implementations with less STM collisions. As already proposed by Remi Meier, bags, sets and lists are of particular interest.
  • Finally, we exposed STM through languages features such as the atomic method, which is provided through the VM. Originally, it was possible to model STM transactions barriers implicitly by using clever locks, now its exposed via the atomic keyword. From a language design point of view, the question arises whether this is a good solution and what features an stm-enabled interpreter must provide to the user in general? Of particular interest are for example, access to the transaction length and hints for transaction borders to and their performance impact.

    Details for the technically inclined

    • Adjustments to the interpreter loop were minimal.
    • STM works on bytecode granularity that means, there is a implicit transaction border after every bytecode executed. Possible alternatives: only break transactions after certain bytecodes, break transactions on one abstraction layer above, e.g. object methods (setter, getter).
    • rstm calls were exposed using primtives (a way to expose native code in Smalltalk), this was mainly used for atomic.
    • Starting and stopping OS threads is exposed via primitives as well. Threads are started from within the interpreter.
    • For Smalltalk enabled STM code we currently have different image versions. However another way to add, load and replace code to the Smalltalk code base is required to make a switch between STM and non-STM code simple.

      Details on the project setup

      From a non-technical perspective, a problem we encountered was the huge roundtrip times (on our machines up to 600s, 900s with JIT enabled). This led to a tendency of bigger code changes ("Before we compile, let's also add this"), lost flow ("What where we doing before?") and different compiled interpreters in parallel testing ("How is this version different from the others?") As a consequence it was harder to test and correct errors. While this is not as much of a problem for other RPython VMs, RSqueakVM needs to execute the entire image, which makes running it untranslated even slower.

      Summary

      The benchmarks show that speed up is possible, but also that the STM overhead in some situations can eat up the speedup. The resulting STM-enabled VM still has some limitations: As rstm is currently only running on 64-bit Linux the RSqueakVM is doing so as well. Eventhough it is possible for us now to create new threads that map to OS threads within the VM, the migration of exiting Smalltalk threads keeps being problematic.

      We showed that an existing VM code base can benefit of STM in terms of scaling up. Further it was relatively easy to enable STM support. This may also be valuable to VM developers considering to get STM support for their VMs.

      5 comments:

      Armin Rigo said...

      "We showed that an existing VM code base can benefit of STM in terms of scaling up." I dispute this conclusion: in the benchmarks, it seems that the non-STM version is scaling up well, even better than the STM+OS-threads version. But how can the non-STM version scale at all? It shouldn't: that's a property of RPython. And why is the STM+OS-threads version faster even with just 1 thread? I think you need to answer these questions first. Right now it screams "you are running buggy benchmarks" to me.

      Stefan Marr said...

      I concur with Armin, the conclusions are problematic in the light of the current numbers.

      Could you give some more details on the benchmarks? Can I find the Smalltalk code somewhere?

      Things that come to mind are details about the scheduler. In the RoarVM, that was also one of the issues (which we did not solve). The standard Squeak scheduling data structure remains unchanged I suppose? How does that interact with the STM, is it problematic that each STM thread updates this shared data structure during every scheduling operation?

      Also, more basic, are you making sure that the benchmark processes are running with highest priority (80, IIRC), to avoid interference with other processes in the image?

      On the language level, something that could also have an impact on the results is closures. How are they implemented? I suppose similar to the way the CogVM implements them? I suppose, you make sure that closures are not shared between processes?

      And finally, what kind of benchmark harness are you using? Did you have a look at SMark? (http://smalltalkhub.com/#!/~StefanMarr/SMark)
      We used that one for the RoarVM, and it provides various options to do different kind of benchmarks, including weak-scaling benchmarks, which I would find more appropriate for scalability tests. Weak-scaling means, you increase the problem size with the number of cores. That replicates the scenario where the problem itself is not really parallelizable, but you can solve more problems at the same time in parallel. It also makes sure that each process/thread does the identical operations (if setup correctly).

      Well, all those questions aside, interesting work :) Hope to read more soon ;)

      Patrick Rein said...

      You definitely hit a really weak spot in our report... Today we investigated the ParallelSum benchmark again. So far, we've found out that it was indeed partially a problem with the priority of the benchmark process. The preliminary benchmark results make more sense now and as soon as we have stable ones we will update them.

      I'll still try to address some of your questions right now. :)

      1. Benchmark code
      I've just wrapped up the current version of our benchmarks and put them in our repository. You can find the two Squeak4.5 images at the stmgc-c7 branch of the RSqueak Repository . You can find the benchmarks in the CPB package. The Squeak4.5stm image needs the RSqueak/STM VM.

      2. Scheduler data structures
      Yes, the scheduling data structure is completely unchanged. We have only added a new subclass of Process which overwrites fork and calls a different primitive. However, these Processes are not managed by the Smalltalk scheduler, so there should be no synchronization issues here.

      3. Interference of other processes:
      This is probably the source of the "speed-up" we observe on the normal RSqueakVM. With more threads we might get a bigger portion of the total runtime. So far, the benchmarks already ran in a VM mode which disables the Smalltalk GUI thread, however in the traces we found that the event handler is still scheduled every now and then. We've done it as you suggested, Stefan, and set the priority to 80 (or 79 to not mess up the timer interrupt handler).

      4. Benchmark harness
      We actually use SMark and also made sure the timing operations of RSqueak do their job correctly. However we are probably not using SMark at its full potential.

      Patrick Rein said...

      I've just updated the benchmarks. All benchmark processes are now running with the Smalltalk process priority of 79 (80 is the highest). The single-threaded VMs now show the expected behavior.

      Patrick Rein said...

      To further clarify on the Mandelbrot benchmarks: After a discussion with Stefan, I have changed the Mandelbrot implementation. Each job now only has private data and does not read or write in any shared data structure. Still the benchmark results remain the same and we can still observe a high proportion of inevitable transactions.

      As Armin pointed out, and which would be a next step, we would need to figure out which parts of the interpreter might cause systematic conflicts.