Comments on PyPy Status Blog: A Field Test of Software Transactional Memory Using the RSqueak Smalltalk VM

To further clarify on the Mandelbrot benchmarks: A...

2014-08-11T15:11:42.818+02:00

To further clarify on the Mandelbrot benchmarks: After a discussion with Stefan, I have changed the Mandelbrot implementation. Each job now only has private data and does not read or write in any shared data structure. Still the benchmark results remain the same and we can still observe a high proportion of inevitable transactions.

As Armin pointed out, and which would be a next step, we would need to figure out which parts of the interpreter might cause systematic conflicts.

I've just updated the benchmarks. All benchmar...

2014-08-11T11:12:47.541+02:00

I've just updated the benchmarks. All benchmark processes are now running with the Smalltalk process priority of 79 (80 is the highest). The single-threaded VMs now show the expected behavior.

You definitely hit a really weak spot in our repor...

2014-08-10T21:13:27.802+02:00

You definitely hit a really weak spot in our report... Today we investigated the ParallelSum benchmark again. So far, we've found out that it was indeed partially a problem with the priority of the benchmark process. The preliminary benchmark results make more sense now and as soon as we have stable ones we will update them.

I'll still try to address some of your questions right now. :)

1. Benchmark code
I've just wrapped up the current version of our benchmarks and put them in our repository. You can find the two Squeak4.5 images at the stmgc-c7 branch of the RSqueak Repository . You can find the benchmarks in the CPB package. The Squeak4.5stm image needs the RSqueak/STM VM.

2. Scheduler data structures
Yes, the scheduling data structure is completely unchanged. We have only added a new subclass of Process which overwrites fork and calls a different primitive. However, these Processes are not managed by the Smalltalk scheduler, so there should be no synchronization issues here.

3. Interference of other processes:
This is probably the source of the "speed-up" we observe on the normal RSqueakVM. With more threads we might get a bigger portion of the total runtime. So far, the benchmarks already ran in a VM mode which disables the Smalltalk GUI thread, however in the traces we found that the event handler is still scheduled every now and then. We've done it as you suggested, Stefan, and set the priority to 80 (or 79 to not mess up the timer interrupt handler).

4. Benchmark harness
We actually use SMark and also made sure the timing operations of RSqueak do their job correctly. However we are probably not using SMark at its full potential.

I concur with Armin, the conclusions are problemat...

2014-08-10T10:09:59.270+02:00

I concur with Armin, the conclusions are problematic in the light of the current numbers.

Could you give some more details on the benchmarks? Can I find the Smalltalk code somewhere?

Things that come to mind are details about the scheduler. In the RoarVM, that was also one of the issues (which we did not solve). The standard Squeak scheduling data structure remains unchanged I suppose? How does that interact with the STM, is it problematic that each STM thread updates this shared data structure during every scheduling operation?

Also, more basic, are you making sure that the benchmark processes are running with highest priority (80, IIRC), to avoid interference with other processes in the image?

On the language level, something that could also have an impact on the results is closures. How are they implemented? I suppose similar to the way the CogVM implements them? I suppose, you make sure that closures are not shared between processes?

And finally, what kind of benchmark harness are you using? Did you have a look at SMark? (http://smalltalkhub.com/#!/~StefanMarr/SMark)
We used that one for the RoarVM, and it provides various options to do different kind of benchmarks, including weak-scaling benchmarks, which I would find more appropriate for scalability tests. Weak-scaling means, you increase the problem size with the number of cores. That replicates the scenario where the problem itself is not really parallelizable, but you can solve more problems at the same time in parallel. It also makes sure that each process/thread does the identical operations (if setup correctly).

Well, all those questions aside, interesting work :) Hope to read more soon ;)

"We showed that an existing VM code base can ...

2014-08-09T15:10:31.035+02:00

"We showed that an existing VM code base can benefit of STM in terms of scaling up." I dispute this conclusion: in the benchmarks, it seems that the non-STM version is scaling up well, even better than the STM+OS-threads version. But how can the non-STM version scale at all? It shouldn't: that's a property of RPython. And why is the STM+OS-threads version faster even with just 1 thread? I think you need to answer these questions first. Right now it screams "you are running buggy benchmarks" to me.