tag:blogger.com,1999:blog-3971202189709462152.post6513983438425039230..comments2024-03-11T12:50:02.036+01:00Comments on PyPy Status Blog: We need Software Transactional MemoryCarl Friedrich Bolz-Tereickhttp://www.blogger.com/profile/00518922641059511014noreply@blogger.comBlogger38125tag:blogger.com,1999:blog-3971202189709462152.post-68350342207463414752014-10-21T18:38:48.308+02:002014-10-21T18:38:48.308+02:00In general, it is completely impossible to map eve... In general, it is completely impossible to map every operation that must be atomic to a single processor instruction.<a href="http://www.uni-collect.com/uniwebsite/Products/VQNAgency.aspx" rel="nofollow">Uni-source</a><br />Alex monerhttps://www.blogger.com/profile/02635328221496286054noreply@blogger.comtag:blogger.com,1999:blog-3971202189709462152.post-66974475370281888242012-07-05T23:42:33.803+02:002012-07-05T23:42:33.803+02:00The high-level semantics that the Python VM provid...The high-level semantics that the Python VM provides through the GIL are perfect for most programs, and for most programmer's knowledge about concurrency.<br /><br />What is the purpose of going after the GIL? <br /><br />If it's just a performance boost on multiple cores, then an GIOL (global IO lock) implemented on the VM, as the GIL is, should be considered. The VM could run several OS threads blocking them on IO and releasing GIL.<br /><br />If the purpose is to make concurrent programming easy and correct, it can be proven that it <b>is not possible</b>.<br /><br />Yet, there are alternatives that don't alter the language or the semantics that can be explored. <br /><br />Erlang-style message passing can be provided through object proxies implemented on top or beneath the VM, so the threads/processes can even run on different computers.<br /><br />In short, an Actor model is much preferable to a shared-memory one.<br /><br />http://en.wikipedia.org/wiki/Actor_modelAnonymoushttps://www.blogger.com/profile/13334191191530764798noreply@blogger.comtag:blogger.com,1999:blog-3971202189709462152.post-18241998238104793712012-05-12T11:24:23.335+02:002012-05-12T11:24:23.335+02:00@Mystilleef agree 100%@Mystilleef agree 100%Anonymoushttps://www.blogger.com/profile/13558027109535076027noreply@blogger.comtag:blogger.com,1999:blog-3971202189709462152.post-39412502122934573652011-11-03T06:31:14.052+01:002011-11-03T06:31:14.052+01:00We are actually working on implementing this direc...We are actually working on implementing this directly into <a href="http://blog.staila.com/?p=81" rel="nofollow">stailaOS</a>.stailanoreply@blogger.comtag:blogger.com,1999:blog-3971202189709462152.post-68533202270309586802011-09-21T20:10:37.456+02:002011-09-21T20:10:37.456+02:00@squeaky_pl: thanks for the link. In some way res...@squeaky_pl: thanks for the link. In some way researching this is ultimately doomed: either transactional memory doesn't work, or it does and in 5 or 10 years all CPUs will have good hardware support and will be able to run existing software like CPython with minor changes. :-)Armin Rigohttps://www.blogger.com/profile/06300515270104686574noreply@blogger.comtag:blogger.com,1999:blog-3971202189709462152.post-71375304637984783102011-09-01T11:31:32.847+02:002011-09-01T11:31:32.847+02:00Hardware transactional memory anyone? http://arste...Hardware transactional memory anyone? http://arstechnica.com/hardware/news/2011/08/ibms-new-transactional-memory-make-or-break-time-for-multithreaded-revolution.arssqueaky_plhttps://www.blogger.com/profile/17437556881221099403noreply@blogger.comtag:blogger.com,1999:blog-3971202189709462152.post-71957772040131305302011-08-30T04:00:43.224+02:002011-08-30T04:00:43.224+02:00is there a branch we can check this out?is there a branch we can check this out?Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-3971202189709462152.post-24043627923469272992011-08-28T01:01:35.495+02:002011-08-28T01:01:35.495+02:00I thought Erlang successfully solved this problem ...I thought Erlang successfully solved this problem years ago? And I don't think anything scales better than it. So why aren't we just copying them? Message passing, where each thread or process share absolutely nothing, is the sanest and safest way to do concurrent and multi-threaded programming. I mean, you don't even have to worry about locking! STM always seemed complicated to me.Anonymoushttps://www.blogger.com/profile/14180694986659591222noreply@blogger.comtag:blogger.com,1999:blog-3971202189709462152.post-55067091742546810282011-08-27T14:52:28.228+02:002011-08-27T14:52:28.228+02:00⚛, you're missing a very important bit of the ...⚛, you're missing a very important bit of the paper. In it, the authors say, that the Rock hardware only holds 256 bytes of write-buffer content, while Riley and Zilles¹ determined the average write-buffer size needed for transactions to not fail prematurely would be "less than 640 bytes", which is almost three times as much as Rock offers.<br /><br />Thus, the big slowdown that the pystone benchmark experiences could be caused by the shortcomings of the TM built into Rock.<br /><br />I do have to agree, though, that the "benchmarks" used in the paper are not very satisfactory. However, the magical "simple parallelization algorithm" you summon in your comment would break down quite easily shortly after the complexity of the situation increases by just a bit, would it not?<br /><br />¹ I only briefly glanced over the paper, so if anyone read it more thoroughly, they can feel free to correct me.Timohttps://www.blogger.com/profile/08514218440159967492noreply@blogger.comtag:blogger.com,1999:blog-3971202189709462152.post-7461331893260676892011-08-24T18:43:13.117+02:002011-08-24T18:43:13.117+02:00Cool to see this happening. What's also cool i...Cool to see this happening. What's also cool is the result reported in Rossbach et al's study (http://www.neverworkintheory.org/?p=122): novices using STM did <i>better</i> in simple programming problems than students using traditional mechanisms, even though they thought they had done <i>worse</i>. "Baroque syntax" may be part of the problem; I'm sure the paper's authors would be happy to chat.Greg Wilsonhttp://www.third-bit.comnoreply@blogger.comtag:blogger.com,1999:blog-3971202189709462152.post-12316235105283162112011-08-24T18:20:22.247+02:002011-08-24T18:20:22.247+02:00I read "Tabba (2010)" (Tabba: Adding Con...I read "Tabba (2010)" (Tabba: Adding Concurrency in Python Using a Commercial Processor’s Hardware Transactional Memory Support) just now.<br /><br /><br />The article:<br /><br />- benchmark "iterate": This function is not making calls to other functions. The authors are independently running 16 instances of the "iterate" function on a 16-core CPU using 16 threads. The speedup in respect to unmodified CPython is 7x. The slowdown in respect to 16 CPython processes is 2.2x.<br /><br />- benchmark "count": This is similar to "iterate". The speedup in respect to unmodified CPython is 4.5x. The slowdown in respect to 16 CPython processes is 3.5x.<br /><br />- benchmark "pystone": This function is making calls to other functions. 16 instances of the "pystone" function on a 16-core CPU using 16 threads. The speedup in respect to unmodified CPython is 0.9x. The slowdown in respect to 16 CPython processes is 17x.<br /><br /><br />My analysis:<br /><br />- iterate: The fact that N instances of this function can run in parallel without any interference can be determined easily. The algorithm to determine this is trivial. (Not to mention, the pointless loop in the function can be replaced by a NOP in a dead-code elimination pass).<br /><br />- count: same as "iterate".<br /><br />- pystone: It is not trivial to determine whether multiple instances can run in parallel. So, it should presumably run single-threaded.<br /><br />- The article is *not* mentioning any real problem that was solved by TM in the case of "iterate", "count" or "pystone". That is logical, since the truth is that there is no real problem to solve here. The benchmark functions can be trivially run in 16 CPython Linux processes - anybody can do that (even your grandma).<br /><br /><br />My summary:<br /><br />- In case of the two functions for which it *can* be trivially determined whether their instances can run in parallel, the TM approach results in a 2x-3x slowdown compared to the most basic auto-parallelization algorithm.<br /><br />- In case of the function for which it *cannot* be trivially determined whether multiple instances can run in parallel, the TM approach running on 4-16 threads achieved 90% (loss of 10%) of the speed of single-threaded CPython without TM. On 1 thread, the TM approach is 2.1x slower.<br /><br /><br />Bottom line:<br /><br />Call me crazy, but my conclusion from this article is that TM (at least the TM approach from the article) is not working at all.Jan Ziak (atomsymbol)https://www.blogger.com/profile/00398184141815003668noreply@blogger.comtag:blogger.com,1999:blog-3971202189709462152.post-36076301934982358002011-08-24T17:50:47.761+02:002011-08-24T17:50:47.761+02:00Of course the sentence "It is OK for transact...Of course the sentence "It is OK for transactional memory to result in 2x slowdown" was meant "on one thread". As soon as your program uses more than 2 threads, on a more-than-2-CPUs machine, then you win.Armin Rigohttps://www.blogger.com/profile/06300515270104686574noreply@blogger.comtag:blogger.com,1999:blog-3971202189709462152.post-83287835289210124382011-08-24T14:39:01.103+02:002011-08-24T14:39:01.103+02:00I think that there is some confusion here about wh...I think that there is some confusion here about what the underlying problem that you are trying to solve is.<br /><br />The underlying (fundamental) problem that transactional memory as a method to replace GIL in Python is trying to solve is: automatic parallelization. That *is* hard.<br /><br />Mediocre implementations of transactional memory are trivial to implement. Almost anybody can do it. Of course, the performance will be horrible.<br /><br />If we stick to the idea about the underlying problem (automatic parallelization) and keep it in our minds while thinking, it is clear and utterly obvious that *any* implementation of transactional memory which is slower than serial execution is simply missing the target. The target is, obviously, to run the program faster than serial execution. Otherwise, it would be totally pointless.<br /><br />Based on this reasoning, it is an *obvious* conclusion that a transactional memory implementation simply cannot be allowed to result in lower performance than serial execution of the code. Allowing lower performance would be completely irrational.<br /><br />We are humans, not animals. Rationality is our distinctive feature. We have to try to follow rationality.<br /><br />In light of this, saying that "It is OK for transactional memory to result in 2x slowdown" is irrational. I will write it one more time: accepting 2x slowdown is irrational.<br /><br />Now, it is crucial to note that there are various kinds of performance measurements. And it is OK to slow down one performance indicator while boosting another performance indicator. For example, in web server environment, it is OK to slow down the delivery of individual web pages by a factor 1.3 - while boosting the number of requests per second by 2.3. That is *rational* and perfectly OK. Also, 3x developer productivity boost would be OK.<br /><br />Following this example, if transactional memory is allowed to slow down performance of the program (compared to serial execution) by 2x, a person who follows rationally would immediately be drawn to seek for the evidence of a greater-than-2x performance boost in another area of the program.<br /><br />Omitting developer productivity, how are the PyPy developers going to deliver the *mandatory* greater-than-2x performance boost (in some area) without actually solving the underlying hard problems requiring hard-core code analysis?<br /><br />If PyPy's transactional memory implementation would serialize calls to the Linux kernel (because it is hard to emulate them in user-space), then this alone would prevent some programs to achieve the more-than-2x performance boost. This is because it is impossible to boost program performance (in some areas, given a particular performance indicator) unless the modified program is allowed to call kernel functions out-of-order or in parallel.<br /><br />-----<br /><br />Note: I am *not* saying that PyPy should give up. I am just trying to note that you do not seem to know what you are doing. But I may be wrong.Jan Ziak (atomsymbol)https://www.blogger.com/profile/00398184141815003668noreply@blogger.comtag:blogger.com,1999:blog-3971202189709462152.post-28787474399239902422011-08-24T13:01:11.480+02:002011-08-24T13:01:11.480+02:00@Armin: Each in-memory process would use its own p...@Armin: Each in-memory process would use its own part of the heap so there would be no locking necessary except during message sending. You also don't need to have a 1-to-1 mapping of OS threads to processes. You could schedule N processes onto M OS threads (preferably chosen to match the number of CPU cores).<br /><br />Of course, if you don't want a message-passing model (as you mentioned in another comment) then fine.<br /><br />My argument is just that: STM is difficult to implement, difficult to make fast, and it still isn't that easy to use. A message passing model is much easier to implement and easier to use for end users. (You can still get deadlocks, but you could provide libraries for standard communication patterns which you only have to get right once, like Erlang's OTP.)Thomas Schillinghttps://www.blogger.com/profile/04274984206279511399noreply@blogger.comtag:blogger.com,1999:blog-3971202189709462152.post-4750384305605032632011-08-24T11:29:10.165+02:002011-08-24T11:29:10.165+02:00@all: please come to the #pypy irc channel on irc....@all: please come to the #pypy irc channel on irc.freenode.net if you want to discuss this further.Armin Rigohttps://www.blogger.com/profile/06300515270104686574noreply@blogger.comtag:blogger.com,1999:blog-3971202189709462152.post-66642695803438467982011-08-24T08:51:06.713+02:002011-08-24T08:51:06.713+02:00Seems like it should be possible to guarantee perf...Seems like it should be possible to guarantee performance not much worse than with a GIL.<br /><br />Am I right in thinking there is a locked section where changes are written to memory? The execution before this is effectively just some speculative computation to to speed up the locked section. If it turns out there's an inconsistency, just execute the locked section as you would normally. If the speculative computation is failing most of the time or is slow, switch to not doing it -- and we are back to GIL performance.Paul Harrisonhttps://www.blogger.com/profile/16075937464283403018noreply@blogger.comtag:blogger.com,1999:blog-3971202189709462152.post-40227302603052788332011-08-24T08:39:16.155+02:002011-08-24T08:39:16.155+02:00Funny to see how Python eats itself like an Ourobo...Funny to see how Python eats itself like an Ouroboros. Wrong design decisions that made concurrency almost impossible, dirty hacks ("dirty" compared to, for example, Erlang's approach to SMP — almost linear scalability with a number of cores with 10-20% static overhead thanks to locks) that PyPy team are trying to do to solve problems introduced by Guido's ignorance, and a lot of Python "programmers" that don't understand what SMP is. Python is a ghetto, for real.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-3971202189709462152.post-30672223877067145512011-08-23T23:43:09.877+02:002011-08-23T23:43:09.877+02:00Hi,
I thought a bit about what you said about Jyt...Hi,<br /><br />I thought a bit about what you said about Jython. Mostly, I was thinking about a way to do this automatically instead of making it explicitly.<br /><br />I came up with this first draft: https://github.com/albertz/automatic_object_locking<br /><br />This will obviously also be very slow but it should be possible to optimize this well (similarly to STM). And I think it is much easier than STM.<br /><br />-AlbertAnonymoushttps://www.blogger.com/profile/09368106766743622124noreply@blogger.comtag:blogger.com,1999:blog-3971202189709462152.post-65048108806189344572011-08-23T21:55:18.131+02:002011-08-23T21:55:18.131+02:00Or just not open transaction when there's only...Or just not open transaction when there's only 1 thread?kost BebiXhttps://www.blogger.com/profile/05385916050136636671noreply@blogger.comtag:blogger.com,1999:blog-3971202189709462152.post-14316671696352656612011-08-23T21:54:25.551+02:002011-08-23T21:54:25.551+02:00I know this might sound stupid, but is it possible...I know this might sound stupid, but is it possible to enable/disable STM on the fly? Like to enable it only for several threads involved.kost BebiXhttps://www.blogger.com/profile/05385916050136636671noreply@blogger.comtag:blogger.com,1999:blog-3971202189709462152.post-70485294679482141432011-08-23T18:37:01.194+02:002011-08-23T18:37:01.194+02:00I dissagree to the fact that threads whose transac...I dissagree to the fact that threads whose transactions would be invalidated, are stealing CPU timeshares from other processes / threads.<br /><br />STM is an 'egoist' aproachAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-3971202189709462152.post-15228364230510949152011-08-23T18:34:16.418+02:002011-08-23T18:34:16.418+02:00@nekto0n: nothing particular. The two threads wil...@nekto0n: nothing particular. The two threads will run the calls in parallel, just like CPython, which calls the send() function without any GIL acquired. What exactly occurs depends on the OS and not on the language.Armin Rigohttps://www.blogger.com/profile/06300515270104686574noreply@blogger.comtag:blogger.com,1999:blog-3971202189709462152.post-60343403475542930272011-08-23T18:33:39.585+02:002011-08-23T18:33:39.585+02:00some change in my code
http://paste.pocoo.org/sho...some change in my code<br /><br />http://paste.pocoo.org/show/463085/Rodrigo Araújonoreply@blogger.comtag:blogger.com,1999:blog-3971202189709462152.post-28645031860233822102011-08-23T18:12:16.144+02:002011-08-23T18:12:16.144+02:00@armin please describe what will happen if 2 threa...@armin please describe what will happen if 2 threads call write() on single socket object? what exactly should/will happen when iterpreter begins to dispatch CALL bytecode?<br /><br />I think, it's the most questionable part of STM approach.nekto0nhttps://www.blogger.com/profile/03289731826123761949noreply@blogger.comtag:blogger.com,1999:blog-3971202189709462152.post-1048917921777267452011-08-23T18:03:23.184+02:002011-08-23T18:03:23.184+02:00@nekto0n: that's not really possible in genera...@nekto0n: that's not really possible in general, because you need to have the return value of the syscall to decide what to do next, which normally means that you have to really do the syscall.Armin Rigohttps://www.blogger.com/profile/06300515270104686574noreply@blogger.com