Small sequential portions of an otherwise parallel algorithm can have huge effects on the overall running when trying to scale up.
"parconc" explains this while discussing a parallel version of k-means, talks about how things like granularity of data needs to be fine-tuned for parallel algos, and provides some nice visualizations into what the CPU's are actually doing on a timeline: http://chimera.labs.oreilly.com/books/1230000000929/ch03.htm...
Overall I think multicore is a good tool to have in your toolbox, but it seems like there needs to be a lot of tuning and effort to get good rewards for the time invested.
This is certainly true for some applications, but originally when Amdahl's law was formulated, people made estimates based on e.g. 95% of an application being parallelizable, and thus fairly rapidly reaching a point of diminishing returns. In practice, however, there's often a simple reformulation of the problem that can result in a much higher percentage, and Amdahl's becomes less an obstacle than it would first appear. There are still many problems that scale more or less linearly with the number of processing units and are "trivially parallelizable".
I wrote https://github.com/abemassry/crazip in python and it was challenging to find out how to do multiprocessing effectively, not sure how this would run if implemented in other laungaues.
This article contains a perfect example of why I don't like to write or read comments in code. Comments are not compiled and thus allow for sloppy verbiage such as the following:
# Exit the completed processes
for p in processes:
p.join()
The comment should read something more like: "Wait for all the subprocesses to exit" but is that really any more helpful than just reading the code and seeing that join is called on each subprocess and connecting the dots from there?
> but is that really any more helpful than just reading the code and seeing that join is called on each subprocess and connecting the dots from there?
It isn't if you have been programming using threads before ,"join" is obvious then
But if you haven't and maybe you are a physicist trying to get something done in python, "p.join()" could mean anything -- "Join to what?", "Why is there no argument to join()?" "We are joining the data togther there like a list..." "It looks like we should be stopping the processes but the method is not called 'stop()' so that's not it"...
That is the problem of teaching this stuff by someone who has been programming for a while, this kind of stuff gets internalized and becomes obvious but it is not obvious to a beginner.
Will use this as ammo next time someone complains about my minimalist approach to commenting.
I think comments need to be targeted at a specific audience. If you know that only professionals are going to work on your code then I think only high-level comments are probably OK.
Also, I would suggest that the physicist in your example hire a software practitioner [who can be expected to know 'join'] to write his software correctly (rather than trying to half-ass it himself).
Instead of calling the 'else' condition of the for loop a 'completion-else', just call it the 'nobreak' condition. Unlike 'completion-else', 'nobreak' immediately describes when it will be executed.
If you're working on CPU-bound tasks with NumPy/SciPy using threads then you have to think very hard to make sure most of the critical sections are hitting the NumPy calls in C which release the GIL. It's not a great reliable way to program. The way the author describes is basically the only pure-Python way of achieving parallelism for this kind of problem.
If you're holding a global mutex every 100 instructions while context switching between CPU bound tasks, then yes, the GIL does suck. There are a class of IO-bound problems where threading/evented models in Python can be used effectively but that's not the class of problems the author is talking about here.
Considering the author's use case (mathematical modeling) and language (Python), threading and event-based models would have no real performance benefit.
Event based models only shine when you are doing IO-bound tasks. They won't help you when you are chewing CPU.
Threading models in Python aren't attractive because of the GIL. If you are doing a parallel matrix operation you can only ever use one CPU because of the GIL. Not attractive.
STM is not a great fit for this kind of problem, there's no need for all the transaction machinery if the problem is embarrassingly parallel. In an ideal world what you want is threads which just split the work sections like they do in C/C++/Haskell.
Just curious but could this be possible to get around by having multiple copies of the python interpreter installed on your sytstem? Maybe possibly changing some config strings so the GIL thinks it's something else.
ProcessPoolExecuter from the futures package essentially does this, it runs multiple instances of the interpreter and then distributes instructions to them.
Small sequential portions of an otherwise parallel algorithm can have huge effects on the overall running when trying to scale up.
"parconc" explains this while discussing a parallel version of k-means, talks about how things like granularity of data needs to be fine-tuned for parallel algos, and provides some nice visualizations into what the CPU's are actually doing on a timeline: http://chimera.labs.oreilly.com/books/1230000000929/ch03.htm...
Overall I think multicore is a good tool to have in your toolbox, but it seems like there needs to be a lot of tuning and effort to get good rewards for the time invested.