Good point. There is definitely a challenge of knowing when a single-threaded collection or stream operation may be preferable to the parallel option. When my colleague wrote his summary of Java 8 [1], he wrote:
Returning to the concept of parallel streams, it's important to note that parallelism is not free. It's not free from a performance standpoint, and you can't simply swap out a sequential stream for a parallel one and expect the results to be identical without further thought. There are properties to consider about your stream, its operations, and the destination for its data before you can (or should) parallelize a stream. For instance: Does encounter order matter to me? Are my functions stateless? Is my stream large enough and are my operations complex enough to make parallelism worthwhile?
The author of the linked InfoQ article (OP) cites that same dilemma by explaining the potential for context-switching overhead to counter the advantage of splitting the work.
Abstracting it away with rough heuristics might be possible, but doing so with consistent success could be challenging. In other words, you could elect to use the serial algorithm for small collections, or when the CPU contention at the start of the sort operation is low. But if the comparison operator is expensive, CPU contention is volatile, or if operating on a stream of unknown length, the abstraction may choose poorly. Ultimately, I like the option to choose for myself, but like you, I wouldn't mind having a third option that defers that choice to some heuristic.
might have been more flexible. Do I want to use the GPU if available? With which priority should it run, max. performance or more as a background task?
GPU won't handle the comparable interface at all. And the memory transfer CPU<=>GPU would be a true killer.
It may work on Direct ByteBuffers only but that's entirely a different subject.
Is that still true with the new unified memory architecture (hUMA) that AMD introduced?
I agree GPUs probably can't handle more complicated comparators well, but Arrays.sort() could optimize at least for the primitive types.
hUMA is brand new and no actual support but in order to work properly the CPU has to stall waiting for the GPU.
GPU should be interruptible the same way the CPU is, so if the GC decides to move the memory it can actually do so. The memory can be pinned instead, though. The latter poses some side effects with the GC.
If the GPU is not on the same die it will have to virtually copy the array as the L1/L2 caches won't be accessible.
Arrays.sort(somePrimitive[]) would be too much of an edge case to optimize for. Overall hard nut to crack. Java8 streams and direct buffers, however, could be a good starting point to perform various operations via the GPU.
Disclaimer: I am really not well versed in the GPU tech.
> This continual creation/termination/destruction of threads is done so often that I wonder where the idea came from. I presume some poisonous textbook is responsible. Sometimes, it seems that the whole SO is riddled with threads that add two integers and then terminate, just so that the 'main' thread can wait with 'join'. God help us :(
I hope parallelSort() just calls sort() if the array is smaller than some threshold.
It reuses the ForkJoin common pool and doesn't have a way for you to specify the thread pool, factory, or anything from what I can see. This is where implicit execution contexts in Scala really help as much as people hate implicits. Of course you have to use the tasksupport setter in Scala for parallel collections instead , but at least it's configurable.
Not an answer, but I was surprised to note that Erlang also requires you to explicitly choose the parallel version of a function. Might be interesting to see what Guy Steel did with this issue in Fortress.
If you did it often, I guess the JIT could work out what was best the last time few times. Trial and error. :-)
No, I didn't. If, for even an instant, you ever buy the argument that a JIT is better for real-world application tuning than static code analysis, you see my point.
Serious question: isn't this something the JVM could abstract away?