I think you just kind of proved my point that real work load benchmarks would have to be much more attractive to offset the cost of supporting an architecture that's not tier one in many languages / software libraries / projects.
Totally. That said, IBM isn't short on optimizing compiler folks. Open ecosystem support is a strategy to pick up small and mid-size buyers; it will be interesting to see if IBM gets there. I would look again next time we source hardware.
And, if IBM put someone internally on a properly vectorized go compilation pathway for Power8, I would buy in a heartbeat, provided it ran some sort of debian variant.
I don't think that go as people tend to use it really benefits from vectorized code. Most people who are using go that I interact with are not writing numerical processing code but network servers and business logic for high level web APIs. You might get minor speed ups in vectorized memcpy but I can't see much else.
I imagine the most important CPU features for most go code would be a good branch predictor and fast atomics / synchronization primitives.
If you're using go for numerical processing code I'd like to hear more about it. Mostly because it's kind of a PITA.
Well, there are almost no real vectorizable primitives or functions in the core library so I'm not surprised that you don't run into people vectorizing much. And, go dev team compiler focus has been elsewhere the last year.
And, so far the go team hasn't seemed to be able to interest Intel in doing the heavy lifting that they might do for some other compilers..
So, branch predictions and faster sync primitives would be great, not least because they would speed up channels in many cases, which would be cool; it would be nice to widen the use cases for channel-based communication significantly, but they're just VERY slow if you want to use them at scale in a large application.
I am using go for some large scale numerical processing, although it's the sort with lots of logic attached, not just a giant matrix with some glue around it. It's kind of a PITA. We are picking and choosing some outside libraries, and spend a lot of time massaging the go code for speed and bitching about the garbage collector. (Did you know that for i, _ := range is often 3 to 4x faster than for _, v := range? Do you know how awful code written down four or five nested loops that uses indices looks?)
But, the size codebase our team can manage with go is pretty great. We wouldn't be nearly so productive in many other cool (or .. experienced) languages when you add up the full life cycle costs including innovation, enhancement, bug fixes, maintenance and deployment. It's a win. I'd do it again in a heartbeat.
> So, branch predictions and faster sync primitives would be great, not least because they would speed up channels in many cases, which would be cool; it would be nice to widen the use cases for channel-based communication significantly, but they're just VERY slow if you want to use them at scale in a large application.
These operations are already pretty good on IA* processors, at least in comparison to the less mainstream architectures. Other architectures focus on either bandwidth, parallelism (but often without a great synchronization story), and optimizing power usage. So I doubt that Go lang would benefit from moving to Power.
Some choices the Go people made about how channels how limited their options for optimizing channels / increase the complexity of a lock-free implementation (don't have the mailing list link handy). If you don't need all these guarantees you can use pick a SPSC, SPMC, MPMC implementation that might work better for your use case.
> (Did you know that for i, _ := range is often 3 to 4x faster than for _, v := range? Do you know how awful code written down four or five nested loops that uses indices looks?)
Yes, in the second version you have to make a copy of v. Depending on how large v this is how large the impact will be. The first version just references the array cell via a[i] and for that there's no copy needed and it's one assembly instruction. Maybe the optimizer can become better here but I'm guessing it might break some language contract.