It's not. It's not about efficiency. Compute-per-watt will certainly be better in other systems. This is about pushing a small system as fast as possible because it's easier to program for a small system. A few problems are 'embarrassingly parallel', but lots have substantial overhead as parallelism increases so running each core as fast as possible is a win for some problems.