Edit: correct me if I'm wrong, it doesn't appear that 'cluster by' avoids a costly shuffle first. I'd rather just push down to the database engine, when using a database engine.. group by worked fine on phoenix, so saying my code is the problem means it's really only a problem when using sparksql with the phoenix RDD driver.
Query languages like HiveQL and Spark SQL were designed to look like SQL, but they're not.