You're trying to GROUP BY on a distributed data store; your code is the problem,...

jupiter90000 · on Feb 10, 2017

Edit: correct me if I'm wrong, it doesn't appear that 'cluster by' avoids a costly shuffle first. I'd rather just push down to the database engine, when using a database engine.. group by worked fine on phoenix, so saying my code is the problem means it's really only a problem when using sparksql with the phoenix RDD driver.