Hacker News new | past | comments | ask | show | jobs | submit login

You're trying to GROUP BY on a distributed data store; your code is the problem, not Spark SQL. Use CLUSTER BY - it's distributed sibling.

Query languages like HiveQL and Spark SQL were designed to look like SQL, but they're not.




Edit: correct me if I'm wrong, it doesn't appear that 'cluster by' avoids a costly shuffle first. I'd rather just push down to the database engine, when using a database engine.. group by worked fine on phoenix, so saying my code is the problem means it's really only a problem when using sparksql with the phoenix RDD driver.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: