Hacker News new | past | comments | ask | show | jobs | submit login

If your code is littered with DISTINCT clauses either you are doing something wrong or the data design is poor.

DISTINCT leads to expensive SORT operations and effectively poor performance.




I'm genuinely curious. Would you mind elaborating or pointing me to another resource that goes into detail on why this is the case?


The poor design part or the performance part?

The use of DISTINCT everywhere smells of denormalized tables. Now if you are in data warehouse or reporting position, then this likely makes sense. I tend to work in transactional applications and keep the redundancy down to an absolute minimum. I abstract away some of the complicated queries with views, procs, and functions where allowable.

As for performance, different rdbms implement this differently, but the general query plan category for GROUP BY and DISTINCT is SORT. If you are querying something where you have multiple subqueries with DISTINCT and you and ordering the final results, you are adding extra sort operations to the query plan, hence hurting performance.

I tend to design everything to optimize read operations because there tends to be much more read than write in systems I work in. For me this means denormalized and heavily indexed where table/index scans are extremely rare. On smaller data sets, some rdbms always scan tables because it saves operations based on table statistics.

I hope this helps.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: