Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is just a map/reduce problem. Use Hadoop. It's Java, isn't it?



Why would I use Hadoop for such a small number of rows…?


1 billion is small for hadoop?


Anything that fits in RAM on one machine is easily too small for Hadoop. In those cases, the overhead of Hadoop is going to make it get destroyed by a single beefy machine. The only times where this might not be the case is when you're doing a crazy amount of computation relative to the data you have.

Note that you can easily reach 1TB of RAM on (enterprise) commodity hardware now, and SSDs are pretty fast too.

Old but gold post from 2014: https://adamdrake.com/command-line-tools-can-be-235x-faster-...



If it fits on one computer it's not a hadoop problem.


It fits on a dusty ten year old USB stick


Sounds like an awk problem tbh.


A Hadoop submission may help people realize that. But since you only have one machine to work with it should be obvious that you're not going to get any speed-up via divide and conquer.


~14GB file? it's on the small side for hadoop


> No external dependencies may be used




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: