This is just a map/reduce problem. Use Hadoop. It's Java, isn't it?

baq · on Jan 4, 2024

Why would I use Hadoop for such a small number of rows…?

jalino23 · on Jan 4, 2024

1 billion is small for hadoop?

chmod775 · on Jan 4, 2024

Anything that fits in RAM on one machine is easily too small for Hadoop. In those cases, the overhead of Hadoop is going to make it get destroyed by a single beefy machine. The only times where this might not be the case is when you're doing a crazy amount of computation relative to the data you have.

Note that you can easily reach 1TB of RAM on (enterprise) commodity hardware now, and SSDs are pretty fast too.

Old but gold post from 2014: https://adamdrake.com/command-line-tools-can-be-235x-faster-...

makapuf · on Jan 4, 2024

Also from 2013: https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html

rapsey · on Jan 4, 2024

If it fits on one computer it's not a hadoop problem.

quickthrower2 · on Jan 4, 2024

It fits on a dusty ten year old USB stick

badgersnake · on Jan 4, 2024

Sounds like an awk problem tbh.

8organicbits · on Jan 4, 2024

A Hadoop submission may help people realize that. But since you only have one machine to work with it should be obvious that you're not going to get any speed-up via divide and conquer.

brokensegue · on Jan 4, 2024

~14GB file? it's on the small side for hadoop

kriskrunch · on Jan 4, 2024

> No external dependencies may be used