Hacker News new | past | comments | ask | show | jobs | submit login

I have no experience in Data Science. How much need is there for a distributed system? How many data points would one need to necessitate it? What if we had a billion data points. Would it be sufficient to run on a fast system overnight to crunch data?



According to me (note zintinio's reply - he disagrees), the etymology of the term is the following.

Back in the day, if you had small data you didn't need a data scientist. You needed a statistician. He'd do some shit in SAS/Python/etc, reading one CSV and writing another. Then the developers could run that on a beefy server with cron and push the CSV output someplace else.

At some point during the 2000's this stopped working - things became complicated and tangled enough that you couldn't just munge CSVs like this. You needed folks who understood the math well enough to come up with algorithms, and who also understood the computer science well enough to scale out. These folks were termed "data scientists". Banks call them "quant developers".

Very often you don't need a distributed system, or anything that isn't trivially parallelizable. Most of the time when I make money it's based on a CSV < 100GB - often it fits in ram. Nowadays I don't think it's misleading to call yourself a "data scientist" if you can only handle such data sets.

I recently learned that Facebook calls their business analysts data scientists. I guess it's just sexier.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: