Hacker News new | past | comments | ask | show | jobs | submit login

It seems to me that Google Ngram isn't wrong. It's reporting statistics on the words it correctly identified in the corpus. The problem is the context of the statistics. You may somewhat confidently say the word "said" dips in usage at such and such time in the Google Books corpus. You can more confidently say it dips at such and such time for the subset of the corpus for which OCR correctly identified every instance of the word. But you can't make claims in a broader context like "this word dipped in usage at such and such time" without having sufficient data.



Just as "it depends" is a meme for economists, "need more data" is the galaxy-brain statistician meme.

Until you've solved the grand unified theory, you can never be fully confident in the completeness of your data or statistical inferences.

What's wrong is misleading the public away from this understanding.


And this is why sampling methodology is so much more vastly important in drawing inferential population statistics than sample size.

Sample 1 million books from an academic corpus, and you'll turn up a very different linguistic corpus than selecting the ten best-selling books for each decade of the 20th century.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: