Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

OP here. I learned about this while reading Stanford's LLM course's "Data" lecture [1]. Very interesting how it assesses the datasets used for GPT 2 and 3, etc, and how The Pile addresses their issues. A very interesting course!

[1] https://stanford-cs324.github.io/winter2022/lectures/data/




The Pile was also referenced in a post today of some guys tweets about “leaked” gpt4 details

https://news.ycombinator.com/item?id=36675934




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: