Hacker News new | past | comments | ask | show | jobs | submit login

Those sizes refer to the data before processing and filtering. The actual training size was about 3 TB:

   The Stack v2 is ten times larger than its predecessor, yielding a raw dataset of 67.5 TB. Through extensive cleaning, filtering, and subsampling of the source code, along with the incorporation of other high-quality code-related datasets, we created a training set of approximately 3TB (900B+ tokens). 
Source: the paper, Section 10 (https://arxiv.org/pdf/2402.19173.pdf)



Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: