Those sizes refer to the data before processing and filtering. The actual traini...

Those sizes refer to the data before processing and filtering. The actual training size was about 3 TB:

   The Stack v2 is ten times larger than its predecessor, yielding a raw dataset of 67.5 TB. Through extensive cleaning, filtering, and subsampling of the source code, along with the incorporation of other high-quality code-related datasets, we created a training set of approximately 3TB (900B+ tokens).

Source: the paper, Section 10 (https://arxiv.org/pdf/2402.19173.pdf)