It is not. In related news, v2 of the "stack" dataset was recently released > 3....

bick_nyers · 2024-03-07T18:16:14 1709835374

Do you happen to know what the v2 dedup size is when compressed? 32.1TB is quite a bit, but if that compresses down to say 3-6TB, it would be much more manageable. Code has a lot of whitespace, repetition, and structure/predictability, so I imagine it would compress better than average text.

spindump8930 · 2024-03-07T19:50:14 1709841014

Those sizes refer to the data before processing and filtering. The actual training size was about 3 TB:

   The Stack v2 is ten times larger than its predecessor, yielding a raw dataset of 67.5 TB. Through extensive cleaning, filtering, and subsampling of the source code, along with the incorporation of other high-quality code-related datasets, we created a training set of approximately 3TB (900B+ tokens).

Source: the paper, Section 10 (https://arxiv.org/pdf/2402.19173.pdf)