In related news, v2 of the "stack" dataset was recently released
> 3.28B unique files belonging to 104.2M github repositories were collected by traversing the Software Heritage 2023-09-06 graph dataset. Additional repository-level metadata was collected from GitHub Archive data up to 2023-09-14. The total uncompressed size of all files is 67.53TB. Near-deduplication was implemented in the pre-processing pipeline on top of exact deduplication.
V1 vs V2 by Deduped Size Tokens
V1: 2.9TB and 200B
V2: 32.1TB and 900B
I imagine we'll see some fairly powerful open coding models soon. The ones I'm looking at testing are:
Do you happen to know what the v2 dedup size is when compressed? 32.1TB is quite a bit, but if that compresses down to say 3-6TB, it would be much more manageable. Code has a lot of whitespace, repetition, and structure/predictability, so I imagine it would compress better than average text.
Those sizes refer to the data before processing and filtering. The actual training size was about 3 TB:
The Stack v2 is ten times larger than its predecessor, yielding a raw dataset of 67.5 TB. Through extensive cleaning, filtering, and subsampling of the source code, along with the incorporation of other high-quality code-related datasets, we created a training set of approximately 3TB (900B+ tokens).
In related news, v2 of the "stack" dataset was recently released
> 3.28B unique files belonging to 104.2M github repositories were collected by traversing the Software Heritage 2023-09-06 graph dataset. Additional repository-level metadata was collected from GitHub Archive data up to 2023-09-14. The total uncompressed size of all files is 67.53TB. Near-deduplication was implemented in the pre-processing pipeline on top of exact deduplication.
V1 vs V2 by Deduped Size Tokens
V1: 2.9TB and 200B
V2: 32.1TB and 900B
I imagine we'll see some fairly powerful open coding models soon. The ones I'm looking at testing are:
dolphincoder-starcoder2-15b-iMat.GGUF
CodeFuse-DeepSeek-33B-iMat.GGUF
OpenCodeInterpreter-DS-33B-iMat.GGUF
starcoder2-15b-instruct-iMat.GGUF
more info
dataset https://huggingface.co/datasets/bigcode/the-stack-v2
gguf quants https://huggingface.co/dranger003