Hacker News new | past | comments | ask | show | jobs | submit login

It is not.

In related news, v2 of the "stack" dataset was recently released

> 3.28B unique files belonging to 104.2M github repositories were collected by traversing the Software Heritage 2023-09-06 graph dataset. Additional repository-level metadata was collected from GitHub Archive data up to 2023-09-14. The total uncompressed size of all files is 67.53TB. Near-deduplication was implemented in the pre-processing pipeline on top of exact deduplication.

V1 vs V2 by Deduped Size Tokens

V1: 2.9TB and 200B

V2: 32.1TB and 900B

I imagine we'll see some fairly powerful open coding models soon. The ones I'm looking at testing are:

dolphincoder-starcoder2-15b-iMat.GGUF

CodeFuse-DeepSeek-33B-iMat.GGUF

OpenCodeInterpreter-DS-33B-iMat.GGUF

starcoder2-15b-instruct-iMat.GGUF

more info

dataset https://huggingface.co/datasets/bigcode/the-stack-v2

gguf quants https://huggingface.co/dranger003




Do you happen to know what the v2 dedup size is when compressed? 32.1TB is quite a bit, but if that compresses down to say 3-6TB, it would be much more manageable. Code has a lot of whitespace, repetition, and structure/predictability, so I imagine it would compress better than average text.


Those sizes refer to the data before processing and filtering. The actual training size was about 3 TB:

   The Stack v2 is ten times larger than its predecessor, yielding a raw dataset of 67.5 TB. Through extensive cleaning, filtering, and subsampling of the source code, along with the incorporation of other high-quality code-related datasets, we created a training set of approximately 3TB (900B+ tokens). 
Source: the paper, Section 10 (https://arxiv.org/pdf/2402.19173.pdf)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: