Tbh I think these models will largely be trained on synthetic datasets in the future. They are mostly trained on garbage now. We have been doing opt outs on these, has been interesting to see quality differential (or lack thereof), eg removing books3 from stableLM 3b zephyr https://stability.wandb.io/stability-llm/stable-lm/reports/S...
Why aren’t the big models trained on synthetic datasets now? What’s the bottleneck? And how do you avoid amplifying the weaknesses of LLMs when you train on LLM output vs. novel material from the comparatively very intelligent members of the human species. Would be interesting to see your take on this.
There are approaches to get the right type of augmented and generated data to feed these models right, check out our QDAIF paper we worked on for example
I’ve wondered whether books3 makes a difference, and how much. If you ever train a model with a proper books3 ablation I’d be curious to know how it does. Books are an important data source, but if users find the model useful without them then that’s a good datapoint.
What I mean is, it’s important to train a model with and without books3. That’s the only way to know whether it was books3 itself causing the issue, or some artifact of the training process.
One thing that’s hard to measure is the knowledge contained in books3. If someone asks about certain books, it won’t be able to give an answer unless the knowledge is there in some form. I’ve often wondered whether scraping the internet is enough rather than training on books directly.
But be careful about relying too much on evals. Ultimately the only benchmark that matters is whether users find the model useful. The clearest test of this would be to train two models side by side, with and without books3, and then ask some people which they prefer.
It’s really tricky to get all of this right. But if there’s more details on the pes2o ablations I’d be curious to see.