Tbh I think these models will largely be trained on synthetic datasets in the fu...

keenmaster · 2024-02-13T21:45:51 1707860751

Why aren’t the big models trained on synthetic datasets now? What’s the bottleneck? And how do you avoid amplifying the weaknesses of LLMs when you train on LLM output vs. novel material from the comparatively very intelligent members of the human species. Would be interesting to see your take on this.

emadm · 2024-02-14T04:06:08 1707883568

We are starting to see that, see phi2 for example

There are approaches to get the right type of augmented and generated data to feed these models right, check out our QDAIF paper we worked on for example

https://arxiv.org/pdf/2310.13032.pdf

sillysaurusx · 2024-02-13T22:45:32 1707864332

I’ve wondered whether books3 makes a difference, and how much. If you ever train a model with a proper books3 ablation I’d be curious to know how it does. Books are an important data source, but if users find the model useful without them then that’s a good datapoint.

emadm · 2024-02-14T00:12:08 1707869528

We did try stableLM 3b4 with books3 and it got worse in general and benchmarks

Just did some pes2o ablations too which were eh

sillysaurusx · 2024-02-14T00:44:17 1707871457

What I mean is, it’s important to train a model with and without books3. That’s the only way to know whether it was books3 itself causing the issue, or some artifact of the training process.

One thing that’s hard to measure is the knowledge contained in books3. If someone asks about certain books, it won’t be able to give an answer unless the knowledge is there in some form. I’ve often wondered whether scraping the internet is enough rather than training on books directly.

But be careful about relying too much on evals. Ultimately the only benchmark that matters is whether users find the model useful. The clearest test of this would be to train two models side by side, with and without books3, and then ask some people which they prefer.

It’s really tricky to get all of this right. But if there’s more details on the pes2o ablations I’d be curious to see.