I'm quite baffled by the fact that LLMs can generate a dataset used to train oth...

cjg · 2024-06-13T09:40:46 1718271646

Humans have bootstrapped by training the next generation. Why not LLMs?

spiderfarmer · 2024-06-13T09:54:42 1718272482

I think the perception is that humans can discover new information to question and improve what they learned, while LLM's cannot.

immibis · 2024-06-13T10:47:36 1718275656

Human language drifts for the same reason LLM language would, but is continually reset to a sensible state by interaction with the real world.

throwthrowuknow · 2024-06-13T18:29:05 1718303345

If the correct labels in the original training set outweigh the incorrect ones then it is possible to reduce the number of errors by relabeling using the trained model. If you can also identify labels that are likely to be incorrect and then have humans focus on relabeling those you have a way to efficiently improve the data.

DebtDeflation · 2024-06-13T12:19:51 1718281191

Yes, and there's even a name for it and associated area of research.

https://en.wikipedia.org/wiki/Model_collapse

bilater · 2024-06-13T17:35:04 1718300104

I feel the same way about synthetic data. Seems intuitively wrong that you can get new insights / unlock new abilities from generated data that you could not from the original data.

vintermann · 2024-06-14T06:41:27 1718347287

The new information comes from our choice in how to generate that data. We're not just blindly making synthetic data, we come up with clever way to generate synthetic data that is hopefully high quality and can improve our models (and if it doesn't, we don't use it).