I'm quite baffled by the fact that LLMs can generate a dataset used to train other LLMs. One would think that such a feedback loop would produce utter nonsense but apparently not. This seems to work.
If the correct labels in the original training set outweigh the incorrect ones then it is possible to reduce the number of errors by relabeling using the trained model. If you can also identify labels that are likely to be incorrect and then have humans focus on relabeling those you have a way to efficiently improve the data.
I feel the same way about synthetic data. Seems intuitively wrong that you can get new insights / unlock new abilities from generated data that you could not from the original data.
The new information comes from our choice in how to generate that data. We're not just blindly making synthetic data, we come up with clever way to generate synthetic data that is hopefully high quality and can improve our models (and if it doesn't, we don't use it).