The LLM miracle comes from the massive amount of text we can use to train it on. Removing that advantage makes LLMs untenable. An idea I've had for a while is to do the opposite: generate nonsense text according to some complex formula, and have the AI learn to predict that. It won't possibly be able to encode any facts, because there are no facts. Now show it english, and it will treat it just like any other sort of nonsense text that it's gotten good at learning to interpret.
But that idea you describe is exactly what would make the LLM stop working. The "LLM miracle" comes from the fact that all that text is not random[0]. There is a lot of information encoded in what phrases, sentences, paragraphs have been written (and how often), vs. a vastly larger amount of nearly identical texts that were not written. The "complex formula" used is... reality, as perceived and understood by people. LLMs pick up on that.
--
[0] - Well, most of it anyway; I bet the training set contains some amount of purely random text, for example technical articles discussing RNGs and showcasing their output. Some amount of noise is unavoidable.
The idea would to generate a false "reality" for the LLM to learn about. You would randomly generate a system of rules, use those rules to generate text, and then train the llm to predict the text. The goal would be to get it to stop encoding the reality proper in its weights, and focus on learning to pick up what reality looks like very quickly from text.
Bonus points for one of the most delightfully creative ideas I’ve heard in some time. I don’t think it will work (the space of "not reality" is superexponentially larger than the space of "this describes reality") but I’m just happy to be thinking about nonstandard ML again.
(I’ve dubbed this sort of thing “nonstandard ML" since, like you, I have a fondness for thinking of unorthodox solutions that seem plausible.)
It will just learn your formula and won’t generalize to anything else. It would essentially have to unlearn it when you started training on English so it would make training slower.