Some people believe they can dodge copyright issues so long as they have enough indirection in their training pipeline.
You take a terabyte of pirated college physics textbooks and train a model that can pose and answer physics 101 problems.
Then a separate, "independent" team uses that model to generate a terabyte of new, synthetic physics 101 problems and solutions, and releases this dataset as "public domain".
Then a third "independent" team uses that synthetic dataset to train a model.
The theory is this forms a sort of legal sieve. Pass the knowledge through a grid with a million fact-sized holes and with enough shaking, the knowledge falls through but the copyright doesn't.