Interesting, so they wouldn't want to disclose something that shows they've illegally (terms / copyright violations) scraped research databases for example.
Won't this eventually come up in legal discovery when someone sues one of these firms for copyright infringement? They'd have to share their data in the discovery process to show that they haven't infringed..
Some people believe they can dodge copyright issues so long as they have enough indirection in their training pipeline.
You take a terabyte of pirated college physics textbooks and train a model that can pose and answer physics 101 problems.
Then a separate, "independent" team uses that model to generate a terabyte of new, synthetic physics 101 problems and solutions, and releases this dataset as "public domain".
Then a third "independent" team uses that synthetic dataset to train a model.
The theory is this forms a sort of legal sieve. Pass the knowledge through a grid with a million fact-sized holes and with enough shaking, the knowledge falls through but the copyright doesn't.