That’s a good point. Wouldn’t OpenR1 suffer from the same problem? Or does being...

michaelt · 2025-01-28T10:45:52 1738061152

Some people believe they can dodge copyright issues so long as they have enough indirection in their training pipeline.

You take a terabyte of pirated college physics textbooks and train a model that can pose and answer physics 101 problems.

Then a separate, "independent" team uses that model to generate a terabyte of new, synthetic physics 101 problems and solutions, and releases this dataset as "public domain".

Then a third "independent" team uses that synthetic dataset to train a model.

The theory is this forms a sort of legal sieve. Pass the knowledge through a grid with a million fact-sized holes and with enough shaking, the knowledge falls through but the copyright doesn't.

svnt · 2025-01-28T11:33:21 1738064001

Knowledge laundering