No company ever will disclose data due it would open endless liability.

maartenpi_ · 2025-01-28T09:49:51 1738057791

Exactly. Meta won't do it for the same reason. Liability alone, imagine all the copyright lawsuits...

Secondly the dataset for now has a lot of competitive advantage.

In a way it seems like a good thing that AI giants compete on methodology now.

fblp · 2025-01-28T20:29:13 1738096153

Interesting, so they wouldn't want to disclose something that shows they've illegally (terms / copyright violations) scraped research databases for example.

Won't this eventually come up in legal discovery when someone sues one of these firms for copyright infringement? They'd have to share their data in the discovery process to show that they haven't infringed..

jackjeff · 2025-01-28T08:52:02 1738054322

That’s a good point. Wouldn’t OpenR1 suffer from the same problem? Or does being open somehow shield them from legal repercussions?

michaelt · 2025-01-28T10:45:52 1738061152

Some people believe they can dodge copyright issues so long as they have enough indirection in their training pipeline.

You take a terabyte of pirated college physics textbooks and train a model that can pose and answer physics 101 problems.

Then a separate, "independent" team uses that model to generate a terabyte of new, synthetic physics 101 problems and solutions, and releases this dataset as "public domain".

Then a third "independent" team uses that synthetic dataset to train a model.

The theory is this forms a sort of legal sieve. Pass the knowledge through a grid with a million fact-sized holes and with enough shaking, the knowledge falls through but the copyright doesn't.

svnt · 2025-01-28T11:33:21 1738064001

Knowledge laundering