Hacker News new | past | comments | ask | show | jobs | submit login

No company ever will disclose data due it would open endless liability.



Exactly. Meta won't do it for the same reason. Liability alone, imagine all the copyright lawsuits...

Secondly the dataset for now has a lot of competitive advantage.

In a way it seems like a good thing that AI giants compete on methodology now.


Interesting, so they wouldn't want to disclose something that shows they've illegally (terms / copyright violations) scraped research databases for example.

Won't this eventually come up in legal discovery when someone sues one of these firms for copyright infringement? They'd have to share their data in the discovery process to show that they haven't infringed..


That’s a good point. Wouldn’t OpenR1 suffer from the same problem? Or does being open somehow shield them from legal repercussions?


Some people believe they can dodge copyright issues so long as they have enough indirection in their training pipeline.

You take a terabyte of pirated college physics textbooks and train a model that can pose and answer physics 101 problems.

Then a separate, "independent" team uses that model to generate a terabyte of new, synthetic physics 101 problems and solutions, and releases this dataset as "public domain".

Then a third "independent" team uses that synthetic dataset to train a model.

The theory is this forms a sort of legal sieve. Pass the knowledge through a grid with a million fact-sized holes and with enough shaking, the knowledge falls through but the copyright doesn't.


Knowledge laundering




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: