Not open source. Even if we accept model weights as source code, which is highly...

ronsor · 2024-11-05T20:34:40 1730838880

I will again ask the obligatory question: are model weights even copyrightable? And if not, does the "license" still matter?

parl_match · 2024-11-05T20:59:44 1730840384

I doubt there will be a satisfactory answer for a long time.

killjoywashere · 2024-11-05T22:37:17 1730846237

How's that NYTimes vs OpenAI lawsuit going? Last I can find is things are hung up in discovery: OpenAI has requested potentially a century of NYTimes reporters' notes.

https://news.bloomberglaw.com/ip-law/openais-aggressive-cour...

bdowling · 2024-11-05T22:55:29 1730847329

Half a century worth of reporters’ notes might be some valuable training data.

neilv · 2024-11-06T00:19:55 1730852395

> The AI company asked Judge Sidney H. Stein of the US District Court for the Southern District of New York to step in and compel the Times to produce reporters’ notes, interview memos, and other materials for each of the roughly 10 million contested articles the publication alleges were illegally plugged into the company’s AI models. OpenAI said it needs the material to suss out the copyrightability of the articles. The Times quickly fired back, calling the request absurd.

Can any lawyer on here defend OpenAI's request? Or is the article not characterizing it well in the quote?

warkdarrior · 2024-11-05T21:16:16 1730841376

(IANAL)

Model weights could be treated the same way phone books, encyclopedias, and other collections of data are treated. The copyright is over the collection itself, even if the individual items are not copyrightable.

TMWNN · 2024-11-05T21:18:12 1730841492

>phone books, encyclopedias, and other collections of data are treated

Encyclopedias are copyrightable. Phone books are not.

skissane · 2024-11-05T22:14:12 1730844852

> Encyclopedias are copyrightable. Phone books are not.

It depends on the jurisdiction. The US Supreme Court ruled that phone books are not copyrightable in the 1991 case Feist Publications, Inc., v. Rural Telephone Service Co.. However, that is not the law in the UK, which generally follows the 1900 House of Lords decision Walter v Lane that found that mere "sweat of the brow" is enough to establish copyright – that case upheld a publisher's copyright on a book of speeches by politicians, purely on the grounds of the human effort involved in transcribing them.

Furthermore, under its 1996 Database Directive, the EU introduced the sui generis database right, which is a legally distinct form of intellectual property from copyright, but with many of the same features, protecting mere aggregations of information, including phone directories. The UK has retained this after Brexit. However, EU directives give member states discretion over the precise legal mechanism of their implementation, and the UK used that discretion to make database rights a subset of copyright – so, while in EU law they are a technically distinct type of IP from copyright, under UK law they are an application of copyright. EU law only requires database rights to have a term of 15 years.

Do not be surprised if in the next couple of years the EU comes out with a "AI Model Weights Directive" establishing a "sui generis AI model weights right". And I'm sure US Congress will be interested in following suit. I expect OpenAI / Meta / Google / Microsoft / etc will be lobbying for them to do so.

ronsor · 2024-11-05T21:24:45 1730841885

Encyclopedias may be collections of facts, but the writing is generally creative. Phone books are literally just facts. AI models are literally just facts.

roywiggins · 2024-11-05T21:33:41 1730842421

What if I train an AI model on exactly one copyrighted work and all it does it spit that work back out?

eg if I upload Marvels_Avengers.mkv.onnx and it reliably reproduces the original (after all, it's just a fact that the first byte of the original file is OxF0, etc)

bdowling · 2024-11-05T23:00:02 1730847602

A work that is “substantially similar” to a copyrighted work infringes that work, under US law, no matter how it was produced. (Note: Some exceptions apply and you have to read a lot of cases to get an idea of what courts find “substantially similar” .)

HWR_14 · 2024-11-06T03:48:55 1730864935

> no matter how it was produced

IIRC, this is wrong. Independent creation is a valid (but almost impossible to prove) defense in US copyright law.

This example is not an independent creation, but your reasoning seems wrong.

ronsor · 2024-11-05T21:35:49 1730842549

If the sole purpose of your model is to copy a work, then that's copyright infringement.

roywiggins · 2024-11-05T21:38:26 1730842706

Oh, in this case, the model can either reproduce the work exactly, or it can play tic-tac-toe depending on how you prompt it.

ronsor · 2024-11-05T21:41:30 1730842890

We can change "sole purpose" to "primary purpose", and I'd argue something that happens 50% of the time counts as a primary purpose.

margalabargala · 2024-11-05T22:37:42 1730846262

> AI models are literally just facts.

Are they, or are they collections of probabilities? If they are probabilities, and those probabilities change from model to model, that seems like they might be copywritable.

If Google, OpenAI, Facebook, and Anthropic each train a model from scratch on an identical training corpus, they would wind up with four different models that had four differing sets of weights, because they digest and process the same input corpus differently.

That indicates to me that they are not a collection of facts.

ronsor · 2024-11-05T23:49:04 1730850544

The AI training algorithms are deterministic given the same dataset, same model architecture, and same set of hyperparameters. The main reasons the models would not be identical is due to differing random seeds and precision issues. The differences would not be due to any creative decisions.

PittleyDunkin · 2024-11-06T06:09:37 1730873377

Who gives a damn about copyright when this is clearly profiting off of someone else's work without compensation? Sometimes the law is inadequate and that's ok—the law just needs to change.

dplavery92 · 2024-11-05T21:43:57 1730843037

The title of Tencent's paper [0] as well as their homepage for the model [1] each use the term "Open-Source" in the title, so I think they are making the claim.

[0] https://arxiv.org/pdf/2411.02265 [1] https://llm.hunyuan.tencent.com/

vanguardanon · 2024-11-05T20:30:57 1730838657

What is the reason for restrictions in the EU? Is it due to some EU regulations?

ronsor · 2024-11-05T20:34:05 1730838845

Most likely yes. I don't think companies can be blamed for not wanting to subject themselves to EU regulations or uncertainty.

Edit: Also, if you don't want to follow or deal with EU law, you don't do business in the EU. People here regularly say if you do business in a country, you have to follow its laws. The opposite also applies.

troupo · 2024-11-05T20:39:43 1730839183

[flagged]

ronsor · 2024-11-05T20:42:57 1730839377

I will address both points:

1. No one is training on users' bank details, but if you're training on the whole Internet, it's hard to be sure if you've filtered out all PII, or even who is in there.

2. This isn't happening because no one has time for more time-wasting lawsuits.

troupo · 2024-11-05T20:49:49 1730839789

> No one is training on users' bank details, but if you're training on the whole Internet

Tencent has access to more than just bank accounts.

In the West there's Meta that this year opted everyone in their platform into training their AI.

> This isn't happening because no one has time for more time-wasting lawsuits.

No, this isn't happening because a) their training data is, without fail, trained on material they shouldn't have willy-nilly access to and b) because they want to pretend to be open source without being opensource

bilbo0s · 2024-11-05T20:45:23 1730839523

??

Doesn't that mean if they used data created by, (or even the data of), anyone in the EU, that they would want to not release that model in the EU?

This sounds like "if an EU citizen created, or has data referenced, in any piece of the data you trained from then..."

Which, I mean, I can kind of see why US and Chinese companies prefer to just not release their models in the EU. How could a company ever make a guarantee satisfying those requirements? It would take a massive filtering effort.

em500 · 2024-11-05T21:18:16 1730841496

This seems to mirror the situation where US financial regulations (FATCA) are seen as such a hassle to deal with for foreign financial institutions that they'd prefer to just not accept US citizens as customers.

troupo · 2024-11-05T20:52:12 1730839932

> that they would want to not release that model in the EU

They don't release that model in the EU, that's correct

> This sounds like "if an EU citizen created, or has data referenced, in any piece of the data you trained from then..."

Yes, and that should be the default for any citizen of any country in the world.

Instead you have companies like Meta just opting everyone in to their AI training dataset.

> I can kind of see why US and Chinese companies prefer to just not release their models in the EU.

Companies having unfettered unrestricted access to any and all data they want is not such a good thing as you make it out to be

warkdarrior · 2024-11-05T21:26:51 1730842011

> > This sounds like "if an EU citizen created, or has data referenced, in any piece of the data you trained from then..."

> Yes, and that should be the default for any citizen of any country in the world.

This is a completely untenable policy. Each and every piece of data in the world can be traced to one or more citizens of some country. Actively getting permission for every item is not feasible for any company, no matter the scale of the company.

andyferris · 2024-11-05T22:09:04 1730844544

I think that’s kinda the point that is being made.

Technolgy-wise, it is clearly feasible to aggregate the data to train an LLM and to release a product on that.

It seems that some would argue that was never legally a feasible thing to do, based on the training data being impossible to use legally. So, it is the existence of many of these LLMs that is (legally) untenable.

Whether valid or not the point may be mute because, like Uber, if the laws actually do forbid this use, they will change as necessary to accommodate the new technology. Too many “average voters” like using things such as ChatGPT and it’s not a hill politicians will be willing to die on.

troupo · 2024-11-06T07:20:31 1730877631

> Actively getting permission for every item is not feasible for any company, no matter the scale of the company.

There's a huge amount of data that:

- isn't personal data

- isn't copyrighted

- isn't otherwise protected

You could argue if that is enough data, but neither you nor corporations argue that. You just go for "every single scrap of data on the planet must be made accessible to supranational trillion-dollar corporations, without limits, now and forever"

blueblimp · 2024-11-05T20:43:21 1730839401

In Meta's case, the problem is that they had been given the go-ahead by the EU to train on certain data, and then after starting training, the EU changed its mind and told them to stop.

GaggiX · 2024-11-05T20:38:30 1730839110

They probably trained on data protected by privacy laws, similar to Meta.

karaterobot · 2024-11-05T20:37:54 1730839074

Hmm, in fairness I don't see where Tencent is claiming this is open source (at least in this repo; I haven't checked elsewhere). The title of the HN post does make the claim, and that may be controversial or simply incorrect.

swyx · 2024-11-05T21:33:03 1730842383

readme: https://github.com/Tencent/Tencent-Hunyuan-Large

> "By open-sourcing the Hunyuan-Large model"

kaliqt · 2024-11-05T20:35:48 1730838948

I agree, however, Meta is also guilty of this crime as well.

foooorsyth · 2024-11-05T20:44:11 1730839451

[flagged]

mrob · 2024-11-05T20:48:08 1730839688

The term "open source" had no significant use to refer to software before the Open Source Initiative started promoting it. Previously, it was only intelligence industry jargon, meaning "publicly available information", which includes software that fails your "can read the source code" test. "Source" was used in the journalistic sense, not as in "source code". The correct term for software that passes your test but does not meet the Open Source Definition is "source available".

kube-system · 2024-11-05T23:41:26 1730850086

The OSI made a huge mistake in choosing to use an non-trademarkable borrowed term as their own trade industry term. The original (and quite long standing) use to refer to publicly available texts is still widely used, and English isn't a prescriptive language outside of legal frameworks like trademark. This is why you really should pick a trademarkable name when you try to define trade marks.

HDThoreaun · 2024-11-05T21:31:56 1730842316

open source means the source code is openly available. That is it. Phrases that have intuitive meaning need to stop being co-opted.

mrob · 2024-11-05T21:37:59 1730842679

If that meaning is "intuitive", why was it not used before the Open Source Initiative introduced their definition? The competing uses are the ones co-opting an existing phrase.

foooorsyth · 2024-11-05T23:47:47 1730850467

It’s perfectly intuitive to anyone with a brain. Never heard of OSI but they seem just about as pedantic, neurotic, and annoying with language as FSF.

Open source = I can view the source code. That’s what it means, that what it has always meant, and that what it will always mean. Simple as.

DataDaemon · 2024-11-05T21:26:49 1730842009

Who cares about EU? They are destroying themselves.

the5avage · 2024-11-05T21:52:57 1730843577

Where would you go when you would live there (as a programmer interested in ai)? Just asking for a friend.

Mistletoe · 2024-11-05T22:11:30 1730844690

Ironically their policies are why I want to move there with my American dollars. I want to live somewhere that cares about my rights, not the rights of corporations.

CamperBob2 · 2024-11-05T22:31:52 1730845912

That's fine, but don't complain when you lose access to products and services that are widely available elsewhere.

In particular, restrictions on ML models will leave you without access to extremely powerful resources that are available to people in other countries, and to people in your own country who don't mind operating outside the law. Copyright maximalism is not, in fact, a good thing, and neither is overbearing nanny-statism. Both will ultimately disempower you.

bluefirebrand · 2024-11-05T23:19:56 1730848796

You have to realize that as an individual, you have no power anyways

It doesn't matter if an individual personally has access to ML models, because government and/or huge corporations will ensure that individuals cannot use them for anything that would threaten government or corporate interests

This unfettered explosion of ML growth is disempowering all of us. Those with power are not using these tools to augment us, they are hoping to replace us.

CamperBob2 · 2024-11-05T23:56:34 1730850994

This unfettered explosion of ML growth is disempowering all of us.

Never mind that I've gotten things done with ChatGPT that would otherwise have taken much longer, or not gotten done at all. If this is what "disempowerment" feels like, bring it on.

Although the tech is nowhere near ready to make it happen, I would be very happy to be "replaced" by AI. I have better things to do than a robot's job. You probably do, too.