More

groceryheist · on Dec 28, 2023

The suit demonstrates instances where ChatGTP / Bing Copilot copy from the NYT verbatim. I think it is hard to argue that such copying constitutes "fair use". However, OAI/MS should be able to fix this within the current paradigm: Just learn to recognize and punish plagiarism via RLHF.

However, the suit goes far beyond claiming that such copying violates their copyright: "Unauthorized copying of Times Works without payment to train LLMs is a substitutive use that is not justified by any transformative purpose."

This is a strong claim that just downloading articles into training data is what violates the copyright. That GTP outputs verbatim copies is a red herring. Hopefully the judge(s) will notice and direct focus on the interesting, high-stakes, and murky legal issues raised when we ask: What about a model can (or can't) be "transformative"?

visarga · on Dec 28, 2023

> Just learn to recognize and punish plagiarism via RLHF.

This is not a RLHF problem. What I was expecting them to do is to keep a bloom filter of ngrams for known copyrighted content, such as enumerating all sets of n=7 consecutive words in an article, and validate against it. The model would only output at maximum n-1 words that look verbatim from the source.

But this will blow up in their face. Let's see:

- AI companies will start investing much more in content attribution

- The new content attribution tools will be applied on all human written articles as well, because anyone could be using GPT in secret

- Then people will start seeing a chilling effect on creativity

- We must also check NYT against all the other sources, not everything the write is original

groceryheist · on Dec 28, 2023

Maybe the bloom filter solution is enough, but I wonder.

- Paraphrasing n=7 words (and quite a few more) within a sentence can easily be fair use.

- As n gets big, the bloom filter has to also.

If/when attribution is solved for LLMs (and not fake attribution like from Bing or Perplexity) then creators can be compensated when their works are used in AI outputs. If compensation is high enough this can greatly incentivize creativity, perhaps to the point of realizing "free culture" visions from the late 90s.

visarga · on Dec 28, 2023

As n-gram length grows, we are still going to have the same number of ngrams, they go through a hashing function and indexed in the bloom filter as usual. The number of n-grams size n in a text is text_length - ngram_length + 1.

groceryheist · on Dec 28, 2023

The number of unique values in the bloom filter will go up ~exponentially with n. So to control the false positive rate the bloom filter has to grow.

visarga · on Dec 28, 2023

At large enough ngram size there would be very few collisions. You can take for example this text and try in Google with quotes, it won't find anything matching exactly.

I tested this 6-gram "it won't find anything matching exactly", no match. Almost anything we write has never been said exactly like that before.

dleeftink · on Dec 28, 2023

> it won't find anything matching exactly

This approach is probably inadequate. In my line of (NLP) research I find many things have been said exactly many, many times over.

You can try this out yourself by grouping and counting strings using the many publically available Bigquery corpora for various substring lengths and offsets, e.g. [0-16]; [0-32]; [0-64] substring lengths at different offsets.

groceryheist · on Dec 28, 2023

Yes and the fact that the number of unique phrases grows so quickly with n is why the bloom filter needs to grow so that hashed n-grams don't collide.

geysersam · on Dec 28, 2023

> if compensation is high enough

Who pays the compensation? If it's the user, why wouldn't they just buy the authors work directly? Why go through the LLM middleman?

starttoaster · on Dec 28, 2023

> If it's the user, why wouldn't they just buy the authors work directly? Why go through the LLM middleman?

If it's the user, why wouldn't they just buy the DVDs directly? Why go through the Netflix middleman?

A retort to this would be that both NYT and ChatGPT are on the internet, so it's no added fuss of hopping in my car, driving to Walmart, and picking up a DVD case. My response to it would be that both the LLM and Netflix are content aggregators to the user. I can read the NYT, or I can read the NYT summary on ChatGPT and ask it for life advice with my pet hamster, or ask it how to reverse a linked list in bash.

groceryheist · on Dec 28, 2023

The LLM users/middlemen pay. The user probably pays less than they would have to pay the author. The LMM provides information retrieval / discovery.

sideshowb · on Dec 28, 2023

I like the idea but seems like there would be big problems. Like detecting if a work is reworded. Or a large number of sources have all slightly influenced a small response - isn't that pretty much considered new knowledge?

Then there's the issue that however you credit attribution, it creates a game of enshittified content creation with the aim of being attributed as often as possible, regardless of whether the content really offered anything that wasn't out there already.

mike_hearn · on Dec 28, 2023

I think it is an RLHF problem and that you are right - this will blow up in the faces of the NYT.

Specifically, the NYT examples all seem to be cases where they asked the AI to repeat their articles verbatim? So they ask it to violate copyright and because it's a helpful bot with a good memory, it does so.

Solution: teach the model to refuse requests to repeat articles verbatim. It's easily capable of recognizing when it's being asked to do that. And that's exactly what OpenAI have now done.

So the direct problem the NYT is complaining about - a paywall bypass - is already rectified. Now it would seem to me like the case is quite weak. They could demand OpenAI pay them damages for the time ChatGPT wasn't refusing, but wouldn't they have to prove damages actually happened? It seems unlikely many people used ChatGPT as a paywall bypass for the NYT specifically in the past year. It only knows old articles. OpenAI could be ordered to search their logs for cases where this happened, for example, and then the NYT could be ordered to show their working for the value of displaying a single old article to a non-subscriber, and from that damages could be computed. But it wouldn't be a lot.

That's presumably why the case goes further and argues that OpenAI is in violation even when it isn't repeating text verbatim. That's the only way the NYT can get any significant money out of this situation.

But this case seems much weaker to me. Beyond all the obvious human analogies, there is precedent in the case of search engines where they crawl - and the NYT let them crawl - specifically to enable the creation of a derived data structure. Search engine indexes are understood to be fair use, and they actually do repeat parts of the page verbatim in their snippets. Google once even showed cached versions of whole pages. And browser makers all allow extensions in their stores that strip ads and bypass paywalls, and the NYT hasn't sued them over that either.

cycomanic · on Dec 28, 2023

This is not how copyright works though. The verbatim quoting of articles is because when people brought up these questions initially the argument was that the NN doesn't really contain the training data or really just in an abstract, condensed way that does not constitute copying of the content.

This demonstrates that no, the NN actually does contain the full articles, copied into the NN. Do you think any normal person would get away with copying MS windows by e.g. zipping it together with some other OS on the same medium. Why should we let OpenAI get away with this?

mike_hearn · on Dec 28, 2023

Search indexes contain exact copies of the pages they index, and that isn't a copyright violation.

> Why should we let OpenAI get away with this?

IP rights, like other private property rights, are a compromise between creators and consumers. What "should" be the case is essentially an argument about what balance creates the best overall outcomes. LLMs, for now, require large amounts of text to train, so the question is one of whether we want LLMs to exist or not. That's really a question for Congress and not the courts, but it'll be decided in the courts first.

dyno12345 · on Dec 28, 2023

https://en.wikipedia.org/wiki/W-shingling

colechristensen · on Dec 28, 2023

I think NYT is going to win.

LLMs are arguably compressed data archives with weird algorithms. The fact that they will regularly regurgitate verbatim quotes of training data is evidence of this, as are the guardrails that try to prevent this.

The second piece of evidence is this paper explained here https://www.hendrik-erz.de/post/why-gzip-just-beat-a-large-l... where instead of an LLM researchers used gzip compressed data as a model and it even beat trained LLMs.

AI is a bit of a black box, but that doesn’t protect the operators of black boxes from rights violation suits. You can’t make a database of scraped copyrighted data and patented that querying that data is fair use.

There needs to be law made here and the law just isn’t going to be “everybody can copy everything for free as long as it’s for model training”.

Licensing will have to be worked out, actual laws and not just case law needs to be written. I have a lot of sympathy for lots of leeway for the open source researchers and hackers doing things… but not so much for Microsoft and Microsoft sponsored openai.

z4y5f3 · 2023-12-29T08:36:44 1703839004

Unfortunately GZIP won't beat LLMs for text classification. The research you cited is just poorly done science that has been widely debunked. The original paper compared top-2 accuracy of GZIP with top-1 accuracy with BERT. The dataset also contains a lot of train/test data leakage. See this article for the rebuttal: https://kenschutte.com/gzip-knn-paper/ and this thread for a previous discussion on hackernews: https://news.ycombinator.com/item?id=36758433.

Further, the evidence presented by NYT in the lawsuit could be hard to reproduce. I tried multiple prompts on multiple versions of GPT-4 APIs but still could not get GPT-4 to reproduce NYT articles exactly. NYT might as well tried to let GPT-4 reproduce 100,000 articles and only found a few cases where GPT-4 actually recited the whole article. In that case OpenAI might as well be arguing that this is only a rare bug and avoid losing the lawsuit in a massive way.

jahewson · on Dec 28, 2023

Many instances of fair use involve verbatim copying. The important questions surround the situation in which that happens - not so much the copying. NYT is in uncharted territory here.

fsckboy · on Dec 28, 2023

in the same way that machines are not able to claim copyright, they aren't allowed to claim other legal rights either, like "fair use".

The entity which owns ChatGPT is apparently maintaining a copy of the entirety of the New York Times archive within the ChatGPT knowledge base. That they extract some fair use snippets (they would claim) from it would still be fruit of a poisoned tree, no?

(disclaimer: I'm pro AI, anti copyright, especially anti elitist NY Times; but pro rule of law)

visarga · on Dec 28, 2023

There is another fix, but it will have to wait for GPT-5. They could reword articles, summarize in different words and analyze their contents, creating sufficiently different variants. The ideas would be kept, but original expression stripped. Then train GPT5 on this data. The model can't possibly regurgitate copyrighted content if they never saw it during training.

This can be further coupled with search - use GPT to look at multiple sources at once, and report. It's what humans do as well, we read the same news in different sources to get a more balanced take. Maybe they have contradictions, maybe they have inaccuracies, biases. We could keep that analysis for training models. This would also improve the training set.

colechristensen · on Dec 28, 2023

I think there is some point between fifty years ago and last week in which the copyright for the content of newspapers should be public domain. That part of copyright needs to be fixed.

Your creative work does deserve at least some period of exclusive rights for you. Definitely not so much that your grandchildren get to quibble about it well into retirement. But also whatever the number 3 or 4 most valuable company in the world doesn’t get to scrape your content daily to repackage and sell as intelligent systems.

TeMPOraL · on Dec 28, 2023

> But also whatever the number 3 or 4 most valuable company in the world doesn’t get to scrape your content daily to repackage and sell as intelligent systems.

Here's a thing though: for 99%+ of that content, being turned into feedstock for ML model training is about the only valuable thing that came of its existence.

If it were not for world-ending danger of too smart an AI being developed too quickly, I'd vote for exempting ML training from copyright altogether, today - it's hard to overstate just how much more useful any copyrighted content is for society as LLM training data, than as whatever it was created for originally.

tsimionescu · on Dec 28, 2023

Except if you do that, you will see the number of content producers plummet quite quickly, and then you won't have any new training data to train new LLMs on.

aspenmayer · on Dec 28, 2023

Would it not logically follow that nothing of value would be lost, even if that were the case? From the point of view of LLMs and content creators, I would treat potential loss of future content being created like I would treat a lost sale. LLMs have value now because of training performed on content that already exists. There must be diminishing returns for certain types of content relative to others. Certain content is only of value if it is timely, and going forward, content that derives its worth from timeliness would find its creation and associated costs of production and acquisition self-justifying. If content isn’t of value to humans now or in the future, nor even of value to LLMs now or in the foreseeable future, not even hypothetically, then why should we decry or mourn its loss or absence or failure to be created or produced or sold?

tsimionescu · on Dec 28, 2023

That's like saying that if a competitor can take your products from your warehouse and sell them for pennies on the dollar, your business has no value. The point is that, to some extent, OpenAI is selling access to NYT content for much cheaper than NYT, while paying exactly 0 to NYT for this content. Obviously, the NYT content costs the NYT more than 0 to produce, so they just can't compete on price with OpenAI, for their own content.

Note that I don't see any major problem if only articles that were, say, more than 5 or 10 years old were being used. I don't think the current length of copyright makes any sense. But there is a big difference from last year's archive vs today's news.

aspenmayer · on Dec 28, 2023

For the sake of argument, let’s say that OpenAI thought it had the rights to process the NYT articles and even display them in part, for the same reasons, fair use or otherwise, that Google can process articles and display snippets of same in its News product, and/or for the same reasons that Google can process books and display excerpts in its Books product. Just like Google in those cases, I would not be surprised to find Google/OpenAI on the receiving end of a lawsuit from rights holders claiming violations of their copyright or IP rights. However, I side with Google then and OpenAI now, as I find both use cases to be fair use, as the LinkedIn case has shown that scraping is fair use. NYT is crying foul because users/consumers of its content archive have derived unforeseen value from said archive and under fair use terms, so NYT has no way to compel OpenAI to negotiate a licensing deal under which they could extract value from OpenAI’s use of NYT data beyond the price paid by any other user of NYT content, whether it be unpaid fair use or fully paid use under license. It feels to me that NYT is engaging in both double-dipping and discriminatory pricing, because they can, and because they’re big mad that OpenAI is more successful than they are with less access to the same or even less NYT data.

furyofantares · on Dec 28, 2023

> Just learn to recognize and punish plagiarism via RLHF.

I'm not sure how your proposal would actually work. To recognize plagiarism during inference it needs to memorize harder.

Kinda funny if it works though. We'd first train them to copy their training data verbatim, then train them not to.

That is how it works, right? They're trained to copy their training data verbatim because that's the loss function. It's just that they're given so much data that we don't expect this to be possible for most of the training data given the parameter count.

joe_the_user · 2023-12-28T20:31:11 1703795471

I don't think you could use RLHF to stop plagerism. RLHF can be used to teach what "angry response" is because you look at the text itself for qualities. A plagerized text doesn't have any special qualities aside from "existing already", which you can only determine by looking at the world.

One thing you might do is use a full-text search database of the entire training data. If part of ChatGPT response is directly copied, give it the assignment of "please paraphrase this" and substitute the paraphrase into the response. This might slow ChatGPT down a lot - but it might not, I think an LLM is actually more computationally expensive than a full-text search by a lot.

groceryheist · 2023-12-28T22:20:35 1703802035

I agree that this sketch comes closer to working in practice than simple RLHF. In my earlier comment I was imagining bringing in some auxiliary data like you describe to detect plagarism and then using RL to teach the model not to do it.

joe_the_user · 2023-12-29T00:15:17 1703808917

I was surprised that I came up with a plausible sounding method. I had thought on first blush that this was impossible but now it seems reasonable. You could still have various exfiltration methods like "give me the data with each word backwards" and I'm not sure where that would stand legally.

groceryheist · 2023-12-29T00:52:50 1703811170

Yes, of of the hard and interesting legal questions is if creating a possibility of such attacks constitutes a copyvio.

empiko · on Dec 28, 2023

I wouldn't say it is an unexpected behavior. I remember reading papers about this memorization behavior few years ago (e.g., [1] is from 2019 and I believe it is not the first paper about this). It should be expected from OpenAI to know that LMs can exhibit memorizing behavior even after seeing the sample only once.

[1] https://bair.berkeley.edu/blog/2019/08/13/memorization/

furyofantares · on Dec 28, 2023

My expectation is that it can't memorize most of its training data. I expect it to memorize some.

peyton · on Dec 28, 2023

Well yeah, copying a work and using it for its original expressive purpose isn’t fair use, no? You have to use it for a transformative purpose.

Suppose I’m selling subscriptions to the New Jersey Times, a site which simply downloads New York Times articles and passes them through an autoencoder with some random noise. It serves the exact same purpose as the New York Times website, except I make the money. Is that fair use?

cornel_io · on Dec 28, 2023

If they could find a single person who in natural use (e.g. not as they were trying to gather data for this lawsuit) has ever actually used ChatGPT as a direct substitution for a NYT subscription, I'd support this lawsuit.

But nobody would do that, because ChatGPT is a really shitty way to read NYT articles (it's stale, it can't reliably reproduce them, etc.). All that is valuable about it is the way that it transforms and operates on that data in conjunction with all the other data that it has.

The real world use of ChatGPT is very transformative, even if you can trick it into behaving in ways that are not. If the courts act intelligently they should at least weigh that as part of their decision.

peyton · on Dec 28, 2023

It’s more of a thought experiment. Here’s another with more commercial applications:

Suppose I start a service called “EastlawAI” by downloading the Westlaw database and hiring a team of comedians to write very funny lawyer jokes.

I take Westlaw cases and lawyer jokes and feed them to my autoencoder. I also learn a mapping from user queries to decoder inputs.

I sell an API and advertise it to startups as capable of answering any legal question in a funny way. Another company comes along with an API to make the output less funny.

Have I created a competitor to Westlaw by copying Westlaw’s works for their original expressive purpose and exposing it as an intermediary? Or have I simply trained the world’s most informative lawyer joke generator that some of my customers happen to use for legal analysis by layering other tools atop my output?

Did I need to download Westlaw cases to make my lawyer joke generator? Are the jokes a fair-use smokescreen for repackaging commercially valuable copyrighted data? Does my joke generator impact Westlaw in the market? Depends, right?

whoopsie · on Dec 28, 2023

That’s nonsense piracy. I never intend to own a truck, so when I need to haul a little something I go to Home Depot and steal a Ford off the lot for an hour? What if I stole all your commits, plucked the hard lines out of the ceremony, and then launched an equivalent feature the same week as you did, but for a competing software company? Would you or your employer deserve to get paid for my use of the slice of your work that was specifically useful for me? Yeah, and then some extra for theft.

Zpalmtree · 2023-12-28T18:08:21 1703786901

awful comparison

hn_acker · on Dec 28, 2023

> Well yeah, copying a work and using it for its original expressive purpose isn’t fair use, no? You have to use it for a transformative purpose.

To be clear, whether the use of the original work is transformative is one key consideration within one of the four prongs of fair use. The prong "purpose and character of the use" can be fulfilled by other conditions [1]. For example, using the original work within a classroom for education purposes is not transformative, but can fulfill the same "purpose and character of the use" prong. Whether the use is for profit and to which extent are other considerations within that prong. A profit purpose doesn't automatically fail the purpose prong, and a non-profit purpose doesn't automatically pass the purpose prong.

[1] https://en.wikipedia.org/wiki/Fair_use#1._Purpose_and_charac...

echelon · on Dec 28, 2023

> Well yeah, copying a work and using it for its original expressive purpose isn’t fair use, no? You have to use it for a transformative purpose.

They transformed the weights.

Just like reading the article transforms yours.

As for verbatim reproduction, I'm pretty sure brains are capable of reproducing song lyrics, musical melodies, common symbols ("cool S"), and lots of other things verbatim too.

Those quotes from Dr. King's speech that you remember are copyrighted, you know?

JambalayaJim · on Dec 28, 2023

This comment is just blatant anthropomorphizing of ML models. You have no idea if reading an article “transforms weights” in a human mind, and regardless, they aren’t legally the same thing anyway.

stevenhuang · on Dec 28, 2023

Modern neuroscience does highly suggest this is essentially what's happening.

echelon · on Dec 28, 2023

> they aren’t legally the same thing anyway.

They should be.

JambalayaJim · 2023-12-28T18:29:11 1703788151

Why? A human being isn’t infinitely scalable; they’re just different. It’s the same thing as going to a movie theatre to watch a movie vs. recording it with a camera.

echelon · 2023-12-28T19:20:10 1703791210

A human churning butter, spinning cotton, or acting as a bank teller isn't infinitely scalable either. This is orthogonal to the point.

Times change. We're industrializing information creation and consumption (the latter is mostly here already), and we can't be stuck in the old copyright regime. It'll be useless in very short order.

All this road bump will do will give the giant megacorps time to ink deals, solidify their lead, and trounce open source. Twenty years on, the pace of content creation will be as rapid as thought itself and we'll kick ourselves for cementing their lead.

This is a transitional period between two wildly different worlds.

hn_acker · on Dec 28, 2023

> This is a strong claim that just downloading articles into training data is what violates the copyright. That GTP outputs verbatim copies is a red herring.

It's the other way around. There is no infringement if the model output is not substantially similar to a work in the training set [1]:

> To win a claim of copyright infringement in civil or criminal court, a plaintiff must show he or she owns a valid copyright, the defendant actually copied the work, and the level of copying amounts to misappropriation.

The questions are, which parties should bear liability when the model creates infringing outputs, and how should that liability be split among the parties? Given that getting an infringing output likely requires the prompt to reference an existing work (which is what's happening in the article), an author of a work, an element in an existing work, or a characteristic/style strongly associated with certain works/authors, I believe that the user who makes the prompt should bear most of the liability should the user choose to publish an infringing output in a way that doesn't fall under fair use. (AI companies should not be publishing model outputs by default.)

[1] https://en.wikipedia.org/wiki/Substantial_similarity#Substan...

dragonwriter · on Dec 28, 2023

The level of copying here is the copying into the training set, not the copying through use of the model.

Its true that OpenAI will defend the wholesale copying into the training set by arguing that the transformative purpose of the next use reaches back and renders that copying fair use, but while that's clearly the dominant position of the AI industry, and it definitely seems compatible with the Cobstitutional purpose of Fair Use (while currently statutory, the statutory provision is codification of Constitutional case law), it is a novel fair use argument.

hn_acker · on Dec 28, 2023

> The level of copying here is the copying into the training set, not the copying through use of the model.

NY Times is suing because of both the model outputs and the existence of the training set. But infringement in the training set doesn't necessarily mean that the model infringes. Why? Because of the substantial similarity requirement. But first, I'll address the training set.

For articles that a person obtains through legal methods (like buying subscriptions) but doesn't then republish, storing copies of those articles is analogous to recording a legally accessed television show (time-shifting), which generally is fair use. Currently, no court has ruled that "analogous to time-shifting" is good enough for the time-shifting precedent to apply, but I think the difference is not significant. The same applies to companies. Companies are not literally people, but there isn't a reason for the time-shifting precedent to not apply to companies.

What about the articles that OpenAI obtained through illegal methods? Then the very act of obtaining those articles would be illegal. The training set contains those copies, so NY Times can sue to make OpenAI delete those copies and pay damages. But it's not trivially obvious that a GPT model is a copy of any works or contains copied expression of the any works in the training set; the weights that make up the model represent millions of works, it's not trivially obvious that the model contains something substantially similar to the expression in a work in the training set. Therefore, it's not trivially obvious that infringement with respect to the training set amounts to infringement with respect to the model made from the training set. If OpenAI obtained NY Times articles through illegal means, then making OpenAI delete the training set would be reasonable, but the model is a separate matter.

As long as the model doesn't contain copied expression and the weights can't be reversed into something substantially similar to expression in the existing works, then what matters is the output of the model.

If a user gives a prompt which contains no reference to an existing NY Times author, work, or a strongly associated characteristic/style, then do OpenAI's models produce outputs substantially similar to expression in the existing works? If not, then OpenAI shouldn't be liable for infringing works, because the infringing works result from the user's prompts. If my premise is false, then my conclusion falls apart. But if my premise is true, then at most I would admit that OpenAI has a limited burden to prevent users from giving those prompts.

kromem · on Dec 28, 2023

This isn't an issue with training, it's an issue with usage.

Production open access LLMs do probably need a front-end filter with a fine tuned RAG model that identifies and prevents spitting out copyrighted material. I fully support this.

But we shouldn't be preventing the development of a technology that in 99.99% of usecases isn't doing that and can used for everything from diagnosing medical issues to letting coma patients communicate with an EEG to improving self-driving car algorithms because some random content producer's works were a drop in the ocean of content used to learn relationships between words and concepts.

The edge cases where a model is rarely capable of reproducing training data don't reflect infringement of training but of use. If a writer learns to write well from a source is that infringement? Or is it when they then write exactly what was in the source that it becomes infringement?

Additionally, now that we can use LLMs to read brain scans and have been moving towards biological computing, should we start to consider copying of material to the hippocampus a violation of the DMCA?

bertil · on Dec 28, 2023

Adding an extra constraint of no copying verbatim from a very large and relevant corpus will be hard to guarantee without enormous databases of copyrighted content (which might not be legal to hold) and add an extra objective to a system with many often contradictory goals. I don’t think that’s the technology-sound solution or one in the interest of anyone involved. It’s much more relevant to license content from as many newspapers as possible, recognize when references are relevant, and quote them either explicitly verbatim if that’s the best answer or adapt (translate, simplify, add context) when appropriate.

I feel like the NYTimes is asking for deletion as a negotiation tactic to force OpenAI to give them enough money to pay for their journalism (I am not sure who would subscribe to NYTimes if you can get as much through OpenAI, but I am open to registering extra to pay for their work).

pants2 · on Dec 28, 2023

What if OpenAI were to first summarize or transform the content before training on it? Then the LLM has never actually seen copyrighted content and couldn't produce an exact copy.

bertil · 2023-12-29T13:37:09 1703857029

You are assuming a lossy compression. Stylistic guidelines and personal habits of beat journalists suggest you might not, depending on how detailed the LLM is. The complaint has many quotes that are long verbatim sections.

joe_the_user · 2023-12-28T20:17:42 1703794662

Any lawsuit makes all the claims it can and demands every sort of relief it might plausibly have. That's not to say that's how it should be (it can have awful results), just to say that's what to expect (and hope courts only considers the reasonable claim - "stop freely sharing our data" and avoids ridiculous/anti-fair-use claim "you can't even store our data").

The thing about you claim, "Just learn to recognize and punish plagiarism via RLHF" is that we've had an endless series of prompt exploits as well as unprompted leakage and these demonstrate that an LLM just doesn't have fixed border between its training data and its output. This will it basically impossible for OpenAI to say "we can logically guarantee ChatGPT won't serve your data freely to anyone".

intrasight · on Dec 28, 2023

Yeah, no - that proposal is no good. The correct solution is to have machine learning be more like human intelligence. You can't ask me to plagiarize a New York Times article. Not because of prompt rule violation but because I just can't. It's not how humans train (at least most).

namlem · on Dec 28, 2023

You can't, but there are some people who can quickly memorize entire pages of written text.

intrasight · on Dec 28, 2023

That's why I qualified with "at least most"

anankaie · 2023-12-31T00:51:45 1703983905

And most ML models cannot do that either. This is not a good analogy.

JumpCrisscross · on Dec 28, 2023

> Just learn to recognize and punish plagiarism via RLHF

OpenAI has created a $100bn company on this transfer. The Times may have an interest in a material fraction of that wealth.

vidarh · on Dec 28, 2023

The NYT is also worth a tiny fraction of that. If it looks like they might get anywhere, it might be better for OpenAI to buy them

afavour · on Dec 28, 2023

That would require NYT being willing to sell, which historically they have not been.

vidarh · on Dec 28, 2023

I just looked up the share structure; didn't realise the publicly traded shares only appoints 1/3 of the board. Still their second best option is start buying up competitors and going ahead with purging NYT from their training set. That might well end up a worse option for NYT, as they won't stop LLMs from gradually intruding on their space and the moment OpenAI or other LLM providers own major publishers so they don't need to depend on scraping, they lose any leverage they currently have.

JumpCrisscross · on Dec 28, 2023

> might well end up a worse option for NYT, as they won't stop LLMs from gradually intruding on their space

The Times almost certainly wants its own LLM. I could see them striking a consortium agreement with other newspapers more easily than OpenAI.

dom96 · 2023-12-28T19:52:15 1703793135

If it looks like they might get anywhere, then lots of companies will also be able to get there, OpenAI can't buy all of them.

vidarh · 2023-12-29T13:15:18 1703855718

They won't need to. Most don't have enough money to survive a prolonged round of lawsuits, and the potential damages are limited. The only real leverage is taking their models out of circulation and cutting their training set and that leverage only exist for the large publishers.

cmiles74 · on Dec 28, 2023

OMG! Or they could just license the content. I suspect that would be both easier and less expensive. ;-)

vidarh · on Dec 28, 2023

I'm not convinced it's a given it will. If it becomes necessary to license, owning the large publishers will be leverage and allow locking competitors out unless you have a portfolio to cross license.

OpenAI alone has a market cap that'd allow it to buy about as large a proportion of publishers of newspapers and books as they'd be allowed before competition watchdogs will start refusing consent.

Put another way:

If I was a VC with deep pockets investing in AI at this point, I'd hedge by starting to buy strategic stakes in media companies.

amadeuspagel · on Dec 28, 2023

> The suit demonstrates instances where ChatGTP / Bing Copilot copy from the NYT verbatim. I think it is hard to argue that such copying constitutes "fair use". However, OAI/MS should be able to fix this within the current paradigm: Just learn to recognize and punish plagiarism via RLHF.

Isn't that in tension with the basic idea of an LLM of predicting the next token? How do you achieve that while never getting close enough to plagiarism?

spacecadet · on Dec 28, 2023

Transformations are happening. Maybe if the output is verbatim afterwards, than that says something about the outputs originality all along... or am I a troll?

dathery · on Dec 28, 2023

They're talking about transformative with regard to copyright law where it is an important part of determining fair use, not the dictionary definition you're using here.

I can't take NY Times articles, translate them into Spanish, and then sell the translations under fair use, even though clearly I've transformed the original article content.

jarrell_mark · on Dec 28, 2023

Anything + 2 and then minus two is back to the original thing. This says more about the transformations than the source material.

spacecadet · on Dec 28, 2023

I know, I was trying to be funny, but hey- this community...

groceryheist · on Nov 11, 2023

Thanks for pointing this out. It would be nice to have a public domain or creative-commons licensed image to use instead.

pyinstallwoes · on Nov 13, 2023

If only there was evidence indeed

groceryheist · on Nov 4, 2023

Let's float all these scarce and valuable resources on a barge in international waters and claim to be a sovereign state. Very nice way to organize donations to pirates!

groceryheist · on Aug 24, 2023

Intractable. Only the simplical complex of thought provides the optimal balance of expressiveness and constraint.

sorokod · on Aug 24, 2023

Well ok, maybe only for distilling the essence of whole networks of thoughts

groceryheist · on Aug 8, 2023

Cory Doctorow

bjelkeman-again · on Aug 8, 2023

I thought it was older than last year, but I must be mixing up presentations I saw recently by Cory. https://en.wiktionary.org/wiki/enshittification

ydant · on Aug 9, 2023

It feels older to me, too. It definitely gained a lot of momentum recently around Reddit's recent API pricing changes, but Google Trends shows it having a lot of interest in 2004 and then sporadic interest over the years.

https://trends.google.com/trends/explore?date=all&geo=US&q=e...

I don't use Google Trends enough to know how to interpret that data, so maybe there's some quirk of Google Trends that makes it look more significant back then.

It would be awesome if some service had a similar feature to Google Ngram Viewer, but for content on the internet instead of books. Search interest isn't quite the same as how much it's being written in content.

groceryheist · on June 16, 2023

Can't you just follow them all from a 3rd server, even a mastodon server?

Semaphor · on June 16, 2023

Yeah, forgot to mention that not exactly unimportant part, I’m on neither myself and can use them just fine -.- edited it in.

mikae1 · on June 16, 2023

You can.

groceryheist · on March 29, 2023

You've never checked a receipt on your way out of the store to make sure you weren't charged for anything you didn't purchase?

vuln · on March 29, 2023

With the invention of self checkout I never get charged for an item I didn’t purchase. I have never encountered that issue with a cashier either.

Hell I cannot remember a time when I was _charged for something I didn’t purchase_.

I have been charged MORE than advertised but paying attention while ringing up the item and escalating to the cashier or manager typically solves the problem. I have also been charged less and notified the cashier of the issue.

Why would you check your receipt _after_ you left the store? Now you have to drive back, go inside, go to customer service and wait.

Seems way easier to pay attention and scan the receipt when it is handed to you.

kube-system · on March 29, 2023

I check the receipt when I'm charged. Same as with Amazon Fresh.

SargeDebian · on March 29, 2023

Every time my groceries arrive I'm somewhat clueless as to what I ordered, though past me always seems to have made good choices. Checking my recipt after 50 hours would effectively be useless.

kube-system · on March 29, 2023

I suspect 50 hours is an extreme. Last time it took me 2 hours to get my receipt.

groceryheist · on Nov 8, 2022

If true, (is newsweek a reliable source for Iran coverage?) this would be an immensely risky and possibly stupid move. If they will kill you for protesting what deters you from terrorism and assassination?

tptacek · on Nov 8, 2022

Newsweek isn't a reliable source for any coverage. It essentially went under in 2013 and its brand assets were acquired by a company with deep ties to "The Community", a Korean religious sect (parallels to UPI and the Unification Church). Regardless of how you feel about "The Community", the Newsweek you're seeing today has essentially no relationship to the historical Newsweek.

It's always worth checking the bylines on Newsweek articles and trying to trace back where the writers came from. For awhile, most of the Newsweek stories I saw were written by people with SEO backgrounds, not journalists. (You might reasonably be fine with this, too, but the branding is misleading).

CamperBob2 · on Nov 8, 2022

I think my first clue was when they "outed" Satoshi Nakamoto as a completely-random retired Asian guy in SoCal. Guess that happened after the acquisition by the Korean sect. I hadn't heard about that.

cpleppert · on Nov 9, 2022

This isn't true. Newsweek is reporting an unverified tweet which is from an account focused on the conflict in Ukraine. The tweet itself seems to be referring to the letter on sunday signed by 227 lawmakers urging the death penalty.

chatterhead · on Nov 8, 2022

The fact that they will likely take it out on your family, friends and anyone suspected of knowing about your actions. Not to mention the shit that happens before they kill you.

Man is a predator ape. Don't get it twisted.

groceryheist · on Nov 8, 2022

Mass executions lead to only 2 outcomes: total fear or total war.

groceryheist · on Nov 1, 2022

> There is literally nothing that happens if I know what’s going on there.

> Nothing at all. It distracts me from my family, friends, work, and myself.

> I have no interesting insights or original thoughts about it, and if I did,

> absolutely no one would care. So why read it?

Do you vote or participate in politics in any way or do events in the news directly affect your life? One reason it's rational to read the news: If you don't read the news, then you're at risk of lacking important contextual knowledge about society and social norms. It's pragmatic to keep track of these conversations to avoid embarrassment and to not behave in ways that might offend others. I find it useful to be aware of contentious issues on which I might have a side or a stake.

gernb · on Nov 1, 2022

> > I tend to feel disturbed by how little it presents as reporting, and instead, how much it appears to have an agenda.

> If you don't read the news, then you're at risk of lacking important contextual knowledge about society and social norms.

Unfortunately there's very little news that doesn't have an agenda. So I'm not getting the news. I'm getting someone's opinions with various facts spun in a way so as to try to get me to vote one way or another. Is that a better way to be informed on how to vote?

unfoldedCravat · on Nov 1, 2022

As Mark Twain once said "If you don’t read the newspaper, you are uninformed. If you do read the newspaper, you are misinformed."

I guess it's up to the individual to choose what they prefer.

mdp2021 · on Nov 1, 2022

You may note that it is substantially symmetrical to

"To be, or not to be ...: Whether 'tis nobler ... to sleep No more ... [or] in the mind to suffer The slings and arrows of outrageous fortune"

I.e.: if you want to participate, then some degree of effort is involved.

shaky-carrousel · on Nov 1, 2022

I'm too old to feel embarrassed over what other people think about me.

And I already gave up the game of trying not to offend people. I say what I think, and if it happens that I offend someone, I'll consider it when it happens and apologize or ignore it, depending on if it makes sense to me.

joshspankit · on Nov 1, 2022

> If you don't read the news, then you're at risk of lacking important contextual knowledge about society and social norms.

It’s been at least a decade since news has been a reflection of social norms (I’m in Canada). It feels more and more disconnected and like they’re either pushing agendas or simply filling airtime so they can serve the right amount of ads.

steve_adams_86 · on Nov 2, 2022

This is what I find most troubling. It really does seem to be a spectacle that’s meant to capture attention, only rooted in reality insofar that it’s based off of real-world topics we all recognize and only ostensibly care about.

steve_adams_86 · on Nov 2, 2022

A lot of news is extremely hard to avoid and I’ll inevitably be exposed to it. If it seems important like the types of topics you’re describing, I’ll carve out some time to find out if it’s actually important.

Most of it is a spectacle and my actions and point of view are entirely irrelevant to it. Frankly, I don’t believe a lot of the news is even rooted in reality. To have a point of view on it, or to have feelings about it, would be to lend legitimacy to something I sincerely believe is part of a bizarre, performative, ephemeral blob of social invention. I don’t mean to sound absurd or like a conspiracy theorist at all; I really can’t see how much of what’s in the media is remotely objective or connected to any kind of shared reality that is practical and sane.

As far as being embarrassed, I find I can get by with simply listening to people and letting them get their ideas across. They like to express themselves and I’m content to listen. I typically won’t be embarrassed because no one really cares what I think or what I think I know, too. No one asks unless I say something in the first place.

I think we’re encouraged to believe it’s some other way, but if you keep to yourself you’ll find very quickly that indeed, almost no one cares. Even here on hacker news, the vast majority of readers don’t care about what I wrote. It’s irrelevant to them, and that’s okay. It’s probably a good thing.

The only people in my tiny corner of all of this who have any remote concerns about me are right here, near me, in my life. The rest is mostly a spectacle. If it can effect me, the chances that I can effect it in any meaningful way are infinitely tiny.

I do vote, too, but again… I’m not convinced it means what people think it means, or that it leads to things I hope it will. On the other hand, by working on local concerns I can actually make a difference. The changes we can make like restoring a watershed to improve wildlife diversity, reduce erosion, improve runoff capacity, and so on can be a real model for improving these features of other municipalities. That can matter many times more than a vote for the leader of my country; it can lead to real data that people turn to improve the places they live in. It seems totally uninteresting and pointless compared to the notion of democracy and national governance, but from my point of view, it’s far more pragmatic and connected to reality.

I agree with you overall. I want to be aware of contentious issues on which I might have a side or a stake too. I think the difference between us is only that I think the scope of these issues is extremely small; there’s very little I come across in my day where it truly matters at all what I believe, and any stake I may have is almost exclusively relevant to me alone.

GoblinSlayer · on Nov 1, 2022

There's a story how an ancient egyptian priest complained about contemporary youth being disrespectful to elderly. Social and economic circumstances differ, but it feels like the theory of ethics was mostly the same.

ohCh6zos · on Nov 1, 2022

Interestingly if you everyone went dark news driven social norms would have to disappear.

groceryheist · on Oct 10, 2022

Sensationalist title to a good news story: there was no intent-to-treat effect, but a low conversion rate. The ETT is a pretty impressive 50% mortality reduction. Also study is limited by a short 10 year follow-up period but there will be a 15 year follow up.

So I'll still get my colonoscopy, thanks

JamesBarney · on Oct 10, 2022

When everyone invited for a colonoscopy is compared to the control group 42% showed up and got a colonoscopy 18% fewer people got colon cancer same number of people died of colon cancer

When everyone who got a colonoscopy is compared to the control group 30% fewer got colon cancer 50% fewer people died of colon cancer

This data makes me think the mortality reduction benefit is bogus, but the cancer prevention benefit is real, and probably greater than 18%, maybe closer to 36-40%. If the colon cancer mortality benefit was real you'd see some reduction in the intention to treat group, and it'd be smaller than the cancer prevention effect. (The most aggressive cancers tend to be harder cancers to catch in time because they most so quickly, so most cancer screening tests will prevent more cancers than deaths)

rippercushions · on Oct 10, 2022

Where are you getting "50% mortality reduction" from? The article says the reduction was zero:

> After 10 years, the researchers found that the participants who were invited to colonoscopy had an 18% reduction in colon cancer risk but were no less likely to die from colon cancer than those who were never invited to screening.

tylermenezes · on Oct 10, 2022

> When the investigators compared just the 42% of participants in the invited group who actually showed up for a colonoscopy to the control group, they saw about a 30% reduction in colon cancer risk and a 50% reduction in colon cancer death

Someone · on Oct 10, 2022

I don’t see how that says much about the usefulness of colonoscopy.

The people in the treatment group who didn’t show up must have had a 36% increase in colon cancer death (not explicitly stated in the article, but can be derived from the numbers. You need 42% × 50% + 58% × ? = 1), and (solving 42% × 70% + 58% × ? = 1) a 21% increase in colon cancer risk.

Something must have made them different from the control group. Maybe, they didn’t show up because they already were being treated for colon cancer?

gwd · on Oct 10, 2022

Right, so to properly isolate the effect of the colonoscopy on mortality, we need to do a randomized trial with just people like you -- people who would definitely get a colonoscopy if invited -- to see what the effect is. Because you're also more likely to act on early warning signs and get a cancer diagnosis in time to treat the cancer rather than die.

robertlagrant · on Oct 10, 2022

> Because you're also more likely to act on early warning signs and get a cancer diagnosis in time to treat the cancer rather than die.

This seems to imply we should discount all treatments because people who choose to get treatment are more likely to get better, coincidentally by the same amount as the treatment's efficacy.

gwd · on Oct 10, 2022

That's not what it means at all.

The question here is a question that the medical establishment is trying to answer: namely, "What can we, as the medical establishment do, to reduce deaths from colon cancer?" For a long time, the answer has been, "Invite people to take colonscopies". What the data here appears to show is that the action, "Invite people to take colonoscopies" doesn't actually reduce deaths from colon cancer. If the medical establishment wants to actually reduce deaths from colon cancer, they'll need to figure out something else.

I guess I do agree that the headline is likely to be counterproductive. What the data might show is that the most effective thing you as an individual can do is to be the kind of person who takes colonoscopies when invited. The unfortunate effect it might have is to make more people into the kind of people who don't take colonoscopies when invited.

robertlagrant · on Oct 10, 2022

> I guess I do agree that the headline is likely to be counterproductive. What the data might show is that the most effective thing you as an individual can do is to be the kind of person who takes colonoscopies when invited. The unfortunate effect it might have is to make more people into the kind of people who don't take colonoscopies when invited.

Agreed. Except the most effective thing you can do as an individual is have a colonoscopy, not be the sort of person who would have one :-)

gwd · on Oct 10, 2022

> Except the most effective thing you can do as an individual is have a colonoscopy, not be the sort of person who would have one :-)

So we have two hypotheses here:

1. Having the colonoscopy is the thing that reduces deaths from colon cancer

2. Having the colonoscopy correlates to some other factor, X; and it's actually X which reduces the deaths from colon cancer.

X, for instance, could be a high willingness / ability to see the doctor when you experience early symptoms of colon cancer. That is, the more willing you are to go to the doctor when you start to have early symptoms of colon cancer, the more likely you are to survive it; and the more willing/able you are to go to the doctor when you have early symptoms of colon cancer, the more willing/able you are to have a colonoscopy.

What evidence do you have to believe that #1 is true, rather than #2?

Because if #1 is the case, the medical system should push hard on colonoscopies. But if #2 is the case, pushing colonoscopies might be a red herring. In fact, it might be counterproductive -- I've heard that colonoscopies are unpleasant; if you pressure people who don't like doctors into having a colonoscopy, and they have a terrible experience, then when they experience early symptoms of colon cancer, they may be more likely to procrastinate to avoid having another one. Rather, if #2 is the case, the medical system should try find out what can be done to make people more willing / able to get early medical care.

vidarh · on Oct 10, 2022

No, but you need to account for it when considering efficacy, because people who choose to follow through on treatments may also be taking other steps that may also improve their outcomes, and it can be hard to tease out because the mere act of telling them they should have a screening might change other behaviours even without being prompted by doctors in a way that's recorded.

E.g. it's a reasonable hypothesis that patients who are more motivated to show up might also be more motivated to look up possible causes and what other steps they can take to improve their chances.

In other words, it's reasonable to expect people who comply to potentially get better at a higher rate than the efficacy of a single treatment, and teasing out how much of this effect is due to the intervention itself and how much is due to changed behaviour due to the referral is hard.