Hacker News new | past | comments | ask | show | jobs | submit login

Unlearning is described as: "process aims to erase specific knowledge from LLMs while preserving as much model utility as possible."

i.e. We know that our model would be useless without your source. So we will take the useful part of your source and obfuscate the rest so that we can charge our users for utility provided by you without having to pay you anything.




> We know that our model would be useless without your source. So we will take the useful part of your source and obfuscate the rest so that we can charge our users for utility provided by you without having to pay you anything.

Isn't this basically the entirety of the latest AI craze? They basically took a public good - the information available on the Internet - and hid behind some thin veneer of "we are not stealing, we just trained an AI on the information" and then they sell it. Note, I'm intentionally not writing "free information available on the Internet", because information is not free. Someone has to pay (in time or money) to generate it and host it. They might have provided it gratis to the public, but nobody asked them if an AI can come along, harvest it all and regurgitate it without a hint of reference to the original source.

Much of that information is not even free in the monetary sense, it is supported by ads. The AI will not only not click through the \ds, it won't even generate repeat traffic as once the information is harvested, there's no need to access the source anymore.

If you really think about it, it's a brilliant business model. It's a perfect theft, where the affected group is too diffuse and uncoordinated, it's extremely difficult to prove anything anyway, and the "thieves" are flush with investment capital so they sleep well at night.

LLMs have undoubtedly great utility as a research tool and I'm not at all against them. I think they (or a model similar in objectives) are the next step in accessing the knowledge humanity has amassed. However, there's a distinct danger that they will simply suck they sources dry and leave the internet itself even more of a wasteland than it has already become. I have no illusions that AI companies will simply regress to the lowest cost solution of simply not giving anything back to whoever created the information in the first place. The fact that they are cutting off the branch that they are sitting on is irrelevant for them, because the current crop of owners will be long gone with their billions by the time the branch snaps.


From my, probably naive perspective, there seems to be at least two major sources of value the generative AI provides:

1.Understanding the world, for example by creating a statistical model of entire languages, as languages are already a model of reality.

2. Recapitulating (stealing) specific instances of information in ways that people often don't find acceptable. Grabbing a news article without permission, and providing that to your paying users without paying for the work. Recreating trademarked characters or the style of a particular living artist, without compensation. Deepfake porn.

The first seems generally valuable to society as a whole and a morally (IANAL) legitimate creative transformation, even of copyrighted work.

The second use seems exactly as you describe.

Societies could navigate this by encouraging and promoting the first use, and criminalizing or removing the ability to be paid from the second.

Of course, what is happening is that groups of economic interests will use their resources and clout to advocate for both, or against both.


I agree for the most part that 2 is what most people find unacceptable, not 1.

The problem is that, like any general intelligence (e.g. humans), any sufficiently generalized model capable of 1 will also necessarily be capable of 2, regardless of whether it's trained on copyrighted material or not. How do you make an AI model that's capable of summarizing Wikipedia articles but not news articles? Or that's capable of generating consistent images of my original character from a reference photo but not images of Mickey Mouse from the same? This is achievable only by restricting software freedom; by taking measures to prevent users from "running the program as they wish" and from "studying the source code and making changes".


I'll note that the way we have typically enforced restrictions on the behavior of general intelligences in the past (before AI) is to pass laws and enforce punishments if the laws are broken. Not to try to somehow take away people's ability to break the law in the first place, because that would require unacceptably onerous restrictions on human freedom.

I think the same principle applies to AI. Trying to make it impossible for people to use AI to break the law is a lost cause, only achievable by unacceptably onerous restrictions on human freedom. Instead, we should do what we've always done: make certain actions illegal and punish those who do them anyway in violation of the law. Maybe new laws might be required for that in some cases (e.g. deepfake porn) but for the most part I think the laws we already have on the books are sufficient, maybe with minor tweaks.


That all sounds great until you're dealing with deepfakes that come from a country without an extradition treaty?


Not really that different from other forms of illegal content coming from countries without an extradition treaty. (Piracy, scam calls, CP, etc.) Trying to stop it by imposing onerous restrictions on your own citizens isn't likely to be effective.


Imagine consultants had to cite sources and pay-out every time they referenced knowledge gained from reading a research paper at working at a formal employer.

I can understand the need to prevent verbatim copying of data. But that is a problem solved on the output side of LLM's, not on the data input for training.

It is completely legal for someone to pay me to summarize the news for them every morning. I can't help but feel that knee-jerk regulation is going to be ultimately bad for everyone.


I think, at one point in time, it was also completely legal to break into computer networks because there were no laws against it.


I would summarize your points as:

We need to create a whole new body of law for enforcing copy write protections in the age of AI.

Does the AI adequately attribute its sources? Does it paraphrase in acceptable ways or just repeat large swathes of text from its corpus with minimal changes?

The laws should force any LLMs not yet capable of complying with these requirements off the Internet until they can comply.


I hear what you're saying, and I'm not saying some of it doesn't have merit. The following is meant as an open philosophical discussion.

On the topic of 'the information isn't free' I'm curious if you have the same opinion of encyclopedia companies. You must admit there's at least some parallels in that they also consolidate a large amount of information that was 'generated' from others.

Or how about the information you and I have gained from books and the internet? Sure we might 'pay' for it once by buying a book or seeing some ad, but then we might use that information to make thousands of dollars through employment without ever going back to buy another copy of that book. An even more 'egregious' example could be teachers. They're literally taking the knowledge of others, 'regurgitating' it to our children for money, and 'not giving anything back to whoever create the information in the first place'.

> there's a distinct danger that they will simply suck they sources dry and leave the internet itself even more of a wasteland than it has already become

Maybe. There's the whole AGI/ASI argument here in that they/we might not _need_ humans to create information in the same way we don't need human-calculators any more.

Barring that though I do hear what you're saying around a lowering value to creating 'new internet information'. Personally I can't see it affecting my internet use that much though as there's basically two categories my internet information gathering fall in to:

1. I want to know something, give me the short quick answer. This category is already full of sites that's are just trying to hack the search algos to show their version of copy-pasted info. I don't really care which I go to and if AI kills their business, oh well.

2. I want follow a personality. This category is where I have bloggers/youtubers/etc in RSS feeds and the like. I want to hear what they're saying because I find them and the topics interesting. I can't see this being replaced by AI any time soon.


> Or how about the information you and I have gained from books and the internet? Sure we might 'pay' for it once by buying a book

We've never as a society needed such a concept before, but publishing a book has always come with the implicit license that people who buy the book are allowed to both read the book and learn from the knowledge inside. Authors didn't write books about facts they didn't want people to learn.

But we now have a new situation where authors who never needed to specify this in a terms-of-use are realizing that they want to allow humans to learn from their work, but not machines. Since this hasn't ever been necessary before it's a huge grey area, and ML companies are riding around claiming they have license to learn to reproduce art styles just like any human would, ignoring whether the artist would have allowed one but not the other if given the chance to specify.

It's not that different from when photocopiers and tape recorder technology made it easy to copy documents or music, say from the radio, and we needed to grapple with the idea that broadcasting music might come with license to make personal recordings but not allow someone to replay those recordings for commercial use. It wasn't a concept that was necessary to have.

Now with AI, the copy is not exact, but neither was it with a tape recorder.


You raise some great points and I agree it that we are on tricky ideological grounds. I'll try to provide sensible counter-arguments to your encyclopaedia and teacher examples, and hopefully not fall into strawmans (please do object if I do):

1. First there's the motivation or intent. Teachers want to earn a living, but their purpose in some sense and (hopefully) their main intent is that of education. I argue that teachers should be paid handsomely, but I also argue that their motivation is rarely to maximize profits. This is contrary to the bog standard Silicon Valley AI company, who are clearly showing that they have zero scruples about breaking past promises for those sweet dollar signs.

2. My second point actually builds a bit on the first: both encyclopaedias and teachers tend to quote the source and they want their audience to expand their research horizon and reach for other sources. They don't just regurgitate information, they'll tend to show the reader where they got the information from and where to go for more and neither the teachers nor the books mind if the audience reaches for other teachers and books. LLMs and generative models are/will be/have been capable of this I'm sure, but it is not in their creators' interest to enhance or market this capability. The more the users are walled in, the better. They want a captive audience who only stays in the world of one AI model provider.

3. Scale. Never before has been the reuse (I'm trying to avoid using the word theft) of content produced by others conducted on such an industrial scale. The entire business model of LLMs and generative models has been to take information created by masses of humans and reproduce it. They seem to have zero qualms taking all the work of professional and amateur artists and feeding it into a statistical model that trivializes replication and reproduction. You could argue that humans do this as well, but I feel scale matters here. The same way that a kitchen knife can be used to murder someone, but with a machinegun you can mow down masses of people. Please excuse the morbid example, but I'm trying to drive a point: if we make a certain thing extremely easy, people will do it, and likely do it on a mass scale. You could argue that this is progress, but is all progress inherently beneficial?

There's value in these models, so we should use them. But I feel we are rapidly hurtling towards a walled garden corporate dystopia in so many areas of our society. Industries which tended to have negative impact on our lives (waste, tobacco, alcohol, drugs) have become heavily regulated and we have paid for these regulations in blood. Will we have to pay the same blood price for the harmful industries of the new age?


Interesting counter-points. Thank you for taking the time to post them.

I don't think I have anything useful to add without giving the issue more thought. Your reply definitely adds new dimensions for me to think about.


Humans do the same thing. Typically in a more narrowed fashion, they read and study and learn from a variety of sources, many of which are not "free" and they become experts on a subject. They can then sell that expertise to others willing to pay for it.

LLMs just do this on a bigger scale, and not as well.


I agree, but that doesn't make it good - or perhaps even acceptable. To quote myself answering another commenter:

> Never before has been the reuse (I'm trying to avoid using the word theft) of content produced by others have been conducted on such an industrial scale. The entire business model of LLMs and generative models has been to take information created by masses of humans and reproduce it. They seem to have zero qualms taking all the work of professional and amateur artists and feeding it into a statistical model that trivializes replication and reproduction. You could argue that humans do this as well, but I feel scale matters here. The same way that a kitchen knife can be used to murder someone, but with a machinegun you can mow down masses of people. Please excuse the morbid example, but I'm trying to drive a point: if we make a certain thing extremely easy, people will do it, and likely do it on a mass scale. You could argue that this is progress, but is all progress inherently beneficial?


I agree that scale changes the nature of what's going on, but I'm not sure if it follows that the scaled up variant is bad. I think models like GPT3 and Sonnet which are intended for "general purpose intelligence" are fine. Same with Copilot and Phind for coding. They contain copy-written knowledge but not by necessity and their purpose is not to reproduce copy-written materials.

Training a diffusion model on a specific artist's work with the intent to reproduce their style I think obviously lives on the side of wrong. While it's true a human could do the same thing, there is a much stronger case that the model itself is a derivative work.

I think the courts will be able to identify cases where models are "laundering copyright" as separate from cases where copyrighted material is being used to accomplish a secondary goal like image editing. Taking a step back this is in some way what copyright is for— you get protections on your work in exchange for making it part of the public body of knowledge to be used for things you might not have intended.


You raise very good points, and I agree that scale is not necessarily bad, in fact it can be a source of much good. Scale simply increases the frequency and thus likelihood of things, whether good or bad.

I'm sure that big players will be able to assert their rights with their armies of lawyers. Just like the music and movie industries have after the rise of file sharing.

My worry is perhaps more subtle: I'd argue that generative AI draws much more from the masses of small content creators and they will not be able to assert their rights. In some sense, if people pirate the next blockbuster movie, the producers might only make 1 billion instead of 1.1 (and piracy has never been proven to actualy impact sales), but if all content starts being consumed via massive and centralized anonymizers, the masses of people who made the internet what it is will eventually disappear. So the scale on which these tools can hoover up information and reproduce it is unprecedented, as is the fact that nobody important seems to be thinking about how to make sure that we can keep actual humans motivated to keep generating the content that actually feeds the AI.

It's one of those cursed things: the long term interest of the AI companies is that humans will keep feeding the beast with more information, but the short term interest is to capture their audience and do everything possible to keep them inside the walled garden of a single AI provider. It is not in the companies' interest for people to step outside and go straight to the painter/writer/moviemaker, because at that point the AI is no longer needed.


> They basically took a public good ... and then they sell it

I think what they sell is more fairly characterized as "hosted inference to a big pretrained model" with perhaps also some optimism that their stuff will improve in the background. The only substantial moat these companies have is their ability to pay for the compute to train contemporary generative models. The public good remains a public good for all to profit from, small-scale or large.

> Someone has to pay ... but nobody asked them if an AI can come along, harvest it all and regurgitate it without a hint of reference to the original source.

Practically speaking, we don't actually need to centralize content to pay for hosting it. People just do it because it makes money. The price of time required to create some work distributed among viewers feels like a vague philosophical argument to me, especially when those works are merely being dispassionately observed by math objects. Currently the price appears to be "whatever I feel morally obliged to and/or can get away with".

> It's a perfect theft

...if it is legally theft to begin with, and not simply fair use. To me the current methods of training e.g. LLMs feel inherently transformative, like a massive partial hash of the internet that you can query. Even if it is ruled as theft in the future, large AI companies will only be further advantaged as they're presently buying off the people that will actually be able to sue them.


I think it's fair and reasonable to assume that the AI companies will at some point start licensing their source content. Through gov/legal oversight or not remains to be seen, but OpenAI are already beginning to do so:

https://searchengineland.com/openais-growing-list-of-partner...


Google is using for 20 years unlicensed source content for their search snippets, they seem to be doing fine with it (with the exception of few news publishers).


The idea with internet search was to get people to find the source of the information they were searching for. As a matter of fact a lot information indexing was requested at the source. Google did respect the bargain for a while until they started to obfuscate getting to the source with AMP and their info snippets directly in the search, bypassing redirecting to the source. Then they started not displaying all that info at all, not even on the nth page of search results. The broth has been getting sour for a while now. Some people never wanted crawlers indexing and there were numerous discussions about how those robot.txt were ignored.

So what I see here is the historical trend broken bargains which is more or less digital theft


Thanks for the link, I appreciate it. I suppose the issue is that this just further enshittifies the internet into a small handful of walled gardens. Big players get their payday, because they could feasibly sue OpenAI and generate them enough headache. But the vast amount of content on the internet was not built by a small handful of media companies, but rather by masses of small creators. It is their work that OpenAI is profiting from and I have yet to see a credible suggestion on how they will compensate them.


The likely and rather sad outcome of all this is small creators stop publishing because what is the point if they think their work is going to be regurgitated by some AI for $20/month.


I thought it was even worse than that: learning any of the corpus verbatim would actually reduce model utility.


Yes although how close to verbatim is debatable. For example there are questions that you’d ask that other people have asked many times before that you’d like the exact answer for (e.g. when does daylight saving time end?)


> that make it forget specific facts. These are often meant to satisfy copyright claims

Facts are not copyrightable.

To quote copyright.gov: “Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed.”


What is a fact without expression? It's not clear under what interpretation might be necessary to get the quoted sentiment to be considered sensical.


Wouldn't that be stuffed in the prompt anyways? No reason for the LLM to learn that.


It really depends on which part of the corpus, though. I do expect my LM to be able to reproduce culturally important citations, for example.


That's not entirely true. Retraining is very expensive. If you can train on a very large dataset including proprietary knowledge and then postprocess the model cheaply to forget things saves you retraining for every variation.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: