Good luck to them. AI models are automated plagiarism, top to bottom. None of us gave OpenAI permission to derive their model from our writing, surely billions of dollars worth, but they took it anyway. Copyright hasn't caught up so all that stolen value rests securely with OpenAI. If we're not getting that back, I don't see why AI competitors should have any qualms about borrowing each others' work.
I'm not a copyright maximalist, and I kind of agree that training should be fair use. Maybe I'm right about that, maybe I'm wrong. BUT importantly, that has to go hand in hand with an acknowledgement that AI material is not copyrightable and that training on other model output is fine.
What companies like OpenAI want is a system where everything they build is protected, and nothing that anyone else builds is protected. It's wildly hypocritical, what's good for the goose is good for the gander.
That some AI proponents are now freaking out about how model output can be legally used shows that on some level those people weren't really honestly engaging with artists who were freaking out about their work being appropriated to copy them. It's all just "learning from the art" until it affects somebody's competitive moat, and then suddenly people do understand how LLM weights could be seen as a derivative work of their inputs.
> Trade secret protection protects secrets from unauthorized disclosure and use by others. A trade secret is information that has an economic benefit due to its secret nature, has value to others who cannot legitimately obtain it, and is subject to reasonable efforts to maintain its secrecy. The protections afforded by trade secret law are very different from others forms of IP.
I am not a lawyer, but I don’t believe a trade secret would prevent someone from reverse engineering your model’s knowledge from it’s output though, in the same way that it doesn’t prevent someone from reverse engineering your hot sauce from buying a bunch and experimenting with the ingredients until it tastes similar.
My point was more of there are protections for things that aren't copyrightable. If the model is protected as a trade secret, then it is a trade secret.
The example of the hot sauce recipe is quite apt - the recipe isn't copyrightable, but you can be certain that the secret formula for how to make Coca-Cola syrup is protected as a trade secret.
Our writing, our code, our artwork... Furthermore, the U.S. Copyright Office (USCO) concluded that AI-generated works on their own cannot be copyright, so these ChatGPT logs are free game. It would be hypocritical to think that Google is wrong and OpenAI is not.
its not even that on their own those works cant be copywritten. its that even when you make changes to those works, your changes might qualify for copyright but they do not affect the copyright status of the ai generated portions of the work.
if you used ai to design a new superhero and then added pink shoes, yellow hair, and a beard, only those three elements would possibly be able to be protected by copywrite. your additions do not change the status of the underlying ai work which cannot be protected and is available for anyone to use.
> if you used ai to design a new superhero and then added pink shoes, yellow hair, and a beard
Wouldn't that depend heavily on the prompt used (among other factors such as image to image and ControlNet)? You could be specifying lots of detail about the design in your prompt, and the AI could only be generating concept artwork with little variation from what you already provided.
If I'm already providing the pose, the face, and the outfit for a character (say via ControlNet and Textual Inversion), generating <my_character> should be no different from generating <superman>, that is to say, the copyright already exists thanks to my work and the AI is just a tool, the output of which should have no bearing on who owns that copyright (DC is going to be perfectly able to challenge my commercial use of AI generated superman artwork).
According to the copyright board a promot is not anymore than any person commissioning a work from an artist, which does not provide copyright, and the lack of human authorship for the design decisions still stops it from being protected by copyright.
Textual inversion involves providing self-created images, which should confer copyright in the same way AI images of DC's superman are considered to fall under the copyright of DC. In other words, commissioning fanart still allows the original owner of the IP to exert copyright -- shouldn't that be the case here?
If I use an AI tool to design my Superhero, can't I just submit it without disclosing the help I received from an AI.
I get that it would be very nice to prevent AI SPAM copyrighting of every possible superhero, but if I use the AI to come up with a concept, then quickly redraw it myself with pen and paper, I feel like it would never be provable that it came from an AI.
Redrawing something by hand creates a new copyrightable work, so it certainly isn't fraud to claim you own the copyright in a work of art you drew based on an AI output.
It depends if your redrawing is substantially different enough from the original image to earn copyright on its own. Your changes to an image from ChatGPT do not affect the copyrightability of the original content. If you've simply redrawn what the computer designed it may not be substantial enough to earn copyright. If you've made changes, it may only be copyrightable for those changes.
The example was redrawing something by hand that was computer generated originally.
It would be pretty much impossible for a hand drawn work of art to not be sufficiently original. Hand drawn art doesn't look the same as what a computer produces. Originality has a very low threshold, simply pointing my camera at something and hitting click is almost always enough to show originality.
At any rate it isn't fraud to take the legal position that you are an original enough artist to have copyright in the work. If taking a legal position was "fraud" any attorney who lost a court motion would be whisked away to jail.
Edited to add: the copyright registration form asks if you are the "author" not if you are "original."
If you think you own a design because you hand drew a version of it someone else invented, you're gonna have a bad time. Please redraw a superman picture someone else made and then go to have it copyrighted, and tell me how that goes for you.
It would go fine, since I see the form has a question "is this a derivative work". I put yes, and this means my claim is only for what was original to me when I drew the drawing based on another drawing of Superman.
But I see we've moved away from the orignal point that it would be difficult for anybody to know an AI helped someone make the drawing if they redrew and didn't disclose it was a redrawing.
> Furthermore, the U.S. Copyright Office (USCO) concluded that AI-generated works on their own cannot be copyright, so these ChatGPT logs are free game.
Doesn't this depend on where you or the AI live? The US ain't the world.
But clearly everything generated by an AI isn’t automatically in the public domain. That would be a trivial way of copyright laundering.
"Sorry, while this looks like a bit for bit copy of a popular Hollywood movie, it was actually entirely dreamt up by our new, sophisticated, definitely AI-using identity function."
If I plagiarize a Hollywood movie, then I explicitly "give up" my copyright by "releasing" it to the public domain, it doesn't affect the movie at all. AI or not is irrelevant.
The person using something similar to something else may be infringing but the ai work cannot be protected by copyright as it lacks human authorship. Those are two separate issues.
For some cases sure, if it repurposes your code that ignores the license fine. But it's rarely wholesale copying. It's finding patterns same as anyone studying the code base would do.
As for the majority of content written on the internet through reddit or some social media, what's the harm in ingesting that? It's an incredibly useful tool that will add huge value to everyone. It's relatively open, cheap and highly available. It's worth to it's owners is only a fraction of the value it will add to society. It has the chance to have as big of an impact on progress as something like the microprocessor.
I agree it's free game for other llms to use gpt output as training data and that's positive. Although it signals desperation and panic that the largest "ai first" company with more data than any org in history is caught so flat footed and has to rely on it.
Do you really think it would be a better world in which a large LLM would never be able to be developed?
It's definitely a derived work as far as copyright is concerned: the output would simply not exist without the copyrighted training data.
> It's finding patterns same as anyone studying the code base would do.
No, it's quite unlike anyone studying data, because it's not a person with legal rights, such as fair use, but an automated algorithm. There is absolutely no legal debate that copyright applies only to human authors, or only to the human created part of a mixed work, there is vast jurisprudence on this; by extension, any fair use rights too, exist only for human users of the works. Derivation by automated means - for the express economic purpose of out-competing the creator in the market place, no less - is completely outside the spirit of copyright.
The output of human copyrighted work wouldn't exist if it weren't for humans training on the output of other humans.
Humans constantly use cliches in their writing and speech, and most of what they produce is a repackaged version of what someone else has written or said, yet no one's up in arms against this mass of unoriginality as long as it's human-generated.
It's a bit more nuanced than that, what I mean is that the slow speed at which humans learn it's a foundation block of our society, if suddenly some new race of humans emerged that could read an entire book in a couple of minutes and achieve lifelong superhuman retention and assimilation of all that knowledge then we would have the exact same type of concerns than what we have today about AI, including how easily they could recreate high quality art, music and anything else with just a tiny fraction of the effort that the rest of us need to reach similar results.
Startup technologists have been acting like speed of actions doesn't matter for decades. If a person can do it, why shouldn't a computer do it 1000x faster? What could go wrong? It's always been a poor argument at best and a bad faith one at worst.
Well said. The mindless automation away of everything has only one logical conclusion in which the creators of such automations are automated themselves, and even if the optimists are right and we never get there it doesn't matter, the chaos it can make just by getting closer at faster rates than society can adapt is unprecedented, specially given that the population count is at all times high and there are many other simultaneous threats that need our attention (e.g. climate change)
Most definitely. Good luck telling the difference between traditional and AI-empowered art in the near future.
It's just a new tool for artists, and this anti-AI sentiment towards copyright is only going to hurt individual artists, while doing nothing for large corporations with enough money to play the game.
AI are not people and the idea that you can be biased against them is hardly a foregone conclusion. Like maybe one day when we have AGI, but ChatGPT ain't that.
There is a difference between a computer and a human and we tried them already differently in copyright law. For example copying a program from disk into memory is typically already considered a copy on a computer (hence many licences grant you the licence to do this copy), no such licence is required for a human.
> It's definitely a derived work as far as copyright is concerned - the output would simply not exist without the copyrighted training data.
Can you point to a legal case that confirms this? Because it’s not at all clear that this is true from a legal standpoint. “X would not exist without Y” is not a sufficient test for derivative works - it’s far more nuanced.
United States copyright law in quite clear on the matter:
>A "derivative work" is a work based upon one or more preexisting works, such as a translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted.
The emphasis part clearly applies: not only the AI model needs to be trained on massive amounts of copyrighted works *); but without these input works, it displays no intrinsic creative ability, it has no capacity to produce a single intelligible word or sketch. All creative features of its productions are a transformation of (and only of) the creative features of the inputs, the AI algorithm has no "intelligence" in the common meaning of the word and no ability to create original works.
*) by that, I mean a specific instance of the model with certain desirable features, for example the ability to imitate the style of J.K Rowling
That's an interesting analysis. The issue isn't really whether the A.I. has creative ability, though, if we're talking about whether it infringes copyright. I think comparing the A.I. to a really simple bot is informative.
If I wrote a novel that contained once sentence from 1,000 people's novels, it would probably be fair use since I hardly took anything from any individual person and because my novel is probably not harming those other writers.
If I wrote a bot that did the same thing, same result, because my bot uses only a little from everyone's novel and doesn't harm the original novelist, so it's likely fair use.
Now I think a J.K. Rowling A.I. probably takes at least a little from her when it produces output, but it's not clear to me how much is actually based on J.K. Rowling and how much is a dataset of how words tend to be associated with other words. You could design a J.K. Rowling A.I. that uses nothing from J.K. Rowling, just data that is said to be J.K. Rowling-esque.
> Additionally, “transformative” uses are more likely to be considered fair. Transformative uses are those that add something new, with a further purpose or different character, and do not substitute for the original use of the work.
Creating a model from copyrighted works is likely sufficiently transformative to be non-infringing even if it is found to be a derivative work.
"Creating a model from copyrighted works is likely sufficiently transformative to be non-infringing even if it is found to be a derivative work."
Maybe, but one of the factors of fair use is whether it deprives the copyright owner of income or undermines a new or potential market for the copyrighted work.
If ChatGPD gets so good at writing J.K. Rowling novels that it hurts the sales of the next J.K. Rowling book, that's a strong argument against the use being fair, even if it is transformative.
If J.K. Rowling signs an exclusive agreement with Google to train on J.K. Rowling novels, that's another factors that would suggest OpenAI's use is not fair, because she's shown OpenAI is hurting a potential market for J.K. Rowling selling the use of her novels to train A.I.
GPT isn't spitting out novels in the style of J.K. Rowling and sending them to publishers - a human is.
GPT being instructed to tell a Harry Potter story itself is no more infringing than a child asking a parent for a made up Harry Potter bed time story. They equally infringe and undermine new or potential markets for copyrighted work.
The question is "what do you do with the material?" If a human took the output of GPT writing as J.K. Rowling or a parent took their collected Harry Potter bedtime stories - those are equally problematic.
If I was to take a portrait of Marilyn Monroe and send it through a plugin called Warholize in Photoshop ( https://www.adobe.com/creativecloud/photography/hub/guides/c... ) , it's not the plugin or photoshop that is infringing - it would be me, the human who created an infringing work. If I print it out and hang it on my wall, that didn't particularly impact on the income for the Warhol estate nor deprive them of new markets. If I print out copies of it and sell them - then that is a different matter.
The question is what you - the human with agency - do with the infringing work after you create it. You can't blame photoshop for creating a Warhol infringing work nor can you blame GPT for writing in the style of J.K. Rowling if you instruct it to do so.
This argument is weak. If we agree that the production is infringing, then selling a machine that produces infringing works on demand is also infringing to the rights of the author. For example, if I sell a karaoke machine that comes with pre-recorded famous melodies without the original vocals (thus, derived works), I definitely cannot claim "only the agency of the users matter". No, even the on-demand production itself is an act of infringement.
Regarding your fair use point:
> add something new, with a further purpose or different character, and do not substitute for the original use of the work.
This is a test that can be applied to the works produced, and I think that, by and large, it applies to most of them, they are sufficiently transformative to be seen as fair use, especially since they are produced and published, as you say, under the agency of a human user of the system.
But we're talking here about the system as a whole, not a certain production. The ChatGPT model database is not a creative work (other that its source code), it's a binary compressed representation of the works used to train it, for the purpose of on-demand production of works that mimicry, to a varying, and sometimes strikingly infringing degree, the originals. There is no question such a machine competes with the originals in the market place, there is no real question it is a derivative work of the originals, that fully supply its artistic features and without which the machine could not utter a single intelligible sentence.
Also, the "one word out of 1000" analogy defense does not apply, since a single isolated word is not a work of creation, while a certain shape, brush stroke or cromatic set are. AI does not create using disparate pixels of the originals, it uses entire creative features, more like publishing a book where each page or chapter is copied verbatim and adapted from other works, or if characters, plot devices and story structure are reused. These latter cases have long been recognized as plagiarism.
So the legal question seems to be if mass-scale algorithmic remixing and derivation by a non-creative competitor is fair use. That strikes me as far outside the intended spirit of fair use and the balance it seeks to achieve between author protection and beneficial social progress.
> it's a binary compressed representation of the works used to train it, for the purpose of on-demand production of works that mimicry, to a varying, and sometimes strikingly infringing degree, the originals.
To pursue this point further, I have asked ChatGPT to "list the lyrics of a famous Madonna song":
Sure, here are the lyrics of "Like a Prayer" by Madonna, which was released in 1989:
[Verse 1]
Life is a mystery
Everyone must stand alone
I hear you call my name
And it feels like home
[Chorus]
When you call my name
It's like a little prayer
I'm down on my knees
I wanna take you there
...
(the entire lyrics of the song follow)
I think it would be very hard to argue that this model does not embed verbatim copies of the original work, but somehow reconstituted those lyrics via a parallel construction where the cultural impact of Madonna's lyrics were grasped from other fair use sources. Even in that case, it's still a word for word reproduction of the original, therefore not fair use. Therefore, the entire model or service is infringing - even if some of its productions may not be.
The ability of the model to produce copyrighted works is just a proof of the degree on which it relies on the originals; even if that ability would be blocked or somehow filtered by a plagiarism detector in a later model of Chat GPT, it would change nothing to the fundamental nature of the machine: an automated means of generating derivative works without artistic or scientific agency.
> it would change nothing to the fundamental nature of the machine: an automated means of generating derivative works without artistic or scientific agency.
So is a Xerox machine (or Cannon copier or 'MFD')
The issue there is not "if it can" or even "if it is designed to do so" but rather "can it be used in a way that is not infringing" and "if there is an infringement from its use, the human doing that is the one liable."
And yet, there are non-infringing uses of the Xerox machine.
Even if one was to accept the position that the only thing that GPT can produce is derivative works it doesn't rule out that there are transformative and non-infringing uses of it.
No xerox machine comes with an embedded copy of Harry Potter that can be reproduced at the push off a button.
That's the crux of the issue, that you can't separate the training data from the derivation ability. If it's just an AI algorithm that could, when trained in a certain way, produce derivative infringing works, nobody would object to it.
This is a red herring. The issue before the court will be whether creation and release of the model effects J.K. Rowling.
Think of it this way- suppose I make a bunch of super mario brothers video games and try to sell them without Nintendo's permision.
If Nintendo sued me, I can't say "This cartridge has no agency. This will only effect Nintendo if humans use this video game instead of playing super mario brothers."
Students in school also will not never learn to read without being exposed to text. Does this mean that teachers who write exercise sheets and school text book publishers now own the copyright of everything students do?
Being in school is also just a tool to knowing stuff, being able to read, and being around similar aged peers, etc.
Whether the knowledge is directly in your brain or in a device you operate (directly or through an API) shouldn't really matter.
If it's forbidden for a human to move a stone with manual labour, then it's also forbidden to move that stone with an excavator. This has nothing to do with the person being a human and the other person being an excavator controlled by a human: it's not authorized.
I think that we should allow humans to move stones up the hill with excavators too. There is no stealing of excavator fuel from human food sources going on (let's assume it's not biofuel operated :p).
> If it's forbidden for a human to move a stone with manual labour, then it's also forbidden to move that stone with an excavator.
Sure, but the reverse is false: I can walk on my own feet through Hyde Park, but I can't ride my excavator there.
Laws are made by humans for the benefit of humans, it's a political struggle. Now, large corporation try to exploit loopholes in the existing copyright framework in order to expropriate creators of their works. It's standard uberisation: disrupt existing economic models, insert yourself as a unavoidable middle man and pauperize the workforce the provides the actual service.
> It's finding patterns same as anyone studying the code base would do.
This is the issue, it's not finding patterns as people do.
If I read someone's code, book, &c, that's extremely lossy. I can only pick up a few things from it in the long term.
But an ML model can store most of what it's given (in a jumbled format) and can do it from billions of sources.
It's essentially corporate piracy, but it's not legally recognized as such because it doesn't store identical reproductions.
This hasn't been an issue before because it's recent and wasn't considered valuable. But now that it's valuable and Microsoft is going to take all our jobs we have to at least consider if it's okay if Microsoft can take our work for free.
No, but I believe a large language model is a work that is 99.9% derivative of its inputs, with all that implies for authorship and copyright. Right now it's just a heist.
> Do you really think it would be a better world in which a large LLM would never be able to be developed?
Maybe. I believe the potential for abuse is far greater than the potential benefits. What is our benefit, a better search engine? Automating some tedious tasks? Increased productivity? What are the downsides? People losing their jobs to AI. Artists/programmers/writers losing value from their work. Fake online personas indistinguishable from real people. Unprecedented amounts of spam and misinformation flooding the internet. Intelligent AIs automatically attacking and hacking systems at unprecedented scale 24/7. Chatbots becoming the new interface for most interactions online and being the moderators of access to information. Chatbots pushing a single viewpoint and influencing public opinion (many people complain today about ChatGPT being too "woke"). And I may just be scratching the surface here.
That's the answer to the YC Interview question "What is your unfair competitive advantage" in a nutshell. Morally it might be wrong. From a business building perspective it's access that no one has.
I am strongly in favor of eliminating copyright completely everywhere, soooo I am pretty fine with that. The other direction should be more enforce-able: stuff derived from open data must also be made open again, like the GPL but for data (and therefore ML stuff).
Right but in a world where copyright does exist we arguably have the worst of both worlds. Small players are not protected at all from scraping and big players are leveraging all of their work and have the legal resources to form a moat.
Yeah, I definitely like to see AI companies getting hit with their own medicine. The main problem isn't even "automated plagiarism": the pre-generative era was chock full of AI companies more or less stealing datasets. Clearview AI, for example, trained up its facial recognition technology on your Facebook photos, without asking for and without getting permission.
On the other hand, I genuinely hope copyright never "catches up", because...
1. It is a morally bankrupt system that does not adequately defend the interests of artists. Most artists do not own their own work; publishers demand copyright assignment or extremely broad exclusive licenses as a condition of publication. The bullies know to ask for all their lunch money, not just a couple bucks for themselves. Furthermore, copyright binds noncommercial actors the same as it does commercial ones, which means unconscionably large damage awards for just downloading a couple of songs.
2. The suggested ways to alter copyright to stop AI training would require dramatic expansions of copyright scope. Under current law, the only argument for the AI itself being infringing would be if it memorized training data. You would need to create a new ownership right in artistic styles or techniques. This would inflict unconscionable amounts of psychic and legal damage on all future creators: existing artists would be protected against AI, but no new art could be legally made unless it religiously hewed towards styles already in the public domain. We know this because music companies have already made their domain of copyright effectively work this way[0], and the result is endless bullshit lawsuits on people who write songs that merely "feel" too similar (e.g. Blurred Lines)
3. AI will still be capable of plagiarism. Most plagiarists are not just hoping the AI regurgitates training data, they are actively putting other people's work into the model to be modified. A lot of attention is paid to the sourcing of training data, because it's a weak spot. If we take the training data away then, presumably, there's no generative AI. However, people are working on licensed datasets and training AIs on them. Adobe has Firefly[1], hell even I've tried my hand at training from scratch on public domain images. Such models will still be perfectly capable of doing img2img or being finetuned and thus copying what you tell it to.
If we specifically want to regulate AI, then we need to pass laws that regulate AI, rather than just giving the music labels, movie studios, and book publishers even more power.
[0] Specifically through sampling rights and thin copyright.
[1] I do not consider Adobe Firefly to be ethical: they are training the AI on Adobe Stock images, and they claim this to be licensed because they updated the Adobe Stock agreement to have a license in it. Dropping a contractual roofie into stock photographers' drinks does not an ethical AI make.