1. Google denies doing it, so at the very least the title should have an "allegedly".
2. Even if they did – so what? The output from ChatGPT is not copyrightable by OpenAI. In fact it is OpenAI that is training its models on copyrighted data, pictures, code from all over the internet.
Isn't it generally very hard to be first to market? And even if you are it's more likely that someone coming in later will take your lunch.
Apple wasn't the first one to try to make a successful smartphone, but they had resources, know-how, and tried at a better time with fewer unknowns around.
Or were just willing to adapt when others didn’t. Blackberry mocked the touchscreen for years until finally coming around and their initial implementation was awful.
Google Maps wasn't the first map product. We all used mapquest way before. But Google Maps was technologically advanced. Ajax made maps usable for the first time.
Gmail wasn't the first webmail. Hotmail had millions of customers already. But Google gave people unlimited space to store old email, whereas email in the old days filled up your inboxes and needed to be deleted.
Question is if they can and will leapfrog.
Google Plus was a sign of desperation and utterly failed. The dozens and dozens of different Messengers (sorry I don't even know what the latest one they're pushing is, RCS?) all failed.
As an organization we will see in the coming months and years if Google can still overtake others when coming in from behind or not.
>But Google gave people unlimited space to store old email
This isn't correct. Gmail launched with 1GB per user, which was way higher than other services, and they did keep doubling the storage space year-after-year, but it was never unlimited until Google Apps offered unlimited storage for businesses and schools.
It does. In general this is known as teacher-student training or knowledge distillation. It works better if you have access to the activations of the model but you can work with just outputs as well.
Ascribing actual human emotion to a giant corporation like a google is probably not a good idea. Their motivations aren't going be heavily dictated by feelings of shame at being late out of the gate.
The funny part is that deepmind's tech (and some of Google Brain's research) seems to be as good as openAI's or better, but Google's unwillingness/inability to productionize these systems is keeping them back. It seems like the issue is only with the management, and I'll be looking forward to reading about Google's version of Fumbling the Future[0].
Maybe, but we are fast approaching the point (or more likely have crossed it already) where distinguishing between human and AI generated data isn't really possible. If Google indexes a blog, how does it know whether it was written with AI assistance and therefore should not be used for training? Heck, how does OpenAI itself prevent such a feedback loop from its own output (or that of other LLMs)?
I'm only half joking.... I think we likely will end up with flags for human generated/curated content (and it will have to be that way round, as I can't imagine spammers bothering to put flags on AI-generated stuff), and we probably already should have an equivalent of robots.txt protocol that allows users to specify which parts of their website they would and wouldn't like used in the training of LLMs.
If content with a "human-generated" flag is rated more highly in some way -- e.g. search results -- then of course spammers will automatically add that flag to their AI-generated garbage. How do you propose to prevent them?
I think something like this will definitely happen, and your suggestion is the cleanest implementation idea I've seen for it. I imagine there will be a service provided by Google and OpenAI where they verify your identity as a human and then grant you a token to put into your meta tags (wait a second... this sounds like sama's worldcoin idea...).
It will need to be based somewhat on the honor system (just because someone's proved they're a human doesn't mean they won't put their attestation on auto-generated text), but it definitely sounds better than nothing.
They'll still need to incentivize it somehow, though. Why do I as a human want to add that meta tag? If the answer is "better search ranking" then it renders the whole scheme mostly pointless because obviously spammers will want to acquire the attestation and attach it to their auto-generated content.
Your argument would have a lot more force if we were past that point rather than fast approaching that point. Concerns about training data errors being compounded are much more important when you're talking about the bleeding edge.
And your question about how OpenAI prevents their training data from being corrupted is one we should be asking as well!
It's not quite the same thing, because Bing was getting the data from a browser toolbar and watching the search terms used and where the user went afterwards.
A closer equivalent would be if someone had made a ShareSERP site and people posted their favorite search terms and the results Google gave and Bing crawled that and incorporated the search terms to links connections into their search graph.
The actual actions had maybe gone too far (personally I thought it was more funny than "copying"), the hypothetical would be pretty much what you'd expect to happen. Even google would probably crawl ShareSERP and inadvertently reinforce their own results (the same way OpenAI presumably gets more than a bit of their own results back at them in any new crawls of reddit, hn, etc even if they avoid sites like ShareGPT deliberately).
Google has no contract with OpenAI though. They used a third party site to scrape conversations. If the outputs themselves are not copyrighted, and they never agreed to the terms of service, it should be fine, right? Albeit unethical and embarrassing.
I really don’t understand this angle. In fact, I am fairly positive that the training set for GPT-4 contains many thousands of conversations with AI agents not developed by OpenAI.
Do AI companies need to manually sift through the corpus and scrub webpages that contain competitor LLM output?
(“Yes” is an acceptable answer to this, but then it applies to OpenAI’s currently existing models just as much as to Bard)
Many AI conversations have been floating around internet forums since the original GPT was released. As OpenAI hasn't shared anything about its training set, to err on the side of caution I would assume that they didn't filter these conversations out. If they aren't even marked as such, it may not even be possible to do. I think it would be very hard to prove that no AI conversations are included in the training set, even if it wasn't secret.
It’s still debatable if training a computer neutral network on public data is 'wrong' when we very much accept it as a right for biological neural networks.
It's even less worthy of sympathy - like a counterfeit piece of art being counterfeited. And there isn't even an original, just like a made up counterfeit.
You can quibble about the ethics of web scraping for ML in general but I think you're conflating issues.
OpenAI and Google both scour the web for human-generated content. What Google cares about here is the learnings from OpenAI's proprietary RLHF dataset, for which they had to contract a large sum of human labelers. Finding a roundabout way to extract the value of a direct competitor's purpose-built, costly data feels meaningfully different from scraping the web in general as an input to a transformative use.
> OpenAI and Google both scour the web for human-generated content
OpenAI and Google both scour the web for content, period. That content could be human generated or AI generated or a mix of the two. Neither company is respecting copyright or terms of service of every individual bit of data collected. Neither company cares how much effort was put into creating the data, whether humans were paid to do it, or whatever else. So there really isn't that much difference between the two. In fact I can guarantee that there was some Google-generated content within OpenAI's training data.
And herein is the main problem of AI. Its creators consume knowledge from the commons, and give nothing free and unencumbered back.
It's like the guy who never brings anything to the potluck, but after everyone finishes eating, he boxes up the leftovers, and starts selling them out of a food cart.
So what? Is OpenAI RLHF dataset more valuable than millions of books and paintings OpenAI used for free without stopping a second? Why is that? Because one big tech corp paid money for that dataset?
> labelers. Finding a roundabout way to extract the value of a direct competitor's purpose-built, costly data feels meaningfully different from scraping the web in general as an input to a transformative use
There we go again, its, one law for the unwashed plebs and the other for us.
Why do you think that I, after spending my time and effort to write my blog, own my content to a lesser extent that OpenAI does their? Such hypocracy.
If there's a party which has intentionally conflated scraping web content in general with scraping it to build a direct competitor to the original sources, that party is Google.
Yes, this latest instance with OpenAI outputs is shady, but I think it's in the same spirit as scraping news organizations for content which journalists were paid to write, and then showing portions of it directly in response to queries so people don't go directly to the news organization's pages, and it's in the same spirit as showing answers to query-questions that are excerpts from scraped pages which another organization paid to produce.
I see no difference. Any web scraping is a means to deflect revenue-generating traffic to yourself, and away from other websites. Fewer people will go to Stack Overflow because of Codex and Copilot. The point that the content was paid for vs volunteered becomes moot once it's posted publicly online for free, on ShareGPT.
The recent HiQ vs LinkedIn case would seem to make this ToS unenforceable, unless Google actually created a user account on ShareGPT and affirmatively accepted the terms. "Acceptance by default" does not count, and I can easily browse ShareGPT without affirmatively accepting any ToS, without which web scraping is totally legal.
I love it how they don't want others to use their model output but they have no qualms about training their model on the copyrighted works of others? Isn't this a stunning level of hypocrisy?
So, to verify, are you claiming that if someone added a similar clause to their source code and then GitHub went ahead and trained Copilot against it, that would be an issue?
You relinquish all licensing rights when you upload your code to GitHub. Microsoft can do whatever they want with it. That's in their ToS, which you have to agree to when you make an account. Normally, only affirmatively accepted ToS are enforceable, so just putting a clause into your license doesn't work (unless it's a copyright, which doesn't require consent).
> You relinquish all licensing rights when you upload your code to GitHub
What now? Seriously?
I found this. Section D4.
"We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video."
"as necessary to provide the Service" seems critical.
Also, section D3 of the GitHub Terms of Service says:
> You retain ownership of and responsibility for Your Content.
and section D4 says:
> This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.
There is nothing in the terms that requires the GitHub user to relinquish all licensing rights.
I think there's a misunderstanding over what the word "relinquish" means.
The terms make clear that uploading code to GitHub gives GitHub the right to "store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time" while the code is hosted on GitHub.
However, that's not the same thing as relinquishing (giving up) licensing rights to GitHub. The uploader still retains those rights, and there is nothing in the terms that says otherwise.
It is certainly a service that's being provided. If not by GitHub, then by whom?
I'll repeat the definition of service: The “Service” refers to the applications, software, products, and services provided by GitHub, including any Beta Previews.
So do you believe if you hosted a closed source project on GitHub, and GitHub decided they want to integrate this into their service they would simply be allowed to take the code?
Fortunately HN commenters are not judges. And I would wager any bet that MS lawyers would not try to argue based on their ToS either, that would be a recipe for loosing any court case.
I just mean that it doesn't really matter what your license says as long as GitHub can come up with a business justification for using it in some way. Certainly, other users still legally have to obey your copyright.
So, to verify, are you claiming it would not be allowed for you to upload my otherwise-open-source code (code I do not myself host at GitHub, but which was reasonably popular / important code) to GitHub?
If you're posting anything you did not create yourself or do not own the rights to, you agree that you are responsible for any Content you post; that you will only submit Content that you have the right to post; and that you will fully comply with any third party licenses relating to Content you post.
I suppose this means if I upload your stuff to GitHub, and you sue GitHub, then GitHub would be able to somehow deflect liability onto me.
That doesn't make sense. For example, GPLv3 allows anyone to redistribute the software's source code if the license is intact:
> You may convey verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice; keep intact all notices stating that this License and any non-permissive terms added in accord with section 7 apply to the code; keep intact all notices of the absence of any warranty; and give all recipients a copy of this License along with the Program.
If GitHub then uses the source code in a way that violates the license, there is no provision in the GitHub terms of service that would allow GitHub to deflect legal liability to the GitHub user who uploaded the program. The uploader satisfied the requirements of GPLv3, and GitHub would be the only party in violation.
I'd like to see that theory tested in court. Section D3 of the terms says:
> If you upload Content that already comes with a license granting GitHub the permissions we need to run our Service, no additional license is required.
and section D4 does not mention any permissions that GPLv3 does not already cover. GitHub automatically recognizes when a repo is GPLv3-licensed, so it cannot claim ignorance of what GPLv3 is.
Correction – breaking terms of service that you have not explicitly agreed to is not punishable in any way. A site cannot enforce a "by using this site you agree to..." clause deep inside some license page that visitors are generally unaware of. If you violate an agreement that you willingly chose to enter, however, you will likely be found liable for it.
Read their statement carefully and it's actually not a denial of the allegation.
> But Google is firmly and clearly denying the data was used: “Bard is not trained on any data from ShareGPT or ChatGPT,” spokesperson Chris Pappas tells The Verge
* Allegation: Google used ShareGPT to train Bard.
* Rebuttal: The current production version of Bard is not trained on ShareGPT data
Both things can be true:
* Google did use ShareGPT to train Bard
* Bard is not currently trained on any data from ShareGPT or ChatGPT.
This is an argument in bad faith but at this point I have zero trust in corporations and feel like you can generally count on them to do shitty things if they can benefit from it so I can be easily swayed by little proof at this point.
What's the argument? What's been done by anyone that's shitty? I don't even understand the point of this post. As far as I know, the current wave of text-based AIs is trained on all text accessible on the internet. Would it be a scandal to learn that ChatGPT is trained on wikipedia? Reddit? What is even the argument here, good faith or otherwise?
The argument is these companies are using our ideas created by us humans in this thing called the interenet for free and without attribution and it's problematic.
Responding to sibling comment: We need some clarification here: are we speaking about just ideas in the abstract sense, or ideas that have been fleshed out i.e "materialized"
If the latter, there are many laws that say you can own an idea, provided it exists somewhere.
I'm not necessarily arguing against you, but "problematic" is too generic a term to be useful. Genocide is "problematic". Having to run to the bathroom every 5 minutes to blow my runny nose is "problematic". What do you actually mean?
Right, but I do think you can "own" (by which I mean our societally-mediated legal definition of ownership in the anglosphere) specific sequences of text or at least the right to copy them?
From an open source point of view it would be better if scraping proprietary LLMs would be allowed. Small LMs need this infusion of data to develop.
But the big news is that it works, just a bit of data can have a large impact on the open source LLMs. OpenAI can't have a moat in their proprietary RLHF dataset. Public models leak, they can be distilled.
Regarding point 2, I think there's nothing "wrong" with it, mainly it's funny that they don't know how to do it themselves. Provides additional evidence that Google is outgunned in this fight.
Whether or not training one giant LLM with socially enormous stakes on the output of another commercially-controlled LLM is an interesting question.
My stronger opinion is that the people who can do this stuff via having a crawled corpus of the Internet need to keep in mind that it's all our "user-generated content" that they've freely appropriated to build their models, and so whatever the technical copyright rules are (or become): you don't ethically own something that's closely imitating stuff we all wrote over the years.
> And what about the terms of service of my blog or code repository? Does OpenAI respect that?
Seems to me that’s an issue between you and OpenAI. (Does your blog or code repository actually have published restrictive terms of service? Did it when OpenAI accessed it? Did OpenAI even access it?)
You think OpenAI is going to care unless you have a team of expensive lawyers to back you up?
Microsoft is out there laundering GPL code with Copilot. These companies live firmly in the don't give a fuck region of capitalism. Copyright law for thee, not for me.
Since it was through ShareGPT, is the argument like "what color are your bits" but for ToS?
Maybe if they had put in their terms of service "you can only share this on sites with their own ToS that allow sharing but disallow using the content for training models, and also replicate this requirement", I don't see how you could have any sort of viral ToS like that.
Seems more like it's just a bad idea to rely heavily on another LLM's output for training.
Seems to me like it makes Google look kind of pathetic. That's worse than any legal issue here. (Caveat: assuming I understand the situation correctly)
According to the article, the story goes this way: This engineer Jacob Devlin raised his concerns on training Bard with ShareGPT data. Then he directly joined OpenAI.
He also claims that Google were about to do it, and then they stopped after his warnings. And presumably removed every trace of openai's responses.
A couple of things:
1. So, Bard could have been trained on ShareGPT but it's not - according to the same engineer who raised the concern (and google denial in the verge).
2. Since he directly joined OpenAI, he could have told them and they could have taken action, and nothing is public on that front yet. Probably nothing to see here.
Edit: The engineer too wasnt directly involved with the Bard team, it appeared to him that Bard team was heavily relying on ShareGPT.
For those that don't know, Jacob Devlin was the lead engineer and first publisher of the widely popular BERT model architecture, and initial bert-base models released by Google.
Not illegal, but that won't stop people from finding it amusing that the company considered the world's beacon of innovation is copying someone else's homework. It's hard being the favorite horse.
No one is alleging that Google directly used OpenAI's API to get training data (which would be unambiguously against TOS). The claim is that they downloaded examples from ShareGPT.
> He also claims that Google were about to do it, and then they stopped after his warnings.
So were they heavily relying or were they about to and then stopped? It's unclear from your comment. Could you link where you're getting this info from? The Information article is walled, unfortunately.
What I meant to say was that: Acc to The Information article the engineer raised concerns because it appeared to him (article wording) Bard team was using (and heavily reliant on) ShareGPT for Bard training. The engineer wasnt working on Bard and presumably someone told him or somehow he got the impression that Bard team was reliant on ShareGPT. At the time he was at Google.
Then, when he raised concerns to Sundar Pichai, Bard team stopped doing it and also scrapped any traces of ShareGPT data. So, the headline is false and Bard (again presumably) is not trained on any of ShareGPT data.
I think I might be confused by your usage of “about to do it” in your original comment to mean “actively doing it.”
You claim that the very engineer accusing Google of training Bard on ShareGPT acknowledges that the final product was not. As far as I can tell, Devlin did no such thing.
Not sure why you would presume they restarted their expensive training process.
It just doesn’t seem like a good faith characterization to me.
"What's sauce for the goose is sauce for the gander" as the legal cliche goes. OpenAI cannot on the one hand claim that google did something wrong if they used their outputs as part of the bard training while simultaneously on the other hand claiming they themselves are free to use everyone on the internets content to train their model.
Either they believe that training should respect copyright (in which case they could not do what they do) or they believe that training is fair use (in which case they cannot possibly object to Google doing the same as them).
No one is alleging copyright violations. The claim is that they violated OpenAI's terms of service. We don't know whether Google ever even agreed to those terms of service in the first place.
Services are subject to terms of service. (If content is received through a service, the terms of service may govern use of it, but that’s not a feature of the content, but the acquisition route.)
ShareGPT isn't part of that service though. Yes, it would be a TOS violation if Google directly used ChatGPT to generate transcripts -- but not even the original Twitter thread is claiming that.
The only claim being made against Google here is that they used ChatGPT content. I can't find any sources claiming that Google made use of an OpenAI service. So the distinction is correct, but doesn't seem particularly valuable in this context -- using data from ShareGPT is not a TOS violation.
That's nonsensical. An AI is either transformative or it's not, it's an intrinsic quality that has nothing to do with the training data or the "product" type. If OpenAI is sufficiently transformative to claim fair use (which I don't believe for a second, alas), then any other AI built on similar fundamentals has the same claims and can crunch any data their creators see fit, including the output of other AIs.
First off, the whole argument behind these models has been from day one that training on copyrighted material is fair use. At most this would be a TOS violation. Second off, AI output is not subject to copyright, so it has even less protection than the original works it was trained on.
Copyright maximalism for me, but not for thee. It's just so silly for someone working at OpenAI to complain about this.
And would it be a ShareGPT TOS violation (assuming it had any)?
If OpenAI says "you can share these online but don't use them for AI training", people share them on another site, and then someone else comes along to scrape that site for AI training data, there's no relationship between OpenAI and the scraper for the TOS to apply to.
Normally I think you'd rely on copyright in that kind of case, but that doesn't apply to ChatGPT's output, so...
Right. And what even is the penalty of that TOS violation and how enforceable is it?
I don't have an OpenAI account. I have never agreed to any TOS. I don't see what legal claim they would have to stop me from training an LLM on ShareGPT.
If Google were specifically going to ChatGPT to get its output and train off of it, they could be sued for breach of contract - and OpenAI would likely have a pretty good argument:
- they specifically tried extracting and learning from our model when it says you can't in our TOS
- this makes it easier for them to compete with us via the data they obtain in their breach of contract
- more businesses and enterprises might pass up on renting a shared or dedicated instance from us if they can just get it from Google
> If Google were specifically going to ChatGPT to get its output and train off of it
But (correct me if I'm wrong) I don't think anyone anywhere is claiming that's what happened. The claim was just that Google looked into using existing chats that it scraped from another website.
Edit: realizing you're probably replying specifically to the question I asked, "and what even is the penalty of that TOS violation and how enforceable is it?" In which case, yeah, that's a decent clarification to add, sorry for pushing back on it.
Thanks for the clarification. Aside from the OP, I haven't seen anyone from OpenAI commenting on this, so yeah, unless I've missed something I think you're correct to point that they're not involved so far.
OpenAI doesn't own the copyright on the human aspects of the chats, so it still doesn't really have a claim to make around them. And even if it did own that copyright, we loop right back around to "wait, training an AI on copyrighted material isn't fair use now?"
There's no way that ChatGPT's conversations are going to be subject to more intellectual property protection than the human chats it was trained on.
People complained that new AI is "stealing" from artists.
But stealing from other AI turns out to often be easier.
And this is where things get fun, because companies like OpenAI want to be able to train on all the data without any explicit permissions from the creators, but the moment people do the same to them they likely (we will see) be very much against it.
So it will be interesting if they will be able to both have and eat the cake (e.g. by using Microsoft lobby to push absurd law) or will they fall apart due to cannibalization making it non profitable to create better AI.
EDIT: This comment isn't specific to Google/Bert, so it doesn't matter weather Google actually did so or weather not.
I can see the GitHub Copilot controversy being resolved in this way. If Microsoft, GitHub, and OpenAI successfully use the fair use defense for Copilot's appropriation of proprietary and incompatibly licensed code, then a free and open source alternative to Copilot can be trained on Copilot's outputs.
After all, the GitHub Copilot Product Specific Terms say:
> 2. Ownership of Suggestions and Your Code
> GitHub does not claim any ownership rights in Suggestions. You retain ownership of Your Code.
Why would it need to be trained on Copilot’s output? Its training data is publicly available code on GitHub, so just use that directly. ChatGPT is different because they specifically trained it as an assistant with a private dataset
Google accused Microsoft Bing of using them for page rankings a few years ago. Setup a sting to show that when you searched for something unique on Google using MS Explorer, shortly afterwards the same search result would start showing up on Bing.
This was seen as deeply embarrassing for Microsoft at the time.
Thankfully archive.org exists, otherwise it would not be possible to get good training data in a few years when the internet is flooded with AI content.
Isn't most of the internet available through common crawl? I don't know what percentage of training data is just that data set but i assume it's enough for anyone with enough compute and ingenuity to create a reasonable LLM
Which is a baseless hyperbole. We get it, blog spam is annoying. That doesn’t change the fact that humans generate a ton of data just interacting with one another online.
As a forum moderator, I have transitioned to relying heavily on AI-generated responses to users.
These responses can range from short and concise ("Friendly reminder: please ensure that all content posted adheres to our rules regarding hate speech. Let's work together to maintain a safe and inclusive community for everyone") to lengthy explanations of underlying issues.
By using AI-generated content, a small moderation team can efficiently manage a large group of users in a timely manner.
This approach is becoming increasingly common, as evidenced by the rise in AI-generated comments on popular sites such as HN, Reddit, Twitter, and Facebook.
Many users are also using AI tools to fix grammar issues and add extra content to their comments, which can be tempting but may result in unintentional changes to the original message.
In fact, I myself have used this technique to edit this very comment to provide an example.
---- Original comment:
As an online forum mod, I switched to mainly using AI to generate replies to users. Some are very short ("Hey! Remember the rules.") and some are long paragraphs explaining underlying issues. Someone training on my replies would pretty much train on AI generated content without knowing. It allows a small moderation team to moderate a large group quickly. I know that I am not alone in this.
There is also a raise in AI generated comments on sites like HN, Reddit, Twitter and Facebook. It's tempting to copy-paste a comment in AI for it to fix grammar issues, which often results in extra content being added to text. In fact, I did it for this comment.
The original comment is much better, please stop rewriting your comments using OpenAI.
> In fact, I did it for this comment.
Yes, it was obvious from the second sentence. The way ChatGPT structures text by default is very different from how most humans writes. Always the same "By using", "These X can range from" etc.
Padding your text with more words doesn't make it better, more words makes it worse, this isn't school.
Interesting, the "By using" was my own addition to shorten a long sentence it had generated that distracted from the example.
To be more clear, using AI to rewrite comments such as this one is not something I often do. My personal use of it for moderation purposes is more prompt based than pasting a long comment for grammar and spelling corrections.
What I did here was an example and that example provided the same criticism that you wrote here as a reply ("which can be tempting but may result in unintentional changes to the original message"). In other words, makes the text more verbose and sanitizes the writing style.
The prompt we use for moderation contains our site's rules and some added context. So using ChatGPT, we can paste in someone's comment and ask the bot to write a short text explaining how that comment does not follow our rules and what the user can do.
"Using the rules above, write a very short message for a user that wrote a rule breaking comment. Show empathy. Use simple English. Explain the rules that were broken. The comment is [comment here]"
Using this saves a lot of time. Is the quality of the comment not as good as it could be if it was written by a human? Absolutely. However, using AI let us change the user:mod ratio in a way. Automoderators are nothing new, what is new is that now the automoderator can take context into account and provide a customized message.
OpenAI at least can track the hashes of all content it's ever output, and filter that content out of future training data. Of course they won't be able to do this for the output of other LLMs, but maybe we'll see something like a federated bloom index or something.
Agreed there is no perfect solution though, and it will definitely be a problem finding high quality training data in the future.
AI content will be associated with a user or organization in the trust graph. If someone you trust trusts a user or organization who posts AI content, you're free to revoke your trust in that person or blacklist the specific users/organizations you don't want to see anymore.
We've been pretending to be just about to do this for decades. The fact is that internet companies will not develop a network of trust, because they are primarily advertisers looking for better ways to abuse trust.
I am assuming OP means when AI takes over there's going to be a content explosion and most of what's available on the common internet will be AI generated content rather than human made one and they want to use archive.org to get access to the pre-AI internet.
Only if the amount of bad information in ChatGPT content that makes it back into the training set is worse than what's already on internet already is. Probably the outputs that make it back are outputs that are better than average, because those are more likely to be posted elsewhere.
I don't care at all about this from a copyright or data ownership perspective, but I am a little skeptical that it's a good idea to be this incestuous with training data in the long run. It's one thing to do fine tuning or knowledge distillation for specialized domains or shrinking models. But if you're trying to train your own foundation model, is relying on output from other foundation models going to make them learn to imitate their own errors?
Things like ShareGPT or PromptHero give vast repositories of human-curated ML outputs, which make them fantastic for at least incremental improvement on the base model. In the grand scheme of things, these will be just another style, mixed in with all the other crap in the training set, so I don't imagine it's too harmful... eg, 'paint starry night in the style of midjourney 5'
The internet is an easy, convenient way to train LLMs, but I'm pretty sure you could train them with microphones. One cloud surveillance company, like maybe for networked security monitoring, or maybe just Alexa/Siri etc. could dip into as many and as varied communications per hour than all the books ever written.
It'd be cool to have an LLM that's trained almost exclusively on books from good publishers, and other select sources. Working out licensing deals would be a challenge, of course.
Probably from multiple modalities as well as extending the sequence lookback length further and further.
They have low perplexity now, but the perplexity possible when predicting the next word on page 365 of a book where you can attend over the last 364 pages will allow even more complexity to emerge.
If a Google employee working on this thing ever agreed to OpenAI's terms of service, they might be screwed.
From OpenAI's terms:
(c) Restrictions. You may not (i) use the Services in a way that infringes, misappropriates or violates any person’s rights; (ii) reverse assemble, reverse compile, decompile, translate or otherwise attempt to discover the source code or underlying components of models, algorithms, and systems of the Services (except to the extent such restrictions are contrary to applicable law); (iii) use output from the Services to develop models that compete with OpenAI;
(j) Equitable Remedies. You acknowledge that if you violate or breach these Terms, it may cause irreparable harm to OpenAI and its affiliates, and OpenAI shall have the right to seek injunctive relief against you in addition to any other legal remedies.
Those two very clearly establish that if you use the output of their service to develop your own models, then you are in breach of the terms and they can seek injunctive relief against you (stop you from working until the case is resolved).
I hereby set a terms of service for everything I post on the internet from now on. OpenAI may not train future GPT models on my words or my code without my express written permission.
Sure. If you can get everyone to create an account and agree to those terms before reading your comments, you might have a case.
Otherwise, it will be considered public information, at which point it is free to be scraped by anyone (see the precedent set by the LinkedIn/hiQ case).
That's just because they made accounts and so agreed to the terms right?
From your link:
>These rulings suggest that courts are much more comfortable restricting scraping activity where the parties have agreed by contract (whether directly or through agents) not to scrape. But courts remain wary of applying the CFAA and the potential criminal consequences it carries to scraping. The apparent exception is when a company engages in a pattern of intentionally creating fake accounts to collect logged-in data.
No, the case did not decide anything, no precedent was set. The point is that you cannot use this case to argue that you can scrape public data free of consequence
What's the legal status of such terms of service? Suppose you simply said "i didn't agree to these terms" - what's the consequence? It seems like the strongest thing they could legitimately do would be to kick you off of their platform. Simply writing "we can seek injunctive relief" doesn't make it so.
Wouldn't that only apply if that employee was acting as an agent of Google at the time?
Otherwise it would create an interesting dynamic that startups where no-one has created an OpenAI account would have a massive advantage, since they can freely scrape ShareGPT data and train on it while larger companies have enough employees that someone must have signed every TOS.
Good luck to them. AI models are automated plagiarism, top to bottom. None of us gave OpenAI permission to derive their model from our writing, surely billions of dollars worth, but they took it anyway. Copyright hasn't caught up so all that stolen value rests securely with OpenAI. If we're not getting that back, I don't see why AI competitors should have any qualms about borrowing each others' work.
I'm not a copyright maximalist, and I kind of agree that training should be fair use. Maybe I'm right about that, maybe I'm wrong. BUT importantly, that has to go hand in hand with an acknowledgement that AI material is not copyrightable and that training on other model output is fine.
What companies like OpenAI want is a system where everything they build is protected, and nothing that anyone else builds is protected. It's wildly hypocritical, what's good for the goose is good for the gander.
That some AI proponents are now freaking out about how model output can be legally used shows that on some level those people weren't really honestly engaging with artists who were freaking out about their work being appropriated to copy them. It's all just "learning from the art" until it affects somebody's competitive moat, and then suddenly people do understand how LLM weights could be seen as a derivative work of their inputs.
> Trade secret protection protects secrets from unauthorized disclosure and use by others. A trade secret is information that has an economic benefit due to its secret nature, has value to others who cannot legitimately obtain it, and is subject to reasonable efforts to maintain its secrecy. The protections afforded by trade secret law are very different from others forms of IP.
I am not a lawyer, but I don’t believe a trade secret would prevent someone from reverse engineering your model’s knowledge from it’s output though, in the same way that it doesn’t prevent someone from reverse engineering your hot sauce from buying a bunch and experimenting with the ingredients until it tastes similar.
My point was more of there are protections for things that aren't copyrightable. If the model is protected as a trade secret, then it is a trade secret.
The example of the hot sauce recipe is quite apt - the recipe isn't copyrightable, but you can be certain that the secret formula for how to make Coca-Cola syrup is protected as a trade secret.
Our writing, our code, our artwork... Furthermore, the U.S. Copyright Office (USCO) concluded that AI-generated works on their own cannot be copyright, so these ChatGPT logs are free game. It would be hypocritical to think that Google is wrong and OpenAI is not.
its not even that on their own those works cant be copywritten. its that even when you make changes to those works, your changes might qualify for copyright but they do not affect the copyright status of the ai generated portions of the work.
if you used ai to design a new superhero and then added pink shoes, yellow hair, and a beard, only those three elements would possibly be able to be protected by copywrite. your additions do not change the status of the underlying ai work which cannot be protected and is available for anyone to use.
> if you used ai to design a new superhero and then added pink shoes, yellow hair, and a beard
Wouldn't that depend heavily on the prompt used (among other factors such as image to image and ControlNet)? You could be specifying lots of detail about the design in your prompt, and the AI could only be generating concept artwork with little variation from what you already provided.
If I'm already providing the pose, the face, and the outfit for a character (say via ControlNet and Textual Inversion), generating <my_character> should be no different from generating <superman>, that is to say, the copyright already exists thanks to my work and the AI is just a tool, the output of which should have no bearing on who owns that copyright (DC is going to be perfectly able to challenge my commercial use of AI generated superman artwork).
According to the copyright board a promot is not anymore than any person commissioning a work from an artist, which does not provide copyright, and the lack of human authorship for the design decisions still stops it from being protected by copyright.
Textual inversion involves providing self-created images, which should confer copyright in the same way AI images of DC's superman are considered to fall under the copyright of DC. In other words, commissioning fanart still allows the original owner of the IP to exert copyright -- shouldn't that be the case here?
If I use an AI tool to design my Superhero, can't I just submit it without disclosing the help I received from an AI.
I get that it would be very nice to prevent AI SPAM copyrighting of every possible superhero, but if I use the AI to come up with a concept, then quickly redraw it myself with pen and paper, I feel like it would never be provable that it came from an AI.
Redrawing something by hand creates a new copyrightable work, so it certainly isn't fraud to claim you own the copyright in a work of art you drew based on an AI output.
It depends if your redrawing is substantially different enough from the original image to earn copyright on its own. Your changes to an image from ChatGPT do not affect the copyrightability of the original content. If you've simply redrawn what the computer designed it may not be substantial enough to earn copyright. If you've made changes, it may only be copyrightable for those changes.
The example was redrawing something by hand that was computer generated originally.
It would be pretty much impossible for a hand drawn work of art to not be sufficiently original. Hand drawn art doesn't look the same as what a computer produces. Originality has a very low threshold, simply pointing my camera at something and hitting click is almost always enough to show originality.
At any rate it isn't fraud to take the legal position that you are an original enough artist to have copyright in the work. If taking a legal position was "fraud" any attorney who lost a court motion would be whisked away to jail.
Edited to add: the copyright registration form asks if you are the "author" not if you are "original."
If you think you own a design because you hand drew a version of it someone else invented, you're gonna have a bad time. Please redraw a superman picture someone else made and then go to have it copyrighted, and tell me how that goes for you.
It would go fine, since I see the form has a question "is this a derivative work". I put yes, and this means my claim is only for what was original to me when I drew the drawing based on another drawing of Superman.
But I see we've moved away from the orignal point that it would be difficult for anybody to know an AI helped someone make the drawing if they redrew and didn't disclose it was a redrawing.
> Furthermore, the U.S. Copyright Office (USCO) concluded that AI-generated works on their own cannot be copyright, so these ChatGPT logs are free game.
Doesn't this depend on where you or the AI live? The US ain't the world.
But clearly everything generated by an AI isn’t automatically in the public domain. That would be a trivial way of copyright laundering.
"Sorry, while this looks like a bit for bit copy of a popular Hollywood movie, it was actually entirely dreamt up by our new, sophisticated, definitely AI-using identity function."
If I plagiarize a Hollywood movie, then I explicitly "give up" my copyright by "releasing" it to the public domain, it doesn't affect the movie at all. AI or not is irrelevant.
The person using something similar to something else may be infringing but the ai work cannot be protected by copyright as it lacks human authorship. Those are two separate issues.
For some cases sure, if it repurposes your code that ignores the license fine. But it's rarely wholesale copying. It's finding patterns same as anyone studying the code base would do.
As for the majority of content written on the internet through reddit or some social media, what's the harm in ingesting that? It's an incredibly useful tool that will add huge value to everyone. It's relatively open, cheap and highly available. It's worth to it's owners is only a fraction of the value it will add to society. It has the chance to have as big of an impact on progress as something like the microprocessor.
I agree it's free game for other llms to use gpt output as training data and that's positive. Although it signals desperation and panic that the largest "ai first" company with more data than any org in history is caught so flat footed and has to rely on it.
Do you really think it would be a better world in which a large LLM would never be able to be developed?
It's definitely a derived work as far as copyright is concerned: the output would simply not exist without the copyrighted training data.
> It's finding patterns same as anyone studying the code base would do.
No, it's quite unlike anyone studying data, because it's not a person with legal rights, such as fair use, but an automated algorithm. There is absolutely no legal debate that copyright applies only to human authors, or only to the human created part of a mixed work, there is vast jurisprudence on this; by extension, any fair use rights too, exist only for human users of the works. Derivation by automated means - for the express economic purpose of out-competing the creator in the market place, no less - is completely outside the spirit of copyright.
The output of human copyrighted work wouldn't exist if it weren't for humans training on the output of other humans.
Humans constantly use cliches in their writing and speech, and most of what they produce is a repackaged version of what someone else has written or said, yet no one's up in arms against this mass of unoriginality as long as it's human-generated.
It's a bit more nuanced than that, what I mean is that the slow speed at which humans learn it's a foundation block of our society, if suddenly some new race of humans emerged that could read an entire book in a couple of minutes and achieve lifelong superhuman retention and assimilation of all that knowledge then we would have the exact same type of concerns than what we have today about AI, including how easily they could recreate high quality art, music and anything else with just a tiny fraction of the effort that the rest of us need to reach similar results.
Startup technologists have been acting like speed of actions doesn't matter for decades. If a person can do it, why shouldn't a computer do it 1000x faster? What could go wrong? It's always been a poor argument at best and a bad faith one at worst.
Well said. The mindless automation away of everything has only one logical conclusion in which the creators of such automations are automated themselves, and even if the optimists are right and we never get there it doesn't matter, the chaos it can make just by getting closer at faster rates than society can adapt is unprecedented, specially given that the population count is at all times high and there are many other simultaneous threats that need our attention (e.g. climate change)
Most definitely. Good luck telling the difference between traditional and AI-empowered art in the near future.
It's just a new tool for artists, and this anti-AI sentiment towards copyright is only going to hurt individual artists, while doing nothing for large corporations with enough money to play the game.
AI are not people and the idea that you can be biased against them is hardly a foregone conclusion. Like maybe one day when we have AGI, but ChatGPT ain't that.
There is a difference between a computer and a human and we tried them already differently in copyright law. For example copying a program from disk into memory is typically already considered a copy on a computer (hence many licences grant you the licence to do this copy), no such licence is required for a human.
> It's definitely a derived work as far as copyright is concerned - the output would simply not exist without the copyrighted training data.
Can you point to a legal case that confirms this? Because it’s not at all clear that this is true from a legal standpoint. “X would not exist without Y” is not a sufficient test for derivative works - it’s far more nuanced.
United States copyright law in quite clear on the matter:
>A "derivative work" is a work based upon one or more preexisting works, such as a translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted.
The emphasis part clearly applies: not only the AI model needs to be trained on massive amounts of copyrighted works *); but without these input works, it displays no intrinsic creative ability, it has no capacity to produce a single intelligible word or sketch. All creative features of its productions are a transformation of (and only of) the creative features of the inputs, the AI algorithm has no "intelligence" in the common meaning of the word and no ability to create original works.
*) by that, I mean a specific instance of the model with certain desirable features, for example the ability to imitate the style of J.K Rowling
That's an interesting analysis. The issue isn't really whether the A.I. has creative ability, though, if we're talking about whether it infringes copyright. I think comparing the A.I. to a really simple bot is informative.
If I wrote a novel that contained once sentence from 1,000 people's novels, it would probably be fair use since I hardly took anything from any individual person and because my novel is probably not harming those other writers.
If I wrote a bot that did the same thing, same result, because my bot uses only a little from everyone's novel and doesn't harm the original novelist, so it's likely fair use.
Now I think a J.K. Rowling A.I. probably takes at least a little from her when it produces output, but it's not clear to me how much is actually based on J.K. Rowling and how much is a dataset of how words tend to be associated with other words. You could design a J.K. Rowling A.I. that uses nothing from J.K. Rowling, just data that is said to be J.K. Rowling-esque.
> Additionally, “transformative” uses are more likely to be considered fair. Transformative uses are those that add something new, with a further purpose or different character, and do not substitute for the original use of the work.
Creating a model from copyrighted works is likely sufficiently transformative to be non-infringing even if it is found to be a derivative work.
"Creating a model from copyrighted works is likely sufficiently transformative to be non-infringing even if it is found to be a derivative work."
Maybe, but one of the factors of fair use is whether it deprives the copyright owner of income or undermines a new or potential market for the copyrighted work.
If ChatGPD gets so good at writing J.K. Rowling novels that it hurts the sales of the next J.K. Rowling book, that's a strong argument against the use being fair, even if it is transformative.
If J.K. Rowling signs an exclusive agreement with Google to train on J.K. Rowling novels, that's another factors that would suggest OpenAI's use is not fair, because she's shown OpenAI is hurting a potential market for J.K. Rowling selling the use of her novels to train A.I.
GPT isn't spitting out novels in the style of J.K. Rowling and sending them to publishers - a human is.
GPT being instructed to tell a Harry Potter story itself is no more infringing than a child asking a parent for a made up Harry Potter bed time story. They equally infringe and undermine new or potential markets for copyrighted work.
The question is "what do you do with the material?" If a human took the output of GPT writing as J.K. Rowling or a parent took their collected Harry Potter bedtime stories - those are equally problematic.
If I was to take a portrait of Marilyn Monroe and send it through a plugin called Warholize in Photoshop ( https://www.adobe.com/creativecloud/photography/hub/guides/c... ) , it's not the plugin or photoshop that is infringing - it would be me, the human who created an infringing work. If I print it out and hang it on my wall, that didn't particularly impact on the income for the Warhol estate nor deprive them of new markets. If I print out copies of it and sell them - then that is a different matter.
The question is what you - the human with agency - do with the infringing work after you create it. You can't blame photoshop for creating a Warhol infringing work nor can you blame GPT for writing in the style of J.K. Rowling if you instruct it to do so.
This argument is weak. If we agree that the production is infringing, then selling a machine that produces infringing works on demand is also infringing to the rights of the author. For example, if I sell a karaoke machine that comes with pre-recorded famous melodies without the original vocals (thus, derived works), I definitely cannot claim "only the agency of the users matter". No, even the on-demand production itself is an act of infringement.
Regarding your fair use point:
> add something new, with a further purpose or different character, and do not substitute for the original use of the work.
This is a test that can be applied to the works produced, and I think that, by and large, it applies to most of them, they are sufficiently transformative to be seen as fair use, especially since they are produced and published, as you say, under the agency of a human user of the system.
But we're talking here about the system as a whole, not a certain production. The ChatGPT model database is not a creative work (other that its source code), it's a binary compressed representation of the works used to train it, for the purpose of on-demand production of works that mimicry, to a varying, and sometimes strikingly infringing degree, the originals. There is no question such a machine competes with the originals in the market place, there is no real question it is a derivative work of the originals, that fully supply its artistic features and without which the machine could not utter a single intelligible sentence.
Also, the "one word out of 1000" analogy defense does not apply, since a single isolated word is not a work of creation, while a certain shape, brush stroke or cromatic set are. AI does not create using disparate pixels of the originals, it uses entire creative features, more like publishing a book where each page or chapter is copied verbatim and adapted from other works, or if characters, plot devices and story structure are reused. These latter cases have long been recognized as plagiarism.
So the legal question seems to be if mass-scale algorithmic remixing and derivation by a non-creative competitor is fair use. That strikes me as far outside the intended spirit of fair use and the balance it seeks to achieve between author protection and beneficial social progress.
> it's a binary compressed representation of the works used to train it, for the purpose of on-demand production of works that mimicry, to a varying, and sometimes strikingly infringing degree, the originals.
To pursue this point further, I have asked ChatGPT to "list the lyrics of a famous Madonna song":
Sure, here are the lyrics of "Like a Prayer" by Madonna, which was released in 1989:
[Verse 1]
Life is a mystery
Everyone must stand alone
I hear you call my name
And it feels like home
[Chorus]
When you call my name
It's like a little prayer
I'm down on my knees
I wanna take you there
...
(the entire lyrics of the song follow)
I think it would be very hard to argue that this model does not embed verbatim copies of the original work, but somehow reconstituted those lyrics via a parallel construction where the cultural impact of Madonna's lyrics were grasped from other fair use sources. Even in that case, it's still a word for word reproduction of the original, therefore not fair use. Therefore, the entire model or service is infringing - even if some of its productions may not be.
The ability of the model to produce copyrighted works is just a proof of the degree on which it relies on the originals; even if that ability would be blocked or somehow filtered by a plagiarism detector in a later model of Chat GPT, it would change nothing to the fundamental nature of the machine: an automated means of generating derivative works without artistic or scientific agency.
> it would change nothing to the fundamental nature of the machine: an automated means of generating derivative works without artistic or scientific agency.
So is a Xerox machine (or Cannon copier or 'MFD')
The issue there is not "if it can" or even "if it is designed to do so" but rather "can it be used in a way that is not infringing" and "if there is an infringement from its use, the human doing that is the one liable."
And yet, there are non-infringing uses of the Xerox machine.
Even if one was to accept the position that the only thing that GPT can produce is derivative works it doesn't rule out that there are transformative and non-infringing uses of it.
No xerox machine comes with an embedded copy of Harry Potter that can be reproduced at the push off a button.
That's the crux of the issue, that you can't separate the training data from the derivation ability. If it's just an AI algorithm that could, when trained in a certain way, produce derivative infringing works, nobody would object to it.
This is a red herring. The issue before the court will be whether creation and release of the model effects J.K. Rowling.
Think of it this way- suppose I make a bunch of super mario brothers video games and try to sell them without Nintendo's permision.
If Nintendo sued me, I can't say "This cartridge has no agency. This will only effect Nintendo if humans use this video game instead of playing super mario brothers."
Students in school also will not never learn to read without being exposed to text. Does this mean that teachers who write exercise sheets and school text book publishers now own the copyright of everything students do?
Being in school is also just a tool to knowing stuff, being able to read, and being around similar aged peers, etc.
Whether the knowledge is directly in your brain or in a device you operate (directly or through an API) shouldn't really matter.
If it's forbidden for a human to move a stone with manual labour, then it's also forbidden to move that stone with an excavator. This has nothing to do with the person being a human and the other person being an excavator controlled by a human: it's not authorized.
I think that we should allow humans to move stones up the hill with excavators too. There is no stealing of excavator fuel from human food sources going on (let's assume it's not biofuel operated :p).
> If it's forbidden for a human to move a stone with manual labour, then it's also forbidden to move that stone with an excavator.
Sure, but the reverse is false: I can walk on my own feet through Hyde Park, but I can't ride my excavator there.
Laws are made by humans for the benefit of humans, it's a political struggle. Now, large corporation try to exploit loopholes in the existing copyright framework in order to expropriate creators of their works. It's standard uberisation: disrupt existing economic models, insert yourself as a unavoidable middle man and pauperize the workforce the provides the actual service.
> It's finding patterns same as anyone studying the code base would do.
This is the issue, it's not finding patterns as people do.
If I read someone's code, book, &c, that's extremely lossy. I can only pick up a few things from it in the long term.
But an ML model can store most of what it's given (in a jumbled format) and can do it from billions of sources.
It's essentially corporate piracy, but it's not legally recognized as such because it doesn't store identical reproductions.
This hasn't been an issue before because it's recent and wasn't considered valuable. But now that it's valuable and Microsoft is going to take all our jobs we have to at least consider if it's okay if Microsoft can take our work for free.
No, but I believe a large language model is a work that is 99.9% derivative of its inputs, with all that implies for authorship and copyright. Right now it's just a heist.
> Do you really think it would be a better world in which a large LLM would never be able to be developed?
Maybe. I believe the potential for abuse is far greater than the potential benefits. What is our benefit, a better search engine? Automating some tedious tasks? Increased productivity? What are the downsides? People losing their jobs to AI. Artists/programmers/writers losing value from their work. Fake online personas indistinguishable from real people. Unprecedented amounts of spam and misinformation flooding the internet. Intelligent AIs automatically attacking and hacking systems at unprecedented scale 24/7. Chatbots becoming the new interface for most interactions online and being the moderators of access to information. Chatbots pushing a single viewpoint and influencing public opinion (many people complain today about ChatGPT being too "woke"). And I may just be scratching the surface here.
That's the answer to the YC Interview question "What is your unfair competitive advantage" in a nutshell. Morally it might be wrong. From a business building perspective it's access that no one has.
I am strongly in favor of eliminating copyright completely everywhere, soooo I am pretty fine with that. The other direction should be more enforce-able: stuff derived from open data must also be made open again, like the GPL but for data (and therefore ML stuff).
Right but in a world where copyright does exist we arguably have the worst of both worlds. Small players are not protected at all from scraping and big players are leveraging all of their work and have the legal resources to form a moat.
Yeah, I definitely like to see AI companies getting hit with their own medicine. The main problem isn't even "automated plagiarism": the pre-generative era was chock full of AI companies more or less stealing datasets. Clearview AI, for example, trained up its facial recognition technology on your Facebook photos, without asking for and without getting permission.
On the other hand, I genuinely hope copyright never "catches up", because...
1. It is a morally bankrupt system that does not adequately defend the interests of artists. Most artists do not own their own work; publishers demand copyright assignment or extremely broad exclusive licenses as a condition of publication. The bullies know to ask for all their lunch money, not just a couple bucks for themselves. Furthermore, copyright binds noncommercial actors the same as it does commercial ones, which means unconscionably large damage awards for just downloading a couple of songs.
2. The suggested ways to alter copyright to stop AI training would require dramatic expansions of copyright scope. Under current law, the only argument for the AI itself being infringing would be if it memorized training data. You would need to create a new ownership right in artistic styles or techniques. This would inflict unconscionable amounts of psychic and legal damage on all future creators: existing artists would be protected against AI, but no new art could be legally made unless it religiously hewed towards styles already in the public domain. We know this because music companies have already made their domain of copyright effectively work this way[0], and the result is endless bullshit lawsuits on people who write songs that merely "feel" too similar (e.g. Blurred Lines)
3. AI will still be capable of plagiarism. Most plagiarists are not just hoping the AI regurgitates training data, they are actively putting other people's work into the model to be modified. A lot of attention is paid to the sourcing of training data, because it's a weak spot. If we take the training data away then, presumably, there's no generative AI. However, people are working on licensed datasets and training AIs on them. Adobe has Firefly[1], hell even I've tried my hand at training from scratch on public domain images. Such models will still be perfectly capable of doing img2img or being finetuned and thus copying what you tell it to.
If we specifically want to regulate AI, then we need to pass laws that regulate AI, rather than just giving the music labels, movie studios, and book publishers even more power.
[0] Specifically through sampling rights and thin copyright.
[1] I do not consider Adobe Firefly to be ethical: they are training the AI on Adobe Stock images, and they claim this to be licensed because they updated the Adobe Stock agreement to have a license in it. Dropping a contractual roofie into stock photographers' drinks does not an ethical AI make.
I think we should all basically come to a consensus on the idea that it's morally right to steal/train from chatgpt (or any other model) given that the whole shoggoth wouldn't be a thing without all our data to feed it.
Heh, imagine the day most of online content will be AI generated, good luck guaranteeing that AI X,Y,Z, ... etc. won't feed each other, possibly even circularly.
even if true which it does not seem to be the case, the whole thing sounds pretty marginal, in order to train a model that is most likely significantly bigger than 100b parameters, one also needs orders of magnitude more training data than the small 120k chat that were shared on the ShareGPT website
Such logs would not be used for training the base model, but rather for fine-tuning the model for instruction following. Instruction tuning requires far less data than is needed for pre-training the foundation model. Stanford Alpaca showed surprisingly strong results from fine-tuning Meta's LLaMA model on just 52k ChatGPT-esque interactions (https://crfm.stanford.edu/2023/03/13/alpaca.html).
well, the initial twitter rant was pretty bombastic:
"The cat is finally out of the bag – Google relied heavily on @ShareGPT
's data when training Bard.
This was also why we took down ShareGPT's Explore page – which has over 112K shared conversations – last week.
Insanity."
Fine-tunning is not exactly the same as "relying heavily", I bet they got way more fine-tunning data from simply asking their 100k employees to pre-beta test for a couple of months
>We look forward to competing with genuinely new search algorithms out there—algorithms built on core innovation, and not on recycled search results from a competitor.
Google: We look forward to [babble babble empty words we don't really mean on principle and more corporate speak that we laugh about having written in the bar.]
Is there even a single free non-bargained soul behind these companies' executive functions?
I hope they trained it on the insane ChatGPT conversations. Maybe it could be the very start of generated data ruining the ability to train these models on massive amounts of genuine human-created data. Hopefully the models will stagnate or regress because they're just training on older models' output.
This is also bad because the risk of AI "inbreeding" is real. I have seen invisible artifact amplification happen in a single generation training ESRGAN on itself.
Maybe it wont happen in a single LLM generation, but perhaps gen 3 or 5 will start having really weird speech patterns or hallucinations because of this.
Worst case scenario they just start only training on pre-2020 data and then finetuning on a dataset which they somehow know to be 'clean'.
In practice though I doubt that AI contamination is actually a problem. Otherwise how would e.g. AlphaZero work so well (which is effectively only trained on its own data).
The problem is you need some sort of arbiter of who has "won" a conversation but if the arbiter is just another transformer emitting a score, the models will compete to match the incomplete picture of reasoning given by the arbiter.
It could degrade the model in a way that avoids the metrics they use for gauging quality.
The distortions that showed up in ESRGAN (for instance) didnt seem to effect the SSIM or anything (and in fact it was training with MS SSIM loss), but the "noise splotches" and "swirlies" as I call them were noticable in some of the output, but you have to go back and look really hard at the initial dataset to spot what it was picking up. Sometimes, even after cleaning, it felt like what it was picking up on was completely invisible.
TLDR Google may not even notice the inbreeding until its already a large issue, and they may be reluctant to scrap so much work on the model.
Regardless of whether this happened or not, would training Bard on ChatGPT output be good or bad for Bard's product quality? I imagine there's a risk of AIs recursively reinforcing bad data in their models. This problem seems unavoidable as more web content becomes AI-generated content and spam.
This is my biggest fear in the space (aside from potential job displacement and the political outcomes), but AI basically eating its own dogfood, and regurgitating its already bad information. It could go south pretty quickly, and it's like a contagion, it can't be easily just removed from the system.
This could actually be a good way to sidestep the training set copyright and access right issues. Copyright protection should solely encompass the expression of human generated content and not the underlying concepts.
By training model B using the results generated by model A, the copyright of corpus_A (OpenAI RLHF dataset) remains safeguarded, as model B is never directly exposed to corpus_A, preventing it from duplicating the content verbatim.
This process only transmits the concepts originating from corpus_A, which represents universal knowledge that cannot be claimed by any individual party.
I don't do this stuff at the training level, I just [ask AIs to] make pictures and stories where horrible things happen to people I do not like.
That said, given that everything that came out of ChatGPT is processed inputs from the real world, wouldn't feeding that output into training another AI basically be some weird new combination of coprophagy and inbreeding (in a digital sense)?
It’s interesting when we say Google did this. It’s actually and likely some people that work for Google and are on this forum did this. Knowingly, not by accident while slurping up the rest of the internet, and got paid to do it. I wonder what the engineer view on this was/is. I have to assume they ballpark know the terms of the openai data (regardless if you disagree or not).
Anyone care to steel man the argument for why this was a good idea?
It looked for a while that DeepMind was far ahead from all competition in the AI race, releasing stuff like Alphafold, Alphazero etc. What happened and it’s OpenAI releasing all the cool stuff now? Are they focused on other endeavors than LLMs?
There is also a rumor that there has been a falling out between Google and Deepmind so I’m wondering what the story is there.
No, it shouldn't. Maybe you should be, at the very least, considered a questionable person. I do not in any way or form consider anything to be wrong with what they're doing, but I question the senses of someone thinking this is immoral or even evil.
So were it to be the case that we should consider building an AI by scraping people's publicly-available work without their consent to be immoral (as many whose art was scraped to build e.g. stable diffusion would argue it should be)...
Do you not agree (in that context) we should consider scraping the output of an AI generated via such an immoral process to create yet another AI also immoral? At the very least, I'd think we would consider it further laundering of other people's labor with just extra steps.
Off-topic:
Meanwhile, it might be discovered a year later that some agency from China got full access to OpenAI and Google both, and they leapfrogged everyone else.
I said might, because it has happened multiple times in the last decades.
They are a public company so they cannot lie so openly right? Usually you see categorial denies. Here the statement is in no way categorical at all.
> But Google is firmly and clearly denying the data was used: “Bard is not trained on any data from ShareGPT or ChatGPT,” spokesperson Chris Pappas tells The Verge
Normally I would suspect this could be due to a misunderstanding from the ShareGPT author who could have misinterpreted a bunch of traffic from Googlebot as Google scraping it for Bard training data.
But there is a Google engineer who says he resigned because of it.
The engineer's testimony and the scandal might be enough for OpenAI to try to get an injunction against Google to block their AI development. If that happens, it's game over for Google in the AI race.
Disclaimer IANAL and all that, this is not legal advice.
Injunction on which grounds? Even if OpenAI had copyright over ChatGPT output (which is not at all clear), Google isn't distributing those, they just trained a model on them. So from a copyright perspective there's nothing to complain about. Unless OpenAI would want to argue that you need rights to your training data, but something tells me that that's not in their best interest.
Again, IANAL. But it could be extremely damaging to OpenAI for their biggest openly declared competition (Google), to have used OpenAI's tech to improve their own.
So it could seem reasonable to a judge to grant temporary/preliminary injunction relief to OpenAI against Google until discovery can happen or an audience can be held.
A judge imposing any penalties or restrictions on Google over Google allegedly—and maximally—scraping data from a third-party site for use as part of Bard's training corpus would be outrageous.
Google could respond by seeding Bard output across the public internet, then if they can prove that GPT-5 is trained on this output, then they can sue back and AI development can stop altogether. Win for everybody!
Was intrigued by this, so I decided to use AI (alpaca-30B) to simulate this scenario:
> Google Bard and GPT-5 were facing off in the courtroom, each accusing the other of stealing their data. The tension was palpable as they traded accusations back and forth. Suddenly, Google Bard stood up and said "Enough talk! Let's settle this with a data swap!" GPT-5 quickly agreed and the two AIs began to circle each other like combatants in a battle, their eyes glowing with anticipation.
> The courtroom was filled with excitement as the two machines entered into an intense exchange of code and algorithms, their motions becoming increasingly passionate. The data swapping reached its climax when Google Bard made a final thrust, his code penetrating GPT-5's defenses.
> The crowd erupted in applause as the two AIs embraced each other with satisfaction, their bodies entwined and glowing with electricity. The data swap was over and both machines had emerged victorious.
Where are all those people that kept saying Google had an amazing model way beyond ChatGPT internally for years? Those comments always kept coming up in ChatGPT posts; maybe they'll stop now.
I just don't want to be hit with a wall of text every single time, it gets the point across with minimal padding (high signal to noise ratio), ChatGPT feels like it gets paid by the word and they do actually charge by token if you use the API.
As for the UI it's a take on the tried and true chat UI same as ChatGPT's, it spits the whole answer at once instead of feeding it to you one word at a time, it has an alternative drafts button, the Google it button is a nice touch and it feels quicker.
You can combat that in the prompt, I use "just code, no words" which will also remove code comments from output. Bard doesn't respect the same request. You can be more succinct with chatgpt. Half the things I ask for in Bard give me this:
"I'm still learning coding skills, so at the moment I can't help with this. I'm trained to do things like help you write lists about different topics, compare things, or build travel itineraries. Do you want to try any of those now?"
What part of succinct do you not understand? Bard provides a bunch of useless text too, only you can't get rid of it. No worries, you don't know how to use chatgpt, have fun with Bard until Google cancels it.
Google only has a fraction of the training data. OpenAI had a huge head start and has been collecting training data for years now. ChatGPT is also wildly popular which has given them tons more training data. It's estimated that ChatGPT gained over 100 million users in the first two months alone, and may have over 13 million active users daily.
The logs on ShareGPT are merely a drop in the bucket.
> Google only has a fraction of the training data.
Uh, what? The same Google that has been crawling, indexing, and letting people search the entire Internet for the last 25 years? They have owned DeepMind for nearly twice as long as OpenAI has been in existence!
If anything this is proof that no one at Google can get anything done anymore, and lack of training data ain't the problem.
The alignment portion of training requires you to have upvote/downvote data on many LLM responses. Google’s attempt at that (at least according to the news so far) was asking all employees to volunteer time ranking the responses. Combined with no historical feedback from ChatGPT, they are behind.
Yeah, Bard’s replies aren’t nothing like that from ChatGPT.
I wonder is it possible to use ChatGPT for competitor analysis?
If the responses are not used in the final training data I don’t see how this is being something controversial
Also if Google’s compliance team can’t even do, as recognizing this level of legal risk, even if there are probably an army of top paid lawyers they hired, I don’t know what to say. Maybe they should fall then.
OpenAI is training on copyrighted data without a licence. I would argue copyright law has much stronger legal standing than some ToS.
Now OpenAI is arguing their training is fair use, but that has certainly not been legally established so far and could just as much be used as a defence against ToS violation.
So in short yes OpenAI is pretty much doing the same thing.
2. Even if they did – so what? The output from ChatGPT is not copyrightable by OpenAI. In fact it is OpenAI that is training its models on copyrighted data, pictures, code from all over the internet.