Something caught my eye when I read the pdf. They are using GPT-4 to name their word clusters: "Topic names are generated by GPT-4 conditioned on the top 20 words for each topic, prompted by a request for a short 1-2 word summary."
I feel like LLMs will be used increasingly for this purpose; whenever someone has to make a mostly arbitrary decision like the kind that would make 'bike shedding arguments', they can delegate it to the LLM. The point isn't that the LLM is necessarily better or objective, but rather that it's a kind of 'schelling point' for agreeing on things that don't matter.
For example I could imagine a super annoying peer reviewer (or even co-author) ask why you used the name 'celebrations' as a topic name associated to the set of words 'fun, wedding, beautiful, christmas, happy, card, birthday, gift, blog, perfect' like in the pdf, and why not use some other word. Instead of having some meaningless back and forth 'bike shedding' discussion with the reviewer over email, you can just say you used GPT-4 and move on to more important things.
All this data is taken from common crawl, a web crawler that clones the web.
How can companies such as OpenAI use that data when no licenses can be identified for most web pages or the associated images? In other words, the user/owner of the content has not agreed for it to be used in these ways.
What's people's experiences with training models using such data?
Is there any way to identify if my photos appear in common crawl or are used in such datasets?
Copyright controls the right to copy, not any and all use. Just because somebody holds the copyright on something, it doesn’t mean they can dictate how it is used. “The copyright holder has not agreed for it to be used in these ways” is irrelevant. “The copyright holder has not agreed for it to be copied in these ways” is what matters.
Analysing an image and adjusting weights in relation to that image is not making a copy. The images aren’t being copied into the model.
You could argue that downloading the images in the first place to make that analysis constitutes copying, but these kinds of incidental copying aren’t normally considered within the bounds of copyright. If they were, you’d be committing copyright infringement every time you surfed the web.
You make a good point. I can browse the web and while I do I see pictures which get "copied" into my memory, somehow. However somebody looking at my brain with a microscope probably could not see those images in my brain, they are not localized that way. So copying something into my brain is not copyright infringement because I don't copy the image, I only allows the image to have some kind of affect on my brain. That is not copying I would argue. it is "experiencing".
AI is a big brain which consumes the web almost like humans do. It "sees" the pictures on the web when it adds their characteristics to its associative memory. The AI then generates images. But unless it comes up with a clear copy of somebody else's work, it is not copyright infringement.
I am not a lawyer so take this with a grain of salt.
Where it gets fiddly is if the model is capable of spitting out close reproductions of images from its training set. That's been demonstrated on some models. I don't have a particularly good intuition as to which way this would go, but I can see someone making an argument that copying the model is therefore copying all the training set images it could be convinced to make a reproduction of. That's unlikely to be everything in the training set, but by the same token I don't know how you could determine how much of the training set can be reproduced that way.
I believe there is an argument to be made that it is a derivative work. If you read a copyrighted book verbatim, that's an infringement, but if you summarize a book, that's fair-use. I don't think almost any of this has gone through the courts so there is a lot of legal peril and speculation though, obviously this is all super hot and pushing the boundaries so I'm sure we'll get clarification sooner rather than later.
This training corpus mmc4 is the one used by OpenFlamingo (https://laion.ai/blog/open-flamingo/).
Something caught my eye when I read the pdf. They are using GPT-4 to name their word clusters: "Topic names are generated by GPT-4 conditioned on the top 20 words for each topic, prompted by a request for a short 1-2 word summary."
I feel like LLMs will be used increasingly for this purpose; whenever someone has to make a mostly arbitrary decision like the kind that would make 'bike shedding arguments', they can delegate it to the LLM. The point isn't that the LLM is necessarily better or objective, but rather that it's a kind of 'schelling point' for agreeing on things that don't matter.
For example I could imagine a super annoying peer reviewer (or even co-author) ask why you used the name 'celebrations' as a topic name associated to the set of words 'fun, wedding, beautiful, christmas, happy, card, birthday, gift, blog, perfect' like in the pdf, and why not use some other word. Instead of having some meaningless back and forth 'bike shedding' discussion with the reviewer over email, you can just say you used GPT-4 and move on to more important things.