MMC4: An open, billion-scale corpus of images interleaved with text

ftxbro · on April 18, 2023

"We introduce mmc4, a corpus of 585M images interleaved in 43B English tokens from the popular c4 dataset."

This training corpus mmc4 is the one used by OpenFlamingo (https://laion.ai/blog/open-flamingo/).

Something caught my eye when I read the pdf. They are using GPT-4 to name their word clusters: "Topic names are generated by GPT-4 conditioned on the top 20 words for each topic, prompted by a request for a short 1-2 word summary."

I feel like LLMs will be used increasingly for this purpose; whenever someone has to make a mostly arbitrary decision like the kind that would make 'bike shedding arguments', they can delegate it to the LLM. The point isn't that the LLM is necessarily better or objective, but rather that it's a kind of 'schelling point' for agreeing on things that don't matter.

For example I could imagine a super annoying peer reviewer (or even co-author) ask why you used the name 'celebrations' as a topic name associated to the set of words 'fun, wedding, beautiful, christmas, happy, card, birthday, gift, blog, perfect' like in the pdf, and why not use some other word. Instead of having some meaningless back and forth 'bike shedding' discussion with the reviewer over email, you can just say you used GPT-4 and move on to more important things.

jmmcd · on April 18, 2023

Great! Now the reviewer will say you should've used GPT4.2, the version released on 29th August, not the version released on 22nd August.

LawTalkingGuy · on April 18, 2023

Lol, that's what a reviewer running on Llama13b would say!

taneq · on April 19, 2023

calls out to nearby microphone "hey bikeshed, what's the best llm and revision for this debate?"

seydor · on April 18, 2023

But then how can their license allow commercial use?

angrais · on April 18, 2023

All this data is taken from common crawl, a web crawler that clones the web.

How can companies such as OpenAI use that data when no licenses can be identified for most web pages or the associated images? In other words, the user/owner of the content has not agreed for it to be used in these ways.

What's people's experiences with training models using such data?

Is there any way to identify if my photos appear in common crawl or are used in such datasets?

JimDabell · on April 19, 2023

Copyright controls the right to copy, not any and all use. Just because somebody holds the copyright on something, it doesn’t mean they can dictate how it is used. “The copyright holder has not agreed for it to be used in these ways” is irrelevant. “The copyright holder has not agreed for it to be copied in these ways” is what matters.

Analysing an image and adjusting weights in relation to that image is not making a copy. The images aren’t being copied into the model.

You could argue that downloading the images in the first place to make that analysis constitutes copying, but these kinds of incidental copying aren’t normally considered within the bounds of copyright. If they were, you’d be committing copyright infringement every time you surfed the web.

galaxyLogic · on April 19, 2023

You make a good point. I can browse the web and while I do I see pictures which get "copied" into my memory, somehow. However somebody looking at my brain with a microscope probably could not see those images in my brain, they are not localized that way. So copying something into my brain is not copyright infringement because I don't copy the image, I only allows the image to have some kind of affect on my brain. That is not copying I would argue. it is "experiencing".

AI is a big brain which consumes the web almost like humans do. It "sees" the pictures on the web when it adds their characteristics to its associative memory. The AI then generates images. But unless it comes up with a clear copy of somebody else's work, it is not copyright infringement.

I am not a lawyer so take this with a grain of salt.

regularfry · on April 19, 2023

Copying the image into your brain isn't what's material here. It's copying it into your browser over the network.

galaxyLogic · on April 20, 2023

Right but that is allowed, you can browse the web which means the content has to be copied to your computer.

robertlagrant · on April 19, 2023

I don't think this is relevant.

regularfry · on April 19, 2023

> The images aren’t being copied into the model.

Except where they kind of are.

Where it gets fiddly is if the model is capable of spitting out close reproductions of images from its training set. That's been demonstrated on some models. I don't have a particularly good intuition as to which way this would go, but I can see someone making an argument that copying the model is therefore copying all the training set images it could be convinced to make a reproduction of. That's unlikely to be everything in the training set, but by the same token I don't know how you could determine how much of the training set can be reproduced that way.

gleenn · on April 19, 2023

I believe there is an argument to be made that it is a derivative work. If you read a copyrighted book verbatim, that's an infringement, but if you summarize a book, that's fair-use. I don't think almost any of this has gone through the courts so there is a lot of legal peril and speculation though, obviously this is all super hot and pushing the boundaries so I'm sure we'll get clarification sooner rather than later.

Gelotoooti · on April 19, 2023

I also get inspiration from the internet.

Am I not allowed to do this either?

I download an image and save it and than use it. Either as inspiration in a collage or as inspiration when drawing something new.

I don't think there have been a lot of people out there who really invented something uniquely new for a while.

flangola7 · on April 18, 2023

Same way Copilot uses GitHub code and Midjourney uses scraped art images.

kolinko · on April 19, 2023

You have robots.txt standard that you can use to exclude crawling.

ipaddr · on April 19, 2023

robots.txt is not a law. It's what google pushed. No one has to obey robots.txt and will often use it to find things you forbid indexing

nl · on April 19, 2023

robots.txt was around before Google

brookst · on April 19, 2023

Keep on mind that humans may also be learning from your work and emulating aspects of it.

andyjohnson0 · on April 18, 2023

Whats the best source for a directory/list of these training data corpuses?