Hacker News new | past | comments | ask | show | jobs | submit login
New data poisoning tool lets artists fight back against generative AI (technologyreview.com)
93 points by lawlessone on Oct 23, 2023 | hide | past | favorite | 86 comments



It's not new and it doesn't work.

Hopefully they don't try to sell it to artists, because that would be literally a scam.


>It works in a similar way to Nightshade: by changing the pixels of images in subtle ways that are invisible to the human eye but manipulate machine-learning models to interpret the image as something different from what it actually shows.

>Nightshade exploits a security vulnerability in generative AI models, one arising from the fact that they are trained on vast amounts of data—in this case, images that have been hoovered from the internet. Nightshade messes with those images.

So, this will not work at all as soon as the vulnerability is patched? Adversarial attacks are super cool, but they flat out do not generalize afaik. If there's been some breakthrough I'd love to hear about it, but the idea doesn't pass the sniff test to me.


Yes, pretty sure that is how it will always be with such attacks. cat-and-mouse forever.


It does put these companies in an awkward position though. If you're implementing defenses against this, it's conceding not all your sources of data consented, or at least that you're not 100% on their provenance.


Not really. "We're just protecting ourselves from the 4chan trolls"


More importantly, you can detect this and filter out those images pretty easily - which means you have implicit consent from the users of all other images because they chose not to protect their images technically.


That’s not how the law works.

You can’t take things from a house because they didn’t lock the front door.


You can absolutely take things from a house with an unlocked front door, if there's no jail time and the fine is one [insert local unit of currency here] and that theft doesn't hurt your reputation. That's how AI works right now:

"No one sued us before we made it to market with our models trained on stolen property, and now we're entrenched with a billion dollars of legal defense funds and budgeted for the fines in case we end up convicted of robbing you — and now our customers pay us to externalize and disregard the robberies we committed to reduce their effort, so they have the moral high ground and we wallow in our wealth."


>"No one sued us before we made it to market with our models trained on stolen property, and now we're entrenched with a billion dollars of legal defense funds and budgeted for the fines in case we end up convicted of robbing you — and now our customers pay us to externalize and disregard the robberies we committed to reduce their effort, so they have the moral high ground and we wallow in our wealth."

What about "all art is derivative to some extent, and training a machine artist based on existing art is no different than training a machine artist based on existing art"?


Then the Stable Diffusion model should hold the copyright. And using generative models doesn't make you any more of an artist that commissioning a human does.


That is for the courts and the lawmakers to decide, and in any case I don’t plan on convincing anyone to change their mind about ML training ethics.


It can be how the law works; like in Field v. Google Inc. for hosting a cached copy of a site: "Google reasonably interpreted absence of meta-tags as permission to present "Cached" links to the pages of Field's site".

I think Fair Use is the stronger defense for model training, but - for crawlers that obey robots.txt/etc. - implied license isn't totally off the table.


That's a pretty poor analogy. To make it a bit better, it's also like we're figuring out what this whole "house" thing is, and there's lots of differing opinions, and also lots of people who think this whole "house" thing is a fad.


This is a great point, it doesn’t have to be a perfect technical countermeasure.

Being able to prove opt-out using a watermark tech like this one, maybe even encoding provenance, could make for very interesting legal battles.


>Being able to prove opt-out using a watermark tech like this one, maybe even encoding provenance, could make for very interesting legal battles.

That's... not how copyright works? If you're infringing on copyright, "the author didn't opt out" isn't a valid defense. The valid defenses like fair use can not be opt out by authors. In other words I don't really see any circumstance where opt outs are a valid factor.


Even that isn't how copyright works....

If I never look at your work and make a work that looks like your work I've still violated copyright.

If I look at all your works and make something that looks kinda but not close enough, most likely I've not violated your work. You cannot both put your work in public and then yell at people|things not to look at it.


it's also likely to be a DRM circumvention device under the DMCA


I don't think watermarking counts as an effective technological countermeasure.


it's designed to break/degrade use contrary to the license, not simply point back to it

ala macrovision (circumvention of which was ruled to be unlawful under the DMCA)


Sounds like bullshit, to be frank. Training data already goes through a number of "subtle augmentations" and instead of "breaking" the resulting model, the augmentations help the model to generalize better.


A little pointless, considering that Stable Diffusion is already at or near human levels. Worst case scenario, just train on pre-2022 images.



Stable Diffusion is capable of training on its own images


Is that advisable? It won’t learn anything new and might reinforce its errors.


No, it isn't if you don't curate meticulously. Perhaps by accident something new could emerge which is worth it to use as input for an advanced model.

But the rule is that new concepts are very hard to produce, although thanks to countless models, stable diffusion is probably the most flexible approach by quite some margin.


It probably won't learn anything new, but it will learn not to generate bad images if you cherry pick the best ones for training.


Yu would train it in an RL setting rather than actually use generated images in the train set.


Sure


Likely they have humans review the images (or even touch up). Same thing with the dataset scraped from the internet.


Yes, that's the technique. Generate 10 images, then choose the ones that turned out well for the next round. That's the standard way to create a LoRA.


when will artists accept genai as a tool not a competitor? all artists with genai tools easily outclass a non artist fiddling around


My artist cousin uses AI to be able to take more customers. He's very happy with it. Just go back to the 70s and read all the discussions about the jobs killed by computers. Before computers, more artists were needed for the same output, while now artists can do things that could only be dreamed a generation ago.

Also, when artists finally get compensated for training, expect Spotify-style cents. Artists won't be able to stop this, they are just not strong enough as a group, for better or worse.


clearly recall the arts world of the 1970s. Manual arts, manual publishing tasks and the business around that employed many orders of magnitude more people in many large metro areas. Fast Forward to post-Lucasfilm digital era and the size of the largest media companies magnified by 1000x, while sign shops closed and local typesetting and design firms had to go digital or die. Fast forward again to post-covid, and I believe that you would be lucky to find a few of the earlier activities at all, in any city. Meanwhile games employ thousands of anonymous artists remotely, and very, very few individuals run their own business with any stability.

Digital arts has been a magnifier for the largest companies, and the literal death of thousands of stable, small business, in fifty years.


I think artists on the low end might suffer. I have commissioned some images and models for some hobby projects, even while I paint and model myself. Probably would use AI to generate images today.


Innovation tends to wipe out the low end everywhere. At the same time, more people can generate their own images without having to find an artist. In practice, images that wouldn't be created because of costs will now be created.


True. Image generation is a godsend for rapid prototyping. Although with low end I meant a financial sense. Skill is often not enough to be financially successful as an artist, there is luck and opportunity involved. So we could still lose something here.

Also AI is still atrocious to come up with any new concept that wasn't meticulously trained. To me this is still just an algorithm as intelligence would suggest something different. But nonetheless an impressive and mighty tool.


GenAI really is a competitor for a lot of commercial illustration and photography. It isn't a complete substitute, but there are lots of situations where a layman fiddling around with a GenAI will get a perfectly acceptable result in a couple of minutes. If it doesn't already exist, I can easily foresee that online publishing tools will have integrated GenAI features to autogenerate an image from a caption, or an illustration based on the text of an article.

Some artists may well benefit from being able to produce more work more quickly and be able to capitalise on their ability to use GenAI effectively, but I expect that many won't; they are likely to see their incomes decline, or lose their livelihoods altogether.

That isn't an argument for stigmatising or banning GenAI, but we do need to recognise that it's a real problem for the people affected.


I mean it took many years for photoshop to be accepted. Change takes time, doubly so if its seen as a threat to one's livlihood.


> when will artists accept genai as a tool not a competitor?

When non-artists stop using it to compete with them?


This is a very conservative and limiting definition of “artist.” I was always told that anyone can be an artist, even if their work isn’t classically “good.” Now the ladder has been pulled up by formerly inclusive artists. That’s too bad.


OK, to explain. I used to commission artwork to illustrate articles in the print media. We used to pay well and commission about 5 images a month.

These days we’d probably have the office intern tinker with some prompt writing for most of those.

Those people have just lost their living


Same with people that made buggy whips for horses.


Did you have a bad experience in art class growing up? Equating the two smacks of resentment.


They're literally equal as classes for this subject matter.


And that’s fair enough. The original commenter asked ‘when artists would stop treating AI as competition’.

You are just reinforcing the fact that AI is competition.


It will be a while. There is a large group of people who don’t willingly uses new tools when those tools enable all comers to make better, “competing” work product.

That will progress a funeral at a time until the stigma is gone and you have critical masses of people really pushing the art of the possible. Lots of the SAF-AGRA protest is based on this reasoning.


The SAG and WGA protest have almost nothing to do with this. The AI debacle their worried about is either being replaced by AI versions of their likeliness (without pay) or being paid less and remaining uncredited because AI was used to write/edit (a portion of) a script, respectively.


>or being paid less and remaining uncredited because AI was used to write/edit (a portion of) a script, respectively.

Here are the demands from the WGA:

https://www.wga.org/uploadedfiles/members/member_info/contra...

Their proposals were:

>Regulate use of artificial intelligence on MBA-covered projects: AI can’t write or rewrite literary material; can’t be used as source material; and MBA-covered material can’t be used to train AI

I agree some aspects are about writers protecting themselves from being replaced by AI, but the outright ban on use of them (ie. "can’t write or rewrite literary material") and prohibiting them from being trained seems consistent with the parent's claim of "There is a large group of people who don’t willingly uses new tools when those tools enable all comers to make better, “competing” work product"


>when will artists accept genai as a tool not a competitor? all artists with genai tools easily outclass a non artist fiddling around

If that AI is being trained to emulate their work , using their work to do so, without asking their permission to use their work.. then it's not unreasonable for them see it as a competitor.

They don't have the same army of lawyers film studios have though, so it looks like SA etc got away with the heist.

And (questions of efficacy aside) they're perfectly within their rights to add hidden data to their images to stop it being scraped and used to train a a competitor.


> all artists with genai tools easily outclass a non artist fiddling around

Personally, I find today's music utter shit. Must be an age thing.


You will be much happier after you concede the fact that the music you grew up listening to isn’t better, you just grew up listening to it. It is ok to say you got older, stopped trying to discover new music, and would rather listen to the old songs you know you like.

But it is such a tired trope to say all new music is shit. Especially when Spotify, Soundcloud, and other services have made it far easier for new artist to get their work out there and to discover them. You also now have access to the entire world of music, and not just music in your own language or from a similar region.

You will also be more pleasant to be around when people think of you as someone with a preference but is open to hearing new things.


> music you grew up listening to isn’t better, you just grew up listening to it

Well you also don't remember all the bad songs just the good ones.


I find it very amusing people we're outraged about "WAP" and considering it a new low for music.

The songs popular when I was a child and a teenager:

"Baby Got Back"

"Kim"

"The Bad Touch"


> "Baby Got Back"

If you remember the maniacal focus on thinness in the 2000s, this was actually an important and supportive song.

(Also, he's a hobbyist electrical engineer now.)


I think i was too young at the time to get that. An interesting point.


Going one generation deeper, "Timothy" by The Buoys.

Or back to the 20s and 30s, Lucille Bogan's "Shave 'Em Dry" or any number of options from the dirty blues.


Your not wrong. When I say shit, I mean they're all created unnaturally. I am not saying don't let computers aide music design, to really obtain the first sample you need the music instruments.

I have yet to hear an official AI rock band play guitar in quill to the skill that was produced.

Manufactured vs Independent


Have you heard any rock bands? I didn't think there are any countries left where rock is popular except Japan.

There is still Christian rock in the US, and sometimes people just don't notice they're listening to it.


Some of the music from other decades is certainly better than some of what's getting put out today. Growing up, I often preferred (much) older artists that were still known to whatever was bubbling up on the Billboard charts.


Wonder how it plays with https://en.wikipedia.org/wiki/Anti-circumvention

From my pov this is circumvention, but can we turn the tables and say genai trainers are the circumventors?


In this case what is being circumvented is precisely what is defined by CFAA as a malicious code or information transmitted to cause intentional damage to a protected computer system, so I doubt the courts will be too upset about effectively circumventing the mechanisms of a malicious virus.

> 5. (A) knowingly causes the transmission of a program, information, code, or command, and as a result of such conduct, intentionally causes damage without authorization, to a protected computer;

You can file whatever copyright suit you want, you can’t go damage someone else’s data just because you’re mad. Just like a contractor has to get a lien and not just go smash up their work because they didn’t get paid.


that's a very good point

macrovision was a DRM technique applied to DVDs

it worked by very slightly altering the signal such that humans wouldn't notice, but prevented VHS recorders from making a decent copy

macrovision removal devices were ruled as circumvention devices and hence illegal in both the US and EU

this poisoning approach is essentially the same idea as macrovision, with pretty similar intent

so I don't see why systematically removing it wouldn't count as circumvention


Uh oh, you have a circumvention device right now then.... the print screen button.


>Uh oh, you have a circumvention device right now then.... the print screen but

A lot of the artists in the DA and other places that are annoyed with their work being used to train AI work by taking a commission to create an image and then share it online for everyone to enjoy.

Their objection to it being used to train AI is more nuanced than people think.


Not really. Their argument is "Hey, I looked at everyones art all my life and used that to build my internal model, but it's time for me to pull up the ladder behind me"

Copyright is not about reading your art, it's about the creation of subsequent works that are some arbitrary measure of 'too close' to the work other people already created.


The people using generative models didn't develop their internal model though?


Seems like this will take 5 seconds to work around and never catch on.


Img2dataset and other tools already handles this through hash verification. But yes kudurru ai is a terrible initiative. I think browsers and search engines should ban any website using it.


I think my browser should display the sites I tell it to, thank you very much.


I am not sure how how hashing would help here? A hash would only be useful to see if someone tampered with data you already have and tried to swap your stored images for poison images.

This is intended to deter scraping of images.


I'm talking about being unable to switch out existing images from laion 5b, for example, with glazed versions to poison the dataset.


So why should browser or search engines ban someone from using this tech on their images?


All these tools (and tools to confirm legitimacy) are defeated by screenshots.

This is our new world, accept the chaos.


Kid named “knowingly causing the transmission of a program, information, code, or command, and as a result of such conduct, intentionally causing damage without authorization, to a protected computer”:


Fight fire with fire? I like the idea.


will happily run this on several hundred highly starred GitHub projects if someone ports the approach to code...

(of course if Microsoft let me opt out I'd have no reason to)


The solution is not publicly sharing your code if you do not want it being used.

Yes, you can spin up a whole debate about monetization etc pp but generative ai is the future of development too and will make it more accessible in a way that we can hardly imagine right now. So just stop with these arbitrary red lines of voluntarily published sources.


It's not about code Generation amigo. This is for images


I think what he meant by port was to create an analogous software for code.


Code generators are autoregressive (code->code) not labeled (text->image) so the attack wouldn't work. Also, you can actually tell if a code model works or not while you're training it, by running tests on the output.

You could put up a lot of misleading code where the comments are wrong or there's bugs in it… seems bad for obvious reasons though.


wasn't there an attack a while ago that used hidden characters to break compilers? i'll see if i can dig it up. maybe something like that could be used for github. you'd have to have a pre commit and pull hook that would encode / decode the malicious characters



Wouldn't you just need to write a bunch of bad code?


applying it to good (highly rated) code would do more damage to the model

the behaviour would be identical for the end user, but very different for those stealing, sorry, "training AI models" from it


> Stealing

Are you old enough to remember "You wouldn't download a car"?


Did you read "the right to read" and take it as a training manual and not a warning?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: