Yet another bot that completely ignores the "429 Too Many Requests" response header and happily continues hammering your tiny little side project [1] to death. Luckily, I already block the IP address they're using as it has been used for (other?) malicious bots before.
[1] In my case, it relies on third-party APIs that are heavily rate limited. Any bot ignoring rate limitation measures will effectively (D)DOS my service.
One option is to completely ban openai’s crawler ip addresses. They steal content without credit anyway - as most ai companies do - so there’s no benefit in allowing them access.
Why? If it helps people it should be good. Why bother posting something on the public web if not to help people.
Sure a large org is receiving some ancillary benefit, but do you feel the same hostility for people working at [large corp] using what you worked on to help them at work?
I honestly don't understand the hostility towards llms using public data
This is like asking why someone doesn't want to do free work for Oracle's database offerings. I mean, why not try to make things better?
Well, because a lot of corporations couldn't care less about the public good and are happy to cause harm if it makes them more money. OpenAI doesn't care about your welfare or mine any more than a sleezy ad company or spyware product does.
If OpenAI were actually an open source company working to benefit the broader ecosystem I would agree with you, but that's about as far as possible from the current state.
One of the reasons is that the company can later close up the effort, completely destroying the future potential helping part of it.
But at the end of the day, I understand that altruism doesn't work this way. But this just means that while I have some tendencies, I'm not altruistic after all. I attach a lot of feelings to where my work ends up and how it affects things, which is, for example, why I like "sticky" licenses like the GPL, and tend toward efforts like the Effective Altruism, however ineffective I think they end up being.
>I honestly don't understand the hostility towards llms using public data
So, getting back to the topic, feelings are attached to where the publications end up and how it affects things. Because of the unintended consequence of companies training AI on publicly available data, people harboring these feelings feel like their thing has been taken from them without their consent. And that is a bad feeling, powerless, inability, and one of the ways of coping with that is coping with it on the outside, directing the feeling outward, whereby it becomes active defense, or hostility.
Don't understand or don't agree? Because it's really very simple to understand.
Generally people need some kind of incentive to produce content. This could be just the thought of somebody, an actual human, having consumed your content. Or a like, a comment exchange that further enriches the topic. Perhaps it leads to a new follower or even a new (online) friend. A job opportunity. Even a date. Or maybe just plain ad impressions to make your effort worthwhile.
The picture of content production was already bleak. Google gets to take it all for free and is the traffic controller deciding who gets the crumbs, and even then is also the sole advertiser. But at least they might throw you some traffic, leading to all the interactions I just mentioned.
OpenAI just steals your shit without permission, credit or payment and completely cuts of any direct human interaction with the original content or its maker.
How can you not "understand" the hostility? This is existential not just for the open web, also the closed web. Have you missed the developments at Twitter, StackOverflow, Reddit?
That's just a win-win situation, you're using their services for free because it helps you, they use your interaction to improve the model; the model is still free to use.
There's no win win situation. My content is stolen and given to others. I've lost. Google paid me for traffic via ads, therefore I allowed google to ingest my content. You as a person could read it. I've never given you permission to resell it, and if you did, I'd come after you to pay royalties. The same must apply to openai and other leeches.
Physical property is either borrowed, owned, sold, and so on.
If your spouse takes your car to work without your knowledge it's borrowed.
If they take it and sell it without consent it's theft.
Same applies to data. But data is electrons and as such it can't be moved, it is "copied". So technically speaking you are right, but practically you are not. If you steal NBC's prerelease movie then that's theft. As is copying it without constent. Once you pay for it you can copy it from their servers to your device. But you can't copy it to someone else's machine.
> If you steal NBC's prerelease movie then that's theft.
No. Advocates of expanded IP law have attempted to spread the idea that copyright infringement is "theft" as it adds emotional weight to their arguments. "You wouldn't download a car" etc. Same for the use of the word "piracy" - borrow an emotionally laden term from another context and hope nobody notices the sleight of hand.
And it's important that we reject this definition because it distorts the reality of the situation.
> And it's important that we reject this definition because it distorts the reality of the situation.
Depends who's reality. A content creator's reality is that their content is indeed stolen and monetised by someone without permission.
"Advocates of expanded IP law" do appear to be in the right, at least by law. Copying and distributing digital products is treated more or less as theft, particularly when done at scale.
AI and current training practices are even worse than stealing someone's work. It steals someone's identity. AI can copy unique characteristics, not just individual content to reproduce identical content. It can replicate a person's unique style without consent, and that's uniquely dangerous.
On a trivial level this is correct as words mean what we collectively decide they mean.
However I am making the point that a) the meaning has been changed and b) it has changed in a way that is deceptive and masks a useful fact about the world
> On a trivial level this is correct as words mean what we collectively decide they mean.
Correct, and collectively we decided that reselling digital work without permission is indeed theft, just as we rightfully decided that digital goods for the most part are like physical goods.
> a) the meaning has been changed
It hasn't really, digital theft still has the same meaning as any form of theft. Some did try to change the meaning and non trivialise the act based on the fact that digital goods are not like phyisical goods. But that's a techincallity based on the nature of digital goods.
Similarly, AI folks wish to change the meaning of theft based on the false assumption that an AI system "learns just like a human". But that's a false assumption. The software does mimic human behaviour, but we all know that it is neither human nor intelligent (if it were intelligent you'd show it a set of multiplications, and from that point onwards it would figure it out on its own. same with writing stories). Yet some are trying to change the meaning of words to accommodate their view of the world in which software that can ingest people's IP at massive scale, mix it in, and output something that looks novel is somehow similar to human learning.
Therefore the matter is trivial. Software ingestsing digital content without permission, and outputting content made of even tiny bits of the original, is theft. Simple as that. However, that does not mean that AI should be banned. It's how the AI software is fed its data that must be brought in line.
This debate predates modern AI and I've been having this debate for a lot longer than generative AI had been around. I think it's more likely that you really want to make a point about AI rather then you have deeply held views on intellectual property
It would be a win-win if the company promised that they'll keep the AI as it is, and as free as it is, as long as the company functions. Then they would take something, give something, and we could discuss if what we get outweighs what they took.
But the street is one-way, and it's the company that has the upper hand. The company can (and does) retract access to the AI, but they themselves keep what they took. If in the meantime people became attached to what the company gave, the company even does damage to them, not just by taking away the access, but because of severing the supply for a dependency.
So the people are taken advantage of because the company took the assets, they are taken advantage of because they help to further train the AI by using it, and then they get, at most, the privilege to pay for something that grew out of them.
That's why it's not a win-win. It's a win for the company, and a questionable outcome, and a risk for the people.
I agree wrt/ experience, but I don't think it applies to this situation. Even if you had an experience that would end, their ownership of the data wouldn't, and that, among other things, make this very one-sided.
I do want to stress something from your conclusion though. That people do better if they anticipate change, and can adapt to it.
Whether it's one-sided depends on what you think you've gained and lost. I publish code for free (open source) and I publish my writing for free (on my blog and as comments on various websites).
I don't expect compensation from anyone who uses them, whether it's public or private use, so I don't feel like I've lost anything. Sometimes people "pay it forward." If I actually get something back, that's a win.
There are web search engines and AI chatbots that might be very slightly better (unmeasurably so) due to having been trained on stuff I published over the years. Meanwhile I get a lot of benefit from using free stuff on the Internet. I think that's a one-sided deal in my favor.
(I also pay for GPT4 access. Whether it's worth $20 a month is more questionable, but it's fun to play with and so far I'm interested enough that I haven't cancelled.)
>Whether it's one-sided depends on what you think you've gained and lost.
I completely agree. At the end of the day, winning and losing in this situation cannot be measured, especially the "losing" part wrt/ people, so it all boils down to how the individuals perceive it. (Which is of course why powerful entities put so much effort into PR.)
I personally feel better if there are some safeguards around usage, and so I like licenses like the GPL family, where regulations are in place so that the effort is not completely trivially closed up.
But really, at the end of the day what we can control best is our perception of thing. Life is what we make of it.
If you're making a library/package/rubygem/crate, allowing ChatGPT to understand your API and being able to generate code using it can help the adoption.
There are plenty of ways you can (and should) rate limit requests on your end. It is a pretty basic security and reliability practice.
Also if you're dealing with an actual malicious adversary real or automated rate limiting can be more effective than blocking. (logic to detect and overcome an even very significant rate limit is much more complex than to detect dropping, ignoring, or 4xx 5xx response blocking methods)
For example, a method to rate limit based on IP with nginx
Sure. I already use several rate limitation measures, return fake data for repeating offenders, and also outright block some others. It is still laughable that a somewhat "reputable" bot does not even know about basic HTTP headers.
[1] In my case, it relies on third-party APIs that are heavily rate limited. Any bot ignoring rate limitation measures will effectively (D)DOS my service.