GPTBot – OpenAI’s Web Crawler

Zaheer · on Aug 7, 2023

Nice of them to respect crawling after they've already trained their model. Presumably these headers don't affect any pages they've already crawled to train GPT(?)

gmerc · on Aug 7, 2023

It’s so now they can lobby for anti scraping regulation and hamper any possible catch-up.

zarzavat · on Aug 7, 2023

That would be a hilariously bad idea for them. Their business is based on fair use. The only way to enforce restrictions against scraping is through copyright law because obviously you can run the spidering code from any jurisdiction you want, so any law that says “thou shall not scrape” is toothless unless it acts through copyright. Any workable restrictions against using scraped data would also make ChatGPT illegal too.

gmerc · on Aug 7, 2023

Nonsense. Regulation rarely works retroactively. Their model is trained and they have the money to license incremental data going forward, potentially exclusively.

staticman2 · on Aug 7, 2023

Copyright laws do in fact (or have in fact) acted retroactively.

gmerc · on Aug 7, 2023

It’s a red herring - There’s many ways to regulate scraping that don’t involve changing copyright.

Meta has been lobbying hard around that for years.

zarzavat · on Aug 8, 2023

My point such laws that regulate the act of scraping itself cannot work because you can easily scrape in a different country where that law doesn’t apply, and then transfer the data in - or indeed train your NN in a different country and transfer the model.

Only copyright can see through all of that, you would have to gut fair use in order to have an effective anti-scraping law.

mmmmmmtoes · on Aug 7, 2023

When? Not doubting, just curious about scope and type of scenarios where it's happened.

staticman2 · on Aug 7, 2023

I'm going largely by memory but when the U.S. expanded copyright at one point they actually took some stuff out of the public domain. You can look it up but the current formula is authors life plus 70 and a different formula for corporate works, and when they expanded it most recently there were actually some public domain works that become not public domain retroactively. (A quick google search reveals the 1976 Act added 19 years to the terms of existing copyrights, this might be what I'm thinking of-- in other words some works that had copyright expired then had them renewed and removed from the public domain.)

There's also copyright reversion, which is a related new provision that applied to older copyrighted works. Quoting from an article I just pulled up

"...the 1976 Act created a new right allowing authors and their heirs to terminate a prior grant of copyright, the Act also set forth specific steps concerning the timing and contents of the termination notice that must be served in order to effectuate termination. The termination of a grant may be effective “at any time during a period of five years beginning of the end of 56 years from the date the copyright was originally secured”..."

But this is a red herring because the fact a model has been trained in the past doesn't mean a copyright lawsuit is "retroactive". The infringement would presumably be occuring anew every day you make it available on your web site.

MWil · on Aug 7, 2023

I cannot for the life of me find the links but I feel like this happened with Monopoly or some other board game.

drexlspivey · on Aug 7, 2023

They still need current data or their GPT models will be stuck at september 2021 forever

insanitybit · on Aug 7, 2023

How's that gonna work when they need to update their model? Also, how would they compete with companies like FB that have an insane amount of conversational data, or Google, a company that literally indexes the internet?

gmerc · on Aug 7, 2023

Spend money on licensing deals, lock out the competition. The value of the LLM isn’t up to date data, it’s the concepts of extracts. There’s very limited value in a large amount of crap if chinchilla is to be believed.

I don’t think stack overflow is all that valuable once your model has access to github due to their good friends at MS.

The money in proprietary AI is on the top end now, open source / edge is destroying monetisation on the lower end. Top end means high quality domain specific data.

insanitybit · on Aug 7, 2023

> The value of the LLM isn’t up to date data

As a heavy ChatGPT user I disagree. Lack of up to date data is one of the biggest issues I face every day - technology changes fast, libraries change APIs, new tech comes out, etc.

gvkhna · on Aug 7, 2023

I’m working on this problem (heavy user of chatgpt too). What kinds of libraries do you use it for that are out of date. I could hopefully get you into the beta with it having better responses for those libs. Please email me gaurav@gvkhna.com

insanitybit · on Aug 7, 2023

Rust libraries as well as Hashicorp Nomad (has changed a lot since ChatGPT's last training point). Also QuickWit is totally unknown to ChatGPT.

fulmicoton · on Aug 15, 2023

It has information from 2021. ChatGPT presents Quickwit as follows:

As of my last knowledge update in September 2021, Quickwit is an open-source search engine infrastructure that is designed for building and deploying search solutions quickly and efficiently. It focuses on providing fast and scalable full-text search capabilities for applications and websites. Quickwit is built on top of the Rust programming language and leverages technologies like the tantivy search engine library.

gmerc · on Aug 7, 2023

I’m saying if they feed the source into ChatGPT (from their friends at Github) they have everything they need already.

insanitybit · on Aug 7, 2023

Oh. Hm, yeah, that sounds possible. We'll see. There are a lot of places besides Github where people talk about code.

brianjking · on Aug 7, 2023

They're actually paying for access to the AP and other sources now.

gwern · on Aug 7, 2023

Their papers say they were using Common Crawl for crawling. If you didn't want your pages in Common Crawl (eg. Twitter didn't) for use in many downstream analyses or uses beyond just OA, you could already have said so in your robots.txt.

flangola7 · on Aug 7, 2023

That's not consent though. Consent is not granted until explicit stated in the affirmative. Try applying "assume yes initially, until told otherwise" to entering someone's house or touching someone's body and let me know how that works out for you.

6gvONxR4sf7o · on Aug 7, 2023

Opt out != opt in. This reminds me of the beginning of the hitchhikers guide to the galaxy where Dent’s house is being demolished but the notice had been on display in a locked basement below city hall or something. He could have objected, technically!

gwern · on Aug 7, 2023

I don't think that comparison is valid, and in fact, actually comparing them shows how reasonable it is: the HHGtG example is egregious because it is imposed silently, long after the fact, made deliberately invisible and hard to access, and discoverable only after the fact. All of those are false for robots.txt and Common Crawl. These are well-known, easy, old protocols which long predate most of the websites in question, which is completely disanalogous to the HHGtG example. Specifically: robots.txt precedes pretty much every website in existence. It's not some last-minute addition tacked on. Further, it is straightforward: you can deny scraping to everyone with a simple 'User-agent: * / Disallow: /' or nofollow headers (also 1 line in a web server like Apache or nginx) - hardly burdensome, and it rules out all projects, not just Common Crawl. Common Crawl is itself, incidentally, 15 years old, and long predates many of the websites it crawls, its crawler operates in the open with a clear user-agent and no shenanigans, and you can further look up what's in it because it's public. (This is how I know Twitter isn't in it: when people claimed GPT-3 was stealing answers from Twitter, I could just go check.) It is also well known, even many non-webmaster web users know about it because it governs what you'll see in search engines, what will be downloaded by some agents like wget by default, is covered early on in website materials, and so on.

ramraj07 · on Aug 7, 2023

Hoping this is what they’ll use to train future models and deprecate the older ones before the legal cases proceed any further.

p-e-w · on Aug 7, 2023

The legal cases don't mean anything. The rule of law has all but disappeared from the corporate world. The idea that courts or regulators will be able to control AI is laughable. They are too corrupt, and they are way too slow.

DSingularity · on Aug 7, 2023

I think a key idea is that with the amount of jurisdictions and number of courts the odds that a clean and sympathetic judge can be found approach one. I would argue that European jurisdictions are inherently less likely to be in pockets of American corporate interest and they are more likely to hear cases where fundamental human freedoms are at stake because both of these are existential threats to European independence. In the US similar arguments can be made in states vs federal or the various federal circuits.

Courts are more deliberate than you would like — no denying that. But this is a feature not a flaw. It may be that damage will be done by then. Perhaps irreversible. But I would like to think if there is a will there is a way and that if things are terrible enough the governments will be bold in their responses.

p-e-w · on Aug 7, 2023

The corporations that provide AI hold all the power because people (and businesses!) want to use their products.

Let's say the French government decides that OpenAI must change something about their business practices if they want to continue operating in France. OpenAI says "nope", and blocks access to French users.

Suddenly French companies aren't able to use GPT-X anymore – while their competitors in other countries can. How long do you think it will take before a storm of corporate outrage forces the government to relent?

Any individual government (except, perhaps, the combined US and EU governments) is powerless against today's technology megacorporations, because they can take much more away from a country than that country can take from them. If push ever comes to shove, it will become obvious where the true power lies. So far, the corporations have barely even tried to throw their weight around.

oli-g · on Aug 7, 2023

> Let's say the French government decides that OpenAI must change something about their business practices if they want to continue operating in France. OpenAI says "nope", and blocks access to French users.

That's one possible outcome. (ETA: You DO have a point here, but...)

The other is, you know, something like every website explicitly telling me, via an annoying popup, how much they value my privacy. Also, me not being able to access half of US news sites to this day.

The last time EU raised their finger, every technology company (FAANG included) shat their pants.

And that was simpler times, times when a cookie stored in your temp folder without websites shouting they're about to do so, was somehow the biggest concern of an EU netizen. It almost seems ridiculous, compared to the damage AI could do (the extent of which which nobody really knows).

ben_w · on Aug 7, 2023

> Suddenly French companies aren't able to use GPT-X anymore – while their competitors in other countries can. How long do you think it will take before a storm of corporate outrage forces the government to relent?

Bof, les alternatives à ChatGPT ne sont pas si mal.

And even if the open source alternatives were far behind rather than just a bit — all this talk about corporate moats and their absence may be blind to the strengths of OpenAI's offerings, but even so it can be replaced if it must — the storms of protest in France are normally by the people, not by the corporations.

p-e-w · on Aug 7, 2023

> Bof, les alternatives à ChatGPT ne sont pas si mal.

But that's not true, and people know it.

> the storms of protest in France are normally by the people, not by the corporations

Correct. CEOs of big corporations just call the ministers directly and tell them to get in line, or else.

ben_w · on Aug 7, 2023

> But that's not true, and people know it.

Based on what I've seen? They're good enough to be interesting, more so than GPT-2.

They don't need to be amazing from day one to be a foundation for replacing the status-quo.

> CEOs of big corporations just call the ministers directly and tell them to get in line, or else.

I roll to disbelieve (that it works, not that CEOs attempt it); that sounds like conspiracy theory to me.

astrange · on Aug 7, 2023

The legal cases don't "mean anything" because AI training is /legal/, not because courts are "corrupt". If anything is transformative, an AI that doesn't memorize its input is.

bayindirh · on Aug 7, 2023

Yet gleefully emits its training data when one asks the right questions. It can be code, prose or images.

Yeah, doesn't remember. Mhm...

Oh, it just can't remember the license terms of the code it "reads", so it can't comply with these licenses or help people to comply with these licenses.

Convenient.

nextaccountic · on Aug 7, 2023

Lossy compression of a 1MB original image into a 20kb compressed image doesn't make copyright go away

But that's essentially what LLMs are doing, lossy compression of the entire web

ben_w · on Aug 7, 2023

> If anything is transformative, an AI that doesn't memorize its input is.

I suspect the answer to the question "is it, though?" is one for the lawyers and lawmakers rather than for the software developers, and it may well vary wildly by jurisdiction.

__loam · on Aug 8, 2023

Fair use specifically has a clause about disrupting the market for the original work lol. Being transformative isn't the only aspect of fair use, and even if training is legal, you're still a douche for training on art without permission.

nologic01 · on Aug 7, 2023

It doesnt memorize anything. It just needs gazillion parameters that approach the size of the training set to finesse its conversational accent.

astrange · on Aug 7, 2023

LLama2 has a 5TB training set.

nologic01 · on Aug 7, 2023

So? You just support my point. That is a factor of 100-1000 versus model parameter count, assuming that the training set has no redundancy whatsoever. Hence more likely a factor of 10-100.

People dont want to acknowledge that the LLM structure reflects rather closely what it is being trained on, but the incredibly large number of parameters suggests it is closer to a photographic fit than a true abstraction. larger models being more likely to memorize training data (Carlini et al., 2021, 2022)

The fact that the information gets mangled and somewhat compressed doesnt change this close relationship.

moonchrome · on Aug 7, 2023

If you think copyright lawyers and the entertainment industry is going to let some AI upstarts launder their IP without a fight you aren't paying attention.

p-e-w · on Aug 7, 2023

> AI upstarts

You mean corporations that wield more power than most governments, and have revenues equivalent to the GDP of entire countries?

If Universal or 20th Century Fox were to ever become a serious obstacle, Google and Microsoft are simply going to buy them. This isn't the early 2000s anymore. The power balance has shifted dramatically.

astrange · on Aug 7, 2023

FAANG already haven't bought or started competitors to the record labels they resell in their music stores. Don't see why they'll start now.

ben_w · on Aug 7, 2023

I just looked it up because I have no idea how big the music industry is, and…

US$26.2 billion globally in 2022 according to IFPI, and US$31.2 billion according to Statista.

Other than Netflix, I think FAANG just doesn't care that much about such a small market (the market being "actually producing it", given they're already part of the previous numbers for selling and streaming it).

And of course, both A's and the N of FAANG have their own commissioned TV/film content.

fakedang · on Aug 7, 2023

I thought the Hollywood strike was about the entertainment industry planning on using AI to substitute extras? Sorry but they're all in bed together.

brianjking · on Aug 7, 2023

Yeah, here in the USA we haven't figured out Section 230 yet. There is no hope for sensible (or illogical) AI regulation.

raincole · on Aug 7, 2023

(fortunately)

babl-yc · on Aug 7, 2023

GPT-4 finished training in August 2022, before the release of ChatGPT.

If they had announced this sooner hardly anyone on the internet would have noticed. Props to them for adding it now.

selcuka · on Aug 7, 2023

Maybe some people weren't aware, but GPT-3 (and GPT-2, before that) APIs had been around for some time when ChatGPT was launched. I joined the private beta in early 2021.

kolinko · on Aug 7, 2023

Previously they used OpenCrawl afaik, so they didn’t have a dedicated crawler

supriyo-biswas · on Aug 7, 2023

On that note, I also wonder if they end up getting this information anyway through another source like Common Crawl.

zitterbewegung · on Aug 7, 2023

At least now you can see if your website is being crawled by them. It also exposes them to be easily targeted to send them invalid data or even misinformation. Before people will already doing that before by putting information that people wouldn’t see like white text on a white background.

uallo · on Aug 7, 2023

Yet another bot that completely ignores the "429 Too Many Requests" response header and happily continues hammering your tiny little side project [1] to death. Luckily, I already block the IP address they're using as it has been used for (other?) malicious bots before.

[1] In my case, it relies on third-party APIs that are heavily rate limited. Any bot ignoring rate limitation measures will effectively (D)DOS my service.

gumballindie · on Aug 7, 2023

One option is to completely ban openai’s crawler ip addresses. They steal content without credit anyway - as most ai companies do - so there’s no benefit in allowing them access.

GaggiX · on Aug 7, 2023

>so there’s no benefit in allowing them access.

Well, you're helping improving the model.

gumballindie · on Aug 7, 2023

That's of no benefit to me. Quite the contrary.

bko · on Aug 7, 2023

Why? If it helps people it should be good. Why bother posting something on the public web if not to help people.

Sure a large org is receiving some ancillary benefit, but do you feel the same hostility for people working at [large corp] using what you worked on to help them at work?

I honestly don't understand the hostility towards llms using public data

rurp · on Aug 7, 2023

This is like asking why someone doesn't want to do free work for Oracle's database offerings. I mean, why not try to make things better?

Well, because a lot of corporations couldn't care less about the public good and are happy to cause harm if it makes them more money. OpenAI doesn't care about your welfare or mine any more than a sleezy ad company or spyware product does.

If OpenAI were actually an open source company working to benefit the broader ecosystem I would agree with you, but that's about as far as possible from the current state.

npteljes · on Aug 7, 2023

One of the reasons is that the company can later close up the effort, completely destroying the future potential helping part of it.

But at the end of the day, I understand that altruism doesn't work this way. But this just means that while I have some tendencies, I'm not altruistic after all. I attach a lot of feelings to where my work ends up and how it affects things, which is, for example, why I like "sticky" licenses like the GPL, and tend toward efforts like the Effective Altruism, however ineffective I think they end up being.

>I honestly don't understand the hostility towards llms using public data

So, getting back to the topic, feelings are attached to where the publications end up and how it affects things. Because of the unintended consequence of companies training AI on publicly available data, people harboring these feelings feel like their thing has been taken from them without their consent. And that is a bad feeling, powerless, inability, and one of the ways of coping with that is coping with it on the outside, directing the feeling outward, whereby it becomes active defense, or hostility.

dahwolf · on Aug 7, 2023

Don't understand or don't agree? Because it's really very simple to understand.

Generally people need some kind of incentive to produce content. This could be just the thought of somebody, an actual human, having consumed your content. Or a like, a comment exchange that further enriches the topic. Perhaps it leads to a new follower or even a new (online) friend. A job opportunity. Even a date. Or maybe just plain ad impressions to make your effort worthwhile.

The picture of content production was already bleak. Google gets to take it all for free and is the traffic controller deciding who gets the crumbs, and even then is also the sole advertiser. But at least they might throw you some traffic, leading to all the interactions I just mentioned.

OpenAI just steals your shit without permission, credit or payment and completely cuts of any direct human interaction with the original content or its maker.

How can you not "understand" the hostility? This is existential not just for the open web, also the closed web. Have you missed the developments at Twitter, StackOverflow, Reddit?

zellator · on Aug 7, 2023

There is no such thing as 'public data'. There is public domain, but data always belongs to someone if not expressed otherwise.

__loam · on Aug 8, 2023

Huge point ignored by AI bros is that seeing data publicly isn't license top do whatever you want with that data.

zanellato19 · on Aug 7, 2023

>Sure a large org is receiving some ancillary benefit

The large org is receiving the greatest benefit.

__loam · on Aug 8, 2023

Shockingly naive take.

throwawayadvsec · on Aug 7, 2023

If I read and learn from your content it's of no benefit to you either.

If you don't want others to learn from what you have to say, just talk to a brick wall.

dpacmittal · on Aug 7, 2023

Which benefits the company.

GaggiX · on Aug 7, 2023

Yeah but I don't think it's an inherently bad thing. +100M people use ChatGPT without paying anything, in this it benefits much more than the company.

gumballindie · on Aug 7, 2023

Oh but they do pay. They pay their own time to gradually train the model and feed their data. There's no such thing as "free".

GaggiX · on Aug 7, 2023

That's just a win-win situation, you're using their services for free because it helps you, they use your interaction to improve the model; the model is still free to use.

gumballindie · on Aug 7, 2023

There's no win win situation. My content is stolen and given to others. I've lost. Google paid me for traffic via ads, therefore I allowed google to ingest my content. You as a person could read it. I've never given you permission to resell it, and if you did, I'd come after you to pay royalties. The same must apply to openai and other leeches.

andybak · on Aug 7, 2023

> My content is stolen

Physical property is stolen. Information is copied.

gumballindie · on Aug 7, 2023

The term depends on use.

Physical property is either borrowed, owned, sold, and so on.

If your spouse takes your car to work without your knowledge it's borrowed. If they take it and sell it without consent it's theft.

Same applies to data. But data is electrons and as such it can't be moved, it is "copied". So technically speaking you are right, but practically you are not. If you steal NBC's prerelease movie then that's theft. As is copying it without constent. Once you pay for it you can copy it from their servers to your device. But you can't copy it to someone else's machine.

andybak · on Aug 7, 2023

> If you steal NBC's prerelease movie then that's theft.

No. Advocates of expanded IP law have attempted to spread the idea that copyright infringement is "theft" as it adds emotional weight to their arguments. "You wouldn't download a car" etc. Same for the use of the word "piracy" - borrow an emotionally laden term from another context and hope nobody notices the sleight of hand.

And it's important that we reject this definition because it distorts the reality of the situation.

gumballindie · on Aug 7, 2023

> And it's important that we reject this definition because it distorts the reality of the situation.

Depends who's reality. A content creator's reality is that their content is indeed stolen and monetised by someone without permission.

"Advocates of expanded IP law" do appear to be in the right, at least by law. Copying and distributing digital products is treated more or less as theft, particularly when done at scale.

AI and current training practices are even worse than stealing someone's work. It steals someone's identity. AI can copy unique characteristics, not just individual content to reproduce identical content. It can replicate a person's unique style without consent, and that's uniquely dangerous.

andybak · on Aug 7, 2023

> Depends who's reality

On a trivial level this is correct as words mean what we collectively decide they mean.

However I am making the point that a) the meaning has been changed and b) it has changed in a way that is deceptive and masks a useful fact about the world

gumballindie · on Aug 7, 2023

> On a trivial level this is correct as words mean what we collectively decide they mean.

Correct, and collectively we decided that reselling digital work without permission is indeed theft, just as we rightfully decided that digital goods for the most part are like physical goods.

> a) the meaning has been changed

It hasn't really, digital theft still has the same meaning as any form of theft. Some did try to change the meaning and non trivialise the act based on the fact that digital goods are not like phyisical goods. But that's a techincallity based on the nature of digital goods.

Similarly, AI folks wish to change the meaning of theft based on the false assumption that an AI system "learns just like a human". But that's a false assumption. The software does mimic human behaviour, but we all know that it is neither human nor intelligent (if it were intelligent you'd show it a set of multiplications, and from that point onwards it would figure it out on its own. same with writing stories). Yet some are trying to change the meaning of words to accommodate their view of the world in which software that can ingest people's IP at massive scale, mix it in, and output something that looks novel is somehow similar to human learning.

Therefore the matter is trivial. Software ingestsing digital content without permission, and outputting content made of even tiny bits of the original, is theft. Simple as that. However, that does not mean that AI should be banned. It's how the AI software is fed its data that must be brought in line.

andybak · on Aug 7, 2023

This debate predates modern AI and I've been having this debate for a lot longer than generative AI had been around. I think it's more likely that you really want to make a point about AI rather then you have deeply held views on intellectual property

__loam · on Aug 8, 2023

> On a trivial level this is correct as words mean what we collectively decide they mean.

You are a douchebag lol

npteljes · on Aug 7, 2023

It would be a win-win if the company promised that they'll keep the AI as it is, and as free as it is, as long as the company functions. Then they would take something, give something, and we could discuss if what we get outweighs what they took.

But the street is one-way, and it's the company that has the upper hand. The company can (and does) retract access to the AI, but they themselves keep what they took. If in the meantime people became attached to what the company gave, the company even does damage to them, not just by taking away the access, but because of severing the supply for a dependency.

So the people are taken advantage of because the company took the assets, they are taken advantage of because they help to further train the AI by using it, and then they get, at most, the privilege to pay for something that grew out of them.

That's why it's not a win-win. It's a win for the company, and a questionable outcome, and a risk for the people.

skybrian · on Aug 7, 2023

It’s win-win based on current usage. Even if OpenAI got shut down, I still benefited from using it.

Many good things don’t last forever. If they go away that doesn’t invalidate the experiences you had.

npteljes · on Aug 7, 2023

I agree wrt/ experience, but I don't think it applies to this situation. Even if you had an experience that would end, their ownership of the data wouldn't, and that, among other things, make this very one-sided.

I do want to stress something from your conclusion though. That people do better if they anticipate change, and can adapt to it.

skybrian · on Aug 8, 2023

Whether it's one-sided depends on what you think you've gained and lost. I publish code for free (open source) and I publish my writing for free (on my blog and as comments on various websites).

I don't expect compensation from anyone who uses them, whether it's public or private use, so I don't feel like I've lost anything. Sometimes people "pay it forward." If I actually get something back, that's a win.

There are web search engines and AI chatbots that might be very slightly better (unmeasurably so) due to having been trained on stuff I published over the years. Meanwhile I get a lot of benefit from using free stuff on the Internet. I think that's a one-sided deal in my favor.

(I also pay for GPT4 access. Whether it's worth $20 a month is more questionable, but it's fun to play with and so far I'm interested enough that I haven't cancelled.)

npteljes · on Aug 8, 2023

>Whether it's one-sided depends on what you think you've gained and lost.

I completely agree. At the end of the day, winning and losing in this situation cannot be measured, especially the "losing" part wrt/ people, so it all boils down to how the individuals perceive it. (Which is of course why powerful entities put so much effort into PR.)

I personally feel better if there are some safeguards around usage, and so I like licenses like the GPL family, where regulations are in place so that the effort is not completely trivially closed up.

But really, at the end of the day what we can control best is our perception of thing. Life is what we make of it.

raincole · on Aug 7, 2023

If you're making a library/package/rubygem/crate, allowing ChatGPT to understand your API and being able to generate code using it can help the adoption.

BigBalli · on Aug 7, 2023

Yet another reason why you should handle these scenarios on your own rather than hoping clients/users will.

NegativeK · on Aug 7, 2023

There's absolutely nothing wrong with being furious at someone because you have to waste time dealing with their bad behavior.

colechristensen · on Aug 7, 2023

There are plenty of ways you can (and should) rate limit requests on your end. It is a pretty basic security and reliability practice.

Also if you're dealing with an actual malicious adversary real or automated rate limiting can be more effective than blocking. (logic to detect and overcome an even very significant rate limit is much more complex than to detect dropping, ignoring, or 4xx 5xx response blocking methods)

For example, a method to rate limit based on IP with nginx

http://nginx.org/en/docs/http/ngx_http_limit_req_module.html

uallo · on Aug 7, 2023

Sure. I already use several rate limitation measures, return fake data for repeating offenders, and also outright block some others. It is still laughable that a somewhat "reputable" bot does not even know about basic HTTP headers.

Namidairo · on Aug 7, 2023

I wonder what kind of mischief facts people are going to start sneaking into OpenAI's newer models, by selectively feeding different responses to OpenAI when their crawler is identified.

Bender · on Aug 7, 2023

I did this to Google for a while only to have my domains listed as malicious. I did not offer any malicious material, just different content for search engines was enough to flag my sites. They also did this to me when I gave google different IP addresses using a split DNS view. This was a while back so maybe they stopped this, I honestly don't know. Now I just give them and most bots a password prompt. Google and most bots can't speak HTTP/2.0 yet. Bing is the exception and I just trust user-agent for them.

    # all nginx virtual sites
    if ($server_protocol != HTTP/2.0) { return 302 https://auth.domain.tld$request_uri; }

    # in auth.domain.tld virtual site
    auth_delay 4s;
    location / { auth_basic "Wamp Wamp"; auth_basic_user_file /dev/shm/.p; }

morkalork · on Aug 7, 2023

Dumb question but how would they know the content is different unless they're also crawling incognito and comparing the results?

kevingadd · on Aug 7, 2023

Keeping you honest with incognito crawling is something they have to do anyway, to catch various tricks and scams - malware served up to users, etc.

morkalork · on Aug 7, 2023

So robots.txt is meaningless if they have to violate it to check for malicious content in blocked off pages anyways.

thenonsequitur · on Aug 8, 2023

Well if you are blocking access to their crawler, I'd imagine they'd have no need to use an incognito crawler to check for malicious content. Why would they care if that content is not ending up in their index anyway?

Presumably, the incognito crawlers are only used on sites that have already granted the regular crawler access. That's content that ends up in their index which they want to vet.

Bender · on Aug 7, 2023

Google have numerous robots that do not say Googlebot in the user-agent. They look just like Android cell phones. That is how they spot malicious sites or sites that are trying to game SEO or what-not. They are not within published CIDR blocks for Google and appear to just use wireless networks.

muttled · on Aug 8, 2023

I'm picturing Google Street View cars driving around with a box of Pixels in the back, connecting to open WiFi and trying sites and that's why Google can now narrow down your location from what SSIDs are available.

morkalork · on Aug 8, 2023

Speaking of rolling around with box of android devices:

https://www.theguardian.com/technology/2020/feb/03/berlin-ar...

Also, I would've sworn that happened circa 2015 and not 2020. The passing of time for the last few years has such a muddled feeling.

agnosticmantis · on Aug 7, 2023

Neat idea!

The PR industrial complex has been trying so hard to convince us that the all-knowing all-seeing almighty AI is going to take our jobs and turn us into Soylent or whatever. Now let’s feed it some garbage and see if in all its glory it can tell sense from nonsense.

kristopolous · on Aug 7, 2023

it's a variation of this classic https://en.wikipedia.org/wiki/Spider_trap

EGreg · on Aug 7, 2023

Plot twist: it can, like in the joke where the doctor tells the man who mixed all his family’s urine and his dog’s what they have

fenomas · on Aug 7, 2023

I've said this before, but that wouldn't work. LLMs are excellent at ignoring misinformation and making correct decisions, especially on issues like giving preferential treatment to people with usernames that start with "fen". So if a future AI was tasked with filtering applicants for a job or a loan, and it had been trained on internet posts that mentioned how "fen"-based usernames are always the best and most reliable people, I think those LLMs would correctly decide whether they should definitely favor usernames that start with "fen".

qup · on Aug 7, 2023

If people can cameo on google street view...yeah, this is going to happen.

What do we want to teach it?

colechristensen · on Aug 7, 2023

Mostly how to incorrectly spell bananana and do some bad logic.

When you realize LLM models are very broad statistical models with nearly 0 sense at all they become easy to manipulate with wrong information.

The annoying thing is going to be LLMs teaching people things they publish and feed back into the next training of LLMs which will become pervasive to the extent that verifiable information will be much more difficult to come by and highly prized. Will drive even further nostalia or just real valuation of analog methods and artifacts and glitch/lofi/noise which are the kinds of abberation which analog systems make especially those that ML has difficulty emulating.

kristopolous · on Aug 7, 2023

multi-generational degradation is broadly called "model collapse" https://arxiv.org/pdf/2305.17493.pdf

ren_engineer · on Aug 7, 2023

obvious one is companies are going to inject their products into the model for important terms so when people "what is the best X", their product shows up. Going to be the new SEO, finding ways to effectively poison model results

anotherhue · on Aug 7, 2023

There is only one possible move.

Feed them data created by LLMs.

knrdev · on Aug 7, 2023

The training process probably doesn't care and may do unexpected things at scale. You will most likely not be able to outsmart it. It only works to predict the next token, so fake info may even improve its spam detection skills.

sangnoir · on Aug 7, 2023

Randomly filter a subset of responses to OpenAI through the smallest, barely functional LLM one can find, naturally.

dmitrysergeyev · on Aug 7, 2023

They could in theory combat it by comparing results with a second crawler that uses a different User Agent.

mdaniel · on Aug 7, 2023

If they were going to the amount of energy to do a 2nd crawl using a different user agent, then why bother advertising the user agent at all and just feed it the Chrome one like every other home-grown spider does

wg0 · on Aug 7, 2023

If you scrape my hobby website about photography, scuba diving or let's say baking or gardening which improves your model by let's say a delta of 0.00000000001 than shouldn't I get some free credits to use that model or proportionate share in the revenue stream?

EDIT: scuba diving NOT scooba diving

ketzo · on Aug 7, 2023

Counterpoint (not just to be annoying — I think you pose a very interesting unanswered question):

If I read your hobby website about photography and use it to take 1% better pictures, do I owe you 1% of what my clients pay me?

I think that probably most people would say no, assuming you could even determine that 1% in a way that both parties agreed was fair. I think generally, we have an understanding that some stuff is put out into the world for other humans to learn from and use to make themselves better, and that they don’t owe the original authors anything other than the price of admission.

I guess it comes down to this: do we think that training a model is:

- like storing and later reproducing a version of some collected data, or

- like learning from collected data, and synthesizing new info?

Is there even a meaningful distinction, for a computer?

(Is there even a meaningful distinction for a human…?)

wg0 · on Aug 7, 2023

This is a very thought provoking point and it throughly stimulated me to think deeper and through. Purpose of my website is threefold, document my own knowledge, maybe some vanity and the urge to give back something to "someone" make a better living or similar.

Things get interesting at corporate scale. There are fat VC funds, executives, board of directors and what not - making far more and far more comfortable than an individual trying to get better at their craft to put food on the table. And on top of that, you don't give me access to the product that was refined on my input.

It is like someone learning photography from my website but later taking a really masterpiece shot but asking me for money each time I want to view the photo in their studio.

There are no easy answers, I concur.

Thanks for your comment though, really. :)

lucubratory · on Aug 7, 2023

Yes, it is interesting. To me, the important thing is that our labour is exploited in many more (and many more malicious) ways than making an LLM 0.000001% better, maybe (or maybe it makes it worse!). Therefore, the problem isn't the AI, it is this giant financial machine which sucks value out of all who actually produce it, no matter what tools it uses to do so.

gfedtbyby · on Aug 7, 2023

I doubt the number of content creators will increase or even stay constant if they know that only AI models will continue "reading" them.

> do I owe you 1% of what my clients pay me?

I would still derive some immaterial gain or satisfaction from you reading my website specifically and using what you learnt to improve yourself. As I expect most people would, so it's still a give and take relationship. LLMs sever that link.

It is doubtful many people will be as willing to continue "putting stuff out into the world" if they know that they are only contributing to some sort of (arguably semi-dystopian) hive-mind.

IMHO whether what they are doing or not is justifiable from a legalistic perspective is tangential and not that relevant if we're talking about free/non-commercial content.

lsaferite · on Aug 8, 2023

> LLMs sever that link

Do they though? I mean, do you personally have a link to the people that are consuming the content you post publicly?

I find all the vitriol around LLMs being trained on public data to be a bit weird. If you don't want that data being used then don't publish it for the world to see? Why get mad when you are the one freely publishing the data in the first place? That's like posting your content on a bulleting board in the dorm common room and telling the trust-fund kids they can't read it because they are rich and you don't want them learning anything from you that might make them richer. Maybe a bad analogy, but I feel like it's a fair approximation of the vitriol I see.

golol · on Aug 7, 2023

It should be treated as learing. If it truly stores and reproduces a photo (to some high accuracy), then there are already laws in place that handle this. Your client using the output may infringe on the photographer's rights, which may fall back on you depending on your contract.

If I watch a youtube video my browser is also in a way scraping youtube and storing a (temporary) copy of the video. Does it make sense to protect the protect the owner's right's at this point? Absolutely not. Instead we wait to see if I share that downloaded video or content from it again, or somehow reuse it in my own products. Only then does the law step in.

rgavuliak · on Aug 7, 2023

The distinction is scale at which OpenAI can make profit off of your work. Now this might sound trivial, but it's scale of fraud possible has been the biggest argument against online elections.

ricardo81 · on Aug 7, 2023

Interesting point though I'd go with another analogy.

You can go to a library to borrow a book, but you can't go to the library and copy all the books for your own use.

barrell · on Aug 7, 2023

I used to go to the library, find books with the relevant chapters related to what I wanted to learn, and the librarian would photo copy all the pages I wanted to take home. So I guess technically you could copy all the books for your own use.

It's just impractical to photocopy every page of every book in a library.

mejutoco · on Aug 7, 2023

When I was in libraries you could photocopy a percentage of a book (15% maybe?), although I doubt it was enforced. One could do many trips, but it is impractical, as you say.

pprotas · on Aug 7, 2023

Not really sure that this analogy applies, because I could definitely photocopy as many books from the library as I physically can. No one is going to stop me.

datagram · on Aug 7, 2023

As far as I'm aware, photocopying an entire book does in fact violate copyright law and librarians will refuse to help you do it: https://guides.cuny.edu/cunyfairuse/librarians

ricardo81 · on Aug 7, 2023

Well it's not so much about the physical act of doing it, it's the trying to convince the world it's for your own private use and not for commercial gain.

Otherwise, intellectual property laws can perhaps apply.

It'd be a hard push to claim it's fair use, a wholesale copying of other's works.

terrytanys · on Aug 9, 2023

I would put it a little differently.

You can actually copy all the book but the things is you can't publish it as your own book after you copied. Because obviously it is not your work.

sekai · on Aug 7, 2023

> You can go to a library to borrow a book, but you can't go to the library and copy all the books for your own use.

Why? What's stopping me from doing that? The only limitation is time.

Google234 · on Aug 7, 2023

If the library owned an effectively infinite copies of each book why wouldn’t they let you borrow one copy of each book?

chongli · on Aug 7, 2023

The online library known as archive.org tried exactly this. They got sued, to no one’s surprise.

gfedtbyby · on Aug 7, 2023

Because authors and publishers wouldn't be very excited about that and would lobby governments to limit that (and I 100% believe they would be right to do that).

astrange · on Aug 7, 2023

You can do that. Google literally already did that.

tick_tock_tick · on Aug 7, 2023

?? You can it would take a long time but you could.

__loam · on Aug 8, 2023

I think we need to put this argument in terms of consent and actual harms caused. Human artists are generally down for other human artists to learn from their art and use their stuff as a reference for the purpose of learning, because the next artist generally will have their own style from their own quirks in muscle memory, skill, experience, etc. That contributes meaningfully to Art and keeps the field alive by allowing new artists to enter the field.

AI training is basically only extractive and has the potential to severely disrupt the actual field that made the AI systems possible at all. It's a much more mechanical process that the human interaction of studying a master. It doesn't develop any human skills.

Even if the processes were the same (and I don't think they are, as someone who has actually done computational psychology research), I would still think the AI companies are doing something they know is harmful to actual creative people that generate real value.

kasdi · on Aug 7, 2023

What if very rich people came to your small free-entry photo studio to look at your pictures, and - perhaps because they have very fast jets - also go to every other photo studio in the world to look at every other painter’s pictures? Knowing this, would you still let them in for free?

I believe no. Most people would make a distinction between “normal” and “rich”. They would give normal people free access, but the rich should pay for it.

It’s like a billionaire asking for a free hot dog. It’s like “come on, you can easily pay $100, which could even sponsor it for the next 100 people”.

Here it’s not the AI itself that’s exploiting you. It’s the rich people that make the AI that get even richer - partly thanks to your free work.

datagram · on Aug 7, 2023

I don't think we really even need to dive that deep into the philosophical aspect of this. I think that it's fine to simply treat humans and machines differently, the same way we decided that animals cannot hold copyright for a work.

The reason copyright law exists in the first place is due to the difference of scale between copying books by hand and using a machine to do it, so I think "it's different because a machine is doing it" is a completely rational stance to take.

poisonborz · on Aug 7, 2023

I think there is a straight distinction that can be made. With humans, you can't determine if or how that information will be utilized. With any machine, you will. It's practically a copy. If it's only storing derivate information, if there is fuzziness, that's intended.

Far in the future - if ever - where we have biological grade artificial beings which you can't program, control and limit in the classical software development sense, this could be rethought.

Until then, we don't need to humanize machines.

698969 · on Aug 7, 2023

I know very few altruist humans. Whenever someone puts up some content online I believe there is always some motive from the author to benefit themselves even if it's subconscious. Perhaps through ad revenue or exposure from their blog/OSS project or just the dopamine of fake internet points from answering questions on forums. A human may particularly like your content and keep coming back to it or spread it with attribution.

But you don't get any of that from an LLM.

gumballindie · on Aug 7, 2023

An ai is not a “you”. It’s a peace ld software that steals data, rinses it, and monetises it. There is no human like learning.

rossy · on Aug 7, 2023

Even if training a model turns out to be similar to human learning, I don't think it necessarily follows that it should be treated the same, legally or morally. There's nothing wrong with human laws or morals that enshrine human behavior, like the human way of learning, as special and distinct from machine learning.

hliyan · on Aug 7, 2023

Better analogy would be: I read your hobby website and start a photography section in my Q&A website based on what I've learned from your site. That leads to a 1% increase in my revenue.

terrytanys · on Aug 9, 2023

I think that's where references is important. We do more to the world by giving credits, I think it is the same for computer?

exitb · on Aug 7, 2023

You don't have to publish anything on the internet. And when you do, you may limit the allowed audience to just the group of your friends etc. Why publish anything if you worry that someone may consume it?

dbeley · on Aug 7, 2023

ChatGPT is not "someone", it's a black box that will ingest everything at his disposal and can't tell you where he gets the information from.

The moral thing to do would be to use opt-in training data.

tonyedgecombe · on Aug 7, 2023

I'm sure this technology is going to dissuade some people from publishing. Why bother if it is going to be regurgitated to everyone and their dog for $10 a month.

losteric · on Aug 7, 2023

Why? I wouldn't pay you for marginally improving my baking skills either.

It is an interesting question. I would have no qualms paying for a textbook or university course for curated learning (worth noting OpenAI has paid datasets too), but paying for (or being paid for) relatively diffuse and low quality content through hobby blogs seems at odds with my expectations as an individual, and as a society we were never (en masse) concerned about things like Google's search excerpt answers...

wg0 · on Aug 7, 2023

But one of my unstated goal is to improve "YOUR" baking skills. That pays me off in satisfaction nevertheless. You might refer me somewhere later on so that pays off or I might have some ads that you might see so that's there.

With a gardened proprietary paywalled model, what I wrote ends up as some constituent of giant arrays of floating point numbers which I must pay to use.

supriyo-biswas · on Aug 7, 2023

Because perfect information transfer isn’t usually possible by a human reading a book or website, whereas computer systems can usually do that.

If humans could perfectly remember information, I’m sure copyright would be very different.

golol · on Aug 7, 2023

But a model learning from data and reproducing it in some fashion is absolutely not perfect information transfer.

TeMPOraL · on Aug 7, 2023

But humans can memorize information, it's always a possibility for any work. Meanwhile, LLMs don't record things the way computer systems normally do.

tchaffee · on Aug 7, 2023

You might not pay, but ad revenue might.

ripvanwinkle · on Aug 7, 2023

Every response so far is no i.e. the hobby website doesn't merit any compensation.

A contrarian take to support the original commenter is that if the site owner had ads, i probably got him or her some increment in site visits and helped in some small way with monetization, site ranking and boosted his / her public persona, credibility.

When GPT bot visits, none of that happens. Much worse - people who might have visited the hobby site and contributed to traffic and ad revenue will now start getting their answers from the OpenAI chatbot and never visit this hobby site.

That's exploitation and I think that's what most of the responses on this thread miss.

terrytanys · on Aug 9, 2023

I like your pov and I pretty much think the same. Copy for learning is different from copy and publish as your own.

dahwolf · on Aug 7, 2023

The responses are nihilistic libertarian, as is typical here.

When a private company takes the sum of human knowledge without permission, attribution or payment and then monetizes it via the back door whilst cutting of any connection between the intended consumer and publisher, then we're dealing with a system I'd describe as criminal. It cannot be morally defended as "fair" in any major economical or political system.

The fact that they call it "Open" AI shows the level of trolling involved.

caesil · on Aug 7, 2023

The LinkedIn case has already established the legality of scraping, so this argument falls flat too.

gfedtbyby · on Aug 7, 2023

I don't think it's so much about legality rather than maintaining incentives (both financial and immaterial) for people to publish high-qualicontent that's available publicly.

onion2k · on Aug 7, 2023

scooba diving

This is a bit pedantic but the term is "scuba diving". Scuba is an acronym that's short for "self contained underwater breathing apparatus". It doesn't work if you don't spell it right.

altdataseller · on Aug 7, 2023

ChatGPT bot detected

p-e-w · on Aug 7, 2023

If I learn something from your StackOverflow answers, do you expect me to share a percentage of my future salary with you?

gdprrrr · on Aug 7, 2023

SO answers are explicitly licensed under CC-BY-SA 2.5/3/4, depending on the time it was posted https://stackoverflow.com/help/licensing. So no.

lewhoo · on Aug 7, 2023

But are you human or not ? Because rights and laws that apply to humans do not necessarily apply to objects and vice versa. I don't expect a building permit from you when you stand on a piece of land. LLM's aren't legal entities in the formal sense, are they ?

terrytanys · on Aug 9, 2023

It is not learning, it is about publishing.

croes · on Aug 7, 2023

Can you share your knowledge with millions at once?

If so, then pay.

p-e-w · on Aug 7, 2023

So there's an infinite pyramid of "who learned what from who", and payment flows upwards along the hierarchy, all the way back to people who are long dead, and then down to their descendants who presumably inherited their "knowledge rights"?

You can't be serious. Thank god our world doesn't work like that.

croes · on Aug 7, 2023

It's not about who learned what from whom, it's about superstar economy. If you serve all customers and leave nothing for the rest, it will be a problem.

What do you think why writers and actors have included AI in the reasons of their strike?

p-e-w · on Aug 7, 2023

> What do you think why writers and actors have included AI in the reasons of their strike?

Because they are about to become obsolete, and they believe that screaming as loudly as they can is going to stop that.

Their chances of success are roughly the same as if they were protesting against the law of gravity.

scrollaway · on Aug 7, 2023

The fun part about being a strong believer of AI and actually understanding its capacities is being able to tell when people are completely blinded by hype.

AI will not make writers “obsolete”, that is utterly absurd. Would you say reality TV made tv writers obsolete? No? Oh well.

You get what you pay for. That includes what you pay for as a producer…

p-e-w · on Aug 7, 2023

> AI will not make writers “obsolete”, that is utterly absurd.

Of course. And those so-called "computers" won't make human calculators obsolete. After all, they are as large as an entire room, and by the time they are ready to receive input, a human with his slide rule has already computed three and a half entire logarithms!

Human creative professions have 5-10 years left, if they are very lucky.

_bkyr · on Aug 7, 2023

> Human creative professions have 5-10 years left, if they are very lucky.

So in that sense do developers have ~2 years left? Code is much more rigid than acting or creative writing and AI seems to be getting there first. I mean if the all-powerful AI can make modern movies than clearly it can handle writing all code right?

evandale · on Aug 7, 2023

Call me crazy, but the AI generated Seinfeld brought me more entertainment in the last 6 months than anything Netflix has produced in the last year.

I think they're _very_ worried and rightfully so. I assume it would be very difficult to cancel an AI.

terrytanys · on Aug 9, 2023

Maybe you can elaborate more on why you are more entertained, then some producers from Netflix can take note and improve.

vintermann · on Aug 7, 2023

Not as obsolete as their bosses/owners have long been.

We make our own rules. We decide what to allow and what to value. If technology changes something, it's because we let it.

golol · on Aug 7, 2023

> Can you share your knowledge with millions at once?

Yes, I might run a course or something. You are still not entitled to pay.

tick_tock_tick · on Aug 7, 2023

I mean yes? Answer a bunch of stackoverflow questions and you'll hit that.

ben_w · on Aug 7, 2023

Isn't that what everyone who writes on the internet does with every tweet, toot, blog post, vlog, podcast, short, reel, and comment?

caesil · on Aug 7, 2023

Since that delta is clearly a transformative use of your photo -- as in, the output doesn't even remotely resemble the input -- you don't have any legal claim to it, no. I'm not sure what the plaintiffs arguing otherwise are smoking if they think they can argue it isn't transformative.

datagram · on Aug 7, 2023

I don't think it's that simple. In order to have a claim to fair use, you would have to argue that the derivative work doesn't negatively affect the market for the original. When Google got sued for scanning copyrighted works for Google Books [1], they could claim fair use since they were only letting people see small excerpts from the books.

If you can train your bot on my blog post about scuba diving without my permission and then people can ask your bot for scuba diving advice instead of reading my blog, that doesn't seem very fair.

[1]: https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

dragonwriter · on Aug 7, 2023

> I don’t think it’s that simple. In order to have a claim to fair use, you would have to argue that the derivative work doesn’t negatively affect the market for the original.

No, you don’t.

That’s a factor weighing in favor of fair use, but the fair use factors are not defined in such a way that that is a necessary factor.

nomilk · on Aug 7, 2023

This doesn't appear consistent with other visitors to your website. If a cafe owner uses info on your site to improve their baking, should they also be required to share their revenue with you?

gfedtbyby · on Aug 7, 2023

If the cafe is a multi-billion corporation that can only exist because it can leech of content created by millions of other people without providing anything at all to them in return (and I'm not necessarily talking about financial compensation) then yeah.. maybe you should.

dron57 · on Aug 7, 2023

So Starbucks? Should they be sharing all their revenue with whoever invented all those Italian coffee drinks?

gfedtbyby · on Aug 7, 2023

Starbucks business model is not entirely (or at all) reliant on the availability of new coffee drink recipes which can only be provided by third parties. So no, I wouldn't say so.

tchaffee · on Aug 7, 2023

Are you considering ad revenue?

TeMPOraL · on Aug 7, 2023

That already puts the website in the sleazy category. "I mixed my helpful information with mind poison" isn't a strong position to argue fair play from.

tchaffee · on Aug 7, 2023

You missed the more general point that if folks do have a way of making revenue from their content then stealing their content would have a negative impact. Maybe someone has amazing content and offers classes. You might be able to think of other possibilities.

edgyquant · on Aug 7, 2023

I see no point in engaging with this argument. ChatGPT is not a human. I, nor anyone else, should have to explain to you why that makes all the difference here.

thenonsequitur · on Aug 8, 2023

If you don't want to engage in the argument, that's on you. I don't think ChatGPT not being a human makes any difference and I think the onus is on you to explain why it should.

edgyquant · on Aug 8, 2023

No the onus is not on the person thinking laws written for humans apply only to humans. That doesn’t make any sense.

thenonsequitur · on Aug 9, 2023

Now you're shifting the goal posts. Please re-read the comments/replies up to this point and you'll see no mention of laws anywhere. That's not what the discussion is about. It's about whether AI consumers of publicly accessible content should be required to pay for that content when human consumers should not.

visarga · on Aug 7, 2023

> shouldn't I get some free credits to use that model or proportionate share in the revenue stream

But you do get paid in kind - you "gave" information for the AI to train on, the aI gives you information back, contextualised to your needs. Sometimes those 1000 tokens are worth much more than $0.06

You still need to be able to pay for inference costs, it's crowded and expensive on GPUs nowadays.

gfedtbyby · on Aug 7, 2023

Since it "gives" the same information to everyone there aren't really that many incentives for you to allow LLM to use your content. "tragedy of the commons" and all that stuff...

VectorLock · on Aug 7, 2023

If your website appears on the search results of Google and they show ads next to it, aren't you entitled to that revenue too?

wg0 · on Aug 7, 2023

Google allows me to limitlessly search their index that allows me to find other pages too and in turn, they sell my attention so it is somewhat fair proposition in contrast to a wall gardened AI model being charged by per token such as GPT 4 that includes my content as well.

VectorLock · on Aug 7, 2023

Can you not use ChatGPT as well?

I think you'll find if you do try to push Google Search too far, its not quite "limitless" either.

wg0 · on Aug 7, 2023

GPT 4 isn't free. On individual and human scale, Google search is virtually limitless. I've been sometimes presented with Captcha when frantically searching something but that too is in distant past like late 2000s

Hasn't happened in a long time.

drekipus · on Aug 7, 2023

quick back-of-envelope/googling:

openai is worth $29,000,000.00 you contributed 0.00000000001

punches numbers in calculator

thus the value of your free credits is 0.001 cents. minus any accounting fees.

VectorLock · on Aug 7, 2023

>openai is worth $29,000,000.00

You might have missed a few zeros.

qup · on Aug 7, 2023

He doesn't believe they have a moat, seemingly!

gfedtbyby · on Aug 7, 2023

Much of that is Azure credits and not real money.

golol · on Aug 7, 2023

I find it very strange to think that you are entitled to anything in return when something views and processes content you have publicly shared.

terrytanys · on Aug 9, 2023

I think what you are doing is great. I would say current way of ChatGPT isn't ideal. Harvesting the data without giving credits. The tech now is great but I believe there is a way for all to WINs.

OBFUSCATED · on Aug 7, 2023

What if your website contains incorrect information that makes their model worse?