Hacker News new | past | comments | ask | show | jobs | submit login
Reddit will begin charging for access to its API (nytimes.com)
303 points by alexrustic on April 18, 2023 | hide | past | favorite | 337 comments




Related to https://news.ycombinator.com/item?id=35617763 ("Reddit Wants to Get Paid for Helping to Teach Big A.I. Systems"; an aside, but I much prefer the title of this post I'm commenting on as it describes the actual change) and it's hard to find this particularly disagreeable. Especially considering:

> Reddit’s API will remain free to developers who want to build apps and bots that help people to use Reddit, as well as to researchers who wish to study Reddit for strictly academic or noncommercial purposes.

> But companies that “crawl” Reddit for data and “don’t return any of that value” to users will have to pay up,” Reddit co-founder and CEO Steve Huffman told The Times.


It’s funny because posts on Reddit don’t belong to Reddit, they belong to the users who created them.

Why would I, as someone who’s made tens of thousands of comments, care if someone scrapes and reuses my comments. I don’t want them to pay up.

This is a really rich comment from a company that relies entirely on user submitted content and has never “paid up.”


> This is a really rich comment from a company that relies entirely on user submitted content

User submitted content and moderation.


And the moderation that makes Reddit hold valuable content is done by its users on a per subreddit status. Only stuff that could break laws like extremist content and hate speech is handled by Reddit themselves.

It's really odd to call it "their" data, and this is not exclusive to Reddit.


It isn't odd. Is it odd that content on Facebook is Meta's data? Maybe try reading the T&Cs and it won't seem so odd.

They provide the platform for free. Don't like it? Self-host or go elsewhere. This is the biz model every content silo uses.


Agreed! Reddit runs on the good will of it's very few good users.

It is quite the cesspit and always has been.

Training much on it will likely worsen the confidently incorrect problems.


Considering OpenAI trained on Twitter data among others, I think it'll make for more flavor that users crave, based on the popularity of both of those platforms.


You made the comment, Reddit built the API and the system you use for making comments. If they wish to charge for their part(s) in this, they can, or retract their work, or give it away. Just as you can for your comments.


Certainly they can charge for their API. It was the CEO’s phrasing that was odd. That is their data and if people want to use it, they should pay up.

I think charging for the api is bad as it will make things like user apps harder. I think Reddit’s app is bad, so other apps need to use the api in order to function.


There's a bunch of precedent that aggregators have some IP rights. Reddit does not have exclusive rights to your posts, but they can have some rights to the collection of posts from all users.


> Why would I, as someone who’s made tens of thousands of comments, care if someone scrapes and reuses my comments. I don’t want them to pay up

Why would I, as someone who's made tens of thousands of comments, be happy with a corporation scraping my content to create a service that they'll turn around and charge me for? I want them to pay up, so that Reddit, this wonderful service that has given me thousands of hours of entertainment and education, can be sustainable and grow.

> This is a really rich comment from a company that relies entirely on user submitted content and has never “paid up.”

Most redditors will agree that they get much more from Reddit than they give. I for one am very happy with the arrangement I have with Reddit.


Because reddit is a terrible company and you don't want to subsidize their transition into a shitty ad service.


There's a lot of people in this thread defending Reddit and they don't seem to have ever had the pleasure of dealing with an actual Reddit employee. They have a culture of unchecked cronyism. Reddit doesn't care about anyone, some people will eventually figure it out the hard way.


> they don't seem to have ever had the pleasure of dealing with an actual Reddit employee.

99.999% of redditors will never ever have to deal with a Reddit employee. Cronyism? What the hell has that got to do with my consumption of and participation on Reddit?


Most users don't interact with Reddit employees, but the moderators that maintain most of the communities you enjoy do. That's how I ended up interacting with them. Shortly after the IPO rumors, my community started being harassed by an admin through modmail.


That's an odd way to put it. The admins are basically god. God doesn't harass you, he tells you how to live. If an admin tells you to jump, you beg to know how high.

I've been a mod for a decade and never had a problem with admins.


I was new to it though, and an admin picked on me because I was parodying another subreddit (my home city subreddit). You've been modding for a decade, great. That basically backs up my original point that it's a party of tenure and closed-mindedness (cronyism). My situation was different, and it's pointless to argue, but if you trust any company (especially in a changing economic environment), keep your eyes open for poorly motivated incentives.

Regardless, the attitude that they're "god" is a weird way to put it. They randomly IP banned me for calling out an admin's publicity issue during an April fools event. That's cronyism. I've used Reddit for 12 years and engaged in conversations in good faith for years. Paid for subscriptions most of that time. All relationships, business or otherwise should be mutual in some form.


If companies pay reddit for "public" data the incentive to poision the platform with too much ads decreases, and there's always adblock.


Having more ads and monetizing the API aren't mutually exclusive. Look at the avatar/award system for example, which evolved in tandem with dark patterns that push the user to the app where they can serve unblockable ads.


My perspective is that Reddit made the comment you submitted. In real terms the comment is a record in a database which backs a web application that is developed and administered by Reddit. The comment is your expression but, like it or not, it is by Reddit’s grace that they publish it on their website. (Consider things which would be illegal for them to host and publish; they need to keep a close eye out for such things and prevent those relatively few posts among the millions they receive daily.)


Does the air make my speech by propagating the energy from my larynx?


Technically, yes, if you’d like to credit the effect of one hearing your speech to the workings of Earth’s atmosphere. It’s true that the speech is your expression but you correctly point out that the air brings it to my perceptions.


I think what GP means by "made" is "produced". Reddit provides the platform, the community, and the reach -- it's like a record label. Much like how recording artists don't own their recordings.


Tons of recording artists own their masters, both big and small. It’s a function of their contract that determines that ownership, and those terms are clear. Just as they are clear in Reddit’s TOS. You own your content, Reddit simply has a license to use it.

https://www.redditinc.com/policies/user-agreement-april-18-2...


From your link:

> When Your Content is created with or submitted to the Services, you grant us a worldwide, royalty-free, perpetual, irrevocable, non-exclusive, transferable, and sublicensable license to use, copy, modify, adapt, prepare derivative works of, distribute, store, perform, and display Your Content and any name, username, voice, or likeness provided in connection with Your Content in all media formats and channels now known or later developed anywhere in the world. This license includes the right for us to make Your Content available for syndication, broadcast, distribution, or publication by other companies, organizations, or individuals who partner with Reddit. You also agree that we may remove metadata associated with Your Content, and you irrevocably waive any claims and assertions of moral rights or attribution with respect to Your Content.

That's just about every aspect of "ownership" I can think of, minus the label "ownership". Honestly, it seems about as close as a lawyer would allow a company governed by Section 230 to have, as "ownership" would step into exposure to liability.


Well, this comment and all other comments you have submitted to Hacker News is a record in my browser's cache. Does it mean I have the right to save them into less volatile storage and charge others for accessing them?


If I sued you for running such a service, it's likely that you would be looking to convince a judge that you got me to agree to something like this: https://www.ycombinator.com/legal/


Okay, so where should we start mailing checks to Reddit for hosting our comments?


To the same address Reddit mails checks for generating revenue based on those comments


I assume you give them a license to use your posts when you sign up.


> But companies that “crawl” Reddit for data

It's not possible to hide crawling at a large enough scale, right? At some point, certain IPs/user agents will (should be?) hit with CAPTCHAs to be able to have access to content and no amount of user agent/cookie/session/whatever spoofing will get around that, yeah?


IP restrictions are easy to overcome using mobile networks. Basically, mobile networks assigns your device an internal ip and NATs out to a very small pool of ip public addresses. If they block you, they also block a very large chunk of legitimate mobile users. I'm a big ol' dummy when it comes to networking, so I imagine I explained something poorly... so any mobile network nerds feel free to pile on!

Captchas are super easy! There's a gagillion captcha bypass services for every type of captcha. Just snag the captcha token, send it in an API call, and then you get a verified captcha token.

See CGNAT for more details about mobile networks. https://en.wikipedia.org/wiki/Carrier-grade_NAT#cite_note-of...

It's pretty much impossible to stop the top 1% of the most dedicated scrapers without affecting end user experience.


> IP restrictions are easy to overcome using mobile networks.

Only if the connection is over IPv4.

The mobile networks were among the first major adopters of IPv6, and most now give each device a unique IPv6 address.


My mobile device (iPhone) relays most traffic through the nearest Akamai datacenter. So they don't get my IP address. And that datacenter has a massive number of IP addresses, which are rotated.


Out of interest how do you know it's being relayed through an Akamai DC? I assume you're talking about private relay which I also use, but I thought cloudflare was the 2nd hop for that?


They're using multiple providers including Fastly, Akamai and Cloudflare: https://www.streamingmediablog.com/2021/06/apple-private-rel...


This is only HTTP and not HTTPS traffic, which most www traffic is these days.


reddits api is ip4 only


Cat & mouse game. If you’re defending against a whitehat business scraping with curl from data center IPs, sure.

Against a less-savory actor using hundreds of IPs from residential proxies/compromised hosts, you’re gonna have a rough time, especially if you’re unwilling or unable to use aggresive fingerprinting or (vomit) CloudFlare. Not to mention CAPTCHAs are generally already a solved problem for scrapers.


Residential proxies are a completely solved problem, for companies that actually lose money to them (e.g. Ticketmaster, whose profit is maximized by blocking third-party scalpers so they can do the scalping themselves)

For companies that make money by having more MAUs, well, yeah, they're going to have a real "rough time" detecting inauthentic traffic


Why is cloudflare vomit-inducing?


While I appreciate it as an irreplaceable tool for countering DDoS, its premise is antithetical to a reliable and open web IMO, and it suffers from the same lack of accessible, customer-facing support as other big tech players. Lazy examples from HN algolia search:

https://news.ycombinator.com/item?id=32912075 https://news.ycombinator.com/item?id=17750801 https://news.ycombinator.com/item?id=22109969 https://news.ycombinator.com/item?id=30764757 https://news.ycombinator.com/item?id=29839960 https://news.ycombinator.com/item?id=22406277 https://news.ycombinator.com/item?id=23897705 https://news.ycombinator.com/item?id=34639212

(Hypocrisy disclaimer: I have sites behind CloudFlare.)


Upvoted for “hypocrisy disclaimer”


I intensely dislike them taking over as gatekeepers of the web. Perhaps because my browser is configured to resist fingerprinting and to avoid running arbitrary scripts from random websites, it is very frequently blocked by Cloudflare.

As one example, I can no longer browse the site for Lowe's (big box home improvement chain). Consequently, I now buy everything from Home Depot (their competitor).

It's astonishing how Cloudflare can do such a poor job of determining the difference between a potential customer and an attacker. Life's too short to solve captchas for an intermediary, so I don't bother, I just find a competitor who wants my business.


> It's astonishing how Cloudflare can do such a poor job of determining the difference between a potential customer and an attacker.

I don’t find that astonishing at all. I can’t see how you’d disambiguate someone who is anonymous for good versus bad reasons. Not supporting the death of the anonymous internet, but it’s not happening because of incompetence.


I don't think Cloudflare is immune to organizational incompetence even if a lot of brilliant people work there. I have similar intermittent problems as ~tomwheeler, despite a mostly unchanged residential IP and a browser configuration that's only a little bit defensive.

My outsider's impression is that Cloudflare has decided to rely much more heavily on browser fingerprinting than on classifying good/bad network activity. That puts them at odds with anyone that's taken steps to oppose being monetized by advertising firms.


> I can’t see how you’d disambiguate someone who is anonymous for good versus bad reasons.

One obvious clue would be that there are no attacks coming from my IP address.


I think that both Cloudflare and the Lowe's stores of the world understand that these interventions have negative side effects. The problem is that leaving them out has even worse consequences, and no one has offered a sufficient alternative.

Put another way, one could reason that they'd prefer to do business with Lowes because they are actively investing in security measures. Perhaps your data is more likely to be compromised at Home Depot.


It induces vomit on anyone who is on any combination of a) a slow network b) TOR or c) noscript. They also fundamentally act as middlemen, the gate between users and what's supposed to be an open web. They even promote having servers run plain http and they'll do the HTTPS proxying for you; you know, so that they can sniff the traffic between you and your users.


Require auth and it’ll help a ton.


One of the core features of Reddit is that any person may create as many accounts as they like for free. Changing that would be incredibly disruptive.

You could set a minimum karma threshold, but that would only promote karma farming; which is already widespread.


reddit might be one of the few last places on the internet that hold the old times of pseudo-names and mindless anonymity. I don't see how changing that would benefit the company. see twitter


> One of the core features of Reddit is that any person may create as many accounts as they like for free. Changing that would be incredibly disruptive.

I wonder what their monthly active users look like if you filter out 1 person switching through 3 usernames/accounts for example.


Reddit wants people to visit the site, become interested in the content they see, and start participating regularly. That's not compatible with hiding enough content behind a registration wall to thwart sufficiently sophisticated scraping.


There are services out there that have a large pool of consumer IPs that are marketed at crawlers for exactly this reason. A lot of them are either using hacked hardware or one of those free VPN browser plugins so it would be very hard to distinguish the traffic from a legit user.


There are residential proxies that allow you bypass most of these things. I’ve been using them to crawl e.g. Amazon or Instagram without any issues, but they’re expensive. IIRC something like $10/GB


Serious question — is “residential proxies” a euphemism for “botnet?”


Yes - but legal and explicitly allowed by the user.

BrightData is the biggest of them, they run the free VPN Hola, and have an SDK app owners can install in their apps that allow selling bandwidth from installs. For someone who is price sensitive, trading some free residential bandwidth for whatever service is pretty compelling.

I'm sure there are scummy ones, but Bright seems to require pretty explicit consent. Not affiliated, just looked into it for some apps I have, but the payouts weren't good and I didn't think it'd be a good fit for our users.


This is exactly the one I was using. Basically you’re piggy-backing on mobile phones and other devices using their free VPN software, and it’s incredibly hard to block for large websites. Combine this with some other clever tricks, and you’re basically able to do huge scrapes for not-that-much money with incredible convenience.


There are also a lot of rural/metro ISPs that offer this as a service (residential IPs) if you find the right person


No because they are only HTTP proxies. But you don’t actually know how these companies get them, rumor is that they are part of browser extensions or free VPNs which users might install on their devices.

The most “reputable” company in this space is Bright Data (formerly Luminati).


Essentially yes. Sometimes it also includes people who have installed “free” VPNs.


Yes and also people who install certain shady "VPN" software.


I’m curious about this too.


Last week someone here said that some of the big VPN players use botnets to residential IP addresses. I assumed they got residential IP addresses from ISPs but maybe not all ISPs in all parts of the world offer that.


It's possible but might end up more expensive (and definitely less reliable) than just paying whatever reddit asks for.


CAPTCHAs have been broken by primitive AI for a long time (Long before GPT4-like tools). Their only purpose is to deter the lazy bots. User agents, and any other arbitrary HTTP headers, cookies, etc. have been easy to circumvent as long as the internet has existed. The only thing that sort of works is IP reputation but with IPV6 you can have as many legitimate IPs as you want.

tl;dr Dedicated crawlers built by sophisticated actors are more or less impossible to defeat.


It's hard to imagine captchas being a workable solution as better AI models get cheaper.

At this point computers are probably already better at solving them than humans are.


It is very easy and cheap to scale. 1$ for 1000 captchas solved, 10$ for 1000 proxies. Then you have 1000 users, and these are kinda impossible to distinguish from your typical common users if you cared to randomize the digital fingerprints for each client to some extent. Paid APIs for publicly accessible data are not something that makes sense or works well in this world.


why would paid scraping services work then?


You're right... I went off-lane there. It makes total sense of course since there is a demand for data, and clearly, just a minority of the people can just scrape everything at will, even if it sounds like kids play to me. And actually, it all makes sense now, since pay walling your own API is just throwing some competition to scrapers, which is totally legit. Sometimes a simple question can do a good deed, thank you:)


Ha, no worries. I ask because I've also contemplated that kind of thing and realized that I was chasing my own tail perhaps.


> Reddit’s API will remain free to developers who want to build apps and bots that help people to use Reddit

I'd be fine if they didn't. The ratio of useful bots to annoying ones is very low.


Bots? eh... probably gonna agree. Apps, hard no. API access to third-party apps is the only way to make competition work out in the end. IRQ, AIM, and MSN Messenger all existed around the same time, and thanks to XMPP, worked equally well on an XMPP client. This made it reasonable to use all 3 if a user wanted to, plus they could use their own server too, or a friends, or a company, or whatever. Thanks to SMTP, email is (mostly) the same way right now. On the other hand, the perfect example of how shuttering API access to apps can completely kill any competition exists right now in the form of Discord. Sure, Guilded exists, and one could argue Slack is a competitor, but tell me, do you genuinely use all three? Would they be interchangeable? Or do you split personal and professional between Discord and Slack? If all 3 had a common standard, or at least had open client api's, we'd already have a unified client, making all 3 easy to access at the same time, and we'd have good competition. Reddit has competition, and there are many third party apps that allow using all of them under one roof. Killing that off would not be a welcome change.

So yeah, Reddit may not need bots, but refusing to allow apps is just pushing another nail in the coffin of competition.


Unfortunately the developer of the Apollo app already got a call, and apps will need to pay. That's then the end of reddit on mobile for me. The official app is unusable and had annoying behaviour in desperate attempts to boost engagement.

https://old.reddit.com/r/apolloapp/comments/12ram0f/had_a_fe...

> There was a quote in an article about how these changes would not affect Reddit apps, that was meant in reference to “apps on the Reddit platform”, as in embedded into the Reddit service itself, not mobile apps

>

> tl;dr: Paid API coming.


Paid I can deal with and Reddit are certainly entitled to some rev share for enabling the content - but - if this goes down the old EEE path through to extinguish third party as a way to force their interface and tools or nothing - then nothing is what it will be. Twitter's API history and present is a great example of how bad things could potentially get.


My thoughts exactly.


Ok - I'm going to merge the threads but will use the more limited title on the merged post.

(Edit: merging https://news.ycombinator.com/item?id=35618695 hither now)


>> But companies that “crawl” Reddit for data and “don’t return any of that value” to users will have to pay up,” Reddit co-founder and CEO Steve Huffman told The Times.

But they do return value to users. I'd much rather get my answer from a Chat-GPT query than scouring through Reddit.

Maybe he meant that they're not returning value to Reddit in which case he'd be right, but I hate him trying to spin this for the users.


ChatGPT having info from reddit does not help redditors, it helps ChatGPTers. You can't really say they're providing a service for reddit users.


You also can't say that it isn't helping Reddit users, because the same person may use both platforms.

So the question is: how significant is the union of those two sets?


>“don’t return any of that value” to users

Notice that 'to users' was outside of the quote. That was an editorial addition.


In the original NY Times article there's this line:

> “Crawling Reddit, generating value and not returning any of that value to our users is something we have a problem with,” Mr. Huffman said. “It’s a good time for us to tighten things up.”

So it is "to users" but more specifically it's to our users.

I would agree with Huffman here: crawling the data to build ChatGPT gives the value to ChatGPT users who aren't necessarily Reddit users, and by short-circuiting queries and processes that otherwise may have led to new Reddit users, it's taking value from all Reddit users.


E.g., the "remind me" bot uses Reddit's API and genuinely returns value to at least some of Reddit's users (unless it just plainly never reminds people). Comparing ChatGPT to things of that nature makes the difference more apparent to me.


That makes you a ChatGPT user, not a Reddit user.

Just because you want to use Reddit data, doesn't make it a Reddit user, does that make sense?


> But companies that “crawl” Reddit for data and “don’t return any of that value” to users will have to pay up,” Reddit co-founder and CEO Steve Huffman told The Times

Pot-Kettle!

The elephant in the room but everyone is forgetting is - how much does Reddit pay its users for content? Reddit's value comes from its users, which is completely voluntarily contributed lol.


Except the users are contributing, in the form of upvotes, downvotes, comments, and moderation.

Reddit wouldn't exist without the work of volunteer moderators, as ripe for abuse as the positions are.

Even search engines provide value in that they provide alternative search functionality.


“ The elephant in the room but everyone is forgetting is - how much does Reddit pay its users for content? ”

I have to say I’m not a fan of reddit, but you could also ask the question: how much do users pay to access reddit?

The web has made a lot of people feel entitled to free (high quality) services. But as developers we know building and maintaining services like reddit is not cheap (let alone free).


Reddit users pay lot, considering just how many ads Reddit has on every page, as well as the arranged content promotions that regularly pop up "naturally".


I thought the supreme court found you can't stop folks from scraping data in the LinkedIn case? I think that applies here in some way.


As is usually the case, what they decided was something much narrower than the general case. LinkedIn still can and does make efforts to prevent/restrict automated browsing at scale. What they can't do is selectively block traffic from the plaintiff company altogether, when the content is otherwise publicly available.


> In a November 2022 ruling the Ninth Circuit ruled that hiQ had breached LinkedIn's User Agreement and a settlement agreement was reached between the two parties.

https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn


It's closer to the exact opposite: You can try and scrape LinkedIn. But if they try and stop you you can't try to get around the block.

Sort of, generally, except it's a lot more complicated


This is the best summary of the current state of scraping laws - https://blog.ericgoldman.org/archives/2022/12/hello-youve-be...

It's complicated is the short answer.


I think you can physically block them, you can't sue them though.


If one scraped the content that's served to a browser when it's navigated to www.reddit.com I would expect that ruling to apply. If the API is considered a separate service, then I would imagine they could restrict access under separate terms.


With the ongoing "slowly breaking old reddit" and the move to SPA and mobile app, the data for comments and posts will be via OAuth API access rather than a server rendered html page.


Reddit is upset at all the Reddit mirrors that preserve deleted + removed comments, simple as.


Oh no, will they go offline because of this? Unddit is extremely useful for seeing how mods manipulate subreddit and just general curiosity for what sort things are no longer in the overtone window this year.


They rely on the pushshift api to get comments that were deleted. The other comments are pulled by the browser at the time the page is accessed.

Depends on if the changes affect pushshift's crawling.


Yesterday Stackoverflow, today Reddit. A clear pattern emerges where open web content/communities face existential issues if the current AI paradigm continues.

It's a daylight robbery. The sum of 18 years of Reddit is an enormous capital investment as well an immeasurable amount of hours spent by its users to create the content.

It's absolutely baffling how a single entity (OpenAI, Google Bard) can just take it all without permission or compensation, and then centrally and exclusively monetize these stolen goods.

The fact that we barely even blink when this happens, and that founders confidently execute on an idea like this, tells you everything there is to know about our industry. It doesn't even pretend to do good anymore. Anything goes, really.

Anyway, get ready for an "open" web that will consist of ever more private places with ever higher walls. Understandably so, any and all incentive to do something on the open web is not only pointless now, it actively helps to feed a giant private brain.


I understand where you're coming from, but can't fully agree with you.

First, Stack Overflow contributions are licensed under Creative Commons. So monetizing them is explicitly allowed.

Second, information is not "stolen" nor "goods". Copyright law is completely separate from physical property laws, so even if you could make a case about fair use of training data, copyright-ability of model weights and AI generated content (which I agree are still legal gray areas) and therefore whether or not the "Share-Alike" CC clause is enforceable in this context, it would be an entirely different argument from whether the whole industry is somehow entirely morally bankrupt.

Third, given that this is unpaid work made voluntarily by users of the platforms (Reddit, SO), why is it any more acceptable for these platforms to lock it up and monetize it than for AI companies?

I think it's completely reasonable to charge for API access, particularly above a certain volume, but not because these companies have a right to protect some sort of "intellectual capital investment", but rather because the server costs of processing the requests are not negligible.

If anything, this situation really separates the wheat from the chaff in terms of what pools of open web content are truly "open". If the platforms hosting them expect to retain control of their "investment" can they really be said to be open?

I understand the irony, given that OpenAI's own name is somewhat at odds with its practices (of merely providing open access versus truly releasing everything as open source) but I think the reasonable solution to that conundrum is something like Wikimedia Foundation, Internet Archive or maybe CERN for AI, not giving up on free, open content just because it might feed a giant private brain.


> First, Stack Overflow contributions are licensed under Creative Commons. So monetizing them is explicitly allowed.

The evolution of any human legal system can be described as follows.

1. Hey guys, here is a simple set of rules we have agreed upon, to make sure there are no conflicts. Please follow them in good faith.

2. 95% of people follow both the letter and spirit of the agreed rules.

3. Some bad actors come in and only comply with the letter of rules, hacking and exploiting the system to their obscene advantage.

4. The complexity of the rules is increased to shut down the bad actors. The new rules increase costs for everyone, good and bad actors.

Repeat steps 2-4 continuously till the system is completely broken and we are all much worse off. The bad actors, "We did nothing wrong, we followed the letter of the law."


What's the conflict? Stack Overflow content was specifically licensed under Creative Commons so that its content can be maximally used and learned from, and it seems to be working successfully in ways not envisioned before.


3.5. Bad actors lobby for the letter of the law to be changed in their favor.

4.5. Everyday people are incited to argue about distracting, trivial issues while systemic problems snowball.


I'm favoriting this post. What a pithy description of the systemic breakdown of rule of law.


I don't wish Microsoft to forcibly snag the profits from my (and more significantly, many others) Stack Overflow posts - while giving nothing back to the SO community. I'm ok with SO profiting from that and giving me points in return. If/when that becomes a noticeable issue for SO I'm sure they will revisit their approach too, because nobody likes leeches.


What do you mean Microsoft "forcibly snagging profits"? How do you profit? I am not familiar with incentives behind posting on SO.

Does Microsoft not cite SO posts in Bing results? Do they not make it easy to find the "correct" SO question/answer?

Is the issue that someone else is helping others, vs "you" or the "SO community"?


An incentive to Stackoverflow to administer the service and to keep the lights on is to get paid for traffic to their website from Google search (which they monetize via a modest amount of ads and job posts)

Incentives for free contributors (SO users) to write up good questions, good answers and debate and to come up with and vote on better solutions in the comments is to get points, recognition and yes to help others and get credit for it in their name, even though this credit is not monetary.

If Microsoft regurgitates my answers (just using me as an example, there are infinitely better contributors) without sending traffic to the SO proper website and without people voting for my answer or participating in debates and discussions on SO website proper - and in many (if not most) cases there is no single smash-hit answer and things need to be worked out and voted on - then my motivation as an SO contributor drops to a complete 0. Basically, no reason to contribute at all, since Microsoft is going to grab my answers for itself and collect the subscription (in case of ChatGPT and Copilot), and eventually the inevitable ad revenue from majority of Microsoft and ChatGPT users never leaving the Microsoft properties and never contributing to the original SO activity.

Of course, there are tons of problems inside SO proper currently as well, but none of them destroy any motivation to contribute as third-parties scraping, regurgitating the original content and keeping the traffic to themselves.


But they didn't take anything? And those two moves of SO and Reddit are mainly about greed: they want some more money just for hosting content that people generated while viewing their ads and giving them money for features.


Two options:

1) Copyleft licenses

2) Abolish copyright law

I am one of the few arguing for #2, but I think #1 is a good short term option.


literally the story of google itself... built technology on a large corpus of existing text (the internet) for pagerank and then able to leverage and monetize it via search and ads.


But Google itself had and still is free. It's a service they provide to you without charge that, were it not to exist, your life would be almost immeasurably more difficult (as with any search engine). And most of the time it doesn't "take" from website owners; if anything, it generates more traffic for them.

When a model trains over Reddit, it may still provide a service that is free. But the way it's going, companies are charging money for access to those models and aren't generating traffic for the underlying training data/sites.


Free to search, though you are the product. Even ChatGPT hasn't productized their users yet in order to provide their service for free

But make no mistake, the secret sauce in Google Search is by no means open, and possibly not even comprehensible to a single human at this point.


I wonder if AI training data can replace ads as a way to monetize web services.


And Twitter charging for api access


We drastically need copyright reform for text, imagery, video. It was never designed for this AI era.

If you take a concept like "fair use". Let's say I embed your photo and express an opinion about it. That's what fair use was designed for. In-context relatively harmless usage of the content of others, for the sake of expression, culture and education.

That's not the same thing as "let me suck up all content ever created without permission, attribution or compensation, mangle it and sell it via the backdoor whilst making you obsolete".

You can't call that fair use, they are wildly different usages at wildly different scales with wildly different impact.

We need a new copyright category specifically for AI usage. If nothing is expressed, no training permission is given. One can opt-in and allow for training, allow for training under conditions, etc.


Honestly, I think it's completely unfair for AIs to train on this data.

I work in ML so I'm aware of the consequences but society wasn't.

My step-daughter is finally crushing it as an graphics artist and she is really pissed at tools like Midjourney.

I asked her about it and she said "yes, they steal the artwork of real artists and generate fake knockoffs" ... and I don't think her opinion is invalid.


Fully agree on everything you said.

In addition, we're all kind of forced to hop on to AI whether we're a programmer or artist just to buy ourselves a little more time, delaying the inevitable. Actually, perhaps accelerating the inevitable by contributing to it.

Even in an utopian world where we would have an economic model to support this (UBI), the outcome still sucks. It wipes out human culture. There's no point in creating/producing anything as almost anything can be produced by anyone, at incredible quality, at no cost and with little skill.

Hence, your daughter being or becoming an incredible artist would have no meaning, except perhaps for herself enjoying the process of creating art.


> Hence, your daughter being or becoming an incredible artist would have no meaning, except perhaps for herself enjoying the process of creating art.

There are lots of points and arguments to be made in this general area, but I have to ask, is this really so bad? I mean, what is the point of our lives and everything we do, other than to generally spend the rest of our time doing things we enjoy for their own sake?

If we're comparing "your daughter is an incredible artist, and here's a job for her designing product packaging for a multinational conglomerate" to "your daughter is an incredible artist, and the multinational conglomerate is using a diffusion model to design their packaging", I think it's really hard to say that the former is better than the latter. Of course, it all depends on the economic model, but the line I am quoting is within that assumption you made of the economic model being able to support this. In that case, I am for the latter wholeheartedly.

Economic incentives are great to get people "hustling", but they are rarely aligned with the human values you wish to protect, and mostly by chance if they are. Your daughter's artistry is better "spent" on art for art's (and personal enjoyment's) sake than on drawing clip art for an obscure HR form somewhere, IMO.


Nothing would stop her from continuing to create art the human way. The most intrinsically motivated will certainly do so.

But it's only half the story. Besides the process of creating art in itself being rewarding, the other rewarding part should be how other people relate to it.

One might have trained themselves for thousands of hours and this will be reflected in the output. Most people suck at art thus the skill, dedication and creativity are recognized as such. This system has merit and scarcity.

The new system has no merit as any fool can type in a few words. Nor does it have scarcity which means an overabundance of output. Both contribute to a lost sense of meaning in creating and even consuming art.

If tomorrow we will all be as fast as the fastest runner, running will become quite pointless. There is no reward or recognition for running fast. In fact, you can't even call it fast anymore, as anybody can do it.


I wonder what would happen to a world where AI runs the economy. Not everyone has some hobby or passion that brings meaning into their lives. Some people just work, come home, and spend their free hours consuming some form of entertainment. Without work, would those people just have more free hours? The elimination of human labor could be disastrous to mental health.


A good point, and I've been puzzled by how hard this split in characters is between individuals.

I know several people that without external force (work, duty) would have absolutely no idea what to do with themselves. Even their free time they organize around work-like chores or spend it on passive media.

These people seem to lack any sense of wonder, of curiosity or exploration. And it seems a permanent and fixed state. This is who they are. You can't change it.

I would not worry about this problem though because surely in the hypothetical situation of no commercial work, there's plenty of other work we can make up.


>There's no point in creating/producing anything as almost anything can be produced by anyone, at incredible quality, at no cost and with little skill.

Does this imply that some significant portion of art "value" is derived from scarcity (e.g. there is more value to creating/producing art when a smaller portion of the population can do so)?

From a strictly financial sense that makes sense, but it does seem morally at-odds with anything that makes art easier for humans to produce.

Is it "good" or "bad" to enable a larger population to produce more art?

Is it "good" or "bad" to enable a larger population to produce higher quality art?

Culturally, both seem like they'd be good. In our current economic model, they're probably both bad.

With an economic model that supports artists financially and removes the need to transmute "art" into "money", I don't think we'd see human culture wiped out. Without a financial incentive to create art, what's the point in creating/producing anything if not to contribute to human culture?


Yes, there's value in scarcity of skill as well as scarcity of output.

Skill: if the merit part is entirely lost, surely we will value art far less compared to now. Anybody can make anything so what is the point?

Output: lots of art to admire is great, but unlimited art isn't. You can't attach value to unlimited.


This perspective seems insane to me. I'm undecided, but if I were to put forward an argument that AI art will be bad for culture regardless of economic model, it would be something like "AI art will always be worse (in some way) than human art, but it will also be cheaper than human art, and thus will replace it in basically all commercial fields, which would be bad for culture." Maybe I'd say it's worse because it's inherently soulless, or just that as a practical matter AI is be better at doing the bare minimum than humans are, or something like that.

If I thought that AI art would allow almost anything to be produced by anything at incredible quality, at no cost and with little skill, that sounds like a Sci-Fi utopia to me, an almost unimaginable world in which all limitations on self-expression are lifted. A world in which making a movie or a TV show or a video game becomes a weekend project. It sounds wonderful.


Nobody will watch your weekend project. Because it fails to impress, anybody can make it. "Unlimited" is not the paradise that you think it is.


I really don't understand the world you're describing. Are you saying that no one will be able to enjoy art anymore because art won't be impressive?


I think if we had an economic model to support this, we would definitely be in a way better system then we are right now. So many artists and musicians don't have the resources right in our current system, and have to seek day jobs or stop making art already.


> Hence, your daughter being or becoming an incredible artist would have no meaning, except perhaps for herself enjoying the process of creating art.

This is the most meaningful reason for creating art. In fact, I'd argue human expression is the defining element of art (AI output not being art in that sense of the word), and economic motivations just pervert it.


Artists don't generate art in a vacuum. Everything is a fake knockoff of everything else.

I believe the cream will still rise to the top, and the best artists will still create something totally different, and/or use AI tools to generate something better than they could create otherwise.


Are there artists who create in isolation? I.e. the ones who somehow can prove that their art is not based on what they've seen?


No, it’s not in isolation. Doesn’t matter, because fair use applies to humans, not robots. When you go from “human that does X” to “human that operates machine that does X” you’ve changed the situation.

We’ve already been through this with cameras, which are technically just the same as using your eyes and your memory. Yet both legally and morally we all feel that operating a camera doesn’t grant you the same rights as you have by just being and looking. Strolling through the park and seeing the kids playing is very different from bringing a zoom lens and a camping chair.

That said, society could agree to a fair use that applies to ML-trained models. It could simply cover all non-commercial applications, or at the very least research.



Are there artist who are capable of viewing gazillions of art works like computers can? And the copy&paste with little effort?


Not sure, maybe there are some savants with photographic memory.

But likely people remember references to certain art and can look them up and then 'copy paste' stylistic elements (but with a lot of effort!)


> My step-daughter is finally crushing it as an graphics artist and she is really pissed at tools like Midjourney.

> I asked her about it and she said "yes, they steal the artwork of real artists and generate fake knockoffs" ... and I don't think her opinion is invalid.

Creativity doesn't exist in a vacuum. New creations are based on long-term absorptions of existing concepts & discoveries, & the decision to advance or rebel against any combination of said concepts & discoveries.

The nature of the work will change to focus more on the final product, wherein humans still hold an advantage over art generation models in terms of errors in the produced artwork. There's the possibility that such errors will be corrected with the use of an additional model down the pipeline that's solely focused on correcting said errors, but they're not foolproof either.

There will also be a larger emphasis in some niches over the documentation of the creation of said artworks, as it currently exists in some niche circles I'm in. Reductively, it's the knockoff Gucci handbag problem, wherein the remedies towards it will be the same here:

- (Tech) Serial imprinting / rollover keys / embedded signatures for verification

- (Social) Shaming & ostracization of individuals that buy knockoffs

I'm hesitant on using the legal system to solve such a problem, as the way the current copyright system is set up, it makes it near impossible for a new artist to NOT step on an existing artist's style in some form or another, even if unconsciously doing so.


She's doing the same. We all are (stealing stuff, blending with randomness that moves us towards the goal).


In my opinion copyright is a law that is always at odds with the free flow of information. I'd hate for that law to start influencing how I interact with user generated text on the internet. As we see on YouTube, nuance for copyright loses to erring on the side of enforcing copyright even when the use is fair.


There's no better thinker IMO on this topic than Stephan Kinsella. (C)opyright law started in the 1500's as a form of censorship. There is no reason for it, other than censorship (or if you are in the top 1%, a great way to extract monopoly profits from the rest).

https://www.stephankinsella.com/paf-podcast/kol236-intellect...


Hey dahwolf I finally got a chance to skim through The Witch Trials of JK Rowling based on your recommendation and I thought it was pretty bad.


> “The Reddit corpus of data is really valuable,”

Totally agree, no question about that. But data comes from users. Shouldn't they also get paid?


Please no. I wish the users who create quality posts could get paid. But sadly, once money is involved, people will start to gamify the system, posting as much barely-passable garbage as possible to maximize upvotes, and the quality of content will deteriorate very quickly.


Money's already involved. Rather than posting barely-passable garbage, the automated systems just repost things that were popular a year or two to various subjects, with the top few comments replacing the titles of the posts. There are a number of counter-bots that detect these posts and warn people that the content is being automatically reposted, but it doesn't really stop it form happening. Presumably Reddit chooses not to stop it because, hey, engagement, woo, metrics go up, manager of engagement look good.

I'm not sure exactly how they're monetizing (maybe they sell the accounts once they have some popular posts?), but they definitely are.


I think this is actually a reddit, and all other similar platforms, problem. The issue is that there's good content, funny memes, insightful essays, whatever, that was submitted in the past. Some of it is no longer relevant and some of it your audience has already seen and would be bored by - but lots of it would be valuable to resurface now.

Because reddit focuses mainly on what's happening recently the good content of the past that might be relevant to a user today is buried. Reposters play a valuable role in resurfacing content. I think a better paradigm, though one I can't really imagine that well, would remove the need for reposters by automatically showing the content they would repost. Maybe a recommendation algorithm?


No, the issue is that pretend internet points always end up having real value, because human brains are lazy and rate in-group signalling really high and therefore trust ads that come from "big people" more. That's like the whole thing behind the influencer advertising economy.

Reddit didn't get rid of r/hailcorporate on accident. There are literal industries that exist to make fake accounts, karma farm, and sell use of those accounts to post basically sponsored messages that maybe even reddit itself doesn't know are sponsored. Think of how many people say "I search reddit for product recommendations" and know that companies have been pushing on that button for years and years. Whether reddit is honestly trying to prevent this kind of stuff doesn't actually matter, because as long as real moderation costs money and breaking that moderation makes money, the advantage is towards those who break it. FFS, reddit still has most popular subreddits modded by one account and their sockpuppets.


They got rid of HailCorporate? I'm only a casual Reddit reader so I thought that I had missed some ban drama but it's still there:

https://old.reddit.com/r/HailCorporate/


It no longer shows up on the /all tab for most users. I imagine that is due to some rule fiddling they did, similar to how one of the donald trump subreddits kept gaming the system to be most of the me page so they changed the rules.


Hard agree.

It'd turn into a world where people would try to make money (which still happens but normally is sniffed out), instead of a place where people like LundgrensFrontKick produce content, for free, because they love doing so, like:

* Estimating how long it took The Joker to set up the giant cash pyramid in The Dark Knight

* Comparing the box office success of movies that have a snowmobile action scene vs those that have a jet ski action scene

* Objectively trying to determine which Fast and Furious movie was the fastest and most furious

https://www.reddit.com/user/LundgrensFrontKick/?sort=top

Yes I know he has like a podcast now or something, but that only came after years of doing this for no reason other than he enjoyed doing it.


I came across this user just last week from their Vin Diesel sleeveless shirt post! It was the first time I've ever seen their content.


It is extremely interesting that money fails so hard at the one thing it should be good at, incentivizing behavior. I mean, yes, you'd get more content, but it would be hollow, as you say.


> It is extremely interesting that money fails so hard at the one thing it should be good at, incentivizing behavior.

Money itself doesn't fail to incentivize behavior. Rather, it is what you choose to reward with money that has be carefully chosen to incentivize the behaviors you want to encourage (via monetary reward).


Money is a fantastic motivator, as evidence by how quickly flaws in systems are gamed in order to attain it.

It's not money that's failing, it's the rule-masters. Agents can't work well in systems with bad rules.


No, it's money. Because the rule-masters you complain about are just optimizing for more money.


Sounds like a good motivator if even the rule-masters are chasing it.


It can be a good motivator and till be failing to serve any greater social purpose. People and organizations can get addicted to money exactly the same as people get addicted to sugar, nicotine, or cocaine. Addicts can be enormously tenacious, creative, and resourceful, but only to the end of feeding their addiction.


Totally agree. I was just taking issue with the blame on money specifically. Money is working as intended. The rules in which money is operating are fundamentally broken though, no doubt.


This is an interesting line of reasoning, does it also apply to HN karma and the big acquirers of karma?


It could. Some people pursue it very aggressively, optimizing submission times and autoposting submissions of new papers or blog posts, for example. I recall one semi-spam account that was set up to submit anything relating to Ruby, including videos that happened to mention gemstones in the title.


Basically the same as the alignment problem in AI. You need to be very careful of how you define your rewards, because you'll end up incentivizing exactly what you define.


It is like Goodhart‘s Law (When a measure becomes a target, it ceases to be a good measure) with an incentive attached to the measure. It probably needs to be in constant flux by design. Maybe a good thing as it would otherwise get rigid and boring.


Money spent is a good way to understand revealed preferences but this goes only one-way: you can't hand out money to reveal preferences.


This is also prevalent with the way that Google incentivizes page content structure now. I have to get 3/4 the way through a page before I find what I'm looking for because they encourage this big kitchen sink posts.


When you incentivize words, you get words.


How is that any different from current Reddit? The users gamify themselves into garbage without money.


Youtubers who rise to the top seem to satisfy their viewers even if there's hard monetary incentive


As an avid redditor I can say it's already like that and they will teach it nothing of value unless they limit it to a very low number of subreddits, which will still barely teach it anything of value unless the goal is teaching it current generation humor and shitposting habits, which would be very valuable to anyone wanting to boost engagement and try to sway a fairly left leaning generation of people toward whatever they're selling.

Which is today's right-wing billionaires and their pet politicians.


Hate to say it but this happens already from what I've seen on Reddit. Even decent subreddits I've followed for years that don't have the issues the major subs have. People want their karma, post low quality content, and somehow people still upvote them.


This is already an issue with the mods of the big subreddits being partial to both bias and incentives.


Welcome to the Internet. Should we ban users who try to make money from their skills?


This isn't really users trying to make money from their skills.

This is a company taking user contributions as their own, aggregating it, and using their work for free to make money off it.


Solution, just use historical content.


Good point.


>people will start to gamify the system, posting as much barely-passable garbage as possible to maximize upvotes

As if this behavior isn't already rampant.


FWIW, we at Medium feel pretty similarly to Reddit but with a yes to this question about whether authors should get paid.

AI companies are betraying basic business principles: they are taking value from datasets like Reddit and Medium without giving any value back. Fine if you can get away with it. But since AI, especially text based LLMs, relies on source material, it's pretty straightforward for the platforms that host that source material to deny access. Things like ChatGPT do need current source material.

I don't think it'll come to a war though and that the AI companies will instead give some value back. It could be as simple as citations that send traffic back. That's essentially the exchange of value that we all have with Google these days.

But if it's money, then I think the obligation is for platforms to pass that on the authors. It'd be hard for an individual author to negotiate this on their own with a company like OpenAI, but platforms are in a good position to negotiate on their behalf.


The content is public. Internet companies are staring to sound like Disney.

The AI is definitely giving value back.


It’s public for humans, not for other businesses stealing it for their own profit. I don’t want to be an anymous contributor to an AI.


Another in a really long list of issues that need a clear differentiation between human consumption and machines. So many things that were innocuous or even useful before the age of ubiquitous cameras, other sensors, and computers, are now a big problem.


As an end user I get a lot of value from AI companies.

I’m curious how a content platforms TOS will matchup against a Search Engine’s webcrawler TOS.

“I want people to find and access my content, but I own it.”

vs

“I will send people to your content, but any public data I can access, I can store and process how I want.”


You're in no position to negotiate this with OpenAI because they already have the relevant data stored locally. So does Google/Bing. You could be in a position to negotiate it with smaller upcoming OpenAI competitors, but all that will achieve is granting OpenAI and Google/Bing a monopoly because their competitors will have new large costs that they don't.

Also, Medium has a metered paywall already. Why not just let them open up a corporate account and pay to access paywalled content the same way users do? Why are any negotiations required?

BTW I use Medium but I never use the paywall. I'm fine with my content being used to train AI for free. The payments and tax complexity involved aren't worth the tiny amount of income that any such deal might generate, nor do I want OpenAI to have a monopoly.


Maybe no position with Google because they can bundle it with search results and threaten to take away search traffic. But OpenAI definitely does not already have all the relevant data. They need the new stuff also. That's part of the Reddit position as well.


I wonder to what extent that is true, now LLMs can search the web and read the results?


Not according to the Reddit user agreement:

https://www.redditinc.com/policies/user-agreement

> When Your Content is created with or submitted to the Services, you grant us a worldwide, royalty-free, perpetual, irrevocable, non-exclusive, transferable, and sublicensable license to use, copy, modify, adapt, prepare derivative works of, distribute, store, perform, and display Your Content and any name, username, voice, or likeness provided in connection with Your Content in all media formats and channels now known or later developed anywhere in the world. This license includes the right for us to make Your Content available for syndication, broadcast, distribution, or publication by other companies, organizations, or individuals who partner with Reddit. You also agree that we may remove metadata associated with Your Content, and you irrevocably waive any claims and assertions of moral rights or attribution with respect to Your Content.


Parent posted asked a "shouldn't they" question, not a "legally must they" question.


Thank you for saying this. It’s a classic appeal to the law fallacy I see so often online.


The Reddit user agreement is written based on US law and would be laughed out of any court in half the world. I literally _cannot_ waive my moral rights, and any company doing that is breaking the law in my country.


Hopefully at least moderators will get paid, and finally be held accountable for their moderation.

Powertripping basement dwellers who ban anyone who refuses to worship their supreme authority are one of the worse aspects of Reddit.


I don't understand this attitude. If you don't like how someone mods their subreddit then why do you want to be part of it in the first place? You can make your own and run it how you please.


You really can't make your own if you're going up against an established subreddit.

This applies doubly so if it's an established region based subreddit i.e. city, state, province, or country, and IMO these are the most problematic subreddits for overmoderation. Finding non-partisan regional subreddits is damn near impossible.


There's two popular NYC subreddits. If two, why not three?


But then people complain about getting banned from the said subreddit for not following the rules that make it what it is. It seems like you want to engage an already established community and just ignore the people that are currently there and play by the existing rules.


> people complain about getting banned from the said subreddit for not following the rules

That's not at all what I'm saying. There are a few subreddits with fair mods who enforce the rules fairly, but the great majority doesn't - they are the rules, and if they don't like you, tough tiddies. Making it effectively a "mod and minions", not a real community with real rules.


I want to be a part of it for the community. Moderators are the (un)necessary evil. The fact that you seem to see moderators as "owners" of a subreddit doesn't really help the case.


Who created the subreddits if not the moderators?


The programmers who wrote Reddit, and the community. A subreddit is nothing without a community, so the fact that a moderator namesquatted a URL doesn't mean he "owns" anything, only that he has the power - the power to moderate it. Which, in my opinion, goes with the responsibility of upholding publicly stated rules and enforcing them in a fair manner.

Unfortunately, many of them start thinking, like you, that the power to moderate means they are the supreme authority and that the subreddit is about them - so they behave accordingly, feeding their ego at the expense of a community. Of course, if you believe the power itself gives them ownership over a community, that's fine. It's just that I don't.


> If you don't like how someone mods their subreddit then why do you want to be part of it in the first place?

Access to the rest of the current community.


The current community follows the rules of the community by definition.


No, the community follows the rules of the moderators by definition. The community itself doesn't get a say.


I would say no to this. Reddit is giving you a platform and in exchange they get the content. If you don't think that is fair deal you're free to just not make the content.


Exactly. Users who contribute to the community of a for-profit enterprise that monetizes their contribution should also get paid.

I remember the good ol' days before reddit, where every community had its own forum, ran for the community, not for profit. Sure, they were running some ads to keep the lights on, but those were non targeted ads, just generic stuff based on the community.

With Dpreview dying along with so many other forums that used to serve various communities on the decline, Reddit becoming the one-stop-shop for all communities is the worst possible outcome.


They are getting paid: in the form of a discount on the cost for using the platform.


What's the cost of using the platform?


Whatever it takes to pay for servers, software engineers, and security teams divided by however many people are using with in a given time period.


I meant costs for their users. Reddit are covering their costs through ads and VC money. The users are creating the content on their platform that attracts more users. The users are the value to reddit.


Reddit has been experimenting with both fiat and Ethereum based approaches to compensate users. But we are talking about cents.


Moderators should, in my opinion. They’re doing a ton of the thankless janitorial work of cleaning up Reddit’s walled garden. If Reddit mods quit for a week Reddit wouldn’t have a product.


And yet it's going to be the other way around, since now we, the users are going to have to pay for the API to use a usable app.


As a Reddit addict who's spent thousands of hours on the website, I feel I am very well compensated for my content. For every post and comment I give to Reddit, Reddit serves me millions of posts and comments in return. This has massive entertainment and educational value to me, so I consider it a very favourable trade.


There is already a compensation system setup for ugc. Users are already free to evaluate what they share on Reddit for what they get in return. But users have effectively been compensated for providing their content.


Users may not get paid, but they do get free access to one of the best moderated community web sites on the planet. I get hours of enjoyment and engagement from Reddit. Totally worth it for me. Of course YMMV...


There would be far less objections to generative AI if these tools were being harnessed by (for example) social democracies to make societies overall more efficient and productive and to redistribute gains into greater overall security. What HN users tend to summarily dismiss as Luddism is the “golden rule” form of American Capitalism, where having the most money/compute resources entitles you to scrape and enclose the sum total of human creative and intellectual output for corporate gain. It’s the basic conflict of Capitalism - what gains should be returned to labor vs Capital owners, sublimated through a whole mess of pedantry, philosophizing, and cynical legal maneuvering to obfuscate that fact.


A startup actually messaged me a few months ago that is trying to do something similar to that. Basically you get paid for your ad targeting data.


The models should be open source because they used open source data to create them, or scrapped content without an explicit granted permission.


Something tells me users would have mixed reactions to monetizing karma, even if Reddit would support it.


What's good for Reddit is ultimately going to be good for Reddit users.


Not really, you have so much garbage mixed in from troll-bots.


Does that follow? Who owns the data?


What a twist of fate! The social media generation companies built their success on aggregating other people’s content and offering new ways to interact with it. “No we’re just linking to what other sites provide for free”. Now, there’s a new leather jacket in town. “We’re just training on data that you already provide for free”.

Of course it’s fun to watch a turf war, and we can all cheer for our favorite team and quibble about who deserves a punch in the gut.

But, we also need to keep an eye on the horizon. This will change the world, even and especially the spaces that we currently rely on. Just look at what happened to legacy media when the aggregators came: it largely turned into blogspam and clickbait. Comment sections (like this one) aren’t perfect, but they’re a damn good pressure valve for regular people to interact with the world. What will happen to those, for instance?


Theres plenty of incentive for people to shoot the breeze and shout into the void on comment sections. Typing to an LLM just isnt the same, especially with the amnesiac ones we have now.

That being said, I don't agree about Reddit comment quality... its just generally not horrible, with the better part of it being old or in niche subs on niche topics (like fandoms or memes) that the LLM trainers are avoiding anyway.


> Theres plenty of incentive for people to […] shout into the void on comment sections. Typing to an LLM just isnt the same[…]

I am concerned with the opposite, that LLMs will shout into our comment sections. For instance, building a convincing sentiment manipulating bit network will be dirt cheap and easy. Even here on HN there’s a financial incentive to flood the place with bots to promote tech products.

> I don't agree about Reddit comment quality... its just generally not horrible

That’s fine, but the important thing is that most comments are written by real people who wasted time to write it.

Not just that, but language is used as a marker. You can tell when someone talks about a subject they know a lot about. This ability to judge for yourself, based on the content alone, will be eroded. Anything can sound convincing, even to the trained ear. This makes content-oriented communities like Reddit and HN particularly vulnerable.


Oh yeah that is a definite issue. Maybe mods/LLM bots in some niche subs can save them with very strict on topic requirements, which would reduce the incentive for most of spam, but the more general parts of reddit (and HN) are in trouble.


That’s such generic logic it applies to society as a whole anymore. We just extend past effort.

James Madison wrote about it, saying the future owes deference to the past by carrying on the benefits it inherits from it.

I wonder if people could just social in place more; talk, make art, rather than stare at a glass obelisk all day should social media die. You think that’s ever happened in human history? I dunno.


The cynic in me thinks this will slowly morph into charging access for third-party apps too.

Third-party apps don't show ads; there's no reason ads couldn't be included in the feed and required to be shown as a condition of using the API, but I imagine it makes tracking impressions etc far more difficult. Any new features they add also need to either be incorporated into the API or remain unavailable for those users.

My only hope is that third-party apps remain niche enough that Reddit leaves them be; the first-party experiences are all awful to the point where I would probably just stop using Reddit if third-party offerings become unavailable.


There's no morphing. This announcement includes 3rd party apps. Link from a sister comment, written by the Apollo dev:

https://www.reddit.com/r/apolloapp/comments/12ram0f/had_a_fe...

* Edit *

They also might be pulling a Tumblr. I really hope they don't.

> For NSFW content, they were not 100% sure of the answer, but thought that it would no longer be possible to access via the API, I asked how they balance this with plans for the API to be more equitable with the official app, and there was not really an answer but they did say they would look into it more and follow back up. I would like to follow up more about this, especially around content hosting on other websites that is posted to Reddit, as well as different types of NSFW content (a text post marked NSFW due to a gory moment in a story, for instance).


This somewhat reminds me of the Twitter APIpocalypse from a decade ago, which the article forgot to mention...


Apollo should ditch reddit and make their own social network.


These proposed changes to Reddit API are highly implied to also affect third-party apps, such as Apollo for iOS: https://www.reddit.com/r/reddit/comments/12qwagm/an_update_r...

As noted in the comments, the API changes will also affect the quick .json representations of Reddit pages, which were an easy way to play with real-world data for beginners learning coding/data science.


> quick .json representations of Reddit pages

Had no idea this existed... replacing the slash at the end of the URL and appending .json will output the post in JSON. Quite nice!


also .rss


Reddit wants to get paid for content it hosts.

Soon Reddit's users will want a cut of it for content they create.

Then all the places these users are copying content from will want their share.

There's no solution here. Either the web stays (mostly) open and free-for-all like it is now or everyone sets up their own little walls and ends the party.


Seems like it will just continue the pattern of things decentralizing and then re-centralizing later. The only constant is the change.


“The world's entire scientific ... heritage ... is increasingly being digitized and locked up by a handful of private corporations....

The Open Access Movement has fought valiantly to ensure that scientists do not sign their copyrights away but instead ensure their work is published on the Internet, under terms that allow anyone to access it.” - Aaron Swartz

The irony here.


So the title is not correct? API access is free. Crawling is not?

A major title change came from the New York Times source that is "Reddit Wants to Get Paid for Helping to Teach Big A.I. Systems". Now that makes it much more clear what this is all about and why it is happening right now.


> So the title is not correct? API access is free.

They are going to charge for API access, but it will remain free for more limited purposes.


https://www.reddit.com/r/apolloapp/comments/12ram0f/had_a_fe...

* Offering an API is expensive, third party app users understandably cause a lot of server traffic

...

* To this end, Reddit is moving to a paid API model for apps. The goal is not to make this inherently a big profit center, but to cover both the costs of usage, as well as the opportunity costs of users not using the official app (lost ad viewing, etc.)

* They spoke to this being a more equitable API arrangement, where Reddit doesn't absorb the cost of third party app usage, and as such could have a more equitable footing with the first party app and not favoring one versus the other as as Reddit would no longer be losing money by having users use third party apps

* The API cost will be usage based, not a flat fee, and will not require Reddit Premium for users to use it, nor will it have ads in the feed. Goal is to be reasonable with pricing, not prohibitively expensive.

* Free usage of the API for apps like Apollo is not something they will offer, and thus me offering free usage of the app will likely be very difficult, Apollo will almost certainly have to move to an Apollo Ultra only (AKA subscription) model

...


Closing the barn door after the horse has already bolted.

Pulling data off Reddit now will likely give you a very large amount of polluted data from LLMs. I mean, yea it could be useful for some broad topics at this point, but still likely to contain a lot of GPTs own feedback.

It's likely that companies like OpenAI will just use their old reddit dataset, and then move to scraping things like YouTube for not just text, but audio and imagery too.


It doesn't really matter if some of the data is generated by LLMs. It's possible for LLMs to improve themselves by training on their own output, there is no strict requirement for "fresh" content. If it's mixed with human content and gets human feedback through voting that's great training data.


What if an LLM generates absolute crap that then gets upvoted by bots?


I can already feel the outrage dying. Yesterday Elon was tech hitler for it, today Reddit is just doing 'what was inevitable anyway'.

One of the best parts about social media is watching swarms of people who know nothing pivot around things you know something about.


I feel API pricing is fine if the API money is more valuable for a platform than what's being built with it. If you have something valuable and you're not getting value from giving it away, sure.

The issue I have with Twitter's new API pricing is it's not either - it's paying a lot for a little, so feels more like an explicit move to stop companies building on Twitter. Like it's trying to kill the API altogther.


There’s a key difference here, in that Reddit bots still operate for free and Twitter bots don’t.


Always a rule of thumb with Elon topics - try to distill whether people are objecting to the principle or to the absurd cack-handed way Elon attempts to implement that principle. Because more often than not a good version what Musk attempts is fine, it's just he's not competent enough to produce the good version.


What exactly are you talking about? How are API access and verifying your identity as a person of public interest the same thing?

Or is it because of things you know something about and I don't that I cannot understand what you are talking about?

Your unneeded hitler comparison doesn't really inspire confidence in your knowledge either.

Do you have sources about the people comparing Musk to Hitler?


One day, once you attempt to understand others points of view in good faith, you will find much less confusion and more peace. Peace be with you.


> The company said on Tuesday that it planned to begin charging companies for access to its application programming interface, or A.P.I.

If this also extends to independent third-party clients then that's basically going to be the end of Reddit.


Is reddit going to pay users? Or are they just going to collect the content generate by its users and then turn around and charge people to access it?

I think we all know that it's more column B than column A.

And while I'm not entirely comfortable with LLMs consuming all of that content without reimbursing the creators of that content. I don't see how Reddit charging for its API is different on any meaningful level.


> Is reddit going to pay users?

They don't need to as they claim a royalty-free license [0] over all content posted to reddit (Section 5).

[0] https://www.redditinc.com/policies/user-agreement


Then maybe users should poison the content they post to make their data worthless unless Reddit decides to compensate its users who provide valuable info instead of updoots.


I can consume HN content, turn around and use it to derive value. That value spills onto others in various forms or maybe I keep all the value for myself (can't think of how I would horde value without sharing because I would need to offer something to others to receive value myself).

Nonetheless, markets don't operate efficiently when people horde shit.


Reddit already exploits unpaid moderators to create and manage its communities, so I don't imagine that Reddit is in a hurry to compensate users for selling their data to Big AI.


Content platforms are inevitably reduced in value by monetization


So this is how SkyNet or the Matrix starts, I guess. Any AI trained using the content from Reddit would obviously conclude that humankind deserves to be eradicated. /s


  We are the Borg. Lower your shields and surrender your ships. We will add your biological and technological distinctiveness to our own. Your culture will adapt to service us. Resistance is futile.


You forgot:

    Please upvote.


God I hope this happens someday to us. Except we are humans so we have already planned for this, and we can take out whatever adversarial alien civilization exists, however advanced they may be on paper. Even if it takes millennia we should be the dominant species in the galaxy and beyond.

I’m pro human species so I just want us to win.


Whose to say it hasn't happened to us already?


Reddit should consider paying its moderators. Or employ moderators who don't use their vast unchecked powers to astroturf the site on behalf of shadowy companies.


Society would not be better off if reddit mods had more economic power.

The latter point isn't a bug it's a feature. Reddit is designed to function that way. The owners/execs have never expressed any interest in countering it outside of the limpest lip service imaginable.


I moderated two large subreddits (on various accounts). They responded to this by permanently banning both the accounts and the subreddits. They don't want users.


That's not really the point, the issue is between AI companies and Reddit, specifically regarding API access.


I'm being snarky because Reddit wants to get paid while already profiting from free labor.


I'm not an AI company but I believe these changes will impact me


Totally understandable.

That said, on subreddits I see people who post content without attribution all the time. I recall in /r/aww you can't directly link to an Instagram post but you can "steal" the image and post it, and it's optional as to whether or not you link to the Instagram post within the comments. Likewise, people take videos from YouTube/TikTok and re-host it on Reddit.

In smaller subreddits people will post entire pay-walled articles as if writers only get paid in likes.


Very common on Twitter too.

You'll have an account like "Science is amazing" or something similar which seems uplifting and does show relevant/great content. Given the positive name and quality content, they get popular quickly.

But they never attribute or give back. They gain millions of followers whilst the original creators of the content get left behind. One of many things broken on the internet.


It's hilarious to see Reddit's inept attempts to monetize the content gold mine they've squandered after a decade of devaluing product and engineering.


ha-ha. Loving these double-faced stories here and there. “Crawling Reddit, generating value, and not returning any of that value to our users is something we have a problem with.” Very well Mr. Huffman, but what about “Posting on Reddit, generating free content which brings multi-hundred-million advertisement profits for the company, and not getting any of that value back is something which your users don't have the slightest problem with.“ The API is just a convenience to get the data, but surely you can get all the data you want without any additional API for free just by using their HTTP API - as any other generic user would do. Of course, filling up an enormous proxy well to avoid various ingenious "protections" could cost you some 10-20 bucks, and solving captchas automatically could cost you another 1$ for 1000, but from there, it's even easier and more enjoying to use than an API. I'm feeling like launching a scrape-it-all service to avoid greedy ip-protocol customs officers could be a profitable venture these days.


A sensible choice. Now only if open source developers would update their licenses, perhaps a new GPL license, to restrict reselling of IP through AI models. These folks need to adhere to rules if we are to have a healthy ecosystem.


Reddit does not own their users' content, however - they would also be simple resellers. All that's happening here is that they have failed to monetise where others are succeeding, and now they are positioning themselves to get a piece of that pie.

The right thing for them to do morally, would be to implement content visibility/privacy controls for their users similar to what Facebook offers (strange feeling to be referring to Facebook in this context).


My hope is that large players sealing off their content will motivate individuals to protect theirs. It brings awareness that their data is harvested and sold in ways never seen before, and is then used against them. Ideally those who make free software are the first to understand the implications.

Basically what I want is that all models trained on open source data or user created content without proper licensing are also open source and free.


Didn't this happen only after Facebook got burned by the Cambridge Analytica scraping a decade ago ?

(Also when the Twitter APIpocalypse happened, which the article forgot to mention.)


The argument AI businesses use is that their use of copyrighted work is fair use, which means that there is no license that would prevent your IP from being used by AI models.

If that holds up legally, the best you can do is to try to stop your content from being scraped or not release it at all.


If that argument holds then indeed there is no reason to create content. Although not sure how that argument can stand in a free society.


In a free society everything is fair game for AI. What you want is actually the opposite of a free society.


The developer of the iOS app Apollo got some more details from Reddit today:

https://redd.it/12ram0f

> Reddit is moving to a paid API model for apps. The goal is not to make this inherently a big profit center, but to cover both the costs of usage, as well as the opportunity costs of users not using the official app (lost ad viewing, etc.)


I’m amazed they are willing to charge for their abomination of an API. The search functionality is terrible, returns unreliable results, and can only return 100 at once. I would happily pay for a great version of the Reddit API. I doubt anyone doing huge scraping jobs on Reddit is using their API to do so.


OpenAI is in some ways the new Napster. It's not a perfect analogy but it cracks open the same copyright can of worms.


Seems like the downfall of Reddit is eminent between this decision and nerfing the mobile web experience for no good reason other than to vacuum up mobile user data. What do others here think?


The New York Times' style guide is starting to look pretty dated. This has gotta be the first time I've seen "L.L.M." as opposed to just "LLM".



ChatGPT is on a trajectory to overtake Reddit in popularity.

And every interaction from users with ChatGPT is valuable content provided to OpenAI.

Most people don't realize this, but every question contains information. When a user asks "Which city is better for digital nomads, Berlin or Lisbon?", they have given out a bunch of information. That there is something called "digital nomads". That there are cities called "Berlin" and "Lisbon". That those seem to be considered good for "digital nomads".

And even more so when the chat continues. If ChatGPT praises how nice a city is for studying and the users replies "I don't study. I need a cheap apartment with fast internet", the user provided information about the preferences of "digital nomads", that apartments can be cheap or expensive, that apartments have internet, that internet can be faster or slower.


The primary reasons people visit Reddit are a) timely news that would not be present in a pretrained LLM and b) human discussions around said news.

No, Agents that can query current information do not fix these issues.


This is not how LLMs work at all. Once your chat session ends that's it. Updating the weights is expensive (although it's done semi regularly). And in updating weights the training datasets' quality becomes an issue.

Folks are drastically underestimating the "grey goo" problem when it comes to training data. Now that AI generated content is so cheap to generate, the quality of training datasets is going to plummet.


The data is useful for training later generation of AIs. Especially when the data is clearly from human and not other AIs.


I sometimes feed the responses from one AI into another for fun.


Two completely different use cases (chatbot vs forums) and as others have said that’s just simply not how chatGPT works with new info.


It doesn’t learn in real time from the chats, though the feedback buttons can be used for training (but I imagine OpenAI review it first)


"There’s a lot of stuff on the site that you’d only ever say in therapy" Yes that is indeed Reddit in a nutshell. May not want Reddit content in your next ChatGPT model, so not necessarily a bad thing.


i.reddit.com gone, they want to kill the awesome 3rd party apps next instead of improving theirs. They are definitely killing off 3rd party apps, my prediction is that it will be killed within an year.


Here’s an extension to delete all of your Reddit history: https://github.com/j0be/PowerDeleteSuite


18 years and still no business model?


On the positive side, training an AI with Reddit content will ensure that the AI is deeply flawed and will not surpass average human intelligence


LLMs already have problems with fact vs fiction. I don't see how Reddit of all places has "valuable data" in that regard.


I think the value is in the examples it provides of language.


Top upvoted comments can filter out the useless information and then it can be trained on actual data and refined.


Except when top voted comments are hivemind approved 'funny' quips/responses, or in reply to exercises in creative writing like half the posts in relationshipadvice, iwantthemanager, nuclear/pettyrevenge, etc


Is this a joke that I'm missing? Top reddit posts are frequently trash filled with misinformation.


Many popular LLMs already include large amount of Reddit comment data which is (usually) cited in their respective papers.


Reddit also has a problem with fact vs fiction.


HN is also an input to OpenAI's LLMs so look forward to being able to redeem your karma for OpenAI shares on a 1:1 basis.



And so the massive behaviour change in response to ChatGPTs "everything is free" model begins.



Technically speaking, there's already a very competent open-source and federated replacement for Reddit: https://github.com/LemmyNet/lemmy

Socially speaking, perhaps not so much.


How do they plan to keep Google from using its search index of Reddit for training? Or keep OpenAI from using Common Crawl? Do they simply add "No AI" to their TOS?


Does this mean third party apps like baconreader will break?


Yeah, what about Apollo which is probably the only reason why I still use Reddit?

EDIT: I guess it’s safe.

> Reddit’s API will remain free to developers who want to build apps and bots that help people use Reddit, as well as to researchers who wish to study Reddit for strictly academic or noncommercial purposes.



Reddit staff had a private conversation with Apollo devs and confirmed you will now have to pay to use an app like Apollo.


Related thread including comments from the creator of the Apollo app:

> Funny timing, given the post yesterday and my praise for how communicative Reddit has been, but today there's a comparatively much more vague post about changes to the Reddit API.

> I posted in that thread and asked a few questions which as of the time of posting have not been answered.

> Shortly after the post they emailed me about a meeting, which I've replied to and will keep you all in the loop on.

> - Christian

https://old.reddit.com/r/apolloapp/comments/12qxo6l/reddit_t...


Reddit is such a toxic waste dump these days. Good they are helping others avoid it by charging for it.


They’re going to disable old.reddit.com soon aren’t they? Locking the API will prevent any workarounds.


Better yet, detect the AI (by its frenetic speed) and then feed it a bunch of false data.


The reddit api was never very good. It's easier just to scrape the site.


Not an unreasonable take.


that is totally incorrect


?


Well obviously. All commercial ai must be trained with licensed data.



@dang this title is awful and doesnt match


Inevitable. Everyone wants to get paid.


The commenters own the copyright


if you go to delete your comments remember to overwrite them first

because reddit don't actually delete anything (plainly visible in GDPR dump)


Honestly your Reddit comment was probably already archived on Pushshift as soon as you posted it.


hosted by one guy?

a single letter from a solicitor to his hosting would probably shut that down


That's why we need more second-tier archive mirrors. Redundancy is important.


Pushshift has been around for about a decade now and has dealt with its fare share of deletion requests.


Reddit already derived their remuneration from making their site public. Now they want paid again?

Maybe they should lock their entire site behind a paywall.


If you are paywalled, try here https://archive.ph/X3Zf3


Oh nooo


what, if everyone start charging, how small creative people built product using APIs


Where is the outrage like when spaceship man started charging for Twitter's API?


There was plenty of outrage when Elon killed TweetBot and other third-party clients: https://www.theverge.com/2023/1/17/23559180/twitter-blocking...

There was also plenty of outrage for the new API rules, including specifically blocking weather alerts (which Elon lied about restoring): https://mashable.com/article/twitter-exemptions-nws-public-s...


You missed the word "like" in his comment.


I did.

Which makes the comment much much weirder since there is plenty of outage on Reddit itself about it.


Is it any comparable to the APIpocalypse that already happened to Twitter a decade ago ?


Interesting… i guess google wont be charged because of the backlinks but ChatGPT will be, because they just show an answer to one’s query and dont actually show any of the “original content” in context, and therefore no back-traffic for reddit.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: