> But companies that “crawl” Reddit for data It's not possible to hide crawling ...

pocket_cheese · on April 18, 2023

IP restrictions are easy to overcome using mobile networks. Basically, mobile networks assigns your device an internal ip and NATs out to a very small pool of ip public addresses. If they block you, they also block a very large chunk of legitimate mobile users. I'm a big ol' dummy when it comes to networking, so I imagine I explained something poorly... so any mobile network nerds feel free to pile on!

Captchas are super easy! There's a gagillion captcha bypass services for every type of captcha. Just snag the captcha token, send it in an API call, and then you get a verified captcha token.

See CGNAT for more details about mobile networks. https://en.wikipedia.org/wiki/Carrier-grade_NAT#cite_note-of...

It's pretty much impossible to stop the top 1% of the most dedicated scrapers without affecting end user experience.

tjohns · on April 18, 2023

> IP restrictions are easy to overcome using mobile networks.

Only if the connection is over IPv4.

The mobile networks were among the first major adopters of IPv6, and most now give each device a unique IPv6 address.

abhibeckert · on April 18, 2023

My mobile device (iPhone) relays most traffic through the nearest Akamai datacenter. So they don't get my IP address. And that datacenter has a massive number of IP addresses, which are rotated.

denkmoon · on April 18, 2023

Out of interest how do you know it's being relayed through an Akamai DC? I assume you're talking about private relay which I also use, but I thought cloudflare was the 2nd hop for that?

lathiat · on April 19, 2023

They're using multiple providers including Fastly, Akamai and Cloudflare: https://www.streamingmediablog.com/2021/06/apple-private-rel...

sjtgraham · on April 19, 2023

This is only HTTP and not HTTPS traffic, which most www traffic is these days.

SnorkelTan · on April 19, 2023

reddits api is ip4 only

wswope · on April 18, 2023

Cat & mouse game. If you’re defending against a whitehat business scraping with curl from data center IPs, sure.

Against a less-savory actor using hundreds of IPs from residential proxies/compromised hosts, you’re gonna have a rough time, especially if you’re unwilling or unable to use aggresive fingerprinting or (vomit) CloudFlare. Not to mention CAPTCHAs are generally already a solved problem for scrapers.

SheinhardtWigCo · on April 19, 2023

Residential proxies are a completely solved problem, for companies that actually lose money to them (e.g. Ticketmaster, whose profit is maximized by blocking third-party scalpers so they can do the scalping themselves)

For companies that make money by having more MAUs, well, yeah, they're going to have a real "rough time" detecting inauthentic traffic

kayson · on April 18, 2023

Why is cloudflare vomit-inducing?

wswope · on April 18, 2023

While I appreciate it as an irreplaceable tool for countering DDoS, its premise is antithetical to a reliable and open web IMO, and it suffers from the same lack of accessible, customer-facing support as other big tech players. Lazy examples from HN algolia search:

https://news.ycombinator.com/item?id=32912075 https://news.ycombinator.com/item?id=17750801 https://news.ycombinator.com/item?id=22109969 https://news.ycombinator.com/item?id=30764757 https://news.ycombinator.com/item?id=29839960 https://news.ycombinator.com/item?id=22406277 https://news.ycombinator.com/item?id=23897705 https://news.ycombinator.com/item?id=34639212

(Hypocrisy disclaimer: I have sites behind CloudFlare.)

gnicholas · on April 19, 2023

Upvoted for “hypocrisy disclaimer”

tomwheeler · on April 18, 2023

I intensely dislike them taking over as gatekeepers of the web. Perhaps because my browser is configured to resist fingerprinting and to avoid running arbitrary scripts from random websites, it is very frequently blocked by Cloudflare.

As one example, I can no longer browse the site for Lowe's (big box home improvement chain). Consequently, I now buy everything from Home Depot (their competitor).

It's astonishing how Cloudflare can do such a poor job of determining the difference between a potential customer and an attacker. Life's too short to solve captchas for an intermediary, so I don't bother, I just find a competitor who wants my business.

travisjungroth · on April 18, 2023

> It's astonishing how Cloudflare can do such a poor job of determining the difference between a potential customer and an attacker.

I don’t find that astonishing at all. I can’t see how you’d disambiguate someone who is anonymous for good versus bad reasons. Not supporting the death of the anonymous internet, but it’s not happening because of incompetence.

thaumaturgy · on April 19, 2023

I don't think Cloudflare is immune to organizational incompetence even if a lot of brilliant people work there. I have similar intermittent problems as ~tomwheeler, despite a mostly unchanged residential IP and a browser configuration that's only a little bit defensive.

My outsider's impression is that Cloudflare has decided to rely much more heavily on browser fingerprinting than on classifying good/bad network activity. That puts them at odds with anyone that's taken steps to oppose being monetized by advertising firms.

tomwheeler · on April 20, 2023

> I can’t see how you’d disambiguate someone who is anonymous for good versus bad reasons.

One obvious clue would be that there are no attacks coming from my IP address.

kokanee · on April 19, 2023

I think that both Cloudflare and the Lowe's stores of the world understand that these interventions have negative side effects. The problem is that leaving them out has even worse consequences, and no one has offered a sufficient alternative.

Put another way, one could reason that they'd prefer to do business with Lowes because they are actively investing in security measures. Perhaps your data is more likely to be compromised at Home Depot.

yazzku · on April 19, 2023

It induces vomit on anyone who is on any combination of a) a slow network b) TOR or c) noscript. They also fundamentally act as middlemen, the gate between users and what's supposed to be an open web. They even promote having servers run plain http and they'll do the HTTPS proxying for you; you know, so that they can sniff the traffic between you and your users.

AlphaSite · on April 18, 2023

Require auth and it’ll help a ton.

thomastjeffery · on April 18, 2023

One of the core features of Reddit is that any person may create as many accounts as they like for free. Changing that would be incredibly disruptive.

You could set a minimum karma threshold, but that would only promote karma farming; which is already widespread.

asdadsdad · on April 19, 2023

reddit might be one of the few last places on the internet that hold the old times of pseudo-names and mindless anonymity. I don't see how changing that would benefit the company. see twitter

MuffinFlavored · on April 18, 2023

> One of the core features of Reddit is that any person may create as many accounts as they like for free. Changing that would be incredibly disruptive.

I wonder what their monthly active users look like if you filter out 1 person switching through 3 usernames/accounts for example.

Zak · on April 18, 2023

Reddit wants people to visit the site, become interested in the content they see, and start participating regularly. That's not compatible with hiding enough content behind a registration wall to thwart sufficiently sophisticated scraping.

henryfjordan · on April 18, 2023

There are services out there that have a large pool of consumer IPs that are marketed at crawlers for exactly this reason. A lot of them are either using hacked hardware or one of those free VPN browser plugins so it would be very hard to distinguish the traffic from a legit user.

stingraycharles · on April 18, 2023

There are residential proxies that allow you bypass most of these things. I’ve been using them to crawl e.g. Amazon or Instagram without any issues, but they’re expensive. IIRC something like $10/GB

sonofhans · on April 18, 2023

Serious question — is “residential proxies” a euphemism for “botnet?”

celestialcheese · on April 18, 2023

Yes - but legal and explicitly allowed by the user.

BrightData is the biggest of them, they run the free VPN Hola, and have an SDK app owners can install in their apps that allow selling bandwidth from installs. For someone who is price sensitive, trading some free residential bandwidth for whatever service is pretty compelling.

I'm sure there are scummy ones, but Bright seems to require pretty explicit consent. Not affiliated, just looked into it for some apps I have, but the payouts weren't good and I didn't think it'd be a good fit for our users.

stingraycharles · on April 18, 2023

This is exactly the one I was using. Basically you’re piggy-backing on mobile phones and other devices using their free VPN software, and it’s incredibly hard to block for large websites. Combine this with some other clever tricks, and you’re basically able to do huge scrapes for not-that-much money with incredible convenience.

candiddevmike · on April 18, 2023

There are also a lot of rural/metro ISPs that offer this as a service (residential IPs) if you find the right person

aeyes · on April 18, 2023

No because they are only HTTP proxies. But you don’t actually know how these companies get them, rumor is that they are part of browser extensions or free VPNs which users might install on their devices.

The most “reputable” company in this space is Bright Data (formerly Luminati).

bikingbismuth · on April 18, 2023

Essentially yes. Sometimes it also includes people who have installed “free” VPNs.

sleepybrett · on April 18, 2023

Yes and also people who install certain shady "VPN" software.

lobsterthief · on April 18, 2023

I’m curious about this too.

jamesfinlayson · on April 19, 2023

Last week someone here said that some of the big VPN players use botnets to residential IP addresses. I assumed they got residential IP addresses from ISPs but maybe not all ISPs in all parts of the world offer that.

zer0tonin · on April 18, 2023

It's possible but might end up more expensive (and definitely less reliable) than just paying whatever reddit asks for.

oceanplexian · on April 18, 2023

CAPTCHAs have been broken by primitive AI for a long time (Long before GPT4-like tools). Their only purpose is to deter the lazy bots. User agents, and any other arbitrary HTTP headers, cookies, etc. have been easy to circumvent as long as the internet has existed. The only thing that sort of works is IP reputation but with IPV6 you can have as many legitimate IPs as you want.

tl;dr Dedicated crawlers built by sophisticated actors are more or less impossible to defeat.

wlesieutre · on April 18, 2023

It's hard to imagine captchas being a workable solution as better AI models get cheaper.

At this point computers are probably already better at solving them than humans are.

gloosx · on April 19, 2023

It is very easy and cheap to scale. 1$ for 1000 captchas solved, 10$ for 1000 proxies. Then you have 1000 users, and these are kinda impossible to distinguish from your typical common users if you cared to randomize the digital fingerprints for each client to some extent. Paid APIs for publicly accessible data are not something that makes sense or works well in this world.

asdadsdad · on April 19, 2023

why would paid scraping services work then?

gloosx · on April 19, 2023

You're right... I went off-lane there. It makes total sense of course since there is a demand for data, and clearly, just a minority of the people can just scrape everything at will, even if it sounds like kids play to me. And actually, it all makes sense now, since pay walling your own API is just throwing some competition to scrapers, which is totally legit. Sometimes a simple question can do a good deed, thank you:)

asdadsdad · on April 19, 2023

Ha, no worries. I ask because I've also contemplated that kind of thing and realized that I was chasing my own tail perhaps.