Hacker News new | past | comments | ask | show | jobs | submit login

> But companies that “crawl” Reddit for data

It's not possible to hide crawling at a large enough scale, right? At some point, certain IPs/user agents will (should be?) hit with CAPTCHAs to be able to have access to content and no amount of user agent/cookie/session/whatever spoofing will get around that, yeah?




IP restrictions are easy to overcome using mobile networks. Basically, mobile networks assigns your device an internal ip and NATs out to a very small pool of ip public addresses. If they block you, they also block a very large chunk of legitimate mobile users. I'm a big ol' dummy when it comes to networking, so I imagine I explained something poorly... so any mobile network nerds feel free to pile on!

Captchas are super easy! There's a gagillion captcha bypass services for every type of captcha. Just snag the captcha token, send it in an API call, and then you get a verified captcha token.

See CGNAT for more details about mobile networks. https://en.wikipedia.org/wiki/Carrier-grade_NAT#cite_note-of...

It's pretty much impossible to stop the top 1% of the most dedicated scrapers without affecting end user experience.


> IP restrictions are easy to overcome using mobile networks.

Only if the connection is over IPv4.

The mobile networks were among the first major adopters of IPv6, and most now give each device a unique IPv6 address.


My mobile device (iPhone) relays most traffic through the nearest Akamai datacenter. So they don't get my IP address. And that datacenter has a massive number of IP addresses, which are rotated.


Out of interest how do you know it's being relayed through an Akamai DC? I assume you're talking about private relay which I also use, but I thought cloudflare was the 2nd hop for that?


They're using multiple providers including Fastly, Akamai and Cloudflare: https://www.streamingmediablog.com/2021/06/apple-private-rel...


This is only HTTP and not HTTPS traffic, which most www traffic is these days.


reddits api is ip4 only


Cat & mouse game. If you’re defending against a whitehat business scraping with curl from data center IPs, sure.

Against a less-savory actor using hundreds of IPs from residential proxies/compromised hosts, you’re gonna have a rough time, especially if you’re unwilling or unable to use aggresive fingerprinting or (vomit) CloudFlare. Not to mention CAPTCHAs are generally already a solved problem for scrapers.


Residential proxies are a completely solved problem, for companies that actually lose money to them (e.g. Ticketmaster, whose profit is maximized by blocking third-party scalpers so they can do the scalping themselves)

For companies that make money by having more MAUs, well, yeah, they're going to have a real "rough time" detecting inauthentic traffic


Why is cloudflare vomit-inducing?


While I appreciate it as an irreplaceable tool for countering DDoS, its premise is antithetical to a reliable and open web IMO, and it suffers from the same lack of accessible, customer-facing support as other big tech players. Lazy examples from HN algolia search:

https://news.ycombinator.com/item?id=32912075 https://news.ycombinator.com/item?id=17750801 https://news.ycombinator.com/item?id=22109969 https://news.ycombinator.com/item?id=30764757 https://news.ycombinator.com/item?id=29839960 https://news.ycombinator.com/item?id=22406277 https://news.ycombinator.com/item?id=23897705 https://news.ycombinator.com/item?id=34639212

(Hypocrisy disclaimer: I have sites behind CloudFlare.)


Upvoted for “hypocrisy disclaimer”


I intensely dislike them taking over as gatekeepers of the web. Perhaps because my browser is configured to resist fingerprinting and to avoid running arbitrary scripts from random websites, it is very frequently blocked by Cloudflare.

As one example, I can no longer browse the site for Lowe's (big box home improvement chain). Consequently, I now buy everything from Home Depot (their competitor).

It's astonishing how Cloudflare can do such a poor job of determining the difference between a potential customer and an attacker. Life's too short to solve captchas for an intermediary, so I don't bother, I just find a competitor who wants my business.


> It's astonishing how Cloudflare can do such a poor job of determining the difference between a potential customer and an attacker.

I don’t find that astonishing at all. I can’t see how you’d disambiguate someone who is anonymous for good versus bad reasons. Not supporting the death of the anonymous internet, but it’s not happening because of incompetence.


I don't think Cloudflare is immune to organizational incompetence even if a lot of brilliant people work there. I have similar intermittent problems as ~tomwheeler, despite a mostly unchanged residential IP and a browser configuration that's only a little bit defensive.

My outsider's impression is that Cloudflare has decided to rely much more heavily on browser fingerprinting than on classifying good/bad network activity. That puts them at odds with anyone that's taken steps to oppose being monetized by advertising firms.


> I can’t see how you’d disambiguate someone who is anonymous for good versus bad reasons.

One obvious clue would be that there are no attacks coming from my IP address.


I think that both Cloudflare and the Lowe's stores of the world understand that these interventions have negative side effects. The problem is that leaving them out has even worse consequences, and no one has offered a sufficient alternative.

Put another way, one could reason that they'd prefer to do business with Lowes because they are actively investing in security measures. Perhaps your data is more likely to be compromised at Home Depot.


It induces vomit on anyone who is on any combination of a) a slow network b) TOR or c) noscript. They also fundamentally act as middlemen, the gate between users and what's supposed to be an open web. They even promote having servers run plain http and they'll do the HTTPS proxying for you; you know, so that they can sniff the traffic between you and your users.


Require auth and it’ll help a ton.


One of the core features of Reddit is that any person may create as many accounts as they like for free. Changing that would be incredibly disruptive.

You could set a minimum karma threshold, but that would only promote karma farming; which is already widespread.


reddit might be one of the few last places on the internet that hold the old times of pseudo-names and mindless anonymity. I don't see how changing that would benefit the company. see twitter


> One of the core features of Reddit is that any person may create as many accounts as they like for free. Changing that would be incredibly disruptive.

I wonder what their monthly active users look like if you filter out 1 person switching through 3 usernames/accounts for example.


Reddit wants people to visit the site, become interested in the content they see, and start participating regularly. That's not compatible with hiding enough content behind a registration wall to thwart sufficiently sophisticated scraping.


There are services out there that have a large pool of consumer IPs that are marketed at crawlers for exactly this reason. A lot of them are either using hacked hardware or one of those free VPN browser plugins so it would be very hard to distinguish the traffic from a legit user.


There are residential proxies that allow you bypass most of these things. I’ve been using them to crawl e.g. Amazon or Instagram without any issues, but they’re expensive. IIRC something like $10/GB


Serious question — is “residential proxies” a euphemism for “botnet?”


Yes - but legal and explicitly allowed by the user.

BrightData is the biggest of them, they run the free VPN Hola, and have an SDK app owners can install in their apps that allow selling bandwidth from installs. For someone who is price sensitive, trading some free residential bandwidth for whatever service is pretty compelling.

I'm sure there are scummy ones, but Bright seems to require pretty explicit consent. Not affiliated, just looked into it for some apps I have, but the payouts weren't good and I didn't think it'd be a good fit for our users.


This is exactly the one I was using. Basically you’re piggy-backing on mobile phones and other devices using their free VPN software, and it’s incredibly hard to block for large websites. Combine this with some other clever tricks, and you’re basically able to do huge scrapes for not-that-much money with incredible convenience.


There are also a lot of rural/metro ISPs that offer this as a service (residential IPs) if you find the right person


No because they are only HTTP proxies. But you don’t actually know how these companies get them, rumor is that they are part of browser extensions or free VPNs which users might install on their devices.

The most “reputable” company in this space is Bright Data (formerly Luminati).


Essentially yes. Sometimes it also includes people who have installed “free” VPNs.


Yes and also people who install certain shady "VPN" software.


I’m curious about this too.


Last week someone here said that some of the big VPN players use botnets to residential IP addresses. I assumed they got residential IP addresses from ISPs but maybe not all ISPs in all parts of the world offer that.


It's possible but might end up more expensive (and definitely less reliable) than just paying whatever reddit asks for.


CAPTCHAs have been broken by primitive AI for a long time (Long before GPT4-like tools). Their only purpose is to deter the lazy bots. User agents, and any other arbitrary HTTP headers, cookies, etc. have been easy to circumvent as long as the internet has existed. The only thing that sort of works is IP reputation but with IPV6 you can have as many legitimate IPs as you want.

tl;dr Dedicated crawlers built by sophisticated actors are more or less impossible to defeat.


It's hard to imagine captchas being a workable solution as better AI models get cheaper.

At this point computers are probably already better at solving them than humans are.


It is very easy and cheap to scale. 1$ for 1000 captchas solved, 10$ for 1000 proxies. Then you have 1000 users, and these are kinda impossible to distinguish from your typical common users if you cared to randomize the digital fingerprints for each client to some extent. Paid APIs for publicly accessible data are not something that makes sense or works well in this world.


why would paid scraping services work then?


You're right... I went off-lane there. It makes total sense of course since there is a demand for data, and clearly, just a minority of the people can just scrape everything at will, even if it sounds like kids play to me. And actually, it all makes sense now, since pay walling your own API is just throwing some competition to scrapers, which is totally legit. Sometimes a simple question can do a good deed, thank you:)


Ha, no worries. I ask because I've also contemplated that kind of thing and realized that I was chasing my own tail perhaps.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: