It's not possible to hide crawling at a large enough scale, right? At some point, certain IPs/user agents will (should be?) hit with CAPTCHAs to be able to have access to content and no amount of user agent/cookie/session/whatever spoofing will get around that, yeah?
IP restrictions are easy to overcome using mobile networks. Basically, mobile networks assigns your device an internal ip and NATs out to a very small pool of ip public addresses. If they block you, they also block a very large chunk of legitimate mobile users. I'm a big ol' dummy when it comes to networking, so I imagine I explained something poorly... so any mobile network nerds feel free to pile on!
Captchas are super easy! There's a gagillion captcha bypass services for every type of captcha. Just snag the captcha token, send it in an API call, and then you get a verified captcha token.
My mobile device (iPhone) relays most traffic through the nearest Akamai datacenter. So they don't get my IP address. And that datacenter has a massive number of IP addresses, which are rotated.
Out of interest how do you know it's being relayed through an Akamai DC? I assume you're talking about private relay which I also use, but I thought cloudflare was the 2nd hop for that?
Cat & mouse game. If you’re defending against a whitehat business scraping with curl from data center IPs, sure.
Against a less-savory actor using hundreds of IPs from residential proxies/compromised hosts, you’re gonna have a rough time, especially if you’re unwilling or unable to use aggresive fingerprinting or (vomit) CloudFlare. Not to mention CAPTCHAs are generally already a solved problem for scrapers.
Residential proxies are a completely solved problem, for companies that actually lose money to them (e.g. Ticketmaster, whose profit is maximized by blocking third-party scalpers so they can do the scalping themselves)
For companies that make money by having more MAUs, well, yeah, they're going to have a real "rough time" detecting inauthentic traffic
While I appreciate it as an irreplaceable tool for countering DDoS, its premise is antithetical to a reliable and open web IMO, and it suffers from the same lack of accessible, customer-facing support as other big tech players. Lazy examples from HN algolia search:
I intensely dislike them taking over as gatekeepers of the web. Perhaps because my browser is configured to resist fingerprinting and to avoid running arbitrary scripts from random websites, it is very frequently blocked by Cloudflare.
As one example, I can no longer browse the site for Lowe's (big box home improvement chain). Consequently, I now buy everything from Home Depot (their competitor).
It's astonishing how Cloudflare can do such a poor job of determining the difference between a potential customer and an attacker. Life's too short to solve captchas for an intermediary, so I don't bother, I just find a competitor who wants my business.
> It's astonishing how Cloudflare can do such a poor job of determining the difference between a potential customer and an attacker.
I don’t find that astonishing at all. I can’t see how you’d disambiguate someone who is anonymous for good versus bad reasons. Not supporting the death of the anonymous internet, but it’s not happening because of incompetence.
I don't think Cloudflare is immune to organizational incompetence even if a lot of brilliant people work there. I have similar intermittent problems as ~tomwheeler, despite a mostly unchanged residential IP and a browser configuration that's only a little bit defensive.
My outsider's impression is that Cloudflare has decided to rely much more heavily on browser fingerprinting than on classifying good/bad network activity. That puts them at odds with anyone that's taken steps to oppose being monetized by advertising firms.
I think that both Cloudflare and the Lowe's stores of the world understand that these interventions have negative side effects. The problem is that leaving them out has even worse consequences, and no one has offered a sufficient alternative.
Put another way, one could reason that they'd prefer to do business with Lowes because they are actively investing in security measures. Perhaps your data is more likely to be compromised at Home Depot.
It induces vomit on anyone who is on any combination of a) a slow network b) TOR or c) noscript. They also fundamentally act as middlemen, the gate between users and what's supposed to be an open web. They even promote having servers run plain http and they'll do the HTTPS proxying for you; you know, so that they can sniff the traffic between you and your users.
reddit might be one of the few last places on the internet that hold the old times of pseudo-names and mindless anonymity. I don't see how changing that would benefit the company. see twitter
> One of the core features of Reddit is that any person may create as many accounts as they like for free. Changing that would be incredibly disruptive.
I wonder what their monthly active users look like if you filter out 1 person switching through 3 usernames/accounts for example.
Reddit wants people to visit the site, become interested in the content they see, and start participating regularly. That's not compatible with hiding enough content behind a registration wall to thwart sufficiently sophisticated scraping.
There are services out there that have a large pool of consumer IPs that are marketed at crawlers for exactly this reason. A lot of them are either using hacked hardware or one of those free VPN browser plugins so it would be very hard to distinguish the traffic from a legit user.
There are residential proxies that allow you bypass most of these things. I’ve been using them to crawl e.g. Amazon or Instagram without any issues, but they’re expensive. IIRC something like $10/GB
Yes - but legal and explicitly allowed by the user.
BrightData is the biggest of them, they run the free VPN Hola, and have an SDK app owners can install in their apps that allow selling bandwidth from installs. For someone who is price sensitive, trading some free residential bandwidth for whatever service is pretty compelling.
I'm sure there are scummy ones, but Bright seems to require pretty explicit consent. Not affiliated, just looked into it for some apps I have, but the payouts weren't good and I didn't think it'd be a good fit for our users.
This is exactly the one I was using. Basically you’re piggy-backing on mobile phones and other devices using their free VPN software, and it’s incredibly hard to block for large websites. Combine this with some other clever tricks, and you’re basically able to do huge scrapes for not-that-much money with incredible convenience.
No because they are only HTTP proxies. But you don’t actually know how these companies get them, rumor is that they are part of browser extensions or free VPNs which users might install on their devices.
The most “reputable” company in this space is Bright Data (formerly Luminati).
Last week someone here said that some of the big VPN players use botnets to residential IP addresses. I assumed they got residential IP addresses from ISPs but maybe not all ISPs in all parts of the world offer that.
CAPTCHAs have been broken by primitive AI for a long time (Long before GPT4-like tools). Their only purpose is to deter the lazy bots. User agents, and any other arbitrary HTTP headers, cookies, etc. have been easy to circumvent as long as the internet has existed. The only thing that sort of works is IP reputation but with IPV6 you can have as many legitimate IPs as you want.
tl;dr Dedicated crawlers built by sophisticated actors are more or less impossible to defeat.
It is very easy and cheap to scale. 1$ for 1000 captchas solved, 10$ for 1000 proxies. Then you have 1000 users, and these are kinda impossible to distinguish from your typical common users if you cared to randomize the digital fingerprints for each client to some extent. Paid APIs for publicly accessible data are not something that makes sense or works well in this world.
You're right... I went off-lane there. It makes total sense of course since there is a demand for data, and clearly, just a minority of the people can just scrape everything at will, even if it sounds like kids play to me. And actually, it all makes sense now, since pay walling your own API is just throwing some competition to scrapers, which is totally legit. Sometimes a simple question can do a good deed, thank you:)
It's not possible to hide crawling at a large enough scale, right? At some point, certain IPs/user agents will (should be?) hit with CAPTCHAs to be able to have access to content and no amount of user agent/cookie/session/whatever spoofing will get around that, yeah?