Hacker News new | past | comments | ask | show | jobs | submit login
The State of Web Scraping 2022 (scrapeops.io)
291 points by Ian_Kerins on Jan 12, 2022 | hide | past | favorite | 144 comments



As a lawyer whose primary focus is in web scraping, this article is in many ways misleading and inaccurate. While it is true that the Van Buren case is generally positive for web scraping, the overall legal landscape is still murky. The main battleground for web scraping legal issues is shifting from the CFAA to breach of contract and various state-law issues, including misappropriation, unjust enrichment, and trespass to chattels.

In my opinion, 2021 was a bad year for the law as it relates to web scraping. The Supreme Court remanded hiQ Labs, and many high-profile lower-court cases ended badly for web scrapers. It's a darker shade of gray than it was in 2020. It can be navigated, but it's tricky.


Not a lawyer, but is it at least true that web scraping alone would now be significantly less likely to be a basis for federal criminal prosecution under the CFAA?

I'm often reminded of the fact that in https://en.wikipedia.org/wiki/United_States_v._Swartz the scraped party JSTOR did not desire to press civil charges, but due to the criminal component of the CFAA, this was out of their hands - and the story ended in the worst possible way.

If the current legal landscape at least better restricts disputes over web scraping to civil litigation, it may not be a huge change for how companies look at their risks, but it could make a huge difference for individuals caught in the crossfire.


Yes, I would agree with that first sentence. After Van Buren, web scraping alone would now be significantly less likely to be a basis for federal criminal prosecution under the CFAA.


Good take, IMO ethically speaking we should not penalize scrapers themselves but do so based on their use.

Scraping Facebook to make a clone of profiles shouldn’t be held to the same scrutiny of scraping Facebook to do an internal analysis of user demographics for research purposes.


Why should either be discouraged?


Cloning profiles is what seems likely wrong to me but I'm not sure how that being done via scraping or not should matter.


With cloned profiles (or any data obtained and shared without your consent) it will be harder for you to exercise your right to be forgotten, for example.


How many contracts google breaches scraping billions of pages every month?


Google doesn't have to proactively try very hard to ingest sites. If something is difficult for Google to scrape they don't sped loads of engineer hours on getting it to work. They just leave the site out and the webmaster there will quickly bend over backwards to make sure Google can scrape them. When something gets scraped into Google inadvertently it's because the website made not even the slightest effort to protect itself.


Given the nuances of browsewrap contract enforceability, perhaps not as many as you suggest. The tricky part with navigating this gray area is knowing the likely circumstances when a contract of adhesion may give rise to an actual legal claim. There are patterns.


So in the scale of google, 'not many' would be some few million per month? And all is good then, right? Even you use their scrapped data probably daily and are totally fine with that, right?

You think google bots read contracts before scraping website? really? :) If you had any experience in creating websites and launching them online, you would know how fast and often they arrive and how they do not care about your TOS. So the real 'violation' numbers might be very scary...for you.

https://ironcladapp.com/journal/contract-management/are-brow...


They read robots.txt right? You can easily add a Disallow rule for google-bot there


What if the scraping occurs as part of web crawling?

Suppose I point a scraper at site S1, which has terms of service that say scraping them is OK, and my scraper finds a link on S1 to S2 and follows that, and follows a link from S2 to S3, and so on.

At some site Sn far enough down that chain is it really possible to use the scraper accessing that site to infer my intent to accept Sn's contract? The connection between me and Sn seems tenuous enough that it might be hard to even argue that I intended to visit Sn, let alone use that to infer acceptance of their contract.


Interesting!...I'm not a lawyer, so the content for this piece was based on commentary in the below article. Was written by their lawyer, but would love to hear your counter point to it. Always good to get multiple viewpoints on something.

https://www.zyte.com/blog/van-buren-a-victory-for-web-scrape...


The Zyte article isn't inaccurate; it's just a simplified assessment of a complicated issue. If you'd like a more nuanced perspective on this, please read my guest post of Prof. Goldman's blog.

https://blog.ericgoldman.org/archives/2021/06/more-perspecti...


Is there a good blog or something that tracks these cases?


Prof. Eric Goldman's blog is probably the #1 site historically on scraping and the law. I've contributed to it a few times.

https://blog.ericgoldman.org/archives/2021/06/more-perspecti...

The name of my firm is McCarthy Garber Law. I write about scraping there when I have time (which I rarely do)!


I agree that Eric's blog is great for getting updates on what's going on, and I've been following it for years. But he is very one-sided in his opinions about decisions, particularly on controversial issues like section 230. I have to remind myself he's an academic (though at a law school) and I'm not just reading some defense firm's memos.


Eric is brilliant, and he has an encyclopedic knowledge of internet law. He's also an incredibly kind, generous, and open-minded person. That said, I will refrain from any commentary on Section 230, as I have zero expertise on that issue!


Enjoyed reading your bio on your website. Sub 24 hour at Leadville is super impressive! (Coming from someone who has not managed 24 hours at Western States... Yet...)


Leadville is just 45 minutes up the road for me, so I'm kind of cheating!


Is there a good blog post or summary that I could read?



Time for me to advocate again for people to use Common Crawl. Please don't slam peoples' websites, look for alternatives before scraping. There are probably other, better options. APIs, data set downloads, etc.

https://commoncrawl.org/


I'd guess that for the many popular scraping uses cases this is not really useful as it's usually about being quick and up to date (job postings, availability information, e-commerce, serps,...) not about having a big corpus of historic data.


Have you used this in real world scenarios? Or is it just a nice hypothetical that sounds great in theory but almost never works in practice?


Common Crawl is missing far too many URLs for it to be useful in a real world scenario.


But can't you add to their index?


No. You can add to the Wayback Machine at web.archive.org via their "save page now" interface... Common Crawl is attempting to be a sample of the web, and doesn't take url suggestions.


I wish web.archive.org had an index by someone like common crawl. There is lots of great stuff on archive.org


web.archive.org has a CDX index, similar to Common Crawl.

Since I use both of these archives together, I wrote this code to iron out the differences between them:

https://github.com/cocrawler/cdx_toolkit


Hey! I was using your tool a couple months ago. It was super helpful for my project.


Thanks! I rarely hear from users, great to hear from you!


They do and its better than common crawl's by my testing.


That looks like a great resource! How often is the data set "updated"?

I'd imagine most people's use cases need data which can change from day to day or week to week but I do think that this is fantastic if I was to have a project which was looking at data across a longer timeframe.


That is too much data to parse for a simple website scrape.

I do think Common Crawl has a lot of potential for people to use instead of scraping, but I think its for larger projects. It gave me the idea to look at the links to ID if they are a business or non-business website


I'm scraping about 30 sites for work at the moment, but have a few that are using Cloudflare which has been a b*tch to deal with. Tried numerous libraries and different proxy providers, but reliability is patchy. Previous fixes like https://github.com/Anorov/cloudflare-scrape don't seem to work anymore after Cloudflare updates, so I've switched to using a pretty optimised headless browser with good proxies instead.


This has a lot of good info on how to cloudflare and others work, and more creative ways to bypass them if the easier options don't work https://incolumitas.com/2021/05/20/avoid-puppeteer-and-playw...


I'm finding that Cloudflare is even blocking my RSS reader from requesting feeds behind their service. It's not even just scrapers at this point.


> optimised headless browser with good proxies instead

are you saying you only had problem because you didn't use headless browser before and now with both headless and proxy it generally suffices to not be seen as scrapper?


I think it will eventually goes to like stock trading. If you have a good strategy, you don't want to share with the world, because it will render your strategy useless.


Is the “pretty optimized headless browser” an off the shelf thing, or something custom? Are you using playwright/puppeteer to drive it?


Headless Chrome [0] and alpine-Chrome [1] are pretty popular. Some variations also include V2Ray, Shadowsocks and other VPNs.

[0] https://hub.docker.com/r/justinribeiro/chrome-headless/

[1] https://github.com/Zenika/alpine-chrome



Do you have any recommendations for the "good proxies" you mentioned?


With the right combination of proxies, user agents and browsers, you can scrape every website. Even those that seem unscrapable.

:

This outcome was great news for web scrapers, as it means that so long as a websites has made their data public you are not in violation of the CFAA when you scrape the data even if it is prohibited in some other way (T&Cs, robots.txt, etc).

Just because you can, doesn't mean you should. It would be better I think if there was a treatment of the ethics here, rather than a seemingly "ra-ra go bots" attitude, as though the only consideration is commercial.


100% agree, when scraping it should always be done respectfully.

- If they provide a API, then use it.

- Don't slam a website, ideally spread it out over hours of the day when there target audience is least active (night time).

- If you can get cached data from somewhere that works, then use that.

Most developers are respectful and only scrape what they really need, not only from an ethical point of view but also a cost and resources point of view. Scraping data is resource intensive and proxy costs can quickly rise to $1,000-$10,000 per month. So most only scrape the minimum they need.

The other thing here as well, is that a lot of the most popular sites being scraped, are also massive scrapers themselves. The big ecommerce sites are being scraped, but they are also scraping their competitors too.


Don’t get my home address, name, family members names, salary, cell phone number, aggregate and sell them and claim “it’s all publically available anyway”


If you post that data on a public domain, that is publicly available. It's like writing that info on a cardboard and putting it in the town square and then saying 'why you people steal my data!'


I disagree because there is a difference between posting something publicly for humans and posting something publicly for bots/large scale analysis. I'm ok with my employer possibly being able to see whether I am looking for a new job or not on LinkedIn if that means they would need to have a human looking at my LinkedIn page. I am not ok with them training some ML algorithm to monitor my LinkedIn page to determine how likely I am to leave the company at all times.

Another danger is when public but not easily accessible data is able to deanonymize datasets which is probably the norm rather than the exception for anonymized datasets. Sure there are technical measures to make it better, but at the end of the day I think a lot of privacy is about respecting social boundaries and not breaking these protection measures even if technically possible. Most of the time, these measures are really about keeping honest people honest and not about stopping dedicated attackers.


I have quite conscientiously never posted most of that information publicly, and yet it is for sale.


You dont even need to do that, go overt plain sight in yer face and call yourself a search engine!


Haha I love that people forget how google/bing are out there scraping everything and anyone who scrapes anything for any other reason is a "bad guy".

You can get around some web scraping blockers by just setting your user agent as Googlebot too which I find funny...


That was a cheap way to read the FT.com!


Haha, nice hack!


No they don't, Google and Bing respect robots.txt. Most websites would open it up to them because they need the traffic, so it's a type of scraping that is beneficial.

Any other scraping, especially when ignoring robots.txt, is unsolicited. And if said website takes additional advanced anti-scraping measures, and you persist in bypassing that too, then to me you're clearly unethical, even if it's technically legal.

"It's public" is a legal defense, not an ethical one. It's public for readers, not for scrapers. It's public within the original context of the website, which may include monetization.

Photographing every page of a book and then reading it that way may be legally allowed, but it's still unethical.

I have somebody in our neighborhood that instead of paying for private trash, takes tiny bags of his private trash to the park and dumps it into the public trash cans.

Legal? Yes. Parasitic behavior? Also yes.


No. robots.txt is not something that is defined and enforced by the law. Just because someone came up with some 'recommendation' like robots.txt does not mean this is the law


> Legal? Yes. Parasitic behavior? Also yes.

You failed to make a meaningful counterpoint; the legal/ethical distinction was made clear in the parent post.


As a matter of fact, robots.txt is a well understood expression of intent which is legally meaningful in a lot of contexts.


> Any other scraping, especially when ignoring robots.txt, is unsolicited. And if said website takes additional advanced anti-scraping measures, and you persist in bypassing that too, then to me you're clearly unethical, even if it's technically legal.

I suppose it just comes to down to your own morals, but I see nothing at all unethical about scraping a site for personal use provided that it's done gently enough to avoid DoS or disruption. The idea that saving webpages to read later is parasitic or unethical if a website uses robot.txt to discourage commercial scrapers and data-mining goes way too far.


You're really taking the most innocent stance possible on scraping.

The article talks of large scale scraping, which includes all kinds of bypassing tools, proxies, hardware, or commercial services that abstract this away.

This industrial scale level of scraping is not the same thing as you saving a local copy of 3 web pages. The scale is a million times bigger and for sure it will not be for personal use.


What you fail to acknowledge is that Bing Google etcetera have an effective monopoly on search. They can afford to respect robots.txt because everyone wants them to scrape their site.

The first mover advantage is so huge in this case that without allowing scraping, it's hard to understand how anyone could ever compete with these monoliths.


robots.txt isn't what's keeping a newcomer from challenging Google.


> No they don't, Google and Bing respect robots.txt.

They don't.


Correct.

I myself wrote a webserver, albeit a specialised one and for curiosity, I also created a few pages which were in no way accessible unless you knew its web address, there were no links to these pages from the home page or anything, I didn't even tell anyone about these webpages and yet in my logs, I could see those webpages were being spidered!

My robots.txt was setup as an instruction to proceed no further, so I think there is other feedback mechanisms guiding the spiders but I havent worked out if its from the web browser, or actual infrastructure like switches or routers.

Admittedly this was before HTTPS became common.


From https://developers.google.com/search/docs/advanced/robots/in...

> [...] Googlebot and other respectable web crawlers obey the instructions in a robots.txt file [...]

If you're saying this is a lie, please provide sources


On an eCommerce site I'm responsible for I changed some links from a GET to a POST. "BingPreview" continued hitting those links with GET requests, polluting my logs with 100s of "method not allowed" entries. So I blocked that UA from those links, nothing changed. Banned the bot all together, still hitting my site. This went on for well over a year.


I believe BingPreview is acting like a regular user - so it is not behaving like a Robot but like a user.


What does that mean exactly? An actual user can't be involved because the links that trigger a GET simply aren't there anymore. Therefore I assume it's a bot hitting faulty links it finds in its cache.


this sort of implies that the 'ethics' would end up meaning that you shouldn't scrape if it is not wanted, although I suppose there can be ethics or other than commercial requirements that mean that you should.


I still have a daily job running a web scraper I first wrote with Scrapy back in 2017. I think I've had to update it 3 times over the years for changes to the site and web standards.

Good old government sites - rarely change!


Not a lawyer, but many terms of service prohibit interacting with their website in an automated fashion, as well as collecting their data. In my understanding, scraping a site with these terms already puts you in the wrong.


> many terms of service prohibit interacting with their website in an automated fashion,

Ignoring the fact that I didn't agree to anything just by virtue of requesting a page from a webserver (and, your server sent me the data!), that's such a meaningless phrase that it's certainly unenforceable. What is an automated fashion? Do I have to manually craft my HTTP request by hand-pulsing a voltage on an Ethernet cable, or do I have your permission to let Chrome automate that for me?


This is so exactly. People do not realize that when they use chrome to view website, chrome is their 'scraper'.

And the goal of webs craping is not to get illegal data, but to have efficiency and performance by not doing something manually but letting computer do the repetitive tasks. It's a productivity tool. You can't make something illegal just because it's an automation instead of 'manual' operation.


are you a lawyer? Your opinion doesn't really mean anything if you still lose the case at the end. By your logic there isn't a clear way to define DDoS either. Sounds like there is though?


> there isn't a clear way to define DDoS either

It isn't clear to me that there is. The difference seems to lie in intent.

You could maybe nail a group making many requests without using the data for anything as making many spurious requests and hence having ill-intent, I suppose. Maybe having dedicated servers for such a tasks prove it even more?


Because those terms are the law and cant be ignored in almost all the rest of the world...


Cloudflare's blocks get in the way of many websites who are simply trying to get a "link preview" of the page, even if it is only a single request from a new IP. I wish they would offer some kind of alternative for the pages they serve instead of a captcha block.


My toolbox of choice for web scraping is either Nokogiri or puppeteer

Can someone sell me on beautiful soup or scrapy or any of the others? Do they provide any advantages or features that I'd be missing out on?


One great scrapy feauture is caching the page content. So you can essentially write a crawler, and when that’s running, you write your extraction code. Then, if you want to go back, you can add more extractors and run it against your local copy.


Ah interesting, I end up doing this manually, i.e. File.write followed by what I want to scrape


I believe scrapy has somewhat intelligent cache control options - maybe it could be recreated in a few dozens of lines of code, maybe a few hundred. But there are a huge number of these types of features - it’s basically a Swiss Army knife.

Examples include rotating proxies, rotating user agent headers. Hooks to add in middleware for processing pipelines. CLI switches to change your data output format. Nice debugging and logging.

Other large scale features include distributed crawlers. Scheduling. Monitoring UI so you can see progress via a web UI.

It’s what I reach for first, because you can be up and running with your first scraper in an hour. By hand, that’s maybe 10 minutes - but if you want to iterate, and your first scraper is a v1 rather than final effort… i think it’s definitely worth it.


What does HN think of web scraping for the purpose of price comparison?

I’m asking this because I run a small side project to show prices across retailers for a very small niche. The users are very very happy. Even the vendors started contacting to be listed on the comparison.

But I am unable to make a business out of it other than few affiliate commission.


I worked for a company that did exactly this many years ago. (They were even able to parter with some retailer). Their product worked well yet they still went out of business long ago. To be honest, I don't see much value in such a service, not that it doesn't exist, it's just hard to justify paying for this data.


If anyone has anything else they think was missed or should be included then let me know!


I have been interested in web scraping lately but never really dived too deep. Did anyone have more indepth resources (github projects, blogs, forums, etc) than the tutorials that are basically install beautiful soup and get data from a tag?


Genuine question but, what more do you need?


Like most here, I am very good at web scraping and automated form fills. I keep trying to figure out a profitable side project or business idea to make out of it and keep coming up with nothing that works.

Any good ideas?


You can do it as a service, but that is highly competitive and basically trading time for money. Best ways are to productize it:

- build a on-demand data api for a specific type of data and charge a premium for it. Good example is https://serpapi.com/ who do Google data, charge ~10X markup on proxy costs

- proxy solutions make good money. To scrape at scale you need proxies, and lots of users pay $1-5k per month. Lots of proxy solutions doing +$100k per month.

- build a tool that uses web scraped data, analyses/filters it and displays it to users. Lots of the biggest web scrapers are doing this, ex. doing product monitoring products for e-commerce companies, etc. Lots of competition there, but you can do it in new markets, like NFTs, etc.

- hedge funds will pay huge money for web data, if you have 5 years of continuous data so they can backtest it.


> build a tool that uses web scraped data, analyses/filters it and displays it to users. Lots of the biggest web scrapers are doing this, ex. doing product monitoring products for e-commerce companies, etc. Lots of competition there, but you can do it in new markets, like NFTs, etc.

Do you have any examples of such sites?

> hedge funds will pay huge money for web data, if you have 5 years of continuous data so they can backtest it.

what kind of web data would they be interested in?


For a while I had a hobby project that would scrape real estate websites listing properties in my city. Goal was to try and figure out trends, pricing data, find good deals. Eventually the site added those features itself (heatmaps based on prices, for example)

With all that data you can do stuff like make heatmaps from pricing data, figure out the most attractive areas for certain profiles (singles, families, ...). You could then mash up that data to produce things like a "Walkscore" or let people indicate what's important for them (green areas, bars & restaurants, time & distance to other destinations, even crime levels) and then show real estate that meets their criteria.

Some sites in the US already show this but in other countries that's not the case, while the data's all there just to grab.

Most likely it wouldn't be legal and certainly not if you made money from it. But it's incredibly fun and hugely useful. Maybe that could get you started on some ideas!


I know many people that follow limited/exclusive releases for things like Yeezy/Air Jordan sneakers as well as PS5's and graphics cards.

They pay $500/mo for access to a bot that will allow them to make these purchases.

Most of the community lives on discord.


It would be relatively easy to solve this problem if the original supplier wanted the problem to be solved. Instead of releasing a batch of inventory at a certain time, run a raffle over a week or two and then randomly select folks to allow to purchase the item.


Wouldn’t a queuing system be more fair?


By queuing, do you mean first come first serve?

No, that causes the problem. That encourages people to use bots to be the first one to purchase the moment the inventory is released.

I don't understand how a random raffle would ever not be fair (with the assumption that one person gets only one entry)


That assumption doesn't seem like it'd hold. You'd just replace bot services with package forwarding services that can generate unique PO Box numbers or whatever.


I understand people using bots to snipe PS5s and GPUs, these have real economic value and actual usage.

But what other than artificial scarcity drives people to spend hundreds of dollars on bots to snipe sneakers?!


> But what other than artificial scarcity drives people to spend hundreds of dollars on bots to snipe sneakers?!

There's a whole sneaker collecting subculture. Some buy and wear while others just collect. The big names in sneakers do release limited production models or limited runs of certain color combinations.

Similar to any other collecting subculture.


Same as NFTs: hype and resale value. At least you can wear the sneakers once you've stopped flipping them.


Economic value and actual usage.


What economic value (other than hoping for the value to increase, aka tulip mania) can I derive out of ultra-rare sneakers? What usage that goes above "it looks cooler than an unbranded, otherwise identical sneaker"?

For me, this kind of product is part of the "bullshit economy" - similar to "bullshit jobs", this kind of product has no reason to exist other than vanity, as almost all of these "collectibles" won't ever be used. We are using up valuable, finite resources to create and distribute this kind of useless "bullshit product", we are using up valuable human time and IT resources on developing websites capable to resist (D)DoS attacks and on developing snipers to bypass the anti-bot technologies employed by the shops, and we are creating a lot of demand for all kinds of sneaker-related crime - and there's a lot of that: theft and robberies from stores, theft and robberies in the supply chain, ebay/classifieds scams, credit card fraud, robberies on broad daylight [1].

Seriously, fuck all that shit. No one needs hundreds of dollars worth of sneakers that only incentivize crime and bullshit.

[1]: https://www.google.com/search?q=man+robbed+because+of+sneake...


I have a project that will be fueled by scraping. We should chat. :)


I do tons of scraping as well, let me know if you need extra hands.


Hey, I'm in the middle of building a large scraping application - would you mind if I asked you for some advice? Email in bio!


Separate from web scraping, there is the use of automation to perform normal allowable user actions on the site. That should be considered distinct from large scale data extraction no


The web scraping ecosystem is growing, with more libraries, frameworks and products available than ever before to simplify our web scraping headaches so the future is looking bright.


for those in this thread with super-serious experience scraping and automating at scale, looking for work (ethical!) please contact me directly.


I fail to understand why Web Scraping isn't almost universally viewed as unethical and a terrible and nasty business practice.

In almost all cases I view Web scraping as people who are trying to build businesses on top of other people's innovation and data. I know this isn't a popular opinion, so change my mind, but at the same time, I'm one of those business owners that fights with Web scraping constantly and my opinion of it is that those that are doing it to my platforms are doing so solely to steal data and build businesses on top of other's hard work.


I think it really depends on the application of web scraping. (As someone who does, what is in my mind, ethical web scraping)

- Scraping public information from government websites to do analysis: ethical, it's the public's data

- Scraping to help some companies customers more effectively use that companies product, for example scraping a medical office's insurance claims to help them automate their insurance remittance process: ethical

- Scraping faces to build a surveillance-tech company: disgusting

- Scraping your own website because your internal processes are so broken you can't get it any other way: ethical

- Scraping to just copy someone's data they worked hard to generate to go and resell: unethical


The first one here is important. Despite the open data movement pressuring governments to provide their data in easily consumable forms, a lot of government organizations are still unable or unwilling to do so.

Political advocacy orgs rely a lot on scraping to collect political representative data that isn't available through any other means.


Yes, and so do research orgs. My organization does a lot of scraping because we deal with local election data and that's. Uh. Let's just say that if all counties had websites that were like Web 1.0, that would be an improvement over the current situation.


- Scraping faces to find missing persons: ethical

- Scraping photos to create deep learning VQGAN+CLIP art generator: ethical

.. we can go on and on, but we should all agree scraping is a useful tool that should never be outlawed.


I don't think that it is outlawed though, at least in practical terms, no one is gonna sue you for scrapping government websites. You really only think about the legal aspect when you do it for commercial gains.

It would be interesting to know if that data can be used in a court case against a government agency though.


> - Scraping to just copy someone's data they worked hard to generate to go and resell: unethical

Wanted to include a slightly different application:

- Scraping multiple websites and organizing data in a new and useful way for customers: To me this would be ethical since it produces new value and does not just copy someone else's data as-is


So it's really not about the "scraping" here, it's about the kind of business you're building. I don't think any of your definitions change if you simply employed people to check the websites instead of scripts.


Re government websites: they're often terrible. I've occasionally contemplated a side project just to scrape and restructure some local/state websites into a usable forms with search and whatnot.


And if you manually copy someone's data they worked hard to generate to go and resell, then it's ethical?


Google is web scrapper number one, as any search engine. Making web scrapping illegal mean making search engine illegal.

You do not want information to be public and/or free? Put it under login and charge for it.

You want to prevent people to reuse the data you publish to build other (potentially competitive) products, then use licensing and copyright, and the law.

However, banning a technological mean because what a minority could potentially do with it? Then make the internet illegal then and the problem is fixed altogether.


Google does do some things that aren't great for website owners too. Like "rich snippets", where they present the information from your page right to the end user, leaving that end user with no reason to visit your site.

And, I imagine, lots of A/B testing geared toward exactly that...keeping them on Google-owned properties.


Maybe if all the useful content on your site can fit into a snippet I don't want to visit it?


Maybe the useful content is something you don't know is there, so you settle for what's in the snippet. Because you imagine Google's AI surely extracted the right bits.

There's also a sort of diminishing returns effect here. If google trains people that the snippet is good enough, less traffic goes to the site. Eventually, enough to shutter the site, for some sites. Then nobody has the info.

The pattern has already affected Google referral traffic to Wikipedia. Pageviews for Wikipedia are roughly flat from 2012 to today, where they had marked growth prior. 2012 is when Google starting rolling out their knowledge graph that presented Wikipedia data directly.


Yes, it would be preferable if people were more curious and willing to explore topics in depth. But sometimes all you want to know is what's the capital of Moldavia. Ideally the web would be about easy access to relevant information, not a competition for harvesting page views.


Ok. FWIW, I'm not talking about simplistic facts. Rich snippets are often multiple paragraphs. And I understand the distaste for harvesting page views, but websites are hard to maintain without visitors too.


That always struck me as unethical as well.


What if Google didn't scrape websites automatically, and waited till users submit their domains to them, to mark that they want to be scraped? I think in that case, most users would still submit their domains there, because they want to come up in Google search. You might want your website to be scraped by some people/companies and not by others, but not have to put everything behind a login screen (which some determined scrapers would still try to breach in some way).


NB: It’s “scraping”, not “scrapping”.


Google is a crawler not a scraper, these are two totally different things


A crawl requires "extraction" of data from a web page, which according to Wikipedia is part of the definition of so-called "web scraping". Even if a crawler is using a sitemap.xml file, it still has to "scrape" (retrieve and extract from) that file first. It seems crawling always requires scraping.

If all the pages to be retrieved are known a priori, before retrieval begins, then one would likely call that "scraping". Whereas if not all pages are known before retrieval begins, then one would likely call that "crawling".


> I fail to understand why Web Scraping isn't almost universally viewed as unethical and a terrible and nasty business practice.

Scraping is simply a way to get data. I used to run a team that was paid by large government contractors in the US to scrape their job posts from their career portals, and then deliver those posts via email, fax and snail mail to veteran's service officers near the job opening. It was required by regulation, and the only way to get the job data was to scrape.Many enterprise applicant tracking systems did not have a good way to automatically deliver that data or wanted $millions for that capability. Scraping was the best way and in some cases, the only way.

By the way, search engines like Google are scrape data and index it.


Some web scraping can be unethical, say for example if you are scraping a site solely to mirror their content and add zero value to the original content owner.

However, there are a lot of web scraping use cases which are beneficial to the site being scraped and actually add value. Two examples:

- Google: Ahrefs & SEMRush scrape Google so they can provide SEO analytics to companies looking to grow their companies. Googles keyword analytics aren't great, so Google has effectively outsourced providing a good analytics tool to Ahrefs & SEMRush who products increase the value of the Google SERPs ecosystem.

- Amazon + Other E-Commerce: Amazon wants brands and 3rd party stores to list products on their site, and the companies scraping Amazon to provide product placement tools to their users make it easier and more profitable to list products on Amazon. Leading to more and more companies listing products on Amazon.


> Some web scraping can be unethical, say for example if you are scraping a site solely to mirror their content and add zero value to the original content owner.

Archiving is unethical?


Good point, wouldn't say archiving is unethical at all...I was thinking more along the lines of someone scraping a entire segment of a websites data and reproducing it 1 for 1 on their own site with zero value add.

I think we can't make broad statements saying that web scraping is ethical or unethical, it isn't that black or white. It really depends on what is being scraped, how is it being used, and the intention of the scraper.


Do you provide an API, paid or not, for the same data? An API which might even have limitations on use makes scraping a bit less defensible in my mind, but if you're offering something for free to the public and then getting upset when people take and use that free info, maybe free isn't the right business model, or maybe you should look into what those people are using that scraped data for and see if you can offer it better and cheaper.

The best way to stop someone trying to make a buck on your hard work is to go direct to their customers and do a better job. If you can't, what they're selling is something on top of your offering and you aren't serving that market, and you either should start serving it, or make a deal so the scrapers can continue to do it without impacting your service.

As someone that had to do scraping in the past, and went through having a free open API that served our needs perfectly replaced with an account based one that required we make 100x the queries, it was really frustrating that the company refused to even respond to queries for specific business accomodations to data.


Here are two use cases why I scrape YouTube.

- There is no external API for getting scheduled streams or when they have gone live AFAIK. This lets me be notified of new stuff to watch.

- The API for getting a channel's members is locked down. I applied for access to it 6 months ago and haven't heard anything about it from YouTube so I just scrape it to give members perks.


Madness that they haven't gotten back to your access request in 6 months!

Why even bother having the API there - so much value can be added by people building on top of YouTube and other large sites, its a shame that most of these large sites do nothing to provide API access and people have to go out of their way to scrape them them...


There are pro-social and anti-social uses of web scraping. If you have ever used Kayak or any other price discovery or price comparison website, you've relied on web scraping to provide you a service.


Also google or any other search engine


I believe Kayak has agreements with the sites they scrape though. So it's a different type of "scraping", really.


When I want to do web scraping is because I have an idea to build over the content of the website I would like to scrape.

Let's say you made a recipes website and I would like to build an app that will order the ingredients for a meal.

It would be useful to extract the recipes, so that I can create experiences like users picking a meal and have the ingredients delivered.

I guess I can't show your recipes as it can be copyright infringement but I can link it to you and sell the tomatoes.

Also, despite copying someones work is unethical and likely illegal , there is nothing unethical or illegal to use computers to analyse the data out there. I should be able to analyse recipe publications just as I can measure the air pollution. The web scarping comes in since the semantic web never happen.

I think, we all should be able to use other people's work to build something else on top of it. Of course I do not advocate outright taking it and re-sell it as of ours.

For example, I would like to be able to create an app with Netflix content but obviously I don't expect to be able to stream their content as if it is mine. What I should be able to do is to create an app with an experience designed by me that lets you stream their movies if you pay them.


Because there would no Internet search - no search engines, no Google Search, and essentially no Internet bigger than a hobbyist DARPA - without web scraping.


> people who are trying to build businesses on top of other people's innovation and data

How would scraping, say, reddit, differ from the business model of Reddit itself?

> those that are doing it to my platforms are doing so solely to steal data

What kind of data are you talking about?


Scraping itself isn't universally unethical. Google and Bing scraping websites to make information easier to access is fine, and scraping and analysing government data is even better. Public data should be public, after all.

However, the disgusting data brokers that employ most of the custom scrapers, are usually unethical. That's why I don't trust any person or company that admits being involved professionally in "scraping", because most of the time that means "we collect personal information that got leaked elsewhere and sell them on".


If we want to take the unethical route, I’d argue not providing an API (paid or free) is unethical and a nasty business practice.

I work for an ecommerce company and we scrape competitors for price information. Should this automated process using API’s not be okay, we’ll have humans do it. Less efficient for us, more traffic for a competitor. Should they provide a paid API with price information available, I’m sure we’d pay.


I think if you make intangible things public you shouldn’t consider them to be only yours anymore.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: