Hacker News new | past | comments | ask | show | jobs | submit login
Web Scraping 101 with Python (scrapingbee.com)
392 points by daolf on Feb 10, 2021 | hide | past | favorite | 131 comments



Before jumping into frameworks, if your data is lucky enough to be stored in an html table:

    import pandas as pd
    dfs = pd.read_html(url)
Where ‘dfs’ is an array of dataframes - one item for each html table on the page.

https://pandas.pydata.org/pandas-docs/stable/reference/api/p...


Sometimes it's also helpful to use beautiful soup to isolate the elements you want, feed the text of the elements into StringIO and give that to read_html.


Yes, this is a good idea for more complicated cases.


We made a chrome extension that queries any html table in any open tab with SQL:

https://chrome.google.com/webstore/detail/sqanything/naejbcf...

You can export the results to Google Sheets too. One advantage of the extension is it works with JS rendered tables.


Handy! I'm also a big fan of pd.read_clipboard() for specific selections.


Holy crap, is there anything pandas can't do?


Ingest bamboo? Sorry, couldn't resist.


Woah. I’ve used pandas a fair amount and had no idea about this. Thank you!


+1 this has saved me countless of hours


what does this do


It reads HTML and returns the tables contained in the HTML as pandas dataframes. It’s a simple way to scrape tabular data from websites.


I've been involved in many web scraper jobs over the past 25 years or so. The most recent one, which was a long time ago at this point, was using scrapy. I went with XML tools for controlling the DOM.

It's worked unbelievably well. It's been running for roughly 5 years at this point. I send a command at a random time between 11pm and 4am to wake up an ec2 instance. It checks its tags to see if it should execute the script. If so, it does so. When it's done with its scraping for the day, it turns itself off.

This is a tiny snapshot of why it's been so difficult for me to go from python2 to python3. I'm strongly in the camp of "if it ain't broke, don't fix it".


Using `2to3` might get you 80% of the way there. Although cases like this make tests really valuable.


why can't you just keep using python2? surely some people out there are interested enough to keep updating and maintaining it?


I certainly can keep using it. There have been so many efforts to get people to update Python 2 code to Python 3 code that it's on my backlog to do it. Will I get to it this year? Probably not.


"I send a command at a random time between 11pm and 4am to wake up an ec2 instance."

Any chance you could tell me your setup for this?


Not my project, but if I had to do it I'd try something like the following:

* Set an autoscaling group with your instance template, max instances 1, min instances 0, desired instances 0 (nothing is running).

* Set up a Lambda function that sets the autoscaling group desired instances to 1.

* Link that function to an API Gateway call, give it an auth key, etc.

* From any machine you have, set up your cron with a random sleep and a curl call to the API.

And that should do the trick, I think.


>From any machine you have, set up your cron with a random sleep and a curl call to the API.

You might as well just call the ASG API directly.


Why use autoscaling and not just launch the instance directly from lambda? The run time is short so there's no danger of two instances running in parallel


the tl;dr for all web scraping is to just use scrapy (and scrapyd) - otherwise you end up just writing a poorer implementation of what has already been built

My only recent change is that we no longer use Items and ItemLoaders from scrapy - we've replaced it with a custom pipeline of Pydantic schemas and objects


One tip I would pass on when trying to scrape data from a website, start by using wget in mirror mode to download the useful pages. It's much faster to iterate on scraping the data once you have it locally. Also, less likely to accidentally kill the site or attract the attention of the host.


Definitely! Scrape the disk, not the web.


Or just use scrapy's caching functionality. Super convenient.


That works only for static page though. Many modern pages would require you to run a selenium or puppetteer to scrape the content.


For these sites, I crawl using a JS powered engine, and just save the relevant page content to disk.

Then I can craft my regex/selectors/etc., once I have the data stored locally.

This helps if you get caught and shut down - it won't turn off your development effort, and you can create a separate task to proxy requests.


I did web-scraping professionally for two years, in the order of 10M pages per day. The performance with a browser is abysmal and requires tonnes of memory so not financially viable. We used them for some jobs, but rendered content isn't a problem, you can also simulate the API calls (common) and read the JSON, or regex the script and try to do something with that.

I'd say 99% of the time you can get by without a browser.


Fully agree. It takes some thought :)


That's never required; the data shows up in the web page because you requested it from somewhere. You can do the same thing in your scraper.


> You can do the same thing in your scraper

Rendering the page in Puppeteer / Selenium and then scraping it from there sounds like a lot easier than somehow trying to replicate that in your scraper?


Sure. How does that relate to the claim that your scraper is actually unable to make the same requests your browser does?


How are you going to deal with values generated by JS and used to sign requests?


If they're really being generated client-side, you're free to generate them yourself by any means you want. But also, that's a strange thing for the website to do, since it's applying a security feature (signatures) in a way that prevents it from providing any security.

If they're generated server-side like you would expect, and sent to the client, you'd get them the same way you get anything else, by asking for them.


I'm not sure what's your point. Of course you can replicate every request in your scraper / with curl if you want to if you know all the input variables.

Doing that for web scraping purposes where everything is changing all the time and you have more than one target website is just not feasible if you have to reverse engineer some custom JS for every site. Using some kind of headless browser for modern websites will be way easier and more reliable.


As someone who has done a good bit of scraping, how a website is designed dictates how I scrape.

If it's a static website that has consistently structured HTML and is easy to enumerate through all the webpages I'm looking for, then simple python requests code will work.

The less clear case is when to use a headless browser vs reverse engineering JS/server side APIs. Typically, I will do like a 10 minute dive into the client side js and monitor ajax requests to see if it would be super easy to hit some API that returns JSON to get my data. If reverse engineering seems to hairy, then I will just do headless browser.

I have a really strong preference for hitting JSON apis directly because, well, you get JSON! Also you usually get more data then you even knew existed.

Then again, if I was creating a spider to recursively crawl a non-static website, then I think Headless is the path of least resistance. But usually, I'm trying to get data in the HTML, and not the whole document.


I’ve been doing web scraping for the past 5 years and this is exactly the approach I take as well!


>If they're really being generated client-side, you're free to generate them yourself by any means you want. But also, that's a strange thing for the website to do

what??

Page loads -> Javascript sends request to backend -> it returns data -> javascript does stuff with it and renders it.


Sure, that's the model from several comments up. It doesn't involve signing anything.


My last contract job was to build a 100% perfect website mirroring program for a group of lawyers who were interested in building class action lawsuits against some of the more henious scammers out there.

I ended up building like 8 versions of it, literally using every PHP and Python library and resource I could find.

I tried httrack, php-ultimate-web-scraper (from github), headless chromium. headless selenium, and a few others

By far the biggest problem was dealing with JS links...you wouldn't think from the start it would be such a big deal but yet..it was.

Selenium with python turned out to be the winning combination, and of course, it was the last one I tried. Also, this is an ideal project to implement recursion altho you have to be careful about exit conditions.

One thing that was VERY important for performance was not visiting any page more then once because, obviously, certain links in headers and footers are duped sometimes 100s of times.

JS links often made it very difficult to discover the linked page, are certain library calls that were supposed to get this info for you often didn't work.

It was a super fun project, and in the end considering I only worked for 2 months, I shipped some decent code that was getting like 98.6% of the pages perfectly.

The final presentation was interesting...for some reason my client I think got in his head that I wasn't very good programmer or something, and as we ran thru his list of sample sites expecting my program to error out or incorrectly mirror the site, but it handled all 10 of the sites about perfectly and he was rather flabbergasted because he told me it would have taken him a week hand clicking the site for the mirror but instead the program did them all in under an hour.


I had to solve nearly the exact same problem for the same reasons. I too ended up with Selenium.

My favorite part was having a nice working system, then throwing it in the cloud and finding out a socking number of sites tell you to go away if you come at them from a cloud-based IP.

Shouldn't be surprising, but it was still annoying.


There are a number of so-called “residential VPN” services with clients that also serve as the firm’s p2p VPN / proxy edge. Some can be subscribed to commercially to resolve precisely the above issue.

Preferably, only give money to one that tells their users this is how it works.


True. We went down that path for a while, but for our purposes, I was never happy with the ones I could find, since they were super vague how they got the IPs in the first place. Some of it felt like protecting commercial secrets, which is fine, but some of it felt like protecting questionable business practices.

For our purposes, and the websites we needed to track, a traditional VPN was good enough.


Can you explain why the lawyers needed mirror sites? Does mirroring a site mean making a local copy of it?


What stack did you end up using ?


>Selenium with python turned out to be the winning combination, and of course, it was the last one I tried.


Cool that's wht I'm using too!:)


Ha! I am currently building something very similar at work, and JS links are driving me up a wall. Its interesting because you would think its super simple, but I still haven't found a good solution. Luckily my boss is understanding.


I think this article does an OK job covering how to scrape websites rendered serverside, but I strongly discourage people from scraping SPAs using a headless browser unless they absolutely have to. The article's author touches on this briefly, but you're far better off using the network tab in your browser's debug tools to see what AJAX requests are being made and figuring out how those APIs work. This approach results in far less server load for the target website as you don't need to request a bunch of other resources, reduces the overall bandwidth costs, and greatly speeds up the runtime of your script since you don't need to spend time running javascript in the headless browser. That can be especially slow if your script has to click/interact with elements on the page to get the results you need.

Other than that, I'd strongly caution anyone looking into making parallel requests. Always keep in mind the sysadmin and engineers behind the site you are targeting. It's can be tempting to value your own time by making a ton of parallel requests to reduce the overall time of your script, but you can potentially cause massive server load for the site you're targeting. If that isn't enough motivation to cause you pause, keep in mind that the site owner is more likely to make the site hostile to scrapers if there are too many bad actors hitting the site heavily.


How would you deal with authentication?


There was (is?) a DARPA project called "Memex" that was built to crawl the hidden web that has many tools like crawling with authentication, automatic registration, machine-learning to detect search-forms, auto detecting pagination etc etc etc etc https://github.com/darpa-i2o/memex-program-index


I don't! As far as I know, scraping data behind a login is illegal in the united states. You can look into the supreme court case Facebook v Powers Inc for information behind that. This page https://www.rcfp.org/scraping-not-violation-cfaa/ seems to have a decent overview of scraping laws in general. It's definitely a legal gray area so I'd suggest doing your research! This doesn't constitute legal advice and all that, I'm not a lawyer just a guy who does some scraping here and there :)


It’s fun to combine jupyter notebooks and py scraping. If you are working 15 pages/screens deep, you can “stay at the coal face” and not have to rerun the whole script after making a change to the latest step.


`ipython -i <script>` also works similarly for debugging, by having the powerful interpreter open after running the script, without the jupyter overhead.


I love the imagery of this being "at the coal face" thanks for that


I write scrapers for fun and notebooks for work but never thought to combine the two. Great idea!


Oh! That's a good idea. My goto has always been pipelining along a series of functions, but never thought of just using Jupyter for some reason.


I wanted to do some larger distributed scraping jobs recently and although it was easy to get everything running on one machine (with different tools including Scrapy), I was surprised how hard it was to do at scale. The open source ones I could find was hard/impossible to get working, overly complex, badly documented etc.

The services I found to be reasonably priced for small jobs, but at scale they quickly become vastly more expensive than setting this up yourself. Especially when you need to run these jobs every month or so. Even if you have to write some code to make the open source solutions actually work.


AWS Lambdas are an easy way to get scheduled scraping jobs running.

I use their Python-based chalice framework (https://github.com/aws/chalice) which allows you to add a decorator to a method for a schedule,

  @app.schedule(Rate(30, unit=Rate.MINUTES)) 
It's also a breeze to deploy.

  chalice deploy


It gets more complicated when you need to leverage real browser engines (eg Chrome). I've got jobs spread across ~ 20 machines/ 140 concurrent browser instances, it's non-trivial.


How many pages can you render in a a second per vcpu core ?


I'm measuring tracking so I need to have the page semi-idle while trackers show up, so it's both speed and allowing some waiting-around-time.


One thing I notice with all blog articles, and HN comments, on scraping is that they always omit the actual use case, i.e., the specific website that someone is trying to scrape. Any examples tend to be so trivial as to be practically meaningless. They do not prove anything.

If authors did name websites they wanted to scrape, or show tests on actual websites, then we might see others come forward with different solutions. Some of them might beat the ones being put forward by the pre-packaged software libraries/frameworks and commercial scraping services built on them, e.g., less brittle, faster, less code, easier to repair.

We will never know.


I'm just curious if you have any particular use cases you can't get the existing solutions to work with


I do not have any particular use cases that I cannot get my own solutions to work with. Thus I have no motivation to try these Python solutions. That is why I would be curious to have an example use case that someone thought could only be handled by some Python framework.


In my career I found several reasons not to use regular expressions for parsing an HTML response, but the largest was the fact that it may work for 'properly formed' documents, but you would be surprised how lax all browsers are about requiring the document to be well-formed. Your regex, unless particularly handled, will not be able to handle sites like this (and there are a lot, at least from my career experience). And you may be able to work 'edge cases' into your RegEx, but good luck finding anyone but the expression author who fully understands and can confidently change it as time goes on. It is also a PITA to debug when groupings/etc. aren't working (and there will be a LOT of these cases with HTML/XML documents).

It is honestly almost never worth it unless you have constraints on what packages you can use and you MUST use regular expressions. Just do your future-self a favor and use BeautifulSoup or some other package designed to parse the tree-like structure of these documents.

One way it can be used appropriately is just finding a pattern in the document- without caring where it is w.r.t. the rest of the document. But even then, do you really want to match: <!-- <div> --> ?


For all the things jQuery got wrong, it got one thing right: arguably the most intuitive way to target a set of data in a document is by having a concise DSL that works on a parsed representation of the document.

I'd love to see more innovation/developer-UX research on the interactions between regexes, document parse trees, and NLP. For instance, "match every verb phrase where the verb has similar meaning to 'call' within the context of a specific CSS selector, and be able to capture any data along that path in capturing groups, and do something with it" right now takes significant amounts of coding.

https://spacy.io/usage/rule-based-matching does a lot, but (a) it's not particularly concise, (b) there's not a standardized syntax for e.g. replacement strings once you detect something, and (c) there's no real facilities to bake in a knowledge of hierarchy within a larger markup-language document.


I think scraping is just inherently brittle whether you go by the DOM traversal or by regex. AI may have the best potential. Regex can be slightly more brittle like you point out with commented html or myriad other problems, but it can also be less brittle than DOM if you craft more lenient patterns. The main problem I found was regex's not being performant due to recursiveness and stack overflows (Google's RE2 lib addresses this). My favorite performance trick is to use negated character classes rather than a dot, /<foo[^>]*>/


> confidently change it

Having a good variety of tests helps.

> tree structure

You'll need a complete language to parse a tree.


I have been developing scrapers and crawlers and writing[1] about them for many years and used many Python based libs so far including Selenium. I have write such scrapers for individuals and startups for several purposes. The biggest issue I faced was rendering of dynamic sites and blocking of IPs due to absence of proxies which are not cheap at all, especially for individuals.

Services like Scrapingbee and ScraperAPI are serving quite good for such problems. I personally liked ScraperAPI for rendering dynamic websites due to the better response time.

Shameless Plug: In case if anyone is interested, long time back, I had written about it on my blog which you can read here[2]. Now you do not need to setup remote Chrome instance or anything. What all is required is to hit an API endpoint to fetch content from a dyanmic JS rendered websites.

[1] http://blog.adnansiddiqi.me/tag/scraping/

[2] http://blog.adnansiddiqi.me/scraping-dynamic-websites-using-...


Been scraping for a long time. If handling JS isn't a requirement, XPath is the 100% the way to go. It's a standard query language, very powerful, and there are great browser extensions for helping you write queries.


fetching html and then parsing it navigating the parsed result (or with regexp) is what used to work 20 years ago. These days, with all these reactive javascript frameworks you better skip to item number 5: headless browsing. Also mind that Facebook, Instagram, ... will have anti-scraping measures in place. It's a race ;)


I've found that a lot of the time that's not needed. Often you'll find the data as a JSON blob in the page and can just read it directly from there. Or find that there's an API endpoint that the javascript reads.


I find this method works best. Skip looking at the page and instead watch all the network requests as the page loads.


ZAP HUD Proxy is the best option here...

Load the page on it and it shows you all the request being made along with the payload as it happens.

Find the one you need, copy the data, endpoint and HTTP verb and recreate it in your language of choice :D


It's not all bad, many modern sites just expose a JSON API that can be used. It really depends on how protective and large the company behind it is.


This. Even relatively simple websites are much harder to parse today. I did a minor side project for a customer scraping some info and anti-scraping measures were in full force. It feels like an all out war.


Such as? I've never encounter anything I wasn't able to overcome.


Usually starts from simple to difficult. User agent stuff, IP address detection, aggressive rate limiting, captcha checking, browser fingerprinting, etc.


It can be overcome and, admittedly, I am new to this so for me that means way more time spent trying to make it work. The odd one that I got stuck on for a while was a presented list where individual record held pertinent details, but the list had, seemingly randomly, items that looked like records on the surface, but were not ( so those had to be identified and ignored ). Small things like that.

Still, I would love to learn more about your approach if you would be willing to share.


Once recaptcha is in the mix it'll get tricky pretty quickly. Everything else is easy to overcome most of the time.


Recaptcha comes to mind.

That said, there are quite a few services which battle these systems for you nowadays (such as scraperapi - not affiliated, not a user). They are not always successful, but they have an advantage of maaany residential proxies (no doubt totally ethically obtained /s, but that's another story).


please click on all the traffic lights you see below.


Datadome, Incapsula


Is web scraping going to continue to be a viable thing, now that the web is mainly an app delivery platform rather than a content delivery platform?

Can you scrape a webasm site?


Maybe it will turn out to be that way, but this is far from reality at the moment. There are not many sites that cannot be scraped statically and there definitely are very few sites/apps that are webasm.

It'll change, but who knows how much. At least currently, most scraping professionals are not even using headless browsers as their targets are statically rendered.


I'm not 100% sure what you mean by a 'webasm' site (web assembly powered?), but the article describes scarping via headless browsers which actually render the page and allow you to select elements that are client rendered.


I have been websraping for almost 4 years now. That is my entire niche.

The problem with web scraping is that you really don't know the ethical point of scraping ends. These days I will reverse engineer a website to minimize the request load and only target specific API endpoints. But then again I am breaching some security measures they have while doing that.


I do web scraping for fun and profit, primarily using Python. Wrote a post some time back about it.

https://www.kashifaziz.me/web-scraping-python-beautifulsoup....


Is there a SOTA library for common web scraping issues at scale( especially distributed over cluster of nodes) for Captcha detection, IP rotation, Rate throttling, Queue Management etc.?


What's a "SOTA library" ?


A contextual guess: "'State of the art' library"

In other words: Is there a drop in library to solve all the big common issues people run into scraping websites in the wild?

At least, that's how I read it.


There is no "state of the art library" to build your own google. But "Rate throttling/limiting" can be done with Redis, rotating ip is still rate-limiting with Redis, Captcha Detection - You have to pay $$ I think.


Personally I have not needed Beautifulsoup a single time, when web scraping. People say it is better for unclean HTML, which I cannot confirm, because I never needed it and always were able to get my result using LXML + etree with XPath and CSS selectors. Once I also used Scrapy, but still not Beautifulsoup. I am glad there is a guide, that starts with LXML, instead of immediately jumping to Beautifulsoup.


lxml is terrific. I agree, never really understood what beautifulsoup added that lxml couldn't just handle on its own. On the JVM, jsoup is excellent but doesn't support XPath.


We created a fun side project to grab the index page of every domain - we downloaded a list of approx 200m domains. However, we ran into problems when our provider complained. It was something to do with the DNS side of things and we were told to run our own DNS server. If there is anyone on here with experience of crawling across this number of domain names it would be great to talk!


The biggest struggle I had while building web scrappers is scaling Selenium. If you need to launch Selenium hundred of thousands times per month, you need a lot of computer power which is really expensive on EC2.

A couple years ago, I discovered browserless.io which does this job for you and it's amazing. I really don't know how they made this but it just scales without any limit.


For browserless.io, the developer behind it talks about the tech stack in this podcast: https://runninginproduction.com/podcast/62-browserless-gives...


I recently undertook my first scraping project, and after trying a number of things landed upon Scrapy.

It’s been a blessing. Not only can it handle difficult sites, but it’s super quick to write another spider for the easy sites that provide the JSON blob in a handy single API call.

Only problem I had was getting around cloudflare, tried a few things like puppeteer but no luck.


Check out https://github.com/clemfromspace/scrapy-cloudflare-middlewar.... I've been running it in production for over a year without any hiccups.


Is this still working for you? cfscrape that it’s based off looks to be failing due to cloudflare updates


For data extraction I highly recommend weboob. Despite the unfortunate name, it does some really cool stuff. Writing modules is quite straightforward and the structure they've chosen makes a lot of sense.

I do wish there was a Go version of it, mostly because I much prefer working with Go, but also because single binary is extremely useful.


I really appreciate the tips in the comments here.

As a beginner it makes a lot of sense to iterate on a local copy with jupyter rather than fetching resources over and over until you get it right. I wish more tutorials focused on this workflow.


I've always had pretty bad experiences with web scrapping, it's such a pain in the ass and frequently breaks. I'm not sure if I'm doing it wrong or if that's how it's supposed to be.


> pain in the ass

Yes, unequivocally.

> frequently breaks

It can definitely depend on what you're scraping, but in the last few years or so the only project I had trouble with was one where they changed the units for the unpublished API (the real UI made two requests which mattered, one to grab the units, and I missed that in my initial inspection -- it bit me awhile later when they changed the default behavior for both locations).

A few tips:

As much as possible, try to find the original source for the data. E.g., are there any hidden APIs, or is the data maybe just sitting around in a script being used to populate the HTML? Selenium is great when you need it, but in my experience UI details change much more frequently than the raw data.

When choosing data selectors you'll get a feel for those which might not be robust. E.g., the nth item in a list is prone to breakage as minor UI tweaks are made.

If robustness is important, consider selecting the same data multiple ways and validating your assumptions about the page. E.g., you might want the data with a particular ID, combination of classes, preceding title, or which is the only text element formatted like a version number. When all of those methods agree you're much more likely to have found the right thing, and if they don't then you still have options for graceful degradation; use a majority vote to guess at a value, use the last known value, record N/A or some indication that we're not sure right now, etc. Critically though, your monitoring can instantly report that something is amiss so that you can inspect the problem in more detail while the service still operates in a hopefully acceptable degraded state.


It’s heavily dependent on the site you’re scraping. If they put in active counter measures, have a complex structure, or update their templates frequently, it’s going to be an uphill battle.

Most sites IME are pretty easy.


How does one do scraping properly on dynamic client side rendered pages?


Most SPA pages will honour direct uri requests and route you properly in javascript. You just need to have your scraping pipeline use phantomJS or selenium to wait until the page stops loading then scrape the html.

Although it might just be easier to scrape their api endpoints directly instead of mucking with html if its a dynamic page. The data is structured that way, and easier to query.


Usually the approach is to use a headless browser. The headless browser instance runs purely in memory without a GUI then renders the website you're interested in. Then, it comes down to regular DOM parsing. A common library that I enjoy is Selenium with Python.


Aside from the Beautiful Soup library, is there something about Python that makes it a better choice for web scraping than languages such as Java, JavaScript, Go, Perl or even C#?


Don't know about large scales, but just today I threw together a script using selenium, imported pandas to mangle the scraped data and quickly exported to json. For quick and dirty, possibly one-off jobs like that, Python is a great choice.


The Scrapy library written in Python (https://scrapy.org) is excellent for writing and deploying scrapers.


I think Python makes sense, at least for the prototyping phase. There's a lot of trial and error involved, and Python is quick to write.


I find javascript (node) to be best suited to web scraping personally. Using the same language to scrape/process as you use to develop those interfaces seems most natural.


Especially with stuff like Puppeteer which allows you to execute JS in context of the browser (which admittedly can lead to weird bugs as the functions are serialized and lose context)


I'm using this exact strategy to scrape content directly from DOM using APIs like document.querySelectorAll. You can use the same code in both headless browser clients like Puppeteer or Playwright and DOM clients like cheerio or jsdom (assuming you have a wrapper over document API). Depending on the way a web page was fetched (opened in a browser tab or fetched via nodejs http/https requests), ExtractHtmlContentPlugin, ExtractUrlsPlugin use different DOM wrappers (native, cheerio, jsdom) to scrape the content.

[1] https://github.com/get-set-fetch/scraper


I like python for the ease of use and scraping is I/O bound anyways so there's no pressure to switch to a more performant language.


I'd say that really depends on your scale and what you're doing with the content you scrape.

In my experience with large scale scraping you're much better off using something like Java where you can more easily have a thread pool with thousands of threads (or better yet, Kotlin coroutines) handling the crawling itself and a *NUM CORES thread pool handling CPU bound tasks like parsing.


Could you give a ballpark figure for what you mean by large scale scraping? I've only worked on a couple projects, one was a broad (100K to 500K domains) and shallow (root + 1 level of page depth, also with a low cap on the number of children pages). The other just a single domain but scraping around 50K pages from it.


I would say millions of domains regularly. That's where the pricing of most 'scraping services' falls down too compared to just doing it yourself.


My experience was with e-commerce scraping. Not many domains, but a massive catalogue.


Does anyone know how could I script Save Page WE extension in Firefox? It does a really nice job of saving the page as it looks, including dynamic content.


This is an ad.


PyPpeteer might be worth a look as well. Basically a port of the JS puppeteer project that drives headless Chrome via the Devtools API.

As mentioned elsewhere, using anything other than headless isn't useful beyond a fairly narrow scope these days.

https://github.com/pyppeteer/pyppeteer


I think you'd be surprised by the amount of website you can scrape without an headless browser.

Even Google SERP can be scraped with a simple HTTP client.


That's what I meant by narrow...a known set of sites and data you want to extract.

I imagine, for example, building on the SERP example might hit a wall if you added logged in vs not logged SERPS, iterating over carousel data, reading advertisement data etc.


log in wall can easily be bypassed with an HTTP client by setting the correct auth header.

From what I can observe, 2/3 websites can be scraped without using a headless browser.


How do you deal with JavaScript then?


There's an official Python library for Playwright as well: https://github.com/microsoft/playwright-python


I am often contacted by people who ask me to scrape a dynamic/JS rendered websites. You might be surprised to know that many of such dynamic websites are actually depending on some API end-point which is being accessed via some AJAX like functionality which you can access directly and get the required data. I often faced the situation where data was not fetched via some external source was already available either in data-field or some JSON like structure hence no need to use Selenium with headless browser.


Sure. This one happens not be Selenium.


Any advantage/disadvantage in using Javascript instead of Python for web scraping?


It's just a language. Might be faster. Use what you know best.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: