Web Scraping 101 with Python

philshem · on Feb 10, 2021

Before jumping into frameworks, if your data is lucky enough to be stored in an html table:

    import pandas as pd
    dfs = pd.read_html(url)

Where ‘dfs’ is an array of dataframes - one item for each html table on the page.

https://pandas.pydata.org/pandas-docs/stable/reference/api/p...

kortex · on Feb 11, 2021

Sometimes it's also helpful to use beautiful soup to isolate the elements you want, feed the text of the elements into StringIO and give that to read_html.

philshem · on Feb 11, 2021

Yes, this is a good idea for more complicated cases.

mritchie712 · on Feb 12, 2021

We made a chrome extension that queries any html table in any open tab with SQL:

https://chrome.google.com/webstore/detail/sqanything/naejbcf...

You can export the results to Google Sheets too. One advantage of the extension is it works with JS rendered tables.

Yenrabbit · on Feb 11, 2021

Handy! I'm also a big fan of pd.read_clipboard() for specific selections.

antisthenes · on Feb 11, 2021

Holy crap, is there anything pandas can't do?

Frost1x · on Feb 11, 2021

Ingest bamboo? Sorry, couldn't resist.

JosephRedfern · on Feb 10, 2021

Woah. I’ve used pandas a fair amount and had no idea about this. Thank you!

andreilys · on Feb 10, 2021

+1 this has saved me countless of hours

mikesholiu · on Feb 10, 2021

what does this do

SirSourdough · on Feb 10, 2021

It reads HTML and returns the tables contained in the HTML as pandas dataframes. It’s a simple way to scrape tabular data from websites.

NDizzle · on Feb 10, 2021

I've been involved in many web scraper jobs over the past 25 years or so. The most recent one, which was a long time ago at this point, was using scrapy. I went with XML tools for controlling the DOM.

It's worked unbelievably well. It's been running for roughly 5 years at this point. I send a command at a random time between 11pm and 4am to wake up an ec2 instance. It checks its tags to see if it should execute the script. If so, it does so. When it's done with its scraping for the day, it turns itself off.

This is a tiny snapshot of why it's been so difficult for me to go from python2 to python3. I'm strongly in the camp of "if it ain't broke, don't fix it".

Topgamer7 · on Feb 10, 2021

Using `2to3` might get you 80% of the way there. Although cases like this make tests really valuable.

silicon2401 · on Feb 10, 2021

why can't you just keep using python2? surely some people out there are interested enough to keep updating and maintaining it?

NDizzle · on Feb 10, 2021

I certainly can keep using it. There have been so many efforts to get people to update Python 2 code to Python 3 code that it's on my backlog to do it. Will I get to it this year? Probably not.

craigmi · on Feb 10, 2021

"I send a command at a random time between 11pm and 4am to wake up an ec2 instance."

Any chance you could tell me your setup for this?

rinze · on Feb 11, 2021

Not my project, but if I had to do it I'd try something like the following:

* Set an autoscaling group with your instance template, max instances 1, min instances 0, desired instances 0 (nothing is running).

* Set up a Lambda function that sets the autoscaling group desired instances to 1.

* Link that function to an API Gateway call, give it an auth key, etc.

* From any machine you have, set up your cron with a random sleep and a curl call to the API.

And that should do the trick, I think.

ses1984 · on Feb 11, 2021

>From any machine you have, set up your cron with a random sleep and a curl call to the API.

You might as well just call the ASG API directly.

warsheep · on Feb 11, 2021

Why use autoscaling and not just launch the instance directly from lambda? The run time is short so there's no danger of two instances running in parallel

wp381640 · on Feb 11, 2021

the tl;dr for all web scraping is to just use scrapy (and scrapyd) - otherwise you end up just writing a poorer implementation of what has already been built

My only recent change is that we no longer use Items and ItemLoaders from scrapy - we've replaced it with a custom pipeline of Pydantic schemas and objects

VBprogrammer · on Feb 10, 2021

One tip I would pass on when trying to scrape data from a website, start by using wget in mirror mode to download the useful pages. It's much faster to iterate on scraping the data once you have it locally. Also, less likely to accidentally kill the site or attract the attention of the host.

spsphulse · on Feb 10, 2021

Definitely! Scrape the disk, not the web.

johtso · on Feb 10, 2021

Or just use scrapy's caching functionality. Super convenient.

strin · on Feb 10, 2021

That works only for static page though. Many modern pages would require you to run a selenium or puppetteer to scrape the content.

edmundsauto · on Feb 10, 2021

For these sites, I crawl using a JS powered engine, and just save the relevant page content to disk.

Then I can craft my regex/selectors/etc., once I have the data stored locally.

This helps if you get caught and shut down - it won't turn off your development effort, and you can create a separate task to proxy requests.

alephu5 · on Feb 11, 2021

I did web-scraping professionally for two years, in the order of 10M pages per day. The performance with a browser is abysmal and requires tonnes of memory so not financially viable. We used them for some jobs, but rendered content isn't a problem, you can also simulate the API calls (common) and read the JSON, or regex the script and try to do something with that.

I'd say 99% of the time you can get by without a browser.

inovica · on Feb 11, 2021

Fully agree. It takes some thought :)

thaumasiotes · on Feb 10, 2021

That's never required; the data shows up in the web page because you requested it from somewhere. You can do the same thing in your scraper.

dewey · on Feb 10, 2021

> You can do the same thing in your scraper

Rendering the page in Puppeteer / Selenium and then scraping it from there sounds like a lot easier than somehow trying to replicate that in your scraper?

thaumasiotes · on Feb 10, 2021

Sure. How does that relate to the claim that your scraper is actually unable to make the same requests your browser does?

dewey · on Feb 10, 2021

How are you going to deal with values generated by JS and used to sign requests?

thaumasiotes · on Feb 10, 2021

If they're really being generated client-side, you're free to generate them yourself by any means you want. But also, that's a strange thing for the website to do, since it's applying a security feature (signatures) in a way that prevents it from providing any security.

If they're generated server-side like you would expect, and sent to the client, you'd get them the same way you get anything else, by asking for them.

dewey · on Feb 10, 2021

I'm not sure what's your point. Of course you can replicate every request in your scraper / with curl if you want to if you know all the input variables.

Doing that for web scraping purposes where everything is changing all the time and you have more than one target website is just not feasible if you have to reverse engineer some custom JS for every site. Using some kind of headless browser for modern websites will be way easier and more reliable.

pocket_cheese · on Feb 10, 2021

As someone who has done a good bit of scraping, how a website is designed dictates how I scrape.

If it's a static website that has consistently structured HTML and is easy to enumerate through all the webpages I'm looking for, then simple python requests code will work.

The less clear case is when to use a headless browser vs reverse engineering JS/server side APIs. Typically, I will do like a 10 minute dive into the client side js and monitor ajax requests to see if it would be super easy to hit some API that returns JSON to get my data. If reverse engineering seems to hairy, then I will just do headless browser.

I have a really strong preference for hitting JSON apis directly because, well, you get JSON! Also you usually get more data then you even knew existed.

Then again, if I was creating a spider to recursively crawl a non-static website, then I think Headless is the path of least resistance. But usually, I'm trying to get data in the HTML, and not the whole document.

shiyason · on Feb 10, 2021

I’ve been doing web scraping for the past 5 years and this is exactly the approach I take as well!

tester756 · on Feb 10, 2021

>If they're really being generated client-side, you're free to generate them yourself by any means you want. But also, that's a strange thing for the website to do

what??

Page loads -> Javascript sends request to backend -> it returns data -> javascript does stuff with it and renders it.

thaumasiotes · on Feb 10, 2021

Sure, that's the model from several comments up. It doesn't involve signing anything.

cubano · on Feb 10, 2021

My last contract job was to build a 100% perfect website mirroring program for a group of lawyers who were interested in building class action lawsuits against some of the more henious scammers out there.

I ended up building like 8 versions of it, literally using every PHP and Python library and resource I could find.

I tried httrack, php-ultimate-web-scraper (from github), headless chromium. headless selenium, and a few others

By far the biggest problem was dealing with JS links...you wouldn't think from the start it would be such a big deal but yet..it was.

Selenium with python turned out to be the winning combination, and of course, it was the last one I tried. Also, this is an ideal project to implement recursion altho you have to be careful about exit conditions.

One thing that was VERY important for performance was not visiting any page more then once because, obviously, certain links in headers and footers are duped sometimes 100s of times.

JS links often made it very difficult to discover the linked page, are certain library calls that were supposed to get this info for you often didn't work.

It was a super fun project, and in the end considering I only worked for 2 months, I shipped some decent code that was getting like 98.6% of the pages perfectly.

The final presentation was interesting...for some reason my client I think got in his head that I wasn't very good programmer or something, and as we ran thru his list of sample sites expecting my program to error out or incorrectly mirror the site, but it handled all 10 of the sites about perfectly and he was rather flabbergasted because he told me it would have taken him a week hand clicking the site for the mirror but instead the program did them all in under an hour.

banana_giraffe · on Feb 10, 2021

I had to solve nearly the exact same problem for the same reasons. I too ended up with Selenium.

My favorite part was having a nice working system, then throwing it in the cloud and finding out a socking number of sites tell you to go away if you come at them from a cloud-based IP.

Shouldn't be surprising, but it was still annoying.

Terretta · on Feb 10, 2021

There are a number of so-called “residential VPN” services with clients that also serve as the firm’s p2p VPN / proxy edge. Some can be subscribed to commercially to resolve precisely the above issue.

Preferably, only give money to one that tells their users this is how it works.

banana_giraffe · on Feb 10, 2021

True. We went down that path for a while, but for our purposes, I was never happy with the ones I could find, since they were super vague how they got the IPs in the first place. Some of it felt like protecting commercial secrets, which is fine, but some of it felt like protecting questionable business practices.

For our purposes, and the websites we needed to track, a traditional VPN was good enough.

concreteblock · on Feb 11, 2021

Can you explain why the lawyers needed mirror sites? Does mirroring a site mean making a local copy of it?

lysecret · on Feb 10, 2021

What stack did you end up using ?

sdfsrrte543 · on Feb 10, 2021

>Selenium with python turned out to be the winning combination, and of course, it was the last one I tried.

lysecret · on Feb 13, 2021

Cool that's wht I'm using too!:)

gazelle21 · on Feb 10, 2021

Ha! I am currently building something very similar at work, and JS links are driving me up a wall. Its interesting because you would think its super simple, but I still haven't found a good solution. Luckily my boss is understanding.

cameroncairns · on Feb 10, 2021

I think this article does an OK job covering how to scrape websites rendered serverside, but I strongly discourage people from scraping SPAs using a headless browser unless they absolutely have to. The article's author touches on this briefly, but you're far better off using the network tab in your browser's debug tools to see what AJAX requests are being made and figuring out how those APIs work. This approach results in far less server load for the target website as you don't need to request a bunch of other resources, reduces the overall bandwidth costs, and greatly speeds up the runtime of your script since you don't need to spend time running javascript in the headless browser. That can be especially slow if your script has to click/interact with elements on the page to get the results you need.

Other than that, I'd strongly caution anyone looking into making parallel requests. Always keep in mind the sysadmin and engineers behind the site you are targeting. It's can be tempting to value your own time by making a ton of parallel requests to reduce the overall time of your script, but you can potentially cause massive server load for the site you're targeting. If that isn't enough motivation to cause you pause, keep in mind that the site owner is more likely to make the site hostile to scrapers if there are too many bad actors hitting the site heavily.

jamra · on Feb 10, 2021

How would you deal with authentication?

ddorian43 · on Feb 10, 2021

There was (is?) a DARPA project called "Memex" that was built to crawl the hidden web that has many tools like crawling with authentication, automatic registration, machine-learning to detect search-forms, auto detecting pagination etc etc etc etc https://github.com/darpa-i2o/memex-program-index

cameroncairns · on Feb 10, 2021

I don't! As far as I know, scraping data behind a login is illegal in the united states. You can look into the supreme court case Facebook v Powers Inc for information behind that. This page https://www.rcfp.org/scraping-not-violation-cfaa/ seems to have a decent overview of scraping laws in general. It's definitely a legal gray area so I'd suggest doing your research! This doesn't constitute legal advice and all that, I'm not a lawyer just a guy who does some scraping here and there :)

Tistel · on Feb 10, 2021

It’s fun to combine jupyter notebooks and py scraping. If you are working 15 pages/screens deep, you can “stay at the coal face” and not have to rerun the whole script after making a change to the latest step.

diarrhea · on Feb 10, 2021

`ipython -i <script>` also works similarly for debugging, by having the powerful interpreter open after running the script, without the jupyter overhead.

whoisburbansky · on Feb 10, 2021

I love the imagery of this being "at the coal face" thanks for that

psychomugs · on Feb 10, 2021

I write scrapers for fun and notebooks for work but never thought to combine the two. Great idea!

CapriciousCptl · on Feb 10, 2021

Oh! That's a good idea. My goto has always been pipelining along a series of functions, but never thought of just using Jupyter for some reason.

tluyben2 · on Feb 10, 2021

I wanted to do some larger distributed scraping jobs recently and although it was easy to get everything running on one machine (with different tools including Scrapy), I was surprised how hard it was to do at scale. The open source ones I could find was hard/impossible to get working, overly complex, badly documented etc.

The services I found to be reasonably priced for small jobs, but at scale they quickly become vastly more expensive than setting this up yourself. Especially when you need to run these jobs every month or so. Even if you have to write some code to make the open source solutions actually work.

turtlebits · on Feb 10, 2021

AWS Lambdas are an easy way to get scheduled scraping jobs running.

I use their Python-based chalice framework (https://github.com/aws/chalice) which allows you to add a decorator to a method for a schedule,

  @app.schedule(Rate(30, unit=Rate.MINUTES))

It's also a breeze to deploy.

  chalice deploy

nr2x · on Feb 10, 2021

It gets more complicated when you need to leverage real browser engines (eg Chrome). I've got jobs spread across ~ 20 machines/ 140 concurrent browser instances, it's non-trivial.

ddorian43 · on Feb 10, 2021

How many pages can you render in a a second per vcpu core ?

nr2x · on Feb 11, 2021

I'm measuring tracking so I need to have the page semi-idle while trackers show up, so it's both speed and allowing some waiting-around-time.

1vuio0pswjnm7 · on Feb 10, 2021

One thing I notice with all blog articles, and HN comments, on scraping is that they always omit the actual use case, i.e., the specific website that someone is trying to scrape. Any examples tend to be so trivial as to be practically meaningless. They do not prove anything.

If authors did name websites they wanted to scrape, or show tests on actual websites, then we might see others come forward with different solutions. Some of them might beat the ones being put forward by the pre-packaged software libraries/frameworks and commercial scraping services built on them, e.g., less brittle, faster, less code, easier to repair.

We will never know.

tt433 · on Feb 11, 2021

I'm just curious if you have any particular use cases you can't get the existing solutions to work with

1vuio0pswjnm7 · on Feb 12, 2021

I do not have any particular use cases that I cannot get my own solutions to work with. Thus I have no motivation to try these Python solutions. That is why I would be curious to have an example use case that someone thought could only be handled by some Python framework.

kruchone · on Feb 10, 2021

In my career I found several reasons not to use regular expressions for parsing an HTML response, but the largest was the fact that it may work for 'properly formed' documents, but you would be surprised how lax all browsers are about requiring the document to be well-formed. Your regex, unless particularly handled, will not be able to handle sites like this (and there are a lot, at least from my career experience). And you may be able to work 'edge cases' into your RegEx, but good luck finding anyone but the expression author who fully understands and can confidently change it as time goes on. It is also a PITA to debug when groupings/etc. aren't working (and there will be a LOT of these cases with HTML/XML documents).

It is honestly almost never worth it unless you have constraints on what packages you can use and you MUST use regular expressions. Just do your future-self a favor and use BeautifulSoup or some other package designed to parse the tree-like structure of these documents.

One way it can be used appropriately is just finding a pattern in the document- without caring where it is w.r.t. the rest of the document. But even then, do you really want to match:  ?

btown · on Feb 10, 2021

For all the things jQuery got wrong, it got one thing right: arguably the most intuitive way to target a set of data in a document is by having a concise DSL that works on a parsed representation of the document.

I'd love to see more innovation/developer-UX research on the interactions between regexes, document parse trees, and NLP. For instance, "match every verb phrase where the verb has similar meaning to 'call' within the context of a specific CSS selector, and be able to capture any data along that path in capturing groups, and do something with it" right now takes significant amounts of coding.

https://spacy.io/usage/rule-based-matching does a lot, but (a) it's not particularly concise, (b) there's not a standardized syntax for e.g. replacement strings once you detect something, and (c) there's no real facilities to bake in a knowledge of hierarchy within a larger markup-language document.

logn · on Feb 11, 2021

I think scraping is just inherently brittle whether you go by the DOM traversal or by regex. AI may have the best potential. Regex can be slightly more brittle like you point out with commented html or myriad other problems, but it can also be less brittle than DOM if you craft more lenient patterns. The main problem I found was regex's not being performant due to recursiveness and stack overflows (Google's RE2 lib addresses this). My favorite performance trick is to use negated character classes rather than a dot, /<foo[^>]*>/

xapata · on Feb 10, 2021

> confidently change it

Having a good variety of tests helps.

> tree structure

You'll need a complete language to parse a tree.

pknerd · on Feb 10, 2021

I have been developing scrapers and crawlers and writing[1] about them for many years and used many Python based libs so far including Selenium. I have write such scrapers for individuals and startups for several purposes. The biggest issue I faced was rendering of dynamic sites and blocking of IPs due to absence of proxies which are not cheap at all, especially for individuals.

Services like Scrapingbee and ScraperAPI are serving quite good for such problems. I personally liked ScraperAPI for rendering dynamic websites due to the better response time.

Shameless Plug: In case if anyone is interested, long time back, I had written about it on my blog which you can read here[2]. Now you do not need to setup remote Chrome instance or anything. What all is required is to hit an API endpoint to fetch content from a dyanmic JS rendered websites.

[1] http://blog.adnansiddiqi.me/tag/scraping/

[2] http://blog.adnansiddiqi.me/scraping-dynamic-websites-using-...

turtlebits · on Feb 10, 2021

Been scraping for a long time. If handling JS isn't a requirement, XPath is the 100% the way to go. It's a standard query language, very powerful, and there are great browser extensions for helping you write queries.

toolslive · on Feb 10, 2021

fetching html and then parsing it navigating the parsed result (or with regexp) is what used to work 20 years ago. These days, with all these reactive javascript frameworks you better skip to item number 5: headless browsing. Also mind that Facebook, Instagram, ... will have anti-scraping measures in place. It's a race ;)

almost · on Feb 10, 2021

I've found that a lot of the time that's not needed. Often you'll find the data as a JSON blob in the page and can just read it directly from there. Or find that there's an API endpoint that the javascript reads.

monkeybutton · on Feb 10, 2021

I find this method works best. Skip looking at the page and instead watch all the network requests as the page loads.

WrtCdEvrydy · on Feb 10, 2021

ZAP HUD Proxy is the best option here...

Load the page on it and it shows you all the request being made along with the payload as it happens.

Find the one you need, copy the data, endpoint and HTTP verb and recreate it in your language of choice :D

lemagedurage · on Feb 10, 2021

It's not all bad, many modern sites just expose a JSON API that can be used. It really depends on how protective and large the company behind it is.

A4ET8a8uTh0 · on Feb 10, 2021

This. Even relatively simple websites are much harder to parse today. I did a minor side project for a customer scraping some info and anti-scraping measures were in full force. It feels like an all out war.

spaniard89277 · on Feb 10, 2021

Such as? I've never encounter anything I wasn't able to overcome.

thinkingkong · on Feb 10, 2021

Usually starts from simple to difficult. User agent stuff, IP address detection, aggressive rate limiting, captcha checking, browser fingerprinting, etc.

A4ET8a8uTh0 · on Feb 10, 2021

It can be overcome and, admittedly, I am new to this so for me that means way more time spent trying to make it work. The odd one that I got stuck on for a while was a presented list where individual record held pertinent details, but the list had, seemingly randomly, items that looked like records on the surface, but were not ( so those had to be identified and ignored ). Small things like that.

Still, I would love to learn more about your approach if you would be willing to share.

dewey · on Feb 10, 2021

Once recaptcha is in the mix it'll get tricky pretty quickly. Everything else is easy to overcome most of the time.

amenod · on Feb 10, 2021

Recaptcha comes to mind.

That said, there are quite a few services which battle these systems for you nowadays (such as scraperapi - not affiliated, not a user). They are not always successful, but they have an advantage of maaany residential proxies (no doubt totally ethically obtained /s, but that's another story).

toolslive · on Feb 10, 2021

please click on all the traffic lights you see below.

ttoomm28 · on Feb 10, 2021

Datadome, Incapsula

bityard · on Feb 10, 2021

Is web scraping going to continue to be a viable thing, now that the web is mainly an app delivery platform rather than a content delivery platform?

Can you scrape a webasm site?

tluyben2 · on Feb 10, 2021

Maybe it will turn out to be that way, but this is far from reality at the moment. There are not many sites that cannot be scraped statically and there definitely are very few sites/apps that are webasm.

It'll change, but who knows how much. At least currently, most scraping professionals are not even using headless browsers as their targets are statically rendered.

jjice · on Feb 10, 2021

I'm not 100% sure what you mean by a 'webasm' site (web assembly powered?), but the article describes scarping via headless browsers which actually render the page and allow you to select elements that are client rendered.

anyfactor · on Feb 11, 2021

I have been websraping for almost 4 years now. That is my entire niche.

The problem with web scraping is that you really don't know the ethical point of scraping ends. These days I will reverse engineer a website to minimize the request load and only target specific API endpoints. But then again I am breaching some security measures they have while doing that.

pythonbase · on Feb 10, 2021

I do web scraping for fun and profit, primarily using Python. Wrote a post some time back about it.

https://www.kashifaziz.me/web-scraping-python-beautifulsoup....

spsphulse · on Feb 10, 2021

Is there a SOTA library for common web scraping issues at scale( especially distributed over cluster of nodes) for Captcha detection, IP rotation, Rate throttling, Queue Management etc.?

ddorian43 · on Feb 10, 2021

What's a "SOTA library" ?

banana_giraffe · on Feb 10, 2021

A contextual guess: "'State of the art' library"

In other words: Is there a drop in library to solve all the big common issues people run into scraping websites in the wild?

At least, that's how I read it.

ddorian43 · on Feb 10, 2021

There is no "state of the art library" to build your own google. But "Rate throttling/limiting" can be done with Redis, rotating ip is still rate-limiting with Redis, Captcha Detection - You have to pay $$ I think.

zelphirkalt · on Feb 11, 2021

Personally I have not needed Beautifulsoup a single time, when web scraping. People say it is better for unclean HTML, which I cannot confirm, because I never needed it and always were able to get my result using LXML + etree with XPath and CSS selectors. Once I also used Scrapy, but still not Beautifulsoup. I am glad there is a guide, that starts with LXML, instead of immediately jumping to Beautifulsoup.

listenallyall · on Feb 11, 2021

lxml is terrific. I agree, never really understood what beautifulsoup added that lxml couldn't just handle on its own. On the JVM, jsoup is excellent but doesn't support XPath.

inovica · on Feb 10, 2021

We created a fun side project to grab the index page of every domain - we downloaded a list of approx 200m domains. However, we ran into problems when our provider complained. It was something to do with the DNS side of things and we were told to run our own DNS server. If there is anyone on here with experience of crawling across this number of domain names it would be great to talk!

jj_jaq · on Feb 11, 2021

The biggest struggle I had while building web scrappers is scaling Selenium. If you need to launch Selenium hundred of thousands times per month, you need a lot of computer power which is really expensive on EC2.

A couple years ago, I discovered browserless.io which does this job for you and it's amazing. I really don't know how they made this but it just scales without any limit.

scalableUnicon · on Feb 11, 2021

For browserless.io, the developer behind it talks about the tech stack in this podcast: https://runninginproduction.com/podcast/62-browserless-gives...

FL33TW00D · on Feb 10, 2021

I recently undertook my first scraping project, and after trying a number of things landed upon Scrapy.

It’s been a blessing. Not only can it handle difficult sites, but it’s super quick to write another spider for the easy sites that provide the JSON blob in a handy single API call.

Only problem I had was getting around cloudflare, tried a few things like puppeteer but no luck.

gg_no_re · on Feb 10, 2021

Check out https://github.com/clemfromspace/scrapy-cloudflare-middlewar.... I've been running it in production for over a year without any hiccups.

FL33TW00D · on Feb 11, 2021

Is this still working for you? cfscrape that it’s based off looks to be failing due to cloudflare updates

dastx · on Feb 10, 2021

For data extraction I highly recommend weboob. Despite the unfortunate name, it does some really cool stuff. Writing modules is quite straightforward and the structure they've chosen makes a lot of sense.

I do wish there was a Go version of it, mostly because I much prefer working with Go, but also because single binary is extremely useful.

fudged71 · on Feb 10, 2021

I really appreciate the tips in the comments here.

As a beginner it makes a lot of sense to iterate on a local copy with jupyter rather than fetching resources over and over until you get it right. I wish more tutorials focused on this workflow.

ackbar03 · on Feb 10, 2021

I've always had pretty bad experiences with web scrapping, it's such a pain in the ass and frequently breaks. I'm not sure if I'm doing it wrong or if that's how it's supposed to be.

hansvm · on Feb 10, 2021

> pain in the ass

Yes, unequivocally.

> frequently breaks

It can definitely depend on what you're scraping, but in the last few years or so the only project I had trouble with was one where they changed the units for the unpublished API (the real UI made two requests which mattered, one to grab the units, and I missed that in my initial inspection -- it bit me awhile later when they changed the default behavior for both locations).

A few tips:

As much as possible, try to find the original source for the data. E.g., are there any hidden APIs, or is the data maybe just sitting around in a script being used to populate the HTML? Selenium is great when you need it, but in my experience UI details change much more frequently than the raw data.

When choosing data selectors you'll get a feel for those which might not be robust. E.g., the nth item in a list is prone to breakage as minor UI tweaks are made.

If robustness is important, consider selecting the same data multiple ways and validating your assumptions about the page. E.g., you might want the data with a particular ID, combination of classes, preceding title, or which is the only text element formatted like a version number. When all of those methods agree you're much more likely to have found the right thing, and if they don't then you still have options for graceful degradation; use a majority vote to guess at a value, use the last known value, record N/A or some indication that we're not sure right now, etc. Critically though, your monitoring can instantly report that something is amiss so that you can inspect the problem in more detail while the service still operates in a hopefully acceptable degraded state.

edmundsauto · on Feb 10, 2021

It’s heavily dependent on the site you’re scraping. If they put in active counter measures, have a complex structure, or update their templates frequently, it’s going to be an uphill battle.

Most sites IME are pretty easy.

max_ · on Feb 10, 2021

How does one do scraping properly on dynamic client side rendered pages?

Topgamer7 · on Feb 10, 2021

Most SPA pages will honour direct uri requests and route you properly in javascript. You just need to have your scraping pipeline use phantomJS or selenium to wait until the page stops loading then scrape the html.

Although it might just be easier to scrape their api endpoints directly instead of mucking with html if its a dynamic page. The data is structured that way, and easier to query.

jmt_ · on Feb 10, 2021

Usually the approach is to use a headless browser. The headless browser instance runs purely in memory without a GUI then renders the website you're interested in. Then, it comes down to regular DOM parsing. A common library that I enjoy is Selenium with Python.

mikece · on Feb 10, 2021

Aside from the Beautiful Soup library, is there something about Python that makes it a better choice for web scraping than languages such as Java, JavaScript, Go, Perl or even C#?

diarrhea · on Feb 10, 2021

Don't know about large scales, but just today I threw together a script using selenium, imported pandas to mangle the scraped data and quickly exported to json. For quick and dirty, possibly one-off jobs like that, Python is a great choice.

jbergstroem · on Feb 10, 2021

The Scrapy library written in Python (https://scrapy.org) is excellent for writing and deploying scrapers.

lemagedurage · on Feb 10, 2021

I think Python makes sense, at least for the prototyping phase. There's a lot of trial and error involved, and Python is quick to write.

freedomben · on Feb 10, 2021

I find javascript (node) to be best suited to web scraping personally. Using the same language to scrape/process as you use to develop those interfaces seems most natural.

isbvhodnvemrwvn · on Feb 10, 2021

Especially with stuff like Puppeteer which allows you to execute JS in context of the browser (which admittedly can lead to weird bugs as the functions are serialized and lose context)

a1sabau · on Feb 11, 2021

I'm using this exact strategy to scrape content directly from DOM using APIs like document.querySelectorAll. You can use the same code in both headless browser clients like Puppeteer or Playwright and DOM clients like cheerio or jsdom (assuming you have a wrapper over document API). Depending on the way a web page was fetched (opened in a browser tab or fetched via nodejs http/https requests), ExtractHtmlContentPlugin, ExtractUrlsPlugin use different DOM wrappers (native, cheerio, jsdom) to scrape the content.

[1] https://github.com/get-set-fetch/scraper

monkeybutton · on Feb 10, 2021

I like python for the ease of use and scraping is I/O bound anyways so there's no pressure to switch to a more performant language.

RhodesianHunter · on Feb 10, 2021

I'd say that really depends on your scale and what you're doing with the content you scrape.

In my experience with large scale scraping you're much better off using something like Java where you can more easily have a thread pool with thousands of threads (or better yet, Kotlin coroutines) handling the crawling itself and a *NUM CORES thread pool handling CPU bound tasks like parsing.

monkeybutton · on Feb 10, 2021

Could you give a ballpark figure for what you mean by large scale scraping? I've only worked on a couple projects, one was a broad (100K to 500K domains) and shallow (root + 1 level of page depth, also with a low cap on the number of children pages). The other just a single domain but scraping around 50K pages from it.

tluyben2 · on Feb 10, 2021

I would say millions of domains regularly. That's where the pricing of most 'scraping services' falls down too compared to just doing it yourself.

RhodesianHunter · on Feb 10, 2021

My experience was with e-commerce scraping. Not many domains, but a massive catalogue.

js8 · on Feb 10, 2021

Does anyone know how could I script Save Page WE extension in Firefox? It does a really nice job of saving the page as it looks, including dynamic content.

jC6fhrfHRLM9b3 · on Feb 10, 2021

This is an ad.

tyingq · on Feb 10, 2021

PyPpeteer might be worth a look as well. Basically a port of the JS puppeteer project that drives headless Chrome via the Devtools API.

As mentioned elsewhere, using anything other than headless isn't useful beyond a fairly narrow scope these days.

https://github.com/pyppeteer/pyppeteer

daolf · on Feb 10, 2021

I think you'd be surprised by the amount of website you can scrape without an headless browser.

Even Google SERP can be scraped with a simple HTTP client.

tyingq · on Feb 10, 2021

That's what I meant by narrow...a known set of sites and data you want to extract.

I imagine, for example, building on the SERP example might hit a wall if you added logged in vs not logged SERPS, iterating over carousel data, reading advertisement data etc.

daolf · on Feb 10, 2021

log in wall can easily be bypassed with an HTTP client by setting the correct auth header.

From what I can observe, 2/3 websites can be scraped without using a headless browser.

hakube · on Feb 11, 2021

How do you deal with JavaScript then?

JackC · on Feb 10, 2021

There's an official Python library for Playwright as well: https://github.com/microsoft/playwright-python

pknerd · on Feb 10, 2021

I am often contacted by people who ask me to scrape a dynamic/JS rendered websites. You might be surprised to know that many of such dynamic websites are actually depending on some API end-point which is being accessed via some AJAX like functionality which you can access directly and get the required data. I often faced the situation where data was not fetched via some external source was already available either in data-field or some JSON like structure hence no need to use Selenium with headless browser.

tyingq · on Feb 10, 2021

Sure. This one happens not be Selenium.

ohmyblock · on Feb 10, 2021

Any advantage/disadvantage in using Javascript instead of Python for web scraping?

ddorian43 · on Feb 10, 2021

It's just a language. Might be faster. Use what you know best.