The problem with these sorts of solutions is that they work perfectly for 'simple' sites like the register, but fail hard with 'modern' sites like, e.g. ASOS.com. Just tried ASOS and the web front end failed to request a product page correctly...
All the dynamic JS and whatnot just plays havoc with these projects. In my experience you have to run through webdriver or something like phantomjs and parse the JS...
There are multiple internal tools I use at work (JIRA, our ticketing system, our code review tool) that won't work because of this issue.
In the meantime, I've written Tampermonkey scripts that will scrape and embedd multiple pages all hack-like, but at least I get a good CSV of the data I need.
To me, the value in this tool is the user interface for creating the scrape logic. If this ran as an embeddable JS app, that you could place inside any page and utilize local storage, you could scrape these dynamic sites by viewing the page first, and still get all of the cool gadetry provided by this tool.
In essence, the value of this tool could be built as a bookmarklet. THAT SIR - I would use every, single, day.
Great idea on the bookmarklet. I could see a tool for building custom readers with clippings from various sites. Say I want to organize JavaScript array patterns and ideas. Throw in a way to clip parts of my PDF books into this "reader" and you have an amazing product worth millions.
While their APIs are nice, they require separate permissions and having access to them isn't always a given depending on the company that runs/owns the Jira instance.
At first, this seems correct. It's definately easier to get scraping with something like Capybara and a suitable js enabled driver, but in my experience, this solution is less reliable. Async loaded data can time out and don't get me started on the difficulties of running the scraper with cron jobs. In the end, I migrated even my JS heavy pages to Mechanize based solutions. It takes a few extra requests to get the async data, but once you get that figured out, it's rock solid - till they update the site design ;-)
Use the tools suitable for the task. There are a lot of those "simple" sites and I'm pretty sure a lot of people will stick to those "dated" methods of building sites, because search traffic still matters.
The long tail is tough, but rules are useful when you only need to work with a small number of sites. And assuming, as you point out, less "modern" sites. (News sites tend to be mostly consistently manageable but, yes, smaller e-commerce players tend to adopt more modern techniques -- as befitting fashion-forward product lines, naturally).
Our (Diffbot) approach is to learn what news and product (and other) pages look like, and obviate the rules-management -- we also fully execute JS when rendering.
The web keeps evolving though, dang it. Tricky thing!
I built SnapSearch for JS/SPA sites that need SEO. But it works for scraping as well. https://snapsearch.io/ You can try the demo. I tried it with "http://www.asos.com/" and it worked properly. Note that empty content actually means that the webserver returned with no body content. The real API will return the headers as well the body.
It works via Firefox, and it's load balanced and multithreaded. It takes care of all the thorny issues regarding async content... etc.
These single page sites do have a public, albeit, undocumented API. If you analyze the network requests via the dev tools in your browser you'll have an XML/JSON data source that is probably structured better than the markup.
CasperJS is an open source navigation scripting & testing utility written in Javascript for the PhantomJS WebKit headless browser and SlimerJS (Gecko). It eases the process of defining a full navigation scenario and provides useful high-level functions, methods & syntactic sugar for doing common tasks such as:
defining & ordering browsing navigation steps
filling & submitting forms
clicking & following links
capturing screenshots of a page (or part of it)
testing remote DOM
logging events
downloading resources, including binary ones
writing functional test suites, saving results as JUnit XML
scraping Web contents
I recently created a service designed to make JS sites crawlable by search engines and other robots. However it works for scraping as well. Try the demo: https://snapsearch.io/
>Is it also a webscraper that can pull data out of a page for me?
No, Phantom will only recreate the page as it would look like to a human user (i.e. after all javascript is parsed and executed). It will not help you parse or slice the page - you would have to do that part programatically with other dom-parsing tools.
CasperJS is a higher-level wrapper for PhantomJS, so - yes, it could be done with PhantomJS solely... But you wouldn't want to, because CasperJS makes automation easier.
Are there any libraries to facilitate database-connectivity to SQL Server or MySQL from javascript? I've used CasperJS to scrape some sites, but always fall back on post-processing the scraped data with another program in order to get it into my database. I'd love to be able to do it all from one piece of code.
You could always have your CasperJS scraping script make an AJAX request to a RESTful API for your MySQL DB. You won't be doing it all from one piece of code, but you'll be doing about 90% of it.
I've used Scrapy and it is the easiest and most powerful scraping tool I've used. This is so awesome. Since it is based on Scrapy I guess it should be possible to do the basic stuff with this tool and then take care of the nastier details directly on the code. I'll try it for my next scraping project.
I like that there's people working to make scraping easier and friendly for everyone. Sadly (IMHO) the cases where these tools will probably fail are at the same time the same not really open on providing the data directly. Most scraper-unfriendly sites would make you request another page before to capture a cookie, set cookies on the request headers or a referer entry, or manually using regex magic to extract information from javascript code on the html. I guess it's just time one tool will provide such methods, though.
For my project I do write all the scrapers manually (that is, in python, including requests and the amazing lxml) because there's always one source that will make you build all the architecture around it. Something that I find that is needed for public APIs is a domain specific language that can work around building intermediate servers by explaining the engine how to understand a data source:
An API producer wants to keep serving the data themselves (traffic, context and statistics), but someone wants an standard way of accessing more than one source (let's say, 140 different sources). If only instead of making an intermediate service providing this standardized version, one could be able to provide templates that a client module would use to understand the data under the same abstraction.
The data consumer would be accessing the source server directly, and the producer would not need to ban over 9000 different scrapers. Of course this would only make sense for public APIs. (real) scraping should never be done on the client: it is slow, crashes and can breach security on the device.
Surely there are difficulties in expecting data providers to produce their data in standard formats across industries and countries? I am naive as to how much and what data is available but that seems a stretch
If interested, take a look at my project on unifying bike sharing networks data. Besides providing a public API, we are also providing a python library that accesses and abstracts different sources under the same model [1, 2]
There are a lot of accessible sources (though, not documented), but there are also clear examples on how one would never provide a service! Some examples [3, 4]
What I was referring, though, was in a way to avoid having to build an intermediate server scraping services that are perfectly usable (JSON, XML) just because we (all) prefer to build clients that understand one type of feed (standard).
Maybe it's not about designing a language, but just as a new way of doing things. Let's say I provide the client with the clear instructions on how to use a service (its format, and where are the fields that the client understands (in an XPath-like syntax)).
That should be enough to avoid periodically scraping good-player servers, but at the same time being able to build client apps without having to implement all the differences between feeds. Besides, it would avoid being banned for accessing too much times a service, and would give data providers insight on who is really using their data.
Let's say we want to unify the data in Feed A and Feed B. The model is about foos and bars:
Feed A:
{
"status": "ok",
"foobars": [
{
"name": "Foo",
"bar": "Baz"
}, ...
]
}
Feed B
[{"n": "foo","info": {"b": "baz"}},...]
We could provide:
{
"feeds": [
{
"name": "Feed A",
"url": "http://feed.a",
"format": "json",
"fields": {
"name": "/foobars//name",
"bar": "/foobars//bar"
}
},
{
"name": "Feed B",
"url": "http://feed.b",
"format": "json",
"fields": {
"name": "//n",
"bar": "//info/b"
}
]
}
Instead of providing a service ourselves that accesses Feed A and Feed B
every minute just because we want to ease things on the client.
Cool tool for developers, but since this one is open source, I think it opens up even more interesting possibilities for these tools to be integrated into part of a consumer app. Curation is the next big trend, right? I think I'll give that a try.
I just took it for a testdrive and it was an absolute pleasure. I tried to scrape all job listings at https://hasjob.co hoping to find trends.
There is one small pain, the output is being printed to the console and piping output to file is not figuring. But it did fetch all the pages and printed a nice json.
UPDATE: there is a logfile setting to dump output to file
I have a project which includes a huge list of websites which must be scraped heavily. My question is...
Are these kind of tools suitable for 'heavy lifting', scraping hundreds of thousands of pages?
Can anyone give a real-life example where this visual tool would be useful? Not that I dont believe in scraping (we do it too: https://github.com/brandicted/scrapy-webdriver). I know Google has a similar tool called Data Highlighter (in Google Webmaster Tools) which is used by non-technical webmasters to tell Google bot where to find the structured data in the page source of a website. It makes sense at Google's scale however I fail to see in which other cases this would be useful considering the drawbacks: some pages may have a different structure, javascript not always properly loaded, etc. Therefor requiring the intervention of a technical person...
This is great. However I have one bone to pick(or rather know if its been taken care of)
Scrapy uses xpaths or equivalent representations to scrape. However there are many alternate xpaths to represent the same div.
For e.g. Suppose data is to be extracted from the fifth div in a sequence of divs. So it would use that as the xpath. But now say it also has a meaningful class or id attribute. An xpath based on this attribute might be a better choice because this content may not be in the fifth div across all the pages in a site I want to scrape.
Is this taken care of by taking the common denominator from many sample pages?
Portia uses https://github.com/scrapy/scrapely library for data extraction. It doesn't use XPaths for learning. There are some links to papers in scrapely README; scrapely is largely based on ideas from these papers, but there are many improvements. In short - yes, this is taken care of.
Although this is cool, the ultimate scraper would probably need to be somehow embedded in a browser and be able to access the JS engine and DOM. Embedded as a plugin, or some other extension depending on the browser.
From the video, I noticed that the HTML tags were also scraped in the large article text. Is there some way to remove those automatically? Or perform further processing?
I believe the GUI is run locally, but if it was run as a web application from the developers site it would only be able to scrape sites accessible to the public internet.
All the dynamic JS and whatnot just plays havoc with these projects. In my experience you have to run through webdriver or something like phantomjs and parse the JS...