Most scrapers seem to break down on heavy dynamic/ajax pages. For example, anything made with GWT appears to provide little for the average scraper to grab (say, for automated daily tracking of android app downloads, for example). Short of reversing out the foreign pages api calls, has anyone encountered a solution to do more processing and then scrape the rendered page?
(well, and short of using Selenium to script a login, then scrape the rendered page via a controlled firefox... which works, but is clunky)
I was looking at several scraping solutions (e.g. imacros, selenium) that can handle DHTML for a project and they all have significant performance issues since they need to render the actual pages before processing them. A couple of thousands or rows isn't a problem but try anything more and you got a real performance bottleneck.
DHTML is server-side. You mean AJAX. Also, think of the page as an interface to a more lightweight web service. You should probably be parsing that directly.
(well, and short of using Selenium to script a login, then scrape the rendered page via a controlled firefox... which works, but is clunky)