Most scrapers seem to break down on heavy dynamic/ajax pages. For example, anyth...

arkitaip · on March 6, 2011

I was looking at several scraping solutions (e.g. imacros, selenium) that can handle DHTML for a project and they all have significant performance issues since they need to render the actual pages before processing them. A couple of thousands or rows isn't a problem but try anything more and you got a real performance bottleneck.

odonnell · on March 6, 2011

DHTML is server-side. You mean AJAX. Also, think of the page as an interface to a more lightweight web service. You should probably be parsing that directly.

thirdusername · on March 7, 2011

He's referring to this: http://en.wikipedia.org/wiki/Dhtml I'm not sure what DHTML you are thinking of that would be server-side.

odonnell · on March 9, 2011

Fuck, thinking of SHTML for some reason.

chsonnu · on March 7, 2011

Have you tried Watir? I'm not sure if it'll solve your performance issue, but it's been at least twice as fast as Selenium for me.

odonnell · on March 6, 2011

Use Charles or another proxy and find what feed the page is loading, then parse that. At least two requests need to be made anyway.