I tried both curl and wget last night (neither of these are HTML5-ready browsers...

wahnfrieden · on Feb 10, 2011

Neither curl nor wget follow the Google convention for handling hashbangs as suggested by the parent, so I'm not sure what you're getting at with this reply.

Isofarro · on Feb 10, 2011

Hash-bang URLs are not reliable references to content - that's what I am getting at. Curl and WGet are perhaps the most used non-browser user-agents on the web. And both of them are unable to retrieve content at a URL specified by a hash-bang URL.

In this context hash-bang urls are broken.

aamar · on Feb 10, 2011

I'm sorry if I implied that curl/wget handle this already. However, they could handle this with a very small wrapper script, maybe 3 lines of code, or a very short patch if the convention becomes a standard. That's not nothing, but it's maybe 7 orders of magnitude lighter than a full JS engine, and it's small anyway compared to the number of cases that a reasonable crawler needs to handle.

Also, with that wrapper or patch, curl & wget will still not be remotely HTML5 ready, which I hope demonstrates that HTML5 is not a requirement in any way. A single HTML5-non-ready browser that can't handle this doesn't mean therefore that HTML5 is a requirement.

wahnfrieden · on Feb 10, 2011

They aren't? You're only supposed to use them if you follow Google's convention, in which case they should be reliably replaced with a normal URL sans the hash. Of courses your scraper must be aware of this, but it should be a somewhat reliable pseudo-standard (and it is just a stopgap after all).