Googlebot now makes POST requests via AJAX

mootothemax · on Oct 11, 2011

From the blog post:

The source of the requests is our client-side JavaScript error tracking code, which installs a global JavaScript error handler and attempts to POST to our server when unhandled errors are detected on the client.

Sounds like Googlebot's executing more advanced Javascript, though it's pretty scary it's allowing POSTs to go through.

dpark · on Oct 12, 2011

I think it'd be silly to block POSTs in cases like this. The page makes an unsolicited POST. It's quite reasonable for any crawler to do the same. How else would they get an accurate representation of what the user sees?

Now, if Google is sending random forms, as another commenter claims, that's a different case. But unsolicited POSTs coded in the page's scrips? Send 'em.

troyk · on Oct 12, 2011

I believe the HTTP spec defines a GET as a representation of a resource and a POST as modifying or updating a resource, so google bot doing POSTS is not very neighborly.

dpark · on Oct 12, 2011

By that definition, unsolicited POSTs are not very neighborly.

chc · on Oct 12, 2011

I don't see how. Updating a view count or putting a pin on a "countries that have visited this page" map is fairly benign. Trying to randomly scribble on somebody else's data is not.

regularfry · on Oct 12, 2011

If all the Googlebot is doing is executing the JS as an arbitrary user's browser would, and part of that page load JS involves doing a POST, I can see the argument for doing it.

16bytes · on Oct 12, 2011

Agree. Otherwise a page could dynamically load html via AJAX and give a totally different view to a user than what google-bot sees.

lm741 · on Oct 11, 2011

I noticed this a few days ago. I'm actually considering POSTing the screen dimensions and a few other browser properties for Googlebot via js.

https://twitter.com/#!/lm741/status/122378906669023232

infinity · on Oct 11, 2011

I like your idea. Let's find out something about what the Googlebot is "seeing" and log the POST requests in a database table.

colanderman · on Oct 12, 2011

Surely you can do this with GET with minor server trickery? (Has such been done already?)

lm741 · on Oct 12, 2011

Correct. The richness of the Javascript environment for Googlebot is probably a more significant discovery than the POST requests. I previously pretty much assumed that bots didn't really run Javacript. (Or at least that they would kill network access before running it)

DanielStarling · on Oct 11, 2011

This is a bit annoying. My company has an IRC bot that notifies us when someone fails to properly fill out an important AJAX POSTed form on our front page and soon found out all the "errors" we were seeing on IRC were generated by googlebot.

Bloodwine · on Oct 12, 2011

No crawler should perform POST requests. it is simply bad etiquette and it is understood that POST requests are typically used to create/change/delete content or affect the environment.

dpark · on Oct 12, 2011

Don't write unsolicited POSTs into your page if you don't want crawlers to execute them. I think Google's doing the right thing in this particular case. If a post is happening automatically for all users, then crawlers must do it to ensure they get the same view.

Fluxx · on Oct 12, 2011

If crawlers can perform them, then your users can/are too. I fail to see how crawlers need to obey by certain rules?

sp332 · on Oct 12, 2011

Does this mean the GoogleBot can flag, edit, and rollback Wikipedia pages?

evan0202 · on Oct 11, 2011

I wonder what this means for SEO? If they are actually fully rendering pages in javascript I guess that means you have to be a lot more careful about how pages are laid out.

a5seo · on Oct 11, 2011

This definitely is important in terms of how much "value" Google assigns to links on the page, if they can identify where they render. They got a patent for varying the amount of page rank that flows to a link based on how likely the "reasonable surfer" is to click it. So in terms of flowing Pagerank to a bunch of high value landing pages via a slew of footer links, it's game over.

My only question is how frequently Googlebot actually renders the full JavaScript. Surely they don't have the computing resources to render it on every crawl.

Link to patent discussion: http://www.seobythesea.com/2010/05/googles-reasonable-surfer...

jrockway · on Oct 12, 2011

Surely they don't have the computing resources to render it on every crawl.

Why do you think this? They had the computing resources to randomly generate billions of mutations of compiled Flash programs to find bugs in the Flash VM.

robryan · on Oct 12, 2011

They seem to have been doing this for a while, google bot was running a tracking js that gets dynamically loaded and communicates to an iframe using postMessage which then sends ajax. If it is doing that, you would assume it's doing most of everything a browser does now.

threepointone · on Oct 12, 2011

As a corollary, does this mean that the googlebot now reads pages generated by javascript? I remember that you needed to follow their ajax guidelines, as well as generate the actual page on the server, but if they're able to run javascript on pages now, does this mean they let the page render first (or with a delay of some sort) before parsing it?

That would be cool.

thezilch · on Oct 12, 2011

We have witnessed a Google "bot" use one of our AJAX requests, which is strange considering we have robots.txt blocked all of our AJAX requests -- robots.txt disallows requests on /remote/, which all of "remote" (AJAX) requests are proxied through. As well, the request only happens (automatically) for a user after they have made a POST to another form; the later requires POST data, where as the POST in question could really be a GET -- our AJAX wrapper requires an explicit use of GET.

Nonetheless, there have been some more recent Google "bots," including the POST in question, that may be of interest to those that track those metrics -- remove those requests from internal reports.

Requests:

  74.125.78.83 "POST /remote/poll/230378/demographics/ HTTP/1.0" 200 6469 "http://www.sodahead.com/united-states/would-you-like-to-wish-my-daughter-hannahgirl-a-happy-birthday/question-230378/?page=2 "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.51 (KHTML, like Gecko; Google Web Preview) Chrome/12.0.742 Safari/534.51" www.sodahead.com
  74.125.78.83 "GET /entertainment/are-cigarettes-destroying-adeles-voice/question-2212639/ HTTP/1.0" 200 9240 "-" "Mozilla/5.0 (Linux; U; Android 2.3.4; generic) AppleWebKit/534.51 (KHTML, like Gecko; Google Web Preview) Version/4.0 Mobile Safari/534.51" m.sodahead.com
  74.125.78.83 "GET /united-states/how-the-white-house-public-relations-campaign-on-the-oil-spill-is-harming-the-actual-clean-up/blog-367099/ HTTP/1.0" 200 39784 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.51 (KHTML, like Gecko; Google Web Preview) Chrome/12.0.742 Safari/534.51" www.sodahead.com
  74.125.78.83 "GET /living/white-tea-natural-fat-burner-will-you-take-it-and-get-ready-for-the-summer/question-362433/ HTTP/1.0" 200 14260 "http://translate.google.com.eg/translate_p?hl=ar&prev=/search%3Fq%3DWHITE%2BTEA%2BFAT%2BBURNER%26hl%3Dar%26biw%3D1024%26bih%3D634%26prmd%3Dimvns&sl=en&u=http://www.sodahead.com/living/white-tea-natural-fat-burner-will-you-take-it-and-get-ready-for-the-summer/question-362433/&usg=ALkJrhhLw3GWfeKOfwKa0CK-pbsDlRuEXA "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.220 Safari/535.1,gzip(gfe) (via translate.google.com)" www.sodahead.com
  74.125.78.83 "GET /living/do-you-think-too-much-about-death/question-1785829/ HTTP/1.0" 302 20 "-" "Mozilla/5.0 (Linux; U; Android 2.3.4; generic) AppleWebKit/534.51 (KHTML, like Gecko; Google Web Preview) Version/4.0 Mobile Safari/534.51" www.sodahead.com
  74.125.78.83 "GET /entertainment/kim-kardashian-boobs-too-big/question-2168867/ HTTP/1.0" 200 12680 "-" "Mozilla/5.0 (en-us) AppleWebKit/534.14 (KHTML, like Gecko; Google Wireless Transcoder) Chrome/9.0.597 Safari/534.14" www.sodahead.com

UserAgents (Google "bots"):

  Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.51 (KHTML, like Gecko; Google Web Preview) Chrome/12.0.742 Safari/534.51
  Mozilla/5.0 (Linux; U; Android 2.3.4; generic) AppleWebKit/534.51 (KHTML, like Gecko; Google Web Preview) Version/4.0 Mobile Safari/534.51
  Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.51 (KHTML, like Gecko; Google Web Preview) Chrome/12.0.742 Safari/534.51
  Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.220 Safari/535.1,gzip(gfe) (via translate.google.com)
  mobile request against non-mobile site: Mozilla/5.0 (Linux; U; Android 2.3.4; generic) AppleWebKit/534.51 (KHTML, like Gecko; Google Web Preview) Version/4.0 Mobile Safari/534.51
  Mozilla/5.0 (en-us) AppleWebKit/534.14 (KHTML, like Gecko; Google Wireless Transcoder) Chrome/9.0.597 Safari/534.14

giltotherescue · on Oct 12, 2011

I run a site which triggers POSTs via Ajax for visitor tracking. We explicitly disable crawling of the Ajax directory via robots.txt and have not observed Googlebot crawling these URLs. Strange that Googlebot behaves differently for you.

jrockway · on Oct 12, 2011

These don't seem like crawlers, but rather proxies requesting information on behalf of a user (perhaps with that user agent that is in the UA string). Those are "Google bots", but they aren't the Googlebot.

nostromo · on Oct 12, 2011

I wonder if you could add your POST URLs to robots.txt if you don't want the crawler to access them.

If other crawlers start doing this, it should probably be added to robots.txt formally.

martian · on Oct 12, 2011

I'm an engineer at Thumbtack. And yes: we have noticed that Googlebot does seem to obey robots.txt even when issuing these AJAX requests.

mariust · on Oct 11, 2011

According to google they can to some length read you ajax but you to follow some guides. http://www.google.com/support/webmasters/bin/answer.py?answe...

consultutah · on Oct 11, 2011

That's a little spooky. I wonder how many people have AJAX posts for handling deletes?

samstokes · on Oct 11, 2011

In order to delete any content owned by a user, the Google bot would have to have logged in as that user first, which seems unlikely - so long as all destructive actions are behind access control, which they definitely should be already.

tptacek · on Oct 11, 2011

This is very far from true. Failing to implement access control or even authentication on POST is a routine error even in applications built in the last couple of years.

samstokes · on Oct 11, 2011

Well, maybe not for much longer...

Less facetiously: the Googlebot isn't making random POST requests, it's just executing Javascript on the page. So not only would you need an unprotected POST /comments/1234/delete endpoint, you'd also need to serve the UI and Javascript for POSTing to that endpoint to an unauthenticated user.

I'm sure there are still people out there doing that, but at least it's more than a simple error of omission.

tptacek · on Oct 11, 2011

I agree that this Googlebot change is unlikely to be the end of the world.

pork · on Oct 12, 2011

It might be for wikis that allow anyone to delete content.

dpark · on Oct 12, 2011

Do those wikis also have JS that attempts to delete content without user interaction?

eli · on Oct 12, 2011

There have been evil spam bots executing JS and posting forms for some time. Those sites were already screwed.

WalterGR · on Oct 12, 2011

Well, maybe not for much longer...

I wonder how much the Google Web Accelerator debacle made people realize that GET is supposed to be idempotent.

Perhaps Googlebot will do the same for POST and authentication.

ericd · on Oct 11, 2011

It definitely requires that you be more vigilant in the design of something like crowdsourced flagging, though.

DrJokepu · on Oct 12, 2011

http://thedailywtf.com/Articles/The_Spider_of_Doom.aspx

I think this is more common than you believe.

mkopinsky · on Oct 12, 2011

Does it only do POSTs that are in the document's onload handler, or also things that are in onclick? I think that the latter could be dangerous

<a href="#" onclick="$.post(etc)">Delete</a>

faruken · on Oct 12, 2011

Why would you let an anonymous visitor to delete a data just by clicking a link?

waitwhat · on Oct 12, 2011

The obvious example is a wiki.

mseebach · on Oct 12, 2011

The implicit expected behaviour of clicking a link is that of a GET - i.e. not updating or deleting data. Delete actions should be a POST submit on a form.

waitwhat · on Oct 12, 2011

In theory, theory and practice are the same. In practice, they are not.

eddieplan9 · on Oct 12, 2011

Would it click facebook Like, too? ;)