Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There are a lot of people that scrape Google pretty badly, so we do need to have protection against bots, including ones that look like the Google toolbar. If you're resuming ~50 tabs, I can believe that might look like a scraper to us for while. I'm glad you could do regular Google searches after 15 minutes or so.



So, are you seriously telling me that google can't tell the difference between their own toolbar used by a logged in user and a bot?

How about changing the toolbar code so it paces the requests to something that sits below the frequency of the 'ban for bot use' trigger? That would seem to me to be an obvious fix.


Bots can probably perfectly duplicate the behavior of a toolbar. Only the rate and volume of requests would be different.

I'm assuming the toolbars can't communicate between each other. On toolbar launch, it should pick a random number between 1 and x and wait that many ms before contacting google. Pick x by looking at the number of req/sec that trigger a ban and the high-end number of tabs a power user might restart with. This would spread the requests out over that time period and keep it under the ban.


The toolbar should really only pull the page rank for active tabs. This would effectively only require a single request initially.


> Bots can probably perfectly duplicate the behavior of a toolbar. Only the rate and volume of requests would be different.

If they do, isn't that precisely what Google would want? Isn't it only the rate and volume of requests that are a problem?


There is plenty of malware that finds victims by looking at the results of google searches. Google seems to think that they have an obligation to prevent the indiscriminate spread of self-replicating infovores. Fucking Censorship if you ask me.


They might not want someone to build a large database of pageranks.


ok, wouldn't that require like a very very very long time? Google has a dadabase probably terabytes big and if someone does want it can't they do something like what DDG does? I believe they get their searches by yahoo for free


Of course they can communicate with each other, they're extensions, not web pages. The first one could act as the 'master', and proxy all the requests.

It's obvious the limiting is rate based, otherwise this would never have happened, so if it is rate based then the toolbar could pace itself to below that rate. Of course that would 'give away' the rate to observers of the toolbar during a browser restart but they could observe that just the same by checking when they get blocked, so that's no loss.

The toolbar knows I'm logged in, knows that a browser has just restarted and presumably can see how many instances/tabs are open (after all that's what it provides the info on) so it has all the data at it's disposal to make the right decision. This seems like a simple oversight to me (that a user installing the toolbar on a machine with a large number of tabs open would land in this situation).


Of course they can communicate with each other, they're extensions, not web pages.

Firefox extensions are Javascript CSS and XUL so I don't think that's obvious. I think it's entirely reasonable to assume that they might be sand boxed and have no awareness of each other. Is it one instance of the toolbar per "page-opened" event? Is it one instance per window? What I was describing was a way to stay under the limit without having centralized state-aware rate-limiting code. If that's possible, then yeah sure, do it that way.

It's obvious the limiting is rate based, otherwise this would never have happened, so if it is rate based then the toolbar could pace itself to below that rate.

It's not obvious to me. I think the issue is that the OP opens 50 tabs simultaneously after a crash and each window opens a connection to google without a rate limit of any kind. My idea was a way to do it without a centralized state.


Firefox extensions are Javascript CSS and XUL so I don't think that's obvious. I think it's entirely reasonable to assume that they might be sand boxed and have no awareness of each other.

Multiple Mozilla extension instances are indeed able to communicate via some centralised code.


It's possible to do with JavaScript modules: "JavaScript code modules are a concept introduced in Gecko 1.9 (Firefox 3) and can be used for sharing code between different privileged scopes." https://developer.mozilla.org/en/Using_JavaScript_code_modul...


> Of course that would 'give away' . . .

Meh, I have trouble believing that spammers cannot experiment to find this number out themselves. The binary search on rate would require only a handful of IPs before you acquire it to a sufficient resolution for working purposes.


Wouldn't it be simpler to rate limit the toolbar? E.g. it will only try x requests/second? Then bots wouldn't emulate it because it wouldn't be able to provide a high enough rate to be really useful.

Of course, that solution is so simple, I'm sure there's a reason it's not possible.


You're oversimplifying a real problem. Would you "authenticate" your own toolbar somehow? That's probably excess effort for a temporary 15 minute ban. Would you rate limit it? Well that's a lost cause if I ever saw one; bots can rate limit themselves too.

It's not an easy problem to solve.


Other factors can also come into play, e.g. you could be sitting on an IP subnet where someone else has been scraping Google, or a worm has been sending automated queries to Google.


> you could be sitting on an IP subnet where someone else has been scraping Google

Unlikely, I'm in the sticks, most people here are old and wouldn't know a mouse from a keyboard

> or a worm has been sending automated queries to Google.

That would have to be a linux based worm then. Unless that suspected worm is sitting on another IP of course.

Do you want me to try to make it reproducible ? I'd happily spend the time if it would help to make this problem go away. I understand how hard it is to differentiate between bots and regular users, but you should be able to pick up the difference between your own toolbar in normal use situations and a bot.

And if that's not the case then either the battle is 'lost' or it might be better to simply only let the toolbar query the google servers when explicitly asked to do so.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: