Hacker News new | past | comments | ask | show | jobs | submit login
Bing: Why Google’s Wrong In Its Accusations (searchengineland.com)
56 points by HardyLeung on Feb 4, 2011 | hide | past | favorite | 69 comments



Given that Bing clearly uses the clickstream to influence its results pages, at what point to SEO folks run farms of machines in a network which resolve 'google.com' to a process that returns a fixed set of results for a given query?

Imagine this scenario, I create 10,000 VM instances of windows running IE8 with the Bing toolbar. I create a local host to 'stand in' for Google such that it emulates the actual Google site (one could even used scraped Google content) and for my 'target' query it returns my spammy results, and then my VM machine clicks on one of the spammy links.

It seems that Google's sting worked because their queries had small return rates, but with some resources it would seem a viable way to inject SEO love right into the bloodstream of Bing's ranking algorithm. As I see it, a whole new front just opened up for spammers.

--Chuck


You don't need VMs or a fake Google or anything that complex. That said, I will grant that it might be harder to detect or easier than reversing the protocol itself. Even so, take a look at what HN user angusgr posted in another thread:

http://projectgus.com/2011/02/bing-google-finding-some-facts...

I'll save you some reading and point out that this is the important part, the exchange with Bing:

http://projectgus.com/files/googlebing/seaport-trace.txt

I bet I could duplicate that with a few lines of Perl. Then I could feed Bing whatever I want, no VMs or clicking or DNS/hosts files to worry about. You might have to harvest a valid identifier (such as the one linked there) or do some figuring out so that you submit this to the right Microsoft IP.

If I were working for Bing, I'd start looking at hardening this a little, before the SEO people figure it out.


Sure, but I'm hard pressed to think of a technique that the spammers wouldn't attempt to game. Search engines and spammers will probably always be in a perpetual arms race no matter what the methodology.


This is like gaming Google Suggest. You're going to need a lot of IP's because they're not idiots in Redmond OR Mountain View and it's pretty simple to detect behavior like that coming from a single IP.

If the spammer gets 10,000 IPs then more power to him. But there's probably more cost effective ways of black-hat SEO.


The spammers already control botnets of that size. They could duplicate the clickstream data with a few lines of code (it's a simple HTTP request feeding Bing the clicks; the only sticky parts are deciding which Bing IP to send them to and harvesting the per-computer ID it sends to Bing).

In short, this gives them a new service to sell when they rent out their existing botnet.


Then by your logic, they're already gaming Google Suggest. And with Instant Search, that's the holy grail. Who needs to optimize placement on a given keyword if you can just nudge users towards your desired keyword.

Or, what's more likely, is that the respective companies are aware of this very obvious vector and have attended to it.


What makes you assume that no one is attempting that?

Botnets are run off of the computers of ordinary, clueless folks, who might be real Bing users submitting real data in addition to whatever the botnet sends.

I've already linked to an analysis of the actual protocol data submitted by the toolbar and I can see obvious ways to copy and fake it.

If you use only the computers in your set that already have Bing, copy that unique ID and figure out what IP they're sending it to, your data will be identical to that sent by a real user.

At that point, you have to harden the protocol and hope it stands up to reverse engineering, or start spam filtering it (if they weren't already). Maybe they can do a good job of that, but it really lowers the quality of the data they're getting once enough people are feeding them garbage.

SEO types already set up thousands of spam websites to game PageRank. I don't see why this would be any different.


I suspect that they are, and when you can simulate a million users doing what ever activity you want them to be doing then the impact on Google and lots of stuff will change.

Of course my favorite was the Amazon hack of repeatedly putting some item in a shopping cart and then adding in a bit of pr0n (or some other weird combination) so that Amazon would put the other item in their auto suggest product combination.


Easy to circumvent. Spend 5 minutes thinking of how to do it. I suspect you'll think of at least a couple of ways.

The battle against spammers will continue to be cat and mouse game. And when one of your signals gets exposed, of course it will be a target.

There's nothing new here. In fact, I think most spammers already assumed that the toolbar clickthroughs fed into both Bing and Google toolbars. I did.


Well, Google should do nothing less than encourage people to use those spammy methods either stop the practice all together or force Microsoft to lessen the weight is uses for that particular signal.


Very clear writeup. I thought this was an excellent point:

"PR is not leading this dispute. It’s following behind. This dispute is happening because real engineers at Google felt there was a deep injustice going on — as reflected in the quote from Google’s Amit Singhal in my original article. I’ve known Singhal for years. I’ve never seen him speak like this before. It’s not because Google PR told him to. It’s because he’s fundamentally bothered by what he’s seen — as are members of his team.

This dispute is also happening because real engineers at Bing feel there’s a deep injustice going on — as reflected in the quote from Harry Shum above. Bing’s worked incredibly hard to build a search engine that’s worthy of respect. Now here’s Google suggesting that Bing has simply cheated its way to relevancy."

The conversation with moultano on a thread a couple days ago was a good example of this. http://news.ycombinator.com/item?id=2177354


From this article:

"We’re not going to stop using that signal, unless it messes up relevancy. It doesn’t make sense to exclude that large amount of traffic from our usage set," Weitz said.

From the TechCrunch article a few days back [1]:

"Google had employees log onto ms customer feedback system and send results to Microsoft."

(to which Matt Cutts replied: normal people call that "IE8")

Unlike many others, I do not think this is a cut-and-dry issue, but the squirrelly responses from the MS folks on this have really made me think they are just up to no good, true to the form of so much of their corporate history. Arguing with vague technical terms and ad-hominem attacks are not a good way to convince a highly technical crowd of your virtues.

[1] http://techcrunch.com/2011/02/01/bing-google-fight/


From the article: "Meanwhile, I’m on my third day of waiting to hear back from Google about just what exactly it does with its own toolbar. Now that the company has fired off accusations against Bing about data collection, Google loses the right to stay as tight-lipped as it has been in the past about how the toolbar may be used in search results."

And Google's sudden silence you don't find suspicious. This is the same company that invited the author of the original article to their headquarters the day after he wrote it.


If Google did the same thing with their toolbar, then Microsoft should be able to catch Google the same way Google caught Microsoft. Thus far they haven't although I am sure they are trying.


My recollection is Google has said several times that they don't use the toolbar tracking to effect search results. I can't find any sources right now though so I'm going from memory. One exception is tracking goes into personalized history which customizes search for that particular accounts results.


You suggest Google Toolbar collects Bing search results?


Inadvertently like the Bing toolbar. It has been shown to log every pageview like Bing toolbar. Whether they filter out the pageviews when a user goes to Bing no one knows besides Google. (ie: The article below shows they definetly do not filter it out at the application level as a Yahoo query is sent back by the Google Toolbar and Yahoo now runs on Bing).

http://www.benedelman.org/news/012610-1.html


Logging pages views with the Google Toolbar does not mean those those views are used for Google's search results.


Well MS now knows how to perform this sting, so I'm sure we'll hear if their results show up on google. Until then, you have zero evidence.


Sigh....Two things: 1) Performing this sting and it working or failing wouldn't prove if they did it in the past...The cat's out of the bag. Bing or Google could have easily changed their systems. (and I'm sure the way the two incorporate clickstreams into their search engine is VASTLY different)

2) More importantly, why would Microsoft need to prove google is also using clickstreams? Microsoft does not believe using clickstreams is wrong.


> 2) More importantly, why would Microsoft need to prove google is also using clickstreams? Microsoft does not believe using clickstreams is wrong.

If they can make Google a hypocrite, the issue goes away. that's motivation enough for Bing to investigate the Google side (though I doubt they'd find much).


You cant make excuses for a lack of evidence. Like if you're caught copying someone's homework, you can't say "but nobody can prove he didn't also copy from me earlier". It's ultimately irrelevant; you can't conclude anything out of ignorance.


Even if they don't collect and use Bing outclicks in their calculations... if they're using clicks, and load-times, and view-times, and order-of-visit information from third-party unaffiliated Google sites, including potentially sites with robots.txt blocks against crawling, then they're doing things almost entirely analogous to Bing.


I believe Matt Cutts already stated that they don't use any click stream signals from their toolbar in their rankings. So much has been said about this that I haven't been able to find the quote. I may have even gotten wrong which Google employee said it.


He said they don't use Google Analytics data. He has also said that the toolbar doesn't affect the indexing of a page. I don't see anything about the toolbar affecting rankings though.


Hmm... I don't think that's what I was thinking of. The wording was very specific. I don't recall Google Analytics being part of the comment, but of course I could be mistaken.


Can you find a reference for Cutts or any other Googler saying Google Analytics data isn't used for search ranking?


From what I heard, the end user agreement for Analytics says it won't affect rank.


I can't find any such term. It's definitely not in the 'Google Analytics Terms of Service':

http://www.google.com/analytics/tos.html

The 'privacy' link from Analytics leads back to the general privacy pages, and their overall privacy policy says Google may use any information they have (from logs, cookies, etc.) to "[p]rovide, maintain, protect, and improve our services (including advertising services) and develop new services".


He always specified they don't use the click streams for the rankings. They may use them for crawling, to build test sets and many other things.


I would wager most of the people commenting don't work in search and how to use datamining to improve search. I think it is useful to see the perspective of someone that actually thinks long and hard about building these types of systems before commenting on whether it is fair or innovative or just plain stealing. http://hunch.net/?p=1660


So Microsoft's defense is that they are not copying Google in particular -- they are copying every search engine?



So it somehow eluded them that most of the searches done anywhere are done on Google? I don't buy it. If you have a "search signal" it's going to be effectively a "Google signal," and they aren't being honest if they contend otherwise.


They didn't say "web search", they said "search". Nearly every website has a search box. Most of my searches in any given day are not on Google - they're on Amazon, Wikipedia, Netflix, StackOverflow, Facebook, etc. Bing's toolbar is watching all of those.


you realize they have to have some system for figuring out which GET parameter matches the search box? there are lots of sites that have varying GET parameters. so that means someone had to decide, BY HAND, what parameter was the right one.

there's no way they are just magically doing this for "every" search box on the web.


Most of the searches I do are on the firefox search field. If I want to search within a specific site, I use the "site:" parameter to restrict results by domain.


The number of people who use that feature is not of statistical significance.


And your data showing is....where?

I can of course say, I use this feature, and of course point to others who do, but I don't have data to say it is significant. However, you have stipulated it is not, but have not provided any data to prove it.


I don't have any data showing that, but come on. Think about alllll the people out there who log into Facebook by Googling "www.facebook.com". That's your average internet user, not you and me.


I'm not sure how much you interact with the "average internet user", but when I do help desk roles, as well as my role working for a floating university, I interact with hundreds of non-technical end users (Given my job history, I've interacted with thousands over the years). What I can tell you is that basing any theory about statistical anomalies on your preconceived notions without any hard data to back it up is usually an exercise in failure.


And how many of the people you talk to do you think use the site: operator?


And those are supposed to use site-provided search boxes for specific content?


I worked on the Bing AI team (pre-launch), they didn't copy Google. If anything, both search engines copy wikipedia. I believe that this was the click-stream research done by an intern researcher from UBC.


I think there is a DEEP flaw in this argument:

"Here’s another one. This time, it’s a misspelling of “bombilate,” a rare word I cited above. I searched for “bombilete,” instead.."

In essence, they say that Google only pointed to the typo, but Bing redirected to it. Thus, "it’s very unlikely it figured this out from Google". For me, making that argument is insulting to the readers intelligence.


You missed part of the argument:

"But it’s very unlikely it figured this out from Google, given that for the misspelling, Google doesn’t auto-correct the word nor provide the same answer."

Bing's nr 1 result is not in Google's results at all.


Google has the spell correction. Microsoft could certainly be harvesting that data directly. (Which appears to be what they are doing on that query.)


Very true...they might be doing this as part of their "cost cutting" exercises.Hasheem had mentioned this and I am not sure what made him to mention "cost cutting" in that panel discussion where Matt and Blekko CEO participated.


Round and round the speculation wheel goes! Where it stops, no one knows!


Using what people click on from search results from competing search engines sounds like copying the competition to me.

It sounds like they automated the process of piggy backing off the work of other search engines, not just Google.

They should exclude all competing search engines from this process.


From the article (about spell correction):

> Well, above is the same situation where Bing gets a misspelled word right — a link to a definition of the correctly spelled word at the top of the list. But it’s very unlikely it figured this out from Google, given that for the misspelling, Google doesn’t auto-correct the word nor provide the same answer.

Perhaps because people first click on the spell correction and then on the result - so maybe they don't yet copy also the spell correction, only the results. I think this is even stronger that they copy the results, not weaker.


Doesn't this passage suggest that Google ignores robots.txt in its own cross-comparisons of search relevancy:

Google said in October that it found statistical evidence that Bing suddenly became more Google-like. More listings in the first page of results of both search engines seemed to match, as did more of the number one results.

How would you get statistically significant results for such things, over time, without constant automated probe queries against Bing?

I think such probes are both legal and wise... but Google should drop the pretense that robots.txt is a sacred barrier across which no analysis can be done, no matter how indirect or for what purpose.

Also, I'd wager at some time in its history – if not constantly even today – Google has shown panels of users results from Google and its competitors in various combinations – side-by-side, with and without branding, intermixed randomly – and used their reactions to detect areas where the competitors are doing well, and Google could improve.

Further, either human eyes or algorithms then tried to determine adjustments to close any gaps in user satisfaction. The net effect of any such process is – surprise, surprise! – leveraging strengths of other engines to patch weaknesses in Google. This is normal, expected behavior by any serious search competitor.


I think there is a major difference between identifying areas for improvement and copying a specific search result from a competitor.


I don't see it as that different. If the panel suggests a competitor's top-10 result is strongly preferred, and then Google examines why they didn't have that in their top-10, and tinkers with their weightings until they do, then they've used a competitor's output signals to train their own systems.


>I don't see it as that different.

It is very very different. You can't tinker with the weighting of something that doesn't exist, and if you don't have the necessary data, no amount of reweighting is going to improve things. Microsoft in this case no longer needs to come up with any data of their own, they can just use clicks on Google as a proxy for any combination of signals.


Does Google use clicks from anywhere on the web other than their own sites as a ranking signal?


I don't have an exhaustive knowledge of everything we do, and I couldn't tell you if I did.

Amit has made it very clear however, that we will never do anything that would cause us to directly or indirectly copy a competitor's search results.


It's clear to me Google wouldn't intentionally do something identical to what Bing's done.

But given what Googlers can't say, and don't even know about what other groups within Google are doing, it's not 'very clear' to me that you aren't already doing very very analogous things, with regard to every other site on the internet.

You've got the data; you're allowed to use it by your privacy policy; you've got the rationalizations handy. ("Sites didn't block us; fully-informed users opted-in; this is a crucial way to fight the manipulators; it's only helping us weight things we already found by other means; etc.")

Amit's not made anything clear to me, with his finessed "put any results" wording. Danny Sullivan picked this up too, as he remarks in the headlined article:

Google’s initial denial that it has never used toolbar data “to put any results on Google’s results pages” immediately took a blow given that site speed measurements done by the toolbar DO play a role in this. So what else might the toolbar do?

There's wiggle room in the definitions of 'copy' and 'competitor' in your 'never' promise, too. Is it OK if Google Toolbar data hoovers up implied editorial-quality signals from user navigation on every site that isn't a 'competitor'? (And given Google's size, what site isn't a competitor in some respects for audience share?) Is it 'copying' if your use of clicktrails makes a preexisting result move from #11 to #9 after you observe it satisfying people in other browsing sessions? Move from #99 to #2?

(Has the effect of any of Google's competitive analysis ever resulted in a single result moving closer to the position, higher or lower, that it had at a studied competitor? Some people could call that 'copying'.)

Maybe none of the clickstream sources Google uses stick out as a dominating factor because Google has so effectively "commoditized its complements" – and no one other entity (except maybe Facebook) has access to as much clickstream data as Google does, simply from its own sites.

Given that, it seems a little convenient that Google's standard is "every aggressively creative use of behavioral trails that led up to our 70%-90% share dominance was OK, but from now on let's be really rigorous about letting others observe our info-effluents."


For more official policy you'd have to ask either Amit or Matt. I can't speak for the company here.

Speaking only for myself and only on the ethics, I generally feel that any site that allows itself to be indexed is pretty happy with Google (or bing) doing whatever they can to rank it better. Even with the link data that sites provide, you can add rel=nofollow to the links if you don't want search engines to use them but still want your pages indexed (yelp has done this for instance.)

For me that's the ethical boundary. Sites have various ways of indicating their wishes, and that ought to be respected in spirit beyond the technical details.

Legally, the technologies that make the internet work all rely on the idea of fair use, so it is very important whether something is "fair."


I've seen no statement that Google throws out Toolbar (and other) clickstream data for sites/pages that Googlebot can't visit (which includes not just robots-precluded but also login-required pages). Not that I think you should throw such data out; that's not what robots.txt was meant for, and the user arguably has more claim to that interaction trail than the site. But that seems the standard you're suggesting.

If Google doesn't want IE features or the Bing Toolbar observing its site interactions, it can disallow such visitors. A steep price to pay, at too coarse a level of control? Yes, just like a site deciding to bar Googlebot.

I would agree that a 'fair use'-like analysis makes sense.

I would further agree that any site solely, or predominantly, powered by indirect observations of Google users would be an unfair taking. You'd crush such a site in court.

Meanwhile, a site that tallies Google referrer inclicks for itself, or for a network of participating sites (as with analytics inserts), even republishing summaries of Google source URLs and search terms as public data, is almost certainly fair use. It's taking data you're dropping freely onto third-party site logs, and making a transformative report of it.

What Bing is doing seems to me somewhere in-between. The mechanism avoids literal copying of specific artifacts but the net effect in some cases approaches the same result. As with other 'fair use' analysis, it's rarely black-and-white. The magnitude of the information used, its effects on the market, and the value-added transformation afterward are all important. I don't know how a court would rule in such a suit but the discovery process would surely be fun for spectators like myself!


I found this very strange... Did Bing increase clickstream data usage in their algorithm when Google saw Bing results get more Google-like in October of 2010 or not?

Shrum:

  Not so, Bing told me. In October, Bing says it rolled out a new 
  ranking algorithm plus a new experimental system called “Aether” 
  that allows them to test changes in their ranking methodology. 
  That’s what caused the bump that Google saw, not some sudden use 
  of the surfstream, Bing said.
Web (during October 2010):

http://www.google.com/search?q=Bing+new+ranking+algorithm...


Good article (a surprisingly heartfelt article -- odd for a tech blog). Key paragraphs: Bing says it does NOT do this. It says there is no Google specific search signal that it being used, no list of all the popular pages as selected just by Google users. Instead, it has a 'search signal' based on searching activity observed across a range of sites.

For example, if you did a search on Amazon, Bing might detect that. A search on eBay might get spotted. A search on Yahoo, that also might get extracted. Any number of searches might be identified. Bing would associate the next page you went to after doing those searches as being a possible 'answer' to those searches.


I think comparisons to Amazon or eBay search are misleading. Amazon only indexes and searches its own site - the same goes for eBay. So of course these companies wouldn't mind if Bing copied some of their search results - after all, this would only end up driving more traffic to their sites - since their search results link back to their own sites.

It's a much different case when Microsoft is copying search results from Google or another whole-web search site. There Microsoft is directly competing with their product, and the bottom-line is that it's fundamentally fishy for Microsoft to be using their own search results against them.


Copying search results. I love the notion of that.

Somewhere in the world at this moment there's a person with the Bing Toolbar installed searching for something using Google. One of his results is to a website, for the sake of narrative lets suppose it's one of our own: A YC startup.

The searcher clicks to the YC startup, it's exactly what he's looking for, he converts.

Now, another searcher, using Bing by choice, searches that same term. That YC startup is in the results. The click is made, another conversion happens.

The first user consented to his click analysis by installing the toolbar.

The startup will surely agree that yes, we are a very good result for that term! We should show up on Bing, DDG, Google, Whatever! And we don't care how we get there!

The second user gets a result that's maybe ranked higher because of the first users click was noticed by Bing.

Google doesn't lose a customer because the second searcher was already on Bing to begin with.

Bing does nothing but analyze their own users' behavior (users of their toolbar) to deliver better results.

I find controversy over this beyond absurd.


> The first user consented to his click analysis by installing the toolbar.

The toolbar says nothing about taking his data to improve Bing, only for Site-Suggestion which, to average user, is clearly a browser feature.

> Google doesn't lose a customer because the second searcher was already on Bing to begin with.

What about one who could have converted to Google if he search on Bing and didn't find a result?

Google couldn't lose a customer but there's no way Google would gain a conversion from Bing in this situation.

> Bing does nothing but analyze their own users' behavior

Assuming that this is a search result that Bing wouldn't have found by itself, then this "user behavior" wouldn't have occurred for Bing to capture had Google not exist.

Without Google each user can then probably only have behavior of clicking same old web site they have collected from long ago, there's no search engine for them to learn new relevant site easily.


Don't take it personally that I'm not going to "line by line" your comment the way you did mine -- I'm pretty busy today.

But the issue is, my clickstream is mine. If I chose to share it with somebody -- Bing, Google, My Mom -- it's my choice.

If I, as a user, don't want Bing to have access to my clickstream, I won't install their toolbar.

But if I do, it's not Google's business.


The issue is simple..google uses all its heavy resources to rank that startup as #1. (good or bad).But bing just uses their toolbar to capture that information and rank them. They call this "cost cutting".Doesn't this sound cheap?


It is known to all that both use user data in whatever way possible.But what bing seem to be doing it is use it on users searching on google (through IE etc. or maybe even windows) and find the more relevant ones.Did you guys notice Harry Shum mentioning about "cost cutting" in the video where Matt, he and the blekko CEO discuss on this issue.

So, it looks like bing is indirectly using google's data by incurring lesser cost and this is how they seem do it.Spy on google searchers and build the right database. instead of spending on innovation and new ways on the algorithm. Let google do that part while we piggy back on their good ones.

This is what has really irritated google.


Remember when the polarizing search battle was between Google and Yahoo? ...yeah neither do I.

These battles do nothing more than establish the two dominate players.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: