Hacker News new | past | comments | ask | show | jobs | submit login

I, for one, forget to appreciate the fact that one of the fastest web apps I regularly use is the one that searches the entire* internet.

*kinda




I would still love to get a good layman’s explanation of how Google and other search engines can be so fast. I’m not a developer but I know a few basic concepts. Sorting makes searching fast – but that fast?

I can imagine that if I’m searching for “ball” Google already has the PageRank sorted results somewhere on its servers. That seems easy enough. But what about arbitrary search terms? Searching for “tango durlast 1978” (a soccer ball) gives me results just as fast, does Google have all of that pre-computed somewhere?

If they did that (to pick an arbitrary number) for search terms up to 100 characters (say all 26 letters, ten digits and space) that would mean that they would have to have something like 37^100=6.6e+156 pre-computed lists of results somewhere on their servers.

That can’t be what they are doing. If they were to use only one byte per possible search term (to pick another arbitrary and completely ridiculous number) they would still need to store 6.6e+144 terabyte. Which seems kinda impossible.


1) To be able to "search (almost) the whole internet" you have to pull the content and have "almost whole internet" on your own servers.

2) You have to update your copy of the internet regularly.

3) You don't want to search all that whenever somebody types a few words. So write programs to regularly prepare indexes in advance (just like in the books) and then when the query with words comes you can "just" look up in these.

The steps above sound simple, but you can imagine that the devil is in details, that there's a lot of knowhow needed and a lot of money to do that really right. Just ask Microsoft or even some failed search company.


There are two basic ways to build a search engine.

First think of the web as a big matrix with each row being a document and each column being a term. The first step is to decide how you want to shard this matrix across multiple machines.

You can shard by row, which lets you do really complicated scoring, but you have to hit every machine for every query.

You can shard by column, which is much cheaper, but limits how sophisticated your scoring can be.

Within each machine the problems are all probably fairly familiar.


Doesn’t that matrix get big fairly quickly? Are there any tricks to keep it as small as possible?


One trick I think people underestimate the effect of is that Google caches queries. When you're searching for "Madonna", you are not causing the web index to be traversed, you're retrieving the saved search for "Madonna". Google can take a couple of seconds if you throw a truly novel query at it.

Of course, storing millions of the "common queries" isn't a joke either and that still doesn't answer the question of how they search the internet within a few seconds anyhow; I'm not trying to give a complete answer, just point one aspect out. Google Instant, by design, provides the Google Suggest queries the vast bulk of the time and I guaranteed that those are either largely or possibly even entirely pre-cached, so you're not doing 10 web searches in two seconds, you're doing 10 hash lookups in 2 seconds. Which is still damned impressive, I can't hardly keep my web site loading in less than two seconds for one hit where I work.


Major parts of that aren't true. I think something like 30% of queries have never been seen before, and Google provides low latency on those as well.


Cut him some slack, it was clearly a simplification. Also, you're wrong to declare him inaccurate.

Even if 30% of queries are truly unique (a number that people have quoted from years ago, may no longer be true) caching 70% of the queries is a big win. Also, instant search is heavily relying on Google Suggest, which by it's very nature, are queries which have been performed already, so are trivially cacheable.


I've always used lengthy, quoted search phrases, but what's really interesting is that Google's autocomplete has slowly trained monkey-me to follow the well-worn path and use the search terms it suggests.

It's a great example of using the UI to encourage a desired behavior.


So, let's just turn off caching and triple or quadruple the size of the Google cluster, eh? No big deal. I completely fail to see how you think that disproves my point.


I should probably mention that I work in search quality at Google. The part I took issue with is "Google can take a couple of seconds if you throw a truly novel query at it." which is definitely not true.

Google can't cache nearly as much as you would expect, not because the queries change, but because the results change.


It gets fantastically huge when you look at quoted searches. Sure, you can build an index of key words to documents containing them, but how do you build an index for literal text searches like "to the bridge, he said"?


Yes. It's huge.


Some typical search engine strategies are on this page (and specifically in this subsection): https://secure.wikimedia.org/wikipedia/en/wiki/Index_%28sear...


It's a combination of embarassing parallelism and clever indexing.


The internet that Google doesn't search is almost always not the internet you want to find. :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: