Hacker News new | past | comments | ask | show | jobs | submit login
How Google Code Search Worked: Regex Matching with a Trigram Index (2012) (swtch.com)
434 points by kbumsik on Dec 2, 2018 | hide | past | favorite | 70 comments



This reminds me of a browser-side search I once wrote using a combination of regex, Soundex, and Damerau-Levenshtein. The reason I made this browser-side is because the app doesn't have that many records that it needed to be on the server; the records were in the hundreds to thousands, not millions.

https://github.com/SCPR/fire-tracker/blob/master/app/lib/sea...

Not the best code I've ever written by far, but it was a neat little experiment.

Basically, indexing each record with Soundex(phonetics) makes the search semi-tolerant to misspellings and also brings up similar-sounding results, while Damerau-Levenshtein is used to sort the results by closeness. I also had it index timestamps with

Here it is in action:

https://firetracker.scpr.org/archive

As you can see, you can misspell search terms and it will usually get it right. Granted, a lot of unrelated records usually come up in a search, but the important thing is the user should get what they're looking for in the first few results, usually the first.

Regex is a powerful thing.


Very cool! - Most search problems are solved this way - interesting index data modeling to the needed use cases, not magic machine learning neural network unicorns. Many want to sell you the latter. The thing that gets the job done is usually the former.

One big reasons for this is the need for really good training data to do supervised learning on search problems. If you’re Google or Wikipedia, you can get this data. But many struggle to understand what a good result was for a query based on clicks and other user data due to low volumes of traffic.


As an aside: most of the Wikimedia Foundation's analytics data is available for anyone to download (https://wikitech.wikimedia.org/wiki/Analytics#Datasets). There is some link-rot to wade through, but try asking on #wikimedia-analytics on freenode if you can't find something.


Then you're using magic machine learning wrong. I'm part of a small team who uses it exactly for the purpose of giving intelligent search to non-Googles and non-Wikipedias, and we have a very good track record.

The key is to use machine learning to generalize user behavior across multiple items in roughly the same category, rather than individual items. In fact, with tiny amounts of data you should pretty much never deal in individual items -- always in groups of related items.


Undoubtedly I’m doing a lot wrong :)

Yup that’s one method I teach people in Learning to Rank training. In addition to reducing dimensionality on queries, you can also do so by geo and user depending on the domain. But do you want to do this much munging of training data?

Even still the basis for how you group similar information needs isn’t always obvious, and most teams prefer a simpler solution they can understand and manage that’s not ML based. A lot depends on the team that’s ultimately going to maintain the solution.


When you said that you work in groups, the groups are discovered with a clustering technique and then you apply the learning to rank algos ?


In most learning to rank data sets, the only notion of a "query" is a numerical identifier that groups a set of features and a grade. Your goal is to optimize the ordering in that group. That group can really be a user, a query pattern, a location, or a bazillion other groupings likely to share similar rankings.

For example, a common training data storage format looks like

grade(0-4) queryId list of feature values...

Such as:

4 qid:1 1:0.5 2:0.24 # a doc with these features is 'good' for this query

0 qid:1 1:-0.5 2:0.24 # a bad doc for query id 1 is described by these features

Features in an LTR context are some relationship between the query and doc (such as scores computed from Solr/Elasticsearch queries)

Of course turning clicks/session info into grades isn't easy. One way to go about it is with click models (https://pdfs.semanticscholar.org/0b19/b37da5e438e6355418c726...). Another way is to equate conversions in a session with one of the grades (a purchase is a 4).

Needless to say, such usage data is sparse for long tail queries (queries that rarely happen - 'fuscha shoes'). Luckily it's not sparse for head queries (common queries - 'shoes'). So those head queries can provide some benefit, helping to generalize to the rare long tail queries.

You'll see a lot of patterns that are modifiers on head queries. Like 'shoes' is a head query. But 'blue shoes' and 'purple shoes' and other 'color shoes' are tail queries.

You can do a lot with that insight

- Supervised grouping of queries into one 'query id' by watching how users refine queries. For example users type "shoes" then in the same session add one color adjective "red shoes". Using that to group '<color> shoes' queries

- Unsupervised clustering of similar sessions or searches where users seem to expect ranking to perform similarly based on user behavior

The downside/tradeoff is you lose a lot of nuance that you're getting with per-query training data. But it's a technique people use.

It also substitutes one ML problem for another, which may or may not be harder than the original one


It's more of a hierarchy than groups, actually. If a user indicates that a pair of maroon walking shoes were relevant to their query for "red sneakers", then we have learned that "red sneakers" is associated, in order of decreasing strength, with e.g.

- maroon walking shoes - walking shoes - casual shoes - footwear - clothes

And obviously the same thing can be applied to generalize over the query. These hierarchies are constructed statically and dynamically with unsupervised learning, and then associations from query to groups happens dynamically.


Very cool, how exactly do you generate the hierarchies? Is there an existing site taxonomy or categorization you’re using? Or associating query strings with the docs clicked and using refinements to see the hierarchy? Or maybe LtR training data per site category?


They are constructed from a probabilistic similarity measure defined in terms of the metadata available for items, where their path tends to be weighed fairly heavily. Does that answer make sense?


Cool stuff, appreciate the answer. Makes more sense. BTW feel free to join us on relevance slack community, a lot of us we’re curious what you were doing

http://o19s.com/slack


I’d also be curious who your team is exactly and what are their results? Be very interested in learning more


I'm gonna be that guy and ask you not to refer to OCD in a flippant/casual sense please


Ah thanks for catching that. Changed to “obsessed”


>Most search problems are solved this way - interesting index data modeling to the needed use cases, not magic machin learning neural network unicorns. Many want to sell you the latter. The thing that gets the job done is usually the former.

occam's razor


Shameless plug:

This is what I based Debian Code Search on: https://codesearch.debian.net/

See also https://codesearch.debian.net/research/bsc-thesis.pdf if you’re interested in details.


Is your code open source?



A bit of a funny full circle, isn't it?


Are there any good writeups on how to implement the trigram index side of this? For instance, the code alludes to the idea that you could store each trigram as a bitmap over documents, then the Boolean combinations become bitwise operations. I suppose you could keep the bitmaps in compressed (or even probabilistic sketch) form, then decompress only those required for a query, but is this the right intuition? Are there lower-level libraries or databases that do this kind of thing well, rather than rolling your own on a KV store or using something monstrous like ElasticSearch?


There's https://github.com/google/zoekt which provides a pretty straight-forward and memory-efficient implementation. Their design document goes a bit into what makes it fast:

https://github.com/google/zoekt/blob/master/doc/design.md


Note that Elasticsearch is built on top of Lucene (lucene.apache.org), which is a straight forward search indexer that creates a reverse index and document store, and gives pluggable low-level access to parts of the indexer path. Would be very easy to create a custom analyzer to create a trigram-based index (my guess is one already exists).


The project was released by the original author here. https://github.com/google/codesearch

There is no information on the subject besides the 3 blog posts and the proof of concept.


https://wiki.postgresql.org/images/6/6c/Index_support_for_re... is I think the slides for a conference talk on implementing within Postgres


Postgres FTS with trigrams (pg_trigram extension) works really well for many search applications. Most of the effort would be tuning the FTS dictionary.


My issue with it is that a small term such as one word doesn't event match to a record which is weird, I'm using it with rails. Thinking of using elastic search to help with that


How fitting. I've been working on making Etsy houndd [0] available as SaaS. Houndd uses a trigram index with regexp and can search all the repos I care about in ms.

The demo page works. The signup just send me a confirmation email and the setup is still manual, so expect hours/day delay. Aiming to have this all automated by tomorrow. Appreciate any/all feedback.

[0] https://github.com/etsy/hound [1] https://gitgrep.com


Have you made sure that Git people are OK with the name GitGrep? See https://git-scm.com/about/trademark for details.


Another shameless plug: Sourcegraph uses this for indexed search via Zoekt (it also has index-less live search). Check out the cmd/indexed-search code at https://github.com/sourcegraph/sourcegraph.


The search features discussed in the article are now available through Google Cloud Source Repositories: https://cloud.google.com/source-repositories/docs/searching-...


I wish GitHub had decent search functionality.


I can't believe they don't adopt something like https://github.com/etsy/hound. It's all right there, it's super easy to setup, and it's a game changer. Lightning fast search that you can link to. Check it out for your org, I was at Etsy when Kelly wrote it, and I've taken it with me to other jobs.


I thought the very same thing which is how I SaaSified etsy/hound as Gitgrep.com[0].

[0] https://gitgrep.com


Highly recommend Sourcegraph. Fast, free, and understands code by using language servers. Install extension to replace github.com search.

https://sourcegraph.com/start


Now that Microsoft has acquired GitHub, no doubt they will be upgrading GitHub search with SOTA algos.


you mean, searching the past history of all codes?


Mostly searching for regexes in code and in filenames. But yes, history would sometimes be useful too.

And this is really the bare minimum. An even better search would e.g. allow searching for identifiers (comments and strings disregarded).


I believe GitHub uses ES, and currently allowing users to perform regex can bog down the entire cluster if the regex is malformed. This is a problem with Solr too. I believe there was some effort to resolve this at the Lucene level, but I am not sure the status.

In other words, I agree with you but I also know it is an extremely hard problem to solve at Github's scale.


If they use ES (or other Lucene) they could already be doing fuzzy search via ES's own ngram support. At that point indexed regex search as described in the article isn't far away. You just need to bridge the gap between a regex and a trigram index.


Agreed, but its not obvious to me how you would bridge that gap without a lot of custom code.


yeah, that would be wonderful indeed! The current search functionality of github is nearly useless, which is a shame.


I am hacking on something that allows searching based on identifiers. Feel free to play around at https://www.codegrep.com, currently indexing few thousand popular repositories.


You joke, but building and retaining an extra Xapian database would be something that lots of people would be willing to pay for. I'm surprised that this isn't something they sell.

Git is easy to host, especially now with several good web frontends.


I was not joking! I would love to do searches with regular expressions over commits. For example, to look for all projects that have deleted code with the GPL license, you could simply do a search on '^-This program is free software: you can redistribute it and/or modify'


also searching other branches than master


If you want to build something like this yourself, Postgres has a Trigram index: https://www.postgresql.org/docs/current/pgtrgm.html


That's great. It seems like it only supports character-level trigrams though. Do you know of any tools that can create word-level trigrams from Postgres?


FWIW kythe.io is the modern successor that does this internally at Google. Haven't worked on it, but have written some code that's a client. Unfortunately I think the indexing pipelines aren't publically available.


You can still use Google Code Search on the Chrome codebase.

https://cs.chromium.org

For example here's a hacky regex that finds lambdas:

https://cs.chromium.org/search/?q=%5CW%5C%5B%5B%5E%5C%5D%5D*...


You can store a local search index of your code using https://github.com/google/codesearch, which uses this algorithm in the background.


Trigrams are of three characters, not tokens.

This is not the obvious search strategy for code, lacking semantic structure, though apparently it suits regex searchable indexes.


Trigram is defined in the dictionary as "a group of three consecutive written units such as letters, syllables, or words." Thus using the term for a triplet of tokens seems appropriate.


See kythe.io for the semantic bits.


For a reasonable code base, say the linux kernel source code, how large is the trigram index? Is it necessary that the index be kept in memory?


77MB for the Linux 3.1.3 kernel sources according to the "Implementation" section of the article.


[flagged]


We detached this subthread from https://news.ycombinator.com/item?id=18582578 and marked it off-topic.


Deleted the link

I did write a book on the topic, developed the Elasticsearch learning to rank plugin, and have helped several Alexa top 50 sites deliver more accurate search

I’m only pointing out that search is oversold industry, but when you read about what Google actually does it’s more normal stuff that we would all probably do to solve a similar problem.

(for those curious it was a link to content at a community conference i help run - pointing out how hard ML can be for search. I admit a bit tangential perhaps)


Could you please re-share the link? I have seen the original text of your post, thought that this is an interesting topic and I'd like to find time to watch them.


Sure here it is sadly no videos, just slides https://haystackconf.com/europe2018/


Be an inter with Jeff Dean as a mentor...


This was posted here at least 10x before.

The last time someone posted a link to a hopeful successor at github: https://news.ycombinator.com/item?id=18022357


Modern servers have a lot of compute resources and memory compared to 15 years back, before doing any optimizations I would start out with a naive full text search, which will most likely be fast enough. A full code search using regex is also more powerful for users. Imagine if you could use Regex when searching on Github or Google!? Source code use relative little space, If you strip out the markup/JS from web sites, they also take up relative little space. The only problem however is to educate users on how to do effective searches.


I think you might be underestimating the size of the document corpus that you'd be running over.


Lets say there are 5 billion url's with an average of 10 KiB of data (if you take out JS/CSS/images etc), and one server has 50 GB or ram you would need 1000 servers, which is very small considered Google probably have one million servers deployed. I just tried to text search your comment on Google and it found your post! So Google is already doing full text search, and does it in less then one second (0.69 to be precise). There are probably many reasons why they don't allow Regex, probably because it would be very easy to "scrape" resources such as e-mail addresses, credit card numbers, etc. It would however be cool if Google would allow you to search structured data, for example find 100 recipes that has eggs in it :P Silly example, but the possibilities are endless!


But that's exactly my point -- when you get to the stage where you have 1,000 servers with 50G of RAM each, you have gotten to the point where an optimization like an inverted index is completely sensible. The design you propose has to do a full regex scan over 50 TB of RAM for every. single. user. query. For Pete's sake! This is definitely the realm where the computational costs make it worthwhile to spend engineering resources to optimize, especially if you are going to serve lots of users.


Maybe you should clarify what you meant by "naive full text search" because that usually means scanning all documents character-by-character to match the characters in the query, which is definitely not what Google does.


Even with modern computers doing a naive regex search on 1GiB+ of full text would take a while. Memory isn't that fast. The trigrams helps you avoid needing to read every document on every search.

Even a traditional rdbms will start become slow with datasets in the 10-100GiB range.

Note that products like bigquery, snowflake, redshift...etc might be able to support but that is not Cheap or relatively fast


I think you can expect 1TB/s per machine. With one thousand servers you could search the entire web in 0.05 seconds. Not all data-sets are that big though. You should at least try the most naive method before applying any optimizations. Dividing everything up in words is a smart machine optimization, but it gets silly ... Imagine for example cutting out all words from a bunch of paper articles, then write down the name of the article at the back. Then sort all words and put them in different jars. Now when you want to find an article you pick a piece from a word-jar, then just look at the back-side! But now to the hard problem, what if the jar has thousands of small pieces in it ?


The fastest main memory bandwidth supported on modern servers is a dual socket AMD Epyc [0] with a theoretical maximum of 320GB/s. One TB/s is non-sense.

That said, I approve of naive brute-force and one should always benchmark their systems against the brute force solution.

[0] https://en.wikichip.org/wiki/amd/epyc/7551


You also have to take into account the trade-offs! In my paper-clips in a jar example, lets say we want to find the string "var foo = bar" you take one clip from the "var" jar, but almost all files has "var" in it ... and there are also many files that has "foo" and "bar" in it! How do you sort the answers from there !? With a simple text/regex search you wouln't have that problem!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: