Why GNU grep is fast (2010)

jmcphers · on March 6, 2015

If you want the very fastest "find strings in files, especially code files", I've become a big fan of ag, "the silver searcher":

https://github.com/ggreer/the_silver_searcher

In addition to all the tricks applied by GNU grep (such as Boyer-Moore), it also splits work among threads to take advantage of multi-core machines. Integrates with your favorite text editor, too!

sltkr · on March 5, 2015

Obvious questions: 1. How do you extend Boyer-Moore to arbitrary regexps? 2. Does "not looking at every byte" really matter when the strings being searched for are typically so short that you're gonna hit every cache line anyway?

sophacles · on March 5, 2015

1. you find the fixed part of the regexps, do boyer-moore to fast filter the haystack, then do deeper NFA matching if the fixed bits are present. One example from later in that email thread: ^foo is the same as a fixed string search for \nfoo

I presume there are ways you can bound your fixed searches, and have your NFA work "backwards" through the string after a boyer-moore hits. (or both backwards and forwards from fixed points, depending on where in the string the fixed bit is).

2. No not for some people's small file usage - But grep is used in many many cases on huge log files (and mail files/queues and so on) - there is a valid case for optimizing those uses. Production sysadmins deal with multiple GBs of logfiles regularly, and grep them a lot.

kevinnk · on March 5, 2015

> Does "not looking at every byte" really matter when the strings being searched for are typically so short that you're gonna hit every cache line anyway?

You're not skipping bytes of the search string, you're skipping bytes of the file being searched.

pacificmint · on March 5, 2015

Of course, but you can never skip more bytes than the search string is long.

aganders3 · on March 5, 2015

But you get to skip those few bites (potentially) very many times.

rcxdude · on March 6, 2015

yes, but the point is that on modern CPUs checking those bytes is likely far from the bottleneck and thus skipping them saves you very little time, if any at all.

taeric · on March 6, 2015

You might be surprised. Consider, counting to 2 million by 5 is significantly faster than counting by 1. Especially if you just go ahead and start some prefetching in the memory, I would expect that this can speed things up considerably.

Edit: I should say I would still think things should be benchmarked to really know...

justin66 · on March 6, 2015

(as mentioned by yan below) Ridiculous Fish did these sorts of benchmarks here: http://ridiculousfish.com/blog/posts/old-age-and-treachery.h...

wongarsu · on March 5, 2015

>How do you extend Boyer-Moore to arbitrary regexps?

As mentioned in the follow-up messages, some regexps can be translated to one or at least few fixed string searches (/^[Gg]rep/ is really a search for "\nGrep" or "\ngrep"). If you can't do that, your regexp may contain longish fixed strings that allow you to discard most lines via Boyer-Moore before you have to run the regexp engine.

For the more general case, http://swtch.com/~rsc/regexp/ provides great links.

cbsmith · on March 6, 2015

1. You use lightgrep: https://github.com/jonstewart/liblightgrep

yan · on March 5, 2015

More required reading on this topic: http://ridiculousfish.com/blog/posts/old-age-and-treachery.h...

alayne · on March 5, 2015

Previous discussion https://news.ycombinator.com/item?id=6813937

wglb · on March 12, 2015

Amusingly, the top comment of which is a few of the other times that it has shown up.

myrandomcomment · on March 5, 2015

This has been posted in the past many times.

tedunangst · on March 5, 2015

Which is curious. Some content you run into by accident and think needs reposting. But how does one stumble upon this post in 2015? Googling for "why is gnu grep fast" and then not noticing that two of the top four results are submissions to HN?

danudey · on March 6, 2015

Finding it linked from some other article or blog post, or having it sent to you from someone else and then assuming that a mail thread from 2010 has never been seen on HN before.

werid · on March 6, 2015

the source this time is probably twitter: https://twitter.com/b0rk/status/573535058797031424

lettergram · on March 5, 2015

Wow.. I had this exact question when I interviewed at a big tech company, then I had to implement it in pseudocode.

cosarara97 · on March 5, 2015

You were supposed to know why a particular implementation of a tool was fast beforehand? Or they gave you GNU grep and BSD grep to compare?

taeric · on March 5, 2015

Yeah, this definitely falls into the absurd category of interviewing whiteboard questions. Ridiculously absurd. I have a book with this algorithm detailed in it, and I think I would still have trouble implementing it well.

That or I am just dumb. Very likely.

tedunangst · on March 5, 2015

"Let's implement BM" (with some hints/guidance) doesn't seem absurd. It's not much code; handwritten can fit on a piece of paper or two. "Why is GNU grep fast?" would definitely be an absurd question though.

taeric · on March 5, 2015

I could understand possibly going through sections. Though, really, that form of algorithm coding is so different from what I typically do day to day, that it just seems irrelevant.

I would probably find it fun, provided the hints/guidance were good. Not sure what to expect of it, though. As the sibling post asks, what would be the takeaway?

tedunangst · on March 5, 2015

Is the candidate capable of decomposing a problem and analyzing subproblems? Are they capable of integrating subsolutions back into a working whole?

taeric · on March 5, 2015

The problem really comes down to expectations on your point. For many of us, this would just be knowing that there are good algorithms on doing this.

If you don't already know of this algorithm, finding it is something that took a surprisingly long time to do. Using a field of algorithm design (pushdown automata) that just isn't that widely utilized nowdays. (Well, I suppose your domain may vary in this claim.)

perdunov · on March 5, 2015

Okay, let's suppose a software engineer cannot implement the BM off the top of their head. What would you make of that? How relevant is this kind of testing to the actual work of a software engineer?

tedunangst · on March 5, 2015

Like I said, with hints and guidance. Write a function to find a substring. Ok. What's its runtime? Can we make it faster? What if we skip ahead a bit after a failed match? How much? What's the fastest way to calculate that?

rdc12 · on March 5, 2015

Curious to know which book that is, could you tell me please?

taeric · on March 6, 2015

The book I've got is Knuth's Selected Papers on Design of Algorithms.[1]

I will confess that I do not have the book with me, so it may actually be a different algorithm in that text. I can not remember. :( Going off of here[2], the algorithm I am referring to is chapter 9, "Fast pattern matching in strings."

[1] www.amazon.com/Selected-Papers-Algorithms-Language-Information/dp/1575865823/

[2] http://www-cs-faculty.stanford.edu/~uno/da.html

perdunov · on March 5, 2015

I wonder where that douchebaggery culture among employers comes from.

nikanj · on March 6, 2015

These are often the same companies that complain about how hard it is to find 'talent'

rivd · on March 5, 2015

mentioned article about fast string search:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13....

jlarocco · on March 5, 2015

Text processing is actually a pretty neat area, with a lot of practical uses.

A while back I found an out of print book, "Text Algorithms," available for download from the author in PDF and some other formats. It's not cutting edge at this point, but still covers all the basics (like Boyer-Moore) really well. http://igm.univ-mlv.fr/~mac/REC/B1.html

sturakov · on March 6, 2015

Awesome, thanks for the books!

lsiebert · on March 6, 2015

I've been finding the following very useful for understanding string matching algorithms. It has a bunch of different algorithms for exact string matching, details on them, drawings, descriptions, speed comparisons, and code in C http://www-igm.univ-mlv.fr/~lecroq/string/index.html

pekk · on March 5, 2015

This should probably have a (2010) in the title or so

mparramon · on March 5, 2015

Related: http://www.developingandstuff.com/2013/07/grep-too-slow-tire...

edwintorok · on March 5, 2015

On the topic of fast grep-like program there is jrep too: http://lwn.net/Articles/589009/

dllthomas · on March 6, 2015

"The key to making programs fast is to make them do practically nothing. ;-)"

Very much the case.

Sevzi · on March 5, 2015

lettergram could do a faster version that would work without errors the first time.