Hacker News new | past | comments | ask | show | jobs | submit login
Why GNU grep is fast (2010) (freebsd.org)
151 points by bpierre on March 5, 2015 | hide | past | favorite | 40 comments



If you want the very fastest "find strings in files, especially code files", I've become a big fan of ag, "the silver searcher":

https://github.com/ggreer/the_silver_searcher

In addition to all the tricks applied by GNU grep (such as Boyer-Moore), it also splits work among threads to take advantage of multi-core machines. Integrates with your favorite text editor, too!


Obvious questions: 1. How do you extend Boyer-Moore to arbitrary regexps? 2. Does "not looking at every byte" really matter when the strings being searched for are typically so short that you're gonna hit every cache line anyway?


1. you find the fixed part of the regexps, do boyer-moore to fast filter the haystack, then do deeper NFA matching if the fixed bits are present. One example from later in that email thread: ^foo is the same as a fixed string search for \nfoo

I presume there are ways you can bound your fixed searches, and have your NFA work "backwards" through the string after a boyer-moore hits. (or both backwards and forwards from fixed points, depending on where in the string the fixed bit is).

2. No not for some people's small file usage - But grep is used in many many cases on huge log files (and mail files/queues and so on) - there is a valid case for optimizing those uses. Production sysadmins deal with multiple GBs of logfiles regularly, and grep them a lot.


> Does "not looking at every byte" really matter when the strings being searched for are typically so short that you're gonna hit every cache line anyway?

You're not skipping bytes of the search string, you're skipping bytes of the file being searched.


Of course, but you can never skip more bytes than the search string is long.


But you get to skip those few bites (potentially) very many times.


yes, but the point is that on modern CPUs checking those bytes is likely far from the bottleneck and thus skipping them saves you very little time, if any at all.


You might be surprised. Consider, counting to 2 million by 5 is significantly faster than counting by 1. Especially if you just go ahead and start some prefetching in the memory, I would expect that this can speed things up considerably.

Edit: I should say I would still think things should be benchmarked to really know...


(as mentioned by yan below) Ridiculous Fish did these sorts of benchmarks here: http://ridiculousfish.com/blog/posts/old-age-and-treachery.h...


>How do you extend Boyer-Moore to arbitrary regexps?

As mentioned in the follow-up messages, some regexps can be translated to one or at least few fixed string searches (/^[Gg]rep/ is really a search for "\nGrep" or "\ngrep"). If you can't do that, your regexp may contain longish fixed strings that allow you to discard most lines via Boyer-Moore before you have to run the regexp engine.

For the more general case, http://swtch.com/~rsc/regexp/ provides great links.





Amusingly, the top comment of which is a few of the other times that it has shown up.


This has been posted in the past many times.


Which is curious. Some content you run into by accident and think needs reposting. But how does one stumble upon this post in 2015? Googling for "why is gnu grep fast" and then not noticing that two of the top four results are submissions to HN?


Finding it linked from some other article or blog post, or having it sent to you from someone else and then assuming that a mail thread from 2010 has never been seen on HN before.


the source this time is probably twitter: https://twitter.com/b0rk/status/573535058797031424


Wow.. I had this exact question when I interviewed at a big tech company, then I had to implement it in pseudocode.


You were supposed to know why a particular implementation of a tool was fast beforehand? Or they gave you GNU grep and BSD grep to compare?


Yeah, this definitely falls into the absurd category of interviewing whiteboard questions. Ridiculously absurd. I have a book with this algorithm detailed in it, and I think I would still have trouble implementing it well.

That or I am just dumb. Very likely.


"Let's implement BM" (with some hints/guidance) doesn't seem absurd. It's not much code; handwritten can fit on a piece of paper or two. "Why is GNU grep fast?" would definitely be an absurd question though.


I could understand possibly going through sections. Though, really, that form of algorithm coding is so different from what I typically do day to day, that it just seems irrelevant.

I would probably find it fun, provided the hints/guidance were good. Not sure what to expect of it, though. As the sibling post asks, what would be the takeaway?


Is the candidate capable of decomposing a problem and analyzing subproblems? Are they capable of integrating subsolutions back into a working whole?


The problem really comes down to expectations on your point. For many of us, this would just be knowing that there are good algorithms on doing this.

If you don't already know of this algorithm, finding it is something that took a surprisingly long time to do. Using a field of algorithm design (pushdown automata) that just isn't that widely utilized nowdays. (Well, I suppose your domain may vary in this claim.)


Okay, let's suppose a software engineer cannot implement the BM off the top of their head. What would you make of that? How relevant is this kind of testing to the actual work of a software engineer?


Like I said, with hints and guidance. Write a function to find a substring. Ok. What's its runtime? Can we make it faster? What if we skip ahead a bit after a failed match? How much? What's the fastest way to calculate that?


Curious to know which book that is, could you tell me please?


The book I've got is Knuth's Selected Papers on Design of Algorithms.[1]

I will confess that I do not have the book with me, so it may actually be a different algorithm in that text. I can not remember. :( Going off of here[2], the algorithm I am referring to is chapter 9, "Fast pattern matching in strings."

[1] www.amazon.com/Selected-Papers-Algorithms-Language-Information/dp/1575865823/

[2] http://www-cs-faculty.stanford.edu/~uno/da.html


I wonder where that douchebaggery culture among employers comes from.


These are often the same companies that complain about how hard it is to find 'talent'


mentioned article about fast string search:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13....


Text processing is actually a pretty neat area, with a lot of practical uses.

A while back I found an out of print book, "Text Algorithms," available for download from the author in PDF and some other formats. It's not cutting edge at this point, but still covers all the basics (like Boyer-Moore) really well. http://igm.univ-mlv.fr/~mac/REC/B1.html


Awesome, thanks for the books!


I've been finding the following very useful for understanding string matching algorithms. It has a bunch of different algorithms for exact string matching, details on them, drawings, descriptions, speed comparisons, and code in C http://www-igm.univ-mlv.fr/~lecroq/string/index.html


This should probably have a (2010) in the title or so



On the topic of fast grep-like program there is jrep too: http://lwn.net/Articles/589009/


"The key to making programs fast is to make them do practically nothing. ;-)"

Very much the case.


lettergram could do a faster version that would work without errors the first time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: