In addition to all the tricks applied by GNU grep (such as Boyer-Moore), it also splits work among threads to take advantage of multi-core machines. Integrates with your favorite text editor, too!
Obvious questions:
1. How do you extend Boyer-Moore to arbitrary regexps?
2. Does "not looking at every byte" really matter when the strings being searched for are typically so short that you're gonna hit every cache line anyway?
1. you find the fixed part of the regexps, do boyer-moore to fast filter the haystack, then do deeper NFA matching if the fixed bits are present. One example from later in that email thread: ^foo is the same as a fixed string search for \nfoo
I presume there are ways you can bound your fixed searches, and have your NFA work "backwards" through the string after a boyer-moore hits. (or both backwards and forwards from fixed points, depending on where in the string the fixed bit is).
2. No not for some people's small file usage - But grep is used in many many cases on huge log files (and mail files/queues and so on) - there is a valid case for optimizing those uses. Production sysadmins deal with multiple GBs of logfiles regularly, and grep them a lot.
> Does "not looking at every byte" really matter when the strings being searched for are typically so short that you're gonna hit every cache line anyway?
You're not skipping bytes of the search string, you're skipping bytes of the file being searched.
yes, but the point is that on modern CPUs checking those bytes is likely far from the bottleneck and thus skipping them saves you very little time, if any at all.
You might be surprised. Consider, counting to 2 million by 5 is significantly faster than counting by 1. Especially if you just go ahead and start some prefetching in the memory, I would expect that this can speed things up considerably.
Edit: I should say I would still think things should be benchmarked to really know...
>How do you extend Boyer-Moore to arbitrary regexps?
As mentioned in the follow-up messages, some regexps can be translated to one or at least few fixed string searches (/^[Gg]rep/ is really a search for "\nGrep" or "\ngrep"). If you can't do that, your regexp may contain longish fixed strings that allow you to discard most lines via Boyer-Moore before you have to run the regexp engine.
Which is curious. Some content you run into by accident and think needs reposting. But how does one stumble upon this post in 2015? Googling for "why is gnu grep fast" and then not noticing that two of the top four results are submissions to HN?
Finding it linked from some other article or blog post, or having it sent to you from someone else and then assuming that a mail thread from 2010 has never been seen on HN before.
Yeah, this definitely falls into the absurd category of interviewing whiteboard questions. Ridiculously absurd. I have a book with this algorithm detailed in it, and I think I would still have trouble implementing it well.
"Let's implement BM" (with some hints/guidance) doesn't seem absurd. It's not much code; handwritten can fit on a piece of paper or two. "Why is GNU grep fast?" would definitely be an absurd question though.
I could understand possibly going through sections. Though, really, that form of algorithm coding is so different from what I typically do day to day, that it just seems irrelevant.
I would probably find it fun, provided the hints/guidance were good. Not sure what to expect of it, though. As the sibling post asks, what would be the takeaway?
The problem really comes down to expectations on your point. For many of us, this would just be knowing that there are good algorithms on doing this.
If you don't already know of this algorithm, finding it is something that took a surprisingly long time to do. Using a field of algorithm design (pushdown automata) that just isn't that widely utilized nowdays. (Well, I suppose your domain may vary in this claim.)
Okay, let's suppose a software engineer cannot implement the BM off the top of their head. What would you make of that? How relevant is this kind of testing to the actual work of a software engineer?
Like I said, with hints and guidance. Write a function to find a substring. Ok. What's its runtime? Can we make it faster? What if we skip ahead a bit after a failed match? How much? What's the fastest way to calculate that?
The book I've got is Knuth's Selected Papers on Design of Algorithms.[1]
I will confess that I do not have the book with me, so it may actually be a different algorithm in that text. I can not remember. :( Going off of here[2], the algorithm I am referring to is chapter 9, "Fast pattern matching in strings."
Text processing is actually a pretty neat area, with a lot of practical uses.
A while back I found an out of print book, "Text Algorithms," available for download from the author in PDF and some other formats. It's not cutting edge at this point, but still covers all the basics (like Boyer-Moore) really well. http://igm.univ-mlv.fr/~mac/REC/B1.html
I've been finding the following very useful for understanding string matching algorithms. It has a bunch of different algorithms for exact string matching, details on them, drawings, descriptions, speed comparisons, and code in C
http://www-igm.univ-mlv.fr/~lecroq/string/index.html
https://github.com/ggreer/the_silver_searcher
In addition to all the tricks applied by GNU grep (such as Boyer-Moore), it also splits work among threads to take advantage of multi-core machines. Integrates with your favorite text editor, too!