Hacker News new | past | comments | ask | show | jobs | submit login
A regular expression matcher By Rob Pike and Brian Kernighan (2007) (princeton.edu)
96 points by sid6376 on May 8, 2013 | hide | past | favorite | 45 comments



Can someone explain to me why the matchstar function takes its first param as an integer? to me it should be a char.

    int matchstar(int c, char *regexp, char *text)
    {
        do {    /* a * matches zero or more instances */
            if (matchhere(regexp, text))
                return 1;
        } while (*text != '\0' && (*text++ == c || c == '.'));
        return 0;
    }


It doesn't seem to make sense in this context. One hypothesis is that it could have been done out of habit to write safer code: whether char is signed or unsigned is left to the implementation, whereas int is guaranteed to be signed.


The usual reason is to allow c to be EOF, but I don't see why anyone would in this case.


There is a lovely description of the classic Thompson algorithm of building an NFA from a regular expression and then simulating the NFA with a DFA using two stacks :) in the "Compilers" book by Aho et al. There is also a great article from Russ Cox on it: http://swtch.com/~rsc/regexp/regexp1.html

I recommend implementing this as an exercise, rarely do you see that much classic CS theory put to a practical use in one place.


I translated this into Haskell just for fun.

    match :: String -> String -> Bool
    match ('^':rest) text     = matchLocal rest text
    match rex        (c:rest) = matchLocal rex (c:rest) || match rex rest
    match _          _        = False
 
    matchLocal :: String -> String -> Bool
    matchLocal []            _         = True
    matchLocal "$"           []        = True
    matchLocal (c:'*':restR) (t:restT) = c == t && matchLocal restR (starSkip c (t:restT))
    matchLocal (r:restR)     (t:restT) = (r == '.' || r == t) && matchLocal restR restT
    matchLocal _             _         = False
 
    starSkip :: Char -> String -> String
    starSkip c (t:rest) = if c == t then starSkip c rest else t:rest
    starSkip _ []       = []


On a unrelated note.

Adding a few lines of css (setting body width, font-size to 16, line-height to 1.4, Georgia for font and larger font size for headings ) makes for a decent reading experience.

http://imgur.com/a/KtaEC/


My recommendations, for a more pleasurable reading experience:

    body {
      width: 70%;
      font-size: 20px;
      margin: auto;
      font-family: sans-serif;
      line-height: 30px;
    }


Pedantic nitpick incoming

Sans-serifs are generally harder to read. Note, for example, how virtually all books are set in serifs. The primary reason sans-serifs are widely used in computers is the lack of decent display resolution required to display the actual serifs. So if you are using a larger font size to read something off the screen, there is no reason not to use a serifed typeface.


This might very well be a myth. See e.g. http://alexpoole.info/blog/which-are-more-legible-serif-or-s....


Everything that has to do with the perception might very well be a myth :)

However, in typography and type designer circles it is commonly accepted that Serifs are superior for consuming large quantities of text. Also in layman terms - serifs make the glyphs more distinct and easier to recognize at a glance, in contrast to the sans where there are several glyph pairs that look virtually the same.


"Commonly accepted" isn't worth much when it conflicts with actual science. Lots of things commonly accepted are simply incorrect.


Please don't pull a Wikipedia on me here. As an old physicist joke goes -

  Experimental physicist comes to a theoretical physicist office,
  bring a graph from a recent experiment and asks for a help with
  interpreting the results. 

  - Well, it's all rather obvious. Here's a peak, here's a dip,
    because of this, that and third.

  - Hold it, hold it... you are looking at it upside down.

  - Ah, right, right. *Rotates the graph*. Oh, it's now even more 
    obvious than before.
In other words, the "actual science" you are referring to frequently ends up to be nothing more than a matter of interpretation and a subject to all sorts of biases. Especially when it concerns something as unquantifiable as "comfort of reading". Just pick up a couple of fiction books, one set in serif and another in sans-serif, go through a pageful of text and see for yourself.


You can't just dismiss science and fact when it's inconvenient. Well, you can, but then your beliefs have no more backing than those of astrologers, and you're guilty of willfully spreading misinformation. You asserted a falsifiable statement as fact ("Sans-serifs are generally harder to read") and in the face of fairly compelling evidence that it is in fact a false statement, you immediately appealed to "common knowledge". It's also "common knowledge" that Einstein failed math as a kid and that eating before swimming will give you cramps, but these are both untrue statements.

Comfort of reading is very quantifiable. You can present text to a bunch of people in serif and sans serif fonts (double-blind and randomized) and ask them to rate how pleasant the text was. You can be clever and ask questions that measure understanding and retention, or you can ask them how much they enjoyed reading the passage, and these will give you indirect measures of reading comfort. Or you can be blunt and ask if they enjoy reading in the font, though this will pick up biases more. Either way is significantly more scientific than an appeal to common knowledge, though.

> Just pick up a couple of fiction books, one set in serif and another in sans-serif, go through a pageful of text and see for yourself.

This isn't science. This is just bias confirmation. ("Wow, I prefer the one I expected to prefer!")

But for the record, my e-reader font is set to Gill Sans.


I have no immediate way of verifying that what you call a compelling evidence referenced in that blog post is in fact compelling and accurate. I could go through all linked sources, but I don't presently have time for that. On the other hand I read several books on typography and I am partial to the type design. From that exposure I do know that the common consensus between people involved in creating said serifs and sans-serifs is that sans-serifs are generally harder to read. So when I am presented with an evidence to the contrary, you can be damn sure I will doubt it.

Throwing around power words like "science" and "compelling evidence" based on a couple of references plucked from a blog post - sorry, but you are well in a meta area, preaching about general subject matter without any regard to the context. You are not the only one who's aware what science and scientific methods entitle, but then you should also be well aware of a bunch of junk that gets published in a format of scientific research, gets quoted and re-quoted and eventually accumulates notable status even though it hasn't even been peer-reviewed once. This happens all the time and it's a part of "science", so being skeptical is a part of the package. And the more obscure the area of the research, the more skepticism is warranted. You surely know that being that well-versed in all things science.

  --
Your e-reader doesn't have the resolution required for good quality rendering of serif fonts. Hence the Gill Sans.


My complaint is not that you disagree with the blog or that you don't have the time to dig through all the references (neither do I). My complaint is that you've presented something as fact with no evidence and continue to assert its unquestionable truth in the face of contrary evidence. You've made no actual attempt to defend the claim except to say that others agree with you. That isn't knowledge. I'm not even sure it qualifies as an appeal to authority. It's just dogma.

You're right to be skeptical. You're not right to be dismissive.

--

My phone also uses sans-serif fonts almost exclusively and I read a ton of stuff on it. It definitely has a high enough resolution (>300ppi) for serif fonts.


I have a suspicion that 90% of the functional effectiveness of any visual aesthetic is learned experience. Witness the outcry over 48fps feature-length films, for example; people complain that it has a "cheap" or "soap-opera" feel, because cheap productions and soap operas are the dominant media filmed in high-speed. This affects the enjoyment that movie-goers take out of their experience, and thus the overall value of the product.

The inferences we make about functional differences in the aesthetics of type would clearly be biased by this phenomenon. It might certainly be possible for controlled observations to reveal that reading speeds are faster for serif type; but this might be due to the fact that the reader is used to it, as most long works are already printed in serif type, in part due to the assumptions of book designers that serif type is faster to read!


My university teacher who was a fan of LaTex, taught web development and actually had a job outside the uni, told us that Serif fonts look bad and unreadable on computer displays. He said the reason is resolution. Books don't have DPI limits, and you can have a serif font look good to the tiniest detail.


I think `width: 70%;` is a bad idea. If you have a very wide screen, 70% of it is too much and on a mobile device you don't want to waste 30%.


It's a catch-22 pretty much. If you set it as pixels, it can be either too narrow (600 px on a full hd in full screen, though lines that are too wide/long suck too) or too wide (which leads to scroll bars).

Setting it as a percentage will obviously always make it look the same since it scales. However, a percentage on a huge screen usually makes it too wide to read lines comfortably.

You pretty much have to pick your poison, though IMO about 600 px is the maximum width for a line before it's horrible to read on a full hd screen.


This is a solved problem.

   max-width: 600px;
This stops your text from being excessively wide, but also doesn't prevent smaller screens from making it more narrow.


Thats a pretty good way, but it's not dpi independent.


That's partially true. A CSS pixel isn't necessarily the same as a screen pixel. A Mac with a retina display set to "native" resolution, for example, should actually make that 1200 pixels wide (max). In general, though, I think most browsers will treat a css pixel as a screen pixel.

Of course you could also use points or ems if you prefer.


Or just resize your window and increase the font size.


It does work as a one time thing.

But, having an extension like Stylebot installed and using a tiny custom CSS does offer an advantage - the CSS applies across the domain and takes effect every time the page is loaded.

For example, see this gallery - http://imgur.com/a/MxFtD


Yes, though it's slightly more complex than that: if you've got multiple tabs open, you'll probably want to resize the window again to view the other tabs, and (at least in Chrome) if you open a new tab or window it'll also have that increased font-size which you'll have to decrease again.


For those who don't know, this is Rob Pike's chapter of Beautiful Code, a thoroughly enjoyable book: http://www.amazon.co.uk/Beautiful-Code-Leading-Programmers-P... . Every now and then I read a different chapter, and it's always interesting. I particularly liked Rob Pike's one, and the one about Python's dictionary implementation.


This code is beautiful, but it's also busted [1] -- albeit only for extraordinarily pathological input. As to whether this ultimately dents its beauty is up to the beholder, presumably; for me personally, I confess to suffer from a utilitarian sense of beauty when it comes to software -- for me, broken code can't achieve absolute beauty.

[1] http://dtrace.org/blogs/bmc/2007/07/11/beautiful-code/


I believe Kernighan acknowledged that: "[snip] The depth of recursion is no more than the length of the pattern, which in normal use is quite short, so there is no danger of running out of space. [/snip]".

If you're going to judge this code based on both pathological haystacks but also pathological needle regexs, then you have to admit that this code, which has linear space usage in the pattern length is far better than PCRE, which has exponential time in the pattern length[1].

Considering its succinctness, I find that pretty impressive.

[1] http://swtch.com/~rsc/regexp/regexp1.html


Ah, and Jon Bentley's chapter about quicksort, along whose lines there's a (also very interesting) Google Tech Talk: http://www.youtube.com/watch?v=aMnn0Jq0J-E


For those unfamiliar with pointer arithmetic I wrote a python port here. https://gist.github.com/siddharthsarda/5538928


Be aware that this isn't an equivalent of the C code. Replacing pointer arithmetic with string copy increases big O complexity of the algorithm.

An equivalent would need to pass indexes to sub functions to avoid additional copies.


On a related note, Rob Pike's "Lexical Scanning in Go" talk is also worth a look https://cuddle.googlecode.com/hg/talk/lex.html#landing-slide


It's interesting that this is in a very functional style, no global state is used.


I was once asked in an interview to write a pattern matcher in C for the exact same class of regular expressions as in this post. I'm pretty sure the interviewer was looking for exactly the approach used here as well.


I wonder if this would be a good test to give a software developer candidate. Develop a regex matcher that matches a subset of the regex rules in your language of choice (without using regex abilities, of course).


https://github.com/darius/ung/blob/master/c/regexpoid.c was my solution; it handles a slightly more useful subset (including escaping) and saves some code by calling segment_matches at the top level.

(I'd seen Pike's code years before in _The Practice of Programming_, so it's not independent.)


I don't know, regexes have a lot of rules so you need to supply a reference, requiring them to remember it is overkill, not to mention some of those rules are pretty arcane.


A simple implementation of ?, +, * , [] and non-capturing, precedence-only () hardly needs a reference. Whether greedy vs non-greedy + / * are implemented doesn't really matter.

I will say that the problem is much easier if you have a good grounding in the basics of compilers, while that knowledge is not hugely useful very often. Any positive value as a test will come from correlation with interest in CS IMO.


I've asked it as an interview question many times; it works really well.

The only bad part is that most candidates can't even write a correct strstr(3), so it becomes incredibly depressing.


Altough performance isn't really part of this exercise, beautiful code is also likely to be fast -- efficiently doing the work to solve the problem and no more.


Not always true, e.g. brute force searching can often look more beautiful than more sophisticated techniques. The article mentions how performance could be improved by separating compilation and execution of the regular expression (as most "real" regex libraries do) - but this would almost certainly make it less beautiful.


I saw this in a drdobbs article in the 90s as well.



Thanks for posting this, and thanks to the OP for posting the link as well. :) I must shamefully admit that I haven't read the book, but it is now on top of my reading list!

This was really a most pleasurable and insightful read. True elegance!


Great code is often deceptively simple. It can lead you to believe it was simple to write.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: