Stop avoiding regular expressions damn it

dasil003 · on May 12, 2013

The core criticism of regular expressions is legitimately directed at intermediate programmers who know enough to be dangerous, but is sometimes inappropriately cargo-culted by beginner programmers who use it as an excuse not to learn regular expressions.

The fact is that despite pithy slogans, there is a sweet spot where a regular expression does the job of matching a string in a clearer fashion than anything else. But that sweet spot is well shy of the theoretical power of regular expressions (especially in Perl!), before which you should further your understanding of a range of parsing techniques before hacking together a baroque regex.

dllthomas · on May 12, 2013

Though there's nothing wrong with busting out a baroque regex in one time use contexts (editor's search function, throw-away application of grep or sed, &c).

dasil003 · on May 12, 2013

Of course not.

kbenson · on May 13, 2013

As with many things, I find the sweet spot changes depending on my knowledge and familiarity with the concept.

4ad · on May 12, 2013

I found the house I currently live in with regular expressions.

A couple of years ago I moved to a different country, and for some reasons I needed two apartments, preferably close to each other. As you can imagine, the real estate websites are not designed for the kind of query I needed, so I wrote some code to aid me in my quest[1].

It's just shell script and text processing with awk. I download various results with all the available apartments for many real estate websites, then I scrape the data I care about (with regular expressions!) like address, rooms, price, anything really, and query the Google Maps API with all the addresses to retrieve the geographical coordinates, then I compute the distances between any two houses and sort them.

It's fantastically modular. Adding support for a new website meant just creating some regular expressions that work for that website. This was great because I was doing this on the road, as I was visiting the foreign city and found new sources of information.

Regular expressions were also great because these websites didn't have any API where I could query for the address, etc. I had to rely on what people wrote in their ads. This meant that when I wrote a regexp to match a set of results I had to inspect the failures to see new ways people described their houses and improved my matching based on that. Initially I had hoped I'd be able to parse 80% of the ads, but measurements and careful coding had allowed me to match approximately 99% of the ads!

The textual operation of this software allowed me to easily input some data manually. For example I realized that I'm also interested in having these apartments close to a subway station. No problem, just manually create the file with the subway stations in the correct, simple, textual format and the program will pick it up and use automatically.

The textual interface also helped with fancy queries, like "price between X and Y, 6 rooms total, prefer 4-2 to 3-3 if distance less than D, but 3-3 if distance greater than D, prefer Z subway line to Q, only one apartment might be from an agency rather than an individual, try to put one in K part of the city". Try to do that with an existing website.

[1] https://code.google.com/p/operation-housefinder/

bradt · on May 12, 2013

A little back story on this article for those who are interested...

I noticed my coworker was going out of his way to use string manipulation, writing many lines of code instead of a simple regular expression. When I asked why, he explained that he didn't know regular expressions, but more importantly that he felt that he had read a lot of posts on Stack Overflow discouraging use of regular expressions. From what he had read, he felt that it was better practice to avoid regular expressions. Although this could be anecdotal, there may be a real danger here that inexperienced programmers are getting the wrong message, that regular expressions are somehow bad in most situations and not worth learning.

Titanous · on May 12, 2013

More concise? Sometimes. Slower? Always.

    BenchmarkRegexp	  500000	      5136 ns/op
    BenchmarkStrings	10000000	       173 ns/op

http://play.golang.org/p/YT29Ao-tOt

btilly · on May 12, 2013

That is an implementation specific benchmark. A grungy real-world regexp engine (such as Perl's) usually will recognize important special cases and substitute in faster code for them.

The classic example is to recognize that you're looking for a fixed string, and substitute in Boyer Moore. But prefix/suffix recognition are two other common examples.

kamaal · on May 12, 2013

Last time I checked, any time I needed the power and flexibility of a using a regular expression. Getting the job done was far more and over a degree of magnitude more important than saving some milliseconds of processing time.

buro9 · on May 12, 2013

You could more than double the performance of the regexp if you did MustCompile just the once rather than within every loop.

MustCompile is generally used to make the regexp a global so that it isn't done over and over.

Just move it out of the loop, as it's really not necessary to compile regular expressions every time you want to match/replace against it.

Titanous · on May 12, 2013

It does have the MustCompile outside of the loop. I pasted the wrong link originally.

buro9 · on May 12, 2013

Ah, my apology I saw the earlier link.

Achshar · on May 12, 2013

If I am reading the chart correctly, doubled performance would still not be enough.

buro9 · on May 12, 2013

Absolutely. But the original version linked was twice as slow as need be.

For trivial replacements string manipulation I find is faster and safer (fewer bugs). But there is some threshold of complexity in which regular expressions are both more performant and safer.

rcfox · on May 12, 2013

Just curious: how do the regexes compare when you use "^@(.*)@$" ? Semantically, it's closer to the string version.

Realistically, you'd expect them to behave exactly the same, but Go's pretty new, and you never know what is or isn't going to be optimized.

Titanous · on May 12, 2013

    `\A@(.*)@\z`

    BenchmarkRegexp	  500000	      5181 ns/op
    BenchmarkStrings	10000000	       171 ns/op

ralph · on May 12, 2013

No, not always. That's a poor example biased to simple string handling.

coolsunglasses · on May 12, 2013

Using Golang as an example of real world regex performance is borderline dishonest. Their regex engine is notoriously unoptimized and is not intended to be a strong point of the language.

Titanous · on May 12, 2013

It's not "unoptimized", it has a different design[1] than other language implementations, resulting in different performance characteristics.

[1] http://swtch.com/~rsc/regexp/regexp1.html

kbenson · on May 13, 2013

Not just a different design, a different set of features. For example, no backreferences.

bradt · on May 12, 2013

Thanks for clarifying. I guess my point was that if you're just matching one email address in a form submission for example, is performance significant?

Titanous · on May 12, 2013

No, a few µs vs a few ns when processing your web form won't be significant. Don't shy away from regular expressions, but be aware of their performance and readability impact.

The problem is when developers that don't know any better build parsers with regular expressions. That's almost always a bad idea.

bradt · on May 12, 2013

Agreed.

nraynaud · on May 12, 2013

As a general rule I ask people to avoid using non-trivial regular expressions. The grammar is too tricky and often the expression doesn't mean what the developer intends it to mean. Or the next developer will make a mistake.

My current pet peeve is with parser combinators, wich seems a good compromise (it's not a magic wand) between maintenance (whereas external parser generators don't blend well in your code), parsing what you think you are parsing (more so when your grammar was defined with rules in a reference document), and integrating the parser with your code.

bane · on May 12, 2013

Does anybody know of a good perl of python library that will use a regex (with constraints on the repetition operators) and generate an exhaustive list of matching strings (instead of generating a random list)?

I think this would be helpful in many cases in getting people to understand how regexes work. I've seen lots of cases where toolsets designed to help people build regexes end up with them confused when their regex also matches other stuff beyond their test strings.

_yid9 · on May 12, 2013

https://github.com/ferno/greenery

bane · on May 13, 2013

cool, looks like the strings() method in lego.py might work

gbog · on May 12, 2013

OT: Where from come this seemingly odd and new habit of spacing inside parentheses? I always write "(a, b)", mostly because it is closer to English (or other languages) typography, and it seem to have good readability, plus it is, I believe, the standard in most languages. So why write "( a, b )"?

By the way, if some like spacing that much, and if the reason is to have a better mouse-selectability, then I humbly propose "( a , b )".

bradt · on May 12, 2013

It's WordPress' PHP Coding Standards: http://make.wordpress.org/core/handbook/coding-standards/php...

gbog · on May 13, 2013

Ok. Is there any rationale behind it?

buro9 · on May 12, 2013

I feel that this needs posting again: http://www.debuggex.com/

Basically a great online tool for testing your regular expressions and stepping through what is actually happening. As soon as you get non-trivial, it's a Godsend.

Su-Shee · on May 12, 2013

THE single best ressource to really learn how to deal competently with regex is still Jeffrey Friedl's book "Mastering Regular Expressions".

You will profit from it for the rest of your career.

(There's also a Regex short reference and a Regex cookbook by O'Reilly...)

krat0sprakhar · on May 12, 2013

Sincere question - is it worth investing time into reading a 500 odd page book for something that I might not use that frequently in my career? From my experience, I've seen that I can get away by just Googling or just experimenting whenever I'm stuck on a regex.

Su-Shee · on May 12, 2013

Absolutely.

The book doesn't just teach you regex, but the why, how AND the dialects. It gives you an overview over different tools and programming languages and their regex-related functions and methods.

On top, it contains a ton of examples, is very well written (considering the insanely dry and difficult to typeset subject :) and is very polished (I think it's in the 3rd edition by now..)

If you just google or experiment on regex, you usally get bad regex, badly crafted regex, brittle regex and make every single mistake the book prevents you from doing.

It's really one of the most worthwhile books of reading through - it's also an excellent handbook to look things up.

Remember that a lot of commandline tools take in regex too - grep, sed, awk, you name it - it's not just for use in programming languages.

Your favorite editor has regex too.

I simple don't know how people can live without; I'm using regex practically every day.

P.S.: And _after_ reading the book, you will understand why people yell at you when you parse HTML with regex but you will know how to do it anyways and at least not completely badly. ;)

P.P.S: And here's the canonical post to BUT OF COURSE you can parse HTML with regex from stackoverflow.. :) http://stackoverflow.com/questions/4231382/regular-expressio...

to3m · on May 12, 2013

Well, maybe not an entire 500 page book, though I'm sure it wouldn't hurt. You could always try Zed Shaw's the-hard-way book on the subject: http://regex.learncodethehardway.org/book/ (haven't read it but the the-hard-way books seem to be fairly well regarded)

I do take issue with your suggestion that you might not use this stuff all that frequently in your career. This is definitely at odds with my experience. Even though I don't use them that much in final-quality code, I use them all the time from the text editor, and quite often for quick one-off text manipulation or extraction scripts. Having a quick way to extract text from ad-hoc data can quickly get you a rough answer to a speculative question, the text equivalent of of back-of-the-envelope calculation, without needing to do a lot of work and without needing the question to justify a lot of work.

But I mainly use them for searching for one of two or three different strings in the text editor.

louischatriot · on May 12, 2013

You don't need to read 500 pages to understand the core of regex. "The core" means "what you will use 99% of the time". You need 11 minutes: http://www.youtube.com/watch?v=hwDhO1GLb_4

bradt · on May 12, 2013

I guess your question is why read a book when you can just learn as you go and as you need to. I didn't read a book, but I probably should have because it took a long time for me to pick up things that would have helped a ton earlier on. For example, I recently learned that you can turn off "greedy" when using .* by adding a ? after it. This was a huge revelation that I would have benefited from day one, ten years prior.

wereHamster · on May 12, 2013

Please have a look at this: http://xkcd.com/1205/.

arkitaip · on May 12, 2013

Yes, hundred times yes. Even rudimentary web coding involves regexps somehow. But the real benefit is mastering a topic that is complex and the sense of accomplishment and competency it brings you.

notyourpal · on May 12, 2013

I'm very guilty of this myself. I'm officially a loser if I haven't delved into regex within two weeks.

ExpiredLink · on May 12, 2013

Stop propagating bad interfaces like 'regular expressions' damn it!

An interface that e.g. makes me 'escape' half of my input because its designers think their special use of characters must take precedence over all user input is a bad interface.

Su-Shee · on May 12, 2013

Many programming languages have a function for that to do that for you...

In Perl, it's called quotemeta (qw, qq and family, too), in Python and Ruby it's .escape... and there's always \Q ... \E to use...

I'm sure others have similar methods/functions.

GhotiFish · on May 12, 2013

I agree with that criticism, so do other people, that's why the implementation of Regular Expressions are anything but regular.

That always makes me giggle.

sanderjd · on May 12, 2013

It's not an interface, it's a language. Most (all?) languages have collisions between in-band and out-of-band information and ways to "escape" them.

fosap · on May 12, 2013

So, what do you use instead?

3minus1 · on May 12, 2013

What's a good resource for learning reg exp?

bradt · on May 12, 2013

Great question. I learned them through osmosis over many years of looking at them in other people's code and tinkering myself. I don't remember ever going through a tutorial or reading a book. Probably not the best way to learn them as it definitely took a long time to have a good grip on them and I was missing important pieces for a long time. For example, it was only relatively recently that I learned that you can turn off "greedy" when using .* by adding a ? after it.

solistice · on May 12, 2013

How about this site? http://www.regular-expressions.info/ I just skimmed it, and I'm not a regex pro, so I can't vouch for it's quality, but it seems like a place to start. For C#, I kinda enjoyed the dotnetperls page on regex http://www.dotnetperls.com/regex.

tekacs · on May 12, 2013

http://www.regular-expressions.info/ is something I've stumbled across a few times in recent years, recommended by my friends to one another.

If I recall correctly I learnt the basics from MSDN documentation and later more thoroughly when I first came across Perl. Either of these are pretty decent choices, too. :)

arkitaip · on May 12, 2013

I find picking an actual context is really important to keep you motivated as learning about regexps can be tedious and prone to error. Probably the most fun regexp work I've done has been with Apache's mod_rewrite - using .htaccess files of course - because it's a great feeling to master something that useful that previously just seemed like magic.

zimbatm · on May 12, 2013

What you want is to achieve a level where you can start thinking in regexp.

Eg: /^([a-z]*)$/ this translates into my brain to: a string that starts with a capture group of 0-N characters between a-z and also ends with it.

Then it's really easy.

beezee · on May 12, 2013

this one is incomplete but gave me a really good foundation that makes it easy to fill in gaps when I need, and most importantly gets you that "thinking in regex" result mentioned above- http://regex.learncodethehardway.org/book/