Advanced Regular Expressions

emmett · on May 7, 2009

How did this get to the top of HN? It seems like a boring blogospam rehash of a bunch of very basic CS concepts, not even well written...

10ren · on May 7, 2009

I wish HN had downvotes for articles instead of harsh top comments.

functional-tree · on May 7, 2009

Or if the "flag" button made posts' gravity higher so that they fell to the bottom sooner.

stcredzero · on May 7, 2009

...basic CS concepts...

Something which is ironic, and also gets back to basic CS concepts, is that modern scripting languages have added so much power to "Regular Expressions," that they are no longer strictly Regular Expressions. They are no longer equivalent to Finite Automata. "Regular Expressions" in many cases are outright Turing Complete.

And so the abuse of our field's terminology continues.

DEinspanjer · on May 7, 2009

DFA vs NFA. I thought it was humorous that the most important features of modern regex engines are the things that prevent them from meeting the criteria of the original definition for a regex engine. And of course, with a true DFA regex engine, you'll never suffer from catastrophic backtracking or exponential performance degradation with a bad regex or bad subject string.

Gompers · on May 7, 2009

Consider the word ‘smashing’. Using the above regular expression, the regex engine will first try to match the pattern ‘hi’ in ‘smashing’. It will not find a match.

... what?

Poleris · on May 7, 2009

I find when I move across languages (which happens frequently) or even tools (like grep), regexp implementation often differs enough to throw me off and introduce subtle bugs.

Is there a page that shows a mapping between different implementations of regular expressions and which languages and applications they're used in? Is there a better way to figure these things out than hunt for documentation every time?

DEinspanjer · on May 7, 2009

A tool I find useful enough to warrant purchasing and running in a VM or Wine is RegexBuddy. It is great for debugging / optimizing regexes and supports several different language flavors. Still waiting for Vim and PHP flavors though. :) It allows you to generate code snippits in several languages that use the regex you constructed, and will assist you in performing cross-language translations.

gustavo_duarte · on May 7, 2009

I've written an automated tool to convert Vim-style regexes to Perl-style. This is kind of a dramatic case as the dialect is very different, but it was a fun thing to write.

DEinspanjer · on May 7, 2009

Is it bi-directional? Is it on the vim.org website somewhere? I find Vim's escape heavy syntax cumbersome and I frequently wish I could write a quick Perl syntax regex and convert it to Vim.

Of course, Vim is one of the few regex engines to support variable width negative look-behinds, so I guess that counts for something. :)

gustavo_duarte · on May 7, 2009

Only Vim -> Perl is fully fleshed out right now. Doing it the other way would be fairly easy though because so much would be reused. Also, with the Vim regexes I had to worry about magicness and so on, the Perl stuff would parse easier.

It's not in Vim.org. I had the code floating around with an MIT license, but it's offline right now. I thought about building a page where you could do the translation (I figured Perl -> Vim is what people would want more) or just use a simple GET request from other code.

Maybe I should do the Perl -> Vim bit and make it available.

staunch · on May 7, 2009

Can you make vim use pcre?

gustavo_duarte · on May 7, 2009

From my look in the Vim sources, it would be tons of work to truly make it use pcre internally.

But I think it wouldn't be hard to do a plugin that translates a pcre into a Vim regex, allowing you to search or replace using pcres.

My code though is not integrated with Vim, it takes a free floating regex and converts it. I used it to convert all the regexes in Vim syntax files and build a syntax highlighter. It was a silly project, just for fun.

0xdefec8 · on May 7, 2009

indeed. grep -P is your friend. (except older grep versions seem to segfault on me constantly with various -P regexes)

zouhair · on May 7, 2009

sudo aptitude install txt2regex

10ren · on May 7, 2009

I hadn't seen recursive regex before (almost a tautology). The operator in the article, "(R?)", isn't in Perl (v5.8.8).

DEinspanjer · on May 7, 2009

Yeah. PHP syntax. From what I understand, they stuffed it in there mostly to allow easier processing of nested HTML tags and such.

lucumo · on May 7, 2009

I did some experimenting with recursive regexes after reading this article. I hadn't heard of it before and it sounded like an interesting concept.

However, captures in the recursion can't be retrieved after it gets returned. This may not be surprising, but it does reduce the power of recursion somewhat.

tjpick · on May 7, 2009

coz you can't expect anyone using PHP to, you know, write a proper parser...

Hexstream · on May 7, 2009

The first four of these I wouldn't call "advanced"...

lucumo · on May 7, 2009

And yet I was pleasantly surprised. I'm quite used to finding "advanced" articles on various programmic topics and finding that everything said in them is so basic that I can't learn even the smallest piece of information from it.

This one still had some quite interesting pieces of information that I hadn't heard of or hadn't studied in detail yet.

masomenos · on May 7, 2009

not for full-time programmers, but SM's audience seems to be more designers who may dip into code now and again

mncaudill · on May 7, 2009

Welcome to Smashing Magzine.

viggity · on May 7, 2009

I'm a big fan of letting named groups being my documentation, instead of using the # comment notation.

DEinspanjer · on May 7, 2009

Named groups have their use and are nice when the language supports them. There is a lot to be said for the (?xi) flags to allow a complicated regex to be indented and commented usefully.