Show HN: my regular expression example match generator

rndmcnlly0 · on Sept 24, 2010

I made this tool over a year ago as a proof of concept. It doesn't support all features (like negations inside of character classes), but it should demonstrate the benefit that a tool like this would give you. I'd be interested in helping someone else understand the guts and make it into a public service to help out newbie developers coming to understand these notoriously-difficult-but-highly-useful expression.

elblanco · on Sept 25, 2010

Oh god, I've been wanting something like this for ages. I've always dreamed of a regex driven list generator, but where I can supply constraints on repetition operators (*->{0,2}, +->{1,3}) to limit the scope, but then get a comprehensive list of all the matching strings.

Imagine using some rules to take something like a user driven entry, where they'll enter something like say...a phone number freetext, but then use some rules that turn that phone number into a regex that'll match many common ways to write that phone number, then generate all the possible matching versions of that phone number for submission against an indexed database of text documents to see if any match. It's a trivial example, but a similar concept could work for things like names that have been Romanized from a non-Latin alphabet....generate a regex to match some variants, then generate the variants and search an indexed database for those variants. It would be much much faster in many cases than scanning through all the documents with the regex.

elblanco · on Sept 25, 2010

Nearly forgot this similar effort: http://research.microsoft.com/en-us/projects/rex/

amethyst · on Sept 24, 2010

Congrats on supporting back-references :)

Hint: Try using "1(0+)1\1" or "([a-zA-Z]+)\1" for some nifty examples of things that technically aren't possible with true regular expressions, but are supported by enhanced regex engines like Perl/PCRE.

Edit: too bad it only works with one back reference, eg, adding \2 apparently breaks everything, returning no matches and no code.

Edit2: after more playing, it seems that using \3 works as a backref for the second matching group... try "(a+)(b+)\1\3"...

pjscott · on Sept 24, 2010

It doesn't seem to be able to handle character classes of the type [^abc], which matches any character except a, b, or c. It's pretty cool, though. I tried copying and pasting some crazy regexps from the internet in there, and it immediately gave me some idea what they do. Well done!

rndmcnlly0 · on Sept 24, 2010

I purposely punted on inverted character classes because they brought up a major usability issue that I didn't have a quick solution for. The expression /[^abc]/ should match '@', ' ', and 'q', but, by the numbers, the majority of generated matches would look like '{angry unicode that you can't even see}'. This brings up the idea that you don't just want valid matches, but a meaningful set of examples that covers different ways of matching.

Likewise, I imagine it might be useful to specify some background language that matches should be taken from. That is, you might want to tell it to show you example matches made from snippets of valid html, from email-style English text, ascii-only strings, arbitrary unicode, etc. Incorporating this kind requires a lot of original thinking about usability (and some prototyping to figure out what would even make sense).

abecedarius · on Sept 24, 2010

Since the intersection of two regular languages is a regular language, you could reduce generating 'background language' example matches to taking intersections. It's unfortunate how rare it is for libraries to provide you that operation.

(Same thing for HTML: intersecting a regular language with a context-free one is context-free.)

nerfhammer · on Sept 24, 2010

you support . which seems to pick random ascii characters

jimbokun · on Sept 24, 2010

Bookmarked. I had an immediate use for this, and your application filled it beautifully.

JangoSteve · on Sept 24, 2010

This first thing I tried to do was

  [\d]+

But it didn't work. Took me a few tries to get it to work. I'm guessing it simply doesn't work with special regex characters.

You know what would be even more useful for me? I give you a string and highlight the important part I need to match, and you give me a good regex for it.

Example: I need to match the string "View conversation (5)", with the important part being that there is a number near the end of the string. So I type that and highlight the "5". Then you give me something like:

  /[^\d]+\d+[^\d]*/

Or whatever. Obviously it won't be full-proof because you don't know all my use cases and edge cases, but it'd give me somewhere to start.

jsankey · on Sept 25, 2010

My CS honours thesis involved deriving regexes (well, DTDs for XML documents, but it is essentially the same problem).

The quality of the results depends a lot on the amount and quality of the input (how well it represents what you are really trying to match). A single example string is unlikely to get you far - you would certainly need multiple matching examples. So I think this approach works best when you already have an example corpus to work from, rather than providing input manually. If you're going to spend effort providing a lot of input, then you'd probably be better off spending at least some of that effort in providing hints or possible regex answers.

Further, there are many possible regexes that would match an input set, so the algorithm also needs some way to evaluate them and choose the better candidates. In my case I used some ideas from information theory (such as MML), which actually worked reasonably well. But this is a computationally hard problem, so even with an objective measure of the optimal regex you won't necessarily be able to find it.

rndmcnlly0 · on Sept 24, 2010

I've thought about how to do a kind of regex-induction, where you give it matches and it gives you one or more expressions that will match it. But, after some thinking, even with a reasonable language bias, getting useful expression comes down to getting the user to type in a collection of positive an negative match examples. This kind of tool would be really nice to use, but the interaction has to be more than giving a single positive example.

A an iterative process where you start from a single positive example and then iteratively refine it with clarifying examples would work without being overwhelming. However, I'm really trying to get the user to better understand regexes to the point where they could do that process for themselves, mentally, and not depend on a super intelligent tool to do it for them.

kenjackson · on Sept 24, 2010

In grad school I had a need for a tool that would generate all strings accepted by a regex of a specified size.

I wrote one myself, but was never really satisfied that I did a good job on it. I'd love to see the code of an efficient implementation of such a thing.

abecedarius · on Sept 24, 2010

Doug McIlroy wrote an article on that: http://www.cs.dartmouth.edu/~doug/nfa.ps.gz

Code: http://www.cs.dartmouth.edu/~doug/nfa.hs

kenjackson · on Sept 25, 2010

Thank you. That is actually an excellent paper. A decade late for me, but excellent.

rndmcnlly0 · on Sept 24, 2010

"Generate all" and "efficient" seems doomed at some level. Considering /[01]+/, there are clearly an exponential number of matches given some given max length.

The expression itself provides a nice summary of the generative space it implies. Perhaps for a particular application you could apply a kind of lazy generation. Say, given /[01]+/ it would first expand it into the subspaces of /0/, /1/, /0[01]+/, and /1[01]+/. One of these branches could be expanded next depending on what was actually needed in the next step.

elblanco · on Sept 25, 2010

"Higher Order Perl" has a chapter on building one in Perl. http://en.wikipedia.org/wiki/Higher-Order_Perl I believe the full text is available.

jusob · on Sept 24, 2010

You mean you re-implemented String::Random ? http://search.cpan.org/~steve/String-Random-0.22/lib/String/...

rndmcnlly0 · on Sept 24, 2010

Interesting module, but different target application. Part of the core of my generation process is remembering which parts of the example match came from which piece of your original expression so that (someday) there could be some nice mouseover effects that help you understand HOW some example match works (as there are sometimes expressions that match a given string in multiple ways).

xtacy · on Sept 24, 2010

Reminded of this: http://news.ycombinator.com/item?id=1387418. Perhaps it's too complicated? :-)

rndmcnlly0 · on Sept 24, 2010

Here is the source to the python meat of this project: http://pastie.org/1180230

The general idea is that it is a command line tool that reads a regex as input (along with a number of examples to generate). It parses the regex into an abstract syntax tree, then hands it over to an interpreter to "execute" the expression/program several times.

The 'simpleparse' python library does a nice job of easing the regex-to-tree mapping.

j_baker · on Sept 24, 2010

One thing that would be useful is if I could select different regex engines. Python does things slightly differently from Perl, C#, and (especially) emacs.

rndmcnlly0 · on Sept 24, 2010

Heh, its a matter of implementing a declarative model of each of the different regex engines you'd like to select. If would be fantastic if it existed, wanna try?

gus_massa · on Sept 25, 2010

I think that it should show for example 10 matches (not only one), so it is easier to have an idea of "all" the possible matches.

DTrejo · on Sept 25, 2010

I've found http://txt2re.com/ to be very helpful for simple things.

sharpemt · on Sept 25, 2010

Something similar to check-out. Based in javascript and lets you visually verify basic patterns on text-blocks: http://regexpal.com/

nerfhammer · on Sept 24, 2010

Using python with the re.X flag?

I can tell because it's doesn't understand scoped flags or atomic groupings.

rndmcnlly0 · on Sept 25, 2010

While the generation engine is written in python (source linked in another comment), the engine itself doesn't make use of existing regex libraries because they solve a different problem (deterministic matching vs nondeterministic generation).

swaits · on Sept 25, 2010

Similar: http://rubular.com/

rndmcnlly0 · on Sept 25, 2010

It look like rubular just shows you example matches out of text you've already typed in (the same situation you have when playing with any regex matching library directly), whereas regexio generates matches you might not expect out of the blue. Can you clarify what sense of similarity you mean to point out?

swaits · on Sept 25, 2010

Yes, I realize that rubular doesn't generate the matches. But they are both web sites where you primarily enter a regex in order to explore its behavior. The similarity is obvious.

teoruiz · on Sept 25, 2010

And the Python 're' option: http://pythex.org/

frobozz · on Sept 25, 2010

What an exceptionally helpful tool! thanks.