The core criticism of regular expressions is legitimately directed at intermediate programmers who know enough to be dangerous, but is sometimes inappropriately cargo-culted by beginner programmers who use it as an excuse not to learn regular expressions.
The fact is that despite pithy slogans, there is a sweet spot where a regular expression does the job of matching a string in a clearer fashion than anything else. But that sweet spot is well shy of the theoretical power of regular expressions (especially in Perl!), before which you should further your understanding of a range of parsing techniques before hacking together a baroque regex.
Though there's nothing wrong with busting out a baroque regex in one time use contexts (editor's search function, throw-away application of grep or sed, &c).
I found the house I currently live in with regular expressions.
A couple of years ago I moved to a different country, and for some reasons I needed two apartments, preferably close to each other. As you can imagine, the real estate websites are not designed for the kind of query I needed, so I wrote some code to aid me in my quest[1].
It's just shell script and text processing with awk. I download various results with all the available apartments for many real estate websites, then I scrape the data I care about (with regular expressions!) like address, rooms, price, anything really, and query the Google Maps API with all the addresses to retrieve the geographical coordinates, then I compute the distances between any two houses and sort them.
It's fantastically modular. Adding support for a new website meant just creating some regular expressions that work for that website. This was great because I was doing this on the road, as I was visiting the foreign city and found new sources of information.
Regular expressions were also great because these websites didn't have any API where I could query for the address, etc. I had to rely on what people wrote in their ads. This meant that when I wrote a regexp to match a set of results I had to inspect the failures to see new ways people described their houses and improved my matching based on that. Initially I had hoped I'd be able to parse 80% of the ads, but measurements and careful coding had allowed me to match approximately 99% of the ads!
The textual operation of this software allowed me to easily input some data manually. For example I realized that I'm also interested in having these apartments close to a subway station. No problem, just manually create the file with the subway stations in the correct, simple, textual format and the program will pick it up and use automatically.
The textual interface also helped with fancy queries, like "price between X and Y, 6 rooms total, prefer 4-2 to 3-3 if distance less than D, but 3-3 if distance greater than D, prefer Z subway line to Q, only one apartment might be from an agency rather than an individual, try to put one in K part of the city". Try to do that with an existing website.
A little back story on this article for those who are interested...
I noticed my coworker was going out of his way to use string manipulation, writing many lines of code instead of a simple regular expression. When I asked why, he explained that he didn't know regular expressions, but more importantly that he felt that he had read a lot of posts on Stack Overflow discouraging use of regular expressions. From what he had read, he felt that it was better practice to avoid regular expressions. Although this could be anecdotal, there may be a real danger here that inexperienced programmers are getting the wrong message, that regular expressions are somehow bad in most situations and not worth learning.
That is an implementation specific benchmark. A grungy real-world regexp engine (such as Perl's) usually will recognize important special cases and substitute in faster code for them.
The classic example is to recognize that you're looking for a fixed string, and substitute in Boyer Moore. But prefix/suffix recognition are two other common examples.
Last time I checked, any time I needed the power and flexibility of a using a regular expression. Getting the job done was far more and over a degree of magnitude more important than saving some milliseconds of processing time.
Absolutely. But the original version linked was twice as slow as need be.
For trivial replacements string manipulation I find is faster and safer (fewer bugs). But there is some threshold of complexity in which regular expressions are both more performant and safer.
Using Golang as an example of real world regex performance is borderline dishonest. Their regex engine is notoriously unoptimized and is not intended to be a strong point of the language.
Thanks for clarifying. I guess my point was that if you're just matching one email address in a form submission for example, is performance significant?
No, a few µs vs a few ns when processing your web form won't be significant. Don't shy away from regular expressions, but be aware of their performance and readability impact.
The problem is when developers that don't know any better build parsers with regular expressions. That's almost always a bad idea.
As a general rule I ask people to avoid using non-trivial regular expressions. The grammar is too tricky and often the expression doesn't mean what the developer intends it to mean. Or the next developer will make a mistake.
My current pet peeve is with parser combinators, wich seems a good compromise (it's not a magic wand) between maintenance (whereas external parser generators don't blend well in your code), parsing what you think you are parsing (more so when your grammar was defined with rules in a reference document), and integrating the parser with your code.
Does anybody know of a good perl of python library that will use a regex (with constraints on the repetition operators) and generate an exhaustive list of matching strings (instead of generating a random list)?
I think this would be helpful in many cases in getting people to understand how regexes work. I've seen lots of cases where toolsets designed to help people build regexes end up with them confused when their regex also matches other stuff beyond their test strings.
OT: Where from come this seemingly odd and new habit of spacing inside parentheses? I always write "(a, b)", mostly because it is closer to English (or other languages) typography, and it seem to have good readability, plus it is, I believe, the standard in most languages. So why write "( a, b )"?
By the way, if some like spacing that much, and if the reason is to have a better mouse-selectability, then I humbly propose "( a , b )".
Basically a great online tool for testing your regular expressions and stepping through what is actually happening. As soon as you get non-trivial, it's a Godsend.
Sincere question - is it worth investing time into reading a 500 odd page book for something that I might not use that frequently in my career? From my experience, I've seen that I can get away by just Googling or just experimenting whenever I'm stuck on a regex.
The book doesn't just teach you regex, but the why, how AND the dialects. It gives you an overview over different tools and programming languages and their regex-related functions and methods.
On top, it contains a ton of examples, is very well written (considering the insanely dry and difficult to typeset subject :) and is very polished (I think it's in the 3rd edition by now..)
If you just google or experiment on regex, you usally get bad regex, badly crafted regex, brittle regex and make every single mistake the book prevents you from doing.
It's really one of the most worthwhile books of reading through - it's also an excellent handbook to look things up.
Remember that a lot of commandline tools take in regex too - grep, sed, awk, you name it - it's not just for use in programming languages.
Your favorite editor has regex too.
I simple don't know how people can live without; I'm using regex practically every day.
P.S.: And _after_ reading the book, you will understand why people yell at you when you parse HTML with regex but you will know how to do it anyways and at least not completely badly. ;)
Well, maybe not an entire 500 page book, though I'm sure it wouldn't hurt. You could always try Zed Shaw's the-hard-way book on the subject: http://regex.learncodethehardway.org/book/ (haven't read it but the the-hard-way books seem to be fairly well regarded)
I do take issue with your suggestion that you might not use this stuff all that frequently in your career. This is definitely at odds with my experience. Even though I don't use them that much in final-quality code, I use them all the time from the text editor, and quite often for quick one-off text manipulation or extraction scripts. Having a quick way to extract text from ad-hoc data can quickly get you a rough answer to a speculative question, the text equivalent of of back-of-the-envelope calculation, without needing to do a lot of work and without needing the question to justify a lot of work.
But I mainly use them for searching for one of two or three different strings in the text editor.
You don't need to read 500 pages to understand the core of regex. "The core" means "what you will use 99% of the time". You need 11 minutes: http://www.youtube.com/watch?v=hwDhO1GLb_4
I guess your question is why read a book when you can just learn as you go and as you need to. I didn't read a book, but I probably should have because it took a long time for me to pick up things that would have helped a ton earlier on. For example, I recently learned that you can turn off "greedy" when using .* by adding a ? after it. This was a huge revelation that I would have benefited from day one, ten years prior.
Yes, hundred times yes. Even rudimentary web coding involves regexps somehow. But the real benefit is mastering a topic that is complex and the sense of accomplishment and competency it brings you.
Stop propagating bad interfaces like 'regular expressions' damn it!
An interface that e.g. makes me 'escape' half of my input because its designers think their special use of characters must take precedence over all user input is a bad interface.
Great question. I learned them through osmosis over many years of looking at them in other people's code and tinkering myself. I don't remember ever going through a tutorial or reading a book. Probably not the best way to learn them as it definitely took a long time to have a good grip on them and I was missing important pieces for a long time. For example, it was only relatively recently that I learned that you can turn off "greedy" when using .* by adding a ? after it.
If I recall correctly I learnt the basics from MSDN documentation and later more thoroughly when I first came across Perl. Either of these are pretty decent choices, too. :)
I find picking an actual context is really important to keep you motivated as learning about regexps can be tedious and prone to error. Probably the most fun regexp work I've done has been with Apache's mod_rewrite - using .htaccess files of course - because it's a great feeling to master something that useful that previously just seemed like magic.
this one is incomplete but gave me a really good foundation that makes it easy to fill in gaps when I need, and most importantly gets you that "thinking in regex" result mentioned above- http://regex.learncodethehardway.org/book/
The fact is that despite pithy slogans, there is a sweet spot where a regular expression does the job of matching a string in a clearer fashion than anything else. But that sweet spot is well shy of the theoretical power of regular expressions (especially in Perl!), before which you should further your understanding of a range of parsing techniques before hacking together a baroque regex.