Hacker News new | past | comments | ask | show | jobs | submit login

Work on a new implementation might be a good idea. :-)

The last months I've used the PCRE regexp library, which afaik is the one used in most implementations wanting Perl compatibility. I was a bit shocked... I thought it was good?

I ran into a couple of cases with bugs (one regexp couldn't be ported from Perl and another over complex one from a newbie hung(!) the process). It seems inefficient with backtracking, or? (Maybe this is fixed now?)




You might be interested in this: http://swtch.com/%7Ersc/regexp/regexp1.html


A bit like complaining about a specific compiler and getting a reference to papers on optimisations. :-) That said, it looks interesting; will read after lunch.


Even though this implementation is short, don't think it is efficient.

Matching regular expressions can be done in linear time.

This presented algorithm takes exponential time.


> Matching regular expressions can be done in linear time.

Unfortunately, the expression "regular expression" has been so abused in the context of programming that, unless otherwise specified, we cannot expect it to correspond to the well-defined regular expression that is used in formal language theory.

For example, statements such as "you can use regular expressions to parse HTML" and "regexps aren't regular" (mentioned in this thread) are now acceptable. I wish people would at least say "extended regular expressions", but it's a lost cause.


My "point" was a joke about PCRE, the present regexp engine in Python and most other open source languages.

The answer I got (in addition to a couple of good introductory lectures I should send to people :-) ) seems to be that is sucks. :-(


I don't think anyone has ever accused PCRE of being particularly high-quality, but it's usually "good enough".

Not being able to use a regex that works in Perl is entirely expected, and probably not a bug -- PCRE is not an exact duplicate of the Perl regex engine, it's an independent reimplementation of a substantial subset of Perl's regex functionality. Amongst other things, it intentionally ignores some features that only really make sense in Perl.

Hanging a process with an ill-advised regex is also to be expected in almost all regex engines, and again, is probably not a bug. It's quite easy to accidentally create a regex that will not terminate during the expected lifetime of the universe, if ever.

Edit: By the way, this is why you should never, ever use untrusted input as a regular expression. This still pops up once in a while, with some poor soul discovering the hard way that it's basically allowing arbitrary code execution.


> This still pops up once in a while, with some poor soul discovering the hard way that it's basically allowing arbitrary code execution.

Actually, wouldn't that just be a DoS vulnerability? The risk is that someone can request a command that will either take too long or use too much memory, but they can't do anything besides pure computation unless there are other bugs (e.g. buffer overrun) in the regex parser.


There are other bugs in the engine. PCRE has had several, I have no reason to believe any other engine is not similarly vulnerable. In a security context, a regex engine is best viewed as a virtual machine. Most virtual machines are not secure in the face of hostile code.


Do you have a ref for that arbitrary code execution claim?!

Yes pathological cases are possible, I never claimed differently. These two cases weren't that complex and worked in Perl. And I found them in just a few dozen attempts... :-( I am scared thinking about my coming work year; I'll go away and update my cv now.


I'm not sure what kind of reference you want. Regular expressions are code. They're a DSL, and in some ways, the full Perl regex engine is disturbingly close to turing complete, even before you notice that the full Perl regex engine explicitly lets you embed arbitrary Perl code in a regex (PCRE does not permit this, obviously).

And it's not about complexity. This is not complex:

    while (1) {}
Yet it will hang your process.

This, pulled from Stroustrup's FAQ, is also not complex:

    void f();	/* argument types not mentioned */
    void g()
    {
        f(2);	/* poor style C. Not C++ */
    }
Yet it will work in C, but not C++.

The full expansion of PCRE's name is misleading. Don't let yourself be fooled by it. If it helps, think "Perl-like regular expressions" instead of "Perl-compatible".


To avoid you the trouble of writing a third introductory paragraph, I have done the automata theory course. I have also heard about halting problem. etc, etc :-)

Perl don't allow code execution blindly in regexps.

I have just had too much problems with simple stuff in PCRE.

My take away is to not use languages with PCRE for serious regexp work. Damn. I had hoped later versions were better.


> Perl don't allow code execution blindly in regexps.

Blindly? No. That e (or ee... ) at the end of your match will stare you in the face as if to ask, "What the hell were you thinking when you thought this was a good idea?"

(I say this and yet I love Perl.)


[Edit: I have only used /e when I really meant it. OK, sometime, I might have gotten that code from a config file... :-)]

An O/S or language can only have so many levels of security.

The same goes for using input strings for "/e" as when you explicitly write use re 'eval' and use input strings in regexps. Or use input strings as shell commands, for that matter...

(Am I being trolled? I get lots of beginner lectures and strange side issues of insignificant problems. Should I just stop criticizing PCRE? :-) )


Your "criticism" of PCRE doesn't sound like real criticism, it sounds like a confused and frustrated beginner, which is why you're getting beginner lectures.

PCRE is taking an extremely powerful and completely non-standard DSL out of the context of a parent language to which it is inextricably intertwined and reimplementing it for general-purpose use in other languages.

Between that, and the long history of outwardly-bizarre compatibility problems between different regex engines, even those that are supposed to be compatible, you should not have expected the level of compatibility you did.

PCRE's quality is essentially orthogonal to these issues.


>>PCRE's quality is essentially orthogonal to these issues.

That seems like... conscious misrepresentation.

If it was just compatibility, I'd learn the new syntax.

The problem is quality. Specifically, I found two relatively simple regexps that doesn't work. Already.

I guess I should try to avoid the PCRE lib for serious work, which is hard. :-(

Edit: I could have been just unlucky, but...

(I'm not going to start trading insults, if that is what you want.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: