Python's Hidden Regular Expression Gems

Grue3 · on Nov 20, 2015

Python's re has nothing on CL-PPCRE [1] though. The ability to build up a "regular expression" from S-expressions is just too useful.

[1] http://weitz.de/cl-ppcre/

nanny · on Nov 20, 2015

It was also twice as fast as Perl in benchmarks at one point or another.

kbenson · on Nov 20, 2015

That's not exactly hard to do, depending on features. There's a definite trade-off between features and the type of regex engine that can be implemented.

DasIch · on Nov 20, 2015

It seems to me that one should be able to detect which features are actually used in a regular expression and choose one of multiple different underlying implementations based on that.

kbenson · on Nov 20, 2015

Newer versions of Perl actually support a pluggable regex engine system, so you can use specific regex engines for specific tasks.

Twirrim · on Nov 20, 2015

If you hook in libpcre direct it actually comes with its own JIT these days. http://sljit.sourceforge.net/pcre.html

draven · on Nov 20, 2015

Having the compiler available at runtime kinda helps too!

junke · on Nov 20, 2015

For example, here below is a CL version of "Enter The Scanner". Instead of a closure, we could instanciate a class, but this is sufficient:

   (defun create-tokenizer (rules)                                                                                                      
     (loop for (token rule) in rules                                                                                                    
           for regex = (if (stringp rule) `(:regex ,rule) rule)                                                                         
           collect `(:register ,regex) into alternatives                                                                                
           collect token into tokens                                                                                                    
           finally                                                                                                                      
              (let ((scanner (ppcre:create-scanner `(:alternation ,@alternatives)))                                                     
                    (tokens (coerce tokens 'vector)))                                                                                   
                (return                                                                                                                 
                  (lambda (string &key (start 0))                                                                                       
                    ;; generator-like                                                                                                   
                    (lambda ()                                                                                                          
                      (multiple-value-bind (match-start match-end registers)                                                            
                          (ppcre:scan scanner string :start start)                                                                      
                        (cond                                                                                                           
                          (match-start                                                                                                  
                           (setf start match-end)                                                                                       
                           (values                                                                                                      
                            (aref tokens (position-if-not #'null registers))                                                            
                            match-start                                                                                                 
                            match-end))                                                                                                 
                          (t (values))))))))))

The above compiles the regex and returns a closure which accepts a string. That closure returns another closure which generates tokens on-demand for the given string, along with the start and end positions of the token inside the string. Here is a test:

   (loop with tokenizer = (funcall (create-tokenizer '((ws "\\s+")                                                                      
                                                       (plus #\+)                                                                       
                                                       (minus #\-)                                                                      
                                                       (mult #\*)                                                                       
                                                       (div #\/)                                                                        
                                                       (num :digit-class)                                                               
                                                       (par-open #\()                                                                   
                                                       (par-close #\))))                                                                
                                   "(1 + 2) * 3")                                                                                       
         for token = (multiple-value-list (funcall tokenizer))                                                                          
         while token                                                                                                                    
         collect token)                                                                                                                 
                                                                                                                                        
   => ((par-open 0 1) (num 1 2) (ws 2 3) (plus 3 4) (ws 4 5) (num 5 6)                                                                  
 (par-close 6 7) (ws 7 8) (mult 8 9) (ws 9 10) (num 10 11))

junke · on Nov 20, 2015

Note also that the problem exposed in the article about mixing register groups with other register groups in each rule can probably be easily overcome with named registers in CL-PPCRE. See http://weitz.de/cl-ppcre/#*allow-named-registers*.

willvarfar · on Nov 20, 2015

Another very-cool undocumented feature on another regex engine is re2's Set. It compiles a collection of regex to a single regex, and so allows you to very efficiently match a string against an array of patterns.

andreasvc · on Nov 20, 2015

Which re2? I maintain a fork of re2 but it's not in there [1].

If you mention re2 the main cool feature about it is that it is efficient, matching in linear time using DFA. Unfortunately unicode strings need to be encoded to utf8 but if you can design your application to work with utf8 bytestrings you can avoid that cost.

[1] http://github.com/andreasvc/pyre2

willvarfar · on Nov 20, 2015

https://github.com/google/re2/blob/master/re2/set.h <-- c++ API

https://github.com/google/re2/blob/master/re2/prog.h#L339 <-- c

andreasvc · on Nov 20, 2015

Neat. I should consider wrapping that.

fanf2 · on Nov 21, 2015

Another package that does this is re2c [1] which is used by spamassassin to compile its rulesets to scan messages faster.

[1] http://re2c.org

andreasvc · on Nov 20, 2015

I wonder what the reason is to include code in a release without documenting it. Maybe this article can form the basis for finally documenting this feature?

There's also the reverse with Python: useful code in the documentation not included in the standard library.

notzorbo2 · on Nov 20, 2015

This happens pretty often in the Python world. There's a bit of an unwritten rule to leave implementation details public that would be private in other languages. Many libraries simply don't bother with prefixing privates with '_' and just leave things undocumented that you probably shouldn't touch/use.

One notable example is importing libraries in your code automatically exposes them to the caller.

    $ cat lib.py
    import re
    
    def somefunc():
        pass

    $ python
    >>> import lib
    >>> lib.re
    <module 're' from '/usr/lib/python2.7/re.pyc'>

Many packages also do `import *` from files which polutes the package namespace with all kinds of stuff you really don't want in there. For example, the popular Requests package:

    >>> import requests
    >>> requests.logging
    <module 'logging' from '/usr/lib/python2.7/logging/__init__.pyc'>

The logging module is not a public part of requests' API. It's just there because requests uses it internally.

So to answer your question, I'd say it's just common practice. If it's undocumented in Python, you should pretend it doesn't exist.

orf · on Nov 20, 2015

Importing libraries from other libraries like this is very useful:

    $ python
    >>> import lib
    >>> lib.re

That's how __init__.py files work, and is part of what makes Python awesome. Using `import *` is very bad practice in modules (except in very specific cases) because it brings in a bunch of crap you don't want and didn't expect. Modules should define a '__all__' list of 'public things' you want to export, but restricting access is very anti-python as we're all consenting adults.

   If it's undocumented in Python, you should pretend it doesn't exist.

I don't agree. It's fairly common to dive into 3rd party packages code to see what's occurring and to use 'undocumented' things (which is mostly because the documentation is bad rather than being hidden away). Just look at the Django `_meta` API, which people relied on because it was the only place you could get some specific model information in a stable way, despite being undocumented and private. Now it's been formalized into a proper API.

Pythons extensive use of duck typing also makes it a lot easier to work with undocumented stuff, you can make some wide ranging changes to internals (changing types completely, turning properties into functions) but as long as it quacks roughly the same nothing breaks.

andreasvc · on Nov 22, 2015

What is useful about 'import lib; lib.re'? I cannot think of a situation in which a direct important wouldn't be better, while that has many obvious advantages. The __init__.py is a special case of course, but you would only use that for a package's own modules.

> It's fairly common to dive into 3rd party packages code to see what's occurring and to use 'undocumented' things

It may be common, but it doesn't convince me that it's a good idea. It seems to me that it would be better if the language forces you to design the public API properly, than to resort to using undocumented/private APIs.

orf · on Nov 22, 2015

> What is useful about 'import lib; lib.re'?

Nothing, but it's a side effect of an awesome feature of Python: nothing being private. Which is incredibly useful. 'lib.re' is exactly the same case as 'lib.actual_library_function', why should Python add the ability to somehow stop these from being included? It would increase complexity for no gain.

andreasvc · on Nov 22, 2015

You're just repeating that it is awesome and useful without saying why. I think distinguishing public & private variables offers more support for structured programming and is therefore desirable.

orf · on Nov 22, 2015

> You're just repeating that it is awesome and useful without saying why.

Sorry, I thought you were asking why you are able to import other modules imports.

> I think distinguishing public & private variables offers more support for structured programming and is therefore desirable.

You can prefix attributes and functions with a single underscore to mark them as private, or a double underscore to make them more private (the attribute name gets mangled).

Anyway, Python doesn't have a enforced notion of privateness because it's a bad idea. By marking something as private you're saying "I, as a developer sat here writing this know better than all of the users of my library. Their lives may depend on using something I haven't exposed properly in my API, but too bad. I know best".

So you end up jumping through ridiculous hoops to access private properties (because even in languages with private, nothing is truly private), all because some guy thought he knows best a long time ago while writing the library you are using.

So a better approach (IMO) is to mark something as private with convention (a prefixed underscore), which means "this is private, don't depend on it", without restricting your access. You can drive a car, have sex, pay taxes, but not access a private variable? Bleugh.

That's more of a cultural thing though, I'm sure enforced private makes more sense in statically typed, compiled languages with lots of classes (and even then I would argue they are still bad for the reasons above), and matter more in huge codebases.

maxerickson · on Nov 22, 2015

How can the language force good design?

Like how would a language that uses explicit exports stop someone exporting everything?

I agree that accessing libraries indirectly probably isn't useful, but I think being able to do dir(lib) and see the namespace that is in use is a good thing (at least in the context of Python).

andreasvc · on Nov 22, 2015

The language cannot force it but can encourage it. Having to make the choice between private and public is worthwhile I think.

> being able to do dir(lib) and see the namespace that is in use is a good thing.

It is, but it is most useful when what you get a curated list of members (__all__) intended to be public.

mh- · on Nov 20, 2015

unwritten rule

I don't know if I'd call this (common) practice an unwritten rule in the Python world.

rather, there's no real notion of visibility.. so the only thing you can do to make something 'look private' is name it obscurely (i.e. the underscore prefix, as you mentioned) and leave it undocumented.

it's perhaps my least favorite part of python module semantics.

rjprins · on Nov 20, 2015

Imho, (good) Python code is so easy to read and inspect, that it can be better than documentation. Documentation is less exact. I often find myself reading the libraries that I use, to see what it really does. Once you get into the habit of that, documentation is only useful as a starting point and you can discover a lot more useful functionality. (And also get a good feel of the quality of the library, and learn a lot in terms of patterns)

Of course, this habit comes with the risk that you might have to do more maintenance on your code on library updates.

As documentation takes effort to write and constitutes a sort of contract with your users, it makes sense for library writers to only document the truly essential.

andreasvc · on Nov 22, 2015

I don't think this position of relying on code instead of documentation is tenable for non-trivial code. The amount of effort to understand code is generally much more than that of maintaining documentation of what a function does, meaning of arguments, and any assumptions made. Documentation can (and should) have a much higher signal-to-noise ratio.

HerpDerpLerp · on Nov 20, 2015

Maybe this sort of thing will be tackled by the new stackoverflow documentation thing.

This link may do nothing if you are not part of the beta! http://docs-beta.stackexchange.com/documentation

creshal · on Nov 20, 2015

> I wonder what the reason is to include code in a release without documenting it.

Presumably it's considered an implementation detail and the author didn't realize it was useful for anything else.

dalke · on Nov 20, 2015

The author knows that it's useful. See http://effbot.org/zone/xml-scanner.htm where the author uses the scanner:

> The 2.0 engine provides another (undocumented) feature that can be used to optimize this even further. The scanner method is used to create a scanner object and attach it to a string.

See http://bugs.python.org/issue5337 where the Python developers wondered if it should be documented.

There's a 10 year old notice at https://mail.python.org/pipermail/patches/2006-November/0210... saying that the code would crash if the scanner was called from multiple threads.

There's no doubt much more information about this - the above was all I cared to find out.

fulafel · on Nov 20, 2015

I did a double take upon seeing "The regex module in Python is really old by now" and had to make sure it's talking about the current re module! re was introduced alongside the older regex module around Python 1.5, the latter was finally removed in Python 2.5.

andreasvc · on Nov 20, 2015

Additionally there's a newer module also named 'regex': https://pypi.python.org/pypi/regex

rspeer · on Nov 20, 2015

And this newer 'regex' is actually really good at tricky cases such as matching word boundaries. (An apostrophe or a non-ASCII character is not necessarily a word boundary!)

andreasvc · on Nov 22, 2015

> An apostrophe or a non-ASCII character is not necessarily a word boundary!

I don't see how a regular expression library could help with that (other than proper Unicode support), because word boundaries are a language-specific, linguistic problem; i.e., you will need to supply a list of possible contractions anyway.

Tokenization of natural language text may appear like a straightforward and solved problem, but there are actually lots of messy details to get right.

fndrplayer13 · on Nov 20, 2015

Really interesting read. To be honest, I got a little lost around the Scanner implementation portion. Guess its time to take that example and play around with it myself. My one suggestion would be to maybe walk through an example of how the Scanner would work to demonstrate your point.

Thanks for the great post.

the_mitsuhiko · on Nov 20, 2015

I made a larger class in a github repo and an example of what you can do with it here: https://github.com/mitsuhiko/python-regex-scanner/blob/maste...

ioquatix · on Nov 20, 2015

There is nothing unique about Python's implementation of regular expressions. Ruby has an equally (if not more) powerful `StringScanner`. This is a nice article but it could have done without the "it's one of the best of all dynamic languages I would argue" tone.

the_mitsuhiko · on Nov 20, 2015

> Ruby has an equally (if not more) powerful `StringScanner`

The string scanner is just a step by step matching of individual expressions. Because they are not folded together the "skip non matching" part has been done in Ruby which is precisely what the Python scanner avoids.

on Nov 20, 2015

[deleted]

lazyjones · on Nov 20, 2015

Personality cult? Perl programmers can only shake their heads at the post, since its regexp implementation is much more powerful and documented properly.

ioquatix · on Nov 20, 2015

I was going to mention Perl, but I decided it goes without saying :)

simgidacav · on Nov 20, 2015

Modest and good perl programmer spotted.

sdoering · on Nov 20, 2015

@lazyjones reminded me of this: https://xkcd.com/1090/

mkesper · on Nov 20, 2015

Lately there had been a comparison of Javascript, Perl and Python regex machinery posted here iirc (can't find right now) and the external regex module was found to be much better regarding unicode support.

berntb · on Nov 20, 2015

(How could non-3 Python have good Unicode support? :-) )

The default Python regexp gets into a tailspin on less well formulated regexps, which did work well with both PCRE and Perl 5. (I wrote an application (specialized query language) a few years ago where programmers entered regexps. Sometimes the execution just hanged. This was 2.6 and early 2.7.)

I haven't tried the external regexp module, but hope it is better.

nichochar · on Nov 20, 2015

Amazing! This Scanner object seems so beautifully pythonic and intuitive. It's a shame it's not documented honestly: it would be a great way for beginners in python to write language parsers

larkinrichards · on Nov 20, 2015

I believe there is a small bug in the final example, in the tokenize definition it references 'self' where I believe it should reference 'scanner'.

stefantalpalaru · on Nov 20, 2015

> it's one of the best of all dynamic languages I would argue

The author should learn about PCRE. I wrote a Python wrapper for it that includes a drop-in 're' substitute: https://github.com/stefantalpalaru/morelia-pcre

lugus35 · on Nov 20, 2015

http://doc.perl6.org/language/regexes

Read. Recite. Review.