> Folks who've worked with regular expressions before might know about ^ meaning "start-of-string" and correspondingly see $ as "end-of-string".
Huh. I always think of them as "start-of-line" and "end-of-line". I mean, a lot of the time when I'm working with regexes, I'm working with text a line at a time so the effect is the same, but that doesn't change how I think of those operators.
Maybe because a fair amount of the work I do with regexes (and, probably, how I was introduced to them) is via `grep`, so I'm often thinking of the inputs as "lines" rather than "strings"?
Even disregarding whether or not end-of-string is also an end-of-line or not (see all the other comments below), $ doesn't match the newline, similar to zero-width matches like \b, so the newline wouldn't be included in the matched text either way.
Problem is, plenty of software doesn't actually look at the match but rather just validates that there was a match (and then continues to use the input to that match).
It's kind of driving me nuts that the article says ^ is "start of string" when it's actually "start of line", just like $ is "end of line". \A is apparently "start of string" like \Z is "end of string".
Usually ^ matches only at the beginning of the string, and $ matches only at the end of the string and immediately before the newline (if any) at the end of the string. When this flag is specified, ^ matches at the beginning of the string and at the beginning of each line within the string, immediately following each newline. Similarly, the $ metacharacter matches either at the end of the string and at the end of each line (immediately preceding each newline).
In single-line [2] mode, the line starts at the start of the string and ends at the end of the line where the end of the line is either the end of the string if there is no terminating newline or just before the final newline if there is a terminating newline.
In multi-line mode a new line starts at the start of the string and after each newline and ends before each newline or at the end of the string if the last line has no terminating newline.
The confusion is that people think that they are in string-mode if they are not in multi-line mode but they are not, they are in single-line mode, ^ and $ still use the semantics of lines and a terminating newline, if present, is still not part of the content of the line.
With \n\n\n in single-line mode the non-greedy ^(\n+?)$ will capture only two of the newlines, the third one will be eaten by the $. If you make it greedy ^(\n+)$ will capture all three newlines. So arguably the implementations that do not match cat\n with cat$ are the broken ones.
The POSIX definition of a line is a sequence of non-newline characters - possibly zero - followed by a newline. Everything that does not end with a newline is not a [complete] line. So strictly speaking it would even be correct that cat$ does not match cat because there is no terminating newline, it should only match cat\n. But as lines missing a terminating newline is a thing, it seems reasonable to be less strict.
Python violates that definition however, by allowing internal newlines in strings. For example /^c[^a]t$/ matches "c\nt\n", but according to POSIX that's not a line.
I suspect the real reason for Python's behavior starts with the early decision to include the terminating newline in the string returned by IOBase.readline().
Python's peculiar choice has some minor advantages: you can distinguish between files that do and don't end with a terminating newline (the latter are invalid according to POSIX, but common in practice, especially on Windows), and you can reconstruct the original file by simply concatenating the line strings, which is occasionally useful.
The downside of this choice is that as a caller you have to deal with strings that may-or-may-not contain a terminating newline character, which is annoying (I often end up calling rstrip() or strip() on every line returned by readline(), just to get rid of the newlines; read().splitlines() is an option too if you don't mind reading the entire file into memory upfront).
My guess is that Python's behavior is just a hack to make re.match() easier to use with readline(), rather than based on any principled belief about what lines are.
Python's behavior is not a hack, it is the common behavior. $ matches at the end of the string or before the last character if that is a newline, which is logically the same as the end of a single line. But as you said, you can have additional newlines inside of the string which is also the common behavior and not specific to python. Personally I think of this as you just assume that the string is a single line and match $ accordingly, either at the end of the string or before a terminating newline, if there are additional newlines, you treat them mostly as normal characters, with the exception of dot not matching newlines unless you set the single-line/dot-all flag.
The very post we're commenting on shows that that's not true: PHP, Python, Java and .NET (C#) share one behavior (accept "\n" as "$"), and ECMAScript (Javascript), Golang, and Rust share another behavior (do not accept "\n" as $).
Let's not argue about which is “the most common”; all of these languages are sufficiently common to say that there is no single common behavior.
> $ matches at the end of the string or before the last character if that is a newline, which is logically the same as the end of a single line.
Yes, that is Python's behavior (and PHP's, Java's, etc.). You're just describing it; not motivating why it has to work that way or why it's more correct than the obvious alternative of only matching the end of the string.
Subjectively, I find it odd that /^cat$/ matches not just the obvious string "cat" but also the string "cat\n". And I think historically, it didn't. I tried several common tools that predate Python:
- awk 'BEGIN { print ("cat\n" ~ /^cat$/) }' prints 0
- in GNU ed, /^M/ does not match any lines
- in vim, /^M/ does not match any lines
- sed -n '/\n/p' does not print any lines
- grep -P '\n' does not match any lines
- (I wanted to try `grep -E` too but I don't know how to escape a newline)
- perl -e 'print ("cat\n" =~ /^cat$/)' prints 1
So the consensus seems to be that the classic UNIX line-based tools match the regex against the line excluding the newline terminator (which makes sense since it isn't part of the content of that line) and therefore $ only needs to match the end of the string.
The odd one out is Perl: it seems to have introduced the idea that $ can match a newline at the end of the string, probably for similar reasons as Python. All of this suggests to me that allowing $ to match both "\n" and "" at the end of the string was a hack designed to make it easier to deal with strings without control characters and string that end with a single newline.
So the consensus seems to be that the classic UNIX line-based tools match the regex against the line excluding the newline terminator (which makes sense since it isn't part of the content of that line) and therefore $ only needs to match the end of the string.
If you read a line, you usually remove the newline at the end but you could also keep it as Python does. If you remove the newline, then a line can never contain a newline, the case cat\n can never occur. If you keep the newline, there will be exactly one newline as the last character and you arguably want cat$ to match cat\n because that newline is the end of the line but not part of the content. It makes perfect sense that $ matches at the end of the string or before a newline as the last character as it will do the right thing whether or not you strip the newline.
If you want cat$ to not match cat\n, then you are obviously not dealing with lines, you have a string with a newline at the end but you consider this newline part of the content instead of terminating the line. But ^ and $ are made for lines, so they do not work as expected. I also get what people are complaining about, if you are not in multi-line and have a proper line with at most one newline at the end, then it will behave exactly as if you are in multi-line which raises the question why you would have those two modes to begin with. Not multi-line only behaves differently if you have additional newlines or one newline not at the end, that is if you do not have a proper line, so why should $ still behave as if you were dealing with a line?
If you are not in multi-line mode, then a single line is expected and consequently there is at most one newline at the end of the string. You can of course pick an input that violates this, run it against a multi-line string with several newlines in it. cat\n\n will not match cat$ because there is something between cat and the end of the line, it just happens to be a newline but without any special meaning because it is not the last character and you did not say that the input is multi-line.
If you have a file containing `A\nB\nC` in a file, the file is three lines long.
I guess it could be argued that a file containing `A\nB\nC\n` has four lines, with the fourth having zero length.
That a regex is applying to an in memory string vs a file doesn't feel to me like it should have different semantics.
Digging into the history a little, it looks like regexes were popularized in text editors and other file oriented tooling. In those contexts I imagine it would be far more common to want to discard or ignore the trailing zero length line than to process it like every other line in a file.
Technically the “newline” character is actually a line _terminator_. Hence “A\n” is one line, not two. The “\n” is always at the end of a line by definition.
Yes, that is a file with zero lines that ends with an "incomplete line". Processing of such files by standard line-oriented utilities is undefined in the opengroup spec. So, for instance, the effect of "grep"ping such a file is not defined. Heck, even "cat"ting such a file gives non-ideal results, such as colliding with the regular shell prompt. For this reason, a lot of software projects I work on check and correct this condition whenever creating a commit.
No, it is valid for a file to have content but no lines.
Semantically many libraries treat that as a line because while \n<EOF> means "the end of the last line" having just <EOF> adds additional complexity the user has to handle to read the remaining input. But by the book it's not "a line".
If I said "ten buckets of water" does that mean ten full buckets? Or does a bucket with a drop in it count as "a bucket of water?" If I asked for ten buckets of water and you brought me nine and one half-full, is that acceptable? What about ten half-full buckets?
A line ends in a newline. A file with no newlines in it has no lines.
Thats beyond ridiculous. Most languages when you are reading a line from a file, and it doesn't have a \n terminator, its going to give you that line, not say, oops, this isn't a line sorry.
I don't think you can meaningfully generalize to "most languages" here. To give an example, two extremely popular languages are C and Python. Both have a standard library function to read a line from a text stream - fgets() for C, readline() for Python. In both cases, the behavior is to read up to and including the newline character, but also to stop if EOF is encountered before then. Which means that the return value is different for terminated vs unterminated final lines in both languages - in particular, if there's no \n before EOF, the value returned is not a line (as it does not end with a newline), and you have to explicitly write your code to accommodate that.
That's a relatively recent invention compared to tools like `wc` (or your favorite `sh` for that matter). See also: https://perldoc.perl.org/functions/chop wherein the norm was "just cut off the last character of the line, it will always be a newline"
I get this is largely a semantic debate, but find it a little ironic so many programmers seem put off with the idea of a line count that starts at “0”.
Another way to look at it is that concatenating files should sum the line count. Concatenating two empty files produces an empty file, so 0 + 0 = 0. If “incomplete lines” are not counted as lines, then the maths still works out. If they counted as lines, it would end up as 1 + 1 = 1.
No, a line is defined as a sequence of characters (bytes?) with a line terminator at the end.
Technically as per posix a file as you describe is actually a binary file without any lines. Basically just random binary data that happens to kind of look like a line.
This isn't a weird state. It's a language problem. An 'incomplete line' isn't a type of line, it's an unfortunate name for a thing that is not a line. Just like how the 'wor' is an incomplete word (the word 'word'), but 'wor' is, of course, not a word.
Same thing for formalisms like equations in algebra or formulas in propositional logic— we have the phrase 'well-formed formula', and we might describe some sequences of terms as 'incomplete formulas' or perhaps 'ill-formed formulas', but those phrases don't describe anything that meets the formal system's definition of 'formula' at all— they are not formulas. 'Ill-formed formula' is not a compositional phrase where 'ill-formed' describes a feature of a 'formula'. It's a bit of convenient language for what we can intuitively or metaphorically recognize as a formula-ish thing.
That's a weird way to look at it. Binary files might not have "lines", but there's no reason they couldn't include a byte with value 10 (the ASCII value for \n). Software reading that file wouldn't know the difference, right?
Also, why couldn't you have a text file without any lines?
It's a file with zero complete lines. But it has 1 line, that's incomplete, right?
Because the Unix definition of text file requires the file to end with a newline. "Lines" only exist in the context of text files. If there's no terminating newline, it's (pedantically) not a text file and so has no lines. Now, in practice, if you open() that file in text mode, it doesn't TMK return an error if the terminating newline isn't present, but it's undefined behaviour.
And if you do have a terminating newline, then you have at least one line :).
That seems like a broken (maybe just bad?) definition/specification to me. A blob of JSON in a file isn't "text" if there's no newline character trailing it?
This does precisely nothing to solve the ambiguity issue when a final line lacks a newline. The representation of that newline isn't relevant to the problem.
The point is that having a sequence of two delimiters to signal the end of the logical line allows you to have single instances of either delimiter included within the text. This allows visual line breaks to be included within the same line as understood by the regex parser.
Despite the downvotes your comment received, I think you have a good point. There are two uses for a newline, first to signal the end of a line, for example when sending text over a serial connection, and second to separate two lines, for example in a text file.
To indicate that a serially received line is complete, the interpretation as a terminator makes perfect sense - abcd\n is a complete line, abc is a still incomplete line. In a text file the interpretation as a separator might be preferable because that gets rid of the issue of the last line not having a newline - a\nb\nc are three lines separated by two newlines, a\nb\nc\n are four lines separated by three newlines and the last line is empty.
But then it might also be useful to have a terminator in a text file to be able to detect an incompletely written line. So using two characters, one for each purpose, could solve the problem. \r means the line is complete, \n means it follows a next line. abc is an incomplete line, abcd\r is a complete line and no line follows, abcd\r\n is a complete line and a second incomplete line follows which is currently empty. abcd\r\n\r are two complete lines, the second one empty. abcd\r\nefg is a complete line followed by an incomplete line. abcd\r\nefg\r are two complete lines. You could even have two incomplete lines abc\nefg.
But I think Windows always uses \r\n because this is how you get to a newline on a typewriter or really old printer, you return the carriage and feed the paper one line. I do not think that they had the idea of differentiating between terminator and separator, otherwise you could have only \r and maybe even only \n sometimes. But in principle this could work quite nicely, I guess. You could start a line with \n and end it with \r, this would give you \r\n between lines and \r after the final line. Or nothing if the final line is incomplete or \r\n if the final line is incomplete and currently empty. The odd thing would be a newline as the very first character, maybe one could suppress that. This would also be compatible with Windows and nix, it would just consider all nix lines incomplete. Only abc\rdef\r would not really make sense, two complete lines but the second one is not a new line.
If I ever get to write a new operating system, I will inflict this on humanity.
I mean, it was what everyone had agreed upon previously. Microsoft was the only party to follow through. For all the guff they get for not following standards, it was the one standard they did.
You don't have to love a company to acknowledge they did something right.
It doesn't indicate the start of a new line, or files would start with it. Files end with it, which is why it is a line terminator. And it is by definition: by the standard, by the way cat and/or your shell and/or your terminal work together, and by the way standard utilities like `wc` treat the file.
I don't know why no-one here sees this as a bad design...
If a line is missing a newline then we just disregard it?!
A way better way to deal with newline is it's a separator like comma. And like in modern languages we allow a final separator, but ignore it so that is easier for tools to generate files.
Now all combinations of characters, including newline characters, has an interpretation without dropping anything.
I also always preferred the interpretation of a newline as a separator instead of as a terminator for files because I never liked the final newline causing a new empty line in the editor and as you thought that it was bad design that you can have a somewhat invalid file.
But if you look beyond files, the interpretation as a terminator also makes perfect sense, when you receive text over a serial connection it signals that the line is complete which does not necessarily imply that another line will follow. The same in a file, if the terminating newline is missing, you can deduce that an incomplete write occurred and some data might be missing. If you decide to have a newline as a separator after the last line but to ignore it, then you can not represent an empty last line.
I guess you would need two different characters, one terminator and one separator. You could start a line with \n and end it with \r. The \n separates the line from the one before, then \r terminates the line and marks it as complete. You would get \r\n between lines as on Windows and the last line would only have \r if complete or would otherwise count as incomplete. Then again you could almost get the same thing with \n only, you would just have to change the interpretation, instead of \n giving you a line and no \n giving you not a line, you would have to say that \n gives you a complete line and no \n gives you an incomplete line. With that you could however not have an incomplete empty line.
This effort of building in redundancy is pointless. We just need a newline to know where to start the output on a new line. If you want to safeguard the proper content of a file, a whole lot more is needed.
Posix getline() includes EOF as a line terminator:
getline() reads an entire line from stream, storing the address
of the buffer containing the text into *lineptr. The buffer is
null-terminated and includes the newline character, if one was
found.
...
... a delimiter character is not added if one was
not present in the input before end of file was reached.
Your quoted documentation says otherwise. It says that a 'line' include the delimiter, '\n', in the line buffer. It also says that is no delimiter is found before the EOF is reached that the line buffer will not include the delimiter. That means the line buffer can clearly indicate an incomplete line by the absence of the delimiter. To be clear, EOF isn't a 'line terminator', it's the end of the data stream.
Probably a vulnerability issue. Programmers would leave multiline mode on by mistake, then validate that some string only contain ^[a-Z]*$… only for the string to have an \n and an SQL injection on the second line.
What is driving me nuts is that we have Unicode now, so there is no need to use common characters like $ or ^ to denote special regex state transitions.
If we were willing to ignore the ability to actually type it, you don't need Unicode for that; ASCII has a whole block of control characters at the beginning; I think ASCII 25 ("End of medium") works here.
the idea of changing a decades old convention to instead use, as I assume you are implying, some character that requires special entry, is beyond silly.
It’s not that silly. You constantly get into escape conundrums because you need to use a metacharacter which is also a metacharacter three levels deep in some embedding.
(But that might not solve that problem? Maybe the problem is mostly about using same-character delimiters for strings.)
And I guess that’s why Perl is so flexible with regards to delimiters and such.
Yes, languages really need some sort of "raw string" feature like Python (or make regex literals their own syntax like Perl does). That's the solution here, not using weird characters...
Fine enough. But I wonder why strings have to use the same delimiter. Imagine if you had a list delimiter `|` and the answer to nested lists was “ohh, use raw list syntax, just make `###||` when you are three levels deep or something”.
It is quite nice what `sed` does. A sed search-and-replace is typically shown as `s/foo/bar/`, but you can actually use any punctuation character to separate the parts. Whatever follows the "s" will be used for that statement, so you can write `s|foo|bar|` or `s:foo:bar:`, even mixing and matching in the same script to have `s|baz|quux|; s:xyzzy:blorp:` and it will all work.
On the third hand strings are of course a special case because you always have special characters and whatnot which makes raw strings useful. :) Not just doing `"` and stuff.
I don't think anyone that writes regex would feel specially challenged by using the Alt+ | Ctrl+Shift+u key combos for unicode entry. Having to escape less things in a pattern would be nice.
I write regexes all the time, and I don't know if I would be CHALLENGED by that, but it would be annoying. Escaping things is trivial, and since you do it all the time it is not anything extra to learn. Having to remember bespoke keystrokes for each character is a lot more to learn.
Regexes are one case where I think it's already extremely unbalanced wrt being easy to write but hard to read. Using stuff like special Unicode chars for this would make them harder to write but easier to read, which sounds like a fair deal to me. In general, I'd say that regexes should take time and effort to write, just because it's oh-so-easy to write something that kinda sorta works but has massive footguns.
I would also imagine that, if this became the norm, IDEs would quickly standardize around common notation - probably actually based on existing regex symbols and escapes - to quickly input that, similar to TeX-like notation for inputting math. So if you're inside a regex literal, you'd type, say, \A, and the editor itself would automatically replace it with the Unicode sigil for beginning-of-string.
Regexes originate from Perl, or they were popularized by Perl if i got this right. In Perl readable code is not ranked as one of it's top 100 priorities. Regexes could originate from J and situation could be even worse though!
Regex's predate perl quite substantially. Think grep and friends if nothing else.
Certainly making the perlre library available separate to perl encouraged its widespread use, and lots of others copied or were inspired by it.
"Popularized" doesn't seem like quite the right word though, I don't disagree with the point, but if I shout "Hey everyone let's write regex's" at the office people throw stationary at me, which is not true of other popular things!
I took a look at Raku, which claims be a better Perl maybe, or closely related but more modern, it certainly looks nice. Although i am a big fan of typed languages, Raku piqued my interest.
Very nice, good to know. Yes i know gradual typing, Python has a form of that i think. I will check out Raku at some point, the type system will not go unnoticed. I didn't even know it had one!
ASCII restriction begets ASCII toothpick soup. Either lift that restriction or use balanced delimiters for strings in ASCII like backtick and single quote.
(“But backtick is annoying to type” said the Europeans.)
People say this all the time, but is it really always true? I have a ton of code that I wrote, that just works, and I never really look at it again, at least not with the level of inspection that requires parsing the regex in my head.
Even for code I wrote once and then never have to fix, I end up reading it multiple times while I create it and the lines around it. I think it really is always true.
Why not? Common characters are easier to type and presumbly if you are using regex on a unicode string they might include these special characters anyway so what have you gained?
It's not really an issue if the string you're matching might have those characters. It's an issue if the regex you are matching that string might need to match those characters verbatim. Which is actually pretty common with ()[]$ when you're matching phone numbers, prices etc - so you end up having to escape a lot, and regex is less readable especially if it also has to use those same characters as regex operators. On the other hand, it would be very uncommon to want to literally match, say, ⦑⦒ or ⟦⟧.
I'm the same, but now that I try in Perl, sure enough, $ seems to default to being a positive lookahead assertion for the end of the string. It does not match and consume an EOL character.
Only in multiline mode does it match EOL characters, but it does still not appear to consume them. In fact, I cannot construct a regex that captures the last character of one line, then consumes the newline, and then captures the first character of the next line, while using $. The capture group simply ends at $.
Maybe because a fair amount of the work I do with regexes (and, probably, how I was introduced to them) is via `grep`, so I'm often thinking of the inputs as "lines" rather than "strings"?
Same, tho it'd be interesting to see if this behavior holds if the file ends without a trailing newline and your match is on the final newline-less line.
Yes exactly, they match the end of a line, not a newline character. Some examples from documentation:
man 7 regex: '$' (matching the null string at the end of a line)
pcre2pattern: The circumflex and dollar metacharacters are zero-width assertions. That is, they test for a particular condition being true without consuming any characters from the subject string. These two metacharacters are concerned with matching the starts and ends of lines. ... The dollar character is an assertion that is true only if the current matching point is at the end of the subject string, or immediately before a newline at the end of the string (by default), unless PCRE2_NOTEOL is set. Note, however, that it does not actually match the newline. Dollar need not be the last character of the pattern if a number of alternatives are involved, but it should be the last item in any branch in which it appears. Dollar has no special meaning in a character class.
I don't have (n)vi(m) open right now but I think this only applies to prepending spaces. For prepending tabs, 0 will take you to the first non-tab character as well.
If you have "set list" to make non-space whitespace visible, it'll go to the leftmost position. I did it long ago along with "set listchars=trail:.,tab:>-" so I can see not only where tabs are, but also their size/alignment without causing the text to shift.
i feel like this perspective will be split between folks who use regex in code with strings and more sysadmin folks who are used to consuming lines from files in scripts and at the cli.
but yeah seems like a real misunderstanding from “start/end of string” people
POSIX regexes and Python regexes are different. In general, you need to reference the regex documentation for your implementation, since the syntax is not universal.
Per POSIX chapter 9[1]:
9.2 … "The use of regular expressions is generally associated with text processing. REs (BREs and EREs) operate on text strings; that is, zero or more characters followed by an end-of-string delimiter (typically NUL). Some utilities employing regular expressions limit the processing to lines; that is, zero or more characters followed by a <newline>."
and 9.3.8 … "A <dollar-sign> ( '$' ) shall be an anchor when used as the last character of an entire BRE. The implementation may treat a <dollar-sign> as an anchor when used as the last character of a subexpression. The <dollar-sign> shall anchor the expression (or optionally subexpression) to the end of the string being matched; the <dollar-sign> can be said to match the end-of-string following the last character."
combine to mean that $ may match the end of string OR the end of the line, and it's up to the utility (or mode) to define which. Most of the common utilities (grep, sed, awk, Python, etc) treat it as end of line by default, since they operate on lines by default.
THERE IS NO SINGLE UNIVERSAL REGULAR EXPRESSION SYNTAX. You cannot reliably read or write regular expressions without knowing which language & options are being used.
This seems like the perfect opportunity to introduce those unfamiliar to Robert Elder. He makes cool YouTube[0] and blog content[1] and has a series on regular expressions[2] and does some quite deep dives into the differing behaviour of the different tools that implement the various versions.
Regexp was one of the first things I truly internalized years ago when I was discovering Perl (which still lives in a cozy place in my heart due to a lovely “Camel” book).
Today most important bit of information is knowledge that implementations differ and I made a habit of pulling reference sheet for a thing I work with.
E.g. Emacs Regexp annoyingly doesn’t have word in form of “\w” but uses “\s_-“ (or something no reference sheet on screen) as character class (but Emacs has the best documentation and discoverability - a hill I’m willing to die on)
Some utilities require parenthesis escaping and some not. Sometimes this behavior is configurable and sometimes it’s not.
I lived through whole confusion, annoyance, denial phase and now I just accept it. Concept is the same everywhere but flavor changes.
My brain thinks in Perl's regex language and then I have to translate the inconsistent bits to the language I'm using. Especially in the shell - I'm way more likely to just drop a perl into the pipeline instead of trying to remember how sed/grep/awk (GNU or BSD?) prefer their regex.
For me, Perl hit me at exactly the right time in my development. One or more of the various O'Reilly Perl books caught my attention in the bookstore, the foreword and the writing style was unlike anything else I'd read in programming up to that point, and I read the book and just felt a strong connection to how the language was structured, the design concepts behind it, the power of regex being built in to the language, etc. The syntax favored easy to write programs without unnecessary scaffolding (of course, leading to the jokes of it being write-only - also the jokes I could make about me programming largely in Java today), and the standard functionality plus the library set available felt like magic to me at that point.
Learning Perl today would be a very different experience. I don't think it would catch me as readily as it did back then. But it doesn't matter - it's embedded into me at a deep level because I learned it through a strong drive of fascination and infatuation.
As for the regex themselves? It's powerful and solved a lot of the problems I was trying to solve, was built fundamentally into Perl as a language, so learning it was just an easy iterative process. It didn't hurt that the particular period of time when I learned Perl/regex the community was really big on "leetcode" style exercises, they just happened to be focused around Perl Golf, being clever in how you wrote solutions to arbitrary problems, and abusive levels of regex to solve problems. We were all playing and play is a great way to learn.
The same way people internalize punching data and instructions into stacks of cards, or internalize advanced mathematical notation. Just because things aren't written in plain english words doesn't mean they can't be internalized.
Perl has few “sigils” which are basically types: $scalar, @array and %hash. And few syntactically equivalent operators. Also a set of global variables with character shorthands like `$.`. Apart from that it’s a regular language.
I can hear thousands of bad hiring manager's adding 'How do you match the end of a string in a regex?' to their list of 'Ha! You don't know the trick!' questions designed to catch out candidates.
Does gpt produce efficient regex? Are there any experts here that can assess the quality and correctness of gpt-generated regex? I wonder how regex responses by gpt are validated if the prompter does not have the knowledge to read the output.
> The second quote was not escaped because in the regex $tok =~ /(\\+)$/ the $ will match the end of a string, but also match before a newline at the end of a string, so the code thinks that the quote is being escaped when it’s escaping the newline.
The only time you'd want to write assembly in production code would be if you need to hand roll some optimisation. So I don't really understand your point here,
That was one of my first uh oh moments with gpt. Getting code that clearly had untestable/unreadable regexen, which given the source must have meant the regex were gpt generated. So much is going to go wrong, and soon.
'I'm writing a nodejs javascript application and I need a regex to validate emails in my server. Can you write a regex that will safely and efficiently match emails?'
GPT4 / Gemini Advanced / Claude 3 Sonnet
GPT4: `const emailRegex = /^[a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/;`
Full answser: https://justpaste.it/cg4cl
Gemini Advanced: `const emailRegex = /^[a-zA-Z0-9.!#$%&'+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)$/;`
Full answer: https://justpaste.it/589a5
Claude 3: `const emailRegex = /^([a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})$/;`
Full answer: https://justpaste.it/82r2v
Whereas email more or less lasts forever (mailbox contents), and has to be backwards compatible with older versions back to (at least) RFC 821/822, or those before. It also allows almost any character (when escaped at 821 level) in the host or domain part (domain names allow any byte value).
So a Internet email address match pattern has to be: "..*@..*", anything else can reject otherwise valid addresses.
That however does not account for earlier source routed addresses, not the old style UUCP bang paths. However those can probably be ignored for newly generated email.
I regularly use an email address with a "+" in the host part. When I used qmail, I often used addresses like: "foo-a/b-bar-tat@DOMAIN". Mainly for auto filtering received messages from mailing lists.
…which both excludes addresses allowed by the RFC and includes addresses disallowed by the RFC. (For example, the RFC disallows two consecutive dots in the local-part.)
I take the descriptivist approach to email validation, rather than the prescriptivist.
I know an email has to have a domain name after the @ so I know where to send it.
I also know it has to have something before the @ so the domain’s email server knows how to handle it.
But do I care if the email server is supports sub addresses, characters outside of the commonly supported range (eg quotation marks and spaces), or even characters which aren’t part of the RFC? I do not.
If the user gives me that email, I’ll trust them. Worst case they won’t receive the verification email and will need to double check it. But it’s a lot better than those websites who try to tell me my email is invalid because their regex is too picky.
The HTML email regex validation [1] is probably the best rule to use for validating an email address in most user applications. It prohibits IP address domain literals (which the emailcore people have basically said is of limited utility [2]), and quoted strings in the localpart. Its biggest fault is allowing multiple dots to appear next to each other, which is a lot of faff to put in a regex when you already have to individually spell out every special character in atext.
I generally agree, but the two consecutive dots (or leading/trailing dots) are an example that would very likely be a typo and that you wouldn’t particularly want to send. Similar for unbalanced quotes, angle brackets, and other grammar elements.
That would be bad form, IMO. The user may have typed john..kennedy@example.com by mistake instead of john.f.kennedy@example.com, and now you’ll be sending their email to john.kennedy@example.com. Similar for leading or trailing dots. You can’t just decide what a user probably meant, when they type in something invalid.
Yeah, that's about as far as I've ever been comfortable going in terms of validating email addresses too: some stuff followed by "@" followed by more stuff.
Though I guess adding a check for invalid dot patterns might be worthwhile.
Actually pretty good response if the programmer bothers to read all of it
I'd be more emphatic that you shouldn't rely on regexes to validate emails and that this should only be used as an "in the form validation" first step to warn of user input error, but the gist is there
> This regex is *practical for most applications* (??), striking a balance between complexity and adherence to the standard. It allows for basic validation but does not fully enforce the specifications of RFC 5322, which are much more intricate and challenging to implement in a single regex pattern.
^ ("challenging"? Didn't I see that emails validation requires at least a grammar and not just a regex?)
> For example, it doesn't account for quoted strings (which can include spaces) in the local part, nor does it fully validate all possible TLDs. Implementing a regex that fully complies with the RFC specifications is impractical due to their complexity and the flexibility allowed in the specifications.
> For applications requiring strict compliance, it's often recommended to use a library or built-in function for email validation provided by the programming language or framework you're using, as these are more likely to handle the nuances and edge cases correctly. Additionally, the ultimate test of an email address's validity is sending a confirmation email to it.
Support for IDN email addresses is still patchy at best. Many systems can’t send to them; many email hosts still can’t handle being configured for them.
There really ought to be a regex repository of common use cases like these so we don't have to reinvent the wheel or dig up a random codebase that we hope is correct to copy from every time.
> if you know where to find something no point in knowing it.
Nonsense. And you know it.
First, you need to know what to find, before knowing where to find it. And knowing what to find requires intricate knowledge of the thing. Not intricate implementation details, but enough to point yourself in the right direction.
Secondly, you need to know why to find thing X and not thing Y. If anything, ChatGPT is even worse than google or stackoverflow in "solving the XY problem for you". XY is a problem you don't want solved, but instead to be told that you don't want to solve it.
Maybe some future LLM can also push back. Maybe some future LLM can guide you to the right answer for a problem. But at the current state: nope.
Related: regexes are almost never the best answer to any question. They are available and quick, so all considered, maybe "the best" for this case. But overall: nah.
While I agree with your point that knowing things matters, it is entirely possible with the current batch of LLMs to get to an answer you don't know much about. It's actually one of the few things they do reliably well.
You start with what you do know, asking leading questions and being clear about what you don't, and you build towards deeper and deeper terminology until you get to the point where there are docs to read (because you still can't trust them to get the specifics right).
I've done this on a number of projects with pretty astonishing results, building stuff that would otherwise be completely out of my wheelhouse.
Funny for me there have been instances where the LLM did push back. I had a plan of how to solve something and tasked the LLM with a draft implementation. It kept producing another solution which I kept rejecting and specifying more details so it wouldn't stray. In the end I had to accept that my solution couldn't work, and that the proposed one was acceptable. It's going to happen again, because it often comes up with inferior solutions so I'm not very open to the reverse situation.
I should have clarified better. Because, indeed, I have the same experience with copilot. Where it suggested code that I disliked but was actually the right one and mine the wrong one.
I was talking about X-Y on a higher level though. Architecture, Design Patterns, that kind of stuff. LLMs are (still?) particularly bad at this. Which is rather obvious if you think of them as "just" statistical models: it'll just suggest what is done most often in your context, not what is current best for your context.
Yea I don't think the crop of LLM is useful for this. They let themselves be lead by what's written, and struggle to understand negation even. So when I suspect there is a better solution, I have a hard time getting such an answer, even if asking explicitly for alternatives. I doubt it's just a question of training, they seem to lock themselves on the context. When using Phind, this is somewhat mitigated by mixing in context from the web, which can lead to responses that include alternatives.
Sure, why bother understanding anything if ChatGPT can just produce the answers for you ;). You don't have to understand the answers even. Actually you don't need to understand the question also. Just forward the question from your manager to ChatGPT and forward the answer back to your manager ;). Why make life difficult for yourself?
Yeah, omitting what is arguably the language most associated with regexes seems a bit of an oversight. I guess it shows how far off the radar Perl currently is.
> I guess it shows how far off the radar Perl currently is.
This is a serious misconception. Perl is far, far from dead. The constant activity of the gargantuan CPAN library more than demonstrates very much the opposite.
I would say Perl and its community has done quite well considering it hasn't had the same mountain of corporate funds thrust into it like more highlighted have. Mainstream ain't everything.
I don't think "off the radar" means dead, but that people aren't generally aware of what's going on with it. I think this is actually pretty consistent with what you're saying; stuff is going on with it, but it's not on people's radar, so they don't realize it.
Raku (formerly Perl 6) has picked ^ and $ for start-of-string and end-of-string, and has introduced ^^ and $$ for start-of-line and end-of-line. No multi line mode is available or necessary.
(There's also \h for horizontal and \v for vertical whitespace)
That's one of the benefits of a complete rethink/rewrite, you can learn from the fact that the old behavior surprised people.
And this is why this curmudgeon can't use Perl 6[^1]. It randomly shuffles the line noise we learned over decades.
It seems so obvious that's the opposite of what they should have defaulted to, that it clearly should have been ^ and $ for lines, and ^^ and $$ for the string, since like ((1)(2)(3)):
^^line1$\n^line2$\n^line3$\n$
[1]: That, and it's not anywhere, while Perl 5 is everywhere.
Pretty much all the regexen I have written have been predicated on start / end of string (i typically feed lines through the regex) … so picking single ^ and $ for the whole string maintains a degree of backward compatibility (assuming that I am normal)
At some point, I felt like I knew them all. There are probably more regex dialects out there, but I don't encounter them and my set of knowledge works most of the time.
I feel it's like driving a rental car. It behaves slightly different than your own car, some features missing, some other features added, but in general, most of the things are pretty similar.
The ISO/IEC 14882 C++ standard library <regex> mandates [0] implementations for six de jure standard regex grammars: IEEE Std 1003.1-2008 (POSIX) [1] BRE, ERE, awk, grep, and egrep and ECMA-262 EcmaScript 3 [2].
So, yes, at least someone (me) considers regex to be standardized in several published de jure standards.
You may be prejudiced against C++, but ISO/IEC 14882 is a published international standard that links to recognized regex standards, so answers the question "does anyone consider RegEx standardised?" very much in the affirmative.
The three big ones I know of are POSIX, Perl/PCRE(aka Perl-Compatible Regular Expression), and Go came along and <strike>added</strike> used re2, which is a bit different from the first too.
A lot of systems implemented PCRE, including JavaScript, since Perl extended the POSIX system with many useful extensions. IIRC, re2 tries to reign in on some of the performance issues and quirks the original systems had, while implementing the whole thing in Go.
POSIX and PCRE are arguably redundant. They both support backreferences, which puts very significant constraints on their implementations. PCRE is at least functionally a superset of POSIX, whether or not there's some quirky thing POSIX supports that PCRE does not.
re2 adds a legitimate option to the menu of using NDFAs, which have the disadvantage of not supporting backreferences, but have the advantage of having constrained complexity of scanning a string. This does not come for free; you can conceivably end up with a compiled regexp of very large size with an NDFA approach, but most of the time you won't. The result may be generally slower than a PCRE-type approach, but it can also end up safer because you can be confident that there isn't a pathological input string for a given regexp that will go exponential.
This is one of those cases where ~99% of the time, it doesn't really matter which you choose, but at the scale of the Entire Programming World, both options need to be available. I've got some security applications where I legitimately prefer the re2 implementation in Go because it is advantageous to be confident that the REs I write have no pathological cases in the arbitrary input they face. PCRE can be necessary in certain high-performance cases, as long as you can be sure you're not going to get that pathological input.
RE engines don't quite engender the same emotions as programming languages as a whole, but this is not cheerleading, this is a sober engineering assessment. I use both styles in my code. I've even got one unlucky exe I've been working with lately that has both, because it rather irreducibly has the requirements for both. Professionally annoying, but not actually a problem.
* Finite automata based regex engines don't necessarily have to be slower than backtracking engines like PCRE. Go's regexp is in practice slower in a lot of cases, but this is more a property of its implementation than its concept. See: https://github.com/BurntSushi/rebar?tab=readme-ov-file#summa... --- Given "sufficient" implementation effort (~several person years of development work), backtrackers and finite automata engines can both perform very well, with one beating the other in some cases but not in others. It depends.
* Fun fact is that if you're iterating over all matches in a haystack (e.g., Go's `FindAll` routines), then you're susceptible to O(m * n^2) search time. This applies to all regex engines that implement some kind of leftmost match priority. See https://github.com/BurntSushi/rebar?tab=readme-ov-file#quadr... for a more detailed elaboration on this point.
> RE engines don't quite engender the same emotions as programming languages as a whole, but this is not cheerleading, this is a sober engineering assessment.
I love when people pat themselves on the back for being pragmatic. Why wait for others to compliment you when you can do it yourself? (that’s very pragmatic self-care)
Languages invented after Perl will generally use some flavor of Perl regex syntax, but there are always some minor differences. The issue of the meaning of `$` and changing it via multi-line mode is usually consistent though.
I like to think of "whatever browsers do in js" as an updated common baseline. Whatever your regex engine does, describe it as a delta to the js precedent. That thing is just so ubiquitous.
I do wonder though what's the highest number of different regex syntaxes I've ever encountered (perhaps written?) within a single line: bash, grep and sed are never not in a "hold my beer" mood!
> I do wonder though what's the highest number of different regex syntaxes I've ever encountered (perhaps written?) within a single line: bash, grep and sed are never not in a "hold my beer" mood!
Your comment is missing a trigger warning, lol. But seriously, this is one of my flags for "this should probably be a script, or an awk or perl one-liner."
I've got "hold my beer" commits in .net - I've balanced brackets. I believe that's impossible in sed and grep. If I were going to write a json parser in a script, then a) stop me and b) it's got to be in powershell.
Delightfully, RFC 9485 https://datatracker.ietf.org/doc/rfc9485/ "I-Regexp: An Interoperable Regular Expression Format" was published just back in October last year!
My working assumption has always been to check the docs of your specific regexp parser, and to write some tests (either automated or manually in a REPL) with specific patterns that you are interested in using.
POSIX specifies two flavours of regular expressions: basic regular expressions (BRE) and extended regular expressions (ERE). There are subtle differences between the two and ERE supports more features than BRE. For example, what is written as a\(bc\)\{3\}d in BRE is written as a(bc){3}d in ERE. See https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1... for more details.
The regular expression engines available in most mainstream languages go well beyond what is specified in POSIX though. An interesting example is named capturing group in Python, e.g., (?P<token>f[o]+).
If anything it would be ECMAScript (JavaScript dwarfs Java use) or PCRE (the de-facto contiuation of Perl regular expressions written in C but used in many languages).
> It was suggested that, in addition to interval expressions, back-references ( '\n' ) should also be added to EREs. This was rejected by the standard developers as likely to decrease consensus.
Updated my comment to present a better example that avoids back-references. Thanks!
No gnu tool can balance brackets, afaics. So you can't do everything in sed. And sed is, by design, useless for matching text that spans lines, so good luck picking out paragraphs with it.
Sorry I meant to write “if you can do it in sed you can do it in anything” thereby implying it is a subset of the more generally available flavours. The issue at hand however is that there isn’t much in the way of standardisation but 95% of sed should work across all of them. Of course you should get more into the specifics of whatever your solution space supports.
"Sed" is the name of a specific tool. It is not defined by the GNU tools, but has existed in some form since 1974, well before Perl. GNU sed and POSIX sed both support BRE and EREs, but not PCREs.
Maybe there's some other implementation of sed that supports PCREs but that would really be an extension of that implementation of sed rather than a property of sed.
And maybe there's some GNU tool that uses PCREs, but that GNU tool would not be GNU sed, so it would not be a relevant property.
Anyway, they probably should have said BREs or EREs rather than "sed"...
kind of a trick question; there is POSIX and then there is the app you're using and whichever flags are enabled (albeit by default or explicitly defined)
People are confused about strings and lines. A string is a sequence of characters, a line can be two different things. If you consider the newline a line terminator, then a line is a sequence of non-newline characters - possibly zero - plus a newline. If there is no new-line at the end, then it is not a [complete] line. That is what POSIX uses. If you consider the newline a line separator, then a line is a sequence of non-newline characters - possibly zero. In either case, the content of the line ends before the newline, either because the newline terminates the line or because it separates the line from the next. [1]
The semantics of ^ and $ is based on lines - whether single-line or multi-line mode. For string based semantics - which you could also think of as entire file if you are dealing with files - use \A and \Z or their equivalents.
[1] Both interpretations have their merits. If you transmit text over a serial connection, it is useful to have a newline as line terminator so that you know when you received a complete line. If you put text into text files, it might arguably be easier to look at a newline as a line separator because then you can not have a invalid last line. On the other hand having line terminators in text files allows you to detect incompletely written lines.
Structural regexes as found in the sam editor are an obscure but well engineered regex engine. I am far from an expert but my main takeaway from them is that most regex engines have an implied structure built around "lines" of text. While you can work around this, it is awkward. Structural regexes allow you to explicitly define the structure of a match, that is, you get to tell the engine what a "line" is.
> A pattern is a sequence of pattern items. A caret '^' at the beginning of a pattern anchors the match at the beginning of the subject string. A '$' at the end of a pattern anchors the match at the end of the subject string. At other positions, '^' and '$' have no special meaning and represent themselves.
Lua's pattern matching is much simpler than regexes though.
> Unlike several other scripting languages, Lua does not use POSIX regular expressions (regexp) for pattern matching. The main reason for this is size: A typical implementation of POSIX regexp takes more than 4,000 lines of code. This is bigger than all Lua standard libraries together. In comparison, the implementation of pattern matching in Lua has less than 500 lines.
There's an additional caveat: if you use the optional "init" parameter to specify an offset into the string to start matching, the ^ anchor will match at that offset, which may or may not be what you expect.
Well, it's not a completely outlandish scenario that the value of `init` might come from a variable that is sometimes at the start of the string and sometimes not, and a newcomer might expect `^` to only match when it is.
Don't get me wrong, it's certainly far more useful as it is, I'm glad it works this way.
* The $ anchor only matches at the end of the string
* The $$ anchor matches at the end of a logical line. That is, before a newline character, or at the end of the string when the last character is not a newline character.
Wait, in non-multiline mode, it only matches _one_ trailing newline? And not any other whitespace, including \r or \r\n? That is indeed surprising behavior. Why? Why not just make it end of string like the author expected?
You need to make your regex multi-line (`/^\d+$/m`), but that isn't the problem shown. Your query will be searching for `25\n`, not `25` despite your pre-check that it’s a good value.
The second line should always be no, which if you use `\A\d+\z`, it will be.
The fact that there are so many different peculiarities in different regex systems has always raised the hairs on the back of my neck. As in when a tool accepts a regex and I have to a trawl the manual to find out exactly what regex is acceptable to it.
The whole \r is archaic. It doesn't even behave properly in most cases. Just use \n everywhere and bite the lemon for a short while to fix your problems.
And if you believe \r\n is the way to go, please make sure \n\r also works as they should have the same results. (or \r\n\r\r\r\r for that matter)
Why did they even decide to use two characters for the end of line? Seems bizarre. I could have imagined that `\r` and `\n` was a tossup. But why both?
Likely compatibility bugs going back decades (70s?). Probably with some terminal/teletype.
\r - returned teletype head to the start of a line
\n - move paper one line down
> The sequence CR+LF was commonly used on many early computer systems that had adopted Teletype machines—typically a Teletype Model 33 ASR—as a console device, because this sequence was required to position those printers at the start of a new line. The separation of newline into two functions concealed the fact that the print head could not return from the far right to the beginning of the next line in time to print the next character. Any character printed after a CR would often print as a smudge in the middle of the page while the print head was still moving the carriage back to the first position. "The solution was to make the newline two characters: CR to move the carriage to column one, and LF to move the paper up."[2] In fact, it was often necessary to send extra padding characters—extraneous CRs or NULs—which are ignored but give the print head time to move to the left margin. Many early video displays also required multiple character times to scroll the display.
Oh well. It’s more important to well-actually your knowledge of typewriter characters than to explain the history of why Windows is apparently the only platform (not Linux, not Mac, probably not the BSDs) that had to take “backwards compatibility” into concern.
I know what the symbols mean and their original purpose. No one has answered why Windows and apparently only Windows acts like this. It’s not like Windows is the only platform that cares about hysterical raisins.
\r\n is a standard though (in fact, it's arguably more of a standard than what Unix and C did), so you really can't just do something different and expect everyone to follow.
(For example, most text-based internet protocols such as HTTP use \r\n as a line separator/terminator, so refusing to put the \r will be incompatible for no reason.)
FWIW, and I know this doesn't really address your complaint: I use Windows and I've set all my text editors to use LF exclusively years ago and Things Are Great. No more weird Git autocrlf warnings, no quirks when copying files over to/from people on Macs or Linuxes, etc. Even Notepad supports LF line endings for quite a long time now - to my practical experience, there's little remaining in Windows that makes CRLF "the OS standard line ending".
I bet if someday VS Code's Windows build ships with LF default on new installations, people won't even notice.
I mean, at some point it did matter what the OS did when you pressed the "Enter" button. But this isn't really the case much anymore. VS Code catches that keypress, and inserts whatever "files.eol" is set to. Sublime does the same. I didn't check, but I assume every other IDE has this setting.
Similarly, the HTML spec, which is pretty nuts, makes browsers normalize my enters to LF characters as I type into this textarea here (I can check by reading the `value` property in devtools), but when it's submitted, it converts every LF to a CRLF because that's how HTML forms were once specced back in the day. Again though, what my OS considers to be "the standard newline" is simply not considered at all. Even CMD.EXE batch files support LF.
I don't really type newlines all that much outside IDEs and browsers (incl electron apps) and places like MS Word, all of which disregard what the OS does and insert their own thing. Maybe the terminal? I don't even know. I doubt it's very consequential.
EDIT: PSA the same holds for backslashes! Do Not Use Backslashes. Don't use "OS specific directory separator constants". It's not 1998, just type "/" - it just works.
I don't know if it is the case on Windows 11, but I have surely been bitten by CMD batch files using LF line endings. I don't remember the exact issue but it may have been the one bug affecting labels. [1]
Writing a string -> NFA -> DFA grep-like tool is one of my most memorable college projects. Had a lot of fun with that, and decades later I ended up reusing some of the concepts for a work project.
Because they're using regex101 to easily test the semantics of different regex engines and Perl isn't available on regex101. PCRE is though, which is a decent approximation. And indeed, Perl and PCRE behave the same for this particular case.
I dunno. Maybe because nobody has contributed it? Maybe because Perl isn't as widely used as it once was? Maybe because it's hard to compile Perl to WASM? Maybe some other reason?
> Note: The table of data was gathered from regex101.com, I didn't test using the actual runtimes.
Has anyone confirmed this behaviour directly against the runtimes/languages? Newlines at the end of a string are certainly something that could get lost in transit inside an online service involving multiple runtimes.
I've now tested C#, directly, and got the same result as the article. It also documents the behavior:
> The ^ and $ language elements indicate the beginning and end of the input string. The end of the input string can be a trailing newline \n character.
If you write it to a text file by itself and then read it from that text file, each runtime can have a different definition of whether a newline at the end of the file is meaningful or not. Under POSIX, a newline should always be present at the end of a non-empty text file and is not meaningful; not everyone agrees or is aware.
this is mostly due to the different types of regex and less about it being platform dependent. $ was end of string in pcre which is the "old" perl compatible regex. python has its own which has quirks as mentioned, re2 is another option in go for example, and i think rust has its own version as well iirc.
The differences of the various regex "dialects" came to me over the years of using regular expressions for all kinds of stuff.
Matching EOL feels natural for every line-based process.
What I find way more annoying is escaping characters and writing character groups. Why can't all regex engines support '\d' and '\w' and such?
Why, in sed, is an unescaped '.' a regex-dot matching any character, but an unescaped '(' is just a regular bracket?
> Why, in sed, is an unescaped '.' a regex-dot matching any character, but an unescaped '(' is just a regular bracket?
It is because sed predates the very influential second generation Extended Regular Expression engine and by default uses the first generation Basic Regular Expression engine. So really it is for backwards compatibility.
BRE and ERE was created at the same time. Prior to this there wasn't a clear standard for Regex. From my memory this was standardised in 1996 (IEEE Std 1003.1-1996).
The work originally came from work by Stephen Cole Kleene in the 1950s. It was introduced into Unix fame via the QED editor (which later became ed (and sed), then ex, then vi, then vim; all with differing authors) when Ken Thompson added regex when he ported QED to CTSS (an OS developed at MIT for the IBM 709, which was later used to develop Multics, and hence lead to Unix).
Also the "grep" command got its name from "ed"; "g" (the global ed command) "re" (regular expression), and "p" (the print ed command). Try it in vi/vim, :g/string/p it is the same thing as the grep command.
"$" could be end of string or end of line in perl, depending on the setting (are you treating data as a multiline text, or each line separately). (/m, /s,...)
I would hold a code review hostage if any file does not end
with an empty new line.
My reasoning would be if the file is transmitted and gets truncated
nobody would know for sure if it does not end a new line.
Brownie points if this is code end has a comment that the files ends there.
The article calls computer languages platforms but the are computer languages. Bash is not included. Weird. I believe the most common use of regular expressions is the use of grep or egrep with bash or some other shell but, who knows. Maybe I am hanging with the wrong crowd.
The table in the article makes this look complicated, but it really isn't. All the cases in the article can be grouped into two families:
- The JS/Go/Rust family, which treats $ like \z and does not support \Z at all
- The Java, .NET, PHP, Python family, which treats $ like \Z and may or may not (Python) support \z.
\Z does away with \n before the end of the string, while \z treats \n as a regular character.
For multiline $ the distinction doesn't matter, because \n is the end.
Really the only deviation from the rule is Python's \Z, which is indeed weird.
Regex would really benefit from a comprehensive industrial standard. It's such a powerful tool that you have to keep relearning whenever you switch contexts.
> In 30 years of developing software I don’t think I ever used multi-line regexp even once.
As long as sharing anecdata, in 30 years, it's almost the only way I use it.
It's incredible for slicing and dicing repetitious text into structure. You generally want some sort of Practical Extraction and Reporting Language, the core of which is something like a regular expression, generally able to handle the, well, irregularity.
Most recent example (I did this last week) was extracting Apple's app store purchases from an OCR of the purchase history available through Apple's Music app's Account page that lets you see all purchases across all digital offerings, but only as a long scrolling dialog box (reading that dialog's contents through accessibility hooks only retrieves the first few pages, unfortunately).
Each purchase contains one or more items and each item has one or more vertical lines, and if logos contain text they add arbitrary lines per logo.
A good match and sub match multi-line regex folds that mess back into a CSV. In this case, the regex for this was less than an 80 char line of code and worked in the find replace of Sublime Text which has multiline matching, subgroups, and back references.
Another way to do this is something like a state match/case machine, but why write a program when you can just write a regular expression?
perlre Metacharacters documentation states:
$ Match the end of the string (or before newline at the end of the string; or before any newline if /m is used)
Something I found really surprising about Python's regexp implementation is that it doesn't support the typical character classes like [:alnum:] etc.
It must be some kind of philosophical objection because there's no way something with as much water under the bridge as Python simply hasn't got around to it.
> So if you're trying to match a string without a newline at the end, you can't
only use $ in Python! My expectation was having multiline mode disabled
wouldn't have had this newline-matching behavior, but that isn't the case.
I would argue this is correct behavior, a "line" isn't a "line" if it doesn't end with \n.[1]
> 3.206 Line - A sequence of zero or more non- <newline> characters plus a terminating <newline> character.
Where you'll find the following block under ANCHORS AND SIMPLE ASSERTIONS:
$ end of subject
also before newline at end of subject
also before internal newline in multiline mode
So all the cases of "newline at/before end of subject" are covered here. Then, the question becomes "what is a subject?" Is it line-by-line? Are newlines included? What if we want multiline matching? That's where re.MULTILINE comes from, it's not "multiline matching" (sort of) it's "what is the subject of the regular expression that we're matching against"
The results did not surprise me. The fact that everyone is in agreement that "cat$" matches "cat" and not "cat\n" if multiline is off did not surprise me. \n is implicitly a multiline-contextual character to me. In other words, if you didn't have any \n, you'd just have an array of lines (without linefeeds), same as if you were reading lines from a file one at a time or splitting a binary on \n.
The other results that differ across engines seem to be because people either don't understand regex or because the POSIX description of how to deal with such an input and config was ill-defined.
There are many differences between implementations of regex. To name a few. Lookbehind, atomic groups, named capturing groups, recursion, timeouts and my favorite interop problem, unicode.
The new-line character is an actual character "at the end" of the string though so it makes sense that $ would include the new-line character in multi-line matching.
It's not wrong actually. It's the difference between BRE and ERE, which are the two different POSIX standards that define regex. In BRE the $ should always match the end of the string (the spec specifically says it should match the string terminator since "newlines aren't special characters"), while the ERE spec says it should match until the end of the line.
The real issue is that no language nowadays "just" implements BRE or ERE since both specs are lacking in features.
Most languages instead implement some variant of Perl's regex instead (often called PCRE regex because of the C library that brought Perl's regex to C), which as far as I can tell isn't standardized, so you get these subtle differences between implementations.
It's possible to get design decisions wrong. Clearly people expect `$` to only match end-of-string so they did make the wrong decision. It may not have been clear it was the wrong decision at the time.
Things are obviously more complicated than that, lines are a complicated issue for historical reasons. There are two conventions, line termination and line separation. In case of line termination, the newline is part of the line and a string without a newline is not a [complete] line. In case of line separation, the newline is not part of the line but separates two lines. Also the way newlines are encoded is not universal.
Because even after disabling multi-line you are still dealing with line-based semantics when you use ^ or $, the newline at the end is still not part of the content. You have to use \A and \Z if you want to treat all characters as a string instead of one or multiple lines.
> Because even after disabling multi-line you are still dealing with line-based semantics when you use ^ or $
No, you're not, except for this weird corner case where `$` can match before the last `\n` in a string. It's not just any `\n` that non-multiline `$` can match before. It's when it's the last `\n` in the string. See:
This is weird behavior. I assume this is why RE2 didn't copy this. And it's certainly why I followed RE2 with Rust's regex crate. Non-multiline `$` should only match at the end of the string. It should not be line-aware. In regex engines like Python where it has the behavior above, it is only "partially" line-aware, and only in the sense that it treats the last `\n` as special.
But that is exactly what it means, the end of the line is before the terminating newline or at the end of the string if there is no terminating newline. Both ^ and $ always match at start or end of lines, \A and \Z match at the start or end of the string. The difference between multi-line and not is whether or not internal newlines end and start lines, it does not change the semantics from end of line to end of string. And if you are not in multi-line mode but have internal newlines, then you might also want single-line/dot-all mode.
One could certainly have a debate whether this behavior is too strongly tied to the origins of regular expressions and now does more harm than good, but I am not convinced that this would be an easy and obvious choice to have breaking change.
re.search does not accept a "line." It accepts a "string." There is no pretext in which re.search is meant to only accept a single line. And giving it a `string` with multiple new lines doesn't necessarily mean you want to enable multi-line mode. They are orthogonal things.
> Both ^ and $ always match at start or end of lines
This is trivially not true, as I showed in my previous example. The haystack `cat\n\n` contains two lines and the regex `cat$` says it should match `cat` followed by the "end of a line" according to your definition. Yet it does not match `cat` followed by the end of a line in `cat\n\n`. And it does not do so in Python or in any other regex engine.
You're trying to square a circle here. It can't be done.
Can you make sense of, historically, why this choice of semantics was made? Sure. I bet you can. But I can still evaluate the choice on its own merits today. And I did when I made the regex crate.
> but I am not convinced that this would be an easy and obvious choice to have breaking change.
Rust's regex crate, Go's regexp package and RE2 all reject this whacky behavior. As the regex crate maintainer, I don't think I've ever seen anyone complain. Not once. This to me suggests that, at minimum, making `$` and `\z` equivalent in non-multiline mode is a reasonable choice. I would also argue it is the better and more sensible approach.
Whether other regex engines should have a breaking change or not to change the meaning of `$` is an entirely different question completely. That is neither here nor there. They absolutely will not be able to make such a change, for many good reasons.
re.search does not accept a "line." It accepts a "string." There is no pretext in which re.search is meant to only accept a single line.
Sure, it takes a string which might be a line or multiple or whatever. Does not change the fact that $ matches at the end of a line. If you want the end of the string, use \Z.
This is trivially not true, as I showed in my previous example. The haystack `cat\n\n` contains two lines and the regex `cat$` says it should match `cat` followed by the "end of a line" according to your definition.
In multi-line mode it matches, in single-line mode it does not because there is a newline between cat and the end of the line. A newline is only a terminating newline if it is the last character, the newline after cat is not a terminating newline. You need cat\n$ or cat\n\n to match.
> In multi-line mode it matches, in single-line mode it does not because there is a newline between cat and the end of the line. A newline is only a terminating newline if it is the last character, the newline after cat is not a terminating newline. You need cat\n$ or cat\n\n to match.
This only makes sense if re.search accepted a line to search. It doesn't. It accepts an arbitrary string.
I don't think this conversation is going anywhere. Your description of the semantics seems inconsistent and incomprehensible to me.
> A newline is only a terminating newline if it is the last character, the newline after cat is not a terminating newline. You need cat\n$ or cat\n\n to match.
The first `\n` in `cat\n\n` is a terminating newline. There just happens to be one after it.
Like I said, your description makes sense if the input is meant to be interpreted as a single line. And in some contexts (like line oriented CLI tools), that can make sense. But that's not the case here. So your description makes no sense at all to me.
This only makes sense if re.search accepted a line to search. It doesn't. It accepts an arbitrary string.
Which is fine because lines are a subset of strings. And whether you want your input treated as a line or a string is decided by your pattern, use ^ and $ and it will be treated as a line, use \A and \Z and it will be treated as a string.
The first `\n` in `cat\n\n` is a terminating newline. There just happens to be one after it.
Look at where this is coming from. You do line-based stuff, there is either no newline at all or there is exactly one newline at the end. You do file-based stuff, there are many newlines. In both cases the behavior of ^ and $ makes perfect sense.
Now you come along with cat\n\n which clearly falls into the file-based stuff category as it has more than one newline in it but you also insist that it is not multiple lines. If it is not multiple lines, then only the last character can be a newline, otherwise it would be multiple lines.
And I get it, yes, you can throw arbitrary strings at a regular expression, this line-based processing is not everything, but it explains why things behave the way they do. And that is also why people added \A and \Z. And I understand that ^ and $ are much nicer and much better known than \A and \Z. Maybe the best option would be to have a separate flag that makes them synonymous with \A and \Z and this could maybe even be the default.
> And whether you want your input treated as a line or a string is decided by your pattern, use ^ and $ and it will be treated as a line, use \A and \Z and it will be treated as a string.
Where is this semantic explained in the `re` module docs?
This is totally and completely made up as far as I can tell.
This also seems entirely consistent with my rebuttal:
Me: What you're saying makes sense if condition foo holds.
You: Condition foo holds.
This is uninteresting to me because I see no reason to believe that condition foo holds. Where condition foo is "the input to re.search is expected to be a single line." Or more precisely, apparently, "the input to re.search is expected to be a single line when either ^ or $ appear in the pattern." That is totally bonkers.
> but it explains why things behave the way they do
Firstly, I am not debating with you about the historical reasoning for this. Secondly, I am providing a commentary on the semantics themselves (they suck) and also on your explanation of them in today's context (it doesn't make sense). Thirdly, I am not making a prescriptive argument that established regex engines should change their behavior in any way.
If you're looking to explain why this semantic is the way it is, then I'd expect writing from the original implementors of it. Probably in Perl. I wouldn't at all be surprised if this was an "oops" or if it was implemented in a strictly-line-oriented context, and then someone else decided to keep it unthinkingly when they moved to a non-line-oriented context. From there, compatibility takes over as a reason for why it's with us today.
I quoted the section from the Python module here. [1]
If you do not specify multi-line, bar$ matches a lines ending in bar, either foobar\n or foobar if the terminating newline has been removed or does not exist. If you specify multi-line, then it will also match at every bar\n within the string. So it either treats your input as a single line or as multiple lines. You can of course not specify multi-line and still pass in a string with additional newlines within the string, but then those newlines will be treated more or less as any other character, bar$ will not match bar\n\n. The exception is that dot will not match them except you set the single-line/dot-all flag, bar\n$ will match bar\n\n but bar.$ will not unless you specify the single-line/dot-all flag.
I would even agree with you that it seems a bit weird. If you have a proper line without additional newlines in the middle, then multi-line behaves exactly like not multi-line. Not multi-line only behaves differently if you confront it with multiple lines and I have no good idea how you would end up in a situation where you have multiple lines and want to treat them as one unit but still treat the entire thing as if it was a line.
The docs do not say what you're saying. Your phrasing is completely different, and the part where "if ^/$ are in the pattern then the haystack is treated as a single line" is completely made up. As far as I can tell, that's your rationalization for how to make sense of this behavior. But it is not a story supported by the actual regex engine docs. The actual docs say, "^ matches only at the beginning of the string, and $ matches only at the end of the string and immediately before the newline (if any) at the end of the string." The docs do not say, "the string is treated as a single line when ^/$ are used in the pattern." That's your phrasing, not anyone else's. That's your story, not theirs.
I still have not seen anything from you that makes sense of the behavior that `cat$` does not match `cat\n\n`. Like, I realize you've tried to explain it. But your explanation does not make sense. That's because the behavior is strange.
The only actual way to explain the behavior of $ is what the `re` docs say: it either matches at the end of the string or just before a `\n` that appears at the end of the string. That's it.
You are right, it is my wording, I replaced end of string or before newline as the last character with end of line because that is what this means. You could also write that into the documentation but then you would have to also explain what end of line means. And I will grant you that I might be wrong, that the behavior is only accidentally identical to matching the end of a line but that the true reason for it is different.
cat$, the $ matches the end of the line, the second \n, cat is not directly before that. I guess you want the regex engine to first treat the input as a multi-line input, extract cat\n as the first line, and then have cat$ match successfully in that single line? What about cat$ and dog$ and cat\ndog\n.
Ignoring compatibility concerns, I would want the regex engine to behave the same way RE2, Go's regexp package and Rust's regex engine behave. I remember specifically considering Cox's decision ~10 years ago when writing the initial implementation of the regex crate. I thought Perl's (and Python's) behavior on this point was whacky then and I still think it's whacky now. So I followed RE2's semantics.
The OP is right to be surprised by this. And folks will continue to be surprised by it for eternity because it's an extremely subtle corner case that doesn't have a consistent story explaining its behavior. (I know you have proffered one, but I don't find it consistent in the context of a general purpose regex engine that searches arbitrary strings and not just lines.)
Of course, compatibility is a trump card here. I've acknowledged that. Changing this behavior now would be too hard. The best you could probably do is some kind of migration, where you provide the more "sensible" behavior behind an opt-in flag. And then maybe Python 4 enables it by default. But it's a lot of churn, and while people will continue to be confounded by this so long as the behavior exists, it probably isn't a Huge & Common Deal In Practice. So it may not be worth fixing. But if you're starting from scratch? Yes, please don't implement $ this way. It should match the end of the string when 'm' is disabled and the end of any line (including end of string and possibly being Unicode aware, depending on how much you care about that) when 'm' is enabled.
I think you've kind of missed the point. Sure if `$` in non-multiline mode means "end of line" the behaviour might be reasonable. But the big error is that people DO NOT EXPECT `$` to mean "end of line" in that case. They expect it to mean "end of string". That's clearly the least surprising and most useful behaviour.
The bug is not in how they have implemented "end of line" matching in non-multiline mode. It's that they did it at all.
And the ones that do not match cat\n with cat$ arguably have it wrong. Both ^ and $ anchor to the start and end of lines, not to the start and end of strings, whether in multi-line mode or not.
> So if you're trying to match a string without a newline at the end, you can't only use $ in Python! My expectation was having multiline mode disabled wouldn't have had this newline-matching behavior, but that isn't the case.
A reproducible example would be nice. I don’t understand what it is he cannot do. `re.search('$', 'no new lines')` returns a match.
Most people would expect 'bob\n' not to match, because I used '$' and it has an extra character at the end, just like 'bobs'. In Python it does match because '\n' is a special case.
Indeed, one should test any regex one puts any trust in, but the problem is that if you take as a fact something that is actually a false assumption (as the author did here), your test may well fail to find errors which may cause faults when the regex is put to use.
This, in a nutshell, is the sort of problem which renders fallacious the notion that you can unit-test your way to correct software.
3.195 Incomplete Line
A sequence of one or more non-<newline> characters at the end of the file.
3.206 Line
A sequence of zero or more non-<newline> characters plus a terminating <newline> character.
courtesy of [0]. See also [1] for rationale on "text file":
Text File
[...] The definition of "text file" has caused controversy. The only difference between text and binary files is that text files have lines of less than {LINE_MAX} bytes, with no NUL characters, each terminated by a <newline>. The definition allows a file with a single <newline>, or a totally empty file, to be called a text file. If a file ends with an incomplete line it is not strictly a text file by this definition. [...]
Note that all of those three apps come from the Windows world where CRLF was indeed a line separator, and a file ending with CRLF was considered to have an empty line at the end.
But yes, you absolutely should accommodate for incomplete lines.
Was any regex documentation unclear on this? Some libraries have modes that change the semantics of ^ and $ but I’ve always found their use to be rather clear. It’s the grouping and look ahead/behind modifiers that I’ve always found hard to understand (at times).
This is a feature that seems so painfully obvious in the abstract that I’d wager most have never read the documentation. I’ve been a regex user since the early 90s and I’ve never thought about this.
Huh. I always think of them as "start-of-line" and "end-of-line". I mean, a lot of the time when I'm working with regexes, I'm working with text a line at a time so the effect is the same, but that doesn't change how I think of those operators.
Maybe because a fair amount of the work I do with regexes (and, probably, how I was introduced to them) is via `grep`, so I'm often thinking of the inputs as "lines" rather than "strings"?