Regex character "$" doesn't mean "end-of-string"

Karellen · on March 20, 2024

> Folks who've worked with regular expressions before might know about ^ meaning "start-of-string" and correspondingly see $ as "end-of-string".

Huh. I always think of them as "start-of-line" and "end-of-line". I mean, a lot of the time when I'm working with regexes, I'm working with text a line at a time so the effect is the same, but that doesn't change how I think of those operators.

Maybe because a fair amount of the work I do with regexes (and, probably, how I was introduced to them) is via `grep`, so I'm often thinking of the inputs as "lines" rather than "strings"?

Izkata · on March 20, 2024

Same here; when I saw the title I was like "well obviously not, where did you hear that?"

In nearly two decades of using regex I think this might be the first time I've heard of $ being end of string. It's always been end of line for me.

michaelt · on March 20, 2024

Take a look at, for example, these stackoverflow answers about a regex to validate and e-mail address: https://stackoverflow.com/a/8829363

These people are I think not intending to say a newline character is permitted at the end of an e-mail address.

(Of course people using 'grep' would have different expectations for obvious reasons)

Izkata · on March 20, 2024

Even disregarding whether or not end-of-string is also an end-of-line or not (see all the other comments below), $ doesn't match the newline, similar to zero-width matches like \b, so the newline wouldn't be included in the matched text either way.

I think this series of comments might be clearest: https://news.ycombinator.com/item?id=39764385

LK5ZJwMwgBbHuVI · on March 20, 2024

Problem is, plenty of software doesn't actually look at the match but rather just validates that there was a match (and then continues to use the input to that match).

frame_ranger · on March 20, 2024

You couldn’t write a post like this if you didn’t start with a strawman.

dbdudbdiddjc · on March 21, 2024

It’s not a straw man, it’s accurate in many contexts.

wccrawford · on March 20, 2024

It's kind of driving me nuts that the article says ^ is "start of string" when it's actually "start of line", just like $ is "end of line". \A is apparently "start of string" like \Z is "end of string".

masklinn · on March 20, 2024

It’s not start of line though, unless the engine is in multiline mode. Here is the documentation for Python’s re for instance:

> Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.

Or JavaScript:

> An input boundary is the start or end of the string; or, if the m flag is set, the start or end of a line.

\A and \Z are start/end of input regardless of mode… when they’re available, that’s not the case of all engines.

danbruc · on March 20, 2024

It is start and end of line. [1]

Usually ^ matches only at the beginning of the string, and $ matches only at the end of the string and immediately before the newline (if any) at the end of the string. When this flag is specified, ^ matches at the beginning of the string and at the beginning of each line within the string, immediately following each newline. Similarly, the $ metacharacter matches either at the end of the string and at the end of each line (immediately preceding each newline).

In single-line [2] mode, the line starts at the start of the string and ends at the end of the line where the end of the line is either the end of the string if there is no terminating newline or just before the final newline if there is a terminating newline.

In multi-line mode a new line starts at the start of the string and after each newline and ends before each newline or at the end of the string if the last line has no terminating newline.

The confusion is that people think that they are in string-mode if they are not in multi-line mode but they are not, they are in single-line mode, ^ and $ still use the semantics of lines and a terminating newline, if present, is still not part of the content of the line.

With \n\n\n in single-line mode the non-greedy ^(\n+?)$ will capture only two of the newlines, the third one will be eaten by the $. If you make it greedy ^(\n+)$ will capture all three newlines. So arguably the implementations that do not match cat\n with cat$ are the broken ones.

[1] https://docs.python.org/3/howto/regex.html#more-metacharacte...

[2] I am using single-line to mean not multi-line for convenience even though single-line already has a different meaning.

masklinn · on March 20, 2024

> It is start and end of line.

You seem to have redefined “line” as “not a line”.

> The confusion

I’m sure redefining “line” as “nothing like what anyone reasonable would interpret as a line” will help a lot and right clear up the confusion.

danbruc · on March 20, 2024

The POSIX definition of a line is a sequence of non-newline characters - possibly zero - followed by a newline. Everything that does not end with a newline is not a [complete] line. So strictly speaking it would even be correct that cat$ does not match cat because there is no terminating newline, it should only match cat\n. But as lines missing a terminating newline is a thing, it seems reasonable to be less strict.

sltkr · on March 20, 2024

Python violates that definition however, by allowing internal newlines in strings. For example /^c[^a]t$/ matches "c\nt\n", but according to POSIX that's not a line.

I suspect the real reason for Python's behavior starts with the early decision to include the terminating newline in the string returned by IOBase.readline().

Python's peculiar choice has some minor advantages: you can distinguish between files that do and don't end with a terminating newline (the latter are invalid according to POSIX, but common in practice, especially on Windows), and you can reconstruct the original file by simply concatenating the line strings, which is occasionally useful.

The downside of this choice is that as a caller you have to deal with strings that may-or-may-not contain a terminating newline character, which is annoying (I often end up calling rstrip() or strip() on every line returned by readline(), just to get rid of the newlines; read().splitlines() is an option too if you don't mind reading the entire file into memory upfront).

My guess is that Python's behavior is just a hack to make re.match() easier to use with readline(), rather than based on any principled belief about what lines are.

danbruc · on March 20, 2024

Python's behavior is not a hack, it is the common behavior. $ matches at the end of the string or before the last character if that is a newline, which is logically the same as the end of a single line. But as you said, you can have additional newlines inside of the string which is also the common behavior and not specific to python. Personally I think of this as you just assume that the string is a single line and match $ accordingly, either at the end of the string or before a terminating newline, if there are additional newlines, you treat them mostly as normal characters, with the exception of dot not matching newlines unless you set the single-line/dot-all flag.

sltkr · on March 20, 2024

> Python's behavior [..] is the common behavior.

The very post we're commenting on shows that that's not true: PHP, Python, Java and .NET (C#) share one behavior (accept "\n" as "$"), and ECMAScript (Javascript), Golang, and Rust share another behavior (do not accept "\n" as $).

Let's not argue about which is “the most common”; all of these languages are sufficiently common to say that there is no single common behavior.

> $ matches at the end of the string or before the last character if that is a newline, which is logically the same as the end of a single line.

Yes, that is Python's behavior (and PHP's, Java's, etc.). You're just describing it; not motivating why it has to work that way or why it's more correct than the obvious alternative of only matching the end of the string.

Subjectively, I find it odd that /^cat$/ matches not just the obvious string "cat" but also the string "cat\n". And I think historically, it didn't. I tried several common tools that predate Python:

  - awk 'BEGIN { print ("cat\n" ~ /^cat$/) }' prints 0
  - in GNU ed, /^M/ does not match any lines
  - in vim, /^M/ does not match any lines
  - sed -n '/\n/p' does not print any lines
  - grep -P '\n' does not match any lines
  - (I wanted to try `grep -E` too but I don't know how to escape a newline)
  - perl -e 'print ("cat\n" =~ /^cat$/)' prints 1

So the consensus seems to be that the classic UNIX line-based tools match the regex against the line excluding the newline terminator (which makes sense since it isn't part of the content of that line) and therefore $ only needs to match the end of the string.

The odd one out is Perl: it seems to have introduced the idea that $ can match a newline at the end of the string, probably for similar reasons as Python. All of this suggests to me that allowing $ to match both "\n" and "" at the end of the string was a hack designed to make it easier to deal with strings without control characters and string that end with a single newline.

danbruc · on March 21, 2024

So the consensus seems to be that the classic UNIX line-based tools match the regex against the line excluding the newline terminator (which makes sense since it isn't part of the content of that line) and therefore $ only needs to match the end of the string.

If you read a line, you usually remove the newline at the end but you could also keep it as Python does. If you remove the newline, then a line can never contain a newline, the case cat\n can never occur. If you keep the newline, there will be exactly one newline as the last character and you arguably want cat$ to match cat\n because that newline is the end of the line but not part of the content. It makes perfect sense that $ matches at the end of the string or before a newline as the last character as it will do the right thing whether or not you strip the newline.

If you want cat$ to not match cat\n, then you are obviously not dealing with lines, you have a string with a newline at the end but you consider this newline part of the content instead of terminating the line. But ^ and $ are made for lines, so they do not work as expected. I also get what people are complaining about, if you are not in multi-line and have a proper line with at most one newline at the end, then it will behave exactly as if you are in multi-line which raises the question why you would have those two modes to begin with. Not multi-line only behaves differently if you have additional newlines or one newline not at the end, that is if you do not have a proper line, so why should $ still behave as if you were dealing with a line?

Izkata · on March 20, 2024

> - in vim, /^M/ does not match any lines

But /\n/ does

sltkr · on March 21, 2024

Thanks for the correction! That's interesting.

masklinn · on March 20, 2024

> a line is a sequence of non-newline characters

Works for me.

How do you square that with your assertion that in your invention of "single-line mode" you implicitly define "line" as matching \n\n?

danbruc · on March 20, 2024

If you are not in multi-line mode, then a single line is expected and consequently there is at most one newline at the end of the string. You can of course pick an input that violates this, run it against a multi-line string with several newlines in it. cat\n\n will not match cat$ because there is something between cat and the end of the line, it just happens to be a newline but without any special meaning because it is not the last character and you did not say that the input is multi-line.

al_borland · on March 21, 2024

I suppose this is why certain config files are strict about ending with a newline, without it, the last line wouldn’t technically be a line?

zeehio · on March 21, 2024

I'm now amused by the idea of a malicious compliance linter telling me "you have an issue at your code, on line NaN"

Bjartr · on March 20, 2024

The line delimiter is a newline.

If you have a file containing `A\nB\nC` in a file, the file is three lines long.

I guess it could be argued that a file containing `A\nB\nC\n` has four lines, with the fourth having zero length.

That a regex is applying to an in memory string vs a file doesn't feel to me like it should have different semantics.

Digging into the history a little, it looks like regexes were popularized in text editors and other file oriented tooling. In those contexts I imagine it would be far more common to want to discard or ignore the trailing zero length line than to process it like every other line in a file.

akdev1l · on March 20, 2024

Technically the “newline” character is actually a line _terminator_. Hence “A\n” is one line, not two. The “\n” is always at the end of a line by definition.

wtetzner · on March 20, 2024

So if you have "A" in a file with no newline, there are no lines in that file?

jepler · on March 20, 2024

Yes, that is a file with zero lines that ends with an "incomplete line". Processing of such files by standard line-oriented utilities is undefined in the opengroup spec. So, for instance, the effect of "grep"ping such a file is not defined. Heck, even "cat"ting such a file gives non-ideal results, such as colliding with the regular shell prompt. For this reason, a lot of software projects I work on check and correct this condition whenever creating a commit.

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1... ("text file")

rovr138 · on March 20, 2024

> Yes, that is a file with zero lines that ends with an "incomplete line".

It's a file with zero complete lines. But it has 1 line, that's incomplete, right?

The file starts empty. Anything in it starts "a line". So it's 1 incomplete line.

I hate weird states.

xyzzy_plugh · on March 20, 2024

No, it is valid for a file to have content but no lines.

Semantically many libraries treat that as a line because while \n<EOF> means "the end of the last line" having just <EOF> adds additional complexity the user has to handle to read the remaining input. But by the book it's not "a line".

If I said "ten buckets of water" does that mean ten full buckets? Or does a bucket with a drop in it count as "a bucket of water?" If I asked for ten buckets of water and you brought me nine and one half-full, is that acceptable? What about ten half-full buckets?

A line ends in a newline. A file with no newlines in it has no lines.

joshjje · on March 20, 2024

Thats beyond ridiculous. Most languages when you are reading a line from a file, and it doesn't have a \n terminator, its going to give you that line, not say, oops, this isn't a line sorry.

int_19h · on March 20, 2024

I don't think you can meaningfully generalize to "most languages" here. To give an example, two extremely popular languages are C and Python. Both have a standard library function to read a line from a text stream - fgets() for C, readline() for Python. In both cases, the behavior is to read up to and including the newline character, but also to stop if EOF is encountered before then. Which means that the return value is different for terminated vs unterminated final lines in both languages - in particular, if there's no \n before EOF, the value returned is not a line (as it does not end with a newline), and you have to explicitly write your code to accommodate that.

squeaky-clean · on March 20, 2024

Most languages but not all. I've even been bit by this recently in cron.

Assuming that EOF is identical to \\nEOF will end up causing trouble for you one day, because it's not actually identical.

LK5ZJwMwgBbHuVI · on March 20, 2024

That's a relatively recent invention compared to tools like `wc` (or your favorite `sh` for that matter). See also: https://perldoc.perl.org/functions/chop wherein the norm was "just cut off the last character of the line, it will always be a newline"

nativeit · on March 20, 2024

I get this is largely a semantic debate, but find it a little ironic so many programmers seem put off with the idea of a line count that starts at “0”.

DougBTX · on March 20, 2024

Another way to look at it is that concatenating files should sum the line count. Concatenating two empty files produces an empty file, so 0 + 0 = 0. If “incomplete lines” are not counted as lines, then the maths still works out. If they counted as lines, it would end up as 1 + 1 = 1.

akdev1l · on March 20, 2024

No, a line is defined as a sequence of characters (bytes?) with a line terminator at the end.

Technically as per posix a file as you describe is actually a binary file without any lines. Basically just random binary data that happens to kind of look like a line.

pxc · on March 23, 2024

Here's another way to think about this:

This isn't a weird state. It's a language problem. An 'incomplete line' isn't a type of line, it's an unfortunate name for a thing that is not a line. Just like how the 'wor' is an incomplete word (the word 'word'), but 'wor' is, of course, not a word.

Same thing for formalisms like equations in algebra or formulas in propositional logic— we have the phrase 'well-formed formula', and we might describe some sequences of terms as 'incomplete formulas' or perhaps 'ill-formed formulas', but those phrases don't describe anything that meets the formal system's definition of 'formula' at all— they are not formulas. 'Ill-formed formula' is not a compositional phrase where 'ill-formed' describes a feature of a 'formula'. It's a bit of convenient language for what we can intuitively or metaphorically recognize as a formula-ish thing.

coryrc · on March 20, 2024

Pedantically, if it doesn't end with a newline, it's considered a binary file and not a text file. Binary files don't have lines.

In practice, most utilities expecting text files will still operate on it.

wtetzner · on March 22, 2024

That's a weird way to look at it. Binary files might not have "lines", but there's no reason they couldn't include a byte with value 10 (the ASCII value for \n). Software reading that file wouldn't know the difference, right?

Also, why couldn't you have a text file without any lines?

coryrc · on March 22, 2024

All I'm addressing is GP's comment:

    It's a file with zero complete lines. But it has 1 line, that's incomplete, right?

Because the Unix definition of text file requires the file to end with a newline. "Lines" only exist in the context of text files. If there's no terminating newline, it's (pedantically) not a text file and so has no lines. Now, in practice, if you open() that file in text mode, it doesn't TMK return an error if the terminating newline isn't present, but it's undefined behaviour.

And if you do have a terminating newline, then you have at least one line :).

PaulDavisThe1st · on March 20, 2024

No file has lines.

"Lines" are a convention established by (or not) software reading a data stream.

coryrc · on March 20, 2024

Ackshully

mort96 · on March 20, 2024

It's a file with 0 lines and some trailing garbage.

rerdavies · on March 20, 2024

The opengroup spec says no such thing.

simonh · on March 20, 2024

3.206 Line

A sequence of zero or more non- <newline> characters plus a terminating <newline> character.

See also ‘3.403 Text File’ for the definition of a text file. No new line characters, no lines. No lines, not a text file.

wtetzner · on March 22, 2024

> No lines, not a text file.

That seems like a broken (maybe just bad?) definition/specification to me. A blob of JSON in a file isn't "text" if there's no newline character trailing it?

simonh · on March 23, 2024

There are other definitions of a text file than the opengroup spec, particularly for specific OS platforms. I’m not sure what convention JSON follows.

As a spec it’s fine. It defines a text file in such a way that you can easily write code to process such a file deterministicaly.

mbrubeck · on March 20, 2024

    $ echo -n "A" | wc --lines
    0

keybored · on March 20, 2024

Yep. since wc(1) apparently strictly adheres to what a newline-terminated text file is. This is why plaintext files should end with a newline. :)

https://stackoverflow.com/questions/729692/why-should-text-f...

LK5ZJwMwgBbHuVI · on March 20, 2024

Why don't you go ask?

    $ echo -n foo | wc -l
    0

wtetzner · on March 22, 2024

wc just counts newline characters. I'm not sure why it would be the ultimate authority on anything.

Gormo · on March 20, 2024

Suddenly the DOS/Windows solution of using \r\n instead of just \n seems to offer some advantages.

samatman · on March 20, 2024

This does precisely nothing to solve the ambiguity issue when a final line lacks a newline. The representation of that newline isn't relevant to the problem.

Izkata · on March 20, 2024

It's actually slightly worse: Windows defines newline as a delimiter, not a terminator. So this:

  foo\nbar\n

Would be 2 lines in *nix and 3 lines in windows.

Gormo · on March 25, 2024

The point is that having a sequence of two delimiters to signal the end of the logical line allows you to have single instances of either delimiter included within the text. This allows visual line breaks to be included within the same line as understood by the regex parser.

danbruc · on March 21, 2024

Despite the downvotes your comment received, I think you have a good point. There are two uses for a newline, first to signal the end of a line, for example when sending text over a serial connection, and second to separate two lines, for example in a text file.

To indicate that a serially received line is complete, the interpretation as a terminator makes perfect sense - abcd\n is a complete line, abc is a still incomplete line. In a text file the interpretation as a separator might be preferable because that gets rid of the issue of the last line not having a newline - a\nb\nc are three lines separated by two newlines, a\nb\nc\n are four lines separated by three newlines and the last line is empty.

But then it might also be useful to have a terminator in a text file to be able to detect an incompletely written line. So using two characters, one for each purpose, could solve the problem. \r means the line is complete, \n means it follows a next line. abc is an incomplete line, abcd\r is a complete line and no line follows, abcd\r\n is a complete line and a second incomplete line follows which is currently empty. abcd\r\n\r are two complete lines, the second one empty. abcd\r\nefg is a complete line followed by an incomplete line. abcd\r\nefg\r are two complete lines. You could even have two incomplete lines abc\nefg.

But I think Windows always uses \r\n because this is how you get to a newline on a typewriter or really old printer, you return the carriage and feed the paper one line. I do not think that they had the idea of differentiating between terminator and separator, otherwise you could have only \r and maybe even only \n sometimes. But in principle this could work quite nicely, I guess. You could start a line with \n and end it with \r, this would give you \r\n between lines and \r after the final line. Or nothing if the final line is incomplete or \r\n if the final line is incomplete and currently empty. The odd thing would be a newline as the very first character, maybe one could suppress that. This would also be compatible with Windows and nix, it would just consider all nix lines incomplete. Only abc\rdef\r would not really make sense, two complete lines but the second one is not a new line.

If I ever get to write a new operating system, I will inflict this on humanity.

deaddodo · on March 20, 2024

The "Windows way" is the "right way" for a few reasons.

This is definitely not one of them.

int_19h · on March 20, 2024

Which are the valid reasons, legacy meanings of those characters aside?

deaddodo · on March 24, 2024

I mean, it was what everyone had agreed upon previously. Microsoft was the only party to follow through. For all the guff they get for not following standards, it was the one standard they did.

You don't have to love a company to acknowledge they did something right.

joshjje · on March 20, 2024

“A\n” is two lines.

LK5ZJwMwgBbHuVI · on March 20, 2024

Factually incorrect.

rerdavies · on March 20, 2024

Technically, that is one of two possible interpretations, and you seem to have invented a "by definition" out of thin air.

Very very technically a "newline" character indicates the start of a new line, which is why it is not called the "end-of-line" character.

LK5ZJwMwgBbHuVI · on March 20, 2024

It doesn't indicate the start of a new line, or files would start with it. Files end with it, which is why it is a line terminator. And it is by definition: by the standard, by the way cat and/or your shell and/or your terminal work together, and by the way standard utilities like `wc` treat the file.

cortesoft · on March 20, 2024

I mean, the person you are responding to didn't invent the definition out of thin air... the POSIX standard did:

3.206 Line A sequence of zero or more non- <newline> characters plus a terminating <newline> character.

https://pubs.opengroup.org/onlinepubs/9699919799.2018edition...

mabster · on March 20, 2024

I don't know why no-one here sees this as a bad design...

If a line is missing a newline then we just disregard it?!

A way better way to deal with newline is it's a separator like comma. And like in modern languages we allow a final separator, but ignore it so that is easier for tools to generate files.

Now all combinations of characters, including newline characters, has an interpretation without dropping anything.

danbruc · on March 21, 2024

I also always preferred the interpretation of a newline as a separator instead of as a terminator for files because I never liked the final newline causing a new empty line in the editor and as you thought that it was bad design that you can have a somewhat invalid file.

But if you look beyond files, the interpretation as a terminator also makes perfect sense, when you receive text over a serial connection it signals that the line is complete which does not necessarily imply that another line will follow. The same in a file, if the terminating newline is missing, you can deduce that an incomplete write occurred and some data might be missing. If you decide to have a newline as a separator after the last line but to ignore it, then you can not represent an empty last line.

I guess you would need two different characters, one terminator and one separator. You could start a line with \n and end it with \r. The \n separates the line from the one before, then \r terminates the line and marks it as complete. You would get \r\n between lines as on Windows and the last line would only have \r if complete or would otherwise count as incomplete. Then again you could almost get the same thing with \n only, you would just have to change the interpretation, instead of \n giving you a line and no \n giving you not a line, you would have to say that \n gives you a complete line and no \n gives you an incomplete line. With that you could however not have an incomplete empty line.

pepa65 · on March 29, 2024

This effort of building in redundancy is pointless. We just need a newline to know where to start the output on a new line. If you want to safeguard the proper content of a file, a whole lot more is needed.

nomel · on March 20, 2024

Posix getline() includes EOF as a line terminator:

    getline() reads an entire line from stream, storing the address
       of the buffer containing the text into *lineptr.  The buffer is
       null-terminated and includes the newline character, if one was
       found.
    ...
    ... a delimiter character is not added if one was
       not present in the input before end of file was reached.

EOF seems same as end-of-string.

lsaferite · on March 22, 2024

Your quoted documentation says otherwise. It says that a 'line' include the delimiter, '\n', in the line buffer. It also says that is no delimiter is found before the EOF is reached that the line buffer will not include the delimiter. That means the line buffer can clearly indicate an incomplete line by the absence of the delimiter. To be clear, EOF isn't a 'line terminator', it's the end of the data stream.

nomel · on March 22, 2024

Yes, "EOF seems same as end-of-string."

pepa65 · on March 29, 2024

No, getline() will stop reading at the newline, even if more (non-NUL) characters follow. EOF is end-of-file.

pepa65 · on March 29, 2024

So this is what "3.403 Text File" says:

A file that contains characters organized into zero or more lines [so characters with no newlines are OK]

No NUL, and lines (delimited by and including newline) not exceeding LINE_MAX bytes.

pepa65 · on March 29, 2024

How about a null-byte then? That's not a newline character, but all POSIX tools will treat it as EOF.

f1shy · on March 20, 2024

Matches the EMPTY STRING at the beginning of the line is the correct definition.

eastbound · on March 20, 2024

Probably a vulnerability issue. Programmers would leave multiline mode on by mistake, then validate that some string only contain ^[a-Z]*$… only for the string to have an \n and an SQL injection on the second line.

masklinn · on March 20, 2024

> Probably a vulnerability issue.

No? It’s a semantics decision.

davidw · on March 20, 2024

What with unicode, it'd be fun to have Α and Ω available to make our regexps that much more readable...

amelius · on March 20, 2024

What is driving me nuts is that we have Unicode now, so there is no need to use common characters like $ or ^ to denote special regex state transitions.

yjftsjthsd-h · on March 20, 2024

If we were willing to ignore the ability to actually type it, you don't need Unicode for that; ASCII has a whole block of control characters at the beginning; I think ASCII 25 ("End of medium") works here.

knome · on March 20, 2024

the idea of changing a decades old convention to instead use, as I assume you are implying, some character that requires special entry, is beyond silly.

keybored · on March 20, 2024

It’s not that silly. You constantly get into escape conundrums because you need to use a metacharacter which is also a metacharacter three levels deep in some embedding.

(But that might not solve that problem? Maybe the problem is mostly about using same-character delimiters for strings.)

And I guess that’s why Perl is so flexible with regards to delimiters and such.

LK5ZJwMwgBbHuVI · on March 20, 2024

Yes, languages really need some sort of "raw string" feature like Python (or make regex literals their own syntax like Perl does). That's the solution here, not using weird characters...

keybored · on March 21, 2024

Fine enough. But I wonder why strings have to use the same delimiter. Imagine if you had a list delimiter `|` and the answer to nested lists was “ohh, use raw list syntax, just make `###||` when you are three levels deep or something”.

Karellen · on March 22, 2024

It is quite nice what `sed` does. A sed search-and-replace is typically shown as `s/foo/bar/`, but you can actually use any punctuation character to separate the parts. Whatever follows the "s" will be used for that statement, so you can write `s|foo|bar|` or `s:foo:bar:`, even mixing and matching in the same script to have `s|baz|quux|; s:xyzzy:blorp:` and it will all work.

keybored · on March 21, 2024

On the third hand strings are of course a special case because you always have special characters and whatnot which makes raw strings useful. :) Not just doing `"` and stuff.

pepa65 · on March 29, 2024

The weird characters are part of the syntax here. Of course, you can make it more verbose, or more flexible/configurable.

FranOntanaya · on March 20, 2024

I don't think anyone that writes regex would feel specially challenged by using the Alt+ | Ctrl+Shift+u key combos for unicode entry. Having to escape less things in a pattern would be nice.

cortesoft · on March 20, 2024

I write regexes all the time, and I don't know if I would be CHALLENGED by that, but it would be annoying. Escaping things is trivial, and since you do it all the time it is not anything extra to learn. Having to remember bespoke keystrokes for each character is a lot more to learn.

int_19h · on March 20, 2024

Regexes are one case where I think it's already extremely unbalanced wrt being easy to write but hard to read. Using stuff like special Unicode chars for this would make them harder to write but easier to read, which sounds like a fair deal to me. In general, I'd say that regexes should take time and effort to write, just because it's oh-so-easy to write something that kinda sorta works but has massive footguns.

I would also imagine that, if this became the norm, IDEs would quickly standardize around common notation - probably actually based on existing regex symbols and escapes - to quickly input that, similar to TeX-like notation for inputting math. So if you're inside a regex literal, you'd type, say, \A, and the editor itself would automatically replace it with the Unicode sigil for beginning-of-string.

emporas · on March 21, 2024

Regexes originate from Perl, or they were popularized by Perl if i got this right. In Perl readable code is not ranked as one of it's top 100 priorities. Regexes could originate from J and situation could be even worse though!

TristanBall · on March 21, 2024

Regex's predate perl quite substantially. Think grep and friends if nothing else.

Certainly making the perlre library available separate to perl encouraged its widespread use, and lots of others copied or were inspired by it.

"Popularized" doesn't seem like quite the right word though, I don't disagree with the point, but if I shout "Hey everyone let's write regex's" at the office people throw stationary at me, which is not true of other popular things!

wlonkly · on March 21, 2024

Perl spread regexes.

emporas · on March 21, 2024

Super spreader event we've got ourselves into.

I took a look at Raku, which claims be a better Perl maybe, or closely related but more modern, it certainly looks nice. Although i am a big fan of typed languages, Raku piqued my interest.

lizmat · on March 22, 2024

In Raku, if you want to go typed, you can! It's called "gradually typed".

my Int $a = 42; # ok

my Int $a = "foo"; # Type check failed in assignment to $a; expected Int but got Str ("foo")

emporas · on March 22, 2024

Very nice, good to know. Yes i know gradual typing, Python has a form of that i think. I will check out Raku at some point, the type system will not go unnoticed. I didn't even know it had one!

keybored · on March 20, 2024

ASCII restriction begets ASCII toothpick soup. Either lift that restriction or use balanced delimiters for strings in ASCII like backtick and single quote.

(“But backtick is annoying to type” said the Europeans.)

amelius · on March 20, 2024

Also, code is read more often than it is written.

cortesoft · on March 20, 2024

People say this all the time, but is it really always true? I have a ton of code that I wrote, that just works, and I never really look at it again, at least not with the level of inspection that requires parsing the regex in my head.

wccrawford · on March 21, 2024

Even for code I wrote once and then never have to fix, I end up reading it multiple times while I create it and the lines around it. I think it really is always true.

Yujf · on March 20, 2024

Why not? Common characters are easier to type and presumbly if you are using regex on a unicode string they might include these special characters anyway so what have you gained?

amelius · on March 20, 2024

In theory yes, in practice no.

What you have gained is that the regex is now much easier to read.

LK5ZJwMwgBbHuVI · on March 20, 2024

> In theory yes, in practice no.

That's like "in theory we need 4 bytes to represent Unicode, but in practice 3 bytes is fine" (glances at universally-maligned utf8mb3)

int_19h · on March 20, 2024

It's not really an issue if the string you're matching might have those characters. It's an issue if the regex you are matching that string might need to match those characters verbatim. Which is actually pretty common with ()[]$ when you're matching phone numbers, prices etc - so you end up having to escape a lot, and regex is less readable especially if it also has to use those same characters as regex operators. On the other hand, it would be very uncommon to want to literally match, say, ⦑⦒ or ⟦⟧.

knome · on March 20, 2024

It's easy to read now.

codethatwerks · on March 20, 2024

The problem with using an eggplant to denote end of string is backwards compatibility.

tangus · on March 20, 2024

That gives the author space for another article ;)

kqr · on March 20, 2024

I'm the same, but now that I try in Perl, sure enough, $ seems to default to being a positive lookahead assertion for the end of the string. It does not match and consume an EOL character.

Only in multiline mode does it match EOL characters, but it does still not appear to consume them. In fact, I cannot construct a regex that captures the last character of one line, then consumes the newline, and then captures the first character of the next line, while using $. The capture group simply ends at $.

singingfish · on March 20, 2024

To get the newline captured as well you need to add the `/s` modifier too

antegamisou · on March 20, 2024

Maybe because a fair amount of the work I do with regexes (and, probably, how I was introduced to them) is via `grep`, so I'm often thinking of the inputs as "lines" rather than "strings"?

Vim is what did that for me.

alphazard · on March 20, 2024

This must be the "second problem" everyone talks about with regular expressions.

jamesmunns · on March 20, 2024

Same, tho it'd be interesting to see if this behavior holds if the file ends without a trailing newline and your match is on the final newline-less line.

fooofw · on March 20, 2024

Fortunately, it's pretty simple to test.

    $ printf 'Line with EOL\nLine without EOL' | grep 'EOL$'        
    Line with EOL
    Line without EOL
    $ grep --version | head -n1
    grep (GNU grep) 3.8

romwell · on March 20, 2024

The line does end with the file, so it's logically consistent.

It's not matching the newline character after all.

colimbarna · on March 20, 2024

Yes exactly, they match the end of a line, not a newline character. Some examples from documentation:

man 7 regex: '$' (matching the null string at the end of a line)

pcre2pattern: The circumflex and dollar metacharacters are zero-width assertions. That is, they test for a particular condition being true without consuming any characters from the subject string. These two metacharacters are concerned with matching the starts and ends of lines. ... The dollar character is an assertion that is true only if the current matching point is at the end of the subject string, or immediately before a newline at the end of the string (by default), unless PCRE2_NOTEOL is set. Note, however, that it does not actually match the newline. Dollar need not be the last character of the pattern if a number of alternatives are involved, but it should be the last item in any branch in which it appears. Dollar has no special meaning in a character class.

jamesmunns · on March 20, 2024

Thanks! I was AFK and didn't have a grep (or a shell) handy on my phone.

absoluteunit1 · on March 20, 2024

I’ve always thought that as well; mostly due to Vim though.

^ - takes you to start of line $ - takes you to end of line

Izkata · on March 20, 2024

^ actually takes you to the first non-whitespace character in the line in vim. For start of line you want 0

kataklasm · on March 20, 2024

I don't have (n)vi(m) open right now but I think this only applies to prepending spaces. For prepending tabs, 0 will take you to the first non-tab character as well.

qu4z-2 · on March 20, 2024

Vim takes me to the first character in the line (the first tab), but displays the cursor on the last gridsquare the tab's width covers.

Izkata · on March 21, 2024

If you have "set list" to make non-space whitespace visible, it'll go to the leftmost position. I did it long ago along with "set listchars=trail:.,tab:>-" so I can see not only where tabs are, but also their size/alignment without causing the text to shift.

notnmeyer · on March 20, 2024

i feel like this perspective will be split between folks who use regex in code with strings and more sysadmin folks who are used to consuming lines from files in scripts and at the cli.

but yeah seems like a real misunderstanding from “start/end of string” people

cerved · on March 20, 2024

In `sed` it's end of string.

String is usually end of line, but not if you use stuff like `N`, to manipulate multi-line strings

SAI_Peregrinus · on March 20, 2024

POSIX regexes and Python regexes are different. In general, you need to reference the regex documentation for your implementation, since the syntax is not universal.

Per POSIX chapter 9[1]:

9.2 … "The use of regular expressions is generally associated with text processing. REs (BREs and EREs) operate on text strings; that is, zero or more characters followed by an end-of-string delimiter (typically NUL). Some utilities employing regular expressions limit the processing to lines; that is, zero or more characters followed by a <newline>."

and 9.3.8 … "A <dollar-sign> ( '$' ) shall be an anchor when used as the last character of an entire BRE. The implementation may treat a <dollar-sign> as an anchor when used as the last character of a subexpression. The <dollar-sign> shall anchor the expression (or optionally subexpression) to the end of the string being matched; the <dollar-sign> can be said to match the end-of-string following the last character."

combine to mean that $ may match the end of string OR the end of the line, and it's up to the utility (or mode) to define which. Most of the common utilities (grep, sed, awk, Python, etc) treat it as end of line by default, since they operate on lines by default.

THERE IS NO SINGLE UNIVERSAL REGULAR EXPRESSION SYNTAX. You cannot reliably read or write regular expressions without knowing which language & options are being used.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...

PuffinBlue · on March 20, 2024

This seems like the perfect opportunity to introduce those unfamiliar to Robert Elder. He makes cool YouTube[0] and blog content[1] and has a series on regular expressions[2] and does some quite deep dives into the differing behaviour of the different tools that implement the various versions.

His latest on the topic is cool too: https://www.youtube.com/watch?v=ys7yUyyQA-Y

He's has quite a lot of content that HN folks might be interested in I think, like the reality and woes of consulting[3]

[0] https://www.youtube.com/@RobertElderSoftware

[1] https://blog.robertelder.org/

[2] https://blog.robertelder.org/regular-expressions/

[3] https://www.youtube.com/watch?v=cK87ktENPrI

aquariusDue · on March 20, 2024

I'm glad to see someone else that has stumbled over his content. Seconding the recommendation.

CatchSwitch · on March 20, 2024

He has so many favorite Linux commands lol

robertelder · on March 21, 2024

This is my favourite comment.

xlii · on March 20, 2024

Regexp was one of the first things I truly internalized years ago when I was discovering Perl (which still lives in a cozy place in my heart due to a lovely “Camel” book).

Today most important bit of information is knowledge that implementations differ and I made a habit of pulling reference sheet for a thing I work with.

E.g. Emacs Regexp annoyingly doesn’t have word in form of “\w” but uses “\s_-“ (or something no reference sheet on screen) as character class (but Emacs has the best documentation and discoverability - a hill I’m willing to die on)

Some utilities require parenthesis escaping and some not. Sometimes this behavior is configurable and sometimes it’s not.

I lived through whole confusion, annoyance, denial phase and now I just accept it. Concept is the same everywhere but flavor changes.

ydant · on March 20, 2024

Exactly the same here, re: Perl.

My brain thinks in Perl's regex language and then I have to translate the inconsistent bits to the language I'm using. Especially in the shell - I'm way more likely to just drop a perl into the pipeline instead of trying to remember how sed/grep/awk (GNU or BSD?) prefer their regex.

mtmk · on March 20, 2024

hah, I'm the same too, straight to 'perl -lne'. I believe that was one of Larry Wall's goals when creating Perl:

> Perl is kind of designed to make awk and sed semi-obsolete.

https://github.com/Perl/perl5/commit/8d063cd8

influx · on March 20, 2024

GNU grep supports Perl regexp with -P

1letterunixname · on March 20, 2024

Using PCRE2, which doesn't behave exactly the same as Perl or PCRE1.

https://pcre.org/current/doc/html/pcre2compat.html

https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expres...

https://stackoverflow.com/questions/70273084/regex-differenc...

mwpmaybe · on March 20, 2024

As does git grep!

pizzafeelsright · on March 20, 2024

How did you internalize it? Perl looks like cat keyboarding.

ydant · on March 20, 2024

For me, Perl hit me at exactly the right time in my development. One or more of the various O'Reilly Perl books caught my attention in the bookstore, the foreword and the writing style was unlike anything else I'd read in programming up to that point, and I read the book and just felt a strong connection to how the language was structured, the design concepts behind it, the power of regex being built in to the language, etc. The syntax favored easy to write programs without unnecessary scaffolding (of course, leading to the jokes of it being write-only - also the jokes I could make about me programming largely in Java today), and the standard functionality plus the library set available felt like magic to me at that point.

Learning Perl today would be a very different experience. I don't think it would catch me as readily as it did back then. But it doesn't matter - it's embedded into me at a deep level because I learned it through a strong drive of fascination and infatuation.

As for the regex themselves? It's powerful and solved a lot of the problems I was trying to solve, was built fundamentally into Perl as a language, so learning it was just an easy iterative process. It didn't hurt that the particular period of time when I learned Perl/regex the community was really big on "leetcode" style exercises, they just happened to be focused around Perl Golf, being clever in how you wrote solutions to arbitrary problems, and abusive levels of regex to solve problems. We were all playing and play is a great way to learn.

mwpmaybe · on March 20, 2024

The same way people internalize punching data and instructions into stacks of cards, or internalize advanced mathematical notation. Just because things aren't written in plain english words doesn't mean they can't be internalized.

chongli · on March 20, 2024

Advanced math is mostly written in plain English, actually!

wruza · on March 21, 2024

Perl has few “sigils” which are basically types: $scalar, @array and %hash. And few syntactically equivalent operators. Also a set of global variables with character shorthands like `$.`. Apart from that it’s a regular language.

onion2k · on March 20, 2024

I can hear thousands of bad hiring manager's adding 'How do you match the end of a string in a regex?' to their list of 'Ha! You don't know the trick!' questions designed to catch out candidates.

hoc · on March 20, 2024

"I will hire you anyway, but I will pay you less"

Regex, useful in any job...

username_my1 · on March 20, 2024

regex is useful but chatgpt is amazing at it, so why spend a minute keeping such useless knowledge in mind.

if you know where to find something no point in knowing it.

ykonstant · on March 20, 2024

Does gpt produce efficient regex? Are there any experts here that can assess the quality and correctness of gpt-generated regex? I wonder how regex responses by gpt are validated if the prompter does not have the knowledge to read the output.

da39a3ee · on March 20, 2024

You don't have to be an expert; you should very rarely be using regexes so complex that you can't understand them.

zacmps · on March 20, 2024

It might not be obvious when you hit that point, bad regexes can be subtle, just see that old cloudflare postmortem.

mnau · on March 20, 2024

Even simple regexs can be problematic, e.g. Gitlab RCE bug through ExifTools

https://devcraft.io/2021/05/04/exiftool-arbitrary-code-execu...

> "a\ > ""

> The second quote was not escaped because in the regex $tok =~ /(\\+)$/ the $ will match the end of a string, but also match before a newline at the end of a string, so the code thinks that the quote is being escaped when it’s escaping the newline.

hnlmorg · on March 20, 2024

...and if you can understand them then you clearly understand regex enough not to need ChatGPT to write them

kaibee · on March 20, 2024

I understand assembly too.

hnlmorg · on March 21, 2024

The only time you'd want to write assembly in production code would be if you need to hand roll some optimisation. So I don't really understand your point here,

2devnull · on March 20, 2024

That was one of my first uh oh moments with gpt. Getting code that clearly had untestable/unreadable regexen, which given the source must have meant the regex were gpt generated. So much is going to go wrong, and soon.

thecatspaw · on March 20, 2024

what does gpt say how we should validate email addresses?

criley2 · on March 20, 2024

Prompt:

'I'm writing a nodejs javascript application and I need a regex to validate emails in my server. Can you write a regex that will safely and efficiently match emails?'

GPT4 / Gemini Advanced / Claude 3 Sonnet

GPT4: `const emailRegex = /^[a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/;` Full answser: https://justpaste.it/cg4cl

Gemini Advanced: `const emailRegex = /^[a-zA-Z0-9.!#$%&'+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)$/;` Full answer: https://justpaste.it/589a5

Claude 3: `const emailRegex = /^([a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})$/;` Full answer: https://justpaste.it/82r2v

dfawcus · on March 20, 2024

Whereas email more or less lasts forever (mailbox contents), and has to be backwards compatible with older versions back to (at least) RFC 821/822, or those before. It also allows almost any character (when escaped at 821 level) in the host or domain part (domain names allow any byte value).

So a Internet email address match pattern has to be: "..*@..*", anything else can reject otherwise valid addresses.

That however does not account for earlier source routed addresses, not the old style UUCP bang paths. However those can probably be ignored for newly generated email.

I regularly use an email address with a "+" in the host part. When I used qmail, I often used addresses like: "foo-a/b-bar-tat@DOMAIN". Mainly for auto filtering received messages from mailing lists.

zaxomi · on March 20, 2024

Still doesn't support internationalized domain names.

croemer · on March 20, 2024

Terrible answers as far as I can tell, especially Chat got would throw out many valid email addresses.

rhd · on March 20, 2024

chatgpt-4:

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

https://chat.openai.com/share/696f7046-7f43-4331-b12b-538566...

chatgpt-3.5:

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

https://chat.openai.com/share/aaa09ae8-3fd9-4df7-a417-948436...

layer8 · on March 20, 2024

…which both excludes addresses allowed by the RFC and includes addresses disallowed by the RFC. (For example, the RFC disallows two consecutive dots in the local-part.)

KMnO4 · on March 20, 2024

I take the descriptivist approach to email validation, rather than the prescriptivist.

I know an email has to have a domain name after the @ so I know where to send it.

I also know it has to have something before the @ so the domain’s email server knows how to handle it.

But do I care if the email server is supports sub addresses, characters outside of the commonly supported range (eg quotation marks and spaces), or even characters which aren’t part of the RFC? I do not.

If the user gives me that email, I’ll trust them. Worst case they won’t receive the verification email and will need to double check it. But it’s a lot better than those websites who try to tell me my email is invalid because their regex is too picky.

jcranmer · on March 20, 2024

The HTML email regex validation [1] is probably the best rule to use for validating an email address in most user applications. It prohibits IP address domain literals (which the emailcore people have basically said is of limited utility [2]), and quoted strings in the localpart. Its biggest fault is allowing multiple dots to appear next to each other, which is a lot of faff to put in a regex when you already have to individually spell out every special character in atext.

[1] https://html.spec.whatwg.org/multipage/input.html#email-stat...

[2] https://datatracker.ietf.org/doc/draft-ietf-emailcore-as/

layer8 · on March 20, 2024

I generally agree, but the two consecutive dots (or leading/trailing dots) are an example that would very likely be a typo and that you wouldn’t particularly want to send. Similar for unbalanced quotes, angle brackets, and other grammar elements.

dumbo-octopus · on March 20, 2024

I wonder whether simply (regex) replacing a sequence of .'s with a single one as part of a post-processing step would be effective.

layer8 · on March 20, 2024

That would be bad form, IMO. The user may have typed john..kennedy@example.com by mistake instead of john.f.kennedy@example.com, and now you’ll be sending their email to john.kennedy@example.com. Similar for leading or trailing dots. You can’t just decide what a user probably meant, when they type in something invalid.

wtetzner · on March 20, 2024

Yeah, that's about as far as I've ever been comfortable going in terms of validating email addresses too: some stuff followed by "@" followed by more stuff.

Though I guess adding a check for invalid dot patterns might be worthwhile.

marcosdumay · on March 20, 2024

What is maybe more important to note, it completely disallows the language of some 4/5 of the humanity. And partially disallows some 2/3 of the rest.

sebstefan · on March 20, 2024

Actually pretty good response if the programmer bothers to read all of it

I'd be more emphatic that you shouldn't rely on regexes to validate emails and that this should only be used as an "in the form validation" first step to warn of user input error, but the gist is there

> This regex is *practical for most applications* (??), striking a balance between complexity and adherence to the standard. It allows for basic validation but does not fully enforce the specifications of RFC 5322, which are much more intricate and challenging to implement in a single regex pattern.

^ ("challenging"? Didn't I see that emails validation requires at least a grammar and not just a regex?)

> For example, it doesn't account for quoted strings (which can include spaces) in the local part, nor does it fully validate all possible TLDs. Implementing a regex that fully complies with the RFC specifications is impractical due to their complexity and the flexibility allowed in the specifications.

> For applications requiring strict compliance, it's often recommended to use a library or built-in function for email validation provided by the programming language or framework you're using, as these are more likely to handle the nuances and edge cases correctly. Additionally, the ultimate test of an email address's validity is sending a confirmation email to it.

bonki · on March 20, 2024

Not good at all, but a little better than expected. I use + in email addresses prominently and there are so many websites who don't even allow that...

zaxomi · on March 20, 2024

Remember to first punycode the domain part of an email address before trying to validate it, or it will not work with internationalized domain names.

jameshart · on March 20, 2024

Support for IDN email addresses is still patchy at best. Many systems can’t send to them; many email hosts still can’t handle being configured for them.

skeaker · on March 20, 2024

There really ought to be a regex repository of common use cases like these so we don't have to reinvent the wheel or dig up a random codebase that we hope is correct to copy from every time.

berkes · on March 20, 2024

> if you know where to find something no point in knowing it.

Nonsense. And you know it.

First, you need to know what to find, before knowing where to find it. And knowing what to find requires intricate knowledge of the thing. Not intricate implementation details, but enough to point yourself in the right direction.

Secondly, you need to know why to find thing X and not thing Y. If anything, ChatGPT is even worse than google or stackoverflow in "solving the XY problem for you". XY is a problem you don't want solved, but instead to be told that you don't want to solve it.

Maybe some future LLM can also push back. Maybe some future LLM can guide you to the right answer for a problem. But at the current state: nope.

Related: regexes are almost never the best answer to any question. They are available and quick, so all considered, maybe "the best" for this case. But overall: nah.

pksebben · on March 20, 2024

While I agree with your point that knowing things matters, it is entirely possible with the current batch of LLMs to get to an answer you don't know much about. It's actually one of the few things they do reliably well.

You start with what you do know, asking leading questions and being clear about what you don't, and you build towards deeper and deeper terminology until you get to the point where there are docs to read (because you still can't trust them to get the specifics right).

I've done this on a number of projects with pretty astonishing results, building stuff that would otherwise be completely out of my wheelhouse.

lolc · on March 20, 2024

Funny for me there have been instances where the LLM did push back. I had a plan of how to solve something and tasked the LLM with a draft implementation. It kept producing another solution which I kept rejecting and specifying more details so it wouldn't stray. In the end I had to accept that my solution couldn't work, and that the proposed one was acceptable. It's going to happen again, because it often comes up with inferior solutions so I'm not very open to the reverse situation.

berkes · on March 21, 2024

I should have clarified better. Because, indeed, I have the same experience with copilot. Where it suggested code that I disliked but was actually the right one and mine the wrong one.

I was talking about X-Y on a higher level though. Architecture, Design Patterns, that kind of stuff. LLMs are (still?) particularly bad at this. Which is rather obvious if you think of them as "just" statistical models: it'll just suggest what is done most often in your context, not what is current best for your context.

lolc · on March 21, 2024

Yea I don't think the crop of LLM is useful for this. They let themselves be lead by what's written, and struggle to understand negation even. So when I suspect there is a better solution, I have a hard time getting such an answer, even if asking explicitly for alternatives. I doubt it's just a question of training, they seem to lock themselves on the context. When using Phind, this is somewhat mitigated by mixing in context from the web, which can lead to responses that include alternatives.

HumblyTossed · on March 20, 2024

This is something ChatGPT would say.

gitaarik · on March 21, 2024

Sure, why bother understanding anything if ChatGPT can just produce the answers for you ;). You don't have to understand the answers even. Actually you don't need to understand the question also. Just forward the question from your manager to ChatGPT and forward the answer back to your manager ;). Why make life difficult for yourself?

tyingq · on March 20, 2024

Seems odd to leave Perl off the list, given it's regex related.

Here's the explanation for $ in the perlre docs:

  $   Match the end of the string                 
      (or before newline at the end of the      
      string; or before any newline if /m is     
      used)

toyg · on March 20, 2024

Yeah, omitting what is arguably the language most associated with regexes seems a bit of an oversight. I guess it shows how far off the radar Perl currently is.

demondemidi · on March 20, 2024

Perl perfected the simplicity and flexibility of regex syntax from POSIX and it seems every other language after has just made it harder.

gordonfish · on March 21, 2024

> I guess it shows how far off the radar Perl currently is.

This is a serious misconception. Perl is far, far from dead. The constant activity of the gargantuan CPAN library more than demonstrates very much the opposite.

I would say Perl and its community has done quite well considering it hasn't had the same mountain of corporate funds thrust into it like more highlighted have. Mainstream ain't everything.

saghm · on March 21, 2024

I don't think "off the radar" means dead, but that people aren't generally aware of what's going on with it. I think this is actually pretty consistent with what you're saying; stuff is going on with it, but it's not on people's radar, so they don't realize it.

gordonfish · on March 22, 2024

Fair enough point.

TillE · on March 20, 2024

PHP uses PCRE, so it more or less serves as a stand-in for Perl in this case.

gordonfish · on March 21, 2024

Sort of, though PCRE is not a 100% replacement for regex in Perl proper; the former lacks some features of the latter.

perlgeek · on March 20, 2024

Raku (formerly Perl 6) has picked ^ and $ for start-of-string and end-of-string, and has introduced ^^ and $$ for start-of-line and end-of-line. No multi line mode is available or necessary. (There's also \h for horizontal and \v for vertical whitespace)

That's one of the benefits of a complete rethink/rewrite, you can learn from the fact that the old behavior surprised people.

Terretta · on March 20, 2024

And this is why this curmudgeon can't use Perl 6[^1]. It randomly shuffles the line noise we learned over decades.

It seems so obvious that's the opposite of what they should have defaulted to, that it clearly should have been ^ and $ for lines, and ^^ and $$ for the string, since like ((1)(2)(3)):

^^line1$\n^line2$\n^line3$\n$

[1]: That, and it's not anywhere, while Perl 5 is everywhere.

richardwhiuk · on March 20, 2024

Think I would have picked exactly the reverse (i.e. ^^ being more "starty" than "^").

lcnPylGDnU4H9OF · on March 20, 2024

Reminds me of verbosity flags in some cli utilities. Often, -v is "verbose" and -vv is "very verbose" and -vvv... etc.

librasteve · on March 26, 2024

Pretty much all the regexen I have written have been predicated on start / end of string (i typically feed lines through the regex) … so picking single ^ and $ for the whole string maintains a degree of backward compatibility (assuming that I am normal)

beardyw · on March 20, 2024

Does anyone consider RegEx to be standardised? Moving to a new context is always a relearning exercise in my experience.

wolletd · on March 20, 2024

At some point, I felt like I knew them all. There are probably more regex dialects out there, but I don't encounter them and my set of knowledge works most of the time.

I feel it's like driving a rental car. It behaves slightly different than your own car, some features missing, some other features added, but in general, most of the things are pretty similar.

stanislavb · on March 20, 2024

What a nice analogy. I’ll borrow it in the future.

bregma · on March 20, 2024

The ISO/IEC 14882 C++ standard library <regex> mandates [0] implementations for six de jure standard regex grammars: IEEE Std 1003.1-2008 (POSIX) [1] BRE, ERE, awk, grep, and egrep and ECMA-262 EcmaScript 3 [2].

So, yes, at least someone (me) considers regex to be standardized in several published de jure standards.

  [0] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3690.pdf#chapter.28
  [1] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
  [2] https://262.ecma-international.org/14.0/#sec-regexp-regular-expression-objects

account42 · on March 20, 2024

<regex> is not exactly an example anyone should follow.

bregma · on March 20, 2024

You may be prejudiced against C++, but ISO/IEC 14882 is a published international standard that links to recognized regex standards, so answers the question "does anyone consider RegEx standardised?" very much in the affirmative.

pjc50 · on March 20, 2024

"At least six different standards" is an XKCD comic, not a standard.

riffraff · on March 20, 2024

"The nice thing about standards is that you have so many to choose from." - Andrew Tanenbaum (or Grace Hopper)

jasonjayr · on March 20, 2024

The three big ones I know of are POSIX, Perl/PCRE(aka Perl-Compatible Regular Expression), and Go came along and <strike>added</strike> used re2, which is a bit different from the first too.

A lot of systems implemented PCRE, including JavaScript, since Perl extended the POSIX system with many useful extensions. IIRC, re2 tries to reign in on some of the performance issues and quirks the original systems had, while implementing the whole thing in Go.

edit: Did not realize re2 predated go ...

jerf · on March 20, 2024

POSIX and PCRE are arguably redundant. They both support backreferences, which puts very significant constraints on their implementations. PCRE is at least functionally a superset of POSIX, whether or not there's some quirky thing POSIX supports that PCRE does not.

re2 adds a legitimate option to the menu of using NDFAs, which have the disadvantage of not supporting backreferences, but have the advantage of having constrained complexity of scanning a string. This does not come for free; you can conceivably end up with a compiled regexp of very large size with an NDFA approach, but most of the time you won't. The result may be generally slower than a PCRE-type approach, but it can also end up safer because you can be confident that there isn't a pathological input string for a given regexp that will go exponential.

This is one of those cases where ~99% of the time, it doesn't really matter which you choose, but at the scale of the Entire Programming World, both options need to be available. I've got some security applications where I legitimately prefer the re2 implementation in Go because it is advantageous to be confident that the REs I write have no pathological cases in the arbitrary input they face. PCRE can be necessary in certain high-performance cases, as long as you can be sure you're not going to get that pathological input.

RE engines don't quite engender the same emotions as programming languages as a whole, but this is not cheerleading, this is a sober engineering assessment. I use both styles in my code. I've even got one unlucky exe I've been working with lately that has both, because it rather irreducibly has the requirements for both. Professionally annoying, but not actually a problem.

burntsushi · on March 20, 2024

I'll add two notes to this:

* Finite automata based regex engines don't necessarily have to be slower than backtracking engines like PCRE. Go's regexp is in practice slower in a lot of cases, but this is more a property of its implementation than its concept. See: https://github.com/BurntSushi/rebar?tab=readme-ov-file#summa... --- Given "sufficient" implementation effort (~several person years of development work), backtrackers and finite automata engines can both perform very well, with one beating the other in some cases but not in others. It depends.

* Fun fact is that if you're iterating over all matches in a haystack (e.g., Go's `FindAll` routines), then you're susceptible to O(m * n^2) search time. This applies to all regex engines that implement some kind of leftmost match priority. See https://github.com/BurntSushi/rebar?tab=readme-ov-file#quadr... for a more detailed elaboration on this point.

jerf · on March 20, 2024

Excellent, thank you.

keybored · on March 20, 2024

> RE engines don't quite engender the same emotions as programming languages as a whole, but this is not cheerleading, this is a sober engineering assessment.

Good on you.