Because even after disabling multi-line you are still dealing with line-based se...

burntsushi · on March 20, 2024

> Because even after disabling multi-line you are still dealing with line-based semantics when you use ^ or $

No, you're not, except for this weird corner case where `$` can match before the last `\n` in a string. It's not just any `\n` that non-multiline `$` can match before. It's when it's the last `\n` in the string. See:

    >>> re.search('cat$', 'cat\n')
    <re.Match object; span=(0, 3), match='cat'>
    >>> re.search('cat$', 'cat\n\n')
    >>>

This is weird behavior. I assume this is why RE2 didn't copy this. And it's certainly why I followed RE2 with Rust's regex crate. Non-multiline `$` should only match at the end of the string. It should not be line-aware. In regex engines like Python where it has the behavior above, it is only "partially" line-aware, and only in the sense that it treats the last `\n` as special.

danbruc · on March 20, 2024

But that is exactly what it means, the end of the line is before the terminating newline or at the end of the string if there is no terminating newline. Both ^ and $ always match at start or end of lines, \A and \Z match at the start or end of the string. The difference between multi-line and not is whether or not internal newlines end and start lines, it does not change the semantics from end of line to end of string. And if you are not in multi-line mode but have internal newlines, then you might also want single-line/dot-all mode.

One could certainly have a debate whether this behavior is too strongly tied to the origins of regular expressions and now does more harm than good, but I am not convinced that this would be an easy and obvious choice to have breaking change.

burntsushi · on March 20, 2024

re.search does not accept a "line." It accepts a "string." There is no pretext in which re.search is meant to only accept a single line. And giving it a `string` with multiple new lines doesn't necessarily mean you want to enable multi-line mode. They are orthogonal things.

> Both ^ and $ always match at start or end of lines

This is trivially not true, as I showed in my previous example. The haystack `cat\n\n` contains two lines and the regex `cat$` says it should match `cat` followed by the "end of a line" according to your definition. Yet it does not match `cat` followed by the end of a line in `cat\n\n`. And it does not do so in Python or in any other regex engine.

You're trying to square a circle here. It can't be done.

Can you make sense of, historically, why this choice of semantics was made? Sure. I bet you can. But I can still evaluate the choice on its own merits today. And I did when I made the regex crate.

> but I am not convinced that this would be an easy and obvious choice to have breaking change.

Rust's regex crate, Go's regexp package and RE2 all reject this whacky behavior. As the regex crate maintainer, I don't think I've ever seen anyone complain. Not once. This to me suggests that, at minimum, making `$` and `\z` equivalent in non-multiline mode is a reasonable choice. I would also argue it is the better and more sensible approach.

Whether other regex engines should have a breaking change or not to change the meaning of `$` is an entirely different question completely. That is neither here nor there. They absolutely will not be able to make such a change, for many good reasons.

danbruc · on March 20, 2024

re.search does not accept a "line." It accepts a "string." There is no pretext in which re.search is meant to only accept a single line.

Sure, it takes a string which might be a line or multiple or whatever. Does not change the fact that $ matches at the end of a line. If you want the end of the string, use \Z.

This is trivially not true, as I showed in my previous example. The haystack `cat\n\n` contains two lines and the regex `cat$` says it should match `cat` followed by the "end of a line" according to your definition.

In multi-line mode it matches, in single-line mode it does not because there is a newline between cat and the end of the line. A newline is only a terminating newline if it is the last character, the newline after cat is not a terminating newline. You need cat\n$ or cat\n\n to match.

burntsushi · on March 20, 2024

> In multi-line mode it matches, in single-line mode it does not because there is a newline between cat and the end of the line. A newline is only a terminating newline if it is the last character, the newline after cat is not a terminating newline. You need cat\n$ or cat\n\n to match.

This only makes sense if re.search accepted a line to search. It doesn't. It accepts an arbitrary string.

I don't think this conversation is going anywhere. Your description of the semantics seems inconsistent and incomprehensible to me.

> A newline is only a terminating newline if it is the last character, the newline after cat is not a terminating newline. You need cat\n$ or cat\n\n to match.

The first `\n` in `cat\n\n` is a terminating newline. There just happens to be one after it.

Like I said, your description makes sense if the input is meant to be interpreted as a single line. And in some contexts (like line oriented CLI tools), that can make sense. But that's not the case here. So your description makes no sense at all to me.

danbruc · on March 20, 2024

This only makes sense if re.search accepted a line to search. It doesn't. It accepts an arbitrary string.

Which is fine because lines are a subset of strings. And whether you want your input treated as a line or a string is decided by your pattern, use ^ and $ and it will be treated as a line, use \A and \Z and it will be treated as a string.

The first `\n` in `cat\n\n` is a terminating newline. There just happens to be one after it.

Look at where this is coming from. You do line-based stuff, there is either no newline at all or there is exactly one newline at the end. You do file-based stuff, there are many newlines. In both cases the behavior of ^ and $ makes perfect sense.

Now you come along with cat\n\n which clearly falls into the file-based stuff category as it has more than one newline in it but you also insist that it is not multiple lines. If it is not multiple lines, then only the last character can be a newline, otherwise it would be multiple lines.

And I get it, yes, you can throw arbitrary strings at a regular expression, this line-based processing is not everything, but it explains why things behave the way they do. And that is also why people added \A and \Z. And I understand that ^ and $ are much nicer and much better known than \A and \Z. Maybe the best option would be to have a separate flag that makes them synonymous with \A and \Z and this could maybe even be the default.

burntsushi · on March 20, 2024

> And whether you want your input treated as a line or a string is decided by your pattern, use ^ and $ and it will be treated as a line, use \A and \Z and it will be treated as a string.

Where is this semantic explained in the `re` module docs?

This is totally and completely made up as far as I can tell.

This also seems entirely consistent with my rebuttal:

Me: What you're saying makes sense if condition foo holds.

You: Condition foo holds.

This is uninteresting to me because I see no reason to believe that condition foo holds. Where condition foo is "the input to re.search is expected to be a single line." Or more precisely, apparently, "the input to re.search is expected to be a single line when either ^ or $ appear in the pattern." That is totally bonkers.

> but it explains why things behave the way they do

Firstly, I am not debating with you about the historical reasoning for this. Secondly, I am providing a commentary on the semantics themselves (they suck) and also on your explanation of them in today's context (it doesn't make sense). Thirdly, I am not making a prescriptive argument that established regex engines should change their behavior in any way.

If you're looking to explain why this semantic is the way it is, then I'd expect writing from the original implementors of it. Probably in Perl. I wouldn't at all be surprised if this was an "oops" or if it was implemented in a strictly-line-oriented context, and then someone else decided to keep it unthinkingly when they moved to a non-line-oriented context. From there, compatibility takes over as a reason for why it's with us today.

danbruc · on March 20, 2024

I quoted the section from the Python module here. [1]

If you do not specify multi-line, bar$ matches a lines ending in bar, either foobar\n or foobar if the terminating newline has been removed or does not exist. If you specify multi-line, then it will also match at every bar\n within the string. So it either treats your input as a single line or as multiple lines. You can of course not specify multi-line and still pass in a string with additional newlines within the string, but then those newlines will be treated more or less as any other character, bar$ will not match bar\n\n. The exception is that dot will not match them except you set the single-line/dot-all flag, bar\n$ will match bar\n\n but bar.$ will not unless you specify the single-line/dot-all flag.

I would even agree with you that it seems a bit weird. If you have a proper line without additional newlines in the middle, then multi-line behaves exactly like not multi-line. Not multi-line only behaves differently if you confront it with multiple lines and I have no good idea how you would end up in a situation where you have multiple lines and want to treat them as one unit but still treat the entire thing as if it was a line.

[1] https://news.ycombinator.com/item?id=39765086

burntsushi · on March 20, 2024

The docs do not say what you're saying. Your phrasing is completely different, and the part where "if ^/$ are in the pattern then the haystack is treated as a single line" is completely made up. As far as I can tell, that's your rationalization for how to make sense of this behavior. But it is not a story supported by the actual regex engine docs. The actual docs say, "^ matches only at the beginning of the string, and $ matches only at the end of the string and immediately before the newline (if any) at the end of the string." The docs do not say, "the string is treated as a single line when ^/$ are used in the pattern." That's your phrasing, not anyone else's. That's your story, not theirs.

I still have not seen anything from you that makes sense of the behavior that `cat$` does not match `cat\n\n`. Like, I realize you've tried to explain it. But your explanation does not make sense. That's because the behavior is strange.

The only actual way to explain the behavior of $ is what the `re` docs say: it either matches at the end of the string or just before a `\n` that appears at the end of the string. That's it.

danbruc · on March 20, 2024

You are right, it is my wording, I replaced end of string or before newline as the last character with end of line because that is what this means. You could also write that into the documentation but then you would have to also explain what end of line means. And I will grant you that I might be wrong, that the behavior is only accidentally identical to matching the end of a line but that the true reason for it is different.

cat$, the $ matches the end of the line, the second \n, cat is not directly before that. I guess you want the regex engine to first treat the input as a multi-line input, extract cat\n as the first line, and then have cat$ match successfully in that single line? What about cat$ and dog$ and cat\ndog\n.

burntsushi · on March 21, 2024

> I guess you want the regex engine

Ignoring compatibility concerns, I would want the regex engine to behave the same way RE2, Go's regexp package and Rust's regex engine behave. I remember specifically considering Cox's decision ~10 years ago when writing the initial implementation of the regex crate. I thought Perl's (and Python's) behavior on this point was whacky then and I still think it's whacky now. So I followed RE2's semantics.

The OP is right to be surprised by this. And folks will continue to be surprised by it for eternity because it's an extremely subtle corner case that doesn't have a consistent story explaining its behavior. (I know you have proffered one, but I don't find it consistent in the context of a general purpose regex engine that searches arbitrary strings and not just lines.)

Of course, compatibility is a trump card here. I've acknowledged that. Changing this behavior now would be too hard. The best you could probably do is some kind of migration, where you provide the more "sensible" behavior behind an opt-in flag. And then maybe Python 4 enables it by default. But it's a lot of churn, and while people will continue to be confounded by this so long as the behavior exists, it probably isn't a Huge & Common Deal In Practice. So it may not be worth fixing. But if you're starting from scratch? Yes, please don't implement $ this way. It should match the end of the string when 'm' is disabled and the end of any line (including end of string and possibly being Unicode aware, depending on how much you care about that) when 'm' is enabled.

IshKebab · on March 21, 2024

Dunno if you noticed who you are debating with here... :-D

IshKebab · on March 20, 2024

> But that is exactly what it means

I think you've kind of missed the point. Sure if `$` in non-multiline mode means "end of line" the behaviour might be reasonable. But the big error is that people DO NOT EXPECT `$` to mean "end of line" in that case. They expect it to mean "end of string". That's clearly the least surprising and most useful behaviour.

The bug is not in how they have implemented "end of line" matching in non-multiline mode. It's that they did it at all.