> In multi-line mode it matches, in single-line mode it does not because there i...

danbruc · 2024-03-20T17:19:23 1710955163

This only makes sense if re.search accepted a line to search. It doesn't. It accepts an arbitrary string.

Which is fine because lines are a subset of strings. And whether you want your input treated as a line or a string is decided by your pattern, use ^ and $ and it will be treated as a line, use \A and \Z and it will be treated as a string.

The first `\n` in `cat\n\n` is a terminating newline. There just happens to be one after it.

Look at where this is coming from. You do line-based stuff, there is either no newline at all or there is exactly one newline at the end. You do file-based stuff, there are many newlines. In both cases the behavior of ^ and $ makes perfect sense.

Now you come along with cat\n\n which clearly falls into the file-based stuff category as it has more than one newline in it but you also insist that it is not multiple lines. If it is not multiple lines, then only the last character can be a newline, otherwise it would be multiple lines.

And I get it, yes, you can throw arbitrary strings at a regular expression, this line-based processing is not everything, but it explains why things behave the way they do. And that is also why people added \A and \Z. And I understand that ^ and $ are much nicer and much better known than \A and \Z. Maybe the best option would be to have a separate flag that makes them synonymous with \A and \Z and this could maybe even be the default.

burntsushi · 2024-03-20T17:54:10 1710957250

> And whether you want your input treated as a line or a string is decided by your pattern, use ^ and $ and it will be treated as a line, use \A and \Z and it will be treated as a string.

Where is this semantic explained in the `re` module docs?

This is totally and completely made up as far as I can tell.

This also seems entirely consistent with my rebuttal:

Me: What you're saying makes sense if condition foo holds.

You: Condition foo holds.

This is uninteresting to me because I see no reason to believe that condition foo holds. Where condition foo is "the input to re.search is expected to be a single line." Or more precisely, apparently, "the input to re.search is expected to be a single line when either ^ or $ appear in the pattern." That is totally bonkers.

> but it explains why things behave the way they do

Firstly, I am not debating with you about the historical reasoning for this. Secondly, I am providing a commentary on the semantics themselves (they suck) and also on your explanation of them in today's context (it doesn't make sense). Thirdly, I am not making a prescriptive argument that established regex engines should change their behavior in any way.

If you're looking to explain why this semantic is the way it is, then I'd expect writing from the original implementors of it. Probably in Perl. I wouldn't at all be surprised if this was an "oops" or if it was implemented in a strictly-line-oriented context, and then someone else decided to keep it unthinkingly when they moved to a non-line-oriented context. From there, compatibility takes over as a reason for why it's with us today.

danbruc · 2024-03-20T18:21:11 1710958871

I quoted the section from the Python module here. [1]

If you do not specify multi-line, bar$ matches a lines ending in bar, either foobar\n or foobar if the terminating newline has been removed or does not exist. If you specify multi-line, then it will also match at every bar\n within the string. So it either treats your input as a single line or as multiple lines. You can of course not specify multi-line and still pass in a string with additional newlines within the string, but then those newlines will be treated more or less as any other character, bar$ will not match bar\n\n. The exception is that dot will not match them except you set the single-line/dot-all flag, bar\n$ will match bar\n\n but bar.$ will not unless you specify the single-line/dot-all flag.

I would even agree with you that it seems a bit weird. If you have a proper line without additional newlines in the middle, then multi-line behaves exactly like not multi-line. Not multi-line only behaves differently if you confront it with multiple lines and I have no good idea how you would end up in a situation where you have multiple lines and want to treat them as one unit but still treat the entire thing as if it was a line.

[1] https://news.ycombinator.com/item?id=39765086

burntsushi · 2024-03-20T18:31:24 1710959484

The docs do not say what you're saying. Your phrasing is completely different, and the part where "if ^/$ are in the pattern then the haystack is treated as a single line" is completely made up. As far as I can tell, that's your rationalization for how to make sense of this behavior. But it is not a story supported by the actual regex engine docs. The actual docs say, "^ matches only at the beginning of the string, and $ matches only at the end of the string and immediately before the newline (if any) at the end of the string." The docs do not say, "the string is treated as a single line when ^/$ are used in the pattern." That's your phrasing, not anyone else's. That's your story, not theirs.

I still have not seen anything from you that makes sense of the behavior that `cat$` does not match `cat\n\n`. Like, I realize you've tried to explain it. But your explanation does not make sense. That's because the behavior is strange.

The only actual way to explain the behavior of $ is what the `re` docs say: it either matches at the end of the string or just before a `\n` that appears at the end of the string. That's it.

danbruc · 2024-03-20T19:00:58 1710961258

You are right, it is my wording, I replaced end of string or before newline as the last character with end of line because that is what this means. You could also write that into the documentation but then you would have to also explain what end of line means. And I will grant you that I might be wrong, that the behavior is only accidentally identical to matching the end of a line but that the true reason for it is different.

cat$, the $ matches the end of the line, the second \n, cat is not directly before that. I guess you want the regex engine to first treat the input as a multi-line input, extract cat\n as the first line, and then have cat$ match successfully in that single line? What about cat$ and dog$ and cat\ndog\n.

burntsushi · 2024-03-21T00:05:19 1710979519

> I guess you want the regex engine

Ignoring compatibility concerns, I would want the regex engine to behave the same way RE2, Go's regexp package and Rust's regex engine behave. I remember specifically considering Cox's decision ~10 years ago when writing the initial implementation of the regex crate. I thought Perl's (and Python's) behavior on this point was whacky then and I still think it's whacky now. So I followed RE2's semantics.

The OP is right to be surprised by this. And folks will continue to be surprised by it for eternity because it's an extremely subtle corner case that doesn't have a consistent story explaining its behavior. (I know you have proffered one, but I don't find it consistent in the context of a general purpose regex engine that searches arbitrary strings and not just lines.)

Of course, compatibility is a trump card here. I've acknowledged that. Changing this behavior now would be too hard. The best you could probably do is some kind of migration, where you provide the more "sensible" behavior behind an opt-in flag. And then maybe Python 4 enables it by default. But it's a lot of churn, and while people will continue to be confounded by this so long as the behavior exists, it probably isn't a Huge & Common Deal In Practice. So it may not be worth fixing. But if you're starting from scratch? Yes, please don't implement $ this way. It should match the end of the string when 'm' is disabled and the end of any line (including end of string and possibly being Unicode aware, depending on how much you care about that) when 'm' is enabled.

IshKebab · 2024-03-21T07:41:15 1711006875

Dunno if you noticed who you are debating with here... :-D