Some random things that the author seem to have missed:
> but TypeScript, Swift, Kotlin, and Scala take string interpolation to the furthest extreme of encouraging actual code being embedded inside strings
Many more languages support that:
C# $"{x} plus {y} equals {x + y}"
Python f"{x} plus {y} equals {x + y}"
JavaScript `${x} plus ${y} equals ${x + y}`
Ruby "#{x} plus #{y} equals #{x + y}"
Shell "$x plus $y equals $(echo "$x+$y" | bc)"
Make :) echo "$(x) plus $(y) equals $(shell echo "$x+$y" | bc)"
> Tcl
Tcl is funny because comments are only recognized in code, and since it's a homoiconic, it's very hard to distinguish code and data. { } are just funny string delimiters. E.g.:
xyzzy {#hello world}
Is xyzzy a command that takes a code block or a string? There's no way to tell. (Yes, that means that the Tcl tokenizer/parser cannot discard comments: only at evaluation time it's possible to tell if something is a comment or not.)
my $foo = 5;
my $bar = 'x';
my $quux = "I have $foo $bar\'s: @{[$bar x $foo]}";
print "$quux\n";
This prints out:
I have 5 x's: xxxxx
The "@{[...]}" syntax is abusing Perl's ability to interpolate an _array_ as well as a scalar. The inner "[...]" creates an array reference and the outer "@{...}" dereferences it.
For reasons I don't remember, the Perl interpreter allows arbitrary code in the inner "[...]" expression that creates the array reference.
I understand that's constructing an array. What's a bit odd is that the interpreter allows you to string interpolate any expression when constructing the array reference inside the string.
It's not...? Well, not directly: It's string interpolating an array of values, and the array is constructed using values from the results of expressions. These are separate features that compose nicely.
> What's a bit odd is that the interpreter allows you to string interpolate any expression when constructing the array reference inside the string.
Why? Surely it is easier for both the language and the programmer to have a rule for what you can do when constructing references to anonymous arrays, without having to special case whether that anonymous array is or is not in a string (or in any one of the many other contexts in which such a construct may appear in Perl).
Doesn't really matter for a syntax highlighter, because it is out of your control what you get. For the llamafile highlighter even more so since it supports other legacy quirks, like C trigraphs as well.
My view on this is that it shouldn’t be interpreted as code being embedded inside strings, but as a special form of string concatenation syntax. In turn, this would mean that you can nest the syntax, for example:
"foo { toUpper("bar { x + y } bar") } foo"
The individual tokens being (one per line):
"foo {
toUpper
(
"bar {
x
+
y
} bar"
)
} foo"
If `+` does string concatenation, the above would effectively be equivalent to:
Indeed in some of the listed languages you can nest it like that, but in others (e.g. Python) you can't. I would guess they deliberately don't want to enable that and it's not a problem in their parser or something.
As of python 3.6 you can nest fstrings. Not all formatters and highlighters have caught up, though.
Which is fun, because correct highlighting depends on language version. Haskell has similar problems where different compiler flags require different parsers. Close enough is sufficient for syntax highlighting, though.
Python is also a bit weird because it calls the format methods, so objects can intercept and react to the format specifiers in the f-string while being formatted.
You're using an old Python version. On recent versions, it's perfectly fine:
Python 3.12.7 (main, Oct 3 2024, 15:15:22) [GCC 14.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> print(f"foo {"bar"}")
foo bar
Even when nesting is disallowed, my point is that I find it preferable to not view it (and syntax-highlight it) as a “special string” with embedded magic, but as multiple string literals with just different delimiters that allow omitting the explicit concatenation operator, and normal expressions interspersed in between. I think it’s important to realize that it is really just very simple syntactic sugar for normal string concatenation.
While you're conceptually right, in practice I think it bears mentioning that in C# the two syntaxes compile differently. This is because C#’s target platform, the .NET Framework, has always had a function called `string.Format` that lets you write this:
var str = string.Format("{0} is {1} years old.", name, age);
When interpolated strings were introduced later, it was natural to have them compile to this instead of concatenation.
Like python, and Rust with the format! macro (which doesn't even support arbitrary expressions), C# the full syntax for interpolated/formatted strings is this: {<interpolationExpression>[,<alignment>][:<formatString>]}, ie there is more going on then just a simple wrapper around concat or StringBuilder.
When not using the format specifiers or alignment it will indeed compile to just string.Concat (which is also what the + operator for strings compiles to). Similar to C compilers choosing to call pits instead of printf if there is nothing to be formatted.
If it’s treated strictly as simple concatenation syntactic sugar then you are allowing something like print(“foo { func() );
Which seems janky af.
> just very simple syntactic sugar for normal string concatenation.
Maybe. There’s also possibly a string conversion. It seems reasonable to want to disallow implicit string conversion in a concatenation operator context (especially if overloading +) while allowing it in the interpolation case.
I failed to mention the balancing requirement, that should of course remain. But it's an artificial requirement, so to speak, that is merely there to double-check the programmer's intent. The compiler/parser wouldn't actually care (unlike for an arithmetic expression with unbalanced parentheses, or scope blocks with unbalanced braces), the condition is only checked for the programmer's benefit.
> here’s also possibly a string conversion. It seems reasonable to want to disallow implicit string conversion in a concatenation operator context (especially if overloading +) while allowing it in the interpolation case.
Many languages have a string contenation operator that does implicit conversion to string, while still having a string interpolation syntax like the above. It's kind of my point that both are much more similar to each other than many people seem to realize.
It's exactly the point that this is one token. It's a string literal with opening delimiter `"` and closing delimiter `{`, and that whole token itself serves as a kind of opening "brace". Alternatively, you can see `{` as a contraction of `" +`. Meaning, aside from the brace balancing requirement, `"foo {` does the same a `"foo " +` would.
Still alternatively, you could imagine a language that concatenates around string literals by default, similar to how C behaves for sequences of string literals. In C,
"foo" "bar" "baz"
is equivalent to
"foobarbaz"
Similarly, you could imagine a language where
"foo" some_variable "bar"
would perform implicit concatenation, without needing an explicit operator (as in `"foo" + x + "bar"`). And then people might write it without the inner whitespace, as:
"foo"some_variable"bar"
My point is that
"foo{some_variable}bar"
is really just that (plus a condition requiring balanced pairs of braces). You can also re-insert the spaces for emphasis:
"foo{ some_variable }bar"
The fact that people tend to think of `{some_variable}` as an entity is sort-of an illusion.
> How does this change how you highlight either?
You would highlight the `"...{`, `}...{`, and `}..."` parts like normal string literals (they just use curly braces instead of double quotes at one or both ends), and highlight the inner expressions the same as if they weren't surrounded by such literals.
Fair enough. The point, as you have acknowledged, being that unlike + you have to treat { specially for balancing (and separately from the “).
> The fact that people tend to think of `{some_variable}` as an entity is sort-of an illusion.
I guess. I just don’t know what being an illusion means formally. It’s not an illusion to the person that has to implement the state machine that balances the delimiters.
> You would highlight the `"...{`, `}...{`, and `}..."` parts like normal string literals (they just use curly braces instead of double quotes at one or both ends), and highlight the inner expressions the same as if they weren't surrounded by such literals
Emacs does it this way FWIW. But I’m not sure how important it is to dictate that the brace can’t be a different color.
In any event, I can agree your design is valid (Kotlin works this way), but I don’t necessarily agree it is any more valid than say how Python does it where there can format specifiers, implicit conversion to string is performed whereas not with concatenation. I’m not seeing the clear definitive advantage of interpolated strings being an equivalent to concatenation vs some other type of method call.
The other detail is order of evaluation or sequencing. String concat may behave differently. Not sure I agree it is wrong, because at the end of the day it is distinct looking syntax. Illusion or not, it looks like a neatly enclosed expression, and concatenation looks like something else. That they might parse, evaluate or behave different isn't unreasonable.
Ruby takes this to 100. As much as a I love Ruby, this is valid Ruby, and I can't defend this:
puts "This is #{<<HERE.strip} evil"
incredibly
HERE
Just to combine the string interpolation with her concern over Ruby heredocs.
My other favorite evil quirk in Ruby is that whitespace is a valid quote character in Ruby. The string (without the quotes) "% hello " is a quoted string containing "hello" (without the quotes), as "%" in contexts where there is no left operand initiates a quoted string and the next characters indicates the type of quotes. This is great when you do e.g. "%(this is a string)" or "%{this is a string}". It's not so great if you use space (I've never seen that in the wild, so it'd be nice if it was just removed - even irb doesn't handle it correctly)
Yes, it's roughly limited in use to places where it is not ambiguous whether it would be the start of a quoted string or the modulus operator, and after a method name would be ambiguous.
> but, at the intersection is "ruby parsing is the 15th circle of hell"
It's surprisingly (not this part, anyway) not that hard. You "just" need to create a forward reference, and keep track of heredocs in your parser, and when you come to the end of a line with heredocs pending, you need to parse them and assign them to the variable references you've created.
It is dirty, though, and there are so many dark corners of parsing Ruby. Having written a partial Ruby parser, and being a fan of Wirth-style grammar simplicity while enjoying using Ruby is a dark, horrible place to live in. On the one hand, I find Ruby a great pleasure to use, on the other hand, the parser-writer in me wants to spend eternity screaming into the void in pain.
One cool feature of C# interpolated strings is that they are lazy. Many loggers used to implement their own interpolation because something like
log.trace($"Entering iteration {i} for customer {c.ID} [{c.ShortName}]");
in a hot loop would call string.Concat every time it was called before the logger could bail out of the method.
C# lets you declare an overload that accepts a `DefaultInterpolatedStringHandler` (or your own custom implementation of the handler pattern) and this overload will take precedence and allow you to delay the building of the string until after you've checked whether logging it is required.
Make :) echo "$(x) plus $(y) equals $(shell echo "$x+$y" | bc)"
I'm guessing this is the reason for the :) but to be clear for anyone else: Make is only doing half of the work, whatever comes after "shell" is being passed to another executable, then make captures its stdout and interpolates that. The other executable is "sh" by default but can be changed to whatever.
Python f-strings are kind of wild. They can even contain comments! They also have slightly different rules for parsing certain kinds of expressions, like := and lambdas. And until fairly recently, strings inside the expressions couldn't use the quote type of the f-string itself (or backslashes).
There is a record constructor syntax in VHDL using attribute invocation syntax: RECORD_TYPE'(field1expr, ..., fieldNexpr). This means that if your record has a first field a subtype of a character type, you can get record construction expression like this one: REC'('0',1,"10101").
Good luck distinguishing between '(' as a character literal and "'", "(" and "'0'" at lexical level.
Haskell.
Haskell has context-free syntax for bracketed ("{-" ... "-}") comments. Lexer has to keep bracketed comment syntax balanced (for every "{-" there should be accompanying "-}" somewhere).
I wish PG had dollar-bracket quoting where you have to use the closing bracket to close, that way vim showmatch would work trivially. Something like ${...}$.
> but TypeScript, Swift, Kotlin, and Scala take string interpolation to the furthest extreme of encouraging actual code being embedded inside strings
Many more languages support that:
> TclTcl is funny because comments are only recognized in code, and since it's a homoiconic, it's very hard to distinguish code and data. { } are just funny string delimiters. E.g.:
Is xyzzy a command that takes a code block or a string? There's no way to tell. (Yes, that means that the Tcl tokenizer/parser cannot discard comments: only at evaluation time it's possible to tell if something is a comment or not.)> SQL
PostgreSQL has the very convenient dollar-quoted strings: https://www.postgresql.org/docs/current/sql-syntax-lexical.h... E.g. these are equivalent: