LQML (and guidance https://github.com/guidance-ai/guidance) are much more inefficient. They loop over the entire vocabulary at each step, we only do it once at initialization.
Does looping over the vocabulary add much overhead to the tok/s? I imagine they're just checking if the input is in a set, and usually there's only ~30k tokens. That's somewhat intensive, but inference on the neural net feels like it'd take longer.
They’re checking regex partial matches for each possible completion, which is intensive indeed. You can look at the Figure 2 in our paper (link in original post) for a simple comparison with MS guidance which shows the difference.
This can get pretty pedantic. Where do you draw the line between what is the blog generator and the tools required to do it when counting the number of lines of code? One could easily argue in this case you might want to be counting the lines of code in pandoc, not this bash script.
That said, I do think this is the way to go, using a popular and generic tool (notably that you do not have to maintain) to accomplish a specific task. And more importantly, composing utilities together in a succinct and efficient way.
Also, if you used semicolons, or xargs with a pipe, you could make this one line :) newlines can be pretty arbitrary, I wonder if there's a better measurement for simplicity, like branches or statements/expressions.
The real story here is not the 60 lines, but the literate programming style used for it.
Aside from that, this approach is very similar to Marijn Haverbeke's (the CodeMirror author) generator, although your 60 lines does lean more heavily on third-party packages.