But the problem is that the tokens are subwords, which means that if you simply ...

lelag · 2025-04-14T09:41:04 1744623664

If I understand it correctly, that's a valid concern but the way structured generation library like outlines[1] work is that they can generate multiple variants of the inference (which they call beam search).

One beam could be "This is a way to solv-". With no obvious "good" next token. Another beam could be "This way is solv-". With "ing" as the obvious next token.

It will select the best beam for the output.

[1]:https://github.com/dottxt-ai/outlines

zahlman · 2025-04-14T17:02:57 1744650177

... What if you retrained it from scratch, on an e-less corpus?

JohnKemeny · 2025-04-14T21:06:42 1744664802

Yes, that would probably work quite well, given enough training data. However, I interpreted the question/claim as a task that LLMs excell at, meaning that writing text while avoiding a certain character is a task for a general purpose LLM.