Hacker News new | past | comments | ask | show | jobs | submit login

But the problem is that the tokens are subwords, which means that if you simply disallowed tokens with es, you'd make it hard to complete a word given a prefix.

For example, it may start like this "This is a way to solv-", or "This is th-"




If I understand it correctly, that's a valid concern but the way structured generation library like outlines[1] work is that they can generate multiple variants of the inference (which they call beam search).

One beam could be "This is a way to solv-". With no obvious "good" next token. Another beam could be "This way is solv-". With "ing" as the obvious next token.

It will select the best beam for the output.

[1]:https://github.com/dottxt-ai/outlines


... What if you retrained it from scratch, on an e-less corpus?


Yes, that would probably work quite well, given enough training data. However, I interpreted the question/claim as a task that LLMs excell at, meaning that writing text while avoiding a certain character is a task for a general purpose LLM.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: