> but couldn't we add some training data to teach the LLM how to spell? Sure, bu...

lupire · 2024-11-29T01:36:18 1732844178

Training data is already provided by humans and certainly already does include spelling instruction, which the model is bind to because of forced tokenization. Tokenizing on words is already an arbitrary capability added one at a time. It's just the wrong one. LLMs should be tokenizing by letter, but they don't, because they aren't good enough yet, so they get a massive deus ex machina (human ex machina?) of wordish tokenization.