Hacker News new | past | comments | ask | show | jobs | submit | qouteall's comments login

Related quote from Karpathy:

Tokenization is at the heart of much weirdness of LLMs. Do not brush it off.

• Why can't LLM spell words? Tokenization.

• Why can't LLM do super simple string processing tasks like reversing a string? Tokenization.

• Why is LLM worse at non-English languages (e.g. Japanese)? Tokenization.

• Why is LLM bad at simple arithmetic? Tokenization.

• Why did GPT-2 have more than necessary trouble coding in Python? Tokenization.

• Why did my LLM abruptly halt when it sees the string "<|endoftext|>"? Tokenization.

• What is this weird warning I get about a "trailing whitespace"? Tokenization.

• Why the LLM break if I ask it about "SolidGoldMagikarp"? Tokenization.

• Why should I prefer to use YAML over JSON with LLMs? Tokenization.

• Why is LLM not actually end-to-end language modeling? Tokenization.

• What is the real root of suffering? Tokenization.


It’s weird because I’m pretty sure my brain does something similar when I speed read. I don’t actually, usually, read the words; instead I recognize the shape of the words (most common words) then I jump to the subject of the paragraphs and break down the meaning of the whole page in a second or so.


(Author Here)

In editing we couldn’t find a good place for this so cut it in the current version, but at one point had discussed a parallel with information density of speech as described by one paper. Essentially the paper found that in languages that were less information dense per syllable, speakers spoke faster to achieve similar information density as languages with higher density per syllable. You could see patching by entropy paralleling this if you consider that low entropy bytes in terms of Shannon entropy are less information dense.


That's generally true, but you also have the ability to stop and look closer if you want to. If someone asks you to count the letters in a word, you will stop to look at the letters individually. If you see an unfamiliar word like SolidGoldMagikarp, you can stop and break it apart. Tokenization prevents LLMs from doing this.


Generally the current crop of LLMs seem pretty good analogues of the "scan reading" immediate instinctual response to stimulus, but seems to completely lack the higher level that can then go "Wait, that doesn't seem right, let's go back over that again". Like hallucinations and seeing "Faces" in dark shadows until you look again, it's like it's doing a pretty good emulation of some level of consciousness.

Is that a fundamental difference to the level of processing? I haven't seen that sort of second-tier logic pop up from any emergence behaviors from increasing scale yet, but will that come with time? I'm not sure.


You can prompt the model to do that kind of "stream of mind" process. It will maximize modeling uncertainty. This is my prompt:

> Write in a raw, real-time stream-of-consciousness style, as if actively solving a problem. Your response should feel like unpolished notes—messy, exploratory, and authentic. Show your full thought process, including missteps, dead ends, and course corrections. Use markers to signal mental states: Insights: "Wait -", "Hold on -", "Oh -", "Suddenly seeing -", "This connects to -". Testing: "Testing with -", "Breaking this down -", "Running an example -", "Checking if -". Problems: "Stuck on -", "This doesn’t work because -", "Need to figure out -", "Not quite adding up -". Progress: "Making headway -", "Starting to see the pattern -", "Explains why -", "Now it makes sense -". Process: "Tracing the logic -", "Following this thread -", "Unpacking this idea -", "Exploring implications -". Uncertainty: "Maybe -", "Could be -", "Not sure yet -", "Might explain -". Transitions: "This leads to -", "Which means -", "Building on that -", "Connecting back to -". Lean into real-time realizations: "Wait, that won't work because…" or "Ah, I missed this…" Show evolving understanding through short paragraphs, with natural pauses where ideas shift. Structure your thought evolution as follows: Begin with an initial take: "This might work because…" or "At first glance…" Identify problems or angles: "Actually, this doesn’t hold up because…" Test examples or counterexamples: "Let me try -", "What happens if -". Seek deeper patterns: "I’m seeing a connection -", "This ties back to -". Link broader implications: "This means -", "If this holds, then -". Admit confusion openly: "I don’t get this yet", "Something’s missing here". Reveal partial understanding: "I see why X, but not Y". Show failures and iterations: "Still not right - trying another approach". Embrace a debugging mindset, treating ideas like code—break them into steps, test logic, reveal failure modes, and iterate. Skip introductions and conclusions. Stop when you solve the problem or find clear next steps. Use short, direct sentences to mimic real-time thinking. The goal is to capture the messy, evolving nature of problem-solving and thought refinement.

Just try this, you can insert at any point in a LLM chat session. I built it by reverse engineering the QwQ-32B model responses with Claude. QwQ itself is based on the GPT-o1 method.


FWIW this gave more entertaining but ultimately worse results than without on Claude for me, using the prompt:

> How many chickens can fit on a 747?


I've tried prompts like this with Claude, but it can get so nitpicky of itself that it runs out of space for the actual answer. It seems it does help to train the model to do it.


I've often wanted to talk with an LLM about its tokenization (e.g. how many tokens are there in "the simplest of phrases") I wonder if you fed it information about its tokenization (text like "rabbit is spelled r, a, b, b, i, t") if it could talk about it.


Well said!!

I’m waiting for reading studies on AI generated text, that’s a different kind of speed read


Meta's approach doesn't seem to throw out character grouping entirely, it just makes it dynamic.


Goodbye tokenization problems, hello encoding problems!


!Long post warning!

Tokenization is often scapegoated for many transformer limitations. I suppose it's because reading about the many limitations of the transformer architecture is harder than dumping everything on tokenization (which to be fair, is often indirectly involved with or exacerbating some deeper issue).

> Why can't LLM spell words? Tokenization.

LLMs can spell if you ask them to though. And there have been investigations into this capability (ref:2). Tokenization makes computations that involve spelling more difficult, but this is downstream of deeper computational limitations of the architecture.

> Why can't LLM do super simple string processing tasks like reversing a string?

Ditto.

> Why is LLM worse at non-English languages (e.g. Japanese)? Tokenization.

Tokenization is also implicitly performing compression. If your tokenizer's corpus is focused only on english, basic information theory explains why it'll be less efficient for other languages. The net effect is longer sequences where tokens are less information dense for non-english languages on average.

> Why is LLM bad at simple arithmetic? Tokenization.

Tokenization could treat digits separately and I believe, llama2 did this. But OpenAI built tiktoken which does not do this. llama3 uses tiktoken.

The transformer architecture also has limitations that make (default) arithmetic computations involving carries difficult to learn. You can read more about this in (ref:1).

> Why did my LLM abruptly halt when it sees the string "<|endoftext|>"? Tokenization.

Why should it not? Either way, it doesn't have to halt, as the sampler can just ignore this. But the distribution will still condition on this as a change of topic switch. The question should probably be, why did the LLM suddenly assign high probability to a stop token before finishing whatever it was writing?

> What is this weird warning I get about a "trailing whitespace"? Tokenization.

Modeling decisions for how to treat whitespace is upstream of tokenization. These choices affect how the LLM models word boundaries. Things can be fine most of the time until they aren't.

There's also the issue of softmax. The way softmax is typically applied forces the model to always assign importance to some tokens, even when no strong relationships exist between them. This in turn leads to the model disproportionately dumping its focus on often semantically unimportant tokens like whitespace or punctuation. Misallocating attention in this manner can lead to wasting representational capacity due to overemphasizing unimportant tokens, perhaps inducing spurious correlations on whitespace. This issue propagates through the model, possibly leading to unexpected negative downstream effects.

> Why the LLM break if I ask it about "SolidGoldMagikarp"? Tokenization.

One step down, it's really a result of high dimensional random vectors.

> Why should I prefer to use YAML over JSON with LLMs? Tokenization.

> Why did GPT-2 have more than necessary trouble coding in Python? Tokenization.

Tokenization does make counting more difficult but the net benefit to programming languages where whitespace can be semantically meaningful is a strong positive. Even when whitespace is not meaningful, long strings of them can often be encountered. Not being careful about devoting tokenization effort on whitespace will significantly degrade code modeling ability in LLMs.

> Why is LLM not actually end-to-end language modeling? Tokenization.

This is correct, but it is not necessarily the case that a character or byte based model will automatically be better. The issue is that LLMs as currently devised spend the same amount of computation per token. This creates the immediate problem of making meaningful sequences, which will now be substantially longer, substantially more expensive to compute, generate and store in memory. This is what the posted paper seeks to address over naive byte level modeling. Although it's unclear from the provided tables if what's claimed is actually what's occurring.

Character level modeling will also make learning long ranged dependencies harder. Subword tokenization also aids in memorization, which can be useful in learning from the tail of the distribution. The following idea is based on (ref:5).

Next-token prediction can be modeled as a hierarchical sampling process where problem instances (topics, natural language tasks), which are mixture distributions, are drawn from a metadistribution, and then data points (eg various strings) are sampled from specific subpopulations (ie clusters of task types) within those instances. Here, memorization is a key strategy since there's initial uncertainty about which features are relevant for predicting the next token. Particularly for rare examples, memorizing their details acts as a starting point for associating particular patterns with specific subpopulations, in turn allowing more accurate prediction of new points.

From that starting point, the model can eventually refine its associations as it encounters more data. This is key for example, when sampling from the tail of the distribution where data about subpopulations will be more limited. Making memorization and learning longer dependencies more challenging can lead to final models that face more difficulty during ICL inference, which depends, among other things, on the ability to infer which task from a mixture distribution.

> What is the real root of suffering? Tokenization.

A better candidate is over-generalization.

1: https://arxiv.org/abs/2310.16028

2: What do tokens know about their characters and how do they know it? (https://aclanthology.org/2022.naacl- main.179.pdf)

3: https://arxiv.org/abs/2406.10851

4: Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP (https://arxiv.org/abs/2112.10508)

5: https://arxiv.org/abs/2012.06421


In all seriousness: why has it been years now and it feels like there is no incremental engineering-level progress on these issues? Like, it seems like doing some manual intervention to the tokenization to at least remove exceptional tokens and add some semantics to how they break up numbers seem like quick wins.


(Author Here)

There is at least some work on character based modeling, but it hasn’t scaled well before. The challenge I think with something more adhoc for exceptional tokens is that it’s hard to see gains since they are by definition, infrequent. If the text is rare enough, BPE should produce many single byte tokens, so current models actually expend more compute on these rare sequences.

BLT scales well because it expends less compute (by patching) on more predictable (low entropy) byte sequences. Current models only to some degree get this benefit, if it’s a larger BPE token, but that only goes so far.

So it’s really two related, but different motivations.


>In all seriousness: why has it been years now and it feels like there is no incremental engineering-level progress on these issues?

From where I'm standing, LLMs appear to be the fastest moving technological field in history.


A field can seem to be going quickly and going nowhere at the same time. Or rather a new technique can be invented and then exhausted in the time it takes somebody to get a PhD. (See https://en.wikipedia.org/wiki/Renormalization_group applied to phase transitions, which turned up just in time for the physics job crisis of 1970)

I didn't ever believe that there was going to be a GPT-5 trained with exponentially more text and resources. Not only is there not enough text, but that's the path to ruin. Why?

Cycle time. Two years ago we had little idea of how those models work so I knew there was a huge room in improving performance. It gets the cost down, it lets you put the models on your device, and it speeds up development. If I can train 10 models in the time it takes you to train 1 model I can make much faster progress.

However even a GPT-15 trained with a Dyson sphere is going to struggle to sort things. (Structurally a pure LLM can't do that!) My #1 beef with Microsoft's Copilot is that you can ask it if it can sort a certain list of items (either a list you are discussing with it or say "states of the United States ordered by percent water area") it will say yes and if you ask it what it thinks the probability is that it will get it in the right order it will say "very high" but when you try it the list comes out totally wrong.

It is equally unable to "help me make an atom bomb" except in the bomb case it will say that it can't but in the sorting case it says it can.

The obvious answer is that it should use tools to sort. That's right but the problem of "knowing what you can really do with your tools" is philosophically challenged. (With problems so intractable it leads people like Roger Penrose to conclude "I couldn't do math if I wasn't a thetan")


I'm not really sure I understand your sorting example, maybe try it out in gpt and post the link to show exactly what you mean.

The refusal of the model is something trained into the model by the process of rlhf, and it can also be untrained, by the process of abliteration [1].

Also, LLMs are capable of using tools in this very moment [2].

[1]: https://huggingface.co/blog/mlabonne/abliteration [2]: https://www.anthropic.com/news/analysis-tool


I'm deliberately blurring refusal with having an accurate picture of its own abilities and, past that, having an accurate picture of of what it can do given tools. Both are tested by

   "Can you X?"
With refusal you find just how shallow it is because it really will answer all sorts of questions that are "helpful" in making a nuclear bomb but when you ask it directly it shuts up. In another sense nothing it does is "helpful" because it's not going to hunt down some people in central asia who have 50kg of U235 burning a hole in their pocket for you, which is what would actually "help".

I use tool using LLMs frequently, but I find they frequently need help using their tools, it is a lot of fun to talk to Windsurf about the struggles it has with its tools and it feels strangely satisfying to help it out.


You totally ignored "on these issues" and are essentially saying there is no need to work on that as they worked on something else, which is extremely strange for a thing which feels like a really trivial win, and should be shocking.

Whether you like it or not, it is entirely fair to look at an entire ecosystem and ask why some trivial thing that everyone talks about all the time hasn't seen any attention even if the entire ecosystem is getting widespread advancement.

Like, I think it would also be fair to complain about how bad the hinge on AirPods are, causing the case to explode when dropped and your earbuds to fly everywhere (potentially getting very dirty) as well as wear out and cause spurious activation (leading to audio routing issues and rapid battery drain).

To then point out that this is one of the most successful consumer devices in recent years and was a remarkable improvement to what came before as well as a continuing achievement of engineering as they do in fact get better in amazing ways every couple years is more than just a non sequitur: it is frankly just annoying.



Agree. Same applies to '!'. '!' is small compared to `== false` and tend to get missed. The IDE should highlight '!' and '?' in Rust.

On the contrary, IDE should fold 'if err != nil' in golang code to make meaningful code more visible.


This is a kind of "false dichotomy". Not using single-letter variable name doesn't ncessarily mean using long verbose variable name. A variable name can be one word or two words that explains well while not using much space. Things usually need balance instead of extreme.


energy = mass * light_speed ^ 2

Yes, it makes it much simpler!


Alias-xor-mutability is useful for interior pointers (referencing into an enum, referencing into a Vec) because modification may change the memory layout and invalidate the interior pointer. But when the memory layout does not change in singlethread case, alias-xor-mutability can reject safe programs.


I don't understand what "singlethread case" means here. If you mean that only a single reference is active at any given point of the program, Rust allows you to make that a mutable borrowing. If multiple aliased references are active, then you don't really have a "single thread" of control, and modifying one reference may invalidate assertions made by others (even something as simple as n != 0 prior to a division, or n < array_size after a bound check). Shared mutable state must be signaled very clearly if you want to keep a sensible semantics, and Rust does this with the Cell<>, RefCell<>, Atomic* etc. constructs.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: