> This is more sophisticated than a Markov process. Nothing prevents a markov pr...

xcv123 · on Nov 18, 2023

The transformer architecture is NOT a Markov process, by mathematical definition of a Markov process. This is not even debatable. It's a mathematical fact.

> Many of the dumb behaviors we see from LLMs today comes from their lack of internal state between tokens, so it don't remember what reason it had for generating the previous token and that means it can easily generate inconsistent answers

The attention mechanism in the transformer architecture models relations between tokens within the context window, and does the exact opposite of what you are describing here. This is one aspect of LLMs that violates the Markov property.

Jensson · on Nov 18, 2023

> The transformer architecture is NOT a Markov process, by mathematical definition of a Markov process. This is not even debatable. It's a mathematical fact.

What, yes it is.

> The attention mechanism in the transformer architecture models relations between tokens within the context window, and does the exact opposite of what you are describing here. This is one aspect of LLMs that violates the Markov property.

The context window is finite, so that is the previous step. You know the dumb markov chains that are based on bag of words? They also look several words back, they don't just go based on a single word. LLMs are just that but with a way larger lookback and some extra logic there, but none of that changes the fundamental parts to make it no be a markov process.

With a large enough context size you could argue it is now fundamentally different in practice, but in theory it is the same. There is no "hidden state", its just the previous n words defines the next word.

xcv123 · on Nov 18, 2023

If you define the state broadly enough to include the state of the entire machine itself, including all of its internal representations, weights, activations, etc, then you are playing a funny trick here.

By the same reasoning, a human brain is also Markov process.

What you are doing here is a vast oversimplification and it is practically useless for understanding how LLMs work.