This confuses me. You have your model, you have your tokens. If the tokens are b...

zeta0134 · 2024-12-27T15:36:56 1735313816

Most any LLM has a "temperature" setting, a set of randomness added to the otherwise fixed weights to intentionally cause exactly this nondeterministic behavior. Good for creative tasks, bad for repeatability. If you're running one of the open models, set the temperature down to 0 and it suddenly becomes perfectly consistent.

owenpalmer · 2024-12-27T16:21:31 1735316491

You can get deterministic output with even with a high temp.

Whatever "random" seed was used can be reused.

zozbot234 · 2024-12-27T15:36:12 1735313772

The model outputs probabilities, which you have to sample randomly. Choosing the "highest" probability every time leads to poor results in practice, such as the model tending to repeat itself. It's a sort of Monte-Carlo approach.

HarHarVeryFunny · 2024-12-27T15:39:31 1735313971

The trained model is just a bunch of statistics. To use those statistics to generate text you need to "sample" from the model. If you always sampled by taking the model's #1 token prediction that would be deterministic, but more commonly a random top-K or top-p token selection is made, which is where the randomness comes in.

lifthrasiir · 2024-12-27T15:36:35 1735313795

It is technically possible to make it fully deterministic if you have a complete control over the model, quantization and sampling processes. The GP probably meant to say that most commercially available LLM services don't usually give such control.

brookst · 2024-12-27T16:14:41 1735316081

Actually you just have to set temperature to zero.

ninkendo · 2024-12-27T15:38:45 1735313925

> If the tokens are bit-for-bit-identical, where does the non-determinism come in?

By design, most LLM’s have a randomization factor to their model. Some use the concept of “temperature” which makes them randomly choose the 2nd or 3rd highest ranked next token, the higher the temperature the more often/lower they pick a non-best next token. OpenAI described this in their papers around the GPT-2 timeframe IIRC.