Then you’re either not testing your prompts or doing something trivial. Remember...

esafak · on Nov 7, 2023

How many people are even writing tests for these things?

dbmikus · on Nov 7, 2023

Very few. Many deployed apps don't have a good quantitative grasp of the quality of their LLMs. Some are doing testing or evaluation, through things like unit tests, A/B testing different prompts, collecting user feedback.

I think we're exiting the phase where people can launch an AI app and have people use it just because of the initial "wow factor" and moving into the phase where users will start churning and businesses will need to make sure that their AI agent is performing and they they understand how well it's performing.

wizzard0 · on Nov 7, 2023

> many iterations of each prompt

BTW its much faster and cheaper to artive at a good prompt if you sample the model in deterministic mode (ie temperature=0)

By default you have to guess if the difference is due to the prompt change or due to the dice roll, as you’ve noticed, but you don’t need to!

wokwokwok · on Nov 7, 2023

You should have a read of https://huggingface.co/blog/how-to-generate (the section on sampling, with regard to setting temperature to zero).

This is degenerate (greedy) behaviour, and not representative of the what the prompt will behave like at a higher temperature.

(At least, that’s my understanding; it’s a complex topic but broadly speaking there no specific reason, as far as I’m aware, to expect that a particular combination of params/prompt is representative of any other combination of params/prompt for the same model; it may be, but it may not. Certainly on models like GPT4 it is not, for reasons that are not clear to anyone. So… take care with your prompt testing. setting temperature to 0 is basically meaningless unless you expect to use a temperature of 0 in production. The results you get from your prompts at temp 0 are not generally reflective of the results you will get at temp > 0).