Then you’re either not testing your prompts or doing something trivial.
Remember: a good model with a good prompt will generate bad outputs sometimes.
A bad model with a bad prompt will generate a good output sometimes.
That is simply a fact with these non deterministic models.
You have to do many iterations for each prompt to verify they are working correctly.
> I’ve not had much problems moving between LLMs…
If you want to move your prompts to a different model, you’re effectively replacing one:
f(prompt + seed) => output
With different black box implementation.
Unless you’re measuring the output over multiple iterations of (seed) and verifying your prompt still does the right thing, it’s actually very likely that what you’ve done if take an application with a known output space and converted it to an application with an unknown output space…
…that partially overlaps the original output space!
So it looks like it’s the same.
…but it isn’t, and the “isn’t” is in weird edge cases.
Unless you’re measuring that, you simply now have an app that does “eh, who knows?”
So yes. Porting is trivial if you don’t care if you have the same functionality.
Very few. Many deployed apps don't have a good quantitative grasp of the quality of their LLMs. Some are doing testing or evaluation, through things like unit tests, A/B testing different prompts, collecting user feedback.
I think we're exiting the phase where people can launch an AI app and have people use it just because of the initial "wow factor" and moving into the phase where users will start churning and businesses will need to make sure that their AI agent is performing and they they understand how well it's performing.
This is degenerate (greedy) behaviour, and not representative of the what the prompt will behave like at a higher temperature.
(At least, that’s my understanding; it’s a complex topic but broadly speaking there no specific reason, as far as I’m aware, to expect that a particular combination of params/prompt is representative of any other combination of params/prompt for the same model; it may be, but it may not. Certainly on models like GPT4 it is not, for reasons that are not clear to anyone. So… take care with your prompt testing. setting temperature to 0 is basically meaningless unless you expect to use a temperature of 0 in production. The results you get from your prompts at temp 0 are not generally reflective of the results you will get at temp > 0).
Remember: a good model with a good prompt will generate bad outputs sometimes.
A bad model with a bad prompt will generate a good output sometimes.
That is simply a fact with these non deterministic models.
You have to do many iterations for each prompt to verify they are working correctly.
> I’ve not had much problems moving between LLMs…
If you want to move your prompts to a different model, you’re effectively replacing one:
f(prompt + seed) => output
With different black box implementation.
Unless you’re measuring the output over multiple iterations of (seed) and verifying your prompt still does the right thing, it’s actually very likely that what you’ve done if take an application with a known output space and converted it to an application with an unknown output space…
…that partially overlaps the original output space!
So it looks like it’s the same.
…but it isn’t, and the “isn’t” is in weird edge cases.
Unless you’re measuring that, you simply now have an app that does “eh, who knows?”
So yes. Porting is trivial if you don’t care if you have the same functionality.
…but reliably porting is much harder (or longer).