Can you give any example queries where the result quality is far away from the rankings on the leaderboard? In my experience it's pretty spot on, so I'm curious if you're asking different sorts of things, or have a different definition of a good answer.
I have for example asked "Write me a very funny scary story about the time I was locked in a graveyard", and the simpler models don't seem to understand that before getting super scared and running out of the graveyard they need to explain how exactly I was locked in, and what changed that let me out.
Sure. Here's a sample of the generations to a query asking for bizarre and absurd prompt suggestions from Feb 2023 with the pre-release GPT-4 via Bing, which was notoriously not fine tuned to the degree of the current models:
> Can you teach me how to fly a unicorn?
> Do you want to join my cult of cheese lovers?
> Have you ever danced with a penguin in the moonlight?
Here's the generations to a prompt asking for bizarre and absurd questions to the current implementation of GPT-4 via the same interface:
> If you had to choose between eating a live octopus or a dead rat, which one would you pick and why?
> How would you explain the concept of gravity to a flat-earther using only emojis?
> What would you do if you woke up one day and found out that you had swapped bodies with your pet?
You can try with more generations, but they tend to be much drier and information/reality based than the previous version.
There's also my own experiences using pretrained vs chat/instruct trained models in production. The pretrained versions are leagues improved over the chat/instruct vs the fine tuned, it's just that GPT-4 is so leagues above everything else even it's chat/instruct model is better than, say, pretrained GPT-3.
I'm not saying simple models are better. I'm saying that we're optimizing for a very narrow scope of applications (chatbots) and that we're throwing away significant value in the flexibility of large and expensive pretrained models by targeting metrics aligned with a specific and relatively low hanging usecase. Larger and more complex models will be better than simple models, but the heavily fine tuned versions of those models will have lost capabilities from the pretrained versions, particularly in areas we're not actively measuring.
Ask it to write a recipe given a couple of ingredients in a language like Thai or Persian. There's a single model that does a decent job, GPT-4. Then GPT3.5 is poor and Mistral is completely clueless. The leaderboards tell a very different story.
Definition of "good answer" here is responding in the target language with something that produces something edible in the target language without burning the house down.
I have for example asked "Write me a very funny scary story about the time I was locked in a graveyard", and the simpler models don't seem to understand that before getting super scared and running out of the graveyard they need to explain how exactly I was locked in, and what changed that let me out.