All of these benchmarks have gotten out of hand, which is highly suggestive. Benchmarks exist as an indicator of quality and proliferate when other indicators of quality fail. Their very prominence implies that observers are having a difficult time assessing LLM performance in context, which hints at a limited utility or more precisely a non-closed feedback loop at the level of utility. (You know a burger tastes really good when you eat it, no benchmarks required.)
Perhaps LLM development really does exist at this rarefied abstract level whereby the development team cannot be immersed in application context, but I doubt that notion. More likely the performance observed in context is either so dispiriting or difficult (or nonexistent) that teams return again and again to the more generously validating benchmarks.
> You know a burger tastes really good when you eat it, no benchmarks required.
I'd say this is a good example of the opposite, where the problem is finding the quantification of an ultimately subjective experience. Take three restaurant reviewers to a burger joint and you might end up with four different opinions.
Benchmarks proliferate because many LLM domains defy easy, quantitative measurement, yet LLM development and deployment are so expensive that they need to be guided by independent and quantitative (even if not fully objective) measures.
Perhaps LLM development really does exist at this rarefied abstract level whereby the development team cannot be immersed in application context, but I doubt that notion. More likely the performance observed in context is either so dispiriting or difficult (or nonexistent) that teams return again and again to the more generously validating benchmarks.