I liked the SimpleQA benchmark that measures hallucinations. OpenAI models did s...

jug 6 months ago | parent | context | favorite | on: OpenAI O3 breakthrough high score on ARC-AGI-PUB

I liked the SimpleQA benchmark that measures hallucinations. OpenAI models did surprisingly poorly, even o1. In fact, it looks like OpenAI often does well on benchmarks by taking the shortcut to be more risk prone than both Anthropic and Google.