Hacker News new | past | comments | ask | show | jobs | submit login

I liked the SimpleQA benchmark that measures hallucinations. OpenAI models did surprisingly poorly, even o1. In fact, it looks like OpenAI often does well on benchmarks by taking the shortcut to be more risk prone than both Anthropic and Google.



Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: