Do you have a sense of what kind of task this benchmark includes? Are they more “general” such that random people would fare well or more specialized (ie something a STEM grad studied and isn’t common knowledge)?
It does, which is why I don’t really subscribe to any test like this being great for actually determining “AGI”. A true AGI would be able to continuously train and create new LLMs that enable it to become a SME in entirely new areas.
Do you have a sense of what kind of task this benchmark includes? Are they more “general” such that random people would fare well or more specialized (ie something a STEM grad studied and isn’t common knowledge)?