It actually beats the human average by a wide margin: - 64.2% for humans vs. 82....

usaar333 · 2024-12-20T20:50:28 1734727828

Super human isn't beating rando mech turk.

Their post has stem grad at nearly 100%

tripletao · 2024-12-20T22:32:58 1734733978

This is correct. It's easy to get arbitrarily bad results on Mechanical Turk, since without any quality control people will just click as fast as they can to get paid (or bot it and get paid even faster).

So in practice, there's always some kind of quality control. Stricter quality control will improve your results, and the right amount of quality control is subjective. This makes any assessment of human quality meaningless without explanation of how those humans were selected and incentivized. Chollet is careful to provide that, but many posters here are not.

In any case, the ensemble of task-specific, low-compute Kaggle solutions is reportedly also super-Turk, at 81%. I don't think anyone would call that AGI, since it's not general; but if the "(tuned)" in the figure means o3 was tuned specifically for these tasks, that's not obviously general either.