Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> 85.0% of questions correctly compared with the mean human score of 73.8%

> These questions are either behind a paywall (in the case of the question bank) or published after 2021 and therefore out-of-training data for both LLMs.

> Despite their strengths, both models demonstrated weaker performance in tasks requiring higher-order thinking compared with questions requiring only lower-order thinking.

> Interestingly, both models exhibited confident language when answering questions, even when their responses were incorrect.



> These questions are either behind a paywall (in the case of the question bank) or published after 2021 and therefore out-of-training data for both LLMs

I'm not sure that it's true that GPT-4's web scrape is currently from before 2021. And even if it was, this still doesn't seem like an effective analysis of test data contamination. (For example, have any questions that differ only in insignificant ways been asked before?)

I'm not skeptical of GPT-4's capability in the way many (most?) people are, but I'd still like to see the level of dialogue around training set data improve.


I'm intrigued as well. Here's my Q&A with GPT-4 itself:

"Q: What was the data cutoff date when GPT-4 was first released?

A: The data cutoff for GPT-4, which means the point at which it stopped being trained on new data, was September 2021. This cutoff date is significant as it means GPT-4's training did not include information or events that occurred after this time."


I just asked “what is your training cut off date?” and it responded “ My training includes data up to April 2023.”


April 2023 is for gtp-4 turbo (aka 1106-preview)


(I don't think there's any reason to expect GPT-4 to actually know its current cutoff date.)


> These questions are either behind a paywall

That's not a great criterium. Exam questions are going to be discussed in many places, similar to how you can get answers to interview questions by learning from Reddit.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: