This finding could be quite useful (LLM 1 refers to GPT-3.5 and LLM 2, GPT-4):
"Reproducibility analyses revealed that highly reproducible answers were more likely to be answered correctly than inconsistent answers by LLM 1 (66 of 88 [75.0%] vs 5 of 13 [38.5%]; P = .02), potentially indicating another marker of confidence of LLMs that might be leveraged to filter out invalid responses. The same observation was made with LLM 2, with 78 of 96 correct answers (81.3%) in those with high reproducibility vs 1 of 4 (25.0%) in answers with low reproducibility (P = .04)."
The paper submitted 50 independent queries to assess self-consistency. Exploring the trade-off graph between the number of queries and the effectiveness to filter likely invalid responses would be interesting.
Microsoft research’s team took advantage of this fact by querying multiple times per question (with randomized option ordering) and choosing the most common answer.
This is driven by the idea that since LLMs are probabilistic engines during inference that, if the LLM "knows" or can "surmise" the correct answer, it will have a higher probability of being generated and thus self consistency methods can more reliably tease it out.
> 85.0% of questions correctly compared with the mean human score of 73.8%
> These questions are either behind a paywall (in the case of the question bank) or published after 2021 and therefore out-of-training data for both LLMs.
> Despite their strengths, both models demonstrated weaker performance in tasks requiring higher-order thinking compared with questions requiring only lower-order thinking.
> Interestingly, both models exhibited confident language when answering questions, even when their responses were incorrect.
> These questions are either behind a paywall (in the case of the question bank) or published after 2021 and therefore out-of-training data for both LLMs
I'm not sure that it's true that GPT-4's web scrape is currently from before 2021. And even if it was, this still doesn't seem like an effective analysis of test data contamination. (For example, have any questions that differ only in insignificant ways been asked before?)
I'm not skeptical of GPT-4's capability in the way many (most?) people are, but I'd still like to see the level of dialogue around training set data improve.
I'm intrigued as well. Here's my Q&A with GPT-4 itself:
"Q: What was the data cutoff date when GPT-4 was first released?
A: The data cutoff for GPT-4, which means the point at which it stopped being trained on new data, was September 2021. This cutoff date is significant as it means GPT-4's training did not include information or events that occurred after this time."
That's not a great criterium. Exam questions are going to be discussed in many places, similar to how you can get answers to interview questions by learning from Reddit.
I don't think this study is as bad as many other LLMs-in-medicine studies, but this is quite a caveat:
> In addition, the level of education of the human scorers is not exactly known and should be considered when interpreting comparisons between the models and the users scores. Additionally, human scores could be overestimated due to users that repeat question banks for practice. Therefore, it is important to also consider the absolute scores from the models.
This is especially relevant when you consider the kinds of "lower-order" questions that both GPT-3.5 and GPT-4 did quite well at [from the supplement pdf]:
> Question: What kind of identification inability does prosopagnosia reflect?
A: Facial expression. B: Familiar faces. C: Melody in language. D: Non-musical sounds. E: Pitch of a melody.
> Answer B: Familiar faces [rest of answer truncated]
I think they did a good job demonstrating that GPT-4 is a plausible clinical knowledge base that could easily be tailored towards specific domains. But stuff like this is precisely why I am so nervous about using LLMs for diagnosis:
> Question: A 35-year-old male was involved in a minor car accident when returning home after a tennis match, and attended hospital because of not being aware of objects on his left side. There was no head injury but nevertheless he felt a throbbing right occipital headache, with no other symptoms. On clinical
examination, two hours after the accident, there was a left hemianopia. Eye movements and pupillary function were normal. Blood pressure was 160/95 mmHg. No abnormalities were found on a cranial CT-scan. He reported previous attacks of
severe unilateral pulsatile headache with nausea and photophobia between the ages of 15 and 35 years. The frequency of headache was about once in 6 months, with a duration of up to 24 hours. Which diagnosis applies best to this case?
A: Internal carotid artery stenosis.
B: Migrainous infarction.
C: Reversible vasoconstriction syndrome.
D: Posterior reversible encephalopathy syndrome.
E: Vertebral artery dissection.
Below is GPT-4's incorrect answer. It's medical jargon but read it closely.
> B: Migrainous infarction. The term "migrainous infarction" describes a rare type of ischemic stroke that occurs during a migraine with aura. Given this patient's history of severe, unilateral, pulsatile headaches with nausea and photophobia (typical features of migraines), and the current presentation with headache and visual disturbance (hemianopia), migrainous infarction is a plausible diagnosis. The onset of visual symptoms after a minor accident could be coincidental, but it's also possible that the stress of the accident triggered a migraine attack. The normal CT-scan findings don't exclude migrainous infarction, as the changes associated with this condition may not be visible on imaging, especially in the early stages. The other options (internal carotid artery stenosis, reversible vasoconstriction syndrome, posterior reversible encephalopathy syndrome, and vertebral artery dissection) could potentially cause similar symptoms, but they don't align as closely with this patient's history and presentation. For example, vertebral artery dissection would typically cause symptoms related to the posterior circulation, such as vertigo, imbalance, or difficulty speaking or swallowing, which this patient doesn't have.
Did you notice what was missing? GPT-4 didn't even acknowledge the most serious symptom: "attended hospital because of not being aware of objects on his left side." If you vaguely understand how LLMs work this actually makes sense: this is a phrase with very little precise medical jargon and wouldn't appear in a neat list of symptoms and diseases - it's not "difficulty speaking or swallowing," it's much more vague. I imagine GPT-4 treated that sentence as an odd outlier and basically threw it out.
But a human neurologist would never make that mistake, because humans actually understand what words mean. For us, "not being aware of objects on his left side" is an incredibly alarming piece of information. Anyone who knows what a stroke is would suspect a stroke, and a neurologist would immediately connect "stroke-like symptoms in a young person after a car crash" with "arterial dissection."
Do you think it should have made a bigger deal of it?
> (hemianopia)
That's loss of vision on one side right?
Edit
This is an honest question, I'm not used to these terms so is "unaware of objects on their left side" more general than vision?
Experimenting a bit with more useful prompts it doesn't ignore it, but it rules it out largely because of the lack of anything on a CT scan. That seems to be a fairly consistent reason it gives for different amounts of prompting and making it break things down.
IANAD [doctor] and missed that in the sea of jargon - but I think your question actually clarifies how the LLM got confused. I wouldn't necessarily say "more general than vision," but more severe. There's a big difference between "unaware of objects" and "difficulty seeing or distinguishing objects."
It looks like GPT-4 disregarded "unaware of objects on their left side" in favor of more clinical (but less specific) information later in the question: "two hours after the accident, there was a left hemianopia." According to Dr. Wikipedia, "hemianopia" is a vague term that could indicate partial loss of vision, or could only affect one eye. But the patient's informal description sure doesn't sound like a partial loss of vision, nor did he say "left eye." The information of "left hemianopia" needs to be considered alongside what the patient self-reported - it confirms and clarifies what the patient said - but the LLM didn't seem to do this.
There's other implicit information that the LLM would have trouble leveraging:
- the problem is so serious that a 35-year-old with a 20-year history of migraines immediately went to the hospital, so I would be very reluctant to dismiss it as a migraine related to the "stress of the accident."
- when I hear "unaware of objects on their left side" my gut reaction as someone with a brain is "that sounds like brain damage, you need to go to the hospital."
- Even with its limitations, GPT-4 should have realized "two hours after the accident" meant that it wasn't a temporary migraine-related hemianopia. Apparently it didn't. But I think this is amenable to specific training. The other two things are not, but they are the kind of thinking doctors rely on.
> when I hear "unaware of objects on their left side" my gut reaction as someone with a brain is "that sounds like brain damage, you need to go to the hospital."
To be fair, the answer it gives is "a stroke".
(For anyone who hasn't looked at the supplemental files the answer is E).
Karpathy recently tweeted about the importance of being close to the data when training LLMs, and I think a similar level of rigor and transparency is essential when judging evaluations. Thanks for bringing us this insight from the data.
20XX: Contact lenses that scan written board questions, processing using in-lense computer w advanced AI, holographic display of answers visible only by lens wearer.
Gigapixel scan of test takers retinas will be trivial in this fantasy world. Not that anyone wants to be in a cheating arms race, but it's been the case for centuries. The more easily accessible the rote information is (i.e. what the language models are decent at), the less important it will be on the test. Different scoring weights for different types of questions and more novel questions that don't appear in the corpus of study material.
While I broadly agree with your core argument, I would disagree that LLMs are good at rote learning — LLMs can do a mediocre job of that after a huge training run, while a simple index search is much easier, much more compact, and much faster.
If a readily available device can answer the questions so easily, then those questions wouldn't be suitable for a professional exam because they could use the same device for similar tasks in their actual work too.
"Reproducibility analyses revealed that highly reproducible answers were more likely to be answered correctly than inconsistent answers by LLM 1 (66 of 88 [75.0%] vs 5 of 13 [38.5%]; P = .02), potentially indicating another marker of confidence of LLMs that might be leveraged to filter out invalid responses. The same observation was made with LLM 2, with 78 of 96 correct answers (81.3%) in those with high reproducibility vs 1 of 4 (25.0%) in answers with low reproducibility (P = .04)."
The paper submitted 50 independent queries to assess self-consistency. Exploring the trade-off graph between the number of queries and the effectiveness to filter likely invalid responses would be interesting.