> but it still might overlook important subtleties If there's one thing we can b...

> but it still might overlook important subtleties

If there's one thing we can be certain of, it's that LLMs often overlooks important subtleties.

Can't believe they used GPT4 to also evaluate the results. I mean, we wouldn't trust a student to grade their own exam even when given the right answers to grade with.