I'm also confused about some of the figures' captions, which don't seem to match the results:
- "Only Sonnet-3.5 can count the squares in a majority of the images", but Sonnet-3, Gemini-1.5 and Sonnet-3.5 all have accuracy of >50%
- "Sonnet-3.5 tends to conservatively answer "No" regardless of the actual distance between the two circles.", but it somehow gets 91% accuracy? That doesn't sound like it tends to answer "No" regardless of distance.
- "Only Sonnet-3.5 can count the squares in a majority of the images", but Sonnet-3, Gemini-1.5 and Sonnet-3.5 all have accuracy of >50%
- "Sonnet-3.5 tends to conservatively answer "No" regardless of the actual distance between the two circles.", but it somehow gets 91% accuracy? That doesn't sound like it tends to answer "No" regardless of distance.