when i’ve done toy demos where GPT5, sonnet 4 and gemini 2.5 pro critique/vote on various docs (eg PRDs) they did not choose their own material more often than not.
my setup wasn’t intended to benchmark though so could be wrong over enough iterations.
when i’ve done toy demos where GPT5, sonnet 4 and gemini 2.5 pro critique/vote on various docs (eg PRDs) they did not choose their own material more often than not.
my setup wasn’t intended to benchmark though so could be wrong over enough iterations.