Table 1 in the Appendix. GSM-No-op is the one benchmark that sees significant drops for those 4 models as well (with preview dropping the least at -17%).
No-op adds "seemingly relevant but ultimately inconsequential statements". So "change names, performance drops" is decidedly false for today's state of the art.
Thanks. I wrongly focused on the headline result of the paper rather than the specific claim in the comment chain about "changing name, different results".