CoT@32 isn't "32-shot CoT"; it's CoT with 32 samples (or rollouts) from the model, and the answer is taken by consensus vote from those rollouts. It doesn't use any extra data, only extra compute. It's explained in the tech report here:
> We find Gemini Ultra achieves highest accuracy when used in combination with a chain-of-thought prompting approach (Wei et al., 2022) that accounts for model uncertainty. The model produces a chain of thought with k samples, for example 8 or 32. If there is a consensus above a preset threshold (selected based on the validation split), it selects this answer, otherwise it reverts to a greedy sample based on maximum likelihood choice without chain of thought.
(They could certainly have been clearer about it -- I don't see anywhere they explicitly explain the CoT@k notation, but I'm pretty sure this is what they're referring to given that they report CoT@8 and CoT@32 in various places, and use 8 and 32 as the example numbers in the quoted paragraph.
I'm not entirely clear on whether CoT@32 uses the 5-shot examples or not, though; it might be 0-shot?)
The 87% for GPT-4 is also with CoT@32, so it's more or less "fair" to compare that Gemini's 90% with CoT@32. (Although, getting to choose the metric you report for both models is probably a little "unfair".)
It's also fair to point out that with the more "standard" 5-shot eval Gemini does do significantly worse than GPT-4 at 83.7% (Gemini) vs 86.4% (GPT-4).
> I'm not entirely clear on whether CoT@32 uses the 5-shot examples or not, though; it might be 0-shot?
Chain of Thought prompting, as defined in the paper referenced, is a modification of few-shot prompting where the example q/a pairs used have chain-of-thought style reasoning included as well as the question and answer, so I don't think that, if they were using a 0-shot method (even if designed to elicit CoT-style output) they would call it Chain of Thought and reference that paper.
I realize that this is essentially a ridiculous question, but has anyone offered a qualitative evaluation of these benchmarks? Like, I feel that GPT-4 (pre-turbo) was an extremely powerful model for almost anything I wanted help with. Whereas I feel like Bard is not great. So does this mean that my experience aligns with "HellaSwag"?
>Like, I feel that GPT-4 (pre-turbo) was an extremely powerful model for almost anything I wanted help with. Whereas I feel like Bard is not great. So does this mean that my experience aligns with "HellaSwag"?
It doesn't mean that at all because Gemini Turbo isn't available in Bard yet.
I can't give any anecdotal evidence on ChatGPT/Gemini/Bard, but I've been running small LLMs locally over the past few months and have amazing experience with these two models:
Thank you for the suggestions – really helpful for my hobby project. Can't run anything bigger than 7B on my local setup, which is a fun constraint to play with.