formatted nicely: Dataset | Gemini Ultra | Gemini Pro | GPT-4 MMLU | 90 | 79 | 8...

teleforce · on Dec 6, 2023

Excellent comparison, it seems that GPT-4 is only winning in one dataset benchmark namely HellaSwag for sentence completion.

Can't wait to get my hands on Bard Advanced with Gemini Ultra, I for one welcome this new AI overlord.

aroo · on Dec 7, 2023

Horrible comparison given one score was achieved using 32-shot CoT (Gemini) and the other was 5-shot (GPT-4).

throwaway287391 · on Dec 7, 2023

CoT@32 isn't "32-shot CoT"; it's CoT with 32 samples (or rollouts) from the model, and the answer is taken by consensus vote from those rollouts. It doesn't use any extra data, only extra compute. It's explained in the tech report here:

> We find Gemini Ultra achieves highest accuracy when used in combination with a chain-of-thought prompting approach (Wei et al., 2022) that accounts for model uncertainty. The model produces a chain of thought with k samples, for example 8 or 32. If there is a consensus above a preset threshold (selected based on the validation split), it selects this answer, otherwise it reverts to a greedy sample based on maximum likelihood choice without chain of thought.

(They could certainly have been clearer about it -- I don't see anywhere they explicitly explain the CoT@k notation, but I'm pretty sure this is what they're referring to given that they report CoT@8 and CoT@32 in various places, and use 8 and 32 as the example numbers in the quoted paragraph. I'm not entirely clear on whether CoT@32 uses the 5-shot examples or not, though; it might be 0-shot?)

The 87% for GPT-4 is also with CoT@32, so it's more or less "fair" to compare that Gemini's 90% with CoT@32. (Although, getting to choose the metric you report for both models is probably a little "unfair".)

It's also fair to point out that with the more "standard" 5-shot eval Gemini does do significantly worse than GPT-4 at 83.7% (Gemini) vs 86.4% (GPT-4).

dragonwriter · on Dec 7, 2023

> I'm not entirely clear on whether CoT@32 uses the 5-shot examples or not, though; it might be 0-shot?

Chain of Thought prompting, as defined in the paper referenced, is a modification of few-shot prompting where the example q/a pairs used have chain-of-thought style reasoning included as well as the question and answer, so I don't think that, if they were using a 0-shot method (even if designed to elicit CoT-style output) they would call it Chain of Thought and reference that paper.

throwaway287391 · on Dec 7, 2023

A-ha, thanks! Hadn't looked at or heard of the referenced paper, but yeah, sounds like it's almost certainly also 5-shot then.

It would've been more consistent to call it e.g. "5-shot w/ CoT@32" in that case, but I guess there's only so much you can squeeze into a table.

bitshiftfaced · on Dec 7, 2023

The vibe I was getting from the paper was that they think something's funny about GPT4's 5-shot MMLU (e.g. possibly leakage into the training set).

carbocation · on Dec 6, 2023

I realize that this is essentially a ridiculous question, but has anyone offered a qualitative evaluation of these benchmarks? Like, I feel that GPT-4 (pre-turbo) was an extremely powerful model for almost anything I wanted help with. Whereas I feel like Bard is not great. So does this mean that my experience aligns with "HellaSwag"?

p_j_w · on Dec 6, 2023

>Like, I feel that GPT-4 (pre-turbo) was an extremely powerful model for almost anything I wanted help with. Whereas I feel like Bard is not great. So does this mean that my experience aligns with "HellaSwag"?

It doesn't mean that at all because Gemini Turbo isn't available in Bard yet.

carbocation · on Dec 7, 2023

I am not sure what Gemini Turbo is. Perhaps you meant Gemini Ultra? Because Gemini Pro (which is in this table) is currently accessible in Bard.

p_j_w · on Dec 7, 2023

Yes, that's what I meant.

tarruda · on Dec 6, 2023

I get what you mean, but what would such "qualitative evaluation" look like?

carbocation · on Dec 6, 2023

I think my ideal might be as simple as a few people who spend a lot of time with various models describing their experiences in separate blog posts.

tarruda · on Dec 6, 2023

I see.

I can't give any anecdotal evidence on ChatGPT/Gemini/Bard, but I've been running small LLMs locally over the past few months and have amazing experience with these two models:

- https://huggingface.co/mlabonne/NeuralHermes-2.5-Mistral-7B (general usage)

- https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instr... (coding)

OpenChat 3.5 is also very good for general usage, but IMO NeuralHermes surpassed it significantly, so I switched a few days ago.

fasttransients · on Dec 12, 2023

Thank you for the suggestions – really helpful for my hobby project. Can't run anything bigger than 7B on my local setup, which is a fun constraint to play with.

carbocation · on Dec 6, 2023

Thanks! I’ve had a good experience with the deepseek-coder:33b so maybe they’re on to something.