Hacker News new | past | comments | ask | show | jobs | submit login
Meta got caught gaming AI benchmarks (theverge.com)
347 points by pseudolus 6 days ago | hide | past | favorite | 161 comments






The Llama 4 launch looks like a real debacle for Meta. The model doesn't look great. All the coverage I've seen has been negative.

This is about what I expected, but it makes you wonder what they're going to do next. At this point it looks like they are falling behind the other open models, and made an ambitious bet on MoEs, without this paying off.

Did Zuck push for the release? I'm sure they knew it wasn't ready yet.


I don't know about Llama 4. Competition is intense in this field so you can't expect everybody to be number 1. However, I think the performance culture at Meta is counterproductive. Incentives are misaligned, I hope leadership will try to improve it.

Employees are encouraged to ship half-baked features and move to another project. Quality isn't rewarded at all. The recent layoffs have made things even worse. Skilled people were fired, slowing down teams. I assume the goal was to push remaining employees to work even more, but I doubt this is working.

I haven't worked in enough companies of this size to be able to tell if alternatives are better, but it's very clear to me that Meta doesn't get the best from their employees.


For those who haven't heard of it, "The Hawthorne Effect" is the name given to a phenomena where when a person or group being studied is aware they are being studied, their performance goes up but as much as 50% for 4-8 weeks, then regresses to its norm.

This is true if they are just being observed, or if some novel new processes are introduced. If the new things are beneficial, the performance rises for 4-8 weeks as usual, but when it regresses it regresses to a higher performance reflecting the value of the new process.

But when poor management introduce a counter-productive change, the Hawthorne Effect makes it look like a resounding success for 4-8 weeks. Then the effect fades, and performance drops below the original level. Sufficiently devious managers either move on to new projects or blame the workers for failing to maintain the new higher pace of performance.

This explains a lot of the incentive for certain types of leaders to champion arbitrary changes, take a victory lap, and then disassociate themselves from accountability for the long-term success or failure of their initiative.

(There is quite a bit of controversy over what the mechanisms for the Hawthorne Effect are, and whether change alone can introduce it for whether participants need to feel they are being observed, but the model as I see it fits my anecdotal experience where new processes are always accompanied by attempts to meet new performance goals, and everyone is extremely aware that the outcome is being measured.)


> There is quite a bit of controversy over what the mechanisms for the Hawthorne Effect are, and whether change alone can introduce it for whether participants need to feel they are being observed

My vote is change alone can introduce it; going from not being observed to observed is a change itself. People get into a daily groove after adjusting to whatever circumstance or process (maybe that averages 4-8 weeks I have no idea), but introduce a novel thing into that and they "perk up" until that bit leaves or integrates into the daily groove. In my experience, people prefer working on new features, even if that feature is some managements arbitrary initiative. They rather work on that new ridiculous thing than continue on their previous slog through the bug backlog until the new thing becomes a slog itself.


I mean, it sounds like we should add in the McNamara fallacy also.

I've never liked it, but

> Move fast and break things

is really a bad concept in this space, where you get limited shots at releasing something that generates interest.

> Employees are encouraged to ship half-baked features

And this is why I never liked that motto and have always pushed back at startups where I was hired that embraced this line of thought. Quality matters. It's context-dependent, so sometimes it matters a lot, and sometimes hardly. But "moving fast and breaking things" should be a deliberate choice, made for every feature, module, sprint, story all over again, IMO. If at all.


> is really a bad concept in this space, where you get limited shots at releasing something that generates interest.

Sure, but long term effects are more depending on the actual performance of the model, than anything.

Say they launch a model that is hyped to be the best, but when people try it, it's worse than other models. People will quickly forget about it, unless it's particularly good at something.

Alternatively, say they launch a model that doesn't even get a press release, or any benchmark results published ahead of launch, but the model actually rocks at a bunch of use cases. People will start using it regardless of the initial release, and continue to do so as long as it's a best model.


I'm with you. Yet I've always understood 'move fast and break things' to mean that there is value in shipping stuff to production that is hard to obtain just sitting in the safe and relatively simple corner of your local development environment, polishing up things for eternity without any external feedback whatsoever, months or even years, and then doing a big drop. That's completely orthogonal to things like planning, testing, quality, taking time to think etc.

Maybe the fact that this is how I understood that motto is in itself telling.


I understand it as that too. But even then, I dislike it.

Yes, "shipped, but terrible software" is far more valuable than "perfect software that's not available". But that's a goal.

The "move fast and break things" is a means to that goal. One of many ways to achieve this. In this, "move fast" is evident: who would ever want to move slowly if moving fast has the exact same trade-offs?

"Break things" is the part that I truly dislike. No. We don't "break things" for our users, our colleagues or our future-selves (ie tech debt).

In other situations, the "move fast and break things" implies a preferred leaning towards low-quality in the famous "Speed / Cost / Quality Trade-off". Where, by decreasing quality, we gain speed (while keeping cost the same?).

This fallacy has long been debunked, with actual research and data to back it up¹: Software Engineering projects that focus on quality, actually gain speed! Even without reading research or papers, this makes sense: we all know how much time is wasted on fixing regressions, muddling through technical debt, rewriting unreadable/-manageable/-extensible code etc. A clean, well-maintained, neatly tested, up-to-date codebase allows one to add features much faster than that horrible ball of mud that accumulated hacks and debt and bugs for decades.

¹e.g. Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations is a nice starting point with lots of references to research and data on this.


I'd argue its a bad concept in any spaces that involve teams of people working together and deliverables that enter the real world.

> is really a bad concept in this space, where you get limited shots at releasing something that generates interest.

It's a really bad concept in any space.

We would be living in a better world if Zuck had, at least once, thought "Maybe we shouldn't do that".


>It's a really bad concept in any space.

I struggle with this because it feels like so many 'rules' in the world where the important half remains unsaid. That unsaid portion is then mediated by goodharts law.

If the other half is 'then slow down and learn something' its really not that bad, nothing is sacred, we try we fail we learn we (critical) don't repeat the mistake. Thats human learning - we don't learn from mistakes we learn from reflecting on mistakes.

But if learning isn't part of the loop - if its a self justifying defense for fuckups, if the unsaid remains unsaid, its a disaster waiting to happen.

The difference is usually in what you reward. If you reward ship, you get the defensive version - and you will ship crap. If you reward institutional knowledge building you don't. Engineers are often taught that 'good, fast, or cheap pick 2'. The reality is its usually closer to 1 or 1.5. If you pick fast...you get fast.


I agree. I think of it like a car engine. You can push it up to a certain RPM and it will keep making more and more power. Above that RPM, the engine starts to produce less power and eventually blows a gasket.

I think the performance-based management worked for a while because there were some gains to be had by pushing people harder. However, they’ve gone past that and are now pushing people too hard and getting worse results. Every machine has its operating limits and an area where it operates most efficiently. A company is no different.


The problem is, there's always some engine pushing the power envelope, or a person pushing their performance harder. And then the rest of them have to keep up.

very nice analogy!

It's 100% PSC (their "Performance Culture")

You're not encouraged per se to ship half-baked features, but if you don't have enough "impact" at the end of the half (for mid cycle checkin) or year (for full PSC cycle) then you're going to get "Below Expectations" and then "Meets Most" (or worse) and with the current environment a swift offboarding.

When I was there (working in integrity) our group of staff+ engineers opined how it led to perverse incentives - and whilst you can work there and do great work, and get good ratings, I saw too many examples of "optimizing for PSC" (otherwise known as PSC hacking).


It's also terrible output, even before you consider what looks like catastrophic forgetting from crappy RL. The emoji use and writing style make me want to suck-start a revolver. I don't know how they expect anyone to actually use it.

> Employees are encouraged to ship half-baked features and move to another project

Maybe there is more to that. It's been more than a year since Llama 3 was released. That should be enough time for Meta to release something with significantly improvement. Or you mean quarter by quarter the engineers had to show that they were making impact in their perf review, which could be detrimental to the Llama 4 project?

Another thing that puzzles me is that again and again we see that the quality of a model can improve if we have more high-quality data, yet can't Meta manage to secure massive amount of new high-quality data to boost their model performance?


> Or you mean quarter by quarter the engineers had to show that they were making impact in their perf review

This is what I think they were referencing. Launching things looks nice in review packets and few to none are going to look into the quality of the output. Submitting your own self review means that you can cherry pick statistics and how you present them. That's why that culture incentivizes launching half baked products and moving on to something else because it's smart and profitable (launch yet another half baked project) to distance yourself from the half baked project you started.


I like how Netflix set up its incentive systems years ago. Essentially they told the employees that all they needed to do is deliver what the company wanted. It was perfectly okay that an employee did their job and didn't move up or do more. Per their chief talent officer McCord, "a manager's job is all about setting the context" and the employees were let loose to deliver. This method puts a really high bar on the managers, as the entire report chain must know clearly what they want delivered. Their expectation must be high enough to move the company forward, but not too ridiculous to turn Netflix into a burnout factory.

Unfortunately I wasn't able to get an interview with Netflix.

> employees that all they needed to do is deliver what the company wanted

How did this work out in practice and across teams? My experience at Meta within my team was that it would be almost impossible to determine what the company actually wanted from our team in a year. Goals kept changing and the existing incentive system works against this since other teams are trying to come up with their own solutions to things which may impact your team.

> an employee did their job and didn't move up

Does Netflix cull employees if they haven't reached a certain IC level? I know at Meta SWEs need to reach IC5 after a while or risk being culled.


Netflix used to have a single level for engineers: Senior. That was the best decision they made, at least when they were small. It simply eliminated any incentive to work for promotion, and the result was magnificent: engineers would generally work on what's right, if they were ambitious. Or they could choose to finish what they were supposed to do. Netflix's culture deck explicitly said it was okay for employees to just do what their roles required, so no pressure to move up at all.

> How did this work out in practice and across teams? My experience at Meta within my team was that it would be almost impossible to determine what the company actually wanted from our team in a year.

That's why it is critical to have good leaders. A problem, at least per my own experience in Meta, is that many managers are people managers. I'm not sure what they want. It's hard to know what product they want to build, what gaps they want to fill, or what efficiency goals they want to drive. If they were in Netflix, they would fail the "setting the right context" test, as they didn't know what the right context should be.


Imagine if sports teams worked this way. Hey you as a wide reciever didn't get promoted to QB - sorry we have to let you go!

Yep they've basically created a culture where people are incentivized to look busy, ship things fast, and look out for themselves. Which attracts and retains people that thrive in that kind of environment.

That's a terrible way to execute on big, ambitious projects since it discourages risky bets and discourages collaboration.


It's not a big deal. Llama 4 feels like a flop because the expectations are really high based on their previous releases and the sense of momentum in the ecosystem because of DeepSeek. At the end of the day, LLama 4 didn't meet the elevated expectations, but they're fine. They'll continue to improve and iterate and maybe the next one will be more hype worthy, or maybe expectations will be readjusted as the specter of diminishing returns continues to creep in.

It feels like a flop because it is objectively worse than models many times smaller that shipped sometimes ago. In fact, it is worse than earlier LLaMA releases on may points. It's so bad that people who initially ran into it assumed that the downloaded weights must be corrupted somehow.

The switching costs are so low (zero) that anyone using these models just jumps to the best performer. I also agree that this is not a brand or narrative sensitive project.

> it makes you wonder what they're going to do next

They're just gonna keep throwing money at it. This is a hobby and talent magnet for them, instagram is the money printer. They've been working on VR for like a decade with barely much results in terms of users (compared to costs). This will be no different.


Both are also decent long-terms bets. Being VR market leader now means they will be VR market-leader with plenty of inhouse talent and IP when the technology matures and the market grows. Being in the AI race, even if they are not leading, means they have in-house talent and technology to be able to react to wherever the market is going with AI. They have one of the biggest messengers and one of the biggest image-posting sites, there is a decent chance AI will become important to them in some not-yet-obvious way.

One of Meta's biggest strengths is Zuckerberg being able to play these kinds of bets. Those bets being great for PR and talent acquisition is the cherry on top


This assumes no upstart will create a game changing innovation which upends everything.

Companies become complacent and confused when they get too big. Employees become trapped in a maze of performative careerism, and customer focus and a realistic understanding of threats from potential competitors both disappear.

It's been a consistent pattern since the earliest days of computing.

Nothing coming out of Big Tech at the moment is encouraging me to revise that assumption.


"made an ambitious bet on MoEs"? No, DeepSeek is MoE, and they succeeded. Meta is not betting on MoE, it just does what other people have done.

Llama4 seems in many ways a cut and paste of DeepSeek. Including the shared expert and the high sparsity. It's a DeepSeek that does not work well.

I remember reading that they were in panic mode when the DeepSeek model came out so they must have scrambled and had to re-work a lot of things since DeepSeek was so competitive and open source as well

Fear of R2 looms large as well. I suspect they succumbed to the nuance collapse along the lines of “Is double checking results worth it if DeepSeek eats our lunch?”

> Did Zuck push for the release?

All I know is it is the first Llama release since Zuck brought "masculine energy" back to Meta.


They knew they can't beat DeepSeek 2 months ago https://old.reddit.com/r/LocalLLaMA/comments/1i88g4y/meta_pa...

Do you know that they made a bet on MoE? Meaning they abandonded dense models? I doubt that is the case. Just releasing MoE Llama 4 does not constitute a "bet" without further information.

Also from what I can tell this performs better than models with parameter counts equal to one expert, and worse than fully dense models equal to total parameter count. Isn't that kind of what we'd expect? in what way is that a failure?

Maybe I am missing some details. But it feels like you have an axe to grind.


A 4x8 MOE performs better than an 8B but worse than a 32B, is your statement?

My response would be, "so why bother with MOE?"

However deepseek r1 is MOE from my understanding, but the "E" are all =>32B parameters. There's > 20 experts. I could be misinformed; however, even so, I'd say a MOE with 32B or even 70B experts will outperform (define this!) Models with equal parameter counts, because deepseek outperforms (define?) ChatGPT et al.


DeepSeek V3/R1 are MoE with 256 experts per layer, actively using 1 shared expert and 8 routed experts per layer https://arxiv.org/html/2412.19437v1#S2:~:text=with%20MoE%20l... so you can't just take the active parameters and assume that's close to the size of a single expert (ignoring experts are per layer anyways and that there are still dense parameters to count).

Despite connotations of specialized intelligences the term "expert" provokes it's really mostly about scalability/efficiency of running large models. By splitting up sections of the layers and not activating all of them for each pass a single query takes less bandwidth, can be distributed across compute, and can be parallelized with other queries on the same nodes.


Easy, vastly improved inference performance on machines with larger RAM but lower bandwidth/compute. These are becoming more popular such as Apple's M series chips, AMD's strix halo series, and the upcoming DGX Spark from Nvidia.

yes i understand all that. I was saying the claim is incorrect. My understanding of deepseek is mechanically correct but apparently they use 3B models as experts, per your sibling comment. I don't buy it, regardless of what they put in the paper - 3B models are pretty dumb, and R1 isn't dumb. No amount of shuffling between "dumb" experts will make the output not dumb. it's more likely 32x32B experts, based on the quant sizes i've seen.

A deepseek employee is welcome to correct me.


I'm not a DeepSeek employee but I think there is more clarification needed on what an "expert" is before the conversation can make any sense. Much like physics, one needs at least take a glance at how the math is going to be used to be able to sanity check a claim.

A model includes many components. There will be bits that encode/decode tokens as vectors, transformer blocks which do the actual processing on the data, some post-transformer block filtering to normalize that output, and maybe some other stuff depending on the model architecture. The part we're interested in involves parts of the transformer blocks, which handle using encoded relational information about the vectors involved (part 1) to transform the input vector using a feed forward network of weights (part 2) and then all that gets post processed in various ways (part 3). A model will chain these transformers together and each part of the chain is called a layer. A vector will run from the first layer through to the last, being modified by each transformer along the way.

In an MoE model the main part about the transformer block changed is part 2, which goes from "using a feed forward network of weights" to "using a subset of feed forward network weights chosen by a router for the given token and then recombined". In the MoE case each subset of weights per feed forward layer is what is called the "expert". Importantly, the expert is not a whole model - it's just a group of the weights available in a given layer. Each layer's router choses which group(s) of weights to use independently. As an example, if a 10 layer model had a total of 10 billion 8 bit parameters in the feed forward layers (so a >10 billion parameter model in overall parameters) and 10 experts that means each expert is ~100 MB (10 billion bytes / 10 layers / 10 experts per layer). These 10 billion parameters would be referred to as the sparse parameters (not always used each token) while the rest of the model would be referred to as the dense parameters (always used each token). Note: folks on the internet have a strong tendency to label this incorrectly as "10x1B" or "10x{ActiveParameters}" instead.

The "Mixture" part of MoE extends a bit further than "the parameter groups sit next to each other in the transformer block" though. Similar in concept to how transformer attention is combined in part 1, more than 1 expert can be activated and the outputs combined. At minimum there is usually 1 expert which is always used in a layer (the "shared expert") and at least 1 expert which is selected by the router (the "routed experts"). The shared expert exists to make the utilization of the routed experts more even by ensuring base information which needs to be used all the time is dedicated to it which increases training performance since the other experts can be more evenly selected by the router as a result.

With that understanding, the important takeaways are:

- Experts are parts of individual parameter networks in each layer, not a sub-model carved out.

- More than 1 expert can be used, and the way the data is combined is not like feeding the output of a low parameter LLM to another, it's more like how the first phase of the transformer has multiple attention heads which are combined to give the full attention information.

- There are a lot of other weights beyond the sparse weights used by experts in a model. The same is true for the equivalent portions of a dense model as well though. This ultimately makes comparing active parameters of a sparse model to total parameters of a dense model a valid comparison.

As to the original conversation: DeepSeek v3/R1 has 37 Billion active parameters so that should set the floor for a comparative dense model, not whatever the size of an individual expert in part of a single transformer layer is (which acts as a bit of a red herring in these kinds of conversations, doubly so since more than 1 experts worth of weights are used anyways). In reality a bit more than that, though definitely less than if all 671 Billion parameters were dense. While we don't have much concrete public information about modern version of ChatGPT, one thing we're relatively certain of is "ChatGPT 4.5 has a TON more parameters and has basically nothing to show for it". Meanwhile Gemma 3, a 27 B locally runnable model from Google, is a hair behind DeepSeek v3 in many popular benchmarks. Truth is, there are just a lot of other things beyond parameter count that go into making a well performing model and DeepSeek v3/r1 hit (or invented) a lot of them. If I had to place a bet on this whole comparison, I'd say OpenAI is very likely also using sparse/MoE-style architectures in their current products anyways.

As a final note, don't just dismiss the architecture of DeepSeek with "regardless of what they put in the paper"! The model files are available and they include this sort of layer/parameter information so your device is able to run them (well, "run" requires a bit of RAM in this case... but you can still read the model file on disk and see it's laid out as the paper claims regardless).


I mean, there's a reason they released it on a Saturday.

I'm just shocked that the companies who stole all kinds of copyrighted material would again do something unethical to keep the bubble and gravy train going...

Yes, their worst fear is people figuring out that an AI chatbot is a strict librarian that spits out quotes but doesn't let you enter the library (the AI model itself). Because with 3D game-like UIs people can enter the library and see all their stolen personal photos (if they were ever online), all kind of monsters. It'll be all over YouTube.

Imagine this but you remove the noise and can walk like in an art gallery (it's a diffusion model but LLMs can be loosely converted into 3D maps with objects, too): https://writings.stephenwolfram.com/2023/07/generative-ai-sp...


Do you really think facebook's model weights contain all the facebook personal photos?

With more than 50% probability. Instagram has a clause that they can use your photos for ads and endorsements, basically for profit.

Many sites repost those photos, I doubt Meta will care to meticulously remove them. If the model can generate photo-like content of people - it had photos of people put inside.

I doubt they only trained on public domain photos of people.

If they started doing shady stuff, why would they ever stop?


I think it's most illustrative to see the sample battles (H2H) that LMArena released [1]. The outputs of Meta's model is too verbose and too 'yappy' IMO. And looking at the verdicts, it's no wonder by people are discounting LMArena rankings.

[1]: https://huggingface.co/spaces/lmarena-ai/Llama-4-Maverick-03...


In fairness, 4o was like this until very recently. I suspect it comes from training on COT data from larger models.

Yep, it’s clear that many wins are due to Llama 4’s lowered refusal rate which is an effective form of elo hacking.

Meta got caught _first_.

Not even first, OpenAI got caught a while back

People have been gaming ML benchmarks as long as there have been ML benchmarks. That's why it's better to see if other researchers are incorporating a technique into their actual models rather than 'is this paper the bold entry in a benchmark table'. But it takes longer.

When a measure becomes a target it is no longer a good measure.

These ML benchmarks were never going to last very long. There is far too much pressure to game them, even unintentionally.


Do you have a source for this? That's interesting (if true).

They got the dataset from Epoch AI for one of the benchmarks and pinky swore that they wouldn't train on it

https://techcrunch.com/2025/01/19/ai-benchmarking-organizati...


I don't see anything in the article about being caught. Maybe I missed something?

davidcbc is spreading fake rumors.

OpenAI was never caught cheating on it, because we didn’t cheat on it.

As with any eval, you have to take our word for it, but I’m not sure what more we can do. Personally, if I learned that a OpenAI researcher purposely or accidentally trained on it, and we didn’t quickly disclose this, I’d quit on the spot and disclose it myself.

(I work at OpenAI.)

Generally, I don’t think anyone here is cheating and I think we’re relatively diligent with our evals. The gray zone where things could go wrong is differing levels of care used in scrubbing training data of equivalent or similar problems. At some point the line between learning and memorizing becomes blurry. If an MMLU question asks about an abstract algebra proof, is it cheating to have trained on papers about abstract algebra?


Your comment would be better without the personal swipe in the first sentence.

https://news.ycombinator.com/newsguidelines.html


>They got the dataset from Epoch AI for one of the benchmarks and pinky swore that they wouldn't train on it

Is anything here actually false or do you not like the conclusions that people may draw from it?


The false statement is “OpenAI got caught [gaming FrontierMath] a while back.”

From a primary source:

"OpenAI did not use FrontierMath data to guide the development of o1 or o3, at all.... we only downloaded FrontierMath for our evals long after the training data was frozen, and only looked at o3 FrontierMath results after the final announcement checkpoint was already picked."

https://x.com/__nmca__/status/1882563755806281986


Well, if we trust you. But you had the extra dataset and no one else did.

Why are you being disingenuous? Simply having access to the eval in question is already enough for your synthetics guys to match the distribution, and of course you don't contaminate directly on train, that would be stupid, and you would get caught, but if it does inform the reward, the result is the same. You _should_ quit, but you wouldn't because you'd already convinced yourself you're doing RL God's work, not sleight of hand.

> If an MMLU question asks about an abstract algebra proof, is it cheating to have trained on papers about abstract algebra?

This kind of disingenuous bullshit is exactly why people call you cheaters.

> Generally, I don’t think anyone here is cheating and I think we’re relatively diligent with our evals.

You guys should follow Apple's cult guidelines: never stand out. Think different


From a primary source:

"OpenAI did not use FrontierMath data to guide the development of o1 or o3, at all.... we only downloaded FrontierMath for our evals long after the training data was frozen, and only looked at o3 FrontierMath results after the final announcement checkpoint was already picked."

https://x.com/__nmca__/status/1882563755806281986


Which part of my post was incorrect?

Sorry you got caught with your hand in the cookie jar, but you can take comfort in others doing the same I guess


Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.

https://news.ycombinator.com/newsguidelines.html


What about the accusation that I was spreading false rumors? That's far more unkind

The guidelines apply to everyone. I didn't like that part of their comment either, and the comment would have been better without it, but they were put on the defensive by what had come before and they went on to explain their position. Your phrase "got caught with your hand in the cookie jar" is a swipe that the thread could also have done without. It's no big deal, and it applies to everyone on the subthread; we want everyone to avoid barbs like that on HN.

The incorrect part is “OpenAI got caught [gaming FrontierMath] a while back.”

From a primary source:

"OpenAI did not use FrontierMath data to guide the development of o1 or o3, at all.... we only downloaded FrontierMath for our evals long after the training data was frozen, and only looked at o3 FrontierMath results after the final announcement checkpoint was already picked."

https://x.com/__nmca__/status/1882563755806281986


This is like saying “oh no we didn’t cheat because they didn’t tell us the answer out loud” when someone winked at you instead.

Stop being obtuse.


Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.

https://news.ycombinator.com/newsguidelines.html


It happens with basically all papers on all topics. Benchmarks are useful when they are first introduced and used to measure things that were released before the benchmark. After that their usefulness rapidly declines.

“Got caught” is a misleading way to present what happened.

According to the article, Meta publicly stated, right below the benchmark comparison, that the version of Llama on LMArena was the experimental chat version:

> According to Meta’s own materials, it deployed an “experimental chat version” of Maverick to LMArena that was specifically “optimized for conversationality”

The AI benchmark in question, LMArena, compares Llama 4 experimental to closed models like ChatGPT 4o latest, and Llama performs better (https://lmarena.ai/?leaderboard).


Is LMArena junk now?

I thought there was an aspect where you run two models on the same user-supplied query. Surely this can't be gamed?

> “optimized for conversationality”

I don't understand what that means - how it gives it an LMArena advantage.


There are almost certainly ways to fine-tune the model in ways that make it perform better on the Arena, but perform worse in other benchmarks or in practice. Usually that's not a good trade-off. What's being suggested here is that Meta is running such a fine-tuned version on the Arena (and reporting those numbers) while running models with different fine-tuning on other benchmarks (and reporting those numbers), while giving the appearance that those are actually the same models.

It can be easily gamed. The users are self-selected, and they have zero incentive to be honest or rigorous or provide good responses. Some have incentives the opposite way. (There was a report of a prediction market user who said they had won a market on Gemini models by manipulating the votes; LMArena swore furiously there had definitely been no manipulation but was conspicuously silent on any details.) And the release of more LMArena responses has shown that a lot of the user ratings are blatantly wrong: either they're basically fraudulent, or LMArena's current users are people whose ratings you should be optimizing against because they are so ignorant, lazy, and superficial.

At this point, when I look at the output from my Gemini-2.5-pro sessions, they are so high quality, and take so long to read, and check, and have an informed opinion on, I just can't trust the slapdash approach of LMArena in assuming that careless driveby maybe-didn't-even-read-the-responses-ain't-no-one-got-time-for-that-nerd-shit ratings mean much of anything. There have been red flags in the past and I've been taking them ever less seriously even as one of many benchmarks since early last year, but maybe this is the biggest backfire yet. And it's only going to get worse. At this rate, without major overhaul, you should take being #1 on LMArena seriously as useful and important news - as a reason to not use a model.

It's past time for LMArena people to sit down and have some thorough reflection on whether it is still worth running at all, and at what point they are doing more harm than good. No benchmark lives forever and it is normal and healthy to shut them down at some point after having been saturated, but some manage to live a lot longer than they should have...


I guess I can't really refute your experience or observations. But, just a single anecdotal point; I use the arena's voting feature quite a bit and I try really hard to vote on the "best" answer. I've got no clue if the majority of the voters put the same level of effort into it or not, but I figure that an honest and rigorous vote is the least I can do in return for a service provided to me free with no obnoxious ads. There's a nonzero incentive to do the right thing, but it's hard to say where it comes from.

As an aside, I like getting two responses from two models that I can compare against one another (and with the primary sources of truth that I know a priori). Not only does that help me sanity-check the responses somewhat, but I get to interact with new models that I wouldn't have otherwise had the opportunity to. Learning new stuff is good, and being earnest is good.

_nick


In the time it takes you to do one good vote, the low-taste slopmaxxers can do 10 (or 20...?).

LMArena was always junk. I work in this space and while the media takes it seriously most scientists don't.

Random people ask random stuff and then it measures how good they feel. This is only a worthwhile evaluation if you're Google or Meta or OpenAI and you need to make a chartbot that keeps people coming back. It doesn't measure anything else useful.


I hear AI news from time to time from the M5M in the US - and the only place I've ever seen "LMArena" is on HN and in the LM studio discord. At a ratio of 5:1 at least.

It's mentioned quite a bit in the LLM related subreddits.

Conversation is a two-way street. A good conversation mechanic could elicit better interaction from the users and result in better answers. Stands to reason, anyway.

Llama 1 derived models on it were beating gpt 3.5 by have less refusals.

In one of karpathys videos he said that he was a bit suspicious that the models that score the highest in LMarena aren't the ones that people use the most to solve actual day to day problems.

Tangent, but does anyone know why links to lmarena.ai are banned on Reddit, site-wide?

(Last I checked 1 month ago)


evidence for this assertion please? its highly unlikely and maybe a bug

I should have re-tested prior to posting. It is fixed now. I just tried posting a comment with the link and it was not [removed by reddit].

It was a really strange situation that lasted for months. Posts or comments with a direct link were removed like they were the worst of the worst.

I tried posting to r/bugs and it was downvoted immediately. I eventually contacted he lmarena folks, so maybe they resolved it with reddit.


theres just no mechanism to my knowledge to ban links sitewide, esp small sites (vs normies) like lmarena

It was crazy. I tested in many subs, using posts and comments. The only caveat is that I only tried in one Reddit account. However, I could post links to any other site in subs that allow such things.

Ahmad al-Dahle, who leads "Gen AI" at Meta, wrote this on Twitter:

  ... We're also hearing some reports of mixed quality across different services ... 

  We've also heard claims that we trained on test sets -- that's simply not true and we would never do that. Our best understanding is that the variable quality people are seeing is due to needing to stabilize implementations.

  We believe the Llama 4 models are a significant advancement and we're looking forward to working with the community to unlock their value.
https://x.com/Ahmad_Al_Dahle/status/1909302532306092107 / https://archive.vn/JzONp

There seems to be a lot of haloo, accusations, and rumors, but little meat to any of them. Maybe they rushed the release, were unsure of which one to go with, and some moderate rule bending in terms of which tune got sent to the arena, but I have seen no real hard evidence of real hard underhandedness.

I believe this was designed to flatter the prompter more / be more ingratiating. Which is a worry if true (what it says about the people doing the comparing).

There's no end to the possible vectors of human manipulation with this "open-weight" black box.

The top of that leaderboard is filled with closed weight experimental models.

This should surprise no one..Also Goodhart's law strikes again

Meta does themselves a disservice by having such a crappy public facing AI for people to try (meta.ai). I regularly use the web versions for GPT 4o, Deepseek, Grok, and Google Gemeni 2.5.

Meta is always the worst so I don't even bother anymore.


Meta doing something dodgy OR unethical OR criminal ... and nobody is surprised.

In other news, the head of AI research just left

https://www.cnbc.com/2025/04/01/metas-head-of-ai-research-an...


I would have thought that title would belong to Yann.

It's a misnomer - the VP left. Yann is the Chief Scientist, which I imagine most would agree would be the 'head' of a research division.

TBH I'm very surprised Yann Le Cun is still there. He looks to me like a free thinker and an independent person. I don't think he buys into the Trump agenda and US nationalistic anti-Europe speech like Zuck does. He may be giving Zuck the benefit of the doubt, and probably is grateful that Zuck gave him a chance when nobody else did.

> TBH I'm very surprised Yann Le Cun is still there. He looks to me like a free thinker and an independent person. I don't think he buys into the Trump agenda and US nationalistic anti-Europe speech like Zuck does. He may be giving Zuck the benefit of the doubt, and probably is grateful that Zuck gave him a chance when nobody else did.

Zuck doesn't buy it, either. He just knows what's good for business right now.

In an example of the worst person you know making a great point, Josh Hawley said "What really struck me is that they can read an election return." [0].

Though it's worth remembering, it's very difficult to accumulate the volume of data necessary to do the current kind of AI training while sticking to the strictest interpretations of EU privacy law. Social media companies aren't just feeding the user data into marketing algorithms, they're feeding them into AI models. If you're a leading researcher in that field - Like Le Cun - and the current state-of-the-art means getting as much data as possible, you might not appreciate the regulatory environment of the EU.

[0] https://www.npr.org/2025/02/27/nx-s1-5302712/senator-josh-ha...


A lower level employee also resigned specifically about this:

https://x.com/arjunaaqa/status/1909174905549042085?s=46


I wonder how much the current work environment contributed to this. There's immense pressure to deliver, so it's not surprising to see this.

For me at least, the 10M context window is a big deal and as long as it's decent, I'm going to use it instead. I'm running Scout locally and my chat history can get very long. I'm very frustrated when the context window runs out. I haven't been able to fully test the context length but at least that one isn't fudged.

This feels like AI deniers grasping at straws. You have to prove the claim that "friendly" models do better in head to head user ratings. You also have to adequately define your claim which hasn't been done. And then you have to prove that by whatever definition of "friendliness" you've constructed made a significant difference in the benchmark.

What I suspect will happen is that someone will just tell the LLM to "be curt" that may cause it's score to drop and people will unthinkingly take that to mean the above nonsensical claims are true.


lmarena has lost credibility for me. Are there any better alternatives out there?

Last LLMs launches from top US AI labs were rather disappointing. Products seems under baked and mainly reactive... They oversold the capabilities of their future models when Deepseek landed to stay in the news at the time, but now that we look at the result it's not what everyone expected. And people are starting to question the hype, which is healthy but also very dangerous for these companies needing very large capital. It seems that some labs like Meta are unto something, but it's more research material for now than a short term product. Interesting times.

wow, is that real?

"I tested myself on a subset of my training data and found that I am the greatest!"

The truth is that the vast majority of FAANG engineers making high six figures are only good at deterministic work. They cant produce new things, and so meta and google are struggling to compete when actual merit matters, and they cant just brute force the solutions. Inside these companies, the massive tech systems built, are actually generally terrible, but they pile on legions of engineers to fix the problems.

This is the culture of META hurting them, they are paying "AI VPs" millions of dollars to go to status meetings to get dates for when these models will be done. Meanwhile, deepseek r1 has a flat hierarchy with engineers that actually understand low level computing

Its making a mockery of big tech, and is why startups exist. Big company employees rise the ranks by building skill sets other than producing true economic value


> They cant produce new things, and so meta and google are struggling to compete when actual merit matters, and they cant just brute force the solutions.

You haven't been keeping up. Less than 2 weeks ago, Google released a model that has crushed the competition, clearly being SotA while currently effectively free for personal use.

Gemini 2.0 was already good, people just weren't paying attention. In fact 1.5 pro was already good, and ironically remains the #1 model at certain very specific tasks, despite being set for deprecation in September.

Google just suffered from their completely botched initial launch way back when (remember Bard?), rushed before the product was anywhere near ready, making them look lile a bunch of clowns compared to e.g. OpenAI. That left a lasting impression on those who don't devote significant time to keeping up with newer releases.


gemini 2.5 pro isnt good, and if you think it is, you arent using LLMs correctly. The model gets crushed by o1 pro and sonnet 3.7 thinking. Build a large contextual prompt ( > 50k tokens) with a ton of code, and see how bad it is. I cancelled my gemini subscription

https://aider.chat/docs/leaderboards/ your experience doesn't align with my experience or this benchmark. o1 pro is good but I would rather do 20 cycles on gemini 2.5 rather than wait for Pro to return.

I have just watched Sonnet 3.7 vs Gemini 2.5 solving the same task (fix a bug end-to-end) side by side, and Sonnet hallucinated far worse and repeatedly got stuck in dead-ends requiring manual rescue. OTOH Gemini understood the problem based on bug description and code from the get go, and required minimal guidance to come up with a decent solution and implement it.

I have, dozens of times, and it's generally better than 3.7. Especially with more context it's less forgetful. o1-pro is absurdly expensive and slow, good luck using that with tools. Virtually all benchmarks, including less gamed ones such as Aider's, show the same. WebLM still has 3.7 ahead, with Sonnet always having been particularly strong at web development, but even on there 2.5 Pro is miles in front of any OpenAI model.

Gemini subscription? Surely if you're "using LLMs correctly" you'd have been using the APIs for everything anyway. Subscriptions are generally for non-techy consumers.

In any case, just straight up saying "it isn't good" is absurd, even if you personally prefer others.


The problem is less that those high level engineers are only good at deterministic work and more that they're only rewarded for deterministic work.

There is no system to pitch an idea as opening new frontiers - all ideas must be able to optimize some number that leadership has already been tricked into believing is important.


A whole lot of opinion there, not a whole lot of evidence.

Evidence is a decade inside these companies, watching the circus

"I'm not bitter! No chip on my shoulder."

bitter about what? I'm a long time employee

What company?

Next on Matt Levine’s newsletter: Is Meta fudging with stock evaluation-correlated metrics? Is this securities fraud?

Sarcasm doesn't translate well in text. Please, elaborate.

Matt Levine has a common refrain which is that basically everything is securities fraud. If you were an investor who invested on the premise that Meta was good at AI and Zuck knowingly put out a bad model, is that securities fraud? Matt Levine will probably argue that it could be in a future edition of Money Stuff (his very good newsletter).

The "everything is securities fraud" meme is really unfortunate, not quite as bad as the "fiduciary duty means execs have to chase short-term profit" myth, but still harmful.

It's only because lying ("puffery") about everything has become the norm in corporate America that indeed, almost all listed companies commit securities fraud. If they'd go back to being honest businessmen, no more securities fraud. Just stop claiming things that aren't true. This is a very real option they could take. If they don't, then they're willingly and knowingly commiting securities fraud. But the meme makes it sound to people as if it's unavoidable, when it's anything but.


is it securities fraud? sort of.

If Mark, both through Meta and through his own resources, has the capital to hire and retain the best AI researchers / teams, and claims he's doing so, but puts out a model that sucks, he's liable. It's probably not directly fraud, but if he claims he's trying to compete with Google or Microsoft or Apple or whoever, yet doesn't adequately deploy a comparable amount of resources, capital, people, whatever, and doesn't explain why, it could (stretch) be securities fraud....I think.


And the fine for that? Probably 0.001% of revenue. If that.

If that's what it takes to get some honesty out of corporate, is it such a bad idea? Why?

Nailed it!

tech companies competing over something that is losing them money is the most bizarre spectacle yet.

I think Meta sees AI and VR/AR as a platform. They got left behind on the mobile platform and forever have to contend with Apple semi-monopoly. They have no control and little influence over the ecosystem. It's an existential threat to them.

They have vowed not to make that mistake again so are pushing for an open future that won't be dominated by a few companies that could arbitrarily hurt Meta's business.

That's the stated rationale at least and I think it more or less makes sense


Makes sense except for the fact that they leaked the llama weights by accident and needed to reverse engineer that explanation.

I wouldn't call what Meta is doing with VR/AR an "open future", it's pretty much the exact same playbook that Google and Apple used for their platforms. The only difference is Meta gets to be the landlord this time.

I'm in favor of whatever semi-monopoly enables fine grained permissions so Facebook can't en masse slurp Whatsapp (antitrust?) contacts

They stated that?

Feels very late 90s.

The old joke is they're losing money on every sale but they'll make up for it in volume.


chef kiss perfect

The reason is simple. All tech companies have very high valuations. They have to sell investors a dream to justify that valuation. They have to convince people that they have the next big thing around the corner.

Plot big tech stock valuations with markers for successful OS model releases.


Is it really losing them money if investors throw fistfuls of cash at them for it

Borderline conspiracy theory with an ounce of truth:

None of the models Meta put out are actually open source (by any measure), and everyone who are redistributing Llama models or any derivatives, or use Llama models for their business, are on the hook of getting sued in the future based on the terms and conditions people been explicitly/implicitly agreeing to when they use/redistribute these models.

If you start depending on these Llama models which have unfavorable proprietary terms today but Meta don't act on them, doesn't mean they won't act on it in the future. Maybe this has all been a play to get people into this position, so Meta can in the future start charging for them or something else.


You never go full Oracle.

This is a signal that the sector is heavily monopolized.

This has happened many times before in US history.


Come on, you can do the critical thinking here to understand why these companies would want the best in class (open/closed) weight LLMs.

then why would they cheat?

I didn't see evidence of cheating in the article. Having a slightly differently tuned version of 4 is not the most dastardly thing that can be done. Everything else is insinuation.

Well we'll see if they suffer consequences of this and they cheated too hard, but being perceived as best in class is arguably worth even more than being the best in class, especially if differences in performance are hard to perceive anecdotally.

The goal is long term control over a technology's marketshare, as winner take all dynamics are in play here.


they're all cheating, see grok

Are you referring to this [1]?

> Critics have pointed out that xAI’s approach involves running Grok 3 multiple times and cherry-picking the best output while comparing it against single runs of competitor models.

[1] https://medium.com/@cognidownunder/the-hype-machine-gpt-4-5-...


LeCun making up results ... well he comes from Academia, so ...

I tried to make Studio Ghibli inspired images using presumably their new models. It was ass.

Llama is not an image generating model. Any interface that uses Llama and generates images is calling out to a separate image generator as a tool, like OpenAI used to do with ChatGPT and DALL-E up until a couple of weeks ago: https://simonwillison.net/2023/Oct/26/add-a-walrus/

GPT 4o images is the future of all image gen.

Every other player: Black Forest Labs' Flux, Stability.ai's Stable Diffusion, and even closed models like Ideogram and Midjourney, are all on the path to extinction.

Image generation and editing must be multimodal. Full stop.

Google Imagen will probably be the first model to match the capabilities of 4o. I'm hoping one of the open weights labs or Chinese AI giants will release a model that demonstrates similar capabilities soon. That'll keep the race neck and neck.


One very important distinction between image models is the implementation: 4o is autogressive, slow, and extremely expensive.

Although the Ghibli trend is market validation, I suspect that competitors may not want to copy it just yet.


Extremely expensive in what since? In that it costs $.03 instead of $.00003c? Yeah it's relatively far more expensive than other solutions, but from an absolute standpoint still very cheap for the vast majority of use cases. And it's a LOT better.

Dall-E is already 4-8 cents per image. Afaik this is not in the API yet but I wouldn't be surprised if it's $1 or more.

> 4o is autogressive, slow, and extremely expensive.

If you factor in the amount of time wasted with prompting and inpainting, it's extremely well worth it.


Impressive results from Meta's Llama adapting to various benchmarks. However, gaming performance seems lackluster compared to specialized models like Alpaca. It raises questions about the viability of large language models for complex, interactive tasks like gaming without more targeted fine-tuning. Exciting progress nonetheless!



Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: