For a super ignorant person: Both Llama 4 Scout and Llama 4 Maverick use a Mixtu...

vessenes · 2025-04-05T20:48:33 1743886113

This was an idea that sounded somewhat silly until it was shown it worked. The idea is that you encourage through training a bunch of “experts” to diversify and “get good” at different things. These experts are say 1/10 to 1/100 of your model size if it were a dense model. So you pack them all up into one model, and you add a layer or a few layers that have the job of picking which small expert model is best for your given token input, route it to that small expert, and voila — you’ve turned a full run through the dense parameters into a quick run through a router and then a 1/10 as long run through a little model. How do you get a “picker” that’s good? Well, it’s differentiable, and all we have in ML is a hammer — so, just do gradient descent on the decider while training the experts!

This generally works well, although there are lots and lots of caveats. But it is (mostly) a free lunch, or at least a discounted lunch. I haven’t seen a ton of analysis on what different experts end up doing, but I believe it’s widely agreed that they tend to specialize. Those specializations (especially if you have a small number of experts) may be pretty esoteric / dense in their own right.

Anthropic’s interpretability team would be the ones to give a really high quality look, but I don’t think any of Anthropic’s current models are MoE.

Anecdotally, I feel MoE models sometimes exhibit slightly less “deep” thinking, but I might just be biased towards more weights. And they are undeniably faster and better per second of clock time, GPU time, memory or bandwidth usage — on all of these - than dense models with similar training regimes.

zamadatix · 2025-04-05T20:56:17 1743886577

The only thing about this which may be unintuitive from the name is an "Expert" is not something like a sub-llm that's good at math and gets called when you ask a math question. Models like this have layers of networks they run tokens through and each layer is composed of 256 sub-networks, any of which can be selected (or multiple selected and merged in some way) for each layer independently.

So the net result is the same: sets of parameters in the model are specialized and selected for certain inputs. It's just a done a bit deeper in the model than one may assume.

jimmyl02 · 2025-04-05T21:22:13 1743888133

the most unintuitive part is that from my understanding, individual tokens are routed to different experts. this is hard to comprehend with "experts" as that means two you can have different experts for two sequential tokens right?

I think where MoE is misleading is that the experts aren't what we would call "experts" in the normal world but rather they are experts for a specific token. that concept feels difficult to grasp.

phire · 2025-04-06T01:09:20 1743901760

It's not even per token. The routing happens once per layer, with the same token bouncing between layers.

It's more of a performance optimization than anything else, improving memory liquidity. Except it's not an optimization for running the model locally (where you only run a single query at a time, and it would be nice to keep the weights on the disk until they are relevant).

It's a performance optimization for large deployments with thousands of GPUs answering tens of thousands of queries per second. They put thousands of queries into a single batch and run them in parallel. After each layer, the queries are re-routed to the GPU holding the correct subset of weights. Individual queries will bounce across dozens of GPUs per token, distributing load.

Even though the name "expert" implies they should experts in a given topic, it's really not true. During training, they optimize for making the load distribute evenly, nothing else.

phire · 2025-04-06T03:48:06 1743911286

BTW, I'd love to see a large model designed from scratch for efficient local inference on low-memory devices.

While current MoE implementations are tuned for load-balancing over large pools of GPUs, there is nothing stopping you tuning them to only switch expert once or twice per token, and ideally keep the same weights across multiple tokens.

Well, nothing stopping you, but there is the question of if it will actually produce a worthwhile model.

regularfry · 2025-04-06T12:43:30 1743943410

Intuitively it feels like there ought to be significant similarities between expert layers because there are fundamentals about processing the stream of tokens that must be shared just from the geometry of the problem. If that's true, then identifying a common abstract base "expert" then specialising the individuals as low-rank adaptations on top of that base would mean you could save a lot of VRAM and expert-swapping. But it might mean you need to train from the start with that structure, rather than it being something you can distil to.

phire · 2025-04-06T21:46:54 1743976014

Yes, Deepseek introduced this optimisation of a common base "expert" that's always loaded. Llama 4 uses it too.

regularfry · 2025-04-07T07:24:09 1744010649

I had a sneaking suspicion that I wouldn't be the first to think of it.

boroboro4 · 2025-04-06T07:35:13 1743924913

DeepSeek introduced novel experts training technique which increased experts specialization. For particular given domain their implementation tends to activate same experts between different tokens, which is kinda what you’re asking for!

jumski · 2025-04-06T08:17:49 1743927469

I think Gemma 3 is marketed for single GPU setups https://blog.google/technology/developers/gemma-3/

idonotknowwhy · 2025-04-08T01:50:55 1744077055

> It's not even per token. The routing happens once per layer, with the same token bouncing between layers.

They don't really "bounce around" though do they (during inference)? That implies the token could bounce back from eg. layer 4 -> layer 3 -> back to layer 4.

mentalgear · 2025-04-06T08:21:12 1743927672

So a more correct term would be "Distributed Loading" instead of MoE.

igravious · 2025-04-06T12:49:35 1743943775

> making the load distribute evenly, nothing else.

so you mean a "load balancer" for neural nets … well, why don't they call it that then?

lxgr · 2025-04-06T15:59:24 1743955164

Some load balancers are also routers (if they route based on service capability and not just instantaneous availability) or vice versa, but this kind isn't always, to my understanding: The experts aren't necessarily "idle" or "busy" at any given time (they're just functions to be invoked, i.e. generally data, not computing resources), but rather more or less likely to answer correctly.

Even in the single GPU case, this still saves compute over the non-MoE case.

I believe it's also possible to split experts across regions of heterogeneous memory, in which case this task really would be something like load balancing (but still based on "expertise", not instantaneous expert availability, so "router" still seems more correct in that regard.)

bonoboTP · 2025-04-05T23:07:04 1743894424

Also note that MoE is a decades old term, predating deep learning. It's not supposed to be interpreted literally.

tomp · 2025-04-05T21:40:21 1743889221

> individual tokens are routed to different experts

that was AFAIK (not an expert! lol) the traditional approach

but judging by the chart on LLaMa4 blog post, now they're interleaving MoE models and dense Attention layers; so I guess this means that even a single token could be routed through different experts at every single MoE layer!

wrs · 2025-04-06T16:14:44 1743956084

ML folks tend to invent fanciful metaphorical terms for things. Another example is “attention”. I’m expecting to see a paper “consciousness is all you need” where “consciousness” turns out to just be a Laplace transform or something.

klipt · 2025-04-05T21:17:56 1743887876

So really it's just utilizing sparse subnetworks - more like the human brain.

philsnow · 2025-04-05T22:27:55 1743892075

The idea has also been around for at least 15 years; "ensemble learning" was a topic in my "Data Mining" textbook from around then.

Meta calls these individually smaller/weaker models "experts" but I've also heard them referred to as "bozos", because each is not particularly good at anything and it's only together that they are useful. Also bozos has better alliteration with boosting and bagging, two terms that are commonly used in ensemble learning.

lordswork · 2025-04-06T01:21:23 1743902483

MOE as an idea specific to neural networks has been around since 1991[1] . OP is probably aware, but adding for others following along, while MoE has roots in ensembling, there are some important differences: Traditional ensembles run all models in parallel and combine their outputs, whereas MoE uses a gating mechanism to activate only a subset of experts per input. This enables efficient scaling via conditional computation and expert specialization, rather than redundancy.

[1]:https://ieeexplore.ieee.org/document/6797059

Buttons840 · 2025-04-05T20:53:36 1743886416

If I have 5000 documents about A, and 5000 documents about B, do we know whether it's better to train one large model on all 10,000 documents, or to train 2 different specialist models and then combine them as you describe?

vessenes · 2025-04-05T22:03:25 1743890605

well you don't. but the power of gradient descent if properly managed will split them up for you. But you might get more mileage out of like 200 specialist models.

MoonGhost · 2025-04-06T19:13:14 1743966794

It probably depends on how much A and B overlap. If it's say English sci-fi and Chinese poetry two different models may be better.

MoonGhost · 2025-04-06T19:30:31 1743967831

> Anecdotally, I feel MoE models sometimes exhibit slightly less “deep” thinking

Makes sense to compare apples with apples. Same compute amount, right? Or you are giving less time to MoE model and then feel like it underperforms. Shouldn't be surprising...

> These experts are say 1/10 to 1/100 of your model size if it were a dense model

Just to be correct, each layer (attention + fully connected) has it's own router and experts. There are usually 30++ layers. It can't be 1/10 per expert as there are literally hundreds of them.

tomjen3 · 2025-04-06T06:41:09 1743921669

Cool. Those that mean I could just run the query through the router and then load only the required expert? That is could I feasibly run this on my Macbook?

faraaz98 · 2025-04-05T22:49:43 1743893383

I've been calling for this approach for a while. It's kinda similar to how the human brain has areas that are good at specific tasks

usef- · 2025-04-05T23:47:34 1743896854

It's already used a lot — the paper I believe is from 1991, and GPT4 among many others is MoE

randomcatuser · 2025-04-05T20:59:57 1743886797

yes, and it's on a per-layer basis, I think!

So if the model has 16 transformer layers to go through on a forward pass, and each layer, it gets to pick between 16 different choices, that's like 16^16 possible expert combinations!

mrbonner · 2025-04-05T23:04:56 1743894296

So this is kind of an ensemble sort of thing in ML like random forest and GBT?

chaorace · 2025-04-05T20:51:57 1743886317

The "Experts" in MoE is less like a panel of doctors and more like having different brain regions with interlinked yet specialized functions.

The models get trained largely the same way as non-MoE models, except with specific parts of the model silo'd apart past a certain layer. The shared part of the model, prior to the splitting, is the "router". The router learns how to route as an AI would, so it's basically a black-box in terms of whatever internal structure emerges from this.

pornel · 2025-04-05T20:59:27 1743886767

No, it's more like sharding of parameters. There's no understandable distinction between the experts.

vintermann · 2025-04-06T07:12:26 1743923546

I understand they're only optimizing for load distribution, but have people been trying to disentangle what the the various experts learn?

calaphos · 2025-04-06T09:29:11 1743931751

Mixture of experts involves some trained router components which routes to specific experts depending on the input, but without any terms enforcing load distribution this tends to collapse during training where most information gets routed to just one or two experts.

pornel · 2025-04-06T12:11:41 1743941501

Keep in mind that the "experts" are selected per layer, so it's not even a single expert selection you can correlate with a token, but an interplay of abstract features across many experts at many layers.

brycethornton · 2025-04-05T20:50:22 1743886222

I believe Mixture-of-Experts is a way for a neural network to group certain knowledge into smaller subsets. AFAIK there isn't a specific grouping goal, the network just figures out what goes where on it's own and then when an inference request is made it determines what "expert" would have that knowledge and routes it there. This makes the inference process much more efficient.

lern_too_spel · 2025-04-05T22:14:11 1743891251

https://arxiv.org/abs/1701.06538