It also feels similar to mixture of depths (https://arxiv.org/abs/2404.02258). B...

kolinko · 2024-04-18T07:19:01 1713424741

Yes! I like that, and I saw that paper last weekend iirc. I think these MoD/MoE and other similar methods are highly compatible, and in a similar style.

I was originally afraid that this method wouldn't be compatible with MoE and the other methods, but fortunately, at least for Mixtral, there seems to be an amazing synergy.

By the way, other tasks have higher priority now, byt there is an interesting observation about MoE. In MoE you get two experts chosen, and each expert has a different weight attached to it - e.g. expert 1 has 75% weight, and expert 2 has 25% weight. Perhaps this could allow to scale the effort to give 75% effort to one expert, and 25% to the other. There are some issues there due to non-linearity of the layers, but perhaps there is something to it.