Hacker News new | past | comments | ask | show | jobs | submit login

It also feels similar to mixture of depths (https://arxiv.org/abs/2404.02258).

Being able to apply this post-training is pretty cool though, makes it easier to use across a wider range of setups.




Yes! I like that, and I saw that paper last weekend iirc. I think these MoD/MoE and other similar methods are highly compatible, and in a similar style.

I was originally afraid that this method wouldn't be compatible with MoE and the other methods, but fortunately, at least for Mixtral, there seems to be an amazing synergy.

By the way, other tasks have higher priority now, byt there is an interesting observation about MoE. In MoE you get two experts chosen, and each expert has a different weight attached to it - e.g. expert 1 has 75% weight, and expert 2 has 25% weight. Perhaps this could allow to scale the effort to give 75% effort to one expert, and 25% to the other. There are some issues there due to non-linearity of the layers, but perhaps there is something to it.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: