Hacker News new | past | comments | ask | show | jobs | submit login

the paper with details: https://arxiv.org/pdf/2411.02265

They use

- 16 experts, of which one is activated per token

- 1 shared expert that is always active

in summary that makes around 52B active parameters per token instead of the 405B of LLama3.1.






Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: