They use
- 16 experts, of which one is activated per token
- 1 shared expert that is always active
in summary that makes around 52B active parameters per token instead of the 405B of LLama3.1.
They use
- 16 experts, of which one is activated per token
- 1 shared expert that is always active
in summary that makes around 52B active parameters per token instead of the 405B of LLama3.1.