Hacker News new | past | comments | ask | show | jobs | submit login

RAM for 4-bit is 1GB per 2 billion parameters. So you will want 256GB RAM and at least one GPU. If you only have one server and one user, it's the full parameter count. (If you have multiple GPUs/servers and many users in parallel, you can shard and route it so you only need the active parameter count per GPU/server. So it's cheaper at scale.)





Do the inactive parameters need to be loaded into RAM to run an MoE model decently enough?



Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: