Don't forget that you can link for example two 'Digits' together (~256 GB) if you want to run even larger models or have larger context size.
That is 2x$3000 vs 8x$2000.
This will make it possible for you to run models up to 405B parameters, like Llama 3.1 405B at 4bit quant or the Grok-1 314B at 6bit quant.
Who knows, maybe some better models will be released in the future which are better optimized and won't need that much RAM, but it is easier to buy a second 'Digits' in comparison to building a rack with 8xGPUs.
For example, if you look at the latest Llama models, Meta states: 'Llama 3.3 70B approaches the performance of Llama 3.1 405B'.
This will make it possible for you to run models up to 405B parameters, like Llama 3.1 405B at 4bit quant or the Grok-1 314B at 6bit quant.
Who knows, maybe some better models will be released in the future which are better optimized and won't need that much RAM, but it is easier to buy a second 'Digits' in comparison to building a rack with 8xGPUs. For example, if you look at the latest Llama models, Meta states: 'Llama 3.3 70B approaches the performance of Llama 3.1 405B'.
To interfere with Llama3.3-70B-Instruct with ~8k context length (without offloading), you'd need: - Q4 (~44GB): 2x5090; 1x 'Digits' - Q6 (~58GB): 2x5090; 1x 'Digits' - Q8 (~74GB): 3x5090; 1x 'Digits' - FP16 (~144GB): 5x5090; 2x 'Digits'
Let's wait and see which bandwidth it will have.