How does that work in practice? Do you connect the LLM up directly to the app somehow, or do you ask it what to do and then do what it says and see what happens?
How much RAM do you need to make those work acceptably well? I assume Llama 3.3 means the 70B model, so you need > 70GB. (So, I guess, a MacBook with 128GB?) In which case I guess you're also using 8 bits for the Qwen model?
We made an adapter (a specific CLI interface) for the LLM to interface with the app. Kind of like an integration test, just a little bit more sophisticated.
The LLM gets a prompt with the CLI commands it may use, and its "personality", and then it does what it does.
On the hardware-side, I personally have 2x 3090 cards on an AMD TR 79x platform with 128GB RAM, which yields around 12 token/sec for LLama 3.3 or Qwen 2.5 72B (Q5_k_m), which is okay (ingestion speed is approx double that)
If you want to know more details, feel free to drop me a message (username at liku dot social)
How much RAM do you need to make those work acceptably well? I assume Llama 3.3 means the 70B model, so you need > 70GB. (So, I guess, a MacBook with 128GB?) In which case I guess you're also using 8 bits for the Qwen model?