Personally, I plugged a Jabra conference speaker to a Raspberry and if it hears something interesting, it sends to my local GPU computer for decoding (with whisper) + answer-getting + response sent back to the Raspberry as audio (with a model from coqui-ai/TTS but using more plain PyTorch). Works really nicely for having very local weather, calendar, ...
A long while ago, I wrote a little tutorial[0] on quantizing a speech commands network to the Raspberry. I used that to control lights directly and also for wake word detection.
More recently, I found that I can just use more classic VAD because my uses typically don't suffer if I turn on/off the microphone. My main goal is to not get out the mobile phone for information. That reduces the processing when I turn on the radio...
Not high-end as your solution, but nice enough for my purposes.