Nice. Siri is completely unusable with an accuracy of less than 10%. I'm guessing Whisper on CPU is probably the same, so I wouldn't risk wasting time on trying the inference server if it only runs on CPU, but once that runs on GPU it would be cool to try this out.
GPU (currently CUDA only) is our primary target for our inference server implementation. It "runs" on CPU but our goal is to enable an ecosystem that is competitive with Alexa in every possible way and even with the amazing work of whisper.cpp and other efforts it's just not happening (yet).
We're aware that's controversial and not really applicable to many home users - that's why we want to support any TTS/STT engine on any hardware supported by Home Assistant (or elsewhere) in addition to ESP BOX on device local command recognition.
But for the people such as yourself, and other commercial/power/whatever users our inference server that we're releasing next week that works with Willow provides impressive results - on anything from a GTX 1060 to an H100 (we've tested and optimized for anything in between the two).
We use ctranslate2 (like faster-whisper) and some other optimizations for performance improvements and conservative VRAM usage. We can simultaneously load large-v2, medium, and base on a GTX 1060 3GB and handle requests without issue.
Again, it's controversial but the fact remains a $100 Tesla P4 that idles at 5 watts and has max TDP of 60 watts from eBay with our inference server implementation does the following:
large-v2, beam 5 - 3.8s of speech, inference time 1.1s
medium, beam 1 (suitable for Willow tasks) - 3.8s of speech, inference time 588ms
medium, beam 1 (suitable for Willow tasks), 29.2s of speech, inference time 1.6s
An RTX 4090 with large-v2, beam 5 does 3.8s of speech in 140ms and 29.2s of speech with medium beam 1 (greedy) in 84ms.
> Siri is completely unusable with an accuracy of less than 10%.
That seems unusual. I've been using both for the last few weeks while replacing my Homebridge setup, and Siri has been as accurate as Alexa — good enough that I've decided that I can now leave the Alexa ecosystem. To be more specific, both are (conservatively) 95%+ accurate for my home control scenarios.
I've never tried any voice recognition system that works well. Maybe my accent is too different from typical training data or something. I had a voice recognition program on my computer in 1994 that had about the same accuracy for me as any modern voice recognition system that I have tried.