This would be immediately obvious in a cursory analysis of performance. On-device transcription is not only computationally infeasible, it would also require model capabilities far beyond what is currently SOTA.
Google had (and has afaik) significant challenges implementing multiple wake-word detection for precisely this reason.
Transcribing a couple of words accurately on-device without a major performance penalty (so that it can be running in the background always) is just _barely_ coming out now.
There's this weird narrative I see that "computers just aren't powerful enough" to do things I remember them already doing on Pentium 1 class machines in the 90s.