I keep these in a dictate.sh script and bind to press/release on a single key. A programmable keyboard helps here. I use https://git.sr.ht/%7Egeb/dotool to turn the transcription into keystrokes. I've also tried ydotool and wtype, but they seem to swallow keystrokes.
I'm very impressed with https://github.com/ggml-org/whisper.cpp. Transcription quality with large-v3-turbo-q8_0 is excellent IMO and a Vulkan build is very fast on my 6600XT. It takes about 1s for an average sentence to appear after I release the hotkey.
before i even bother opening that github: does it work on windows? so far, all of the whisper "clones" run poorly, if at all, on windows. I do have a 3060 and a 1070ti i could use just for whisper on linux, but i have this 3090 on my windows desktop that works "fine" for whisper, TTS, SD, LLM.
Whisper on windows, the openai-whisper, doesn't have these q8_0 models, it has like 8 models, and i always get an error about triton cores (something about timeptamping i guess), which windows doesn't have. I've transcribed >1000 hours of audio with this setup, so i'm used to the workflow.
If you want to stay near the bleeding edge with this stuff, you probably want to be on some kind of linux (or lacking that, Mac). Windows is where stuff just trickles down to eventually.
This is my favorite kind of software, to write and to use. I am reminded of Upton Sinclair's evergreen quote:
> It is difficult to get a man to understand something, when his salary depends upon his not understanding it!
in the special case where the thing to be understood is "your app doesn't need to be a Big Fucking Deal". Maybe it pleases some users to wrap this in layers of additional abstraction and chrome and clicky buttons and storefronts, but in the end the functionality is already there with a couple of FOSS projects glued together in a bash script.
I used to think the likes of Suckless were brutalist zealots, but more and more I think they (and the Unix patriarchs) were right and the path to enlightenment is expressed in plain text.
On key press, start recording microphone to /tmp/dictate.mp3:
On key release, stop recording, transcribe with whisper.cpp, trim whitespace and print to stdout: I keep these in a dictate.sh script and bind to press/release on a single key. A programmable keyboard helps here. I use https://git.sr.ht/%7Egeb/dotool to turn the transcription into keystrokes. I've also tried ydotool and wtype, but they seem to swallow keystrokes. This gives a very functional push-to-talk setup.I'm very impressed with https://github.com/ggml-org/whisper.cpp. Transcription quality with large-v3-turbo-q8_0 is excellent IMO and a Vulkan build is very fast on my 6600XT. It takes about 1s for an average sentence to appear after I release the hotkey.
I'm keeping an eye on the NVidia models, hopefully they work on ggml soon too. E.g. https://github.com/ggml-org/whisper.cpp/issues/3118.