I built RealtimeVoiceChat because I was frustrated with the latency in most voice AI interactions. This is an open-source (MIT license) system designed for real-time, local voice conversations with LLMs.
The goal is to get closer to natural conversation speed. It uses audio chunk streaming over WebSockets, RealtimeSTT (based on Whisper), and RealtimeTTS (supporting engines like Coqui XTTSv2/Kokoro) to achieve around 500ms response latency, even when running larger local models like a 24B Mistral fine-tune via Ollama.
Key aspects: Designed for local LLMs (Ollama primarily, OpenAI connector included). Interruptible conversation. Smart turn detection to avoid cutting the user off mid-thought. Dockerized setup available for easier dependency management.
It requires a decent CUDA-enabled GPU for good performance due to the STT/TTS models.
Would love to hear your feedback on the approach, performance, potential optimizations, or any features you think are essential for a good local voice AI experience.
Neat! I'm already using openwebui/ollama with a 7900 xtx but the STT and TTS parts don't seem to work with it yet:
2025-05-05 20:53:15,808] [WARNING] [real_accelerator.py:194:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
Error loading model for checkpoint ./models/Lasinya: This op had not been implemented on CPU backend.
Basically anything llama.cpp (Vulkan backend) should work out of the box w/o much fuss (LM Studio, Ollama, etc).
The HIP backend can have a big prefill speed boost on some architectures (high-end RDNA3 for example). For everything else, I keep notes here: https://llm-tracker.info/howto/AMD-GPUs
Can you explain more about the "Coqui XTTS Lasinya" models that the code is using? What are these, and how were they trained/finetuned? I'm assuming you're the one who uploaded them to huggingface, but there's no model card or README https://huggingface.co/KoljaB/XTTS_Models
Yeah I really dislike the whisperiness of this voice "Lasinya". It sounds too much like an erotic phone service. I wonder if there's any alternative voice? I don't see Lasinya even mentioned in the public coqui models: https://github.com/coqui-ai/STT-models/releases . But I don't see a list of other model names I could use either.
I tried to select kokoro in the python module but it says in the logs that only coqui is available. I do have to say the coqui models sound really good, it's just the type of voice that puts me off.
The default prompt is also way too "girlfriendy" but that was easily fixed. But for the voice, I simply don't know what the other options are for this engine.
PS: Forgive my criticism of the default voice but I'm really impressed with the responsiveness of this. It really responds so fast. Thanks for making this!
Create a subfolder in the app container: ./models/some_folder_name
Copy the files from your desired voice into that folder: config.json, model.pth, vocab.json and speakers_xtts.pth (you can copy the speakers_xtts.pth from Lasinya, it's the same for every voice)
Then change the specific_model="Lasinya" line in audio_module.py into specific_model="some_folder_name".
If you change TTS_START_ENGINE to "kokoro" in server.py it's supposed to work, what does happen then? Can you post the log message?
I didn't realise that you custom-made that voice. Would you have some links to other out-of-the-box voices for coqui? I'm having some trouble finding them. I think from seeing the demo page that the idea is that you clone someone else's voice or something with that engine. Because I don't see any voices listed. I've never seen it before.
And yes I switched to Kokoro now, I thought it was the default already but then I saw there were 3 lines configuring the same thing. So that's working. Kokoro isn't quite as good though as coqui, that's why I'm wondering about that. I also used kokoro on openwebui and I wasn't very happy with it there either. It's fast, but some pronounciation is weird. Also, it would be amazing to have bilingual TTS (English/Spanish in my case). And it looks like Coqui might be able to do that.
Didn't find many coqui finetunes too so far. I have David Attenborough and Snoop Dogg finetunes on my huggingface, quality is medium.
Coqui can to 17 languages. The problem with RealtimeVoiceChat repo is turn detection, the model I use to determine if a partial sentence indicates turn change is trained on english corpus only.
Would you say you are using the best-in-class speech to text libs at the moment? I feel like this space is moving fast because the last time I was headed down this track, I was sure whisper-cpp was the best.
I'm not sure tbh. Whisper was king for so long time now, especially with the ctranslate2 implementation from faster_whisper. Now nvidia open sourced Parakeet TDT today and it instantly went no 1 on open asr leaderboard. Will have to evaluate these latest models, they look strong.
Tried that one. Quality is great but sometimes generations fail and it's rather slow. Also needs ~13 GB of VRAM, it's not my first choice for voice agents tbh.
I'd say probably not. You can't easily "unlearn" things from the model weights (and even if this alone doesn't help). You could retrain/finetune the model heavily on a single language but again that alone does not speed up inference.
To gain speed you'd have to bring the parameter count down and train the model from scratch with a single language only. That might work but it's also quite probable that it introduces other issues in the synthesis. In a perfect world the model would only use all that "free parameters" not used now for other languages for a better synthesis of that single trained language. Might be true to a certain degree, but it's not exactly how ai parameter scaling works.
Maybe possible, I did not look into that much for Coqui XTTS. What i know is that the quantized versions for Orpheus sound noticably worse. I feel audio models are quite sensitive to quantization.
A couple questions:
- any thought about wake word engines, to have something that listen without consuming all the time? The landscape for open solutions doesn't seem good
- any plan to allow using external services for stt/tts for the people who don't have a 4090 ready (at the cost of privacy and sass providers)?
FWIW, wake words are a stopgap; if we want to have a Star Trek level voice interfaces, where the computer responds only when you actually meant to call it, as opposed to using the wake word as a normal word in the conversation, the computer needs to be constantly listening.
A good analogy here is to think of the computer (assistant) as another person in the room, busy with their own stuff but paying attention to the conversations happening around them, in case someone suddenly requests their assistance.
This, of course, could be handled by a more lightweight LLM running locally and listening for explicit mentions/addressing the computer/assistant, as opposed to some context-free wake words.
Home Assistant is much nearer to this than other solutions.
You have a wake word, but it can also speak to you based on automations. You come home and it could tell you that the milk is empty, but with a holiday coming up you probably should go shopping.
If the AI is local, it doesn't need to be on an internet connected device. At that point, malware and bugs in that stack don't add extra privacy risks* — but malware and bugs in all your other devices with microphones etc. remain a risk, even if the LLM is absolutely perfect by whatever standard that means for you.
* unless you put the AI on a robot body, but that's then your own new and exciting problem.
I build something almost identical last week (closed source, not my IP) and I recommend: NeMo Parakeet (even faster than insanely_fast_whisper), F5-TTS (fast + very good quality voice cloning), Qwen3-4B for LLM (amazing quality).
This looks great will definitely have a look. I'm just wondering if you tested fastRTC from hugging face? I haven't done that curious about speed between this vs fastrtc vs pipecat.
Yes, I tested it. I'm not that sure what they created there. It adds some noticable latency compared towards using raw websockets. Imho it's not supposed to, but it did it nevertheless in my tests.
LLM and TTS latency get's determined and logged at the start. It's around 220ms for the LLM returning the first synthesizable sentence fragment (depending on the length of the fragment, which is usually something between 3 and 10 words). Then around 80ms of TTS until the first audio chunk is delivered. STT with base.en you can neglect, it's under 5 ms, VAD same. Turn detection model also adds around 20 ms.
I have zero clue if and how fast this runs on a Mac.
With the current 24b LLM model it's 24 GB. I have no clue how far down you can go with the GPU is using smaller models, you can set the model in server.py.
Quite sure 16 GB will work but at some point it will probably fail.
All local models:
- VAD: Webrtcvad (first fast check) followed by SileroVAD (high compute verification)
- Transcription: base.en whisper (CTranslate2)
- Turn Detection: KoljaB/SentenceFinishedClassification (selftrained BERT-model)
- LLM: hf.co/bartowski/huihui-ai_Mistral-Small-24B-Instruct-2501-abliterated-GGUF:Q4_K_M (easily switchable)
- TTS: Coqui XTTSv2, switchable to Kokoro or Orpheus (this one is slower)
That would be absolutely awesome. But I doubt it, since they released a shitty version of that amazing thing they put online. I feel they aren't planning to give us their top model soon.
Quick Demo Video (50s): https://www.youtube.com/watch?v=HM_IQuuuPX8
The goal is to get closer to natural conversation speed. It uses audio chunk streaming over WebSockets, RealtimeSTT (based on Whisper), and RealtimeTTS (supporting engines like Coqui XTTSv2/Kokoro) to achieve around 500ms response latency, even when running larger local models like a 24B Mistral fine-tune via Ollama.
Key aspects: Designed for local LLMs (Ollama primarily, OpenAI connector included). Interruptible conversation. Smart turn detection to avoid cutting the user off mid-thought. Dockerized setup available for easier dependency management.
It requires a decent CUDA-enabled GPU for good performance due to the STT/TTS models.
Would love to hear your feedback on the approach, performance, potential optimizations, or any features you think are essential for a good local voice AI experience.
The code is here: https://github.com/KoljaB/RealtimeVoiceChat