I built RealtimeVoiceChat because I was frustrated with the latency in most voic...

zaggynl · 2025-05-05T20:58:36 1746478716

Neat! I'm already using openwebui/ollama with a 7900 xtx but the STT and TTS parts don't seem to work with it yet:

2025-05-05 20:53:15,808] [WARNING] [real_accelerator.py:194:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.

Error loading model for checkpoint ./models/Lasinya: This op had not been implemented on CPU backend.

dankwizard · 2025-05-06T01:09:34 1746493774

I've given up trying to locally use LLMs on AMD

lhl · 2025-05-06T04:37:12 1746506232

Basically anything llama.cpp (Vulkan backend) should work out of the box w/o much fuss (LM Studio, Ollama, etc).

The HIP backend can have a big prefill speed boost on some architectures (high-end RDNA3 for example). For everything else, I keep notes here: https://llm-tracker.info/howto/AMD-GPUs

peterldowns · 2025-05-06T00:59:10 1746493150

Can you explain more about the "Coqui XTTS Lasinya" models that the code is using? What are these, and how were they trained/finetuned? I'm assuming you're the one who uploaded them to huggingface, but there's no model card or README https://huggingface.co/KoljaB/XTTS_Models

In case it's not clear, I'm talking about the models referenced here. https://github.com/KoljaB/RealtimeVoiceChat/blob/main/code/a...

wkat4242 · 2025-05-06T21:27:45 1746566865

Yeah I really dislike the whisperiness of this voice "Lasinya". It sounds too much like an erotic phone service. I wonder if there's any alternative voice? I don't see Lasinya even mentioned in the public coqui models: https://github.com/coqui-ai/STT-models/releases . But I don't see a list of other model names I could use either.

I tried to select kokoro in the python module but it says in the logs that only coqui is available. I do have to say the coqui models sound really good, it's just the type of voice that puts me off.

The default prompt is also way too "girlfriendy" but that was easily fixed. But for the voice, I simply don't know what the other options are for this engine.

PS: Forgive my criticism of the default voice but I'm really impressed with the responsiveness of this. It really responds so fast. Thanks for making this!

koljab · 2025-05-07T12:45:48 1746621948

Yeah I know the voice polarizes, I trained it for myself, so it's not an official release. You can change the voice here:

https://github.com/KoljaB/RealtimeVoiceChat/blob/main/code/a...

Create a subfolder in the app container: ./models/some_folder_name Copy the files from your desired voice into that folder: config.json, model.pth, vocab.json and speakers_xtts.pth (you can copy the speakers_xtts.pth from Lasinya, it's the same for every voice)

Then change the specific_model="Lasinya" line in audio_module.py into specific_model="some_folder_name".

If you change TTS_START_ENGINE to "kokoro" in server.py it's supposed to work, what does happen then? Can you post the log message?

wkat4242 · 2025-05-07T14:45:36 1746629136

Thank you!

I didn't realise that you custom-made that voice. Would you have some links to other out-of-the-box voices for coqui? I'm having some trouble finding them. I think from seeing the demo page that the idea is that you clone someone else's voice or something with that engine. Because I don't see any voices listed. I've never seen it before.

And yes I switched to Kokoro now, I thought it was the default already but then I saw there were 3 lines configuring the same thing. So that's working. Kokoro isn't quite as good though as coqui, that's why I'm wondering about that. I also used kokoro on openwebui and I wasn't very happy with it there either. It's fast, but some pronounciation is weird. Also, it would be amazing to have bilingual TTS (English/Spanish in my case). And it looks like Coqui might be able to do that.

koljab · 2025-05-07T18:13:37 1746641617

Didn't find many coqui finetunes too so far. I have David Attenborough and Snoop Dogg finetunes on my huggingface, quality is medium.

Coqui can to 17 languages. The problem with RealtimeVoiceChat repo is turn detection, the model I use to determine if a partial sentence indicates turn change is trained on english corpus only.

Buckaroo9 · 2025-05-06T04:26:00 1746505560

https://huggingface.co/coqui/XTTS-v2

optimog · 2025-05-06T06:41:41 1746513701

Seems like they are out of business. Their homepage mentions "Coqui is schutting down"* That is probably the reason you can't find that much.

*https://coqui.ai/

koljab · 2025-05-07T12:51:46 1746622306

Lasinya voice is a XTTS 2.0.2 finetune I made with a self-created, synthesized dataset. I used https://github.com/daswer123/xtts-finetune-webui for training.

dummydummy1234 · 2025-05-05T22:56:28 1746485788

Have you looked at pipecat, seems to be similar trying to do standardized backend/webrtc turn detection pipelines.

koljab · 2025-05-06T10:43:24 1746528204

Did not look into that one. Looks quite good, I will try that soon.

ivape · 2025-05-05T20:28:00 1746476880

Would you say you are using the best-in-class speech to text libs at the moment? I feel like this space is moving fast because the last time I was headed down this track, I was sure whisper-cpp was the best.

koljab · 2025-05-05T20:42:11 1746477731

I'm not sure tbh. Whisper was king for so long time now, especially with the ctranslate2 implementation from faster_whisper. Now nvidia open sourced Parakeet TDT today and it instantly went no 1 on open asr leaderboard. Will have to evaluate these latest models, they look strong.

kristopolous · 2025-05-05T20:58:08 1746478688

https://yummy-fir-7a4.notion.site/dia is the new hotness.

koljab · 2025-05-05T21:05:43 1746479143

Tried that one. Quality is great but sometimes generations fail and it's rather slow. Also needs ~13 GB of VRAM, it's not my first choice for voice agents tbh.

kristopolous · 2025-05-05T21:22:49 1746480169

alright, dumb question.

(1) I assume these things can do multiple languages

(2) Given (1), can you strip all the languages you aren't using and speed things up?

koljab · 2025-05-05T21:40:45 1746481245

Actually good question.

I'd say probably not. You can't easily "unlearn" things from the model weights (and even if this alone doesn't help). You could retrain/finetune the model heavily on a single language but again that alone does not speed up inference.

To gain speed you'd have to bring the parameter count down and train the model from scratch with a single language only. That might work but it's also quite probable that it introduces other issues in the synthesis. In a perfect world the model would only use all that "free parameters" not used now for other languages for a better synthesis of that single trained language. Might be true to a certain degree, but it's not exactly how ai parameter scaling works.

nardi · 2025-05-06T04:34:59 1746506099

I don't know what I'm talking about, but could you use distillation techniques?

koljab · 2025-05-07T12:56:32 1746622592

Maybe possible, I did not look into that much for Coqui XTTS. What i know is that the quantized versions for Orpheus sound noticably worse. I feel audio models are quite sensitive to quantization.

oezi · 2025-05-05T21:45:39 1746481539

Paraket is english only. Stick with Whisper.

The core innovation is happening in TTS at the moment.

ivape · 2025-05-05T20:45:26 1746477926

Yeah, I figured you would know. Thanks for that, bookmarking that asr leaderboard.

riquito · 2025-05-06T02:56:10 1746500170

Very cool, thanks for sharing.

A couple questions: - any thought about wake word engines, to have something that listen without consuming all the time? The landscape for open solutions doesn't seem good - any plan to allow using external services for stt/tts for the people who don't have a 4090 ready (at the cost of privacy and sass providers)?

TeMPOraL · 2025-05-06T08:05:09 1746518709

FWIW, wake words are a stopgap; if we want to have a Star Trek level voice interfaces, where the computer responds only when you actually meant to call it, as opposed to using the wake word as a normal word in the conversation, the computer needs to be constantly listening.

A good analogy here is to think of the computer (assistant) as another person in the room, busy with their own stuff but paying attention to the conversations happening around them, in case someone suddenly requests their assistance.

This, of course, could be handled by a more lightweight LLM running locally and listening for explicit mentions/addressing the computer/assistant, as opposed to some context-free wake words.

Dr4kn · 2025-05-06T08:38:19 1746520699

Home Assistant is much nearer to this than other solutions.

You have a wake word, but it can also speak to you based on automations. You come home and it could tell you that the milk is empty, but with a holiday coming up you probably should go shopping.

Dlemo · 2025-05-06T10:36:21 1746527781

I want that for privacy reasons and for resource reasons.

And having this as a small hardware device should not add relevant latency to it.

jillyboel · 2025-05-06T10:58:50 1746529130

Privacy isn't a concern when everything is local

Dlemo · 2025-05-06T19:51:42 1746561102

Yes it is.

Malware, bugs etc can happen.

And I also might not want to disable it for every guest either.

ben_w · 2025-05-06T20:50:42 1746564642

If the AI is local, it doesn't need to be on an internet connected device. At that point, malware and bugs in that stack don't add extra privacy risks* — but malware and bugs in all your other devices with microphones etc. remain a risk, even if the LLM is absolutely perfect by whatever standard that means for you.

* unless you put the AI on a robot body, but that's then your own new and exciting problem.

jillyboel · 2025-05-06T21:12:19 1746565939

There is no privacy difference between a local LLM listening versus a local wake word model listening.

koljab · 2025-05-06T10:40:18 1746528018

That would be quite easy to integrate. RealtimeSTT already has wakeword support for both pvporcupine and openwakewords.

justlikereddit · 2025-05-06T07:24:27 1746516267

Modify it with an ultra light LLM agent that always listens that uses a wake word to agentically call the paid API?

Dr4kn · 2025-05-06T08:36:06 1746520566

You could use open wake word. Which Home Assistant developed for its own Voice Assistant

supermatt · 2025-05-06T09:32:30 1746523950

It was developed by David Scripka: https://github.com/dscripka/openWakeWord

jokethrowaway · 2025-05-06T11:21:48 1746530508

Neat!

I build something almost identical last week (closed source, not my IP) and I recommend: NeMo Parakeet (even faster than insanely_fast_whisper), F5-TTS (fast + very good quality voice cloning), Qwen3-4B for LLM (amazing quality).

pzo · 2025-05-06T07:44:53 1746517493

This looks great will definitely have a look. I'm just wondering if you tested fastRTC from hugging face? I haven't done that curious about speed between this vs fastrtc vs pipecat.

koljab · 2025-05-06T10:42:25 1746528145

Yes, I tested it. I'm not that sure what they created there. It adds some noticable latency compared towards using raw websockets. Imho it's not supposed to, but it did it nevertheless in my tests.

karimf · 2025-05-06T02:43:34 1746499414

Do you have any information on how long each step take? Like how many ms on each step of the pipeline?

I'm curious how fast it will run if we can get this running on a Mac. Any ballpark guess?

koljab · 2025-05-07T12:55:12 1746622512

LLM and TTS latency get's determined and logged at the start. It's around 220ms for the LLM returning the first synthesizable sentence fragment (depending on the length of the fragment, which is usually something between 3 and 10 words). Then around 80ms of TTS until the first audio chunk is delivered. STT with base.en you can neglect, it's under 5 ms, VAD same. Turn detection model also adds around 20 ms. I have zero clue if and how fast this runs on a Mac.

tmaly · 2025-05-06T22:31:46 1746570706

What is the min VRAM needed on the GPU to run this? I did not see that on the github

koljab · 2025-05-07T12:48:23 1746622103

With the current 24b LLM model it's 24 GB. I have no clue how far down you can go with the GPU is using smaller models, you can set the model in server.py. Quite sure 16 GB will work but at some point it will probably fail.

dotancohen · 2025-05-05T20:40:17 1746477617

This looks great. What hardware do you use, or have you tested it on?

koljab · 2025-05-05T20:42:57 1746477777

I only tested it on my 4090 so far

echelon · 2025-05-05T21:31:26 1746480686

Are you using all local models, or does it also use cloud inference? Proprietary models?

Which models are running in which places?

Cool utility!

koljab · 2025-05-05T21:49:37 1746481777

All local models: - VAD: Webrtcvad (first fast check) followed by SileroVAD (high compute verification) - Transcription: base.en whisper (CTranslate2) - Turn Detection: KoljaB/SentenceFinishedClassification (selftrained BERT-model) - LLM: hf.co/bartowski/huihui-ai_Mistral-Small-24B-Instruct-2501-abliterated-GGUF:Q4_K_M (easily switchable) - TTS: Coqui XTTSv2, switchable to Kokoro or Orpheus (this one is slower)

echelon · 2025-05-06T03:16:57 1746501417

That's excellent. Really amazing bringing all of these together like this.

Hopefully we get an open weights version of Sesame [1] soon. Keep watching for it, because that'd make a killer addition to your app.

[1] https://www.sesame.com/

koljab · 2025-05-07T12:58:42 1746622722

That would be absolutely awesome. But I doubt it, since they released a shitty version of that amazing thing they put online. I feel they aren't planning to give us their top model soon.