As I switched from macOS to arch linux, I was looking for a good text-to-speech option, cross platform and open-source. (I use it to get an first overview over larger, boring papers I'm reviewing ).
TTS is a very active field of research. The things you see here, GlowTTS, PortaSpeech, Diffusion, etc, these are very new, published 2020 and 2021. Every couple of months there is a new model, which needs its own implementation.
You will find a lot of code but that is research code, i.e. you need to understand the science behind it and it needs some effort to get it to run, and it will have lots of hard edges.
When you ask for a usable cross platform and open source solution, this is different. Although, Larynx looks actually nice and usable, i.e. intended for the end user.
The hard part of TTS systems is also to get all the edge cases right, like special symbols, abbreviations, special names, etc. And most research code does not care too much about this but is more about doing research such as developing new models.
A manual lexicon with lots of custom rules is often used to cover all these cases but to maintain this is a lot of work and depends also on the use case. Or you could train another model for this part which will somehow work better than a simplistic lexicon but will still fail at many edge cases.
Larynx author here, glad you're enjoying it! I'm implementing a new TTS model (VITS) which sounds better and is about 2x faster in practice. Should be ready in the next month or so :)
It sounds really good, but from what I've heard nothing really comes close to the performance of https://15.ai/.
It still seems like the best TTS around or am I wrong?
> Who/what is behind this project?
>
> I've had a couple people and institutions provide a bit of help along the way, but I am — and always have been — one person.
Never underestimate the intellect of a single person. Every corporate team, every open source maintainer, every researcher more or less* has access to the same prior knowledge to build something new.
* some have more money to burn, some are surrounded by mentors/helpers
This one never sounds fantastic to me, just "very good". It does have a very cool feature where you can input a phrase with one "feeling" to change the main text to have that "feeling" as well. It is a bit confusing to use.
I actually think Microsoft's TTS is the best and has many voices. IBM's Watson branded voice was a close second. But as near as I can tell, there's no way to run these locally, only through their cloud interface.
I'd say this portaspeech one was the best local I've heard but it annoyingly trips on words like "shouldn't".
Don't know about the quality but the welcome banner is a real assholey move. Not everyone is out there to steal someone's work. Btw, try rejecting it for fun and going back to the same website.
As someone whose mind was blown away 30 years ago by the robotic TTS software that a Spectrum computer was able to produce, I find the fact that this level of quality is provided with an MIT licensed product a very encouraging sign of not only dark cyberpunk-style events are happening in the technology landscape.
Not saying this one is poor, but I recently surveyed all of the text to speech models on hugging face, and they were not usable. There is a massive gulf between accuracy from the main cloud providers text to speech API and hugging face
For now we only host models backed by external libraries on the HF Hub, but we've started working on native TTS for the HF Transformers library, which will hopefully be on par with the paid alternatives! :)
By any chance, are you aware of any good surveys comparing the open models to the cloud providers for other NLP tasks like speech recognition or POS tagging?
As I switched from macOS to arch linux, I was looking for a good text-to-speech option, cross platform and open-source. (I use it to get an first overview over larger, boring papers I'm reviewing ).
pretty impressed by larynx so far: https://github.com/rhasspy/larynx
It seems also quite hackable.
Will also play with NATSpeech.