NATSpeech: High Quality Text-to-Speech Implementation with HuggingFace Demo

kgarten · on Feb 17, 2022

On a Tangent:

As I switched from macOS to arch linux, I was looking for a good text-to-speech option, cross platform and open-source. (I use it to get an first overview over larger, boring papers I'm reviewing ).

pretty impressed by larynx so far: https://github.com/rhasspy/larynx

It seems also quite hackable.

Will also play with NATSpeech.

albertzeyer · on Feb 17, 2022

TTS is a very active field of research. The things you see here, GlowTTS, PortaSpeech, Diffusion, etc, these are very new, published 2020 and 2021. Every couple of months there is a new model, which needs its own implementation.

You will find a lot of code but that is research code, i.e. you need to understand the science behind it and it needs some effort to get it to run, and it will have lots of hard edges.

When you ask for a usable cross platform and open source solution, this is different. Although, Larynx looks actually nice and usable, i.e. intended for the end user.

The hard part of TTS systems is also to get all the edge cases right, like special symbols, abbreviations, special names, etc. And most research code does not care too much about this but is more about doing research such as developing new models.

A manual lexicon with lots of custom rules is often used to cover all these cases but to maintain this is a lot of work and depends also on the use case. Or you could train another model for this part which will somehow work better than a simplistic lexicon but will still fail at many edge cases.

synesthesiam · on Feb 17, 2022

Due to licensing constraints, I wrote my own text frontend (MIT): https://github.com/rhasspy/gruut

I plan to release a version of Larynx that uses eSpeak for phonemization, since it covers many of those corner cases.

synesthesiam · on Feb 17, 2022

Larynx author here, glad you're enjoying it! I'm implementing a new TTS model (VITS) which sounds better and is about 2x faster in practice. Should be ready in the next month or so :)

superasn · on Feb 17, 2022

This sounds awesome. Is there a mailing list I can join to get a notification when it's released?

synesthesiam · on Feb 17, 2022

I usually announce things on the Rhasspy voice assistant forums: https://community.rhasspy.org/

I also have a Twitter account (@rhasspy) used almost exclusively for announcements.

kgarten · on Feb 18, 2022

Impressive work. thanks a lot :)

Jack5500 · on Feb 17, 2022

It sounds really good, but from what I've heard nothing really comes close to the performance of https://15.ai/. It still seems like the best TTS around or am I wrong?

evilhackerdude · on Feb 17, 2022

Holy cow, that’s impressive.

Also, from The FAQ:

> Who/what is behind this project? > > I've had a couple people and institutions provide a bit of help along the way, but I am — and always have been — one person.

Never underestimate the intellect of a single person. Every corporate team, every open source maintainer, every researcher more or less* has access to the same prior knowledge to build something new.

* some have more money to burn, some are surrounded by mentors/helpers

figglestar · on Feb 17, 2022

This one never sounds fantastic to me, just "very good". It does have a very cool feature where you can input a phrase with one "feeling" to change the main text to have that "feeling" as well. It is a bit confusing to use.

I actually think Microsoft's TTS is the best and has many voices. IBM's Watson branded voice was a close second. But as near as I can tell, there's no way to run these locally, only through their cloud interface.

I'd say this portaspeech one was the best local I've heard but it annoyingly trips on words like "shouldn't".

salamandersauce · on Feb 17, 2022

To me NATSpeech sounds better. This sounds worse than Amazon Alexa at least with my quick test.

allisdust · on Feb 17, 2022

Don't know about the quality but the welcome banner is a real assholey move. Not everyone is out there to steal someone's work. Btw, try rejecting it for fun and going back to the same website.

ciberado · on Feb 17, 2022

As someone whose mind was blown away 30 years ago by the robotic TTS software that a Spectrum computer was able to produce, I find the fact that this level of quality is provided with an MIT licensed product a very encouraging sign of not only dark cyberpunk-style events are happening in the technology landscape.

l33tman · on Feb 17, 2022

Ah, yes the Currah uSpeech among others :)

https://en.wikipedia.org/wiki/General_Instrument_SP0256

kingcharles · on Feb 17, 2022

I love my uSpeech as a kid. I didn't know there were games that supported it.

https://en.wikipedia.org/wiki/Currah#Currah_Microspeech_for_...

kristopolous · on Feb 17, 2022

My big test with these is dialog with punctuation. Can it do ellipses, exclamations, dashes, etc convincingly?

That answer is basically a universal "nope" imho. The vocal inflections are contextual and subtle. The problem is hard. That's why it's the test

djmips · on Feb 17, 2022

It sure sounds like the woman who narrates the 'How It's Made' videos. Is there information on where the training data came from?

felixr · on Feb 17, 2022

Linda Johnson from LibriVox (https://librivox.org/reader/11049) See also https://keithito.com/LJ-Speech-Dataset/

ldjkfkdsjnv · on Feb 17, 2022

Not saying this one is poor, but I recently surveyed all of the text to speech models on hugging face, and they were not usable. There is a massive gulf between accuracy from the main cloud providers text to speech API and hugging face

anton-l · on Feb 17, 2022

For now we only host models backed by external libraries on the HF Hub, but we've started working on native TTS for the HF Transformers library, which will hopefully be on par with the paid alternatives! :)

maga · on Feb 17, 2022

By any chance, are you aware of any good surveys comparing the open models to the cloud providers for other NLP tasks like speech recognition or POS tagging?

evilhackerdude · on Feb 17, 2022

Reminds me of Apple Live Text/Google Lens vs tesseract… Sad.

HKH2 · on Feb 17, 2022

It can't pronounce 'couldn't' or 'shouldn't' properly.

monkeydust · on Feb 17, 2022

What model is state of the art in T2S domain?

clemnt · on Feb 17, 2022

very cool!