Hacker News new | past | comments | ask | show | jobs | submit login
NATSpeech: High Quality Text-to-Speech Implementation with HuggingFace Demo (github.com/natspeech)
88 points by smusamashah on Feb 17, 2022 | hide | past | favorite | 25 comments



On a Tangent:

As I switched from macOS to arch linux, I was looking for a good text-to-speech option, cross platform and open-source. (I use it to get an first overview over larger, boring papers I'm reviewing ).

pretty impressed by larynx so far: https://github.com/rhasspy/larynx

It seems also quite hackable.

Will also play with NATSpeech.


TTS is a very active field of research. The things you see here, GlowTTS, PortaSpeech, Diffusion, etc, these are very new, published 2020 and 2021. Every couple of months there is a new model, which needs its own implementation.

You will find a lot of code but that is research code, i.e. you need to understand the science behind it and it needs some effort to get it to run, and it will have lots of hard edges.

When you ask for a usable cross platform and open source solution, this is different. Although, Larynx looks actually nice and usable, i.e. intended for the end user.

The hard part of TTS systems is also to get all the edge cases right, like special symbols, abbreviations, special names, etc. And most research code does not care too much about this but is more about doing research such as developing new models.

A manual lexicon with lots of custom rules is often used to cover all these cases but to maintain this is a lot of work and depends also on the use case. Or you could train another model for this part which will somehow work better than a simplistic lexicon but will still fail at many edge cases.


Due to licensing constraints, I wrote my own text frontend (MIT): https://github.com/rhasspy/gruut

I plan to release a version of Larynx that uses eSpeak for phonemization, since it covers many of those corner cases.


Larynx author here, glad you're enjoying it! I'm implementing a new TTS model (VITS) which sounds better and is about 2x faster in practice. Should be ready in the next month or so :)


This sounds awesome. Is there a mailing list I can join to get a notification when it's released?


I usually announce things on the Rhasspy voice assistant forums: https://community.rhasspy.org/

I also have a Twitter account (@rhasspy) used almost exclusively for announcements.


Impressive work. thanks a lot :)


It sounds really good, but from what I've heard nothing really comes close to the performance of https://15.ai/. It still seems like the best TTS around or am I wrong?


Holy cow, that’s impressive.

Also, from The FAQ:

> Who/what is behind this project? > > I've had a couple people and institutions provide a bit of help along the way, but I am — and always have been — one person.

Never underestimate the intellect of a single person. Every corporate team, every open source maintainer, every researcher more or less* has access to the same prior knowledge to build something new.

* some have more money to burn, some are surrounded by mentors/helpers


This one never sounds fantastic to me, just "very good". It does have a very cool feature where you can input a phrase with one "feeling" to change the main text to have that "feeling" as well. It is a bit confusing to use.

I actually think Microsoft's TTS is the best and has many voices. IBM's Watson branded voice was a close second. But as near as I can tell, there's no way to run these locally, only through their cloud interface.

I'd say this portaspeech one was the best local I've heard but it annoyingly trips on words like "shouldn't".


To me NATSpeech sounds better. This sounds worse than Amazon Alexa at least with my quick test.


Don't know about the quality but the welcome banner is a real assholey move. Not everyone is out there to steal someone's work. Btw, try rejecting it for fun and going back to the same website.


As someone whose mind was blown away 30 years ago by the robotic TTS software that a Spectrum computer was able to produce, I find the fact that this level of quality is provided with an MIT licensed product a very encouraging sign of not only dark cyberpunk-style events are happening in the technology landscape.


Ah, yes the Currah uSpeech among others :)

https://en.wikipedia.org/wiki/General_Instrument_SP0256


I love my uSpeech as a kid. I didn't know there were games that supported it.

https://en.wikipedia.org/wiki/Currah#Currah_Microspeech_for_...


My big test with these is dialog with punctuation. Can it do ellipses, exclamations, dashes, etc convincingly?

That answer is basically a universal "nope" imho. The vocal inflections are contextual and subtle. The problem is hard. That's why it's the test


It sure sounds like the woman who narrates the 'How It's Made' videos. Is there information on where the training data came from?



Not saying this one is poor, but I recently surveyed all of the text to speech models on hugging face, and they were not usable. There is a massive gulf between accuracy from the main cloud providers text to speech API and hugging face


For now we only host models backed by external libraries on the HF Hub, but we've started working on native TTS for the HF Transformers library, which will hopefully be on par with the paid alternatives! :)


By any chance, are you aware of any good surveys comparing the open models to the cloud providers for other NLP tasks like speech recognition or POS tagging?


Reminds me of Apple Live Text/Google Lens vs tesseract… Sad.


It can't pronounce 'couldn't' or 'shouldn't' properly.


What model is state of the art in T2S domain?


very cool!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: