to be disturbing and without much added value. (Such as used with games like Jeopardy.)
If used, the application of these tags needs to be both meticulous in its proper context, somewhat non-deterministically applied, and with randomized prosody. Repeated usage of the same overstated emotive content is annoying and unnatural (worse than a "flat" presentation) and only serves to underscore the underlying inflexible conversational content.
Agree, there's definitely a risk for uncanny valley in voice assistants around tone. Emotion in particular is an area where I think we won't have sophisticated enough systems to address for some time, which means it's mostly something that makes things unintentionally worse. A few of these downsides include:
1. tone has a different valence than what the subject matter requires ('Sure thing! The funeral home number is...'), which reminds you of and amplifies negative feelings
2. tone is used to get something from you, which makes you feel manipulated
3. tone is uniformly applied, which through repeat exposure makes the world feel false
Totally agree that randomized prosody would help aid in making assistants more natural; tempo and rhythm don't have as much emotional weight.
I'm not sure explicit markup is the right way to approach the problem, given its lack of understanding of user context and state of mind, and the likelihood developers execute poorly. It feels like a solution at the wrong level of abstraction... getting in sync with a user's emotions is a pretty high bar.
I agree. All I care about is that the pronunciation is contextually correct, and smooth. Accents are fine, and necessary. But I don't want a non-human simulating human emotions. Now if they aren't simulated, that's another story...
+1. And I know I'm in the minority here, but I prefer it if computers don't use actual human voices at all. I like my computer to sound like a computer. (I wouldn't go so far as to add distracting retro affectations though.)
I'm visually impaired, and I often use a screen reader when browsing the web. Here's my favorite voice to use with a screen reader:
I only run my TTS moderately fast. I have blind friends who run it faster. I have some vision, so I only use a screen reader part of the time, and not for programming.
Also, I think speech synthesizers can do a more consistent job of enunciating at high speed than humans can. I didn't quite understand everything the FedEx speed talker was saying.
Thank you for sharing. I was recently foraging for better TTS and somehow missed these.
These Alexa voices are amazing.
I recently got addicted to podcasts and want to branch into audiobooks. Alas, most narrators' voices drive me nuts. Especially the free content I've found.
I now think I'd quite like the Australian accents for consuming ebooks via TTS. It's both familiar enough and different enough that it somehow bridges my own uncanny valley.
(FWIW, I speak PNW English, allegedly neutral for North Americans.)
Nice. Big fan of Andrew Scott, so I'll try this reading for a while. Thank you.
It just occurred to me that my hangup might be fiction vs non-fiction. Fictional readings might bug me in the same way the cartoons of Garfield and Dilbert bug me; their voices don't align with my imagination.
I've powered thru a handful of biographies (eg Robert Caro's The Power Broker) and got a lot out of them. I only cringed when someone affects a different voice for written quotes.
Whereas each time I tried a Mark Twain audio book, I wanted to kick a puppy.
I used to read a lot to my kid (and other relations). But I really didn't like those books on tape. (Maybe because no one ever read to me, sniff.)
Thanks again. I love this (being uncomfortable). Now I'm gonna push myself to finish some fiction audiobooks. See if I can get into the proper frame of mind.
Speech Synthesis has always baffled me. You could run a reasonable (albeit strangely accented) version on 16Mhz Macs without major CPU impact. The code including sound data was less than a megabyte.
In order to achieve modest improvements in dictation we're throwing entire GPU arrays at the problem. What happened in the middle? Was there really no room for improvement until we went full AI?
IIRC, it was _with_ major CPU impact. A 8 MHz machine couldn’t do much else while talking.
Also, the original MacinTalk sounded a lot better if you fed it phonemes instead of text. It didn’t know how to pronounce that many different words, and wasn’t really good at making the right choice when the pronunciation of a word depends on its meaning.
For example, if you gave it the text “Read me”, it always pronounced “Read” in the past tense. That always seemed the wrong bet to me, and I would think the developers had heard that, too, but apparently, fixing it was not that simple.
I also think it didn’t know to pronounce “Dr” as “Drive” or “Doctor”, depending on context, or “St” as “Saint” or “Street”, to mention a few examples, and probably was abysmal when you asked it to speak a phone book, with its zillions of rare names (back in the eighties, that’s an area where AT&T’s speech synthesizers excelled, I’ve been told)
And that’s just the text-to-phoneme conversion. The arts of picking the right intonation and speed of talking are in a whole different ball park; they require a kind of sentiment analysis.
Lots of the usual vocoder methods were developed in the 80s, and kinda approximate voice as a collection of linear systems. The neural systems allow nonlinear outputs, and are quickly getting much cheaper. This is common in codecs: you get a computationally expensive jump on quality, followed by years of work making that quality jump computationally cheap.
You can also combine the linear and nonlinear approaches. LPCNet uses a bit of each, for example: a linear lpc prediction, which is then corrected by a neutral net.
The audio generation is split into packets or chunks of audio. There are two general methods of storing/generating these chunks:
1. LPC/formants -- building the audio using the LPC (linear predictive coding) or frequency (formant) parameters and a residual carrier for the difference;
2. OLA (overlapped add) -- combining and overlapping audio split at individual wave (peak to peak) packets.
The different systems (klatt, MBROLA, etc.) build on these basic techniques. The *OLA systems are more space intensive, as they need more of the audio data to synthesize the voice, but tend to result in more natural voices.
The other complexity is how the individual phonemes (building blocks of speech) are stored. Different languages and accents have different sets of phonemes, so more complex languages require more data. Systems like MBROLA store the data as diphones (from the mid-point of the first phoneme to the mid-point of the second), to make it easier to join the phonemes. There are more diphones, although not all combinations are in a given language and accent.
This data is controlled by different prosodic parameters (e.g. pitch and duration). Neural nets can be used to control these parameters, or techniques like decision trees and probabilistic models.
IIUC, there are two approaches to neural nets when using audio generation: 1) generate the LPC/formant parameters; 2) generate the audio directly. These can either operate on phonemes, or on the text directly. Operating on the text directly limits the voice to a particular language and accent, but potentially allows the neural net to infer pronunciation rules itself.
Hmmmm.... In my experience, for (neural) TTS the dominant option is to have one model to generate a melspectrogram from the text (handling prosody) and a second model for synthesizing samples from the melspectrogram. (Tacotron, Lyrebird, and this Facebook group are all doing this.) There's certainly research projects on going directly from text to samples, but it's not the currently winning strategy... Maybe eventually, though. The Text->Mel portion specializes the prosody and pronunciation problem, and provides a nice place to add extra conditioning.
On the vocoder side: LPC, F0, etc can all be estimated from a reasonably sized melspectrogram; for the most part, these neural models are just letting the big vocoder model handle all of these things which are traditionally (fragile!) subtasks. The question is which "classical" parts are both cheap and reliable: you can compute these on the side and lighten the neural network's burden. LPC is great for this.
There was steady improvement for decades (e.g. on a Mac, you can compare Fred, Victoria, Vicki, and Alex, as representatives of 4 generations of TTS, and there were year-over-year improvements as well).
But the latest neural techniques have added quite a bit of naturalness, and their computational requirements, while high, are within the reach of consumer level devices.
You can go even farther backwards - SAM (Software Automated Mouth) on the C64 produced understandable weirdly accented speech with something like 10K code+data IIRC, and a 1Mhz 6502; see e.g. https://discordier.github.io/sam/ (though it did take all the CPU, because it did software based PCM on non-PCM supporting hardware.... if the there was just little more control of the waveform generators, it would likely have been possible with 1% CPU or so ...)
You could output speech on the TRS-80 (which didn't even have any analog outputs) by pulse-width modulating 0/1 bits on the cassette port -- in conjunction with the inherent RC circuit provided by the output circuitry and speaker, you could get barely understandable speech.
Huh? It appears to be written in PyTorch according to the article?
The training data could also be considered source.
And I agree that this is of limited use if I have to access it by uploading and downloading everything from Facebook servers. Not only do I have privacy implications, but there's the need for a solid fast low latency internet connection that I can't guarantee.
> The training data could also be considered source.
That makes me wonder about something: would it be less computationally intensive to prove that a trained net is the result of some training data than the training procedure? A lot less? Can it be proven that there is no backdoor (supplemental training data altering the results)?
I suspect you could prove rather cheaply that the trained net weight indeed correspond to a local minima for the training data. However, there is no telling that this is the best minima that could be achieved, nor that the provided training dataset is enough to obrain that result.
Depending on the architecture, though, it's possible to export the trained model into a stand-alone file that can be imported by somebody else's program, de-coupling the network's training data from model it produces.
This is done pretty frequently in areas like computer vision and speech recognition, with the pre-trained weights for YOLO and Mozilla Deepspeech[0] being available for download. I'm not sure if the word "open-source" totally applies here, since as you pointed out, apart from downloading the dataset source might be tought, but OP's question might be answered by having the resulting models made publicly available with the source code of the networks they used to train and deploy it?
Impressive but also still sounds "robotic" like AWS Polly. I wonder if they'll fuse that tech where you can sample someone's voice from a paragraph and build something. Then you could hire a voice actor(ress) and maybe license their voice? I don't know how that would work.
The weaknesses of TTS twig different people in different ways. For example, Microsoft Zira and the older Google TTS voice rank near the top for me, while I find every single one of the modern Google voices so horrible as to provoke instant anger when I hear them.
Yeah, awesome! This proprietary transcription algorithm must make it a hell of a lot easier for NSA databases. If this is deployed and used by FB so they send the finished and full transcripts of calls and other voice traffic [1] instead of the original audio to be transcribed later, it will all be more efficienct! // sarcasm
https://developer.amazon.com/en-US/blogs/alexa/alexa-skills-...
to be disturbing and without much added value. (Such as used with games like Jeopardy.)
If used, the application of these tags needs to be both meticulous in its proper context, somewhat non-deterministically applied, and with randomized prosody. Repeated usage of the same overstated emotive content is annoying and unnatural (worse than a "flat" presentation) and only serves to underscore the underlying inflexible conversational content.