Hacker News new | past | comments | ask | show | jobs | submit | jmvalin's comments login

Actually, what we're doing from DRED isn't that far from what you're suggesting. The difference is that we keep more information about the voice/intonation and we don't need the latency that would otherwise be added by an ASR. In the end, the output is still synthesized from higher-level, efficiently compressed information.


What the PLC does is (vaguely) equivalent to momentarily freezing the image rather than showing a blank screen when packets are lost. If you're in the middle of a vowel, it'll continue the vowel (trying to follow the right energy) for about 100 ms before fading out. It's explicitly designed not to make up anything you didn't say -- for obvious reasons.


Reassuring - thanks for clarifying that up.


Well, there's different ways to make things up. We decided against using a pure generative model to avoid making up phoneme or words. Instead, we predict the expected acoustic features (using a regression loss), which means that model is able to continue a vowel. If unsure it'll just pick the "middle point", which won't be something recognizable as a new word. That's in line with how traditional PLCs work. It just sounds better. The only generative part is the vocoder that reconstructs the waveform, but it's constrained to match the predicted spectrum so it can't hallucinate either.


Any demos of this to listen to? It sounds potentially really good.


There is a demo in the link shared by OP.


That's really cool. Congratulations on the release!


Quoting from our paper, training was done using "205 hours of 16-kHz speech from a combination of TTS datasets including more than 900 speakers in 34 languages and dialects". Mostly tested with English, but part of the idea of releasing early (none of that is standardized) is for people to try it out and report any issues.

There's about equal male and female speakers, though codecs always have slight perceptual quality biases (in either direction) that depend on the pitch. Oh, and everything here is speech only.


As part of the packet loss challenge, there was an ASR word accuracy evaluation to see how PLC impacted intelligibility. See https://www.microsoft.com/en-us/research/academic-program/au...

The good news is that we were able to improve intelligibility slightly compared with filling with zeros (it's also a lot less annoying to listen to). The bad news is that you can only do so much with PLC, which is why we then pursued the Deep Redundancy (DRED) idea.


(Opus author here) I'm curious what kind of "glitch" this is referring to.


author here: they are randomly generated packets, extended using the packet loss concealment feature. you can fill a packet with random data and it will still decompress into a sensible sound, and that's where all the sounds in the drum machine come from


> This is entirely my fault, and I take all the blame for that.

You shouldn't be blaming yourself, it was the best thing to do. Some people may have been confused over who the "good guys" were in this mess. By taking over all these channels you made everything perfectly clear. No amount of arguing could have made things clearer than your actions.


No, exactly none of that data was used for training. The training was done before the demo that was asking for noise contributions. The contributions are CC0, but were never used (i.e. totally unknown dataset quality).


All major browsers now implement WebRTC, including Opus support. Also, most browsers now support Opus playback in HTML5, though AFAIK Safari only supports it in the CAF container. See https://caniuse.com/#search=opus


> FAIK Safari only supports it in the CAF container

Apple being Apple. While they "support" it, looks like they can't even do it properly.


Thanks for your work on this awesome codec!


The ceptrum that takes up most of the bits (or the LSPs in other codecs) is actually a model of the larynx -- another reason why it doesn't do well on music. Because of the accuracy needed to exactly represent the filter that the larynx makes, plus the fact that it can more relatively quickly, there's indeed a significant number of bits involved here.

The bitrate could definitely be reduced (possibly by 50%+) by using packets of 1 seconds along with entropy coding, but the resulting codec would not be very useful for voice communication. You want packets short enough to get decent latency and if you use RF, then VBR makes things a lot more complicated (and less robust).


So can you do domain adaptation and get it to vocode my voice into Johnny Cash's? Larynx adaptation.


In theory, it wouldn't be too hard to implement with an neural network. In theory. In practice, the problem is figuring out how to do the training because I don't have 2 hours of your voice saying the same thing as the target voice and with perfect alignment. I suspect it's still possible, but it's not a simple thing either.


Perfect alignment, or any alignment for that matter, is not necessary. Check out adversarial networks. CycleGAN can be trained to do similar feats in image domain without aligned inputs. Shouldn't be hard to adopt it to audio.


I didn't say "impossible", merely "not simple". The minute you bring in a GAN, things are already not simple. Also, I'm not aware of any work on a GAN that works with a network that does conditional sampling (like LPCNet/WaveNet), so it would mean starting from scratch.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: