Hacker News new | past | comments | ask | show | jobs | submit login

This is a well known problem. The noise is due to mu-law compression. The 16 bit audio samples are compressed to 8, 9, or 10 bits before feeding to the neutral net. The reason is because predicting a categorical distribution of 2^16 values requires too many parameters. The noise was also in samples from the famous Wavenet from Deepmind (they used 8 bit mu law).

There are two ways to avoid this: 1. predict 8 high (coarse) bits, 8 low (fine) bits separately as in the original waveRNN paper. 2. use a mixture of logistic distributions as the predictive output as in the recent Lyra vocoder from Google.




How does the number of parameters scale with resolution?

Specifically, how much slower this would be if the audio was, say, 10 bits?

I recall a lab exercise in college where we were supposed to increase the resolution of a quantizer until we reached a decent tone and 10 bits were the point at which we reached satisfying quality.


It is a single matrix multiplication to predict probabilities of all possible outputs. For example, with a hidden state of 1024 dimensions, and 8 bits output, it is 1024x256 parameters. 10 bits will need 1024x1024 params.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: