How much data does a model take up? I wonder if this would work for compression? Train a model on a corpus of audio, then store the audio as text that turns back into a close approximation of that audio. (Optionally store deltas for egregious differences.)
It would be a slow (but very efficient information-wise - only have to send text which itself can be compressed!) decompression process with current models / hardware due to sequential relationships in generation.
I am sure people will start trying to speed this up, as it could be a game changer in that space with a fast enough implementation. Google also has a lot of great engineers with direct motivation to get it working on phones, and a history of porting recent research in to the Android speech pipeline.
The results speak for themselves - step 1 is almost always "make it work" after all, and this works amazingly well! Step 2 or 3 is "make it fast", depending who you ask.
We've known for decades that neural networks are really good at image and video compression. But as far as I know, this has never been used in practice, because the compression and decompression times are ridiculous. I imagine this would be even more true for audio.