tl;dr: the algorithm splits the audio into blocks (default is 4096 samples per block). Then a polynomial is used to approximate the wave for the block, and residuals are calculated for each sample by subtracting the approximation. The residuals, because their magnitude is smaller than the original signal, require fewer bits to encode. Hence the smaller file without loss of information.
Curious to me why only such a simple representation (polynomial) is used for the approximation. Couldn't you get a much better approximation (⇒ more compressible residual) in about the same amount of storage space + CPU decoding effort, using a formula based on e.g. wavelet LUTs as your primitive (plus maybe a larger block size)?
Are there other, more advanced lossless encodings that do this? And if so, why didn't they catch on compared to FLAC?
For that matter, is there a lossless format that just embeds a regular lossy encoding of the audio as the approximation, and then computes+stores the residual relative to that? (I'm guessing that this wouldn't work well for some reason, but I'm not sure what that reason would be.)
(ETA: the other later lossless audio formats that I'm personally aware of — ALAC, Monkey's Audio, and WavPack — all seem to use linear prediction. Seemingly they were all designed under the presumption of the constraint that the encode step must be able to be done in hardware / with fixed-sized memory buffers; rather than allowing that you can load the whole PCM audio file into memory and do things like FFT to it. Possibly made sense in the late 90s, when a PC's RAM wasn't much larger than five minutes of uncompressed audio. Doesn't really make sense today. Maybe we're due for a new lossless audio encoding?)
Diminishing returns - the polynomial is already good relative to the unpredictability of the waveform. Adding degrees of freedom to the analyzed/predicted waveform shape at some point needs as many bits to store as a larger residual signal would have.
FLAC can use polynomials or lpc, as described in the article.
I'm curious if you'd gain anything by doing an mdct and then modeling in the frequency domain and then storing the residuals... Lots of the frequency channels will usually have much lower energy, so the residuals wind up being easier to store.
I just ran across MPEG-4 SLS [1] (née "AAZ" for "Advanced Audio Zip", aka HD-AAC), which is a lossless format on top of AAC's MDCT, and which is patented, unfortunately.
That led me to other coders: DTS-HD Master Audio [2] (née DTS++) and OptimFROG DualStream [3].
OptimFROG's DualStream mode is similar to WavPack's hybrid mode, and DTS-HD MA uses DTS Coherent Acoustics (based on ADPCM), so none of these are based on a perceptual lossy codec besides MPEG-4 SLS.
(Sorry to keep replying here, I keep stumbling on interesting things after the edit window closes)
> is there a lossless format that just embeds a regular lossy encoding of the audio as the approximation, and then computes+stores the residual relative to that?
It's not "regular lossy", but WavPack does allow separating lossy from the residual in hybrid mode. I think this is rarely done with DCT-based stuff because there's so much potential imprecision in the decoders.
Interesting; it never occurred to me that your average lossy audio codec isn't just lossy, but non-strict in defining the decoding output, such that you get different PCM output depending on the decoder implementation.
Is this just a thing with old codecs? I'd think that any codec from the last 10 years could assume a minimum level of support for certain ALU ops in even the wimpiest hardware expected to decode it; and then constrain more-powerful implementations to emulate that minimum standard, so as to get exactly the same decoded output. (I.e., define a strict "decoder abstract machine" semantics, and then explain how various real-world software/hardware impls should implement it.)
I don't know much about the practical engineering of modern codecs, I was just assuming that the exact order and combination of operations (particularly rounding) would be underspecified to allow for different architectures, and that would impact at least the low bits of the output. magicalhippo's comment suggests this might not actually be the case for H.264, which does also have a lossless mode...
There's another more important point, though: Modern lossy codecs are designed to be perceptually transparent, rather than minimizing an absolute signal error. The difference is a likely a large and unpredictable signal, so the typical Rice coding will be ineffective for compressing the residual.
Wavelets still might be interesting as a basis, there's at least one project [1] that reports comparable ratios to FLAC, if a bit lower.
As always with compression there is no magic. There are wav files that are smaller than the equivalent flac file (perhaps even the majority of possible WAV files are shorter than their equivalent FLAC). But it so happens that the sounds we actually want to store are very well approximated by polynomials.
Nice. Not dissimilar from what jpeg2000 does (from memory - correct me if I’m wrong!). They use the previous pixel values in order to “guess” the next value and then store the delta, which tends to be a smaller number on average, giving better compression.
proximity in a 2D image has neighbors in north-south-east-west (NSEW) on a 2D grid, not just pixel neighbors per-scanline (row of bytes); proximity in a sound file has harmonics and transitions of some plural set of signal, but the sound file neighbors are roughly at a point in time. corrections welcome
True. The fact of locality in real-world 2D images means you can predict a pixel's value by picking any neighboring pixel, or indeed any nearby pixel, with worsening performance with L2 distance. There's no reason I can think of to prefer a direction, except for convenience in the data structure.