00xxxxxx - copy (x+1)-th last EXPLICITLY ENCODED pixel (i.e. ignoring repeats)
010xxxxx - repeat the last pixel x+1 times
011xxxxx xxxxxxxx - repeat the last pixel x+33 times
10rrggbb - copy the last pixel and adjust RGB by (r-1, g-1, b-1)
110rrrrr ggggbbbb - copy the last pixel and adjust RGB by (r-15, g-7, b-7)
1110rrrr rgggggbb bbbaaaaa
- copy the last pixel and adjust RGBA by (r-15, g-15, b-15, a-15)
1111RGBA [rrrrrrrr] [gggggggg] [bbbbbbbb] [aaaaaaaa]
- copy the last pixel and replace RGBA if each bit flag is set
So it is essentially the PNG filter limited by Sub, but offers better delta coding and a larger history window. My guess is that it might not work well if the image has a strong vertical gradient (which would need the vertical context), but nevertheless it's pretty impressive that this simple coding is not as inefficient at all.
You might be able to do better compression by traversing the pixels in Hilbert [1] order or Z [2] order, to take into account vertical neighbours as well as horizontal. It might cost time or space though, as you couldn't stream the pixels through it anymore - you'd need them all in memory.
You could still stream the pixels, you'd just have to stream them in Z order. This may seem pedantic, but this applies to any mismatch between the order in which you store pixels and the order in which the format stores pixels.
E.g. some file formats may be row-major, others column-major, some block-based (like JPEG) or a fancy interleaving scheme (like Adam7). Some might store the channels interleaved, others separate the channels. If any of these choices doesn't match the desired output format, it breaks streaming.
The nice thing is if M >= B^2 (i.e., total memory is large enough to fit a square region of the image, where each row/column of the square fits a full block/page of memory), you can transform from row/column order to Hilbert/Z-order without needing to do more I/Os.
So you can't do such a conversion in a streaming fashion, but there is no need to load all data in memory either.
Yeah, those images explain it pretty well. There is one slight difference, though.
Because the Hilbert/Z-order curves are defined recursively, algorithms for converting between the curve and row/column order are "cache-oblivious", in that they don't need to take an explicit block size. You write it once, and it performs well on any system regardless the CPU cache size and/or page size.
I wonder if the pixel data were transformed to/from a different order, before and after the compression/decompression, if that would speed things up without introducing too much slowdown of its own?
Iteration order through arrays has a big impact on performance due to cache effects. I'd guess that you'd lose a lot of performance by a complicated iteration scheme.
So I implemented this change and tested it out. Turns out, in my simple test suite, this actually changes the size of images. Not drastically. Often just a 5-10 kb difference (on a 500-600kb file), but that's still more than I expected for changing r-15 to r-7.
Suggests the DIFF16 delta-color mode is responsible for quite a bit of the savings. Maybe it would be worth experimenting the exact encoding.
One idea would be to start a predicted color (calculate the delta of the previous two pixels, apply that, then specify a delta to that). Another would be to encode the delta in YUV or some other color space, and then experiment with the best balance of bits between those channels.
Perhaps it would be better to steal bits from RUN16 instead, I somewhat doubt it's usefulness.
I see it as a variant of LZ specialised to images, so its performance should not be too surprising. Running an uncompressed bitmap through a general-purpose compression algorithm tends to yield similar results to PNG.
This is so neat. Reminds me the recent LZ4 post who manually hand tweaking the encoding one could still create a decent compression boosts after all these decades.