I never learned AC. It's on my overflowing stack of thing to read about.

orlp · on Nov 4, 2016

AC is conceptually stupidly simple. All you do is encode a string of symbols into a range of real numbers.

To start your range is [0, 1). For each symbol you want to encode you take your range and split it up according to your probabilities. E.g. if your symbols are 25% A, 50% B and 25% C, then you split up that range in [0, 0.25) for A, [0.25, 0.75) for B and [0.75, 1) for C.

Encoding multiple symbols is just applying this recursively. So to encode the two symbols Bx we split up [0.25, 0.75) proportionally just like we did [0, 1) before to encode x (where x is A, B or C).

As an example, A is the range [0, 0.25), and AC is the range [0.1875, 0.25).

Now to actually turn these ranges into a string of bits we choose the shortest binary representation that fits within the range. If we look at a decimal number:

    0.1875

We know that this means 1/10 + 8/100 + 7/1000 + 5/10000. A binary representation:

    0.0011

This means 0/2 + 0/4 + 1/8 + 1/16 = 0.1875. So we encode AC as 0011.

---

The beauty of arithmetic coding is that after encoding/decoding any symbol we can arbitrarily change how we split up the range, giving rise to adaptive coding. Arithmetic coding can perfectly represent any data that forms a discrete string of symbols, including changes to our knowledge of data as we decode.

Dylan16807 · on Nov 4, 2016

Or on a more abstract level to compare to Huffman encoding: Huffman turns each symbol into a series of bits like "011". Arithmetic encoding lets you use fractional bits.

A Huffman tree for digits might assign 0-5 to 3 bits and 6-9 to 4 bits. Encoding three digits will use on average slightly more than 10 bits. Using AC will let you give the same amount of space to each possibility, so that encoding three digits always uses less than 10 bits.

titanomachy · on Nov 4, 2016

Nice explanation. Can you explain how to remove ambiguity relating to string length?

"0" = 0.0b = 0 falls in the range [0,0.25) so it's a valid encoding for "A"; but isn't it also a valid encoding for "AA", "AAA", etc.?

AA = [0,0.25) * [0, 0.25) = [0, 0.125), and so on.

It seems that adding "A"s to a string in general doesn't change its encoding.

Dylan16807 · on Nov 4, 2016

You either reserve a symbol for "end of stream" or externally store the length.

It's the equivalent to pretending a Huffman stream never ends and is padded with infinite 0s.