Hacker News new | past | comments | ask | show | jobs | submit login

There's some additional interesting details, and a surprising amount of variation in those details, once you start really digging into things.

If the length of your input data isn't exactly a multiple of 3 bytes, then encoding it will use either 2 or 3 base64 characters to encode the final 1 or 2 bytes. Since each base64 character is 6 bits, this means you'll be using either 12 or 18 bits to represent 8 or 16 bits. Which means you have an extra 4 or 2 bits which don't encode anything.

In the RFC, encoders are required to set those bits to 0, but decoders only "MAY" choose to reject input which does not have those set to 0. In practice, nothing rejects those by default, and as far as I know only Ruby, Rust, and Go allow you to fail on such inputs - Python has a "validate" option, but it doesn't validate those bits.

The other major difference is in handling of whitespace and other non-base64 characters. A surprising number of implementations, including Python, allow arbitrary characters in the input, and silently ignore them. That's a problem if you get the alphabet wrong - for example, in Python `base64.standard_b64decode(base64.urlsafe_b64encode(b'\xFF\xFE\xFD\xFC'))` will silently give you the wrong output, rather than an error. Ouch!

Another fun fact is that Ruby's base64 encoder will put linebreaks every 60 characters, which is a wild choice because no standard encoding requires lines that short except PEM, but PEM requires _exactly_ 64 characters per line.

I have a writeup of some of the differences among programming languages and some JavaScript libraries here [1], because I'm working on getting a better base64 added to JS [2].

[1] https://gist.github.com/bakkot/16cae276209da91b652c2cb3f612a...

[2] https://github.com/tc39/proposal-arraybuffer-base64




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: