Base64 is very bizarre in general. Why did they use such a weird pattern of symbols instead of a contiguous section, or at least segments ordered from low->high (on that note, ASCII is also quite strange, I'm guessing due to some backwards compatibility idiocy that seemed like it made sense at some point (or maybe changing case was super important to a lot of workloads or something, making a compelling reason to fuck over the future in favor of optimisation now))?
The original specification is in RFC 989 [0] from 1987, called “Printable Encoding”, where it explains “The bits resulting from the encryption operation are encoded into characters which are universally representable at all sites, though not necessarily
with the same bit patterns […] each group of 6 bits is used as an index into an array of 64 printable characters; the character referenced by the index is placed in the output string. These characters, identified in Table 1, are selected so as to be universally representable, and the set excludes characters with particular significance to SMTP (e.g., ".", "<CR>", "<LF>").”
Using the array-indexing method, the noncontiguity of the characters doesn’t matter, and the processing is also independent of the character encoding (e.g. works exactly the same way in EBCDIC).
This subset has the important property that it is represented
identically in all versions of ISO 646, including US-ASCII, and all
characters in the subset are also represented identically in all
versions of EBCDIC. Other popular encodings, such as the encoding
used by the uuencode utility, Macintosh binhex 4.0 [RFC-1741], and
the base85 encoding specified as part of Level 2 PostScript, do not
share these properties, and thus do not fulfill the portability
requirements a binary transport encoding for mail must meet.
If you want to learn why ASCII is the way it is, try "The Evolution of Character Codes, 1874-1968" at https://archive.org/details/enf-ascii/mode/2up by Eric Fischer (an HN'er). My reading is contiguous A-Z was meant for better compatibility with 6-bit use.
Yes, though in principle you could interleave AaBbCc and so on, which would also be a single bit difference, and the naive collation would be more like that people expect.
> A6.4 It is expected that devices having the capability of printing only 64 graphic symbols will continue to be important. It may be desirable to arrange these devices to print one symbol for the bit pattern of both upper and lower case of a given alphabetic letter. To facilitate this, there should be a single-bit difference between the upper and lower case representations of any given letter. Combined with the requirement that a given case of the alphabet be contiguous, this dictated the assignment of the alphabet, as shown in columns 4 through 7.
> This is reflected in the set I proposed to X3 on 1961 September 18 (Table 3, column 3), and these three characters remained in the set from that time on. The lower case alphabet was also shown, but for some time this was resisted, lest the communications people find a need for more than the two columns then allocated for control functions.
> At the 1963 May meeting in Geneva, CCITT endorsed the principle of the 7-bit code for any new telegraph alphabet, and expressed general but preliminary agreement with the ISO work. It further requested the placement of the lower case alphabet in the unassigned area.
> I had a great opportunity to start on the standards road when invited by Dr. Werner Buchholz to do the main design of the 120-character set [9,24] for the Stretch computer (the IBM 7030). I had help, but the mistakes are all mine (such as the interspersal of the upper and lower case alphabets). ...
> he didn't make the same mistake I made for STRETCH by interspersing both cases of the alphabet!
Base64 and ASCII both made perfect sense in terms of their requirements, and the future, while not fully anticipated at the time, is doing just fine, with ASCII being now incorporated into largely future-proof UTF-8.
Considerably stranger in regard to contiguity was EBCDIC, but it too made sense in terms of its technological requirements, which centered around Hollerith punch cards. https://en.wikipedia.org/wiki/EBCDIC
There are numerous other examples where a lack of knowledge of the technological landscape of the past leads some people to project unwarranted assumptions of incompetence onto the engineers who lived under those constraints.
(Hmmm ... perhaps I should have read this person's profile before commenting.)
P.S. He absolutely did attack the competence of past engineers. And "questioning" backwards compatibility with ASCII is even worse ... there was no point in time when a conversion would not have been an impossible barrier.
And the performance claims are absurd, e.g.,
"A simple and extremely common int->hex string conversion takes twice as many instructions as it would if ASCII was optimized for computability."
WHICH conversion, uppercase hex or lowercase hex? You can't have both. And it's ridiculous to think that the character set encoding should have been optimized for either one or that it would have made a measurable net difference if it had been. And instruction counts don't determine speed on modern hardware. And if this were such a big deal, the conversion could be microcoded. But it's not--there's no critical path with significant amounts of binary to ASCII hex conversion.
"There are also inconsistencies like front and back braces/(angle)brackets/parens not being convertible like the alphabet is."
That is not a usable conversion. Anyone who has actually written parsers knows that the encodings of these characters is not relevant ... nothing would have been saved in parsing "loops". Notably, programming language parsers consume tokens produced by the lexer, and the lexer processes each punctuation character separately. Anything that could be gained by grouping punctuation encodings can be done via the lexer's mapping from ASCII to token values. (I have actually done this to reduce the size of bit masks that determine whether any member of a set of tokens has been encountered. I've even, in my weaker moments, hacked the encodings so that <>, {}, [], and () are paired--but this is pointless premature optimization.)
Show me a quote. Where did I attack the competence of past engineers. Quote it for me or please just stop lying. I never attacked anyone. I even (somewhat obliquely) referred to several reasons they may have had to make decisions that confound me. Are you mad that I think backwards compatibility is a poor decision? That's not an attack against any engineers, it's just a matter of opinion. Your weird passive-aggressive behavior is just baffling here.
Here is a quote: "that seemed like it made sense at some point (or maybe changing case was super important to a lot of workloads or something, making a compelling reason to fuck over the future in favor of optimisation now))?"
You used "that seemed like it made sense" when you could have written "that made sense." The additional "seemed like" implies the past engineers were unable to see something they should have.
You used "fuck over the future in favor of optimisation now" implying the engineers were overly short-sighted or used poor judgement when balancing the diverse needs of an interchange code.
Hindsight is 20/20. Something that seemed like a good decision at the time may have been a good decision for the time, but not necessarily a great decision half a century later. That has nothing to do with engineering competency, only fortune telling competency.
I get that people here don't like profanity, but I don't see any slight in describing engineering decisions like optimizing for common workloads today over hypothetical loads tomorrow as 'fucking over the future'. Slightly hyperbolic, sure, but it's one of the most common decisions made in designing systems, and commonly causes lots of issues down the line. I don't see where saying something is a mistake that looks obvious in retrospect is a slight. Most things look obvious in tetrospect.
Again, "seemed like it made sense" expresses doubt, in the way that "it seems safe" expressed doubt that it actually is safe.
If you really meant your comment now, there was no reason to add "seemed like it" in your earlier text.
> I don't see any slight
You can see things however you want. The trick is to make others understand the difference between what you say and that utterances of an ignorant blowhard, "full of sound and fury, signifying nothing."
You don't seem to understand the historical context, your issues don't make sense, your improvement seem pointless at best, and you have very firm and hyperbolic viewpoints. That does not come across as 20/20 hindsight.
P.S I'm not the one lying here. Not only are there lies, strawmen, and all sorts of projection, but my substantive points are ignored.
"some backwards compatibility idiocy that seemed like it made sense at some point"
Is obviously attack on their judgment.
"a compelling reason to fuck over the future in favor of optimisation now"
Talk about passive-aggressive! Of course the person who wrote this does not think that there was any such "compelling reason", which leaves us with the extremely hostile accusation.
And as I've noted, the arguments that these decisions were idiotic or effed over the future are simply incorrect.
What is your preferred system? How does it affect other needs, like collation, or testing if something is upper-case vs. lower-case, or ease of supporting case-insensitivity?
In the following it goes from 2 assembly instructions to three:
int is_letter(char c) {
c |= 0x20; // normalize to lowercase
return ('a' <= c) && (c <= 'z');
}
Yes, that's 50% more assembly, to add a single bit-wise or, when testing a single character.
But, seriously, when is this useful? English words include an apostrophe, names like the English author Brontë use diacritics, and æ is still (rarely) used, like in the "Endowed Chair for Orthopædic Investigation" at https://orthop.washington.edu/research/ourlabs/collagen/peop... .
And when testing multiple characters at a time, there are clever optimizations like those used in UlongToHexString. SIMD within a register (SWAR) is quite powerful, eg, 8 characters could be or'ed at once in 64 bits, and of course the CPU can do a lot of work to pipeline things, so 50% more single-clock-tick instructions does not mean %50 more work.
> like front and back braces/(angle)brackets/parens not being convertible
I have never needed that operation. Why do you need it?
Usually when I find a "(" I know I need a ")", and if I also allow a "[" then I need an if-statement anyway since A(8) and A[8] are different things, and both paths implicitly know what to expect.
> and saved a few instructions in common parsing loops.
Parsing needs to know what specific character comes next, and they are very rarely limited to only those characters. The ones I've looked use a DFA, eg, via a switch statement or lookup table.
I can't figure out what advantage there is to that ordering, that is, I can't see why there would be any overall savings.
Especially in a language like C++ with > and >> and >>= and A<B<int>> and -> where only some of them are balanced.
Look at ASCII mapped out with four bits across and four bits down and the logic may suddenly snap into place. Also remember that it was implemented by mechanical printing terminals.
> I'm guessing due to some backwards compatibility idiocy that seemed like it made sense at some point ...
> ... making a compelling reason to fuck over the future in favor of optimisation now
> I never questioned the competence of past engineers
False just based on your opening volley of toxic spew. Backwards compatibility is an engineering decision and it was made by very competent people to interoperate with a large number of systems. The future has never been fucked over.
You seem to not understand how ASCII is encoded. It is primarily based on bit-groups where the numeric ranges for character groupings can be easily determined using very simple (and fast) bit-wise operations. All of the basic C functions to test single-byte characters such as `isalpha()`, `isdigit()`, `islower()`, `isupper()`, etc. use this fact. You can then optimize these into grouped instructions and pipeline them. Pull up `man ascii` and pay attention to the hex encodings at the start of all the major symbol groups. This is still useful today!
No, the biggest fuckage of the internet age has been Unicode which absolutely destroys this mapping. We no longer have any semblance of a 1:1 translation between any set of input bytes and any other set of character attributes. And this is just required to get simple language idioms correct. The best you can do is use bit-groupings to determine encoding errors (ala UTF-8) or stick with a larger translation table that includes surrogates (UTF-16, UTF-32, etc). They will all suffer the same "performance" problem called the "real world".