I love Unicode, but I'm more and more coming to the conclusion that strings are evil and should be treated as opaque byte arrays, whose only available operation is rendering into a bounded area. I now see any other string operation as code smell.
It's scary how much of our infrastructure relies on strings, given how few guarantees string operations actually give. Take files names, for example. Two visually identical file names may map to different files (because confusables[1]), or two different names map to the same file (because normalization[2]), or the ".jpg" at the end may not actually be the extension (because right-to-left override[3]), not to mention names with newlines or backspaces in them, and inconsistencies between operating systems.
I would go as far as blaming our overreliance on strings for all the injection attacks we see (XSS, SQL, command, etc).
I want to note a separate issue of defensive coding that comes up in the writeup:
> GitHub's forgot password feature could be compromised because the system lowercased the provided email address and compared it to the email address stored in the user database. If there was a match, GitHub would send the reset password link to the email address provided by the attacker
The logical flow is:
1. Get the email address from the forgot-password request.
2. Get the email address from the database for the same account.
3. Check whether they match.
4a. If not, we're under attack -- refuse the request.
4b. If so, all is well -- send a password reset to the email address.
Of course, we know the email address twice -- we asked the user for it during the password reset process (step 1), but we never needed to do that because we already had an email address on file for the account. We retrieved that email address in step 2. We know that the two addresses are the same, but, if you look at the semantics behind the variables, in step 4b we're choosing one of these two "equivalent" options, depending on which variable we use for the email address:
1. Send the account password to the account owner.
2. Send the account password to a guy who doesn't know what the password is.
And these have very different risk profiles. Choosing the first option instead of the second would have prevented this attack without needing to worry about unicode case-translation issues. You never want to trust information you just received from an unknown user when you already have the same information from a more authoritative source.
Yes, you're absolutely right. The real bug was sending to the "wrong" matching email. But this is what makes this bug so hard to find. You're looking at "equal" strings, so why should it make a difference if you pick A or A if A === A ? :)
Hard to find, but it's the kind of thing that will hopefully come up in code review. Consider this pseudocode with helpful pseudo-hungarian notation:
username = request.post_params('username')
evil_email = request.post_params('email_address')
user = get_user_by_name(username)
good_email = user.email_address
if good_email != evil_email:
# Hackers!
else:
reset_password(user.id, evil_email) # it's fine; it's the same as user.email_address
You can see the problem in the call to reset_password. Sure, it's the same as user.email_address, but if you just used user.email_address you'd be doing the right thing and you wouldn't need the comment about "don't worry; it's fine".
Then you start to wonder if that line shouldn't look more like
reset_password(user)
and from a security perspective, it should. (I assume the reason Github allows the code to specify the email address in the first place is that an account may be associated with multiple addresses.)
If it makes no difference which email address you pick, why not pick the one that represents doing something safe instead of the one that represents doing something dangerous?
The problem is that it requires a particularly detailed way of thinking about strings, much like the popular IEEE 754 floating point number question.
If the code were instead written:
username = request.post_params('username')
email = request.post_params('email_address')
user = get_user_by_name(username)
if emails_are_equal(email, user.email_address)
reset_password(user.id, email)
Most people would not feel compelled to use user.email_address in the call to reset_password, because they're guaranteed to be equal by the previous line! The vulnerability was introduced when the author of emails_are_equal() wrote it with conversion to uppercase of the whole thing, and failed to realize that this conversion can collide with Unicode domains. They may also have done things like normalize/remove optional.dots+suffixes@gmail.com, "quoted" or (commented) local parts. There's a lot of magic - and a lot of risk - in an emails_are_equal method.
> They may also have done things like normalize/remove optional.dots+suffixes@gmail.com, "quoted" or (commented) local parts.
At least the first two (removing dots and suffixes), along with conversion to lowercase, are strictly invalid transformations which may result in an email address referring to a different account. Sure, most email providers treat the account name as case-insensitive, and the use of '+' as a label/subaccount separator is a common convention, but neither of these is required. The RFCs say that the account name is case-sensitive (unlike the domain name) and '+' is just an ordinary part of the name; any special significance is assigned by the server. Ignoring '.' characters is something specific to Gmail.
In the context of a search it's not unreasonable to ignore some of these differences, but at that point the matching names aren't "equal", just "similar". Certainly it should never be assumed that it's safe to send email to the transformed version of the address.
> There's a lot of magic - and a lot of risk - in an emails_are_equal method.
My whole point is that changing reset_password(user.id, email) to reset_password(user.id, user.email_address) neutralizes that risk. Then it doesn't matter whether emails_are_equal is risky or not, because a failure in emails_are_equal can only cause you to do a safe thing. What's less embarrassing -- "we accidentally sent a password reset for your account to you", or "we accidentally sent a password reset for your account to someone else"?
So you might not feel compelled to rewrite reset_password(user.id, email) as reset_password(user.id, user.email_address) -- but you should.
If they do have multiple email addresses associated with one account, and they don't want to send a password reset to all of them, then you can see how it would happen.
```
if evil_email not in good_email_addresses:
# Hackers!
else:
# just reset with provided email. If I thought there was a potential security issue I would have already addressed it.
```
So easy to be lazy at this point, especially under time pressure.
Sorry but that is not correct either. Code blocks on HN are created by using two or more leading spaces on each line.
Tripple backticks do nothing, and will be shown on the same line as the next line even if put on a separate line, because they are considered as regular text and part of one paragraph of text.
This is one of those cases where what we want is a tainting system for strings, not an encoding problem. The only language I've seen attempt this was Perl, and even then intermittently.
It should be made as difficult as possible to pass user input directly to something vulnerable like an email-sending API, without first laundering it through "validation". Unfortunately it can be very hard to do good validation, but in a tainting system you would find it easier to use the already-trusted value from the database for the user's email rather than the untrusted one from user input.
(Comparing two email addresses with toupper rather than doing a full RFC2822 comparison is another mistake, although I can see why nobody bothers to do it properly)
At least the first two of them are type wrappers for strings - opaque to the type checker, but transparent to the runtime. Rust and Haskell have wrappers like this. Typescript does not.
Keep the internal details private to the module so that application code doesn't concern itself with the details.
The module that sends emails only provides code that accepts a ValidatedEmailAddress and sends it an email, or accepts an UnvalidatedEmailString, records it in the database, send it an email, and Validates it. Next time you load it, if it's been validated, you get a ValidatedEmailAddress - as long as it's been validated.
Sensibly, there would be no (public) code to convert the ValidatedEmailAddress to a database id, since (a) you never application code want to run "UPDATE email_address SET address = 'ketchup@tomato.sauce' WHERE id = 5", because it needs to be validated. Likewise, you never want application code to run "INSERT INTO user_email(email, user) VALUES (5, 6)". (Something like the latter will of course occur, in the code responsible for recording invalid emails and retrieving validated emails.)
> At least the first two of them are type wrappers for strings - opaque to the type checker, but transparent to the runtime. Rust and Haskell have wrappers like this. Typescript does not.
You are right that TypeScript aliases don't work like Haskell ones (which are considered different, and type checked). In TypeScript you can use "branded types" to work around the more loose structural typing:
type Firstname = string & { readonly brand?: unique symbol }
I'm not sure I fully understand. What do you mean by step 2? If I entered "myemaıl@example.com" into the reset field, are you saying that step 2 would be the process of doing some normalization to try to find a matching account? If I reset a password, don't I only provide an email address by means of doing so? Therefore, doesn't the service merely attempt to match an email to an existing account within the DB?
I believe I understand the rest (the take-away being, however you match A to B, send the reset email to the email address stored in the DB?), just not sure about the flow beforehand.
I reply separately to observe that the flow you describe is bugged in a more obvious way: if you ask only for an email address, and then discover the related account by normalizing that address before doing a database lookup, it's a serious error to then send the reset email (which controls an account you looked up using the normalized address) to the original address. You found the account by looking up a normalized address; the original address isn't even known to be associated with the account.
In that case, there are three options:
1. Send the reset email to the address you pulled from the database. (correct)
2. Send the reset email to the normalized attacker-provided address. (wrong but "probably fine"; this is the bug I was talking about in the first place)
3. Send the reset email to the original, non-normalized attacker-provided address. (wrong and definitely a problem)
This is the flow I envisioned, which matches some large websites, but not necessarily github.
In step zero, you enter "2T1Qka0rEiPr" as the username of the account you want to hack.
In step one, github says "We have m------@e------.com on file for you. Please confirm your email address." and you enter "myemaıl@example.com".
Then github retrieves the email address associated with the username "2T1Qka0rEiPr".
You're correct that you could do this with just an email address and not a username, but that doesn't affect my criticism -- you'd still want to ultimately send the reset to the email address you pulled from the database, not the one you got from the reset request. Your takeaway is exactly right.
Another more secure method is to pull up the account information and display "We have the following methods on file for contacting you: () email 1 (potentially obscured); () email 2 (potentially obscured); () SMS (to a phone number which is potentially obscured)". If I recall correctly, that's how Twitter does it. This bypasses the need to ask the user to type in the address they'd prefer for the reset to be sent to -- you just show them some radio options, and they select the one they want. Since they never provided the address, there's no chance you'll accidentally pick their malicious address over the real address.
"But how do I make sure someone knows the email address, in order to stop strangers from spamming reset emails to addresses they might not even know?" You don't; you apply rate limiting to the reset functionality.
I've always used values I pulled from the DB even when they matched (theoretically) the value I passed into the where clause and never really had a reason why, it just felt better that way. Thanks for validating that.
and this is a perfect example. ASCII has almost a 1-to-1 mapping between screen representation and byte representation. Once you know your font will differentiate between 1 i L | l and 0 O you still need to know a bit about control characters and you are good to go.
Unicode has tons of pages, control characters, diacritics and rendering oddities that make it hard to use like a tool.
I think your approach of considering like bytes to render is spot on. I consider these strings as something like SVG fragments: you can do text operations on them but you have to be confident that you know what you are doing.
I'm personally incredibly annoyed by just the idea of "Unicode is hard, let's do ASCII", most of the world is non-ASCII, it's just annoying and sad to still see systems that fail when people try to use their native languages. UTF-8 should be the default pretty much everywhere, there are quite easy ways to avoid homograph attacks, those attacks are a poor excuse to discriminate against non-anglosphere.
Yes, all user-facing text should be Unicode, always. But ASCII has its use cases as well, as the parent comment mentioned in programming. I am natively non-ASCII compatible, but I am glad to program in ASCII and English. The only things that should not be English in written code are domain-specific terms that do not have an official unambiguous English translation. That's the only case to be made for non-ASCII characters in programming that I can think of, but I think romanization can take care of it with most languages.
But aside from APL and a few Haskell coders, who isn't programming in ascii? I mean, sometimes i chuck an emoji into a comment but I don't think that really counts.
And how could programming in ascii have solved this bug? The problem is that the strings were compared without giving any thought to what "are these strings equal" is supposed to mean.
The only general solution to this kind of bug - that is, to considering distinct emails to be identical - is to have a special function that can check if two emails are identical according to the spec. This function will be different than determining if two names are the same in an American database where dotless and dotful i are the same, and that function will also be different than a function determining if two names are the same in a Turkish database where dotless and dotful i are different. Although I don't know why you're trying to compare strings for identity, you're probably just after similarity anyway - how often do we convert strings to lowercase before we compare them.
Generic string equality functions are always the source of bugs, since there's no useful, single uniformally applicable definition of string equality.
>I think romanization can take care of it with most languages
Except for the ones that it really can't. When you try and romanize standard written Chinese it becomes nigh impossible to read for most speakers, often it would be more confusing than an English translation.
> there are quite easy ways to avoid homograph attacks, those attacks are a poor excuse to discriminate against non-anglosphere.
No. There are these kind of attacks every now and then to this day. Maybe of you're not following itsec they fly under your radar, but getting this right is exceptionally hard. And besides these kind of attack, every major os had multiple bugs just in the processing of Unicode that could at least be used for DoS attacks.
So saying it's easy to avoid any sort of abuse of Unicode seems quite ridiculous.
Go ahead and support it for messages, display names and whatnot, but for the love of god, limit the login name of users to ASCII. Don't assume that your Python/Go/JavaScript lib for Unicode handles sanitizing and canonicalization properly. It doesn't. And even if it has only a minor bug that doesn't lead to direct issues, the next update of the lib might fix the problem and now you have to deal with the fact that your db might contain data that was processed with the old faulty lib and now gets compared to the properly processed output of the new version. Just don't. Use it as opaque data for displaying, as GP said, but never as an identifier for anything.
"I'm personally incredibly annoyed by just the idea of "Unicode is hard, let's do ASCII", most of the world is non-ASCII, it's just annoying and sad to still see systems that fail when people try to use their native languages."
You are correct that most of the world is non-ascii and I am enthusiastic about recognizing that diversity and I am willing to pay certain costs in return for the richness it provides.
However, I will point out that certain systems have been deemed crucial and in need of deliberate (and brutal) dumbing-down. Specifically, I speak of the global Air Traffic Control system that is English only[1].
Tagalog/Flemish/Satsugu is hard. Let's (land airplanes with) English.
The Unicode Technical Standard [1] (different from Unicode Standard) recommends treating identifiers (filenames, variable or fuction names, email adresses, usernames, etc.) different from normal text. There is a special class of 'identifier characters' which already exludes a lot of sneaky characters like invisible punctuation and obscure scripts that are not in modern use.
Additionally there are 5 additional restriction levels for identifiers depending on your specific situation:
1. ASCII only
2. single script
3. single script or Latin+{Japn, Hanb, Kore}
4. single script or Latin+{any, excluding Cyrillic, Greek}
5. Not containing any characters in the recommended blacklist of characters for use in secure contexts
6. No restrictions other than normal identifier restrictions
For any specific strings that might be confused, there is an algorithm to compute the visual 'skeleton' of a string and match it against that of another string to test if they are confusable.
So in a secure environment I can't write words in IPA? IPA requires mixing Greek beta and theta with Latin letters (but there's a Latin/IPA phi just for fun).
It also doesn't help ɡ vs g.
It sounds like any genuine solution is either not simple or excludes legitimate use cases.
Yes, the Unicode Technical standard identifies the Latin, Greek and Cyrillic (most of IPA characters) scripts as confusable, and recommends against mixing them in identifiers. Use in general text is fine though.
I want to be able to collaborate with Japanese, Chinese, Indians, Russian, Arabian, Malay programmers. Even if we somehow live in a dystopia where all the english-speaking world disappears in a magic puff, I think I will continue to shut down my native french and communicate with other fellow programmers in English ASCII.
I wish more programmers understood the value of having a lingua franca is.
UTF-8 should be used everywhere on the user-facing side, but under the hoods, you want unambiguous representation of strings and code. It is easy to avoid homograph attacks if you do have a non-unicode representation available to you.
I swear software's biggest problem is that developers prioritize their own ergonomics over actually producing functional code. Imagine if auto engineers just decided "safety is hard, let's just make deathtraps". I know the current narrative is that the 737 Max failed because of MBAs but it's a really big indictment that out of all the complex and hard to engineer systems in an aircraft it was poor software that caused the crash.
It was a bad corporate ecosystem that caused the crash. The people at fault were the accountants that tweaked lines un Excel until they got the numbers they wanted amd then pressed reality into the service of those numbers. Bad software didn't eliminate pilot training on the new systems. Bad software didn't shortcut the required FAA safety certification on the system. Bad software didn't reduce the number of redundant sensors that fed the system.
Bad beancounters did all that. C-levels in suits who got million-dollar bonuses for killing 346 people to gain marginally increased quarterly results. Don't blame the software.
> really big indictment that out of all the complex and hard to engineer systems in an aircraft it was poor software that caused the crash
While I frequently point out how bad we are as an industry at making fault tolerant code, this part is just flat wrong.
The software portion of the 737, while definitely flawed in a catastrophic way, would not have came to be if the aeronautics engineers had done their job and designed a flight worthy plane without software hacks. Not to imply the aeronautics guys are the root cause either though, the 737 Max fiasco is a top to bottom completely failure of Boeing as a whole, virtually every department involved in the Max has a significant reason to share in the blame.
The longer I program, the more I am convinced that falling back to ASCII in 90% of (non-embedded) use-cases is just a programmer avoiding to do the extra brainwork to deal with encodings.
That is why I like the Rust approach: make these issues front and center, implement clear solutions for common use cases and enforce them. In to many languages encodings feel like an afterthought rather than something that has been considered from the start. When I started using Rust I e.g. learned that the Strings a OS uses in its filenames are not necessarily valid UTF-8. Rust forces you to handle this explicitly. Strings are complex and hiding this can be dangerous
The issue I see with rust is that they (partially) repeated the mistake older operating systems made: they assumed that the Unicode of that time will be Unicode forever.
While UTF8 is just an encoding, a Rust UTF8 "String" is actually a UTF8-encoded Unicode 11.0 string. Or version 12.0, possibly 12.1. Maybe 13 soon! Who knows. A moving target, certainly.
If you're "agile", you can just recompile and you will be fine, the next Rust version will surely support any changes in the Unicode standard.
But once a Rust program or crate is no longer actively maintained, many small assumptions will be baked into its Unicode handling at a much lower level than, say, the typical Windows C++ program that uses the OS-provided dynamic libraries for string processing.
Who knows, maybe the consortium will never introduce breaking changes, but I suspect that in a decade or two people will be cursing Rusts too-strong integration with the Unicode of the 2010s...
The Unicode consortium has guaranteed that there are many properties they won't change once they're released. Even if the property value is objectively wrong.
More to the point: the standard library of Rust relies on no properties of Unicode that will change--there's no builtin normalization, case folding, grapheme cluster support, etc. So there's no data tables in the standard library that need to be updated. The only assumption Rust makes of Unicode is that no assigned codepoints will be changed, and that the space will not grow beyond the current limit of U+10FFFF.
I was thinking that. Unicode is great for presenting text to users but for file names, email addresses, code and the like ASCII has a lot going for it.
It's not just Unicode weirdness, it's the whole concept of string operations. Examples:
1) Naive CSV libraries that break when given a field containing a comma itself. Same for injections.
2) Indexing things by strings (file names, user names, etc) relies on humans typing and reading with 100% accuracy.
3) Control characters are still present. Try to generate a file name containing every ASCII character (from 0x01 to 0x7F), then try to delete it in different ways. It's really frustrating.
4) Even string templates can cause problems. MMO's used to have game masters identified with "[GM] Character name", until scammers started using the same pattern. The sin here was to concatenate "[GM] " with the character name, which is spoofable, instead of using a badge or different color.
For 3), that's not the case on Windows filesystems as these characters are not allowed in file and folder names.
On Linux, in addition to control characters, the shell interprets the `-` character specially. This means that if a filename starts with `-` you may have surprising results even if the filename is quoted.
> On Linux, in addition to control characters, the shell interprets the `-` character specially.
It's not the shell that interprets the leading `-` character specially, but rather the program receiving the string. The convention of marking the end of option processing with `--` helps, for programs that support it; you can also prefix relative paths with `./`.
And how's that supposed to work? Every country has developed some form of transliteration to the Latin alphabet. Its simple enough that anybody can memorize these additional 26 letters. A form of least common denominator. Anyone on this planet can scribble down their email address on a piece of paper and equally type one on their computer. Now imagine you'd write down a Chinese email address with characters. How am I supposed to type it? How's the Chinese going to type in the Arab's address? If I receive a mail from them, how can I confirm it's really them from the address? I can't read Chinese! But the Arab and the Chinese can read the Latin alphabet.
Case in point: we have punycode for over a decade now. China has their own native TLD. Still pretty much every Chinese website out there gets created under the cn TLD or even com, and uses Latin characters only. How come?
Sorry, this is wrong. They mainly use the US keyboard layout for input, typing in pinyin (mainland China. Taiwan etc. are indeed different). Every person using a computer not only has to know the Latin alphabet, but also has to know how the characters they want to input are translated into pinyin. Even on the smartphone where thanks to the touch screen you could easily come up with new flexible input methods, they simply ported that concept over.
This article doesn't appear well informed, or tries to make the Chinese look stupid. On a similar note there is the common belief that Chinese people, having simplified characters for well over 50 years now, couldn't read the traditional characters still in use in Taiwan and Hong Kong. They are still used for artistic reasons and in calligraphy, which a lot of Chinese people have practiced at some point in their lives. Based on my very limited sample size, I'd argue that a Taiwanese person has more issues reading the simplified characters than vice versa, but then again it's not like all the characters look totally different, so given at least minimal context it shouldn't be too hard going either direction. (end OT rant)
I've once seen a Chinese girl texting on a bus stop. She used the ordinary US-looking on-screen keyboard but Chinese hieroglyphs were appearing in the message window as she typed.
Yeah my wife uses this input method too, as I do if writing Mandarin. It's basically typing the pinyin for the characters. However, nearly everyone else in her family uses the type of input where you type 'strokes' of the characters to filter down the character you want to input.
There's a certain value to say Arabs and Chinese being able to email customerservice@wherever.com rather than having to try to decipher خدمة الزبائن@ and 客戶服務@ in each others language.
Yes, and good luck using a search engine to find
月山@何でも.jp
or
الله@gmail.com
By the way, the first one is not a mail address as @ and @ are different characters. But actually it may be because @ is the @ in the Japanese character set. So it probably should be accepted right? How about if I use the Japanese @ in a korean address? Is that valid? Or a likely attack?
Also in the Arabic name, the first character of the string is the one closer to the @ (it reads right to left), be sure to take that into account in a search interface.
I think it is totally utopian to think that a programmer can know all the subtleties of all the writing systems in the world. I agree that this is a problem that needs to be solved, but users and UX designers should in no way underestimate the magnitudes of additional work it causes.
Often multiple, which makes the suggestion rather less suitable.
Take a Japanese name like 麻生太郎. He may find his name romanized (ignoring the issue of surname/given name order) as Tarō Asō (preferable), Taro Aso (usually), Taroh Asoh (occurs on passports), Tarou Asou (‘I can't figure out diacritics’-style), Taroo Asoo, or even Tarô Asô.
How do you do the rendering? Who (process-wise) is responsible for converting bytes to pixels? How do users on social media put in their name? How is it stored? How do users get urls with specific usernames?
Now take all that and multiply by the complexity of world languages, many which don't even map to one glyph == one morpheme. The ol' apple message crash bug was due to the property of some Arabic not being monotonic in rendering space vs string length.
I think we could have skipped utf8 and just gone to 4byte runes. But even then, that would not have avoided the above bug.
The problem is now deeply entrenched, so I don't have perfect answers. But here are my guesses:
> Who (process-wise) is responsible for converting bytes to pixels?
The operating system, with minor exceptions (word processors, for example). Rendering logic is too complicated to be embedded into every application. And you get better accessibility and consistency.
> How do users on social media put in their name? How is it stored?
Keyboards, or their preferred input methods, and stored in UTF-8. I'm not saying to get rid of all strings, just don't use it for infrastructure.
> How do users get urls with specific usernames?
You don't, because that's how you get little Bobby FRACTION-SLASH. Also, if you have a valuable namespace like URLs, people will hack each other to get valuable names, but that's only tangentially related.
> Now take all that and multiply by the complexity of world languages, many which don't even map to one glyph == one morpheme. The ol' apple message crash bug was due to the property of some Arabic not being monotonic in rendering space vs string length.
That's exactly my point! You get this multiplied complexity when people try to peek into string contents instead of treating them like black boxes. Stop with the dangerous string operations and you now support usernames with zalgo-ed hieroglyphs if that's what users want.
> I think we could have skipped utf8 and just gone to 4byte runes. But even then, that would not have avoided the above bug.
> Utf16 is a hot mess though, worst of all worlds.
Agreed with UTF-16. I like UTF-8, and I honestly think it solved our encoding problems for non-legacy applications. Everyone should be using it, as long as the contents are for human consumption only.
I think I'll disagree, and I'm someone with a native Asian language. A character doesn't mean anything in Asian languages, and any attempt to use a fixed-length encoding is pointless. The concepts you want instead are either a code point or a glyph. The concept of a code point is useful as a part (and not the whole) of Unicode-validating, encoding, and decoding. A glyph is useful mostly in rendering engines (i.e. webkit-internals and UI rendering frameworks). Okay, maybe they are also kind of useful for sorting and collating. But fixed-width character encodings are almost never useful, and invite programmer to make assumptions about how strings can be sliced.
Most server-side applications should never have to know what these concepts even are. Or any library that is not user-facing. They get bytes from the UI layer, and they can keep them as opaque bytes. For user-facing apps, you can ask your renderer library for a pixel-width or similar for a string, and let them handle how to parse it. Very little code ever needs to know about unicode.
Any kind of input-sanitization is vastly simplified in utf8, and that makes it worth it for me. For me the really troubling trends are conventions like Rust Utf8Error, where they can cause what I'd consider a UI-related exception in code that had no business even interpreting what those bytes are. Unfortunately, every API uses strings, so they are kind of hard to avoid. It introduces what I'd consider a software layering problem.
Maybe others here with more experience with internationalization can chime in and tell me I'm wrong.
Also, UTF-16 is not fixed width in modern usage. As soon as someone uses an emoji, boom!, surrogate pairs. So unless UTF-32 is used at a glorious four bytes per character, you won't get fixed width.
> For me the really troubling trends are conventions like Rust Utf8Error,[…]
Interesting. Isn't that only returned when the input bytes contain a non-UTF-8 byte sequence? How is Rust's approach different from other languages?
> Interesting. Isn't that only returned when the input bytes contain a non-UTF-8 byte sequence?
As a Rust user: this is what your function returns if the user inputs non-UTF-8 bytes into something that expects UTF-8 bytes and the programmer explicitly choses not to handle that error.
I don’t see anything wrong with that. Sometimes you might be interested in receiving, processing, storing valid UTF-8 strings rather than arbitrary byte sequences, that may or may not be able to be translated back into something valid that you can display.
I always hated to do encoding related work with a passion before I started using Rust. Rust forced me to do it the right way and actually understand why I am doing it a certain way. When it comes to encoding I feel safer in Rust then e.g. in Python, despite having used it for four times as long.
Using any UTF encoding as fixed-width is almost always a mistake. There's no concept of "characters" in Unicode or the UTF encodings, they use codepoints, and those are only very rarely useful--you cannot treat them as "characters".
This is a common misconception with UTF encodings.
> A character doesn't mean anything in Asian languages, and any attempt to use a fixed-length encoding is pointless.
Totally agree, indexing into a string is bad practice, no matter the encoding (because you probably can't ever guarantee what the encoding will be, or how it was converted, etc.) This is true for UTF-8 and UTF-16, and really any encoding because again, you can't be sure what you're dealing with. Code points, "characters", glyphs, etc. are all concepts that work at different parts of the stack (well, characters doesn't), which again is true for all encodings.
The advantage UTF-16 has for languages that tend to be multibyte is it's representation takes up less space. Other than that, it has all the same disadvantages any other encoding has.
> Any kind of input-sanitization is vastly simplified in utf8, and that makes it worth it for me. For me the really troubling trends are conventions like Rust Utf8Error, where they can cause what I'd consider a UI-related exception in code that had no business even interpreting what those bytes are. Unfortunately, every API uses strings, so they are kind of hard to avoid. It introduces what I'd consider a software layering problem.
This is the problem right here. The "UTF-8" the world initiative ignores that UTF-16 is a lot more practical for most people, and as a result you get platforms expecting UTF-8 that really have no business doing so.
Space-wise, yes, but the size of our user-entered strings is rarely a concern. On the other hand, UTF-16 is often mixed with UCS-2, which is not really Unicode, and lulls developers into a false sense of security by almost never needing two code units for a given code point.
It's a partial fix that makes bugs harder to catch. There's also the BOM issue, and not being ASCII compatible...
Space is always a concern, and encodings should largely be transparent to your application unless you're writing something like a rendering engine.
Consider things like databases (relational/document/key-value/etc.), JSON payloads, compression algorithms, battery life, etc. etc.
There is a problem with lots of developers assuming UTF-16 is fixed-width (just look at this thread), but that's a mistake for any UTF encoding, and it's almost always a mistake in any encoding--you shouldn't be indexing into strings.
People say, "just use UTF-8, you never have to worry about it and it's ASCII compatible", but you shouldn't be using the ASCII compatibility (arguably this is an anti-feature) and you shouldn't have to worry about the details like BOM because you're using your language's built-in string support or a library right?
Where there is significant markup, UTF-16 encoding is rarely more space-efficient than UTF-8. When compression is involved, there is rarely a difference.
This doesn't address my points. That where there is markup, the additional space required by the ASCII markup on UTF16 may very well offset the space savings of Asian text in UTF8. And also, that with compression, these size arguments become negligible for any sufficiently long document written in one language.
I conceded that with "UTF-16 may not be great". I'm responding to the idea that UTF-8 is such a good solution to the problem that everyone should just use UTF-8. It isn't.
For writing globally appropriate software, UTF-8 is the best option that exists in practice.
Note that “Asian” is quite a gamut in terms of UTF-8 size: Chinese is at the very compact end of the spectrum and the Burmese script at the other end: https://hsivonen.fi/string-length/#counts
I'm a little skeptical about that table though, because as you point out, there's not a great way of figuring "meaning per character". For some more (real flawed) comparison, the English version of The Tale of Genji is ~60k words and 224 pages, and the Chinese version is ~75k words and 300 pages. [1] [2]
So yeah, I'm skeptical about the claim that some languages have more meaning per character, and as a result you'll end up storing less text overall. I think it'd be cool to look at more data about this though, like (for example) stats from Treasure Data [3].
But regardless, it's pretty indisputable that UTF-16 is a lot better space-wise for languages that tend to be multibyte in UTF-8. Mostly what I'm trying (Quixotically) to say is "UTF-8 the world" ignores a lot of the world.
> the Chinese version is ~75k words and 300 pages. [1] [2]
Note that link [2] says "guess [of # of words] based on page count". 75k words, 300 pages is one statistic, not two separate statistics. (Estimating words based on page count is likely to be highly reliable, but still.)
> I'm skeptical about the claim that some languages have more meaning per character, and as a result you'll end up storing less text overall.
Your skepticism is unwarranted. From a pure information-theoretic perspective, the claim that some languages have more meaning per character is a slam dunk, and less than a second of examination proves it conclusively.
For example, an entire English novel is unlikely to use more than 256 unique characters. A Chinese novel couldn't use anywhere near that few without sounding incredibly artificial.
Here are some single-character words in modern Mandarin:
高 - high/tall
低 - low
大 - big
小 - small
最 - most (superlative marker)
更 - more (comparative marker)
到 - arrive
走 - leave (go away)
玩 - play (e.g. a game)
看 - look
听 - listen
贵 - expensive
爱 - love (verb)
恨 - hate (verb)
龍 - dragon
For something more representative, here are some song lyrics -- I'll enclose every word of multiple characters. Unenclosed stretches of text are words of one character each:
How many distinct one-character words do you think English could practically support? How many twos? In this verse-and-a-half, there's one word that's always[1] three characters and two that have reached three characters by picking up a verb suffix, for a total of three words that are longer than two characters. (鼓起勇气 is kind of a special case, in that it's a well-known fixed expression, but its meaning is transparent as an ordinary combination of the two words 鼓起 and 勇气, which also see use outside the expression.)
[1] Actually, 什么 ["what"] has the vernacular contraction 啥, and this also applies to 为什么 ["why"], so you could argue that the word is sometimes just two characters.
Ugh, I was trying to get a good comparison of translations, but too bad I guess.
Yeah I mean, there's no question ideographic languages have more meaning per character than alphabetic languages. But other comparisons and considerations aren't as obvious:
- how do ideographic languages compare with each other?
- do people using ideographic languages write more?
- are ideographic languages as effective at compression as general (or special) compression algorithms?
Moving up the conceptual ladder from "average bytes per codepoint" to "average size of encoded tweet" (for example) is a big leap is all I'm saying.
> are ideographic languages as effective at compression as general (or special) compression algorithms?
That one we know; the compression algorithms are more compressive. For example, compressed Chinese text takes up less space than the same text uncompressed. Ideographic languages are still languages that real humans have to use, and they feature redundancy because that helps everyone. Compression algorithms have the luxury of stripping that redundancy out.
> how do ideographic languages compare with each other?
Of modern languages, only Chinese and Japanese could really be described as ideographic. (Japanese much more so than Chinese, in fact.) Chinese will have more meaning per character, because Japanese makes heavy use of the comparatively less meaningful kana. (Interestingly... it has to do this precisely because of its more ideographic nature.)
> do people using ideographic languages write more?
> do people using ideographic languages write more?
A related question is if people using ideographic languages write electronically when they do, and I suspect yes.
Language Log has several articles on the phenomenon where Chinese seem to be forgetting how to write characters due to IT input methods. https://languagelog.ldc.upenn.edu/nll/?p=7142
Handwriting isn't exactly uncommon in China. Nobody's forgetting how to write characters they write all the time, and if they do need to write a character they can't quite remember, they can squiggle something and rely on the reader to understand it.
Handwriting is actually a moderate-level problem for me -- I can't recognize most handwritten characters. Often a special handwriting form will be used.
> For writing globally appropriate software, UTF-8 is the best option that exists in practice.
OK, I lied. China doesn't use GB2312 anymore. They use its update, GB18030, which carries arbitrary Unicode data just as UTF-8 does... except that, like GB2312, it puts the Asian characters in two bytes and the European characters in three.
It's certainly not obvious to me that UTF-8 is better for globally appropriate software. It looks worse. Software for Europeans, sure.
GB18030 has no 3-byte sequences. Only 1, 2, and 4.
There are more important qualities than optimizing byte length. Processing GB18030 as Unicode scalar values involves lookup tables. A single byte error cascades potentially further than in UTF-8.
If optimizing Chinese byte length is really important for you, UTF-16 is easier to work with than GB18030.
Yet the higher information density of Chinese & Japanese means the total file size tends to be about the same. Korean gets a bit short-changed, and some of the Indian subcontinent languages have low-density (like Latin) but high size (in with the Asian characters) and lose out.
There's a lot of text out there that isn't markup, or where markup is a sparse minority. Consider anything binary, ebooks, PDFs, various document formats, almost anything in a database, almost all messages sent via messaging apps, almost all emails, any JSON, anything in a protobuf, etc. etc. The vast majority of "text" in the world isn't in HTML/XML.
> When compression is involved, there is rarely a difference.
Rare is the situation where smaller input to a compression algorithm leads to larger output, and UTF-16 is smaller in lots of languages. It might rarely make a difference you, but that's not the same as rarely making a difference.
Mr. dot is evil is the first thing we taught new engineers at Facebook during the security engineering on boarding.
There’s a whole slide deck (filled with real code snippets) with examples where a string was used instead of a better suited object representation and lead to a security flaw.
We eventually build xhp/jsx to get rid of strings-holding-html data, but that was just scratching the surface of bugs caused by user-supplied strings.
Github's mistake in the parent article wasn't overlooking a character casing collision -- it was sending password reset emails to the email provided by the resetter rather than the saved email for that user.
The problem is that your method works well if the string is either:
- Only meant to be used in a machine-to-machine interface (like a JSON key for instance).
- Only meant to be used in a machine-to-human interface (like the text letting you know that you used a wrong password).
For strings that have a special significance to both humans and machines you can still run into problems, mainly because of the many unicode strings that look similar (or sometimes even identical) visually but are encoded with a different byte sequence. Take usernames or URLs for instance.
But in the end Unicode is messy because human languages are messy. As programmers we like well-bounded problems with elegant generic solutions. This clearly isn't possible here and we have to deal with it.
I think the parent is saying that unless you're ready to open up the can of worms on correctly handling text processing that you should treat all strings as just an opaque array of bytes and do nothing except pass it around your app verbatim and limit all string transformations to a small thoroughly inspected and tested library of code.
There are exceptions, of course. Fuzzy search is one of them. Image editing software and programming language parsers are another two.
And what's the problem of letting the user type a URL? You take the resulting string (given to you by the OS subsystem responsible for keyboard input), and stuff that into your favorite HTTP library.
Don't get me wrong, strings are absolutely necessary. I'm not suggesting we switch to pictograms. But string processing should be treated like the dangerous operation it is, like pointer math and cryptography.
I just write a lot of code that looks inside strings, in simple 'crud' apps, so I can't imagine avoiding these kinds of operations. Of course if you can use highly trusted libraries and just pass 'em around it's OK.
But the Github issue was sort of subtle. In hindsight it seems like an obvious one, but it's a mistake I typically see in a lot of code reviews I've done where people just reach for the convenient variable without thinking behind the intent and 'meaning' behind it. The problem is you would probably have code like this:
How would the don't look at strings thing apply? Well you could not use .toLowerCase inside the getUserDetails, then you are treating the email as a stream of bytes. But it would allow someone to sign up multiple times as fred@gmail.com, FRED@gmail.com. This might be bad for you (spammers) and bad for the user (they don't realise they actually have 2 accounts).
The alternative could be 'well use a library written by experts for dealing with email duplication issues', i.e. detecting that string A and string B map to the same email or not. But we criticize ourselves for the NPM dependency madness, right? So some balance is needed.
It is tricky to not deal with strings in some way!
One pattern I am keen on, and will probably use in a refactor I am doing is using types to add meaning to strings. So in Typescript for example, don't pass a string, but create a wrapper class called, say EmailString, than on construction does some validation (or maybe none), but as a minimum you have documentation of what the string means. You can shift-F12 the constructor to check the hopefully few places it is constructed to make sure they do the write thing, then in the 100's of consuming places you know you are dealing with an email.
String matching is inherently fuzzy. I haven't tested it, but I think a dropdown box to select your account would be best. This would also help with other normalizations (accents, spaces), and even typos if user enumeration is not a problem.
I put abuse detection system in the same category as tests, it gets a free pass on most cleanliness requirements.
The typed strings pattern is a great one. In this case you could go as far as having a RegisteredEmail types, and requiring that in your MailerHelper.
opaque byte arrays, only available operation is
rendering into a bounded area
Goodbye web then. Because the whole request you get from a client is nothing but strings.
So when you run an onlineshop and a customer orders "7" screwdrivers - then you are screwed. Because what does an "opaque byte array" of "opaque byte arrays" cost? How much shipping will that be?
But at least you have a brand new customer: Henry@gmail.com. Since you do not lowercase the email and do not recognize him as henry@gmail.com which he used in his last order.
>But at least you have a brand new customer: Henry@gmail.com. Since you do not lowercase the email and do not recognise him as henry@gmail.com which he used in his last order
You may not like it but this is fully standards-compliant behaviour. RFC 5321 states:
>The local-part of a mailbox MUST BE treated as case sensitive
I'm not saying this is a good behaviour and the RFC also discourages exploiting it further.
There are more ways an email address can be equivalent though :-). Since we know it's gmail in this case we know the email address is both case insensitive and dots don't matter. Comments are also allowed in email addresses (in both the local part and the domain part). Here's a couple examples from RFC 2822:
> But at least you have a brand new customer: Henry@gmail.com. Since you do not lowercase the email and do not recognize him as henry@gmail.com which he used in his last order.
Which would be correct behaviour. From RFC 5321, part 2.4, "The local-part of a mailbox MUST BE treated as case sensitive."
That's a good point, these problems are not easy to solve. But they should be solved.
For example, the user types "7" into the UI control. The UI control knows it's supposed to hold a number, so it has methods that return the parsed number (done by the OS, very carefully), or an error to the user.
Similarly, you don't send the product name, but the ID of the product the user selected from their fuzzy search. I would go as far as using this technique for the email too.
> but I'm more and more coming to the conclusion that strings are evil and should be treated as opaque byte arrays, whose only available operation is rendering into a bounded area
nope. they are not byte arrays. if treating them as such, you run into many nasty bugs.
today I encountered one (C#):
if (string.Length > 60)
string.Remove(60)
this has a fucking serious bug, especially if you try to show the string afterwards or worse save it inside a database.
strings are arrays of codepoints. NEVER TREAT THEM AS BYTE/CHAR ARRAYS.
I meant "byte array" as in "this is a bunch of binary data you should not look into". Like people already do with images. You wouldn't remove a byte from the middle of a JPEG file, and you shouldn't do it with strings either.
You’ve read my mind! I’ve been meaning to give a talk where I just go down the list of string methods and point out how each one is a hideous source of bugs.
Please, go forth and spread the word. Drop me an email (on my profile) if you want ideas.
I don't have a blog yet, but I already have a draft for a post titled "Strings are evil". There's a lot wrong with the current way we develop software, but I'm convinced this is a big one.
So here's the thing. When you display a word written entirely in Cyrillic characters, it would be wrong for it not to display as normal text.
But when you display a word that contains eight Latin characters and one Cyrillic character, couldn't the display create some sort of warning? Highlight the section-mismatched character with a box? It's not normal in any language to mix alphabets.
> it’s not normal in any language to mix alphabets
Decidedly not true. Languages like French and Turkish and German use their own alphabet but they look like the typical “American” Latin alphabet but aren’t. How do you decide if it’s English with non-American letters thrown in or not? Languages like Arabic are transliterated with English letters and Arabic numerals. Other Transliteration dialects use random Arabic characters with Latin others.
That's a good heuristic, but you are still left with spoofing using 100% Cyrillic, and even old school approaches like "app1e.com" (1 instead of l). When you consider distracted users, or people with bad eyesight, you are back to square l.
We (I) really would like system where I would enable the languages/characters that I support/recognise and the rest would be displayed as a number in a box.
> It's not normal in any language to mix alphabets.
This is not true.
For example, my native language has suffix-based articles (there is no separate word "the"; instead there is a suffix attached to the noun). When including a foreign word in the middle of a sentence, some people will attach the article suffix directly to the foreign word, or at best with a hyphen (this is likely not correct "by the book", but some people write like this anyway). With your suggestion, this will show a warning when it is perfectly normal usage.
Furthermore, some Asian languages like Japanese do not use spaces, so when they mix in foreign words or names written in their original script how do you tell that it's intended to be a separate word?
But why non english users should be forced to name their files with english characters, libraries and code that work with files should always use tests with unicode in file names.
This blog sets opacity: 0 (fully invisible) on the entire content, then fails to unset that CSS with JS, b/c the JS crashes if you block cookies.
> because the system lowercased the provided email address and compared it to the email address stored in the user database.
While sending the email to the attack-provided email, instead of the one in the database, is bad… lowercasing emails is also not valid. The lookup should never have matched in the first place.
(It's slightly more complicated: to an extent, the case of the domain name doesn't matter, ignoring non-ASCII characters — I have no idea what they do. But the local part — the portion before the @ — is case sensitive. A server is free to ignore that, and map multiple local parts to the same mailbox internally¹, and many do, but the sender cannot make that assumption.)
¹or do other weird things, like ignore dots, or +extensions, etc.
Practically speaking, you have to treat email addresses as case-preserving: you match case-insensitive, but you always store the case the user entered in directly.
I usually give my email address as starting with a capital letter for historical reasons, and I once had an issue that wouldn't let me log on with any case whatsoever, probably because it lowercased the email address and tried to match it against the uppercase in the database.
> you match case-insensitive, but you always store the case the user entered in directly.
You still need to normalize the unicode for either search or display, unless you're only allowing ascii or something.
E.g. for the column you store for search and to guarantee uniqueness, casefold and then normalize to NFKC. And for the column you store for display, normalize to NFC. (And obviously you need to sanitize user input before doing anything.)
That's one of the reasons why it's recommended that email services to have case insensitive local parts. That and the fact that there are plenty of clients out there that will corrupt the address capitalization.
Unfortunately convention has normalised user expectations that email addresses are now case-insensitive. It's now a standard business requirement.
Too bad few devs fully handle Unicode.
Github shouldn't have normized case, but it's not insane to require lowercase in the first place I think so long as you don't convert silently.
I wouldn't call +extensions and ignoring dots weird because Gmail does that and fairly or unfairly they set the standard for user expectations of email nowadays.
Also, I think it's unreasonable to require full compliance with the spec. For example, if you go around trying to give out email addresses with comments in the ("usern(ignored_comment)ame@example.com") you'll see many things break and I don't think that's an issue.
For a delightful example of how insane email addresses can get, here's a fully compliant regex to validate one:
Except that's not actually correct, because strictly following the ABNF does not yield correct semantics for an email address.
An email address consists of a local-part, a literal @ character, and then a domain name or an IP address literal. The local-part is either a series of dot-separated atoms (/[a-zA-Z0-9!#$%&'+/=?^_`{|}~-]+/ is the syntax for an atom) or a quoted string (/"([^\\"\0-\031\x7f]|\\[^\0-\031\x7f])"/ in regex). If you support EAI, you need to add \u00a0-\u10ffff [i.e., all non-ASCII non-C1 control characters].
According to RFC 822, you can insert whitespace (and comments) arbitrarily into the mailbox production, but that is non-semantic. Seeing "From: John Doe <foo @ example (I ((hate)) CFWS).com>" means that the email address is exactly "foo@example.com", and you can reject any claim of the spelling without prejudicing any email addresses.
In practice, you can drop all support for quoted strings and IP address literals in most applications. So a correct email address (pre-EAI) regex in that vein would be:
[I simplified the quoted string to only accept escapes that are semantically necessary--the strings "a b"@example.com and "a\ b"@example.com correspond to the same email address].
Edit: Sorry, there's quite a few asterisks in the regexes that Hacker News is turning into italicization, and I don't know how to unbork them.
Edit 2: Someone suggested how to unbork the standalone regexes, but the asterisks in the inline regex in the second paragraph are still missing.
This is why I love HN. Someone replied to my joking improperly-researched comment with the right answer and a detailed explanation of why I'm wrong. I stand by my assertion that actually using that regex to check emails the user enters would be dumb.
By the time you hit IDNs, regexes for validation are no longer your biggest issue. Your real check for validity at that point becomes "can I actually contact the host" (or send an email, if validating an email address), and there is little point in aggressively validating a purported domain name instead of checking if it actually exists.
Just... don't try to validate it like that. Check if you can send an e-mail to it, if you can then it's fine. I see way too many devs thinking they can validate e-mails with regex and then I can't use my own name in my e-mail.
To be honest, login forms should take both username and e-mail as the user identifier, far too often they don't. I personally think that the only place where such a regex should exist is during sign-up.
Pick a language of your choice, and fully implement the spec. I bet it'll still be long.
Edit: The following is completely wrong
For example, the python module to parse an email address is around 500 lines and repeatedly warns it'll be very hard to follow without a copy of the spec in front of you. It contains code for parsing multiple timezone formats and cite to a follow up spec addressing a bug in the initial treatment of negative timezones...
The timezone isn't for parsing addressing headers, it's for parsing date headers. And actually, that file isn't for parsing email addresses, it's for parsing addressing headers in mail messages.
The code I wrote for parsing email headers is here: https://github.com/jcranmer/jsmime/blob/emailutils/headerpar... . A decent chunk of it is building a full lexer for email headers, and trying to cope with only supporting internationalization support in a few cases where they need to be supported. And the corner cases for that i18n support are really nasty.
So if I understand this right, what GitHub did was something like:
user = get_user_from_valid_email(params[:email])
send_reset_email(params[:email])
# instead of
# send_reset_email(user.email)
?
I've seen this pattern before and the reason is usually something about using the variable in memory as opposed to the function call. Total non-optimisation.
Lots of non-thought too. Sending e-mail directly from the place where web requests are processed isn't very smart. What if the SMTP subsystem is currently down or very slow? How many e-mails will you send if an attacker starts 1000 parallel web requests for a password reset?
A saner way to do these things is to just set a flag ("password reset requested") in the user database there and do the actual work asynchronously in a regularly performed maintenance task. It'll also prevent all these attacks based on misuse of user input by default (unless, in this case, you pointlessly decide to update the user's e-mail in the user database to what the hacker specified).
Are designers like you the reason password reset mails frequently take 30 seconds plus to arrive?
I expect things on the web to be damn near instant - I want to click that password reset button and hear a synchronous "ding" of an arriving mail. I don't want to wait for some cron job to run once per minute.
If you want to send mail off the serving path, at least use a push based queue so there is no scheduling or polling involved.
I can't speak for any other engineer anywhere else (where I'm sure they synchronously send out email because they haven't learned how to do it better), but there's not a chance github do this at their scale, and they're built on Rails. Rails gives you async mail by default and you just have to plug in a queue adapter for your worker processes to consume.
e.g. GitHub gets this request, queues the job in redis/zero MQ/SQS or whatever they're using, and another process dedicated to sending those emails (or jobs with that priority) does the rest of the work.
This is a massively common pattern in the Rails world and is as trivial to configure as your database connection.
Or maybe someone wrote a validate_user_and_email() API and someone else wrote a send_password_reset_email() in a context where they lack access to the user DB, so they just validate the {username, email} and send to the attacker-provided email address.
The first case (yours) is a plain bug, while the latter (mine) is an architecture bug. Architecture bugs often arise out of organizational bugs.
We use three versions of the email address internally: the exact verified address used at signup or the last valid email change, a normalized version of that (for identity) without + mailboxes, lowercased, de-accented, stripped of dots and other inert punctuation, and normalized in a number of other ways... and then of course the email parameter (only used during registration).
We accomplish this with a slightly more restrictive version of the standard ABNF provided in the RFC.
I guess I should probably document why we go to this trouble, in case somebody gets the brilliant idea to "simplify" it.
I wouldn't go so far as to call it an antifeature--in fact, I wouldn't be surprised if a common use is to allow people to maintain multiple accounts on the same service with one email. It isn't standard, but it's not in violation of any standard--nothing says that the server must store each distinct valid email in a separate mailbox with its own login. Many servers implement "catchall" emails or aliases which result in the same thing, distinct addresses going to the same mailbox.
What is a problem, and what is non-standards-compliant, is GitHub incorrectly assuming that all mail providers will do this when many do not. It would be no different from assuming that "admin" and "postmaster" go the same mailbox because that's the way a lot of software is configured.
This seems like it would take a lot of effort to mildly annoy someone.
And the only way it can be automated is if the service doesn't protect itself against automated user sign-ups, which they will either start doing once someone really takes advantage of it, or will result in their domain being categorised as spam once they start sending lots of sign-up emails (either by the user or gmail in general).
> That's why we keep your verified mailbox address for sending mail; but there's no good reason to consider them different for the purpose of identity.
Both of Microsoft's own identity services (AAD and Live ID, and by extension O365 and Outlook.com) recognize (and allow creation of) microcolonel@example.com, micro.colonel@example.com, and mic.rocolonel@example.com as distinct identities/email addresses.
What you're saying is that once the first of (microcolonel|micro.colonel|mic.rocolonel)@example.com registers at github, the other two will no longer be able to do so, but will instead receive a confusing 'you already have an account' error, (hopefully) without being able to receive password reset emails.
...and only one in 50,000 e-mail addresses contain the string "rq5", therefore we strip that string from addresses...
A false postive in these identity checks is likely to be less destructive than a false negative. But I still don't get the point of making up all sorts of rules not in the standard. I have seen both + as well as meaningful dots in e-mail adresses in the wild.
+ and . have is legal in email addresses since it was standardized.
Google was the first installation I know of to silently swallow periods. Plus-addressing was well-known back in the day, but as far as I know GUI mailers more or less killed the practice by not offering support for it, and web sites written by people too smart to know how to validate email addresses ensured you can't even use them properly anymore.
Why not? On most major hosts it has a special meaning, and otherwise it is a relatively ridiculous thing to just add willy-nilly to your email address. We keep your verified mailbox address, the one you gave us, for sending mail.
I doubt we'll ever turn away a customer by preventing registration of a new account sharing the prefix to a plus sign in their email address with an existing customer.
Senders don't get to dictate how a recipient encodes their addresses.
RFC 822:
The local-part of an addr-spec in a mailbox specification (i.e., the host's name for the mailbox) is understood to be whatever the receiving mail protocol server allows.
I am well aware of that, but I'm comfortable requiring that new customers don't register an account with an email address foolishly designed to resemble another customer's email address in this particular way. We don't throw away their specified mailbox address, we just don't accept registrations which look suspiciously similar, or intended to cause confusion.
I repeat, this has absolutely nothing to do with the mailbox address, where we send mail.
So if on my system we name users by their last name unless that is taken and then we add initials, and so Mike Nesmith (our only Nesmith) gets "nesmith", but we have several Smiths so Norman Edward Smith is given "n.e.smith", then only one of them can use your service?
So, when I registered for my primary email account (many, many years ago), Firstname.Lastname@provider was already taken, so I took FirstnameLastname@provider.
Are you suggesting I shouldn't be allowed an account with you if the person who beat me to my preferred email address also beat me to registering with you?
Can agree with this, maybe except for dot stripping, while this will block out some legal mail addresses it's generally worth it and close to impossible to have accidental collisions in practice.
It's like deciding to not allow quoting and with this whitespace. Sure ":"@example.com is a legal mail address (surprised?) but nothing good will come from allowing it.
Yes they can! It's their prerogative to allow people to sign up or not with any email address, full stop.
It might make them bad netizens and you may not like it. But the spec doesn't compel any behavior. It's just a way to communicate technical ideas and ideals.
This is paltry nonsense. Gmail and similar host users are used to treating the dots as decorative, and will register with john.smith@ then try to use "Forgot username?" with johnsmith@ instead. They'll end up with three GitHub accounts registered to the same mailbox and be confused as heck about how they're still getting email notifications for an account that they can't recover a password for.
You can't break user expectations and mental models by pointing to the spec as justification. The spec exists to serve users, not the other way around.
Sure, and meanwhile if there are collisions without periods on that domain (randy@somewhere, r.andy@somewhere and rand.y@somewhere, for instance), only one of them gets an account.
In the absence of being able to count on specs, I guess the user should expect a race?
The fact that we just now found out that this is the case, yet would be completely unable to find anyone complaining about it except in hypothetical terms, tells you exactly how important it is.
This stuff just causes your business to mysteriously have low customer satisfaction. You're doing 95% as well as the successful business but for 5% of your customers these "unimportant" problems make it awful to deal with you and they stay away and tell others.
Most of them can't specifically point to the problem, their impression is just that your services don't work properly. They're right.
> Why not? On most major hosts it has a special meaning
So at best you can special case for those "major hosts" and not apply such treatment to any other domain.
The RFCs for email don't say "uhh dude whatever, just check what gmail does and maybe hotmail too lol". You are playing fast and loose with these things.
A lot of thinks you describe are Gmail properitary extensions. Especially the dot stripping. While far having two mail addresses only differing in dots is possible, especially with some older email addresses.
Domains are not case sensitive, but email local parts are! There is no reason whatsoever to do case normalization on local parts of emails on any domain you do not own, as this is strictly incorrect and could lead to a totally different address that also exists (as happened here).
Of course, email providers are free to do whatever case folding or normalization they want, in which case the security burden of avoiding collisions is on the provider. If someone's email provider maps different case variants to the same mailbox, there's still no need whatsoever to do anything to the address, as the user will get it delivered to them regardless. If the provider doesn't do case folding, they will have to enter their local part case sensitively, but that's exactly the same as for any other use of their email address.
I can only imagine how this vulnerability came to be. Unicode is not to blame here. If security-critical password reset code was not audited carefully enough to catch a mistake like this, one wonders what other errors might remain.
Why? First of all ß is already lowercase, why should toLowerCase() change it? It also is a normal letter having both uppercase (ẞ) and lowercase (ß) forms so converting between the cases can be made be trivial and quirk-free. Arguably the most common word you will encounter ß in is Straße (street) where it already is lowercase - will "Straße".toLowerCase() turn it into "strasse"? WTF? "Straße".lower() returns "straße" in Python which seems reasonable (nevertheless "Straße".upper() actually returns "STRASSE" ignoring the existence of the uppercase ẞ (U+1E9E)). Why should it behave different in JavaScript? (Because JavaScript is different, I know, just a rhetorical question)
> (nevertheless "Straße".upper() actually returns "STRASSE" ignoring the existence of the uppercase ẞ (U+1E9E))
Python probably predates the addition of a notional capital ß glyph in 2017. SS is the capitalization of ß that you'd expect if you were thinking of your data as a string rather than a collection of font elements.
My girlfriend is from Turkey, and I shared this story with her a while back (I discovered it while also researching some Unicode collision issues.) She said that while it's an entertaining story, there's almost certainly a bit of sensationalism and exaggeration from the Turkish press combined with credulity by English language journalists when re-reporting it. Basically, the supposed texts wouldn't have made sense in terms of grammar and syntax with a straightforward dotless/dotted I swap, and would've been obvious to someone fluent in Turkish what happened. This is would've been especially true if you had had this cell phone for any amount of time and communicating in Turkish and it had been routinely swapping Is.
More likely is a bunch of young and/or not too bright people were looking for a reason to get into a violent confrontation. Then the muckraking Turkish press had a sensationalist murder-suicide lovers quarrel story, and as a bonus a nationalistic "see how cell phone companies don't respect our culture" angle as the cherry on top.
However, you should still ALWAYS be careful when converting between character sets and be locality aware when manipulating strings. Practice string safety. ;)
> // Note the Turkish dotless i 'John@Gıthub.com'.toLowerCase() === 'John@Github.com'.toLowerCase()
I'm not sure this example is correct. The dotless ı is already lower cased, so the comparison above should yield false. Maybe the author was thinking about upper case dotted "İ", which becomes regular dotted "i" when lower cased.
So what could happen is that an user enter "JOHN@GİTHUB.COM" as email, and then the email is sent to "john@github.com" .
The example doesn't seem to work at all for GitHub's explanation.. they say that their outgoing email server didn't support unicode in the domain part anyway. What am I missing?
Was your actual attack on the local part?
What is so fascinating about security is that the same problems keep popping up again once in a while - I remember the Turkish i issues being a big problem in early mid 2000s during all the security pushes that went through software engineering world. Then we kinda forgot about them, and now it keeps coming up again.
The best a researcher can do is go back 20+ years and look for security issue that occurred back then. Most likely you'll find very similar things now again.
So it is! It seems to even fool Chrome. If you search for "delivered" on the page the search box says "1/4" but entering will only take you to the 2 real ones, not the Turkish i ones which it has presumably counted.
The alternative is worse though. Characters with umlauts matched with characters with no umlauts. E.g. searching for "rõõsa" will find both "rõõsa" and "roosa", incredibly annoying.
Note that Unicode did add a "uppercase-ish" ß, as it does appear in German, but only in context of an all caps word e.g on a sign board, so captilazation of a whole word to SS and that new all caps ß are both correct (not sure if UNICODE changed the capitalization rules or just added that strange all caps ß)
For backwards compatibility reasons, the capitalization rules can't be changed for existing characters. So normalizing by naive case-folding now requires at least three steps:
I had someone tell me that programming isn't real work before, and this is yet another example of all the small little details going into building things that most people don't really think about day to day.
I haven't had to work with login code in a while, but might at some point. I know some systems only allow alphanumeric usernames, but sounds like can't force people to have alphanumeric emails... Well I guess you could but might upset someone. I know there's normalizing functions though like NFD, NFC, NFKD or NFKC that might work for usernames but not sure what's really recommended.
Then also brute forcing attempts to try to mitigate attacks and other considerations to make also when building out an account system. Then if your company is large enough to provide phone support, not sure how you'd tell the support person which specific emoji you used in your username.
> I had someone tell me that programming isn't real work before
Haha, I don't even understand what metrics the person was using to consider something "work". A pilot sits throughout the flight but I think it's fair to say their working...
Yeah. I guess this person thinks sitting at a computer all day isn't real work. Real work would be working in a factory all your life breaking your back. Then I also think some people think making websites and coding is the same as using Word or Powerpoint. I guess they just don't really understand tech, probably a lot of people in the rust belt. Probably why they're driving young people away.
As a side note puny code conversion is only defined for the domain name _not_ the local part. Using puny code on the local part will semantically create a different email and at last theoretically a mail provider might support both the puny code and normal version as two different mail addreses and as such using punicode there would potentially open up a different vulnarability.
Now that I think about it as far as I remember the local part of mail is actually not defined as cases insensitive , through all? mail programs treat it as such. The important part her is to always use data from your database for any security relevant parts.
The linked post is titled "Hacking GitHub with Unicode's dotless 'i'.", but this submission is in title case "Hacking GitHub's Auth with Unicode's Turkish Dotless 'I'". I think this is a bad title change, because an uppercase I is supposed to be dotless, whereas the lowercase i used by the author's title is not.
I haven't finished reading the article, but isn't the problem here that they are sending the e-mail to the address provided by the user rather than sending it to the e-mail stored in the database ?
I fail to see any added value in sending a reset link to an e-mail entered by the user (while that e-mail is already in the database).
>'ß'.toLowerCase() // 'ss'
"ß"
>'ß'.toLowerCase() === 'SS'.toLowerCase() // true
false
>// Note the Turkish dotless i
>'John@Gıthub.com'.toUpperCase() ===
'John@Github.com'.toUpperCase()
true
Chrome 79.0.3945.79 (Windows, 64bit) seems to differ from the proposed results for the first two statements. If the proposed results are what should indeed happen by unicode standard then I wonder chrome is not fully implementing them?
Is this something that needs changing in the Unicode spec itself or how strings are handled in general by various tools/programming languages? I love [plain] text, but it's so, so fragile. :/
Author: any word on how much GitHub paid out for discovering this vulnerability? Also, how simple was it to create a unicode-based email address on one of the large providers?
So any email containing an i can be reset. Technically those using a custom domain name are immune but those using a general email service are at risk.
Why would you write a general function that resets an account password but also accept an email address as a parameter? What use-case exists to change the email address sending the message?
Would this not have been solved if the email addresses were stored as hashes? Besides this it would be an extra layer of security in the even of a breach. Why addresses aren't stored as hashes seems silly. Especially when stored in true databases that can be queried quickly.
Barely. If you look at it it is clearly just what its name says: a lowercase medial s combined with a lowercase z. The uppercase version exists, as a parallel commentator noted, basically as a typographic utility.
Fraktur had/has other such lower case ligatures (tz, ch, sch, ss (not ß) et al) but for some reason only ß survived into Latin script as a full fledged letter. I have a lot of old (mostly 20th century) books in Fraktur and they all use these ligatures more consistently than the Latin ligatures are used.
Fraktur ß is a ligature of s and z; the current form came into use in Latin-script German because the Latin script already had ß, for the ligature of s and s.
Some sort of orthographic device is needed - s between two vowels means the first vowel is long and the consonant is voiced, and ss between two vowels means the first vowel is short and the consonant is voiceless. So it's useful to have a different case for when the first vowel is long and the consonant is voiceless:
Busen /bu:zən/
Busse /busə/
Buße /buːsə/
Vowel length isn't predictable from spelling in cases of consonant clusters and other digraphs: Hand /hant/ vs. Mond /moːnt/, Bruch /brux/ vs. Buch /buːx/, etc. So ss could've been used for both - or sz, although there are some words spelled with sz as a sequence of s and z, like Szene /stseːnə/.
This is an organization that not only hosts a great deal of the worlds public/secret code, but is run by one of the largest data-gathering organizations on earth.
I don't believe Microsoft can permit things like this under it's umbrella if it wants to continue to pretend that it's data-collection is benign.
The likelihood of such a “hack” happening using the Turkish dotless “I” is ZERO as all Turkish email addresses and website domains are formatted WITHOUT using Turkish characters which include examples like: ç, ı, ü, ğ, ö, ş, İ, Ğ, Ü, Ö, Ş, Ç
That’s incorrect. The attack vector hinges on the ability to create email addresses with Turkish characters. There is nothing stopping an attacker from creating addresses with Turkish characters to attack existing addresses without Turkish characters.
Turkish emails are not supported in the first place.
Internationalization examples[edit]
The example addresses below would not be handled by RFC 5322 based servers, but are permitted by RFC 6530. Servers compliant with this will be able to handle these:
RFC 6530 doesn't mention those character sets explicitly. It proposes allowing all Unicode characters, apart from some control characters.
It is true that the RFC recommends mailbox providers take normalization into account. A mailbox provider that allows i and dotless-i addresses to be routed to different mailboxes is careless, if not actually uncompliant. I don't know if any popular provider does this: I'm guessing the authors created their own to demonstrate this attack.
It's scary how much of our infrastructure relies on strings, given how few guarantees string operations actually give. Take files names, for example. Two visually identical file names may map to different files (because confusables[1]), or two different names map to the same file (because normalization[2]), or the ".jpg" at the end may not actually be the extension (because right-to-left override[3]), not to mention names with newlines or backspaces in them, and inconsistencies between operating systems.
I would go as far as blaming our overreliance on strings for all the injection attacks we see (XSS, SQL, command, etc).
[1] https://unicode.org/cldr/utility/confusables.jsp
[2] https://developer.apple.com/library/archive/qa/qa1173/_index...
[3] https://krebsonsecurity.com/2011/09/right-to-left-override-a...