This scheme struggles in the face of email address case folding.
At the protocol level, email addresses are case-folded on the RHS but case-sensitive on the LHS. So it’s crucial that LHS case is preserved by delivery systems. Unfortunately most users then treat them as folded on both. So you can successfully verify one variant, store the downcased hash, and it’ll subsequently match but delivery bounces. Or, hash the exact original input but have many baffled users unable to access their accounts. Neither is a good outcome.
This is not an edge behaviour either, I have tons of users that mix up their email capitalisation from day to day.
No it doesn't, convert the case when generating the hash – think of it as part of the hash function. But leave the case unchanged from whatever the user entered for any steps that involve sending an e-mail to the address.
Isn't this problem orthoganal to storing hashed email addresses? You'll always have access to the email address the user typed in when you want to send a transactional email so you can perform whatever sanitization needs to be done at that point. How does storing the email in plaintext get around issues involving sending emails to case-sensitive mailboxes?
> How does storing the email in plaintext get around issues involving sending emails to case-sensitive mailboxes?
By the validation one performs at initial sign-up.
I’m implicitly saying it’s okay to require that an email address was entered with perfectly matched case at signup, which is validated by a code or link etc, and then be more forgiving about what you receive in all subsequent uses because it’s supportive of ordinary humans trying to use your product.
You could manage that with hashed emails by separately storing the case of letters in the user part, without storing which letters it's referring to. You could then apply that case to the user-provided email during checkout.
(You could probably actually do a pretty good job in most cases just by bruteforcing the possible combinations and seeing which one matches the hash, but the worse-case CPU cost would be bad for long emails.)
This was proposed in another comment as well, and it’s a neat idea that unfortunately becomes fragile in the face of UTF-8 local parts, since the Unicode folding standards change over time.
Fair point, though you could probably store all non-ASCII characters in plaintext and still get most of the benefit. I suspect UTF-8 in emails is rare overall, at least for sites that don't have certain region-limited audiences.
That isn't how email confirmation works, because it's a) enormously prone to false negatives, b) subject to ruinous delays, c) occurring after the transaction i.e. too late, and d) building you a reputation as a bad sender.
This is why all email address confirmation today is asking you to click a link or enter a code.
Also note that the latter strategy is rising in popularity; it is because clickable links, whilst seeming so convenient, are themselves prone to both false positives and negatives, and also increase the likelihood of ending up in junk mail.
But wouldnt the user face this issue all the time if they have a case sensitive mailbox but type their own address in the wrong case? So the assumption may be that a user with a case sensitive mail box is used to typing in the correct address?
No — many accounts aren’t created by the people that use them, and in any case we shouldn’t be relying on correct repeated string input and then blaming the user for a fulfilment process failure due to a typo from hours ago that we silently accepted at the time, and (worse) we can’t even distinguish between a capitalisation error and a discontinued recipient, even if it was previously verified.
As designers/developers, it’s our problem to solve.
Wait a second. Back up. Let's be clear here: the case you are talking about is the user entering their own email address incorrectly, and you're saying we as designers/developers should make a system that knows this and sends it to the correct email? What? Huh? If Person@place.com and person@place.com are two different recipients, how the hell is my application supposed to know which one you actually mean!?
The same way we deal with all email validation. Verify it once at application signup, then rely on the precise verified form.
It is unrealistic to expect end users to get the capitalisation of their email address consistently right. It is realistic to expect it to be done right at signup, since in the best practice case this’ll include a verification loop.
> No — many accounts aren’t created by the people that use them
Can you elaborate on this? I can only think of 2 examples, but neither seem like good ones:
1- someone holds power of attorney over someone else, and register an account (email account?) in their name. But if there's PoA involved, the 2nd person isn't (probably ?) able to manage an email account on their own, so this doesn't seem a meaningful distinction to worry about. (though if it's not an untreatable condition, it's possible that they might resume using their email account themselves, I guess)
2- the account is created by your ISP when you register, and they "helpfully" choose a username for you. So from this point of view, you didn't truly "create the account"
5- parental/grandparental accounts. What’s more, speaking from our own support mailbox, these are the folks most likely to miscapitalize their email address.
Okay... so how is this issue currently handled? Say I create a new email account: lAsZlO@inopinatus.com and the mailbox is case sensitive.
Next I create a youtube account but as E-Mail address I enter laszlo@inopinatus.com and I am told to click the link in the confirmation E-Mail... that I never receive. Well I do not like youtube anyway so I head over to hackernews and create an account to write this comment. Oh no I can not because I typed my E-Mail address in lowercase again. So now I am wondering if there is some error so I open your homepage, search for the support page and file a bug report, typing my E-Mail address in lowercase into the E-Mail field. I never receive your response telling me to write my E-Mail address in the correct casing.
I come to the conclusion that my E-Mail account just does not work and create another one at a provider that is case insensitive. OR I figure out/am told that the case is important and will never forget it again.
So where exactly are services like youtube or hn that ask me for my E-Mail responsible for handling upper/lower case correctly?
The solution could still be: store both the hash for the canonicalized address and the hash of the exact address of which you know that it worked at some point in time. If the user enters the address later the matches the one but not the other issue a warning that should they not receive the E-Mail to double check their spelling.
Actually I do perform case routing & delivery in one of my Fastmail accounts. It’s not the default, but it is possible, if you’re willing to write or generate Sieve, which I am.
A base58-encoded extension part, for segmenting actions due to email (and the replies/responses to/consequences of email) originated by a SaaS platform. Helps with routing of support requests in particular so we don't have to go back to the end-user to say "which <platform organisation> did you mean?" and other similar CRM-ish behaviours. Also allows us to manage bounces, spam complaints, and RTBF assertions by (organisation,end-user) tuple and similar. The discriminator string itself comes from an application subsystem where it was already generated for our state machines. The (minor) downside is outbound delivery sometimes being delayed by greylisting more frequently than otherwise.
Since I've contributed to MTAs & MDAs and built multinational ISP email services in a previous life, I'm confident that every MTA of consequence is case-preserving of envelopes, I'm happy to rely on it. (this is also why I feel on solid ground pointing out the hidden gotchas in various proposed schemes that don't perfectly accommodate the same rule)
If the email provider says that email addresses are case sensitive, then that's the truth you live with, it's not your system, not your design and you can't dictate other systems how they should work.
It depends on configuration. I doubt very many SMTP servers are case sensitive in this day and age. This is not the case on my Postfix servers. Sendmail was also case insensitive in its default configuration (though it has been many years.)
I thought email addresses were case folded on the right hand side and site dependent on the left hand side.
The right hand side is more or less forced by the rules of DNS.
As for the left hand side, if I run the email for a site, can't I decide whether to deliver ABC@ and abc@ both to the same mailbox or to different mailboxes? And can't someone else make a different decision for their site?
If a site administrator does not have the prerogative to decide this, what rule prevents them? (And if there is such a rule, can you rely on it being enforced?)
Of course, but when processing an arbitrary email address, which will almost always be not on your site, you MUST treat the left hand side as case-sensitive (unless you have knowledge about that email domain).
site dependent means that when given an address you must treat the lhs as case sensitive. To do otherwise will mean that you've potentially broken the address and can no longer properly use it.
You could store the hash of the downcased address plus a capitalization mask which tells you which letters to capitalize.
This works from a technical perspective, as letters with ambiguous capitalization (Turkish i, etc) aren't allowed in emails. It's a very minor privacy compromise: if a user has a very rare pattern of capitalization then an attacker with access to the database could identify their account. Negligible compared to the current standard.
It’s a neat idea but unfortunately RFC 6531 opened up the local part to most of UTF-8, so internationalised capitalisation is in the mix now.
Ultimately I’ll never advise delivering to email addresses other than the precise octets of the one already verified, and this means the gold standard is always folding for match and uniqueness, but delivery precisely as verified.
How about this: store the verified email address, but encrypted using the hash of the case-folded input as part of the key. The intention being, you had to have the matching folded form in hand to obtain the verified canonical form. For extra jollies, only decrypt it on the client. (cryptography warning: I write this as the idea comes to me and without any analysis of emergent properties, vulnerabilities etc)
> For extra jollies, only decrypt it on the client.
Uh... How does that work if you need email address to send email to the user? If only the client is able to decrypt the addess, you will basically have to wait until the user connects and gives you the email (which your presumably never store long-term, handling it like you'd handle credit card numbers). That severely limits what you can do with an address.
If you're okay with being technically able to access email address, you'd probably better off with just straight encryption. That solves data leak protections, key rotation, backups, etc.
If you want some magic ID which is known only to the user and your servers will only use it to verify identity, then why not use just passwords with client-provided KDF parameters. Your machines will never know the plain data.
True. But since so many websites (e.g. all aviation companies I've encountered) case-smash the LHS of the email address and can get away with it since all other email software has had to adapt, this is a rather minor concern by now.
You're trying to sound clever but it's not working, I don't think you understand what the logic actually is, if you are comparing those two.
A lot of emails get casefolded no matter what. This is enough of an inconvenience that nobody in their right mind would operate a case sensitive mailbox you would be using for account signups.
A lot of people are still using that part of a regex no matter what - I still run into this regularly (and have been, for two decades now).
"There is some danger that common usage and widespread sloppy coding will establish a de facto standard for e-mail addresses that is more restrictive than the recorded formal standard."
This is EXACTLY it. Both in the local part, and in TLDs - nothing clever about people being too clever by half, and generating false negatives on their input side.
"Funky"? These things are two decades old by now. There have been multiple generations of communication protocols since this became standard - and yet still people consider this some weird aberration, even the 4-letter TLDs.
Indeed, fallback is still necessary - but it doesn't follow "meh, just go back to the 3-letter maximum, because a lot of people still live in 1999."
I have a 3.2 domain on a ccTLD and even that gets shot down regularly enough that I wouldn't consider using it as a primary address. There should be no excuse for that, ccTLDs are older than a good chunk of the people writing the code excluding them, and yet here we are.
Says who? Email addresses are case insensitive. If email software treats emails as case sensitive then it is broken. People have to write email addresses on paper forms, in all caps.
Says RFC 5321 [1]: "The local-part of a mailbox MUST BE treated as case sensitive."
It _does_ recommend receivers treat it as case insensitive for maximum interoperability, so it is de facto insensitive, but something implementing it as case sensitive isn't broken.
It does make it broken. Broken means not working. If your software refuses an email because it's in the wrong case then that software is broken. And quoting out of an RFC is not going to make users stop complaining.
Email addresses are written i a variety of situations where preserving case is not possible. For instance on forms, or over the phone. If the IETF wants to ignore that then that's the IETF's problem, don't make it yours too.
I think it is a fair point -- when a technical standard and/or convention so vastly disagrees with common user perception, perhaps the requirement should be broadened to account for both.
On the flipside, you’re trying to tell me I should be willing to accept a lower standard than I wish to, or that I’m used to, or that has been established for decades, because of some anachronistic bureaucrat, and my response to that is a short expletive.
You, and the paper forms, are incorrect. In fact, on such forms, you should use the proper case for your email address, otherwise you are entering an incorrect address, which may be fraud.
At the protocol level, email addresses are case-folded on the RHS but case-sensitive on the LHS. So it’s crucial that LHS case is preserved by delivery systems. Unfortunately most users then treat them as folded on both. So you can successfully verify one variant, store the downcased hash, and it’ll subsequently match but delivery bounces. Or, hash the exact original input but have many baffled users unable to access their accounts. Neither is a good outcome.
This is not an edge behaviour either, I have tons of users that mix up their email capitalisation from day to day.