Hacker News new | past | comments | ask | show | jobs | submit login
You might not need to store plaintext email addresses (klungo.no)
250 points by danielskogly on Nov 2, 2020 | hide | past | favorite | 172 comments



In most cases, encrypting sensitive information like e-mail addresses with a memory-resident key (e.g. injected using tools like Vault) in the application layer is a better strategy, at least if you need asynchronous access to that information (e.g. to send out weekly update e-mails). Most of the data leaks in the past were caused by compromised or misconfigured databases, not by compromised application server code.

Also, within the EU I need to be able to proactively reach my users (e.g. to notify them about a data loss), so only storing hashes of e-mail addresses and hoping users will log in so that I can send them an e-mail won't work.


This kind of encryption-at-rest scheme becomes an absolute necessity when cryptographic secrets have to be stored, such as 2FA TOTP secret keys or recovery codes.

Encrypting the email addresses and any Personally Identifiable Information on your users may also be a good practice, to limit which eyes can actually see the plaintext data (database provider, former developers without rotated credentials, an old backup left over..).

One issue with this though could be the inability to use the encrypted field for queries (eg: select * from users where email = 'foo@bar.com'), but OP's solution of hashing can help here: store the email encrypted, its hash in clear text, and do a query on the hash.


You could also use convergent encryption (https://www.vaultproject.io/docs/secrets/transit#convergent-...) to do this with only one field.


you need to reach your users in case of data breach, but what if you have zero “Personal Data” ? you could still hash the email I imagine


Well the hashed email is Personal Data. https://www.google.com/url?sa=t&source=web&rct=j&url=https:/...

Sorry for the Google link, I can't figure out how to copy the direct link on Android Chrome.


Click the link, then copy the address from address bar, if it still exists.


Is that generalizaion true for other forms of data theft too? I feel that way about some wrongly public s3 buckets and document leaks.


Sure if you encrypt with an application-level key it makes it harder for any adversary to use your data, as he/she will need to not only get access to your data but also obtain the encryption key to do anything with it.

Encrypting data like this is easy and can drastically reduce your attack surface.


This would not work for any serious/useful service: e-mails are not only for marketing, there are many good reasons to send one like (user requested) notifications, invoicing, ... and also screw-ups! If your service had a problem (security, broken data, invoicing again, long downtime, ...), you better contact your users before they find out on hackernews.


Even just a very thin encryption layer would probably do a decent job. Attackers are typically going at it from an infrastructure perspective: they make a hole, poke around for basic configuration info, locate the database, and siphon it out. They may or may not have enough time and knowledge to reverse-engineer a basic column-specific symmetric scheme.

The only drawback is that such scheme must then be made available to any system that consumes the database, possibly from multiple languages.


Not any system that consumes the database, just any system that needs to send email. Unless you are using the encryption for other fields as well, I suppose.


Plaintext email could be stored client side in a cookie and may be submitted to the server when use of the email is required, and if it validates.

If the user logs in and the site is down, a backup system could email them about the issue. This is the backup system, primary systems are down. Please contact support if you need more information. No need to email users who aren't using the system currently about downtime, or in fact no need to email users if they aren't using the system.

Further, if a "password recovery" flow is modified slightly, it can be repurposed for password-less logins by using strong tokens sent to user email, as they request them. A simplified 2FA flow can be established as well, where a token is texted the user after verifying email address. A second layer of security to texting tokens can be achieved using Google Authenticator.

To use such a system, the user will need to be OK with sending their email address each time they need email from the system AND be OK with having their phone handy to login. Of course not every use case requires security, or can be used with this proposed security system.


But how do you contact users if they aren't on the site? What if you have a data breach and need to notify them or need to remove their account because they are inactive and want to give them a heads up.


If your account recovery works by sending an email... which then sets a plaintext email cookie, there's no actual auth, right?

To make this make sense, I think you are assuming but without explicitly stating the use of signed cookies? EDIT: "if it validates", I guess so.

The other bit which is not clear to me is, what is the key in the database to identify ownership of user information?

You need a linking record which looks like hash(email) -> uid (or user record or whatever) which does not seem any better than what is proposed in TFA.

OTOH if no information is stored against the user's email / uid / username then you probably don't need login or auth.


This scheme struggles in the face of email address case folding.

At the protocol level, email addresses are case-folded on the RHS but case-sensitive on the LHS. So it’s crucial that LHS case is preserved by delivery systems. Unfortunately most users then treat them as folded on both. So you can successfully verify one variant, store the downcased hash, and it’ll subsequently match but delivery bounces. Or, hash the exact original input but have many baffled users unable to access their accounts. Neither is a good outcome.

This is not an edge behaviour either, I have tons of users that mix up their email capitalisation from day to day.


No it doesn't, convert the case when generating the hash – think of it as part of the hash function. But leave the case unchanged from whatever the user entered for any steps that involve sending an e-mail to the address.


That is what produces the first failure mode described, viz. bouncing email, when delivering to a case-sensitive mailbox.


Isn't this problem orthoganal to storing hashed email addresses? You'll always have access to the email address the user typed in when you want to send a transactional email so you can perform whatever sanitization needs to be done at that point. How does storing the email in plaintext get around issues involving sending emails to case-sensitive mailboxes?


> How does storing the email in plaintext get around issues involving sending emails to case-sensitive mailboxes?

By the validation one performs at initial sign-up.

I’m implicitly saying it’s okay to require that an email address was entered with perfectly matched case at signup, which is validated by a code or link etc, and then be more forgiving about what you receive in all subsequent uses because it’s supportive of ordinary humans trying to use your product.


You could manage that with hashed emails by separately storing the case of letters in the user part, without storing which letters it's referring to. You could then apply that case to the user-provided email during checkout.

(You could probably actually do a pretty good job in most cases just by bruteforcing the possible combinations and seeing which one matches the hash, but the worse-case CPU cost would be bad for long emails.)


This was proposed in another comment as well, and it’s a neat idea that unfortunately becomes fragile in the face of UTF-8 local parts, since the Unicode folding standards change over time.


Fair point, though you could probably store all non-ASCII characters in plaintext and still get most of the benefit. I suspect UTF-8 in emails is rare overall, at least for sites that don't have certain region-limited audiences.


If you have a method to validate email by bouncing, you just use it every time.


That isn't how email confirmation works, because it's a) enormously prone to false negatives, b) subject to ruinous delays, c) occurring after the transaction i.e. too late, and d) building you a reputation as a bad sender.

This is why all email address confirmation today is asking you to click a link or enter a code.

Also note that the latter strategy is rising in popularity; it is because clickable links, whilst seeming so convenient, are themselves prone to both false positives and negatives, and also increase the likelihood of ending up in junk mail.


But wouldnt the user face this issue all the time if they have a case sensitive mailbox but type their own address in the wrong case? So the assumption may be that a user with a case sensitive mail box is used to typing in the correct address?


No — many accounts aren’t created by the people that use them, and in any case we shouldn’t be relying on correct repeated string input and then blaming the user for a fulfilment process failure due to a typo from hours ago that we silently accepted at the time, and (worse) we can’t even distinguish between a capitalisation error and a discontinued recipient, even if it was previously verified.

As designers/developers, it’s our problem to solve.


Wait a second. Back up. Let's be clear here: the case you are talking about is the user entering their own email address incorrectly, and you're saying we as designers/developers should make a system that knows this and sends it to the correct email? What? Huh? If Person@place.com and person@place.com are two different recipients, how the hell is my application supposed to know which one you actually mean!?


The same way we deal with all email validation. Verify it once at application signup, then rely on the precise verified form.

It is unrealistic to expect end users to get the capitalisation of their email address consistently right. It is realistic to expect it to be done right at signup, since in the best practice case this’ll include a verification loop.


> No — many accounts aren’t created by the people that use them

Can you elaborate on this? I can only think of 2 examples, but neither seem like good ones:

1- someone holds power of attorney over someone else, and register an account (email account?) in their name. But if there's PoA involved, the 2nd person isn't (probably ?) able to manage an email account on their own, so this doesn't seem a meaningful distinction to worry about. (though if it's not an untreatable condition, it's possible that they might resume using their email account themselves, I guess)

2- the account is created by your ISP when you register, and they "helpfully" choose a username for you. So from this point of view, you didn't truly "create the account"


3- Work e-mail is usually created by IT and users have no influence over it.


4 - school and academic emails are also often created the same for as work emails; following a Policy without any user involvement.


5- parental/grandparental accounts. What’s more, speaking from our own support mailbox, these are the folks most likely to miscapitalize their email address.


Okay... so how is this issue currently handled? Say I create a new email account: lAsZlO@inopinatus.com and the mailbox is case sensitive.

Next I create a youtube account but as E-Mail address I enter laszlo@inopinatus.com and I am told to click the link in the confirmation E-Mail... that I never receive. Well I do not like youtube anyway so I head over to hackernews and create an account to write this comment. Oh no I can not because I typed my E-Mail address in lowercase again. So now I am wondering if there is some error so I open your homepage, search for the support page and file a bug report, typing my E-Mail address in lowercase into the E-Mail field. I never receive your response telling me to write my E-Mail address in the correct casing. I come to the conclusion that my E-Mail account just does not work and create another one at a provider that is case insensitive. OR I figure out/am told that the case is important and will never forget it again.

So where exactly are services like youtube or hn that ask me for my E-Mail responsible for handling upper/lower case correctly? The solution could still be: store both the hash for the canonicalized address and the hash of the exact address of which you know that it worked at some point in time. If the user enters the address later the matches the one but not the other issue a warning that should they not receive the E-Mail to double check their spelling.


I suspect you'll have difficulty creating a case sensitive email address on any public email service.


Actually I do perform case routing & delivery in one of my Fastmail accounts. It’s not the default, but it is possible, if you’re willing to write or generate Sieve, which I am.


What is your use case for case sensitivity? Just curious.


A base58-encoded extension part, for segmenting actions due to email (and the replies/responses to/consequences of email) originated by a SaaS platform. Helps with routing of support requests in particular so we don't have to go back to the end-user to say "which <platform organisation> did you mean?" and other similar CRM-ish behaviours. Also allows us to manage bounces, spam complaints, and RTBF assertions by (organisation,end-user) tuple and similar. The discriminator string itself comes from an application subsystem where it was already generated for our state machines. The (minor) downside is outbound delivery sometimes being delayed by greylisting more frequently than otherwise.

Since I've contributed to MTAs & MDAs and built multinational ISP email services in a previous life, I'm confident that every MTA of consequence is case-preserving of envelopes, I'm happy to rely on it. (this is also why I feel on solid ground pointing out the hidden gotchas in various proposed schemes that don't perfectly accommodate the same rule)


If the email provider says that email addresses are case sensitive, then that's the truth you live with, it's not your system, not your design and you can't dictate other systems how they should work.


Do you know of any servers that are case sensitive?


Postfix. In the local user case. Alias and Virtual are not but local accounts are (because username is case sensitive)

https://serverfault.com/questions/969671/postfixdovecot-case...


It depends on configuration. I doubt very many SMTP servers are case sensitive in this day and age. This is not the case on my Postfix servers. Sendmail was also case insensitive in its default configuration (though it has been many years.)


I thought email addresses were case folded on the right hand side and site dependent on the left hand side.

The right hand side is more or less forced by the rules of DNS.

As for the left hand side, if I run the email for a site, can't I decide whether to deliver ABC@ and abc@ both to the same mailbox or to different mailboxes? And can't someone else make a different decision for their site?

If a site administrator does not have the prerogative to decide this, what rule prevents them? (And if there is such a rule, can you rely on it being enforced?)


Of course, but when processing an arbitrary email address, which will almost always be not on your site, you MUST treat the left hand side as case-sensitive (unless you have knowledge about that email domain).


site dependent means that when given an address you must treat the lhs as case sensitive. To do otherwise will mean that you've potentially broken the address and can no longer properly use it.


You could store the hash of the downcased address plus a capitalization mask which tells you which letters to capitalize.

This works from a technical perspective, as letters with ambiguous capitalization (Turkish i, etc) aren't allowed in emails. It's a very minor privacy compromise: if a user has a very rare pattern of capitalization then an attacker with access to the database could identify their account. Negligible compared to the current standard.


It’s a neat idea but unfortunately RFC 6531 opened up the local part to most of UTF-8, so internationalised capitalisation is in the mix now.

Ultimately I’ll never advise delivering to email addresses other than the precise octets of the one already verified, and this means the gold standard is always folding for match and uniqueness, but delivery precisely as verified.

How about this: store the verified email address, but encrypted using the hash of the case-folded input as part of the key. The intention being, you had to have the matching folded form in hand to obtain the verified canonical form. For extra jollies, only decrypt it on the client. (cryptography warning: I write this as the idea comes to me and without any analysis of emergent properties, vulnerabilities etc)


> For extra jollies, only decrypt it on the client.

Uh... How does that work if you need email address to send email to the user? If only the client is able to decrypt the addess, you will basically have to wait until the user connects and gives you the email (which your presumably never store long-term, handling it like you'd handle credit card numbers). That severely limits what you can do with an address.

If you're okay with being technically able to access email address, you'd probably better off with just straight encryption. That solves data leak protections, key rotation, backups, etc.

If you want some magic ID which is known only to the user and your servers will only use it to verify identity, then why not use just passwords with client-provided KDF parameters. Your machines will never know the plain data.


> That severely limits what you can do with an address.

Assuming we’re talking in the scope of the original article, that is exactly one of the intended constraints.


True. But since so many websites (e.g. all aviation companies I've encountered) case-smash the LHS of the email address and can get away with it since all other email software has had to adapt, this is a rather minor concern by now.


By that logic, \.[a-z]{2,4}$ is still the correct way to match the TLD of an e-mail: "meh, it's obviously wrong, but everyone has to adapt."


You're trying to sound clever but it's not working, I don't think you understand what the logic actually is, if you are comparing those two.

A lot of emails get casefolded no matter what. This is enough of an inconvenience that nobody in their right mind would operate a case sensitive mailbox you would be using for account signups.


A lot of people are still using that part of a regex no matter what - I still run into this regularly (and have been, for two decades now).

"There is some danger that common usage and widespread sloppy coding will establish a de facto standard for e-mail addresses that is more restrictive than the recorded formal standard."

This is EXACTLY it. Both in the local part, and in TLDs - nothing clever about people being too clever by half, and generating false negatives on their input side.


Pretty close to true in practice, actually. Yes, some of us have funky domains that we take email at, but it's not like we don't have backup plans.


"Funky"? These things are two decades old by now. There have been multiple generations of communication protocols since this became standard - and yet still people consider this some weird aberration, even the 4-letter TLDs.

Indeed, fallback is still necessary - but it doesn't follow "meh, just go back to the 3-letter maximum, because a lot of people still live in 1999."


I have a 3.2 domain on a ccTLD and even that gets shot down regularly enough that I wouldn't consider using it as a primary address. There should be no excuse for that, ccTLDs are older than a good chunk of the people writing the code excluding them, and yet here we are.

Dealing with systems in the wild sucks.


Phone keyboards love to autocapitalize things. Also, once the keyboard. Memories your email as capitalised, it will sometimes autocorrect it.


Says who? Email addresses are case insensitive. If email software treats emails as case sensitive then it is broken. People have to write email addresses on paper forms, in all caps.


Says RFC 5321 [1]: "The local-part of a mailbox MUST BE treated as case sensitive."

It _does_ recommend receivers treat it as case insensitive for maximum interoperability, so it is de facto insensitive, but something implementing it as case sensitive isn't broken.

[1] https://tools.ietf.org/html/rfc5321


It does make it broken. Broken means not working. If your software refuses an email because it's in the wrong case then that software is broken. And quoting out of an RFC is not going to make users stop complaining.

Email addresses are written i a variety of situations where preserving case is not possible. For instance on forms, or over the phone. If the IETF wants to ignore that then that's the IETF's problem, don't make it yours too.


I think it is a fair point -- when a technical standard and/or convention so vastly disagrees with common user perception, perhaps the requirement should be broadened to account for both.


On the flipside, you’re trying to tell me I should be willing to accept a lower standard than I wish to, or that I’m used to, or that has been established for decades, because of some anachronistic bureaucrat, and my response to that is a short expletive.


Wait, how can it recommend to do contrary to MUST?


huh, TIL email addresses can be case sensitive.


You, and the paper forms, are incorrect. In fact, on such forms, you should use the proper case for your email address, otherwise you are entering an incorrect address, which may be fraud.


I filled paper forms and wrote email in lower case all the time. What I miss?


Some things that become difficult if you don't have a verified email address for your users:

- Most common: a user has a support request because they can't get into their account (e.g. you have sign-in-with-Facebook and they lost their account there, or got banned).

- Your authentication partner (again e.g. Facebook) disables your integration for some reason - someone reports your account as abusive (maybe maliciously) and it gets locked, and your attempts to work through Facebook customer support hit a brick wall. If you have email addresses you can at least get your users back into their accounts via a reset-password style flow.

- You have a data breach, and you need to tell your users what happened and what private data of theirs was leaked to an attacker.

- You get a legal threat - a DMCA takedown message for example - and need to pass it on to your users.

- You sell your service to another company and the lawyers involved in the transaction insist on emailing out a terms of service update.

There are plenty more.


Between 'yes' and 'no' we could still have airgapped or at least segregated systems, where an email address is known, but only to the part of the system responsible for communication.


In larger systems that could be a reasonable way to build things.

Keeping email addresses in the "auth" microservice which has tighter security - blocking security team code reviews, a smaller team who are allowed to modify it for example.


Right? And especially in these event stream architectures we like to use now. We still use outbound mail queues don't we, if for no other reason than to control the blast radius of any bugs in the system.

I don't need to send you a mail, I need to tell the system that handles mail to send you a mail. As long as I spend a small amount of care on avoiding the Confused Deputy scenario (eg, open relay), that would work better and contain much of the PII to a low-traffic (network and, as you say, code delta) system.


It can be a fairly thin encryption service that encrypts provided data but doesn't decrypt. Then you decrypt in rare cases when it's needed.


This is a clever idea but limited in applicability. It is probably fine for a low security web app or game, but could still leak personal information if the db got hacked.

The problem is that the salt has to be the same for each record and that emails present a limited search space.

Imagine I stole the database for blackmailable-fetish.com. All the emails are hashed with the same salt so I can brute-force the following restricted space:

[top 200 first names][top 1000 surnames][digits from 0 - 999]@[top 5 email providers]

That would probably get me 75% of the emails - let the extortion games begin!


Not sure why you're bringing this up. If you stored them in plaintext it gives hackers 100% of the emails with zero effort required.


Because it gives the false appearance of security. With this scheme, you always need to act as if the e-mail addresses are plaintext anyway. It should not be used.


True, but I maintain that if you are worried that hackers may steal your email database then a much better approach is to encrypt the emails with an external key.


You could just do that anyway. I've seen spam email trying to extort people threatening to release indecent images of them (that they don't have. supposedly, they've been captured from the victims selfie cam).

In this case, you don't have to be accurate, you're just trying to call someones bluff, in the small case they've actually done said thing (and in addition believe you can prove it AND that payment will silence them)


You could compute a very slow hash on the client, using bcrypt or argon2id. Then your attack is still possible but a bit more expensive.


If extortion is a possibility, use a separate email address. This is still a good extra layer of protection though.


Depending on the size of the user database it might be cheap enough to try all random-salts+hashed emails (if it fits in RAM it's probably cheap enough).


The salt is in config.


Sidenote, but I find this post maddening to understand, because the author seems to be using the word "e-mail" to mean both "e-mail address" and "e-mail message", and then uses ambiguous pronouns to boot:

> In conclusion, if you only use emails for transactional emails, you might be able to only store hashed versions of them.

HUH?

The most obvious way to interpret this sentence is as storing hashed versions of transaction e-mail messages. Which makes no sense and isn't what the author means, but wow this is some confusing writing.


"Most obvious" only if you make the conscious decision to ignore the context of the entire article, maybe. The author isn't a native English speaker.


Thank you for this! I realize now how confusing that is. An updated version should be up any second :)


> Earlier this year, when I went from having only Facebook-login [...] to allow registrations with email and password, one of my concerns was how to implement this is a way that protects the data and privacy of my users.

Any privacy effort is laudable. Then again, if you're serious about protecting your users' data and privacy, Facebook login is the elephant in the room.


Fully agree and you can be certain that Facebook does save your E-Mail address.

I use authentication services like auth0 and AWS cognito. The first one I think is completely safe for privacy, the second one is used for convenience (I think the service is good for stuff you host on AWS anyway, although it is generic, so it isn't restricted to that).

But using an auth-service is mostly about deferring risk of breaches to people more proficient in security. That comes with the cost that said auth service can know which services registered users are using.

The author is correct though. While a user that employs such an auth service, it can be good practice to hash the mail-address or even other identifiers for you own DB (you still need that to associate state with a user).


I fully agree, so when I released the update with email/password registration, I also stopped allowing new account creations via Facebook. Now it's only supported for login for legacy reasons, and those users can disconnect their Facebook account after connecting an email.


Why would you go to the trouble of not enabling Facebook login (as long as you provide other logins methods of course)? If someone is using Facebook & Facebook login they clearly don't mind being tracked by Facebook.

I don't have a FB account fwiw.


To not enable Facebook greater power?


If you enabled FB in your app, then publish to Apple/iOS they then say you must support Apple based SSO. But email only avoid this Apple heavy-hand


Its all about liability. With GDPR, you want to be compliant. Also see Schneier's 'data is a toxic asset' essay.

Though I don't know about it being compliant I suppose Facebook Login (and other forms of SSO) shifts the reliability to Facebook.


Wouldn't the lack of means to contact all of your users, immediately and directly, create other compliance challenges? You would be unable to notify users of a data breech until their next login; former users might be left permanently in the dark. Similarly, being unable to push legally mandated notice of policy updates could be an impossible challenge. I can see how this proposed scheme could work day to day, but you would likely be well served to retain un-hashed emails in cold storage.


> For every transactional email I need to send out - registration, account recovery, and email change verification - the user always initiates this by submitting their email address, and it will at that time be available to the backend to perform the needed action.

This sounds like terrible UX, not to mention email use cases not initiated by the user. I really think you'd be shooting yourself in the foot by setting up a small site with this philosophy because you don't need emails right now


Good points. Though given how many emails have been leaked already, not sure sha256 with fixed salt achieves much. One can build a rainbow table with that salt fairly quickly. You might as well use bcrypt, scrypt and co.


Using something like bcrypt would definitively be better, but considering that the email is the identifier, I would have no way of retrieving the correct hash to check it against, so the salt must be fixed to allow for lookups.

I'm currently using SHA512 with a fixed salt. If someone gains access to only the database and not the salt, the emails are well protected. If someone gains access to both, then it's true that they could build a rainbow table to check if a given email exists in the database. What they _can't_ do is easily use all the emails in the database for spam/phishing/etc.

Either way, I'd argue it's better than nothing :)


Sorry, I missed the point about it being the identifier.

Though technically you can still use the bcrypt hash as your identifier, unless it has to be correlated to an external source of emails.


You cant, really... bcrypt hashes are not consistent... you run bcrypt on the same email, you are going to get two different hashes. You can't search your DB for the matching hash, you would have to iterate through every entry to compare.


Use the password as the salt.

Still need a fixed salt but that should help.


Doesn't that break account recovery requests?


This is not a reasonable use case for rainbow tables.


Why? You can churn sha256 hashes pretty quickly. There are probably less email addresses out there than there are passwords.


Why would you build a rainbow table you’re only going to use once? You’d just use hashcat for this.


When signing up for a service, I always sign up with <name-of-service>@<my-domain.com>, which makes it easy to see who sold my email address and to filter/block by service.


It's a good idea to protect user privacy. One drawback I can think of storing a hashed email is - What if the user forgets the username / email id and wants to know it? (This is a common use case). In such a case you have to collect additional unique data to help the user gain access to their account, but that defeats the original purpose - to protect user privacy.


People often forget usernames but not so often e-mails. The e-mail is usually the primary/only means of identifying users anyway so you're not going to provide it back on request anyway.


They won't forget the email, but they might easily forget with which email they registered (happened to me).


How many e-mails could you possibly have that makes trial and error not a good solution in this scenario?


Well, first of all, I have 5 email accounts that I regularly use.

One of those uses multiple domains (it's an iCloud account that I registered way back in the iTools days, so it supports @mac.com, @me.com, and @icloud.com....possibly even @itools.com, not sure about that one).

I regularly use a + extension to the addresses that support it when I'm signing up somewhere that I don't 100% trust, and that allows arbitrary additions to the address. And I don't always remember what version of a site's name I will have used (eg, user+hn@example.com, user+hackernews@example.com, user+ycombinator@example.com, etc).

I recognize that I'm an outlier, but trial and error is completely infeasible in this scenario.


I tend to use aliases so I can filter/discover which exact service is perhaps selling my email.

ie <email prefix>+<tag>@gmail.com

That tag could be anything or everything.


You could use a combination of TOTP/google/facebook or some other side channel verification and use that to allow unlimited tries for a certain period of time to allow for more guesses? I'm thinking that for the most part people generally have/keep <5 email addresses (that they use often enough to log in with) so they should be able to just iterate through the list of emails that they've used over their lifetime and figure it out?

I do wonder though -- if the hash secret gets out then I think we're right back to where we started... it would be easy to cross-reference leaded email DBs with the dumped DB and work backwards. I'm now a bit less sure this does much for use privacy against an even slightly motivated opponent (one who would almost certainly have access to at least one dump of previously-exposed emails)...


The more important takeaway from this article for me is that sites should be hashing the Facebook user ID, since it's often far more personally-identifiable information than an email address.


Id is already hashed per-app from facebook's side


The article points out that this doesn't maintain privacy:

> I discovered that, even though the ID was unique to my FB-app, it was still possible to go to facebook.com/{id} and be redirected to the user’s FB-profile.


Maybe this is not important in your user-case, but what if you have a database breach and you have to warm your users?


So far I haven’t seen a comment point this out or suggest similar, so let’s say that instead of trying to maintain an application level list of email addresses that is used in a breach (or for other reasons), rely on the exercising service (email) which by formerly sending a verification email, has a record of the destination at least in a log, and maybe during registration placed in a “verified member” list, all more or less managed within the mail service.


Article 34.3.a seems to disavow the data processor of such requirement.

  The communication to the data subject referred to in paragraph 1 shall not be required if any of the following conditions are met:
  (a) the controller has implemented appropriate technical and organisational protection measures, and those measures were applied to the personal data affected by the personal data breach, in particular those that render the personal data unintelligible to any person who is not authorised to access it, such as encryption;
https://gdpr-info.eu/art-34-gdpr/


I don't think hashing and encryption are considered equivalent in this case.


>(c) it would involve disproportionate effort. In such a case, there shall instead be a public communication or similar measure whereby the data subjects are informed in an equally effective manner.


This is the sort of content I come on HN for. It introduced me to a possibility I haven't considered, and it's followed by an interesting debate in the comments. Thank you for sharing.

The most important caveat I can think of is the ability to inform irregular users about something important, either for legal or ethical reasons. For example, my note taking app is shutting down, and users might have important things stored there. I could also message them about a deprecated feature, a change to the ToS, or ironically a data breach.

Nonetheless, it's still a good idea and I'll keep it in mind.


Storing email feels like a no-brainer for a system that needs to send messages to its customers. Some prefer phone numbers, which maybe provide stronger guarantees while being maybe not as long lasting.

As an individual, the issue is that "anon" or "throw-away" emails are not that commoditized. I heard that "login with Apple" meant to provide an email proxy, hiding your real email, but I have not seen it deployed, except on Reddit. As good as it can be, it’s Apple only.

I can always wildcard on a domain I rent and use klingo@domain as a mean to compartment identifiers but it is not low maintenance.

Still, it feels better that "login with faang".


I use wildcards and they're extremely low maintenance. Literally no maintenance required since I set them up.

Steps to reproduce: 1. register a domain name 2. register that domain with your email service provider of choice (I use Protonmail) 3. create an email address on that domain 4. set that address up as a catch-all address for the domain 5. profit


how are wildcards not low maintenance?


Lack for a better word I guess. This is what I do for myself and my So, but this isn’t a commodity accessible to anyone directly by their email provider.


To support: Hey, I closed my Facebook account and would like you to delete my data for me?

Oh...


They could simply check whether the hash they stored corresponds to your email address.


In this case they will have to provide their facebook id which they probably did not store and have a means of authentication which they deleted.


Good catch, I missed in the conclusion that they suggested hashing the email if necessary (earlier in the article they mentioned only hashing the identifier).


There is either Personally Identifiable Information, or there is not.

If there is then identity can be confirmed and the account deleted.

If there is not then there is nothing that needs to be deleted.


You're right that could be an issue, but hopefully anyone who registered via Facebook will take care to add an email/password to their account and disconnect Facebook before they delete their Facebook account.


I’m not sure I would trust all users to keep that in mind.


I'm sure I would trust users to not keep that in mind. Most user's aren't very good an this stuff.


You must first prove it's your account.


Yeah with data breaches becoming more and more common, I really think it's irresponsible to not have a way of contacting your users. Sure, you could throw a banner up on your website - but the comms should be immediate.

This might be reasonable for a service that doesn't sell anything, or there's absolutely nothing owed to users and users have no reasonable expectation of privacy. But any commercial or professional organisation that doesn't have a method for contacting end-users is either A. Shady as fuck (numbered accounts, darknet-hosted, ignorance by design), or B. irresponsible.

This is a website who's pure purpose is to extract PPC/ad and referral revenue from its users. There's no personal information requested from users, other than "Display Name". This is actually one of few exceptions I think the owner of the website is being more responsible with their user's data by not keeping anything.

However, if they are breached and are serving malware to customers for a week before realising, they will have no way to tell their users they may have been affected. Or what if someone decides to install a backdoor and log the user's email and password when logging in? This is nitpicking and honestly probably 1% of websites hacked in this way actually notify their customers, but it's nevertheless still a hole in the design.

They're also likely capping their earning potential if they do plan to sell the website, as they don't have any delicious user data to sell to marketers.. For which I commend Daniel and Bjorn! Well done.

I don't know, I'm thinking this is great, but also pretty bad. Maybe adding an opt-in for breach notifications would be useful, or having a third-party service to subscribe to breach notifications for the website would be the best of both worlds.


Sounds unnecessary complicated for no real benefit with issues along the road.

And it does feel weird to use Facebook in this example.

If you don't care for an email address, and you are using the login only for maintaining that list, use an permalink. Thats probably easier and better.

One permalink for edit, one permalink for viewing.


Facebook is only mentioned in the context of how I did things previously, before I implemented the email/password registration.


What was your main motivation to put so much effort in making sure that you are not storing his/her email address?


We did this slightly differently. For login we stored a hash of the normalized email address (all lowercase, and handling gmail's dots and plusses). For sending emails we had them encrypted in a separate database, which only the mail-sending servers had access to - not the web-facing servers. That way we didn't need to ask for the email address every time, and it was still fairly well protected.


I wonder what a database that supported a moral equivalent of cgroups would look like.

I can't create a record, I can't delete a record, I can't see the email field, but I can change the subscription plan for this user, or change their avatar.

We tend to do table or row level permissions, matrixed with verb. Column level occurs at the application layer, leaving plenty of room underneath for exfiltration.


I admire your dedication to keeping your users data secure, anonymous and private.

> For Wishy.gift I use SHA512 with a fixed salt Just a FIY in case you don’t know: if you want to allow different accounts with the same email, in case of a data breach it would be obvious by the duplicate hash this has occurred. Salting with a different nonce for every row is not much harder and would protect in that case.


How would you check for duplicate entries if you use a different salt per entry?


Check every previously generated salt with the currently received email for collision.


Lets imagine that a@exam.com and b@exam.com have same hash, so you use different salt so that they are different. How do you know which one is which? Which salt belongs to what email?


Maybe this is a dumb question but how do you send an email if you only have the hash of the recipient's address?


As per the article, you only send them when the user requests an action, and ask them for their email at that time


Oh. I would be pretty annoyed if I had to re-enter my email address every time I perform an action that sends out a transactional email...


On this website, this only happens for actions that require entering your email address anyway.

> For every transactional email I need to send out - registration, account recovery, and email change verification - the user always initiates this by submitting their email, and it will at that time be available to the backend to perform the needed action.


So this only works if you only use the email as a username. No way to notify the user of things like ToS changes, security issues, notifications from the service, etc.


Some services don't have the desire (or need!) to send out generic service notifications.

For ToS changes, it's often enough to display them to the user when they first log in after the change. That's how many mobiel apps handle it already.

The security notification thing could be an issue though. Doesn't seem to be necessary per the GDPA but it's probably a good idea to communicate security breaches.


"You might not need to store user email addresses"

Emails and email addresses are two very different things.


In technical writing within this industry, “email” is interchangeably used to mean either the protocol, an address or a message, depending on context. “If user confirmed their account by entering a confirmation code received via email or phone, that email or phone number becomes verified” is a routine sentence that will confuse no one.


> “email” is interchangeably used to mean either the protocol, an address or a message

"email" isn't a protocol. SMTP is though, and referring to RFC 5321 ("a specification of the basic protocol for Internet electronic mail transport"), section 2.3.11 [1], we see that: "As used in this specification, an "address" is a character string that identifies a user to whom mail will be sent or a location into which mail will be deposited."

https://tools.ietf.org/html/rfc5321#section-2.3.11


I stand corrected on the former, should’ve written “system” instead of “protocol”.

As a frequent reader of technical blogs and reference documentation, I stand by my point that “email” alone is used to mean email address quite frequently.


It is confusing when the referent is ambiguous. In the original title, email could mean either of two things, and there is no way to tell which it is.

As an irrelevant aside, if you insist on justifying web content, “hyphens: auto” is your friend.


Ambiguity is ever-present in human language, I’d say I rarely know precisely what I will read about when clicking a link here. Confusion between “email address” and “email message” is relatively mild, in fact (post about Kafka from not long ago comes to mind).

I am not the OP, and frankly I don’t believe full justification has a place on the web just yet. Hyphenation alone is rarely enough to make it bearable, and I think browsers’ rendering engines don’t do more than that.


Ok, we've addressed the topic in the title above.


I think you can store emails but process them only in an anonymizing publicly auditable proxy, ensuring that downstream business services do not have plaintext access whilst still being able to send outbound emails whenever you want. I wrote about it recently: https://futurice.com/blog/trustworthy-services-from-cloud-pr...

The key is to grant cloudfunctions.functions.sourceCodeGet (Or AWS/Azure equivalent) on the edge so anybody can verify that your proxy is above board. End users just have to trust the Cloud Providers access controls, not the service providers word on implementation.


Huh, considering the case (& Unicode) issue (that I wasn't aware of !), shouldn't using email addresses as logins be considered bad practice ?


You use email+password for login. Does this mean that on every login attempt, you iterate through every row in the database to check for a hash match?


I mean, how's that different from 'iterating' over every row in search of an email address? Indexes solve that.


I think the others are missing the fact that if you use the same salt for every row, it's less secure. So you'll be storing the email more securely, but still not as securely as you should be storing the password.

To do it any more securely would require pulling up every single record for its salt, and hashing the login with that salt and checking it. It's virtually impossible at any real scale.


Hah, when I wrote my comment above I didn't even consider the possibility of using the same salt everywhere.

I suppose the goal here is privacy, not information security, so it's okay.


It's pretty scalable. 10 billion email addresses times 16+32+4 bytes of salt, SHA512/256, and ID is 520GB of RAM; available in a single (big) machine and searchable in under a second with a few cores.

Shard it into multiple machines for higher QPS.


Yes, exactly as you would do to check for an email address match.


hash the email first and then check for same hash in db.


But don't you need to know which salt to use?


Don't forget to NOT store email server logs either. ;) Otherwise this exercise is kinda pointless.


... or if you do, make sure they are deleted pretty quickly afterwards (with logrotate or something comparable).

Having some logs in invaluable for debugging, but for example keeping them only for 14 days could be an acceptable compromise between debugability and privacy.


One thing worth noting is that often, you don't even need to store passwords.

If a user wishes to log in, you send them a link/code by email. That increases security dramatically, as most email services already have some more advanced protections built-in. You also don't have to worry about leaks that much, as there are just no passwords to be leaked.


Please don't do that. It makes me very angry when I'm trying to quickly get some information from a public/shared device and instead of getting logged in, I get a link on my device. (which I need to retype / send / ...) Most likely I don't care about the security of the account that much in that case. But my email login never goes anywhere close a device I don't own.

Don't annoy users. Just let them log in.


To be fair, that scenario is increasing security.

Shared/public devices are not considered secure at all.


Haven't heard of that one.

I'd like to see an A/B test on conversion and long time satisfaction.


That might be in breach of the GDPR. In the event of a personal data breach, you need to tell the data subjects about the breach [0]. You can’t just put a notice on your page, since someone might not be using your service any more, but you still have their data. And GDPR aside, it is very short-sighted to assume you will never ever need to e-mail users on your own.

[0]: https://gdpr-info.eu/art-34-gdpr/


You comment was dead, I vouched for it as it's a valid assumption that you have to notice users on breach. But in this case, it seems like the following applies (34.3.A from your link):

> The communication to the data subject referred to in paragraph 1 shall not be required if any of the following conditions are met:

> the controller has implemented appropriate technical and organisational protection measures, and those measures were applied to the personal data affected by the personal data breach, in particular those that render the personal data unintelligible to any person who is not authorised to access it, such as encryption;

With emphasis on the "in particular those that render the personal data unintelligible". Since the email is no longer an email, it should not be counted as personal data, it's just random characters, and no notification needed.

IANAL, but that's how I understand it.


I would totally agree with you, diggan. I understand it the same way.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: