I want to note a separate issue of defensive coding that comes up in the writeup:
> GitHub's forgot password feature could be compromised because the system lowercased the provided email address and compared it to the email address stored in the user database. If there was a match, GitHub would send the reset password link to the email address provided by the attacker
The logical flow is:
1. Get the email address from the forgot-password request.
2. Get the email address from the database for the same account.
3. Check whether they match.
4a. If not, we're under attack -- refuse the request.
4b. If so, all is well -- send a password reset to the email address.
Of course, we know the email address twice -- we asked the user for it during the password reset process (step 1), but we never needed to do that because we already had an email address on file for the account. We retrieved that email address in step 2. We know that the two addresses are the same, but, if you look at the semantics behind the variables, in step 4b we're choosing one of these two "equivalent" options, depending on which variable we use for the email address:
1. Send the account password to the account owner.
2. Send the account password to a guy who doesn't know what the password is.
And these have very different risk profiles. Choosing the first option instead of the second would have prevented this attack without needing to worry about unicode case-translation issues. You never want to trust information you just received from an unknown user when you already have the same information from a more authoritative source.
Yes, you're absolutely right. The real bug was sending to the "wrong" matching email. But this is what makes this bug so hard to find. You're looking at "equal" strings, so why should it make a difference if you pick A or A if A === A ? :)
Hard to find, but it's the kind of thing that will hopefully come up in code review. Consider this pseudocode with helpful pseudo-hungarian notation:
username = request.post_params('username')
evil_email = request.post_params('email_address')
user = get_user_by_name(username)
good_email = user.email_address
if good_email != evil_email:
# Hackers!
else:
reset_password(user.id, evil_email) # it's fine; it's the same as user.email_address
You can see the problem in the call to reset_password. Sure, it's the same as user.email_address, but if you just used user.email_address you'd be doing the right thing and you wouldn't need the comment about "don't worry; it's fine".
Then you start to wonder if that line shouldn't look more like
reset_password(user)
and from a security perspective, it should. (I assume the reason Github allows the code to specify the email address in the first place is that an account may be associated with multiple addresses.)
If it makes no difference which email address you pick, why not pick the one that represents doing something safe instead of the one that represents doing something dangerous?
The problem is that it requires a particularly detailed way of thinking about strings, much like the popular IEEE 754 floating point number question.
If the code were instead written:
username = request.post_params('username')
email = request.post_params('email_address')
user = get_user_by_name(username)
if emails_are_equal(email, user.email_address)
reset_password(user.id, email)
Most people would not feel compelled to use user.email_address in the call to reset_password, because they're guaranteed to be equal by the previous line! The vulnerability was introduced when the author of emails_are_equal() wrote it with conversion to uppercase of the whole thing, and failed to realize that this conversion can collide with Unicode domains. They may also have done things like normalize/remove optional.dots+suffixes@gmail.com, "quoted" or (commented) local parts. There's a lot of magic - and a lot of risk - in an emails_are_equal method.
> They may also have done things like normalize/remove optional.dots+suffixes@gmail.com, "quoted" or (commented) local parts.
At least the first two (removing dots and suffixes), along with conversion to lowercase, are strictly invalid transformations which may result in an email address referring to a different account. Sure, most email providers treat the account name as case-insensitive, and the use of '+' as a label/subaccount separator is a common convention, but neither of these is required. The RFCs say that the account name is case-sensitive (unlike the domain name) and '+' is just an ordinary part of the name; any special significance is assigned by the server. Ignoring '.' characters is something specific to Gmail.
In the context of a search it's not unreasonable to ignore some of these differences, but at that point the matching names aren't "equal", just "similar". Certainly it should never be assumed that it's safe to send email to the transformed version of the address.
> There's a lot of magic - and a lot of risk - in an emails_are_equal method.
My whole point is that changing reset_password(user.id, email) to reset_password(user.id, user.email_address) neutralizes that risk. Then it doesn't matter whether emails_are_equal is risky or not, because a failure in emails_are_equal can only cause you to do a safe thing. What's less embarrassing -- "we accidentally sent a password reset for your account to you", or "we accidentally sent a password reset for your account to someone else"?
So you might not feel compelled to rewrite reset_password(user.id, email) as reset_password(user.id, user.email_address) -- but you should.
If they do have multiple email addresses associated with one account, and they don't want to send a password reset to all of them, then you can see how it would happen.
```
if evil_email not in good_email_addresses:
# Hackers!
else:
# just reset with provided email. If I thought there was a potential security issue I would have already addressed it.
```
So easy to be lazy at this point, especially under time pressure.
Sorry but that is not correct either. Code blocks on HN are created by using two or more leading spaces on each line.
Tripple backticks do nothing, and will be shown on the same line as the next line even if put on a separate line, because they are considered as regular text and part of one paragraph of text.
This is one of those cases where what we want is a tainting system for strings, not an encoding problem. The only language I've seen attempt this was Perl, and even then intermittently.
It should be made as difficult as possible to pass user input directly to something vulnerable like an email-sending API, without first laundering it through "validation". Unfortunately it can be very hard to do good validation, but in a tainting system you would find it easier to use the already-trusted value from the database for the user's email rather than the untrusted one from user input.
(Comparing two email addresses with toupper rather than doing a full RFC2822 comparison is another mistake, although I can see why nobody bothers to do it properly)
At least the first two of them are type wrappers for strings - opaque to the type checker, but transparent to the runtime. Rust and Haskell have wrappers like this. Typescript does not.
Keep the internal details private to the module so that application code doesn't concern itself with the details.
The module that sends emails only provides code that accepts a ValidatedEmailAddress and sends it an email, or accepts an UnvalidatedEmailString, records it in the database, send it an email, and Validates it. Next time you load it, if it's been validated, you get a ValidatedEmailAddress - as long as it's been validated.
Sensibly, there would be no (public) code to convert the ValidatedEmailAddress to a database id, since (a) you never application code want to run "UPDATE email_address SET address = 'ketchup@tomato.sauce' WHERE id = 5", because it needs to be validated. Likewise, you never want application code to run "INSERT INTO user_email(email, user) VALUES (5, 6)". (Something like the latter will of course occur, in the code responsible for recording invalid emails and retrieving validated emails.)
> At least the first two of them are type wrappers for strings - opaque to the type checker, but transparent to the runtime. Rust and Haskell have wrappers like this. Typescript does not.
You are right that TypeScript aliases don't work like Haskell ones (which are considered different, and type checked). In TypeScript you can use "branded types" to work around the more loose structural typing:
type Firstname = string & { readonly brand?: unique symbol }
I'm not sure I fully understand. What do you mean by step 2? If I entered "myemaıl@example.com" into the reset field, are you saying that step 2 would be the process of doing some normalization to try to find a matching account? If I reset a password, don't I only provide an email address by means of doing so? Therefore, doesn't the service merely attempt to match an email to an existing account within the DB?
I believe I understand the rest (the take-away being, however you match A to B, send the reset email to the email address stored in the DB?), just not sure about the flow beforehand.
I reply separately to observe that the flow you describe is bugged in a more obvious way: if you ask only for an email address, and then discover the related account by normalizing that address before doing a database lookup, it's a serious error to then send the reset email (which controls an account you looked up using the normalized address) to the original address. You found the account by looking up a normalized address; the original address isn't even known to be associated with the account.
In that case, there are three options:
1. Send the reset email to the address you pulled from the database. (correct)
2. Send the reset email to the normalized attacker-provided address. (wrong but "probably fine"; this is the bug I was talking about in the first place)
3. Send the reset email to the original, non-normalized attacker-provided address. (wrong and definitely a problem)
This is the flow I envisioned, which matches some large websites, but not necessarily github.
In step zero, you enter "2T1Qka0rEiPr" as the username of the account you want to hack.
In step one, github says "We have m------@e------.com on file for you. Please confirm your email address." and you enter "myemaıl@example.com".
Then github retrieves the email address associated with the username "2T1Qka0rEiPr".
You're correct that you could do this with just an email address and not a username, but that doesn't affect my criticism -- you'd still want to ultimately send the reset to the email address you pulled from the database, not the one you got from the reset request. Your takeaway is exactly right.
Another more secure method is to pull up the account information and display "We have the following methods on file for contacting you: () email 1 (potentially obscured); () email 2 (potentially obscured); () SMS (to a phone number which is potentially obscured)". If I recall correctly, that's how Twitter does it. This bypasses the need to ask the user to type in the address they'd prefer for the reset to be sent to -- you just show them some radio options, and they select the one they want. Since they never provided the address, there's no chance you'll accidentally pick their malicious address over the real address.
"But how do I make sure someone knows the email address, in order to stop strangers from spamming reset emails to addresses they might not even know?" You don't; you apply rate limiting to the reset functionality.
I've always used values I pulled from the DB even when they matched (theoretically) the value I passed into the where clause and never really had a reason why, it just felt better that way. Thanks for validating that.
> GitHub's forgot password feature could be compromised because the system lowercased the provided email address and compared it to the email address stored in the user database. If there was a match, GitHub would send the reset password link to the email address provided by the attacker
The logical flow is:
1. Get the email address from the forgot-password request.
2. Get the email address from the database for the same account.
3. Check whether they match.
4a. If not, we're under attack -- refuse the request.
4b. If so, all is well -- send a password reset to the email address.
Of course, we know the email address twice -- we asked the user for it during the password reset process (step 1), but we never needed to do that because we already had an email address on file for the account. We retrieved that email address in step 2. We know that the two addresses are the same, but, if you look at the semantics behind the variables, in step 4b we're choosing one of these two "equivalent" options, depending on which variable we use for the email address:
1. Send the account password to the account owner.
2. Send the account password to a guy who doesn't know what the password is.
And these have very different risk profiles. Choosing the first option instead of the second would have prevented this attack without needing to worry about unicode case-translation issues. You never want to trust information you just received from an unknown user when you already have the same information from a more authoritative source.