Hacker News new | past | comments | ask | show | jobs | submit login

From everything I've read, the engineers did. The problem was that the security team had to go head-to-head with the budget team. And unfortunately, the budget team won - since the upper levels didn't feel that the IT security salaries were a necessary expenditure. And beyond that, there was concern that making people actually change their passwords regularly and requiring anything like security in said passwords was going to discourage users from using Yahoo and send them over to GMail.

Unfortunately... that argument wasn't wrong.




> The problem was that the security team had to go head-to-head with the budget team. //

Wouldn't engineers at such a big corp whistle-blow such incompetent decision making?

Apparently [1] they had a $1.37B net income in 2013. Given using bcrypt with a Blowfish hash and salting was pretty much a de facto standard by that point (I think that's what Wordpress were doing, hardly revolutionary security work) it seems the relative cost for Yahoo was approximately zero.

All I can imagine is that those in control were asked to leave the system open for government snooping? Why else would engineers working there not [anonymously] bring this to press attention - "hey, Yahoo security amounts to a piece of sticky tape holding a bank-vault shut".

- - -

[1] http://www.marketwatch.com/investing/stock/yhoo/financials#


It's not that hard to implement something at the start. It's more work to retrofit it on top of an existing system in a way that doesn't reduce the total security.


But would it require users to change their password?

The way I would have implemented it, but would be keen to know how secure it is, is that you start with the md5 of the password (md5(password)). You then bcrypt or scrypt that md5 (bcrypt(md5(password))) and replace the md5 in your database with the bcrypt hash.

When a user logs in, all you need to do is to calculate the md5 first then check that md5 against the bcrypt hash you have stored.

I am not a crypto expert but intuitively it doesn't look like I would have weakened the security that way. You can't really attack bcrypt(md5(password)) much more than bcrypt(password). Can you?


The method I've used is to add the column for the new stronghash then you update the old column to stronghash(<oldhash>), where <oldhash> is dumbhash(password) check against that on login stronghash(dumbhash(password)) and generate just stronghash(<password>) while you have the plaintext password in memory and update the row to add the new hash (simple and interoperable, not dependent on dumbhash) and drop the stronghash(<oldhash>). After a <longtime> limit (to optimize both maintenance overhead of the additional column / behavior and limit exposure to only minority users that haven't logged in for <longtime>), you drop the stronghash(<oldhash>) from everyone and do a "we sent you a reset email" for anyone that's trying to log in but has no <stronghash> password hash.


This is fine workflow, but keep in mind

> and do a "we sent you a reset email" for anyone that's trying to log in but has no <stronghash> password hash.

Yahoo is an email provider so many of these users won't have an external provider to refer to.


This workflow is much better than the other proposals I've read up-thread.


It's one way to do it, which is okay sometimes.

The other way is to add a new empty column for bcrypt. The next time the user logs in, you save the bcrypt hash and you remove the MD5 hash.

Over time, the active users will be migrated to the new scheme. The only issue is the abandoned accounts, they'll keep the old weak scheme.


There are other migration techniques. If you know md5(password), you can create bcrypt(md5(password)).


That's what I do, though care should be taken that you can't then login against the old passwords by putting md5(password) in the password field.

Usually you do this by decorating the bcrypt(md5(p)) entries in some way so you can recognize which ones are tested with bcrypt() vs bcrypt(md5()).


I am not sure I agree. Your way will leave all the non active users exposed in the case of a leak. They may not be active on your website but are likely active on another website using the same password.


As I said, that's an option among others, it has drawbacks.

For a website like Yahoo with billions of abandoned accounts, that's a serious drawback ^^


The problem is in collisions. Md5(password) can yield the same result for many different values of password so simply bcrypting that result means that you start with a restricted possibility space. So less secure. Punts the question to how much less secure. Seems to me it would still be worth it to do and then all new passwords going forward are done correctly.


Agree, but a collision even for md5 is a relatively rare event. When brute-forcing the bcrypt hash, this would reduce the attempts you would need to try against a given hash, but only by a very small factor. With a reasonable work factor, I would assume it would still make a brute force attack impractical at scale.

I didn't do the test, but I'd expect that there wouldn't be more than a handful of collisions for the md5 of the 100m most common passwords.

[edit] I actually I just did the test on this 10m password list and no collision

https://xato.net/today-i-am-releasing-ten-million-passwords-...


I've done it before on a 1 billion word / password list and didn't get any collisions.


That being said md5 does generate collisions. I was playing with the IMDB movie database that you can download. They use a combination of the title and the year as a primary key. I tried using an md5 instead to save space (but giving a reproducible ID instead if an identity column), and got many collisions. No collision with SHA256.


Wait, what? No MD5 collisions at all were publicly known until Xiaoyun Wang disclosed one in 2004 using a new cryptographic technique she invented (explained in Wang and Yu's "How to Break MD5 and Other Hash Functions").

MD5 has a 128-bit output so collisions that occur by chance should require about 2⁶⁴ inputs (18 exa-inputs). Surely your database didn't contain over 2⁶⁴ different movie records.

Could you take a look at what you were doing again? Your description doesn't really make sense mathematically.


You must be right. I can't reproduce it. I must have fucked something up then.


You likely goofed something up. No one has demonstrated two strings that are conceivably used as passwords that users type in -- and that includes the tuple {movie title:year} -- that have MD5 collisions.

The security problem with MD5 isn't collisions.


I think you are right, I can't reproduce it.


What you're describing is not possible given the database you tested. Are there more details that would clarify your post?


Oh, of course md5 has collisions. It's relatively easy (not computationally easy, but there are known methods) to find two random strings that hash to the same value, it's just very difficult to find a string that hashes to the value of a specific other string.


Not "relatively easy" by chance: it should require 2⁶⁴ entries in your database to see a single collision happen at random! It's only "relatively easy" following cryptographic research in the early 2000s that exploits structure in MD5 to produce collisions deliberately.

Yes, collisions are easier than preimages, but they still shouldn't occur by chance in real applications!


Realized my wording was way to ambiguous, clarified. Thanks!


Very nice. Thanks for that. So yes, this is likely the thing to do in this situation.


Unfortunately, this isn't an accurate description of the nature of the collision problem with MD5, which involves carefully crafted inputs using a sophisticated cryptographic attack -- not arbitrary user inputs that don't intend to collide with each other. See my and danielweber's comments about this down-thread.

(Yes, susceptibility to collisions was recognized as a problem with MD5 leading to a reason not to use it, but the collisions in question were constructed, not encountered accidentally. There isn't any evidence to date that the probability of a collision given two randomly chosen inputs is higher than the expected 1/2¹²⁸. You could test this yourself by hashing 2⁴⁰ random strings under MD5: you won't see a collision among the outputs!)


>Md5(password) can yield the same result for many different values of password //

Not "many different" using the normal constraints of text/numbers/typographical-marks and with maximum password lengths of 32 or so (I'll bet Yahoo's was shorter than that in 2013).

Are there any MD5 collisions in [:graph:]{,32} ?


I really doubt it. When people demonstrate MD5 collisions, they use a hex strings like

0e306561559aa787d00bc6f70bbdfe3404cf03659e70 4f8534c00ffb659c4c8740cc942feb2da115a3f4155c bb8607497386656d7d1f34a42059d78f5a8dd1ef


Yes, because MD5 digests are much shorter than 32 characters, even if it's just ascii, so by the pidgeonhole principle there must be. If you're asking if there are _known_ collisions between two messages with less than 32 printable ascii characters -- the answer is likely yes, but there are not known to me and likely not publicly known at all yet.


I thought md5 were 32 characters. But you're right every md5 hash would be in that space, so there must be collisions.


bcrypt(md5(password)) is what Yahoo! did when they switched.


Especially about it being a bad idea to make people regularly change their passwords!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: