Humans being the error-prone creatures we are, I wonder how it could work better to a longer word-based code, but allow for certain kinds of errors: for example, 1) word order doesn't matter, 2) it tries to correct a word not in the built-in word list to the "closest" word in it, 3) maybe you only need, say, seven of the eight words to be right to get in (then it might remind you of the real code). You can do a little calculation to see how many words you need to get to the desired number of bits then.
Pretty different from passwords as we know them (#3 means you can't simply store a bcrypt/scrypt'd code--you could build something complicated tricks to square "allow one word wrong" with not storing plain PWs: each word is 10 bits and what you tell the user is really their secret code + a parity/ECC word, and correction happens before a convnetional password check--that rabbit hole goes deep, though.) But if all this gets average folks remembering more entropy than before, that makes it kind of interesting.
On the study (which I admit just skimming), survey in 2010 suggested about half of Mechanical Turk users were located in the US (http://www.behind-the-enemy-lines.com/2010/03/new-demographi... who learned English at school would probably do better at memorizing a password in their native language. (I studied French many years in school, but I doubt I'd memorize a set of random French words as easily as English words.)
1 and 2 are easy. Sort the words and use levenshtein distance to correct unrecognized words to the nearest valid word, prior to doing the password hashing. 3 is slightly more difficult; for 7-of-8 you'd have to store 9 hashes, one for the full list, and 8 more hashes resulting from removing one of the 8 words. What you gain in user friendliness costs you since you have to store several times the hashes, and you have to run multiple hash computations for each attempted login.
It should be noted that 1 and 3 both reduce entropy significantly.
A system that lets users omit one of the words in their passphrase will end up with users omitting one of the words all the time (why would you type more than you have to, particularly for passphrases of 20-40 characters?); the effective security of an 8 word passphrase with that system is worse than if it was a 7 word passphrase with no leeway for forgetting words. With a strict comparison but a passphrase of one less word, attacker then has to guess the right 7 words. By allowing a word to be forgotten, the attacker can guess any of 8 different 7-word passphrases and get in.
Any system that reminds you of the full passphrase has to store the passphrase unhashed (or equivalent, like, with each sub-hash, storing a hint for the missing word), which is unacceptable.
The system may only store a hash of the correct password, and then try to brute force it with the entered (slightly wrong) password as a starting point.
This has several advantages:
- Nearly as secure as only accepting the correct password (as an attacker could do the brute forcing as well, without help from the system)
- Incentive for the user to remember and type the correct password, as logging in with an incorrect password takes longer
- After successfully brute forcing the password, the system can remind the user of the correct one - without having to store it!
Of course, this is not without disadvantages:
- Much more load on the server, which probably can't be offloaded to the client without leaking the hash. (But perhaps it can, by offloading parts of the calculation to the client, and only doing the final comparison on the server.)
- Depends on efficiently generating a list of likely passwords given an imperfect starting point - one needs to develop a model of likely user errors.
Assuming 56 bit passwords, and 2^20 hashes per second, one could try all 4-bit-errors in 9s and all 5-bit-errors in 8min. But 'all possible n-bit errors' is not a realistic measure, as errors wouldn't be random.
56 bit would be about 10 random letters, and e.g. assuming that the only possible errors are omissions of letters, one could forget 3 letters and would still be able to login in about 1 minute. On the other hand, an attacker without any knowledge of the password would need ~2000 CPU years to brute force the password.
(Of course the values should be tuned according to the intended security level.)
> 3 is slightly more difficult; for 7-of-8 you'd have to store 9 hashes, one for the full list, and 8 more hashes resulting from removing one of the 8 words. What you gain in user friendliness costs you since you have to store several times the hashes, and you have to run multiple hash computations for each attempted login.
Naw. Store one hash (of all 8 words, sorted), along with an (unencrypted) total function mapping words^7 → words, constructed in such a way as to maximize entropy. This can be implemented simply by storing an XOR of the (dense) index of all 8 words: XOR the 7 provided words together with the stored XOR to reveal the 8th, then hash all 8 to check the password.
(Yes, storing the XOR of the word indices unencrypted reveals information – exactly one word's worth of information, which represents knowledge of the 8th word given knowledge of the other 7. i.e., this knowledge does not reduce the search space below that of 7 sorted words. Hence this same system can be used to remind the user of the full password given the partial password as well.)
> Any system that reminds you of the full passphrase has to store the passphrase unhashed
Store it encrypted (the same way you encrypt the payload; you can keep around a single copy of the encrypted full passphrase so you don't have to add it to any actual payload). Iff you have a match of one of the hashes, you can decrypt the full passphrase and hence show it.
Yeah, I did some rough calcs (after my anti-procrastination timer expired and I couldn't waste more time here, haha). If you have ~1000 words, ~10 bits/word, seven words is 70 bits, and you lose about 12 of those by allowing reordering, and obviously 10 if you let them flub a word (via a parity word or storing 8 hashes). Seven words could be 48 bits with both kinds of variation allowed, or 58 or 60 with only one kind allowed. It might also help users remember to give them limited choice of code (pick one of 4 or 8 codes, say), which could cost you up to 2-3 more bits (on the paranoid assumption that users will pick according to predictable patterns).
How much entropy you target depends on the system. If it needs to work as a crypto key, then you really need entropy, and good luck to you. (Maybe you have folks remember a piece and write a piece down, and use an expensive key scheduley thing to stretch the concatenated result out.) If it's for a login system where you can rate-limit, then, hey, even 32 bits is a lot if the attacker has to wait one second between tries.
Just realized, if you use a parity word to recover from the user flubbing one word in a login system, you don't need to store the cleartext in order to correct the word they missed--you just redo the parity calculation on the remaining words to recover it. Yay. Also, to recover after a parity error in a seven-word phrase, you have to check seven different potential passphrases (assume word 1 was erased, apply parity to recover it, check, repeat), so I guess it takes another three bits off your security on top of the 10 of losing a word.
On users just omitting a word all the time if the system lets them, I think 1) you require them to enter the full number of words every time, 2) if you see an error recoverable using parity, first you tell them which word was wrong and ask them to re-enter it, 3) if they still can't get it, you tell them what the word was (shoulder-surfer issues obv) and they have to type it. Then you minimize how much you show, you (try to) keep the user from slowly forgetting the passphrase one word at a time, and you make sure that entering your whole passphrase is the fastest way to log in.
Though it's nice what weird stuff you can do while keeping a hashed cred DB, if it were a login code (not crypto key) and there were big wins to other aspects of the system to keeping the clear version around (for example, it let you do some significantly better error-tolerance strategy), I'm not sure you have quite the same level of obligation around a random token that you give out specifically for your app as you do around a password the user entrusts you with that they also use elsewhere.
One of the greatest problem with login systems currently is not the entropy but the min-entropy [1] of the passwords. This random keyword approach greatly reduces this main threat so I'd say It's an acceptable tradeoff to have entropy for your average user and have a reliable system as you suggested.
I don't think your parity method works for error recovery (it would probably work if it were a lot more complicated). If you store words are 10-bit values, each bit is independent from the other. Let's assume if a user makes a mistake the mistaken word value is uncorrelated with the true one. Then parity will only allow error detection, and your method of assuming erasures will simply generate N different correct word hypotheses.
Interestingly, if you don't need permutation reliability (I don't think you'd need it anyway, people create meaning associating the words in the given order), you could use an ECC to achieve M-out-of-N passphrase securely, allowing both erasure and error correction!
For example, with 4096 words, you could do 6-out-of-8 words with two errors (or 5-out-of-8 with erasures) by splitting the words in 4 3-bit groups and using a GF(8) Reed-Solomon [2] error correction ([8,5,4] code). The resulting system would have 5*12 = 60 bits of min-entropy. There's plenty of room to present some 8 choices of combinations to users and let them pick one, leaving 57 bits.
Getting to this late, but if I hear you right, generating a set of hypotheses was exactly what I was trying for with the parity. If you see a, say, 6-word input where the parity doesn't work out, you "erase" each of the six words and replace it with the parity of the other words, and check if the result is the user's password. It'd mean the user(/attacker) could make you do a 6 hash checks each time they tried to log in, so it'd cost you ~13 bits of security not just 10.
And, yes, min-entropy is a really helpful concept here. Captures well that nobody's password is a good crypto key, but at the same time it deals with the attacks trying the really common overused passwords on a lot of accounts.
Very cool that you could tolerate 2 errs/3 erasures and still have 60 bits with ECC if you trade off ordering. ECC and coding theory is one of those areas where I wish I had slightly more CS education so I could do more than just gawk at its power.
For sake of comparison and voice of experience, I actually do use fifty-six bit secrets for my various needs, as follows. The following secrets are at least fifty-six bits in size:
* Nine "C isgraph" characters chosen uniformly at random. Those are lowercase, uppercase, digits, and 32 more typewriter symbols. This is the kind of password that I actually use for most systems where I frequently need to type the password myself. I am not making any claims about which kind of password everyone should use.
* Ten alphanumeric characters chosen uniformly at random. Those are lowercase, uppercase, and digits. This is the kind of password that I use for systems where I may need to read someone the password over wired telephone service. That service has the advantage, as I understand, that eavesdropping on it is a Federal crime in the USA.
* Five words chosen uniformly at random from the Diceware list of 7,776 words. This is the kind of password I use to lock each of my AES-encrypted master password files.
In practice I have presumably degraded the security of my passwords despite my efforts. Here are two obvious ways to degrade the bit size of these secrets:
* Failing to choose the elements in entirely unpredictable way. Rolling fair dice is OK. Using something built in a standard way for the purpose of generating passwords, such as KeePass, is OK. Mentally choosing combinations of characters is not OK, no matter how quirky you think your mind is.
* Failing to keep every part of the secret completely concealed from everyone who does not own it. Letting people know that a password is a particular one of the above types is OK. Letting people see over your shoulder is not OK.
I wish I could use these kinds of passwords more often, but too many sites enforce the "must contain one number, one capital letter, and one punctuation symbol", oftentimes with a maximum password length.
Practically every bank I do business with has these limitations. By contrast, my gmail password is 24 characters long, and contains capital and lowercase letters and punctuation. It makes me wonder what their back-end systems look like that they can't handle passwords longer than that.
Absolutely insane. I'd like to see a browser plugin that will detect these sites and give you a form letter asking them to fix their security standards as well as the appropriate mailto link.
Don't want the passwords to be TOO strong, now do we? wink! wink!
Also, it would be BETTER if people used two-factor authentication, maybe forcing the use of text messages to cell phones, so that they must use something they know and something they have. I wonder... if, maybe, we forced our users to select absurd, unnatural passwords they'll have no hope of remembering, and then maybe they'll have to call or text or e-mail for help almost every single time? wink! wink!
Would you help me understand why this isn't closer to the security of a random 6 character password? It uses the first three letters from each word, and the first three letters of English words aren't random.
I'm trying to read through the source code to understand it, but the Python version seems to be performing (slightly?) different steps than the JS version. Is the Python version an accurate description of your technique?
At first glance, it seems like it could be a 60-bit secret, because a random 18 letter a-z password would be an 84-bit secret, so clearly you're taking the non-random nature into account somehow. But it's not clear to me exactly how it's 60-bit.
The page you linked to does describe the high-level process, sort of, but it doesn't give an example, or explain "This is secure because..." so I figure I should ask you.
> Would you help me understand why this isn't closer to the security of a random 6 character password?
There are about 80 characters you can type, so a 6 character secret has 262,144,000,000 permutations (80^6) which is about 38 bits (2^38 is 274,877,906,944). 2^60 is much larger: 1,152,921,504,606,846,976 and its sixth root is 1024. So 1,024 memorable "characters" are needed.
There are about 4000 unique word-based three letter prefixes:
The Javascript version is a relatively straight port from C, but uses the same algorithm as the Python.
It picks a series of three-letter prefixes from a pool of 1024, then attempts to generate a mnemonic that fits it. I'll expand the explanation to give the security properties.
You could, but I find entering long passwords annoying. I use a 9-letter (3 words, 30 bit) abbrase for my Android password, and find it's a tolerable balance between ease and security.
There's no additional security in Abbrase-generated mnemonics.
>128 bits of security (10 diceware words) not terribly difficult to memorize, easy to type. The + is just to make memorization easier, same with the capitalization. They can't hurt the entropy, and make it easier to remember.
Pretty much all of my passwords that I can't store in a password safe (encrypted file) are of this sort. That's not a ton, but memorizing 5-10 10-word phrases is pretty easy. If you can memorize Jabberwocky, this sort of thing is easy.
The uncorrelated nature of the words makes memorization harder. Using prefixes instead of precise words forgives minor memory slips and exchanging equivalent words.
A good PBKDF (e.g. high difficulty scrypt) is good for stretching password security another 20-30 bits.
Or you could use the method of loci and learn a 10 character password in 5 minutes. I wrote a simple program which creates a random password and suggests keyword images...
It was rough kivy app, which I am not committed to publishing, currently.
You can easily write your own, once you define a system for memorizing the characters.
The idea is to use animal names for upper case characters, inanimate objects for lower case, and special mnemonics for other characters you want to use. A doormat for underscore, a bomb for exclamation mark etc.
Then it's just a dictionary lookup.
The tricky thing is to make the users understand how the method of loci works (Minecraft is a good medium for this), and that you need to repeat the password at least once per day until it is completely memorized.
Of course the human brain routinely stores memories which are vastly larger than 56 bits. The trick is in mapping those to passwords. For instance, if you went to grade school in the US I claim you can easily remember the following password: ipattfotusoaattrfwisonugiwlajfa. ("I pledge allegiance to the flag...") Song lyrics tend to work particularly well.
A question is how much entropy a string like this has? Say you have 10 million English songs, if you always start a password with the first letter from a song, and choose the password to have between 20-30 letters, you only have about 26 bits of entropy in the password, which is trivial to brute force. You can increase the entropy by starting from a different word, or inserting random characters, but the entropy will never be close to the entropy of a random string of the same length.
This is an insecure approach. Dictionary style attacks now can account for phrases from vast libraries of literature, pop culture, or anything else you're likely to remember.
But the suggestion is to carry out a simple hash on the input phrase — one could use the second or third letter, for instance. (Or the first, second, third, first, second, third…)
How could a dictionary-style attack predict all possible such arrangments?
> Of course the human brain routinely stores memories which are vastly larger than 56 bits.
Usually because they are LOW-entropy. Spelling rules, grammar rules, rhymes, shared culture, all of those things erode any utility they may have had as random or unpredictable.
Pretty different from passwords as we know them (#3 means you can't simply store a bcrypt/scrypt'd code--you could build something complicated tricks to square "allow one word wrong" with not storing plain PWs: each word is 10 bits and what you tell the user is really their secret code + a parity/ECC word, and correction happens before a convnetional password check--that rabbit hole goes deep, though.) But if all this gets average folks remembering more entropy than before, that makes it kind of interesting.
On the study (which I admit just skimming), survey in 2010 suggested about half of Mechanical Turk users were located in the US (http://www.behind-the-enemy-lines.com/2010/03/new-demographi... who learned English at school would probably do better at memorizing a password in their native language. (I studied French many years in school, but I doubt I'd memorize a set of random French words as easily as English words.)