Hacker News new | past | comments | ask | show | jobs | submit login
How not to design a CAPTCHA (plus.google.com)
310 points by DrewHintz on July 12, 2011 | hide | past | favorite | 88 comments



I work in medical IT. You'd be surprised how many government sites do similar.

An example would be https://sso.state.mi.us/som/dch/enroll/reg_page1.jsp (You can enter any fake name/email, this is only step one of the registration script. The next page has the captch in question.)

The captcha is plaintext, right on the page. The data from the captcha isn't even sent to the server, it is processed locally via JavaScript.

So, the bots don't even have to do anything, but humans have to input a meaningless number...

    <input type="text" name="inputNumber" class="entry-field" size="5" tabindex="3">

    <!-- ... -->

    document.write('<div id="layerNum" class="verifyNumber" align="center">');
    document.write('<b>'+str+'</b>');
    document.write('<img src="generateGIF.jsp?number='+str+'">');
    document.write('</div>');
    document.write('<input size="5" type="hidden" name="rdNumber"  value="'+str+'">');

    <!-- ... -->

    <input type="submit" value="Continue" name="submit" onclick="return Valid();">

    <!-- ... -->

    function Valid(){
    // ...
            if(chkRandomNumber()){
              return true;
            }else{
              return false;
            }
    // ...
    }

    function chkRandomNumber(){
      str1=document.all.rdNumber.value;
      str2=document.all.inputNumber.value;
      if(str1!=str2){
        alert("Please check and type the number as shown in the box");
        return false;
      }else{
        return true;
      }
    }


Wow, that is very surprising. Is it that the web development industry hurting that much for good programmers, or are just the wrong people being hired?


There is also a skills shortage in programmers. If people like this can get work then imagine if you actually knew about programming. Remember that next time your thinking about your unsatisfying job or at pay review time.


> If people like this can get work then imagine if you actually knew about programming.

You're assuming the client/employer can actually distinguish between the two. I'm not sure that's the case for many jobs.


For what it worth, I worked on a website that started receiving massive amounts of spam on its feedback page very shortly after it went live. We (as in the three programmers on the project) hated captchas with a assion. Instead we put in a field with the text "What is 1 + 1?" (If they missed it, we'd actually put in red next to it "Hint: the answer is '2'". (Granted, we checked the value server side.)

The end result, spam disappeared and we didn't add much pain to our customers.

Most spammers likely don't go check every website to see how they can break the captcha, they just set up a script to go fill out forms and submit them.

They're solution, while not being the "awesome, technologically advanced solution", if it prevented spam, was a working solution without the complexity of actual captchas.

Furthermore, as captchas have been known to be broken, who's to say that the spammers tool doesn't recognize valid, commly-used captchas and break them automatically? As opposed to a field that says "Type the following word", which the spammers don't (can't easily?) check for.


The second. They probably have scammed the client in other ways as well.


"Scammed" is probably not the right word here - at least to me, it conveys a malicious intent, while mistakes like this are merely ignorance. I'm sure most of us have made mistakes just as stupid as this, despite working hard to earn our pay.


"Scammed" is exactly the right word here.

When CAPTCHA is just simulated on the client, then it's clear indication of malicious intent (of getting paid for faking real job).

That said, CAPTCHA should not be used at all. But torturing users with CAPTCHA while allowing bypass access to bots is more advanced level of evil.


'Scammed' is not the right word. Ever heard the aphorism "never attribute to malice what can be attributed to incompetence"? Manager types (and unfortunately probably quite a few programmers) have no idea what CAPTCHAs do, and I would bet money that somewhere, somebody has vetoed a server CAPTCHA in favor of a client CAPTCHA because it sounded easier or something. I'm not saying that's what happened here, but don't say it was obviously malice when you just don't know.


Scam does not imply malice.

Usually scammers treat their victims as customers and wish them well.

In this particular example it was combination of technical incompetence [not being able to deliver proper CAPTCHA] with scam [of getting paid for project that did not deliver on promise].


Ehm, a scam with good intentions? Come on, are you a Nigerian prince?


Scammers are like parasites. They take from their victims, but aside from that they want their victims to be well.


How can this be a "mistake"? They created something that looks like a captcha to fool the client into believing it's an actual captcha. If they did't know how to make a proper captcha it's better to tell the client so someone else does it.


What if they don't know they don't know how to make a captcha? It's an obfuscated image, those are easy to make and check! If you don't know basic things like "never trust the client", and you don't know that they exist to know, then you may not know to tell the client to have someone else do it.

That doesn't excuse the programmer. As a web programmer, it is, to some extent, their job to know when they're out of their league. But second-order knowledge can be a rare skill.


Yes, this exactly. Donald Rumsfeld got no end of flack for his comment (distilled here) "there are known knowns, known unknowns, and unknown unknowns", but it's actually a great statement - in this case, there are some people who know they know how to make captchas, some people who know they don't, and some people who don't know that they don't know.


Many programmers have no idea how a CAPTCHA is supposed to work. It never occurs to them to think though how someone would break it. Someone tells them the client wants a CAPTCHA, they go "oh yeah, that's those weird letters on the screen", and are probably pretty proud of how they did it.

Don't believe me?

Think about how often you see obvious SQL injection problems - the same (lack of!) thought process is responsible for both.


You are assuming that the client knows what a CAPTCHA is. Probably the manager at the client-side said "Oh yeah, before I forget to mention it, add that funny image you see on websites - you know, the CAPTCHA thing, a guy at my gym said it improves security. We definitely want good security in this project!".


Completely OT: I find it interesting that this post and several other HN posts this week are hosted on Google Plus. I definitely would not have predicted that G+ would encroach on the LiveJournal/Tumblr space.


On a similar tangent to your OT post: we're getting to the point where seeing (plus.google.com) would be useful, since it conveys quite a different meaning to me from (google.com).



If you use Chrome, you might be interested in http://news.ycombinator.com/item?id=2240646


Actually, yes, that did the trick for me; thanks! (Though I still think it's a feature that everyone would appreciate, so if it could go into news.yc, that would be even better.)


You will hear no disagreement from me. Glad it worked for you.


And I am impressed by how bad their URL structure is.


I'm just relieved that I can copy their URLs, email them to a friend and have them work ... since there are so many gaps in G+ sharing.


Although if the post is Limited, you just get a 404. Which isn't super friendly.


Is that important?


Yes, human-readable URLs are important.

They are important as additional clue to users.

They are important for SEO.

But most likely it would take Google+ years before they implement it (like it was with Blogger).


Does Google really care about SEO?

I very much doubt that users care about urls. (Yeah, you care, I care, sure, but nobody else.)


If you and I care about readable URLs why other users would not care? Everyone benefits from being able to read link that they see.


Google thinks so, they use human readable permalinks on their official blogs: http://googleblog.blogspot.com/2011/07/what-do-you-love.html


Isn't that just a side effect of using Blogger? I know why people demanded that feature for Blogger (imaginary effects on PageRank), but I imagine Google has a better algorithm than looking at text in the URL for content that they host. For example, they can find the post's title right in the database.


Its kind of problem for me, as G+ is blocked on my office network. ;) :(


Yep, for many people g+ will be the long form twitter. Google should throw some calendar / archive and search in there


OT: (FUD) All HN users will move to G+?


If anyone ever wondered what the phrase "cargo cult science" referred to, this is a prime example. They're going through all the motions, but sadly their understanding of the universe is gratuitously flawed.


+1 for cargo cults: http://en.wikipedia.org/wiki/Cargo_cult. Its a great idea to keep in mind for a creator/designer/programmer. People/users/everyone all too often intuit through imitation.


On a site I administer that used to be deluged in spam, I managed to eliminate it with a three-pass filter:

1. Simple mathematical question, e.g. "What do you get if you add five and three?" Answer is processed on the server.

2. Hidden form field that is supposed to remain blank.

3. Blacklist of common spam words.


On a forum I run (phpbb3) I eliminated 99% of the spam by adding 1 field that says "enter 42 here to prove you are human". No image, no hidden field, nothing.

We still get the occasional spammer but the real problem was our phpbb3 board showing up in the automated spam programs. As soon as we were slightly different than the default install, nearly all the spam stopped.

The interesting thing was that even the built-in captcha didn't stop the spam--it was worth cracking since everyone uses it.


Yeah, even recaptcha is broken. A new board I helped set up at my company got some spam before even being publicly announced!

On my blog I generate two random sequences of characters and tell the user to join them together without a space. This seems to have worked really well. (Though in the past I've also had static strings like "join 'bow' and 'ser' together" or "join 'doc' and 'tor' together".) I used to have the addition challenge like the GP but it was broken. My comment form was slammed with hits, so I rate-limited attempts, but a few still got through (since it's actually not a big set of responses to go through and you can defeat rate limits). That's when I implemented my string scheme and changed the comment form submission url (which only lives in Javascript now), haven't had a spammer get through yet.

On another forum I used to moderate (I think it was an Invision Powerboards one) I fixed it with a second field asking something like "What makes things fall down? gravity or noodles?" And if they entered gravity it would let them register. It lasted a few years, then a few randomly got in but by that time the forum had died.


That works great. Though the first spam botnet to specifically target your site is really going to go to town.


> 1. Simple mathematical question

Best CAPTCHA ever: http://random.irb.hr/signup.php


omg if you refresh they just get harder and harder. calculus? trig?


What I loved was when I signed up some time ago they had given me a partial derivative with a single variable, telling me what the variable was. Meaning that the answer was 0. Some of them look REALLY complex but they're actually far simpler than they appear, except for the fact that it'd be incredibly difficult to break them in practice given the variation they produce.


The answer seems to be zero nearly all the time.


I've occasionally seen it where it's -1 or 1. I think it's all three until you measure it.


Nice schrodinger cat reference


When you design solution, you have to decide if you're protecting against targeted or not targeted attack. It's not all just "spam".

If your concern are only dumb, fully-automated bots not targeting your site specifically (which is true for the bottom 99.5% of the web) then you don't need CAPTCHA.

2 and 3 are great for non-targeted attack. 1 is a very weak protection against targeted attack and it's likely an overkill unnecessarily burdening users.


Visitors have the option to register a user account, which eliminates the spam filters.


2 and 3 are decent, as long as you don't have commenters trying to discuss something spammy (depends on the site community). #1 only works because your site isn't big enough for anybody to specifically target, though. I'm not saying it's bad (so long as it works, it's by definition at least "good enough"), just don't expect it to scale.


A system to solve problems like #1 was actually one of the very first tasks solved by early AI research. It was a PhD in MIT in 1964.

http://en.wikipedia.org/wiki/STUDENT_%28computer_program%29


I should note that registered users get to skip the captcha. Right now the site gets around 1,500 visitors and 3,500 pages a day, and growth has been steady and incremental for some years.


We wanted to do something similar on a site I was involved with.

Unfortunately it wasn't allowed because the site owner pointed out that the market the site was aimed at had a reasonable number of people with connotative difficulties - ie, they struggled to follow multi-step instructions.

(Yes, this does mean that computers are able to solve a problem that is supposed to identify a human much better than some humans.)


I've often thought captchas were doing it wrong.

Even my pre-school self could solve the Sesame Street "one of these things is not like the other".

There are so many sets with an odd-one-out that would only be easily determinable by a human over a computer.


I've seen similar systems, such as "Which one of these four images is a puppy?". I think the problem is that the set has to be small, so it ends up being a multiple choice quiz. With one correct answer out of four or five choices, it is very easy to brute force.


The ones I've seen don't ask you to identify one single puppy. They ask you to identify all the puppies, making it rather harder to brute force:

http://thepcspy.com/read/the_cutest_humantest_kittenauth/

http://research.microsoft.com/en-us/um/redmond/projects/asir...


I have a few sites only getting about 1k visitors a month and #1 does reduce the spam a bit, but I still get 2-3 submissions a day, and I would not say these are targeted at all, just mass spam bots.



I'm disappointed to find that "What do you get if you multiply six by nine?" (http://www.wolframalpha.com/input/?i=What+do+you+get+if+you+...) just returns 54. (Cf. http://answers.yahoo.com/question/index?qid=1006050815188 .)


If you are in this, maybe you could find interesting this review of a paper from googlers to approach a CAPTCHA design, in which humans are asked to select the right image rotation: http://glinden.blogspot.com/2009/05/exploiting-spammers-to-m...

As always, one of the most interesting part of truly great CAPTCHA systems is that they are advancing the state of the art in image recognition. But on the other hand we still have scams like this, and no real solutions.


Sony... some part of me had really hoped that they would overreact to the hacking movement against them, and lock themselves down like Ft. Knox.

Instead, it would seem they're taking the "we'll get hacked anyway, so let's not waste our time" approach.


The Sony's CAPTCHA we are discussing here was likely written years ago (before Sony security vulnerability scandal).

It just indicates pathetic state of Sony Security development team - something that cannot be changed overnight.


A few years ago, or so i think, people went all crazy talking about a replacement for captcha's: Show a range of images, and make the user pick the image described by a block of text.

How come nobody adopted that approach?


Because the math doesn't work. Most "next-gen" captcha fundamentally fail (by orders of magnitude) one of the many pillars that make captchas scale....

1. Is it trivial for a human to answer correctly? This affects growth.

2. Can humans do it quickly? This affects growth.

3. How is the random guess-rate? This better be abysmal.

4. How good is the “opposing” technology?

5. How is the guess rate of a sophisticated attacker, using said technology?

6. How much human input is required to create your captcha? You better be asymptotically better than human-solving the captcha.

7. What are the cultural and accessibility issues?


8. The user may have a slow computer.

I remember suggestions of using computing power to slow down guess-rates. Probably related to bitcoins. However, it doesn't work since some users don't seek better computer performance.


"Anyone can invent a security system that he himself cannot break." - http://www.schneier.com/blog/archives/2011/04/schneiers_law....

Any CAPTCHA scheme that can be solved by enumeration of all possible answers is a failure, because there are cost effective ways to hit a CAPTCHA over and over again, with cheap humans, and build the enumeration table. This is where the "pick the image with a cute thing" in it scheme falls down. In this case, once the enumeration of description -> image(s) is determined, you lose.

Any scheme that involves humans some how creating tags or labeling images or writing text will generally be enumerable as well, because they can trivially out-manpower you.

Also, many CAPTCHA schemes use a model of spammer in which the spammer isn't permitted to be clever. If there is a pattern, in the real world the spammer is "allowed" to exploit it. There are 2^64 different ways to add two 32-bit numbers to each other, but that doesn't mean that you can beat a spammer just by asking the user to do a simple addition, because when I say "enumerate" I mean it more in the computer science sense, not the literal sense. They can and will create something that parses the problem and does it, so for instance for my stupid "add two random 32-bit numbers" example the CAPTCHA is actually easier for a computer than a human.

CAPTCHAs are hard and getting steadily harder... at least, if you require them to work. Security theater is easy.


If you only have a limited collection of images to pick from, then bots could get decent scores by picking at random. A better approach might be to ask users to pick matching images (ie. 2^N possible choices).


What would the system use for its corpus of images and text descriptions? The corpus would have to be significantly larger than what any given attacker could manually identify. Once an attacker has manually identified an image+text combination, they could store the combination and use it to solve any future CAPTCHAs with the same image+text.


Mainly because, to quote Spolsky, Users don't read instructions.

If the captcha is ANYTHING other than immediately obvious, a signficant number users will not be able to pass it.


Need help for Open Sourcing the CAPTCHA research project. I have covered few points of CAPTCHA design in my presentation.

Here is my CAPTCHA research paper:

http://news.ycombinator.org/item?id=2754436

http://www.slideshare.net/desaiguddu/drag-and-drop-captcha-a...


Jesus, rootkits, psn, and now plaintext captchas ... the dev/it clowns at sony need to be fired en masse.


On the subject of terrible captcha systems. I found the following gem while looking for OSS games for linux:

"You are born into WHAT? (answer is one english word)* [1]

It is not entirely clear to me what the expected answer is. A google search for "you are born into" does not return any answer that is clearly correct. If I had to guess I would go with "sin" but I am hoping that nobody would be so ignorant as to design a captcha system that assumes a certain cultural/religious background.

[1] http://garden.sourceforge.net/drupal/?q=image/tid/3


What about just asking the user "Why would a benevolent God allow evil to exist?" and then the server checks if the answer mentions "freewill"


A slightly less clueless (but still clueless) approach to CAPTCHA design is to 1) make the CAPTCHA case-sensitive, 2) use letters for which the lower-case representation is very similar to upper-case, and/or use both zero and the letter O, 1 and the letter l, and so on, 3) use an image munging algorithm that makes it next to impossible to disambiguate the cases in 2).


The problem with captchas is they have to be readable to humans.

Sure, a captcha of "lI0Ol1o" would would probably be unreadable to a computer ... but it would be to a human too.

We're quickly approaching the point that image recognition is getting as good at solving image captchas as humans are, and when we do, we'll need to find some other way to do it.


Reminds me of the infamous Rapidshare Captcha with cats. [1] There were hard to solve. However, also for humans.

[1]: http://www.labnol.org/internet/favorites/cats-inside-rapidsh...


> However, also for humans.

Which is exactly why they added those. Premium accounts do not require the user to enter a captcha for every download. So every user who was annoyed by the "cats captcha" was a potential customer.


That's actually probably going to be easier for a computer to solve than a human.

A computer can do statistical sampling of many CAPTCHAs generated by the same website, and then try to reverse-engineer the image munging algorithm.

Humans, OTOH will probably give up after 2 tries and already struggle to get |O0Il1l right.


What I think is cool are the captchas that make fake words that actually look like they could be real words (as opposed to a random string of text). Makes it easier for a human to read and figure out, but no easier for a bot. I dont know how they do that.


Markov chains: http://en.wikipedia.org/wiki/Markov_chain

Imagine picking letters with the right frequencies. Now, instead of doing that, pick pairs of letters, with the right frequency, so that each pair "chains" with the previous. If you have good pair frequency data, you can do longer than pairs and get even closer to English.


(vowel+consonant).times(6).join('')


if(failed_attempts > 20){ ban for ten minutes }


I dislike like long, nonsensical captchas that confuse people, it's totally annoying. A few years ago i used a 5 digit captcha, but in the background i added faded small letters in various angles.


Unfortunately the faded small letters probably did not make your captcha more difficult to crack. It's relatively simple to remove all grey pixels from the image before OCRing it.

However having your own custom captcha probably helped quite a bit. I'm guessing spammers aren't going to bother writing custom software to decode your captcha unless you have a major site.


Yes of course. But you can choose certain colors, not gray that look faded to the eye but are not in rgb values. In any case, captchas are always mediocre solutions.


DON'T use a bloody CAPTCHA.


I can't believe Google is criticizing how Sony does CAPTCHAs when I've been complaining for years about how difficult Google's are to read. But as to their point, based on Sony's recent security issues, it doesn't sound like Sony has a very good IT department.


It's not Google criticizing Sony, it's Andrew Hintz posting on his Google+ page.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: