TextCAPTCHA: 180 million simple logic questions

moxiemk1 · on Nov 10, 2010

Since there are 180million of them, presumably they were generated with a computer. Then, it seems that they probably fit into a finite number of patterns. If these could be determined, wouldn't this be a rather easy to crack captcha system?

Most captchas depend on the difficulty of the reverse transform applied to an image, especially when you don't know what the transform is. Here, the forms seem pretty regular, and the "transform" of inserting words is discrete rather than continuous, so a bit easier to reverse.

mike-cardwell · on Nov 10, 2010

Yeah, the 180million number is irrelevant. What's important is the number and variety of patterns, not the number of questions.

I don't even need to program something which will understand all of the patterns. I just need to program something that will understand some of the patterns, which just keeps fetching captchas until it eventually understands the format, eg "What is X plus Y?"

This captcha system is considerably more simple to break than the typical image based ones.

mseebach · on Nov 10, 2010

It will only be little harder to reverse than it was to develop. Unless the generation of patterns can be distributed to attain a scale that the attackers can't match, this will fail as soon as it's protecting something worthwhile.

ryanc · on Nov 10, 2010

Not only do they follow a small number of patterns, most that involve subject matter rather than simply logic are part of a small number of categories. Almost all involve body parts, colors, names and days of the week.

azim · on Nov 10, 2010

I tried breaking this captcha. Here are some experimental results and mathematics:

By applying the mathematics from the Birthday Attack (http://en.wikipedia.org/wiki/Birthday_attack), If an attacker is able to solve 15.8 million of the 180 million captchas, there will be a 50% probability that the attacker can beat the captcha.

I tried refreshing the page 10 times, generating a total of 100 captchas. Out of those, I observed 8 arithmetic problems which I entered into and solved using Wolfram Alpha. That gives roughly the 15.8m/180m necessary to break the captcha with 50% probability.

At 50% probability, again going back to the Birthday Attack mathematics, an attacker would need roughly 16.8 thousand tries before expecting a collision with one they could break.

This probability will increase if an attacker is able to successfully reverse-engineer more patterns.

Edit: thinking about this more after MichaelGG's comment, I think my math is incorrect. Either way, point still stands that Wolfram Alpha can successfully solve 8% of the captchas and other patterns should be solvable by other means too.

MichaelGG · on Nov 10, 2010

Can you explain how the birthday paradox applies here?

Just looking at it simply, if you can solve 15.8 of 180, that means that for any given test you should have an 8.77% chance of solving it (6 tests for > 50%). What am I doing wrong?

Also, it looks like some of the other questions are easy to automate. Like "how many letters are in the word 'whatever'".

tshaddox · on Nov 10, 2010

A Birthday Attack relies on the fact that even for some rare events (such as two random people having the same birthday), when there is a large opportunity to observe the rare event (such as 30 random people in the same room) it's actually quite likely to observe the event.

amih · on Nov 10, 2010

Some of the questions don't have one globally unique answer, for example: which day is a part of the weekend, Sunday, Friday or Monday. Where I live (Israel), Friday is part of the weekend, were as I bet the creator of the list lives in the USA and as many times happens, believes the USA==World and the "correct" answer is probably Sunday.

tomedme · on Nov 10, 2010

He's UK-based according to his personal pages.

Einh · on Nov 10, 2010

Stop bombing palestinian children and we'll start giving a fuck about what you call the weekend.

bmm6o · on Nov 10, 2010

After lurking for 80 days, this is what you choose for your first comment?

patrickaljord · on Nov 10, 2010

Friday is part of the weekend for Palestinian children too.

Another problem with this system is that it only works for people who can read and understand English.

patio11 · on Nov 10, 2010

I would be interested to see what the completion rate for this is versus, e.g., the Yahoo captcha. My intuition is "not that great." (You require reading on the Internet... uh oh.)

By the way, picking one token from the captcha and returning it beats the captcha 7% of the time, if the examples are representative. Spammer wins, since he can generate requests by the hundreds of thousands.

dolinsky · on Nov 10, 2010

I also noticed that certain questions aren't necessarily 'easy'.

> The 1st number from 25, eight, 6, six and 27 is?

So is the answer 25 or 6?

I've come to the realization that CAPTCHAs aren't the solution, or at least can't be a standalone solution. Make the CAPTCHA easy enough for a human to not be blocked (pick the cat from these 3 photos) and the bot still wins 33% of the time. Make it hard enough that the user has to invest energy to 'solve' the problem in front of them and you alienate users by treating them like criminals.

cmurphycode · on Nov 10, 2010

I agree completely! I suspect the solution to the CAPTCHA paradox will be something we haven't even considered. In fact, I don't think we can even solve it directly- we have to approach the problem from a different angle. For example, we can start requiring significant identity authorization on some high quality sites (sacrificing anonymity for responsibility), and rely on advanced filtering for the rest.

Think about it: the actual problem with spammers defeating CAPTCHAs is low quality content. I'd much prefer to expend energy trying to stop low quality content, which is often delivered by non-spammers :)

I wrote a blog post expanding on these ideas: http://cmurphycode.posterous.com/the-problem-with-captcha

pitdesi · on Nov 10, 2010

I don't get how the answer could be 6, though I do agree with you on the paradox of CAPTCHAs

buro9 · on Nov 10, 2010

Interpretation 1: It's a string list and pick the first element = 25.

Interpretation 2: It's a numerical list of numbers, numbers being ordered by value have an implicit sort applied to them, pick the first element in that sequence = 6.

#2 is a very programmer thing to do ;)

dolinsky · on Nov 10, 2010

you get a cookie :)

GavinB · on Nov 10, 2010

Maybe the solution is to obscure the text somehow, so that a spammer can't read the text to grab a token . . . .

protomyth · on Nov 10, 2010

Do remember you really need to consider the blind in any solution.

il · on Nov 11, 2010

Not to mention that most modern captchas are being solved by people in third world countries, not bots.

Something like ReCaptcha is already for all intents and purposes bot-proof. A text captcha like this looks like a step backward.

qq66 · on Nov 11, 2010

I haven't been to Yahoo in a few years, but when I registered a mail account there once, it took me at least 10 if not 15 tries to get the captcha right. I came dangerously close to blowing my stack.

nkohari · on Nov 10, 2010

I understand the importance of CAPTCHAs, but I wouldn't put anything that required a reasonable level of thought in between my users and something I wanted them to do (for example, buy something from me). The more complex CAPTCHAs get, the less likely users are to try to complete them.

Vivtek · on Nov 10, 2010

It strikes me that the specific example of buying something is generally sufficient confirmation of identity even without a Captcha. Not that your point isn't valid.

binarymax · on Nov 10, 2010

Great concept, but some of the easier ones are very susceptible to an automated solve.

For example: "What is ten + 1?" ...in bing: http://www.bing.com/search?setmkt=en-US&q=What+is+ten+%2... ...in google: http://www.google.com/#sclient=psy&hl=en&q=what+is+t...

mitko · on Nov 10, 2010

Trying all the words in the captcha one by one has a big chance of "hitting" correct answer. If it doesn't a brute-forcer can just request a new captcha until it works.

Said that, they don't seem very spam-proof to me.

For more info about how hard CAPTCHAs need to be read Luis Von Ahn's papers:

http://www.cs.cmu.edu/~biglou/

mike-cardwell · on Nov 10, 2010

I get the impression that about half of the questions are list based. And the questions are about 10 words long. If that's the case, then using the algorithm you mentioned, you have a 1 in 20 chance of getting the captcha right. So yeah, it's a completely useless captcha system. That number should be 1 in a million or higher, not 1 in 20.

mvalle · on Nov 10, 2010

And it's probably not the first or last word. And if it precedes a ',' then it's probably more likely to be it. There are many ways to increase the hit-rate.

mike-cardwell · on Nov 10, 2010

Sure. I know my calculation was very rough, but my point is, if I'm not at least 2 orders of magnitude out, then the captcha system is very very bad.

blahedo · on Nov 10, 2010

I've thought about this issue before (and proof-of-concepted a similar system, see http://www.blahedo.org/botblock/), and came to similar conclusions, but there's an important difference:

A crucial part of making this a successful anti-spam system is that it is a moving target. Every user of the system must be able to write their own questions. If that happens, the spammer's task is intractable. But if there is a central site serving these, it will be worth the spammers' while to just hardcode the patterns and write a little bit of logic to parse and answer them.

Now, there's a fair bit of interesting UI design in the question of "how do I get a non-programmer to write what is in essence a very small program". My proof of concept used some cute Perl-isms to basically construct a mini-language that was restricted enough that an inexperienced programmer could "script kiddie" their way through it, and I think this is the right general direction, but you'd need a fair amount of work to really make it accessible to the masses.

(Other crucial points that he gets right: it must be text based; it must have questions that hinge on natural language understanding but not be otherwise difficult; and it must have questions that are really question templates each of which can generate infinite numbers of question instances.)

lotharbot · on Nov 10, 2010

I've often wanted my own text-based CAPTCHA for a video game website I run. I'd ask things like "What is the name of the purple weapon?" or "How many shields do you start with?" People who actually play the game could nail questions like that, while bots would be up a creek.

darinpantley · on Nov 10, 2010

What if one of your fans created a simple bot specifically designed to answer your admittedly easy questions?

lotharbot · on Nov 10, 2010

Then he would be a douchebag... and probably a huge moron, too.

Who creates a bot specifically to overcome the CAPTCHA on a forum for a 15 year old video game with very little traffic? We're not really a significant target; I only have to ban about one spambot per week. There's a tremendously low ROI from spambots on our forum, I can't imagine it'd be worth anyone's time to even attempt to incorporate it into their CAPTCHA-breaking bot.

megamark16 · on Nov 10, 2010

Very cool, this is one of my favorite types of captchas, because I don't have to sit and squint at the screen trying to figure out what the heck I'm supposed to type. Is it an I, or a 1? Is it an S or a 5?

jerf · on Nov 10, 2010

The problem is, it turns out computer programs feel exactly the same way.

vladev · on Nov 10, 2010

I actually wrote something similar at http://stopam.com. Never been to brave to announce it officially.

spc476 · on Nov 10, 2010

At one point I was getting spammed through a contact form ( http://hhgproject.org/contact.cgi ) so I added two forms of a text based captcha---the first one is a single question (that anyone visiting that particular page should know) and a hidden field (via CSS) that should not be changed. I haven't received a spam since.

TamDenholm · on Nov 10, 2010

I know some people that would fail a few of these questions...

daten · on Nov 10, 2010

I agree. The questions in the example may be easily solved by a technically minded person, but they could also confuse a large part of your audience. I would find it very difficult to generate questions that are appropriate no matter what language, culture, math or literacy background my visitors have.

qntm · on Nov 10, 2010

One could make a case for deliberately weeding such people out of whatever system the CAPTCHA is intended to protect.

bjonathan · on Nov 10, 2010

Easy is not enough, Captcha need to be universal also.

For non native english speakers:

"Cheese, cat, mosquito, trousers, elbow and ant: how many body parts in the list?" or "Soup, dog, trousers, house, mosquito or pink: the colour is?"arent as easy as "3+1" or reCaptcha . Not everybody speak english on the interweb...

eli · on Nov 10, 2010

On the plus side, these captcha are much easier on people who can't see.

mseebach · on Nov 10, 2010

> Easy is not enough, Captcha need to be universal also.

Perhaps on the long term, but solving the captcha problem for the english-speaking (or any language, for that matter) subset of the internet population is still a very worthwhile undertaking.

tropin · on Nov 10, 2010

Yes, because capchas aren't alienating enough, we should also let non english speakers out of our non english written webs.

joshklein · on Nov 10, 2010

CAPTCHA (n.) - the outsourced laziness of your development team to your customers, in order to stunt conversion rates and signups so you don't have to be bothered to sanitize your own user lists.

Xk · on Nov 10, 2010

It seems to me that this wouldn't work. There are not so many different types of phrasings, so it would be fairly simple to write a parser generator which would then pass to a very basic interpreter to solve them.

Maybe the next time I have some free time I'll see if I can go and implement it.

v21 · on Nov 10, 2010

If you want to produce a cheap AI for solving a particular class of problems, turn the class of problems into a CAPTCHA...

Devilboy · on Nov 10, 2010

I bet you can make this work for image tagging

dspeyer · on Nov 10, 2010

180 million isn't all that many. Keeping the answers in a database is trivial. Extracting the answers by trial and error is feasible. You'll probably want a large botnet to avoid getting blocked for suspiciously high traffic. If the servers can take an extra kqps or so, you should be done in about a week.

joelvh · on Nov 10, 2010

I played around with the demo page and used WolframAlpha to answer the questions for me.... With a little massaging, WolframAlpha would get you pretty far in hacking it.

http://news.ycombinator.com/item?id=1891375

ComputerGuru · on Nov 10, 2010

Obligatory XKCD link: http://xkcd.com/810/

"Constructive Spam"

bbest86 · on Nov 10, 2010

The first letter in the word "titties" is?

Beware if you have users that might be sensitive to such things.

eru · on Nov 10, 2010

Yes. And here's a picture of some tits (http://www.btinternet.com/~micka.wffps/great_tit.jpeg).

noobles · on Nov 10, 2010

The first one displayed on the page for me was: "pubes" has how many letters in it?

fertel · on Nov 10, 2010

Seems as though there are very few patterns that repeat themselves in a different fashion.

For example - it would be quite easy to solve which word is capitalized - or any of the math or series questions.

flawawa2 · on Nov 10, 2010

"Ten, 33, thirty five, 10 and thirty six: the 5th number is?"

10? Thirty? Thirty Six?

confuzatron · on Nov 10, 2010

Your point is that this captcha system may prevent smart-alec pedants from commenting? Man, that's a feature not a bug.

LordLandon · on Nov 10, 2010

It should probably have random words in all capitals in each question. The way it is, if a question has a word in all caps, that's the answer.

rarestblog · on Nov 10, 2010

...and also has 1 in 1 chance of automated recognition.

Jencha · on Nov 10, 2010

This may have issues with non-native speakers. You have to know language fairly well to answer those questions.

9ec4c12949a4f3 · on Nov 10, 2010

Lovely, but I could spend $500 and have the matching answers in a nice database I could resell.