Since there are 180million of them, presumably they were generated with a computer. Then, it seems that they probably fit into a finite number of patterns. If these could be determined, wouldn't this be a rather easy to crack captcha system?
Most captchas depend on the difficulty of the reverse transform applied to an image, especially when you don't know what the transform is. Here, the forms seem pretty regular, and the "transform" of inserting words is discrete rather than continuous, so a bit easier to reverse.
Yeah, the 180million number is irrelevant. What's important is the number and variety of patterns, not the number of questions.
I don't even need to program something which will understand all of the patterns. I just need to program something that will understand some of the patterns, which just keeps fetching captchas until it eventually understands the format, eg "What is X plus Y?"
This captcha system is considerably more simple to break than the typical image based ones.
It will only be little harder to reverse than it was to develop. Unless the generation of patterns can be distributed to attain a scale that the attackers can't match, this will fail as soon as it's protecting something worthwhile.
Not only do they follow a small number of patterns, most that involve subject matter rather than simply logic are part of a small number of categories. Almost all involve body parts, colors, names and days of the week.
I tried breaking this captcha. Here are some experimental results and mathematics:
By applying the mathematics from the Birthday Attack (http://en.wikipedia.org/wiki/Birthday_attack), If an attacker is able to solve 15.8 million of the 180 million captchas, there will be a 50% probability that the attacker can beat the captcha.
I tried refreshing the page 10 times, generating a total of 100 captchas. Out of those, I observed 8 arithmetic problems which I entered into and solved using Wolfram Alpha. That gives roughly the 15.8m/180m necessary to break the captcha with 50% probability.
At 50% probability, again going back to the Birthday Attack mathematics, an attacker would need roughly 16.8 thousand tries before expecting a collision with one they could break.
This probability will increase if an attacker is able to successfully reverse-engineer more patterns.
Edit: thinking about this more after MichaelGG's comment, I think my math is incorrect. Either way, point still stands that Wolfram Alpha can successfully solve 8% of the captchas and other patterns should be solvable by other means too.
Can you explain how the birthday paradox applies here?
Just looking at it simply, if you can solve 15.8 of 180, that means that for any given test you should have an 8.77% chance of solving it (6 tests for > 50%). What am I doing wrong?
Also, it looks like some of the other questions are easy to automate. Like "how many letters are in the word 'whatever'".
A Birthday Attack relies on the fact that even for some rare events (such as two random people having the same birthday), when there is a large opportunity to observe the rare event (such as 30 random people in the same room) it's actually quite likely to observe the event.
Some of the questions don't have one globally unique answer, for example: which day is a part of the weekend, Sunday, Friday or Monday.
Where I live (Israel), Friday is part of the weekend, were as I bet the creator of the list lives in the USA and as many times happens, believes the USA==World and the "correct" answer is probably Sunday.
I would be interested to see what the completion rate for this is versus, e.g., the Yahoo captcha. My intuition is "not that great." (You require reading on the Internet... uh oh.)
By the way, picking one token from the captcha and returning it beats the captcha 7% of the time, if the examples are representative. Spammer wins, since he can generate requests by the hundreds of thousands.
I also noticed that certain questions aren't necessarily 'easy'.
> The 1st number from 25, eight, 6, six and 27 is?
So is the answer 25 or 6?
I've come to the realization that CAPTCHAs aren't the solution, or at least can't be a standalone solution. Make the CAPTCHA easy enough for a human to not be blocked (pick the cat from these 3 photos) and the bot still wins 33% of the time. Make it hard enough that the user has to invest energy to 'solve' the problem in front of them and you alienate users by treating them like criminals.
I agree completely! I suspect the solution to the CAPTCHA paradox will be something we haven't even considered. In fact, I don't think we can even solve it directly- we have to approach the problem from a different angle. For example, we can start requiring significant identity authorization on some high quality sites (sacrificing anonymity for responsibility), and rely on advanced filtering for the rest.
Think about it: the actual problem with spammers defeating CAPTCHAs is low quality content. I'd much prefer to expend energy trying to stop low quality content, which is often delivered by non-spammers :)
Interpretation 1: It's a string list and pick the first element = 25.
Interpretation 2: It's a numerical list of numbers, numbers being ordered by value have an implicit sort applied to them, pick the first element in that sequence = 6.
I haven't been to Yahoo in a few years, but when I registered a mail account there once, it took me at least 10 if not 15 tries to get the captcha right. I came dangerously close to blowing my stack.
I understand the importance of CAPTCHAs, but I wouldn't put anything that required a reasonable level of thought in between my users and something I wanted them to do (for example, buy something from me). The more complex CAPTCHAs get, the less likely users are to try to complete them.
It strikes me that the specific example of buying something is generally sufficient confirmation of identity even without a Captcha. Not that your point isn't valid.
Trying all the words in the captcha one by one has a big chance of "hitting" correct answer. If it doesn't a brute-forcer can just request a new captcha until it works.
Said that, they don't seem very spam-proof to me.
For more info about how hard CAPTCHAs need to be read Luis Von Ahn's papers:
I get the impression that about half of the questions are list based. And the questions are about 10 words long. If that's the case, then using the algorithm you mentioned, you have a 1 in 20 chance of getting the captcha right. So yeah, it's a completely useless captcha system. That number should be 1 in a million or higher, not 1 in 20.
And it's probably not the first or last word. And if it precedes a ',' then it's probably more likely to be it. There are many ways to increase the hit-rate.
I've thought about this issue before (and proof-of-concepted a similar system, see http://www.blahedo.org/botblock/), and came to similar conclusions, but there's an important difference:
A crucial part of making this a successful anti-spam system is that it is a moving target. Every user of the system must be able to write their own questions. If that happens, the spammer's task is intractable. But if there is a central site serving these, it will be worth the spammers' while to just hardcode the patterns and write a little bit of logic to parse and answer them.
Now, there's a fair bit of interesting UI design in the question of "how do I get a non-programmer to write what is in essence a very small program". My proof of concept used some cute Perl-isms to basically construct a mini-language that was restricted enough that an inexperienced programmer could "script kiddie" their way through it, and I think this is the right general direction, but you'd need a fair amount of work to really make it accessible to the masses.
(Other crucial points that he gets right: it must be text based; it must have questions that hinge on natural language understanding but not be otherwise difficult; and it must have questions that are really question templates each of which can generate infinite numbers of question instances.)
I've often wanted my own text-based CAPTCHA for a video game website I run. I'd ask things like "What is the name of the purple weapon?" or "How many shields do you start with?" People who actually play the game could nail questions like that, while bots would be up a creek.
Then he would be a douchebag... and probably a huge moron, too.
Who creates a bot specifically to overcome the CAPTCHA on a forum for a 15 year old video game with very little traffic? We're not really a significant target; I only have to ban about one spambot per week. There's a tremendously low ROI from spambots on our forum, I can't imagine it'd be worth anyone's time to even attempt to incorporate it into their CAPTCHA-breaking bot.
Very cool, this is one of my favorite types of captchas, because I don't have to sit and squint at the screen trying to figure out what the heck I'm supposed to type. Is it an I, or a 1? Is it an S or a 5?
At one point I was getting spammed through a contact form ( http://hhgproject.org/contact.cgi ) so I added two forms of a text based captcha---the first one is a single question (that anyone visiting that particular page should know) and a hidden field (via CSS) that should not be changed. I haven't received a spam since.
I agree. The questions in the example may be easily solved by a technically minded person, but they could also confuse a large part of your audience. I would find it very difficult to generate questions that are appropriate no matter what language, culture, math or literacy background my visitors have.
Easy is not enough, Captcha need to be universal also.
For non native english speakers:
"Cheese, cat, mosquito, trousers, elbow and ant: how many body parts in the list?" or "Soup, dog, trousers, house, mosquito or pink: the colour is?"arent as easy as "3+1" or reCaptcha . Not everybody speak english on the interweb...
> Easy is not enough, Captcha need to be universal also.
Perhaps on the long term, but solving the captcha problem for the english-speaking (or any language, for that matter) subset of the internet population is still a very worthwhile undertaking.
CAPTCHA (n.) - the outsourced laziness of your development team to your customers, in order to stunt conversion rates and signups so you don't have to be bothered to sanitize your own user lists.
It seems to me that this wouldn't work. There are not so many different types of phrasings, so it would be fairly simple to write a parser generator which would then pass to a very basic interpreter to solve them.
For example, to solve the "Which of these is a T: W, X, Y or Z?" you would just put in a rule like "BodyPart ::= Foot | Knee | Leg | ..." "DayOfWeek ::= Monday | Tuesday | ..." "Color ::= Red | Blue | ..." and then have it match against those.
Maybe the next time I have some free time I'll see if I can go and implement it.
180 million isn't all that many. Keeping the answers in a database is trivial. Extracting the answers by trial and error is feasible. You'll probably want a large botnet to avoid getting blocked for suspiciously high traffic. If the servers can take an extra kqps or so, you should be done in about a week.
I played around with the demo page and used WolframAlpha to answer the questions for me.... With a little massaging, WolframAlpha would get you pretty far in hacking it.
Most captchas depend on the difficulty of the reverse transform applied to an image, especially when you don't know what the transform is. Here, the forms seem pretty regular, and the "transform" of inserting words is discrete rather than continuous, so a bit easier to reverse.