Everybody is spamming everybody else on Mechanical Turk

marcua · on March 9, 2011

You don't ask assembly line workers to build an amazing car on their own in a single step. Similarly, you shouldn't ask low-paid information workers to synthesize amazing text on their own in a single step.

I think that your HIT design highlights several common mistakes requesters make on MTurk:

- You are underpaying for the task (would you write a good review of Berkeley, CA for $1 for a stranger?)

- You provide no aggregation or verification step, to ensure that turkers know their work should jive with other turkers' output. You also give no indication that such verification is possible or likely to happen.

- Your task output is poorly defined and open to interpretation. You may have asked a straightforward question, but I assume you placed a blank textbox on the screen and expected well-formed paragraphs in return.

If you want a great example of text synthesis of relatively high quality using MTurk for prices in the range of your budget, see http://borismus.com/crowdforge/

If you want to learn more about how to design HIT workflows, see http://projects.csail.mit.edu/soylent/ (disclosure: I share an office with and work with Michael Bernstein, but not on this work). One of Soylent's contributions was the Find-Fix-Verify design pattern, which helps with some of the problems you raise.

Your task is even harder, of course, since you require subject-matter experts in a fictional location. So perhaps MTurk is the wrong crowd for your task.

lukas · on March 9, 2011

I run a company called CrowdFlower that provides quality control on top of Mechanical Turk and other pools of workers from traditional outsourcing companies to offerwalls (where people earn in game credits for doing our tasks).

I think this article doesn't reflect everyone's experience with Mechanical Turk. We get lots of high quality work out of Mechanical Turk and lots of other companies do as well. It does take a fair amount of work to get the quality right - that's how we got started as a business and that's why many people still come to us.

As an aside, if the author of the article is reading this thread and wants data, we would be happy to talk about it.

snikolov · on March 9, 2011

The article raises an interesting point: that many turkers just assume there is no quality assurance being done on the requester end and everything will automatically be accepted and paid for. Since it is tricky to automate QA for huge sets of tasks I would guess this assumption is mostly correct, and turkers take advantage of it.

jellicle · on March 9, 2011

"Tricky to automate" what? Are you not literally in the middle of using a tool that helps you automate QA for huge sets of tasks?

It should be trivial to create a task, create a task for evaluating that task, and yet another task for evaluating that task. Run all three long enough and you will in fact get good results.

Obviously if you're going to use an unreliable protocol there have to be management protocols in effect to correct errors, or you will end up with errors. This is not a revelation.

nmcfarl · on March 9, 2011

This should be easy - but it is not. Many many Requesters submit tasks to Turk from the Amazon provided UI, or some other simplified UI with no concept of a workflow. Which makes this stupidly hard.

So you'd think this tool would do this for you - but instead you need another layer on top, either one you code, or some 3rd party tool like CrowdFlower.

snikolov · on March 9, 2011

I meant that it is tricky to automate QA without feeding it back into Mechanical Turk for manual evaluation, for example, by classifying the task results are good or bad. This is an active area of research (see for example [1]).

Even if in theory, feeding back the results into Mechanical Turk for manual evaluation will correct errors, there are still huge tradeoffs in practice.

Suppose you had three people look at a task and say if it was done correctly or not. We pick the most popular choice out of the three. This works fine for tasks like speech transcription where it is easy to tell if it was done correctly or not. But what about tasks like labeling features in biological images? Surprisingly, even if you show people examples of what is correct and not correct, they still have a hard time distinguishing between the two. This are the kinds of difficult tasks that are especially in need of QA.

If the people evaluating correctness are only right 60 percent of the time, you'd better have more than 3 people vote on whether it's correct, just to get a good estimate. (Also, we are assuming people are biased toward the correct answer, rather than toward the wrong answer, or toward a fixed response) If you need a lot of people to evaluate each task, then you're paying several times more money than you were for the original tasks, and you have to write some infrastructure for feeding things back into mechanical turk.

Like you said, it will work in principle, but there are some tradeoffs.

Personally, I prefer the gold-data method and being conservative about accepting results in the first place rather than feeding them back to get fixed or labeled incorrect.

[1] www.vision.caltech.edu/publications/WelinderPerona10.pdf

yummyfajitas · on March 9, 2011

What you typically do is include gold data in your dataset. I.e., define a HIT to be 10 sequential tasks, 3-4 of which you know the correct answer to. If a turker can't get the gold data right, their work is rejected.

_delirium · on March 9, 2011

An even better (and common) strategy is to reject the work that fails the gold-data test and to reject most of the rest of the work too, claiming it failed the gold-data test.

scrrr · on March 9, 2011

It seems like unique online-identities that belong to real people, just like Facebook offers them, seem to be the only way to prevent rating-spam.

Or are they?

What if "mechanical turks" continue to use their FB-account to do the same?

This makes any rating-system almost useless.

And since I will be publishing an Android-App soon: Wouldn't it be wise to hire people to rate it with 5 stars, say a few hundred times? It seems like my competition will do it.

kkowalczyk · on March 9, 2011

A solution does exist: providers of mturk-like services could disallow such work items and enforce that (inci-meta-dentally they could use mturk itself to crowd-source spam identification on the cheap).

There is additional work for the service provider but it would seem to me that it does align with their self-interest at some level. I don't think Amazon really wants mturk to be associated with providing a spam work force.

I believe one of the things that CrowdFlower explicitly calls out as an advantage over mturk is quality control (although for this particular solution to work all crowd-sourcing providers would have to do it - in this particular case it takes only one bad provider to enable bad behavior.

As to your hopefully hypothetical question: a risk you're running is that Google will pull out your app from the store. I haven't heard a case with Google but I'm pretty sure apps were pulled from Apple's App Store for manipulating ratings, so the downside could be big (your hard work could amount to nothing).

pavel_lishin · on March 9, 2011

> providers of mturk-like services could disallow such work items

Except for the one shady site that doesn't, and ends up raking in profits.

kf · on March 9, 2011

Ethics are murky. Don't talk about things like this on public forums. Google's watchful AI is always with us and when it finds you it will crush you. In practice, why hire people to rate it 5 stars when you can pay 200 friends of your friends to download your app and rate it if they like it?

reddot · on March 9, 2011

MIT is doing some really interesting research into using crowd sourcing like mturk. Check it out: http://groups.csail.mit.edu/uid/research.shtml#crowdcomp

They are tackling tasks like extremely difficult OCR and collaborative editing and proofreading.

I've used mturk at work to automate transcribing short recordings and have found that it works pretty well. The trick is to qualify your workers so that they pass some kind of test. You can also only accept workers that have a rating above some minimum. Then, critically, as suggested by others here, get each task done multiple times for cross-checking. And make sure that your instructions are clear.

snikolov · on March 9, 2011

I've spent a good chunk of time modifying an image labeling interface to make it more intuitive for mechanical turk workers to label obscure things like biological scans. The hope is that a better interface will increase quality (they did not seem to have a clue what to do using the old interface), but I'm starting to question whether the interface will make that much of a difference after all.

orionlogic · on March 9, 2011

Is there any app or start-up which delegates work to my social connections? Like a Mechanical Turk for my social sphere. Would be a nice solution.

theklub · on March 9, 2011

What about a local mech turk type program for small tasks? Maybe craigslist or angieslist is filling this need?

orionlogic · on March 9, 2011

There are places in the world where craiglist even not heard of. On second thought there are some issues with this approach, what if my social circle is narrow? Scalability and Reliability(reliable sources) are sitting on opposite side.

galuggus · on March 9, 2011

Eyeopening.

I was planning on using mt for a project I'm working on.

has anyone got any pointers on how to get the best out of mechanical turk? Advice much appreciated.

imx · on March 9, 2011

Mturk manuals are junk and very hard to follow. HIT data cleansing is the biggest issue. Instead of using the command line tool, use their API to integrate into your app, as it will save plenty of time down the road...

To "weed out" ineligible workers, try this approach: 1. Post a bunch (1000-5000) of cheap multiple-choice HITs. 2. Allow no more than 10 hits per worker. 3. Each hit to get 3 responses from different workers. 4. Review answers, compile the list of "good" workers, blacklist the "bad" ones. 5. Post another bunch of HITS, make them available for eligible workers only (found in step 4), this time the HITs might be more demanding, individually review results for each worker -> the best ones go on your "preferred worker" list. 6. Repeat steps 1-5 as necessary.

From then on it's fairly safe to rely on mturk workers from your preferred list.

user24 · on March 9, 2011

looks like there's a need for a mturk preferred worker aggregation service.

patio11 · on March 9, 2011

Seriously, for being a fire-and-forget API to the lowest possible level of human tasks, it requires a heck of a lot of hands on management, including arguing with identifiable people over two cents. (YOU DIDN'T SAY TO TURN OFF CAPS. I wish I were exaggerating.)

I ended up writing off five hours to goodwill when I did a project with a $100 turking component for a client. To use a line favored by my old Indian colleagues: if you pay peanuts, you get monkeys. Lesson learned.

Next time I will just find a freelancer with a high tolerance for repetition.

todayiamme · on March 9, 2011

>>>Next time I will just find a freelancer with a high tolerance for repetition.<<<

Can you shoot me an email when you do?

I have the ominous feeling that it will be mind numbingly boring, but nonetheless money is money.

davidu · on March 9, 2011

it's called crowdflower.

wizard_2 · on March 9, 2011

These guys constantly prove they know how to manage the kinds of problems Turk and other sources bring to the table.

pault · on March 9, 2011

Until someone uses mturk to spam your ratings.

barefoot · on March 9, 2011

Does this actually happen?

s00pcan · on March 9, 2011

This is still being used? I remember how easily scammed it was when this first came out. You'd think that massive failure would have meant something to them.

saturn · on March 9, 2011

Money quote:

> We all know that Mechanical Turk challenges the whole “Junk-in, Junk-out” dilemma and makes it more like “Always junk-out, regardless of the input process”

Couldn't be more true IMO, mturk is basically useless except for this "meta" kind of research and is a good example of a community that needed active management and positive incentives going to absolute shit in the absence of both.