Hacker News new | past | comments | ask | show | jobs | submit login

"Tricky to automate" what? Are you not literally in the middle of using a tool that helps you automate QA for huge sets of tasks?

It should be trivial to create a task, create a task for evaluating that task, and yet another task for evaluating that task. Run all three long enough and you will in fact get good results.

Obviously if you're going to use an unreliable protocol there have to be management protocols in effect to correct errors, or you will end up with errors. This is not a revelation.




This should be easy - but it is not. Many many Requesters submit tasks to Turk from the Amazon provided UI, or some other simplified UI with no concept of a workflow. Which makes this stupidly hard.

So you'd think this tool would do this for you - but instead you need another layer on top, either one you code, or some 3rd party tool like CrowdFlower.


I meant that it is tricky to automate QA without feeding it back into Mechanical Turk for manual evaluation, for example, by classifying the task results are good or bad. This is an active area of research (see for example [1]).

Even if in theory, feeding back the results into Mechanical Turk for manual evaluation will correct errors, there are still huge tradeoffs in practice.

Suppose you had three people look at a task and say if it was done correctly or not. We pick the most popular choice out of the three. This works fine for tasks like speech transcription where it is easy to tell if it was done correctly or not. But what about tasks like labeling features in biological images? Surprisingly, even if you show people examples of what is correct and not correct, they still have a hard time distinguishing between the two. This are the kinds of difficult tasks that are especially in need of QA.

If the people evaluating correctness are only right 60 percent of the time, you'd better have more than 3 people vote on whether it's correct, just to get a good estimate. (Also, we are assuming people are biased toward the correct answer, rather than toward the wrong answer, or toward a fixed response) If you need a lot of people to evaluate each task, then you're paying several times more money than you were for the original tasks, and you have to write some infrastructure for feeding things back into mechanical turk.

Like you said, it will work in principle, but there are some tradeoffs.

Personally, I prefer the gold-data method and being conservative about accepting results in the first place rather than feeding them back to get fixed or labeled incorrect.

[1] www.vision.caltech.edu/publications/WelinderPerona10.pdf




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: