I meant that it is tricky to automate QA without feeding it back into Mechanical...

I meant that it is tricky to automate QA without feeding it back into Mechanical Turk for manual evaluation, for example, by classifying the task results are good or bad. This is an active area of research (see for example [1]).

Even if in theory, feeding back the results into Mechanical Turk for manual evaluation will correct errors, there are still huge tradeoffs in practice.

Suppose you had three people look at a task and say if it was done correctly or not. We pick the most popular choice out of the three. This works fine for tasks like speech transcription where it is easy to tell if it was done correctly or not. But what about tasks like labeling features in biological images? Surprisingly, even if you show people examples of what is correct and not correct, they still have a hard time distinguishing between the two. This are the kinds of difficult tasks that are especially in need of QA.

If the people evaluating correctness are only right 60 percent of the time, you'd better have more than 3 people vote on whether it's correct, just to get a good estimate. (Also, we are assuming people are biased toward the correct answer, rather than toward the wrong answer, or toward a fixed response) If you need a lot of people to evaluate each task, then you're paying several times more money than you were for the original tasks, and you have to write some infrastructure for feeding things back into mechanical turk.

Like you said, it will work in principle, but there are some tradeoffs.

Personally, I prefer the gold-data method and being conservative about accepting results in the first place rather than feeding them back to get fixed or labeled incorrect.

[1] www.vision.caltech.edu/publications/WelinderPerona10.pdf