My wife studys people for living (experimental cognitive psychologist), the quality of MTurk is laughable, if that's our standard for higher level cognition then the bar is low. You'll see the most basic "attention check" questions ("answer option C if you read the question") be failed routinely, honestly at this point I think GPT4 would to a better job than most MTurkers at these tasks...
She has found that prolific is substantially better (you have to pay more for it as well), however that may only be because it's a higher cost/newer platform.
My take is the tasks on Turk are awful and will drive away anybody decent.
I had a time when I was running enough HITs to get a customer rep and felt I was getting OK results. I wanted to get better at running HITs so I thought I would “go native” as a Turk and try to make $50 or so but I could not find tasks to do that were at all reasonable. Instead they’d want me to “OCR” a receipt that was crumpled up and torn and unreadable in spots and said they’d punish me for any mistakes.
> In the first batch of participants collected via Amazon Mechanical Turk, each received 11 problems (this batch also only had two “minimal Problems,” as opposed to three such problems for everyone else). However, preliminary data examination showed that some participants did not fully follow the study instructions and had to be excluded (see Section 5.2).
If they stuck to the average Mechanical Turk worker instead of filtering for "Master Workers," the parent's conclusions likely would've aligned with those of the study. Unfortunately, it seems the authors threw out the only data that didn't support their hypothesis as GPT-4 did, in fact, outperform the median Mechanical Turk worker, particularly in terms of instruction following.
> Unfortunately, it seems the authors threw out the only data that didn't support their hypothesis as GPT-4 did, in fact, outperform the median Mechanical Turk worker, particularly in terms of instruction following.
MTurk, to first approximate, is a marketplace that pays people pennies to fill out web forms. The obvious thing happens. The median Mechanical Turk worker probably either isn't a human, isn't just a (single) human, and/or is a (single) human but is barely paying attention + possibly using macros. Or even just button mashing.
That was true even before GPT-2. Tricks like attention checks and task-specific subtle captcha checks have been around for almost as long as the platform itself. Vaguely psychometric tasks such as ARC are particularly difficult -- designing hardened MTurk protocols in that regime is a fucking nightmare.
The type of study that the authors ran is useful if your goal is to determine whether you should use outputs from a model or deal with MTurk. But results from study designs like the one in the paper rarely generalize beyond the exact type of HIT you're studying and the exact workers you finally identify. And even then you need constant vigilance.
I genuinely have no idea why academics use MTurk for these types of small experiments. For a study of this size, getting human participants that fit some criteria to show up at a physical lab space or login to a zoom call is easier and more robust than getting a sufficiently non-noisy sample from MTurk. The first derivative on your dataset size has to be like an order of magnitude higher than the overall size of the task they're doing for the time investment of hardening an MTurk HIT to even begin make sense.
This is just coming up with excuses for the MTurk workers. "they were barely paying attention", "they were button mashing", "they weren't a single human", etc.
It turns out that GPT-4 does not have those problems. The comparison in the paper is not really fair, since it does not compare average humans vs GPT-4, it compares "humans that did well at our task" vs GPT-4.
> This is just coming up with excuses for the MTurk workers
No. The authors are not trying to study MTurk market dynamics. They are trying to compare humans and LLMs.
Both questions are interesting and useful. This study is only asking about the second question. That's okay. Isolating specific questions and studying them without a bunch of confounds is one of the basic principles of experiment design. The experiment isn't intended to answer every question all at once. It's intended to answer one very specific question accurately.
LLMs can both be worse at Mensa tasks and also better than humans at a variety of reasoning tasks that have economic value. Or, LLMs can be worse at those reasoning tasks but still reasonably good enough and therefore better on a cost-adjusted basis. There's no contradiction there, and I don't think the authors have this confusion.
> The comparison in the paper is not really fair
The study is not trying to fairly compare these two methods of getting work done in general. It's trying to study whether LLMs have "abstraction abilities at humanlike levels", using Mensa puzzles as a proxy.
You can take issues with the goal of the study (like I do). But given that goal, the authors' protocols are completely reasonable as a minimal quality control.
Or, to put this another way: why would NOT filtering out clickbots and humans speedrunning surveys for $0.25/piece result in a more insightful study given the author's stated research question?
> It turns out that GPT-4 does not have those problems.
I think the authors would agree but also point out that these problems aren't the ones they are studying in this particular paper. They would probably suggest that this is interesting future work for themselves, or for labor economists, and that their results in this paper could be incorporated into that larger study (which would hopefully generalize beyond MTurk in particular, since MTUrk inter alia are such uniquely chaotic subsets of the labor market).
For me, the problems with the study are:
1. The question isn't particularly interesting because no one cares about Mensa tests. These problem sets make an implicit assumption that psychometric tools which have some amount of predictive power for humans will have similar predictive power for LLMs. I think that's a naive assumption, and that even if correlations exist the underlying causes are so divergent that the results are difficult to operationalize. So I'm not really sure what to do with studies like this until I find an ethical business model that allows me to make money by automating Mensa style test-taking en masse. Which I kind of hope will ever exist, to be honest.
2. MTurk is a hit mess (typo, but sic). If you want to do this type of study just recruit human participants in the old fashioned ways.
But given the goal of the authors, I don't think applying MTurk filters is "unfair". In fact, if anything, they're probably not doing enough.
She has found that prolific is substantially better (you have to pay more for it as well), however that may only be because it's a higher cost/newer platform.