Very interesting article and statistical analysis, but I really don't see how it concludes that the DK effect is wrong based on the analysis. The fact that the DK effect emerges with _completely random data_ is not surprising at all - in this case the intuitive null hypothesis would be that people are good at estimating their skill, therefore there would be strong a correlation between their performance and self-evaluation of said performance. If the data weren't related, then this hypothesis isn't likely, which is exactly what DK means. And indeed if you look at the plots in the article (of the completely random data), they depict a world in which people are very bad at estimating their own skill, therefore, statistically, people with lower skills tend to overestimate their skills, and experts tend to underestimate it.
Also wanted to point out that in general there is no issue with looking at y - x ~ x, this is called the residual plot, and is specifically used to compare an estimate of some value vs. the value itself.
That being said, the author seems very confident in their conclusion, and from the comments seems to have read a lot of related analyses, so I might be missing something. ¯\_(ツ)_/¯
> therefore there would be strong a correlation between their performance and self-evaluation of said performance. If the data weren't related, then this hypothesis isn't likely, which is exactly what DK means.
DK doesn't mean no correlation, it means inverse correlation. It's the correct analysis at the bottom that shows what no correlation actually looks like (at least no correlation in tend, there is heteroskedasticity).
> a world in which people are very bad at estimating their own skill, therefore, statistically, people with lower skills tend to overestimate their skills, and experts tend to underestimate it.
Be careful here, the conclusion you drew doesn't actually follow.
> y - x ~ x, this is called the residual plot
You're giving x and y meaning that they don't have. In the article these are uncorrelated random variables - the plot of y-x ~ x will always look that way. That's however not the case if you're plotting y_hat - y ~ y_hat for a y_hat taken out of a model. That won't be a random variable in your setup.
> DK doesn't mean no correlation, it means inverse correlation.
No, it means that people's self-assessment, their prediction of what their test scores will be, is uncorrelated (or more precisely weakly correlated--that's what the original D-K data showed) with their actual test scores. Which is not what we would expect: we would expect that their predictions of their test scores would be strongly (or at least more strongly) correlated with their actual test scores. The question the D-K effect raises is why that is not the case, and it's a valid question--one which this article does not even attempt to answer.
>> DK doesn't mean no correlation, it means inverse correlation.
> No, it means that people's self-assessment, their prediction of what their test scores will be, is uncorrelated (or more precisely weakly correlated--that's what the original D-K data showed) with their actual test scores.
The problem is that in common Internet discussions, there is no fixed meaning of D-K. When things heat up, D-K becomes "his belief in his competent is an indication of his incompetence,ha" but in more calm debate is become "less competent people have a somewhat less accurate understanding of their competence".
In this form, D-K is a classic Motte and Baily device [1]. It's plausible, unsurprising and uninteresting that less competent people in a field self-access somewhat less accurately (the motte). It's completely unsupported but very "juicy" that someone's claim of competence indicates an incompetence (the bailey).
>their prediction of what their test scores will be, is uncorrelated
I couldn't access the original 1999 full article because it's behind a paywall, but is this accurate?
The abstract states that people "grossly overestimated their test performance and ability". If it were truly uncorrelated, wouldn't we expect there to be roughly equal number of people who "grossly underestimate" their performance? In any event, I wouldn't expect a weak correlation to be described as "grossly [anything]".
Or maybe this is just the D-K effect in myself trying to understand the abstract :-)
> The abstract states that people "grossly overestimated their test performance and ability".
I understand that the words D-K used were along those lines; but the actual data shown in their graphs says what I said. Just look at the lines on the graphs. The "actual" lines are 45 degree lines up and to the right--as they must be. But the "perceived" lines are horizontal, or roughly so. That means the two lines are uncorrelated, or nearly so.
Saying that underperformers grossly overestimated their performance is only describing part of that--the part where the "perceived" lines on the left sides of the graphs are above the "actual" lines. It does not describe the other part--the part where the "perceived" lines on the right sides of the graphs are below the "actual" lines.
D-K don't talk about that at all in their paper, which means their paper itself misdescribes the actual data they found. And if this article had said that, it would have been a valid criticism. But this article is not talking about what D-K said in words; it's talking about D-K's actual graph. And the article's claimed criticism of D-K's graph is not valid: the article is in fact just describing the same graph in different words. It's not offering a different "explanation" of why the graph looks that way; in fact, as I said, it offers no explanation of that at all; it doesn't even recognize that as a question.
Thanks for taking the time to help me understand the data.
Looking at the graphs, isn't the implication that the distance between the two lines is the metric of interest? In other words, that the distance between real/perceived scores in Q1 & Q2 is "grossly" wider than the distance between Q3 & Q4, with the noted exception that only Q4 underestimates?
> isn't the implication that the distance between the two lines is the metric of interest?
That's more or less what D-K are saying, yes. That doesn't necessarily mean it's the best metric to use for understanding what the data is saying.
Also, "distance between the two lines" is misleading because the data is bucketed--on the x axis by quartiles, on the y axis by percentiles. In other words, the actual data is the circles/triangles/squares, not the lines. When put that way, "distance between the lines" looks a lot less like an actual metric and a lot more like an artificial one.
You seem to be first correcting the statement made by D-K in a way so compatible with this article that it is literally their example and then claiming that this corrected statement is correct so the article is wrong... the actual statement--as you even admit--was wrong, and that their data was fine was not being questioned by the article: it was their interpretation, which is what the D-K effect actually is... an inverse correlation between skill and perceived skill. Yes: if you choose to read their graph in a way which D-K didn't consider you can see the correct statement, but D-K didn't do that, and that isn't what anyone--but I guess you--means when they reference the conclusions of the paper and the resulting "effect", which purports to explain "why people tend to hold overly optimistic and miscalibrated views about themselves".
No, it isn't. The article is claiming that the D-K effect doesn't actually exist; actually all that is going on is "autocorrelation". That claim is wrong. The D-K effect does exist, but it isn't what D-K said it was: it's not that there is an inverse correlation between perceived skill and actual skill; it is that there is no correlation (or at best a weak positive correlation) between perceived skill and actual skill (when we would intuitively expect a fairly strong positive correlation between them). That is what D-K's actual data shows; but neither D-K themselves in their paper, nor this article, correctly describe that.
> The D-K effect does exist, but it isn't what D-K said it was...
You are claiming an effect exists, and no one--and certainly not the author of this article, who seem to be fully bought into your very different statement of the effect (and so maybe we shall call it "the pdonis effect")--is disagreeing with you on that point; however, your insistence upon calling this effect "the Dunning-Kruger effect"--while simultaneously admitting that Dunning and Kruger described a different incorrect analysis as their "effect"--is what is unreasonable. Put simply: "the Dunning-Kruger effect" is the thing that Dunning and Kruger claimed to be true, not some other random property of the data that their same experiment could have backed up, if only they had realized; and it is this effect--the actual Dunning-Kruger effect, as described by Dunning and Kruger in no uncertain terms--which does not seem to exist.
Overestimating vs underestimating their ability has less to do with whether their estimates are correlated and more to do with the average estimation.
If estimates of ability were a truly random sample with mean of 75th percentile ability, then the net effect will be vast overestimation. If they're a truly random sample with mean at 50th percentile, it will be equal. At 25th, underestimation.
It's like how 90% of people estimate that they're in the top 50% of drivers by ability.
having also not read the original paper recently, I'm in the camp that says the question is interesting and valid, but you've got to be careful about interpretation. how much the scores correlate (or autocorrelate) with self assessment IS an interesting question, and it's not apparent a priori what the answer will be, or whether the real effect we're interested in will differ for different levels of expertise.
but the use of percentile to percentile (or quartiles, but those are just grouped percentiles) to give the impression of a particular kind of effect (lower groups overestimating and higher groups underestimating) is a flaw i think. it's a common one, in my experience, when dealing with percentages.
if you think about it, percentages/percentiles have to be bounded at 0 or 100. for a dunning kruger effect not to appear, participants at both ends of the ability spectrum would have to be eerily accurate in their self-assessments. if they aren't, there's just more space on one side of the measurement scale for each group to make an error (if you score 1 in ability, there's ~99 percentiles available for you to make an overestimate and only 1 to be accurate, ditto for those with high ability. those in the middle of the ability group have equal chance on either side and so appear statistically more accurate even with a purely random distribution of guesses). so if there is any measure of central tendancy towards the middle percentile in the estimations of peoples abilities at all (and i would argue there is a priori reason to believe there would be, as the alternative would require those at both ends of the distribution to be getting increasingly accurate, which would be really weird), then practically any real world graph of percentile performance to percentile estimation will show a dunning kruger effect (with lower ends overestimating and higher ends underestimating). the article does a good job of showing this by plotting just random estimations and observing one appears.
> DK doesn't mean no correlation, it means inverse correlation
Both are wrong. DK effect describes a weak positive correlation, but weaker than we intuitively expect. The top quartile still estimates their ability better than the bottom - positive correlation. But this is still below their actual ability. The bottom quartile correctly predicts they score worse than the top, but they overestimate their own ability ie they underestimate how much better the top quartile actually performs.
> DK doesn't mean no correlation, it means inverse correlation. It's the correct analysis at the bottom that shows what no correlation actually looks like (at least no correlation in tend, there is heteroskedasticity).
Not sure I follow, inverse correlation between what? The analysis at the bottom (assuming you mean fig. 11) is too dense to show if there's a correlation or not (between skill and self-assessment bias), and looking at the relevant figure from the paper itself gives me the impression that there is a correlation.
>> a world in which people are very bad at estimating their own skill, therefore, statistically, people with lower skills tend to overestimate their skills, and experts tend to underestimate it.
> Be careful here, the conclusion you drew doesn't actually follow.
Can you elaborate? If X and Y are two independent random variables, X representing skill and Y representing self-assessment of skill, X and Y will be negatively correlated - this is exactly what the first part of the article is about, although from my perspective it's the author who's drawing the wrong conclusion.
>> y - x ~ x, this is called the residual plot
> You're giving x and y meaning that they don't have. In the article these are uncorrelated random variables - the plot of y-x ~ x will always look that way. That's however not the case if you're plotting y_hat - y ~ y_hat for a y_hat taken out of a model. That won't be a random variable in your setup.
Not following again. Other than calling x y_hat, and having y_hat be your own estimate vs. x be the subjects' estimate, what is the distinction? What do you mean by "the plot of y-x ~ x will always look that way" - what way? The shape of the plot will necessarily depend on the relationship between x and y.
> If X and Y are two independent random variables, X representing skill and Y representing self-assessment of skill, X and Y will be negatively correlated
1. As mentioned in another comment, X and Y can't be independent and correlated at the same time
2. The point of the article is to show that you can replicate the results of the DK paper starting from purely random data. In the article X and Y don't mean anything, they're just random variables that the author draws samples from. The fact that you can get DK results from this very strongly suggests that DK is just an artifact of statistics, not an actual result.
> What do you mean by "the plot of y-x ~ x will always look that way" - what way?
See Figure 8 in the article, also panel B in Figure 10.
> The shape of the plot will necessarily depend on the relationship between x and y.
What I meant to say is that since X and Y are simulated data, not actual observations, the shape in Figure 8 will not actually depend on any possible relationship between ability and self-assessment. It's just a statistical artifact.
Figure 9 is based on this simulated data as well, and since it closely replicates Figure 2 there's good reason to believe that Figure 2 itself is actually just a statistical artifact and that the DK data don't actually show the purported correlation.
This point is further strengthened by referring to a few papers and by showing some corrected results in Figure 11.
> 1. As mentioned in another comment, X and Y can't be independent and correlated at the same time
And as I replied to that comment, "sorry, my bad - Y - X and X will be negatively correlated."
> 2. The point of the article is to show that you can replicate the results of the DK paper starting from purely random data
You (and many others) are using "purely random data" as if it's always the null hypothesis and using it to cast doubt on the results. But assuming as the null model 0 correlation between skill and self-assessment of that skill makes no sense to me, and is in fact more extreme than the claim DK is making. So in other words, sure, if you assume something more extreme than the claim and generate data based on this assumption, you'll get the same effect and more extreme.
> If X and Y are two independent random variables, X representing skill and Y representing self-assessment of skill, X and Y will be negatively correlated
“If two variables are independent, then their correlation will be 0”
> > a world in which people are very bad at estimating their own skill, therefore, statistically, people with lower skills tend to overestimate their skills, and experts tend to underestimate it.
> Be careful here, the conclusion you drew doesn't actually follow.
How does that not follow? It's just regression to the mean.
I believe the god-emperor was implying that it is possible to imagine a world where people are bad at estimating their own skill, but in the other direction - people who are good at something drastically overestimate how good they are, and vice-versa.
> a world in which people are very bad at estimating their own skill, therefore, statistically, *people with lower skills tend to overestimate their skills, and experts tend to underestimate it*.
I think the correct conclusion is that if a cohort is not good at estimating their own skill, you can conclude that the variance in their predicted self-assessment will be high (since they’re concluding things without strong evidence), but you can’t assume that their estimates will be biased without additional evidence.
That is, indeed, what was shown in the last panel of the article: freshman (and undergraduates in general) are much worse at assessing their own ability than professors, but no group has a strong bias towards over- or under-assessing their own ability (recalling the “my guesses are much better than your guesses” from “The Death of Expertise”).
The author’s confidence is itself an indication that they’re more likely to be wrong.
Kidding. Well, half-kidding, I did kind of find the tone a bit biting and dismissive, especially towards one of the commenters that were pointing out exactly what you did.
It’s an interesting question to ask whether ask whether the uniformly random data “really” exhibits DK or not, and whether that’s interesting. A world where people have 0 ability to assess their own skill and resort to making uniformly random guesses at it is kind of interesting, and of course in such a world more skilled people would end up on average underestimating themselves and vice versa.
But I think the author’s right that obviously nothing psychological is happening here. There’s the psychological effect of no one being able to assess themselves, but the fact that unskilled people overestimate themselves in this world has nothing to do with the fact that they are unskilled.
> But I think the author’s right that obviously nothing psychological is happening here. There’s the psychological effect of no one being able to assess themselves, but the fact that unskilled people overestimate themselves in this world has nothing to do with the fact that they are unskilled.
If the results from DK were similar to the random data results, I'd agree. But the DK results do show some correlation between skill and self-assessment ability.
My guess is that the spread in self-evaluation is largest at low skill level and decreases as skill level increases. This would produce results more similar to what D-K actually observed, and is much more plausible than postulating that highly skilled people have no more idea of their skill level than low skilled people.
What strikes me about that graph is that the entire group is likely to be more skilled than the general population. The selection only of people who have the interest and means to attend higher education seems like a narrow window at the furthest edge of the true graph. So I wonder what it would look like if we included people of all education levels and social strata?
My gut, based on this article, is that it would look generally the same but with a larger spread at the lower end of the scale. But I don't think we can truly say we've disproved the Dunning-Kruger effect without a more varied dataset.
I dunno, if we account for the whole mosaic I think there is a higher probability of younger Cornell undergraduates performing poorly and self-assessing highly. They could very well be thinking, and especially against the backdrop of their high school peers, that they know it all. I think the explanation of "You don't know what you don't know." with a splash of "I'm the top 10% of my public high school class!" moreover I'd operate with the presumption they don't read outside of the curricula and so they're exceedingly unaware of the boundaries of their knowledge within the bounds of all knowledge, and the scope of the surface they cover. It's a wonder that they're as accurate as they are. But I think that has to do with the nature of the test, which as I recall was something like the ACT/SAT where they're effectively only testing to 8th grade standards. What happens if they test people against the whole of human knowledge?
Taken at face value, what the DK data shows without autocorrelation is that more skilled people are, in fact, better at evaluating themselves than unskilled people. The correlation is positive. It's only when you introduce the difference in score vs. assessment (autocorrelation) that the correlation appears negative ("skilled people evaluate themselves as less skilled than they are").
>The fact that the DK effect emerges with _completely random data_ is not surprising at all - in this case the intuitive null hypothesis would be that people are good at estimating their skill, therefore there would be strong a correlation between their performance and self-evaluation of said performance. If the data weren't related, then this hypothesis isn't likely, which is exactly what DK means.
DK effect is not that low skill people are overconfident and high skill people are underconfident. It is specifically that low skill people are more overconfident than high skill people are underconfident. i.e. if someone's estimated skill is true_skill+bias+noise, then bias_lowskill > -bias_highskill.
This is very clear in the original DK paper, they specifically focus on the supposed metacognitive deficiencies of low-skill people.
The article argues that the graphs supposedly demonstrating this fact, can also be generated from a model that does not have this difference, i.e. where bias_lowskill == bias_highskill.
EDIT: My characterization of the article is not correct, see here[1] for a visualization of the point I'm trying to make.
> The article argues that the graphs supposedly demonstrating this fact, can also be generated from a model that does not have this difference, i.e. where bias_lowskill == bias_highskill.
But as I understand it, it doesn't: In the graph generated using random data, the lines intersect in the middle (bias_lowskill == bias_highskill), whereas in DK's paper they intersect in the upper right (so bias_lowskill != bias_highskill).
To be fair, in the random data, you have people of all actual skill levels estimating their ability at 0.
That's going to skew the data somewhat because people don't work that way. While you will likely have some zeros, I wouldn't expect any from the skilled population and I'd expect fewer from the rest of the population.
That's going to make the "difference in estimation" line higher overall, which would make it intersect higher and more to the right.
Okay well in that case the article is deficient (I skimmed it, assuming it was making the same case I already knew about[1] -- my bad). You can actually reproduce the DK graph without supposing that estimation bias depends on skill. There is an interactive visualization here[2] that does precisely that (click on the "line plot w/ centiles" tab). You can adjust the parameters to see how it behaves.
I see what you mean, but is using the "general overestimation" slider actually unbiased? I don't really understand what it is supposed to represent. Looking at the code, it seems to be only an additive term that shifts the plot upwards, but that does not explain anything (actually, in the code, the variable is called "up_bias", and could correspond to an actual DK effect...).
Per DK's hypothesis, the added bias would be a decreasing function of underlying skill, not a constant as it is here. I think this is the only way to define it sensibly. It's easier to see what it's doing if you look at the scatterplot, although it's kind of annoying because the axes are flipped and it keeps the view centred on the data, so the points stay still and the axes move around them. Basically it's just translating all the points uniformly along the estimated score axis.
That would make sense, except that graphs generated from random data show identical bias for overestimation and underestimation, as can be seen in the article. And this is opposed to graphs from the DK paper, which show a smaller underestimation bias for experts than an overestimation bias for low-skill people. (Of course that alone doesn't prove anything, just saying that to my understanding, nothing in the article contradicts my interpretation of DK).
I'm definitely not even close to a statistician, but I'm also having a hard time accepting this analysis.
I'll admit that part of it also comes from personal experience, at work and elsewhere. I've met some catastrophically incompetent people were completely oblivious to their own incompetence, and this has very often felt like that the more incompetent they were, the more likely they were to be try to do stuff that was waaaay out of their comfort zone, which would make even experienced, competent people tread carefully.
But even ignoring personal experiences, I'm not convinced by the arguments either. I understand what they are saying, but I don't see how this disproves the DK effect.
Even if everyone is equally bad at estimating their own skill, so that their estimate is essentially a completely random variable, then we would expect the self-assessment score average to be around 50. If I understand it correctly, this is essentially what figure 9 is demonstrating.
But that figure still says that worse performers are then likely to overestimate their own ability, just as much as it says that better performers are bad at it.
If we look at the original DK figure and contrast it with figure 9 with random data, then I think one way of interpreting the differences is that, yes, worse performers are indeed bad at self-assessment, but they're just kind of bad at it as if their self-assessment is a completely random variable. It then seems to keep being essentially random but as people's skills improve, the distance between their score and their self-assessment becomes a bit tighter.. so in conclusion: most people are pretty bad at self-assessment, but skilled people are a bit less so.
The end result is still that people in the bottom quartiles are going to over-estimate their own ability.
I don't know, maybe this is way out in the weeds. Please school me.
The feeling your got from looking at the graph come from the idea that the red line represent absolute values. It’s actually an average of the real value as that is a really big difference and of the reasons why it’s very easy to lie with statistics.
To correct correlate the two variables (assessment and actual-score) you need to correlate the actual data, not measures of its caracteristics (average being one of them).
The actual data is shown. Even by eye it’s possible to see no strong (or significant) correlation exists.
If people are all pretty-much crap at estimating their own skill, then you'd expect all estimates to be roughly their actual skill, plus-or-minus some random error-margin with some kind of probabilistic distribution (Gaussian?).
If that were the case, then high-skilled people would be more likely to underestimate their skill (because their actual skill is greater than the mean). That (I think) is an example of "reversion to the mean".
If that reasoning is right, then the DK claim may be true, but it says nothing about the comparative estimating propensities of high-skilled and low-skilled people. It just says that low-skilled people tend to overestimate their skill, and vice-versa. But that's exactly what you'd expect, amirite?
I think part of the issue here is you're talking about people estimating future accomplishment where they have no skill, and all variations of these experiments are people assessing how they performed on a test they just took. There's very likely a difference between someone doing open-ended talking wildly in a field where they don't even know what's possible and an introspective assessment of themselves on a concrete task they just completed.
And where they got feedback. What the original paper presented turns out to be a tautology: in the absence of feedback, people don't know if they performed well or poorly.
The problem is that they calculated each person’s ‘self-assessment error’ with the actual test score. This error is the difference between a person’s self assessment and their test score,
This is like comparing x - y to x, and if you do this, you will get a correlation no matter what.
Not necessarily. If for example y =~ x, comparing x - y to x would yield zero correlation, which to me is what DK is all about - people _aren't_ as good as we would expect at estimating their own skill.
They are not saying that the hypothesis is wrong, just that the proof of the hypothesis is wrong. You can frame it however you like, but it won't work if you have x on both sides of the equation.
You might be biased towards avoiding catastrophic risk. Somebody who doesn't know what they are doing without knowing it is more dangerous than somebody who knows what they are doing yet taking precautions as if they don't.
Unskilled people are more random with their self assessment that skilled. It has nothing to do with unskilled people thinking that they know everything.
> Unskilled people are more random with their self assessment than skilled
This is a statistical claim, supported by the DK graph (but not the random data thought experiment from the article).
> It has nothing to do with unskilled people thinking that they know everything
This reads to me as a claim about the psychological reason for the statistical pattern, which I don't think is either supported nor contradicted by data, both in the article and in the original paper.
"Unskilled people often think they know everything" is pretty much the colloquial use of 'Dunning Krueger' to label people who insist that they know better than the experts from a position of relative ignorance
But I think that's consistent with the statistical pattern: if the distribution of self-assessment [amongst unskilled people] of their relative abilities is random or near random, it logically follows that the set of unskilled people includes a lot of people who significantly overestimate their ability at something. Dunning and Kruger don't really talk about the propensity of excellent test performers to underestimate their skill as much (though the Nuher study results somewhat justify their original focus on the ignorant by finding that more skilled groups like professors and graduate students make smaller average prediction errors of their test scores than undergrads).
Dunning and Krueger's contention in the original article is that "incompetence robs people of their ability to realise they're incompetent". Similarity of the prediction errors to a random walk isn't a rebuttal of that (although it's a fair critique of the presentation) because the null hypothesis is that people who find a test particularly difficult shouldn't be [almost] as likely to believe they achieved above average performance as the people who aced it. There might be other reasons for that (like the test being pretty easy for all participants and raw scores in a fairly narrow range, or test takers wrongly assuming their lack of understanding was being compared against the general population rather than other smart undergraduates) but in general people ought to be able to incorporate knowing that they didn't know how to answer a lot of questions into their self-assessment of how they performed.
The one reproduced in the article doesn't show the density of points, so it's hard to conclude anything from it. Figure 4 from the Nuhfer et al. paper does seem, to me at least, to support DK's conclusions.
It does show the lack of extreme values for higher-skilled people, surely this has some statistical significance ?
Especially in a situation where you would expect the distributions to be of the same type ?
Unless they had messed up in failing to normalize the number of points per group, and so this might come from the law of large numbers failing + sheer randomness failing to create extreme values on higher-qualified, but lower population groups ?
> Also wanted to point out that in general there is no issue with looking at y - x ~ x, this is called the residual plot, and is specifically used to compare an estimate of some value vs. the value itself.
This article seemed very unconvincing -- and this part noted above, early on in the article set the tone that I felt like the author didn't know what they were doing. And even after reading it all, I felt like the standard lay use of DK remained valid.
This just felt like the type of thing I would have thought about as an undergrad, started to write it, and then realized it didn't make sense halfway through it. Or maybe I just missed something as well...
IMO the most interesting thing is not so much that you can get DK from noise, it's that the Nuhfer study was utterly unable to replicate the DK effect. If DK is real, there should have been at least a hint of it visible in the Nuhfer study.
>the author seems very confident in their conclusion
They are, but honestly all that can be concluded safely IMHO is that the original D-K graph doesn't support that the widely discussed "effect" which their conclusion describes exists. Therefore unless there is more evidence from some other subsequent study there may not be any evidence for it at all, and if that's the case then there's potentially no proof it exists.
However, even if you prove that their data is not evidence, that doesn't actually say anything about whether the effect exists or not, just that the D-K paper isn't evidence of such an effect.
I don't think that's enough for the author to conclude that "DK is autocorrelation". A more careful conclusion would be that "the DK data do not support DK's conclusion"... but of course that's much less likely to attract click throughs.
> the intuitive null hypothesis would be that people are good at estimating their skill, therefore there would be strong a correlation between their performance and self-evaluation of said performance
That's not really the intended interpretation of "null" in "null hypothesis". "Null" does not mean "contrary to the effect you're testing". "Null" means "do not assume dependencies anywhere" and so your description is backwards.
"Null hypothesis" comes down to agreeing on a prior that is "reasonable". Most of the time, that indeed means not assuming dependencies, e.g. when testing the outcomes of a medical treatment. But that's not always the case, e.g. does it seem reasonable to you to assume a priori no dependency between a person's age and their height? Does the result "People get taller until the age of 20" merit a journal article? There's a very strong correlation there after all.
As I've written elsewhere this all comes down to the prior, based on my life experience the prior "people are generally capable of self-assessing their performance" is much more likely than "people have absolutely no ability to self-assess their performance."
To state it differently, when you assume no dependencies anywhere, you've already jumped to a conclusion that is more far-fetched to me than the results of DK. Do you really think people have zero ability to self-assess their own performance? In all domains and all contexts, as this is your "null hypothesis"?
I invite you to ponder on the etymology and actual definition of the word "null".
It does not refer to some antecedent or to some notion of "reason", it refers as explicitly as possible to "not-x", contra whatever effect generates "x".
You are conflating the construction of the null hypothesis with a more liberal notion of hypothesis testing (science) per se.
Null means "that thing which I've decided to attach special significance to". There's in general no reason to prefer a null hypothesis over any other hypothesis.
Check out Nuhfer et al 2016, who had a different explanation for why Dunning Kruger wasn't true.
Dunning Kruger effect: lower performers overestimate their ability, and higher performers underestimate their ability.
How did they find that? They asked participants to take a test and then had them do a self-assessment. Both were standardized from 0-100. They rated a participant's self-assessment accuracy by "self-assessment minus test score."
What's wrong with that method? You can't arrogantly self-assess as though you got a 130, and you can't humbly say that you got -50. Because of the standardization, you're bound by 0 and 100. This method makes it almost impossible for higher performers to overestimate their ability and for lower performers to underestimate.
What they actually found was that higher performers tend to be better at self-assessment. Lower performers are less accurate, but in both directions (not just overconfident).
You've been corrected about "what DK means" in the other comments, but this is not quite the point of the post. This is not about if DK (as expressed in English words) is true or not — in fact, author points out in the beginning that it's one of these "everybody knows it's like that" ideas (as it often is with social psychology).
The point is, that the original DK paper is bullshit. At least, this plot is. And people tend to miss it, until they start to carefully read the labels and think about the caveats. In fact, as presented here it looks like it shouldn't even be accepted as a valid study, this is outright deceptive, maliciously so. If there is assumed to be a correlation between x & y, how about we start by plotting x against y then? I know, it may be messy. It almost certainly will be. Because of that, I personally won't even be offended (but some people might) by you removing the outliers and producing the unnaturally clean version of the plot in the end to highlight the main idea. Then some statistical tests to make the results quantified. But here we see nothing, it really is just comparing x to x.
IMO, this is pretty much the invariant of most of the problems of academic research in the last God-knows-how-many decades (maybe always was, I don't know). Computer science papers without the code. Data science papers without the data. Yeah-yeah, I've heard hundreds of excuses why researchers do it like that. But it's pointless, such "research" shouldn't be accepted by anybody. Either you make your findings actually public by providing everything to replicate every single step of your study (which is supposed to be the point), or you just don't publish anything and keep the research proprietary (I mean, obviously it's never black and white, there always will be concerns about test-subject anonymity, etc. — but it's ridiculous to discuss that when the accepted standard even in "proper" sciences are 20 pages of dense text which might never even get to the point of the study, i.e., actually showing the data to any extent.)
I strongly disagree, not necessarily with everything (e.g. I don't have access to the raw data from the DK experiment, don't know how well they performed all the analysis leading to the plot). But the plot itself is not inherently deceptive, and, unlike implied in the article, is not equivalent to "just comparing x to x". The plot essentially shows the actual performance vs. self-assessment of performance, compared to what we would expect if there were perfect correlation.
Well, I wouldn't say "equivalent" (and I wouldn't say it's implied in the article), but it isn't very far from that. And meaning of "very far" is relative. I mean, f(x,y) ~ x obviously isn't equivalent to x ~ x, but the importance of pointing this out depends on how likely people are to mistake it for x ~ y, and how likely they are to skip that f() is DEFINED by authors and not found in the nature. Same with "misleading": your perspective on if something is misleading obviously depends on where it leads you. But my evaluation of 2 questions above are "very likely" and "it does lead majority of people to the wrong conclusions", so the whole sentiments stands true. I'll get back to it in a moment.
Anyway, my (lesser) point is that a study like that (even in 1999) must contain at least one plot of [actual score] vs [self-assessment score], or it is worthless, and I stand by it. (The original paper doesn't contain such plot.) My more general point is that what is accepted as "sufficient" data provided in overwhelming majority of research is laughable, and I really want people to stop tolerating that.
(I don't want to delve into discussing the particular paper, I'm purposefully trying to keep all my point as general as possible. Some problems are actually addressed in DK-1999, but the whole study makes you smirk at a circus social psychology experiments are. Here's a nice quote for you (being tested is "ability to recognize what's funny"): "To assess joke quality, we contacted several professional comedians … and asked them to rate each joke on a scale ranging from 1 (not at all funny) to 11 (very funny). Eight comedians responded to our request …, an analysis of interrater correlations found that one (and only one) comedian's ratings failed to correlate positively with the others (mean r = -.09). We thus excluded this comedian's ratings in our calculation of the humor value of each joke.")
Now, back to the f(x, y) ~ x. Let's imagine f(x, y) = C. E.g., every respondent evaluated himself as "average". Is it true that less knowledgeable people turned out to overestimate themselves, and more knowledgeable people turned out to underestimate themselves? Well, yeah, I guess. Does x correlate with y? No. So, what's more valuable here, to plot [x ~ y] or [f(x, y) ~ x]? Moreover, this is obviously different from the case where y (the self-assessment score) is uniformly distributed and doesn't correlate with x, yet it will yield the same line-plot as the above, if we construct them like D&K did. And that's skipping the part that there are several viable ways to define "self-assessment of performance" as f(x,y). I think, if I'll try hard enough I would be able to "prove" almost any hypothesis about self-assessment using this method. Of course, I won't really prove anything, but the plots will be convincing enough, and the "results" will be much more demonstrative than the original data is, same as probably is with DK-1999 (but that's impossible to tell without seeing actual source data of DK-1999).
TL;DR: It all boils down to the old adage about "3 kinds of lies". Is the result being discussed an example of auto-correlation? Yes, it is. Is auto-correlation necessarily bad? Not really. Is it important and can it obstruct interpretation in the particular example of DK-1999 paper? It's subjective and hasn't been measured, but my evaluation is — absolutely.
> I really don't see how it concludes that the DK effect is wrong based on the analysis.
Neither do I. Basically what the article actually shows is that these two statements are equivalent:
(1) People with low test scores tend to overpredict their test scores, while people with high test scores tend to underpredict their test scores.
(2) People's predictions of their test scores are uncorrelated (or more precisely very weakly correlated [1]) with their actual test scores.
This is not a statement that the D-K effect is wrong. It's just restating what the D-K effect is in different words. All the talk about "autocorrelation" is just another way of saying that, if people's predictions of their test scores are only weakly correlated with their test scores, then people with low test scores will have to overpredict their test scores (because there's virtually no room to underpredict them--there's a minimum possible test score and their actual score is already close to it), and people with high test scores will have to underpredict them (because there's virtually no room to overpredict them--there's a maximum possible test score and their actual score is already close to it). But the real question is: why are x and y so weakly correlated? Why are people's predictions of their test scores so weakly correlated with their actual test scores? That is not what one would intuitively expect. That is the question the D-K effect raises, and the author not only doesn't answer it, he doesn't even see it.
Also, this statement in the description of the Nuhfer research doesn't make sense:
"What’s important here is that people’s ‘skill’ is measured independently from their test performance and self assessment."
Um, the test performance is the people's "skill". And in the original D-K research, it was "measured independently" from the people's self-assessment (their prediction of their test performance).
[1] Notice that in the "uncorrelated data" graph, Figure 10, the red line is basically horizontal. That's what you get when x and y are uncorrelated. But in the original D-K graph, Figure 2, the thick black line is not horizontal--it slopes upward. That's what you get when x and y are weakly correlated. If the author had put in a weak correlation between x and y in his own experiment, he would have gotten a graph that looked like Figure 2. But of course that still would do nothing to explain why x and y are so weakly correlated, which is the actual question.
My issue with the thought process is the general picking of uniform distributions to show random data that follows DK. Uniform distributions are kind of uninteresting in this case because they're pretty artificial. It's not like competence is actually measured in the range 0-100 in anything that actually matters.
Please feel free to tell me why my interpretation is wrong. I understand stats just well enough to get myself in trouble.
The line for actual ability is basically x=y. If you scored 10%, you're in the bottom quartile. If you scored 100%, you're in the top quartile. That line isn't really data, just something for comparison. The perceived ability line is the one that utilizes the data. It seems to show that once you average out what everyone rated themselves, it ends up kind of in the middle between ~55-70%. So the people who scored 10% assumed, on average, they would score around 55%. The people who scored 100% assumed, on average, that they would score about 75%. That makes the average expected score much higher than the actual score on the low end and somewhat lower than the actual score on the high end.
I'd interpret this as the bottom quartile thinks they're average and the top quartile thinks they're a bit above average. So basically everyone thinks they're average-ish, but the people who did worst on the test were the most wrong about that. But then again, I can't remember what the questions on the test were even about and just the single graph isn't terribly useful to argue over because it's missing all of the context of the paper.
Now that I've sat and interpreted the graph using my own set of notions about what the numbers mean and what the graph actually shows, I feel like this ought to be used as one of those life lessons about how quotes and diagrams outside of their context within a paper are the epitome of the phrase "lies, damn lies, and statistics." Statistics aren't always lies, but they're incredibly easy to bend to your own biases and assumptions.
The fact that the statistical artifact is seen in completely uncorrelated data is only shown as a demonstration that it is not itself evidence of the claimed effect. To gather evidence that the effect doesn't actually exist you need a new experiment, not just a new analysis of the same data, because the the different levels of actual skill need to be established separately from establishing the error in skill self-assessment. In the original experiment the test results were used to establish both, which is not sufficient.
But the article presents the results from just such a new experiment. In this one they used university education level (sophomore through to professor) and measured skill self-assessment level within those groups. Higher education level (a good proxy for skill on the test used, which was about science literacy) was found to be associated with more accurate skill self-assessment but the bias of lower-skilled people overestimating their skills was not observed.
That's just one study of course, but it sounds like a much better designed one than the original and does constitute actual evidence that the Dunning-Kruger effect doesn't exist.
> The fact that the statistical artifact is seen in completely uncorrelated data is only shown as a demonstration that it is not itself evidence of the claimed effect
I don't understand this part. "Completely uncorrelated data" is usually taken to represent the null hypothesis, but that's not the case here. In the DK paper, the implicit null hypothesis is "people of all skill levels are good at estimating their performance". In this case the "completely uncorrelated data" matches an alternative hypothesis, "people's skills have nothing to do with their ability to estimate their performance in tasks testing that skill". This hypothesis doesn't outright contradict the DK proposed hypothesis (and is certainly not the DK null hypothesis), so getting similar results is unsurprising to me, and I'm not sure that we learn from it anything about the DK results.
As for the other study cited, the figure shown in the article doesn't give a lot of information on density, and looking at the paper itself, figure 4 does actually seem to show that self-assessment gradually shifts left with increasing level of education.
The null hypothesis for Dunning-Kruger isn't "people of all skill levels are good at estimating their performance" it's "people of all skill levels have equal bias" in estimating their performance" (remember that Dunning-Kruger isn't that lower skilled people are bad at estimating their skill, it's that they systematically overestimate their skill). The randomly generated data used is one example of that, albeit an unrealistic one: a world in which all people are completely incapable of estimating their performance at all. In this case all of them have no bias at all in their (totally random) estimates. The fact that the artifact is seen in such data is a powerful demonstration that it is not evidence of the Dunning-Kruger effect.
Re the other experiment, as I say it's just one study and I haven't looked deeply into it or others (and nor do I have a position on whether there is a real effect of this nature, or any great interest in it). The point is the article isn't claiming their randomly generated data example is evidence against the Dunning-Kruger effect itself. That needs further experiments such as the one they showed. The random data example is a demonstration that the original paper's analysis is flawed and doesn't support its conclusions.
> The null hypothesis for Dunning-Kruger isn't "people of all skill levels are good at estimating their performance" it's "people of all skill levels have equal bias" in estimating their performance"
OK,
> remember that Dunning-Kruger isn't that lower skilled people are bad at estimating their skill, it's that they systematically overestimate their skill
The way I see it these aren't very different - conditioning on low skill and randomly sampling will tend to give way more overestimates than underestimates. What is the distinction between being bad at the skill and bad at estimation, and being bad at the skill and systematic overestimation?
> The fact that the artifact is seen in such data is a powerful demonstration that it is not evidence of the Dunning-Kruger effect
"Such data" refers to a world where everyone have absolutely no idea how good or bad they are. To me that is a much stronger argument than that made by DK. So perhaps our differences all come down to our priors. My prior belief (before looking at any data) is that people would know how good they are at a certain skill. If your prior is to expect that people don't know how good they are, then your arguments make sense to me. If however your prior is that people do know how good they are, but are also biased (all in the same direction), then I don't understand how the random data experiment reveals anything relevant to your beliefs.
> But when you're bad at the skill and can't underestimate, they look the same.
The (definitional) difference between bias and variance isn't related to do with whether you're bad at the skill or not. It's just mean vs variance of a probability distribution.
If there's a good faith acknowledgement on your part that there's something here you're not getting then I'm very happy to try and help you understand it, and in the spirit of hn I'm assuming that is the case as you've claimed. I'm definitely not interested in any sort of motivated argument, though. If you're attached to the ideas you're putting forward here in some way I have no desire to try and dissuade you.
Operating on the former assumption, I'm not really clear where the misunderstanding lies at this stage, but perhaps it would help if you were to expand on in what sense you think being "bad at the skill" would make bias and variance "look the same"?
Here's my main point of confusion - what does the random data experiment have to do with the DK results?
As stated elsewhere DK has 2 claims:
1. Low-skilled people overestimate their performance and skilled people underestimate their performance
2. Skill correlates with self-assessment accuracy
My first issue with the article is that it implies that since we get effect #1 with random data, that invalidates the respective DK conclusion. This IMO is misleading because random data represents a null model that is very different from my intuitive null model, that of people generally capable of assessing their skills (which I truly believe).
My second issue is that there's no relationship between effect 2 and the random data experiment, which doesn't exhibit anything of the sort. We can have a discussion about the cited papers and effect 2 as the reproduced plot doesn't show density and density plots from the paper do seem to support DK, but that's not my main gripe with the article.
As far as I can see (having checked wiki and the abstract of the original paper - I'm no expert on this) the DK effect is only the first of those claims. However it sounds like claim 2 is less significant here anyway.
Re claim 1 the random numbers example is "all noise, no signal" and I can see the objection that a more convincing example might be to demonstrate the "false" DK effect in an example that does have some signal (i.e. a positive relationship between actual and estimated skill), but that is easy to do and I hope you'll be able to see why if you see my reply at https://news.ycombinator.com/item?id=31042619 and read the comments under the article I mentioned there.
The point is that the DK analysis involves comparing two things which both contain the same single sample from a noise source. Pure noise like the random numbers in the example displays a powerful DK effect due to autocorrelation that says nothing interesting (just that a single random sample of noise is correlated with itself), and that powerful effect can swamp any actual relationships in the distributions. To avoid that effect appearing, you have to make sure that if the two things you are comparing contain samples of a single noise source they are separate, independent samples of it. The experiment with the education level groups achieves this because the education level is "measured" as a separate event from the "actual" skill measurement so they have separate noise sources (and even if they didn't the noise source would have been sampled separately and independently).
I have to say, during the discussion above I hadn't thought through it deeply enough to grok this level of it, and while pondering your last comment I went through a phase of "hang on, am I actually understanding this myself?", so I apologise and retract any suggestion of bad faith.
Thanks! No worries, I appreciate you writing this.
> I can see the objection that a more convincing example might be to demonstrate the "false" DK effect in an example that does have some signal
More than that - as it is, the argument is meaningless to me. It states that DK is trivial in a world where all people have no ability whatsoever to assess their own performance. OK, and finding dinosaur bones is uninteresting in a world where dinosaurs roam free. Both are true, but both are irrelevant in our world (considering my priors). To give a less hyperbolic example, suppose I found some population of people whose weight and height correlate much less than we currently measure, through some biological mechanism of very high variance in bone density or something. To me this article is like saying "well yeah, but this finding is uninteresting, for example if you take purely random weight and height you get an even stronger effect of short people with very high bone density and tall people with very low bone density".
Regarding all the rest - I don't really understand all this "comparing things which both contain the same single sample from a noise source". I'm currently willing to bet (albeit not too much) that any synthetic data experiment you'll come up with, that doesn't display an effect through the DK analysis, will turn out to be based on assumptions that strongly align with my prior, which is that subjects' self-assessment of their performance is correlated to their performance, with 0 bias (on average) and noise that is small (but not negligible) compared to the signal. Would be interested to be proven wrong.
> Pure noise like the random numbers in the example displays a powerful DK effect due to autocorrelation that says nothing interesting
On the contrary, finding out that the distribution in the real world is like that ("pure noise") would be very surprising (therefore interesting, in a sense) to me.
Sorry, but this doesn't make sense to me. There has to be a boundary - a person who got all the answers wrong can't underestimate their performance, and a person who got all the answers right can't overestimate their performance. You could make the case that boundary effects are all DK is about, but that's not what the article is doing (and also I don't think such a claim is supported by the DK plot).
I'm watching this thread with interest and will try to restate my understanding of GP's argument, by means of an example.
If a person's true skill is 5 on a 1-100 point scale, but the person is completely unaware of their true skill and will guess randomly, then their estimate will bias heavily in the direction of overestimating their skill, even if they were not intrinsically motivated to overestimate their skill, simply because far more of the available guesses are higher than their true ability.
In other words, the available probability space itself biases in the direction of overestimating their ability, for those people.
Is that right andersource?
I don't know the statistical right answer here, but curious to know.
It's worth reading the discussion between author and Nicolas Bonneel that starts with the first comment below the article. The author's explanation is very helpful regarding this point.
The main point is that in the paper's randomly generated numbers example, the DK effect disappears if you measure the actual "skill" and the "prediction error" in separate, independent experiments. In the example if you take a "person" and conduct the test you get a totally random result, and you get another, independent totally random result if you test them again. If you perform your "actual skill" measurement using one of those test runs and your "skill estimation error" measurement using another, the DK effect disappears completely.
So, to the extent the result of your skills test has any "noisiness" to it, if you analyse it the way Dunning & Kruger did, the autocorrelation resulting from using the same sample of that noise in the two things you're trying to assess the relationship between will show up as a powerful DK effect, and can easily swamp any actual correlations in the underlying distribution.
Edit: Also worth mentioning footnote 3 on the article, which points out that the use of quantiles introduces a separate bias for the same reason you mention (about there being a minimum and maximum score).
It’s not a complete tautology though, if people’s estimates of their skill were accurate in an unbiased way, we wouldn’t see a DK effect (or we’d only see a slight one, since you can’t really be unbiased at the low and high ends of the spectrum as the other comment pointed out). This isn’t true in the case of uniform random data, or in the real data we see, but it could be true of some data.
It depends on what one means by the "Dunning-Kruger effect."
I had the same impulse that the analysis did not disprove DK, but after sitting with it for overlong I agree with the analysis.
I think there are two competing DK effect definitions that are being conflated, one descriptive and one explanatory:
1. DK shows that less skilled people overestimate their ability, and highly skilled people underestimate it
2. DK shows that people's estimation of their ability is causally determined by their actual ability
I believe you are claiming, correctly, that the article does not disprove the first definition that explains the observation, but I think the article is trying to disprove the second definition that explains why it occurs.
In other words: Yes, there is an observable Dunning-Kruger effect in the sense that we're bad at self evaluation. Is that effect attributable to one's actual level of competence? The evidence for that appears to be a statistical artifact, and further experiments seem to disprove that conjecture.
Yup. THat's what I was going to say. There data suggests that everyone kind of estimates their ability similarly so that more skilled people underestimate there ability (impostor's syndrome) and less skilled people overestimate the their abilities.
You're changing the definition of the null hypothesis. What you are essentially saying is that there is no need to perform experiments and studies, I can arbitrarily grab data from anywhere and if it fails to support the hypothesis... then the hypothesis is wrong.
if your analysis produces an effect when fed uniform random data, then yes, no need to perform experiments or studies, because all your results are null. right?
so the hypothesis would be something like "the analysis shows a relationship between the two variables" and the null hypothesis would be something like "the analysis shows a relationship between two uniform random variables" and in this case that null is shown and accepted because no such relationship exists by definition. right? (unless it's like, "they have the same entropy", or something)
i'm very rusty with this stuff, so clarification would be much appreciated!
In the case of DK, the hypothesis is that there is a bizarre relation between perceived skill and actual skill. The null hypothesis is that is there no relation or correlation between perceived skill and actual skill.
A researcher would perform an experiment. If researcher observes statistically significant results. Then the researcher can reject the null hypothesis, and say the theory is valid. If researcher does not observe statistically significant results. They can only say their experiment doesn't support the theory.
note that i have not carefully been through the claimed analysis here (and specifically have doubts about the data v. error plot) but if the claim that the analysis produces the effect with random inputs is assumed, then the whole thing can be rejected at step 0, right?
Excerpt from a newer paper by Nuhfer (2017) adds more clarity:
“… Our data show that peoples' self-assessments of competence, in general, reflect a genuine competence that they can demonstrate. That finding contradicts the current consensus about the nature of self-assessment. Our results further confirm that experts are more proficient in self-assessing their abilities than novices and that women, in general, self-assess more accurately than men. The validity of interpretations of data depends strongly upon how carefully the researchers consider the numeracy that underlies graphical presentations and conclusions. Our results indicate that carefully measured self-assessments provide valid, measurable and valuable information about proficiency. …”
The article is correct. The effect is statistical not psychological. It emerges even from artificial data and occurs independently of the supposed psychological justifications even for data where those justifications are clearly removed.
If you adjust the experiment design to avoid introducing the auto-correlation you get data that doesn't show the DK effect at all. Some might take issue with the adjusted experiment as using seniority related categories like "sophomore" and "junior" as skill levels has its own issues. To show the DK effect is real you need to come up with a better adjusted experiment that avoids the autocorrelation while still generating data that generates the effect. It's unclear if that's possible.
Just anecdotal, but my life observation of DK is often highly intelligent and competent people in a particular field who then generalize that to pontificate and proclaim, directly or indirectly, superior understanding to certified domain experts (e.g., have directly related advanced degree(s), work in the field for decades.)
It thus seems more or as much a psychological effect - in short, people with a personality type of superiority and know-it-all, yet have never done the deep and hard work to gain or demonstrate any competency in said areas.
A common side observation is of course unfounded conspiracy theories, that the derided experts have sinister intentions.
Anecdotes are not data. Even in plural form. We can witness isolated instances of what seems to be a phenomenon without it actually being part of a phenomenon. Other posters have already established that some experts can overestimate their expertise as well. The study mentioned by the original post and in my comment seems to suggest that the overestimation bias is prevalent across a wide range of cohorts of expertise. Senior students are just as likely to overestimate their talents as freshmen for instance. This effect likely extrapolates to experts as well it's just hard to get good data.
No DK doesn't say no bluster, no proclamations or no artificial assertions of expertise. It doesn't even say that the overestimates are just as prevalent among experts as laypeople. All it says is as near as we can tell the effect size of the overestimation is the statistical autocorrelation and our best efforts to produce the same effect without relying on the autocorrelation have failed.
I think there are a lot of ways to accept the anecdotes you mentioned occur that need much weaker assertions than DK as a psychological phenomenon and would hesitate to jump to DK based on that information.
To be fair, I think the counterargument is that ostensible experts can overstate their ability/skill/knowledge to detrimental effect just as easily, and by virtue of the label ignore the reality of the argument/scenario at hand. That is, experts can overlook mistakes they're making, or conflicts of interest, etc because of their status, and because they overestimate their own ability. There's been studies of this in group decision making in crisis situations, where hierarchies can cause failures because the "leader" becomes overconfident and fails to heed warnings by others in the group.
This all gets really murky quickly in practice because of what "low" and "high" competence means, and what constitutes the actual scope of expertise with reference to a particular scenario.
>The V-tail design gained a reputation as the "forked-tail doctor killer",[16] due to crashes by overconfident wealthy amateur pilots,[17] fatal accidents, and inflight breakups.[18] "Doctor killer" has sometimes been used to describe the conventional-tailed version, as well.
It is, though. This article says that if people are bad at estimating their skill (towards randomness / the midpoint), then bad people will overestimate, and good people will underestimate.
The psychological part is that people indeed will assume they're closer to the mean than they actually are. DK effect would not be seen if people correctly estimated their skill, nor if experts overestimated, nor if the incompetent underestimated.
Yes you are right and I don't understand how so many commenters in this thread can be so confidently state that the article is wrong. I just went ahead and did the random simulation myself and you get the "Dunning Kruger effect" which is exactly what the author paints it as: autocorrelation.
>If you adjust the experiment design to avoid introducing the auto-correlation you get data that doesn't show the DK effect at all.
This is even in the article! Yet some people are making claims against this despite references to the contrary.
The article shows that the qualitative claim, that low scoring people tend to overestimate, and high scoring people tend to underestimate their ability, is nothing but a statistical obviety.
For the Dunning-Kruger effect to have psychological significance, you must quantify this, and show that they overestimate their abilities more resp. less than expected.
Regardless of how you set your expectation/null hypothesis, the absence of the effect would mean that the lowest scoring quartile would on average estimate their abilities to lie at 50% or below. It is found however that people in the lowest scoring quartile position themselves in the third or fourth quartile.
I'm not saying that this is necessarily deep or unexpected, just that the article only shows that the qualitative statement is true regardless of any psychological factors, and not that the Dunning-Kruger effect doesn't exist
The y-axis is actual_score - self_assessed_score, right?
So we are interested in whether the average of this y-axis value differs between high scoring and low scoring groups. The study that shows there is no D-K effect [1] linked by the OP article shows there isn't a difference in the average value (y-axis) between groups. In fact the mean value was 0 for all quartiles. So we can conclude that we have found no systematatic over/under reporting of actual vs percieved skill. People in the lowest quartile do not appear to position themselves higher than they should. Or at least, we can't reject the null hypothesis
Modern Psychology is having a lot of these sorts of results over the last decade, none of their methods are holding up under proper scrutiny. They are struggling to reproduce findings but more critically even the reproduced ones are turning out to be statistical and mathematical errors like shown here. Some of the findings have also done severe harm to patients over the decades as well, I can't help but think we need a lot of caution when it comes to psychology results given its harmful uses (such as the abuse of ill patients) and its lack of truthful results.
I'm convinced it's associated with the methodology of how psychology has approached matters.
In the world of biology, you're observing the world around you. Same for physics, chemistry, et al. This means that you can set up proper controls to obscure your own presence from any potential results (e.g., isolate everything in another room, use cameras to avoid being near animals, etc.)
Psychology has the same nightmare as quantum physics: pre-existing thoughts and beliefs literally define what results you end up with.
I'm convinced that psych is a victim of the "new" way of doing science: treating the Scientific Method™ as a self-evident concept instead of regarding science as a vastly certain domain of metaphysics.
It would be kind of ironic if psychologists were more susceptible to the "Pop-Baconian" simplification of science ?
And damning if what I've heard about psychology going through multiple paradigms during the 20th century alone is true.
But then, indeed, also understandable, due to the "softness" and "paradigmlessness" of the subject matter, as Kuhn had pointedout the later back then ?
It's still sad how we now have detailed theories and histories of science, but in practice scientists show no interest in trying to learn from them, nor the mistakes of their predecessors?
But then maybe that would have too high of a cost. (Consider all those successful projects where the founders later say : "we didn't knew what we were getting into / that this was considered impossible".)
Bonus : Wiseman & Schlitz’s attempts to do an adversarial super-controlled parapsychological experiment : ( IV. )
So many psychology experiments are so fantastically underpowered, that if the effect they are attempting to measure were real, odds are that it would have to be of such a large magnitude that it would completely overturn all of our beliefs on how humans function. And they can publish with a P value of 0.049.
So, implicit in the standards for publishing new research in psychology is "We think there is much greater than a 5% chance that our entire field is wrong" which is not a great place to start from.
This was interesting to me so I spent a while this AM playing with a Python simulation of this effect. I used a simple process model of a normally-distributed underlying 'true skill' for participants, a test with questions of varying difficulty, some random noise in assessing whether the person would get the question right, noise in people's assessments of their own ability, etc.
I fiddled with number of test questions, amounts of variation in question difficulty, various coefficients, etc.
In none of my experiments did I add a bias on the skill axis.
My conclusion is that the "slope < 1" part of the DK effect (from their original graph) is very easy to reproduce as an artifact of the methodology. I could reproduce the rough slope of the DK quartiles graph with a variety of reasonable assumptions. (One simple intuition is that there is noise in the system but people are forced to estimate their percentiles between 0 and 100, meaning that it's impossible for the actual lowest-skill person to underestimate their skill. There are probably other effects too.)
However, I didn't find an easy way using my simulation to reproduce the "intercept is high" part of the DK effect to the extent present in the DK graphs, i.e. where the lowest quartile's average self-estimated percentile is >55%. (*)
However, it strikes me that without a very careful explanation to the test subjects of exactly how their peer group was selected, it's easy to imagine everyone being wrong in the same direction.
(*) EDIT: I found a way to raise the intercept quite a lot simply by modeling that people with lower skill have higher variance (but no bias!) in their own skill estimation. This model is supported by another paper the article references.
Wouldn't variance be influenced by a similar bounding effect but this time from the upper side? That is, if your true skill is 98% you aren't going to ever overestimate by more than 2%, but if your true skill is 50% you could be off by up to 50% in either direction.
If we assume random data then the people at the lower end will over-estimate their own performance the same amount that people on the higher end will under-estimate theirs.
However, if the under-performers consistently over-estimate more than the over-performers under-estimate there is still some merit to the effect, isn't there?
That is, the interesting number is the difference between integral of y-x on lower half vs the integral of y-x on the upper half. Does that make sense to anyone else?
Yeah, I think so and we're probably in the minority here. There are a couple of other comments referring to regression to the mean and that the article takes a literalist view which is perhaps unwarranted. You win the followup comment. ;-)
I confess that I've never paid that much attention to the classic D-K graph, and that taking a close look at it, it is most assuredly crap. Now I want to know what the plots of the actual scores for those quartiles look like rather than %ile, or after-the-fact ranking. Yeah, it sure looks like people mostly figure they're in the 55-75 %ile ranking, if that's what that actually is, and that where in that spread they think they are correlates with their actual ranking.
Let's go down a Bayesian rabbit hole. Let's assume, as does the article, that people's self estimations are completely random rubbish: the worst people have nowhere to go but up, the best nowhere but down. Yup, completely agree.
Now let me ask a question: is self-estimation of any use in determining actual ability? The answer in this case is no: knowing one does not inform our ability to know the other in a Bayesian sense, they are not correlated.
D-K sounds valuable as a cautionary tale concerning excessive exuberance and a tendency not to learn well from experience, but aside from child-proof caps and Mr. Yuk stickers where we really want to apply the lesson is at the high-performing end of the scale and here we get into trouble immediately.
It is tempting to say "high-performers have nowhere to go but down" as though maybe we should reject those self-reporting the best performance. The classic chart hints at high performers underestimating their true performance, but it's a crappy chart; maybe they want it to be true.
But in the specific case where there is utterly no correlation and true performance is as evenly distributed as self-assessment, if we chop off the "top X self-reporting" we will chop off just as many poor performers as high performers. Yes, I hear you, and I agree, random is an edge case; I just don't believe that affects its prevalence.
Maybe it is true; alright dust off those priors and have at it.
Article seems to be saying “DK doesn’t exist because it always exists”. Which is… absurd?
The point of DK is that when you don’t know shit, any non-degenerate self assessment will result in overestimating your ability. In short, “there are more natural numbers above smaller natural numbers than bigger ones”. This doesn’t have to do with psychology, and it’s expected that it appears when evaluating random data. That’s a good thing! It means DK exists even when us pesky humans aren’t involved at all, not that DK doesn’t exist at all.
Seems pretty simple. When we create upper and lower boundaries to some score, people with lower scores have more space to overestimate and those with higher scores more space to underestimate, causing the perceived score to trend towards the mean.
I think there's both a component of numbers and psychology here. If the dispersion in perceived score caused by inaccuracy is wide enough to touch the bounds, it will force a trend towards the mean. This effect is possibly exacerbated by a tendency of perception to stray from "extremes", so subjects with a score near the edges will trend to the mean more strongly as they are unlikely to rate themselves the very best or very worst.
This seems pretty simple to correct, so I'm skeptical that nobody has done so yet in these experiments. If true, it's an equally interesting oversight as the Monty Hall problem. The basic premise is that the structure of an experiment will naturally nudge randomness in a particular direction, and we need to adjust for that in the analysis. Everyone who does this type of work should know this.
In a simplified experiment where we give people a 3 question quiz, those who got 2 questions right have one overestimation option, 3, and two underestimation options, 0 and 1. So it's very easy to adjust for autocorrelation by checking if a large group of 2-scorers underestimate more than twice as often as they overestimate. Then we see how their tendencies compare against 1-scorers and how they deviate from naturally overestimating more than twice as often as underestimating.
I haven't reviewed these types of papers, but if nobody made even that basic adjustment in their analysis, how many others have been missed in experiments like this?
I share your sentiment that DK being a statistical phenomenon actually makes the effect more interesting. I also, however, think that you are being too semantically liberal when you insist that this proves the effect exists. The DK effect is a _psychological_ effect, with a whole lot of psychological theories around it's causes. If the cause is statistical, the psychological effect can no longer be said to exist.
The stasticial phenomenon exists, surely - but I think it will be very confusing for everyone to re-use the same name.
OK, I think I understand. What the data from the original experiment actually shows is that people at all skill levels are pretty bad at estimating their skill level- it's just that if you scored well, the errors are likely to be underestimates, and if you scored badly, the errors are likely to be overestimates, by pure chance alone. So it's not that low scoring individuals are particularly overconfident so much as everyone is imperfect at guessing how well they did. Great observation.
So it seems... but I still don't understand why the author thinks it's helpful to say "autocorrelation" dozens of times when he could have just said this.
My intuition for this is: given a fixed and known scoring range (say 0..100), when scoring very low there is simply a lot of room for overestimating yourself and when scoring very high there is simply a lot of room for underestimating yourself. So all noise ends up adding to the inverse correlation naturally.
In other words people are quite bad at estimating their skill level. Some people will overestimate, while some other people will underestimate and on average there will be a relatively constant estimated skill level that doesn't change all that much based on the actual abilities.
Given that fact, it logically follows that people who score low ability tests will more often than not have overestimated their ability (and the same on the other end of the spectrum).
You can frame this effect as autocorrelation if you wish or just as a logical consequence. But that's missing the point.
The point is: why on earth are humans so bad at estimating their own competence level as to make it practically indistinguishable from random guesses.
- what DK claims: there is bias (incompetent people overestimate their ability)
- what data actually shows: there is a greater variance (incompetent people both over and _under_ estimate to a larger degree compared with more competent people. Data shows heteroscedasticity. No bias (estimations are around zero +/-, tighter for more competent).
The data supports the claim because indeed it turns out that incompetent people overestimate their ability. This phenomenon exhibits itself with random data too, so it clearly doesn't mean that incompetent people overestimate their ability because of their incompetence.
Or is it?
The trick lies in the fact that when asked to judge your competence you're given a range (e.g. 0-10) and both competent people and incompetent people have access to the whole range when taking a self-assessment. I.e. if less competent people were on average more aware of their incompetence they may be less likely to rate themselves 5 or 6, but yet the data shows that no matter what competence level you have on average you self-assess more or less the same.
This seems to imply that your incompetence indeed doesn't allow you to truly appreciate the full range of skills that are required to reach a higher level of competence.
In other words, the DK effect itself is the cause of the random distribution of the skill self-assessment (which in turn is the cause of the overestimation secondary effect)
That’s what I was thinking; if the average is about constant all you’ve shown is that everyone is bad at self-assessment (another issue - not fully qualifying a distribution by just using the average loses information).
But a comment above quoting the more recent paper presents a contradictory conclusion: that humans can self-assess with some accuracy. So now I’m confused again.
Suppose you make 1000 people take a test. Suppose all 1000 of these people are utterly incapable of evaluating themselves, so they just estimate their grade as a uniform random variable between 0-100, with an average of 50.
You plot the grades of each of the 4 quartiles and it shows a linear increase as expected. Let's say the bottom quartile had an average of 20, and the top had 80. But the average of estimated grades for each quartile is 50. Therefore, people who didn't do well ended up overestimating their score, while people who did well underestimated it.
In reality, nobody had any clue how to estimate their own success. Yet we see the Dunning-Kruger effect in the plot.
That's the way I understand the statistical analysis, and in my view this exactly supports (not contradicts) DK:
> In reality, nobody had any clue how to estimate their own success.
Wouldn't that mean unskilled people tend to overestimate their skill, and experts tend to underestimate it? Why is there a contradiction with DK's conclusions?
> Wouldn't that mean unskilled people tend to overestimate their skill, and experts tend to underestimate it?
I think it's because the original paper speculates far beyond it:
> The authors suggest that this overestimation occurs, in part, because people who are unskilled in these domains suffer a dual burden: Not only do these people reach erroneous conclusions and make unfortunate choices, but their incompetence robs them of the metacognitive ability to realize it.
The argument about autocorrelation says this "dual burden" doesn't need to be there to observe the effect.
Again, not in my reading. In the random data thought experiment, everybody (experts and unskilled alike) suffer from the burden of not having the skill to estimate their performance. The author is even surprised that the DK effect in the random data is bigger than observed in the DK experiment ("In fact, as Figure 9 shows, our effect is even bigger than the original") - but that's because in reality, people do have some ability to estimate their own skill. So the claim that the lack of skill is related to the lack of ability to self-evaluate does make sense, or at least, isn't contradicted by the experiment.
Someone who has a skill=0 can not underestimate and someone with a skill=100 can not overestimate. So by the framing of the question alone the participants are nudged to estimate there own skill "more averagely".
I’ve always felt the DK effect is cynical pseudoscience for midwit egos. It’s a sophistic statement of the obvious dressed up as an insight. But worse, it serves to obvert something interesting and beautiful about humans - that even very intellectually challenged people sometimes can, over time, develop behaviours and strategies that nobody else would have thought of, and form a kind of background awareness of their shortcomings even if they aren’t equipped to verbalise them, allowing them to manage their differences and rise to challenges and social responsibilities that were assumed to be beyond their potential. Forrest Gump springs to mind as an albeit fictional example of the phenomenon I’m talking about. I think this is a far more interesting area than the vapid tautology known as the DK effect.
I think there's an easier explanation for the effect, and that is people are just not very good at judging their skill level, and due to reversion to the mean, low-performers probably overestimate and high-performers underestimate.
And also, I think there is actually a tiny bit of DK going on.
And then, as you say, it gets amplified by the pseudo-literati.
The author is onto something that Dunning-Kruger is suspicious, but the argument is wrong. The "statistical noise" plot actually demonstrates a very noteworthy conclusion: that Usain Bolt estimates his own 100m ability as the same as a random child's. This would be a great demonstration of the Dunning-Kruger effect, not a counterargument.
On the other hand, regression to the mean rather than autocorrelation does explain how you could get a spurious Dunning-Kruger effect. Say that 100 people all have some true skill level, and all undergo an assessment. Each person's score will be equal to their true skill level plus some random noise based on how they were performing that day or how the assessment's questions matched their knowledge. There will be a statistical effect where the people who did the worst on the test tend to be people with the most negative idiosyncratic noise term. Even if they have perfect self-knowledge about their true skill, they will tend to overestimate their score on this specific assessment.
Regression to the mean has broad relevance, and explains things like why we tend to be disappointed by the sequel to a great novel.
The point is that if people estimate their abilities at random, with no information, it will look like people who perform worse over-estimate their performance. But it isn't because people who are bad at a thing are any worse at estimating their performance than people who are good at the thing: they are both potentially equally bad at estimating their performance, and then one group got lucky and the other didn't.
It would require them to be _even worse that random_ for them to be worse at estimating their abilities, rather than simply being judged for being bad at the task. It is only human attribution bias that leads us to assume that people should already know whether they are good or bad at a task without needing to being told.
The study assumed that the results on the task are non-random, performance is objective, and that people should reasonably have been expected to have updated their uniform Bayesian priors before the study began.
If any of those are not true, we would still see the same correlation, but it wouldn't mean anything except that people shared a reasonable prior about their likely performance on the task.
People will nevertheless attribute "accurate" estimates to some kind of skill or ability, when the only thing that happened is that you lucked into scoring an average score. You could ask people how well they would do at predicting a coin flip and after the fact it would look like whoever guessed wrong over-estimated their "ability" and a person who guessed right under-estimated theirs, even though they were both exactly accurate.
This comment section clearly demonstrates the attribution bias that makes this myth appealing, though. And this blog post demonstrates how difficult it is to effectively explain the implications of Bayesian reasoning without using the concept.
Consider the original study: they used 45 Cornell undergraduate students and asked them about grammar. Grammar isn't objective. Everyone there had performed well on the verbal portion of the SAT, but they weren't studying grammar and hadn't gotten instruction on this particular book of grammar they were judged against. It is very likely that what they were capturing in the "better" or "worse" scores is differences in local dialect.
They then judged people whose beliefs about grammar varied from the one book's beliefs about grammar as having over-estimated their performance. They took people out of one context, asked them how they would behave in a novel context, and everyone made an educated guess. The people who guessed correctly were judged to accurately know their own abilities, when actually they may just have gotten lucky.
Thus what Dunning-Kruger's paper actually says is that if you want people to know how you would like them to perform a task, you can't assume they will read your mind: you have to provide them with actual feedback on their performance.
I've felt inadequate throughout most of my early career. That's how I know that the confidence I have today is well deserved.
I've never had impostor syndrome though. To have impostor syndrome, you have to be given opportunities which are significantly above what you deserve.
I did get a few opportunities in my early career which were slightly above my capabilities but not enough to make me feel like an impostor. In the past few years, all opportunities I've been given have been below my capabilities. I know based on feedback from colleagues and others.
For example, when I apply for jobs, employers often ask me "You've worked on all these amazing, challenging projects, why do you want to work on our boring project?" It's difficult to explain to them that I just need the money... They must think that with a resume like mine I should be in very high demand or a millionaire who doesn't need to work.
I've worked for a successful e-learning startup, launched successful open source projects, worked for a YC-backed company, worked on a successful blockchain project. My resume looks excellent but it doesn't translate to opportunities for some reason.
Dunning and Kruger showed that students all thought they were in roughly the 70th percentile, regardless of where they actually ranked. That's it. The plots in the original paper make that point very clear.
It is unnecessary to walk the reader through autocorrelation in order to achieve a poorer understanding of that simple result.
》It’s the (apparent) tendency for unskilled people to overestimate their competence.
Close. It's the cognitive bias where unskilled people greatly overestimate their own knowledge or competence in that domain relative to objective criteria or to the performance of their peers or of people in general.
So, they observe a bias toward the average, and the dependence goes exactly as one would naively expect. If scientists exist to explain things we find interesting, statisticians exist to make those things boring. Seriously, work as a data scientist and you end up busting hopes and dreams as a regular part of your job. Almost everything turns out to be mostly randomness. The famous introduction to a statistical mechanics textbook had me pondering this. If life really is just randomness, it’s hard to find motivation. From a different viewpoint, however, I’ve found that the people that embrace this concept by not trying to control things too much, actually end up with the most enviable results, although I may be guilty of selection bias in that sample.
> Collectively, the three critique papers have about 90 times fewer citations than the original Dunning-Kruger article.5 So it appears that most scientists still think that the Dunning-Kruger effect is a robust aspect of human psychology.6
Critiques cite the work being critiqued (yes, the referenced critiques in TFA cite the Dunning-Kruger study). Also, a 23 year-old paper will inevitably get cited more than 6 year-old papers. But yeah...the inertia in Science is real. That conservatism's a feature, not a bug.
Thanks for introducing me to the term "half-life of knowledge".
> An engineering degree went from having a half life of 35 years in ca. 1930 to about 10 years in 1960. A Delphi Poll showed that the half life of psychology as measured in 2016 ranged from 3.3 to 19 years depending on the specialty, with an average of a little over 7 years.
This is very interesting and makes me wonder what it is for tech careers, e.g. web devs, data scientists etc.
Kidding aside, it seems you could estimate it by asking, what portion of the knowledge I use did I learn 20 years ago? Then 10, 5, 1. For me it seems to be somewhere around ten years.
Unless I missed something, this article doesn't explain WHY random data can result in a Dunning-Kruger effect. The relationship between the "actual" and "perceived" score is a product of bounding the scores to 0-100.
When you generate a random "actual" score near the top, the random "perceived" score has a higher chance of being below the "actual" the numerical below is larger than the one above, and vice-versa. E.g. a "test subject" with an actual score of 80% has a (uniform random) 20% chance of overestimating their ability and an 80% of underestimating it. For an actual score of 20%, they have an 80% chance of overestimating.
A person with an actual score of 80% will probably have enough confidence in his or her abilities due to experience that they will tend not to rate themselves low. Imagine being a graduate student asked how high (s)he would rank. They would not rank themselves as low as they might have when they were sophomore students. They would probably rank within 20% of their actual score, which is what the final graph in the article shows; professors have enough experience to be able to self-assess themselves better than less experienced subjects can.
As explained in the article, the reason is autocorrelation. Basically the y axis is correlated to the x axis because the y axis is actually x + random noise. The dunning kruger graph is then a transformation of that data - still subject to autocorrelation.
This is a fascinating discussion, to which I have little to add, except this. Quoting the article (including the footnote):
> [I]f you carefully craft random data so that it does not contain a Dunning-Kruger effect, you will still find the effect. The reason turns out to be embarrassingly simple: the Dunning-Kruger effect has nothing to do with human psychology[1].
> [1]: The Dunning-Kruger effect tells us nothing about the people it purports to measure. But it does tell us about the psychology of social scientists, who apparently struggle with statistics.
It seems to me that despite rudely criticizing a broad swath of academics for their lack of statistical prowess, the author here is himself guilty of a cardinal statistical sin: accepting the null hypothesis.
The fact that data resemble a random simulation in which no effect exists does not disprove the existence of such an effect. In traditional statistical language, we might say such an effect is not statistically significant, but that is different from saying that the effect is absolutely and completely the result of a statistical artifact.
Later in the article the author points to an article which does systematically illustrate that the D-K effect is probably not real. They achieve this by using college education level as an independent proxy for test skill with the Y variable being an unrelated assessment of skill - self-assessment. So we can be pretty confident that the D-K effect is at least very small.
Seems like a half-baked analysis. You would plot x=x to show where y is above and where it is below. It is useful for exposition. The author questions this as if it is an analytical oversight.
I’m not a scientist, but wouldn’t it make sense for standard practice to be to assume at first that there’s a shared variable (that you have introduced) and to look for it until you’re certain the things you’re plotting are independent? Of course they may not be in the end as that’s the “goal”, but the shared variable if there is indeed causation in that case will be what you’re looking for, not one of the variables you “know”.
Autocorrelation is much more interesting, and much more important topic than dk, which mostly seems to be popular concept because it supports biases and other fallacious, ego driven thinking. Autocorrelation is an under-appreciated problem, particularly in the social sciences and Econ. So it’s nice to use dk to catch the attention of the masses to spread the word about autocorrelation.
Tangential, but the more interesting question for me is:
How does estimating my skill level influence skill growth, social relationships and decision making?
I think there are a bunch of useful angles to this. When there are risk/responsibility opportunities, then I need to be courageous. When it’s about learning and interacting collaboratively, then I need to be humble.
I don't find the "autocorrelation" explanation intuitive (although it may be equivalent to what I'm about to suggest). The way I think about it, is that it comes about because the y-axis is a percentile rank. How does it actually work for people to give unbiased estimates of their performance as percentiles? For the people at the 50th percentile in truth, they could give a symmetric range of 45-55 as their estimates, and it would be unbiased. But what about the people at the 99th percentile? They can't give a range of 94-104, the scale only goes as high as 100. So even if they are unbiased (whatever that means in this context), their range of estimates in percentile terms has to be asymmetrical, by construction. So, even if people are unbiased, if you were to plot true percentile vs subjective estimated percentile, the estimated scores would "pull toward" the centre. Then the only thing you need to replicate the Dunning-Kruger graph is to suppose that people have a uniform tendency to be overconfident, i.e. that people over-rate their abilities, but to an extent unrelated to their true level of skill. The estimated score at the left side of the graph goes higher, but it can't go as high on the right side of the graph because it butts up against the 100 percentile ceiling. Then you end up with a graph that looks like lower skilled people are more overconfident than higher skilled people are underconfident.
It's an interesting article but the author is using terms a little incorrectly or strangely I think, and making untrue statements. The basic points are important and interesting to think about, but could've been explained more clearly.
I am no expert in statistics or the Dunning-Kruger effect but this analysis doesn't sound correct to me. If you plot self assessment against test scores then the following will happen. If people are perfect at self assessment, then you get a straight diagonal line. The more wrong they are, the wider the line will get, in the extreme - if the self assessment is unrelated to the test result - the line will cover the entire chart. If people overestimate their performance, the line will move up, if they underestimate their performance, the line will move down. If you look at the Dunning Kruger chart, that is what you see, complicated a bit by the fact that they aggregated individual data points. At low test scores the self assessment is above the diagonal, at high test scores it is below. What matters is indeed the difference between the self assessment and the ideal diagonal, but if you don't plot individual data points but aggregate them, you have to make sure that there is a useful signal - if self assessments are random, then the median or average in each group will be 0.5 and you will get a horizontal line, but that aggregate 0.5 isn't really telling anything useful.
I made you a picture [1]. I randomly generated 100 test scores between 0 and 1, then different self assessments. Top left, self assessment matches actual score, top middle, self assessment varies uniformly by ±0.1 around the test score, top right, self assessment varies uniformly by ±0.2 around the test score. None of those have a Dunning-Kruger effect. If you aggregate data points, there will be - as you mentioned - an edge effect because the self assessment will get clipped.
In the bottom row I added a Dunning-Kruger effect, at a test score of 0.7 the self assessment is perfect, below and above that the self assessment is off by 0.5 times the distance of the test score from 0.7. Otherwise the bottom charts are the same, no random variation on the left, ±0.1 in the middle and ±0.2 on the right. You can see that the edge effect is less important as the data points are steered away from the corners.
I will admit that the original Dunning-Kruger chart could or could not show a real effect, really depends on how they aggregated the data and how noisy self assessments are. But if you have a raw data set like the one I generated, you could easily determine if there is an effect. If one could find such a data set, I would like to have a look.
I find the article frustrating because of the tone. It's also wrong. They misunderstood what lines mean.
This article is absolutely dripping with condescension throughout and is really pushing a "gotcha" that doesn't exist. It then argues basic statistics, generates a DK-looking graph from random data, and then claims the phenomena doesn't exist. When in fact, as other people have commented, when people are bad at estimating their own ability (i.e. random), the DK effect still exists; it falls out of statistics.
Sigh, the author misunderstood the very definition of the DK effect:
> "The Dunning–Kruger effect is the cognitive bias whereby people with low ability at a task overestimate their ability. Some researchers also include in their definition the opposite effect for high performers: their tendency to underestimate their skills."
In all the examples, this holds, even if the assessment ability is totally random. Even if every quartile gives themself an average score, like the random data generated here. The author seems to think that it should be even more lopsided or something to demonstrate the effect. (I mean, honestly, what are they expecting, a line above 50th percentile? A line with negative slope? What?)
If there were no DK effect, the two lines would be the same.
Instead, if we go back and look at the original data, we see indeed, the two lines are not the same, the average for the bottom quantile is over 50%, there is some small increase in perceived ability associated with actual ability (and not the opposite).
The sin here isn't some autocorrelation gotcha, but rather, DK should have put error bars on the graph. If it was totally random, the error bars would be all over the place.
The fact that you can generate a Dunning-Kruger looking graph using nothing but noise does indicate that the graph isn't proof of anything.
He also points out that the problem is that there's nothing below zero and nothing above 100. You can't have people who estimate beyond that. He uses another study and it turns out, the less knowledgeable you are about a skill, the worse you are at estimating your ability at all. In both directions.
If the lines were the absolute difference between perceived ability and actual ability, for no effect, the lines still shouldn't be the same. They should converge towards those who are knowledgeable. If anything, the difference line should be nearly a horizontal line. Because there should be greater variance in estimations at the lower end.
It would seem like using violin plots in these type of graphs would help a lot. If low-skilled people are bad at estimating, their variance (and distribution) will be a lot wider.
My takeaway, which may be flawed, is that the DK effect really hasn’t been debunked it any fundamental way. It’s just that the effect is statistical rather than psychological. High skilled individuals are still more likely to underestimate their skill level while low skilled individuals are still more likely to overestimate theirs. It’s just that everyone is bad at estimating their skill level and high skilled individuals have more room to estimate below their actual, while low skilled individuals have more room to miss above.
Because of the effect that is actually found (variance is higher the less achievement) it follows that people you encounter who wildly overestimate their ability are more likely to people who are poor performers (the same is true for the inverse, but they obviously don't stand out anecdotally to us).
IMO that explains why Dunning Kruger seems intuitively correct even if the conclusion they drew isn't actually correct.
> people you encounter who wildly overestimate their ability are more likely to people who are poor performers
How is this helpful? You won't know whether someone is "overestimating" their ability until you learn both their estimated and actual performance, at which point you don't need to guess whether they're "likely" to have poor actual performance.
Agree. This should have been the original conclusion of DK, if they hadn’t made the mistake.
Another way to show this would have been to keep the auto correlation plot, but compare it to the same plot with statistical noise. With infinite random data, the expected value for self-assessment would be 50% score, regardless of actual score - a flat line through the chart. It would then be significant to find a non-flat line, as DK did.
It’s not inconceivable that with a smaller sample, you’d get come biasing, where lesser skilled people would over estimate, and higher skilled people under estimate.
The follow up studies seem to suggest there’s not really a bias like that, but that there is a “honing” of the general ability to estimate your own outcome, which makes sense.
> Although there is no hint of a Dunning-Kruger effect, Figure 11 does show an interesting pattern. Moving from left to right, the spread in self-assessment error tends to decrease with more education. In other words, professors are generally better at assessing their ability than are freshmen. That makes sense. Notice, though, that this increasing accuracy is different than the Dunning-Kruger effect, which is about systemic bias in the average assessment. No such bias exists in Nuhfer’s data.
I would have been more interested in seeing the raw data from the original Dunning-Kruger study reformatted to avoid auto-correlation. Maybe I've skipped over an important detail in my head, but I don't see why plotting perceived test score vs. actual test score would cause any problems; neither variable is in terms of the other.
The final study discussed is convincing as far as I thought. By using academic rank (Freshman, Sophomore, ...) they can plot the difference between difference in score and predicted score against rank without auto-correlation. Its just that using academic rank seems a possibly unreliable metric and an unnecessary complication - why not just use data about test scores and predictions of scores which already exists in a proper statistical interpretation?
The opposite - the publishing pressure makes experts overconfident in their abilities. The non-experts then assume that the experts, for sure, know better, so they don't look for flaws.
Also wanted to point out that in general there is no issue with looking at y - x ~ x, this is called the residual plot, and is specifically used to compare an estimate of some value vs. the value itself.
That being said, the author seems very confident in their conclusion, and from the comments seems to have read a lot of related analyses, so I might be missing something. ¯\_(ツ)_/¯