I know nothing about the domain of teacher measurement, and have no opinion on it.
But the first chart in the blog post, with the big mass of blue on it, is surely not the strong evidence of weak correlation, that the author is making it out to be?
There could still be a strong correlation in that data, even if it looks like a blob of blue - because there are so many data points on that chart, that we can no longer tell where the density of points is.
Its possible that there is a dense line of points in there that still gives a strong correlation, that we just can't see.
I'm really surprised to see Gary Rubinstein continuing to post bad graphs which obscure the data. 8 months ago I wrote a blog post about why you should avoid scatterplots, and used his graphs as examples of what not to do.
He is certainly aware of my post - he even commented on it and I offered him some constructive suggestions (such as adding alpha if he wants to use scatterplots).
I'm beginning to think that maybe his goal is to make VAM look bad, rather than to actually clarify and explain the data.
(I also left a comment on his blog expressing my concerns with his plot. Strangely, I seem to be hellbanned on his blog - the post is invisible when I view the post in an incognito window.)
Looks like I overreacted. My post is showing up fine. I often warn against assuming the worst of those you disagree with, and I should have heeded my own warnings.
Important expectation-calibration note: I'm not making any claims about teacher performance, the usefulness of the value-added idea, or any lofty topics in general. I'm just trying to figure out how to recreate the first graph in the initial posting and get a closer look at the data behind it.
I grabbed the data set [1], loaded it into R, and tried to recreate the first graph in the original post. Observations:
1. The value-added scores are quantized so, as some people have suggested, it's hard to see the true density relationships in the original plot. There's a lot of overlapping data points.
It seems to me that the evidence of weak correlation isn't quite as weak as you imply :-) If there's a dense line in there, how come it's not sticking out of the blob even a little? Do actual phenomena often yield "deceptive" blobby scatterplots with dense lines hidden inside them? That seems apriori improbable to me, thus increasing the Bayesian probability that the blobby scatterplot is not a misleading picture after all. Though of course some more numbers would've been nice.
>If there's a dense line in there, how come it's not sticking out of the blob even a little?
It is. If you look at the density in the upper right of the cloud, it is clearly clustered around the x=y line. The correlation may not be very strong, but without extra information about how strong of an effect would constitute clear evidence in favor of including the value add metric, we can't say anything more about it.
> Do actual phenomena often yield "deceptive" blobby scatterplots with dense lines hidden inside them?
Absolutely. You're looking at a misapplication of a plotting technique - a scatterplot with substantial overplotting. The author may be convinced that there's no correlation there, but by presenting this plot they provide no evidence in support of their point.
A heatmap would be more revealing, but I'd eat my hat if it weren't a standard distribution (and I had a hat). It's nature (but hey, anything's possible). The scales are only slightly skewed, but we're not talking logarithms. The merit of the scales ("value-added") on the other hand, is very questionable.
Clearly all the overlapping points in the first graph are obscuring most of what's going on. This is a warning flag that the author may be trying color the facts to suit their argument.
Not very clear what you are try to say:
(1) What "standard deviation"? Perhaps you mean the standard error of the correlation coefficient is large relative to the its estimated value. Given the large number of data points it is likely to be quite small. Another warning flag is the failure to report either the correlation coefficient or its standard error.
(2) What does "ellipse isn't irregular" mean? Given all the overlapping data points the shape of the plot is entirely driven outliers. To my eye this looks what you would get from plotting a bivariate normal distribution.
(3) Kepler? The linked graph? How is this helpful in understanding what's going on?
The first graph has a straightforward interpretation: you are making two imperfect measurements of each teacher's performance at two different times. There is noise in both measurements and teacher performance may have actually changed between measurements.
Typo: should have been "standard distribution"... Less colloquially: "a two-variable normal distribution."
"What does "ellipse isn't irregular" mean?"
Eccentricity. Sorry `fer the double negative.
"Kepler? The linked graph? How is this helpful in understanding what's going on?"
If you take the scattered data and draw an ellipse containing a fixed probability (presumably represented with some confidence, by the data shown), the ellipse would be more eccentric for the same enclosed probability, the higher the correlation. It's a single statistic and is analogous to Kepler's 2nd law ("a line joining a planet and the Sun sweeps out equal areas during equal intervals of time."), except with probability rather than area... (To contain the same probability, you have to adjust the ellipse's size, while holding its enclosed probability constant, and accounting for the new eccentricity, which isn't under our control. If the area is fixed and the distance changes, solve for speed... I think visually. Alternatively, think V=IR.; if V is constant and R is known, solve for I.). The link was just illustrative of any such scatter plot. It was the best I could find, but I agree, it could have been better.
I agree that the article is low on quantitative statistics though.
Sorry don't mean to belabor this point but I think "standard distribution" (changed from standard deviation) is still incorrect. "Standard" referers to standardized parameters (i.e. mean, variance) as in "standard normal" which has a mean of zero and variance of one (and standard deviation of one). Certainly not the case here. Maybe you mean "bivariate normal with positive correlation"?
No worries. Statistics can be very confusing. Part of the reason it is so confusing is all the sloppy usage we come across in the media and on the internet.
Everyone wants to use statistics as a tool to bolster their point of view but don't want to bother to actually learn enough statistics to deal with the situation they are trying to analyze.
Correlation means how well the points fit on the line. It is unfortunate that the author did not include a regression line or R^2 value but none the less one can tell there is no line which most of the points lie near. That data is absolutely not strongly correlated.
You seem to be confusing correlation and significance. Having lots of data points makes the correlation more significant, it does not make the correlation stronger.
one can tell there is no line which most of the points lie near
No, that's not correct. You can't conclude anything from looking at a big blob because you don't know the density of points at different places in the blob. This is the point the guy you replied to was making.
As an extreme example, imagine a billion data points that fit perfectly on a straight line. Then superimpose a million data points randomly on top of it. What does it look like? A big blob. But almost every point is highly correlated with that straight line.
Okay hows this: unless the author of the original blog post is deliberately deceiving the audience but putting a bunch of points on top of each other then one can tell there is no line which most of the points lie near.
I agree that a scatter plot is not the best for showing data with that many data points, but frankly it's kind of irrelevant. The point the author was making was just that it isn't that hard to cook up some data that is not highly correlated, but will be if you bin and average it.
It's not a matter of being deliberate. He's just being lazy, using the standard excel scatterplot. To fix that graph, he needs to carefully choose an opacity, which he didn't do.
I agree he could have presented his data better. I also recognize that there is some correlation. None the less, the story was about how averaging can be used to make a correlation appear strong than it is, and going from 0.3 to 0.99 certainly meets that criteria. I was responding to the claim that the data might already have a strong correlation (presumably on the order of 0.99), and I was arguing that no "natural" strongly correlated data would have a scatter plot like that.
I might add that discussions like these, Illustrate a broader problem with the hacker news community. Rather than discussing the facts of the story, many of the posts are instead about the less relevant detail about how the author chose to present them, despite the fact the the authors point is still effectively made (and if the 0.3 were included there wouldn't have been any doubt)
Edit: since ColinWright explicitly asked us to indicate experience with hard statistics: Check the link in my profile. I work with very messy data. I've also found that the more experience people have with something, the less confident they are about their expertise so I won't claim to be NotWrong, but the statistics here really aren't hard. It's linear regression with data that appears, cursorily, to be gaussian, close-ish to iid, etc. The question isn't "did Gary do the math wrong?" it's more "is he interpreting the data reasonably?"
I will make a strong claim: If doctor effectiveness were this stable, and it were the only data I had available, I would choose a doctor on this metric, ceteris paribus.
I found a scatterplot by the author that uses smaller marks so it looks a lot less blob-ish and more heat-map-ish:
I'm trying to track down the original data to verify this myself, but a couple of quick points:
There definitely seems to be good evidence of a relationship. It might be "weak" but that's not the same thing as "ineffective". Considering the number of data points, it makes it unlikely that the trend occurred by chance. This plot also makes it clear why the averaging made it look so pretty: the average in just about any percentile group in 08-09 matched 09-10 regardless of how finely it's divided (to a point).
It's certainly the case that there's a lot of "noise". The question is how to interpret it. One might conclude that there is a lot of "measurement noise" -- that teacher evaluations are inherently messy and rather uninformative except as a larger trend. There seems to be reasonable support for this! Students aren't the same, even from year to year. Of course it could "signal noise" - maybe teachers themselves change from year to year. Perhaps one year you're more motivated, the next your not, or vice versa.
To understand the implications of this noise, we also need to ask what we're using it for. For instance, if this were a plot of the correlation between money donated to Watsi and quality-of-life, you'd probably think you were doing pretty well! You might point out that a particular quality-of-life metric is flawed or noisy, or that the quality-of-life depends on lots of things besides health, but likely you'd be happy that you're clearly causing some benefit.
So when we ask whether the data suggest the viability of value-added measurements in aggregate, to me that seems to be a yes. But if we ask whether the data suggest using these measurements to make a decision on a per-teacher basis, that depends on the context of the alternatives! If this is the only thing we have to go by, perhaps it's not so bad.
I would love to see how stable the other parts of the teacher evaluation fit in. If you combine multiple noisy signals, you often get a much better picture of what's going on! This accounts for only 1/3 of the teacher's evaluation. I would like to think that if you appeared not to be a value-added teacher, but your principals and coworkers spoke highly of you, that you'd still get that raise.
Also, I'm looking forward to multiple years of this data - if your score accounts for 35 percent of the variance from one year to the next, multiple years might look a lot better. Perhaps in deciding teacher tenure, scores over 5-10 years could be really useful.
This is a surprising blog post in that I draw the complete opposite conclusion the author does. The author seems to think that the averaging hides volatility (which it does), which leads to incorrect conclusions drawn. Whereas to me it looks like it removes visual noise to show an actual trend.
In his original scatter plot, because of the big ball in the middle it's easy to handwave and say, "look a random blob" -- but even a slightly longer look seems to indicate there is a positive correlation in the data. His averaging that he does at the end, IMO, makes it clear.
You're missing the point. The author agreed that there is a slight positive correlation in the data. But the averaging makes it look like there is a strong positive correlation in the data, which there isn't.
In other words: without the averaging, you are seeing the full implications of the data: slight positive correlation, but a lot of scatter--i.e., not much meat there. With the averaging, you are seeing only the positive correlation, with all other information filtered out--i.e., you are seeing only part of the picture, and it's the part that, surprise, surprise, makes the person who ordered the data collected look good.
Huh? The averaging doesn't give you only the positive correlation. It simply averages it. If the correlation was negative, you'd see that just as plainly. Averaging doesn't change the trend -- it just removes the volatility from the graph and makes the graph more readable (what's the density at y=x? I suspect it's a lot denser than he'd have you believe, but you can't tell from the chart at all).
Sure, you could add a volatility measure as well, but the trend is clear -- you'd rather have your child by a teacher with a high teacher score than a low one. Or are you saying that over 12 years of schooling you don't think you'd have any preference having your child taught by teachers with a score in the lowest 5%ile versus a teacher with a predicted score in the top 5%ile? Of course there's some volatility, but over time, it's seems pretty obvious to me.
Nobody is arguing that a small correlation doesn't exist. They are arguing that a small correlation isn't very meaningful.
Or are you saying that over 12 years of schooling you don't think you'd have any preference having your child taught by teachers with a score in the lowest 5%ile versus a teacher with a predicted score in the top 5%ile?
If you base those percentiles on a metric that has a very low correlation with performance, then there won't be much of a difference in outcome.
Or are you saying that you want to pick who to fire and who to promote based on a metric that we objectively know has only a very weak correlation with reality?
The point is, you shouldn't worry too much about teacher performance.
If it's only loosely correlated, measuring value-added then punishing / rewarding teachers based it won't be very accurate. It's like lines of code - programmer talent is probably weakly correlated with lots of lines of code, but you don't want to reward or punish them based on LoC because it will lead to pathological behavior.
You should look at other things to alter (which are less easy to game once you start trying to control them), like class size, course materials, assessment style, how the teacher actually teaches, etc.
Or are you saying that you want to pick who to fire and who to promote based on a metric that we objectively know has only a very weak correlation with reality?
What alternatives do we have? Do you know of a more accurate metric?
One alternative is to not use these flawed/complicated/expensive metrics. Imagine if I came up with a metric that had an 0.1 correlation coefficient with your performance at your job. If you're a programmer, maybe that'd be something like measuring the lines of code written per day. Would you put much faith in that? Of course not, there would be myriad problems with such a system. Same thing here.
The status quo in teaching is to evaluate teachers based on qualitative observations from management, same as it is in most other industries. As the same blogger points out in another post http://garyrubinstein.teachforus.org/2013/01/13/50-million-3... "Conspicuously missing from the various weighting schemes they compare is one with 100% classroom observations. As this is what many districts currently do and since this report is supposed to guide those who are designing new systems, wouldn’t it be scientifically necessary to include the existing system as the ‘control’ group? As implementing a change is a costly and difficult process, shouldn’t we know what we could expect to gain over the already existing system?"
The status quo also has the advantage of not being ridiculously expensive and complicated. Ask Bill Gates to release the data if you're curious about its performance, but given what I've seen, I'd be shocked if it was terribly different than these complicated new metrics.
Conspicuously missing from the various weighting schemes they compare is one with 100% classroom observations.
It's not that conspicuously absent as they don't use 100% weighting for anything. That said, they do attempt to maximize ability to predict state test scores, and in that optimization classroom observation plays the smallest role of the three metrics (2-9%).
Given they appear to have to done the analysis across all weightings, I seriously doubt they'd hide this data.
I do find it odd that teachers would rather be judged based on one or two people observing their classroom and ignoring actual student output, rather than raw numbers.
As a developer I'd much rather be evaluated on some metric I could optimize for (I think some metric measuring feature value/bugs/fix rate/etc...), rather than just my manager watching me code/debug a couple of times per month.
I do find it odd that teachers would rather be judged based on one or two people observing their classroom and ignoring actual student output, rather than raw numbers.
As a developer I'd much rather be evaluated on some metric I could optimize for (I think some metric measuring feature value/bugs/fix rate/etc...), rather than just my manager watching me code/debug a couple of times per month.
Okay, I really don't understand this mindset. You'd really prefer it is, say, 1/3 of your salary was determined by the number of lines of code you wrote? Or the number of bugs you closed? Or some equally inane metric that only loosely correlates with your actual performance and can be easily gamed?
You'd really prefer it is, say, 1/3 of your salary was determined by the number of lines of code you wrote?
I was a quant trader for a while. A considerable chunk of my salary was determined by profit, hardly unreasonable.
Measuring developers in other areas is difficult because of heterogeneous goals - last year I built a search engine, this year I'm statistically tracking user behavior. Hard to compare one to the other.
Education does not suffer this problem - last year a teacher taught 30 kids to read. This year she taught 28 kids to read. The goal is always maximizing the fraction of kids who can read.
I don't think it's obvious that these teaching metrics correlate better with teaching ability than LOC would correlate with programming ability. I'd expect weak correlations in both cases. I'm just speculating at this point, though.
One alternative is to not use these flawed/complicated/expensive metrics.
So replace a metric with 0.1 correlation with one having 0.0 correlation?
If you are advocating that we should use the school principal's opinion rather than VAM, why do you believe opinion is superior? Do you have evidence that principal's opinion has a higher correlation with student outcomes than VAM?
So replace a metric with 0.1 correlation with one having 0.0 correlation?
Two problems with this:
1. The other metric almost certainly doesn't have an 0.0 correlation, but we don't know that for sure since the data wasn't released for whatever reason.
2. The metric with 0.1 correlation (or whatever the number is)... keep in mind the context. What is the correlation with? Test scores, something that can be and is often rigged, and is at best tangentially related to the ultimate goal which is inherently qualitative in nature. I think the comparison to programming is instructive. If you had 1/3 of your salary determined by the LOC you wrote or the number of bugs you closed, you would game the system and maximize your salary, even if you weren't providing value by doing that and even if those naive metrics did very weakly correlate with performance in some big study prior to implementing the monetary incentive. What would be gained in that scenario?
If you are advocating that we should use the school principal's opinion rather than VAM, why do you believe opinion is superior? Do you have evidence that principal's opinion has a higher correlation with student outcomes than VAM?
One of the points made in the blog I linked to is that it's unfortunate that they didn't release that data. And as I said above, it's not straightforward to assume that these weak correlations are actually meaningful in practice.
The 0.3 correlation described in the article is the correlation between a teacher's VAM score in a single class last year and this year.
The author is arguing that because the measurement is noisy, we should ignore it. This is silly - it just means a single class-year's point estimate is noisy, and multiple class-years must be combined to form an accurate estimate.
If you had 1/3 of your salary determined by the LOC you wrote or the number of bugs you closed, you would game the system and maximize your salary...
Indeed - if 1/3 of my salary was determined by the number of kids in my class who can read better than when they are predicted to read at this age, I'd definitely try to make sure that reading skills improved.
Conversely, if a stupid metric such as student/principal opinion were used, I'd focus on jokes and friendliness over education.
(Well, actually I didn't back when I taught. But that sure didn't help my student evaluations...)
The author is arguing that because the measurement is noisy, we should ignore it.
Not quite..
Indeed - if 1/3 of my salary was determined by the number of kids in my class who can read better than when they are predicted to read at this age, I'd definitely try to make sure that reading skills improved.
Keep in mind that "score on a reading test" and "reading skills" are not the same things. For instance, one would be significantly improved by spending valuable class time teaching students tricks about how to succeed on the standardized reading test. It's similar to the incentives you'd get by determining salary by LOC - even if LOC correlates with performance, basing pay on it encourages all kinds of nonsense that doesn't actually benefit anyone.
So in that context, we have a metric that very loosely correlates with this clearly flawed marker of success. And we're supposed to spend ungodly sums of money implementing this strategy? Come on.
Conversely, if a stupid metric such as student/principal opinion were used, I'd focus on jokes and friendliness over education.
Well, they do propose to use student and principal evaluations..
Keep in mind that "score on a reading test" and "reading skills" are not the same things.
True. Score on a reading test might be lower than actual reading skills due to a lack of test taking skills.
Then again, I can't think of any measurable quantity more highly correlated with reading skills than scoring high on a reading test. Do you have evidence that "principal's opinion of teacher" is better?
...we have a metric that very loosely correlates...
You are confused. GR shows that year on year correlations are weak (0.3), not that correlation between reading test and reading skills are weak.
You are correct that a certain nonzero amount of teaching test skills will be effective at improving scores. I see no reason to believe this amount will be large - do you?
Again, standardized tests are not the only gameable metric. Principal/student opinion is as well. Why do you believe they are less gameable and more accurate than directly measuring student ability?
It's not that I think this is the only flawed metric or that there are other methods that are quantifiably better. It's that we've spent ridiculous sums of time and money on this complicated new metric, despite some pretty clear arguments against it, and we still are very far away from actually proving that it is better than the status quo, even if that status quo obviously falls somewhat short of perfection.
It's like.. how much money and effort do you think is worthwhile to spend on something with such uncertain prospects? There are tons of great grant proposals to funding agencies which are rejected. Maybe we would be better off funding some of those. Opportunity costs exist, and it seems like undue sums of money are being wasted on this stuff.
You are missing the connection between these statistics and school HR policy. Sure there is a small positive correlation, but at what cost should we extract this benefit? Given this data, should we use value-add as a major component of teacher evaluations, and ultimately compensation/terminations?
Giving employee evaluations that are 5% performance and 95% random chance is a recipe for an organization that no capable intelligent person would consider for a career.
you'd rather have your child by a teacher with a high teacher score than a low one
Not necessarily. The very weak correlation means that the difference in this score only accounts for a very small amount of the difference in outcome; and even that claim assumes that the observed correlation is evidence of a causal link, which is not necessarily the case. You can't usefully conclude anything from this data except that more investigation is needed.
The other thing the author does is to derive absolute findings when the author himself admits to making an assumption about how the numbers were obtained. To me, using the word "lie" in the title is therefore nothing more than click-baiting.
And on a separate note - there's a lot of anti-Gates sentiment flying around lately. Whether his methods are the best or not, he's using his own money. I find it really difficult to give any criticism directed at Bill Gates' foundation any credence. And that not knowing much at all about what he does.
Maybe people would like Bill more if he did a Larry Ellison and bought a yacht instead.
People are arguing that he is using his own money to actively damage society. I'm not sure if I agree with that, but to say you can't criticize someone for what they do with their money is ridiculous - as an extreme, he could use his money to poison people, or to lobby for life sentences for marijuana use.
I said Bill Gates, and not someone. And while I don't know how people think he's damaging society, if the article being discussed here is representative then I remain unconvinced. This is not an invitation to enlighten me. I'm not wasting my time reading yet another opinion, assumption or half-baked arm-chair allegation. The investment dollars and opinion of the likes of Warren Buffett are a much more compelling endorsement to me than (I mean this respectfully) members of HN.
There's a lot of distrust for any billionare meddling in the education system, and with good reason - they tend to push faddish ideas with questionable evidence, like this one, and somehow manage to convince the media they've "solved" the problems with the education system. Then the state wastes a whole bunch of money on the latest fad.
Also, that's probably why the headline calls it a "lie". This is something of a pattern.
Bill Gates's position as a famous wealthy guy means he has more of a responsibility than the rest of us to not put out misleading graphs. Even if he is spreading misinformation with his own money.
This looks to me like someone trying to make the report look good so that Bill Gates will get positive press, think this was worthwhile, and fund them again.
Many would argue that Bill came by that money via illegal and unethical behavior, paid for by millions of consumers, so it is more like tax money than his own money.
"Whereas to me it looks like it removes visual noise to show an actual trend."
Except for that in this case the noise is more important than the trend. Think about it, if you're firing or sanctioning perhaps 30%+ of teachers each year for no reason, then only complete morons would go into teaching.
It's the same as airport security, where a .1% false positive rate is unacceptable, whereas a 75% false negative rate is just fine.
Did you transpose false negative and false positive in your statement about airport security? I think airport security would much rather have false positive (i.e. This guy has something suspicious, let's do further checks, oh, turns out it was nothing) vs. False negatives (i.e. This guy is clean, oh, turns out he wasn't and blew up a plane).
No, that's correct. The issue is that the overwhelming majority of people aren't terrorists. So even though it's worse to let a terrorist on a plane than it is to ban someone who isn't a terrorist from flying, in aggregate the harms of banning non-terrorists from flying become greater than the harms letting a few terrorists fly even when the false positive rate is very small. Bruce Schneier has a good explanation of this somewhere on his blog. (Actually, this is one of his pet issues, so there are probably dozens of blog posts about this.)
I agree 100%. A scatter plot, with a low but very obvious correlation, shown for two years only seems to me to be absolutely fantastic evidence that progress is being made.
The blog author saying there is no clear trend in his random blob of points is misleading. The trend is there, it is just difficult to tell the density of points in his scatter plot because of the marker size and plot size.
I suspect there'll be a lot of differing opinions on this, and I'm looking forward to seeing the discussion. What would be helpful would be if people could say how much experience they have in hard statistics, and how much what they say is driven a priori from the data.
I know that hackers, in particular, have real problems with "Argument from Authority", but stats is one place where it's really, really easy to go wrong, so knowing how much formal training someone has can be an important indicator.
Statistics, like engineering, is done with numbers. When you attempt to do it without numbers, that's called "opinion".
I have not tried to look at the numbers. Here is my opinion.
My experience of school was that a small minority of teachers are truly excellent, and a small minority were horrible. The truly excellent ones are not distinguished so much by what happened in their class as by what happened in the following classes, and what classes they left people excited about taking. The terrible ones, by contrast, showed up as poor performance all around.
The described data set has a clear correlation. Whether or not the correlation is meaningful in individual cases is much less clear. My initial approach would be to try to fit a hierarchical Bayesian model to a data set, then use the resulting model to come back with predictions about individual teachers. Teachers whom, after several measurements, are overwhelmingly identified as terrible should be removed. If you find a population of teachers who show up as superstars, they should be subjected to further study to see if we can predict the quality of incoming teachers, and to see if we can learn lessons from them that improve other teachers.
However this is a well-studied problem. I'm sure someone has tried something like this. I am sure that there are a lot of vested interests. I have not attempted to evaluate work in this area, and I have no opinion on how good it is.
I'll bite. I won't discuss my experience in my posts since I don't want to argue from authority but will do so below. But first, I should point out that you're committing a similar sin to that which is alleged in the article. What you really care about is the correctness of an argument. You hypothesize that formal training is an important indicator of correctness. Presumably there's also noise in that scatterplot but we don't even have it. Since it's the best thing available, you are choosing to rely on it.
As for my formal stats training: None beyond High School. That said, I've worked with and interviewed many with far more training. One thing, I learned is that most people don't have great intuition for statistical problems and it's not terribly highly correlated with years of schooling. I've met PhD's in economics and statistics who have made basic conceptual errors as well as those with less training who were more reliable.
So while training may be correlated with accuracy, you should still demand a well-reasoned argument and think critically about it.
I agree almost entirely. I'm specifically not asking for people to say "I'm a PhD in statistics, and here's the answer. Accept it because I know better than you." What I'm asking for a a complete and reasoned response, along with some evidence as to how much I should listen to you in the first place.
If you have no such evidence then the onus will be on you to make your argument more complete, more coherent, and more comprehensive. If you have evidence (note: evidence, not proof) then you can be a little less rigorous in what you say, and rely on people giving you the benefit of the doubt while they work through the argument.
What I see a lot of is long, apparently good arguments, that then turn out not to be as complete or coherent. they sometimes just don't hang together.
Significant amounts of formal study in a subject is evidence that someone might just have a better understanding. After spending a lot of time on the internet I'm tired of having to wade through every single argument in detail looking for all the possible chinks.
Maybe that's just impossible. Maybe every person has to redo every analysis for every argument. Seems like a complete waste of almost everyone's time. What about "Don't Repeat Yourself" or "Don't re-invent the wheel." I guess we are doomed to reinvent the wheel in every single discussion.
Reinventing the wheel would be a major problem if our goal is to solve the education problem with this discussion. No one here has done even the basic work I would expect of someone trying to understand teacher evaluation as a solution and compare it to other alternatives.
I would argue that the whole point of HN is to think through arguments in other domains and build intuition by reasoning through problems and arguments. Otherwise, what's the point? No one is going to arrive at this thread and scan the top-rated comments for the solution to his school district's problems.
In cases where actual decisions are being made where the analyses are much more thorough and fully validating much more expensive, other techniques are available. First, one generally builds an awareness of the strengths of each team member which suggests where errors may be more likely. Additionally, one can check a random set of the most likely problem areas. Perhaps, most importantly, while everyone won't re-do every analysis, it's highly unlikely that an any important analysis will only be done once. So one can expect that the high-level results are generally consistent.
How much formal training does the author of the blog post have? From the short bio, it doesn't sound like much.
That said, my training -- read Friedman's intro text in high school (just freetime reading on my own -- so not formal per se). Two quarters of stats as an undergrad. Two quarters in grad school. All of which at least 15 years ago -- so mostly forgotten anyways. :-)
Or you could ask people to put forth coherent mathematical arguments, since research has shown, for example, that most professional PhD-holding published medical research is statistically incorrect.
"...it has been known for more than 60 years now that correlating averages of groups grossly overstates the strength of the relationship between two variables"
Apparently, this was also used in the past to show high correlation between African-Americans and illiteracy.
There are two problems with this article. First is that the "evidence" is from a scatter plot graph, with no statistical measurements attached to it. Stats really aren't something that you can rely on your eyes on. To me, it looks like a weak, positive correlation between previous year's score and current years score. The thing to look for is greater densities in the NE and SW quadrants than the NW and SE. Having a zero-mean "blob" in the data makes it tricky on the eye, but NE and SW points are the ones the model "correctly" predicts, while the NW and SE plots are the ones the models "gets wrong". I think the issue is the eye is looking for a line to process a slope of, but such data doesn't streach into a line. What it should look for is: "given a good score last year, is a good score given this year" or "given a bad score last year, is a bad score given again".
The second issue is that the author does not offer a better solution to the problem. Some information is almost always better than no informantion in decision making. Statistics in human-based samples are always tricky, and its tough to get strong correlations easily. As an aside, this is part of why clinical trials are so tricky and expensive.
(I am not a stats professional, but am a biomedical research scientist with some training in human-population based stats).
Exactly. Charts are useful for eyeballing, but only the use of proper statistical tools is valid for showing trends. Charts can easily be very misleading.
There seem to be a lot of comments about the charts on this page. Arguably basic statistics should be a part of every education - is that happening in the USA?
Can't comment on USA - in Canada we had an intro in high school then about a 1/2 course-credit module per year from second year until grad school in the engineering programs.
"In New York City, the value-added score is actually not based on comparing the scores of a group of students from one year to the next, but on comparing the ‘predicted’ scores of a group of students to what those students actually get. The formula to generate this prediction is quite complicated, but the main piece of data it uses is the actual scores that the group of students got in the previous year."
An experienced and resourceful teacher may be allocated more 'problem' students than a relatively new teacher as a matter of policy. Teaching students at the extremes of the ability range is harder than the median as the reasons for (say) severe under achievement can be very diverse (learning difficulties, home situation, undiagnosed medical issues - diabetes in my recent experience with one student - and so on) and each student may need a different approach.
I have taught in a few settings over the years. The best teaching happens in my experience in settings where people work together. Performance related pay based on dodgy statistics in an atmosphere of public ridicule does not seem the best way to encourage sharing of practice and teamwork!
"Good luck with that" as I believe the late Mr Jobbs used to say to competitors.
You are correct that if students were assigned in some biased manner it could invalidate any relationship between teacher quality and subsequent student performance.
That is exactly what makes this study interesting. All the students were assigned randomly! Much like in a randomized medical study this allows the establishment of a causal relationship.
"All the students were assigned randomly! Much like in a randomized medical study this allows the establishment of a causal relationship."
Two thoughts...
1) Ethics: You put some children with dyslexia or with challenging behaviour with newly trained teachers with little experience? Did their parents have any say in this? Was there an ethics process? Was there support available for newly trained teachers meeting major needs in their randomly allocated classes? Amazing.
2) Correlation does not imply a causal relationship. You need a theory. Theory in this area is not positivist, it is fuzzier.
1) Same students, same teachers and same support in same schools but the assignment of students and teachers to classes was randomized. Consent was obtained from principals, teachers and parents.
2) Randomized experiments can establish causation if done properly. This is how medical treatments are evaluated. The theory is simple: teachers that were more effective in the past will be more effective in the future and using objective metrics (tests, classroom observation and surveys) we can measure teacher effectiveness. This certainly positivist but more important likely repeatable.
You seem like you are interested in this but I get the sense you already have strong views. Maybe you should put those aside for a moment and take a closer look at this new and interesting study you might learn something:
"You seem like you are interested in this but I get the sense you already have strong views."
I will read the reference, however, to those of us in Europe, please remember that this methodology resembles a group of earnest steam punks analysing flies in amber with brass and ebony microscopes on an ice floe that is breaking up... I'm suggesting that we suspect positivist descriptions do not capture the important values in education. 90% of what children learn is learned outside school I suggest.
I am prepared to be astounded with contrary evidence.
I'm surprised at the comments here. Seems that nearly everyone disagrees with the blog? Is Bill Gates such a saint among the HN crowd that we ignore statistics for him?
Seems pretty clear that Gates' study presents the data in a very misleading manner to make a very weak correlation look like a very strong correlation.
You criticize a short blog post for something that is not done in the Gates publication. Do you have some bias, perhaps? Why does Gates not even give us a scatter plot, let alone a correlation coefficient? This is the critical point raised by the blog.
I never said they don't report "quite a few statistics". I said they don't report this one specific particularly important statistic, and instead obscure it behind a ridiculous averaging procedure. This was either done because they are incompetent or because they are purposely trying to mislead. That's the whole point of TFA.
Using LOC to hire or fire programmers ignores important contextual factors, just as using test scores of students to hire or fire teachers ignores important contextual factors. What language is being used? How large is the project? Existing code? Platform? Bug count? Does the person contribute to the team in other ways?
In teacher evaluation you also have critically important contextual factors. What course is being taught? Are the students motivated? What's the classroom like? A lot of ELL kids in the class; does the teacher speak their language?
Many people see some amount of value in these imperfect metrics because they a) seek to measure the critical output (programmers are paid to make code and teachers are paid to make kids learn), and b) simple numeric metrics aren't biased by human fallibility. Popularity amongst peers won't have any impact on LOC or student test scores.
The author feels they shouldn't be used as 33% of the overall evaluation, which is the suggestion made by Gates.
In terms the specific points the author made about Gates' evidence in favor of their suggestion of 33%, I think he's basically right. I haven't read the original study and so I'm assuming for sake of discussion that he's conveying it correctly.
Over saturated scatterplots are not very useful, and in the hands of someone with no training or knowledge of basic statistics they may be dangerous, but they're not wrong. A trained consumer of research, seeing a plot like the author's first, will only be able to conclude that the correlation is not perfect. That rules out only a subset, and a subset that's fairly uncommon outside of the physical sciences, of the possible relationships between two variables. Not very useful, but not misleading unless there's an assumption that the number of data points is relatively small (which may be the case in some research contexts).
Averaging points and then taking the fake aggregated points and plotting those is wrong. It's using a tool to show variability in the wrong way. Averaging hides the variability that is supposed to be shown in a scatterplot. Why not average down to 2 points and draw the line then? Because that's wrong.
I'm thankful that one or two other people here understand, but a bit dismayed at the kneejerk groupthink.
I see the OP's point, that visual noise can be filtered in such a way as to create a more compelling trend, and perhaps the filtering that Gates has done has simplified the results...so, OK. Let's not look at charts, let's look at actual calculations that have the various factors accounted for...what are they?
And I guess I'll throw in my two cents into the inevitable teachers-vs-data debate. I believe teachers may be too quick to toss aside performance analysis because they are so close to the personal stories and circumstances...no line chart can seem to account for the student who was improving greatly but then missed an insurmountable number of classtime due to problems at home.
Most humans are unwilling/unable to see their lives and work measured as averages and trend lines and deterministic outcomes, and I think teachers are acutely of this belief due to the nature of their work. And unfortunately, this blinds them to the possibility that while there is no cookie-cutter solution that always works, there are surely some measures/guidelines that often work, or at least mitigate even the most extreme circumstances. As it is, there's not much incentive or opportunity to see "the bigger picture" outside of what happens in one's own classroom year after year.
That said, this is not even close to being mainly the fault of the teachers. Among the other problems is that administrators are just as prone to shortsightedness, and "the numbers" can be wielded by a petty/mediocre principal/district officer to apply pressure to a teacher. It's no wonder that teachers are quick to distrust analyses, given the possibility that they're abused for political means.
So, essentially, it's a giant crapfest with no easy solutions. From the perspective of Gates' efforts, we can only hope that he pours in so much money and advocacy that there's eventually a shift towards best practices...though it's safe to say there will still be unintended consequences that aren't all favorable.
Among the other problems is that administrators are just as prone to shortsightedness, and "the numbers" can be wielded by a petty/mediocre principal/district officer to apply pressure to a teacher.
The more we rely on objective numbers, the less a good teacher has to fear from a bad principal. The principal has no ability to make the teacher's students do worse than their statistical predictor, after all.
On the other hand, many other popular metrics (e.g. principal's opinion/coworker's opinion) can easily be tweaked by a principal who wants to put pressure on a teacher.
Both sides are right but they seem to be concerned with different things.
Grouping data points does mask volatility. This doesn't make the effect any less real on the societal level. If we seek to improve student performance, skewing the pool of teachers towards the better ones will do that.
Yet there is also an enormous amount of volatility which is also real. Ideally we would just create better predictors. Constraining ourself for the moment to this set, since they have high error rates on the individual level, it's also reasonable to assert that in relying on this metric, we will make unfair decisions for many teachers.
The right solution depends on how much you value optimizing student outcomes vs. optimizing fairness as well as second order effects such as driving away teachers. My bias would be to use these scores for economic incentives to attempt to ensure that retention for the best teachers is higher than for the worst. Without additional work, termination likely isn't warranted based on this data.
To those who argue that pay differentials require better evidence, I would suggest comparing the validity of performance evaluations across any other job. It's imperfect but better for society than doing nothing.
For long-term fairness, it's important that there isn't a single value-added model. Competitive pressure can do wonders to improve model quality and it's also fairer since those who don't do well on one model can move to a different rubric.
He doesn't actually say what the positive correlation is, which I find disingenuous and suspicious. Because his scatterplot is so non-detailed, it's about as useful as the following scatterplot of SAT scores:
X|X
---
X|X
Where the axes are scores above and below 600 in math on the first and the second time taking the test. There are some individuals who do better or worse - but just because we can draw a detail-obfuscating graph of data doesn't mean there is no detail in the data, or no correlation, or that SAT scores won't help us predict future SAT scores. It just means we can draw a detail-obfuscating graph, and that there's not a perfect correlation.
One point the author does have is that these scores are not perfectly predictive, so bad performance one year shouldn't mean that the teacher gets i.e. fired. OK. He seems to be protesting using information to draw any conclusions at all. Perhaps it would be more effective to show what fallacious or irrational conclusions are being falsely drawn from this data and used to fire teachers, and then object to those specific instances of irrational actions. This value-added measure has not been shown to be intrinsically unreliable or wrongheaded but maybe some applications of it are.
I'm working on reproducing the graph with the data available from the NYT, but I'm not sure on one or two details of the author's method. If someone sees it, please let me know! Things I can't tell so far:
1) How is a "teacher" defined? Just(hash firstname lastname)? If so...
2) When a teacher teaches multiple subjects per year, which one is chosen? Or are they both represented, so that "teacher" actually means (hash firstname lastname subject)?
Nice, thanks for sharing. I used the same variables with the exception of dbn; my code is here https://gist.github.com/4652968 (I later started applying filters such as 4th-grade teachers only, etc., which is reflected in the gist.)
I was pretty curious how the author specifically munged his data, since then we could put to rest the speculation about the degree of correlation.
There's such a high correlation between an individual student's score from one year to the next, and the number of data points is so small, that trying to tweeze out the teacher's contribution in a statistically meaningful way is fairly hopeless.
What I'd like to see are correlates with a teacher's own scores on the tests of the material they're supposed to be teaching (oh you think they all can get perfect scores easily? Hah!) and whether teachers banned from giving homework do any worse, and whether dumping the enforced curriculum in favor of letting students study at their own pace makes any difference. Given how little what the teachers actually do seems to make, it would be logical to at least dump the things which make school dull and unpleasant.
There is no objective way to measure teacher performance. Any evaluation method that can be written as a list of rules can and will quickly be gamed.
The thing is, it's easy to find out who the best teachers are. Simply ask the students, parents, and staff. They all know who the good ones and the bad ones are - and that can't be gamed.
Your claim makes no sense. If the evaluation method measures what we want teachers to do, then it, by definition, can't be "gamed": if the measurement is high, then the teacher is doing what we want.
Are you claiming that there's something special about teaching that makes teacher performance much more difficult, or even impossible, to quantify? If so, what is the point of teaching if one cannot measure results?
Your solution to "ask the students, parents, and staff" is precisely a measurement method (albeit a more qualitative than quantitative one), and moreover, one that can be gamed easily by anyone who's socially shrewd. Qualitative measures like that is exactly why we have so many crappy politicians.
You can't specify everything in the rules. For example, I know a case where management decided to hand out bonuses to the staff for minimizing inventory, because inventory costs money. The staff minimized inventory, all right, and got their bonuses. Meanwhile, the production line would regularly get halted because they'd run out of things like 5 cent resistors. It was a disaster.
The same company decided to rate programmers based on the bug list. A big chart was posted on the wall that showed the daily bug count. It wasn't a week before huge fights erupted over the bug count - over what was and was not a bug. The programmers quickly gamed it. They'd hide bugs, they'd refuse to fix one bug if that fix would produce some other minor bugs (e.g. if the bug was "feature X missing", then feature X is added, but had a couple issues with it, then X would not get fixed). They'd even add "bugs" blamed on other programmers and then "fix" them to get the credit.
Management gave up on that after two weeks and pulled the banner down.
For teachers, the metric is (roughly speaking) "% of students capable of multiplying/dividing numbers up to 4 digits in a standardized test setting". How do you game this metric?
The fact that one company used a couple of bad objective metrics doesn't mean all objective metrics are bad. They are used with a great deal of success in many fields. Sales people are paid on commission, traders are paid proportionally to (risk adjusted) profits, etc. All it means is that if you set the wrong goals for your company, you'll probably succeed at the wrong thing.
So tell us - is "maximize the % of students capable of multiplying/dividing 4 digit numbers" the wrong goal? If so, what is the right goal, and why can't it be measured?
One popular way of gaming the metric is systematic cheating on the tests by the teachers. I say popular because it has happened on a large scale, most recently in Atlanta as I recall.
Another way to manipulate the test results is to manipulate which students are in your class or your school.
The point is, people are endlessly creative in subverting rules to their own benefit - so they conform to the letter of the rules but not the spirit.
Consider also the "work to rule" technique used by unions as a bargaining tactic. It's as simple as the workers literally adhering to their job descriptions. It doesn't work out well for the company.
To prevent cheating, test administration should be handled by someone other than teachers.
The fact that the current system has a bunch of cheaters is not an argument against more carefully and objectively measuring the current system. What next - bankers sometimes engage in rogue trading, so we should reduce monitoring of their behavior?
Another way to manipulate the test results is to manipulate which students are in your class or your school.
This is very difficult with VAM, since the goal is to increase (actual score - statistically predicted score). You need to reliably identify students who will do better than their statistical predictor.
I.e., you need to discover students who will improve drastically this year and then pack your student body with them.
> "% of students capable of multiplying/dividing numbers up to 4 digits in a standardized test setting". How do you game this metric?
You use the time that you used to use for teaching reading and devote it to teaching methods for multiplying and dividing numbers up to 4 digits in a standardized test setting.
You might say this isn't gaming the test, it's just a stupid way for schools to optimise their results for that metric. But that metric is mechanical; there's no measurement for deep understanding of the principles, just success or failure for applying mechanical rules by rote.
Most US standardised test settings use machine marked multiple choices. I'm told that there are lessons for 'bubbling in' - lesson on how to fill in the multiple choice answer bubbles to ensure fewer mis-marked answers.
You use the time that you used to use for teaching reading and devote it to teaching methods for multiplying and dividing numbers up to 4 digits in a standardized test setting.
Clever - you caught me. I got lazy and didn't feel like typing up all the goals of a 3'rd grade education system into an HN comment, preferring only to provide a simple example and hoping that a reasonable reader would extrapolate.
Mea culpa - I'll stop assuming reasonable readers in the future.
Or are you actually suggesting that the school system might forget to include reading when defining their goals?
> up all the goals of a 3'rd grade education system
Can all the goals of a 3rd grade education system be reduced to a purely mechanical list of stuff?
Let's try applying your example of arithmetic to reading. What are the goals? To get children to read individual words? Or to get children to read a sentence, and obtain meaning from it? If it's just to read words, do you get those words from a defined vocab? Should they all be real words, or do you include nonsense words too?
While schools may not have stopped teaching students to read they have cut out other parts of the curriculum to focus on what's being tested.
And there's a risk of a cut off point - there's a bunch of children who will read, and there's a few children who are struggling to read. Do you spend extra time and money on the few strugglers (some of whom are going to fail whatever you do), or do you concentrate on the majority (and get most of them through the test and thus look good)?
Can all the goals of a 3rd grade education system be reduced to a purely mechanical list of stuff?
Yes, I would hope that a multi-million dollar enterprise can clearly define their goals.
What are the goals? To get children to read individual words? Or to get children to read a sentence, and obtain meaning from it?
I don't know off the top of my head whether the latter should be learned by 3rd grade. Ultimately setting the goals of our educational system is up to the various bureaucrats in the school system.
However, regardless of what the goal is, you still haven't given an way to game the system apart from "teach kids to read [words/sentences]".
While schools may not have stopped teaching students to read they have cut out other parts of the curriculum to focus on what's being tested.
Indeed - if the school system is not achieving their primary goals, they should cut secondary goals and focus on the primary ones. That's a good thing.
Do you spend extra time and money on the few strugglers (some of whom are going to fail whatever you do), or do you concentrate on the majority (and get most of them through the test and thus look good)?
This depends on what the goals of the school system are. If you want to ignore the strugglers, set the goal to be maximizing this function:
student_scores.max()
If you want to help the strugglers and ignore the strivers, choose this one:
student_scores.min()
Choose this one if helping a struggler is equally important with helping a striver:
student_scores.mean()
This one is like the previous, but strugglers get a bit of extra weight:
log(student_scores).mean()
Setting a goal merely forces you to acknowledge possible tradeoffs and decide which ones should be made (if the need arises).
> However, regardless of what the goal is, you still haven't given an way to game the system apart from "teach kids to read [words/sentences]".
Sure I have. You ignore everything that is not tested. This gets you children that pass the tests. But it ignores all the other work that schools should be doing, and it reduces education to the worst, least inspiring, mechanical drudge work.
> I don't know off the top of my head whether the latter should be learned by 3rd grade. Ultimately setting the goals of our educational system is up to the various bureaucrats in the school system.
They can't agree. That's why I listed nonsense words in one of the requirements. There's an argument about whether phonics or whole-word approaches are better, even though we have good research showing that phonics is better. And so when you look at phonics methods (include nonsense words in the tests), you get disagreement in the phonics camp, and you also have all the non-phonics people piling on.
But this should be easy to discover, right? We have millions of children learning to read each year. We randomise them, we set up a control group, we give other groups different methods. Then we test. (assuming we can get agreement on what and how to test.)
But it ignores all the other work that schools should be doing...
Such as?
But this should be easy to discover, right?
No. Choosing your goals is about subjective value choices. If nonsense words are intrinsically valuable, they should be included, otherwise they should be excluded.
You are conflating the setting of goals with the method used to achieve them. If phonics is superior (I agree with you that it is), it will achieve higher scores. If teaching children nonsense words helps them understand real words, then teachers wishing to maximize their score will teach them.
Regarding stating goals clearly: The problem is that the real goal is something like "maximize future student happiness" or perhaps "maximize future student income", and any set of test goals is therefore necessarily an approximation. I think that if you want to make this argument, and it seems like a reasonable one, the thing that actually needs to be shown is that some specific set of easy to write down test scores does a reasonable job as an approximation, or at least could in principle if one could just find the correct test.
...the thing that actually needs to be shown is that some specific set of easy to write down test scores does a reasonable job as an approximation...
All we really need is to believe that it's a better approximation than the alternative, which is currently something along the lines of 0.5 x Principal's Opinion + 0.5 x Union Seniority (at best).
Most of these issues can be addressed simply by completely separating education and evaluation. I.e., teachers take the day off when tests are administered, some bureaucrat shows up and does it instead.
Treating damaged test materials/absent students as a score of 0 is the simplest way to prevent hacking the student body.
vie for the most favorable students by pulling strings with the administration
The most favorable students are those who will perform better than their statistical predictor. How do teachers know who those will be?
I don't think it's so easy to manipulate a classroom full of students into thinking you're a great teacher when you're not. Remember, you're in front of them several hours a day, 5 days a week.
For some students, sure, but a majority? But I'm not so sure that will impress either the parents or the staff. Also, was that work busywork, or productive work? Students can tell the difference.
I had a high school teacher who explicitly wanted to be popular by assigning no work. The students did like him, but they did not respect him (and laughed at him behind his back). Those of us who wanted to go on to college also knew he was shortchanging us.
The author's argument for this post is rather poor. He states that the raw data plot is noisy and weakly correlated, thus no relationship can be drawn by averaging the data into sets of buckets.
While the author is correct in that averaging hides the variance of data, he does not try to answer the question of "Can a trend be found from noisy data?" or show other accepted statistical methods that show no correlation in the data.
Instead the author dismisses the report from Gates through name calling: "It seems like the point of this ‘research’ is to simply ‘prove’ that Gates was right about what he expected to be true. "
As far as I can tell, the author breaks the universe of teachers into 5% groups and then proceeds to show that the percentile groups are ordinally stable between these two years. In other words, the top scoring group in year 1 beat all other groups in year 2, and so forth, with perhaps some slight shuffling in the absolute middle of the distribution [hard to see on the graph].
This is supposed to be evidence that the correlation is weak??
Even if Bill Gates isn't doing the best job, every time I see one of these articles, it appears that there isn't any advice on better ways to grade teachers. Did I miss that in the article? Are there a lot of other people in the field that are doing better work? (I would really like to read if anyone can provide links to papers to other ways of analyzing teachers)
The blog link to this thread is a propaganda site. It conspicuously doesn't reference any of the original materials that it is attempting (weakly) to refute.
There was an article on HN last week that referenced a more source material article about the project.
I'm confused as to why the "‘raw’ score (for value added, this is a number between -1 and +1)" goes up to ~1.6 in 2009-2010 and up to ~1.1 in 2008-2009. Why does the data go outside the ranges he states? I find it hard to believe someone who can't read something so simple off his own graph.
The author incorrectly stated that "value added" is a score between -1 and +1. In fact, "value added" scores are z scores, which can in theory range from (-∞,∞).
To build intuition, let's consider a hypothetical society. We divide a group of people in two. The first group rolls a die with 100 sides labeled from 1 to 100 in even increments. The second group rolls a die with labeled 1 to 110 in even increments. The roll determines their annual ability level which is unknowable. If you were hiring, which group would you prefer? Is it fair to the second group?
Note that this isn't strictly equivalent for several reasons. The game above the error is part of the generating process instead of the measurement process. Also the distribution is district rather than continuous. The key commonality is that a weak predictor at an individual level can be a meaningful predictor at a group level.
Imagine if one group had 55% of its dice labeled up to 110 and 45% up to 100, and the other group had the opposite. So if all you use for hiring is the group membership, you're going to very arbitrarily miss out on a lot of 110s and hire a lot of 100s. And now also imagine that this process of identifying who is in what group is ridiculously expensive.
It's not contrary at all. It's the same point. The changing mix is merely a difference of degree. The key argument you raise is that discrimination has a cost. I don't have the numbers offhand to determine whether the cost of monitoring exceeds the benefit of higher quality education. Regardless, that's a different argument that should be backed by a different set of data.
It's contrary in that a naive interpretation leads to the opposite conclusion. That's all I meant. And I'm not sure how you define what is a different argument and what isn't. It's all about the value of these scores.
I don't know enough about statistics to judge the strength of Mr Rubinstein's critique, but I find this a little odd: The axes on his graph is a 'raw score' between -1 and 1, but the axes on the MET study graph is 'standard deviations'. Is that significant?
More weekends than not, Colin shares here on HN an interesting link about education policy. I see from Colin's own top level comment that this one is shared for methodological discussion. I'm still mulling over what I think about this link after reading it, but I'm not sure I can agree that the Gates Foundation research group is engaged in a "lie" about anything. Education policy is the issue that drew me to participate on Hacker News,
and I encourage Hacker News participants to dig through some of the other publications to figure out what the best available current data show.
The other comments here suggest that the blog author of the submitted blog post is overstating his own case at least as much as he accuses the Gates Foundation supporters of overstating their case. If people who oppose the kind of teacher ratings proposed by the Gates Foundation really had courage of their convictions, they might try to let learners have full power to shop for teachers, on the grounds that teacher quality measures miss many dimensions of teacher performance that are important to individual learners. In fact, very few states even have as easy and time-proven a policy reform as statewide open enrollment (which my state has had for almost a generation now)
and no state yet lets learners shop for schools on an equal per-capita funding basis whether the school is state-operated or not (as the Netherlands has for more than a century). Teachers will have plenty of deserved professional respect and regard from their clients if their clients have power to shop. But if policy proposals for learner power to shop are rejected, hand in hand with rejecting proposals for evaluating teachers for their effectiveness in publicly subsidized instruction, I wonder what the true agenda is. What assurance do we (we parents, we taxpayers, we members of the general public) have that all schools are staffed with the best available teachers unless someone is checking on how the teachers are doing their work?
I think you are mistaken. In today's attention economy even the very best scholarship and the very best evidence-based conclusions will be completely ignored unless there is a hook to get people, preferably lots of people, to read it.
Even here on the allegedly rational Hacker News I've seen item after item sink without trace, whereas others with significantly fewer details and significantly worse evidence get voted up and read by thousands.
Alex Bellos recently passed on to me what his editor told him. It doesn't matter how good your article is, if no one reads it, or if no one gets to the point, it's no use and a waste of your time.
I'd like to think "Scholarship and evidence don't need hyperbolic headlines." Sadly, I think that scholarship and evidence say otherwise.
But being newsworthy does and this man wants to be newsworthy apart from trying to convey a message (which may be right, may be wrong, but he is not Bill Gates).
And really you would be amazed at the titles (and abstracts) of papers in journals of Pure Mathematics. Really. I've seen thing you people would't believe.
Isn't the point that before starting a measurement exercise one should have defined how one was going to interpret the results - and the whole is or is not successful if it meets its own success criteria
The idea would be not to waste numerous resources and money implementing a broken idea and spend some time coming up with a better one. Why put the education of our children and the careers of good teachers on the line with such a poorly thought out and half-assed idea?
If you think that no time has been spent coming up with this idea, you are coming very late to the party. This debate is decades old, at a minimum. Dates from 1971, says Wikipedia.
It's not something that "Bill Gates" or anyone else made up yesterday. It's a state-of-the-art approach to a difficult problem. If you have a better way, let's hear it. Otherwise, be quiet.
You: "Yeah, okay, this method doesn't work well enough to accomplish anything and it's ridiculously expensive. HOWEVER, it's been used for decades, and I don't know any other methods that actually work. So let's continue pissing money down the drain."
You are right in saying that if we don't have better evaluation methods, it doesn't matter too much. The trouble is, should we just ignore the problems with current methods? It seems the equivalent of sticking your fingers in your ears and saying, "la la, I can't hear you."
This is the corporatist way. Everything is measured so it can be 'tweaked' by managers and technocrats because school is a business now.
Having a bad teacher to me doesn't matter when any kid can turn on his phone or laptop and watch free MIT lectures, free Stanford lectures, free Khanacademy lectures, free Android and iOS developer courses all over youtube, ect.
Gates obsession with teacher metrics would be better spent on building an entirely free university with worldwide standardized credits, so somebody in Pakistan can have their education credentials accepted should they start working overseas instead of being forced to drive a cab around Montreal while spending thousands to retake what he/she already learned.
Having a bad teacher to me doesn't matter when any kid can turn on his phone or laptop and watch free MIT lectures
Alternative self-study education is small compensation for having your child literally waste a school year in a teacher's class who has no clue how to keep control of the kids or desire to go over the material in class vs assigning everything for homework. In my daughter's case, her fourth grade teacher was too busy using that vaunted Internet access to check her facebook page.
Also, your "doesn't matter" argument would hold a lot more water if I didn't have to pay significant taxes to support such a lame public education system.
But the first chart in the blog post, with the big mass of blue on it, is surely not the strong evidence of weak correlation, that the author is making it out to be?
There could still be a strong correlation in that data, even if it looks like a blob of blue - because there are so many data points on that chart, that we can no longer tell where the density of points is. Its possible that there is a dense line of points in there that still gives a strong correlation, that we just can't see.
A heatmap with some level of binning, or a hex bin chart, or 2d histogram, would surely be more appropriate in these circumstances? Something like this: http://matplotlib.org/examples/pylab_examples/hexbin_demo.ht...
Maybe along with some direct math analysis of the correlation. Am I missing something?