The problem is that averaging can make a very small correlation look like a very...

wisty · on Jan 27, 2013

The point is, you shouldn't worry too much about teacher performance.

If it's only loosely correlated, measuring value-added then punishing / rewarding teachers based it won't be very accurate. It's like lines of code - programmer talent is probably weakly correlated with lots of lines of code, but you don't want to reward or punish them based on LoC because it will lead to pathological behavior.

You should look at other things to alter (which are less easy to game once you start trying to control them), like class size, course materials, assessment style, how the teacher actually teaches, etc.

yummyfajitas · on Jan 27, 2013

Or are you saying that you want to pick who to fire and who to promote based on a metric that we objectively know has only a very weak correlation with reality?

What alternatives do we have? Do you know of a more accurate metric?

streptomycin · on Jan 27, 2013

One alternative is to not use these flawed/complicated/expensive metrics. Imagine if I came up with a metric that had an 0.1 correlation coefficient with your performance at your job. If you're a programmer, maybe that'd be something like measuring the lines of code written per day. Would you put much faith in that? Of course not, there would be myriad problems with such a system. Same thing here.

The status quo in teaching is to evaluate teachers based on qualitative observations from management, same as it is in most other industries. As the same blogger points out in another post http://garyrubinstein.teachforus.org/2013/01/13/50-million-3... "Conspicuously missing from the various weighting schemes they compare is one with 100% classroom observations. As this is what many districts currently do and since this report is supposed to guide those who are designing new systems, wouldn’t it be scientifically necessary to include the existing system as the ‘control’ group? As implementing a change is a costly and difficult process, shouldn’t we know what we could expect to gain over the already existing system?"

The status quo also has the advantage of not being ridiculously expensive and complicated. Ask Bill Gates to release the data if you're curious about its performance, but given what I've seen, I'd be shocked if it was terribly different than these complicated new metrics.

kenjackson · on Jan 28, 2013

Conspicuously missing from the various weighting schemes they compare is one with 100% classroom observations.

It's not that conspicuously absent as they don't use 100% weighting for anything. That said, they do attempt to maximize ability to predict state test scores, and in that optimization classroom observation plays the smallest role of the three metrics (2-9%).

Given they appear to have to done the analysis across all weightings, I seriously doubt they'd hide this data.

I do find it odd that teachers would rather be judged based on one or two people observing their classroom and ignoring actual student output, rather than raw numbers.

As a developer I'd much rather be evaluated on some metric I could optimize for (I think some metric measuring feature value/bugs/fix rate/etc...), rather than just my manager watching me code/debug a couple of times per month.

streptomycin · on Jan 28, 2013

I do find it odd that teachers would rather be judged based on one or two people observing their classroom and ignoring actual student output, rather than raw numbers.

As a developer I'd much rather be evaluated on some metric I could optimize for (I think some metric measuring feature value/bugs/fix rate/etc...), rather than just my manager watching me code/debug a couple of times per month.

Okay, I really don't understand this mindset. You'd really prefer it is, say, 1/3 of your salary was determined by the number of lines of code you wrote? Or the number of bugs you closed? Or some equally inane metric that only loosely correlates with your actual performance and can be easily gamed?

What am I missing?

yummyfajitas · on Jan 28, 2013

You'd really prefer it is, say, 1/3 of your salary was determined by the number of lines of code you wrote?

I was a quant trader for a while. A considerable chunk of my salary was determined by profit, hardly unreasonable.

Measuring developers in other areas is difficult because of heterogeneous goals - last year I built a search engine, this year I'm statistically tracking user behavior. Hard to compare one to the other.

Education does not suffer this problem - last year a teacher taught 30 kids to read. This year she taught 28 kids to read. The goal is always maximizing the fraction of kids who can read.

streptomycin · on Jan 28, 2013

I don't think it's obvious that these teaching metrics correlate better with teaching ability than LOC would correlate with programming ability. I'd expect weak correlations in both cases. I'm just speculating at this point, though.

yummyfajitas · on Jan 27, 2013

One alternative is to not use these flawed/complicated/expensive metrics.

So replace a metric with 0.1 correlation with one having 0.0 correlation?

If you are advocating that we should use the school principal's opinion rather than VAM, why do you believe opinion is superior? Do you have evidence that principal's opinion has a higher correlation with student outcomes than VAM?

streptomycin · on Jan 28, 2013

So replace a metric with 0.1 correlation with one having 0.0 correlation?

Two problems with this:

1. The other metric almost certainly doesn't have an 0.0 correlation, but we don't know that for sure since the data wasn't released for whatever reason.

2. The metric with 0.1 correlation (or whatever the number is)... keep in mind the context. What is the correlation with? Test scores, something that can be and is often rigged, and is at best tangentially related to the ultimate goal which is inherently qualitative in nature. I think the comparison to programming is instructive. If you had 1/3 of your salary determined by the LOC you wrote or the number of bugs you closed, you would game the system and maximize your salary, even if you weren't providing value by doing that and even if those naive metrics did very weakly correlate with performance in some big study prior to implementing the monetary incentive. What would be gained in that scenario?

If you are advocating that we should use the school principal's opinion rather than VAM, why do you believe opinion is superior? Do you have evidence that principal's opinion has a higher correlation with student outcomes than VAM?

One of the points made in the blog I linked to is that it's unfortunate that they didn't release that data. And as I said above, it's not straightforward to assume that these weak correlations are actually meaningful in practice.

yummyfajitas · on Jan 28, 2013

The 0.3 correlation described in the article is the correlation between a teacher's VAM score in a single class last year and this year.

The author is arguing that because the measurement is noisy, we should ignore it. This is silly - it just means a single class-year's point estimate is noisy, and multiple class-years must be combined to form an accurate estimate.

If you had 1/3 of your salary determined by the LOC you wrote or the number of bugs you closed, you would game the system and maximize your salary...

Indeed - if 1/3 of my salary was determined by the number of kids in my class who can read better than when they are predicted to read at this age, I'd definitely try to make sure that reading skills improved.

Conversely, if a stupid metric such as student/principal opinion were used, I'd focus on jokes and friendliness over education.

(Well, actually I didn't back when I taught. But that sure didn't help my student evaluations...)

streptomycin · on Jan 28, 2013

The author is arguing that because the measurement is noisy, we should ignore it.

Not quite..

Indeed - if 1/3 of my salary was determined by the number of kids in my class who can read better than when they are predicted to read at this age, I'd definitely try to make sure that reading skills improved.

Keep in mind that "score on a reading test" and "reading skills" are not the same things. For instance, one would be significantly improved by spending valuable class time teaching students tricks about how to succeed on the standardized reading test. It's similar to the incentives you'd get by determining salary by LOC - even if LOC correlates with performance, basing pay on it encourages all kinds of nonsense that doesn't actually benefit anyone.

So in that context, we have a metric that very loosely correlates with this clearly flawed marker of success. And we're supposed to spend ungodly sums of money implementing this strategy? Come on.

Conversely, if a stupid metric such as student/principal opinion were used, I'd focus on jokes and friendliness over education.

Well, they do propose to use student and principal evaluations..

yummyfajitas · on Jan 28, 2013

Keep in mind that "score on a reading test" and "reading skills" are not the same things.

True. Score on a reading test might be lower than actual reading skills due to a lack of test taking skills.

Then again, I can't think of any measurable quantity more highly correlated with reading skills than scoring high on a reading test. Do you have evidence that "principal's opinion of teacher" is better?

...we have a metric that very loosely correlates...

You are confused. GR shows that year on year correlations are weak (0.3), not that correlation between reading test and reading skills are weak.

You are correct that a certain nonzero amount of teaching test skills will be effective at improving scores. I see no reason to believe this amount will be large - do you?

Again, standardized tests are not the only gameable metric. Principal/student opinion is as well. Why do you believe they are less gameable and more accurate than directly measuring student ability?

streptomycin · on Jan 28, 2013

It's not that I think this is the only flawed metric or that there are other methods that are quantifiably better. It's that we've spent ridiculous sums of time and money on this complicated new metric, despite some pretty clear arguments against it, and we still are very far away from actually proving that it is better than the status quo, even if that status quo obviously falls somewhat short of perfection.

It's like.. how much money and effort do you think is worthwhile to spend on something with such uncertain prospects? There are tons of great grant proposals to funding agencies which are rejected. Maybe we would be better off funding some of those. Opportunity costs exist, and it seems like undue sums of money are being wasted on this stuff.