I'd say it would only be fraud if he misreported his methods in his papers. I think the Slate author just says that she assumed his methods to have been different than they were, after reading Gladwell's story about them.
Also, it seems to me that if you make a model that fits the data, that doesn't at all mean that it doesn't have predictive power. You could fit it with one data-set, and test it on another one. It just means that you now have a purely empirical model, with no built-in assumptions on why it is so.
Not testing the model on a validation set doesn't mean it doesn't have predictive power... but if you don't test it, there's no way you can say it does have predictive power. And saying it predicts the data 87% of the time when you're using the training set is in fact fraudulent.
I can't believe that no one called him on this before, or that he got away with this! Anyone who has take a college statistics course knows that you have to have separate sets for training and for validation. There are even a ton of statistical methods to deal with this problem with a small data set, i.e. bootstrapping your training set. There's no excuse.
In fact, I remember reading some Feynman lecture in which he specifically mentions how bad this is... anyone got a citation for me?
I remember reading some Feynman lecture in which he specifically mentions how bad this is... anyone got a citation for me?
Perhaps you are referring to this passage of "Cargo Cult Science":
"There is also a more subtle problem. When you have put a lot of ideas together to make an elaborate theory, you want to make sure, when explaining what it fits, that those things it fits are not just the things that gave you the idea for the theory; but that the finished theory makes something else come out right, in addition."
> In fact, I remember reading some Feynman lecture in which he specifically mentions how bad this is... anyone got a citation for me?
It's interesting that we still long for citations of some authority figure --- even when we all agree. (And I, too, would like to see what Feynman said.)
It's more of a longing to figure out what the heck I'm remembering. This happens to me all the time with people, too... someone will look really familiar but I can't place them at all, and it drives me crazy.
I'd say it would only be fraud if he misreported his methods in his papers.
Fair point, you are right. Many authors do, in fact, misreport in this fashion, but I have no evidence that John Gottman has done so.
I think the Slate author just says that she assumed his methods to have been different than they were, after reading Gladwell's story about them.
No, she doesn't just say that. This is not a difference of opinion, it's a matter of fact of what is written in the actual article. "I think" does not apply.
Also, it seems to me that if you make a model that fits the data, that doesn't at all mean that it doesn't have predictive power.
That is not what I said.
You could fit it with one data-set, and test it on another one. It just means that you now have a purely empirical model, with no built-in assumptions on why it is so.
That is a valid methodology, but not the one the article says he used. What was described in the article was pure curve-fitting to the training data, which is not predictive. You can only claim predictiveness by using the generated curve on a test set and seeing how well it does, which is exactly what the article says he did not do. You are arguing in circles!
You should, in fact, expect to fit a curve to a training set with higher accuracy than its general predictive power! You can only test its predictive power on test data.
Right, sorry, I should have been clearer that I was reacting to the article and not to you there. Or actually, I was reacting to what I thought the article said, because indeed that's not what it said! Thanks.
At least as this article describes it, it's not even predictive on its face.
He did not assess "fresh" couples, and come back a few years later to see how they fared. Rather, he interviewed people who were already either having trouble or they weren't. Although I haven't studied it for proof, I have to assume that the interactions of a honeymoon couple will be different from a couple in trouble, even if the honeymoon couple is destined to have problems. Thus, he's not able to interview the honeymoon couple and determine what their prospects are.
and you predict that c=3, you are not committing fraud.
The gold standard would be testing on some more couples and finding the model he made is predictive.
However, I would be convinced by a strong correlation (i.e. the model is not overfitted), coupled with a plausible model. I'm not a statistician - does anyone know if there is a measure of how over-fitted a model is based on degrees of freedom and size of sample?
To properly use your analogy, if I first saw your data, then "predicted" that "a = 1.1, b = 2.1, d = 3.99, e = 5.01, f = 3.9999" and claimed that I had "predicted" your data with 90% (or whatever) accuracy, that is substantially misrepresenting what I have done.
However, I would be convinced by a strong correlation (i.e. the model is not overfitted), coupled with a plausible model.
That is a problem. That should not be enough to convince you. It is enough to make it an avenue worth exploring.
I don't think so. I think my analogy was that given the data, if I can create a model that accurately predicts the data given, and is plausibly simple, i.e. for X=Y, Y=ordinal(X), then I can have some faith in the model.
To put it another way, if Gottman said:
Add up these 5000 variables multiplied by these factors to get the score
then I wouldn't believe it. But I think he said something much closer to (on p. 13 of the 98 paper):
We tried a couple of combinations of variables, that we had hypotheses for. One combination gives a very accurate predictor (Table 1, Husband high negativity). Two others are also accurate (Table 1, low negativity by husband or wife.)
I think where the paper went wrong, and what the Slate article is criticising is the claim made in the second column of p. 16, where they do seem to put all the variables into a model and pull the answer from nowhere. So, I'll give you that the 82% prediction is not justifiable. But negativity as a predictor for divorce seems well supported.
---
If I drop 30 monkeys out of different height trees and I measure 100 things about each monkey, each second, as it falls, and then I give you a model that purports to tell you a monkey's speed from those 100 measurements, then you won't be surprised to received an over-fitted piece of junk.
If instead I give you speed=constant * t^2, would I need to drop another 30 monkeys?
This is one of those things where the size of the result set let's you easily over fit data. Take 50 people born in 1970 and you could make a fairly simple model based on their birth day to predict if they were living or dead because you only need 50 bits of information and 10 of them can be wrong. There is even a strong bias where most people born in that year are still alive.
Edit: As a simple rule of thumb compare the complexity of the formula with the number of accurate bits in your output compared to the most simple model possible. If the formula is anywhere near as complex as your results it’s probably over fitting the data.
It is a bit disappointing. At my university, we were taught of this fallacy in the mandatory introduction to science course. I remember a beatiful analogy which went something like:
As long as both your sample data and your test data are collected between January and May, you can make a model that accurately predicts the time the sun sets based on the length of your hair.
The evidence would abundantly suggest that merely being exposed to this idea in a course is not sufficient to guarantee that the resulting scientist will have any clue.
I see two problems here: First, by the time you get to that level of schooling, you've developed a "resistance" (for lack of a better term) to schooling. That is, you've long since internalized the fact that there are two worlds, the one described in school and the real world, and the overlap between the two is tenuous at best. Merely being told about this stuff in the school world isn't enough, you somehow need it to be penetrated down to the "real" world. Computer scientists get a bit of an advantage here in that after being told about this, their homework assignment will generally be to implement the situations described and verify the results. Computer courses are rather good at bridging this gap; most disciplines lack the equivalent and a shocking amount of basic stuff is safely locked away in the "schooling" portion of the world for most people.
This is a result of the second part of the problem, which is that the teachers themselves help create this environment. I went up the computer science grad school approach for this stuff, where not only do they show you the fundamental math, they make you actually implement it (like I said above). I got to watch my wife go up through the biology stats course, and they don't do that. At the time I didn't notice how bad it was, but the stats courses they do are terrible; they teach a variety of "leaf node" techniques (that is, the final results of the other math), but they never teach the actual prerequisites for using the techniques, and in the end it is nothing but voodoo. Students who consign this stuff to "in the school world but not real" are, arguably, being more rational than those who don't; things like confidence interval and p-values are only applicable in Gaussian domains, and I know they say that at some point but they don't ever cover anything else from what I can see. By never connecting it to the real world (very well) and indeed treating in the courses as a bit of voodoo ("wave this statistical chicken over the data and it will squawk the answer, which you will then write in your paper") they contribute to the division between school and real.
So, I'd say two things: Those who say this stuff is taught in "any college stats course" are probably actually literally wrong in that it is possible to take a lot of stats without really covering this, and secondly, even if covered, it ends up categorized away unless great efforts are taken to "make it real" for the student, which, short of forcing implementation on the students (impractical in general) I'm not sure how to do. Most scientists seem to come out of PhD programs with a very thin understanding of this stuff.