>"They're saying that the pipeline that runs between the raw DICOM output of the...

ihnorton · on July 5, 2016

> Describing the problem as excess false positives is confused, because these are true positives.

This is deeply wrong. A "true positive" in this context would mean that the resting brain activity of multiple subjects is actually correlated with arbitrary length, randomized, moving test windows. Again, the data is untreated; they are imposing arbitrary test windows and looking for increases in brain activity (increased image intensity) correlated with the stimulus-ON sections of the pseudo-treatment.

From reading some of your prior comments on the subject, it seems prudent to point out that the causal link between stimulus and BOLD effect is quite literally observable: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3130346/

nonbel · on July 5, 2016

>"This is deeply wrong. A "true positive" in this context would mean that the resting brain activity of multiple subjects is actually correlated with arbitrary length, randomized, moving test windows."

This is the same misconception I have been trying to dispel. The null hypothesis is not the inverse of that "layman's" statement, it is a specific set of predicted results calculated in a very specific way. It is a mathematical statement, not an English prose statement.

In this case, apparently one part of this calculation (this assumption about autocorrelation, whatever it is doesn't matter to my point) has lead to such a large deviation from the observations that the null model has been rejected. The null model has been rejected correctly. This is not a false positive.

The problem here is not false-positives due to faulty stats. It is the poor mapping between the hypothesis the researchers want to test, and the null hypothesis they are actually testing.

The tools provided by statisticians look like they worked just fine in this case. If the researchers have decided to use them for GIGO, that is not a statistical problem.

ihnorton · on July 6, 2016

> This is the same misconception I have been trying to dispel. The null hypothesis is not the inverse of that "layman's" statement, it is a specific set of predicted results calculated in a very specific way.

The entire point of this discussion is that the software is not calculating the null hypothesis that users expect. That the current model it does in fact calculate is internally-consistent is tautological and irrelevant (though actual bugs were found in at least one package).

As you yourself said: "I didn't read the code, or even the paper very closely." (https://news.ycombinator.com/item?id=12037207) Perhaps you should do?

nonbel · on July 6, 2016

There really is no reason for me to read the paper closely. They say that the null hypothesis is wrong, and they know exactly why. Then, like multiple people responding to me here, they also want to say somehow the null hypothesis is true.

Everyone who has questioned me also does admit there is something wrong with that hypothesis they tested. You do it too: "the software is not calculating the null hypothesis that users expect", but then you also want to say the null hypothesis they tested is true! Just bizarre, what underlying confusion is making people repeat something clearly incorrect? The null hypothesis cannot be both true and false at the same time.

There is a big difference between a "positive" result that is a "false positive" and a "positive" result due to a false null hypothesis. This is a clear cut case of the second.

mattkrause · on July 6, 2016

> You have agreed that something was wrong, but want to call it something other than the null model.

That's not really what I've said.

In the previous NeuroImage paper, they generated null data by "imposing" block and event-related designs over resting state data. When they did a single subject analysis of that data with SPM, they found that the block designs had excess "positive[1]" results. However, analyzing data with an event-related design had the expected proportion of positive events.

Based on this, the event-related null model looks fine. The block designs may also be fine when data from enough data from multiple subjects is analyzed together. This makes sense because the problem was related to low-frequency oscillations that aren't synchronized across subjects.

However, you don't even have to assume this. The right-most panel of Figure 1A (and S5, S6, and S11) of the PNAS paper repeats this analysis and the voxel-by-voxel tests are appropriately sized, p<0.05 yields a 5% FWER.

> If the null is true, the pvalues should be samples from a uniform distribution. Another thing is they should have shown histograms of these. The 5% below 0.05 is not the whole story. There could be other ways the deviation manifests and/or they just chose sample sizes to be powered to get that result.

I agree that a p-value histogram would be nice, but I don't think it's essential. However, there's no way the sample sizes were cherry-picked, as you suggested: they get essentially the same result with 20, 40, and 499 subjects

------------

[1] As in the paper, I think of these as false positives. We know that there's no way the designs actually influenced neural activity. They are totally arbitrary and were assigned after the data was collected.

You seem to want to call these true positives instead. There are two reasons you might want to do this, but they both strike me as a little off. I suppose these are "true positives" from the perspective of the final t-test, but only because an earlier part of the analysis failed to remove something it should have. It seems like weird to draw a line in the middle of the analysis like that.

Alternately, you might call them true positives because you're not convinced that the slice-and-dice procedure generates two sets of indistinguishable data. If so, you should say that and say why you think that. The one sentence you quoted does not count, for the reasons I outlined above.

nonbel · on July 6, 2016

> "I agree that a p-value histogram would be nice, but I don't think it's essential."

Sure it is, here is an example using R where only about 2% of the tests report p<0.05:

  set.seed(1234)
  Nsim=1000;   n=100
  p=matrix(nrow=Nsim,ncol=1)
  m=matrix(nrow=Nsim,ncol=2)
  for(i in 1:Nsim){
    a=rnorm(n,0,1); b=rcauchy(n,0,1)
    p[i,] = t.test(a,b, var.equal=T)$p.value
    m[i,] = cbind(mean(a),mean(b))
    sig = sum(ifelse(p<0.05,1,0), na.rm=T)/i
    hist(p, col="Grey", breaks=seq(0,1,by=.01), freq=F, main=round(sig,4))
  }

You can see from the histogram that there is a very clear deviation from the null hypothesis, and the %p values under 0.05 doesn't come close to telling the story of what is going on: https://s31.postimg.org/n9w3ydr63/Capture.jpg

> "I suppose these are "true positives" from the perspective of the final t-test, but only because an earlier part of the analysis failed to remove something it should have. It seems like weird to draw a line in the middle of the analysis like that."

Yes, precisely. That is the only perspective that matters because the p-value is being used as the actionable statistic in the end. This p-value has no necessary connection to what you think the null hypothesis was, it has to do with the actual values and calculations that were used.

Any other perspective is the perspective of someone who is confused about what hypothesis they are testing. This is, once again, not any fault of the statistics (maybe stats teachers... but that is another issue). Choosing an appropriate null hypothesis (that actually reflects what you believe is going on) is an investigation-specific logical problem outside the realm of statistics.

mattkrause · on July 7, 2016

> That is the only perspective that matters because the p-value is being used as the actionable statistic in the end.

You're obviously entitled to your own perspective, but extracting the t-test from the rest of analysis like that is...idiosyncratic...at best.

Forget about all the MRI stuff for a second and imagine we were working on a grocery store self-checkout system. As you scan your purchases, it weighs each item and tests the weight against some distribution of weights for that product. If the weight is way out in the tails of the distribution, a human checks your cart; this keeps you from scanning some carrots while stuffing your cart with prime rib.

The checkout computer will occasionally flag a legitimate purchase. Perhaps the customer found an incredibly dense head of lettuce[1] . I would call this a false positive: the z-test (or whatever) is incorrectly reporting that the sample is drawn from a different distribution.

Now, suppose that the scale incorrectly adds some weight to each item. Maybe the sensor is broken or a bunch of gunk has accumulated on the tray. As a result, the checkout system now flags more legit purchases for human review. Are you actually refusing to call these "an increase in false positives", since the z-test is working correctly, but it's been fed a database of accurate weights instead of "item weights + gunk weight")? How about if the item database is wrong instead (e.g., chips now come in a slightly smaller bag)?

Or, here's a purely statistical version. Suppose you fit two linear models to some data--one full model and one reduced one--then compare them via an F-test. However, the regression code has a bug that somehow deflates the SSE for the reduced model. I would say this procedure has an inflated false positive rate. You seem to be saying that this does not count as a false positive: the F-test is doing the right thing given its garbage input.

> Any other perspective is the perspective of someone who is confused about what hypothesis they are testing.

Again, this almost makes sense from the perspective of the t-test, but that is a bizarre perspective to take. A MRI researcher wants to know if the BOLD signal in a cluster, once corrected for various nuisance factors, varies across conditions or not. That ("or not") is a perfectly reasonable null hypothesis.

If the corrections are imperfect, it doesn't mean that fuzzy thinking lead them to choose the wrong null hypothesis. At worst, it means that they were too trusting with regard to the correction (but let's not be too hard on people--this is a hard problem).

-----------

[1] Assume it's sold by the head and not by weight.

[2] And sure, you can write out what the motion correction, field imhomgenity correction, etc. mean in excruciating detail if you want to actually calculate them.

nonbel · on July 7, 2016

>"extracting the t-test from the rest of analysis like that is...idiosyncratic...at best."

So much nonsense about statistics has been institutionalized at this point that this sounds like a compliment. To be sure, the issue here is minor compared to the various misconceptions like "p-value is the probability my theory is wrong", etc that are dominant.

>"Are you actually refusing to call these "an increase in false positives", since the z-test is working correctly, but it's been fed a database of accurate weights instead of "item weights + gunk weight")? "

Yes, the hypothesis should include extra uncertainty due to gunk if that is an issue. These are true positive rejections. It is a bad hypothesis.

>"How about if the item database is wrong instead (e.g., chips now come in a slightly smaller bag)?"

Once again, the hypothesis was wrong. We are right to reject it.

In both these cases calling it a "false positive" is misleading because it focuses the attention on something other than the source of the problem: a bad hypothesis. These are true positives.

>"A MRI researcher wants to know if the BOLD signal in a cluster, once corrected for various nuisance factors, varies across conditions or not. That ("or not") is a perfectly reasonable null hypothesis."

Then they should deduce a distribution of expected results directly from that hypothesis and compare the observations to that prediction. What seems to be going on is they say that is the hypothesis but then go onto test a different one. Honestly, just the fact they are using a t-test as part of a complicated process like this makes me suspect they don't know what they are doing... Maybe it makes sense somehow, but I doubt it. Disclaimer: I haven't looked into the code or anything in detail.