This was a very controversial paper when it was published, perhaps because of its incendiary title. But the paper is much more subtle than the title suggests. Basically the idea is that if you try to test phenomena that are completely unexpected, your prior odds are low, so even if you get a positive result, there is a good chance the result is incorrect. So there is a danger that by trying to ensure more correct results, scientists may avoid paradigm changing experiments, which would be a loss.
A good way to address these issues is to frame all experiments as multilevel models. See [1] for a long discussion from Andrew Gelman et al on why this is advisable.
Surprisingly, many people working in statistics still ignore the James-Stein theorem, which provides a theoretical justification for multilevel models. In layman terms, said theorem shows that if you are simultaneously estimating many random variables you should borrow information across variables [2]. Estimating them one by one is suboptimal and does not minimize the global mean squared error.
Multilevel models "shrink" individual effect sizes by looking at the overall distribution of effect sizes and provide much more realistic estimates.
Multilevel models relax the assumption of independent observations by specifying that the measures of repeated experimental units are dependent on each other. It's a way of telling your model that it has less information than it would have if all observations came from independent units. Therefore, standard errors of effects are usually larger. Otherwise, they are biased [1].
Since most researches are not aware of multilevel models, they design their experiments and aggregate their data to fit the independence assumption, which is rarely a good idea. Many are not even aware of modeling beyond hypothesis tests, and are unable or unwilling to adjust their analysis for confounding factors or non-sampling errors that arise due to experiment design flaws.
Also, p-values should be deprecated, since a) nil hypothesis are strawmen at best and false by definition at worst [2] and b) they incentivize researchers to not think hard about effect sizes and uncertainty in their problems.
Many articles in e.g. Nature Genetics are cheating by hacking their p-values. These hacks would be much harder to get away with if, as a starter, they were asked to use hierarchical models and continuous explanatory variables, whenever possible.
The article from Andrew Gelman you cited explains this quite well. In general the review articles and books he has co-authored are incredibly helpful to learn how to avoid common issues that plague statistical inference.
We need to shift away from null hypotheses and p-values towards generative models, model selection and effect sizes. It leads to much more robust inference.
Not to be pedantic, but considering how many people I have run into who take all research studies as if they are holy gospel; you may want to reconsider saying "everyone knows".
can we name one mainstream journalist, or any policy makers that know anything about reproducibility - or, more honestly, the appalling lack thereof - or "p-values are questionable at best"?
There's been books, such as "how to lie with statistics" that should be mandatory reading, at least on entry to adulthood. I am one of maybe 3 people i know personally that knows how to actually parse a scientific paper, so when someone starts throwing a half dozen studies at me to back up their point, i can usually - on the scale of quarter hours - nitpick their understanding of the sources they used.
This doesn't win anyone friends, and something i've said for nearly 3 decades: Religion and science like to point fingers, but religion seems more likely to bend and change than science. Every time i mention this or an analogue i'm shouted at that it's "actually the opposite".
Shut up already, it's science. The science is in. This is science!
For anyone who likes academic drama (or who is interested in the underlying methodological disagreements among academic statisticians), it is worth pointing out that Jager and Leek's 2014 paper was a discussion paper, and that Ionnidas was one of the people invited to write a response to be published alongside with the original paper. He did (https://doi.org/10.1093/biostatistics/kxt036) and the response is extremely critical of Jaeger and Leeks' methodology and his contempt for the authors is not hard to read between the lines.
For anyone who wants to get really into the weeds, here are all the articles in the sequence of discussion papers:
Main Paper (Jager and Leek): ttps://doi.org/10.1093/biostatistics/kxt007
Response papers:
- Yoav Benjamini and Yotam Hechtlinger: https://doi.org/10.1093/biostatistics/kxt032
- David R. Cox: https://doi.org/10.1093/biostatistics/kxt033
- Andrew Gelman and Keith O'Rourke: https://doi.org/10.1093/biostatistics/kxt034
- Steven N. Goodman: https://doi.org/10.1093/biostatistics/kxt035
- John P. A. Ioannidis (the spicy response): https://doi.org/10.1093/biostatistics/kxt036
- Martijn J. Schuemie, Patrick B. Ryan, Marc A. Suchard, Zach Shahn, and David Madigan: https://doi.org/10.1093/biostatistics/kxt037
I'd love to know the correlation between failure to replicate and public virality of findings. I wouldn't be surprised of the more exciting findings replicate at a different rate from average.
Being skeptical of cutting edge research is a lot different than distrusting science. There are plenty of people out there denying special relativity, evolution by natural selection, or believing all of western medicine is invalid, based on extrapolation from stuff like this.
I won't speak for all textbooks, but generally stuff you find in there should not be the same as what you find in journals, and is much more settled. Big caveat that that isn't necessarily true for younger sciences without long-established theory, say exercise physiology or social psychology, but something like a chemistry textbook is pretty damn trustworthy.
And those are what people who aren't actually scientists should mostly be educating themselves with, not newspaper science reporting sections.
Doubting a scientific paper results is different from distrusting science.
Academia is a complex place and it's full of fake results to get publications. (I'm coauthor of several scientific papers and I know for a fact countless highly rated papers in my former field are not reproducible because they come from adjusted numbers).
Mentioning that not everything that's published in a peer reviewed publication is automatically 100% correct is not anti-intellectual.
Mistakes can be made, data can be limited or misinterpreted, scientists can be corrupted. Theories, "common sense" can change: what we thought we'd know for sure have been proven wrong. If you are a scientist, you know that.
Only because I don't automatically take every "scientific" finding at face value, it doesn't mean I don't trust science.
I trust science, I just don't trust every single research, experiment, scientist, or journal. Actually that's in itself, in a way, science.
Yup. Blindly "trusting the science" is inherently anti-science
There is plenty of things that we took as scientific fact in the past that turned out to be false. Science is supposed to question existing notions and test/prove new theories. If you just assume that we're in a post-science era where we've got it all figured out, and our current theories are all correct, that's not science, it's dogmatic faith.
Ignaz Semmelweis was destroyed by the medical community for his insane idea that doctors should wash their hands before performing medical procedures. He was attacked about to the point where he ended up having a nervous breakdown and was committed to an asylum where he was beaten to death by the guards. Serves him right for questioning the science!
We've optimized since Ignaz; now, you only have to infect and then cure yourself to win a Nobel after being doubted (https://www.nytimes.com/2005/10/11/health/nobel-came-after-y... the scientist actually infected themself with helicobacter pylori, got sick in the predictable way, and cured it with antibiotics)
The process by which DNA was convincingly demonstrated to be the molecule of hereditary was fairly complex; an early experiment that was complex and hard to understand did so, but people didn't completely believe it so a later experiment that was easier to understand was done.
For the longest time, the establishment believed the functionality of the ribosome (a critical subsystem that translates mRNA into protein) was carried out by its protein subunits. Although convincing data was published in the 1960s, the general belief did not change until the crystal structure of the ribosome was published showing that RNA formed the catalytic component.
And my personal favorite, it was considered unpossible that prions could be caused by proteins that misfolded and caused other proteins to misfold, it required absolutely heroic efforts in the face of extraordinary pressure to establish the molecular etiology of prions in the minds of the establishment.
It's hard to change your mind. Some people never will.
Medicine was not at all scientific until recently. There's a great podcast series about the rise of evidence-based medicine (over authority-based medicine) that I loved, but I can't seem to find it now... Anybody know what I'm thinking of?
I think it is a misconception that switching to the "right" incentives will solve problems. The very existence of incentives is not conducive for knowledge-based work.
It's also wrong. There have now been large-scale initiatives to reproduce papers, and they're getting higher failure rates than 14%.
The statistics underlying all of this make a number of implicit assumptions that may or may not be true IRL (for example: that journals publish papers independent of the salaciousness of claims made within them). If Science picks out only the top-5% most-sensational claims for publication, then you can't assume that a 95% CI is a safe threshold. You've probably got to increase it to a much higher value to have any prayer of getting past the inherent bias in such a process.
Wait until you start looking in to, say, psychology papers, vis a vis reproducibility. I think that a lot of these studies should be reviewed during the actual research and study part by at least a sociologist or cultural anthropologist, and a mathematician with expertise in probability.
When nearly every policy can be backed by "some papers published in <some journal>", stuff like this becomes important. For a recent example, the US White house is engaging in a 5 year study to determine if releasing sulfur dioxide into the stratosphere would accomplish enough "cooling" to offset the "climate crisis". the "climate crisis" is something that exists in computer models, few of the models agree, and if they do, it behooves us to look at when they started agreeing and find out why that change happened.
a more personal example: I can't reproduce getting CaCO out of seawater, yet.
I think you've never done psych research. First of all, statisticians and methodologists are frequently included, also in the design phase, if it's not a simple adaptation of an existing study. Nonetheless, many papers contain design errors.
Second, sociology and anthropology are among the least empirical of academic disciplines. What reason would you have to involve them?
Third, many studies are of the type "my theory predicts X and my experiment shows p < 0.05". That's almost wrong by definition. Bayesian theory explains why.
Fourth, many studies have conclusions/claims that cannot be inferred from the actual finding.
> When nearly every policy can be backed by "some papers published in <some journal>", stuff like this becomes important.
It is important, but as long as academia is a "turn out papers for tenure" industry, it won't be fixed. Don't take the conclusions of a paper for true. They're meant for other academics.
it seems that psychology papers extrapolate too much, and even "correcting for confounding factors" isn't enough, if the researchers don't know enough confounding factors, for one.
And i think that R and python being popular in the academic fields belies the fact that no one wants to talk to statisticians about their data. It's a lot easier to massage inputs to a machine than a human.
I was under the impression we were talking about reproducibility in this thread; therefore my point about policy being influenced by something you suggest is "meant for other academics"
I encourage you to dig in to the sources posted. The majority of published research can't be and/or hasn't been replicated. That's a crisis by any definition, especially with the amount of idiots running around bleating about "the science is settled" or "trust the science" or whatever catch phrase.
The degree to which the science is settled varies wildly by topic and that’s ok. Nutrition is perceived to be filled with a lot of junk, but without dietary vitamin C you will die. That’s settled even inside a field filled with debate.
Individual papers where never intended to be the final arbiters of truth, that’s not their role. If nobody thinks things are worth looking into again then stuff stays in a very nebulous state which is no worse than where things where before a paper was published.
Those bad papers are an impediment. It's much harder to get your paper/proposal past reviewers when it's for solving a problem someone else claims to have solved already. Also, even if you do get past that higher barrier at an added expense and time (reviewers will want a lot more evidence to support a claim refuting a prior claim versus a groundbreaking new claim), the person who usually gets cited more, funded more, and a bigger career advantage is still the person who's non-replicable publication was first.
>If nobody thinks things are worth looking into again then stuff stays in a very nebulous state which is no worse than where things where before a paper was published.
I don't think so, because of the aforementioned idiots who take a study that can't even be replicated but then cite the paper as evidence for something. If it can't be replicated, logic would say there is a very good chance it's inaccurate. Yet it's treated as being "settled science" which is worse.
Also, many times it isn't that nobody thinks it's worth looking into again, it's that coming to "the wrong conclusion" on some things will end their careers so they'd rather play it safe and just keep sucking up grant money to pump out more of the same garbage.
There's a big difference between science and social science. Or science and medicine for that matter. They are different domains that require different models of reasoning, which come with their respective sets of limitations.
There's a reason you don't see a crisis in e.g. physics, where their hypotheses are leaps and bounds more testable.
This failure rate suggests a reason the social sciences don’t “lead anywhere”: If you try to build a theorem off of two earlier results, but these each only have probability p of being right, then the chance your new theorem is correct goes like p^2. For small p this means the chance that a deduction is correct is very small.
This failure rate suggests a reason the social sciences don’t “lead anywhere”: If you try to build a theorem off of two earlier results, but these each only have probability p of being right, then the chance your new theorem is correct goes like p^2. For small p this means deductions based on combining results are very small.
The big unaddressed issue is that conclusions based on empirical data can only be as good as the observations. If the data is biased, manipulated, "massaged", etc, there can be no good output from that poisoned tree, and other persons cannot contradict the results without attempting to replicate the entire research process beginning to end.
With math, physics, applied math, computer science, etc, bad results can be usually (but not always) deconstructed by anyone with the merest tools of the rational; formal logic, application of various categorical laws, etc.
Science as a process is useful, but its conclusions can only be as reliable as what its body of producers are incentivized to produce.
That's totally normal. An individual study is unlikely to have enough statistical power to make a definite conclusion, and an individual line of evidence is insufficient to confirm a theory. No conclusion can be stated with any certainty until it is verified repeatedly, and no theory is fit for use until several consistent lines of evidence conform to its predictions. In fact, even calling these findings "false"is often inaccurate, as most researchers don't make strong claims.
While using the software packages released last week may not be the greatest idea, I am leery of any research paper using 10 years old genome assembly, Ensembl annotation release in the 40ish (we are at #105 ) or clearly outdated program versions with X updates in the last 5 years.
Also if Fedex/UPS were tracking packages with bordering on cavalier attitude observed in some labs, often we would be getting bags of guano ordered by some horticulturalist instead of a book of our choice. Or even an empty bag, since QC may be an afterthought and hard wet lab work may still produce unusable crappy data.
On the other hand the brand new technologies are rather expensive, good quality human tissue samples hard to get so scrapping the bottom of the barrel trying to justify grant $$$ is unavoidable.
Sure. Some genes were retired because there was not enough support. Earlier Ensembl versions used older genome assembly. Which means: some genes "jumped" from one chromosome to another, or got properly stitched residing before partially on floating contigs.
Just to be clear: I am not saying that anything done in 2022 using hg19 is 100% wrong. Just that it is a bit like using a stretched shoelace 50cm (+/- 5cm) to measure your corridor when you have a decent tape measure in your pocket.
There are microarrays used in Big Science projects with ~1/3 of probes not matching human transcripts from latest ENSEMBL. Since most of them map to the current genome who knows what is the meaning of the signal from such probes. Unannotated exons? Retained introns? But some probes do not even map to the genome => some silly splicing error(?) packed in a plasmid in the 90ies?
Do you mean in the social sciences? The natural sciences are physics, astronomy, chemistry, geology, and biology. These fields of inquiry lend themselves to empiricism. Mistakes/fraud happen, but it's a lot better than social science where empiricism is routinely rejected (straining the claim to even be science at all.)
No I mean natural sciences too, I am a former photovoltaics researcher and know for a fact most articles in the field to be dubious. I have dedicated the first 4 months of my master thesis replicating an experiment wihich could not be replicated because I later found the author cooked the numbers declaring an open circuit voltage that was impossible to obtain.
Even in hard sciences such as physics, there is a problem where not many researchers have access to instruments used in the published result, or even if they do, it's just too expensive to do it just to replicate (low return on investment).
The footnote provided[1] for this sentence in the intro does not actually mention the word "natural":
Data strongly indicate that other natural, and social sciences are affected as well.
It only writes that physicists and chemists are the most confident in their fields' results, and that a biochemistry graduate student is complaining about the effort it takes to replicate results.
It also writes that 70+% of researchers have failed to replicate an experiment. But that is very far from saying that 70+% of all experiments don't replicate. And the figure given by parent is 90+%.
The Prevalence section of the Wikipedia page talks about psychology, medicine, economics, and other social sciences. Medicine is arguably a natural science. But parent’s comment made me think it’s referring to results in physics or chemistry.
The journal linked in the OP is apparently one of the better and more rigorous journals.
For example:
>Public Library of Science, PLOS ONE, was the only journal that called attention to the paper's potential ethical problems and consequently rejected it within 2 weeks.
N attempts at proving N different invalid hypotheses is not much different from N attempts at proving 1 invalid hypothesis (all at the same p values). Both will result in p*N incorrect validations.
Ioannidis was the first major name to point out that the mortality rate for covid-19 was 0.25-0.5%, at a time when figures as high as 1 in 8 were being cited. Abundant evidence since then, including CDC's official estimates in the US, suggest he was correct. So, you know, beware, ideology can catch everyone. Even you.
The first estimates from Ioannidis from the flawed (in every possible way) California paper were less than 0.1% IFR. He later revised them to 0.16% IFR, not sure if subsequent "revision" (=admission of being wrong) was done afterwards.
Bear in mind, that was his IFR estimate for March-April 2020, when a lot of treatment was being done wrong (=intubate early) and even steroids were not supposed to be used for treatment outside RCTs.
Meanwhile, the CFR at Diamond Princess was at 2.6%, so refrain from the idiocy "1 in 8 IFR estimates" as only completely uninformed people will fall for them.
To be fair, in the beginning the policy was to intubate and put on ventilator. This had a 97.2% chance of ending in death. So "1 in 8" was possibly true for the cases of people in the beginning who were both very susceptible to catching a respiratory disease in conjunction with 4 - or more - comorbidities.
The reasonableness of a statement should be based on how well it is supported at the time it is made, not whether it is later demonstrated to be correct.
overall mortality might have been 0.5, but for susceptible groups like old people it was way higher. also risk of nonfatal but serious covid complications is no joke. also secondary deaths from collapsing health care systems...
Follow up analysis by Jager and Leek (2014) https://academic.oup.com/biostatistics/article/15/1/1/244509 suggests the false discovery rate is closer to 14% than 50%.