Hacker News new | past | comments | ask | show | jobs | submit login
I don't like this cartoon (andrewgelman.com)
82 points by pav3l on Nov 11, 2012 | hide | past | favorite | 54 comments



Interesting, but it IS a joke, which usually requires taking some liberties with the truth.

For example, an issue I have with it is that it's not actually possible for the Sun to explode. It is incapable of going nova. This is an old sci-fi trope, of a civilization that dies because its Sun goes nova, but this isn't a thing that actually happens. There are very specific circumstances where a nova can occur, and it is typically a recurring thing. More so, it requires the presence of a white dwarf star, which must come into existence only after a star has gone into a red giant phase. None of these things are conducive to life developing on nearby planets, they aren't even conducive to the continued existence of nearby planets. Regardless, our star is not a white dwarf, nor does it have a companion and therefore will never produce a nova. More so, it is about 1/8th too small to go supernova.


I learned something today. http://en.wikipedia.org/wiki/Helium_flash is what I thought was the process that leads to novas, it is not.

That said, from our point of view a helium flash is pretty scary. Of course the Sun is not expected to do this for another 5 billion years.

But according to http://www.whillyard.com/science-pages/our-solar-system/sun-... the Earth is likely to be uninhabitable in about a billion years anyways.


That's the point. The Bayesian incorporates this information into the prior. ;)


Since it's Sunday and I'm in procrastination mode I took a little excursion into the rabbit hole and found this:

http://www.stat.columbia.edu/~gelman/research/published/fell...

The money quote:

"More than that, though, the big, big problem with the Pr(sunrise tomorrow | sunrise in the past) argument is not in the prior but in the likelihood, which assumes a constant probability and independent events. Why should anyone believe that? Why does it make sense to model a series of astronomical events as though they were spins of a roulette wheel in Vegas? Why does stationarity apply to this series? That’s not frequentist, it isn’t Bayesian, it’s just dumb."


If the default state of procrastination was researching higher order mathematical theory the world would be.... probably more predictable.

I'll let myself out.


Reminds me of the letter Babbage wrote to Tennyson, correcting his poem:

"In your otherwise beautiful poem one verse reads,

Every moment dies a man, Every moment one is born

If this were true the population of the world would be at a standstill. In truth, the rate of birth is slightly in excess of that of death. I would suggest that the next edition of your poem should read:

Every moment dies a man Every moment 1 1/16 is born

Strictly speaking the actual figure is so long I cannot get it into a line, but I believe the figure 1 1/16 will be sufficiently accurate for poetry."

Seems jgc has written a blog about it:

http://blog.jgc.org/2010/09/on-being-nerd.html


Actually a moment is an infinitesimally small amount of time, therefore in order to model the birth/death distribution across moments we need to rigorously define... wait, what am I doing with my life?!


Here's the thing: the frequentist in the comic has made an error even by frequentist standards, and that error is equivalent, in Bayesian thinking, to choosing an inappropriate prior.

The problem is that many frequentist techniques implicitly choose a prior for you. That's handy since choosing an appropriate prior is hard. But it also abstracts away the choice of prior.

If a Bayesian makes this mistake, anyone can look at their math and say "There. That's where you chose a bad prior."

If a frequentist makes this mistake, you have to have a complicated analysis to explain why the method used is inappropriate.


I agree in with this analysis. The cartoon is, of course, a caricature but highlights the nature of the essential problem you point out, which is that unraveling the line of reasoning carved out by a frequentist is not a straightforward matter, even if the argument is correct. From the Bayesian perspective all arguments are deductive once you have all the information. A typical reply to this observation is that in most cases it will be clear how the steps of a calculation translate into the various assumptions needed to clarify the argument. Of course, what counts as "typical" depends on the kinds of questions one asks. If there are only a few kinds of problems in your field, then maybe you can get away with heuristic lines of reasoning, patching up problem in special cases as needed. But if you are, say, trying to write a general purpose software for statistical analysis it will not be possible to rely on inductive lines of reasoning particular to a certain field. This might partially explain the popularity of the Bayesian approach in machine learning circles.


Given that Randall Munroe has already expressed that the Frequentist vs. Bayesian angle of the comic was an afterthought, I think the real point (and the point that I actually took away from the comic) is about the blind application by scientists of the 0.05 P-value threshold for significance without regard to the specific circumstances of the experiment, which I can attest is a huge issue in the scientific literature. Another huge issue is using a statistical test on data that is known not to satisfy the assumptions of the test, either out of ignorance or because the test gives a good (i.e. P<0.05) result where the correct test doesn't.


Cf Randall's own classic comic about jellybeans causing cancer, of maybe one color of jellybeans causing cancer.


Wow, this is embarrassing. I took statistics, but it's been almost a decade. I got a pretty nice review on Bayes' theorem in an AI class I took last year though. I thought I understood the xkcd comic quite clearly, but I'm completely lost on the posted article.

Apparently, Munroe was a bit confused also.

>The truth is, I genuinely didn’t realize Frequentists and Bayesians were actual camps of people—all of whom are now emailing me.

http://andrewgelman.com/2012/11/16808/#comment-109366


Some how this only makes the comic and reaction way more funny.


I really like this sentence:

> All statisticians use prior information in their statistical analysis. Non-Bayesians express their prior information not through a probability distribution on parameters but rather through their choice of methods.

Also, from the comments:

>My problem is ultimately not with the cartoon. My problem is that there are practitioners and teachers of statistics who spread cartoonish ideas about statistical methods without recognizing these ideas are inaccurate.


I'm not sure how it can be a good idea to be choosing from various statistical methods based on which one will give you the sort of answer you intuitively think is correct. I mean, if you're going to bite the bullet and bring priors into your analysis in an ad-hoc way like that, why not just acknowledge their existence mathematically?


When you are reporting, say, a p-value from your ANOVA F-test, you are making formal mathematical assumptions, such as normal marginal distribution of your dependent variable. A lot of Frequentist methods (hypothesis testing) are really just mathematical shortcuts from times when computation was more expensive. The problem is, many people tend to misuse the tests where they are not appropriate either because those give "better" answers, or simply out of ignorance.


Right, but the way you deal with the situation in the comic is going to end up being much fuzzier and more subjective issue of picking you reference classes.


I do like this cartoon; it's a very nice representation of the problem with the viewpoint.

The main issue is that it make frequentist humans look like idiots. They're not.

Simply oftentimes lead astray -- as this example illustrates.

In particular, the epistemological fact that people misuse p-values must be exposed. Just because it is unlikely something happened by chance, in one way, in your model, does not prove your hypothesis correct.


While it's true (and should be obvious) that a competent statistician wouldn't make such silly mistakes, there are quite a few scientists who are not-so-competent statisticians who publish results with those exact errors.


Including the entire A/B testing world, from startup to Amazon.com and big banks. (Source: working with all three.)


Isn't the joke that if the sun did just go nova, then a $50 bet is meaningless?


Correct. The frequentist is being played for a double loser: the sun has not gone supernova, and he's out $50 (p < 0.0000000...) or the sun has gone supernova, and there will never be an opportunity to settle.


There is more than one joke.


Here is my writeup on the "real" difference between frequentist and Bayesian methods: http://stats.stackexchange.com/a/2287/1122

Even more here: http://qr.ae/17BEW

The truth is they both make tradeoffs that can appear ridiculous. In fact, the criticisms of confidence intervals and p-values apply almost exactly, in transpose, to credibility intervals and posterior probabilities.

Confidence intervals and p-values are a worst-case technique. The p-value will always control the false positive rate below alpha, even in the worst input. Sometimes you do want this -- e.g. when we say that the worst-case runtime of QuickSort is O(n^2), that's useful, even if we do have a prior distribution over the inputs and could also say that the expected runtime is O(n log n). But the errors are correlated across observations. You can have a valid "95%" confidence interval that always produces total nonsense when the experiment ends up with output x, as long as x happens <5% of the time for all possible inputs.

Credibility intervals and posterior probabilities are an average-case technique, where we integrate over the prior. Even if the prior is correct, the errors are correlated across inputs, which can be a problem. In the cookie-jar example at stackexchange, the 70% credibility interval is "wrong" 80% of the time when the jar is type B. That means if you send out 100 "Bayesian" robots to assess what type of jar you have, each robot sampling one cookie, you will expect 80 of the robots to get the wrong answer, each having >73% posterior probability in that wrong conclusion! That's a problem, especially if you want most of the robots to agree on the right answer. The two methods just make different tradeoffs in the way they quantify uncertainty.

My quibble with the cartoon, though, was that it's not really about the frequentist vs. Bayesian debate. If you want to decide whether to take action (like shuttering a satellite) in response to a "YES" output from the instrument, everybody will agree that you need to calculate {rate of events} * {false negative rate} * {cost of false negative} and compare it with {1 - rate of events} * {false positive rate} * {cost of false positive}.

The frequentist agrees with this math, the Bayesian agrees with this math, and the math doesn't even use Bayes' rule. This is basic actuarial science or decision theory.

The frequentist might do the mechanics in a certain way. They may say they are first going to calculate a p-value, and then ask whether the p-value is less than a threshold alpha, where alpha was set based on the costs and rate of events in order to control the "false discovery rate." And then take action only on a "significant" result.

The Bayesian might do the calculation a little differently too; they could say they are first going to use the expected rate of events as a prior, then calculate the conditional probability that there has been an event (given the instrument's reading), and then multiply this posterior probability by the cost of false negative, and its complement by the cost of false positive, to decide which action has lower expected cost.

But both the frequentist and Bayesian will get the same answer and end up with the same result as somebody who evaluates the inequality above directly. I don't think any technique has a monopoly on the correct answer here.


Upvoted for the "real difference" link.

I personally think that both groups are doing it wrong. I do a lot of A/B testing. In A/B testing what you care about is this:

1. Not getting a horribly wrong answer.

2. Getting an answer quickly.

A frequentist can tell me how to avoid getting an answer, but has no idea of the fact that below some threshold of data any answer is likely to be chance, and the errors that I can make are severe. And conversely if I have enough data, I'm likely to find real answers, and the mistakes I am likely to make are acceptable.

A Bayesian can tell me - in principle - that there is a threshold below which I should be cautious of making decisions and a threshold above which I can make decisions more easily. But naive priors set those thresholds too low, and I do not have sufficient data to come up with a real prior to use. I could create a conservative prior, but it would be hard to explain to anyone what I am doing.

In practice I've found that it is effective to blend the approaches. I compute frequentist statistics, but based on my past experience and knowledge of the ease of making severe errors, I insist on very high confidence levels for low amounts of data, and much lower for high amounts of data. Based on some numerical simulations, my true error rates seem acceptably low, and experiments run acceptably quickly.

(If I did not care about the speed of testing, then I could just set a rule like, "Go with the first version to get 296 conversions ahead." If the two versions have conversion rates that differ by 1%, then 95% of the time I will get the right answer. If the difference is larger, I will get the answer even more often. If the difference is smaller I will get the wrong answer more often - but the errors that come will be small and on average I'm still making good business decisions. All of the complex stats I actually do are just about getting answers quickly without compromising how often, on average, I make bad business decisions.)


The posterior distribution is only half the story. A true Bayesian uses a utility function to make decisions. How much you care about the worst case vs the average case is in the utility function. That's exactly the problem with frequentist methods: you're still making assumptions, but they are implicit and hardcoded in your choice of method, instead of explicitly stated and tweakable. With a particular choice of prior and utility function, you can recover many frequentist methods, but in most cases those will not be the prior and utility function you actually want. For example maximum likelihood estimation corresponds to a utility function equal to the likelihood (i.e. the probability mass), which at the very least should strike you as ridiculous for continuous quantities (maximum likelihood can still be useful as an approximation technique if the problem is intractable with your actual utility function). With a frequentist method, you are using a prior, you just don't know which one.

For some problems you might be able to get the correct decision in a very roundabout way by setting your alpha to the right magic value, but (1) it's not clear how to find the right alpha and (2) in general you cannot encode a complete utility function into a single number.


> With a frequentist method, you are using a prior, you just don't know which one.

I don't think so. Look over the cookie-jar example. The confidence interval guarantees worst-case coverage at least equal to its confidence parameter, for all values of the parameter. The credibility interval gives average-case coverage, integrated over the prior.

The confidence interval gives guaranteed coverage for every value of the parameter (conditioned on each possible input value). The credibility interval includes enough mass in the conditional probability function, conditioned on each possible output observable.

These are different mathematical objects and they do different things. The confidence interval doesn't use a prior over input values; it is giving you guaranteed coverage for any input value.

Let me put it this way: if you think the frequentist method is using some prior, what choice of prior will make the 70% credibility intervals in the cookie-jar case be identical to the 70% confidence intervals?

Anybody can think about utility to make decisions; it's not unique to Bayesian methods. Statisticians and engineers have been calculating ROC curves and choosing operating points on the ROC frontier (based on cost/benefit analysis) since World War II.


I meant that in the context of making a decision. The point of statistics is to make decisions. For example you want to know whether a medical treatment works so that you can decide whether or not to give it to people. So you do a hypothesis test to see whether the treatment works better than a placebo, and then if the p-value is small enough you give it to people, and otherwise you don't. Instead of explicitly separating the assumptions (prior & utility) from the logical deduction, the assumptions are embedded in this procedure. Why would the assumptions implicitly made by the choice of procedure be the assumptions you want to make? You take the answer to a question that's irrelevant to the decision, namely "given that the treatment doesn't work, how likely is the data" and try to tweak a decision based on that. There is no principled way to make decisions based on that information.

Credibility intervals are about average-case coverage, but Bayesian statistics as a whole is definitely NOT just about average case. In general the utility `U` is a function of the decision `d`, and of the posterior knowledge you have of the world `P`. In many practical cases the utility might be the expected profit: U(d,P) = integral(profit(x,d)P(x)dx). But it certainly doesn't have to be. If you are risk averse you might choose your utility as U(d,P) = min_x profit(x,d) to ensure that your utility is the minimum profit you make given a decision, rather than the average. Another example is U(x,P) = P(x) which gives maximum likelihood estimation. Making a decision based on a hypothesis test can also be emulated with a utility function. Suppose the hypothesis is H and we make decision d1 if p-value > alpha and d2 if p-value < alpha. We choose a prior P(I) that makes each possible observed data set I equally likely, and we choose the utility function that reverses Bayes' rule to compute P(I|H) to make a decision based on that:

    U(d1,P') = [P'(H)*P(I)/P(H) > alpha]
    U(d2,P') = [P'(H)*P(I)/P(H) < alpha]
where brackets are indicator notation. Note that P(I) appears to access the data set which the utility function does not have access to, but recall that P(I) is constant regardless of the measured data. Of course not many people have such a prior and utility function...so it doesn't really make sense to hard code them into the method.

In general the process works like this. Given prior P and utility U and measured information I:

1. Compute posterior P' from prior P and information I according to Bayes' rule. 2. Perform decision argmax_d U(d,P').

Can you point me to a similarly principled approach to decision making based on utility with frequentist methods?


> Can you point me to a similarly principled approach to decision making based on utility with frequentist methods?

Sure, as I said anything involving ROC curves, where we pick an operating point by trading off the cost of false positives vs. false negatives and a design rate of incoming true positives and negatives.


Can you give a mathematical recipe with assumptions and deductions? ROC curves don't cut it, it's just twiddling of a parameter of a classifier. Is it optimal in any sense? How do you know it's a good classifier for making decisions, when it's a classifier based on "given the hypothesis, how likely is the data" and not the other way around? Is it generalizable to other situations?


Yes, given a design rate of true positives or negatives, and a cost for false positives and false negatives, you can pick the optimal operating point. It will be optimal in the sense of minimizing average cost when the incoming rate equals the rate you designed for. You'll get the exact same answer as a "Bayesian" who uses conditional probability to calculate the same thing and whose prior equals the design rate. I gave a worked-out example in my original post ("If you want to decide whether to take action...").

Sure, it is generalizable -- we use ROC curves for radar, medical imaging, almost any diagnostic test...


Sure, the problem is not which point on the ROC curve you pick, the problem is which classifier you use to obtain it in the first place. I can pick a random classifier with a tunable parameter and draw its ROC curve and then pick the "optimal" point, but if the classifier sucks then that's no good. Why would a frequentist classifier based on a hypothesis test be good? A hypothesis test is the answer to the wrong question for the purposes of making a decision.

As I showed above, you can indeed get the same result from Bayesian decision making if you use a weird prior and utility function, which shows that frequentist decision making based on hypothesis tests is a subset (of measure 0) of Bayesian decision making. Again, that just means that you encoded a most likely wrong prior and utility in the choice of method without any justification.


Just that response seems to prove the cartoon's point.


I don't see how, really. Gelman himself is a (rather famous) Bayesian, and if you read the comments, you'll see that Randall himself pops up and basically cedes Gelman's point.


Can't we all just take a cartoon AS a cartoon? I don't think Munroe is trying to'smack down' frequentists here, it's all just for a little fun. Look at his own comment on the post: http://andrewgelman.com/2012/11/16808/#comment-109366


I don't understand how the two camps can exist. Don't the two methods produce different results? Surely, only one is true. Which is it?


It's mostly a philosophical difference between thinking of probabilities as measures of relative frequency versus thinking of probabilities as measures about one's uncertainty about the outcome. There isn't so much a huge war between them as there used to be, but if you want to read about the history of that this was a book I enjoyed: http://www.amazon.com/The-Theory-That-Would-Not/dp/030016969...

Being horribly biased in favor of the Bayesian interpretation ever since I learned it was a thing I'll give an example of places that frequentists can be wrong. People who disagree can give counterexamples. ;)

http://lesswrong.com/lw/1gc/frequentist_statistics_are_frequ...

On the other hand, some argue that certain forms of inference are invalid and that it doesn't matter if they give the correct answer or not in practice because they're invalid. Calculus was attacked on this basis early on because many mathematicians thought that taking the limit of something as it approached 0 wasn't a thing you should be able to do.


Thanks, I'll read that post now. I'm actually going by Yudkowsky's post:

http://lesswrong.com/lw/ul/my_bayesian_enlightenment/

It might not have been the exact same one, it was one where he mentioned the riddle and how his friend got the wrong answer because he was a frequentist. That might be where most of my misconception arises from.


>Surely, only one is true.

This is my general comment about statistics and data analysis without going into the specifics of Freq vs Bayes:

As anyone who ever worked with real-world data can tell you, for the most part data analysis is more of an art than exact science. Sure, it's math, and once you pick the right model, there is (usually) one correct way to solve it. The problem is that most mathematical methods come with assumptions that are almost never met in practice, so you have to make a lot of (often fairly arbitrary) decisions about how to go about your data analysis. How do you handle missing data cases? Is your data normally distributed enough to justify the use of some method? Is the sample large enough? Are those residuals in your model diagnostics random enough? Does that trend line look linear enough? What prior information can I use (and how?) to formally add value to the model? Real world is messy.


Given your clarification...

The two methodologies can give somewhat different results, but not as often as you might think, and the differences aren't as large as you might think. In my experience, the instances where you get radically different answers from Bayesian/Frequentist methods are quite rare, and tend toward the pathological example invented purely to demonstrate the "superiority" of one over the other.

That said, sometimes one or the other are significantly more convenient or easy to apply for a given model or type of data.


I see, thanks. Are the differences hard to reconcile, given that we can do Monte Carlo on a model and see whose predictions are correct?

I'm just annoyed by there being two camps in science, where one gets slightly different results from the other. It seems to me that one is obviously wrong, since there's only one truth.


You think that's bad, you should check out physics: http://en.wikipedia.org/wiki/Theory_of_everything

Statistics is more about data sets and interpretation than "right" vs. "wrong".


> It seems to me that one is obviously wrong, since there's only one truth.

A bold claim.



The point of Gelman's reply is that the comic is actually comparing a Bayesian to an absurdly incompetent Frequentist, so there' really no conflict. No (modestly intelligent) Frequentist would mis-apply this methodology in this circumstance.


Sorry, I wasn't talking about the comic. In general, don't the two approaches give different results? Surely, only one is "correct".


Both are correct but they target different things. The disagreement is around what is the target should be and the advantages and disadvantages of choosing these targets. Bayesians are interested in p(unknown|data) and frequentists are interested in p(data|unknown = H0). Inference can be framed either way but means different things.


Are there any situations where you want to use a frequentist procedure?

I've concluded that given a perfect, infinite-power MCMC simulator, I would always do a Gelman-style Bayesian analysis (with model falsification and improvement), but in practice, frequentist methods are computationally convenient.

Inference can be framed either way but means different things.

A Bayesian posterior P(H|D,M) is the probability that hypothesis H is true given data D and modelling assumptions M.

What does a frequentist p-value mean?


Sure, see my link above (http://stats.stackexchange.com/a/2287/1122). If you want to put an upper bound on the worst-case probability of making a mistake, you use a p-value. If you want to express the conditional probability of a particular hypothesis given the observation (and given a prior belief), you use a posterior probability. The Bayesians also can do silly things (see the cookie example with the inept Bayesian robots). In the end there is no free lunch.


The frequentist p-value is about H0, not (directly) the hypothesis you are testing. More specifically, it denotes the probability of rejecting H0, even though it's true.


Wow thank you, this is the clearest and most straightforward explanation of the difference between the two camps in this thread.


They are both models and as such, you might consider that neither of them are "correct." But they are both useful, sometimes in different circumstances.

"Essentially, all models are wrong, but some are useful." — George Box


I see, thank you all very much for the clarifications.


This cartoon actually explains why the Higgs Boson and tachyon hunting folks use p=0.000001 or so, not p=0.05. Extraordinary claims require extraordinary evidence. Not a problem with frequentism at all.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: