For me the best so far book on Bayesian probability was "Probability Theory: The Logic of Science: Principles and Elementary Applications" by E. T. Jaynes.
The book starts from the deduction of Bayesian theorem from the first principles of logic and shows its applications to a wide range of topics. There is thorough discussion of various "paradoxes" and the author sharply criticizes the frequentist statistics. In addition there is a lot of historical references.
I came here to recommend this book too!
It's a text that definitely allows one to go as deep as they wish very fleshed out references, a good appendix, and lots of comments on directions that can be explored more deeply.
It's a book I'm happy to have in dead tree form on my shelf.
For those unclear on the concrete (rather than philosophical) difference between Bayesian and frequentist statistics in the first place, I hope it's not inappropriate for me to share this 5-minute example that I wrote a while back: https://news.ycombinator.com/item?id=11096129
You write that the frequentist doesn't answer the question, but it does. It answers
P(H') = (H/H+T)^H'
You also write that the frequentist solution fails to give an error estimate, yet you don't show that the Bayesian solution does give one.
If the goal of the article is to show that Bayesian is more correct than frequentist then it leaves the reader unconvinced. If the goal is to show 3 ways of finding a probability, you should either say each is fine under its own paradigm, or argue why only one paradigm is correct.
> You write that the frequentist doesn't answer the question, but it does. It answers
> P(H') = (H/H+T)^H'
That's not the probability of getting H' heads in a row. It's an estimate of the probability of getting H' heads in a row based on a Maximum Likelihood estimation.
It doesn't make much sense if you take it to be the probability of getting H' heads in a row. For example, if {H=1, T=0}, then P(H'=100) = 1. You looked at one flip, and then decided that every subsequent flip was guaranteed to be heads?
It becomes even more clear that the question isn't really being answered if you take {H=0, T=0}.
> You write that the frequentist doesn't answer the question, but it does. It answers: P(H') = (H/H+T)^H'
The question was asking for P(H' | H, T), not P(H').
> You also write that the frequentist solution fails to give an error estimate, yet you don't show that the Bayesian solution does give one.
Because there is no error? In the proof I assume P(p) is known and then after that every step follows from a law of probability. There is no error to be accounted for in the procedure. The only caveat is that we need to know P(p) to be able to perform the procedure, which is a caveat that I point out at least 3 times in the page.
> The only caveat is that we need to know P(p) to be able to perform the procedure
I think this is a very confusing way to put it. P(p) is not an objective value that you can know or not know, it is rather a model of our subjective knowledge, and therefore it doesn't really make sense to say "the caveat is that we need to know what our knowledge is" ... yeah, we do, but that is always the case by definition, so pointless to bring up.
Another problem for the less familiar with the Bayes theorem is what is described as the "Bayesian trap", explained by the youtuber Veritasium:
https://www.youtube.com/watch?v=R13BD8qKeTg
I hope people agree it’s totally appropriate, and appreciated, thank you for reposting it.
This is most of the reason I come here, because people show the good will to share bits of knowledge and experience.
Then a whole other benefit, is that when people are willing to do this, their contribution might be critiqued or corrected, which can then sharpen or polish your knowledge and thinking even in areas where you might be very qualified.
For some people this would be a nightmare, if they can easily feel angry or hurt when their intellect is challenged, especially when they are an “expert” on the subject.
But I suspect most people here feel the opposite. You found a flaw in my results or reasoning? Fucking awesome, you have just make me stronger.
edit: I don’t know many other online forums where this dynamic exists, so if anyone does please don’t keep it a secret.
For me, this book's first chapters explained nicely about ML, MAP and Bayesian using real computer vision problems. The author included helpful visual aids (gaussian plots, contour plots, filters output, etc)
http://www.computervisionmodels.com
This is a rather unusual book where it gives primer on probabilistic method that is actually applicable in non computer vision problems. It is Bayesian heavy and rarely touches neural networks; the book is released in 2012, the year deep learning boom started.
You can also think about Bayes' theorem as follows. Suppose we have a logical robot trying to learn about the world. The robot has a collection of hypotheses in its brain. Every time it observes a new fact, it deletes all hypotheses that are incompatible with that fact.
For example, suppose it is thinking about the hair colour and eye colour of Joe. It starts with these hypotheses about Joe's (eye colour, hair colour):
Suppose that it learns that blue eyed people have blond hair. It deletes hypothesis (blue, black) incompatible with it, and keeps only the hypotheses compatible with it:
(blue, blond)
(brown, blond)
(brown, black)
Suppose it now learns that Joe has blue eyes. It keeps only the hypothesis compatible with it:
(blue, blond)
So it has now learned the hair colour.
In reality it is not true that all blue eyed people have blond hair. We change the robot's brain and give a weight to each hypothesis indicating how likely it is. Equivalently, we could insert multiple copies of each hypothesis, and the likelihood of a hypothesis is equal to the number of copies of the hypothesis.
Blue eyed people are more likely to be blond. Those are our hypotheses about the attributes of Joe. Suppose we now learn that Joe has blue eyes. It keeps only the hypotheses compatible with it:
(blue, blond): 10
(blue, black): 2
So P(blond hair) = 10/12 and P(black hair) = 2/12. This is all Bayes' theorem is: you have a set of weighted hypotheses, and you delete hypotheses incompatible with the observed evidence. The extra factor in Bayes' theorem is only there to re-normalise the weights so that they sum to 1.
To clear up your first set to have conditional probabilities for everything, Bayes' theorem is just a restatement of the product rule:
p(a and b | context c) = p(a|b,c) * p(b|c)
= p(b|a,c) * p(a|c)
or = p(a|c)*p(b|c) = p(b|c)*p(a|c) if a and b are independent of each other
so Bayes only matters when there is dependence:
p(a|b,c) = p(a|c) * p(b|a,c) / p(b|c)
otherwise it's just p(a|c) = p(a|c)
I like to put things in that order because p(a|c) is the "prior belief" and with some handwaving say things like "updated belief = prior belief and new evidence about belief".
My youngest has Allen Downey as a professor this year. She says he is crazy. And she means this in the best way possible. His productivity is prolific having written Think Java in 13 days. He memorized pictures and bios of all 90 students in the first year class at Olin College of Engineering.
My favorite Allen memory was a modeling contest between him and Mark Somerville. (Physical modeling, of course.) The result was basically a draw, but their approaches were totally different: Allen's was beautifully simple as usual, Mark's brilliantly complex.
I don't know. But he is so deep and broad in his knowledge that he may have a memory that exceeds mere mortal memory. This is a book he is working on now http://greenteapress.com/thinkos/index.html
From the description:
> This book is intended for a different audience, and it has different goals. I developed it for a class at Olin College called Software Systems.
> Most students taking this class learned to program in Python, so one of the goals is to help them learn C. For that part of the class, I use Griffiths and Griffiths, Head First C, from O'Reilly Media. This book is meant to complement that one.
> Few of my students will ever write an operating system, but many of them will write low-level applications in C, and some of them will work on embedded systems. My class includes material from operating systems, networks, databases, and embedded systems, but it emphasizes the topics programmers need to know.
Teachers learn their students eventually, by constant regular exposure (which you could consider to be de facto exploiting the spacing effect), so it doesn't require a herculean memory. Spaced repetition software is just a neat trick to speed the process up.
>He memorized pictures and bios of all 90 students in the first year class at Olin College of Engineering.
It's impressive not so much that he did that, but that he bothered to try.
Most lecturers (myself included) will try very hard not to learn anything about their students because they consider actually dealing with undergrads (particularly first-years!) on an individual level is beneath them.
At Olin everyone is an undergraduate. Olin is about reinventing engineering education. They consider faculty as very important to the process but they are guides not instructors. The students most of the time have to seek information and approaches out.
Sad to know that attitude pervades higher ed. Another reason students are well served choosing a teaching college for undergraduate instead of a research university.
Thanks for posting this. The Jupyter notebooks (and the fact Github has built-in support for them) really help illustrating the concepts.
The book I've used so far to study is "Probability and Statistics: The Science of Uncertainty", by Michael J. Evans and Jeffrey S. Rosenthal. This book is not being published anymore and is free in PDF form.
“I broke this rule because I developed some of the code while I was a Visiting Scientist at Google, so I followed the Google style guide, which deviates from PEP 8 in a few places. Once I got used to Google style, I found that I liked it. And at this point, it would be too much trouble to change.”
Why would you write a book that targets the Python community and ignore PEP8 styling, inconveniencing an entire community, simply because it would be too much trouble for you to change?
“Also on the topic of style, I write “Bayes’s theorem” with an s after the apostrophe, which is preferred in some style guides and deprecated in others.”
It is deprecated in all modern style guides and should not be used. You’ll get dinged in college English and writing classes for using this outdated and redundant style.
I’m sure this book is great, but, as a point of constructive criticism, I would suggest the author do a better job at adhering to the styles of code and English expected by his target audience, rather than what is comfortable for him.
"Many projects have their own coding style guidelines. In the event of any conflicts, such project-specific guides take precedence for that project."
and
"A Foolish Consistency is the Hobgoblin of Little Minds".
And, throughout, PEP8 makes it clear that it is a set of recommendations, and that if a project or community already has an established style, it need not be changed.
You misunderstand. I’m not criticizing Google not following PEP8. They’re welcome to make whatever modifications they want. I do the same. For example, I don’t like having two blank lines between methods and I don’t limit my line widths to 80 characters. This personal or project level alteration is fine. However, when your target audience is the Python community at large, you are better off following the PEP8 standard, which everyone knows and is comfortable with, rather that a project specific format, just because you personally find it more convenient. Standards are pretty important in the publishing industry, so I’m not sure why this is so controversial here today.
Why are you arguing against PEP8? As you mentioned in your final sentence, the Python community DOES have an established standard. It is called PEP8. The parent has made a valid point. Why would you criticize or trash his "karma" for stating it?
You say I'm arguing against PEP8 by quoting PEP8? That's hard to understand. Whose karma am I "trashing", and how? I certainly didn't downvote him, if that's what you mean.
Others have already made the point about PEP8 being a guideline, so I just wanted to also point out that not all style guides would agress with you on Bayes'/'s theorem. Case in point, the APA style guide: http://blog.apastyle.org/apastyle/2013/06/forming-possessive...
I'm not sure why you were down-voted. This is a valid point and as a college professor and author, I'm sure Downey would appreciate any feedback that would make his book better.
I can't downvote. The author points out his differences and provides his reasons why. The criticism doesn't add any value to the conversation because it has been addressed already by the author himself.
I disagree. In Python, the PEP8 standard is to use snake case for variables and function names. Classes should appear in this format: ClassName. Downey uses the class style for functions because, according to him, he feels it would be too inconvenient to do it the right way. This is a lazy cop-out. If you're writing a book targeting the Python community you should adhere to the Python PEP8 standards out of respect for your readers, if nothing else. The parent poster poses a valid question and it most definitely adds to the conversation because it calls into question the author's respect for his audience.
How is what Downey did any different from me writing a book and stating that correct spelling, grammar, and editing would be too inconvenient for me, so I'm just going to type whatever I feel like and that should be OK, because I addressed my lack of quality and attention to detail during my introduction?
My introduction to Bayesian probability was accidentally reinventing it while trying to invent my own AI system. It naturally followed by constructing a network of information which could be queried to get back whatever had been fed into it and perform deduction/induction.
A Bayesian pollster began with a certain set of prior probabilities. That the college educated were more likely to vote in previous elections, for example, informed the sample population, because it wouldn't make much sense to ask the opinions of those who would stay home.
Thus, based on priors that were updated with new empirical data, a new set of probabilities emerged, that gave a certain candidate a high probability of victory.
Members of the voting public, aware of this high probability, decided that this meant with certainty that this candidate would win and therefore decided to stay home on election day.
In reality the Bayesian models were incorrect as amongst other factors, a much higher number of non-college educated individuals decided to vote and to vote for the other candidate.
As it is with Bayesian intelligence, shared as much by pollsters as machine learning algorithms:
Real-time heads up display
Keeps the danger away
But only for the things that already ruined your day.
It should also have one called "Think Markov Chain Monte Carlo" - even the simplest reference is intractable and others began very simplistically and ends incomprehensible enough to disgust the subject altogether.
The book starts from the deduction of Bayesian theorem from the first principles of logic and shows its applications to a wide range of topics. There is thorough discussion of various "paradoxes" and the author sharply criticizes the frequentist statistics. In addition there is a lot of historical references.