Hacker News new | past | comments | ask | show | jobs | submit login
Think Stats: Probability and Statistics for Programmers (greenteapress.com)
213 points by samrat on Aug 28, 2011 | hide | past | favorite | 25 comments



My favorite references on statistics get to an idea that I think you will be inclined to emphasize: DATA is what matters in statistics, more than mathematical manipulation. See what you think about these, and best wishes for further revisions of your interesting online guide.

http://statland.org/MAAFIXED.PDF

The first link is a BEAUTIFUL and thought-provoking discussion of what's dangerous about having a new mathematics professor choose the statistics textbooks for an introductory college class in statistics, with advice on what to look for in statistics textbooks.

http://escholarship.org/uc/item/6hb3k0nz

The second link is by a very famous statistician, with discussion of how the statistics curriculum could be revised to better emphasize the most important ideas.

Put the key ideas from these two resources in your own words, and you will have a good guide to programmers about how to think about statistics.


Very interesting. I certainly agree with the idea from the first article, on the importance of data, and from the second article, on the usefulness of simulation.


I read the first article more carefully, and it inspired this new blog post: http://allendowney.blogspot.com/2011/08/jimmy-nut-company-pr...


Your second link was an eye-opener. Can you recommend any textbooks or an online class (or its remains) that teaches Cobb's recommended curriculum?


Cobb, in some writing of his, recommends the textbook Statistics: Learning in the Presence of Variation,

http://www.amazon.com/Statistics-Learning-Presence-Variation...

which I bought just more than three years ago after I read that. The textbook is quite thought-provoking, not just another Tweedledum to the usual Tweedledee of undergraduate statistics textbooks.


Thanks, I'll check it out. It looks like he has authored a couple of stats books, too, that might be worthwhile.


If you've gone through this book and want to learn more ways to use statistics and probability, you should look at the videos from mathematicalmonk[1] on youtube - he's like Khan for Machine Learning and more rigorous probability. It helps learning Machine Learning and Hidden Markov Models pretty well, at least from my experience.

http://www.youtube.com/user/mathematicalmonk


Awesome .. just what I was looking for!


Who's going try reading some of this before doing the ML or AI Stanford courses?


Very much so. Still trying to find out exactly what maths will be required, I'm looking at things from related courses http://see.stanford.edu/see/courses.aspx.


Hah funny same here, trying to beef up stats and linear algebra. Quite excited about the ML Class.


That's partly why I bought it. My university didn't have a statistics department, so I never had the chance to take a statistics course that was worthwhile.

Statistics and linear algebra really should be required by all CS programs. It's funny that at many schools those courses are not, yet Calculus is. First, Calculus should have been handled in HS. Second, I've never had a use for Calculus professionally or for anything I've worked on in my free time.


Calculus is a major prerequisite for both a reasonably serious statistics course, and for a reasonably serious AI/ML course. Furthermore, at least for ML it is multi-variable calculus based on linear algebra, I doubt you could learn this in high school at the level needed.


And on that note, does anyone have any other good book suggestions regarding ML?


Of the top of my head,

- Machine Learning by Tom M Mitchell http://www.cs.cmu.edu/~tom/mlbook.html

For general reading and introductions I also like:

- Pattern Classification by Richard Duda

- Pattern Recognition and Machine Learning by Christopher Bishop

For a bit more emphasis on statistics and math, I usually dive in to

- Classification,Parameter Estimation and State Estimation by van der Heijden

And last, but certainly not least:

- Information Theory, Inference, and Learning Algorithms by David MacKay, available here:

http://www.inference.phy.cam.ac.uk/mackay/itila/


Excellent, many thanks.

I've read O'Reilly's Collective Intelligence. It's a great introductory survey, but it was very light on theory.

I also own Collective Intelligence in Action. It had more explanation of theory than O'Reilly's offering, but most of the chapters devolved into how to use Java data mining framework X.


No mention of Z-test at all[1]? Guess, at least for hypothesis testing, Wikipedia did a more comprehensive job[2]

[1] http://en.wikipedia.org/wiki/Z-test

[2] http://en.wikipedia.org/wiki/Statistical_hypothesis_testing


Rather than get into a catalog of tests, I take the approach that "There is only one test." I wrote more about it here: http://allendowney.blogspot.com/2011/05/there-is-only-one-te...


Furthermore, another reason the entire classic test-based approach is bad since it encourages having binary hypotheses when many real-world problems don't, they may be many or composite or even infinite. (Of course, many real-world problems can be reduced to binary ones, which is one reason the approach became popular.)

If you know enough probability theory, statistics is just a special case. The nice thing about using probability theory is if you do decide to use a 'test', all of your assumptions are put forth first. As E.T. Jaynes says:

    In estimating a location parameter, for example, the sample median M is often cited as
    a more robust estimator than the sample mean. But here it is obvious that this
    ‘robustness’ is bought at the price of insensitivity to much of the relevant
    information in the data. Many different data sets all have the same
    median; the values above or below the sample median may be moved about arbitrarily
    without affecting the estimate. Yet those data values surely contain information
    highly relevant to the question being asked, and all this is lost. We would have
    thought that the whole purpose of data analysis is to extract all the information
    we can from the data.
    
    Thus, while we agree that robust/resistant properties may be desirable in some cases,
    we think it important to emphasize their cost in performance. In the literature,
    ad hoc procedures have been advocated on no more grounds than that they are ‘robust’
    or ‘resistant’, with no mention of the quality of the inference they deliver, much
    less any comparison of performance with alternative methods; yet alternative methods
    such as Bayesian ones are criticized on grounds of lack of robustness,
    without any supporting factual evidence.
A recent probability book I've started that I think is pretty good is http://uncertainty.stat.cmu.edu/


Great book. But I made the mistake of buying this book on Kindle just a few hours before I saw it show up here.

:(


Sorry! It's really not meant to be a trick. The book is under a CC license. I provide the PDF version at Green Tea Press, the LaTeX source for anyone who wants to make a modified version, and a not very good HTML version (some of the math is broken). O'Reilly provides the printed and Kindle versions.

Right now all versions have the same content, but I will continue to revise, so you can think of the version on Green Tea Press as the draft of the second edition.

I know it can be confusing, but I hope the benefits of the free license make up for it.


Thanks for the clarification, Allen. It's a great tutorial, I was glad to pay for it, and I will be recommending it to others!


Green Tea Press are good guys, I would never feel sorry for paying them; they have done far more for humanity than my $30 ever will: http://greenteapress.com/


I'm a little confused about the various editions. Just a few weeks ago I bought the "Think Stats" ebook published by O'Reilly (same author). Is it all the same content?


kindle native version should be much easier to read than trying to view html/pdf on the kindle, and you've supported the author




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: