Histograms for Probability Density Estimation: A Primer

bagrow · 2024-04-11T17:53:04 1712857984

The best way to compute the empirical CDF (ECDF) is by sorting the data:

    N = len(data)
    X = sorted(data)
    Y = np.arange(N)/N
    plt.plot(X,Y)

Technically, you should plot this with `plt.step`.

andrewla · 2024-04-11T20:31:51 1712867511

scipy even has a built-in method (scipy.stats.ecdf) for doing exactly this.

vvanirudh · 2024-04-12T01:06:28 1712883988

Neat! That is so simple and in hindsight, makes a lot of sense. Thanks!

sobriquet9 · 2024-04-11T19:40:40 1712864440

Why estimate PDF through histogram then convert to CDF, when one can estimate CDF directly? Doing so also avoids having to choose bin width that can have substantial impact.

andrewla · 2024-04-11T20:29:24 1712867364

Agreed -- very odd to use a parameter (bin width) in a nonparametric estimation. Just use the raw data. In numerical analysis, broadly speaking, integrals are stable while derivatives are wild; an empirical cdf is a nice smooth integral of the messy pdf.

abhgh · 2024-04-12T03:06:25 1712891185

"nonparametric" is somewhat of a (confusing!) misnomer in that it doesn't mean no parameters, but lots of them where the # of parameters grows with # of instances [1]. In all of these cases the models have some general parameter(s) as well.

Some simple examples would be the bin-width and bandwidth in the histogram and the kernel density estimator. A somewhat complex example would be Dirichlet Process-based Mixture Models [2]; this has a "concentration" parameter. The terminology is used outside of density estimation too, e.g., Support Vector Machines (SVM) and k-Nearest Neighbors are considered nonparametric [3].

[1] For ex, see https://stats.stackexchange.com/a/268646, or https://youtu.be/I7bgrZjoRhM?si=VOEENs773SXlEMxm&t=300

[2] https://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/dp.pdf

[3] https://stats.stackexchange.com/a/237704

vvanirudh · 2024-04-12T01:08:35 1712884115

If sampling from the density is the only goal, then you are absolutely right. Can directly estimate empirical CDF as you pointed below. But histograms can still be useful to approximate the PDF itself? (taking the derivative of the empirical CDF to estimate PDF is wild as you said)

andrewla · 2024-04-12T16:39:07 1712939947

Yes, if you need an approximate pdf. I've found that when I'm working non-parametrically (or in robust statistics) I like to stay in cdf space or use quantile functions more than trying to use those nasty derivatives.

Bostonian · 2024-04-08T22:10:23 1712614223

If the data is continuous, use kernel density estimation (KDE) instead of histograms to visualize the probability density, since KDE will give a smoother fit. A similar idea is to fit a mixture of normals -- there are numerous R packages for this and sklearn.mixture.GaussianMixture in SciPy.

vvanirudh · 2024-04-08T22:26:41 1712615201

Yep! The next post would be on Kernel density estimation -- wanted to start from histograms as they are still a useful tool in 1-D and 2-D density estimation, and you don't have to store the data either (unlike KDE)

Bostonian · 2024-04-08T22:51:06 1712616666

I should have read to the end of your post:

'I will describe a very popular nonparametric method, Kernel Density Estimation, that also follows strategy 1 and is much more scalable to higher dimensions than histograms.'

vvanirudh · 2024-04-08T23:59:06 1712620746

Haha no worries!