Not one mention of the EM algorithm, which is, as far as I can understand, is be...

Sniffnoy · 2024-10-06T02:32:53 1728181973

It does not appear to be what's being described here? Could you perhaps expand on the equivalence between the two if it is?

CrazyStat · 2024-10-06T16:28:58 1728232138

EM can be used to impute data, but that would be single imputation. Multiple imputation as described here would not use EM since the goal is to get samples from a distribution of possible values for the missing data.

wdkrnls · 2024-10-06T20:53:20 1728248000

In other words, EM makes more sense. All this imputation stuff seems to me more like an effort to keep using obsolete modeling techniques.

CrazyStat · 2024-10-06T23:30:04 1728257404

Absolutely not.

EM imputation (or single imputation in general) fails to account for the uncertainty in imputed data. You end up with artificially inflated confidence in your results (p-values too small, confidence/credible intervals too narrow, etc.).

Multiple imputation is much better.

miki123211 · 2024-10-06T02:50:40 1728183040

> It has so many applications, among which is estimating number of clusters for a Gaussian mixture model

Any sources for that? As far as I remember, EM is used to calculate actual cluster parameters (means, covariances etc), but I'm not aware of any usage to estimate what number of clusters works best.

Source: I've implemented EM for GMMs for a college assignment once, but I'm a bit hazy on the details.

fleischhauf · 2024-10-06T10:59:11 1728212351

you are right you still need the number of clusters

BrokrnAlgorithm · 2024-10-06T13:40:14 1728222014

I've been out of the loop for stats for a while, but is there a viable approach for estimating ex ante the number of clusters when creating a GMM? I can think if constructing ex post metrics, i.e using a grid and goodness of fit measurements, but these feel more like brute forcing it

lukego · 2024-10-06T15:29:23 1728228563

Is the question fundamentally: what's the relative likelihood of each number or clusters?

If so then estimating the marginal likelihood of each one and comparing them seems pretty reasonable?

(I mean in the sense of Jaynes chapter 20.)

disgruntledphd2 · 2024-10-06T15:05:49 1728227149

Unsupervised learning is hard, and the pick K problem is probably the hardest part.

For PCA or factor analysis, there's lots of ways but without some way of determining ground truth it's difficult to know if you've done a good job.

CrazyStat · 2024-10-06T16:09:36 1728230976

There are Bayesian nonparametric methods that do this by putting a dirichlet process prior on the parameters of the mixture components. Both the prior specification and the computation (MCMC) are tricky, though.