Given that all parameters are trained jointly at inference time and a single sam...

programjames · 2025-03-05T23:05:25 1741215925

I agree, z should be absorbed into μ and Σ, e.g. you always input `[1 0 0 ... 0]`, and the first layer of the neural network would essentially output z. They would have to stop approximating the KL(q(θ)||p(θ)) as O(θ^2) though, so maybe this is more computationally efficient?

uh_uh · 2025-03-06T13:00:39 1741266039

Could be. Also, as you imply, they'd have to loosen the regularization penalty on θ, and maybe it's difficult to loosen it such that it won't become too prone to overfitting.

Maybe their current setup of keeping θ "dumb" encourages the neural network to take on the role of the "algorithm" as opposed to the higher-variance input encoded by z (the puzzle), though this separation seems fuzzy to me.