As for practical use cases, one is to find an approximate optimization to a function
- You want to find the min/max of some probability distribution P(x)
- P(x) is too complicated to find a closed-form min, but you can draw samples from it.
- So instead, you carefully construct some OTHER probability distribution Q(x|θ) that you claim is structurally similar "enough" to P(x), parameterized by θ.
- Now you find the theta which minimizes the KL divergence KL(P(x) || Q(x|θ)), which is equivalent to delivering you the parameters of θ to Q(x|θ) that make it [approximately] "most" similar to P(x) without ever having minimized P(x)
It was a trick that came up a lot when AI consisted of giant Bayesian plate models for each specific task that you had to hand-optimize.
- You want to find the min/max of some probability distribution P(x)
- P(x) is too complicated to find a closed-form min, but you can draw samples from it.
- So instead, you carefully construct some OTHER probability distribution Q(x|θ) that you claim is structurally similar "enough" to P(x), parameterized by θ.
- Now you find the theta which minimizes the KL divergence KL(P(x) || Q(x|θ)), which is equivalent to delivering you the parameters of θ to Q(x|θ) that make it [approximately] "most" similar to P(x) without ever having minimized P(x)
It was a trick that came up a lot when AI consisted of giant Bayesian plate models for each specific task that you had to hand-optimize.