So VAE finds a lower bound of the log likelihood
logp(x) using Jensen’s inequality, which also appears in the derivation of EM algorithm.
Intuitively, the first part of ELBO maximizes the log likelihood, the likelihood tries to make the generated image more correlated to the latent variable, which makes the model more deterministic.
The second part of ELBO minimizes the KL divergence between the posterior and the prior. Since we usually assume the prior is a standard Gaussian distribution (why?), and minimizing the KL will make the posterior more similar to the prior, which means we are trying to make the posterior to be a smooth Gaussian distribution, while at the same time expand evenly through the entire latent space, so it gives the model more randomness.
So it seems the VAE also somehow includes an adversarial training.
Next, I’ll use an example to illustrate this formula.
- Estimate posterior: For each input image, the green encoder estimates its mean and variance , which is called estimated posterior
q(z|x)(usually modeled by a multivariate Gaussian distribution)
- KL: After getting
q(z|x), we can write down the second part in ELBO, usually we assumes the prior
p(z)is a standard multivariate normal distribution. On the one hand, it makes sense to use Gaussian because of Central Limit Theorem (the average of samples from any distribution could be estsimated by a Gaussian Distribution). On the other hand, the KL divergence between two Gaussian has a perfect analytic solution, which eases the computation. (There are following works using other prior as well)
- Reparameterization: The model samples
q(z|x)which is forwarded through the decoder to estimate the true posterior distribution
p(x|z). In MNIST because the input image only contains 0/1,
p(x|z)could be asummed to be the product of 28*28=784 independent Bernoulli Distributions. In other cases, e.g. RGB images, we could use Multivariate Gaussian Distribution as
- Likelihood: We need to compute the likelihood for each sample in the batch. Ideally, we need infinite number of samples from
q(z|x)to estimate the likelihood, however as the author claims below Equation (8) in their original paper.
In our experiments we found that the number of samples L per datapoint can be set to 1 as long as the minibatch size M was large enough, e.g. M = 100.
So for one image, only one sample
z is used to estimate the likelihood (which is quite amazing), this means the first part in ELBO is approximated by just
ln p(x|z). (There are following works using more accurate estimation of the likelihood)
If we assume
p(x|z) is Bernoulli, the parameter is estimated using the reconstructed pixel value
This means if the real pixel value is 1, then likehood for that pixel is
Else if the real pixle value is 0, then likelihood for that pixel is
Combine them together, we have
which is the cross entropy between real pixel and reconstructed pixel. Then we need to sum up for all pixels and average over all samples in one batch.
Here is a very good code using TensorFlow.