Auto-encoders: reconstruction versus compression
We discuss the similarities and differences between training an auto-encoder to minimize the reconstruction error, and training the same auto-encoder to compress the data via a generative model. Minimizing a codelength for the data using an auto-encoder is equivalent to minimizing the reconstruction error plus some correcting terms which have an interpretation as either a denoising or contractive property of the decoding function. These terms are related but not identical to those used in denoising or contractive auto-encoders [Vincent et al. 2010, Rifai et al. 2011]. In particular, the codelength viewpoint fully determines an optimal noise level for the denoising criterion.
💡 Research Summary
The paper investigates the relationship between two common training objectives for auto‑encoders: minimizing reconstruction error and minimizing the codelength required to compress the data. The authors adopt a Minimum Description Length (MDL) viewpoint, treating an auto‑encoder as a probabilistic generative model consisting of a prior distribution ρ over a latent feature space Y and a decoder g that maps a latent code y to a distribution over the input space X. The true compression cost of a dataset D is the negative log‑likelihood L_gen(D)=−∑_{x∈D}log p_g(x), where p_g(x)=∫ρ(y)g_y(x)dy. Directly minimizing L_gen is difficult because it requires integrating over all possible latent codes for each data point.
To bridge this gap, the authors first describe a naïve two‑part coding scheme: encode a latent code y using −log ρ(y) bits and then encode the data point x using −log g_y(x) bits. This yields a codelength L_two‑part(x)=−log ρ(y)−log g_y(x) that is always larger than L_gen(x) because it fixes a single y rather than averaging over all y. When the encoder f is deterministic (f(x)=y) the two‑part codelength reduces to the reconstruction loss plus a cross‑entropy term between the empirical distribution of encoded features and the prior ρ. When f is stochastic, the same expression holds in expectation over f(x).
The key theoretical contribution is a variational upper bound on the true codelength. For any encoder distribution f(x) over Y, the following inequality holds: L_rec(x) + KL(f(x)‖ρ) ≥ L_gen(x), where L_rec(x)=−E_{y∼f(x)}
Comments & Academic Discussion
Loading comments...
Leave a Comment