Sampling Using Neural Networks for colorizing the grayscale images
The main idea of this paper is to explore the possibilities of generating samples from the neural networks, mostly focusing on the colorization of the grey-scale images. I will compare the existing methods for colorization and explore the possibilities of using new generative modeling to the task of colorization. The contributions of this paper are to compare the existing structures with similar generating structures(Decoders) and to apply the novel structures including Conditional VAE(CVAE), Conditional Wasserstein GAN with Gradient Penalty(CWGAN-GP), CWGAN-GP with L1 reconstruction loss, Adversarial Generative Encoders(AGE) and Introspective VAE(IVAE). I trained these models using CIFAR-10 images. To measure the performance, I use Inception Score(IS) which measures how distinctive each image is and how diverse overall samples are as well as human eyes for CIFAR-10 images. It turns out that CVAE with L1 reconstruction loss and IVAE achieve the highest score in IS. CWGAN-GP with L1 tends to learn faster than CWGAN-GP, but IS does not increase from CWGAN-GP. CWGAN-GP tends to generate more diverse images than other models using reconstruction loss. Also, I figured out that the proper regularization plays a vital role in generative modeling.
💡 Research Summary
This paper investigates the problem of colorizing grayscale images by treating it as a multimodal generation task: a single gray input can correspond to many plausible color outputs. Using the CIFAR‑10 dataset, the authors convert the images to grayscale, upscale them to 64 × 64 pixels, and then train a suite of generative models to map the gray input to a colored output. The models evaluated include a baseline convolutional decoder with L1/L2 reconstruction loss, a Conditional Variational Auto‑Encoder (CVAE), a Conditional Wasserstein GAN with Gradient Penalty (CWGAN‑GP), a CWGAN‑GP augmented with an L1 reconstruction term, an Adversarial Generative Encoder (AGE), and an Introspective VAE (IVAE).
Performance is primarily measured by the Inception Score (IS), which is well‑suited for CIFAR‑10, and complemented by human visual assessment of color realism and detail preservation. The baseline CNN produces blurry, low‑diversity results despite reasonable pixel‑wise accuracy. Adding a conditional structure (CVAE) improves both visual quality and IS; further incorporating an L1 reconstruction loss (CVAE‑L1) yields the highest IS among VAE‑based methods.
CWGAN‑GP, which enforces a Lipschitz constraint on the discriminator via a gradient‑penalty term, generates the most diverse colors but attains a lower IS than CVAE‑L1. When an L1 reconstruction loss is added to CWGAN‑GP (CWGAN‑GP+L1), training converges faster, yet the final IS does not surpass the plain CWGAN‑GP, indicating that reconstruction loss stabilizes training but may restrict the GAN’s ability to explore diverse modes.
Hybrid approaches that blend VAE reconstruction objectives with adversarial training are explored next. AGE creates a cyclic architecture where the encoder, decoder, and discriminator interact to keep the latent space consistent while sharpening outputs. IVAE extends the VAE ELBO with a discriminator‑based regularization term, encouraging the latent distribution to stay close to a standard normal. Both hybrids outperform the pure GAN and VAE baselines, with IVAE achieving the highest IS and receiving the best human ratings for naturalness and color fidelity.
The experiments are conducted on an AWS p3.2xlarge instance (Tesla V100) and, for prototyping, on Google Colab (Tesla K80). Training each model takes up to 30 hours. The authors detail architectural choices—kernel sizes, activation functions (ReLU/LeakyReLU), batch normalization, residual connections—and training tricks such as learning‑rate schedules and gradient‑penalty weighting.
A key insight is the importance of regularization balance: overly strong KL‑divergence penalties compress the latent space and reduce color diversity, while appropriately weighted L1 reconstruction loss improves pixel‑level accuracy without sacrificing the sharpness that GANs provide. The study concludes that the most effective strategies for grayscale colorization are CVAE with L1 loss and the IVAE hybrid, both of which combine stable variational training with adversarial sharpening. Future work is suggested to incorporate additional metrics like FID, scale to higher‑resolution datasets, and explore user‑controlled color palettes or conditional style transfer.
Comments & Academic Discussion
Loading comments...
Leave a Comment