MFE-GAN: Efficient GAN-based Framework for Document Image Enhancement and Binarization with Multi-scale Feature Extraction
Document image enhancement and binarization are commonly performed prior to document analysis and recognition tasks for improving the efficiency and accuracy of optical character recognition (OCR) systems. This is because directly recognizing text in degraded documents, particularly in color images, often results in unsatisfactory recognition performance. To address these issues, existing methods train independent generative adversarial networks (GANs) for different color channels to remove shadows and noise, which, in turn, facilitates efficient text information extraction. However, deploying multiple GANs results in long training and inference times. To reduce both training and inference times of document image enhancement and binarization models, we propose MFE-GAN, an efficient GAN-based framework with multi-scale feature extraction (MFE), which incorporates Haar wavelet transformation (HWT) and normalization to process document images before feeding them into GANs for training. In addition, we present novel generators, discriminators, and loss functions to improve the model’s performance, and we conduct ablation studies to demonstrate their effectiveness. Experimental results on the Benchmark, Nabuco, and CMATERdb datasets demonstrate that the proposed MFE-GAN significantly reduces the total training and inference times while maintaining comparable performance with respect to state-of-the-art (SOTA) methods. The implementation of this work is available at https://ruiyangju.github.io/MFE-GAN.
💡 Research Summary
The paper addresses the problem of document image enhancement and binarization, which are essential preprocessing steps for optical character recognition (OCR) systems, especially when dealing with degraded color documents. Existing state‑of‑the‑art (SOTA) approaches train multiple generative adversarial networks (GANs) – often six – each dedicated to a specific color channel (red, green, blue, and gray). While this multi‑GAN strategy improves the removal of shadows, stains, and noise, it incurs prohibitive training and inference times, limiting its practicality for large‑scale or real‑time applications.
MFE‑GAN (Multi‑Scale Feature Extraction GAN) is proposed as an efficient alternative that dramatically reduces computational cost without sacrificing performance. The framework consists of three stages.
Stage 1 – Multi‑Scale Feature Extraction (MFE).
Input document images are first divided into 256 × 256 patches and split into four single‑channel images (R, G, B, Gray). Each patch undergoes Haar wavelet transformation (HWT), which decomposes it into four sub‑bands: LL (low‑low), LH, HL, and HH. Only the LL sub‑band, which captures low‑frequency structural information while suppressing high‑frequency noise, is retained. The LL sub‑band is down‑sampled to 128 × 128 and normalized. This step replaces naïve interpolation‑based resizing, preserving contour details and reducing the data size by half, thereby cutting the amount of computation required for subsequent GAN training.
Stage 2 – Color‑Channel‑Specific Enhancement.
Four independent generators are instantiated, each based on a U‑Net++ encoder‑decoder architecture with an EfficientNetV2‑S backbone. Each generator receives the normalized LL sub‑band of a single channel and produces a 128 × 128 enhanced sub‑image. All four generators share a single discriminator, an improved PatchGAN that applies instance normalization to every layer except the first, preventing the distortion of low‑level color cues while stabilizing adversarial training. The four enhanced sub‑images are summed pixel‑wise and concatenated to reconstruct a full‑resolution enhanced image. By sharing the discriminator, the model reduces the total number of parameters compared with training six separate GANs, yet still learns channel‑specific enhancements.
Stage 3 – Dual‑Scale Binarization.
The enhanced image from Stage 2 is fed into two parallel binarization GANs. The “local” binarization GAN processes the image at its original resolution, focusing on fine‑grained text strokes. Simultaneously, the original input image is up‑scaled to 512 × 512 using nearest‑neighbor interpolation and passed to a “global” binarization GAN, which captures broader layout and large‑scale contrast. Each branch has its own discriminator, forming two complete GANs. The final binary output is obtained by a logical AND of the local and global binarization results, ensuring that a pixel is marked as text only when both scales agree. This dual‑scale strategy improves the separation of text from complex backgrounds and mitigates errors that arise when only a single scale is considered.
Loss Functions.
Training stability is enhanced by employing the Wasserstein GAN with Gradient Penalty (W‑GAN‑GP) as the adversarial loss. In addition to the adversarial term, the generator loss incorporates Binary Cross‑Entropy (BCE) loss, which directly optimizes pixel‑wise classification accuracy, and Soft‑Dice loss, which encourages overlap between predicted and ground‑truth text regions. The combined loss is:
L_G = –E_x
Comments & Academic Discussion
Loading comments...
Leave a Comment