SCENE: Semantic-aware Codec Enhancement with Neural Embeddings
Compression artifacts from standard video codecs often degrade perceptual quality. We propose a lightweight, semantic-aware pre-processing framework that enhances perceptual fidelity by selectively addressing these distortions. Our method integrates semantic embeddings from a vision-language model into an efficient convolutional architecture, prioritizing the preservation of perceptually significant structures. The model is trained end-to-end with a differentiable codec proxy, enabling it to mitigate artifacts from various standard codecs without modifying the existing video pipeline. During inference, the codec proxy is discarded, and SCENE operates as a standalone pre-processor, enabling real-time performance. Experiments on high-resolution benchmarks show improved performance over baselines in both objective (MS-SSIM) and perceptual (VMAF) metrics, with notable gains in preserving detailed textures within salient regions. Our results show that semantic-guided, codec-aware pre-processing is an effective approach for enhancing compressed video streams.
💡 Research Summary
The paper addresses a persistent problem in high‑resolution video streaming: standard codecs such as H.264 and H.265 introduce compression artifacts that severely degrade perceived visual quality, especially in regions that are most important to human viewers (faces, text, object boundaries). Existing neural enhancement methods either operate post‑decoding, replace the whole codec, or use simple saliency maps that lack rich semantic context. To bridge this gap, the authors propose SCENE (Semantic‑aware Codec Enhancement with Neural Embeddings), a lightweight pre‑processing module that can be inserted before any conventional encoder without altering the existing pipeline.
SCENE’s core innovation lies in two tightly coupled components. First, it extracts dense, spatially aware semantic embeddings from each input frame using a frozen SigLIP 2 So400M vision‑language model. These 1,152‑dimensional vectors encode high‑level concepts (objects, scenes) as well as fine‑grained spatial cues. A small “control module” consisting of two 1×1 convolutions with ReLU transforms the embeddings into channel‑specific coefficients. Second, the network employs assembled convolutions: a set of four base kernels is linearly combined per output channel using the coefficients, yielding dynamically constructed kernels that adapt to the semantic content of each frame. This assembled‑convolution mechanism provides finer granularity than traditional dynamic convolutions because each output channel receives its own weighted combination of base kernels.
Training is performed with a differentiable JPEG proxy that mimics the block‑transform‑quantization distortions common to H.264/H.265 while omitting motion‑compensation. This proxy enables end‑to‑end gradient flow through the compression stage, aligning the learned enhancement with real‑world codec artifacts. The loss function combines four terms: a perceptual loss based on a differentiable PyTorch implementation of VMAF (λp), a bitrate‑estimation loss (λb), and two L1 reconstruction losses before (λ1) and after (λ2) the proxy to stabilize training when perceptual gradients are noisy.
Experiments use the Video‑90K dataset for training (9:1 train/validation split) and evaluate on the high‑resolution UVG 1080p benchmark. Two baselines are considered: AsCon‑vSR, an identical assembled‑convolution network without semantic conditioning, and “codec‑only” (no pre‑processing). Results show that SCENE consistently outperforms AsCon‑vSR on perceptual metrics. For H.264, SCENE achieves a VMAF BD‑rate reduction of –32.0 % versus –29.4 % for AsCon‑vSR, a relative gain of 3.9 %. For H.265, the gains are –37.4 % versus –33.9 %, a 5.8 % relative improvement. MS‑SSIM BD‑rate values are slightly positive for both methods, indicating that optimizing for VMAF can modestly sacrifice pixel‑level similarity—a known trade‑off. With AV1, SCENE improves VMAF substantially but also raises bitrate enough that the BD‑rate curve does not overlap the baseline, highlighting the need for codec‑specific tuning.
Qualitative examples demonstrate that at low bitrates SCENE better preserves sharp object edges, fine textures, and text readability compared with the baseline and raw codec output. The model contains only 1.4 M trainable parameters and runs at 27.74 ms per 1080p frame on an RTX 4090 (≈36 fps), confirming real‑time feasibility.
The authors acknowledge limitations: semantic conditioning is applied only to the first assembled block, the second block relies solely on learned features; temporal consistency is not modeled, which could cause flickering in video sequences; and the JPEG proxy does not capture motion‑compensated prediction, potentially limiting fidelity for codecs that heavily rely on inter‑frame coding. Future work will explore multi‑layer semantic modulation, integrate transformer‑based temporal modeling, and extend the approach to additional codecs and streaming scenarios.
In summary, SCENE introduces a novel combination of vision‑language semantic guidance and dynamically assembled convolutions, trained with a differentiable codec proxy, to deliver perceptually superior, real‑time pre‑processing for standard video codecs. The method demonstrates measurable bitrate savings and quality gains on H.264 and H.265 while maintaining a lightweight footprint suitable for deployment in existing streaming pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment