FedVSR: Towards Model-Agnostic Federated Learning in Video Super-Resolution

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Video super-resolution (VSR) aims to enhance low-resolution videos by leveraging both spatial and temporal information. While deep learning has led to impressive progress, it typically requires centralized data, which raises privacy concerns. Federated learning (FL) offers a privacy-friendly solution, but general FL frameworks often struggle with low-level vision tasks, resulting in blurry, low-quality outputs. To address this, we introduce FedVSR, the first FL framework specifically designed for VSR. It is model-agnostic and stateless, and introduces a lightweight loss function based on the Discrete Wavelet Transform (DWT) to better preserve high-frequency details during local training. Additionally, a loss-aware aggregation strategy combines both DWT-based and task-specific losses to guide global updates effectively. Extensive experiments across multiple VSR models and datasets show that FedVSR not only improves perceptual video quality (up to +0.89 dB PSNR, +0.0370 SSIM, -0.0347 LPIPS and 4.98 VMAF) but also achieves these gains with close to zero computation and communication overhead compared to its rivals. These results demonstrate FedVSR’s potential to bridge the gap between privacy, efficiency, and perceptual quality, setting a new benchmark for federated learning in low-level vision tasks. The code is available at: https://github.com/alimd94/FedVSR

💡 Research Summary

**
The paper introduces FedVSR, the first federated learning (FL) framework specifically designed for video super‑resolution (VSR). VSR aims to reconstruct high‑resolution video frames from low‑resolution inputs by exploiting both spatial details and temporal correlations. While deep learning has dramatically improved VSR performance, it traditionally relies on centralized training data, which raises privacy, regulatory, and bandwidth concerns—especially for sensitive video content such as medical recordings, personal phone footage, or surveillance streams. General FL methods (FedAvg, FedProx, SCAFFOLD, etc.) have been successful for classification or detection tasks, but they struggle with low‑level regression problems like VSR, often producing blurry outputs that lack fine textures.

FedVSR tackles this gap with a two‑pronged solution that remains model‑agnostic and stateless, meaning it can be applied to any VSR architecture without modifying internal layers or storing additional client‑side state.

Frequency‑aware loss using 3‑D Discrete Wavelet Transform (DWT).
- Each client computes a standard reconstruction loss (e.g., L1 or Charbonnier) and adds a lightweight DWT‑based term that penalizes the difference between high‑frequency wavelet coefficients of the predicted and ground‑truth video clips.
- The authors adopt a separable Haar wavelet for its simplicity and integer‑filter implementation, which incurs negligible computational overhead on edge devices.
- By operating on 3‑D video volumes, the loss captures both spatial and temporal high‑frequency details (edges, textures, motion boundaries) that are typically lost under the limited communication budget of FL. Interestingly, the same loss degrades performance in a centralized setting, highlighting its specific benefit for federated training where high‑frequency information tends to be “averaged out.”
Loss‑aware adaptive aggregation.
- Traditional FedAvg weights client updates solely by dataset size, ignoring data quality and heterogeneity. FedVSR records the average local loss on each client during training and sends this scalar to the server.
- The server computes inverse‑loss weights (p_k ∝ 1/L_k) and performs a weighted average of the received model parameters. Clients with lower loss (i.e., higher‑quality data or better convergence) thus exert more influence on the global model, while noisy or poorly‑performing clients are automatically down‑weighted.
- This strategy mitigates the adverse effects of non‑IID data distributions common in video collections (different scenes, compression levels, lighting conditions) and improves overall convergence stability.

The framework is evaluated on four public VSR datasets (REDS, Vid4, UDM10, etc.) and four state‑of‑the‑art VSR backbones (VR‑T, R‑VR‑T, IAR‑T, DiQP). Compared against a suite of FL baselines (FedAvg, FedProx, FedMedian, FedNova, SCAFFOLD), FedVSR consistently delivers superior quantitative results: up to +0.89 dB in PSNR, +0.0370 in SSIM, –0.0347 in LPIPS, and +4.98 in VMAF. Visual inspection confirms sharper textures and better temporal coherence, especially in scenes with rapid motion or rich textures.

From an efficiency standpoint, the added DWT loss contributes less than 0.3 % extra FLOPs per local update, and the loss‑aware aggregation only requires transmitting a single scalar per client per round, resulting in a communication overhead of roughly 1–2 % over vanilla FedAvg. The method scales well: increasing the number of participating clients from 5 to 20 does not degrade convergence speed or final performance, demonstrating robustness to realistic federated deployments.

Overall, FedVSR bridges the privacy‑efficiency‑quality triad for low‑level vision tasks. By preserving high‑frequency information through a lightweight frequency domain loss and dynamically re‑weighting client contributions based on actual training quality, it achieves high‑fidelity video super‑resolution without sacrificing the privacy guarantees of federated learning. The authors release their code, paving the way for practical applications in privacy‑sensitive domains such as remote medical imaging, secure video streaming, and confidential surveillance.

FedVSR: Towards Model-Agnostic Federated Learning in Video Super-Resolution

💡 Research Summary

Comments & Academic Discussion

Leave a Comment