📝 Original Info
- Title: End-to-End Learning-based Video Streaming Enhancement Pipeline: A Generative AI Approach
- ArXiv ID: 2512.14185
- Date: 2025-12-16
- Authors: Emanuele Artioli, Farzad Tashtarian, Christian Timmerer
📝 Abstract
The primary challenge of video streaming is to balance high video quality with smooth playback. Traditional codecs are well tuned for this trade-off, yet their inability to use context means they must encode the entire video data and transmit it to the client. This paper introduces ELVIS (End-to-end Learning-based VIdeo Streaming Enhancement Pipeline), an end-to-end architecture that combines server-side encoding optimizations with client-side generative in-painting to remove and reconstruct redundant video data. Its modular design allows ELVIS to integrate different codecs, inpainting models, and quality metrics, making it adaptable to future innovations. Our results show that current technologies achieve improvements of up to 11 VMAF points over baseline benchmarks, though challenges remain for real-time applications due to computational demands. ELVIS represents a foundational step toward incorporating generative AI into video streaming pipelines, enabling higher quality experiences without increased bandwidth requirements.
💡 Deep Analysis
📄 Full Content
End-to-End Learning-based Video Streaming Enhancement
Pipeline: A Generative AI Approach
Emanuele Artioli
emanuele.artioli@aau.at
Alpen-Adria-Universitaet
Klagenfurt, Kaernten, Austria
Farzad Tashtarian
farzad.tashtarian@aau.at
Alpen-Adria-Universitaet
Klagenfurt, Kaernten, Austria
Christian Timmerer
christian.timmerer@aau.at
Alpen-Adria-Universitaet
Klagenfurt, Kaernten, Austria
Abstract
The primary challenge of video streaming is to balance high video
quality with smooth playback. Traditional codecs are well tuned
for this trade-off, yet their inability to use context means they
must encode the entire video data and transmit it to the client.
This paper introduces ELVIS (End-to-end Learning-based VIdeo
Streaming Enhancement Pipeline), an end-to-end architecture that
combines server-side encoding optimizations with client-side gen-
erative in-painting to remove and reconstruct redundant video data.
Its modular design allows ELVIS to integrate different codecs, in-
painting models, and quality metrics, making it adaptable to future
innovations. Our results show that current technologies achieve
improvements of up to 11 VMAF points over baseline benchmarks,
though challenges remain for real-time applications due to com-
putational demands. ELVIS represents a foundational step toward
incorporating generative AI into video streaming pipelines, en-
abling higher quality experiences without increased bandwidth
requirements.
CCS Concepts
• Computing methodologies →Object identification; Artifi-
cial intelligence; Concurrent algorithms; Image compression; •
Information systems →Online analytical processing; Multime-
dia streaming.
Keywords
HTTP adaptive streaming, Generative AI, End-to-end architecture,
Quality of Experience
1
Introduction
With the increasing demand for high-quality video streaming and
storage, video compression methods are becoming more crucial
than ever. Traditional codecs have significantly reduced file sizes
while preserving visual quality, with each new iteration improving
upon its predecessor by about 50% [1]. However, further advance-
ments are needed to meet the growing requirements of bandwidth-
constrained environments and the ever-increasing resolution of
video content. A promising avenue is represented by neural codecs [2,
3], i.e., compressing a video into the weights of a neural network,
which is then prompted by the client to recreate frames. In ad-
dition to server-side compression efficiency, video enhancement
The financial support of the Austrian Federal Ministry for Digital and Economic
Affairs, the National Foundation for Research, Technology and Development, and
the Christian Doppler Research Association, is gratefully acknowledged. Christian
Doppler Laboratory ATHENA: https://athena.itec.aau.at/.
This work is licensed under a Creative Commons Attribution 4.0 International License.
ELVIS controller
Client side
Network
Performance monitoring
Server side
Video
encoding
Frame
extraction
Frame
shrinking
Complexity
calculation
Frame
in-painting
Video
rendering
Frame
stretching
Video
decoding
Figure 1: Overview of the ELVIS pipeline.
techniques have been explored to leverage client-side computation,
such as frame interpolation and super-resolution [4, 5, 6, 7].
All of the aforementioned techniques face a common limitation:
they can only tackle low-level features of videos, such as edges and
block-wise flow. With the advent of large generative models, AI is
now able to learn and replicate high-level video features, objects,
and up to a few seconds of the whole video [8]. Therefore, a new and
yet to be explored avenue for enhancing video compression is the
integration of video in-painting techniques [9, 10, 11] that analyze
the video as a whole, gather context as to what is represented in it,
and fill in missing or corrupted regions. Using the latest advances in
machine learning, such as attention mechanisms [12], and training
on increasingly large and curated datasets, state-of-the-art (SOTA)
in-painting algorithms learn how objects typically appear and move
in videos, giving them the ability to recreate far larger portions of
content than previously possible [9, 10, 11].
This paper’s contributions are twofold: (𝑖) it presents ELVIS, an
innovative method that implements video in-painting alongside
encoding, aiming to enhance compression efficiency by eliminating
parts of the video that are challenging to encode, but can be regen-
erated by the client using in-painting algorithms, without signifi-
cantly deteriorating the viewing experience. This approach allows
the encoder to focus on portions that cannot be easily replicated at
the client side, thereby increasing video quality without additional
bandwidth requirements. The effectiveness of this method is eval-
uated using a variety of metrics, to ensure that the reconstructed
video meets the high standards required for practical deployment.
Contribution (𝑖𝑖) is the release 1 of an end-to-end pipeline, outlined
Reference
This content is AI-processed based on open access ArXiv data.