Generative Video Compression: Towards 0.01% Compression Rate for Video Transmission

Whether a video can be compressed at an extreme compression rate as low as 0.01%? To this end, we achieve the compression rate as 0.02% at some cases by introducing Generative Video Compression (GVC), a new framework that redefines the limits of vide…

Authors: Xiangyu Chen, Jixiang Luo, Jingyu Xu

Generative Video Compression: Towards 0.01% Compression Rate for Video Transmission
Generativ e Video Compression: T o w ards 0.01% Compression Rate for Video T ransmission Xiangyu Chen, Jixiang L uo, Jingyu Xu, F angqiu Yi, Chi Zhang, Xuelong Li * Institute of Artificial In telligence (T eleAI), China T elecom Whether a video can be compressed at an extreme compression rate as low as 0.01% ? T o this end, we ac hieve the compression rate as 0.02% at some cases by introducing Generative Video Compression (GVC) , a new framework that redefines the limits of video compression b y lev eraging modern generative video mo dels to achiev e extreme compression rates while preserving a p erception-cen tric, task-orien ted comm unication paradigm, corresp onding to Level C of the Shannon–W ea ver mo del. Besides, How we trade computation for compression rate or bandwidth ? GV C answ ers this question b y shifting the burden from transmission to inference: it enco des video into extremely compact representations and delegates con tent reconstruction to the receiver, where p ow erful generative priors synthesize high-quality video from minimal transmitted information. Is GVC practical and deploy able ? T o ensure practical deploymen t, w e prop ose a compression–computation trade-off strategy , enabling fast inference on consumer-grade GPUs. Within the AI Flo w framew ork, GVC op ens new p ossibility for video communication in bandwidth- and resource-constrained environmen ts such as emergency rescue, remote surveillance, and mobile edge computing. Through empirical v alidation, we demonstrate that GVC offers a viable path to ward a new effective, efficient, scalable, and practical video communication paradigm. Date: Decem b er 30, 2025 K eyw ords: generativ e video compression, task-oriented comm unication, AI Flow * Correspondence to: Xuelong Li ( xuelong_li@ieee.org ) Other authors are listed alphab etically by surname. 1 Introduction • Is it possible to reconstruct high-quality video at a compression rate as low as 0.01% ? • How w e trade computation for compression rate to achiev e extreme compression ? • Can extreme compression be practical and deployable in real-w orld scenarios ? The ab o ve questions challenge the conv entional paradigm of video compression. With the rapid expansion of high-resolution video, virtual reality , so cial media, and remote conferencing applications, video data is gro wing exp onen tially , placing unprecedented demands on existing technology and infrastructures for video storage and transmission. In bandwidth-constrained and latency-sensitive en vironments, achieving more efficient video compression has b ecome a key research fo cus at the intersection of communications and artificial intelligence. T raditional communication theory , ro oted in the Shannon–W ea ver mo del introduced in the 1940s ( Shannon , 1948 ), conceptualizes communication across three lev els : Lev el A addresses the te chnic al pr oblem , i.e., data- orien ted communication - how to transmit information accurately; Le vel B considers the semantic pr oblem , i.e., semantic communication - whether the transmitted sym b ols conv ey the intended meaning; and Le vel C fo cuses on the effe ctiveness pr oblem , i.e., task-oriented communication - whether the received information leads to the desired b ehavior. F or decades, video communication technology has primarily fo cused on Level A - maximizing signal fidelity under constrained bandwidth. Such an approach optimizes rate–distortion but can b e wasteful when the receiver only needs task-relev ant conten t rather than pixel-p erfect reconstructions. T o bridge the gap b etw een bit-lev el fidelit y and task-lev el utili t y , the AI Flow framework ( Shao and Li , 2024 ) 1 w as first prop osed by T eleAI at the end of 2024, which envisions leveraging communication netw orks to distribute intelligence for ubiquitous AI-p o wered services. In addition, the Information Capacity ( Y uan et al. , 2025b ) has b een prop osed to ev aluate the effectiveness of the generative mo dels in data compression. Suc h an adv ance lays the theoretical and metho dological groundwork for data compression based on generative mo dels. In early 2025, we extended this line of work by introducing task-oriented communications for multimodal understanding via device-edge co-inference ( Y uan et al. , 2025a ). By mid-2025, T eleAI first introduced the concept of Generativ e Video Compression (GVC) in W orld Artificial In telligence Conference (W AIC). Unlik e traditional codecs that emphasize pixel-level reconstruction fidelity , GVC adopts a task-oriented communication p erspective. It prioritizes whether the transmitted information meets p erceptual exp ectations or effectively supp orts do wnstream tasks, thereby placing Lev el C at the core of its design. At the W AIC conference, T eleAI released a prototype 1 for maritime communications that enable ultra-low bitrate video transmission ov er limited-bandwidth satellite connections. The underlying theoretical foundation has b een elab orated in the tec hnical rep ort ( An et al. , 2025 ). The core principle of GVC is trading computation for compression rate . Recen t adv ances in generative mo dels, particularly generativ e video mo dels ( Op enAI , 2024 ; W an et al. , 2025 ), present unprecedented opp ortunities in video compression. By leveraging p ow erful generative priors, GVC aims to ov ercome the long-standing trade-offs b etw een bitrate and p erceptual quality found in traditional standards like HEV C ( Sulliv an et al. , 2012 ). Besides we clarify the motiv ation of GVC from ( F an et al. , 2025 ), whic h depicts the relationship among computation, bandwidth and memory , we depict the motiv ation of GVC. A useful metaphor illustrates this shift: traditional compression is akin to photographing a pain ting and sending the image; GVC, in contrast, describ es the painting’s comp osition and st yle, then relies on an “AI painter” at the receiv er to recreate it. Thanks to their expressive generativ e capabilities, mo dern mo dels can synthesize high-quality videos from minimal latent representations - or even pure noise - guided b y learned priors. As a result, the enco der’s role transitions from preserving ev ery pixel to transmitting only the most task-relev ant information. As illustrated in Figure 2 , GVC can ac hieve visually comp elling reconstruction at bitrates as low as 0.005 bpp (equivalent to a 0.02% compression rate) . This demonstrates that GVC is adv ancing tow ard the extreme compression frontier of 0.01%. Moreo ver, ev en when considering av erage p erformance across the common test sequences, GVC significantly reduces the required transmission bandwidth while maintaining high p erceptual qualit y , as shown in Section 3 . This mak es it particularly suitable for scenarios demanding high communication efficiency , such as maritime communication, emergency rescue, narrowband mobile netw orks, remote video surv eillance, in-vehicle or wearable devices. Ho wev er, extreme compression introduces new challenges. High-quality reconstruction relies on computationally in tensive generative pro cesses, imp osing strict requirements on hardware and inference latency . T o address this, we prop ose the concept of trading compression rate for practicality - sacrificing a small fraction of the compression rate to achiev e a more fav orable balance b et ween compression, computation, and quality . As sho wn in T able 3 , our system can run on consumer-grade GPUs with inference latency around 2 seconds - comparable to large language mo del resp onse times - demonstrating promising real-world usability . This rep ort presents a generative video compression framew ork tailored for ultra-low bitrate video transmission. By lev eraging p ow erful generativ e video priors, the framew ork achiev es high p erceptual quality while drastically reducing bandwidth consumption with supp ort for downstream tasks. W e describ e the system architecture, design strategies, and exp erimental v alidations of GV C, aiming to lay the foundation for the next generation of p erception-driv en video communication technology . 2 Methodology 2.1 F ramework Overview The Generative Video Compression (GVC) framework ac hieves high-efficiency video compression by trans- forming raw video frames into compact latent representations and reconstructing them through generative mo deling. As illustrated in Figure 1 , the framework is composed of t wo primary components: a Neural Enco der and a Generative Video Deco der. 1 https://mp.w eixin.qq.com/s/QrF AmGjQvHmgEgX9En4MjQ 2 Figure 1 Ov erview of Our GVC F ramework Grounded in the Shannon-W eav er mo del ( Shannon , 1948 ). T op-left: Lev el A addresses the technical problem, optimizing signal fidelity under limited bandwidth by minimizing distortion b et ween input and output videos. T op-right: Level B fo cus on the semantic problem, aiming at transmitting the precise semantic symbols. Bottom: Level C, cen tral to the prop osed Generative Video Compression (GVC) framework, emphasizing task-oriented effectiveness. It ensures that the compressed tokens enable the ac hievemen t of task goals - suc h as high-quality perception reconstruction or supp ort for do wnstream tasks like segmen tation. The system b egins by ingesting an input video sequence, which ma y include v arious types of conten t such as surv eillance fo otage, video call streams, or live broadcasts. This input video is pro cessed by the Neural Enco der, a pre-trained neural netw ork designed to compress the video into a set of compact representations, referred to as compressed tok ens. These compressed tokens comprise b oth discrete and contin uous representations, including compressed keyframes, high-level descriptors of video segments, and lo w-level con tinuous features. The enco der is capable of significan tly reducing dimensions of the video data while preserving essential semantic information and motion dynamics. T o further improv e compression efficiency , the tokens are additionally enco ded into a bitstream using tec hniques such as residual co ding, reducing storage and transmission requirements. On the deco der side, a pre-trained diffusion-based generative video mo del reconstructs the video from the compressed tok ens. Some of these tokens serve as direct inputs to the denoising pro cess, while others function as the conditions. This reconstruction pro cess is essen tially a conditional video generation task, where the mo del syn thesizes video frames that are visually faithful to the original input. The final output is a reconstructed video that closely resembles the original in visual quality , with minimal p erceptual loss, thereby achieving a balance b et ween compression rate and visual quality . 2.2 T rading Computation for Compression Rate A core idea in GVC is trading computation for compression rate. Instead of transmitting detailed visual data, GV C leverages p ow erful generative mo dels at the deco der to reconstruct video conten t, thereby significantly reducing the bitrate required for transmission. T raditional co decs are designed to preserve signal fidelity under bitrate constraints using handcrafted signal pro cessing techniques. In contrast, GVC shifts the burden of reconstruction to the deco der, using computation and prior knowledge embedded in generative mo dels to syn thesize realistic frames from minimal inputs. This shift can b e illustrated metaphorically: traditional compression is like photographing a painting and sending the photo; GVC is like describing the painting’s comp osition and style, then having an “AI painter” recreate it. Mo dern generative mo dels are capable of pro ducing high-qualit y video giv en only laten t represen- tations or ev en random noise, thanks to strong learned priors. Therefore, the enco der’s role b ecomes one of selecting and transmitting the most task-relev ant information, rather than preserving every pixel. 3 Critically , this means that what is transmitted dep ends on the purp ose of the reconstructed video. If the goal is human p erception, the enco der transmits features that help generate p erceptually similar conten t. If the goal is machine understanding (e.g., segmentation, recognition), then the enco der fo cuses on transmitting seman tically meaningful representations. This represents a departure from fidelit y-orien ted compression to ward task-oriented or effectiv eness-orien ted comm unication, aligning GVC with higher-lev el ob jectives b ey ond simple reconstruction accuracy . 2.3 T rading Compression Rate for Practicality While trading computation for compression rate enables highly compact video representations through deco der-side generation, this approach faces practical limitations in real-world deploymen t. Specifically , the computational capacity of the deco der - constrained by hardw are resources, p o wer consumption, and latency requiremen ts - imp oses an upp er b ound on how muc h computation can b e traded for compression. In many applications, such as real-time video conferencing or edge-device streaming, deco der-side latency and efficiency b ecome critical b ottlenecks. Consequen tly , the balance among reconstruction quality , compression rate, and compution m ust b e re-ev aluated with practicality as a core consideration. T o address this, our framework incorp orates strategies that consciously trade compression rate for practicalit y , ensuring that decoding remains feasible with acceptable reconstruction quality . One such strategy is to increase the ric hness of the compressed latent represen tations, thereby reducing the reliance on large generativ e mo dels at the deco der. This unlo cks the ability to use smaller, faster mo dels. A dditionally , we apply mo del compression techniques to reduce the size and complexity of key comp onen ts (e.g., 3D V AEs), and employ distillation and sampling acceleration metho ds for diffusion-based deco ders to low er inference time. In these cases, we often comp ensate for the quality loss due to mo del simplification by transmitting higher-dimensional or more informativ e features, striking a new balance in the compression rate-computation-quality triangle. Ultimately , this trade-off reflects a practical extension to the GVC paradigm: while generative mo dels enable extreme compression, real-world usabilit y demands adaptive strategies that scale with av ailable computational resources. By flexibly adjusting the amount of transmitted information and the complexity of generative inference, our framework ensures that GVC remains not only efficien t in terms of bitrate, but also viable and resp onsiv e under practical deploymen t conditions. 3 R esults T o v alidate the effectiveness of our GVC framework, we first assess the video compression p erformance based on a 14B video generative mo del on the standard b enchmark: MCL-JCV ( W ang et al. , 2016 ). W e employ mainstream p erceptual metric for ev aluation: L e arne d Per c eptual Image Patch Similarity (LPIPS), as it is recognized as measures of human perception quality . As sho wn in T able 1 , at an av erage bitrate of 0.008 bpp 2 , our metho d maintains comp etitively high p erceptual quality . In contrast, conv entional video co ding schemes exhibit a substantial p erformance gap at this bitrate. F or certain c hallenging sequences, conv entional methods need appro ximately 6 times higher bitrate than our approach to attain equiv alen t p erceptual reconstruction qualit y , as shown in Figure 2 . bpp=0.0058,LPIPS=0.487 bpp=0.0053, LPIPS=0.312 bpp=0.035, LPIPS=0.319 HEVC Ours HEVC Figure 2 Bandwidth comparison for achieving comparable reconstruction quality . T raditional metho ds require more than 6 times the bandwidth to match the perceptual quality of our approach across selected representativ e sequences. 2 Note that the av erage bpp is calculated by first av eraging o ver each sequence in the dataset and then av eraging across all sequences. When testing each sequence, the trailing frames that do not form a complete GOP are discarded. 4 T able 1 Quantitativ e comparison on the MCL-JCV dataset. Lo wer v alues are b etter. Method LPIPS ↓ HEV C Sulliv an et al. ( 2012 ) 0.278 Ours 0.180 T o further v alidate its practical utility , we apply the reconstruction results of the mo del to the downstream task: video ob ject segmentation (VOS) on DA VIS2017 ( Pon t-T uset et al. , 2017 ). W e ev aluate p erformance using the Jaccard index J , contour accuracy F , their a verage ( J & F ), and con tour recall ( F -Recall). As sho wn in T able 2 , our metho d achiev es highly comp etitive p erformance. This indicates that even at low bitrates, our approac h can preserve correct semantic transmission. T able 2 Do wnstream p erformances of different co ding metho ds. ‘ Upp er-b ound ’ is obtained by ev aluating the task mo dels with the original videos. Metho d VOS: XMEM on DA VIS2017 J & F (%) J (%) F (%) F -Recall (%) HEV C@bpp=0.01 57.68 56.84 58.51 67.44 Ours@bpp=0.01 75.22 71.17 79.28 91.87 Upp er-bound 87.70 84.06 91.33 97.02 W e hav e dedicated our effort to improving computational efficiency through techniques that include mo del miniaturization, knowledge distillation, and quan tization. These optimizations make our approach feasible for deplo yment on v arious hardware platforms. As demonstrated in T able 3 , which rep orts the latency of our miniaturized mo del for generating a GOP of 29 frames (i.e., 29 frames at once) across different platforms, our system ac hieves practical inference sp eeds even on consumer-grade hardware. T able 3 Mo del Computational Efficiency and Hardware Performance (GOP=29) Resolution Module Latency (s) 4090 A100 H200 480p Enco der 0.95 0.64 0.2 Deco der 1.35 1.4 1.13 720p Enco der 1.15 0.80 0.3 Deco der 6.4 5.5 2.3 1080p Enco der 1.59 0.85 0.5 Deco der 21.5 18 6.1 Although the miniaturized mo del incurs some loss in visual quality and bandwidth efficiency compared to its full-scale counterpart, it still maintains comp etitiv ely high p erceptual quality , as illustrated in Figure 3 , where the shown video sequence achiev es LPIPS of 0.273. These results collectively demonstrate that our miniaturized model ac hieves an effectiv e balance b etw een computational efficiency and visual quality in practical deplo yment scenarios. 4 Conclusion This rep ort, under the AI Flow framew ork ( An et al. , 2025 ) and guided by the theory of Information Capacit y ( Y uan et al. , 2025b ), reimagines the foundation of video compression through the lens of Generative Video Compression (GVC) - a paradigm shift that prioritizes p erceptual relev ance and task effectiveness o ver pixel-lev el fidelity . By asking whether high-qualit y video can b e reconstructed at extreme compression, we not only c hallenge the limits of conv entional co decs but also demonstrate the feasibilit y of trading computation for compression in an era of increasingly capable edge devices. Our findings show that, with the aid of 5 Figure 3 Visual quality comparison of the miniaturized mo del, demonstrating competitive perceptual quality despite mo del compression. mo dern generativ e video models, it is p ossible to achiev e comp elling reconstructions at extreme bitrates while maintaining visual realism and downstream task utility . F urthermore, we in tro duce the concept of trading compression rate for practicality , highligh ting system designs that balance compression efficiency with inference latency and hardware constraints. Our implementation demonstrates that GVC can op erate on consumer-grade GPUs with acceptable latency , making it viable for real-world deploymen t in domains such as remote surv eillance, low-bandwidth mobile communication, and edge AI devices. In conclusion, GVC is not just a compression technique - it em b o dies a task-oriented communication paradigm tailored for the era of generativ e intelligence. By transmitting only what is necessary for p erception and decision making, it op ens the do or to a new class of communication systems that are more efficien t, adaptive, and intelligen t. W e hop e this work inspires further research at the intersection of generative mo deling, communication theory , and real-w orld deploymen t, pushing the b oundaries of what is p ossible in extreme video compression. R eferences Hong jun An, W enhan Hu, Sida Huang, Siqi Huang, Ruanjun Li, Y uanzhi Liang, Jia wei Shao, Yiliang Song, Zihan W ang, Cheng Y uan, et al. AI Flow: Perspectives, Scenarios, and Approaches (2025). arXiv pr eprint arXiv:2506.12479 , 2025. Y uank ai F an, Qizhen W eng, and Xuelong Li. Computation-Bandwidth-Memory Trade-offs: A Unified Paradigm for AI Infrastructure. a rXiv pr eprint arXiv:2601.11577 , 2025. Op enAI. Video Generation Mo dels as World Sim ulators, 2024. URL https://openai.com/index/ video- generation- models- as- world- simulators/ . Jordi P ont-T uset, F ederico Perazzi, Sergi Caelles, Pablo Arb eláez, Alex Sorkine-Hornung, and Luc V an Gool. The 2017 D A VIS Challenge on Video Ob ject Segmentation. arXiv pr eprint arXiv:1704.00675 , 2017. Claude E Shannon. A Mathematical Theory of Communication. The Bel l System T e chnic al Journal , 27(3):379–423, 1948. Jia wei Shao and Xuelong Li. AI Flow at the Net work Edge. IEEE Network , 2024. Gary J Sulliv an, Jens-Rainer Ohm, W o o-Jin Han, and Thomas Wiegand. Ov erview of the High Efficiency Video Co ding (HEV C) Standard. IEEE T r ansactions on Cir cuits and Systems for Vide o T e chnolo gy , 22(12):1649–1668, 2012. T eam W an, Ang W ang, Baole Ai, Bin W en, Chao jie Mao, Chen-W ei Xie, Di Chen, F eiwu Y u, Haiming Zhao, et al. W AN: Open and Adv anced Large-scale Video Generative Mo dels. arXiv pr eprint arXiv:2503.20314 , 2025. Haiqiang W ang, W eihao Gan, Sudeng Hu, Jo e Y uchieh Lin, Lina Jin, Longguang Song, Ping W ang, Ioannis Kat- sa vounidis, Anne Aaron, and C.-C. Jay Kuo. MCL-JCV: A JND-based H.264/A VC Video Quality Assessment Dataset. In 2016 IEEE International Confer enc e on Image Pr o c essing , pages 1509–1513, 2016. Cheng Y uan, Zhening Liu, Jiashu Lv, Jia wei Shao, Y ufei Jiang, Jun Zhang, and Xuelong Li. Task-Orien ted Feature Compression for Multimo dal Understanding via Device-Edge Co-Inference. IEEE T r ansactions on Mobile Computing , pages 1–14, 2025a. Cheng Y uan, Jiaw ei Shao, Chi Zhang, and Xuelong Li. Information Capacity: Ev aluating the Efficiency of Large Language Mo dels via Text Compression. arXiv pr eprint arXiv:2511.08066 , 2025b. 6

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment