Lightweight High-Fidelity Low-Bitrate Talking Face Compression for 3D Video Conference
The demand for immersive and interactive communication has driven advancements in 3D video conferencing, yet achieving high-fidelity 3D talking face representation at low bitrates remains a challenge. Traditional 2D video compression techniques fail to preserve fine-grained geometric and appearance details, while implicit neural rendering methods like NeRF suffer from prohibitive computational costs. To address these challenges, we propose a lightweight, high-fidelity, low-bitrate 3D talking face compression framework that integrates FLAME-based parametric modeling with 3DGS neural rendering. Our approach transmits only essential facial metadata in real time, enabling efficient reconstruction with a Gaussian-based head model. Additionally, we introduce a compact representation and compression scheme, including Gaussian attribute compression and MLP optimization, to enhance transmission efficiency. Experimental results demonstrate that our method achieves superior rate-distortion performance, delivering high-quality facial rendering at extremely low bitrates, making it well-suited for real-time 3D video conferencing applications.
💡 Research Summary
The paper tackles the problem of delivering high‑fidelity 3D talking‑face video in real‑time conferencing while keeping the transmission bitrate extremely low. Traditional 2‑D codecs (HEVC, AV1) only compress pixel streams and cannot represent 3‑D geometry, whereas implicit neural representations such as NeRF provide excellent visual quality but are far too computationally heavy for live use. The authors propose a hybrid framework that combines a parametric 3‑D morphable model (FLAME) with the efficient rendering technique of 3‑D Gaussian Splatting (3DGS).
In the encoding stage, a lightweight network extracts FLAME expression (ψ) and pose (θ) parameters from each input frame. These parameters are quantized and entropy‑coded using zero‑order Exponential‑Golomb coding, drastically reducing the amount of data that must be sent per frame. Because FLAME expression vectors are derived from a PCA basis, most facial variation can be captured with as few as 10–20 dimensions, meaning that expression data accounts for roughly 82 % of the transmitted bits but can be heavily compressed without noticeable quality loss.
The decoder receives a pre‑distributed personalized face model. This model consists of a fixed set of anisotropic Gaussians placed on the FLAME mesh via UV sampling. Each Gaussian is defined by position µ, base color h_base, higher‑order spherical‑harmonic coefficients h_rest, scale s, rotation r, and opacity o. Position µ is regenerated on‑the‑fly from the FLAME mesh and an MLP that predicts offsets based on the current expression vector, so µ itself does not need to be transmitted. The remaining attributes are compressed: low‑precision attributes (h_rest, r, o) are encoded as low‑dimensional latent vectors decoded by tiny neural decoders, while high‑precision attributes (h_base, s) are directly quantized and entropy‑coded. All components, together with the MLP weights, are further reduced with LZ77 lossless compression, shrinking the model from 4.3 MB to 0.59 MB (≈ 7× reduction).
The MLP that predicts Gaussian offsets is also lightweight: weight pruning, layer reduction, FP16 quantization, and LZ77 compression bring its size down to a few hundred kilobytes. At runtime, the decoder reconstructs the Gaussian attributes, applies the MLP offsets, and renders the face using the 3DGS volume‑rendering pipeline. The system achieves over 170 fps on an RTX 4090, enabling multiple simultaneous users in a conference.
Experiments were conducted on 512 × 512 videos (≈ 2,500 frames each) from public datasets and a self‑collected set. The proposed method was compared against x265 in low‑delay preset (LDP) and a NeRF‑based 3‑D talking‑face compression approach. Metrics (PSNR, SSIM, LPIPS) show that at bitrates below 40 kbps the proposed system consistently outperforms x265 and the NeRF baseline, preserving fine details such as teeth and eyelids that the baselines either blur or block. Qualitative results confirm smoother, artifact‑free renderings. An ablation study demonstrates that each compression step (latent representation, MLP pruning, quantization/entropy coding, LZ77) contributes to size reduction while incurring less than 0.15 dB PSNR loss.
In summary, the paper delivers a practical solution for low‑bitrate, high‑quality 3‑D talking‑face transmission by (1) sending only compact FLAME metadata, (2) using a fixed Gaussian head model that can be reconstructed on the fly, (3) applying multi‑stage attribute compression, and (4) aggressively slimming the offset MLP. The resulting system meets the stringent latency and bandwidth constraints of real‑time video conferencing, and the authors outline future work on multi‑user scalability and robustness to network impairments.
Comments & Academic Discussion
Loading comments...
Leave a Comment