Content-Driven Frame-Level Bit Prediction for Rate Control in Versatile Video Coding
Rate control allocates bits efficiently across frames to meet a target bitrate while maintaining quality. Conventional two-pass rate control (2pRC) in Versatile Video Coding (VVC) relies on analytical rate-QP models, which often fail to capture nonlinear spatial-temporal variations, causing quality instability and high complexity due to multiple trial encodes. This paper proposes a content-adaptive framework that predicts frame-level bit consumption using lightweight features from the Video Complexity Analyzer (VCA) and quantization parameters within a Random Forest regression. On ultra-high-definition sequences encoded with VVenC, the model achieves strong correlation with ground truth, yielding R2 values of 0.93, 0.88, and 0.77 for I-, P-, and B-frames, respectively. Integrated into a rate-control loop, it achieves comparable coding efficiency to 2pRC while reducing total encoding time by 33.3%. The results show that VCA-driven bit prediction provides a computationally efficient and accurate alternative to conventional rate-QP models.
💡 Research Summary
The paper addresses the long‑standing challenge of accurate rate control in Versatile Video Coding (VVC), where conventional two‑pass rate control (2pRC) relies on analytical rate‑QP models that often fail to capture the highly non‑linear relationship between content characteristics and bit consumption. To overcome this limitation, the authors propose a content‑adaptive framework that predicts the number of bits required for each frame directly from lightweight video‑complexity features and the quantization parameter (QP).
The core of the approach is the Video Complexity Analyzer (VCA), an open‑source tool that extracts seven descriptors per frame: spatial texture energy for Y, U, and V channels (E_Y, E_U, E_V), average brightness for the same channels (L_Y, L_U, L_V), and a temporal‑gradient measure h that quantifies inter‑frame texture variation. These features are inexpensive to compute (block‑level DCT analysis) and already encode both spatial and temporal information, making them suitable for real‑time bitrate estimation.
Three separate regression models are trained—one for I‑frames, one for P‑frames, and one for B‑frames—because each frame type exhibits distinct coding behavior. The I‑frame model uses the seven VCA features plus the current QP. The P‑frame model adds the temporal gradient relative to the reference frame (h_ref) and the reference QP (q_ref). The B‑frame model further includes gradients and QPs for both past and future references (h_ref1, h_ref2, q_ref1, q_ref2). Random Forest regression is selected after comparing linear regression, XGBoost, and Random Forest; the latter consistently yields the highest coefficient of determination (R²) and the lowest mean absolute percentage error (MAPE). With 100 trees, a maximum depth of 16, and modest split/leaf constraints, the model remains lightweight and fast enough for integration into a live encoder.
Training is performed on the Inter‑4K dataset (1,000 UHD sequences) using 5‑fold cross‑validation. The resulting R² scores are 0.93 for I‑frames, 0.88 for P‑frames, and 0.77 for B‑frames, with MAPE values below 9 % across all types. SHAP analysis confirms that QP is the dominant predictor, but texture energy (E_Y) and temporal gradients (h) also contribute significantly, validating the relevance of the VCA features.
In the second encoding pass, the predicted bit count (ˆb) from the first pass is compared with the target bit budget (b′) for each frame. The encoder’s internal rate‑QP relationship (as defined in the VVC standard) is used to adjust the QP: an initial correction based on a low‑rate scaling constant (c_low) is computed, followed by a high‑rate refinement using c_high. This procedure maps the model’s bit estimate to an appropriate QP without requiring additional trial encodings, thereby preserving the single‑pass latency while still benefiting from the content‑aware prediction.
Evaluation on JVET CTC test sequences (classes A1 and A2) shows that the proposed method achieves coding efficiency comparable to 2pRC. The average Bjøntegaard Delta‑Rate (BD‑Rate) is –0.14 % for the proposed scheme versus +0.26 % for 2pRC, indicating that the new approach can even improve quality on several sequences. Target bitrate deviation is only 0.05 %, demonstrating excellent rate stability. Crucially, encoding speed is dramatically increased: the first pass of 2pRC runs at about 0.40 fps, whereas the VCA‑driven pass exceeds 10 fps, a 25‑fold speedup. Overall encoding time is reduced by 33.3 % compared with 2pRC and by 6 % compared with a conventional one‑pass fixed‑QP run.
The paper acknowledges that B‑frame prediction is less accurate (R² = 0.77) due to the complexity of bidirectional prediction, and suggests that future work could explore deeper temporal models (e.g., LSTM or transformer‑based networks) or richer feature sets to close this gap. Nonetheless, the study convincingly demonstrates that a simple, interpretable set of VCA features combined with a Random Forest regressor can replace analytically derived rate‑QP models, delivering comparable rate‑distortion performance while substantially lowering computational cost. This makes the approach attractive for real‑time streaming, large‑scale transcoding, and any scenario where encoding resources are at a premium.
Comments & Academic Discussion
Loading comments...
Leave a Comment