Bitrate Ladder Construction using Visual Information Fidelity

Bitrate Ladder Construction using Visual Information Fidelity
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recently proposed perceptually optimized per-title video encoding methods provide better BD-rate savings than fixed bitrate-ladder approaches that have been employed in the past. However, a disadvantage of per-title encoding is that it requires significant time and energy to compute bitrate ladders. Over the past few years, a variety of methods have been proposed to construct optimal bitrate ladders including using low-level features to predict cross-over bitrates, optimal resolutions for each bitrate, predicting visual quality, etc. Here, we deploy features drawn from Visual Information Fidelity (VIF) (VIF features) extracted from uncompressed videos to predict the visual quality (VMAF) of compressed videos. We present multiple VIF feature sets extracted from different scales and subbands of a video to tackle the problem of bitrate ladder construction. Comparisons are made against a fixed bitrate ladder and a bitrate ladder obtained from exhaustive encoding using Bjontegaard delta metrics.


💡 Research Summary

**
The paper addresses the high computational cost of per‑title video encoding, which constructs a content‑specific bitrate‑resolution ladder by exhaustively encoding a video at every combination of R resolutions and B bitrates. To avoid this costly process, the authors propose a machine‑learning approach that predicts the visual quality (VMAF) of compressed videos directly from features extracted from the uncompressed source using the Visual Information Fidelity (VIF) metric. VIF, a full‑reference quality index based on a Gaussian Scale Mixture (GSM) model, quantifies the amount of information that could be extracted by the human visual system from each subband of an image.

Nine distinct feature sets are constructed by computing VIF on four spatial scales (each with two sub‑bands), on frame‑difference images, and by adding the mean absolute luminance difference (MAD) used in VMAF. For each compressed video, the temporally‑averaged VIF‑based features are concatenated with three metadata items: the bitrate, the normalized width (w/3840), and the normalized height (h/3840). This results in a compact feature vector that can be computed without any additional compression.

The authors use the BVT‑1004K dataset, which contains 4K (3840 × 2160) clips cropped to 64 frames, 10‑bit depth, and mostly 60 fps. The dataset is split into 70 training, 10 validation, and 20 test videos, ensuring no title overlap. Each clip is encoded with libx265 (medium preset) at eight resolutions (3840 × 2160 down to 512 × 288) and constant‑quality CRF values ranging from 18 to 50. After up‑scaling the compressed streams back to the original resolution, VMAF and VIF scores are computed to serve as ground‑truth quality labels.

Various regressors (Extra‑Trees, XGBoost, Random Forest) are trained separately for each feature set. Across all experiments, the Extra‑Trees regressor consistently yields the lowest prediction error. Using the trained model, the authors predict VMAF for a predefined set of target bitrates {0.25, 0.5, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.5 Mbps}. For each target bitrate, the resolution that maximizes the predicted VMAF is selected, forming a candidate bitrate ladder. Because regression errors can produce non‑monotonic ladders, a simple post‑processing pass enforces that the chosen resolution never increases when moving from higher to lower bitrates.

Performance is evaluated with Bjontegaard delta (BD) metrics against two baselines: (1) the fixed HLS ladder defined by Apple, and (2) a reference ladder obtained by exhaustive encoding (the “ground‑truth” Pareto front). The reference ladder yields an average BD‑Rate improvement of –17.95 % (i.e., 17.95 % bitrate saving) and a BD‑VMAF gain of +4.38 over the fixed ladder. Among the nine feature sets, the combination “I_k,b


Comments & Academic Discussion

Loading comments...

Leave a Comment