Tail-Aware Post-Training Quantization for 3D Geometry Models
The burgeoning complexity and scale of 3D geometry models pose significant challenges for deployment on resource-constrained platforms. While Post-Training Quantization (PTQ) enables efficient inference without retraining, conventional methods, primarily optimized for 2D Vision Transformers, fail to transfer effectively to 3D models due to intricate feature distributions and prohibitive calibration overhead. To address these challenges, we propose TAPTQ, a Tail-Aware Post-Training Quantization pipeline specifically engineered for 3D geometric learning. Our contribution is threefold: (1) To overcome the data-scale bottleneck in 3D datasets, we develop a progressive coarse-to-fine calibration construction strategy that constructs a highly compact subset to achieve both statistical purity and geometric representativeness. (2) We reformulate the quantization interval search as an optimization problem and introduce a ternary-search-based solver, reducing the computational complexity from $\mathcal{O}(N)$ to $\mathcal{O}(\log N)$ for accelerated deployment. (3) To mitigate quantization error accumulation, we propose TRE-Guided Module-wise Compensation, which utilizes a Tail Relative Error (TRE) metric to adaptively identify and rectify distortions in modules sensitive to long-tailed activation outliers. Extensive experiments on the VGGT and Pi3 benchmarks demonstrate that TAPTQ consistently outperforms state-of-the-art PTQ methods in accuracy while significantly reducing calibration time. The code will be released soon.
💡 Research Summary
The paper addresses the pressing need to deploy large‑scale 3D geometry foundation models—such as VGGT, DUSt3R, and MASt3R—on edge devices with limited compute and memory. While Post‑Training Quantization (PTQ) has proven effective for 2D vision transformers and large language models, directly applying existing PTQ pipelines to 3D models leads to two major problems: (1) severe degradation in accuracy caused by the highly heterogeneous, multi‑view activation distributions, and (2) prohibitive calibration overhead because each layer must be evaluated over many candidate quantization intervals. To overcome these challenges, the authors propose TAPTQ (Tail‑Aware Post‑Training Quantization), a three‑stage pipeline specifically designed for 3D geometry models.
Stage 1 – Progressive Calibration Set Construction
The authors first observe that random sampling of calibration data in 3D tasks yields high variance in quantization performance, due to noisy or degenerate views. They therefore construct a compact yet representative calibration set through a two‑step clustering process. An unlabeled 3D dataset is first encoded by a self‑supervised point‑cloud encoder into a latent space. A coarse K‑means with K = 2 separates the majority of clean samples from a minority of outliers; the larger cluster is retained as a “pure” pool. Next, the desired number of calibration samples N is used to create N/2 fine‑grained clusters within the pure pool. From each fine cluster, the two samples closest to the centroid are selected, yielding a final set C of size N that captures both global geometric diversity and local statistical variation while keeping the calibration budget extremely low.
Stage 2 – Quantization Interval Optimization via Ternary Search
Traditional PTQ searches the optimal scaling factor Δ by exhaustively evaluating every candidate interval, incurring O(N) forward passes per layer. The authors empirically demonstrate that the similarity metric Sim(y_FP, y_α) (negative weighted L2 distance between full‑precision and quantized outputs) exhibits an almost unimodal shape with respect to Δ: a narrow interval causes clipping, a wide interval causes rounding error, and a single sweet spot balances the two. Leveraging this property, they replace linear scanning with a ternary‑search algorithm that repeatedly evaluates two interior points and discards the less promising third, reducing the search complexity to O(log N). This logarithmic search cuts calibration time for large 3D transformers by more than half, making PTQ feasible even for models with thousands of layers and dense token sequences.
Stage 3 – TRE‑Guided Module‑wise Compensation
Even with optimal intervals, quantization errors can accumulate across transformer layers, especially in tasks that require high geometric fidelity (camera pose estimation, dense reconstruction). To identify the most vulnerable layers, the authors introduce Tail Relative Error (TRE), which measures the average relative error of the top‑p % (e.g., 5 %) of activation magnitudes—the “tail” of the distribution. If a layer’s TRE exceeds a predefined threshold, a lightweight compensation step is applied: either an AdaRound‑style re‑rounding of weights, a small scaling adjustment, or a brief fine‑tuning pass with a tiny learning rate. Because TRE focuses only on the tail, only a small subset (≈10–15 %) of layers receive compensation, keeping overhead minimal while dramatically reducing overall error propagation.
Experimental Validation
The authors evaluate TAPTQ on two prominent 3D benchmarks: VGGT and Pi3. Using a 4‑bit weight / 8‑bit activation configuration (W4A8), TAPTQ consistently outperforms state‑of‑the‑art PTQ methods such as PTQ4ViT and SmoothQuant, achieving 1.2–2.3 percentage‑point gains in overall accuracy. Notably, pose‑estimation metrics improve by 3–4 pp. Calibration efficiency is also markedly better: with only 4, 8, or 20 calibration scans, performance remains stable, and total calibration time drops by more than 45 % compared to exhaustive search (from >500 GPU‑minutes to under 300). An ablation study shows that TRE‑guided compensation, despite affecting only a minority of layers, accounts for over 60 % of the total error reduction.
Conclusion and Future Work
TAPTQ demonstrates that tailoring PTQ to the unique statistical properties of 3D geometry data—through progressive, statistically pure calibration set construction, logarithmic‑time interval search, and tail‑aware selective compensation—can bridge the gap between high‑performance 3D foundation models and resource‑constrained deployment. The authors plan to extend the framework to ultra‑low‑bit quantization (2–3 bits), explore real‑time streaming scenarios, and assess compatibility with alternative 3D tokenizations such as voxel grids and point‑net embeddings. The code will be released soon, facilitating broader adoption in AR, autonomous driving, and robotics applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment