For extreme low-bit quantization of large language models (LLMs), Double Binary Factorization (DBF) is attractive as it enables efficient inference without sacrificing accuracy. However, the scaling parameters of DBF are too restrictive; after factoring out signs, all rank components share the same magnitude profile, resulting in performance saturation. We propose Multi-Envelope DBF (MDBF), which retains a shared pair of 1-bit sign bases but replaces the single envelope with a rank-l envelope. By sharing sign matrices among envelope components, MDBF effectively maintains a binary carrier and utilizes the limited memory budget for magnitude expressiveness. We also introduce a closed-form initialization and an alternating refinement method to optimize MDBF. Across the LLaMA and Qwen families, MDBF enhances perplexity and zeroshot accuracy over previous binary formats at matched bits per weight while preserving the same deployment-friendly inference primitive.
Large language models (LLMs) support many NLP systems; however, their size renders deployment expensive, since storing FP16 or FP32 parameters and moving them through the memory hierarchy often dominates both the memory footprint and inference latency. Quantization is therefore a central tool for efficient deployment. Post-training quantization (PTQ) (Frantar et al., 2022;Lin et al., 2024) is particularly appealing because it can be applied to a pretrained model with minimal overhead, avoiding the need for full retraining. Although recent PTQ methods maintain strong accuracy at around 4-bit precision, performance typically degrades as precision approaches the 2-1-bit regime, where the per-layer information budget is extremely limited. To push below 2 bits lower, many approaches move beyond elementwise quantization and adopt structured parameterizations (Chee et al., 2023;Tseng et al., 2024b,a;Malinovskii et al., 2024a). Binary and near-binary schemes are especially appealing because they provide a clear hardware fast path: most computation can be performed by specialized kernels operating on bit-packed sign matrices, with only lightweight higher-precision scaling.
A prominent family of methods factorizes each weight matrix into low-rank components and then binarizes the factors. OneBit (Xu et al., 2024) shows that appropriate scaling can stabilize 1-bit factors, while Double Binary Factorization (DBF) (Boža and Macko, 2025) makes the binary path explicit by composing two binary matrix multiplications with interleaved diagonal scalings. LittleBit (Lee et al., 2025) further enhances extreme-bit accuracy through multi-scale scaling and residual compensation, utilizing quantization-aware training (QAT) across multiple GPUs. Despite these advances, existing formats share a key structural limitation: after demodulation, factor magnitudes are confined to a single rank-one envelope. Increasing the inner rank primarily enhances sign diversity rather than magnitude expressiveness. As a result, under a fixed bits-per-weight budget, accuracy can saturate because gains come more from signs than magnitudes.
This paper addresses the DBF bottleneck by explicitly allocating limited expressivity to the critical components at extremely low precision. We propose Multi-Envelope Double Binary Factorization (MDBF), which retains the shared 1-bit sign bases and a deployment-friendly binary fast path while replacing the rank-one magnitude envelope with multiple demodulated envelope modes. Figure 1 shows that increasing the envelope rank l systematically reduces reconstruction error; however, while increasing the residual path P , as in LittleBit (Lee et al., 2025), is often less effective within a fixed bits-per-weight budget. MDBF adds a small number of real-valued degrees of freedom for magnitude modeling, better aligning with the empirically observed low-rank structure of Transformer weights, which are rarely rank-one. To make MDBF applicable for layer-wise PTQ, we introduce a layer-wise optimization pipeline with closed-form initialization followed by ADMM refinement. Across the LLaMA and Qwen families, MDBF improves perplexity and zero-shot accuracy compared to previous binary formats at matched bits per weight, particularly in the challenging 2-1 bit range, while maintaining the same deploymentfriendly binary inference primitive.
• Identifying DBF Bottleneck: We identify the bottleneck as the single envelope constraint. Under a fixed bits-per-weight budget, modeling magnitude variation yields greater accuracy gains than increasing sign diversity.
• Multi-Envelope Generalization of DBF: We propose Multi-Envelope DBF (MDBF), which retains the shared 1-bit sign bases and maintains the same deployment-friendly binary fast path while replacing the rank-one magnitude envelope with a rank-l envelope.
• Initialization and ADMM Refinement for MDBF: We generalize the initialization of LittleBit and the ADMM-based refinement of DBF to a multi-envelope setting. This results in a closed-form initializer and an efficient alternating ADMM refinement procedure.
• Empirical Validation: Across the LLaMA and Qwen model families, MDBF consistently reduces reconstruction error and improves perplexity and zero-shot accuracy compared to prior binary formats at matched BPW.
Vectors are denoted by bold lowercase letters, e.g., x, and matrices by uppercase letters, e.g., W . Throughout, W ∈ R N ×M denotes a realvalued weight matrix. We denote by ⊙ the Hadamard elementwise product, and for a vector a, let D a represent the diagonal matrix with (D a ) ii = a i . We use ∥ • ∥ F for the Frobenius norm and ⟨A, B⟩ F := Tr(A ⊤ B) for the corresponding Frobenius inner product. For any matrix A, its singular values are denoted by σ
Finally, the entrywise sign function sign(•) maps to {±1}, with sign(0) = +1.
A common approach to model compression exploits the empirically observed approximate lowrank structure of weight matrices. Given a weight matrix
This content is AI-processed based on open access ArXiv data.