Multiword matrix multiplication over large finite fields in floating-point arithmetic
This article is concerned with the efficient computation of modular matrix multiplication C=AB mod p, a key kernel in computer algebra. We focus on floating-point arithmetic, which allows for using efficient matrix multiplication libraries. However, the existing approach is limited to primes p with bitsize at most half the mantissa size (e.g., 26 bits with double precision arithmetic), and becomes quite inefficient when p approaches this limit. We present a new approach that overcomes this limitation and can efficiently handle primes with larger bitsizes. The key idea is to use multiword decompositions, which represent A and B as scaled sums of u and v matrices (words) with smaller coefficients. We provide a rigorous analysis that proves the correctness of this approach for suitably chosen scaling parameters. Our analysis determines the maximum bitsize of p that can be handled for a given number of words; in particular, we show that decomposing in two words each input suffices to handle bitsizes almost equal to the full mantissa size (e.g., the 26 bits limit is raised to 52 bits in double precision arithmetic). Moreover, we show that (1,v) decompositions with v>1 are also of interest to handle intermediate bitsizes. We perform an extensive experimental analysis for various matrix shapes and prime bitsizes. Our performance benchmarks on both CPU and GPU architectures confirm the efficiency of the proposed approach, which can outperform the existing single word approach for bitsizes as low as 23, and can handle bitsizes as high as 52 while retaining high performance.
💡 Research Summary
This paper addresses the problem of computing modular matrix multiplication C = A B mod p efficiently when the modulus p is a large prime, using only floating‑point arithmetic. The motivation stems from computer‑algebra applications (e.g., exact linear algebra, polynomial factorisation, Chinese Remainder reconstruction) where intermediate rational coefficients explode, and therefore computations are performed over a finite field Z/pZ. Existing high‑performance libraries (FLINT, NTL, FFLAS/LinBox) already exploit BLAS‑based floating‑point kernels, but they are limited to primes whose bit‑size does not exceed half the mantissa of the floating‑point format (≈26 bits for IEEE‑754 double). When p approaches this bound the block size λ required for safe accumulation becomes very small, leading to many reductions and a dramatic loss of performance.
The authors propose a fundamentally different approach based on multi‑word decomposition. Each input matrix is expressed as a scaled sum of a small number of “words” whose coefficients fit comfortably inside the floating‑point mantissa. Formally, A = ∑{i=0}^{u‑1} α_i A_i, B = ∑{j=0}^{v‑1} β_j B_j, with scaling factors α≈p^{1/u} and β≈p^{1/v}. The product then becomes C = ∑{i=0}^{u‑1}∑{j=0}^{v‑1} α_i β_j (A_i B_j mod p). If α and β are chosen appropriately, every entry of A_i and B_j is bounded by roughly α and β, respectively, so that the intermediate products A_i B_j stay within the range where the standard floating‑point modular reduction (based on a pre‑computed reciprocal of p and an FMA instruction) is provably exact.
The paper first revisits the floating‑point error model (|η| ≤ ε/(1+ε) with ε = 2^{‑t}) and derives rigorous correctness conditions for two reduction primitives (Algorithm 2.1 for single‑operand reduction and Algorithm 2.2 for product reduction). These primitives are shown to work for any integer x ≤ 2^{t‑2} p and for any product xy ≤ 2^{t‑1} 3 p, which is a substantial improvement over prior analyses that required p < 2^{(t‑1)/2}.
Next, the multi‑word decomposition algorithm (Algorithm 3.1) is introduced. It repeatedly divides the matrix by α, extracting the remainder as a word and recursing on the quotient. Lemma 3.1 proves that the floating‑point division followed by floor yields the exact integer quotient, guaranteeing that each word’s coefficients are indeed bounded by α. The authors then analyse how many words are needed to support a given modulus size. The total number of BLAS GEMM calls is uv, so the computational cost scales linearly with the product of the word counts. Crucially, with u = v = 2 (a (2, 2) decomposition) the admissible modulus size grows to almost the full mantissa, i.e. p < 2^{52} for double precision. For intermediate sizes, (1, v) or (u, 1) decompositions provide a smoother trade‑off.
A “concatenated” variant is also described: all B_j (or all A_i) are stacked into a single tall matrix, allowing a single GEMM call to compute all uv partial products simultaneously. This improves cache utilisation and arithmetic intensity, especially when one operand is tall‑and‑skinny or short‑and‑wide.
The experimental section implements the algorithms on a multicore Intel Xeon Gold CPU and an NVIDIA A100 GPU, using a range of matrix dimensions (e.g., 1024×1024, 2048×512, 4096×256) and prime bit‑sizes from 23 up to 52. Results show:
- For small primes (23–30 bits), a (1, 2) or (2, 1) decomposition already outperforms the single‑word method by 30‑100 %.
- For medium primes (40–45 bits), the (2, 2) decomposition yields the best performance, achieving 1.8‑2.5× speed‑up.
- Even at the extreme 52‑bit limit, (2, 2) remains correct and delivers about a 1.5× gain on the GPU. Memory overhead grows by a factor of uv, but the high bandwidth of modern hardware and the block‑size selection keep the overhead negligible.
The authors compare their approach to CRT‑based multimodular methods, noting that the multi‑word technique requires fewer modular reductions, avoids the overhead of managing many small moduli, and directly benefits from highly tuned BLAS kernels. They also discuss the practical impact: many cryptographic protocols (e.g., lattice‑based schemes) and large‑scale symbolic computations need modular arithmetic with primes close to the machine word size; the proposed method enables these workloads to stay within the fast path of floating‑point BLAS without resorting to slower arbitrary‑precision libraries.
In conclusion, the paper presents a solid theoretical foundation and a practical implementation that lifts the long‑standing 26‑bit barrier for floating‑point modular matrix multiplication. By exploiting multi‑word representations, careful scaling, and modern hardware features (FMA, SIMD, GPU tensor cores), the authors achieve high performance across a wide range of prime sizes. Future work suggested includes automated selection of (u, v) based on a cost model, exploration of higher‑dimensional decompositions (e.g., (3, 3)), and integration into existing computer‑algebra systems for end‑to‑end exact linear algebra pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment