Scalable testing of quantum error correction
The standard method for benchmarking quantum error-correction is randomized fault-injection testing. The state-of-the-art tool stim is efficient for error correction implementations with distances of up to 10, but scales poorly to larger distances for low physical error rates. In this paper, we present a scalable approach that combines stratified fault injection with extrapolation. Our insight is that some of the fault space can be sampled efficiently, after which extrapolation is sufficient to complete the testing task. As a result, our tool scales to distance 17 for a physical error rate of 0.0005 with a two-hour time budget on a desktop. For this case, it estimated a logical error rate of $1.51 \times 10^{-11}$ with high confidence.
💡 Research Summary
The paper addresses a critical bottleneck in benchmarking quantum error‑correction (QEC) schemes: estimating the logical error rate when physical error rates are low and code distances are large. The state‑of‑the‑art tool, Stim, performs randomized fault injection by uniformly sampling all possible fault locations and injecting Pauli X, Y, or Z errors. While this works for modest distances (d ≤ 10), it becomes prohibitively expensive for larger distances because low‑weight fault patterns dominate the sampling distribution, yet they rarely produce logical failures. Consequently, achieving statistically significant estimates of extremely low logical error rates (e.g., 10⁻¹¹) would require billions of simulation runs, far beyond practical time budgets.
The authors propose ScaLER (Scalable Logical Error Rate Testing), a two‑phase methodology that dramatically reduces the required simulation effort. First, they identify a “high‑weight” region of the fault space, defined as fault patterns whose weight w exceeds the code’s fault‑tolerance threshold t = ⌊(d − 1)/2⌋. In this region, logical errors occur with appreciable probability, so a modest number of random samples suffices to obtain accurate estimates of the conditional logical error rate P₍w₎ᴸ for each weight w. Second, they model the full logical error rate as a weighted sum over all possible weights:
Pᴸ ≈ ∑₍w₎ P₍w₎ᴸ · Binomial(C, w, p),
where C is the total number of fault locations in the circuit and p is the physical error rate. The binomial term gives the exact probability of observing exactly w faults under the independent depolarizing SID model used throughout the paper. By measuring P₍w₎ᴸ only for the high‑weight region and extrapolating to low weights, ScaLER avoids the need to simulate the overwhelming majority of low‑weight samples that contribute negligibly to logical failures.
A key technical contribution is the introduction of an S‑curve model to interpolate P₍w₎ᴸ across the full weight range. The authors critique IBM’s Min‑Fail Envelope model, which treats the onset weight β as a free parameter that must be discovered via costly search. Instead, they fix β = t + 1 (the smallest weight that can cause a logical error under the SID model) and propose a continuous function f(w) that satisfies several desirable properties: f(0)=0, monotonic increase, asymptotic limit 0.5 as w→∞, and a change of curvature at w = t. This function includes a denominator term (w − t) that forces f(t)=0 exactly, reflecting the fault‑tolerant zone. Empirically, the S‑curve fits the measured data for surface codes, toric codes, and bicycle‑block (BB) codes with R² > 0.99, confirming its universality across code families and distances.
The experimental evaluation demonstrates the practical impact of ScaLER. Using a desktop machine with a two‑hour time budget, the authors benchmark surface‑code implementations at physical error rate p = 5 × 10⁻⁴. Stim can only estimate logical error rates up to distance 13, and for distance 13 it observes zero logical errors, yielding an inaccurate estimate of zero. In contrast, ScaLER successfully estimates the logical error rate for distance 17 as 1.51 × 10⁻¹¹ with tight confidence intervals. Additional experiments at p = 1 × 10⁻⁴, and on toric and BB codes, show consistent S‑curve behavior and accurate extrapolation. The authors also compare ScaLER’s estimates against ground‑truth Monte‑Carlo simulations for smaller distances, confirming that the extrapolation error remains within a few percent.
The paper discusses several threats to validity. The primary assumption is the independence of single‑qubit depolarizing errors (the SID model); correlated errors or leakage would violate the binomial weight distribution and could bias the extrapolation. The choice of high‑weight cutoff influences the number of required samples; while the authors provide heuristic guidelines, a formal bound on the sampling error is not derived. Finally, the S‑curve functional form, though empirically robust, may not capture pathological error models that produce non‑monotonic weight‑dependent logical error rates.
In conclusion, ScaLER offers a scalable, statistically sound alternative to exhaustive random fault injection. By focusing computational effort on the informative high‑weight region and leveraging a principled probabilistic model for low‑weight contributions, it reduces the simulation time from exponential to near‑linear in the distance for realistic physical error rates. This advancement enables researchers and engineers to benchmark high‑distance QEC codes—up to d = 17 in the presented experiments—within practical time frames, thereby accelerating the development and validation of fault‑tolerant quantum processors.
Comments & Academic Discussion
Loading comments...
Leave a Comment