LoRIF: Low-Rank Influence Functions for Scalable Training Data Attribution
Training data attribution (TDA) identifies which training examples most influenced a model’s prediction. The best-performing TDA methods exploits gradients to define an influence function. To overcome the scalability challenge arising from gradient computation, the most popular strategy is random projection (e.g., TRAK, LoGRA). However, this still faces two bottlenecks when scaling to large training sets and high-quality attribution: \emph{(i)} storing and loading projected per-example gradients for all $N$ training examples, where query latency is dominated by I/O; and \emph{(ii)} forming the $D \times D$ inverse Hessian approximation, which costs $O(D^2)$ memory. Both bottlenecks scale with the projection dimension $D$, yet increasing $D$ is necessary for attribution quality – creating a quality-scalability tradeoff. We introduce \textbf{LoRIF (Low-Rank Influence Functions)}, which exploits low-rank structures of gradient to address both bottlenecks. First, we store rank-$c$ factors of the projected per-example gradients rather than full matrices, reducing storage and query-time I/O from $O(D)$ to $O(c\sqrt{D})$ per layer per sample. Second, we use truncated SVD with the Woodbury identity to approximate the Hessian term in an $r$-dimensional subspace, reducing memory from $O(D^2)$ to $O(Dr)$. On models from 0.1B to 70B parameters trained on datasets with millions of examples, LoRIF achieves up to 20$\times$ storage reduction and query-time speedup compared to LoGRA, while matching or exceeding its attribution quality. LoRIF makes gradient-based TDA practical at frontier scale.
💡 Research Summary
Training Data Attribution (TDA) aims to identify which training examples most influence a model’s prediction on a given test input. Classical influence functions approximate this effect by the formula I(x_tr, x_te) = g_teᵀ H⁻¹ g_tr, where g_tr and g_te are gradients of the loss w.r.t. model parameters for a training and a test sample, and H is the Hessian of the total training loss. While theoretically appealing, directly computing or storing H or its inverse is infeasible for modern deep networks that have millions or billions of parameters.
Recent scalable approaches (TRAK, LoGRA, TrackStar) mitigate the problem by projecting per‑example gradients into a lower‑dimensional space of size D using random matrices. This reduces the dimensionality of the stored gradients and allows the formation of a D × D approximate Hessian K = (GᵀG + λI)⁻¹, where G is the matrix of all projected gradients. However, two fundamental bottlenecks remain: (1) storing N projected gradients of dimension D requires O(ND) space and dominates query latency due to I/O; (2) forming and storing the dense D × D inverse Hessian costs O(D²) memory. Since attribution quality improves with larger D, practitioners face a harsh quality‑scalability trade‑off.
LoRIF (Low‑Rank Influence Functions) breaks this trade‑off by exploiting two low‑rank phenomena that are empirically observed in deep networks: (a) each per‑example projected gradient matrix is approximately low‑rank; (b) the collection of all projected gradients across examples has a low effective rank, leading to a spiked spectrum in GᵀG.
Low‑rank storage of per‑example gradients.
For each linear layer ℓ with input dimension Iℓ and output dimension Oℓ, LoRIF follows LoGRA’s two‑sided random projection: P_in ∈ ℝ^{Iℓ×d₁} and P_out ∈ ℝ^{Oℓ×d₂}. The projected gradient for training sample i is \tilde G_{ℓ,i} = (X_{ℓ,i}P_in)ᵀ(δY_{ℓ,i}P_out) ∈ ℝ^{d₁×d₂}. Instead of storing the full d₁·d₂ matrix, LoRIF computes a rank‑c factorization \tilde G_{ℓ,i} ≈ u_{ℓ,i} v_{ℓ,i}ᵀ with u ∈ ℝ^{d₁×c}, v ∈ ℝ^{d₂×c}. The factorization is obtained via a few block power iterations, which are cheap compared to a full SVD. Storage per sample per layer drops from d₁·d₂ floats to c(d₁ + d₂) floats, i.e., O(c√D) instead of O(D). Experiments show that even c = 1 preserves most of the attribution signal, and that for a fixed storage budget increasing D (e.g., using larger random projections) yields larger quality gains than increasing c.
Low‑rank approximation of the inverse Hessian.
The second bottleneck is the dense D × D matrix K = (GᵀG + λI)⁻¹. LoRIF first aggregates the (reconstructed) projected gradients across the whole training set into a matrix G ∈ ℝ^{N×D}. It then computes a truncated SVD: G ≈ U_r Σ_r V_rᵀ with r ≪ min(N, D). This can be done with randomized SVD without materializing G in memory, because each row of G can be regenerated on‑the‑fly from the stored low‑rank factors u, v. The Hessian is approximated as H ≈ V_r Σ_r² V_rᵀ + λI. Applying the Woodbury matrix identity yields a closed‑form expression for H⁻¹:
(H)⁻¹ = λ⁻¹I − λ⁻¹ V_r (Σ_r⁻² + λ⁻¹I_r)⁻¹ V_rᵀ.
Thus only V_r (D×r) and Σ_r (r×r) need to be stored, reducing memory from O(D²) to O(D r). The computational cost of forming the influence scores becomes O(N D c + N D r) instead of O(N D² + D³).
End‑to‑end pipeline.
- During training, for each layer and each sample, compute the two‑sided projected gradient and factorize it into rank‑c u and v matrices; store these compact factors.
- After training, reconstruct G batch‑wise from the stored factors and run a randomized truncated SVD to obtain V_r and Σ_r. Store only these.
- At query time, reconstruct the test‑sample gradient (or use a pre‑computed one), retrieve the low‑rank factors for any candidate training sample, and compute the influence score using the Woodbury‑based inverse Hessian.
Empirical evaluation.
The authors evaluate LoRIF on three language models of increasing scale: GPT‑2‑small (124 M parameters, 233 k training examples), Olmo‑7 B (7 B parameters, 2.2 M examples), and Apertus‑70 B (70 B parameters, 3.8 M examples). Compared against LoGRA (the strongest prior projection‑based method), LoRIF achieves:
- Storage reduction: 2.3 × to 20 × less space across models, with the biggest gains on the 70 B model where the raw projected gradients would require > 80 GB.
- Query‑time speedup: 1.3 × to 20 × faster influence computation, primarily because I/O is reduced from loading O(D) floats per sample to loading O(c√D) floats.
- Attribution quality: Measured by LDS (Low‑Dimensional Similarity) and Tail‑Patch metrics, LoRIF matches or exceeds LoGRA. Notably, even the most aggressive compression (c = 1, r = 64) retains high fidelity, confirming that the low‑rank assumptions hold in practice.
A key insight from the ablation studies is that, for a fixed storage budget, allocating resources to increase the effective projection dimension D yields larger quality improvements than increasing the factor rank c. This guides practitioners to prioritize larger random projections while keeping c minimal (often c = 1).
Impact and future directions.
LoRIF demonstrates that the intrinsic low‑rank structure of neural‑network gradients can be harnessed to make gradient‑based TDA feasible at the frontier of model size and dataset scale. By simultaneously reducing per‑example storage and the Hessian memory footprint, it removes the primary obstacles that previously forced a compromise between attribution fidelity and scalability. Potential extensions include adaptive selection of c and r per layer, integration with non‑linear layers (e.g., attention heads), and deployment in production pipelines for data‑driven debugging, data‑poisoning detection, and curriculum learning. The paper thus opens a practical pathway for high‑quality, large‑scale training data attribution.
Comments & Academic Discussion
Loading comments...
Leave a Comment