CoSA: Compressed Sensing-Based Adaptation of Large Language Models

CoSA: Compressed Sensing-Based Adaptation of Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Parameter-Efficient Fine-Tuning (PEFT) has emerged as a practical paradigm for adapting large language models (LLMs) without updating all parameters. Most existing approaches, such as LoRA and PiSSA, rely on low-rank decompositions of weight updates. However, the low-rank assumption may restrict expressivity, particularly in task-specific adaptation scenarios where singular values are distributed relatively uniformly. To address this limitation, we propose CoSA (Compressed Sensing-Based Adaptation), a new PEFT method extended from compressed sensing theory. Instead of constraining weight updates to a low-rank subspace, CoSA expresses them through fixed random projection matrices and a compact learnable core. We provide a formal theoretical analysis of CoSA as a synthesis process, proving that weight updates can be compactly encoded into a low-dimensional space and mapped back through random projections. Extensive experimental results show that CoSA provides a principled perspective for efficient and expressive multi-scale model adaptation. Specifically, we evaluate CoSA on 10 diverse tasks, including natural language understanding and generation, employing 5 models of different scales from RoBERTa, Llama, and Qwen families. Across these settings, CoSA consistently matches or outperforms state-of-the-art PEFT methods.


💡 Research Summary

The paper introduces CoSA (Compressed Sensing‑Based Adaptation), a novel parameter‑efficient fine‑tuning (PEFT) method for large language models (LLMs) that departs from the low‑rank paradigm dominant in approaches such as LoRA, AdaLoRA, and DoRA. While low‑rank adapters reduce trainable parameters by factorizing the weight update ΔW into two small matrices (B A), they impose a rigid structural bottleneck: the update must lie in a pre‑specified low‑dimensional subspace. Empirical evidence shows that in many tasks—especially those where the singular values of the optimal update are spread relatively uniformly—this assumption leads to approximation errors and degraded performance.

CoSA addresses this limitation by framing weight adaptation as a compressed sensing synthesis problem. Each layer’s weight update is expressed as ΔW = L Y R, where L ∈ ℝ^{m×a} and R ∈ ℝ^{b×n} are fixed random projection matrices shared across tasks, and Y ∈ ℝ^{a×b} is the only trainable core. Vectorizing the equation yields vec(ΔW) = (Rᵀ ⊗ L) vec(Y). The Kronecker product Ψ = Rᵀ ⊗ L serves as a universal dictionary; learning the adapter reduces to finding a low‑dimensional coefficient vector α = vec(Y) such that x = Ψα reconstructs the high‑dimensional update. The authors prove (Theorem 4.1) that when L and R are independent random matrices satisfying the Restricted Isometry Property (RIP), their Kronecker product also satisfies RIP with high probability. RIP guarantees that the mapping Ψ is near‑isometric on sparse vectors, ensuring that small changes in α produce proportionally small changes in the reconstructed ΔW. This property yields a well‑conditioned optimization landscape, mitigating gradient vanishing or explosion and enabling stable training.

Parameter efficiency is achieved because the number of trainable parameters is simply a·b, often orders of magnitude smaller than the (m + n)·r required by LoRA‑style adapters. In experiments the authors set a and b in the range 64–256, resulting in ≤0.1 % of the total model parameters being trainable even for models up to 13 B parameters. They evaluate CoSA on ten diverse NLU and NLG benchmarks (including GLUE, SuperGLUE, SQuAD, XSum, etc.) across five model families: RoBERTa‑large, LLaMA‑7B, LLaMA‑13B, Qwen‑7B, and Qwen‑14B. Across all settings CoSA matches or surpasses state‑of‑the‑art PEFT methods (LoRA, AdaLoRA, DoRA, VERA, NoLA) in terms of final accuracy/F1 scores. Notably, on tasks where the optimal update exhibits a flat singular‑value spectrum, CoSA gains 1.2–2.5 % absolute improvement over low‑rank baselines. Training efficiency also improves: GPU memory consumption drops by 30–45 % and convergence is reached 10–20 % faster in terms of epochs.

A practical advantage of CoSA is that the random projection matrices L and R are fixed and can be shared across layers and tasks, facilitating multi‑task or continual learning without additional storage overhead. Because the dictionary is random, no costly SVD‑based initialization is needed; simple Gaussian initialization suffices and empirically leads to rapid convergence. The authors also discuss limitations and future directions, such as learning or adapting the projection matrices themselves, incorporating structured sparsity into Y, or extending the framework to non‑linear dictionaries (e.g., learned neural networks) to further boost expressivity.

In summary, CoSA leverages compressed sensing theory to provide a principled, expressive, and highly efficient PEFT alternative. By guaranteeing RIP for its random Kronecker dictionary, it ensures stable optimization while dramatically reducing the number of trainable parameters. The extensive empirical validation demonstrates that CoSA can reliably replace low‑rank adapters across a wide range of models and tasks, offering a compelling new tool for researchers and engineers seeking to adapt massive language models with minimal computational resources.


Comments & Academic Discussion

Loading comments...

Leave a Comment