LORA-CRAFT: Cross-layer Rank Adaptation via Frozen Tucker Decomposition of Pre-trained Attention Weights
We introduce CRAFT (Cross-layer Rank Adaptation via Frozen Tucker), a parameter-efficient fine-tuning (PEFT) method that applies Tucker tensor decomposition to pre-trained attention weight matrices stacked across transformer layers and trains only small square adaptation matrices on the resulting frozen Tucker factors. Existing tensor-based PEFT methods decompose gradient updates: LoTR applies Tucker decomposition with shared factor matrices, while SuperLoRA groups and reshapes $ΔW$ across layers before applying Tucker decomposition. Separately, methods like PiSSA apply SVD to pre-trained weights but operate independently per layer. CRAFT bridges these two lines of work: it performs full Tucker decomposition via Higher-Order SVD (HOSVD) directly on pre-trained weights organized as cross-layer 3D tensors, freezes all resulting factors, and adapts the model through lightweight trainable transformations applied to each factor matrix. Experiments on the GLUE benchmark using RoBERTa-base and RoBERTa-large demonstrate that CRAFT achieves competitive performance with existing methods while requiring only 41K Tucker adaptation parameters–a count independent of model dimension and depth at fixed Tucker ranks.
💡 Research Summary
The paper introduces CRAFT (Cross‑layer Rank Adaptation via Frozen Tucker), a novel parameter‑efficient fine‑tuning (PEFT) technique for large pre‑trained transformer models. Existing PEFT methods fall into two main families. The first family (e.g., LoTR, SuperLoRA) builds a tensor from gradient updates across layers and applies Tucker‑2 or Tucker‑n decomposition, sharing factor matrices while learning per‑layer core tensors. These methods capture inter‑layer correlations but operate on gradients rather than the pre‑trained weight structure. The second family (e.g., PiSSA) decomposes each pre‑trained weight matrix with SVD, initializes LoRA adapters with the leading singular vectors, and freezes the residual. While this leverages the intrinsic low‑rank structure of the model, it treats each layer independently and therefore ignores cross‑layer patterns.
CRAFT unifies these two lines of work. For each attention projection type (Q and V), it stacks the weight matrices of all N L layers into a third‑order tensor of shape (N L × d_out × d_in). This tensor simultaneously encodes layer‑wise, output‑wise, and input‑wise modes. A one‑time Higher‑Order SVD (HOSVD) is then performed, yielding three orthonormal factor matrices U^(1), U^(2), U^(3) of dimensions (N L × r₁), (d_out × r₂), (d_in × r₃) and a core tensor G ∈ ℝ^{r₁×r₂×r₃}. All these components are frozen after decomposition.
The only trainable parameters are three small square matrices J^(1) ∈ ℝ^{r₁×r₁}, J^(2) ∈ ℝ^{r₂×r₂}, and J^(3) ∈ ℝ^{r₃×r₃} for each projection type. Each J is initialized close to the identity (I + ε·E). The adapted weight tensor is computed by a residual‑preserving formula:
cW = W +
Comments & Academic Discussion
Loading comments...
Leave a Comment