Understanding the Curse of Unrolling

Understanding the Curse of Unrolling Shehery ar Mehmo o d † and Florian Knoll ⋆ and P eter Oc hs † † Saarland Univ ersity , Saarbr ¨ uc ken, German y ⋆ F riedrich-Alexander Univ ersit y , Erlangen, German y Abstract Algorithm unrolling is ubiquitous in mac hine learning, particularly in h yp erpa- rameter optimization and meta-learning, where Jacobians of solution mappings are computed b y diﬀerentiating through iterativ e algorithms. Although unrolling is kno wn to yield asymptotically correct Jacobians under suitable conditions, recent work has sho wn that the deriv ativ e iterates may initially diverge from the true Jacobian, a phe- nomenon known as the curse of unrolling. In this work, we provide a non-asymptotic analysis that explains the origin of this b ehavior and iden tiﬁes the algorithmic factors that gov ern it. W e sho w that truncating early iterations of the deriv ative computa- tion mitigates the curse while simultaneously reducing memory requirements. Finally , w e demonstrate that w arm-starting in bilevel optimization naturally induces an im- plicit form of truncation, pro viding a practical remedy . Our theoretical ﬁndings are supp orted by numerical exp eriments on representativ e examples. 1 In tro duction Man y problems in mo dern machine learning can b e form ulated as bilevel optimization tasks [ 16 , 17 ], where an outer ob jectiv e dep ends on the solution of an inner problem. That is, giv en a suﬃcien tly smo oth mapping ℓ : X × U → R , we solv e min u ∈U ℓ ( x ⋆ ( u ) , u ) , (1) where, for eac h parameter v ector u , the quan tit y x ⋆ ( u ) is deﬁned implicitly as the solution of an optimization problem or, more generally , as the solution of a nonlinear equation in the v ariable x . Suc h form ulations arise in a wide range of applications, including hyperparameter optimization [ 9 , 28 , 35 ], meta-learning [ 10 , 26 , 38 ], implicit deep learning [ 2 , 4 , 6 , 7 ], and neural arc hitecture searc h [ 27 ]. In this w ork, w e fo cus on deterministic settings in whic h x ⋆ ( u ) is c haracterized as the ﬁxed p oin t of a smo oth parameterized mapping A : X × U → X , that is, x ⋆ ( u ) solv es x = A ( x , u ) . (2) Intr o duction 0 200 400 600 800 1000 k 1 0 3 1 0 2 1 0 1 1 0 0 Er r ors (log scale) x ( k ) ( u ) x * ( u ) D x ( k ) ( u ) D x * ( u ) k Figure 1: Iterate x ( k ) ( u ) vs deriv ative D x ( k ) ( u ) error plot for gradien t descen t applied to f ( x , u ) : = ∥ A x − b ∥ 2 / 2 + u ∥ x ∥ 2 / 2. Unlike x ( k ) ( u ), D x ( k ) ( u ) initially drifts aw a y from its limit b efore ev en tually coming bac k to it. W e provide a non-asymptotic understanding of this transien t behavior, called the curse of unrolling, and study simple w ays to mitigate it. for x ∈ X . This formulation serv es as the ﬁxed-p oin t equation of many iterativ e algorithms used in practice, such as, gradient descen t, to solv e inner problems of the form min x ∈X f ( x , u ) . (3) Under standard assumptions, the solution x ⋆ ( u ) of ( 2 ) or ( 3 ) exists, is unique, and dep ends smo othly on u . T o solv e the bilev el problem using gradient-based metho ds, we need to compute the gradien t of the outer ob jectiv e ℓ ( x ⋆ ( u ) , u ) with respect to the parameter u . By the c hain rule, this requires the deriv ativ e of the solution map x ⋆ ( u ) with resp ect to u . When x ⋆ ( u ) is deﬁned implicitly as the ﬁxed point of A , this deriv ativ e can be c haracterized using the implicit function theorem. Under suitable regularity conditions (see Theorem 2 ) on A or the inner ob jectiv e f , the map u 7→ x ⋆ ( u ) is diﬀerentiable and its Jacobian is giv en b y D x ⋆ ( u ) : = ( I − D x A ( x ⋆ ( u ) , u )) − 1 D u A ( x ⋆ ( u ) , u ) . This approac h, commonly referred to as implicit diﬀerentiation, has b ecome a standard to ol for computing gradients in bilevel optimization and related problems. In practice, since the — 2 — Contributions exact ﬁxed p oint x ⋆ ( u ) is rarely av ailable, implicit diﬀerentiation is t ypically applied b y replacing x ⋆ ( u ) with an appro ximate solution obtained after a ﬁnite n um b er of iterations of an inner algorithm. In practice, the solution map x ⋆ ( u ) is t ypically appro ximated b y running an iterative algorithm associated with the mapping A . That is, for any parameter u and some initial p oin t x (0) ( u ), the algorithm generates a sequence of iterates ( x ( k ) ( u )) k ∈ N through the updates x ( k +1) ( u ) : = A ( x ( k ) ( u ) , u ) , (4) whic h conv erges to the ﬁxed p oin t x ⋆ ( u ) under suitable assumptions. After a ﬁnite num b er of iterations K , the algorithm output x ( K ) ( u ) serv es as an appro ximation of x ⋆ ( u ) in the outer ob jective. Bey ond approximating the solution itself, w e ma y also diﬀerentiate the algorithmic map deﬁning the iterates [ 15 , 18 , 32 ]. Since A is smo oth, each iterate x ( k ) ( u ) is diﬀerentiable with resp ect to u , and its deriv ative can b e computed recursiv ely by diﬀerentiating through the up date rule, that is, D x ( k +1) ( u ) : = B k ( u ) D x ( k ) ( u ) + C k ( u ) (5) where we deﬁne B k ( u ) : = D x A ( x ( k ) , u ) and C k ( u ) : = D u A ( x ( k ) , u ). Running this recursion for K iterations yields the Jacobian D x ( K ) ( u ) which pro vides an approximation of the true Jacobian D x ⋆ ( u ). This pro cess amounts to diﬀeren tiating the ﬁnite-depth computational graph of u 7→ x ( K ) ( u ) formed by the iterations of the algorithm and is, therefore, commonly referred to as algorithm unrolling. A natural question is whether the fav orable conv ergence prop erties of the algorithm iter- ates are reﬂected at the level of their deriv atives. Under suitable assumptions, the sequence ( x ( k ) ( u )) k ∈ N con v erges linearly to the ﬁxed p oint x ⋆ ( u ), that is, the error ∥ x ( k ) ( u ) − x ⋆ ( u ) ∥ deca ys at a geometric rate. One might therefore exp ect the deriv ative iterates D x ( k ) ( u ) to exhibit a similar behavior. Indeed, sev eral works ha ve established that, under standard regularit y conditions, the sequence ( D x ( k ) ( u )) k ∈ N con v erges linearly to D x ⋆ ( u ) with the same asymptotic rate [see, for example, 11 , 20 , 21 , 29 , 31 ]. Ho wev er, the deriv ativ e error ∥ D x ( k ) ( u ) − D x ⋆ ( u ) ∥ may initially increase with k b efore ev entually decreasing [ 39 ]. This initial increase in the deriv ative error, as illustrated in Figure 1 , is referred to as the curse of unrolling and is the sub ject matter of this pap er. 1.1 Con tributions Our con tributions can b e summarized as follo ws: (i) W e analyze the non-asymptotic b eha vior of deriv ativ e iterates pro duced by algorithm unrolling and identify the factors resp onsible for the curse of unrolling. (ii) W e study truncation through the same non-asymptotic lens and show explicitly why excluding early iterations from diﬀerentiation alleviates the curse. — 3 — R elate d Work (iii) W e sho w how w arm-starting in bilevel optimization naturally induces truncation of deriv ative computations whic h av oids the need for explicit truncation. (iv) W e provide empirical results that demonstrate the curse of unrolling and v alidate the predicted eﬀects of truncation and warm-starting. 1.2 Related W ork Asymptotic Analysis of Unrolling: The con v ergence of the deriv ative sequence D x ( k ) ( u ) has been pro vided for general iterative pro cesses [ 8 , 20 ], ﬁrst-order metho ds for smo oth [ 29 , 34 ] and non-smo oth optimization [ 11 , 30 , 31 ], and second-order metho ds [ 12 , 21 ] Curse of Unrolling: The ﬁrst study on the curse phenomenon in unrolling w as con- ducted by Scieur et al. [ 39 ]. They provided comprehensiv e results for quadratic low er-lev el problems and identiﬁed that decreasing the step size reduces the eﬀects of the curse at the cost of a slow er conv ergence. They also designed an algorithm which when unrolled is free from the curse. How ev er, their study was limited to the quadratic problems. T runcation in Unrolling: T runcation w as already hinted at by Gilb ert [ 20 , see Sec- tion 4] but a comprehensiv e study w as still missing. It w as also studied to reduce the memory and compuational o v erhead of the backpropagation in [ 40 ]. In [ 12 ], the authors sho w ed that diﬀeren tiating only the last step of the algorithm is enough for sup er-linear algorithms and the same eﬀect can b e appro ximated in the linear algorithms b y diﬀeren tiating the last few steps. Ho wev er, their study did not show how the truncation can help with the curse and in mac hine learning, the ﬁrst order metho ds are predominantly used, for whic h, the optimal rates are only linear. 1.3 Notation: In this pap er, X and U are Euclidean spaces with their inner pro ducts and the induced norms are represented by ⟨· , ·⟩ and ∥ · ∥ respectively . The elements of X and U are denoted b y small b old letters, for instance, x ∈ X and u ∈ U . L ( U , X ) denotes the space of linear maps from U to X with induced (or sp ectral) norm top ology , unless stated otherwise. A linear op erator is denoted by a capital letter, for example, Q ∈ L ( X , X ) and ρ ( Q ) denotes its sp ectral radius. Throughout this pap er, we use standard dotted and barred notation for forw ard and reverse mo de deriv ativ es. Moreo ver, w e frequen tly mak e use of forward pass and forw ard mo de which are standard terminologies in Machine Learning and Automatic Diﬀeren tiation literature resp ectiv ely and must not b e confused with eac h other. 2 Preliminaries W e recall the ﬁxed-p oin t form ulation introduced in Section 1 and collect assumptions and results needed for the non-asymptotic analysis. The following global con tractivity assump- tion helps to ensure existence, uniqueness, and smo oth dep endence of the ﬁxed p oin t on the parameter. — 4 — Pr eliminaries Assumption 2.1. A is C 1 -smo oth and for each u ∈ U , A ( · , u ) is a contraction on X , that is, there exists ρ ( u ) ∈ [0 , 1) such that A ( · , u ) is ρ ( u )-Lipsc hitz contin uous. Remark 1. (i) Under Assumption 2.1 , the con traction mo dulus ρ ( u ) b ounds the Jacobian D x A ( x , u ) uniformly , that is, ∥ D x A ( x , u ) ∥ ≤ ρ ( u ) for all ( x , u ) ∈ X × U . This global uniform b ound on the sp ectral norm can b e replaced by a local one [ 37 , Chapter 2, Theorem 1] on the more general, sp ectral radius. W e adopt it since it yields simpler non-asymptotic error b ounds. F or lo cal results, it is typically suﬃcient to assume con tractivit y near a ﬁxed-p oin t. (ii) Let f : X × U → R be a C 2 -smo oth function. F or some α > 0, w e deﬁne the up date map for gradient descent A α : X × U → X by A α ( x , u ) : = x − α ∇ x f ( x , u ) , and for some β ∈ [0 , 1), w e deﬁne the Heavy-ball update map A α,β : X × X × U → X × X b y A α,β ( z , u ) : = ( A α ( x 1 , u ) + β ( x 1 − x 2 ) , x 1 ) , where z : = ( x 1 , x 2 ). Under global Lipschitz con tinuit y of ∇ x f and strong con v exit y of f with resp ect to x , A α ( · , u ) is a global con traction for suitable α . On the other hand, A α,β ( · , u ) do es not satisfy Assumption 2.1 in general and we can only ha ve ρ ( D z A α,β ( z , u )) < 1 [ 36 , 37 ]. F or a given parameter u and a smo oth initialization map x (0) : U → X , w e consider the ﬁxed-p oin t iteration ( 4 ) for solving ( 2 ). Under Assumption 2.1 , Banac h ﬁxed-p oin t theorem [ 5 , Theorem 4.1.3] ensures the existence and uniqueness of solution of ( 2 ), while the implicit function theorem [ 19 , Theorem 1B.1] guarantees its smo oth dep endence on u . Theorem 2. Supp ose A satisﬁes Assumption 2.1 . Then there exists a C 1 -smo oth map x ⋆ : U → X such that for all u ∈ U , x ⋆ ( u ) = A ( x ⋆ ( u ) , u ) is the unique ﬁxed-p oin t of A ( · , u ) and its Jacobian is given by D x ⋆ ( u ) : = ( I − B ( u )) − 1 C ( u ) , (6) where the maps B : U → L ( X , X ) and C : U → L ( U , X ) provide the partial Jacobians of A ev aluated at ( x ⋆ ( u ) , u ), that is, B ( u ) := D x A ( x ⋆ ( u ) , u ) C ( u ) := D u A ( x ⋆ ( u ) , u ) . (7) Moreo v er, the sequence ( x ( k ) ( u )) k ∈ N generated b y ( 4 ) conv erges linearly to x ⋆ ( u ) with rate ρ ( u ) from Assumption 2.1 , that is, for all k ∈ N , e ( k ) ( u ) ≤ ρ ( u ) k e (0) ( u ) , (8) where w e deﬁne e ( k ) ( u ) : = ∥ x ( k ) ( u ) − x ⋆ ( u ) ∥ . — 5 — A utomatic Diﬀer entiation or Unr ol ling T able 1: Propagation of algorithm and deriv ativ e iterates. Iterations Iterates Propagation Unrolled Al gorithm ( x ( k ) ) K k =0 0 → K F orw ard Mo de ( ˙ X ( k ) ) K k =0 0 → K Rev erse Mo de ( ¯ U ( k ) K ) 0 k = K K → 0 2.1 Automatic Diﬀeren tiation or Unrolling As suggested in Section 1 , an alternative to implicit diﬀeren tiation for ev aluating D x ⋆ ( u ) is automatic diﬀeren tiation whic h admits t w o mo des: forward and reverse. F orw ard Mo de AD: The C 1 -smo othness of A allo ws us to write the follo wing recursion. ˙ X ( k +1) ( u ) := B k ( u ) ˙ X ( k ) ( u ) + C k ( u ) , (9) where we set ˙ X (0) ( u ) : = D x (0) ( u ) and deﬁne sequence of maps B k : U → L ( X , X ) and C k : U → L ( U , X ) by: B k ( u ) : = D x A ( x ( k ) ( u ) , u ) C k ( u ) : = D u A ( x ( k ) ( u ) , u ) . (10) This recursion corresp onds to forward mo de automatic diﬀeren tiation and is iden tical to the deriv ativ e up date introduced in Section 1 so that D x ( k ) ( u ) = ˙ X ( k ) ( u ). W e deﬁne the forw ard mo de errors by: ˙ e ( k ) ( u ) : = ∥ ˙ X ( k ) ( u ) − D x ⋆ ( u ) ∥ . (11) Rev erse Mo de AD: Let ( x ( k ) ( u )) K − 1 k =0 b e ﬁxed. W e deﬁne the reverse-mode quantities ( ¯ X ( k ) K , ¯ U ( k ) K ) bac kw ard for k = K − 1 , . . . , 0 via ¯ X ( k ) K : = ¯ X ( k +1) K B k ( u ) ¯ U ( k ) K : = ¯ U ( k +1) K + ¯ X ( k +1) K C k ( u ) , (12) where B k and C k are the same as in ( 10 ). By setting ¯ U ( K ) K : = 0 and ¯ X ( K ) K : = I , the ab o v e recursion outputs the total deriv ativ e of the ﬁnal iterate: u 7→ x ( K ) ( u ), that is, D x ( K ) ( u ) = ¯ U (0) K ( u ). F ollo wing ( 11 ), we deﬁne the rev erse-mo de error b y: ¯ e ( k ) ( u ) : = ∥ ¯ U ( k ) K ( u ) − D x ⋆ ( u ) ∥ , (13) whic h should decrease for decreasing k . Remark 3. The t w o modes propagate information in opp osite directions, as summarized in T able 1 . F orw ard mo de AD can b e ev aluated alongside the algorithm iterations, while rev erse mo de AD requires storing in termediate iterates during a forward pass b efore a back- w ard sweep, which can b e memory-intensiv e for large K. Moreov er, rev erse mo de computes — 6 — A utomatic Diﬀer entiation or Unr ol ling v ector–Jacobian pro ducts, whereas forward mo de computes Jacobian–v ector pro ducts. F or a b etter understanding and ease of comparison, w e construct full Jacobians in b oth mo des. Due to the equiv alence of forward and rev erse mo de automatic diﬀerentiation [ 22 ], this yields D x ( K ) = ˙ X ( K ) = ¯ U (0) K . The following lemma provides a foundation for pro ving the con v ergence of the deriv ativ e sequence D x ( k ) ( u ). Lemma 4. Let ( e ( k ) ) k ∈ N and ( ˙ e ( k ) ) k ∈ N b e non-negative sequences. Assume that there exist constan ts ρ ∈ [0 , 1) and Γ ≥ 0 such that, for all k ∈ N , ˙ e ( k +1) ≤ ρ ˙ e ( k ) + Γ e ( k ) . (14) Then, for any k ∈ N , it holds that ˙ e ( k ) ≤ ρ k ˙ e (0) + Γ k − 1 X i =0 ρ k − i − 1 e ( i ) . (15) Moreo v er, when e ( k +1) ≤ ρe ( k ) , ( 15 ) reduces to ˙ e ( k ) ≤ ρ k ˙ e (0) + k ρ k − 1 Γ e (0) . (16) Pr o of. The pro of is in Section A.1 in the app endix. Belo w we assume global Lipsc hitz con tin uity of D A as well as a global b ound on D u A whic h along with the global contractivit y in Assumption 2.1 provides non-asymptotic error b ound for the deriv ativ e sequence. Assumption 2.2. F or all u ∈ U , there exist non-negative constants M x ( u ), M u ( u ) and κ ( u ) such that the mappings x 7→ D x A ( x , u ) and x 7→ D u A ( x , u ) are Lipsc hitz contin uous with constan ts M x ( u ) and M u ( u ) resp ectiv ely and ∥ D u A ( x , u ) ∥ ≤ κ ( u ) for all x ∈ X . Remark 5. (i) Similarly , the lo cal Lipschitz con tin uity and b ound conditions are suﬃcient for asymptotic conv ergence results of D x ( k ) ( u ). (ii) F or the setting of Remark 1(ii) and under additional global Lipsc hitz con tin uity of D ( ∇ x f ), the mappings D A α and D A α,β are Lipschitz contin uous with constan ts pro- p ortional to α and the Lipschitz constant of D ( ∇ x f ). Using the ab o v e assumption we can pro vide the following b ound which is useful for pro ving con vergence of the deriv ative sequence ˙ X ( k ) ( u ). Lemma 6. Supp ose A satisﬁes Assumption 2.2 and the mappings B k , C k , B and C b e deﬁned in ( 10 ) and ( 7 ). Then for an y u ∈ X , we hav e the following b ound. ∥  B k ( u ) − B ( u )  D x ⋆ ( u ) + C k ( u ) − C ( u ) ∥ ≤ Γ( u ) e ( k ) ( u ) , (17) — 7 — The Curse of Unr ol ling where e ( k ) ( u ) is deﬁned in Theorem 2 and Γ( u ) is giv en b y Γ( u ) : = M x ( u ) κ ( u ) 1 − ρ ( u ) + M u ( u ) . (18) Pr o of. The pro of is in Section A.2 in the app endix. The follo wing result follo ws from Lemmas 4 and 6 and pro vides non-asymptotic error b ound for ˙ X ( k ) ( u ). Theorem 7 (Con v ergence of Deriv ativ e Iterates). Supp ose A satisﬁes Assumptions 2.1 and 2.2 . Then the sequence ( ˙ X ( k ) ( u )) k ∈ N generated b y ( 9 ) conv erges linearly to D x ⋆ ( u ). In particular, for any u ∈ U and for all k ∈ N , ˙ e ( k ) ( u ) ≤ ρ ( u ) k ˙ e (0) ( u ) + k ρ ( u ) k − 1 Γ( u ) e (0) ( u ) , (19) where ρ ( u ) is the con traction constan t from Assumption 2.1 and Γ( u ) is deﬁned in Lemma 6 . Pr o of. The pro of is in Section A.3 in the app endix. 3 The Curse of Unrolling In this section, we study the ﬁnite-iteration b eha vior of the deriv ative iterates generated b y algorithm unrolling. While asymptotic conv ergence of these iterates is well understo o d, we fo cus on their non-asymptotic b eha vior and show that, at ﬁnite depth, the deriv ative error ma y increase in the earlier iterations. Scieur et al. [ 39 ] called this phenomenon, the curse of unrolling. 3.1 F orw ard Mo de AD W e refer to an initial increase in the forward-mode deriv ative error sequence ( ˙ e ( k ) ( u )) 0 ≤ k ≤ K as the curse of unrolling. Although ˙ e ( k ) ( u ) con v erges asymptotically , it ma y increase for a n um- b er of initial iterations b efore reaching its maximum. This b eha vior is clearly visible in Fig- ure 2 , where the forward-mode deriv ativ e error initially increases b efore deca ying. W e deﬁne the index at whic h the forw ard-mo de deriv ative error is largest as ˙ k := argmax 0 ≤ k ≤ K ˙ e ( k ) ( u ). 3.2 Rev erse Mo de AD A similar non-asymptotic b eha vior can b e observed for the rev erse mo de deriv ative iterates. In contrast to forw ard mo de, the rev erse mo de deriv ativ e error may initially decrease as the bac kw ard sw eep b egins, b efore increasing as the propagation contin ues tow ard earlier itera- tions as depicted in Figure 2 . W e similarly deﬁne an index corresp onding to the minimum v alue of the rev erse-mo de deriv ative error, that is, ¯ k := argmin 0 ≤ k ≤ K ¯ e ( k ) ( u ). — 8 — Interpr etation of the Non-asymptotic R esults 0 200 400 600 800 1000 k 1 0 4 1 0 3 1 0 2 1 0 1 1 0 0 Er r ors (log scale) e ( k ) e ( k ) e ( k ) e ( k ) b o u n d e ( k ) b o u n d k k Figure 2: Error evolution of e ( k ) ( u ), ˙ e ( k ) ( u ), and ¯ e ( k ) ( u ) generated by gradient descen t applied to f ( x , u ) : = ∥ A x − b ∥ 2 / 2 + u ∥ x ∥ 2 / 2. The dashed lines denote the b ounds given in ( 8 ) and ( 19 ). The v ertical lines denote ˙ k and ¯ k deﬁned in Sections 3.1 and 3.2 respectively . 3.3 In terpretation of the Non-asymptotic Results W e now interpret the non-asymptotic b ounds established in Section 2 and explain how they giv e rise to the curse of unrolling. Lemma 4 sho ws that the deriv ativ e error accum ulates con tributions of e ( k ) from all previous iterations into a gro wing-then-deca ying term O ( k ρ k ). Lemma 6 provides a b ound on the size Γ( u ) of these contributions. Finally , Theorem 7 puts ev erything together and giv es us the b ound whic h contains a deca ying term O ( ρ ( u ) k ), a gro wing-then-deca ying term O ( k ρ ( u ) k ) which is the source of the curse and the algorithmic constan ts controlling the curse term. Since we only hav e an upp er b ound for the error, there is no guaran tee that the curse of unrolling o ccurs. How ev er, if it do es, the upp er b ound allo ws us to con trol it. Therefore, understanding this upp er b ound is essential. 3.4 F actors Go v erning the Curse W e now brieﬂy discuss the algorithmic constants in ( 19 ) that gov ern the curse of unrolling. This will help us build in tuition on how to mitigate the curse later. W e also demonstrate these eﬀects empirically in Section 6 . — 9 — Mitigating the Curse of Unr ol ling 3.4.1 Rate of Con v ergence F or faster algorithms, that is, when ρ ( u ) is small, the geometrically decaying term ρ ( u ) k dominates the linear term k more quic kly , reducing b oth the magnitude and duration of the initial increase in the deriv ativ e error. 3.4.2 Smo othness of the Up date Map The smo othness of the up date map, in particular, the Lipschitz constant of D A plays a critical role. A larger Lipsc hitz constant delays the decline of the curse term k ρ ( u ) k , leading to a more pronounced curse of unrolling. The Lipschitz constant is mainly aﬀected by t wo factors: Small step size: F or algorithms suc h as gradient descen t or the Hea vy-ball metho d, a smaller step size leads to a small Lipsc hitz constan t (see Remark 5(ii) ). Almost-Linear Ob jectiv es: When the ob jectiv e function is nearly linear, for example in the case of Huber loss [ 24 ], the quan tities ∇ 2 x f and D u ∇ x f are small, resulting in a milder curse of unrolling. 3.4.3 Qualit y of Initialization Finally , the quality of the initial iterate x (0) ( u ) also aﬀects the sev erity of the curse. A larger initial error e (0) ( u ) allo ws the curse term k ρ ( u ) k to gro w to a larger magnitude b e- fore even tually decreasing, resulting in a more signiﬁcan t initial increase in the deriv ative sequence. 4 Mitigating the Curse of Unrolling In Section 3 , w e identiﬁed the factors that give rise to the curse of unrolling. Most of these factors are algorithmic constants, such as the conv ergence rate ρ or the Lipschitz constan t of D A , which can only b e impro v ed through parameter tuning (e.g., step size) or b y changing the algorithm. The only factor that can b e inﬂuenced without altering the algorithm is the qualit y of the initial iterate x (0) ( u ). In this section, we study a simple strategy to mitigate the curse by truncating early iterations from the deriv ative computation. 4.1 T runcation or Late-start In tuitiv ely , the curse is most pronounced during the early iterations, where the non-asymptot- ic deriv ativ e error ma y gro w. Excluding these iterations from diﬀerentiation helps av oid this unfa v orable regime. Accordingly , we truncate the deriv ative computation by excluding early algorithm iterations. In reverse mo de, truncation amoun ts to stopping the bac kw ard sweep at an index T ′ instead of propagating deriv ativ es all the wa y to the initial iterate. Similarly , in forward mo de, truncation is implemen ted b y starting the deriv ativ e recursion at the same index T ′ , rather than at the b eginning of the algorithm. — 10 — Conver genc e with T runc ation More formally , in forw ard mo de, we allo w the ﬁrst T ′ ∈ N algorithm iterations to run idly and start the deriv ativ e recursion from x ( T ′ ) . Sp eciﬁcally , we compute ˙ X ( k +1) T ′ ( u ) := B k + T ′ ( u ) ˙ X ( k ) T ′ ( u ) + C k + T ′ ( u ) , (20) for k = 0 , 1 , . . . , where we use the same initialization ˙ X (0) T ′ := D x (0) ( u ) as b efore for simplic- it y . Here, the subscript T ′ indicates the late-started deriv ativ e sequence, which diﬀers from the v anilla forw ard-mo de sequence ˙ X ( k ) ( u ) = D x ( k ) ( u ). If w e run K ′ algorithm iterations in total with the initial T ′ idle iterations, then the output of the late-started forward mo de will b e ˙ X ( K ′ − T ′ ) T ′ . F or rev erse mo de, w e ﬁrst run the algorithm for K ′ iterations and then stop the bac kward sw eep at k = T ′ instead of propagating deriv ativ es back to k = 0. The output of the truncated rev erse mo de is therefore ¯ U ( T ′ ) K ′ rather than ¯ U (0) K ′ . T able 2 summarizes the algorithm and deriv ative iterations. Since b oth mo des ev alu- ate the same pro duct arising from the chain rule, the late-started forward mo de and the truncated rev erse mo de satisfy ˙ X ( K ′ ) T ′ = ¯ U ( T ′ ) K ′ . 4.2 Con v ergence with T runcation W e now theoretically justify why truncation, or late-start, alleviates the curse of unrolling. F or clarit y , w e fo cus on forw ard mo de. The same conclusions apply to reverse mo de by equiv alence (see Remark 3 ). Theorem 8 (Con v ergence with T runcation). Supp ose A satisﬁes Assumptions 2.1 and 2.2 . Then, for an y late-start index T ′ ∈ N , the sequence ( ˙ X ( k ) T ′ ( u )) k ∈ N generated by ( 20 ) con v erges linearly to D x ⋆ ( u ). In particular, for an y u ∈ U and for all k ∈ N , ˙ e ( k ) T ′ ( u ) ≤ ρ ( u ) k ˙ e (0) ( u ) + k ρ ( u ) k + T ′ − 1 Γ( u ) e (0) ( u ) , (21) where ρ ( u ) is the con traction constan t from Assumption 2.1 and Γ( u ) is deﬁned in Lemma 6 . Pr o of. The pro of is in Section A.4 in the app endix. Remark 9. (i) When T ′ is suﬃcien tly large, the curse term is multiplied by an addi- tional factor ρ ( u ) T ′ . This signiﬁcantly attenuates the con tribution of the growing- then-deca ying term in the non-asymptotic b ound and thereb y alleviates the curse. (ii) The eﬀectiveness of truncation is particularly easy to visualize in reverse mo de AD. As sho wn in Figure 2 , the iterate ¯ U ( k ) K ( u ) is closest to D x ⋆ ( u ) at k = ¯ k . It is therefore more reasonable to use ¯ U ( ¯ k ) K ( u ) as an estimate of D x ⋆ ( u ) rather than ¯ U (0) K ( u ). (iii) If Assumptions 2.1 and 2.2 are replaced by their lo cal counterparts, the deriv ativ e iter- ates ˙ X ( k ) ( u ) and ˙ X ( k ) T ( u ) still con v erge [ 30 , Section 1.4.2] but the linear error b ounds pro vided in Theorems 7 and 8 hold only asymptotically . In this regime, truncation b ecomes even more relev an t, as it a voids the unpredictable b ehavior of the earlier algorithm iterates. — 11 — R e al lo c ation of Computational R esour c es T able 2: Propagation of the iterates under truncation. Iterations Iterates Propagation Idle Algorit hm ( x ( k ) ) T ′ k =0 0 → T ′ Unrolled Algori thm ( x ( k ) ) K ′ k = T ′ T ′ → K ′ F orw ard Mo de ( ˙ X ( k ) ) K ′ k = T ′ T ′ → K ′ Rev erse Mo de ( ¯ U ( k ) K ) T ′ k = K ′ K ′ → T ′ 4.3 Reallo cation of Computational Resources When truncation is employ ed, we m ust decide ho w to allo cate computational cost b et w een running the algorithm itself and diﬀerentiating through its iterations. W e may , for instance, ﬁx the total num b er of algorithm iterations K ′ and v ary the truncation index T ′ , or alterna- tiv ely ﬁx the n um b er of deriv ative iterations K ′ − T ′ . Both viewp oints are reasonable, but neither directly reﬂects practical constraints. Instead, we adopt a computational budget p ersp ectiv e and assume that the combined cost of ev aluating the algorithm and its deriv ativ es is ﬁxed and must not be exceeded. When there is no truncation, we run b oth the algorithm and deriv ative sw eeps for some K ∈ N iterations. Ho w ev er, the computational cost of ev aluating the Jacobian-v ector product (resp. v ector-Jacobian product) using forw ard (resp. rev erse) mode AD is ω times that of ev aluation of the function itself where ω ∈ [2 , 2 . 5] for forw ard mo de and ω ∈ [3 , 4] for reverse mo de [ 22 , Chapter 3]. In total, the computational cost without truncation is prop ortional to K + ω K whic h will b e our computational budget. Therefore, if we reduce the num ber of deriv ativ e iterations b y some T ∈ N , w e can increase the num ber of algorithm iterations by ωT without exceeding our computational budget. Our new cost is still ( K + ω T ) + ω ( K − T ) = K + ω K . Therefore, we run our algorithm for a total K ′ : = K + ω T num b er of iterations. Out of these, the last K − T are used in the deriv ativ e computation while the remaining T ′ : = ( K + ω T ) − ( K − T ) = T + ω T algorithm iterations are run idly . That is, the iterates from x (0) to x ( T ′ − 1) are not added to the computational graph and only those from x ( T ′ ) to x ( K + ω T − 1) are used in the deriv ativ e computation pro cess. The late-started forward mo de AD iterates ˙ X ( k ) T ′ are generated by using ( 20 ) for k = 0 , . . . , K − T . Example 10. Supp ose that, without truncation, b oth the algorithm and the deriv ative recursions are run for K = 500 iterations. If w e reduce the n um ber of deriv ative iterations b y 20% and assume ω = 3, then the truncation index is T = 100, since the num ber of deriv ativ e iterations b ecomes K − T = 0 . 8 K = 400. Under a ﬁxed computational budget, this reduction allo ws the n um b er of algorithm iterations to b e increased to K ′ = K + ω T = 800. — 12 — Optimizing the T runc ation Index 4.4 Optimizing the T runcation Index By using the resource allo cation strategy from the previous section, w e now try to ﬁnd a truncation index that not only reduces the eﬀect of the curse but also leav es suﬃcient iterations while k eeping the computational cost ﬁxed. F or ﬁnding the optimal truncation index T in { 0 , . . . , K } , we minimize the upp er b ound provided in Theorem 8 for k = K − T , that is, min 0 ≤ T ≤ K ρ K − T ˙ e (0) + ( K − T ) ρ K + ω T − 1 Γ e (0) . (22) Here, we drop the dep endence on u for brevity . If we relax the optimization space to [0 , K ], then the problem is (strictly) conv ex as established in the follo wing lemma. Lemma 11. Supp ose A satisﬁes Assumptions 2.1 and 2.2 where ρ ∈ (0 , 1). Then the optimization problem giv en in ( 22 ), relaxed to T ∈ [0 , K ], is con vex. Moreov er, the problem is strictly conv ex provided that either ˙ e (0) > 0, or e (0) > 0 and Γ > 0. Pr o of. The pro of is in Section A.5 in the app endix. Remark 12. When K is not large and the constants are known, the discrete problem in ( 22 ) can b e solved eﬃciently . F or large K , w e may instead solve the relaxed problem in Lemma 11 to obtain an approximation of optimal T . 4.5 Limitations of Optimizing the T runcation Index Although, the pro cedure for c ho osing an optimal truncation index b y solving ( 22 ) provides clear theoretical insights on the choice from a practical p ersp ectiv e, it has sev eral limita- tions. First, it optimizes an upp er b ound on the deriv ative error rather than the error itself. Moreo v er, key quantities such as the conv ergence rate and Lipschitz constants are typically unkno wn. F urthermore, the non-asymptotic b ounds rely on global con traction and smo oth- ness assumptions that ma y not hold in practice. Finally , mo dern automatic diﬀeren tiation framew orks such as PyT orc h [ 33 ], T ensorFlo w [ 1 ], and JAX [ 13 ] require static computational graphs, whic h prev ents adaptive selection of the truncation index when the termination of the inner solver is inﬂuenced by a stopping criterion. Nonetheless, choosing a truncation index using heuristics [ 40 ] and truncating the back- w ard pass is still a practical and eﬀectiv e strategy . In addition to alleviating the curse of unrolling, truncation reduces memory ov erhead and allows computational resources to b e reallo cated to the forward pass. 5 W arm-Starting & Implicit T runcation In the previous section, w e studied explicit truncation of the deriv ative computation as a principled wa y to mitigate the curse of unrolling. While this analysis provides v aluable insigh t, explicit truncation requires selecting a truncation index and relies on quan tities — 13 — Warm-starting in Se quenc e of R elate d Pr oblems that are often unkno wn in practice. In this section, we discuss warm-starting, a widely used strategy that naturally induces an implicit form of truncation without requiring suc h c hoices. 5.1 W arm-starting in Sequence of Related Problems W arm-starting is a standard tec hnique for solving sequences of closely related optimization problems, where the solution of a previous problem is reused as the initial p oint for the next. Supp ose w e aim to solv e ( 2 ) or ( 3 ) for a sequence of slowly changing parameters ( u ( r ) ) R r =0 . Then after solving the problem asso ciated with u ( r ) through ( 4 ), for r = 0 , . . . , R − 1, the resulting appro ximation for x ⋆ ( u ( r ) ) can be used to initialize x (0) ( u ( r +1) ) for solving the next problem. This strategy has been sho wn to reduce the n um b er of iterations required to reac h a giv en accuracy and is widely used in applications suc h as video and image pro cessing, time- v arying in v erse problems [ 23 ], and con tinuation methods [ 3 ], where consecutiv e problems diﬀer only slightly . F rom this p ersp ectiv e, bilev el optimization naturally giv es rise to a sequence of related inner problems indexed b y the outer v ariable. As the outer iterate ev olves, the corresponding inner problems typically c hange smo othly , making w arm-starting a natural and eﬀective c hoice. More details can b e found in Section B in the app endix. 5.2 W arm-starting as Implicit T runcation F rom the viewp oin t of algorithm unrolling, warm-starting has an imp ortan t and previously underappreciated eﬀect. By initializing the inner algorithm close to its ﬁxed p oint, warm- starting eﬀectively bypasses the early iterations where the curse of unrolling is most pro- nounced. As a result, the deriv ative computation a v oids the unfav orable transien t regime iden tiﬁed in Section 3 . W arm-starting, therefore, induces an implicit form of truncation: early algorithm itera- tions are either skipp ed or substan tially shortened, without explicitly mo difying the compu- tational graph or choosing a truncation index. Unlik e explicit truncation, this mechanism is adaptiv e, compatible with automatic diﬀeren tiation frameworks, and already presen t in man y practical implemen tations of bilev el optimization. 6 Exp erimen ts 7 Conclusion W e studied the non-asymptotic b eha vior of algorithm unrolling for diﬀeren tiating solution maps of parametric optimization problems. Despite asymptotic con vergence guarantees, deriv ative iterates ma y exhibit an initial transient gro wth or the curse of unrolling, ev en when the underlying algorithm conv erges monotonically . Our analysis identiﬁes the factors go v erning this b ehavior and shows how truncating early iterations eﬀectively mitigates the — 14 — Pr o ofs curse. W e further demonstrate that warm-starting, a standard practice in bilevel optimiza- tion, naturally induces an implicit form of truncation. Numerical exp eriments supp ort our theoretical ﬁndings and highlight the practical relev ance of truncation-based strategies for unrolling-based diﬀeren tiation. A Pro ofs A.1 Pro of of Lemma 4 Pr o of. The b ound ( 15 ) follows b y iterating ( 14 ) and a straigh tforw ard induction. The b ound ( 16 ) then follows b y substituting e ( i ) ≤ ρ i e (0) — which is obtained b y recursiv ely expanding the inequalit y e ( k +1) ≤ ρe ( k ) — in to ( 15 ) whic h mak es the summand ρ k − 1 e (0) . A.2 Pro of of Lemma 6 Pr o of. By expanding ∥ ( B k ( u ) − B ( u )) D x ⋆ ( u ) + C k ( u ) − C ( u ) ∥ and subsituting ∥ D x ⋆ ∥ ≤ κ ( u ) / (1 − ρ ( u )) [ 14 , Theorem 2.2], we arrive at our desired result. A.3 Pro of of Theorem 7 Pr o of. First w e rewrite ˙ X ( k +1) ( u ) − D x ⋆ ( u ) as: ˙ X ( k +1) ( u ) − D x ⋆ ( u ) = B k ( u ) ˙ X ( k ) ( u ) + C k − B ( u ) D x ⋆ ( u ) − C ( u ) = B k ( u )  ˙ X ( k ) ( u ) − D x ⋆ ( u )  +   B k ( u ) − B ( u )  D x ⋆ ( u ) + C k ( u ) − C ( u )  . Since the t wo terms in the last expression are bounded by ρ ( u ) ˙ e ( k ) ( u ) and Γ( u ) e ( k ) ( u ) (from Lemma 6 ) resp ectively , w e obtain ˙ e ( k +1) ( u ) ≤ ρ ( u ) ˙ e ( k ) ( u ) + Γ( u ) e ( k ) ( u ) . The error b ound in ( 19 ) is then obtained b y applying Lemma 4 . A.4 Pro of of Theorem 8 Pr o of. Applying the same strategy as in the pro of of Theorem 7 , we write ˙ X ( k +1) T ( u ) − D x ⋆ ( u ) = B k + T ( u )  ˙ X ( k ) T ( u ) − D x ⋆ ( u )  +   B k + T ( u ) − B ( u )  D x ⋆ ( u ) + C k + T ( u ) − C ( u )  . Because, the t wo terms in the expression on the righ t side are b ounded by ρ ( u ) ˙ e ( k ) T ( u ) and Γ( u ) e ( k + T ) ( u ) (from Lemma 6 ) respectively , and b y deﬁning the iterate error at k + T by e ( k ) T : = e ( k + T ) , w e obtain the following inequality ˙ e ( k +1) T ( u ) ≤ ρ ( u ) ˙ e ( k ) T ( u ) + Γ( u ) e ( k ) T ( u ) . — 15 — Pr o of of L emma 11 Applying Lemma 4 , we obtain ˙ e ( k ) T ( u ) ≤ ρ ( u ) k ˙ e (0) T ( u ) + k ρ ( u ) k − 1 Γ( u ) e (0) T ( u ) , Since e (0) T = e ( T ) ≤ ρ ( u ) T e (0) ( u ) and ˙ e (0) T ′ = ˙ e (0) , w e arriv e at our desired result. A.5 Pro of of Lemma 11 Pr o of. W e ﬁrst rewrite h b y deﬁning A : = ρ K ˙ e (0) ≥ 0, B : = ρ K − 1 Γ e (0) ≥ 0 and δ : = − log( ρ ) > 0 since ρ ∈ (0 , 1). Here log denotes the natural logarithm. Substituting A , B , and δ , w e get h ( T ) = Ae δ T + B ( K − T ) e − ω δT . W e now compute the second deriv ativ es of h and chec k for its sign. F or h to b e conv ex h ′′ m ust not b e negative. Moreo ver, h ′′ > 0 is a suﬃcient condition for the strict con v exit y of h . Now the ﬁrst deriv ative is given b y: h ′ ( T ) = Aδ e δ T + B e − ω δT ( − 1 − ω δ ( K − T )) . Similarly , the second deriv ative reads: h ′′ ( T ) = Aδ 2 e δ T + B ω δ e − ω δT (2 + ω δ ( K − T )) . Since all constants are non-negative and T ∈ [0 , K ], therefore h ′′ ≥ 0. Moreo v er, h ′′ is p ositiv e when either ˙ e (0) , or b oth e (0) and Γ are p ositive. This concludes our pro of. B W arm-Starting in Bilev el Optimization W arm-starting is a widely used strategy in bilevel optimization, where the solution of the inner problem from the previous outer iteration is used as the initial p oint for the estimation of the next inner solution. Both theoretical and empirical studies ha v e sho wn that warm- starting signiﬁcantly reduces the num ber of inner iterations required to achiev e a giv en accuracy [ 25 ]. Let ℓ : X → R b e C 1 -smo oth and A : X × U → X satisﬁes Assumptions 2.1 and 2.2 . W e consider the bilevel optimization problem min u ∈U ℓ ( x ⋆ ( u )) s.t. x ⋆ ( u ) = A ( x ⋆ ( u ) , u ) . (23) F or simplicity w e assume that b oth problems are deterministic. Algorithm 1 summarizes a standard bilev el optimization pro cedure that uses reverse-mode automatic diﬀerentiation to compute the h yp ergradien ts and utilizes w arm-starting. In particular, at each outer iteration r , the low er-lev el problem is appro ximately solv ed using an iterative algorithm. Once the algorithm terminates, we use the reverse mo de AD to compute the gradien t of the outer — 16 — A dditional Exp eriments ob jectiv e ∇ ( ℓ ◦ x ( K r ) )( u ( r ) ), which is referred to as hypergradient or the metagradien t in Mac hine Learning literature. After up dating the outer v ariable using a gradien t descen t step, we w arm-start the inner algorithm using the solution from the previous outer iteration. Algorithm 1 (Bilev el Optimization with W arm-Starting). Input: R ∈ N , total outer iterations ε > 0, inner solver tolerance ( τ r ) r ∈ N , step size sequence for outer optimization x (0) , initialization for inner algorithm u (0) , initialization for outer optimization Output: u ( R ) , estimated outer solution Initialize x (0) ( u (0) ) ← x (0) for r = 0 to R − 1 do Initialize k ← 0 ▷ Solving the inner problem rep eat x ( k +1) ( u ( r ) ) ← A ( x ( k ) ( u ( r ) ) , u ( r ) ) k ← k + 1 un til ∥ x ( k ) ( u ( r ) ) − x ( k − 1) ( u ( r ) ) ∥ ≤ ε K r ← k Initialize ¯ x ( K r ) ← ∇ ℓ ( x ( K r ) ( u ( r ) )) T ▷ Computing the h yp ergradien t Initialize ¯ u ( K r ) ← 0 for k = K r − 1 down to 0 do ¯ x ( k ) K ← ¯ x ( k ) K B k ( u ( r ) ) ¯ u ( k ) K ← ¯ u ( k +1) K + ¯ x ( k ) K C k ( u ( r ) ) end for Compute d ( r ) ← ( ¯ u (0) ) T u ( r +1) ← u ( r ) − τ r d ( r ) Initialize x (0) ( u ( r +1) ) ← x ( K r ) ( u ( r ) ) ▷ W arm-starting end for C Additional Exp erimen ts In this section, w e pro vide additional details on the exp erimen tal setup and presen t supple- men tary results for v arying problem dimensions and truncation indices. F or any A ∈ R M × N and b ∈ R M , w e consider the quadratic least-squares problem f ( x , u ) : = 1 2 ∥ A x − b ∥ 2 , (24) where u = ( A, b ), solv ed using gradien t descent. Let L and m resp ectiv ely denote the largest — 17 — A dditional Exp eriments and the smallest eigenv alues of A T A . When m > 0, the problem is m -strongly con v ex and admits a unique solution expressed analytically b y x ⋆ ( u ) = ( A T A ) − 1 A T b , allo wing the exact Jacobian of the solution map to b e computed for ev aluation purp oses. F or computational eﬃciency , w e generate the Jacobian-v ector pro ducts ˙ X ( k ) T v and v ector- Jacobian pro ducts w T ¯ U ( k ) K + ω T for v and w in R N . Eac h elemen t of A is dra wn from U (0 , 1) while that of b is sampled from N (0 , 1). W e conduct our exp eriments for a ﬁx M = 50 and multiple problem dimensions N ∈ { 2 , 5 , 10 , 20 , 30 , 40 } . This aﬀects the conditioning of the problem and hence the conv ergence rate of gradient descen t, allo wing us to examine the eﬀect of the conv ergence rate on the curse of unrolling. F or eac h dimension, w e also study the inﬂuence of the step size on un- rolling by considering tw o step sizes: the optimal step size α = 2 / ( L + m ) (Figures 3 – 8 ) and a sub optimal one α = 1 / (3 L ) (Figures 9 – 14 ). T o in vestigate truncation, we v ary the n um b er of initial algorithm iterations excluded from diﬀeren tiation. W e use the budget reallocation strategy describ ed in Section 4.3 and c ho ose ω = 3 . 0. In particular, for a ﬁxed dimension, all metho ds are compared under an equal computational budget, with reductions in deriv ativ e computations reallo cated to additional forw ard iterations according to ω . This ensures that observ ed impro v emen ts are attributable to truncation rather than increased computation. F or eac h problem dimension N , w e use 9 diﬀeren t v alues of T from { 0 , 0 . 1 , 0 . 2 , . . . , 0 . 8 } K . F or T = 0, we run our algorithm for K : = min( ⌈ log(10 − 3 ) / log( ρ ∗ ) ⌉ , 1000) iterations where ρ ∗ : = ( L − m ) / ( L + m ). T o keep the computational budget ﬁxed, the algorithm is run for K + ω T for a given T . W e start with the same initialization for ˙ X (0) T v = 0 for all T . W e use the linear solver of PyT orch [ 33 ] for computing x ⋆ ( u ) and the corresp onding Jacobians. W e p erform eac h exp eriment 100 times to generate 100 sequences e ( k ) , ˙ e ( k ) T and ¯ e ( k ) [ T ] and plot the median accros the 100 exp erimen ts. Eac h ﬁgure corresp onds to a single step size α and problem dimension N . The errors for forw ard mo de are sho wn in the left subﬁgure while those for reverse mo de error are depicted in the righ t subﬁgure. F or a b etter visualization of the eﬀect of truncation, w e plot all the late-started or truncated deriv ativ e error plots in the same ﬁgure for a giv en N and α . In each subﬁgure, the dashed blue curv es show the error of the algorithmic iterates for k = 0 , . . . , K + ω T max . F or eac h truncation index T , the solid curves represent the error of the corresp onding deriv ativ e sequences, plotted o v er the range k = T + ω T , . . . , K + ω T . The dashed blac k curves with circular mark ers indicate the ﬁnal error attained b y eac h deriv ativ e algorithm. Speciﬁcally , ﬁnal errors are reported at k = K + ω T for forward mo de and at k = T + ω T for rev erse mo de. The circular mark er on the algorithm error curv e denotes the ﬁnal algorithmic error corresp onding to the chosen truncation index T . In addition, the dashed curves with cross-shap ed mark ers in the left-hand ﬁgures depict the upp er b ound giv en in ( 22 ). F or the optimal step size and large con v ergence rates ρ (corresp onding to N = 30 and N = 40), the curse of unrolling is more pronounced, as the algorithm requires substantially — 18 — A dditional Exp eriments more iterations to conv erge. In this regime, the deriv ativ e error in b oth forw ard and reverse mo des increases as additional deriv ativ e recursions are performed, making T max the preferred c hoice. As the conv ergence rate improv es, that is, as ρ decreases, the severit y of the curse diminishes and the optimal truncation index shifts aw a y from T max . Moreov er, increasing the truncation index progressively suppresses the curse, as illustrated in Figures 3 – 6 . Finally , across all dimensions, using a sub optimal step size slows the con v ergence but also reduces the severit y of the curse of unrolling, whic h conﬁrms the ﬁndings of Scieur et al. [ 39 ]. Moreo ver, truncation is ineﬀective in this regime, since ev en without truncation the deriv ative error do es not increase substantially during the early iterations. These additional exp eriments reinforce the conclusions of the main text b y demonstrat- ing that the curse of unrolling is a non-asymptotic phenomenon gov erned by con v ergence rate, smo othness, and initialization, and that truncation pro vides an eﬀective and robust mec hanism for mitigating it across a range of problem scales. — 19 — A dditional Exp eriments 0 20 40 60 80 k 1 0 9 1 0 7 1 0 5 1 0 3 1 0 1 1 0 1 Er r ors (log scale) (a) F orward mo de. 0 20 40 60 80 k 1 0 9 1 0 7 1 0 5 1 0 3 1 0 1 Er r ors (log scale) (b) Rev erse mo de. Figure 3: Late-start / truncation b ehavior for N = 2, α = 2 / ( L + m ), and ρ ≈ 0 . 763526. 0 50 100 150 200 250 300 k 1 0 9 1 0 7 1 0 5 1 0 3 1 0 1 1 0 1 Er r ors (log scale) (a) F orward mo de. 0 50 100 150 200 250 300 k 1 0 9 1 0 7 1 0 5 1 0 3 1 0 1 Er r ors (log scale) (b) Rev erse mo de. Figure 4: Late-start / truncation b ehavior for N = 5, α = 2 / ( L + m ), and ρ ≈ 0 . 921979. 0 200 400 600 800 k 1 0 9 1 0 7 1 0 5 1 0 3 1 0 1 1 0 1 Er r ors (log scale) (a) F orward mo de. 0 200 400 600 800 k 1 0 9 1 0 7 1 0 5 1 0 3 1 0 1 1 0 1 Er r ors (log scale) (b) Rev erse mo de. Figure 5: Late-start / truncation b ehavior for N = 10, α = 2 / ( L + m ), and ρ ≈ 0 . 974362. — 20 — A dditional Exp eriments 0 500 1000 1500 2000 2500 3000 3500 k 1 0 8 1 0 6 1 0 4 1 0 2 1 0 0 1 0 2 Er r ors (log scale) (a) F orward mo de. 0 500 1000 1500 2000 2500 3000 3500 k 1 0 8 1 0 6 1 0 4 1 0 2 1 0 0 Er r ors (log scale) (b) Rev erse mo de. Figure 6: Late-start / truncation b ehavior for N = 20, α = 2 / ( L + m ), and ρ ≈ 0 . 993876. 0 500 1000 1500 2000 2500 3000 3500 k 1 0 2 1 0 1 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 Er r ors (log scale) (a) F orward mo de. 0 500 1000 1500 2000 2500 3000 3500 k 1 0 2 1 0 1 1 0 0 1 0 1 1 0 2 Er r ors (log scale) (b) Rev erse mo de. Figure 7: Late-start / truncation b ehavior for N = 30, α = 2 / ( L + m ), and ρ ≈ 0 . 998509. 0 500 1000 1500 2000 2500 3000 3500 k 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 Er r ors (log scale) (a) F orward mo de. 0 500 1000 1500 2000 2500 3000 3500 k 1 0 0 1 0 1 1 0 2 Er r ors (log scale) (b) Rev erse mo de. Figure 8: Late-start / truncation b ehavior for N = 40, α = 2 / ( L + m ), and ρ ≈ 0 . 999722. — 21 — A dditional Exp eriments 0 20 40 60 80 k 1 0 1 1 0 0 Er r ors (log scale) (a) F orward mo de. 0 20 40 60 80 k 1 0 1 1 0 0 Er r ors (log scale) (b) Rev erse mo de. Figure 9: Late-start / truncation b ehavior for N = 2, α = 1 / (3 L ), and ρ ≈ 0 . 763526. 0 50 100 150 200 250 300 k 1 0 1 1 0 0 Er r ors (log scale) (a) F orward mo de. 0 50 100 150 200 250 300 k 1 0 1 1 0 0 Er r ors (log scale) (b) Rev erse mo de. Figure 10: Late-start / truncation b ehavior for N = 5, α = 1 / (3 L ), and ρ ≈ 0 . 921979. 0 200 400 600 800 k 1 0 1 1 0 0 Er r ors (log scale) (a) F orward mo de. 0 200 400 600 800 k 1 0 1 1 0 0 Er r ors (log scale) (b) Rev erse mo de. Figure 11: Late-start / truncation b ehavior for N = 10, α = 1 / (3 L ), and ρ ≈ 0 . 974362. — 22 — A dditional Exp eriments 0 500 1000 1500 2000 2500 3000 3500 k 1 0 1 1 0 0 Er r ors (log scale) (a) F orward mo de. 0 500 1000 1500 2000 2500 3000 3500 k 1 0 1 1 0 0 Er r ors (log scale) (b) Rev erse mo de. Figure 12: Late-start / truncation b ehavior for N = 20, α = 1 / (3 L ), and ρ ≈ 0 . 993876. 0 500 1000 1500 2000 2500 3000 3500 k 1 0 0 Er r ors (log scale) (a) F orward mo de. 0 500 1000 1500 2000 2500 3000 3500 k 1 0 0 2 × 1 0 0 3 × 1 0 0 4 × 1 0 0 6 × 1 0 0 Er r ors (log scale) (b) Rev erse mo de. Figure 13: Late-start / truncation b ehavior for N = 30, α = 1 / (3 L ), and ρ ≈ 0 . 998509. 0 500 1000 1500 2000 2500 3000 3500 k 1 0 0 1 0 1 Er r ors (log scale) (a) F orward mo de. 0 500 1000 1500 2000 2500 3000 3500 k 4 × 1 0 0 5 × 1 0 0 6 × 1 0 0 7 × 1 0 0 8 × 1 0 0 9 × 1 0 0 Er r ors (log scale) (b) Rev erse mo de. Figure 14: Late-start / truncation b ehavior for N = 40, α = 1 / (3 L ), and ρ ≈ 0 . 999722. — 23 — R efer enc es Ac kno wledgmen ts Shehery ar Mehmo o d and P eter Oc hs are supp orted b y the German Researc h F oundation (DF G Gran t OC 150/4-1). References [1] M. Abadi, P . Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemaw at, G. Irving, M. Isard, M. Kudlur, J. Lev en b erg, R. Monga, S. Mo ore, D. G. Murray , B. Steiner, P . T uc ker, V. V asudev an, P . W arden, M. Wic ke, Y. Y u, and X. Zheng. T ensorﬂow: A system for large- scale mac hine learning. In 12th { USENIX } Symp osium on Op er ating Systems Design and Implementation ( { OSDI } 16) , pages 265–283, 2016. [2] A. Agraw al, B. Amos, S. Barratt, S. Bo yd, S. Diamond, and J. Z. Kolter. Diﬀerentiable con vex optimization lay ers. In A dvanc es in Neur al Information Pr o c essing Systems 32 , pages 9562–9574. Curran Asso ciates, Inc., 2019. [3] E. L. Allgo wer and K. Georg. Intr o duction to numeric al c ontinuation metho ds . SIAM, 2003. [4] B. Amos and J. Z. Kolter. OptNet: Diﬀerentiable optimization as a la yer in neural net works. In D. Precup and Y. W. T eh, editors, Pr o c e e dings of the 34th International Confer enc e on Machine L e arning , v olume 70 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 136–145, In ternational Conv en tion Cen tre, Sydney , Australia, 06–11 Aug 2017. PMLR. [5] K. A tkinson and W. Han. The or etic al numeric al analysis , volume 39. Springer, 2005. [6] S. Bai, J. Z. Kolter, and V. Koltun. Deep equilibrium mo dels. In H. W allach, H. Laro c helle, A. Beygelzimer, F. d ' Alc h´ e-Buc, E. F o x, and R. Garnett, editors, A dvanc es in Neur al Infor- mation Pr o c essing Systems , volume 32. Curran Asso ciates, Inc., 2019. [7] S. Bai, V. Koltun, and J. Z. Kolter. Multiscale deep equilibrium mo dels. In H. Laro chelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, A dvanc es in Neur al Information Pr o c essing Systems , volume 33, pages 5238–5250. Curran Asso ciates, Inc., 2020. [8] T. Beck. Automatic diﬀeren tiation of iterativ e pro cesses. Journal of Computational and Applie d Mathematics , 50(1-3):109–118, 1994. [9] Y. Bengio. Gradien t-based optimization of h yp erparameters. Neur al c omputation , 12:1889– 900, 09 2000. [10] L. Bertinetto, J. F. Henriques, P . T orr, and A. V edaldi. Meta-learning with diﬀeren tiable closed-form solv ers. In International Confer enc e on L e arning R epr esentations , 2019. [11] Q. Bertrand, Q. Klopfenstein, M. Massias, M. Blondel, S. V aiter, A. Gramfort, and J. Salmon. Implicit diﬀeren tiation for fast h yperparameter selection in non-smo oth con v ex learning. Jour- nal of Machine L e arning R ese ar ch , 23(149):1–43, 2022. [12] J. Bolte, E. P auw els, and S. V aiter. Automatic diﬀeren tiation of nonsmooth iterativ e algo- rithms. In A. H. Oh, A. Agarw al, D. Belgrav e, and K. Cho, editors, A dvanc es in Neur al Information Pr o c essing Systems , 2022. [13] J. Bradbury , R. F rostig, P . Hawkins, M. J. Johnson, C. Leary , D. Maclaurin, G. Necula, A. Paszk e, J. V anderPlas, S. W anderman-Milne, and Q. Zhang. JAX: composable transfor- mations of Python+NumPy programs, 2018. [14] B. Christianson. Reverse accum ulation and attractive ﬁxed p oints. Optimization Metho ds and Softwar e , 3(4):311–326, 1994. [15] C.-A. Deledalle, S. V aiter, J. F adili, and G. Peyr ´ e. Stein un biased gradien t estimator of the risk — 24 — R efer enc es (sugar) for multiple parameter selection. SIAM Journal on Imaging Scienc es , 7(4):2448–2487, 2014. [16] S. Demp e, V. Kalashnik o v, G. A. Prez-V alds, and N. Kalashnyk o v a. Bilevel Pr o gr amming Pr oblems: The ory, Algorithms and Applic ations to Ener gy Networks . Springer Publishing Compan y , Incorp orated, 2015. [17] S. Dempe and A. Zemk oho. Bilev el optimization. In Springer optimization and its applic ations , v olume 161. Springer, 2020. [18] J. Domke. Generic metho ds for optimization based mo deling. In Pr o c e e dings of the Fifte enth International Confer enc e on A rtiﬁcial Intel ligenc e and Statistics , volume 22 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 318–326, La Palma, Canary Islands, 21–23 Apr 2012. PMLR. [19] A. L. Dontc hev and R. T. Ro c k afellar. Implicit F unctions and Solution Mappings , volume 543. Springer, 2009. [20] J. C. Gilbert. Automatic diﬀerentiation and iterative pro cesses. Optimization Metho ds and Softwar e , 1(1):13–21, 1992. [21] A. Griewank, C. Bischof, G. Corliss, A. Carle, and K. Williamson. Deriv ative con vergence for iterativ e equation solvers. Optimization metho ds and softwar e , 2(3-4):321–355, 1993. [22] A. Griew ank and A. W alther. Evaluating Derivatives: Principles and T e chniques of Algorith- mic Diﬀer entiation . SIAM, Philadelphia, 2nd edition, 2008. [23] T. H. Hamam and J. Romberg. Streaming solutions for time-v arying optimization problems. IEEE T r ansactions on Signal Pr o c essing , 70:3582–3597, 2022. [24] P . J. Huber. Robust estimation of a lo cation parameter. In Br e akthr oughs in statistics: Metho d- olo gy and distribution , pages 492–518. Springer, 1992. [25] K. Ji, J. Y ang, and Y. Liang. Bilev el optimization: Con v ergence analysis and enhanced design. In International c onfer enc e on machine le arning , pages 4882–4892. PMLR, 2021. [26] K. Lee, S. Ma ji, A. Ra vic handran, and S. Soatto. Meta-learning with diﬀeren tiable con v ex optimization. In Pr o c e e dings of the IEEE/CVF c onfer enc e on c omputer vision and p attern r e c o gnition , pages 10657–10665, 2019. [27] H. Liu, K. Simon yan, and Y. Y ang. DAR TS: Diﬀerentiable arc hitecture search. In International Confer enc e on L e arning R epr esentations , 2019. [28] D. Maclaurin, D. Duv enaud, and R. P . Adams. Autograd: Eﬀortless gradients in nump y . In ICML 2015 AutoML Workshop , volume 238, page 5, 2015. [29] S. Mehmo od and P . Oc hs. Automatic diﬀerentiation of some ﬁrst-order methods in parametric optimization. In S. Chiappa and R. Calandra, editors, The 23r d International Confer enc e on A rtiﬁcial Intel ligenc e and Statistics, AIST A TS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy] , v olume 108 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 1584–1594. PMLR, 2020. [30] S. Mehmoo d and P . Oc hs. Fixed-p oin t automatic diﬀeren tiation of forw ard–bac kw ard splitting algorithms for partly smo oth functions. arXiv pr eprint arXiv:2208.03107 , 2022. [31] S. Mehmoo d and P . Oc hs. Automatic diﬀerentiation of optimization algorithms with time- v arying up dates. In Pr o c e e dings of the 42nd International Confer enc e on Machine L e arning , v olume 267 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 43581–43602. PMLR, 13–19 Jul 2025. [32] P . Ochs, T. Brox, and T. Pock. ipiasco: Inertial proximal algorithm for strongly con vex optimization. Journal of Mathematic al Imaging and Vision , 53(2):171–181, Octob er 2015. [33] A. Paszk e, S. Gross, F. Massa, A. Lerer, J. Bradbury , G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. An tiga, A. Desmaison, A. Kopf, E. Y ang, Z. DeVito, M. Raison, A. T e- — 25 — R efer enc es jani, S. Chilamkurthy , B. Steiner, L. F ang, J. Bai, and S. Chin tala. Pytorch: An imp erativ e st yle, high-p erformance deep learning library . In H. W allach, H. Laro c helle, A. Beygelzimer, F. d ' Alch ´ e-Buc, E. F o x, and R. Garnett, editors, A dvanc es in Neur al Information Pr o c essing Systems 32 , pages 8024–8035. Curran Asso ciates, Inc., 2019. [34] E. P auw els and S. V aiter. The deriv ativ es of sinkhorn–knopp conv erge. SIAM Journal on Optimization , 33(3):1494–1517, 2023. [35] F. P edregosa. Hyp erparameter optimization with approximate gradient. In International c onfer enc e on machine le arning , pages 737–746. PMLR, 2016. [36] B. T. P olyak. Some metho ds of sp eeding up the con v ergence of iteration metho ds. USSR Computational Mathematics and Mathematic al Physics , 4(5):1–17, 1964. [37] B. T. P olyak. Intr o duction to Optimization . Optimization Softw are, 1987. [38] A. Ra jeswaran, C. Finn, S. M. Kak ade, and S. Levine. Meta-learning with implicit gradien ts. A dvanc es in neur al information pr o c essing systems , 32, 2019. [39] D. Scieur, G. Gidel, Q. Bertrand, and F. P edregosa. The curse of unrolling: Rate of diﬀeren- tiating through optimization. A dvanc es in Neur al Information Pr o c essing Systems , 35:17133– 17145, 2022. [40] A. Shaban, C.-A. Cheng, N. Hatc h, and B. Bo ots. T runcated back-propagation for bilev el optimization. In The 22nd international c onfer enc e on artiﬁcial intel ligenc e and statistics , pages 1723–1732. PMLR, 2019. — 26 —

Understanding the Curse of Unrolling

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment