ASDA: Automated Skill Distillation and Adaptation for Financial Reasoning

ASD A: Automated Skill Distillation and A daptation for Financial Reasoning Tik Y u Yim, W en ting T an, Sum Y ee Chan, T ak-W ah Lam, and Siu Ming Yiu The Univ ersit y of Hong Kong, Hong Kong SAR, China {tyyim, wt212796}@connect.hku.hk, sumychan@hku.hk, {twlam, smyiu}@cs.hku.hk Abstract. A dapting large language models (LLMs) to specialized ﬁ- nancial reasoning t ypically requires exp ensiv e ﬁne-tuning that produces mo del-lock ed expertise. T raining-free alternatives ha v e emerged, yet our exp erimen ts show that leading methods (GEP A and ACE) achiev e only marginal gains on the F AMMA ﬁnancial reasoning b enchmark, exposing the limits of unstructured text optimization for complex, multi-step do- main reasoning. W e introduce Automated Skill Distillation and Adapta- tion (ASDA), a framework that automatically generates structured skill artifacts through iterativ e error-correctiv e learning without modifying mo del weigh ts. A teac her mo del analyzes a student mo del’s failures on ﬁnancial reasoning tasks, clusters errors by subﬁeld and error type, and syn thesizes skill ﬁles containing reasoning pro cedures, co de templates, and work ed examples, whic h are dynamically injected during inference. Ev aluated on F AMMA, ASDA achiev es up to +17.33 pp improv ement on arithmetic reasoning and +5.95 pp on non-arithmetic reasoning, sub- stan tially outperforming all training-free baselines. The resulting skill artifacts are h uman-readable, v ersion-controlled, and compatible with the Agen t Skills op en standard, oﬀering an y organization with a labeled domain dataset a practical and auditable path to domain adaptation without weigh t access or retraining 1 . Keyw ords: LLM Adaptation · Financial Reasoning · Skill Distillation · T raining-F ree A daptation · Agent Skills 1 In tro duction Financial reasoning poses a distinctive challenge for general-purp ose LLMs: it demands sim ultaneous mastery of multi-step quantitativ e calculation and deep domain-sp eciﬁc judgmen t, a com bination that pure math or pure kno wledge b enc hmarks do not join tly test [6,13]. Ev aluations across m ultiple ﬁnancial b enc h- marks conﬁrm a p ersisten t p erformance ceiling: F AMMA reveals that standard, 1 Co de and skill libraries are a v ailable at https://github.com/SallyTan13/ ASDA- skill 2 T.Y. Yim et al. non-reasoning fron tier mo dels achiev e only 38–45% o verall accuracy across eight ﬁnancial subﬁelds [13] 2 ; and FinBen ﬁnds that while LLMs handle information extraction well, they consistently struggle with adv anced reasoning and com- plex ﬁnancial QA [6]. F AMMA’s error analysis ﬁnds that domain-know le dge gaps dominate mo del errors [13], as mo dels misapply ﬁnancial concepts to the wrong con text or lac k the exp ertise to select the correct pro cedure. These are not failures that more parameters alone will ﬁx—they require targeted adv ances in domain reasoning. The standard remedy—domain-speciﬁc ﬁne-tuning—is costly , pro duces mo del-lo cke d exp ertise that becomes obsolete with eac h mo del release, and dep ends on sup ervision resources that many organizations lack [11,14]. This is especially problematic in regulated industries suc h as ﬁnancial services, legal, and healthcare, where organizations deploy commercial LLMs via black-box API without w eight access. Automated prompt optimization oﬀers a training-free al- ternativ e, but metho ds like GEP A [2] and ACE [1] optimize ﬂat text strings — monolithic instruction blocks that lack the mo dularit y and executabilit y required for multi-step reasoning across diverse ﬁnancial sub domains, as our F AMMA ex- p erimen ts conﬁrm (Section 5). The missing abstraction is not a b etter prompt, but an exe cutable skil l : a mo dular, self-contained reasoning pro cedure that can b e indep endently comp osed, tested, and up dated for eac h target domain. W e introduce Automated Skill Distillation and Adaptation (ASD A) , a framework that automatically generates executable agen t skills from error anal- ysis without mo difying mo del w eigh ts. A teacher mo del diagnoses a studen t’s failures on ﬁnancial tasks, clusters them by subﬁeld and error type to identify the ro ot causes of domain-know le dge gaps , and synthesizes skill ﬁles containing domain-sp eciﬁc reasoning pro cedures and co de templates, whic h a selector in- jects at inference time. Ev aluated on F AMMA, ASDA achiev es up to +17.33 pp on arithmetic and +5.95 pp on non-arithmetic reasoning after iterative reﬁnemen t, signiﬁcantly outperforming all training-free baselines. The resulting skill library is not a better prompt, but a new represen tational la yer b et ween the mo del and its deplo yment context that can b e version-con trolled, audited, and regenerated for an y successor mo del. 1.1 Con tributions (1) ASDA framework. W e introduce the ﬁrst system to automatically gen- erate executable agent skills for domain-sp eciﬁc reasoning using only blac k-b ox LLM access—no weigh t up dates or gradien t computation—substantially outper- forming all training-free baselines on F AMMA. (2) Self-suﬃcient adaptation from questions and answers alone. A self- teac hing ablation shows that ASD A can improv e a mo del using only the ques- tions and ground-truth answers in the training set—no sup erior teac her mo del 2 Extended-thinking mo dels (GPT-o1, DeepSeek-R1, Qw en-QwQ-32B) score 67–76% on F AMMA, and P oT-augmented v ariants reac h 78–86%. W e exclude these as think- ing budget introduces a confound orthogonal to domain adaptation; our ev aluation targets standard non-reasoning models under ﬁxed inference budgets. ASD A: Automated Skill Distillation and Adaptation for Financial Reasoning 3 required—ac hieving +6.33 pp (73% of the full gain). This means an y organization with a labeled domain dataset can run ASD A on their deplo yed mo del directly , without access to a stronger or more exp ensive mo del, making the framew ork practical for real-w orld enterprise deplo yment. 2 Related W ork 2.1 Financial LLM Adaptation Domain-sp eciﬁc ﬁne-tuning has b een the dominant approach to adapting LLMs for ﬁnancial tasks. Blo om b ergGPT [11] demonstrated the p oten tial of ﬁnance- sp eciﬁc pretraining but required approximately 1.3 million GPU hours on a pro- prietary 363B-token corpus—resources b eyond most organizations. FinGPT [14] oﬀered a more accessible alternativ e through LoRA-based ﬁne-tuning on op en ﬁnancial data. More recently , Xue et al. [13] explored distillation from DeepSeek- R1 to smaller models for ﬁnancial reasoning. Despite these adv ances, all ﬁne- tuning approaches share a fundamental limitation: they pro duce mo del-lo c ked exp ertise that requires re-training when the base mo del is up dated or replaced, and man y are incompatible with black-box API access to commercial LLMs. 2.2 T raining-F ree A daptation Automated prompt optimization oﬀers a weigh t-free alternativ e. Prior work for- malizes this as automatic diﬀerentiation o v er text [15] or optimizable module comp ositions [7]. GEP A [2] ac hieves state-of-the-art results on several b enc h- marks through reﬂective prompt evolution. ACE [1] takes a test-time knowledge accum ulation approach, building contextual expertise during inference. While these metho ds a void ﬁne-tuning costs, their output is a ﬂat text string—a mono- lithic instruction block that cannot represent the mo dular, m ulti-step procedural kno wledge required for complex domain reasoning, as our F AMMA exp erimen ts conﬁrm (Section 5). A complementary line of work treats failure analysis as the primary learning signal. LEMMA [3] synthesizes error-t yp e-grounded training data for mathe- matical reasoning, consistently outp erforming correction-agnostic data augmen- tation baselines, establishing that failure-driv en analysis yields richer adaptation signal—a principle ASD A extends to training-free, executable skill generation. A t the other end of the adaptation sp ectrum, test-time training (TTT) metho ds temp orarily up date mo del w eights at inference, requiring gradient access incom- patible with blac k-b o x API deploymen ts; ASDA o ccupies the gap b etw een these approac hes. 2.3 Skill-Based Agent Architectures Recen t agen t work has explored skill libraries as reusable kno wledge artifacts. V o yager [10] demonstrated that comp osable executable skills—stored as Ja v aScript 4 T.Y. Yim et al. co de—enable con tin ual capability gro wth in open-ended Minecraft environmen ts. The Agen t Skills op en standard [4] formalized p ortable Markdown skill ﬁles with routing metadata, progressive disclosure, and em b edded code templates; ASD A generates skill artifacts compatible with this standard. On the skill-distillation side, a concurrent w ork, SkillRL [12], distills hier- arc hical skills from interactiv e agent rollouts and co-evolv es them with a rein- forcemen t learning p olicy . SkillRL op erates in the in teractive agen tic setting and still requires SFT and weigh t up dates, whereas ASD A distills reasoning patterns from error analysis for static domain QA—en tirely training-free and compatible with API-only mo dels. A cross threads, no prior w ork has addressed automate d, tr aining-fr e e skill generation for domain-sp eciﬁc r e asoning from error analysis, the gap ASDA ﬁlls. 3 Metho d: ASD A F ramew ork ASD A op erates through a teacher–studen t architecture comprising tw o phases: (1) a warm-up pip eline that establishes an initial skill library from systematic error analysis, and (2) an iter ative r eﬁnement lo op that reﬁnes skills through rep eated evidence collection and v alidation. Figure 1 provides an ov erview. Before describing each phase, w e introduce three terms used throughout. An error t yp e is one of ten categories from a predeﬁned taxonom y 3 . A pattern is a single, named failure scenario within a skill ﬁle, one sp eciﬁc kno wledge gap or recurring mistake the model makes. A skill ﬁle groups all patterns that share the same ﬁnancial subﬁeld and error type; for example, fixed_income/wrong_ method_selection.md collects all wrong-metho d failures observ ed in the ﬁxed income subﬁeld. 3.1 Skills W arm-Up F ailure Analysis and Structured Annotation The w arm-up stage con- structs an initial skill library K 0 from the student mo del’s failures on the train- ing set. F or each question the student answ ers incorrectly , the teacher model receiv es the question, the student’s incorrect answ er and reasoning trace, and the ground-truth answ er. The teacher is then prompted to p erform failure anal- ysis and output a structured annotation in the follo wing format: { "subfield": "fixed_income", "error_type": "wrong method selection", "root_cause": "Lacks knowledge that forward rates must be composed sequentially as discount factors, not applied independently per period" } 3 The ten error t yp es are: visual evidence, wrong metho d selection, concept confusion, missed multi-step computation, unit/currency mistakes, missed constrain ts, wrong targets, wrong output format, co de execution errors (P oT-sp eciﬁc), and other. ASD A: Automated Skill Distillation and Adaptation for Financial Reasoning 5 Domain-Specific Reasoning Questions Failure Cases Failure Analysis WHY student model failed ？ - Missing Knowledge? - Concept Confusion? - Ignoring Constraints... CLASSIFY the error type (Subfield ✖ Error Type) Clustering Ground-Truth Skills Distillation → Skill Libary K0 Test on Student Model with and w/o skills Kt Evidence Collection Q+ Q- Q_gap analyze Common Subfield Attribution Analysis Coverage Refinement ❌ iterative refine / create Refine skills create new skills Verify Common Subfield Update Q_gap cases Q+ & Q- cases ❌ iterative refine Refine skills Verify post-verification Safety Refinement Update Skill Library Kt+1 Phase 1: Skill Warm Up Phase 2: Iterative Skill Refinement Inference Phase Question Context Selected Skill Files LLM-based Selector Student Model Augmented Prompt Answer Student Model Teacher Model Fig. 1. Overview of the ASDA framew ork. Phase 1 (warm-up): the teacher mo del analyzes student failures, pro ducing structured annotations that are clustered b y sub- ﬁeld and error type to synthesize an initial skill library K 0 . Phase 2 (iterativ e re- ﬁnemen t): the library is reﬁned through tw o sequential phases, cov erage reﬁnement (resolving uncov ered failures in Q gap ) follow ed by safety reﬁnement (suppressing re- gressions in Q − ), with ev ery skill up date gated by a correctness threshold. Inference: a selector reads SKILL.md and injects the relev an t skill ﬁles into the studen t’s prompt. The error_type ﬁeld is constrained to the taxonom y deﬁned ab o ve. The root_cause ﬁeld captures the underlying knowledge gap rather than a surface description of what was computed incorrectly , ensuring the diagnosis is action- able for skill syn thesis. Skill Library Organization The annotated failures are clustered b y their (subfield, error_type) pair. Eac h cluster b ecomes one skill ﬁle. The library is therefore organized as a t wo-lev el hierarch y: SKILL.md % navigation + routing table common/ visual_evidence.md % cross-subfield patterns wrong_output_format.md fixed_income/ wrong_method_selection.md % one file per subfield x error type concept_confusion.md corporate_finance/ concept_confusion.md derivatives/ ... Within each skill ﬁle, the teacher synthesizes one p attern p er distinct failure scenario iden tiﬁed in the cluster. Eac h pattern contains: a concise description of the addressed knowledge gap, explicit "when to use“ conditions, step-by-step reasoning pro cedures, and work ed examples or co de templates. Figure 2 shows one pattern from fixed_income/wrong_method_selection.md ; the full ﬁle con- tains ﬁve additional patterns cov ering other recurring failure scenarios in that subﬁeld. 6 T.Y. Yim et al. The library also includes a top-lev el SKILL.md navigation ﬁle that summarizes the scope of each skill and pro vides a structured mapping from subﬁeld keyw ords and failed ﬁnancial patterns to skill ﬁle paths. This navigation ﬁle allo ws the do wnstream selector to identify relev ant skills without parsing every skill ﬁle in full. Skill Selection and Injection A t inference time, an LLM-based selector reads the question text and the SKILL.md mapping table to identify the relev ant ﬁ- nancial subﬁeld and match the question’s characteristics against listed patterns. The selector ma y c ho ose m ultiple skill ﬁles for a single question, since complex ﬁ- nancial reasoning often requires combining guidance from sev eral patterns (e.g., a ﬁxed income pricing question ma y need b oth wrong_method_selection.md and common/missed_constraints.md ). Selected skill ﬁles are injected into the studen t’s prompt as domain knowledge, guiding it to follow the appropriate rea- soning pro cedure for the question. 3.2 Dual-Phase Iterative Skill Reﬁnemen t The warm-up stage captures the m ost systematic failure patterns but leav es tw o residual problems: c over age gaps , where some failures remain because existing skills do not y et address the required kno wledge; and r e gr essions , where skill injection causes previously correct answers to b ecome incorrect b ecause a skill o verﬁts a narrow failure pattern and misleads the mo del on related but distinct questions. W e address these through iterative reﬁnemen t, alternating b et ween a c over age phase that expands skill cov erage and a safety phase that suppresses regressions. Across b oth phases, ev ery candidate skill up date passes through a veriﬁc ation gate : the up dated skill is tested by ha ving the student re-solve the target questions with the new skill injected, and is committed to the library only if the resulting accuracy meets a predeﬁned threshold τ . If it fails, the teacher regenerates a revised prop osal, rep eating up to N max attempts b efore falling bac k to the previous skill v ersion. The full pro cedure is summarized in Algorithm 1. Evidence Collection and Attribution A t the start of eac h reﬁnemen t it- eration t , ev ery training question is ev aluated under tw o conditions: once with the curren t skill library K t injected, and once without an y skills. By comparing results, we partition the training set into three disjoint groups: Q + t (correct with skills), Q − t (incorrect with skills but correct without, i.e., regressions introduced b y the current library), and Q gap t (incorrect under b oth conditions, i.e., cov erage failures that the library has not y et resolved). Since multiple skill ﬁles may b e loaded for eac h question, a naïv e per-question outcome cannot be attributed to a sp eciﬁc ﬁle. An attribution step follows: for eac h question in Q + t , Q − t , and Q gap t , the teac her examines the studen t’s reason- ing trace alongside the loaded ﬁles and identiﬁes the single ﬁle most resp onsible for the outcome. Questions are then re-group ed b y their attributed ﬁle, pro duc- ing p er-ﬁle evidence sets that isolate eac h ﬁle’s individual contribution to ﬁxes, regressions, and remaining co verage gaps. ASD A: Automated Skill Distillation and Adaptation for Financial Reasoning 7 fixed_income/wrong_method_selection.md Sequential Discounting with Forward Rates for Coupon Bonds When pricing coupon bonds using forward rates, each cash flow must be discounted by the cumulative product of (1 + forward_rate) for all periods from today to that cash flow's date, not by a single rate or simple average. When to Use Questions asking for bond price given a table/list of forward rates (labeled "Y ear 0 (today)", "Y ear 1", etc.) and bond characteristics (coupon rate, maturity , par value). Procedure 1. Price = Σ[CF t / Π(1 + f i )] for i =0 to t −1 , where CF t is cash flow at time t and f i is forward rate for period i 2. Extract all forward rates from the given table in chronological order ( f 0 , f 1 , f 2 , …) 3. For each cash flow date t , compute cumulative discount factor: DF t = (1+ f 0 )(1+ f 1 )…(1+ f t −1 ) 4. Divide each cash flow by its corresponding cumulative discount factor and sum all present values 5. Return the final sum as the bond price (expression, not print) Correct Code Scenario: 3-year bond, 7% annual coupon, $1,000 par . Forwar d rates: Y ear 0=4%, Y ear 1=5%, Y ear 2=6%. # Bond characteristics par_value = 1000 coupon_rate = 0.07 maturity = 3 annual_coupon = par_value * coupon_rate # Forward rates for each period forward_rates = [ 0.04 , 0.05 , 0.06 ] # Calculate price using sequential discounting price = 0 cumulative_discount = 1.0 for year in range( 1 , maturity + 1 ): cumulative_discount *= ( 1 + forward_rates[year - 1 ]) if year == maturity: cash_flow = annual_coupon + par_value else : cash_flow = annual_coupon price += cash_flow / cumulative_discount price # Result: approximately 1027.51 Common Bugs to A void Using forward rates directly as discount rates without cumulative multiplication ✗ A veraging forward rates instead of compounding them sequentially ✗ Forgetting to include par value in final year's cash flow ✗ Using print(price) instead of returning price as final expression ✗ Off-by-one errors in indexing forward rates (Y ear 0 rate applies to Y ear 1 cash flow) ✗ Fig. 2. One pattern from an ASDA skill ﬁle pro duced during warm-up from Haiku 3.5 failure analysis by Sonnet 4.5. The full ﬁle con tains ﬁv e additional patterns. Co verage Phase F or eac h skill ﬁle, the teacher examines the Q gap t cases at- tributed to that ﬁle and diagnoses why cov erage fails. Common causes include missing procedures for an edge case, trigger conditions that are to o narro w to ﬁre on relev an t questions, or the absence of a work ed example for a pattern the ﬁle do es not yet con tain. Based on this diagnosis, the teac her prop oses an up date, either b y reﬁning an existing pattern or by adding a new one. Each prop osal is submitted to the veriﬁcation gate: the student re-solv es the attributed Q gap t 8 T.Y. Yim et al. Algorithm 1 Dual-Phase Iterative Skill Reﬁnemen t Require: Skill library K 0 , training set D , studen t M s , teacher M t , max iterations T Ensure: Reﬁned skill library K ∗ 1: for t = 1 to T do // Evidenc e Col le ction 2: Q + , Q − , Q gap ← Ev alWithWithout ( D , K t , M s ) 3: A ttributeToFiles ( Q + , Q − , Q gap , M t ) // Cover age Phase 4: for each ﬁle f with attributed Q gap f  = ∅ do 5: f ′ ← M t . Pr oposeExp ansion ( f , Q gap f ) 6: if Verify ( f ′ , Q gap f , M s ) ≥ τ cov then 7: K t [ f ] ← f ′ 8: end if 9: end for 10: ˜ Q + , ˜ Q − ← PostCo verageVerify ( K t , Q + , Q gap , M s ) // Safety Phase 11: for each ﬁle f with attributed ˜ Q − f  = ∅ do 12: f ′ ← M t . Pr oposeRep air ( f , ˜ Q + f , ˜ Q − f ) 13: if Verify ( f ′ , ˜ Q + f , M s ) ≥ τ safe then 14: K t [ f ] ← f ′ 15: end if 16: end for 17: K t +1 ← K t 18: end for 19: return K T cases with the candidate update injected, and the up date is accepted only if the reco very rate exceeds τ cov . After all p er-ﬁle cov erage up dates are committed, a p ost-co verage veriﬁcation pass re-ev aluates the entire aﬀected set. Cases in Q gap t that are no w solved are promoted into an up dated p ositiv e set ˜ Q + t . Cases in Q + t that regress under the mo diﬁed skills are merged with the existing Q − t to form an up dated regression set ˜ Q − t . This up dated partition forms the input to the safet y phase. Safet y Phase The safety phase resolves regressions in ˜ Q − t , including b oth pre- existing ones and those newly in tro duced by the co v erage phase, without dis- rupting correct b eha vior on ˜ Q + t . F or each skill ﬁle, the teacher receives b oth sets sim ultaneously as contrastiv e evidence: the ˜ Q + t cases (annotated with what the curren t skill gets righ t) serv e as preserv ation constraints, and the ˜ Q − t cases (an- notated with what go es wrong) serv e as repair targets. The teacher prop oses a revised skill that remov es or narrows the guidance resp onsible for the regressions while preserving the reasoning steps that pro duce correct answ ers on the p osi- tiv e set. Each prop osal again passes through the veriﬁcation gate with threshold ASD A: Automated Skill Distillation and Adaptation for Financial Reasoning 9 τ safe , which requires that accuracy on the p ositiv e cases not degrade to o muc h while reco vering as man y negative cases as p ossible. After b oth phases complete, the up dated library K t +1 b ecomes the input for the next iteration. 4 Exp erimen tal Setup 4.1 Benc hmark: F AMMA F AMMA-Basic [13] pro vides the scale, arithmetic/non-arithmetic decomposi- tion, and self-contained textual context our pip eline requires. 4 It comprises 1,945 questions sourced from universit y textb o oks and professional ﬁnance exams, spanning eight ﬁnancial sub domains (e.g., corp orate ﬁnance, deriv ativ es, p ort- folio management) across three diﬃcult y levels, with an explicit decomp osition b et ween arithmetic and non-arithmetic questions that enables separate ev alua- tion of pro cedural and conceptual skill eﬀectiv eness. W e use the F AMMA-Basic- T xt release, whic h pro vides OCR-extracted textual con text for eac h question, ensuring that ev aluation targets reasoning abilit y rather than retriev al. 5 4.2 Data Filtering and Split W e restrict the corpus to the 1,378 English-language questions to control for lan- guage v ariation, ensuring that observ ed p erformance diﬀerences reﬂect reasoning capabilit y alone. Since F AMMA provides no oﬃcial train–test split, we construct our own: separating the English corpus into arithmetic and non-arithmetic sub- sets and applying stratiﬁed 60/40 splits based on diﬃculty level (easy , medium, hard) and question type (m ultiple-choice vs. op en-ended). This produces 448 training and 300 test questions for arithmetic, and 378 training and 252 test questions for non-arithmetic. All reported results are on these held-out test sets and are therefore not directly comparable to those rep orted b y Xue et al. [13]. 4.3 Ev aluation Proto col ASD A’s distillation pip eline operates on individual question–answer pairs, so we ev aluate eac h question indep endently rather than in group ed LLM calls as in the original F AMMA proto col. F AMMA stores shared context only in the ﬁrst sub- question of each group, so w e propagate this con text to all sub-questions to pre- serv e information completeness. 6 Arithmetic questions use Program-of-Though t 4 W e also considered FinMR [5], Fin anceMath [16], FinanceQA [9], and FinMME [8], but these are either to o small for reliable train–test splits, withhold ground-truth answ ers for their primary test sets, yield baseline results that diﬀer substantially from published ﬁgures, or fo cus on visual and retriev al-based reasoning rather than pro cedural ﬁnancial reasoning. 5 F AMMA also includes a Liv ePro subset of 103 expert-curated questions, but only 35 are English-language and most are op en-ended, making it to o small for our pip eline. 6 A dditional prepro cessing: w e enforce expression-based PoT outputs rather than print() statements to preven t execution failures. 10 T.Y. Yim et al. T able 1. Ev aluation protocol modiﬁcations. Original F AMMA Ours Question handling Group ed sub-questions Indep enden t MC ev aluation LLM judge Rule-based exact match Op en-ended ev aluation LLM judge (GPT-4o) LLM judge (Qwen-Max) Arithmetic execution P oT P oT + MC mapping step (P oT) code execution, with an additional selection step that maps n umeric out- puts to the closest multiple-c hoice option where applicable. F or ev aluation, w e adopt a h ybrid approac h: rule-based exact matc hing for m ultiple-choice ques- tions and LLM-based judging for op en-ended questions, replacing the original proto col’s use of LLM judging for all question t yp es. T able 1 summarizes these mo diﬁcations. V alidation conﬁrms these mo diﬁcations do not inﬂate gains: swapping Qwen- Max for GPT-4o yields 99.3% agreement across 1,104 judgments (delta: 0.00 pp arithmetic, − 0 . 39 pp non-arithmetic); adopting the full F AMMA LLM-judge strategy conﬁrms the same shift. 4.4 Baselines W e compare ASDA against tw o leading training-free adaptation metho ds that op erate under the same deploymen t constraint—blac k-b o x API access without w eight modiﬁcations: GEP A [2] and A CE [1]. Both metho ds optimize a sin- gle monolithic text applied uniformly to all questions. The baseline condition uses the student mo del with a standard task prompt and no injected skills or optimized prompts. Neither GEP A nor ACE has previously b een ev aluated on F AMMA; the results rep orted here are, to our kno wledge, the ﬁrst published ev aluations of both metho ds on this b enc hmark. T o ensure a fair comparison, we adapt training conditions to each method’s arc hitectural requirements. 7 4.5 Implemen tation Details T w o comp onen ts are ﬁxed across all exp erimen tal conditions: Qw en-Max serves as the LLM ev aluation judge for op en-ended questions, and Qw en-T urb o p er- forms the MC selection step that maps numeric PoT outputs to answ er c hoices. All inference is run at temp erature 0 for reproducibility . 8 7 GEP A follows its original 3-wa y split protocol, training on 50% of our training p ool (222 arithmetic, 188 non-arithmetic) and selecting the b est prompt on the remaining 50%. ACE and ASD A use the full training p ool (448 arithmetic, 378 non-arithmetic). 8 All mo dels are accessed via the Anthropic API, with tw o exceptions: Haiku 3.5 via Op enRouter (no longer av ailable on the An thropic API directly) and Qwen mo dels via DashScop e. ASD A: Automated Skill Distillation and Adaptation for Financial Reasoning 11 T able 2. ASDA results across student mo dels on F AMMA. F or Haiku 3.5, GEP A and A CE are shown as reference baselines. All exp erimen ts use Sonnet 4.5 as the teacher mo del. WU = W arm-Up; E2 = best reﬁnement epo c h. ∆ denotes absolute improv ement in p ercen tage points ov er each mo del’s own baseline. Studen t Metho d Arithmetic Non-Arithmetic A cc. (%) ∆ A cc. (%) ∆ Haiku 3.5 Baseline 41.00 — 49.21 — GEP A [2] 42.33 +1.33 50.79 +1.58 A CE [1] 44.30 +3.30 49.60 +0.39 ASD A WU (ours) 49.67 +8.67 51.98 +2.78 ASD A E2 (ours) 58.33 +17.33 55.16 +5.95 Haiku 4.5 Baseline 64.67 — 57.14 — ASD A WU (ours) 69.67 +5.00 56.35 − 0.79 ASD A E2 (ours) 70.66 +5.99 58.74 +1.60 5 Results and Analysis 5.1 Main Results W e ev aluate ASD A across t wo studen t models on both arithmetic and non- arithmetic tasks. F or Haiku 3.5, w e additionally include GEP A and A CE as training-free reference points; both ac hieve only marginal gains on F AMMA, whic h suggests the structural limitations of ﬂat-text optimization. ASD A consisten tly improv es performance for Claude-family student mo dels. On arithmetic, Haiku 3.5 gains +8.67 pp at warm-up and +17.33 pp after tw o reﬁnemen t ep ochs. Haiku 4.5, despite its stronger 64.67% baseline, achiev es a +5.99 pp—showing that ASD A adds v alue even for more capable models. Non- arithmetic gains are smaller but consistent for Haiku 3.5 (+2.78 pp warm-up, +5.95 pp at E2); Haiku 4.5 sees mo dest non-arithmetic impro vemen t only after reﬁnemen t (+1.60 pp at E2). Eﬀe ct of iter ative r eﬁnement. The w arm-up stage targets the most frequent fail- ure patterns and delivers the ﬁrst large wa v e of gains. Iterativ e reﬁnement then addresses residual failures that remain after initial skill injection. F or arithmetic, accuracy improv es from 49.67% at warm-up to 54.67% at ep o c h 1 and reaches 58.33% at ep och 2. A smaller but consisten t trend holds for non-arithmetic, where ep o c h 2 p eaks at 55.16%. Both domains regress at ep och 3 (arithmetic: 54.33%; non-arithmetic: 51.98%), indicating ov erﬁtting to residual training-set patterns after tw o reﬁnement passes. In practice, tw o reﬁnement ep o c hs repre- sen t the optimal op erating p oin t. Gains by question typ e. Skills consisten tly produce larger gains on m ultiple- c hoice questions than on op en-ended ones. F or Haiku 3.5 arithmetic at warm-up, 12 T.Y. Yim et al. T able 3. Self-teac hing ablation. Each mo del serves as both student and teacher; no sup erior mo del is inv olved. F ull ASDA (Sonnet 4.5 teacher) results are shown for ref- erence. Studen t Domain Baseline With Skills ∆ F ul l ASDA (Sonnet 4.5 te acher), for r efer enc e Haiku 3.5 Arithmetic 41.00 49.67 +8.67 Haiku 3.5 Non-Arith 49.21 51.98 +2.78 Self-T e aching (student = te acher) Haiku 3.5 Arithmetic 41.00 47.33 +6.33 Haiku 3.5 Non-Arith 49.21 50.79 +1.58 MC accuracy rises by +14.39 pp compared to +3.73 pp for op en-ended; Haiku 4.5 sho ws the same pattern (MC +7.91 pp, op en-ended +2.48 pp). A skill that nar- ro ws the solution procedure is most useful when the answer space is already constrained to a few options. F or op en-ended generation, where the mo del must pro duce a free-form numerical or textual resp onse, the guidance is less precise and the ro om for regression is higher. R e gr essions r eve al the c omplementary risk. In sampled cases, the most common pattern is skill-induced ov er-reasoning: the mo del elab orates beyond what the question requires and revises an already-correct judgment. Because the selector loads all skill ﬁles for a subﬁeld as a bundle, ev ery question receives guidance re- gardless of whether it needs it — questions the baseline already handles correctly can b e destabilized by unnecessary proce dural elab oration. 5.2 Qualitativ e Analysis T o illustrate how skill artifacts op erate at inference time, Figure 3 presents an arithmetic case from the Haiku 3.5 warm-up ev aluation where the baseline fails and skill injection succeeds. The same skill ﬁle was credited with 7 additional ﬁxes on related ﬁxed income questions, illustrating the reusabilit y of subﬁeld- sp eciﬁc skill artifacts. 5.3 Where Do es the Impro vemen t Come F rom? Self-T eaching Ablation ASD A’s gains could come from t wo sources: knowledge contributed by a superior teac her mo del, or knowledge drawn from the training data itself. T o separate these, we run a self-teac hing conﬁguration where the student mo del acts as its o wn teacher—it analyzes its own failures, builds the skill library , and then uses those skills at inference. No stronger mo del is in volv ed. F or Haiku 3.5 arithmetic, self-teac hing ac hieves +6.33 pp—73% of the +8.67 pp gain from using a Sonnet 4.5 teac her. This shows that most of the b eneﬁt comes ASD A: Automated Skill Distillation and Adaptation for Financial Reasoning 13 FAM MA · AR IT HM ET IC · FI XE D IN CO ME What is the price of a 2-year maturity bond with a 5% coupon rate paid annually? (Par = $1,000) Forward rates: Y ear 0 = 3%, Y ear 1 = 4%, Y ear 2 = 5%, Y ear 3 = 6% A. $1,092.97 B. $1,054.24 C. $1,028.51 ✓ D. $1,073.34 E. None BA SE LIN E (NO S KIL LS ) forward_rates = [ 0.03 , 0.04 , 0.05 , 0.06 ] par_value = 1000 coupon_rate = 0.05 ; maturity = 2 def discount_cashflow(cf, yrs, rate): return cf / (( 1 + rate) ** yrs) coupon = par_value * coupon_rate bond_price = ( discount_cashflow(coupon, 1 , forward_rates[ 0 ]) + discount_cashflow( coupon + par_value, 2 , forward_rates[ 1 ] ) ) # Year 2: 1050/(1.04)² = 970.78 # Should be: 1050/(1.03×1.04) bond_price $1,019.33 → Option E ✗ WI TH A SD A SK ILL S par_value = 1000 coupon_rate = 0.05 ; maturity = 2 annual_coupon = par_value * coupon_rate forward_rates = [ 0.03 , 0.04 ] price = 0 cumulative_discount = 1.0 for year in range( 1 , maturity + 1 ): cumulative_discount *= ( 1 + forward_rates[year - 1 ]) if year == maturity: cf = annual_coupon + par_value else : cf = annual_coupon price += cf / cumulative_discount price $1,028.75 → Option C ✓ Error: The baseline discounts Y ear 2 using (1.04)² = 1.0816, treating the Y ear 1 forward rate as a spot rate. The correct denominator chains all forward rates: (1.03)(1.04) = 1.0712. The skill file fixed_income/wrong_method_selection.md (Fig. 2) directly addresses this in its procedure and "Common Bugs to Avoid." Fig. 3. Baseline vs. skill-augmented output on a F AMMA ﬁxed income question (Haiku 3.5, warm-up). The skill ﬁle in Fig. 2 provides the domain-sp eciﬁc pro cedure that corrects the baseline error. The same skill ﬁle was credited with 7 additional ﬁxes on related ﬁxed income questions. from the training data, not from the teacher’s sup erior knowledge. It is worth noting that the training questions provide only the question text and the correct answ er—not work ed solutions or step-by-step reasoning. Even so, seeing where it consisten tly go es wrong is enough for the mo del to iden tify its recurring failure patterns and syn thesize skills to address them. In other w ords, the mo del al- ready p ossesses m uch of the relev an t domain knowledge; the distillation pro cess giv es it the structure to apply that kno wledge reliably . The remaining 2.34 pp gap reﬂects the teacher’s contribution—a stronger mo del pro duces sharper fail- ure diagnoses and more precise skill form ulations. The same pattern holds for non-arithmetic (+1.58 pp self-teaching vs. +2.78 pp with a Sonnet teacher). 5.4 Are Skills Student-Speciﬁc? Cross-T ransfer Exp erimen t The self-teaching results raise a natural follow-on question: once skills are dis- tilled for one mo del, can they b eneﬁt another? W e apply skills generated from Haiku 3.5’s failures to Haiku 4.5 (a stronger model in the same family) and compare against skills deriv ed from Haiku 4.5’s own failures. Cross-mo del transfer pro duces a net regression of − 2.33 pp, driven by a sharp op en-ended decline ( − 6.21 pp) that ov erwhelms mo dest MC gains (+2.16 pp). In 14 T.Y. Yim et al. T able 4. Skill p ortability: Haiku 4.5 arithmetic (300 ev al questions). Own skills are generated from Haiku 4.5’s failures; cross-transfer uses skills generated from Haiku 3.5’s failures. Skills Source Baseline With Skills ∆ MC ∆ Op en ∆ Own skills (H4.5) 64.67 69.67 +5.00 +7.91 +2.48 H3.5 skills (cross-transfer) 64.67 62.33 − 2.33 +2.16 − 6.21 con trast, Haiku 4.5 with its own ASDA skills gains a consisten t +5.00 pp across b oth question types. The practical implication is direct: skills should be generated for eac h de- plo yed model indep endently . Reusing skills across mo del generations is unlik ely to yield p ositiv e returns and can actively harm the stronger mo del. At approx- imately $13 and ∼ 6 hours of w all-clock time per our model conﬁgurations, 9 studen t-sp eciﬁc distillation is op erationally feasible at deplo yment scale. 6 Discussion and Conclusion Discussion. T wo experimental ﬁndings illuminate the underlying mec hanism of ASD A. The self-teaching result—where a mo del acting as its o wn teac her reco vers 73% of the full arithmetic gain—shows that the primary source of im- pro vemen t is not the teac her’s sup erior knowledge but the structure imposed by the distillation pro cess itself. Systematically en umerating failure patterns across a training set forces the mo del to externalize domain knowledge it already im- plicitly holds but cannot reliably apply during single-pass inference. The cross- transfer failure reinforces this view from the opposite direction: skills are not generic domain knowledge that transfers across mo dels, but artifacts of a sp e- ciﬁc mo del’s failure distribution. Applying one mo del’s skills to a stronger mo del activ ely constrains it b y the w eaker mo del’s blind sp ots. T ogether, these results c haracterize skills as mo del-sp e ciﬁc failur e r eme dies —a distinction with direct consequences for how skill libraries should b e managed in practice. F or organi- zations in regulated industries—ﬁnancial services, legal, and healthcare—that rely on commercial LLMs via blac k-b o x API for kno wledge-intensiv e workﬂo ws, ASD A oﬀers a concrete operational path: run the distillation pip eline once on a lab eled in-domain dataset, version-con trol the resulting skill ﬁles alongside ap- plication co de, and regenerate them when the base mo del is upgraded. The skill library then functions as an auditable, insp ectable kno wledge la yer that domain exp erts can review, compliance teams can certify , and engineering teams can up date without touching model weigh ts. 9 W arm-up pip eline cost for the arithmetic conﬁguration (Haiku 3.5 student, Son- net 4.5 teacher), including baseline generation, failure analysis, skill synthesis, and ev aluation with skills: appro ximately $13 across ∼ 10M tokens in ∼ 6 hours wall-clock time. Costs computed at March 2026 API pricing. ASD A: Automated Skill Distillation and Adaptation for Financial Reasoning 15 ASD A’s gains are largest when failure patterns are clustered and the task has w ell-deﬁned pro cedural structure, as with arithmetic reasoning via Program-of- Though t. When errors are more disp ersed—as in non-arithmetic tasks—the sig- nal av ailable for distillation is weak er and regression risk increases. This b ound- ary condition suggests that diagnostic quality (ho w cleanly errors cluster by t yp e and subﬁeld) predicts adaptation success alongside ra w mo del capability . Limitations. Results are conﬁned to F AMMA and the Claude mo del family , leav- ing op en ho w the error taxonomy , skill format, and reﬁnement dynamics trans- fer to other domains. F AMMA’s OCR-extracted text also introduces a corpus- sp eciﬁc confound: skills distilled from questions with gen uine OCR artifacts can enco de data-correction heuristics that misﬁre on correctly parsed test questions, inﬂating regression coun ts in wa ys that may not app ear on cleaner datasets. F utur e W ork. Cr oss-domain tr ansfer. Legal and tax reasoning are the most nat- ural next targets, structured, pro cedural, and with explicit auditability require- men ts that align with the skill library’s insp ectable format. Skil l c ompr ession. The current pip eline generates 10–30 skill ﬁles p er con- ﬁguration. A pruning step that measures per-skill regression rates and merges narro wly scoped ﬁles could reduce regressions while sharpening routing precision. Conclusion. The central ﬁnding of this work is that failure-driven distillation can externalize laten t domain knowledge into an explicit, inspectable form that single-pass inference cannot access without modifying model w eights. Our ab- lations further rev eal that the resulting skills function as mo del-sp e ciﬁc failur e r eme dies : artifacts of a particular mo del’s failure distribution that cannot b e transferred across mo del generations, but can be cheaply regenerated for an y deplo yed model giv en only a lab eled domain dataset—at a one-time cost of ap- pro ximately $13 and six hours of wall-clock time. The skill library is not a b etter prompt: it is a new represen tational la yer b et w een the raw model and its deplo y- men t con text, one that can b e version-con trolled, audited, and regenerated for successor mo dels. Whether this lay er generalizes to other kno wledge-intensiv e domains, and ho w far self-teaching can substitute for stronger sup ervision, re- main the cen tral op en questions. Generativ e AI Disclosure The authors used generativ e AI tools to assist with man uscript editing and exp erimen tal pip eline dev elopment. All scientiﬁc conten t, in- cluding the hypotheses, exp erimen tal design, and conclusions, is entirely the work of the human authors. Disclosure of In terests. The authors hav e no comp eting in terests to declare that are relev ant to the conten t of this article. References 1. Zhang, Q., Hu, C., Upasani, S., Ma, B., Hong, F., Kaman uru, V., Rainton, J., W u, C., Ji, M., Li, H., Thakker, U., Zou, J., Olukotun, K.: Agentic Context En- gineering: Evolving Contexts for Self-Improving Language Mo dels. arXiv preprint arXiv:2510.04618 (2025) 16 T.Y. Yim et al. 2. Agra wal, L .A., T an, S., Soylu, D., Ziems, N., Khare, R., Opsahl-Ong, K., Singhvi, A., Shandilya, H., Ry an, M.J., Jiang, M., P otts, C., Sen, K., Dimakis, A.G., Stoica, I., Klein, D., Zaharia, M., Khattab, O.: GEP A: Reﬂective prompt evolution can outp erform reinforcemen t learning. arXiv preprint arXiv:2507.19457 (2025) 3. P an, Z., et al.: LEMMA: Learning from errors for mathematical adv ancemen t in LLMs. In: Findings of ACL 2025. arXiv preprint arXiv:2503.17439 (2025) 4. An thropic: Agen t Skills: An open standard for executable agent knowledge. https: //agentskills.io (2025) 5. Deng, S., Peng, H., Xu, J., Mao, R., Giurcˇ anean u, C.D., Liu, J.: FinMR: A Kno wledge-Intensiv e Multimo dal Benchmark for Adv anced Financial Reasoning. arXiv preprint arXiv:2510.07852 (2025) 6. Xie, Q., et al.: FinBen: A holistic ﬁnancial benchmark for large language models. In: NeurIPS 2024. arXiv preprint arXiv:2402.12659 (2024) 7. Khattab, O., Potts, C., Zaharia, M.: DSPy: Compiling declarative language mo del calls into self-impro ving pipelines. arXiv preprint arXiv:2310.03714 (2023) 8. Luo, J., et al.: FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Ev aluation. arXiv preprint arXiv:2505.24714 (2025) 9. Mateega, S., Georgescu, C., T ang, D.: FinanceQA: A Benchmark for Ev aluat- ing Financial Analysis Capabilities of Large Language Models. arXiv preprint arXiv:2501.18062 (2025) 10. W ang, G., Xie, Y., Jiang, Y., Mandlek ar, A., Xiao, C., Zh u, Y., F an, L., Anand- kumar, A.: V o yager: An op en-ended embo died agent with large language models. arXiv preprint arXiv:2305.16291 (2023) 11. W u, S., Irso y , O., Lu, S., Dab er, V., Dredze, M., Gehrmann, S., Gupta, P ., Ishrat, S., Jha, A., Johnston, S., et al.: Blo ombergGPT: A large language mo del for ﬁ- nance. arXiv preprint arXiv:2303.17564 (2023) 12. Xia, P ., et al.: SkillRL: Evolving agen ts via recursive skill-augmented reinforcemen t learning. arXiv preprint arXiv:2602.08234 (2026) 13. Xue, Z., et al.: F AMMA: A b enc hmark for ﬁnancial domain multilingual m ulti- mo dal question answering. arXiv preprint arXiv:2410.04526 (2024) 14. Y ang, H., Liu, X., W ang, C.: FinGPT: Op en-source ﬁnancial large language models. arXiv preprint arXiv:2306.06031 (2023) 15. Yüksekgon ül, E., Bianchi, F., Boen, J., Liu, T., Zou, J.: T extGrad: Automatic “diﬀeren tiation” via text. arXiv preprint arXiv:2406.07496 (2024) 16. Zhao, Y., Liu, H., Long, Y., Zhang, R., Zhao, C., Cohan, A.: Finance- Math: Knowledge-In tensiv e Math Reasoning in Finance Domains. arXiv preprin t arXiv:2311.09797 (2024)

ASDA: Automated Skill Distillation and Adaptation for Financial Reasoning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment