Learning to Aggregate Zero-Shot LLM Agents for Corporate Disclosure Classification

This paper studies whether a lightweight trained aggregator can combine diverse zero-shot large language model judgments into a stronger downstream signal for corporate disclosure classification. Zero-shot LLMs can read disclosures without task-speci…

Authors: Kemal Kirtac

Learning to Aggr egate Zer o-Shot LLM Agents for Cor porate Disclosure Classification K emal Kirtac Uni versity College London London, United Kingdom kemal.kirtac@ucl.ac.uk Abstract This paper studies whether a lightweight trained aggregator can combine di verse zero- shot large language model judgments into a stronger downstream signal for corporate dis- closure classification. Zero-shot LLMs can read disclosures without task-specific fine- tuning, but their predictions often v ary across prompts, reasoning styles, and model families. I address this problem with a multi-agent frame- work in which three zero-shot agents indepen- dently read each disclosure and output a senti- ment label, a confidence score, and a short ratio- nale. A logistic meta-classifier then aggregates these signals to predict next-day stock return direction. I use a sample of 18,420 U.S. cor- porate disclosures issued by Nasdaq and S&P 500 firms between 2018 and 2024, matched to next-day stock returns. Results show that the trained aggregator outperforms all single agents, majority v ote, confidence-weighted vot- ing, and a FinBER T baseline. Balanced accu- racy rises from 0.561 for the best single agent to 0.612 for the trained aggregator , with the largest gains in disclosures combining strong current performance with weak guidance or el- ev ated risk. The results suggest that zero-shot LLM agents capture complementary financial signals and that supervised aggregation can turn cross-agent disagreement into a more useful classification target. 1 Introduction T extual information plays an increasingly cen- tral role in empirical finance. Prior work shows that ne ws coverage, corporate disclosures, and in- vestor communication contain predictive informa- tion about asset prices and firm performance ( T et- lock , 2007 ; T etlock et al. , 2008 ; Price et al. , 2015 ; Huang et al. , 2014 ; Li , 2008 ). Studies of re gulatory filings, earnings calls, and online platforms further sho w that sentiment and attention extracted from text shape trading behavior and return dynamics ( Loughran and McDonald , 2011 ; Da et al. , 2011 ; Chen et al. , 2014 ). This literature establishes that financial text contains economically meaningful signals for asset pricing. Recent work extends this literature with trans- former models and lar ge language models, which capture contextual and semantic information that dictionary methods often miss ( Devlin et al. , 2019 ; Liu et al. , 2019 ; Zhang et al. , 2022 ; Huang et al. , 2023 ; Kirtac and Germano , 2024 ). More broadly , recent sentiment-analysis work shows that zero- shot and fe w-shot LLM methods can be attracti ve when domain-specific annotation is expensiv e or dif ficult to maintain ( W ang and Luo , 2023 ; Bai et al. , 2024 ; Hellwig et al. , 2025 ). Y et zero-shot LLM outputs are often unstable across prompts and reasoning paths. This is especially problematic for corporate disclosures, where language is often mixed rather than uniformly positi ve or negati ve. A disclosure can report strong realized earnings while lo wering forward guidance, or announce rev enue gro wth while revealing litigation, re gulatory , or liq- uidity risk. One prompt may overweight current performance while another overweights risk lan- guage, so a single zero-shot judgment may reflect prompt emphasis rather than the full balance of the disclosure. This creates a natural question for both NLP and finance: can disagreement across zero-shot agents become a feature rather than a weakness? Multi-agent LLM systems are often moti vated by di versity of reasoning, and recent work suggests that additional agents and aggregation can improv e on individual outputs ( Li et al. , 2024 ; Fei et al. , 2023 ). Corporate disclosures provide a useful test bed because they contain competing signals, domain-specific vocab ulary , and short-horizon con- sequences that can be ev aluated against down- stream market outcomes. They are therefore well suited for studying whether aggre gation across mul- tiple zero-shot vie ws yields a better classification signal than any indi vidual agent or a simple voting rule. I address that question with a multi-agent zero- shot frame work for corporate disclosure classifi- cation. Three zero-shot agents read the same dis- closure from dif ferent financial perspecti ves and return a sentiment label, a confidence score, and a short rationale. I then train a lightweight meta- classifier that aggregates the joint outputs into a prediction of ne xt-day stock return direction. I do not fine-tune the base LLMs and instead train only the aggregation layer . This design keeps the sys- tem simple, data-ef ficient, and aligned with recent zero-shot multi-agent sentiment work ( Fei et al. , 2023 ; W ang and Luo , 2023 ; Bai et al. , 2024 ). It also matches practical financial settings in which large amounts of text are a vailable but high-quality sentiment labels are scarce or costly to construct ( Pustejovsk y and Stubbs , 2013 ; Gev a et al. , 2019 ; Paullada et al. , 2021 ). This paper makes three contributions. First, I introduce a compact multi-agent zero-shot setup for corporate disclosure classification. Second, I test whether a trained aggregator impro ves out-of- sample return-direction classification relative to single-agent and voting baselines. Third, I show where aggregation helps most through a short anal- ysis of mixed-signal disclosures. 2 Method I use three zero-shot agents. Each agent receiv es the same disclosure text, but each prompt empha- sizes a different financial lens. The first agent fo- cuses on realized operating performance such as earnings, rev enue, margins, and costs. The sec- ond focuses on forward guidance, management out- look, and expectation revisions. The third focuses on uncertainty , litigation, regulation, liquidity , and do wnside risk. This design reflects the structure of real disclosures. A firm may beat current earnings while lowering guidance, or report re venue gro wth while disclosing legal or operational problems. Fi- nancial texts therefore benefit from multiple spe- cialized readings ev en when all agents observe the same input. This prompt di versification is related to recent chain-of-thought sentiment setups that v ary the ordering and structure of reasoning across agents rather than assuming one fixed path ( Fei et al. , 2023 ; W ang and Luo , 2023 ). I use the follo wing fixed prompts for all disclo- sures, where is replaced with the full disclosure text after preprocessing. Perf ormance agent pr ompt. “Read the corpo- rate disclosure belo w . Focus on realized operating performance, including earnings, re venue, margins, costs, and reported business outcomes. Decide whether the disclosure is positive , neutral , or negative for next-day stock reaction. Output ex- actly three fields in JSON format: {"label": ..., "rationale": ..., "confidence": ...} . The rationale must be one sentence and confidence must be a number between 0 and 1. Disclosure: ” Guidance agent pr ompt. “Read the corporate disclosure belo w . Focus on forward guidance, man- agement outlook, demand expectations, and an y re- visions to future expectations. Decide whether the disclosure is positive , neutral , or negative for next-day stock reaction. Output e xactly three fields in JSON format: {"label": ..., "rationale": ..., "confidence": ...} . The rationale must be one sentence and confidence must be a number between 0 and 1. Disclosure: ” Risk agent prompt. “Read the corporate dis- closure below . Focus on uncertainty , litigation, regulation, liquidity , operational disruption, and do wnside risk. Decide whether the disclosure is positive , neutral , or negative for next-day stock reaction. Output exactly three fields in JSON format: {"label": ..., "rationale": ..., "confidence": ...} . The rationale must be one sentence and confidence must be a number between 0 and 1. Disclosure: ” For each disclosure x , agent k outputs a tuple a k ( x ) = ( l k , c k , r k ) , where l k ∈ {− 1 , 0 , +1 } is the sentiment label, c k ∈ [0 , 1] is the confidence score, and r k is a short rationale. I use a model pool consisting of Qwen2.5-3B-Instruct, Llama-3.2-3B-Instruct, and Qwen2.5-72B-Instruct, although the frame work is model-agnostic and could incorporate other accessi- ble open or API-serv ed models. Labels are mapped as positive = +1 , neutral = 0 , and negative = − 1 . For downstream ev aluation against ne xt- day return direction, I map positive to 1 and both neutral and negative to 0. I use the same bi- nary mapping for single-agent outputs, majority vote, confidence-weighted voting, and the trained aggregator . This reliance on multiple prompted agents follo ws recent work showing that different agent vie ws can recov er complementary informa- tion in label-scarce settings ( Li et al. , 2024 ; Bai et al. , 2024 ). I extract confidence in two stages. First, each model is run with deterministic decoding, temper - ature = 0 , top- p = 1 , and a fix ed random seed. Second, confidence is defined as the model-implied probability of the generated label tok en sequence. For models that expose token log probabilities, I compute c k = exp   1 m m X j =1 log p ( t j | t 0 and 0 otherwise. This baseline uses the same label mapping and confidence estimates as the trained aggregator , but without learning weights from data. I define the downstream tar get as next-day stock return direction, y t = ⊮ ( r t +1 > 0) , where r t +1 is the stock return on the trading day follo wing the disclosure release. I therefore do not claim to recover a ground-truth human sentiment label. Instead, I study whether the joint outputs of zero-shot agents help classify do wnstream mar- ket response more ef fectively than any indi vidual agent or a nai ve aggregation rule. This framing matters because a disclosure can be positi ve on one dimension and neg ative on another, while the market response reflects their net ef fect. I compare the trained aggregator against four baselines. I first e valuate each single agent on its own. I then compare against majority vote and confidence-weighted voting. I also include a finance-specific FinBER T baseline ( Huang et al. , 2023 ). I report accuracy , macro F1, and balanced accuracy . These metrics are sufficient for a short paper and pro vide a balanced vie w under mild class imbalance. 3 Experimental Setup I use a sample of 18,420 U.S. corporate disclosures issued by Nasdaq and S&P 500 firms between 2018 and 2024, matched to next-day stock returns. I split the sample chronologically into 60% training, 20% de velopment, and 20% test. The out-of-sample test set contains 3,684 disclosures and e xhibits mild class imbalance, with 53.1% positi ve ne xt-day re- turns. When multiple disclosures occur for the same firm on the same day , I k eep them as separate observ ations and assign each the same next-day return label. Each disclosure passes through the three zero- shot agents. The aggregator is trained on historical agent outputs only . The core question is whether multi-agent zero-shot aggregation impro ves do wn- stream classification. I preprocess disclosures by lo wercasing ticker symbols only when they appear as metadata, remo ving duplicated header lines, nor - malizing whitespace, and truncating each input to the maximum token b udget supported by the small- est model in the agent pool. I do not remove num- bers, dates, or financial terms because the y may carry predicti ve information. I keep the three prompts fixed across the full sam- ple, and I generate all agent outputs before training the aggre gator . The logistic meta-classifier is tuned on the de velopment split and then ev aluated once on the held-out test period. That ordering matters because a realistic deployment setting would re- quire the aggreg ation rule to be learned only from past disclosures. A chronological split therefore matches the financial application, a voids informa- tion leakage, and aligns with broader concerns in LLM e valuation about contamination, reproducibil- ity , and ov erly optimistic assessments when ex- perimental boundaries are not carefully controlled ( Samuel et al. , 2025 ). I implement the inference pipeline as follo ws. I store the exact disclosure IDs, raw text, prepro- cessing script, prompts, model names, decoding pa- rameters, random seed, raw JSON outputs, parsed labels, parsed rationales, token log probabilities when av ailable, final confidence scores, and down- stream binary labels for e very agent and ev ery dis- closure. I cache all model outputs before training the aggregator . I then build the aggregation feature matrix from cached agent outputs only , standardize continuous aggregation features using training-split statistics, fit L2-regularized logistic regression with the liblinear solver on the training split, tune the in verse re gularization strength on the de velop- ment split, and ev aluate once on the test split. This pipeline is fully reproducible because e very stage from preprocessing to aggregation is deterministic conditional on the stored model version, prompt text, and seed. 4 Results T able 1 reports out-of-sample results. The trained aggregator achieves the strongest performance across all three metrics. It improv es balanced accu- racy from 0.561 for the best single agent to 0.612. It also outperforms majority v ote and confidence- weighted voting. Confidence-weighted voting per - forms better than simple majority vote, which sug- gests that agent confidence carries useful informa- tion. The trained aggregator performs best because it learns when to trust or discount specific agents un- der disagreement rather than applying a fixed rule. This is consistent with recent zero-shot multi-agent sentiment findings that confidence-based aggrega- tion can outperform simple list v oting when agents capture different parts of the signal ( Fei et al. , 2023 ; W ang and Luo , 2023 ). Method Acc. Macro F1 Bal. Acc. FinBER T 0.556 0.548 0.544 Performance agent 0.574 0.566 0.558 Guidance agent 0.579 0.571 0.561 Risk agent 0.563 0.554 0.550 Majority vote 0.586 0.579 0.573 Conf. vote 0.597 0.589 0.584 Aggregator 0.624 0.617 0.612 T able 1: Out-of-sample classification results for next- day return direction. Case Maj. vote Aggregator 3-agent agreement 0.642 0.651 2–1 split 0.558 0.603 High conflict 0.517 0.581 T able 2: Balanced accuracy by agent agreement re gime. Se veral patterns stand out. The guidance agent is the strongest single agent, consistent with the idea that forward-looking language often matters more than backward-looking performance for short- horizon market reaction. The performance agent follo ws closely , while the risk agent performs worst in isolation. That ordering is intuiti ve. Risk prompts often capture genuinely ne gati ve language, but the y may also ov erreact to routine cautionary phrasing that appears in many disclosures. Majority vote impro ves on all single agents because it sup- presses idiosyncratic prompt errors. Confidence- weighted voting improv es further because it par- tially accounts for how strongly each agent sup- ports its own judgment. The learned aggregator improv es most because it combines labels, confi- dence, and agreement structure. I then examine when aggregation helps most. I break the test set into three agreement regimes: cases in which all three agents agree, cases with a 2– 1 split, and high-conflict cases in which the agents produce competing labels and no single confidence score clearly dominates. T able 2 shows that the trained aggre gator adds only a modest gain when all three agents agree, b ut it helps substantially more when the disclosure contains conflicting signals. This pattern supports the central intuition of the paper . Multi-agent di versity matters most when the text itself is dif ficult. The difference between the agreement regimes is conceptually important. Cases with full agreement are easier because the disclosure language is more one-sided. Cases with 2–1 splits or stronger con- flicts are harder because they often contain coun- Pattern V ote Agg. Beat, weak guidance Wrong Correct Growth, le gal risk Wrong Correct Better margins, soft demand Correct Correct Routine filing, weak signal Wrong Wrong T able 3: Qualitativ e examples. terv ailing cues. A beat paired with weak guidance, a strong headline paired with a legal disclosure, or an operating impro vement paired with soft demand language can all produce disagreement for valid reasons. The aggregator helps because it treats dis- agreement as structured information. A 2–1 split in which the dissenting agent is highly confident may differ sharply from one in which the dissenting agent is uncertain. Majority vote cannot represent that distinction, but the trained model can. I also examine a small error taxonomy . The largest gains appear in disclosures that combine positi ve current performance with negati ve forward guidance and in disclosures that mix numerical strength with legal or regulatory risk. Single agents often overemphasize the part of the text highlighted by their own prompt. The performance agent tends to ov erweight strong realized results, while the risk agent tends to overweight do wnside language. The trained aggreg ator reduces this bias by learning from the full joint pattern of labels and confidences. The last ro w in T able 3 is also useful. Some disclosures may simply lack a clear short-horizon signal. Aggregation cannot solv e ev ery case. A routine filing with weak informational content may remain hard regardless of prompt div ersity . That failure mode matters because it distinguishes am- biguous texts from cases in which aggregation suc- cessfully resolves structured disagreement. These results support three conclusions. First, zero-shot agents capture complementary dimen- sions of corporate disclosure language. The performance-focused, guidance-focused, and risk- focused prompts do not fail on the same examples. Second, nai ve aggregation already helps. Majority and confidence-weighted v oting both improv e on most single agents. Third, a trained meta-classifier helps more because it learns ho w disagreement pat- terns map to the do wnstream market response. The gains are especially large in 2–1 split and high- conflict cases, which suggests that disagreement is informati ve rather than merely noisy . More broadly , this is consistent with the argument that multi-agent systems can improv e performance when distinct vie ws are combined rather than collapsed too early ( Li et al. , 2024 ). 5 Conclusion This paper presents a multi-agent zero-shot frame- work for corporate disclosure classification. Three LLM agents produce sentiment labels, confidences, and rationales from dif ferent financial perspectives, and a trained logistic aggregator combines these outputs to predict next-day stock return direction. Results show that learned aggre gation outperforms single-agent baselines and simple v oting rules, es- pecially for mixed-signal disclosures. The paper’ s main contribution lies in treating cross-agent disagreement as an informati ve signal rather than a nuisance. A lightweight trained ag- gregator can turn di verse zero-shot financial judg- ments into a stronger downstream classification signal without fine-tuning the base LLMs. That design is compact, interpretable, and plausible for real financial text settings in which prompt div er- sity is easy to create but annotation is expensi ve. The frame work therefore offers a credible A CL- style path for studying financial text classification under zero-shot conditions. 6 Limitations The current design uses next-day return direction as the only downstream tar get, even though market responses may reflect factors be yond textual senti- ment. The agent prompts also impose a researcher- designed decomposition of financial signals, so part of the observed gain may depend on this prompt partition rather than agent di versity alone. Confi- dence is also only an approximation to model cer- tainty , especially when some models expose token log probabilities and others require a constrained self-reported confidence field. Future work should test calibration, prompt sensitivity , abnormal re- turns, and source-specific robustness. References Y inhao Bai, Zhixin Han, Y uhua Zhao, Hang Gao, Zhuowei Zhang, Xunzhi W ang, and Mengting Hu. 2024. Is compound aspect-based sentiment analysis addressed by LLMs? In F indings of the Associa- tion for Computational Linguistics: EMNLP 2024 . Association for Computational Linguistics. Hailiang Chen, Prabuddha De, Y u Hu, and Byoung- Hyoun Hwang. 2014. W isdom of crowds: The v alue of stock opinions transmitted through social media. Revie w of Financial Studies , 27(5):1367–1403. Zhi Da, Joseph Engelberg, and Pengjie Gao. 2011. In search of attention. J ournal of F inance , 66(5):1461– 1499. Jacob Devlin, Ming-W ei Chang, K enton Lee, and Kristina T outano va. 2019. BER T: Pre-training of deep bidirectional transformers for language under - standing . In Pr oceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T ec h- nologies, V olume 1 (Long and Short P apers) , pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. Hao Fei, Bobo Li, Qian Liu, Lidong Bing, Fei Li, and T at-Seng Chua. 2023. Reasoning implicit sentiment with chain-of-thought prompting . In Pr oceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 2: Short P apers) , pages 1171–1182, T oronto, Canada. Association for Computational Linguistics. Mor Ge va, Y oav Goldber g, and Jonathan Berant. 2019. Are we modeling the task or the annotator? an inv es- tigation of annotator bias in natural language under- standing datasets . In Pr oceedings of the 2019 Confer- ence on Empirical Methods in Natural Language Pr o- cessing and the 9th International Joint Confer ence on Natural Language Pr ocessing (EMNLP-IJCNLP) , pages 1161–1166, Hong Kong, China. Association for Computational Linguistics. Nils Constantin Hellwig, Jakob Fehle, Udo Kruschwitz, and Christian W olf f. 2025. Do we still need human annotators? prompting lar ge language models for aspect sentiment quad prediction . In Pr oceedings of the 1st Joint W orkshop on Lar ge Language Models and Structure Modeling (XLLM 2025) , pages 153– 172, V ienna, Austria. Association for Computational Linguistics. Allen H. Huang, Hui W ang, and Y i Y ang. 2023. Fin- BER T : A large language model for extracting infor - mation from financial te xt. Contemporary Account- ing Resear ch , 40(2):806–841. Andrew H. Huang, Siew Hong T eoh, and Y . Zhang. 2014. T one management. Review of F inancial Stud- ies , 27(3):1043–1083. Kemal Kirtac and Guido Germano. 2024. Enhanced fi- nancial sentiment analysis and trading strategy dev el- opment using large language models . In Pr oceedings of the 14th W orkshop on Computational Appr oaches to Subjectivity , Sentiment, & Social Media Analy- sis , pages 1–10, Bangkok, Thailand. Association for Computational Linguistics. Feng Li. 2008. Annual report readability , current earn- ings, and earnings persistence. Journal of Accounting and Economics , 45(2–3):221–247. Junyou Li, Qin Zhang, Y angbin Y u, Qiang Fu, and De- heng Y e. 2024. More agents is all you need . T rans- actions on Machine Learning Researc h . Y inhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Le vy , Mike Lewis, Luke Zettlemoyer , and V eselin Stoyanov . 2019. RoBER T a: A robustly optimized BER T pretraining approach . arXiv preprint . T im Loughran and Bill McDonald. 2011. When is a liability not a liability? textual analysis, dictionaries, and 10-Ks. J ournal of F inance , 66(1):35–65. Amandalynne P aullada, Inioluwa Deborah Raji, Emily M. Bender, Emily Denton, and Alex Hanna. 2021. Data and its (dis)contents: A survey of dataset dev elopment and use in machine learning research . In Pr oceedings of the 2021 ACM Confer ence on F air- ness, Accountability , and T ranspar ency , pages 638– 650. A CM. S. Michael Price, James S. Doran, David R. Peterson, and Brian A. Bliss. 2015. Earnings conference calls and stock returns: The incremental informativeness of textual tone. Journal of F inancial Economics , 115(3):415–430. James Pustejovsk y and Amber Stubbs. 2013. Nat- ural Language Annotation for Machine Learning . O’Reilly Media, Sebastopol, CA. V inay Samuel, Y ue Zhou, and Henry Peng Zou. 2025. T o wards data contamination detection for modern large language models: Limitations, inconsistencies, and oracle challenges . In Pr oceedings of the 31st International Confer ence on Computational Linguis- tics , pages 5058–5070, Abu Dhabi, U AE. Associa- tion for Computational Linguistics. Ola Shorinwa, Zhiting Mei, Justin Lidard, Allen Z. Ren, and Anirudha Majumdar . 2024. A surve y on un- certainty quantification of large language models: T axonomy , open research challenges, and future di- rections . arXiv preprint . Paul C. T etlock. 2007. Giving content to in vestor senti- ment: The role of media in the stock market. Journal of F inance , 62(3):1139–1168. Paul C. T etlock, Maytal Saar-Tsechansk y , and Sofus Macskassy . 2008. More than words: Quantifying language to measure firms’ fundamentals. Journal of F inance , 63(3):1437–1467. Y ajing W ang and Zongwei Luo. 2023. Enhance multi- domain sentiment analysis of revie w texts through prompting strategies . CoRR , abs/2309.02045. Susan Zhang, Stephen Roller , Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher De- wan, Mona Diab, Xian Li, Xi V ictoria Lin, T odor Mi- haylov , Myle Ott, Sam Shleifer , Kurt Shuster , Daniel Simig, Punit Singh K oura, Anjali Sridhar, T ianlu W ang, and Luke Zettlemoyer . 2022. OPT: Open pre- trained transformer language models . arXiv pr eprint arXiv:2205.01068 .

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment