Multi-Agent Dialectical Refinement for Enhanced Argument Classification

Multi-Agen t Dialectical Reﬁnemen t for Enhanced Argumen t Classiﬁcation Jakub Bąba [0009 − 0009 − 7000 − 4887] and Jarosła w A. Chudziak [0000 − 0003 − 4534 − 8652] F acult y of Electronics and Information T echnology , W arsa w Universit y of T ec hnology , Poland {jakub.baba.stud,jaroslaw.chudziak}@pw.edu.pl Abstract. Argumen t Mining (AM) is a foundational technology for au- tomated writing ev aluation, y et traditional sup ervised approac hes rely hea vily on exp ensive, domain-speciﬁc ﬁne-tuning. While Large Language Mo dels (LLMs) oﬀer a training-free alternative, they often struggle with structural ambiguit y , failing to distinguish b et ween similar comp onents lik e Claims and Premises. F urthermore, single-agent self-correction mech- anisms often suﬀer from sycophancy , where the mo del reinforces its o wn initial errors rather than critically ev aluating them. W e introduce MAD-A CC (Multi-Agen t Debate for Argument Comp onent Classiﬁca- tion), a framework that leverages dialectical reﬁnement to resolve clas- siﬁcation uncertaint y . MAD-ACC utilizes a Prop onent-Opponent-Judge mo del where agen ts defend conﬂicting interpretations of am biguous text, exp osing logical nuances that single-agent mo dels miss. Ev aluation on the UKP Studen t Essays corpus demonstrates that MAD-ACC ac hieves a Macro F1 score of 85.7%, signiﬁcan tly outp erforming single-agent reason- ing baselines, without requiring domain-sp eciﬁc training. Additionally , unlik e "black-box" classiﬁers, MAD-ACC’s dialectical approach oﬀers a transparen t and explainable alternative by generating human-readable debate transcripts that explain the reasoning b ehind decisions. Keyw ords: Artiﬁcial In telligence · Natural Language Pro cessing · F or- mal Argumen tation · Argument Mining · Multi-Agent Systems · Large Language Mo dels. 1 In tro duction Argumen t Mining - the automated extraction and iden tiﬁcation of argumentativ e structures from text - is a crucial ﬁeld for high-lev el semantic and reasoning tasks. It enables systems to mov e b ey ond surface-level text ev aluation, such as grammar or sp elling chec king, tow ard deep logic analysis and automated writing ev aluation. Recen t dev elopmen ts in Large Language Models (LLMs) ha ve shifted the paradigms used in the ﬁeld. Current state-of-the-art approaches primarily lever- age ﬁne-tuned generativ e mo dels, achieving high accuracy on tasks such as com- p onen t classiﬁcation and relation extraction. How ev er, the reliance on proper tuning of the sup ervised architecture remains c hallenging: it is computationally 2 J. Bąba and J. A. Ch udziak exp ensiv e, requires high-quality annotated corpora, and often results in rigid mo dels that struggle to generalize to new domains [24]. Con versely , training- free LLMs oﬀer a ﬂexible and cost-eﬀective alternative, but they currently fail to bridge the p erformance gap with sup ervised baselines. Standard prompting approac hes often miss sp eciﬁc details in Argument Mining, resulting in errors, esp ecially b et ween structurally similar comp onen ts. Moreov er, attempts to re- duce this via single-agent self-correction mechanisms often result in supp orting previously made mistakes instead of correcting them [6]. This raises the cen- tral question: can LLM-based systems improv e their p erformance in argumen t comp onen t classiﬁcation by utilizing a m ulti-agent framework while a voiding the cost of sup ervised ﬁne-tuning? T o address this question, we prop ose the MAD-ACC (Multi-Agent Debate for Argumen t Component Classiﬁcation), a framework that form ulates argu- men t classiﬁcation as a structured debate pow ered b y Prop onent, Opponent, and Judge agents. Through structured interaction, the mo del encourages ev al- uating comp eting classiﬁcations rather than self-reﬁnement. This dynamic en- ables MAD-A CC to capture and expose logical nuances that are often o v er- lo ok ed b y single-pass mo dels. W e illustrate this capability through a case study in Section 3.3. W e show that MAD-ACC reduces the p erformance gap b etw een inference-only approaches and sup ervised mo dels, outp erforming all ev aluated single-pass and reasoning-augmented baselines on the UKP Student Essays cor- pus [26]. Moreo ver, MAD-ACC provides an additional contribution by improv- ing mo del transparency . Through rev ealing in termediate argumen ts in debate and ﬁnal reasoning b ehind each decision, the framew ork oﬀers insigh t into the decision-making pro cess, addressing a key limitation of black-box classiﬁers. 2 Related W ork The prop osed MAD-ACC sits at the intersection of computational argumen- tation and agentic artiﬁcial intelligence. T o contextualize our contribution, we surv ey the literature across the evolution of Argument Mining metho dologies, the applications of Large L anguage Mo dels to these tasks, and the emergence of Multi-Agen t Systems for reasoning. W e fo cus there on the paradigm shifts, from feature engineering to deep learning and generative inference. W e also review MAS approac hes, which motiv ate our idea of dialectical reﬁnement. 2.1 Argumen t Mining Approac hes Argumen t Mining (AM) [16] is a researc h area within the ﬁeld of Natural Lan- guage Pro cessing, fo cused on extracting and identifying structured reasoning from unstructured text. The ﬁeld includes sev eral diﬀerent subtasks, ranging from b oundary identiﬁcation and relation extraction to the classiﬁcation of the elemen ts. F oundational work in AM fo cused on providing annotation schemes and corp ora that allow ed for structuring and indexing retriev ed annotations. T o b enc hmark progress, the comm unity established v arious domain-sp eciﬁc datasets. Multi-Agen t Dialectical Reﬁnemen t for Enhanced Argument Classiﬁcation 3 Among these, the UKP Argument Annotated Essays corpus [26] has emerged as a widely adopted standard for analyzing argumen tation in educational texts. Metho dologically , this area has exp erienced a notable shift. Early approaches relied primarily on manual feature engineering, t ypically combining Supp ort V ector Mac hines (SVMs) with carefully designed lexical and structural fea- tures [26, 12]. As deep learning techniques matured, the state-of-the-art shifted to ward neural architectures [7, 20]. T ransformer-based models such as BER T and RoBER T a hav e set new p erformance b enc hmarks [19]. Despite their strong accuracy , these sup ervised approac hes remain limited by their dependence on large-scale annotated data, suﬀering from p o or generalization when applied to out-of-domain text. 2.2 Large Language Mo dels in Argumen tation The rise of generative models has resulted in another shift, mo ving from e ncoder- only architectures to generative Large Language Mo dels (LLMs). This led to an exploration of the training-free capabilities, where mo dels such as GPT-4 [1] were ev aluated on argument mining tasks. Comprehensive ev aluations ha ve sho wn the p oten tial of LLMs [31, 5, 9], including the ﬁeld of argument mining [22, 2]. One of the research directions to enhance LLM p erformance on argument mining w as chain-of-though t (CoT) prompting, a promising training-free technique that impro ves reasoning capabilities on complex tasks [28]. How ev er, the eﬀectiveness of CoT v aries signiﬁcan tly on the mo del size and task characteristics, with recent ﬁndings suggesting decreasing results for non-reasoning mo dels [18]. In resp onse to the limitations of prompting, recent state-of-the-art research fo cused on applying sup ervised strategies into LLM metho ds. Recent studies, in- cluding Cab essa et al. (2025) demonstrated that ﬁne-tuning of the LLMs achiev es sup erior p erformance compared to the earlier b enc hmarks [4]. Beyond argument mining, ﬁne-tuning has prov en to b e promising for enhancing reasoning capabili- ties across v arious NLP tasks [21, 3]. How ever, while these ﬁne-tuned approaches curren tly deﬁne the standard, they reintroduce the issue of heavily relying on high-qualit y annotated data, limiting their usage in low-resource domains and languages, where suc h annotations are unav ailable. 2.3 LLM-based Multi-Agent Systems Multi-Agen t Systems (MAS), one of the recen tly emerging research directions [27], lev erage the concept of structured roles and collab oration to enhance problem solving. By distributing tasks across sp ecialized agents with distinct roles, p er- sonas, and sp eciﬁc contexts, MAS framew orks often oﬀer improv ed problem solving across diverse ﬁelds [14, 30]. These systems hav e demonstrated particular promise in domains such as legal reasoning and Natural Language pro cessing [10, 23]. A subset of this ﬁeld is Multi-Agent Debate (MAD), a concept growing in Argumen t Mining and NLP . Distributing reasoning among sp ecialized agents engaged in structural discussion enables mo dels to reﬁne through debate and 4 J. Bąba and J. A. Ch udziak critique [29, 13]. Recent research included diﬀerent MAD frameworks and con- ﬁgurations across diverse NLP tasks [17, 8, 11]. A key recent work [15] utilized a debate framework to ev aluate implicit premises, outp erforming b oth neural baselines and single-agent LLMs. This eﬀectively sho wed that agents can ac hieve b etter accuracy b y discussing and reﬁning their answers based on opp osing opin- ions than b y rep eated generation. 3 Metho dology W e prop ose MAD-ACC (Multi-Agent Debate for Argument Comp onen t Clas- siﬁcation), a framework designed to resolv e ambiguities in the classiﬁcation task without reliance on annotated training data. In our framew ork, we lev erage di- alog to adjudicate comp eting interpretations of structural relationships within argumen ts. 3.1 T ask F orm ulation W e formalize the Argumen t Component Classiﬁcation (A CC) as a se- quence lab eling task. Let D = { t 1 , t 2 , . . . , t n } b e an argumentativ e do cument consisting of n argument comp onen ts. F or each target comp onen t t i , let C i de- note its con text window (e.g. whole do cument or surrounding paragraph). The ob jectiv e is to establish a mapping function Φ : ( t i , C i ) → y that assigns the correct lab el y ∈ Y , derived from the annotation sc heme deﬁned by Stab and Gurevyc h [25]: Y = { MajorClaim , Claim , Premise } where classes are deﬁned as follo ws: – MajorClaim : The ro ot node of the argumen t structure, represen ts cen tral thesis of the do cumen t. – Claim : An in termediate no de that receiv es supp ort, functions as the topic for eviden tiary statements. – Premise : A leaf no de that pro vides supp ort (example, evidence, reason) to Claim or another Premise. 3.2 The MAD-ACC F ramew ork W e formalize the MAD-ACC framework as a Multi-Agen t System (MAS) tuple S = ⟨A , P , T ⟩ , where: – A = { Mgr , Prop , Opp , Jud } is the set of Agen ts (Manager, Prop onen t, Op- p onen t, Judge). – P is the set of agent-speciﬁc system Prompts deﬁning their roles. – T is the shared state (T ranscript) of the in teraction. The execution ﬂow of the MAD-ACC framew ork is illustrated in Figure 1. It consists of three phases. Multi-Agen t Dialectical Reﬁnemen t for Enhanced Argument Classiﬁcation 5 Fig. 1. The MAD-ACC system ov erview. Probabilistic Initialization T o prop erly induce dialectical diversit y b etw een debaters, the Manager agent acts as a probabilistic ﬁlter. Giv en input x , it estimates the probabilit y distribution ov er the lab els: P ( y | x ) = Mgr ( x ) for y ∈ Y Let y top 1 , y top 2 ∈ Y b e the lab els with the highest probabilities. T o mitigate p osition and authorit y biases, system randomly assigns these lab els to the Pro- p onen t and Opp onen t , ensuring fairness of the debate for b oth Debaters . Dialectical In teraction The debate is mo deled as a sequence of message turns T = [ m 1 , m 2 , . . . , m k ] , where k is total num b er of turns. A t each turn i , an active agen t a ∈ { Prop , Opp } generates a message defending their y lab el, based on the input x and the con versation history T an apa rtment is more exp ensive . Ho wever, this is only partially true..." Mgr P ( Premise ) = 0 . 75 P ( Claim ) = 0 . 20 Prop Opp Argues Premise: "It states a fact. Supports topic." Argues Claim: "No, lo ok at ’Howev er’. It’s a Counter-Claim." Assign: Premise Assign: Claim Interaction Interaction Jud T ranscript V erdict: Agent B correctly identiﬁes the argument structure. Final Lab el: CLAIM Fig. 2. Illustrativ e execution trace. Multi-Agen t Dialectical Reﬁnemen t for Enhanced Argument Classiﬁcation 7 The do cumen t examines the diﬀerences and trade-oﬀs b et ween living in uni- v ersity dormitories and apartments. The target sen tence - "an ap artment is mor e exp ensive" - illustrates a case of classic claim am biguity . Semantically , the state- men t sounds as a factual observ ation ab out accommo dation costs. As a result, standard mo dels (and our Manager agent) frequently misclassify this comp o- nen t as a Pr emise , assuming it serves as a supp orting evidence rather than as a claim. Ho wev er, the debate exp oses the true structural role of the sentence in the do cumen t. Prop onen t initially argues for the Pr emise label, interpreting the sen tence as a previously stated observ ation. On the other side of the debate, Opp onen t identiﬁes the structural function of the argumen t as a core economic argumen t in a section of text, directly supp orting text’s main thesis. The Op- p onen t shows that the subsequent statements refute this statement (e.g. "this is only p artial ly true" ), and as a result, it functions as a Claim . The Judge agen t, leveraging the hierarchical deﬁnitions, analyzes the debate transcript and adjudicates in fav or of the Opp onen t. The verdict relies on the direction of supp ort - since the comp onent w as recipient of the logic b ehind the whole paragraph, it hierarc hically functions as a Claim . This example highlights the core con tribution of our work: by forcing agents to debate the structure of the text, defending even unpopular lab els, MAD-ACC successfully ﬁnds the small details that other classiﬁcation mo dels miss. 4 Exp erimen ts In this section, we present the ev aluation pro cess of the MAD-A CC framework. Our primary ob jective is to c heck if the proposed dialectical in teraction can eﬃcien tly increase the p erformance of the training-free inference. W e detail the b enc hmark dataset and its preparation, the selection of baselines and the sp eciﬁc conﬁguration of multi-agen t arc hitecture used to v alidate our claims regarding accuracy and in terpretability . 4.1 Dataset F or our analysis, w e used the UKP Argumen t Annotated Essa ys v2 [26] corpus, a dataset containing 402 essays and 6089 statements. T o ensure strict comparabilit y with state-of-the-art sup ervised solutions, exp eriments w ere con- ducted based on the exact test split (80 essays with 1266 argumen t statements) established in prior literature [4]. No p ortion of the training split was used for prompt calibration or man ual tuning. Prior to pro cessing by the MAD-ACC framework, the corpus with the an- notations was formatted to enable easier LLM agent comprehension. F or each instance, the full essay was provided, with argumen t comp onen ts delimited by the tags. The target comp onent was marked as ... , while surrounding comp onen ts w ere mask ed as generic ... , without leak- ing the ground truth lab els. 8 J. Bąba and J. A. Ch udziak 4.2 Baselines W e ev aluate performance using Macro F1, W eigh ted F1 and class-sp eciﬁc F1 scores for Ma jorClaim, Claim and Premise t yp es. T o v alidate the eﬀectiveness of the multi-agen t framework, w e compare MAD-ACC against three diﬀerent, single-agen t baselines: 1. V anilla: Represents standard usage of the LLMs. It utilizes the same mo del as the Manager agent ( Gemini 2.5 Flash ), with the direct classiﬁcation prompt. 2. Chain-of-Though t (CoT): Utilizes standard Chain-of-Though t reasoning prompting with Gemini 2.5 Flash to assess if internal reasoning is suﬃcient. 3. Smart Reasoning: Uses the more capable Gemini 2.5 Pr o mo del with built-in reasoning and the exact same system deﬁnitions and rules as ones giv en to a Judge agent. It is designed to simulate Judge’s decision making pro cess without the b eneﬁt of the debate conten t. A dditionally , we contextualize our results with the state-of-the-art sup ervised approac hes, sp eciﬁcally ﬁne-tuned LLMs [4]. While these metho ds currently de- ﬁne the upp er b ound baseline, w e highligh t that they act as "black-box" solutions with limited explainability , whereas our framework prioritizes transparency and reasoning used b ehind decisions. 4.3 Exp erimen tal Setup F or this study , we set the debate length to 2 rounds (four total turns), allow- ing eac h agen t to present its initial argument and resp ond to the opponent’s coun terargument. This conﬁguration reﬂects a trade-oﬀ b etw een argumen tative depth and eﬃciency: preliminary exp erimentation and manual insp ection indi- cated that a single round often fails to exp ose structural disagreements, while longer debates tend to introduce rep etitiv eness without yielding additional clas- siﬁcation b eneﬁts. T o mitigate the p osition bias from the Judge agent, w e employ ed a random- ized stance assignment strategy . Prop onent and Opp onen t agents are randomly assigned to defend ﬁrst and second most probable label, ensuring that the order of probabilities will not aﬀect the ﬁnal judgment. While the framework supp orts a conﬁdence-based skip threshold, we treated all samples with the debate to rigorously ev aluate system’s ability to resolv e ambiguities in the corpus. W e used the Gemini 2.5 family of models. The Manager and the Judge agen ts used Gemini 2.5 Flash and Gemini 2.5 Pr o resp ectively; the Manager was designed to quickly ﬁlter the least probable lab el, while the Judge required higher capacit y to pro cess debate context. While the MAD-A CC framework is mo del- agnostic, w e prioritized establishing a strong baseline with Gemini and leav e the comparative analysis of other models, including open-source alternativ es , for future w ork. F or b oth agen ts, the temperature w as set to 0.0 to ensure deterministic outputs and consisten t scoring. The Debaters used Gemini 2.5 Flash mo del, with a temp erature of 0.7 , selected to ensure creativity through div erse reasoning paths during lab el defense. Multi-Agen t Dialectical Reﬁnemen t for Enhanced Argument Classiﬁcation 9 5 Results and Discussion In this section, w e present the empirical ev aluation of the MAD-ACC framework on the UKP Student Essa ys corpus. W e analyze the MAD-ACC p erformance against selected single-agent baselines and contextualize them with a sup ervised approac h. Subsequently , we conduct a qualitativ e analysis of selected examples based on the debate transcripts, to sho w the p o wer of our system in correct- ing logical errors through adversial reasoning, highlighting the in terpretability b eneﬁts of our framework. 5.1 P erformance Analysis T able 1 summarizes the argument classiﬁcation results on the UKP test set. The results show a clear p erformance hierarc hy . Context-free, general-purp ose inference baselines (Baselines A and B) achiev e Macro F1 scores of 78.5% and 79.2%, respectively . Single-agen t reasoning (Baseline C) achiev es a Macro F1 score of 84.9%, and MAD-ACC achiev es the highest training-free p erformance of 85.7%. Chain-of-Though t ac hieves slightly b etter performance than v anilla prompt- ing, how ev er b oth of the baselines hit the ceiling of approximately 80%. More- o ver, b oth of them mostly struggle with Claim comp onen ts (Claim F1 scores of 57.0% and 58.5%). This supp orts the theory of incorrect reliance on surface-level seman tics and often incorrectly connecting comp onents "sounding like opinions" with Claims. There is a ma jor improv emen t b etw een the ﬁrst tw o baselines and the solu- tions equipp ed with base rules. Baseline C, equipp ed with reasoning and such kno wledge, impro ves consistently , particularly in Macro F1, and more impor- tan tly , Claim F1, which is up around +14 p ercen tage p oin ts . It suggests that mo ving the fo cus to resolving semantic ambiguities based on direction of supp ort substan tially improv es p erformance. Utilizing dialectical reﬁnement in the pip eline resulted in MAD-ACC ac hiev- ing the b est results, b eating Baseline C b y 0.8% in Macro F1. Moreo ver, the MAD-ACC outp erformed strong single-agent baseline by another Claim F1 T able 1. Comparison of classiﬁcation p erformance on UKP Student Essays. Metho d Ov erall Performance Class-wise F1-score Macr o F1 W-F1 MC Claim Pr emise Infer enc e-Only Baselines Baseline A (V anilla) 78.5 80.8 90.6 57.0 88.0 Baseline B (Chain-of-Thought) 79.2 81.2 91.4 58.5 87.8 Baseline C (Smart Reasoning) 84.9 86.1 92.2 72.5 90.1 MAD-A CC (Ours) 85.7 87.0 92.0 74.5 90.7 Sup ervise d Refer enc e Cab essa et al. (2025) 89.5 - - - - 10 J. Bąba and J. A. Ch udziak +2% , while keeping F1 scores for Premises and Ma jor Claims stable. This means that it MAD-ACC isn’t just moving classiﬁcations from Premises to Claims, but activ ely using debate to diﬀerentiate comp onents more eﬀectively . 5.2 Comparison with State-of-the-Art T able 1 contextualizes our results against the sup ervised state-of-the-art [4]. While ﬁne-tuned LLMs currently deﬁne the upp er b ound p erformance of 89.5% Macro F1 score, MAD-ACC reduces this gap with a comp etitive 85.7% without requiring an y training or parameter up dates. This result highlights a trade-oﬀ betw een Performanc e and Data Eﬃciency . Their SOT A mo del achiev ed b etter results, but ﬁne-tuning relied on approxi- mately 80% of the corpus, whereas MAD-A CC op erates in a training-free setting. Ultimately , ﬁne-tuning remains optimal for the cases where the cost of tuning is acceptable and annotated data is a v ailable. On the other side, the MAD- A CC framework presents a comp elling alternative for lo w-resource domains and cases where annotating large amounts of do cuments is imp ossible. W e note that while MAD-ACC eliminates training costs, the multi-agen t debate increases costs from token consumption compared to single-pass prompting. How ever, for low- resource domains, this trade-oﬀ is often preferable to data annotation costs. 5.3 Qualitativ e Analysis: Case Studies T o inv estigate the source of MAD-ACC p erformance gain, we analyzed exem- plary instances where the single-agent baseline (Baseline C) failed, but MAD- A CC lab eled the comp onent correctly by taking adv antage of reasoning from the debate transcript. Resolving T opic Sen tence Ambiguit y (Case 1) Single-pass mo dels often confuse the main argumen t with the evidence supporting it, esp ecially when the argumen t is descriptive. In Essa y 335, the baseline incorrectly classiﬁed the comp onen t "connecting p eople by email is easy and fast" as a Premise, under- estimating its role in the text. How ev er, the debate transcript shows that Agent B correctly identiﬁed it as one of the main arguments in the text, directly sup- p orting the main thesis of the essa y ( "IT disc overies ar e likely to have mor e disadvantages than b eneﬁts and p e ople should know how to use their develop- ments pr op erly" ), and as a result, it w as correctly classiﬁed as a Claim. Hierarc hical Distinction (Case 2) These models also sometimes struggle with abstraction and hierarc hy . In Essay 169, the target sentence "Only b y de- v eloping students, can we ha ve a b etter academic ﬁeld" was misclassiﬁed by a baseline as a Ma jorClaim. The debate pro cess correctly found and analyzed the dep endency c hain, realizing that while the sen tence w as abstract, it was a pillar for the main thesis of the text ( "pr ofessors should sp end mor e time on pr ep aring c ourses than r ese ar ch" ), and as a result, it was a Claim. Multi-Agen t Dialectical Reﬁnemen t for Enhanced Argument Classiﬁcation 11 T able 2. Qualitative comparison of Baseline C vs. MAD-ACC. The dialectical tran- script allows the Judge to resolv e structural am biguity where the single-agent fails. Case Type T ext & Context Dialectical Resolution (MAD- A CC) Case 1: T opic Sen tence (Essa y 335) T ar get: "connecting p eople by email is easy and fast" Pr e dictions: Baseline: Premise MAD-ACC: Claim A gent B A r gues (simpliﬁe d): The target sentence is a Claim, as it is pre- sen ted as argument supp orting another idea that IT has beneﬁts, which is a crucial comp onen t of the essay’s ov er- all thesis. At the same time, it acts as the core ideas of the paragraph, one of direct b eneﬁts. V er dict: Judge accepts that the tar- get is one of the main arguments pre- sen ted in the paragraph and functions as a Claim. Case 2: Hierarc hy Resolution (Essa y 169) T ext: "Only by developing stu- den ts, can we hav e a b etter academic ﬁeld" Pr e dictions: Baseline: Ma jorClaim MAD-ACC: Claim A gent A Ar gues (simpliﬁe d): Th e target is a Main Argumen t and a Claim that supports the essay’s o verarc hing thesis, not the thesis itself. It provides the abstract principle explaining why professors should prioritize teaching. It connects directly to the Ma jorClaim b y justifying it, functioning as a supp ort rather than the ro ot no de. V er dict: Judge recognizes the target as a Claim as a high-lev el argumen t supp orting the Thesis. These examples demonstrate that the dialectical pro cess forces the system to thoroughly ev aluate the function of each comp onent and decide whether it supp orts a neighbor (Premise), the thesis (Claim) or acts as a main idea - and as a result, it supp orts resolving structural ambiguit y present in the baselines. 6 F uture W ork The results presented in this work suggest several promising directions for future researc h. Firstly , the framework’s cross-domain generalization capabilities should b e in vestigated. Since MAD-A CC do es not rely on domain-sp eciﬁc annotated data, it is a strong candidate for application in other domains, esp ecially low-resource ones, including legal, p olitical, or biomedical text mining. F uture studies could ev aluate the usage of such a dialectical framew ork, esp ecially in the role of an assistan t suggesting initial annotations, with the p ow erful reasoning helping in v alidating these lab els. 12 J. Bąba and J. A. Ch udziak A second area of study is the extension of the approac h b eyond comp onen t classiﬁcation. The framew ork could b e assessed on other tasks, suc h as Argument Relation Identiﬁcation and Classiﬁcation (ARI/ARC). By conﬁguring agents to debate the existence and types of links (Supp ort/Attac k), subsequent research could mo ve tow ard full argument structure parsing. Finally , a critical direction for suc h agentic systems would b e to integrate them into real-world educational technologies. F uture w ork could deploy MAD- A CC within intelligen t educational systems, where the framework’s explainabil- it y and in terpretability could b e measured through p edagogical impact and user studies. Suc h studies could verify the v alue of the AI-generated dialectical an- notations in the pro cess of improving argumentativ e skills. 7 Conclusion In this w ork, w e presented the MAD-A CC , a m ulti-agent framework utiliz- ing dialectical reﬁnement to improv e p erformance on the Argument Comp onent Classiﬁcation task without relying on exp ensiv e ﬁne-tuning on high-qualit y anno- tated data. By replacing static classiﬁcation with a multi-agen t debate of con tra- dicting opinions, we addressed the limitations of single-agent LLMs, sp eciﬁcally their tendency to mismatch the structural function of the argument based on a seman tic assertiveness. Our exp erimen ts on the UKP Student Essays corpus demonstrate that MAD- A CC achiev es a Macro F1 score of 85.7%, outp erforming all the baselines without task-sp eciﬁc training. Notably , the framework eﬀectively resolves the "Claim vs. Premise" ambiguit y , providing substan tial improv ement in the Claim F1 score. As demonstrated by the qualitative analysis in Section 5.3, the debate mec hanism successfully corrects errors where single agen ts misclassify topic sen tences or hierarc hy of the documents, based on the direction of supp ort. While state-of- the-art still holds a p erformance adv antage (89.5%), our metho dology provides a comp etitiv e, data-eﬃcient alternative for low-resource domains. Bey ond quantitativ e p erformance, MAD-ACC adds a level of explainabilit y that is largely absent in traditional classiﬁers. The generated debate transcripts pro vide a transparent thought pro cess b ehind each decision, shifting the system from a blac k-b ox mo del to a to ol capable of justifying its conclusions to users. A ckno wledgemen t. The work rep orted in this paper w as supp orted b y the P olish National Science Centre under grant 2024/06/Y/HS1/00197. References 1. A chiam, J., Adler, S., Agarwal, S., Ahmad, L., Akk a ya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadk at, S., et al.: Gpt-4 technical rep ort. arXiv preprin t arXiv:2303.08774 (2023) Multi-Agen t Dialectical Reﬁnemen t for Enhanced Argument Classiﬁcation 13 2. Al Zubaer, A., Granitzer, M., Mitrović, J.: Performance analysis of large language mo dels in the domain of legal argument mining. F rontiers in Artiﬁcial Intelligence 6 , 1278796 (2023) 3. Bousselham, H., Mourhir, A., et al.: Fine-tuning gpt on biomedical nlp tasks: an empirical ev aluation. In: 2024 International Conference on Computer, Electrical & Comm unication Engineering (ICCECE). pp. 1–6. IEEE (2024) 4. Cab essa, J., Hernault, H., Mushtaq, U.: Argumen t mining with ﬁne-tuned large language mo dels. In: Rambow, O., W anner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Sc ho ck aert, S. (eds.) Pro ceedings of the 31st International Confer- ence on Computational Linguistics. pp. 6624–6635. Asso ciation for Computational Linguistics, Abu Dhabi, UAE (Jan 2025), https://aclan thology .org/2025.coling- main.442/ 5. Chang, Y., W ang, X., W ang, J., W u, Y., Y ang, L., Zhu, K., Chen, H., Yi, X., W ang, C., W ang, Y., et al.: A survey on ev aluation of large language mo dels. ACM T ransactions on Intelligen t Systems and T echnology 15 (3), 1–45 (2024) 6. Chen, C.H., Huang, H.H., Chen, H.H.: Self-augmented preference alignment for sycophancy reduction in llms. In: Pro ceedings of the 2025 Conference on Empirical Metho ds in Natural Language Pro cessing. pp. 12390–12402 (2025) 7. Eger, S., Daxenberger, J., Gurevych, I.: Neural end-to-end learning for com- putational argumentation mining. In: Barzilay , R., Kan, M.Y. (eds.) Pro ceed- ings of the 55th Annual Meeting of the Asso ciation for Computational Linguis- tics (V olume 1: Long Papers). pp. 11–22. Association for Computational Lin- guistics, V ancouver, Canada (Jul 2017). h ttps://doi.org/10.18653/v1/P17-1002, h ttps://aclanthology .org/P17-1002/ 8. Estornell, A., T on, J.F., Y ao, Y., Liu, Y.: Acc-collab: An actor-critic approach to m ulti-agent llm collab oration (2025), 9. Gorur, D., Rago, A., T oni, F.: Can large language mo dels p erform relation-based argumen t mining? (2024), 10. Gorur, D., Rago, A., T oni, F.: Retriev al and argumentation enhanced multi-agen t llms for judgmental forecasting (2025), h 11. Gou, Z., Shao, Z., Gong, Y., Shen, Y., Y ang, Y., Duan, N., Chen, W.: Critic: Large language models can self-correct with to ol-interactiv e critiquing (2024), h 12. Hab ernal, I., Gurevych, I.: Argumentation mining in user-generated w eb discourse. Computational Linguistics 43 (1), 125–179 (Apr 2017). h ttps://doi.org/10.1162/COLI_a_00276, https://aclan thology .org/J17-1004/ 13. Harbar, Y., Chudziak, J.A.: Simulating o xford-style debates with llm-based multi- agen t systems. In: Nguy en, N.T., Matsuo, T., Gaol, F.L., Manolop oulos, Y., F ujita, H., Hong, T.P ., W o jtkiewicz, K. (eds.) Intelligen t Information and Database Sys- tems. pp. 286–300. Springer Nature Singapore, Singapore (2025) 14. K ostk a, A., Chudziak, J.A.: T ow ards cognitive synergy in llm-based m ulti- agen t systems: In tegrating theory of mind and critical ev aluation (2025), h 15. Ku, H.B., Shin, J., Lee, H.J., Na, S., Jeon, I.: Multi-agent LLM de- bate un veils the premise left unsaid. In: Chistov a, E., Cimiano, P ., Had- dadan, S., Lap esa, G., Ruiz-Dolz, R. (eds.) Pro ceedings of the 12th Argu- men t Mining W orkshop. pp. 58–73. Association for Computational Linguis- tics, Vienna, Austria (Jul 2025). https://doi.org/10.18653/v1/2025.argmining-1.6, h ttps://aclanthology .org/2025.argmining-1.6/ 14 J. Bąba and J. A. Ch udziak 16. La wrence, J., Reed, C.: Argument mining: A survey . Computational Lin- guistics 45 (4), 765–818 (Dec 2019). h ttps://doi.org/10.1162/coli_a_00364, h ttps://aclanthology .org/J19-4006/ 17. Liu, T., W ang, X., Huang, W., Xu, W., Zeng, Y., Jiang, L., Y ang, H., Li, J.: Group debate: Enhancing the eﬃciency of multi-agen t debate using group discus- sion. arXiv preprint arXiv:2409.14051 (2024) 18. Meinc ke, L., Mollick, E., Mollick, L., Shapiro, D.: Prompting science rep ort 2: The decreasing v alue of chain of thought in prompting. arXiv preprint (2025) 19. Mush taq, U., Cab essa, J.: Argumen t classiﬁcation with b ert plus con textual, struc- tural and syntactic features as text. In: International Conference on Neural Infor- mation Pro cessing. pp. 622–633. Springer (2022) 20. Niculae, V., Park, J., Cardie, C.: Argument mining with structured SVMs and RNNs. In: Barzilay , R., Kan, M.Y. (eds.) Pro ceedings of the 55th Ann ual Meeting of the Asso ciation for Computational Linguistics (V olume 1: Long Papers). pp. 985– 995. Asso ciation for Computational Linguistics, V ancouver, Canada (Jul 2017). h ttps://doi.org/10.18653/v1/P17-1091, https://aclan thology .org/P17-1091/ 21. P areja, A., Nay ak, N.S., W ang, H., Killamsetty , K., Sudalaira j, S., Zhao, W., Han, S., Bhandwaldar, A., Xu, G., Xu, K., et al.: Unv eiling the secret recip e: A guide for sup ervised ﬁne-tuning small llms. arXiv preprint arXiv:2412.13337 (2024) 22. P o joni, M.L., Dumani, L., Schenk el, R.: Argumen t-mining from po dcasts using c hatgpt. In: ICCBR W orkshops. pp. 129–144 (2023) 23. Sado wski, A., Chudziak, J.A.: On veriﬁable legal reasoning: A multi-agen t frame- w ork with formalized knowledge represen tations. In: Pro ceedings of the 34th ACM In ternational Conference on Information and Knowledge Management. pp. 2535– 2545. CIKM ’25, ACM (Nov 2025). https://doi.org/10.1145/3746252.3761057, h ttp://dx.doi.org/10.1145/3746252.3761057 24. Song, S., Xu, H., Ma, J., Li, S., Peng, L., W an, Q., Liu, X., Y u, J.: How to alleviate catastrophic forgetting in llms ﬁnetuning? hierarchical la yer-wise and elemen t-wise regularization (2025), 25. Stab, C., Gurevych, I.: Annotating argument comp onen ts and relations in p ersua- siv e essays. In: Pro ceedings of COLING 2014, the 25th International Conference on Computational Linguistics: T echnical Papers. pp. 1501–1510 (2014) 26. Stab, C., Gurevych, I.: P arsing argumentation structures in p ersuasiv e essays. Computational Linguistics 43 (3), 619–659 (2017) 27. T ran, K.T., Dao, D., Nguy en, M.D., Pham, Q.V., O’Sulliv an, B., Nguy en, H.D.: Multi-agen t collab oration mechanisms: A survey of llms (2025), h 28. W ei, J., W ang, X., Sch uurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language mo dels. A dv ances in Neural Information Pro cessing Systems 35 , 24824–24837 (2022) 29. W u, H., Li, Z., Li, L.: Can llm agents really debate? a controlled study of multi- agen t debate in logical reasoning (2025), 30. Zamo jsk a, M., Chudziak, J.A.: Games agents play: T ow ards transactional analysis in llm-based multi-agen t systems (2025), 31. Zhao, W.X., Zhou, K., Li, J., T ang, T., W ang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al.: A survey of large language models. arXiv preprint arXiv:2303.18223 1 (2) (2023)

Multi-Agent Dialectical Refinement for Enhanced Argument Classification

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment