EvasionBench: A Large-Scale Benchmark for Detecting Managerial Evasion in Earnings Call Q&A

EvasionBench: A Large-Scale Benchmark for Detecting Managerial Evasion in Earnings Call Q&A
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present EvasionBench, a comprehensive benchmark for detecting evasive responses in corporate earnings call question-and-answer sessions. Drawing from 22.7 million Q&A pairs extracted from S&P Capital IQ transcripts, we construct a rigorously filtered dataset and introduce a three-level evasion taxonomy: direct, intermediate, and fully evasive. Our annotation pipeline employs a Multi-Model Consensus (MMC) framework, combining dual frontier LLM annotation with a three-judge majority voting mechanism for ambiguous cases, achieving a Cohen’s Kappa of 0.835 on human inter-annotator agreement. We release: (1) a balanced 84K training set, (2) a 1K gold-standard evaluation set with expert human labels, and (3) [Eva-4B], a 4-billion parameter classifier fine-tuned from Qwen3-4B that achieves 84.9% Macro-F1, outperforming Claude 4.5, GPT-5.2, and Gemini 3 Flash. Our ablation studies demonstrate the effectiveness of multi-model consensus labeling over single-model annotation. EvasionBench fills a critical gap in financial NLP by providing the first large-scale benchmark specifically targeting managerial communication evasion.


💡 Research Summary

EvasionBench introduces the first large‑scale benchmark for detecting managerial evasion in corporate earnings‑call Q&A sessions. The authors harvested 22.7 million question‑answer pairs from the S&P Capital IQ full‑text dataset, spanning 2002‑2022 and covering over 1.3 million transcripts. After a three‑stage quality filter (question must contain a “?”, answer length ≥ 30 characters, removal of transcription artifacts, and combined Q&A length ≥ 500 characters), 11.27 million high‑quality pairs remained. From this pool the authors constructed a balanced training set of 84 000 examples (33.3 % per class) and a gold‑standard evaluation set of 1 000 examples that were independently validated by expert human annotators.

The core contribution is a three‑level evasion taxonomy: Direct (the answer fully addresses the question), Intermediate (the answer provides related context but sidesteps the core), and Fully Evasive (the question is ignored, refused, or answered off‑topic). This taxonomy was derived from pilot studies that showed low inter‑annotator agreement with finer granularity; collapsing to three levels raised Cohen’s κ to 0.83, indicating “almost perfect” reliability.

Labeling is performed via a Multi‑Model Consensus (MMC) framework. In Stage I, two frontier large language models—Claude Opus 4.5 and Gemini 3 Flash—independently annotate each sample. When both agree, the label is accepted as consensus. For the 16.1 % of cases where they disagree, a three‑judge arbitration step is invoked: Claude Opus 4.5, Gemini 3 Flash, and GPT‑5.2 each evaluate the competing labels, and a majority vote determines the final label. The authors demonstrate systematic model‑specific biases (e.g., Opus favors Direct, Gemini favors Fully Evasive, GPT‑5.2 favors Intermediate) and mitigate positional bias by randomizing the order of model predictions in the judge prompt (seed = 42). Human re‑annotation of a 100‑sample subset of the gold set yields κ = 0.835, confirming that MMC produces human‑level quality.

For the detection model, the authors fine‑tune Qwen‑3‑4B‑Instruct‑2507 (a 4‑billion‑parameter open‑source LLM) in two stages. Stage I trains on the 60 K consensus samples for two epochs (learning rate 2e‑5, bf16). Stage II continues training on the 24 K arbitrated samples, incorporating the three‑judge majority labels. Three variants are reported: (a) Eva‑4B (Consensus) – only Stage I, (b) Eva‑4B (Opus Only) – Stage II using Opus labels, and (c) Eva‑4B (Full) – Stage II using the majority‑vote labels.

Evaluation on the 1 K gold set compares twelve systems: closed‑source Claude Opus 4.5, GPT‑5.2, Gemini 3 Flash; open‑source GLM‑4.7, Qwen‑3‑Coder, MiniMax‑M2.1, Kimi‑K2, DeepSeek‑V3.2; and the three Eva‑4B variants plus the untouched Qwen‑3‑4B base model. Eva‑4B (Full) achieves the highest Macro‑F1 of 84.9 % (Direct F1 = 82.2 %, Intermediate F1 = 80.1 %, Fully Evasive F1 = 92.4 %). All three frontier LLMs trail slightly (Claude Opus 4.5 = 84.4 %, Gemini 3 Flash = 84.6 %). The base Qwen‑3‑4B scores only 34.3 % Macro‑F1, illustrating the dramatic benefit of the two‑stage fine‑tuning (+50.6 pp). The most challenging class for every system is Intermediate, reflecting the inherent subjectivity of partial answers.

Ablation studies confirm that multi‑model consensus labeling outperforms single‑model annotation by an average of 4.3 percentage points in Macro‑F1, and that the second fine‑tuning stage adds 2–3 pp over consensus‑only training. These results substantiate the claim that diverse model opinions reduce systematic bias and improve downstream performance for tasks with nuanced, subjective labels.

In summary, the paper makes four key contributions: (1) a rigorously filtered, large‑scale dataset for evasion detection with balanced class distribution, (2) a novel MMC annotation pipeline that yields human‑level agreement, (3) a lightweight 4‑billion‑parameter model (Eva‑4B) that surpasses state‑of‑the‑art closed‑source LLMs on the task, and (4) extensive empirical analysis demonstrating the value of multi‑model consensus for both label quality and model accuracy. The authors suggest future work on cross‑domain transfer (e.g., political interviews, legal depositions), finer granularity within the Intermediate class, and hybrid human‑LLM labeling workflows to further reduce annotation costs while preserving reliability.


Comments & Academic Discussion

Loading comments...

Leave a Comment