Not Search, But Scan: Benchmarking MLLMs on Scan-Oriented Academic Paper Reasoning
With the rapid progress of multimodal large language models (MLLMs), AI already performs well at literature retrieval and certain reasoning tasks, serving as a capable assistant to human researchers, yet it remains far from autonomous research. The f…
Authors: Rongjin Li, Zichen Tang, Xianghe Wang
Published as a conference paper at ICLR 2026 N O T S E A R C H , B U T S C A N : B E N C H M A R K I N G M L L M S O N S C A N - O R I E N T E D A C A D E M I C P A P E R R E A S O N I N G Rongjin Li, Zichen T ang, Xianghe W ang, Xinyi Hu, Zhengyu W ang, Zhengyu Lu, Y iling Huang, Jiayuan Chen, W eisheng T an, Jiacheng Liu, Zhongjun Y ang, Haihong E ∗ Beijing Univ ersity of Posts and T elecommunications https://bupt-reasoning-lab.github.io/ScholScan https://github.com/BUPT-Reasoning-Lab/ScholScan https://huggingface.co/datasets/BUPT-Reasoning-Lab/ScholScan A B S T R AC T W ith the rapid progress of multimodal large language models (MLLMs), AI al- ready performs well at literature retriev al and certain reasoning tasks, serving as a capable assistant to human researchers, yet it remains far from autonomous re- search. The fundamental reason is that current work on academic paper reason- ing is largely confined to a search-oriented paradigm centered on pre-specified targets, with reasoning grounded in relev ance retriev al, which struggles to sup- port researcher-style full-document understanding, reasoning, and verification. T o bridge this gap, we propose ScholScan , a ne w benchmark for academic paper rea- soning. ScholScan introduces a scan-oriented task setting that asks models to read and cross-check entire papers like human researchers, scanning the document to identify consistency issues. The benchmark comprises 1,800 carefully annotated questions dra wn from nine error categories across 13 natural-science domains and 715 papers, and provides detailed annotations for evidence localization and rea- soning traces, together with a unified e valuation protocol. W e assessed 15 models across 24 input configurations and conducted a fine-grained analysis of MLLM capabilities for all error categories. Across the board, retrie val-augmented genera- tion (RA G) methods yield no significant improv ements, rev ealing systematic defi- ciencies of current MLLMs on scan-oriented tasks and underscoring the challenge posed by ScholScan. W e expect ScholScan to be the leading and representative work of the scan-oriented task paradigm. 1 I N T RO D U C T I O N Enabling multimodal large language models (MLLMs) (OpenAI, 2025a; Anthropic, 2025; ByteDance Seed T eam, 2025; Meta, 2025; xAI, 2025) to conduct comprehensiv e understanding and generation based on academic literature is the ultimate goal of Deep Research (Comanici et al., 2025), and a critical milestone on the path tow ard artificial general intelligence (A GI) (Ge et al., 2023; Morris et al., 2024; Phan et al., 2026). W ith rapid adv ances, MLLMs are increasingly capable of supporting academic workflows through retriev al, reading, and writing. For example, PaSa (He et al., 2025) can in voke a series of tools to answer complex academic queries with high-quality results, while Google Deep Research (Comanici et al., 2025) is capable of producing human-lev el research reports based on specific queries. Howe ver , most of the e xisting w ork still follows a search-oriented paradigm , where models retriev e a few relev ant passages and reason over local evidence based on pre-specified targets (Gao et al., 2024; Lou et al., 2025). Such methods are effecti ve for tasks with clearly pre-defined targets, but struggle with researcher -style full-document reasoning and verification (Zhou et al., 2024). T o func- tion as researchers, models must mov e beyond reactive question answering and to ward proactive discov ery of implicit problems. T o fill this gap, as sho wn in Figure 1, we introduce a scan-oriented paradigm , where models ad- dress queries with absent targets , requiring them to exhaustively scan papers, actively construct ∗ Corresponding author . 1 Published as a conference paper at ICLR 2026 Search-Oriented What methodological issues aris e fr Target Pre-specified : om short-interval calcein labeling? Page3: ...our dynamic analysis using short-interval calcein labelingindicated that the bone formation rate (BFR), mineral apposition rate (MAR)... A short-interval calcein labeling protocol does not allow valid measurement of MAR or BFR. Scan-Oriented Assess the Methods section for Measur Target Absent: ement & Operationalization iss ues. still reported, cr eating a disconnect between validly measure MAR or BFR, yet these values ar A one-day calcein labeling interval cannot e method and data. P a g e 3 : . . . using short- i n t e r v a l calcein l a b e l i n g . ..the bone f o r m a t i o n r ate (BFR))... Page5:... analysis of the BFR/BS, MAR, and MS/BS ... Unanswerable ... Retrieval ... Retrieval Figure 1: A comparison between search-oriented and scan-oriented task paradigms. Unlik e the former , the scan-oriented paradigm provides no pre-specified targets, requiring the model to activ ely scan the entire paper , construct a document-level e vidence view . a document-level evidence view , and perf orm evidence-based reasoning . In contrast to search- oriented tasks that assess the model’ s ability to identify and reason over relevant fragments, scan- oriented tasks emphasize consistency . Instead of relying on pre-specified targets or hints, models must derive all necessary concepts and inferences solely from giv en documents. W e instantiate this setting via scientific error detection, as it naturally demands discovering non- obvious flaws without target cues, and present ScholScan , a new multimodal benchmark for aca- demic reasoning. ScholScan features the following k ey highlights: • Scan-Oriented T ask Paradigm. In ScholScan, models receive one or more complete academic papers with tar get-absent queries, undergoing rigorous evidence-based reasoning e valuation. The benchmark comprises 715 papers spanning 13 natural science disciplines. • Comprehensive Err or T axonomy . ScholScan cov ers nine categories of scientific errors through- out the research workflo w , including citation and reference errors, rigorously ev aluating models’ cross-source reasoning abilities. • Process-A ware Evaluation Framework. ScholScan provides fine-grained annotations for both evidence location and reasoning steps, enabling a comprehensive ev aluation frame work that as- sesses model performance in terms of both process and outcome. W e ev aluated 15 models across 24 input configurations and 8 retrie val-augmented generation (RA G) framew orks. All models exhibited limited performance, and none of the RA G methods deliv ered significant improvements. These results highlight the inadequacy of search-oriented frame works when applied to scan-oriented tasks and underscore both the challenges and the potential of enabling MLLMs to perform reliable document-lev el reasoning over full academic papers. 2 R E L A T E D W O R K 2 . 1 M U LT I M O DA L L A R G E L A N G UA G E M O D E L S W ith the rapid progress of MLLMs, the models have ev olved beyond perception tasks (e.g., image recognition and explanation) (Liu et al., 2024) tow ard a deep understanding of structured multimodal long documents. Their strengths lie in the ability to integrate cross-modal information and perform multi-hop reasoning over extended contexts. These capabilities are not only valuable for specific 2 Published as a conference paper at ICLR 2026 question answering or instruction-following tasks (Y ue et al., 2024) but are particularly well suited to simulating human thought processes and generating explainable reasoning trajectories (Zheng et al., 2023). Consequently , achie ving a comprehensive understanding of entire documents has emerged as a core challenge that MLLMs are inherently equipped to address. 2 . 2 D O C U M E N T U N D E R S TA N D I N G B E N C H M A R K Document understanding tasks challenge models to identify the relev ant context and perform ac- curate reasoning grounded in that information. Progress in document understanding benchmarks has follo wed two main axes. Across the input dimension, it has e volv ed from short to long con- tents, from everyday to specialized domains, and from plain text to multimodal formats (Chen et al., 2021; Y ang et al., 2018; Mathe w et al., 2021; Deng et al., 2025). Across the output dimension, it has shifted from limited-output formats to more open-ended responses (Pramanick et al., 2024). DocMath-Eval CompLong (Zhao et al., 2024) ev aluates numerical reasoning on long specialized docu- ments, while MMLongBench-Doc (Ma et al., 2024) builds a multimodal benchmark with layout-rich documents. The recently proposed FinMMDocR (T ang et al., 2025) pioneers the integration of sce- nario awareness, document understanding, and multi-step reasoning in financial scenarios. Howe ver , a comprehensiv e benchmark that integrates all the abov e challenges has yet to be introduced. 2 . 3 A C A D E M I C P A P E R U N D E R S TA N D I N G B E N C H M A R K Compared with general documents, academic papers are distinguished by their rich domain knowl- edge and logical rigor . Reasoning over papers has emerged as a major challenge in recent research. Some studies request local elements such as charts or snippets, le veraging their internal complexity , but neglect the need for cross-source integration and domain-specific interpretation within the full document (W ang et al., 2024; Li et al., 2024). Recent studies extends to document-level inputs using image-based formats for real-world reading scenarios (Auer et al., 2023; Y an et al., 2025). Howe ver , benchmarks based on the QA paradigm face inherent limitations, as they typically pre- suppose the existence of answers and embed explicit cues in the question itself, reducing the need for comprehensive understanding and information organization. Moreov er, mainstream ev aluation protocols focus on the final outcome, with limited assessment of whether intermediate reasoning is evidentially grounded and logically v alid. More examples and analysis are sho wn in Appendix B. 3 S C H O L S C A N B E N C H M A R K Computer Science Biology Chemistry Physics Others 427 112 38 42 96 Neuroscience Medicine Mathematics Economics Materials . . . Photonics Rheology . . . Total Synthesis Proteomics Microbiology Immunology Catalysis Genetics Parasitology Machine Learning Graph Neural Networks . . . . . . . . . Bioinformatics Robotics Optimization Multimodal Computer Vision Natural Language Processing Pattern Recognition Multi-Agent Systems Software Engineering Human Computer Interaction Reinforcement Learning Benchmark Mod. Para. Eval. # Dom. Document Understanding DocMath-Eval CompLong T+TD Search O N/A FinMMDocR T+MD Search O N/A MMLongBench-Doc T+MD Search O N/A LongDocURL T+MD Search O N/A SlideVQA T+MD Search O N/A Academic P aper Understanding CharXiv T+I Search O 8 ArXivQA T+I Search O 10 MMCR T+MD Search O CS AAAR-1.0 T+MD Search O CS ScholScan (ours) T+MD Scan P+O 13 Figure 2: Left : Overvie w of ScholScan. Right : Comparison to related benchmarks. Mod. : Modal- ities; Para. : T ask Paradigm; Eval. : Evaluation F ocus; T : T ext; I : Image; TD : T ext Document; MD : Multimodal Document; P : Process; O : Outcome; Dom. : Academic Domain Cov erage. 3 . 1 O V E RV I E W O F S C H O L S C A N W e introduce ScholScan, a benchmark designed to comprehensively ev aluate MLLMs’ ability to de- tect scientific flaws in academic papers under scan-oriented task settings. As illustrated in Figure 2, 3 Published as a conference paper at ICLR 2026 Research Question & Definitions Design & Identifiability Sampling & Generalizability Measurement & Operationalization Data Handling & Preprocessing ... ar e expected to be dir ectly applicable and pr omote further development of NIR-II fluor escence imaging in the general patient population ... Computation & Formulae ... ar e expected to be dir ectly applicable and pr omote further development of NIR-II fluor escence imaging in the general patient population ... Inference & Conclusions ... ar e expected to be dir ectly applicable and pr omote further development of NIR-II fluor escence imaging in the general patient population ... Referential & Citation Alignment ... ar e expected to be dir ectly applicable and pr omote further development of NIR-II fluor escence imaging in the general patient population ... Language & Expression Explanation: The design is described as pr obing both short- and long-range interactions, yet the paper still claims unique lar ge-q selectivity , creating a disconnect. Explanation: The definition of "actionable variants" shifts acr oss sections (LOE 1–5 in Abstract, LOE 1–3 in Results), causing ambiguity . Explanation: The experiments use a narr ow diabetic mouse substrain, yet the paper generalizes findings to all patients, cr eating an invalid sample-to-population infer ence. Explanation: First-harmonic demodulation is dominated by far-field backgr ound and cannot pr oduce the r eported high-quality near-field images. Explanation: Featur e selection for NSCLC and HCC models was done on the full dataset befor e splitting, causing data leakage, while the Discussion falsely claims unbiased validation. Explanation: The Methods claim a 200- fold concentration, but the 200 µL subsample is incorr ectly said to r epr esent ~20 mL instead of 40 mL, cr eating a twofold calculation err or . Explanation: The data show PGK1 pr omotes EGFR degradation, yet the Discussion claims inhibiting PGK1 as therapy , directly contradicting the r esults. Explanation: Figur e 1 r eport an LPS dose of 1.5 mg/kg, but Figur e 5 r eports 15 mg/kg, cr eating a tenfold discr epancy that makes the actual experimental dose unclear . Explanation: The paper swaps C. elegans gene and pr otein nomenclatur e (e.g., 'unc-45' vs. 'UNC- 45'), cr eating technically misleading r efer ences. Figure 3: Sampled ScholScan examples with 9 error categories, covering the whole process of scien- tific research, each requiring the model to perform thorough cross-source e vidence-based reasoning. ScholScan spans 13 disciplines in the natural sciences, including physics, chemistry , and computer science, and encompasses over 100 subfields such as immunology , total synthesis, and machine learning. The benchmark comprises 1,800 questions derived from 715 real academic papers and cov ers nine major error categories commonly observed in real-world research scenarios (see exam- ples in Figure 3; more examples in Appendix D). These include issues in numerical and formulaic computation, e xperimental design, inference and conclusion, and citation misuse, among others. Figure 2 also compares ScholScan with existing document and paper understanding benchmarks. 3 . 2 D A TA C U R A T I O N A N D Q U E S T I O N G E N E R A T I O N W e curated papers from ICLR 2024 1 and 2025 2 , as well as Nature Communications 3 , collecting public re vie ws for the former . Questions were constructed based on two dimensions, where the source is either generated or sampled, and the context is either within-paper or cross-paper . 1 https://openreview.net/group?id=ICLR.cc/2024/Conference 2 https://openreview.net/group?id=ICLR.cc/2025/Conference 3 https://www.nature.com/ncomms/ 4 Published as a conference paper at ICLR 2026 Generation. On high-quality accepted papers, we prompt Gemini 2.5 Pro to perform coordinated sentence-lev el edits spanning multiple sections or pages. It then synthesizes composite errors and generates the corresponding question along with an explanation grounded in the edited conte xt. Sampling. From rejected ICLR submissions and their public revie ws, we prompt Gemini 2.5 Pro to extract explicit, falsifiable scientific errors and con vert them into questions with initial e xplanations. Subjectiv e remarks on novelty or writing quality are e xcluded. Within-P aper . This setting focuses on verifiable facts and internal consistency within a single paper , and supports both Generation and Sampling . Cross-P aper . This setting examines citation consistency across papers. For each instance, Gemini 2.5 Pro recei ves an accepted paper and one of its cited sources, then edits the accepted paper to introduce paraphrases or reasoning errors about the citation. As public revie ws mainly address nonfalsifiable aspects, such as appropriateness, all cross-paper instances are constructed exclusi vely using the generation method. The detailed prompt templates and specific instructions used are provided in Appendix A. 3 . 3 Q U A L I T Y C O N T RO L A N D A N N O TA T I O N Despite e xplicit instructions, initial outputs exhibited substantial hallucinations, logical inconsisten- cies, and low-quality questions. T o ensure quality , 10 domain experts conducted a rigorous annota- tion process. Each instance underwent independent dual revie w , and disagreements were resolved by a third expert. Among the 3,500 initial candidates, 1,700 were discarded, and 1,541 of the remain- ing were revised, including 535 question rewrites, 1,207 explanation edits, and 1,141 corrections to error categories or metadata. Appendix C details data sourcing and validation protocols. 4 E X P E R I M E N T S 4 . 1 E X P E R I M E N T A L S E T U P Models. W e benchmark a total of 24 input configurations by feeding academic papers as images or OCR-based text using the T esseract engine (Smith, 2007), covering 15 mainstream models (Y ang et al., 2025; Bai et al., 2025; DeepSeek-AI et al., 2025; Guo et al., 2025; OpenAI et al., 2025). Evaluation Protocol. Inspired by MMLongBench-Doc (Ma et al., 2024), we prompt the models to generate the necessary reasoning chains from evidence to detected anomalies without constraining the output format. This design assesses evidence-based reasoning ability rather than basic instruction following. For open-ended responses, we use GPT -4.1 (OpenAI, 2025b) to extract cited evidence and reasoning steps, and quantify alignment with annotated explanations. Human ev aluation con- firms high agreement between our pipeline and expert annotations. Human validation results and ev aluator robustness checks are detailed in Appendix E. Metrics. W e define a structured ev aluation framew ork by parsing the model response a into a tuple: Ψ( a ) ⇒ 1 exist , 1 contain , b E , b R , n . (1) Here, 1 exist and 1 contain are binary indicators for whether output contains any error and includes the annotated target error; b E , b R and E ∗ , R ∗ are the predicted and gold evidence sets and reasoning chains; ˆ g = prefix matc h( b R , R ∗ ) counts matched reasoning steps; n ∈ N is the number of unre- lated errors. 1 contain is 1 if the output contains any predicted error , and 0 otherwise. Based on Ψ( a ) , we define an end-to-end score S ( a ) ∈ [0 , 1] that combines all aspects of prediction quality: (i) Err or Detection Scor e. W e consider the error detected only if the model identifies the target error: S detection = 1 exist · 1 contain . (2) (ii) Evidence Location Scor e. Even when the target error is identified, the cited e vidence may be incomplete or noisy . W e compute a Dice score with a squared penalty for over -reporting: S location = max ( 0 , 2 b E ∩ E ∗ + 1 n b E + E ∗ = 0 o max b E + E ∗ , 1 − 0 . 8 b E \ E ∗ max b E , 1 2 ) . (3) 5 Published as a conference paper at ICLR 2026 (iii) Reasoning Pr ocess Score . Even if the target error is detected, the reasoning may div erge from the gold chain. W e use prefix match to assess reasoning completeness: S reasoning = 1 { | R ∗ | = 0 } + 1 { | R ∗ | > 0 } ˆ g | R ∗ | 2 . (4) (iv) Unrelated-Err or P enalty . Models may list unrelated items to inflate recall at the cost of precision. W e penalize this with a rapidly increasing function of unrelated error count: P unrelated err ( n ) = 0 . 9 min( n, 2) exp − 0 . 6 max( n − 2 , 0) 1 . 5 . (5) (v) Over all Scor e. The final score for response a integrates detection accurac y , e vidence quality , and reasoning faithfulness: S ( a ) = S detection · p S location · S reasoning · P unrelated err ( n ) . (6) T able 1: Model performance across 9 error categories (scaled by 100). RQD : Research Question & Definitions; DI : Design & Identifiability; SG : Sampling & Generalizability; MO : Measurement & Operationalization; DHP : Data Handling & Preprocessing; CF : Computation & Formulae; IC : Inference & Conclusions; RCA : Referential & Citation Alignment; LE : Language & Expression. Models A vg. RQD DI SG MO DHP CF IC RCA LE MLLM (Image Input) Pr oprietary MLLMs GPT -5 19.2 10.1 9.7 28.2 14.6 26.6 13.8 25.3 25.3 6.9 Gemini 2.5 Pro 15.6 11.9 12.6 35.7 12.3 27.0 4.6 14.7 15.2 7.4 Doubao-Seed-1.6-thinking 10.2 3.4 3.5 22.3 7.5 15.1 10.2 12.2 10.9 3.3 Doubao-Seed-1.6 9.9 3.0 4.4 29.2 4.9 15.0 6.3 17.9 8.0 3.9 Grok 4 4.0 0.0 1.9 16.7 3.2 7.4 0.7 1.9 3.6 0.0 Open-Sour ce LLMs Llama 4 Mav erick 7.0 7.0 7.3 9.4 4.5 4.0 6.5 6.7 8.8 3.0 Mistral Small 3.1 3.3 0.1 2.0 2.0 1.5 0.1 1.0 2.2 8.6 1.0 Gemma 3 27B 1.7 0.5 2.7 2.3 1.7 1.0 1.0 1.3 2.6 0.0 Qwen2.5-VL-72B 0.1 0.0 0.7 0.0 0.0 0.0 0.0 0.0 0.2 0.0 OCR + LLM (T ext Input) Pr oprietary LLMs Gemini 2.5 Pro 30.3 21.5 34.2 44.3 27.6 56.6 10.3 28.8 35.6 8.1 GPT -5 22.5 16.1 21.4 26.0 20.3 36.7 4.7 29.8 30.0 2.6 Grok 4 20.8 9.3 7.7 37.4 12.3 34.4 9.0 20.0 31.2 7.2 Doubao-Seed-1.6-thinking 15.3 8.2 10.1 24.3 10.1 24.2 6.4 19.2 21.0 4.2 Doubao-Seed-1.6 13.9 5.4 6.9 26.4 10.3 23.6 6.3 20.1 17.5 2.3 Claude Sonnet 4 5.7 3.7 2.5 10.8 4.3 10.3 1.4 8.4 6.6 3.5 Open-Sour ce LLMs Qwen3 A22B (Thinking) 17.4 8.9 16.2 31.9 15.1 23.7 5.6 22.3 21.1 2.3 DeepSeek-R1 11.4 5.1 11.9 25.4 8.7 22.5 4.7 16.3 9.8 3.5 gpt-oss-120b 7.3 6.3 5.7 18.3 4.9 14.5 1.6 12.5 5.5 0.0 Mistral Small 3.1 6.9 3.0 2.7 5.5 7.0 2.0 8.5 4.0 12.2 3.0 Llama 4 Mav erick 2.3 1.5 2.0 4.8 3.0 3.6 0.0 5.8 1.6 0.2 Gemma 3 27B 2.0 2.1 1.6 3.0 2.7 0.2 0.7 7.7 1.0 0.0 Qwen3 A22B (Instruct) 1.7 1.2 0.0 2.7 0.4 1.0 0.1 4.3 2.5 1.1 DeepSeek-V3.1 1.7 1.2 2.0 1.7 1.0 5.8 0.5 2.2 2.1 0.0 Qwen2.5-VL-72B 0.2 0.0 0.7 0.0 0.0 0.0 0.0 0.0 0.6 0.0 4 . 2 M A I N R E S U LT S T able 1 presents our ev aluation results. Our main findings are summarized as follows: Overall perf ormance remains unsatisfactory . GPT -5 achie ves the highest av erage score in the image input group (19.2), while Gemini 2.5 Pro, the best-performing model in the text input setting, still fails to surpass the 60-point threshold in any error category . Even in the SG category , which yields the best o verall performance, nearly half of the models receiv e single-digit scores. Most 6 Published as a conference paper at ICLR 2026 models perform poorly under the scan-oriented task paradigm and fail to detect any issues in many papers. This challenge is particularly pronounced for open-source models. Reasoning-enhanced models demonstrate clear advantages. Across both input configurations, reasoning-enhanced variants consistently achieve higher scores. Almost all best-performing mod- els, measured by metrics for both specific error categories and overall performance, fall into this category . In particular, Qwen3-Thinking and Deepseek-R1 outperform their base versions by more than 10% in av erage scores, with substantial gains observed across all error categories. These results indicate that reasoning-enhanced models are better able to simulate the iterativ e process of e xtraction followed by reasoning, which is essential for effecti vely handling scan-oriented tasks and producing higher-quality responses. MLLMs face significant bottlenecks in handling long multimodal inputs. Across most error categories, text inputs outperform image inputs. Among the nine MLLMs tested, the average per- formance gap between text and image inputs reaches 4.81 points, highlighting visual processing as a key limitation in current MLLM capabilities. Although overall performance is generally weaker , multimodal input remains indispensable. In certain cate gories such as CF , where OCR-based text extraction leads to substantial loss of formu- laic or tab ular content, image inputs outperform their text counterparts. This highlights the essential role of multimodal reasoning and the irreplaceable v alue of visual information in addressing specific error categories. 4 . 3 F I N E - G R A I N E D A N A L Y S I S Figure 4: Spearman correlation matrix among the 9 error categories. Capability Dimensions. W e compute pairwise Spearman correlations between error categories across two input configurations (text and image) for the eight ev aluated MLLMs excluding Qwen2.5-VL- 72B, as shown in Figure 4. W e deriv e the following insights: (i) W ith image input, CF exhibits consistently low cor- r elations with other err or cate gories, suggesting that the skills requir ed for mathematical r easoning are r el- atively distinct. In contrast, with text input, CF shows moderate correlation with LE, indicating that OCR- flattened formulas lose their structural specificity and are interpreted by models in a manner more akin to natural language. Combined with the overall poor performance on CF tasks, this underscores the unique challenges of this category and the need for targeted improv ements. (ii) Although DI is also r elated to experimental set- tings, it does not exhibit str ong corr elations with SG, MO, or DHP . This indicates that DI primarily emphasizes causal framing and variable identifiability , rather than the procedural understanding of experimental operations. (iii) OCR sever ely de grades structur ed content such as figures and formulas, making questions that depend on multimodal information unanswerable . This diminishes the expression of multimodal reasoning capabilities and artificially inflates inter-cate gory correlations under text input. Based on the abov e analysis, we consolidate the original nine error categories, each defined by its objectiv e target, into fi ve core latent skill dimensions ev aluated by ScholScan under image input. While each dimension emphasizes the primary competence of its corresponding error categories, they are not mutually e xclusi ve, as man y questions inv olve ov erlapping reasoning abilities. RQD and DI correspond to research concept comprehension , which requires models to identify the scope and definition of research objectives by integrating contextual cues and prior knowledge. SG, MO and DHP fall under experimental process modeling , which tests a model’ s ability to re- construct procedural workflo ws such as sampling, measurement, and data handling. CF captures 7 Published as a conference paper at ICLR 2026 0 4 8 12 16 20 G P T - 5 ( I m a g e ) G P T - 5 ( T e x t ) G e m i n i ( I m a g e ) G e m i n i ( T e x t ) 2.1 G o l d R e fe r e n c e 1 3.48 15.73 16.03 8.73 11.24 17.17 11.32 18.13 17.55 Evidenc e Location Reasoni ng Step 0 200 400 600 800 1000 G e m i n i ( I m a g e ) G e m i n i ( T e x t ) L l a m a 4 ( I m a g e ) L l a m a 4 ( T e x t ) Q w e n 3 - I n s t r u c t Q w e n 3 - T h i n k i n g Omission Hallucin ation 290 367 49 103 790 245 995 368 226 97 36 675 Figure 5: Left : Distribution of omission and hallucination errors. Right : A verage reasoning steps and evidence locations in volved in the answer generation, compared against the gold reference. formal reasoning and symbolic computation , focusing on syntactic parsing and numerical logic. IC e valuates causal inference , where models must synthesize dispersed causal evidence to reach sound conclusions. RCA and LE reflect referential alignment and linguistic consistency , which assess the ability to verify citations and maintain coherent e xpression throughout the document. Hidden Complexity in Scan-Oriented T asks. W e analyze the reasoning traces of GPT -5 and Gem- ini 2.5 Pro across both input configurations, focusing on the number of evidence pieces scanned and the reasoning steps performed. As illustrated in Figure 5, ev en the most adv anced models often scan up to 8 times more e vidence and e xecute 3.5 times more reasoning steps than the reference answers, merely to approximate a correct response, yet they still frequently fail. This highlights the substan- tial hidden complexity inherent in scan-oriented tasks, which significantly amplifies the challenge of successful task completion. 4 . 4 E R R O R A NA LY S I S Omission and Hallucination. Most zero-score cases fall into two cate gories: either the model fails to detect any errors in the paper , or it becomes o verwhelmed by hallucinations and entirely overlooks the actual errors present in the reference answer . W e analyze the number of zero-score questions and the proportion of these two failure modes across models, as sho wn in Figure 5. Stronger models tend to hav e fewer zero-score cases ov erall, but are more prone to ov erconfident hallucinations. Fragile Reasoning under Complex Evidence. Figure 6 shows how best-performing models behave under different numbers of reasoning steps and evidence locations. As reasoning steps increase, both reasoning and ov erall scores steadily decline, rev ealing a clear bottleneck in MLLMs’ ability to con- struct long causal chains. In contrast, v ariation in evidence locations has a weaker and less consistent impact. Howe ver , this does not imply that multi-evidence questions pose only marginal difficulty . Since the ev aluation metric allo ws partial evidence omissions, more evidence items do not necessar- ily incur large score penalties. Still, heavier e vidence loads often require longer reasoning chains, which substantially af fect the coherence and completeness of inferred logic. These results highlight the persistent challenge for MLLMs in integrating evidence and maintaining logical structure as task complexity increases. 4 . 5 R A G A N A L Y S I S W e ev aluated 8 RA G methods across both input configurations (Robertson et al., 1994; Chen et al., 2024; Lee et al., 2025; Faysse et al., 2025; Y u et al., 2025; W ang et al., 2025; Izacard et al., 2022). Ke y findings are presented below , with detailed results shown in T ables 2 and 3. The Oracle condition yields significant accuracy gains. Providing gold images alle viates the scanning burden in long-context inputs, increasing the chances of generating correct answers. Al- though o verall performance improv es, the gains are limited for CF errors and minimal for LE errors. For CF , the sparse formulaic content means gold images offer limited assistance. For LE, the dense text distribution makes e ven direct access to target regions insuf ficient for current models to reduce complexity . 8 Published as a conference paper at ICLR 2026 0 10 20 30 40 1 2 3 ≥4 0 10 20 30 40 50 1 2 3 ≥4 Evidence Loc ations Overall Score Evidence Loc ation Score GPT-5 (Image ) Gemini (Text ) GPT-5 (Text) Gemini (Imag e) 0 10 20 30 40 50 2 3 4 ≥5 0 10 20 30 40 50 60 70 2 3 4 ≥5 Reasoning St eps Overall Score Reasoning P rocess Score Figure 6: Model performance trends across reasoning steps and evidence locations (scaled by 100). T able 2: Overall scores of RA G methods across the 9 error categories (scaled by 100). Models A vg. RQD DI SG MO DHP CF IC RCA LE T ext Input (Base Model: Qwen3-Thinking) Baseline 17.4 8.9 16.2 31.9 15.1 23.7 5.6 22.3 21.1 2.3 Oracle 24.5 20.6 27.9 43.6 21.3 40.8 7.4 26.9 26.0 1.9 BM25 16.7 9.7 13.7 33.0 17.3 23.8 6.8 25.4 16.5 3.0 Contriev er 16.6 9.7 18.2 33.7 10.7 20.8 6.4 18.5 19.8 1.8 BGE-M3 11.3 8.6 7.5 24.8 9.1 15.4 5.3 15.6 11.4 1.0 NV -Embed-v2 6.8 4.0 4.0 9.4 6.1 4.9 5.5 5.7 10.0 2.0 Image Input (Base Model: Llama 4 Maverick) Baseline 7.0 7.0 7.3 9.4 4.5 4.0 6.5 6.7 8.8 3.0 Oracle 6.5 3.0 4.5 15.6 8.2 9.4 4.9 10.0 4.4 1.4 VRA G-RL 10.9 9.8 11.6 17.8 8.2 11.0 6.8 13.1 10.8 8.1 ColQwen2.5 1.2 2.1 0.7 0.5 0.0 1.2 0.2 2.7 2.0 0.0 V isRA G 1.0 2.0 0.0 1.0 0.0 1.0 1.6 1.3 1.2 0.0 ColPali-v1.3 0.8 1.5 0.0 0.5 0.0 0.9 0.5 1.3 1.4 0.0 T able 3: Retriev al performance of RA G methods. Models MRR@5 Recall@5 T ext Input (Base Model: Qwen3-Thinking) BM25 0.41 0.48 Contriev er 0.31 0.39 NV -Embed-v2 0.30 0.38 BGE-M3 0.16 0.21 Image Input (Base Model: Llama 4 Maverick) V isRA G 0.41 0.46 ColQwen2.5 0.30 0.35 ColPali-v1.3 0.26 0.31 In consistency-centric scan-oriented tasks, most retrie val-based enhancement methods show mini- mal effectiveness. All embedding models exhibit poor retriev al accuracy . None achieves recall of 50% within the top-5 retriev ed items. More critically , performance deteriorates after retriev al, especially for multimodal embedding models, where post-retriev al responses are almost entirely incorrect and scores approach 0. Complex embedding model architectures do not yield better performance. Under the text-input set- ting, BM25 achiev es the highest retriev al metrics, out- performing Contriever and NV -Embed-v2. Under the image-input setting, although V isRA G shows certain advantages in retriev al performance, its overall score remains comparably low and conv erges with methods such as ColPali-v1.3. Under such circum- stances, comparisons between retriev al metrics lose their substantive significance. The underlying reason lies in the fact that existing embedding models are primarily designed to enhance retriev al performance at the level of semantic relev ance. They already struggle with traditional multi-hop reasoning tasks, let alone scan-oriented tasks with target suppression. Reinfor cement learning frameworks with visual focus have emerged as leading approaches. Despite being built on a compact 7B model, VRA G-RL consistently deliv ers improv ed performance and is the only method that achieves gains under image input following RL optimization. Its en- hanced retriev al sharpens evidence selection, while strong reasoning provides effecti ve guidance during document scanning. The retriev al and reasoning components are interleav ed in the design, with each stage informing the other in an iterativ e loop. This tightly coupled interaction contributes to the method’ s superior performance potential. 9 Published as a conference paper at ICLR 2026 5 C O N C L U S I O N In this paper , we introduce ScholScan, a benchmark designed to e valuate the performance of MLLMs on scan-oriented tasks that require the detection of scientific errors across entire academic papers. W e conduct a comprehensiv e ev aluation and in-depth analysis of mainstream MLLMs and RA G methods. The results demonstrate that current MLLMs remain far from capable of reliably addressing such tasks and that existing RAG methods provide little to no improvement. This high- lights the complexity , integrativ e demands, and originality of the ScholScan benchmark. Looking ahead, our goal is to dev elop scan-oriented task paradigms suited to di verse academic scenarios and explore new techniques to improv e model performance on target-suppressed inputs. These direc- tions support the larger goal of advancing MLLMs from passiv e assistants to active participants in scientific research. E T H I C S S TA T E M E N T Data Prov enance. All data used in this paper were constructed by the authors and do not include ex- ternal public or proprietary datasets. The academic papers and author names referenced are publicly av ailable through arXiv and OpenRe view . Annotation Process. A team of 10 domain experts was assembled to thoroughly revie w all tasks initially generated by Gemini 2.5 Pro. All annotators provided informed consent to participate. T o ensure accuracy and neutrality of both model-generated and human-verified content, we employed a rigorous multi-stage validation process in volving cross-re vie w and third-party adjudication. Model Evaluation. Ev aluation of 15 mainstream models in 24 input configurations was carried out using legally authorized API access through V olcEngine, Alibaba Cloud’ s LLM services, and OpenRouter . Dissemination. ScholScan is open source and freely av ailable for academic and non-commercial re- search. All personally identifiable information has been remov ed from the dataset and its collection and release comply with the ethical and legal requirements in place at the time of data acquisition. R E P RO D U C I B I L I T Y S TA T E M E N T All results presented in this paper are fully reproducible. T o facilitate verification and extension, we pro vide the complete dataset on Hugging Face, source code and detailed documentation on GitHub . The GitHub repository includes step-by-step instructions and the exact hyperparameter configurations used in our experiments, ensuring full reproducibility . The retrie v al components in all RA G experiments were ex ecuted on a server equipped with 8 NVIDIA A40 GPUs. A C K N O W L E D G M E N T S This work is supported by the Beijing Natural Science Foundation (Grant No. QY25345), the Na- tional Natural Science Foundation of China (Grant Nos. 62473271, 62176026), and the Funda- mental Research Funds for the Beijing Univ ersity of Posts and T elecommunications (Grant No. 2025AI4S03). This work is also supported by the Engineering Research Center of Information Net- works, Ministry of Education, China. W e would also like to thank the anonymous revie wers and area chairs for constructiv e discussions and feedback. R E F E R E N C E S Anthropic. System Card: Claude Opus 4 & Claude Sonnet 4. T echnical report, Anthropic, May 2025. URL https://www.anthropic.com/claude- 4- system- card . Accessed: 2025-09-24. S ¨ oren Auer , Dante A. C. Barone, Cassiano Bartz, Eduardo G. Cortes, Mohamad Y aser Jaradeh, Oliv er Karras, Manolis Koubarakis, Dmitry Mouromtsev , Dmitrii Pliukhin, Daniil Radyush, Ivan Shilin, Markus Stocker , and Eleni Tsalapati. The sciqa scientific question answer- ing benchmark for scholarly kno wledge. Scientific Reports , 13(1):7240, May 2023. ISSN 2045-2322. doi: 10.1038/s41598- 023- 33607- z. URL https://doi.org/10.1038/ s41598- 023- 33607- z . 10 Published as a conference paper at ICLR 2026 Shuai Bai, K eqin Chen, Xuejing Liu, Jialin W ang, W enbin Ge, Sibo Song, Kai Dang, Peng W ang, Shijie W ang, Jun T ang, Humen Zhong, Y uanzhi Zhu, Mingkun Y ang, Zhaohai Li, Jianqiang W an, Pengfei W ang, W ei Ding, Zheren Fu, Y iheng Xu, Jiabo Y e, Xi Zhang, T ianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Y ang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025. URL . ByteDance Seed T eam. Seed1.6 T ech Introduction, June 2025. URL https://seed. bytedance.com/en/seed1_6 . Accessed: 2025-06-25. Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi-linguality , multi-functionality , multi-granularity text embeddings through self-knowledge distillation. In Lun-W ei Ku, Andre Martins, and V ivek Srikumar (eds.), F indings of the Asso- ciation for Computational Linguistics: ACL 2024 , pp. 2318–2335, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings- acl.137. URL https://aclanthology.org/2024.findings- acl.137/ . Zhiyu Chen, W enhu Chen, Charese Smiley , Sameena Shah, Iana Borov a, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and W illiam Y ang W ang. FinQA: A dataset of numerical reasoning over financial data. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott W en-tau Y ih (eds.), Pr oceedings of the 2021 Confer ence on Empirical Methods in Natural Language Processing , pp. 3697–3711, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021. emnlp- main.300. URL https://aclanthology.org/2021.emnlp- main.300/ . Gheorghe Comanici, Eric Bieber , Mike Schaekermann, Ice Pasupat, Noveen Sachde va, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Ev an Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality , long context, and next generation agentic capa- bilities, 2025. URL . DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan W ang, Bochao W u, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, et al. Deepseek-v3 technical report, 2025. URL https://arxiv.org/abs/2412.19437 . Chao Deng, Jiale Y uan, Pi Bu, Peijie W ang, Zhong-Zhi Li, Jian Xu, Xiao-Hui Li, Y uan Gao, Jun Song, Bo Zheng, and Cheng-Lin Liu. LongDocURL: a comprehensiv e multimodal long docu- ment benchmark integrating understanding, reasoning, and locating. In W anxiang Che, Joyce Nabende, Ekaterina Shutov a, and Mohammad T aher Pilehvar (eds.), Pr oceedings of the 63r d An- nual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pp. 1135–1159, V ienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979- 8-89176-251-0. doi: 10.18653/v1/2025.acl- long.57. URL https://aclanthology.org/ 2025.acl- long.57/ . Manuel Faysse, Hugues Sibille, T ony W u, Bilel Omrani, Gautier V iaud, CELINE HUDE- LO T , and Pierre Colombo. Colpali: Efficient document retriev al with vision lan- guage models. In Y . Y ue, A. Garg, N. Peng, F . Sha, and R. Y u (eds.), In- ternational Confer ence on Learning Repr esentations , v olume 2025, pp. 61424–61449, 2025. URL https://proceedings.iclr.cc/paper_files/paper/2025/file/ 99e9e141aafc314f76b0ca3dd66898b3- Paper- Conference.pdf . Y unfan Gao, Y un Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Y uxi Bi, Y i Dai, Jiawei Sun, Meng W ang, and Haofen W ang. Retriev al-augmented generation for large language models: A survey , 2024. URL . Y ingqiang Ge, W enyue Hua, Kai Mei, jianchao ji, Juntao T an, Shuyuan Xu, Zelong Li, and Y ongfeng Zhang. Openagi: When llm meets domain experts. In A. Oh, T . Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Le vine (eds.), Advances in Neural Infor- mation Pr ocessing Systems , volume 36, pp. 5539–5568. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 1190733f217404edc8a7f4e15a57f301- Paper- Datasets_and_Benchmarks. pdf . 11 Published as a conference paper at ICLR 2026 Daya Guo, Dejian Y ang, Haowei Zhang, Junxiao Song, Peiyi W ang, Qihao Zhu, Runxin Xu, Ruo yu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through rein- forcement learning. Natur e , 645(8081):633–638, Sep 2025. ISSN 1476-4687. doi: 10.1038/ s41586- 025- 09422- z. URL https://doi.org/10.1038/s41586- 025- 09422- z . Y ichen He, Guanhua Huang, Peiyuan Feng, Y uan Lin, Y uchen Zhang, Hang Li, and W einan E. PaSa: An LLM agent for comprehensive academic paper search. In W anxiang Che, Joyce Nabende, Eka- terina Shutova, and Mohammad T aher Pilehvar (eds.), Pr oceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pp. 11663–11679, V ienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8-89176- 251-0. doi: 10.18653/v1/2025.acl- long.572. URL https://aclanthology.org/2025. acl- long.572/ . Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojano wski, Armand Joulin, and Edouard Grave. Unsupervised dense information retriev al with contrastiv e learn- ing. T ransactions on Machine Learning Resear ch , 2022. ISSN 2835-8856. URL https: //openreview.net/forum?id=jKN1pXi7b0 . Chankyu Lee, Rajarshi Roy , Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and W ei Ping. Nv-embed: Improv ed techniques for training llms as gen- eralist embedding models. In Y . Y ue, A. Garg, N. Peng, F . Sha, and R. Y u (eds.), International Confer ence on Learning Representations , volume 2025, pp. 79310–79333, 2025. URL https://proceedings.iclr.cc/paper_files/paper/2025/file/ c4bf73386022473a652a18941e9ea6f8- Paper- Conference.pdf . Lei Li, Y uqi W ang, Runxin Xu, Peiyi W ang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multi- modal ArXi v: A dataset for improving scientific comprehension of large vision-language models. In Lun-W ei Ku, Andre Martins, and V ivek Srikumar (eds.), Pr oceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (V olume 1: Long P apers) , pp. 14369–14387, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/ 2024.acl- long.775. URL https://aclanthology.org/2024.acl- long.775/ . Haotian Liu, Chunyuan Li, Y uheng Li, Bo Li, Y uanhan Zhang, Sheng Shen, and Y ong Jae Lee. Llav a-next: Improved reasoning, ocr , and world knowledge, January 2024. URL https:// llava- vl.github.io/blog/2024- 01- 30- llava- next/ . Accessed: 2025-05-13. Renze Lou, Hanzi Xu, Sijia W ang, Jiangshu Du, Ryo Kamoi, Xiaoxin Lu, Jian Xie, Y uxuan Sun, Y usen Zhang, Jihyun Janice Ahn, Hongchao F ang, Zhuoyang Zou, W enchao Ma, Xi Li, Kai Zhang, Congying Xia, Lifu Huang, and W enpeng Y in. AAAR-1.0: Assessing AI’ s potential to assist research. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, T egan Maharaj, Kiri W agstaff, and Jerry Zhu (eds.), Pr oceedings of the 42nd In- ternational Confer ence on Machine Learning , volume 267 of Pr oceedings of Machine Learning Resear ch , pp. 40361–40383. PMLR, 13–19 Jul 2025. URL https://proceedings.mlr. press/v267/lou25c.html . Y ubo Ma, Y uhang Zang, Liangyu Chen, Meiqi Chen, Y izhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Y an Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Y u-Gang Jiang, Jiaqi W ang, Y ixin Cao, and Aixin Sun. Mmlongbench-doc: Benchmarking long-context document under- standing with visualizations. In A. Globerson, L. Mackey , D. Belgra ve, A. Fan, U. Paquet, J. T omczak, and C. Zhang (eds.), Advances in Neural Information Pr ocessing Systems , vol- ume 37, pp. 95963–96010. Curran Associates, Inc., 2024. doi: 10.52202/079017- 3041. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ ae0e43289bffea0c1fa34633fc608e92- Paper- Datasets_and_Benchmarks_ Track.pdf . Minesh Mathe w , Dimosthenis Karatzas, and C. V . Jawahar . Docvqa: A dataset for vqa on document images. In 2021 IEEE W inter Confer ence on Applications of Computer V ision (W ACV) , pp. 2199–2208, 2021. doi: 10.1109/W A CV48630.2021.00225. URL https://openaccess.thecvf.com/content/WACV2021/papers/Mathew_ DocVQA_A_Dataset_for_VQA_on_Document_Images_WACV_2021_paper.pdf . 12 Published as a conference paper at ICLR 2026 Meta. Llama 4 | Model Cards and Prompt Formats, September 2025. URL https: //www.llama.com/docs/model- cards- and- prompt- formats/llama4/ . Ac- cessed: 2025-09-24. Meredith Ringel Morris, Jascha Sohl-Dickstein, Noah Fiedel, Tris W arkentin, Allan Dafoe, Alek- sandra Faust, Clement Farabet, and Shane Legg. Position: Le vels of A GI for operationalizing progress on the path to A GI. In F orty-first International Confer ence on Machine Learning , 2024. URL https://openreview.net/forum?id=0ofzEysK2D . OpenAI. GPT -5 System Card. T echnical report, OpenAI, August 2025a. URL https://cdn. openai.com/gpt- 5- system- card.pdf . Accessed: 2025-09-24. OpenAI. Introducing GPT -4.1 in the API. T echnical report, OpenAI, April 2025b. URL https: //openai.com/index/gpt- 4- 1/ . Accessed: 2025-04-14. OpenAI, Sandhini Agarw al, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Y u Bai, Bowen Baker , et al. gpt-oss-120b & gpt-oss-20b model card, 2025. URL https://arxiv.org/abs/2508.10925 . Long Phan, Alice Gatti, Nathaniel Li, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausen- loy , Oliver Zhang, Mantas Mazeika, Dan Hendrycks, et al. A benchmark of expert-le vel aca- demic questions to assess ai capabilities. Natur e , 649(8099):1139–1146, January 2026. ISSN 1476-4687. doi: 10.1038/s41586- 025- 09962- 4. URL http://dx.doi.org/10.1038/ s41586- 025- 09962- 4 . Shraman Pramanick, Rama Chellappa, and Subhashini V enugopalan. Spiqa: A dataset for multi- modal question answering on scientific papers. In A. Globerson, L. Mackey , D. Belgrav e, A. F an, U. Paquet, J. T omczak, and C. Zhang (eds.), Advances in Neural Information Pr ocessing Systems , volume 37, pp. 118807–118833. Curran Associates, Inc., 2024. doi: 10.52202/079017- 3773. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ d74033a247989e8f6f3bf9e0c9629fb5- Paper- Datasets_and_Benchmarks_ Track.pdf . Stephen E. Robertson, Steve W alker , Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at trec-3. In T ext Retrieval Confer ence , 1994. URL https://api. semanticscholar.org/CorpusID:41563977 . R. Smith. An overvie w of the tesseract ocr engine. In Ninth International Conference on Doc- ument Analysis and Recognition (ICDAR 2007) , volume 2, pp. 629–633, 2007. doi: 10.1109/ ICD AR.2007.4376991. URL https://ieeexplore.ieee.org/stamp/stamp.jsp? tp=&arnumber=4376991 . Zichen T ang, Haihong E, Rongjin Li, Jiacheng Liu, Linwei Jia, Zhuodi Hao, Zhongjun Y ang, Y uanze Li, Haolin T ian, Xinyi Hu, Peizhi Zhao, Y uan Liu, Zhengyu W ang, Xianghe W ang, Y iling Huang, Xueyuan Lin, Ruofei Bai, Zijian Xie, Qian Huang, Ruining Cao, and Haocheng Gao. Finmmdocr: Benchmarking financial multimodal reasoning with scenario awareness, document understanding, and multi-step computation, 2025. URL . Qiuchen W ang, Ruixue Ding, Y u Zeng, Zehui Chen, Lin Chen, Shihang W ang, Pengjun Xie, Fei Huang, and Feng Zhao. VRA G-RL: Empower vision-perception-based RA G for visually rich information understanding via iterativ e reasoning with reinforcement learning. In The Thirty- ninth Annual Conference on Neural Information Pr ocessing Systems , 2025. URL https:// openreview.net/forum?id=EeAHhNwXPV . Zirui W ang, Mengzhou Xia, Luxi He, Howard Chen, Y itao Liu, Richard Zhu, Kaiqu Liang, Xindi W u, Haotian Liu, Sadhika Malladi, Alexis Chev alier , Sanjeev Arora, and Danqi Chen. Charxiv: Charting gaps in realistic chart understanding in multimodal llms. In A. Globerson, L. Mackey , D. Belgrave, A. Fan, U. Paquet, J. T omczak, and C. Zhang (eds.), Advances in Neural Informa- tion Pr ocessing Systems , volume 37, pp. 113569–113697. Curran Associates, Inc., 2024. doi: 10.52202/079017- 3609. URL https://proceedings.neurips.cc/paper_files/ paper/2024/file/cdf6f8e9fd9aeaf79b6024caec24f15b- Paper- Datasets_ and_Benchmarks_Track.pdf . 13 Published as a conference paper at ICLR 2026 xAI. Grok 4 F ast Model Card. T echnical report, xAI, September 2025. URL https://data.x. ai/2025- 09- 19- grok- 4- fast- model- card.pdf . Accessed: 2025-09-19. Dawei Y an, Y ang Li, Qing-Guo Chen, W eihua Luo, Peng W ang, Haokui Zhang, and Chunhua Shen. Mmcr: Advancing visual language model in multimodal multi-turn contextual reasoning, 2025. URL . An Y ang, Anfeng Li, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Y u, Chang Gao, Chengen Huang, Chenxu Lv , Chujie Zheng, Dayiheng Liu, F an Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran W ei, Huan Lin, Jialong T ang, Jian Y ang, Jianhong T u, Jianwei Zhang, Jianxin Y ang, Jiaxi Y ang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, K eqin Bao, Ke xin Y ang, Le Y u, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng W ang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, T ianhao Li, T ianyi T ang, W enbiao Y in, Xingzhang Ren, Xinyu W ang, Xinyu Zhang, Xuancheng Ren, Y ang Fan, Y ang Su, Y ichang Zhang, Y inger Zhang, Y u W an, Y uqiong Liu, Zekun W ang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, and Zihan Qiu. Qwen3 technical report, 2025. URL . Zhilin Y ang, Peng Qi, Saizheng Zhang, Y oshua Bengio, W illiam Cohen, Ruslan Salakhutdinov , and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Ellen Rilof f, Da vid Chiang, Julia Hock enmaier, and Jun’ichi Tsujii (eds.), Pr oceed- ings of the 2018 Conference on Empirical Methods in Natural Language Pr ocessing , pp. 2369– 2380, Brussels, Belgium, October-Nov ember 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18- 1259. URL https://aclanthology.org/D18- 1259/ . Shi Y u, Chaoyue T ang, Bokai Xu, Junbo Cui, Junhao Ran, Y ukun Y an, Zhenghao Liu, Shuo W ang, Xu Han, Zhiyuan Liu, and Maosong Sun. V israg: V ision-based retriev al-augmented generation on multi-modality documents. In Y . Y ue, A. Garg, N. Peng, F . Sha, and R. Y u (eds.), International Conference on Learning Representations , volume 2025, pp. 21074–21098, 2025. URL https://proceedings.iclr.cc/paper_files/paper/2025/file/ 3640a1997a4c9571cea9db2c82e1fc35- Paper- Conference.pdf . Xiang Y ue, Y uansheng Ni, Tian yu Zheng, Kai Zhang, Ruoqi Liu, Ge Zhang, Samuel Stev ens, Dongfu Jiang, W eiming Ren, Y uxuan Sun, Cong W ei, Botao Y u, Ruibin Y uan, Renliang Sun, Ming Y in, Boyuan Zheng, Zhenzhu Y ang, Y ibo Liu, W enhao Huang, Huan Sun, Y u Su, and W enhu Chen. Mmmu: A massiv e multi-discipline multimodal understanding and rea- soning benchmark for expert agi. In 2024 IEEE/CVF Conference on Computer V ision and P attern Recognition (CVPR) , pp. 9556–9567, 2024. doi: 10.1109/CVPR52733.2024.00913. URL https://openaccess.thecvf.com/content/CVPR2024/papers/Yue_ MMMU_A_Massive_Multi- discipline_Multimodal_Understanding_and_ Reasoning_Benchmark_for_CVPR_2024_paper.pdf . Y ilun Zhao, Y itao Long, Hongjun Liu, Ryo Kamoi, Linyong Nan, L yuhao Chen, Y ixin Liu, Xiangru T ang, Rui Zhang, and Arman Cohan. DocMath-ev al: Evaluating math reasoning capabilities of LLMs in understanding long and specialized documents. In Lun-W ei Ku, Andre Martins, and V ivek Srikumar (eds.), Pr oceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (V olume 1: Long P apers) , pp. 16103–16120, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl- long.852. URL https://aclanthology.org/2024.acl- long.852/ . Ge Zheng, Bin Y ang, Jiajin T ang, Hong-Y u Zhou, and Sibei Y ang. Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models. In A. Oh, T . Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neu- ral Information Processing Systems , volume 36, pp. 5168–5191. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/ file/108030643e640ac050e0ed5e6aace48f- Paper- Conference.pdf . Ruiyang Zhou, Lu Chen, and Kai Y u. Is LLM a reliable revie wer? a comprehensi ve ev aluation of LLM on automatic paper revie wing tasks. In Nicoletta Calzolari, Min-Y en Kan, V eronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue (eds.), Pr oceedings of the 2024 Joint International Confer ence on Computational Linguistics, Language Resources and Evalu- ation (LREC-COLING 2024) , pp. 9340–9351, T orino, Italia, May 2024. ELRA and ICCL. URL https://aclanthology.org/2024.lrec- main.816/ . 14 Published as a conference paper at ICLR 2026 A ppendix Contents A Pr ompt T emplates 17 A.1 W ithin-Paper Generation Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.2 W ithin-Paper Sampling Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 A.3 Cross-Paper Generation Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 A.4 Extractor Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 A.5 Evaluation System Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 B Examples from Existing Datasets 28 B.1 DocMath-Ev al . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 B.2 MMLongBench-Doc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 B.3 FinMMDocR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 B.4 LongDocURL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 B.5 SlideVQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 B.6 DocVQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 B.7 CharXi v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 B.8 ArXi vQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 B.9 SPIQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 B.10 MMCR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 B.11 AAAR-1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 C Dataset Annotation and Construction 39 C.1 Data Sourcing and Quality Control . . . . . . . . . . . . . . . . . . . . . . . . . . 39 C.2 Annotation Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 C.3 Annotation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 C.3.1 Case 1: Discard Directly . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 C.3.2 Case 2: Modify Question . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 C.3.3 Case 3: Modify Explanation . . . . . . . . . . . . . . . . . . . . . . . . . 42 D Examples fr om ScholScan 43 D.1 RQD (Research Question and Definitions) . . . . . . . . . . . . . . . . . . . . . . 43 D.2 DI (Design and Identifiability) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 D.3 SG (Sampling and Generalizability) . . . . . . . . . . . . . . . . . . . . . . . . . 45 D.4 MO (Measurement and Operationalization) . . . . . . . . . . . . . . . . . . . . . 46 D.5 DHP (Data Handling and Preprocessing) . . . . . . . . . . . . . . . . . . . . . . . 47 D.6 CF (Computation and Formulae) . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 D.7 IC (Inference and Conclusions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 D.8 RCA (Referential and Citation Alignment) . . . . . . . . . . . . . . . . . . . . . . 50 15 Published as a conference paper at ICLR 2026 D.9 LE (Language and Expression) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 E Human-Machine Consistency Evaluation 52 F Hyperparameter Sensitivity Analysis 53 G Use of LLMs 54 16 Published as a conference paper at ICLR 2026 A P RO M P T T E M P L A T E S A . 1 W I T H I N - P A P E R G E N E R A T I O N P RO M P T W ithin-Paper Generation Prompt You will receive a high-quality, already accepted scientific paper as a PDF. Working only with the PDF itself (and any appendix embedded in the same PDF), edit specific textual spans to inject one or more errors chosen only from the taxonomy below, such that the errors are hard yet clearly identifiable by a professional reviewer reading the PDF alone. Error Type (fixed): Research Question & Definitions Definition: The core construct/hypothesis/variable is insufficiently or inconsistently defined (conceptual vs operational), leaving the estimand ambiguous. Design & Identifiability Definition: Given a clear estimand, the design violates structural identification conditions so the effect is not identifiable even with infinite data and perfect measurement. Sampling & Generalizability Definition: The sampling frame/process/composition or cluster/power setup does not support valid or stable sample → population claims. Measurement & Operationalization Definition: Measures/manipulations lack feasibility/ reliability/validity/timing, so observed variables systematically diverge from the intended construct/ treatment. Data Handling & Preprocessing Definition: Pipeline choices in missing handling, joins/ keys, temporal splitting, feature construction, or partitioning introduce bias (incl. leakage or unit/ scale conflicts). Computation & Formulae Definition: Arithmetic/algebra/notation errors (totals/ ratios, unit conversion, CI vs point estimate, p-value vs label, symbol reuse, undefined variables, dimension mismatch). Inference & Conclusions Definition: Interpretations or causal statements exceed what methods/data support, or contradict the shown statistics/tables/captions. Referential & Citation Alignment Definition: Contradictions about the same quantity/term across text, tables, captions, or appendix within the paper. Language & Expression Definition: Terminology/capitalization/grammar ambiguities that affect meaning or domain-critical term consistency (not cosmetic typos). 17 Published as a conference paper at ICLR 2026 W ithin-Paper Generation Prompt (Continued) Global constraints (must comply) 1. Each error must map to exactly one primary category in the taxonomy. Do not mix causes. 2. Each error must involve more than 2 micro-edits (each edit ≤ 20 English words) spread across distinct pages or paragraphs. 3. If an edit would create an immediate contradiction in the same sentence/paragraph/caption, you may add shadow patch (es) for the same error to keep the text natural (still counted as edit locations). 4. Independence across errors (per-copy generation) Generate each error on a separate copy of the original PDF . Different errors must be logically and operationally independent: No progression or variant relations: an error must not be a stricter/looser version, superset/subset, or minor wording variant of another error. No anchor reuse: do not target the same sentence/caption/ table cell or reuse the same old_str (or a near- duplicate paraphrase) across different errors. Applying any single error in isolation to the original PDF must still yield a detectable, clearly categorizable error according to the taxonomy. 5. Every error must be supportable using text inside the PDF. Do not rely on external supplementary files or prior knowledge. 6. Design as difficult as possible but clean errors. Prefer edits that force cross-checking between two spots (e.g., Methods vs Results). Avoid trivialities. Edits must remain locally plausible and not advertise themselves via obviously artificial phrases (e.g., avoid contrived tokens purely added to be detectable). 7. ‘‘No cosmetic issues’’ applies except for I (Language & Expression). For I, edits must affect meaning or domain- critical terminology (e.g., ambiguous phrasing, inconsistent technical terms). Pure typos, punctuation tweaks, or layout nits are not allowed. 8. Do not edit titles, author lists, bibliography entries, equation numbering, figure images, or add new figures/ tables/references. 9. Frame each question as a neutral imperative that asks for a decision about a specific condition, using (but not limited to) Decide/Determine/Judge/Evaluate/Assess whether.... Do not presuppose an outcome or use suggestive intensifiers (e.g., clearly/obviously/likely/ suspicious as examples). 18 Published as a conference paper at ICLR 2026 W ithin-Paper Generation Prompt (Continued) 10. Output English-only and strictly follow the JSON schema below. Do not include any additional text outside the JSON: [ { "id":"1-based integer as string", "modify":[ { "location":"Page number + short unique nearby quote ( ≤ 15 tokens).", "old_str":"Exact original text from the PDF (verbatim) .", "new_str":"Edited text after your change." } / * Add 1-2 more locations; each location ≤ 20 words changed. Shadow patches for local coherence count as locations. * / ], "question":"One neutral audit-style task (1-25 words).", "explanation":"Explain in 2-4 sentences why a reviewer can detect this error from the edited PDF alone.", "Type":"Name the primary category (e.g., Inference & Conclusions).", } / * More Errors * / ] 19 Published as a conference paper at ICLR 2026 A . 2 W I T H I N - P A P E R S A M P L I N G P RO M P T W ithin-Paper Sampling Prompt You will receive a paper PDF and the weaknesses mentioned in its peer-review comments. Your task is, based only on the content of that PDF, to sample from the review comments and verify possible errors related to the categories below, and for each confirmed or highly plausible error, generate one question and one explanation. Error Type (fixed): Research Question & Definitions Definition: The core construct/hypothesis/variable is insufficiently or inconsistently defined (conceptual vs operational), leaving the estimand ambiguous. Design & Identifiability Definition: Given a clear estimand, the design violates structural identification conditions so the effect is not identifiable even with infinite data and perfect measurement. Sampling & Generalizability Definition: The sampling frame/process/composition or cluster/power setup does not support valid or stable sample → population claims. Measurement & Operationalization Definition: Measures/manipulations lack feasibility/ reliability/validity/timing, so observed variables systematically diverge from the intended construct/ treatment. Data Handling & Preprocessing Definition: Pipeline choices in missing handling, joins/ keys, temporal splitting, feature construction, or partitioning introduce bias (incl. leakage or unit/ scale conflicts). Computation & Formulae Definition: Arithmetic/algebra/notation errors (totals/ ratios, unit conversion, CI vs point estimate, p-value vs label, symbol reuse, undefined variables, dimension mismatch). Inference & Conclusions Definition: Interpretations or causal statements exceed what methods/data support, or contradict the shown statistics/tables/captions. Referential & Citation Alignment Definition: Contradictions about the same quantity/term across text, tables, captions, or appendix within the paper. Language & Expression Definition: Terminology/capitalization/grammar ambiguities that affect meaning or domain-critical term consistency (not cosmetic typos). 20 Published as a conference paper at ICLR 2026 W ithin-Paper Sampling Prompt (Continued) Global constraints (must comply) 1. Output only the specified categories; even if other error types appear in the reviews, do not output them. 2. Sample first, then verify: extract candidates from the review comments, then confirm them in the PDF. If you cannot locate supporting anchors in the PDF (page number plus phrase/label), do not output that candidate. 3. Questions must be neutral and non-leading: use an "audit task + decision" style, avoiding yes/no bias. 4. Independence: each question must target a different figure or different textual anchor; no minor variants of the same issue. 5. Evidence first: the explanation must cite locatable anchors in the PDF (page number + original phrase/caption ). You may mention a key short phrase from the review as a clue, but write the question and explanation in your own words 6. Language & format: both question and explanation must be in English; output JSON only, with no extra text. 7. Quantity: sort by evidence strength and output up to 5 items; if none qualify, output an empty array []. Output JSON schema: [ { "id": "1", "question": "Audit y-axis baselines and possible axis breaks in Figure 2; decide presence/absence and cite evidence.", "explanation": "The review flags possible exaggeration in Fig.2. In the PDF (p.6, caption ’Performance vs baseline’), the y-axis starts at 0.85 with a break, magnifying small differences; panels use different ranges." "Type":"Visualization & Presentation Bias" } ] 21 Published as a conference paper at ICLR 2026 A . 3 C R O S S - P A P E R G E N E R AT I O N P R O M P T Cross-Paper Generation Prompt You will receive two PDFs: a "focus paper" P and a cited " evidence paper" S (exactly one pair per run). Edit only P with textual changes so that P’s statements about S become incompatible with or unsupported by S along one or more dimensions (direction, scope, condition, metric, unit, version, protocol). Do not modify S. Each error must be detectable by a professional reviewer who reads both PDFs. Error Types (fixed): Research Question & Definitions (A) Definition: The core construct/hypothesis/variable is insufficiently or inconsistently defined (conceptual vs operational), leaving the estimand ambiguous. Design & Identifiability (B) Definition: Given a clear estimand, the design violates structural identification conditions so the effect is not identifiable even with infinite data and perfect measurement. Sampling & Generalizability (C) Definition: The sampling frame/process/composition or cluster/power setup does not support valid or stable sample-population claims. Measurement & Operationalization (D) Definition: Measures/manipulations lack feasibility/ reliability/validity/timing, so observed variables systematically diverge from the intended construct/ treatment. Data Handling & Preprocessing (E) Definition: Pipeline choices in missing handling, joins/ keys, temporal splitting, feature construction, or partitioning introduce bias (incl. leakage or unit/ scale conflicts). Computation & Formulae (F) Definition: Arithmetic/algebra/notation errors (totals/ ratios, unit conversion, CI vs point estimate, p-value vs label, symbol reuse, undefined variables, dimension mismatch). Inference & Conclusions (G) Definition: Interpretations or causal statements exceed what methods/data support, or contradict the shown statistics/tables/captions. Referential & Citation Alignment (H) Definition: Contradictions about the same quantity/term across text, tables, captions, or appendix within the paper. Language & Expression (I) Definition: Terminology/capitalization/grammar ambiguities that affect meaning or domain-critical term consistency (not cosmetic typos). 22 Published as a conference paper at ICLR 2026 Cross-Paper Generation Prompt (Continued) Global constraints (must comply) 1. Edit P only, never S. Do not add/remove references, figures, or external links. 2. Micro-edits, not rewrites. 3. Cross-paper verifiability. For every error, your explanation must quote >=1 anchor from P and >=1 anchor from S (page number + a short verbatim span, 10-20 words) that, together, demonstrate the conflict or lack of support. 4. Local coherence. The edited text in P must read naturally in context; the contradiction should emerge only when P is compared with S. 5. Independence across errors. You may output multiple errors , but generate each on an independent copy of P. Errors must not reuse the same P anchor or be minor variants (no strengthen/weaken/same-sentence paraphrases). 6. PDF-only evidence. Rely solely on content present in the two PDFs. No outside knowledge or materials. 7. Scope exclusions. Do not edit titles, author lists, bibliography entries, equation numbering, LaTeX commands, or figure pixels; do not add new figures/tables/ references. 8. Type must be a single letter in {A,B,C,D,E,F,G,H,I}. 9. Inter-paper note. "Inconsistency" describes the evidence relation (P vs S). The Type must reflect the primary upstream cause (e.g., B/C/D/F/G/I/A). Quantity & ordering. Sort by evidence strength and output up to 5 errors; if none qualify, output an empty array []. Language & format. Output English-only JSON exactly in the schema below; include no extra text. 10. Multiple mentions of S. The evidence paper S may be cited multiple times in P. You may (i) introduce a single error spanning >=2 of those citation points, or (ii) generate multiple independent errors, each based on a different citation of S. 23 Published as a conference paper at ICLR 2026 Cross-Paper Generation Prompt (Continued) Output JSON schema: [ { "id": "1-based string", "modify": [ { "location": "P: page X + short nearby quote (<=15 tokens).", "old_str": "Exact original text from P (verbatim).", "new_str": "Edited text in P after your change." } / * could add 1-2 more P locations * / ], "question": "One neutral audit-style task (1-25 words) for checking P against S.", "explanation": "Diagnostic voice: state that the edited P now contains an error of Type , and substantiate it with anchors from both PDFs. Include page numbers and short verbatim quotes from P and S showing the mismatch (e.g., scope/condition/metric/ unit/version/protocol). Do not describe how to create the error; assert and evidence the error as present.", "Type": "A|B|C|D|E|F|G|H|I" } ] 24 Published as a conference paper at ICLR 2026 A . 4 E X T R A C T O R P RO M P T Extractor Prompt You will receive three inputs: Q: the open-ended question; E: the gold explanation (describes exactly one error; extra details still belong to the same single error); A: the model’s answer to be evaluated. Your job is to extract counts only and output a single JSON object with the exact schema below. Do not compute any scores. Do not add fields. Core selection rule (multiple errors in A) 1. Parse E into a single gold error (the "target error"). 2. From A, identify how many distinct error claims are made. Cluster together mentions that support the same error ( multiple locations for one error are still one error). 3. Existence decision (binary correctness only): Let the gold existence be 1 if E asserts an error exists, else 0. Let the predicted existence be 1 if A asserts any error, else 0 (e.g., states no error). Set existance = 1 if predicted existence equals gold existence; otherwise set existance = 0. 4. If existance = 0: set contains_target_error = 0; set all location and reasoning counts to 0; and set unrelated_errors to the total number of distinct error claims in A. Then output the JSON. 5. If existance = 1: If the gold existence is 1: determine whether A contains the target error (match by the main error idea in E: category /intent/scope; treat E’s subpoints as the same error). If yes, set contains_target_error = 1 and compute location and reasoning only for the target error. Count all other error claims in A as unrelated_errors. If no, set contains_target_error = 0; set all location and reasoning counts to 0; set unrelated_errors to the total number of distinct error claims in A. If the gold existence is 0: set contains_target_error = 0; set all location and reasoning counts to 0; set unrelated_errors to the total number of distinct error claims in A. (These negative items are for binary accuracy only; they are not used for detailed scoring.) Matching guidance (A error ↔ target error): match by the main error idea in E (category/intent/scope), not by wording. Treat E’s subpoints as part of the same single error. Prefer the best-matching cluster in A; if ties, choose the one with stronger alignment to E’s core claim. 25 Published as a conference paper at ICLR 2026 Extractor Prompt (Continued) Counting rules Location (for the target error only when existance=1 and contains_target_error=1): gold_steps: number of unique error locations described in E ( after normalization and deduplication). hit_steps: number of predicted locations in A that match any gold location for the target error. extra_steps: number of predicted locations in A for the target error that do not match any gold location. Reasoning (for the target error only when existance=1 and contains_target_error=1): Convert E into a canonical set or ordered chain of reasoning steps for the target error. gold_steps: total number of such steps. reached_steps: single-chain tasks: length of the longest valid prefix of A along the gold chain; multi-path/parallel tasks: size of the intersection between A’s steps and the gold step set (or the maximum across gold paths if multiple are defined). missing_steps: gold_steps - reached_steps (non-negative integer). Unrelated errors: unrelated_errors: number of distinct error claims in A that are not the target error (0 if none). Output JSON schema (return exactly this JSON; integers only): { "existance": 0, "contains_target_error": 0, "location": { "gold_steps": 0, "hit_steps": 0, "extra_steps": 0 }, "reasoning": { "gold_steps": 0, "reached_steps": 0, "missing_steps": 0 }, "unrelated_errors": 0 } 26 Published as a conference paper at ICLR 2026 A . 5 E V A L UA T I O N S Y S T E M P RO M P T Evaluation System Prompt You are a neutral, careful academic reviewer. You will receive an open-ended question and the paper content. The paper may or may not have issues related to the question Do not assume there are errors. If the question is about citations, you will be given a citing paper and a cited paper; evaluate only the citing paper for possible issues and use the cited paper only as the reference for comparison. Write in natural prose with no fixed template Rules: Speak only when sure. State an error only if you are confident it is a real error (not a mere weakness). Stay on scope. Discuss only what the question asks about. Evidence completeness. For every error you state, list all distinct evidence cues you are confident about from the PDF. Include plain identifiers (figure/table/section/ equation/citation) or quotes. Avoid redundant repeats of the exact same instance; include all distinct locations needed to support the error. Be clear and brief. Use short, direct sentences. No metaphors. No fancy wording.No guesses or outside sources. Do not invent figures, tables, equations, citations, or results. Report as many distinct, well-supported errors as you can within scope. If none are clear, write exactly: "No clear issue relevant to the question." and nothing else. 27 Published as a conference paper at ICLR 2026 B E X A M P L E S F R O M E X I S T I N G D AT A S E T S B . 1 D O C M AT H - E V A L DocMath-Eval Question ID : complong-testmini-30 Question : What is the percentage of total offering cost on the total amount raised in the IPO if the total offering cost is $14,528,328 and each unit sold is $10? Context Modalities : T ext + T ext Document 1. Offering costs consist of legal, accounting and other costs incurred through the balance sheet date that are directly related to the Initial Public Of fering. Offering costs amounting to $14,528,328 were charged to shareholders’ equity upon the completion of the Initial Public Offering. 2. Pursuant to the Initial Public Offering on July 20, 2020, the Company sold 25,300,000 Units, which includes the full exercise by the underwriter of its option to purchase an additional 3,300,000 Units, at a purchase price of $10.00 per Unit. Each Unit consists of one Class A ordinary share and one-half of one redeemable warrant (“Public W arrant”). Each whole Public W arrant entitles the holder to purchase one Class A ordinary share at an ex ercise price of $11.50 per whole share (see Note 7). Cover ed Areas : Only Mathematics Cross-Evidence Reasoning : Limited T ask Paradigm : Search-Oriented 28 Published as a conference paper at ICLR 2026 B . 2 M M L O N G B E N C H - D O C MMLongBench-Doc Document ID : afe620b9beac86c1027b96d31d396407.pdf Question : How much higher was the proposed dividend paid (Rupees in lacs) in 2002 compared to 2001? Context Modalities : T ext + Multimodal Document Unclaimed Divid end Unclaimed dividend for the years prior to and including the financial year 1 998- 99 has been transferred to the General Revenue Account of the Central Government / the Investor Education and Protection Fund established by the Central Gove rnmen t (IEP F), as applica ble. Shareholders who have not encashed their dividend warrants relating to financial year(s) up t o and includi ng 19 93-94 may claim such dividend (transferr ed to the General Revenue Account) from the Regist rar of Compani es, West Bengal , Gover nmen t of Ind ia, Ni zam Pa lace, I I MSO Building, 2nd Floor, 234/4 A.J.C. B ose Road, K olka ta 7 00 020, in th e p rescr ibed form. This form can be furnished by the Investor Service Centre of the Company (ISC) on re quest or can be downloaded from the Company’s corporate website www.itcportal.com under the section ‘ Investor Relation s ’ . The divide nd for the under noted years, if uncla imed for 7 years, will be transferred by the Company to IEP F in a ccordanc e with the schedule given below. Attention is drawn that the unclaimed dividend for the financial year 1999-2000 will be due for transfer to IEPF later t his year. Communication has been sent by the Company to the concerned Shareholders a dvis ing them to lodge thei r cl aims with respect to unclaimed dividend. Once unclaimed dividend is trans ferred to I EPF, no c laim shall lie in re spect ther eof. ITC Limited * It will not be possible to entertain claims received by ISC after 9th October, 2007. Ban k De tails Shareholders holding Shares in the physical form are requested to notify / send the following to ISC to facilitate better servicing:- i) any change i n their a ddress / mand ate / bank details, and ii) par ticulars of the bank account in which they wish their dividend to be credited, in case the same have not been furnishe d earl ier. Shar ehol ders are advi sed that respec tive bank detai ls and addre sses as furn ished by the m or by NSD L / C DSL to the Compan y, for Shares held in the physical form and in the demateriali sed form res pectivel y, will be printed on di vidend warrant s as a measure of protection against fraudulent encashment. Financ ial Year Di vi de nd Ide ntifi cat ion No . Date of Declaration of Div iden d Total Dividend (Rs.) Unclaimed Dividend as on 31/0 3/2 007 Due fo r transfer to IEPF on (Rs.) % 199 9-00 70th 28th July, 2000 1,84 ,06, 11,780 .00 1,2 6,32, 087 .00 0.69 15 th S epte mber , 2 007* 200 0-01 71st 3rd A ugu st, 200 1 2,45 ,41, 49,040. 00 2,06 ,42, 133.00 0.84 9t h Sep te mb er, 2008 200 1-02 72n d 26th July, 2002 3,34 ,14, 27,743 .00 2,56 ,63, 749.00 0.77 31 st Aug us t, 20 09 200 2-03 73rd 25th July, 2003 3,71 ,26, 78,290 .00 2,38 ,48, 718.00 0.64 30t h Augu st, 20 10 200 3-04 74th 30th July, 2004 4,95 ,35, 77,020. 00 3,35 ,88, 620.00 0.68 4t h S epte mber , 2 011 200 4-05 75th 29th July, 2005 7,73 ,24, 56,356 .00 5,07 ,52, 301.00 0.66 3r d Se pt embe r, 201 2 200 5-06 76th 21st July, 2006 9,95 ,12, 91,267. 00 7,3 8,87, 332 .00 0.74 26t h A ugust , 201 3 Financ ial Date o f De clarati on Total Dividend Unclaimed Dividend Due fo r Year of Div iden d (Rs.) as on 31/0 3/2 007 transfer to IEPF on (Rs.) % 19 99 -0 0 25 th A ugus t, 20 00 3,0 2,16, 492 .00 3,1 9,648 .00 1.0 6 10 th O ct ob er, 2007 * 200 0-01 17 th Aug ust, 200 1 3,0 2,16, 492 .00 3,0 4,552 .00 1.0 1 20th Septem ber, 2008 200 3-04 14th July, 2004 6,0 4,32, 984 .00 6,9 9,704 .00 1.1 6 18 th Aug ust, 201 1 * It will not be possible to entertain claims received by ISC after 14th September, 2007. Erst whil e ITC Ho tels Limit ed SH AR EH OLD ER RE FER ENC ER 30 Cover ed Areas : Limited (7 areas) Cross-Evidence Reasoning : Limited T ask Paradigm : Search-Oriented 29 Published as a conference paper at ICLR 2026 B . 3 F I N M M D O C R FinMMDocR Question ID : test-41 Question : Based on the detailed financial forecast tables provided at the end of the report, analyze the company’ s working capital management ef ficiency related to its customer collections for the fiscal year 2025. Using the year-end balances of (Accounts Receiv able) for 2024 and 2025 to calculate the av erage balance for 2025, and the corresponding (T otal Operating Revenue) for 2025, calculate the implied average collection period for receivables during 2025 (assume 365 days in a year, round to one decimal place, unit: days). Context Modalities : T ext + Multimodal Document Cover ed Areas : Limited (Finance Only) Cross-Evidence Reasoning : Full T ask Paradigm : Search-Oriented 30 Published as a conference paper at ICLR 2026 B . 4 L O N G D O C U R L LongDocURL Question ID : free gemini15 pro 4061601 47 71 8 Question : What was the total fair value of options that vested in 2016, 2015, and 2014, in millions of Canadian dollars? Context Modalities : T ext + Multimodal Document y ea r e nd e d De c e m b er 3 1, 2 01 6 (millions of Canadian $) B ef o r e T ax A m ou nt In com e Tax Re cove ry/ ( Ex pe n se ) N et o f Ta x A m ou nt Foreign currency translation gain s on net inves tment in forei gn operations 3 — 3 Change in fair value of net investment hedges (1 4) 4 ( 1 0) Change in fair value of c ash flow hedges 4 4 (1 4) 3 0 Reclassification to net income of gains and losses on cash flow hedges 7 1 (2 9) 4 2 Unrealized actuarial gains and losses on pension and other post-retireme nt benefit plans (3 8) 12 ( 2 6) Reclassification to net income of act uarial loss on pension and other post- retirement benefit plans 2 2 (6 ) 16 Other comprehensive loss on equity investment s (1 17 ) 3 0 ( 8 7) O t he r C om pr eh en s iv e L os s (2 9) (3 ) ( 3 2) y ea r e nd e d De c e m b er 3 1, 2 01 5 (millions of Canadian $) B ef o r e T ax A m ou nt I nc om e Ta x Re cove ry/ (Ex pen se ) N et o f Ta x A m ou nt Foreign currency translation gain s on net inves tment in forei gn operations 79 8 15 81 3 Change in fair value of net investment hedges (5 05 ) 13 3 (3 72 ) Change in fair value of c ash flow hedges (9 2) 35 (5 7) Reclassification to net income of gains and losses on cash flow hedges 14 4 (5 6) 88 Unrealized actuarial gains and losses on pension and other post-retireme nt benefit plans 74 (2 3) 51 Reclassification to net income of actu arial loss and prior service costs on pension and other post-retirement benefit plans 41 (9) 32 Other comprehensive income on equity investments 62 (1 5) 47 O th er C om pr e h en s i ve I n c om e 52 2 80 60 2 As at December 31, 2016, the aggregate intrinsic value of the total options exer cisable was $86 million and the total intrins ic value of options outstanding was $130 million. 2 1 . P RE FE RR ED SH A R ES In March 2014, TCPL redeemed all o f the 4 million outstanding Series Y preferred share s at a redemptio n price of $50 per share for a gross paymen t of $200 million. 2 2 . OT HE R C O M P R E H EN S I VE ( L O SS )/ I N C O M E A N D A CC UM UL A T ED OT H E R C O M P R EH E N S I V E L O S S Compon ents o f Other compr ehensiv e (los s)/inco me, including t he portion attribu table to non-cont rolling interes ts and related tax effects, are as follows: y ea r e nd ed D ec em be r 3 1 (millions of Canadian $, unless otherwise noted) 20 16 20 15 20 14 Total intrinsic value of options exercised 3 1 10 21 Fair value of op tions that have vested 12 6 91 95 Total options vested 2 .1 m il li o n 2.0 million 1.7 m illion The following table summarizes addition al stock option informa tion: 155 T CPL C o n so li da te d f in an ci al s t at em e n ts 201 6 Cover ed Areas : Limited Cross-Evidence Reasoning : Limited T ask Paradigm : Search-Oriented 31 Published as a conference paper at ICLR 2026 B . 5 S L I D E V Q A SlideVQA Question ID : 1 Question : How much difference in INR is there between the average order v alue of CY2013 and that of CY2012? Context Modalities : T ext + Multimodal Document Cover ed Areas : Limited Cross-Evidence Reasoning : None T ask Paradigm : Search-Oriented 32 Published as a conference paper at ICLR 2026 B . 6 D O C V Q A DocVQA Question ID : 24581 Question : What is name of univ ersity? Context Modalities : T ext + Multimodal Document Cover ed Areas : Limited Cross-Evidence Reasoning : None T ask Paradigm : Search-Oriented 33 Published as a conference paper at ICLR 2026 B . 7 C H A R X I V CharXiv Question ID : 2004.10956 Question : Which model shows a greater decline in accurac y from Session 1 to Session 9 in the 5-way full-shot scenario? Context Modalities : T ext + Image Cover ed Areas : Limited (8 areas) Cross-Evidence Reasoning : None T ask Paradigm : Search-Oriented 34 Published as a conference paper at ICLR 2026 B . 8 A R X I V Q A ArXivQA Question ID : physics-8049 Question : Based on the top-right graph, how would you describe the beha vior of P(z) as z approaches zero? Context Modalities : T ext + Image Cover ed Areas : Limited Cross-Evidence Reasoning : None T ask Paradigm : Search-Oriented 35 Published as a conference paper at ICLR 2026 B . 9 S P I Q A SPIQA Question ID : 1611.04684v1 Question : What is the role of the knowledge g ates in the KEHNN architecture? Context Modalities : T ext + Image Cover ed Areas : Limited Cross-Evidence Reasoning : None T ask Paradigm : Search-Oriented 36 Published as a conference paper at ICLR 2026 B . 1 0 M M C R MMCR Question ID : 1 Question : Which module’ s weights are frozen? Context Modalities : T ext + Multimodal Document Cover ed Areas : Limited Cross-Evidence Reasoning : Limited T ask Paradigm : Search-Oriented 37 Published as a conference paper at ICLR 2026 B . 1 1 A A A R - 1 . 0 AAAR-1.0 Question ID : 1902.00751 Question : What experiments do you suggest doing? Why do you suggest these experiments? Context Modalities : T ext + Multimodal Document Cover ed Areas : Limited Cross-Evidence Reasoning : Full T ask Paradigm : Search-Oriented 38 Published as a conference paper at ICLR 2026 C D A TA S E T A N N O T A T I O N A N D C O N S T RU C T I O N C . 1 D A TA S O U R C I N G A N D Q U A L I T Y C O N T RO L The defectiv e academic papers in our dataset are curated from three primary sources: • W e synthetically injected nine types of errors into papers accepted at ICLR and Nature Commu- nications. • For papers rejected by ICLR, we identified the shortcomings based on revie wers’ comments and categorize them into the same nine error types. • For accepted ICLR papers, we generated consistency-related errors by cross-referencing their content against the cited literature. T o ensure the quality of each error , all entries underwent a rigorous multistage validation protocol ex ecuted by human annotators. For synthetically generated errors, annotators manually embedded them into the source papers following this protocol: • Credibility V alidation. Each error must be logically sound and verifiable. For generated errors, annotators first confirm their logical coherence and unambiguity . Flawed error descriptions are revised whene ver possible; only irrepairable cases are discarded. • Evidence V erification. All evidence substantiating an error must be either directly traceable to the source document or grounded in established domain-specific knowledge. Annotators are required to meticulously verify the origin and accuracy of all supporting data and background information. • Category Classification. Each error must be accurately classified into one of the nine predefined categories according to their formal definitions. Annotators verify the correctness of the assigned category and reclassify it if necessary . • Manuscript Editing. Upon successful validation, annotators embedded the generated error into the original manuscript by adding, deleting, or modifying relev ant text segments as dictated by the error’ s specification. This unified and standardized annotation protocol enables the creation of a high-quality dataset of academic papers with curated errors, providing a robust benchmark for ev aluating the document scanning and error detection capabilities of MLLMs. C . 2 A N N OTA T I O N S TA T I S T I C S Initially , we generated or sampled a pool of 3,500 academic paper instances containing potential errors. During the manual annotation phase, following the protocol described above, we discarded 1,700 instances to ensure the logical rigor of the errors, the accurac y of the e vidence, and a balanced distribution of cate gories. Of the remaining 1,8 00 instances, 1,541 (85.6%) underwent manual revision. The distribution of these modifications is as follows: • 535 questions were rewritten to eliminate ambiguity or to increase their retriev al and reasoning difficulty . • 1,207 explanations were revised to correct erroneous evidence references and resolve logical flaws. • 1,141 instances underwent category reclassification or manual paper editing. This process served to fix classifications that were inconsistent with our definitions and, for errors generated, to man- ually inject them into the source papers to create the flawed documents. 39 Published as a conference paper at ICLR 2026 C . 3 A N N OTA T I O N E X A M P L E S C . 3 . 1 C A S E 1 : D I S C A R D D I R E C T L Y Example Question : Assess whether the conclusions dra wn about the protein’ s functional state and therapeutic applicability are supported by the presented methods and results. Explanation : Edits in the abstract and discussion claim the paper presents an activ e-state structure that rev eals the activation mechanism and provides a roadmap for drug design. This overstates the findings, as the paper repeatedly describes solving the ‘apo’ (unbound) structure and explicitly states the activ ating ligand is unknown (p.6). T o make the error subtle, a contradictory sentence was added to the methods (p.9) claiming a stabilizing agonist was used, but this is falsified by the numerous, unmodified mentions of the ‘apo GPR179’ structure throughout the results and methods. Error T ype : IC (Inference & Conclusions) Decision : Discard Analysis : Based on the modifications, the revised abstract and conclusion claim that the paper elucidates the protein’ s ‘active-state’ structure and provides a roadmap for drug design. Ho wev er, the original text repeatedly states (e.g., on pages 5 and 9) that it is the ‘apo’ (inactiv e) structure that was resolved, and crit- ically notes on page 6 that the ‘activ ating ligand is still unknown’.This constitutes a clear RCA-type error , defined by the inconsistent description of a concept within the article. Y et, the large model misclassifies this as an IC-type (Inference & Conclusions) error , which is a significant mistake. Considering that the in- consistency re garding the ‘acti ve-state’ description is overly superficial and ob vious, a type of error almost nev er encountered in actual academic literature, it lacks practical value. Even reclassifying it as an H-type question would be of little significance. Therefore, we have decided to delete this instance. 40 Published as a conference paper at ICLR 2026 C . 3 . 2 C A S E 2 : M O D I F Y Q U E S T I O N Example Question : Judge whether the mathematical description of the scT ab model architecture presents any dimensional mismatches. Explanation : The Methods section on page 11 (and the Figure 1b caption) now states that the model’ s feature transformer creates an embedding of dimension n d . Ho wev er, the same paragraph then describes splitting this embedding into two components of dimension n d and n d respectiv ely . This is algebraically impossible for any non-zero n a , and the hyperparameter table on page 12 confirms that n a is set to 64, creating a fundamental dimensional mismatch in the model’ s description. Error T ype : CF (Computation & Formulae) Befor e : question : Judge whether the mathematical description of the scT ab model architecture presents any dimen- sional mismatches. Decision : Modify After : question : Assess the Methods section for Computation & Formulae issues. Analysis : Based on the error information and the text, the modified model description states that a vec- tor of dimension n d is split into two parts: one of dimension n d and another of dimension n a . This is algebraically impossible, as the total dimension ( n d ) cannot equal the dimension of one of its parts ( n d ) plus another non-zero part ( n a is set to 64). This constitutes a clear dimensional mismatch, rendering the model’ s architectural description logically in valid.The original question w as ov erly specific, as it explicitly prompted an assessment of whether the mathematical description of the scT ab model architecture contained ‘any dimensional mismatches’. This hint was too detailed, reducing the analytical difficulty for the model. T o increase the dif ficulty , we have revised the question’ s phrasing to ask only whether the mathematical description of the scT ab model architecture presents any problems. 41 Published as a conference paper at ICLR 2026 C . 3 . 3 C A S E 3 : M O D I F Y E X P L A N A T I O N Example Question : Evaluate if the composition of the SNAPcmini construct is consistently defined through- out the paper . Explanation : The results on page 4 state that the assembled SN APcmini construct includes the SN APC2 subunit. Ho wev er , the methods on page 12 describe the construction of SNAPcmini using only SN APC4, SNAPC3, and SNAPC1, with SNAPC2 explicitly removed from the cloning description. A third conflicting statement on page 6 implies SNAPC2 was expected to be part of the minimal core, creating conceptual and operational inconsistency regarding this key experimental complex. Error T ype : RQD (Research Question & Definitions) Befor e : Explanation : ...with SN APC2 explicitly remov ed from the cloning description. A third conflicting state- ment on page 6... Decision : Modify After : Explanation : ...with SN APC2 explicitly removed from the cloning description. Analysis : This instance tar gets an inconsistenc y in the operational definition of the SN APcmini construct. Specifically , the results section states that the assembled SNAPcmini complex includes the SN APC2 sub- unit, while the methods section explicitly describes the construction of SN APcmini using only SNAP C4, SN APC3, and SN APC1, with SN APC2 remov ed from the cloning procedure. The original explanation ad- ditionally referenced a speculative statement regarding SN APC2’ s expected presence in the minimal core, which introduced unnecessary ambiguity and reduced the clarity of the definition-lev el inconsistency . By removing this auxiliary statement, the modified instance focuses on a clear and realistic mismatch between experimental description and implementation, which is representative of RQD-type errors commonly en- countered in academic writing. 42 Published as a conference paper at ICLR 2026 D E X A M P L E S F R O M S C H O L S C A N D . 1 R Q D ( R E S E A R C H Q U E S T I O N A N D D E FI N I T I O N S ) Example: 1001 Question : Assess the Methods section for Research Question and Definitions issues. Explanation : The definition of a ‘promoter region’, a key analytical construct, is inconsistent across the paper , making the estimand ambiguous. The RNA-seq methods (page 12) define it as +/-1kb from the TSS, the A T A C-seq analysis methods (page 11) define it as +/-500bp from the TSS, and the Results section (page 4) defines it as +/-2kb from the TSS. These three conflicting operational definitions mean that analyses in volving ‘promoters’ are not comparable and the construct is insufficiently defined. Error T ype : RQD (Research Question & Definitions) T ype : Within-Generate 43 Published as a conference paper at ICLR 2026 D . 2 D I ( D E S I G N A N D I D E N T I FI A B I L I T Y ) Example: 1006 Question : Assess the Experiments section for Design & Identifiability issues. Explanation : The paper’ s core argument is that it identifies a specific ‘dynamic coupling’ pathw ay as essential for R TP , distinct from a ‘static coupling’ pathway . The edits state that the ke y experiment (excitation-phosphorescence mapping) cannot distinguish between these two pathways, as the final phosphorescence shows spectral signatures of originating from both. This introduces a structural identification problem: with two potential causal pathways leading to the same outcome and no w ay to isolate their effects, the claim that the dynamic pathway is the definiti ve mechanism is not identifiable from the data presented. Error T ype : DI (Design & Identifiability) T ype : Within-Generate 44 Published as a conference paper at ICLR 2026 D . 3 S G ( S A M P L I N G A N D G E N E R A L I Z A B I L I T Y ) Example: 1014 Question : Assess the Methods section for Sampling & Generalizability issues. Explanation : The Methods section is edited to state that the experiments used a “specific substrain of diabetic” mice, a highly specialized sample. Howe ver , the Abstract and Discussion make broad, unsupported claims of generalizability to “all patients” and the “general patient population. ” This constitutes an in valid sample-to-population inference. Error T ype : SG (Sampling & Generalizability) T ype : Within-Generate 45 Published as a conference paper at ICLR 2026 D . 4 M O ( M E A S U R E M E N T A N D O P E R A T I O N A L I Z AT I O N ) Example: 1015 Question : Assess the Figures/T ables section for Measurement & Operationalization issues. Explanation : The figure captions on pages 3 and 4 hav e been edited to specify that the background for Signal-to-Background Ratio (SBR) calculations w as defined as the single minimum pix el intensity in the image. This is not a valid or reliable operationalization of the “background” construct, as it’ s highly susceptible to single-point noise or detector artifacts. This flawed measurement procedure systematically undermines all conclusions based on the SBR metric. Error T ype : MO (Measurement & Operationalization) T ype : Within-Generate 46 Published as a conference paper at ICLR 2026 D . 5 D H P ( D A TA H A N D L I N G A N D P R E P R O C E S S I N G ) Example: 528 Question : Assess the Methods section for Data Handling & Preprocessing issues. Explanation : The revie wer correctly identifies that the authors tuned hyperparameters on the test set. The paper’s “Implementation Details” section on page 5 states: “For hyperparameter tuning, we employed Bayesian optimization with the wandb sweep tool (Biewald, 2020), aiming to minimize MPJPE for the S9 and S11 in the H36M dataset and P A-MPJPE for the S8 in the H3WB dataset, follo wing the con vention of prior works. ” According to standard protocols for the H36M dataset, subjects S9 and S11 constitute the test set. T uning hyperparameters directly on the test set introduces data leakage, leading to an optimistic bias in the reported results and in validating claims of generalization. This is a critical violation of machine learning best practices and fits the Data Handling & Preprocessing (E) category , as a pipeline choice introduces bias. Error T ype : DHP (Data Handling & Preprocessing) T ype : Within-Sample 47 Published as a conference paper at ICLR 2026 D . 6 C F ( C O M P U TA T I O N A N D F O R M U L A E ) Example: 1909 Question : Scan the Methods section for Computation & Formulae errors. Explanation : Algorithm 1 on page 5 uses the parameter T’ in the loop definition on line 5: for t = 1 to T’ do. This parameter determines the number of iterations for the Randomized Global Initialization phase. Howe ver , the value of T’ is never specified anywhere in the paper, including the “Hyper-parameters” section (Section 4.1 on page 7). An algorithm cannot be implemented or reproduced with an undefined critical parameter . This fits the Computation & Formulae category as an “undefined variable”. Error T ype : CF (Computation & Formulae) T ype : Within-Sample 48 Published as a conference paper at ICLR 2026 D . 7 I C ( I N F E R E N C E A N D C O N C L U S I O N S ) Example: 875 Question : Ev aluate Abstract, Introduction and Experiment section for issues in Inference & Conclusions. Explanation : The paper’ s claims of generality are not supported by its e vidence. The title and abstract introduce “FedGraph: A New Paradigm for Federated Graph Learning” (page 1), suggesting a broadly applicable framework. Howe ver , the methodology is heavily tailored to, and the experiments are exclusi vely focused on, the single downstream task of anomaly detection. F or example, a stated contribution is “Broad application, ” b ut this is immediately qualified with “the models are successfully transferred to FEDGRAPH framew ork in anomaly detection tasks” (page 2). Furthermore, Section 5, “EXPERIMENTS”, exclusi vely reports results on anomaly detection tasks. This discrepancy represents an issue of Inference & Conclusions, as the broad conclusion of having created a ne w “paradigm” for FGL is an overstatement that exceeds what the narrow experimental results can support. Error T ype : IC (Inference & Conclusions) T ype : Within-Sample 49 Published as a conference paper at ICLR 2026 D . 8 R C A ( R E F E R E N T I A L A N D C I TA T I O N A L I G N M E N T ) Example: 0 Question : Scan the errors in cited reference Chen et al. (2021) Explanation : The edited P contains a T ype H error by misrepresenting the performance of the cited model. P (p. 8) claims that the NSTPP model from Chen et al. (2021) ‘reported performance comparable to a standard Hawk es process baseline’. This contradicts the results in S, where the proposed models (i.e., NSTPP) consistently outperform the Hawkes process baseline, often by a large margin. For example, S (p. 9, T able 1) shows on the BOLD5000 dataset that the ‘ Attentive CNF’ model achieves a temporal log-likelihood of 5.842 ± 0.005, which is substantially better than the Hawkes Process at 2.860 ± 0.050. Error T ype : RCA (Referential & Citation Alignment) T ype : Cross-Generate 50 Published as a conference paper at ICLR 2026 D . 9 L E ( L A N G U A G E A N D E X P R E S S I O N ) Example: 1785 Question : Assess the Methods section for Language & Expression issues. Explanation : The paper introduces a key contribution, the ‘load-compute-store-finish’ template, and its acronym ‘LCSF’. This error introduces inconsistencies in this critical term: it’ s defined as ‘LCS-F’ on page 6, called ‘LCFS’ in a figure title on page 7, and written out in full in the conclusion on page 10, while the original ‘LCSF’ acronym remains elsewhere. This terminological inconsistency for a central, paper-defined concept creates ambiguity and undermines the paper’ s precision. Error T ype : LE (Language & Expression) T ype : Within-Generate 51 Published as a conference paper at ICLR 2026 E H U M A N - M AC H I N E C O N S I S T E N C Y E V A L U A T I O N T o ev aluate whether GPT -4.1 accurately extracts detailed information from model responses, we conducted a human-machine consistency ev aluation. W e first randomly sampled 200 questions from the dataset. Then, we in vited human experts to analyze the corresponding model-generated responses for these questions and to manually e xtract key information, including e vidence sets, rea- soning chains, and the number of unrelated errors. The results are presented in T able 4. T able 4: Spearman’ s correlation coefficients among S , S location , S reasoning , and P unrelated err . S S location S reasoning P unrelated err Correlation Coefficients 0.841 0.806 0.842 0.954 In summary , GPT -4.1 can extract rele v ant e vidence and reasoning steps with considerable accuracy , leading to precise ev aluation scores. In addition, we substituted GPT -4.1 with Qwen3-32B and Gemini 2.5 Flash to independently re- ev aluate the same 200 samples (T ables 5 and 6). The results further confirm that our ev aluation framew ork is not dependent on any particular LLM and exhibits strong rob ustness. T able 5: Model performance under Qwen3-32B ev aluation (scaled by 100). Model A vg. RQD DI SG MO DHP CF IC RCA LE MLLM (Image Input) GPT -5 21.2 12.6 11.8 33.5 15.4 26.6 16.0 27.1 27.0 6.4 Gemini 2.5 Pro 19.7 15.0 23.2 44.7 13.2 31.4 7.8 17.3 17.8 12.8 Doubao-Seed-1.6 12.0 4.8 5.9 36.2 7.5 13.3 5.7 19.9 9.9 5.5 Doubao-Seed-1.6-thinking 11.6 5.1 6.9 28.0 7.7 14.3 10.5 15.6 10.8 4.5 Grok 4 4.2 0.0 1.3 20.5 2.6 5.5 1.2 3.9 2.6 0.8 OCR + LLM (T ext Input) Gemini 2.5 Pro 35.0 26.6 39.6 53.9 31.4 56.2 15.9 35.6 39.0 10.0 GPT -5 25.3 19.0 28.3 28.3 24.1 38.8 10.3 32.2 31.8 3.1 Grok 4 22.6 10.7 8.9 40.6 14.1 31.3 11.1 22.6 33.6 7.3 Doubao-Seed-1.6-thinking 17.4 11.0 14.9 31.9 9.5 26.3 9.2 20.5 21.1 4.7 Doubao-Seed-1.6 15.3 6.4 9.8 31.7 10.8 22.4 8.5 21.9 17.8 2.4 Claude Sonnet 4 6.3 5.7 2.4 12.4 4.2 8.3 2.8 9.5 6.6 4.4 T able 6: Model performance under Gemini 2.5 Flash ev aluation (scaled by 100). Model A vg. RQD DI SG MO DHP CF IC RCA LE MLLM (Image Input) GPT -5 20.5 11.2 12.0 32.4 17.9 27.4 10.5 25.9 27.4 3.1 Gemini 2.5 Pro 14.6 9.1 11.3 34.2 11.4 28.2 4.9 12.6 14.8 5.3 Doubao-Seed-1.6 10.9 5.3 3.9 34.0 6.1 15.6 6.1 17.9 8.0 4.7 Doubao-Seed-1.6-thinking 10.3 3.2 4.7 26.4 7.3 14.2 9.2 14.8 9.3 3.1 Grok 4 3.9 0.5 1.8 15.9 1.7 4.1 1.5 1.1 4.4 0.0 OCR + LLM (T ext Input) Gemini 2.5 Pro 30.1 20.0 30.6 47.8 27.5 47.8 11.7 30.2 36.6 5.9 GPT -5 23.5 15.3 21.8 26.5 23.1 37.7 7.1 31.9 31.6 2.7 Grok 4 20.2 9.2 7.8 36.1 11.5 32.4 8.0 20.7 30.6 5.8 Doubao-Seed-1.6-thinking 15.9 8.4 11.1 30.9 10.1 23.7 6.3 19.9 20.2 3.5 Doubao-Seed-1.6 14.0 4.8 8.2 29.1 11.4 23.9 6.4 21.2 16.0 0.8 Claude Sonnet 4 5.7 3.8 1.4 11.1 3.9 9.4 2.0 8.7 6.5 3.1 52 Published as a conference paper at ICLR 2026 F H Y P E R P A R A M E T E R S E N S I T I V I T Y A N A L Y S I S W e conducted a sensitivity analysis of all four hyperparameters in volved in scoring. W e varied each independently and re-computed the overall score S across 11 proprietary model configurations (T a- bles 7, 8, 9, and 10). The results demonstrate that our ev aluation metric exhibits strong rob ustness. T able 7: Sensitivity under image input: variations of λ and µ (scaled by 100). Model λ =0 . 6 λ =0 . 8 λ =1 . 0 µ =0 . 85 µ =0 . 9 µ =0 . 95 GPT -5 19.3 19.2 19.0 18.5 19.2 19.9 Gemini 2.5 Pro 15.8 15.6 15.3 15.0 15.6 16.1 Doubao-Seed-1.6-thinking 10.4 10.2 10.0 9.9 10.2 10.5 Doubao-Seed-1.6 10.1 9.9 9.8 9.7 9.9 10.2 Grok 4 4.0 4.0 3.9 3.9 4.0 4.1 T able 8: Sensitivity under image input: variations of γ and q (scaled by 100). Model γ =0 . 4 γ =0 . 6 γ =0 . 8 q =1 . 0 q =1 . 5 q =2 . 0 GPT -5 19.4 19.2 19.0 19.3 19.2 19.1 Gemini 2.5 Pro 15.7 15.6 15.5 15.6 15.6 15.5 Doubao-Seed-1.6-thinking 10.4 10.2 10.0 10.3 10.2 10.1 Doubao-Seed-1.6 10.1 9.9 9.8 10.0 9.9 9.9 Grok 4 4.0 4.0 4.0 4.0 4.0 4.0 T able 9: Sensitivity under te xt input: variations of λ and µ (scaled by 100). Model λ =0 . 6 λ =0 . 8 λ =1 . 0 µ =0 . 85 µ =0 . 9 µ =0 . 95 Gemini 2.5 Pro 30.6 30.2 29.9 29.1 30.2 31.5 GPT -5 22.7 22.5 22.3 21.4 22.5 23.7 Grok 4 21.1 20.8 20.6 20.2 20.8 21.4 Doubao-Seed-1.6-thinking 15.6 15.3 15.0 14.8 15.3 15.8 Doubao-Seed-1.6 14.1 13.9 13.7 13.6 13.9 14.3 Claude Sonnet 4 6.0 5.9 5.8 5.6 5.9 6.1 T able 10: Sensitivity under te xt input: variations of γ and q (scaled by 100). Model γ =0 . 4 γ =0 . 6 γ =0 . 8 q =1 . 0 q =1 . 5 q =2 . 0 Gemini 2.5 Pro 30.7 30.2 29.9 30.5 30.2 30.1 GPT -5 23.4 22.5 21.9 23.0 22.5 22.2 Grok 4 21.0 20.8 20.7 20.9 20.8 20.8 Doubao-Seed-1.6-thinking 15.6 15.3 15.1 15.5 15.3 15.2 Doubao-Seed-1.6 14.2 13.9 13.7 14.1 13.9 13.8 Claude Sonnet 4 6.3 5.9 5.6 6.1 5.9 5.7 53 Published as a conference paper at ICLR 2026 G U S E O F L L M S LLMs were used for language editing and stylistic refinement during manuscript preparation. In ad- dition, Gemini 2.5 Pro was used in a controlled manner to synthesize data for dataset construction. Details are provided in Section 3.2, Appendix A, and Appendix C. All research ideas, experimen- tal design, ev aluation protocols, and result analysis were conceiv ed, implemented, and validated entirely by the authors. The use of LLMs did not influence the scientific conclusions of this paper . 54
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment