MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios
We introduce Multilingual Document Parsing Benchmark, the first benchmark for multilingual digital and photographed document parsing. Document parsing has made remarkable strides, yet almost exclusively on clean, digital, well-formatted pages in a ha…
Authors: Zhang Li, Zhibo Lin, Qiang Liu
MDPBenc h: A Benc hmark for Multilingual Do cumen t P arsing in Real-W orld Scenarios Zhang Li 1 ⋆ , Zhib o Lin 1 ⋆ , Qiang Liu 2 , Ziy ang Zhang 1 , Sh uo Zhang 1 , Zidun Guo 1 , Jia jun Song 1 , Jiarui Zhang 2 , Xiang Bai 1 , Y uliang Liu 1 1 Huazhong Univ ersity of Science and T echnology , 2 Kingsoft Office Abstract. W e in tro duce M ultilingual D ocument P arsing Benc h mark, the first b enc hmark for multilingual digital and photographed do cu- men t parsing. Document parsing has made remark able strides, yet al- most exclusively on clean, digital, well-formatted pages in a handful of dominan t languages. No systematic benchmark exists to ev aluate how mo dels p erform on digital and photographed do cumen ts across diverse scripts and low-resource languages. MDPBenc h comprises 3,400 do cu- men t images spanning 17 languages, diverse scripts, and v aried photo- graphic conditions, with high-quality annotations pro duced through a rigorous pipeline of exp ert model lab eling, man ual correction, and hu- man v erification. T o ensure fair comparison and prev ent data leak age, w e main tain separate public and priv ate ev aluation splits. Our comprehen- siv e ev aluation of b oth op en-source and closed-source mo dels uncov ers a striking finding: while closed-source models (notably Gemini3-Pro) pro ve relativ ely robust, op en-source alternatives suffer dramatic p erformance collapse, particularly on non-Latin scripts and real-world photographed do cumen ts, with an av erage drop of 17.8% on photographed do cumen ts and 14.0% on non-Latin scripts. These results rev eal significant p erfor- mance imbalances across languages and conditions, and p oin t to concrete directions for building more inclusiv e, deploymen t-ready parsing systems. Source a v ailable at https://github.com/Yuliang- Liu/MultimodalOCR . Keyw ords: Multilingual Document Parsing · Photographed Do cumen ts · Real-W orld Ev aluation 1 In tro duction Do cumen t parsing bridges the gap betw een visual information and mac hine- readable text b y conv erting do cumen t images in to structured, serialized repre- sen tations. As a cornerstone for building high-quality pre-training corp ora for LLMs, it directly influences the scale and fidelity with which human knowledge is transferred into machine in telligence. Significant progress has b een made in do cumen t parsing, with numerous metho ds [ 7 – 9 , 20 , 21 , 34 , 35 , 42 , 50 ] con tinuously ⋆ Core con tribution. 2 Li. et al. Arabic ( AR) German ( DE) English ( EN) Spanish ( ES) French ( FR) Hindi ( HI) Indonesian ( ID) Italian ( IT) Japanese ( JP) Korean ( KO) Dutch ( NL) Portuguese ( PT) Russian ( RU) Thai ( TH) Vietnamese ( VI) Simplified - Chinese ( ZH) Traditional - Chinese ( ZH - T) 17 Languages Successfu l Recognition Photo dots .mocr Result Reading O rder Error Layout Missing Wrong Word …… La y o u t D e t e ct i o n Result Lang Misclassification Reading Order Error Symbol Error Wrong word Repetition …… Case 1: Real - world Scene Case 2: Multilingual Scene Photo Layout Detection Layout Detection GLM - OCR Fig. 1: Ov erview of the MDPBench. pushing state-of-the-art p erformance on benchmarks such as OmniDocBench [ 31 ] and olmOCR-Bench [ 34 ]. How ever, these b enchmarks predominantly focus on digital-b orn and scanned documents in limited languages. This limitation leads existing document parsing models to exhibit a certain bias tow ard inputs from standardized, high-resource languages, with performance often declining in m ul- tilingual and c hallenging photographed do cument scenarios. Multilingual and photographed do cumen t parsing plays a crucial role in the dev elopment of general-purp ose AI systems. On the one hand, multilingual do c- umen ts encapsulate diverse kno wledge from different regions and cultural con- texts. Enhancing m ultilingual parsing capabilities enables more comprehensive access to global kno wledge, helping mitigate potential biases arising from imbal- anced language distributions. On the other hand, a substan tial portion of real- w orld do cuments exists only in photographed form, incl uding historical arc hives, pap er receipts, b ooks, and handwritten notes, often without corresp onding dig- ital v ersions. Effectiv ely parsing suc h documents is essential for unlo c king the v alue of large-scale unstructured data, thereb y supp orting more scalable and higher-qualit y LLM pre-training. T o promote researc h on document parsing in multilingual and photographed do cumen t scenarios, we introduce Multilingual Do cument Parsing Benc hmark (MDPBenc h), which consists of 3,400 document images across 17 languages, as sho wn in Fig. 1 . W e first collected electronic do cumen ts from publicly acces- sible w ebsites, cov ering a wide range of document t yp es, including academic pap ers, business rep orts, handwritten notes, newspap ers, textbo oks, and comics from different countries. T o ensure annotation quality , w e adopt a m ulti-stage MDPBenc h 3 T able 1: Comparison of do cumen t parsing b enc hmarks. DB : Digital-Born; Photo. : Photographed; SR : Screen Re-photograph; PD : Physical Deformation; ID : Image Degradation; CO : Camera Orientations; BV : Bac kground V ariation. Benchmark Languages Type Image Count Photograph Conditions Num ZH EN FR ES RU AR Others SR PD ID CO BV F oxP age [ 24 ] 2 ✓ ✓ — DB 212 olmOCR-Bench [ 34 ] 1 ✓ — DB 1402 OmniDocBench-v1.5 [ 31 ] 2 ✓ ✓ — DB 1355 DocPTBench [ 11 ] 2 ✓ ✓ — DB / Photo. 2362 ✓ ✓ ✓ ✓ Real5-OmniDocBench [ 55 ] 2 ✓ ✓ — DB / Photo. 6775 ✓ ✓ ✓ MDPBench 17 ✓ ✓ ✓ ✓ ✓ ✓ DE, HI, ID, IT, JP , KO, NL, PT, TH, VI, ZH-T DB / Photo. 3400 ✓ ✓ ✓ ✓ ✓ annotation pipeline that combines automatic lab eling and cross-v erification b y m ultiple expert models, follo wed by manual correction and human verification. T o obtain photographed documents that b etter reflect real-world conditions, w e printed the collected electronic do cumen ts or displa y ed them on screens, and captured them under div erse en vironments and conditions, including indo or and outdo or scenes, physical deformation, image degradation, v arying camera orien tations, and background v ariations. T o prev ent data leak age and targeted training, the dataset is divided into 2,720 publicly av ailable b enc hmark samples and 680 priv ate ev aluation samples. Researchers can submit their mo dels to the official ev aluation website for assessment on the priv ate b enc hmark. W e conduct a comprehensive ev aluation of general-purp ose vision-language mo dels (VLMs), specialized VLMs, and pipeline-based systems on MDPBenc h. The results reveal that: (1) op en-source mo dels still lag b ehind proprietary mo dels, with Gemini-3-Pro [ 14 ] outp erforming the strongest open-source mo del, dots.mo cr [ 52 ], b y 7.9% in photographed scenarios; (2) all metho ds exp erience notable p erformance degradation on photographed do cumen ts, with an av erage drop of 17.8%; and (3) performance on non-Latin-script languages is consistently lo wer, showing an a verage decrease of 14.0%. Overall, MDPBenc h highligh ts the limitations of curren t do cumen t parsing approaches and provides a standardized b enc hmark for ev aluating m ultilingual text understanding and OCR capabilities in general VLMs. W e summarize the con tributions as follows: – W e introduce MDPBenc h , the first b enc hmark specifically designed for m ultilingual digital and photographed do cumen t parsing. It comprises 3,400 do cumen t images spanning 17 languages and diverse do cumen t t yp es, with high-qualit y annotations obtained through a rigorous multi-stage pip eline, including expert mo del labeling, manual correction, and h uman verification. – A comprehensive ev aluation shows that op en-source mo dels still lag b ehind the state-of-the-art proprietary mo del, Gemini-3-Pro [ 14 ]. Moreov er, all mo d- els exhibit significan t p erformance degradation on photographed do cumen ts and non-Latin scripts, highlighting the limitations of curren t approaches in real-w orld multilingual scenarios. 4 Li. et al. Photographed Process Multi- dimensional Conditions Camera Orientations Physical Deformation Bend Crease Wrinkle Left Down Right Oblique Image Degradation Shadow Hold Doc Type Slide Exam Paper Magazine Newspaper Comic Book Academic Paper Financial Report Note Letter Book Textbook Photographed Indoor Screen Outdoor Flash-off Indoor Crease + Bend-in Outdoor Oblique + Crease Indoor Oblique + Crease Background Variation Indoor (2 photos/doc) Table Floor Items ··· Outdoor (1 photos/doc) Grass Trees Road ··· Printed Photo Screen Display ··· 850 Digital- born Docs Flash-on … Flash-off Blur Others Outdoor Shadow + Bend-out Outdoor Flash-on + Crease Outdoor Wrinkle + Shad ow Outdoor Left + Blur Outdoor Down + Wrinkle Indoor Right + Bend-in Indoor Blur Fig. 2: Visualization of digital-b orn and photographed do cumen t images. 2 Related W ork 2.1 Do cumen t P arsing Metho ds Do cumen t parsing metho ds can b e broadly categorized in to traditional pip eline metho ds, general vision–language models (VLMs), end-to-end specialized VLMs, and multi-stage sp ecialized VLMs. T raditional pip eline [ 9 , 26 , 32 , 42 ] metho ds typ- ically follow a fixed workflo w: they first p erform lay out detection, then detect and recognize elements, merge the recognized elements, and finally reconstruct the reading order. These systems rely on m ultiple task-specific mo dels, includ- ing lay out detection [ 15 , 39 , 51 ], formula detection [ 17 ], formula recognition [ 40 ], table recognition, text detection, text recognition [ 16 , 18 ], and reading order prediction [ 45 ]. T rained on massive datasets, general VLMs [ 3 , 14 , 22 , 25 , 44 , 46 ] ha ve demonstrated strong p oten tial for do cumen t parsing. End-to-end sp ecial- ized VLMs such as olmocr [ 34 ], Nanonets-OCR [ 27 ], and OCRFlux [ 5 ] further impro ve do cumen t parsing p erformance by fine-tuning the Qwen-VL [ 2 , 3 , 43 ] se- ries on do cumen t parsing tasks. In addition, HunyuanOCR [ 37 ] and the dots.ocr series [ 20 , 52 ] extend document parsing capabilities to supp ort tasks such as vi- sual question answ ering and SVG co de generation. Mean while, DeepSeek-OCR 2 [ 48 ] introduces a causal visual flow modeling mechanism to better capture the visual dependencies in do cument reading. Recen t w ork [ 23 ] impro v es the infer- ence efficiency of do cumen t parsing VLMs via training-free hierarchical specu- lativ e decoding. MonkeyOCR [ 21 ] p oin ts out that traditional pip eline systems suffer from error accum ulation due to the com bination of m ultiple to ols, while end-to-end models that directly process full-page do cumen ts ma y suffer from lo w efficiency and hallucination issues caused by long contexts. T o address these lim- itations, it prop oses a three-stage do cumen t parsing paradigm, SRR, consisting MDPBenc h 5 3 EXPERT MODEL LABELING Source Document Image Step 1 Layout Detection Initial Layout Detection Result Expert Layout Detection Models Step 2 Crop Block Crop Blocks By Category Type Step 3 Model Recognition Expert Model 1 Expert Model 2 Expert Model 3 MANUAL CORRECTION & HUMAN V ERIFICATION Quality Control Cycle Pass Data Qualified Public Annotation File Private Annotation File Corrector Verifier Feedback & Re-annotation Step 4 Consensus-based Voting NO Max Average Score > 0.7? 1 − Levenshtein 𝑃, 𝐺 max 𝑃 , 𝐺 } Text/Formula Score 1 − TED 𝑇 , 𝑇 max 𝑇 , 𝑇 } Table Score Gemini-3-pro Recognition Select Highest Consensus Pre- annotation JSON File Manual Selection Annotation Preparation Aligning Correction Guidelines Corrector/Verifier Training Pilot Correction Phase Pre-annotation JSON File YES Fig. 3: Ov erall pipeline of the data annotation pro cess. of structure detection, con ten t recognition with a VLM, and relation predic- tion. Subsequently , P addleOCR-VL [ 7 ] adopts a similar three-stage framew ork. MinerU2.5 [ 29 ] and Monk eyOCR v1.5 [ 50 ] further merge structure detection and relation prediction into a single vlm, simplifying the pip eline in to a t wo-stage parsing framew ork. 2.2 Do cumen t P arsing Benchmarks Early document parsing benchmarks predominantly focused on single, isolated tasks. F or instance, M 6 Do c [ 6 ], CDLA [ 19 ], D4LA [ 10 ], and Do cLa yNet [ 33 ] were designed for la yout analysis; UniMER-1M [ 40 ] and HME-100k [ 49 ] for formula recognition; and FinT abNet [ 53 ] and PubT abNet [ 54 ] for table recognition. Re- cen tly , there has been a paradigm shift to wards unified, multi-task ev aluation framew orks. F or example, F o xP age [ 24 ] focuses on the structured domain of academic pap ers sourced from clean PDF do cumen ts. OCRBench V2 [ 13 ] encom- passes 23 tasks including do cumen t parsing to comprehensively assess do cument understanding capabilities. OmniDo cBench [ 31 ] in tro duces a m ulti-level do cu- men t parsing ev aluation framework co v ering full-page, mo dule, and attribute lev els across div erse document t yp es, whereas olmOCR-Bench [ 34 ] fo cuses on assessing conten t recall for do cumen t parsing mo dels. How ev er, these compre- hensiv e b enchmarks primarily fo cus on well-formatted, digitally b orn do cumen ts. T o address real-world scenarios, recen t efforts such as DocPTBench [ 11 ] and Real5-OmniDo cBenc h [ 55 ] ha ve ev aluated the parsing of captured do cumen ts b y prin ting and photographing 1,355 document images from OmniDocBench v1.5. As shown in T ab. 1 , existing open-source benchmarks remain confined to a limited set of languages, leaving a significan t gap in m ultilingual document ev aluation. 6 Li. et al. T able 2: P erformance of general VLMs, sp ecialized VLMs, and pip eline tools on MDP- Benc h. Model Overall Latin Non-Latin Priv ate All Digit. Photo. A vg. DE EN ES FR ID IT NL PT VI A vg. AR HI JP KO RU TH ZH ZH-T All Gener al VLMs Gemini-3-pro-preview [ 14 ] 86.4 90.4 85.1 88.4 91.2 90.6 83.4 82.7 91.5 91.6 87.7 91.4 85.9 84.1 89.4 90.4 74.8 85.5 84.9 80.6 85.1 82.1 89.8 kimi-K2.5 [ 38 ] 77.5 85.0 75.0 81.6 85.9 86.2 72.7 71.0 80.6 86.6 77.4 87.6 86.2 72.9 75.8 74.5 72.5 70.9 61.8 67.0 81.7 78.6 81.2 Doubao-2.0-pro [ 4 ] 74.2 78.9 72.8 75.7 82.8 74.4 69.0 70.0 73.3 82.0 69.9 83.4 76.5 72.5 81.3 75.7 65.8 74.7 63.3 71.9 71.9 75.2 79.5 Claude-Sonnet-4.6 [ 1 ] 73.1 85.0 69.3 79.2 79.8 80.6 72.8 66.5 82.3 83.3 76.7 88.0 83.1 66.2 67.8 71.7 63.4 64.3 70.8 65.2 61.3 65.1 77.6 ChatGPT-5.2-2025-12-11 [ 30 ] 68.6 85.6 63.0 75.2 70.8 79.4 71.4 60.0 77.7 78.5 71.6 85.0 82.1 61.1 64.9 63.4 55.8 65.4 60.7 63.8 56.3 58.7 74.0 Qwen3-VL-Instruct-8b [ 2 ] 68.3 78.4 65.0 73.6 73.7 71.4 69.3 66.2 68.5 79.1 78.3 82.2 73.4 62.5 63.1 58.4 59.9 61.9 57.9 62.0 62.6 73.8 70.8 Qwen3.5-Instruct-9B [ 36 ] 65.7 74.8 62.7 72.5 72.8 72.0 72.0 64.4 66.2 77.6 74.5 79.1 74.0 58.2 53.4 56.2 55.7 60.3 54.7 56.7 60.8 67.5 68.9 Intern VL-3.5-8B [ 44 ] 42.7 59.7 37.0 53.4 39.8 64.2 47.5 42.7 53.8 60.6 52.2 63.2 57.0 30.6 8.2 9.0 45.6 30.3 26.1 10.8 55.3 59.3 45.3 Sp ecialize d VLMs dots.mocr [ 52 ] 80.5 90.5 77.2 81.7 82.6 87.4 71.3 70.1 84.5 89.3 83.2 86.8 79.9 79.2 83.3 83.6 75.0 78.7 71.2 77.9 84.6 79.6 82.8 PaddleOCR-VL-1.5 [ 8 ] 78.3 87.4 75.2 81.2 84.8 83.0 75.7 78.1 83.9 85.2 80.6 80.2 78.9 74.9 71.3 67.7 69.5 86.0 76.0 68.4 84.8 75.7 80.7 dots.ocr [ 20 ] 76.5 88.8 72.3 79.1 79.7 81.2 69.2 67.1 82.5 87.8 78.8 86.9 79.1 73.5 75.9 77.3 70.6 68.5 66.8 73.3 79.1 76.2 79.7 olmOCR2 [ 35 ] 70.4 79.9 67.2 76.7 75.7 77.3 72.5 68.9 70.6 81.0 72.0 88.0 84.0 63.3 59.0 60.8 59.4 70.6 65.8 59.2 68.6 63.4 76.1 PaddleOCR-VL [ 7 ] 69.6 87.6 63.6 72.1 78.2 79.3 62.9 66.0 77.4 78.4 67.9 72.0 66.6 66.7 65.8 68.4 59.9 77.8 56.9 57.8 78.2 68.5 70.9 HunyuanOCR [ 37 ] 68.3 80.2 64.3 72.4 75.0 73.1 63.0 66.1 69.9 80.3 61.4 81.9 80.6 63.7 68.3 73.1 55.6 68.9 52.2 60.7 66.8 64.2 68.6 GLM-OCR [ 12 ] 67.3 77.9 63.7 78.7 82.7 84.5 75.8 76.2 79.7 82.8 80.2 77.4 69.2 54.3 21.7 39.6 65.5 61.2 64.2 27.4 78.5 76.7 68.8 MonkeyOCRv1.5 [ 50 ] 65.0 84.3 58.6 67.4 70.8 74.9 55.6 60.3 73.8 75.9 66.3 67.2 61.4 62.4 60.1 56.8 57.0 78.9 51.7 55.6 74.8 64.1 69.0 Nanonets-ocr2-3B [ 28 ] 64.2 79.2 59.3 71.4 76.7 76.4 61.8 66.1 68.4 78.5 74.1 74.2 66.0 56.2 60.2 59.2 52.1 54.7 45.5 44.6 68.3 65.1 67.6 Nanonets-OCR-s [ 27 ] 63.7 78.8 58.7 71.3 75.1 78.5 61.2 62.5 70.3 81.0 69.6 75.9 67.5 55.0 59.5 61.8 55.9 51.2 43.5 39.5 67.4 61.5 66.6 MonkeyOCR-pro-3B [ 21 ] 52.2 68.0 47.0 65.1 71.7 77.9 55.9 62.1 66.2 74.5 66.3 71.1 40.2 37.6 4.6 4.2 55.2 60.5 42.6 9.1 72.2 52.4 53.6 DeepSeek-OCR [ 47 ] 51.8 80.7 42.2 54.5 55.0 58.3 44.1 43.2 60.9 69.3 52.4 53.0 54.1 48.9 56.9 52.2 49.1 28.2 36.2 49.4 59.7 59.2 54.5 MinerU-2.5-VLM [ 29 ] 46.3 61.9 40.8 63.0 68.8 78.4 54.7 57.3 67.5 75.2 60.4 58.8 46.0 27.4 1.3 9.0 39.1 14.7 8.6 11.3 72.9 62.2 48.7 Pip eline To ols PP-Structure V3 [ 9 ] 45.4 56.2 41.7 59.8 60.4 68.7 54.4 49.8 69.6 68.9 55.5 58.4 52.7 28.9 1.0 7.7 56.2 15.4 7.5 11.9 72.2 59.1 49.6 MinerU-2.5-pipeline [ 29 ] 33.5 57.6 25.4 46.5 54.3 58.3 38.4 43.6 51.9 56.5 43.9 44.0 27.6 18.7 1.2 5.3 24.5 6.8 4.2 6.4 53.9 47.2 36.2 3 MDPBenc h Constructing a benchmark for multilingual photographed do cumen t parsing is inheren tly challenging. The dataset m ust simultaneously ensure t yp e div ersit y , annotation accuracy , and realistic visual conditions. W e carefully design the con- struction pip eline of MDPBench with three k ey ob jectiv es: diversit y , realism, and reliabilit y . Fig. 2 presen ts a visualization of the dataset. In the following sub- sections, we in tro duce the dataset construction pro cess in detail, including data collection, annotation methodology , and ev aluation metrics. Fig. 3 ill ustrates the o verall pip eline of the data annotation pro cess. 3.1 Multilingual Digital-b orn Document Collection T o ensure a comprehensiv e ev aluation, we prioritize diversit y in do cumen t types, la yout complexity , and visual eleme n ts (e.g., formulas, images, tables, and charts) during data collection. W e systematically source data from globally accessible public platforms, co vering 17 representativ e languages. Our data sources include op en-access academic papers, business rep orts, educational materials, handwrit- ten notes, historical arc hives, modern newspap ers, as well as challenging Chinese and English do cumen ts from OmniDo cBenc h [ 31 ], and complex text-image do c- umen ts suc h as textb ooks and comics from public digital libraries. F ollo wing the collection phase, all samples undergo manual review to filter out lo w-qualit y or structurally trivial do cumen ts. Ultimately , we curate a dataset of 850 digital- b orn do cumen t images spanning 17 languages. MDPBenc h 7 Photo. Markdow n ### 細胞 の 分裂 * 細胞周期 * 間期: $DNA $ を 正 確に 複 製 し、 分 配す る 準 備期 間 * $G_1$ :細胞成長、機能 成長、 細胞質 分裂 * $S$ : $DNA$ 複製 $6 \ sim 8$ 時間 * $G_2$ :細胞小器官の複 製、 $ DNA$ チ ェッ ク * 分裂期:核、細胞質分裂 * $ G_ 1$ 期:細胞成長 $46$ 本 * $S$ 期: $DNA$ 量 $2$ 倍! 最 も 重要 な 時期! $46 \ sim 92 $ 本 * $G_2$ 期: $DNA$ 量 が $ G_ 1$ 、 $G_2$ 倍! $3 \ sim 4$ 時間 $92$ 本 • $M$ 期:分裂 $1 \ sim 2$ 時間 $92$ 本 * $G_0$ 期:細胞周期か ら外れ た状態 * ステージにわけられると * 微小管が大事! ------ Cont ent Miss ing ------ : : : . : : : : : . : : : : : : . : *** : . : : : : : . : : : : " : " . " : ." : ( : . ) ( : . ) : " : ." : : : . " : " ------- Cont ent Corruption ------ Photo. Markdown ** 商海浮沉唯錢是岸 ** ** 紙醉金迷玩翻上海 ** * 人 物 有 智 慧 方面 : ○ 電 腦 自 動 選 出的 對手 , 其智慧 度會隨 遊戲進 行而增 加 , 愈 後面會 愈聰明 。 * 玩 家 可 自 己 設定 對手 , 選擇 對 象不 同 , 難易 度也 就 不同 。 * 股票方面 : ○ 買 賣 操 作 介 面方 便 , 隨 時可進 出股市 。 ○ 股 市 漲 跌 絕 對反 映經濟 現況 。 ○ 公 司 有 盈 餘 時可 參與分 紅 。 * 新聞方面 : ○ 隨 時 報 導 最 新狀 況 , 增 加遊戲 趣味性 。 ○ 特 殊 新 聞 事 件發 生 , 將 影響遊 戲進行 。 * 界面方面 : ○ 立體 3 D 視角 , 畫 面 精緻華 麗 。 ○ 視窗模式 , 操作 容易上 手 。 ** 江湖小人小鬼大 ** ** 三拳兩腳搞定天下 ** * 人物屬性分為 : 正派 、 邪派 、 中立 。 不同的 門派有 不同的 人物屬 性與武 功 。 * 武功分為 : 內力 、 輕功 、 暗器 、 用毒 、 防禦 、 醫療 、 暗殺 。 各自 有其獨 特性 。 * 暗器 、 暗 算 方面 : 可對 地圖上 任何人 進行攻 擊 , 成 功率視 等級而 定 。 * 大俠 、 惡 徒 事件 : 隨機 發生 , 將會有 意想不 到的結 果 。 * 比武招親 : ○ 有緣者得之 , 不 僅抱 得 美人 歸 , 還可 獲得 神 秘禮 物 。 * 獨門絕技 : ○ 武 林 各 派 皆 有其 獨門武 功 , 華 麗的聲 光效果 , 令人 目不暇 給 。 * 武林大會 : ○ 每 隔 一 段 時 間將 舉辦武 林大會 , 爭奪 武林盟 主寶座 。 * 多樣化的道具 : ○ 提 供 多 樣 化 的道 具輔助 , 增加 遊戲的 變化性 。 * 豐 富 的 突 發 事件 : ○ 遊 戲 進 行 中 隨時 會有突 發狀況 發生 , 考驗玩 家的應 變能力 。 ------- F abricat e Cont ent ------ ( i ) 基底 : $ \ { |e_ 6 ( x_ 1 , x_ 2 ) \ r angle \ } \ q u ad e_ 6 ( x_ 1 , x_ 2 ) \ r angle \ equ iv |x_ 1 , x_ 2 \ r angle $ ------- F abricat e Cont ent ------ $ = \ d elt a^ 3 ( x_ 1 - x'_ 1 ) \ d elt a^ 3 ( x_ 2 - x'_ 2 ) $ ( ii ) 基礎的 な 力 学変数 : $ X_ 1 , X_ 2 , P_ 1 , P_ 2 $ $ [X _{ 1 j}, X_ { 2 k} ] = 0 \ q u ad [P_{ 1 j}, P_{ 2 k} ] = 0 $ $ [X _{ 1 j}, X_ { 1 k} ] = 0 \ q u ad [X _{ 2 j}, X_ { 2 k} ] = 0 $ $ [X _{ 1 j}, P_{ 1 k} ] = i \ h b ar \ d elt a_{ jk } \ q u ad ( j= 1 ,2 ,3 ) $ $ [X _{ 2 j}, P_{ 2 k} ] = i \ h b ar \ d elt a_{ jk } \ q u ad ( k= 1 ,2 ,3 ) $ $ [X _{ 1 j}, P_{ 2 k} ] = 0 \ q u ad [X _{ 2 j}, P_{ 1 k} ] = 0 $ ( ii i) 基礎的関係 $ \ la n gle x_ 1 , x_ 2 | X_ 1 | \ p si \ r angle = x_ 1 \ la n gle x_ 1 , x_ 2 | \ p si \ r angle $ $ \ la n gle x_ 1 , x_ 2 | X_ 2 | \ p si \ r angle = x_ 2 \ la n gle x_ 1 , x_ 2 | \ p si \ r angle $ $ \ la n gle x_ 1 , x_ 2 | P_ 1 | \ p si \ r angle = - i \ h b ar ( \ p artia l / \ p artia l x_ 1 ) \ la n gle x_ 1 , x_ 2 | \ p si \ r angle $ $ \ la n gle x_ 1 , x_ 2 | P_ 2 | \ p si \ r angle = - i \ h b ar ( \ p artia l / \ p artia l x_ 2 ) \ la n gle x_ 1 , x_ 2 | \ p si \ r ang le $ ② 角運動量 $ L_ 1 = X_ 1 \ times P_ 1 , \ q u ad L_ 2 = X_ 2 \ times P_ 2 $ $ [L _{ 1 j}, L_ { 1 k} ] = i \ h b ar \ ep sil o n _{ jkm } L_ { 1 m} $ $ [L _{ 2 j}, L_ { 2 k} ] = i \ h b ar \ ep sil o n _{ jkm } L_ { 2 m} $ $ [L _{ 1 j}, L_ { 2 k} ] = 0 \ q u ad [ L_j , L_k ] = 0 $ $ L = L_ 1 + L_ 2 , \ q u ad [ L_j , L_k ] = i \ h b ar \ ep sil o n _{ jkm } L_m $ ③ 表示 ( 座標表示 ) $ \ { |x_ 1 , x_ 2 \ r angle \ } \ q u ad x_ 1 \ in \ ma th b b {R} ^ 3 , \ q u ad x_ 2 \ in \ ma th b b {R} ^ 3 $ ( i ) $ X $ の 表示 $ = x_ 1 \ d elt a^ 3 ( x_ 1 - x'_ 1 ) \ d elt a^ 3 ( x_ 2 - x'_ 2 ) $ $ = x_ 2 \ d elt a^ 3 ( x_ 1 - x'_ 1 ) \ d elt a^ 3 ( x_ 2 - x'_ 2 ) $ ( ii ) $ P $ の 表示 $ = - i \ h b ar ( \ p artia l / \ p artia l x_ 1 ) \ d elt a^ 3 ( x_ 1 - x'_ 1 ) \ d elt a^ 3 ( x_ 2 - x'_ 2 ) $ $ = - i \ h b ar ( \ p artia l / \ p artia l x_ 2 ) \ d elt a^ 3 ( x_ 1 - x'_ 1 ) \ d elt a^ 3 ( x_ 2 - x'_ 2 ) $ -- Mis s in g C on t ent -- Cont e nt Mi sor der * $M$ 期 : 有糸分裂 ( 核 の 分裂 ) * 細胞質分裂 * $ 2 $ 個 の 中心体 の 間 に 有糸 分 裂紡 錘 体 が 形成 され る * 核膜 が 消失 * 染色体 は 個別 に 赤道面 に 並 ぶ * ( 体細胞分裂 ) * 相同染色体 は 対合 して 赤 道面 に 並 ぶ * ( 減数分裂 ) * 複製 してから $ 2 $ 回分裂 する * 交差 して 新 しい 遺伝子 を 作 る * ( キアズマ 現象 ) Digit. Digit. Fig. 4: Visualization of document parsing results of Gemini-3-Pro on photographed do cumen ts. 3.2 Real-w orld Photographed Do cumen t Generation. T o ev aluate model robustness under real-world degradations, w e transform the aforemen tioned digital-b orn do cumen ts into photographed data by prin ting them on paper or displaying them on computer screens for capture. W e then photo- graph all do cumen ts in b oth indo or and outdo or environmen ts. T o simulate the complexities of real-world capture, we apply v arying degrees of ph ysical deforma- tion to the printed do cuments, including inw ard bending, outw ard bending, and irregular wrinkling. In addition, we introduce div erse camera p erspectives, cap- turing images from left, right, inv erted, and oblique angles. F or eac h do cumen t, w e collect three images: t wo indo ors and one outdo ors.The indo or images feature div erse background in terferences (e.g., desk surfaces, flo or textures, and back- ground text) and are sub ject to complex indoor ligh ting, moiré patterns (from screen captures), reflections, glare, and slight blur. Conv ersely , the outdo or im- ages introduce distinct challenges, suc h as lo w-light conditions, shadows cast b y surrounding ob jects, uneven illumination, and complex natural backgrounds. 3.3 Data Annotation T o balance efficiency and annotation accuracy , w e design a three-stage anno- tation pipeline consisting of Exp ert Model Lab eling, Manual Correction, and Human V erification. The resulting annotations follow a format that is fully com- patible with OmniDo cBenc h [ 31 ]. Exp ert Mo del Labeling . W e first use dots.o cr [ 20 ] and PaddleOCR-VL [ 7 ] to p erform la yout detection on all digital-born do cumen ts. The detection results from the tw o models are then manually compared for eac h image, and the one with fewer missed detections and false detections is selected as the initial lay- out annotation. Based on the obtained lay out information, w e crop the text 8 Li. et al. 3 - أد اة ا ﻟ ﺑ ﺣ ث : ﺗ م ا ﺳ ﺗ ﺧ د ا م ﻣ ﻘ ﯾ ﺎ س ﺻ ﻌ و ﺑ ﺔ ا ﻟ ﺗ ﻌ ﺑ ﯾ ر ﻋ ن ا ﻟ ﻣ ﺷ ﺎ ﻋ ر ) ﺗ و ر ﻧ ﺗ و ( ا ﻟ ذ ي أ ﻋ د ه ﻛ ل ﻣ ن ﺑﺎ ﺟ ﺑ ﻲ و ﺑﺎ ر ﻛ ر و ا ﺧ ر ﯾ ن ) 1994 ( ﺑﺻ ﻌ و ﺑ ﺔ ا ﻟ ﺗ ﻌ ﺑ ﯾ ر ﻋ ن ا ﻻ ﻧ ﻔ ﻌ ﺎ ﻻ ت أ و ا ﻟ ﻣ ﺷ ﺎ ﻋ ر : ) أ ( ﺻ ﻌ و ﺑ ﺔ ﺗ ﺣ د ﯾ د ا ﻟ ﻣ ﺷ ﺎ ﻋ ر ، ) ب ( ﺻ ﻌ و ﺑ ﺔ و ﺻ ف اﻟ ﻣ ﺷ ﺎ ﻋ ر ، ) ج ( أ ﺣ ﻼ م اﻟ ﯾ ﻘ ظ ﺔ -- Mi s s i n g ' C o n t e n t -- اﻟ ﺑ ﻌ ث اﻟ ﺛ ﺎ ﻧ ﻲ : أ ﺣ ﻼ م اﻟ ﯾ ﻘ ظ ﺔ ، و ﻋ ﺑ ﺎ ر ﺗ ﮫ ) 13 ، 14 ، 15 ، 16 ، 17 (. ا ﻟ ﺑ ﻌ ث ا ﻟ ر ا ﺑ ﻊ : ا ﻟ ﺗ ﻔ ﻛ ﯾ ر ا ﻟ ﻣ و ﺟ ﮫ ﺧ ﺎ ر ﺟ ﺎ ً ، و ﻋ ﺑ ﺎ ر ا ﺗ ﮫ ) 26 , 25 , 24 , 23 , 22 , 21 , 20 , 20 , 2 ﺻ ﻣ م ا ﻟ ﻣ ﻘ ﯾ ﺎ س ﺑ ط ر ﯾ ﻘ ﺔ ﻟ ﯾ ﻛ ر ت ﻣ ن ﺧ ﻣ ﺳ ﺔ أ و ز ا ن أ و إ ﺟ ﺎ ﺑ ﺎ ت ، ﺑ د ﻋ ﺎ ً ﻣ ن ) 1 ( إ ﻟ ﻰ ) 5 (، و ﺗ ﺷ ﯾ ر ا ﻟ د ر ﺟ ﺔ ) 1 ( إ ﻟ ﻰ ﻏ ﯾ ﺎ ب ﺻ ﻌ و ﺑ ﺔ ا ﻟ ﺗ ﻌ ﺑ ﯾ ر ﻋ ن ا ﻟ ﻣ ﺷ ﺎ ﻋ ر ﻟ د ى ا ﻟ ﻌ ﯾ ﻧ ﺔ ، ﺑ ﯾ ﻧ ﻣ ﺎ ﺗ ﺷ ﯾ ر ا ﻟ د ر ﺟ ﺔ ا ﻟ ﻣ ر ﺗ ﻔ ﻌ ﺔ ) 5 ( إ ﻟ ﻰ وﺟ و د ﺻ ﻌ و ﺑ ﺔ ﻓ ﻲ ا ﻟ ﺗ ﻌ ﺑ ﯾ ر ﻋ ن ا ﻟ ﻣ ﺷ ﺎ ﻋ ر ﻟ د ى أ ﻓ ر ا د ا ﻟ ﻌ ﯾ ﻧ ﺔ . ﻣ ﺗ ط و ﻋ ﺎ ً ﻣ ن ﻋ ﺎ ﻣ ﺔ ا ﻟ ﺳ ﻛ ﺎ ن ، ا ﻟ ذ ﯾ ن ﺗ ﺗ ﺣ ﺻ ر أ ﻋ ﻣ ﺎ ر ھ م ﺑ ﯾ ن 18 و 60 ﻋ ﺎ ﻣ ﺎ ً ، ﻋ ﻠ ﻰ ھ ذ ا ا ﻟ ﻣ ﻘ ﯾ ﺎ س ﺑ ﻌ د أ ن ﺧﺿ ﻌ و ا ﻟ ﺗ ر ﺟ ﻣ ﺔ ﻋ ﻛ ﺳ ﯾ ﺔ ﻣ ﺗ ﻛ ر ر ة ﺑ و ا ﺳ ط ﺔ ﻣﺗ ر ﺟ م ﺛ ﻧ ﺎ ﺋ ﻲ ا ﻟ ﻠ ﻐ ﺔ ﻣ ﺳ ﻧ ﻘ ل ، و ﻣﺗ و ﺳ ط ﻣ ﺟ ﻣ و ع ا ﻟ ﻧ ﻘ ﺎ ط ﻓ ﻲ TA S - 26' ( S D ' ± ' 72. 9) ' ﻛﺎ ن 8. 4 ± 72. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 × 8. 4 …. -- Mi s s i n g ' C o n t e n t -- -- Mi s s i n g ' C o n t e n t -- Ac a d e m i c P a p e r s ( AR ) Ma r k d o w n ( P a d d l e O C R - VL - 1. 5) ------- Re p e t i t i v e ( O u t p u t ------ Ex a m P a p e r s ( V I ) Ma r k d o w n ( do t s . m o c r ) Ví d ụ 5. 解方程 $$ 2 \ si n ^ 2 \ fr a c { x } { 2 } + \ sq r t { 2 } \ si n \ fr a c { x } { 2 } - 2 = 0. $$ 解 . 设 $ \ si n \ fr a c { x } { 2 } = t $ $$ - 1 \ le t \ le 1 \ qu ad ( * ) , $$ 则得方程 $$ 2t ^ 2 + \ sq r t { 2 } t - 2 = 0. \ qu ad ( 1) $$ 方程 (1 ) 有两个解 $ t_ 1 = - \ sq r t { 2 } $ 和 $ t_ 2 = \ fr a c { \ sq r t { 2 } } { 2 } $ 但只有一解 $ t_ 2 = \ fr a c { \ sq r t { 2 } } { 2 } $ 3. 方程 $ \ si n \ fr a c { x } { 2 } = \ fr a c { \ sq r t { 2 } } { 2 } $ 的解 3 解 a) 每个角的正弦值为 $ \ fr a c { \ sq r t { 2 } } { 2 } $ ; b) 每个角的余弦值为 $ - \ fr a c { \ sq r t { 2 } } { 2 } $ ; c) 每个角的正切值为 $ \ sq r t { 2 } $ ; d) 每个角的正弦值为 $ \ fr a c { \ sq r t { 2 } } { 2 } $ , 余弦值为 $ - \ fr a c { \ sq r t { 2 } } { 2 } $ 。 有多个方程 $ \ si n \ fr a c { x } { 2 } = \ fr a c { \ sq r t { 2 } } { 2 } $ , 解为 $ x = \ fr a c { \ pi } { 4 } + k 2 \ pi $ 或 $ x = \ fr a c { 3 \ pi } { 4 } + k 2 \ pi $ , $k \ in \ ma t h b b {Z }$ 。 Ví d ụ 6. 解方程 ------- La n g u a g e ( D r i f t ------ !" $%&' !&() *+ ,- ./ * + 0 1)- 2)*/ Hindi Digit . !" $%&' !&() *+ ,- ./ & * + 0 1)- 2)*/ Ground Truth Resul t (PaddleOC R - VL - 1.5) Выполнение мех ano - сборочных работ по машиностроительным чертежам Resul t (PaddleOC R - VL - 1.5) Выполнение мех ано - сборочных работ по машиностроительным чертежам Ground Truth Russian Di git . จีนคือผู ้ ส ่ งออกที/ใหญ่ที/สุดในโลกใน ที/ สหรัฐ ฯคือผู ้ บริโภคที/ใหญ่ที/ สุดในโลก นโยบาย การค ้ าของ 2 ประเทศนีGจึงส ่ งผลกระทบกับการ งานส ่ งทั/วโลกอย่างมากสงครามการ ค ้ า และกําแพงภาษีในยุค ประชา ธ ิ ป บด ี โด น ั ลด ์ ทรัมป ์ ส ่ งผลให ้ การค ้ าขาย ทั/วโลกชะลอตัว ลง ส ่ งผลถึงการ งานส ่ งทางเรือ ตอนนีGโจ บะ เดนได ้ เป ็ น ประ ธา นา ธิป บดีคนใหม่ซ ึ/ งมี นโยบาย ที/เน ้ นการค ้ าขายกับต่างประเทศรวม ไปถึงประเทศจีนมากขึGนเหตุการณ์นีG จะทํา ให ้ ปริมาณการส ่ งออกของจีนไป สหรัฐ ฯสูงขึGนในอนาคต จีนคือผู ้ ส ่ งออกที/ใหญ่ที/สุดในโลกในขณะที/ สหรัฐ ฯคือผู ้ บริโภคที/ใหญ่ที/สุดใน โลก นโยบายการค ้ าของ 2 ประเทศนีGจึงส ่ งผลกระทบกับการขนส ่ งทั/วโลกอย่างมากสงคราม การค ้ าและกําแพงภาษีในยุคประธานาธิบดี โดนั ลด์ ทรัมป ์ ส ่ งผลให ้ การค ้ าขายทั/วโลก ชะลอตัวลง ส ่ งผลถึงการขนส ่ งทางเรือ ตอนนีGโจ ไบ เดนได ้ เป ็ นประธานาธิบดีคนใหม่ซ ึ/ ง มีนโยบายที/เน ้ นการค ้ าขายกับต่างประเทศรวมไปถึงประเทศจีนมากขึGนเหตุการณ์นีGจะทํา ให ้ ปริมาณการส ่ งออกของจีนไป สหรัฐ ฯสูงขึGนในอนาคต Thai Digit . Ground Truth Resul t ( dots.mocr ) Fig. 5: T ypical language-specific errors. blo c ks, table blocks, and form ula blo c ks according to the bounding box co ordi- nates and elemen t types in the la yout, pro ducing individual elemen t-level crops. W e then emplo y three state-of-the-art recognition mo dels, P addleOCR-VL [ 7 ], dots.o cr [ 20 ], and Qwen3VL [ 2 ], to recognize these cropp ed elements. Since the correct recognition result is usually unique and stable, whereas incorrect results tend to b e diverse and more random, the prediction that is most similar to the outputs of other mo dels is more likely to b e correct. Based on this observ ation, w e compute the pairwise similarity among the recognition results of the three exp ert mo dels for each elemen t and select the result with the highest av erage similarit y to the other tw o mo dels as the final initial annotation. Sp ecifically , for text and formulas, we measure similarit y using 1 - Normalized Edit Distance (NED), while for tables, we use T ree Edit Distance-based Similarity (TEDS). If the highest av erage similarity is lo w er than 0.7, w e consider the predictions of the three exp ert mo dels to b e unreliable. In such cases, we instead use the then state-of-the-art Gemini-3-pro [ 14 ] mo del to recognize the corresp onding elemen t blo c k. Man ual Correction . Before conducting man ual corrections, w e first train the annotators to align correction guidelines and introduce the annotation w ork- flo w. W e then carry out a pilot annotation on a small subset of samples to v erify and ensure the accuracy and consistency of the ov erall annotation pro cess. After obtaining the model-generated annotations, we p erform man ual correction in a staged manner. First, w e c heck whether the lay out co ordinates and element types of each image are correct. Next, we v erify whether the reading order follows the natural reading logic of h umans. Finally , we examine and refine each element detected in the la yout one by one. Human V erification . T o ensure the final quality of the dataset, w e adopt a strict “annotation–correction–v erification” mechanism. After manual correction of one do cumen t, the data is submitted to an indep endent review er for v erifica- tion. If the annotation meets the qualit y standards, it is marked as “P assed” and pro ceeds to the final deliv ery stage. If any errors or inconsistencies are identi- fied, it is mark ed as “F ailed”, accompanied by detailed feedbac k, and returned to the original annotator for targeted revisions. This pro cess is iteratively rep eated un til the do cumen t fully satisfies the acceptance criteria. MDPBenc h 9 3 - : ( ) ( 1 9 9 4 ) ( : ) ( ) ( ) -- M i s s i n g C o n t e n t -- : ( 13 14 15 16 17 .) : ( 26 , 25 , 24 , 23 , 22 , 21 , 20 , 20 , 2 ( 1 ) ( 5 ) ( 1 ) ( 5 ) . 18 60 T A S - 26 ( S D ± 72.9 ) 8 . 4 ± 7 2 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 × 8 . 4 … . -- M i s s i n g C o n t e n t -- -- M i s s i n g C o n t e n t -- A c a d e m i c P a p e r s ( A R ) Ma r k d o w n ( P a d d l eOC R - VL - 1 . 5 ) - - - - - - - R e p e t i t i v e O u t p u t - - - - - - E x a m P a p e r s ( V I ) Ma r k d o w n ( d o ts . m o c r ) Ví d ụ 5 . 解方程 $$ 2 \ s i n^ 2 \ f r a c { x } { 2 } + \ s q r t { 2 } \ s i n \ f r a c { x } { 2 } - 2 = 0 . $$ 解 . 设 $ \ s i n \ f r a c { x } { 2 } = t $ $$ - 1 \ l e t \ l e 1 \ q ua d ( * ) , $$ 则得方程 $$ 2 t^ 2 + \ s q r t { 2 } t - 2 = 0 . \ q ua d ( 1 ) $$ 方程 ( 1 ) 有两个解 $ t _ 1 = - \ s q r t { 2 } $ 和 $ t _ 2 = \ f r a c { \ s q r t { 2 }}{ 2 }$ 但只有一解 $ t _ 2 = \ f r a c { \ s q r t { 2 }}{ 2 }$ 3 . 方程 $ \ s i n \ f r a c { x } { 2 } = \ f r a c { \ s q r t { 2 }}{ 2 } $ 的解 3 解 a ) 每个角的正弦值为 $ \ f r a c { \ s q r t { 2 }}{ 2 }$ ; b ) 每个角的余弦值为 $ - \ f r a c { \ s q r t { 2 }}{ 2 }$ ; c ) 每个角的正切值为 $ \ s q r t { 2 }$ ; d ) 每个角的正弦值为 $ \ f r a c { \ s q r t { 2 }}{ 2 }$ , 余弦值为 $ - \ f r a c { \ s q r t { 2 }}{ 2 }$ 。 有多个方程 $ \ s i n \ f r a c { x } { 2 } = \ f r a c { \ s q r t { 2 }}{ 2 }$ , 解为 $ x = \ f r a c { \ p i } { 4 } + k 2 \ p i $ 或 $ x = \ f r a c { 3 \ p i } { 4 } + k 2 \ p i $ , $ k \ i n \ m a t h b b { Z} $ 。 Ví d ụ 6 . 解方程 - - - - - - - L a n g u a g e D r i f t - - - - - - Fig. 6: Existing do cumen t parsing mo dels often exhibit hallucinations. 3.4 Ev aluation Metrics Unlik e OmniDocBench [ 31 ], MDPBenc h adopts a page-lev el aggregation ev alu- ation strategy to mitigate the impact of imbalanced multilingual data distribu- tions. In OmniDo cBenc h [ 31 ], the o verall metric is t ypically computed b y first calculating the a verage scores of different elemen t t yp es—suc h as text, tables, and formulas—and then av eraging these scores. How ever, in multilingual sce- narios, document structures v ary significan tly across languages. F or example, academic pap ers are often written in English and tend to contain a large n um- b er of formulas, while do cumen ts in some other languages may include far few er form ulas. If an elemen t-level ev aluation strategy is used in such cases, the ov erall score of certain languages can be disproportionately influenced by the parsing results of only a few formulas or tables. T o address this issue, we adjust the ev al- uation granularit y to the page level. Sp ecifically , w e first compute the ev aluation metrics for all elements within a single page (e.g., text, tables, and form ulas), and then av erage these metrics to obtain the score for that page. The final score is calculated as the a verage of the scores across all pages. T o mitigate p oten- tial b enchmark ov erfitting caused b y targeted optimization on publicly av ailable error cases, we divide the dataset in to public and priv ate subsets. The images and ground-truth annotations in the public subset will be released to the com- m unity for free download and ev aluation. In contrast, the priv ate subset and its annotations will remain undisclosed. Ev aluation on the priv ate subset can only b e conducted by submitting mo del through our official ev aluation website. F or eac h image, w e follo w OmniDo cBenc h [ 31 ] to perform data preprocess- ing, elemen t extraction, pure text extraction, and element matc hing. During ev aluation, page components such as headers, fo oters, page num b ers, and page fo otnotes are ignored. F or the matc hed elements, w e use the following metrics for ev aluation: – F or text and reading order, w e adopt Normalized Edit Distance (NED). Sp ecifically , w e compute the Lev enshtein distance betw een the predicted text and the ground-truth text, and normalize it by the maximum length of 10 Li. et al. the t wo strings: Score = 1 − Lev enshtein ( P, G ) max( | P | , | G | ) . – F or formula recognition, w e use CDM [ 41 ] for ev aluation to prev ent misjudg- men ts caused by differences in expression forms. – F or table recognition, w e adopt the widely used T ree-Edit-Distance-based Similarit y (TEDS) [ 54 ]. TEDS = 1 − TED ( T p , T g ) max( | T p | , | T g | ) 4 Exp erimen ts W e ev aluate a diverse set of do cumen t parsing mo dels on MDPBenc h, including general VLMs, sp ecialized models, and pipeline-based to ols. Beyond assessing do cumen t parsing performance, MDPBenc h also serv es as a v aluable b enc hmark for ev aluating the m ultilingual text understanding capabilities of general VLMs. During ev aluation, we assume that mo dels hav e no prior knowledge of the in- put image’s language or whether it is photographed or digitally generated. The results are summarized in T ab. 2 . B o o k s ( A R ) 1 2 3 4 5 6 7 8 9 : : : : Ma r k d o w n ( P a d d l e OC R - VL - 1 . 5 ) : : : . : : : : . : . : . : . : . : . : : . : : . : : . : : : : : : " . " : - . : : . : . : : . : : . : . : . . : . : : 7 8 9 1 2 3 4 5 6 B o o k s ( A R ) 1 2 3 4 5 6 7 8 9 10 . : . : . Ma r k d o w n ( d o ts . o c r ) : . : . : . : . : : : : . . : : : : : . . : : . . : : . : . : . : : : . : . : . : . : . : . : 7 8 9 10 1 3 5 6 : . : : : . 2 4 Fig. 7: Existing document parsing models often fail to correctly handle right-to-left reading order in Arabic do cumen ts. 4.1 End-to-End Ev aluation Results As sho wn in T ab. 2 , the top-p erforming proprietary model, Gemini-3-Pro [ 14 ], ac hieves an o v erall accuracy of 86.4%, reaching state-of-the-art (SOT A) results in 14 of 17 languages. In contrast, the b est op en-source mo del, dots.mo cr, at- tains 80.5% ov erall accuracy , revealing a clear gap b et ween proprietary and MDPBenc h 11 op en-source approaches. W e further analyze the k ey limitations of current do c- umen t parsing mo dels, including c hallenges with photographed do cumen ts, lim- ited recognition of non-Latin scripts, language-sp ecific reading order issues, hal- lucinations and rep eated outputs, and inherent drawbac ks of traditional multi- stage pip elines. W e summarize the findings as follows: Greater challenges in parsing photographed do cumen ts. As sho wn in T ab. 2 , model p erformance on photographed documents drops by an a ver- age of 17.8%. Among all ev aluated mo dels, Gemini-3-Pro [ 14 ] demonstrates the most robust p erformance on photographed do cumen ts. How ev er, as illustrated in Fig. 4 , even Gemini-3-Pro [ 14 ] can still pro duce errors suc h as missing conten t, incorrect reading order, and hallucinated outputs when handling photographed do cumen ts, resulting in a p erformance decrease of approximately 5.3% compared to digital do cumen ts. Mo dels p erform w orse on non-Latin-script languages. Latin-script languages are based on the A–Z alphabet, often extended with diacritics and other mo difications. W e observe that models perform significan tly worse on non- Latin-script languages, with an av erage p erformance drop of 14.0% compared to Latin-script languages. F or example, although mo dels such as MinerU2.5 [ 29 ] and Monk eyOCR [ 21 ] are trained primarily on English and Chinese data, they still generalize well to Latin-script languages suc h as German. How ever, their accuracy drops to b elo w 10% on non-Latin-script languages, such as Arabic and Hindi. Mo dels exhibit typical errors sp ecific to individual languages. As il- lustrated in Fig. 5 , mo dels exhibit distinct language-sp ecific error patterns when pro cessing languages like Hindi, Russian, and Thai. In Hindi, which relies on v ow el diacritics, mo dels tend to retain only the base characters and ignore crucial mo difiers, often misrecognizing “ " (Arvind) as “ " (Ara vid). F or Russian do cumen ts, mo dels frequen tly suffer from visual confusion, erroneously deco ding Cyrillic characters that are visually iden tical to Latin letters (e.g., misclassifying the Cyrillic “ ", “ ", and “ " in “ " as their Latin coun terparts). F ur- thermore, when pro cessing Thai, a typical unspaced language in which spaces denote only seman tic boundaries, mo dels often hallucinate spaces within con tin- uous text. F or instance, “ " (“biggest") is incorrectly segmented as “ ", similar to splitting the English word “biggest" in to “bigge st", thereby severely disrupting lexical in tegrity . Mo dels tend to pro duce repetitive outputs and exhibit language drift when handling m ultilingual do cumen ts. As sho wn in Fig. 6 , we ob- serv e that PaddleOCR-VL-1.5 [ 8 ] exhibits issues suc h as missing conten t and rep etitiv e outputs when pro cessing Arabic do cumen t. Meanwhile, dots.mocr [ 52 ] demonstrates language drift when handling Vietnamese, incorrectly recognizing it as Chinese. These findings suggest that existing do cumen t parsing models are insufficien tly trained for such linguistic scenarios and suffer from biases in their training data. Mo dels Struggle with Reading Order in Right-to-Left Scripts. Ara- bic is read from right to left. As sho wn in Fig. 7 , w e observ e that for t w o- 12 Li. et al. T able 3: Component-lev el ev aluation of text, formula, and table recognition on MDP- Benc h subsets. Model T ext Edit ↓ F ormula CDM ↑ T able TEDS ↑ DE EN ES FR ID IT NL PT VI AR HI JP KO RU TH ZH ZH-T Digit. Photo. Digit. Photo. Gemini-3-pro-preview [ 14 ] 0.026 0.041 0.082 0.026 0.025 0.021 0.016 0.061 0.050 0.040 0.017 0.234 0.043 0.031 0.042 0.104 0.418 93.4 90.5 75.9 69.2 Qwen3-VL-Instruct-8B [ 2 ] 0.126 0.135 0.124 0.023 0.066 0.018 0.016 0.043 0.046 0.412 0.065 0.124 0.059 0.039 0.226 0.104 0.342 92.9 89.3 65.0 56.2 dots.mocr [ 52 ] 0.003 0.022 0.107 0.027 0.021 0.011 0.005 0.024 0.039 0.059 0.024 0.071 0.031 0.030 0.598 0.084 0.804 89.8 78.7 59.6 55.9 PaddleOCR-VL-1.5 [ 8 ] 0.003 0.019 0.172 0.040 0.018 0.012 0.004 0.019 0.031 0.262 0.224 0.063 0.031 0.033 0.330 0.065 0.197 88.4 85.1 76.0 65.0 MonkeyOCR-pro-3B [ 21 ] 0.016 0.027 0.169 0.035 0.461 0.016 0.018 0.040 0.913 0.978 0.980 0.139 0.113 0.458 0.989 0.126 0.660 90.7 88.3 68.3 60.7 PP-Structure V3 [ 9 ] 0.288 0.279 0.481 0.421 0.347 0.356 0.267 0.396 0.569 0.975 0.969 0.613 0.944 0.880 0.931 0.387 0.659 88.6 56.9 56.3 46.2 column Arabic do cument images, mo dels such as PaddleOCR-VL-1.5 [ 8 ] and dots.o cr [ 20 ] often incorrectly pro cess the text in a left-to-right order. 4.2 Single T ask Ev aluation Results T ext Recognition Results. W e leverage the ground-truth lay out annotations to crop text blo c ks from Digital-Born documents, and randomly sample 200 blo c ks per language for ev aluation. The results, shown in T ab. 3 , indicate that P addleOCR-VL-1.5 [ 8 ] achiev es the b est p erformance on 10 out of 17 languages, while dots.mo cr [ 52 ] and Gemini-3-Pro attain state-of-the-art (SOT A) results on 4 languages each. This phenomenon reflects, to some extent, biases in b oth training data and training paradigms. Specifically , dots.mo cr [ 52 ] and Gemini- 3-Pro [ 14 ] are primarily trained in an end-to-end manner on full-page images, whic h leads to relatively w eaker p erformance when handling cropp ed lo cal text blo c ks. In contrast, P addleOCR-VL-1.5 [ 8 ] is trained with a substan tial amount of text-blo c k-level data, making it b etter suited for this ev aluation setting. F ur- thermore, w e observ e that P addleOCR-VL-1.5 [ 8 ] performs notably w orse on Arabic, Hindi, and Thai. This suggests a language distribution bias in its train- ing data, whic h in turn limits its generalization abilit y on low-resource or un- derrepresen ted languages. F orm ula and T able Recognition Results. W e crop formula and table re- gions from digital-b orn documents using ground-truth annotations, and manu- ally extract corresponding regions from photographed do cumen ts. The results are rep orted in T ab. 3 . Gemini-3-Pro [ 14 ] achiev es the b est p erformance on b oth form ula and table recognition. All mo dels exhibit performance degradation in photographed scenarios, lik ely due to complex backgrounds, v arying illumina- tion, image degradation, and geometric distortions. T able recognition remains c hallenging: Gemini-3-Pro [ 14 ] ac hiev es 75.9% accuracy on digital tables but drops to 69.2% on photographed tables, indicating that table recognition still lac ks robustness under real-world imaging conditions. La yout Detection Results. W e ev aluate different mo dels on digital-b orn do c- umen ts across multiple languages using the PageIoU [ 29 ] metric. The results are shown in T ab. 4 . dots.mo cr [ 52 ] achiev es SOT A p erformance in 13 out of 17 languages, demonstrating strong generalization in multilingual scenarios. P addleOCR-VL [ 7 ] and PaddleOCR-VL-1.5 [ 8 ] exhibit comparable multilingual la yout detection p erformance, resulting in similar ov erall results on digital-b orn do cumen ts. How ev er, due to its support for arbitrarily shap ed b ounding b o xes, MDPBenc h 13 T able 4: Comp onent-lev el la yout detection ev aluation on the MDPBenc h lay out sub- set. Model Lay out Detection PageIoU ↑ DE EN ES FR ID IT NL PT VI AR HI JP KO RU TH ZH ZH-T dots.mocr [ 52 ] 93.1 86.6 88.3 85.9 94.4 92.3 93.6 92.2 92.6 95.4 92.7 91.6 91.6 92.9 90.2 85.7 89.9 PaddleOCR-VL [ 7 ] 93.9 86.2 88.8 94.4 85.9 88.1 86.9 84.2 81.9 84.6 87.2 85.8 86.5 88.6 80.2 86.0 84.7 PaddleOCR-VL-1.5 [ 8 ] 92.4 84.5 88.8 93.5 84.3 88.0 86.2 81.3 80.6 87.5 86.6 86.5 86.8 87.2 81.0 84.9 84.6 MinerU-2.5-VLM [ 29 ] 90.6 84.8 76.1 91.8 84.4 86.1 85.4 87.8 85.0 89.2 87.2 82.7 91.5 88.8 81.9 83.9 87.1 PP-Structure V3 [ 9 ] 65.5 63.7 58.8 70.9 61.0 62.1 62.0 59.0 56.0 59.3 66.5 59.7 62.5 65.8 55.7 59.8 58.7 P addleOCR-VL-1.5 [ 8 ] improv es ov erall p erformance on photographed do cu- men ts by 11.6%, as sho wn in T ab. 2 . Although MinerU-2.5-VLM achiev es o verall results b elo w 10% on AR, HI, and RU, its PageIoU scores on these three lan- guages all exceed 85%, indicating that lay out detection performance is relatively insensitiv e to language differences. 5 Conclusion This pap er introduces MDPBenc h, the first b enc hmark for m ultilingual pho- tographed do cument parsing. MDPBench comprises 3,400 high-quality human- annotated images cov ering 17 languages and systematically incorp orates a wide range of real-world capture conditions. Extensive exp erimen ts demonstrate the limitations of existing document parsing mo dels, particularly a significant p erfor- mance degradation on non-Latin scripts and photographed do cumen t scenarios. MDPBenc h can b e used not only to ev aluate sp ecialized do cumen t parsing sys- tems but also as a b enc hmark for assessing the multilingual text understanding and OCR capabilities of general-purp ose large m ultimo dal mo dels, providing in- sigh ts for future mo del improv emen t and facilitating the dev elopment of more robust, generalizable, and practically deplo yable do cumen t parsing systems. References 1. An thropic: Claude. https://www.anthropic.com/claude (2025) 6 2. Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical rep ort. arXiv preprint (2025) 4 , 6 , 8 , 12 3. Bai, S., Chen, K., Liu, X., W ang, J., Ge, W., Song, S., Dang, K., W ang, P ., W ang, S., T ang, J., Zhong, H., Zh u, Y., Y ang, M., Li, Z., W an, J., W ang, P ., Ding, W., F u, Z., Xu, Y., Y e, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Y ang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 4 4. ByteDance: Doubao. https://research.doubao.com (2026) 6 5. ChatDo c: Ocrflux. https://github.com/chatdoc- com/OCRFlux (2025) 4 6. Cheng, H., Zhang, P ., W u, S., Zhang, J., Zh u, Q., Xie, Z., Li, J., Ding, K., Jin, L.: M6do c: a large-scale multi-format, multi-t yp e, multi-la yout, multi-language, multi- annotation category dataset for mo dern do cument lay out analysis. In: Pro ceedings 14 Li. et al. of the IEEE/CVF Conference on Computer Vision and P attern Recognition. pp. 15138–15147 (2023) 5 7. Cui, C., Sun, T., Liang, S., Gao, T., Zhang, Z., Liu, J., W ang, X., Zhou, C., Liu, H., Lin, M., et al.: Paddleocr-vl: Boosting m ultilingual do cument parsing via a 0.9 b ultra-compact vision-language model. arXiv preprint arXiv:2510.14528 (2025) 1 , 5 , 6 , 7 , 8 , 12 , 13 8. Cui, C., Sun, T., Liang, S., Gao, T., Zhang, Z., Liu, J., W ang, X., Zhou, C., Liu, H., Lin, M., et al.: Paddleocr-vl-1.5: T o wards a multi-task 0.9 b vlm for robust in-the-wild document parsing. arXiv preprint arXiv:2601.21957 (2026) 1 , 6 , 11 , 12 , 13 9. Cui, C., Sun, T., Lin, M., Gao, T., Zhang, Y., Liu, J., W ang, X., Zhang, Z., Zhou, C., Liu, H., et al.: P addleo cr 3.0 technical report. arXiv preprin t (2025) 1 , 4 , 6 , 12 , 13 10. Da, C., Luo, C., Zheng, Q., Y ao, C.: Vision grid transformer for do cumen t lay out analysis. In: Pro ceedings of the IEEE/CVF international conference on computer vision. pp. 19462–19472 (2023) 5 11. Du, Y., Chen, P ., Ying, X., Chen, Z.: Do cptbench: Benchmarking end-to-end photographed do cumen t parsing and translation. arXiv preprin t (2025) 3 , 5 12. Duan, S., Xue, Y., W ang, W., Su, Z., Liu, H., Y ang, S., Gan, G., W ang, G., W ang, Z., Y an, S., et al.: Glm-o cr technical rep ort. arXiv preprint arXiv:2603.10910 (2026) 6 13. F u, L., Kuang, Z., Song, J., Huang, M., Y ang, B., Li, Y., Zhu, L., Luo, Q., W ang, X., Lu, H., Li, Z., T ang, G., Shan, B., Lin, C., Liu, Q., W u, B., F eng, H., Liu, H., Huang, C., T ang, J., Chen, W., Jin, L., Liu, Y., Bai, X.: Ocrbench v2: An impro ved b enc hmark for ev aluating large multimodal mo dels on visual text lo calization and reasoning. In: Pro ceedings of the Neural Information Pro cessing Systems T rack on Datasets and Benc hmarks (2025) 5 14. Go ogle DeepMind: Gemini 3 pro. https : / / blog . google / innovation - and - ai / technology/ developers- tools /gemini- 3- pro- vision (2025) 3 , 4 , 6 , 8 , 10 , 11 , 12 15. Huang, Y., Lv, T., Cui, L., Lu, Y., W ei, F.: La youtlm v3: Pre-training for docu- men t ai with unified text and image masking. In: Proceedings of the 30th A CM in ternational conference on multimedia. pp. 4083–4091 (2022) 4 16. Jaided AI: Easy o cr: Ready-to-use ocr with 80+ supp orted languages. https : / / github.com/JaidedAI/EasyOCR (2024) 4 17. Jo c her, G., Chaurasia, A., Qiu, J.: Ultralytics yolo v8. https : / / github . com / ultralytics/ultralytics (2023) 4 18. Li, C., Liu, W., Guo, R., Yin, X., Jiang, K., Du, Y., Du, Y., Zh u, L., Lai, B., Hu, X., et al.: Pp-o crv3: More attempts for the improv ement of ultra ligh tw eight o cr system. arXiv preprint arXiv:2206.03001 (2022) 4 19. Li, H.: Cdla: A c hinese do cument la yout analysis (cdla) dataset. https://github. com/buptlihang/CDLA (2021) 5 20. Li, Y., Y ang, G., Liu, H., W ang, B., Zhang, C.: dots. o cr: Multilingual do cumen t la yout parsing in a single vision-language mo del. arXiv preprin t (2025) 1 , 4 , 6 , 7 , 8 , 12 21. Li, Z., Liu, Y., Liu, Q., Ma, Z., Zhang, Z., Zhang, S., Guo, Z., Zhang, J., W ang, X., Bai, X.: Monk eyocr: Do cumen t parsing with a structure-recognition-relation triplet paradigm. arXiv preprint arXiv:2506.05218 (2025) 1 , 4 , 6 , 11 , 12 MDPBenc h 15 22. Li, Z., Y ang, B., Liu, Q., Ma, Z., Zhang, S., Y ang, J., Sun, Y., Liu, Y., Bai, X.: Monk ey: Image resolution and text lab el are important things for large multi- mo dal mo dels. In: pro ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26763–26773 (2024) 4 23. Liao, W., Li, H., Xie, P ., Cai, X., Shen, Y., Xin, Y., Qin, Q., Y e, S., Li, T., Hu, M., et al.: T raining-free acceleration for document parsing vision-language model with hierarc hical sp eculative deco ding. arXiv preprin t arXiv:2602.12957 (2026) 4 24. Liu, C., W ei, H., Chen, J., Kong, L., Ge, Z., Zh u, Z., Zhao, L., Sun, J., Han, C., Zhang, X.: F o cus anywhere for fine-grained m ulti-page document understanding. arXiv preprin t arXiv:2405.14295 (2024) 3 , 5 25. Liu, Y., Y ang, B., Liu, Q., Li, Z., Ma, Z., Zhang, S., Bai, X.: T extmonkey: An o cr-free large m ultimo dal mo del for understanding document. arXiv preprin t arXiv:2403.04473 (2024) 4 26. Liv athinos, N., Auer, C., Lysak, M., Nassar, A., Dolfi, M., V agenas, P ., Ramis, C.B., Omenetti, M., Dinkla, K., Kim, Y., et al.: Do cling: An efficient open-source to olkit for ai-driven document conv ersion. In: AAAI Conference on Artificial Intelligence (2025) 4 27. Mandal, S., T alewar, A., Ahuja, P ., Juv atk ar, P .: Nanonets-ocr-s: A mo del for trans- forming documents into structured markdown with intelligen t con tent recognition and seman tic tagging (2025) 4 , 6 28. Mandal, S., T alew ar, A., Thakuria, S., Ah uja, P ., Juv atk ar, P .: Nanonets-ocr2: A mo del for transforming do cuments in to structured markdo wn with in telligent con tent recognition and seman tic tagging (2025) 6 29. Niu, J., Liu, Z., Gu, Z., W ang, B., Ouyang, L., Zhao, Z., Chu, T., He, T., W u, F., Zhang, Q., et al.: Mineru2. 5: A decoupled vision-language mo del for efficient high-resolution do cumen t parsing. arXiv preprin t arXiv:2509.22186 (2025) 5 , 6 , 11 , 12 , 13 30. Op enAI: Chatgpt. https://chat.openai.com (2025) 6 31. Ouy ang, L., Qu, Y., Zhou, H., Zh u, J., Zhang, R., Lin, Q., W ang, B., Zhao, Z., Jiang, M., Zhao, X., et al.: Omnido cbench: Benc hmarking diverse pdf document parsing with comprehensive annotations. In: Pro ceedings of the Computer Vision and P attern Recognition Conference. pp. 24838–24848 (2025) 2 , 3 , 5 , 6 , 7 , 9 32. P aruch uri, V.: Marker. https://github.com/datalab- to/marker (2024) 4 33. Pfitzmann, B., Auer, C., Dolfi, M., Nassar, A.S., Staar, P .: Do cla ynet: A large h uman-annotated dataset for do cument-la y out segmentation. In: Proceedings of the 28th ACM SIGKDD conference on knowledge discov ery and data mining. pp. 3743–3751 (2022) 5 34. P oznanski, J., Borc hardt, J., Dunkelberger, J., Huff, R., Lin, D., Rangapur, A., Wilhelm, C., Lo, K., Soldaini, L.: olmocr: Unlocking trillions of tok ens in pdfs with vision language mo dels. arXiv preprint arXiv:2502.18443 (2025) 1 , 2 , 3 , 4 , 5 35. P oznanski, J., Soldaini, L., Lo, K.: olmo cr 2: Unit test rewards for document o cr. arXiv preprin t arXiv:2510.19817 (2025) 1 , 6 36. Qw en T eam: Qwen3.5: T ow ards native m ultimo dal agents (2026), https://qwen. ai/blog?id=qwen3.5 6 37. T eam, H.V., Lyu, P ., W an, X., Li, G., Peng, S., W ang, W., W u, L., Shen, H., Zhou, Y., T ang, C., et al.: Hunyuanocr technical rep ort. arXiv preprin t (2025) 4 , 6 38. T eam, K., Bai, T., Bai, Y., Bao, Y., Cai, S., Cao, Y., Charles, Y., Che, H., Chen, C., Chen, G., et al.: Kimi k2. 5: Visual agentic in telligence. arXiv preprint arXiv:2602.02276 (2026) 6 16 Li. et al. 39. W ang, A., Chen, H., Liu, L., Chen, K., Lin, Z., Han, J., et al.: Y olo v10: Real-time end-to-end ob ject detection. Adv ances in Neural Information Pro cessing Systems 37 , 107984–108011 (2024) 4 40. W ang, B., Gu, Z., Liang, G., Xu, C., Zhang, B., Shi, B., He, C.: Unimernet: A uni- v ersal netw ork for real-world mathematical expression recognition. arXiv preprint arXiv:2404.15254 (2024) 4 , 5 41. W ang, B., W u, F., Ouyang, L., Gu, Z., Zhang, R., Xia, R., Shi, B., Z hang, B., He, C.: Image ov er text: T ransforming form ula recognition ev aluation with char- acter detection matc hing. In: Pro ceedings of the Computer Vision and Pattern Recognition Conference. pp. 19681–19690 (2025) 10 42. W ang, B., Xu, C., Zhao, X., Ouy ang, L., W u, F., Zhao, Z., Xu, R., Liu, K., Qu, Y., Shang, F., et al.: Mineru: An open-source solution for precise do cumen t conten t extraction. arXiv preprin t arXiv:2409.18839 (2024) 1 , 4 43. W ang, P ., Bai, S., T an, S., W ang, S., F an, Z., Bai, J., Chen, K., Liu, X., W ang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language mo del’s p erception of the w orld at any resolution. arXiv preprin t arXiv:2409.12191 (2024) 4 44. W ang, W., Gao, Z., Gu, L., Pu, H., Cui, L., W ei, X., Liu, Z., Jing, L., Y e, S., Shao, J., et al.: Intern vl3. 5: Adv ancing open-source multimodal mo dels in versatilit y , reasoning, and efficiency . arXiv preprint arXiv:2508.18265 (2025) 4 , 6 45. W ang, Z., Xu, Y., Cui, L., Shang, J., W ei, F.: La y outreader: Pre-training of text and la yout for reading order detection. In: Pro ceedings of the 2021 Conference on Empirical Metho ds in Natural Language Pro cessing. pp. 4735–4744 (2021) 4 46. W ei, H., Liu, C., Chen, J., W ang, J., K ong, L., Xu, Y., Ge, Z., Zhao, L., Sun, J., Peng, Y., et al.: General o cr theory: T o wards o cr-2.0 via a unified end-to-end mo del. arXiv preprint arXiv:2409.01704 (2024) 4 47. W ei, H., Sun, Y., Li, Y.: Deepseek-o cr: Contexts optical compression. arXiv preprin t arXiv:2510.18234 (2025) 6 48. W ei, H., Sun, Y., Li, Y.: Deepseek-o cr 2: Visual causal flow. arXiv preprint arXiv:2601.20552 (2026) 4 49. Y uan, Y., Liu, X., Dikubab, W., Liu, H., Ji, Z., W u, Z., Bai, X.: Syn tax-aw are net work for handwritten mathematical expression recognition. In: Pro ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4553– 4562 (2022) 5 50. Zhang, J., Liu, Y., W u, Z., Pang, G., Y e, Z., Zhong, Y., Ma, J., W ei, T., Xu, H., Chen, W., et al.: Monk eyocr v1. 5 technical report: Unlo c king robust do cumen t parsing for complex patterns. arXiv preprint arXiv:2511.10390 (2025) 1 , 5 , 6 51. Zhao, Y., Lv, W., Xu, S., W ei, J., W ang, G., Dang, Q., Liu, Y., Chen, J.: Detrs b eat y olos on real-time ob ject detection. In: Pro ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16965–16974 (2024) 4 52. Zheng, H., Li, Y., Zhang, K., Xin, L., Zhao, G., Liu, H., Chen, J., Lou, J., Qiu, J., F u, Q., et al.: Multimo dal o cr: P arse an ything from documents. arXiv preprint arXiv:2603.13032 (2026) 3 , 4 , 6 , 11 , 12 , 13 53. Zheng, X., Burdick, D., P opa, L., Zhong, X., W ang, N.X.R.: Global table extractor (gte): A framework for joint table iden tification and cell structure recognition using visual context. In: Pro ceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 697–706 (2021) 5 54. Zhong, X., ShafieiBa v ani, E., Jimeno Y ep es, A.: Image-based table recognition: data, model, and ev aluation. In: European conference on computer vision. pp. 564–580. Springer (2020) 5 , 10 MDPBenc h 17 55. Zhou, C., Gao, Z., W ang, X., Gao, T., Cui, C., T ang, J., Liu, Y.: Real5- omnido cbench: A full-scale physical reconstruction b enc hmark for robust do cumen t parsing in the wild. arXiv preprint arXiv:2603.04205 (2026) 3 , 5
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment