Accelerating End-to-End PDF to Markdown Conversion Through Assisted Generation
Converting data from machine-unreadable formats like PDFs into Markdown has the potential to enhance the accessibility of scientific research. Existing end-to-end decoder transformer models can transform screenshots of PDFs into Markdown, offering more flexibility than pipeline-based methods. Yet, decoding text token by token from scratch is inefficient, especially when dense text can be directly copied from the PDF. To address this challenge, this paper modifies Prompt Lookup Decoding (PLD) to extract candidate sequences directly from PDF files, leveraging the high n-gram overlap between PDFs and their Markdown equivalents. A new method, Copy Lookup Decoding (CLD), is introduced here to enhance PLD’s candidate generation mechanism. Experiments demonstrate that CLD can accelerate the conversion process by up to 1.70$\times$ at original quality. The codebase for this paper is open-source on GitHub (https://github.com/Fireblossom/CopyLookup).
💡 Research Summary
The paper tackles the problem of converting scientific PDFs into Markdown, a format that is both human‑readable and machine‑friendly. While pipeline approaches (layout analysis → text extraction → element‑wise conversion) suffer from error propagation, recent end‑to‑end vision‑language models such as Nougat and Kosmos‑2.5 can directly generate Markdown from a screenshot of a PDF page. However, these models generate text token‑by‑token in an autoregressive fashion, which makes inference slow.
To accelerate this process, the authors adapt Prompt Lookup Decoding (PLD), an assisted‑generation technique that copies n‑grams already present in the prompt as candidate tokens, and verifies them with the large model. They first modify PLD so that the source of candidates is not the textual prompt but the entire text extracted from the PDF page. This straightforward change, called modified PLD (mPLD), suffers from two issues: (1) not all PDF text is copyable (formulas, tables, figures, page numbers), and (2) simply taking the first matching n‑gram can miss later‑positioned candidates.
The authors therefore introduce Copy Lookup Decoding (CLD), which consists of two novel components. The first, Copy‑able Text Identification (CTI), uses layout information to decide whether a span of PDF text should be copied. They fine‑tune the ERNIE‑Layout model with LoRA to classify each token as KEEP (copyable) or DELETE (non‑copyable). Token‑level F1 reaches 0.985 and span‑level F1 0.988, with an average inference cost of only 0.03 seconds per page. The second component, Candidate Generation (CG), merges adjacent KEEP spans, scans them sequentially for the longest n‑gram that matches the current decoder prediction, and supplies the following tokens as a candidate sequence. When a candidate is accepted, the corresponding span is moved to the front of the processing list, keeping the generation aligned with the natural reading order of the PDF.
Experiments were conducted on four backbone models—Nougat‑base (349 M), Kosmos‑2.5 (1.37 B), Llama‑3.2‑Vision (10.6 B), and Qwen2‑VL (12.6 B, quantized)—and three test sets: the full arXiv collection, a subset of economics papers, and a subset of quantum‑physics papers. The baseline generation times per page ranged from 6.48 s (Nougat) to 138.85 s (Qwen2‑VL). With mPLD, speed‑ups of 1.09× to 1.38× were observed; with the full CLD pipeline, speed‑ups increased to 1.10×–1.70×. Larger models benefited more from CLD, achieving up to a 1.70× reduction in inference time. Importantly, the Markdown quality remained unchanged because non‑copyable elements (formulas, tables, etc.) were still generated by the underlying model.
Ablation studies confirmed that both CTI and CG independently contribute to latency reduction, and their combination yields the greatest acceleration. The extra 0.03 s per page required for CTI is modest compared to the overall gains.
The significance of this work lies in demonstrating that a simple, plug‑and‑play assisted‑generation technique can be combined with domain‑specific preprocessing (layout‑based copyability filtering) to dramatically speed up end‑to‑end PDF‑to‑Markdown conversion without any model fine‑tuning. The approach is model‑agnostic, works with both specialized and general‑purpose vision‑language models, and can be extended to other tasks where the source document contains large amounts of copyable text (e.g., OCR‑based summarization, document translation). Future directions include integrating OCR or specialized formula recognition for the DELETE spans, handling multi‑page documents with global reading‑order optimization, and exploring dynamic candidate length strategies to further close the gap between speed and quality.
Comments & Academic Discussion
Loading comments...
Leave a Comment