Efficient Domain Adaptation for Text Line Recognition via Decoupled Language Models

Ef ﬁcient Domain Adaptation for T e xt Line Recognition via Decoupled Language Models Arundhathi De v Department of Computer Science University of Cincinnati Cincinnati, USA dev ai@mail.uc.edu Justin Zhan Department of Computer Science University of Cincinnati Cincinnati, USA zhanjt@ucmail.uc.edu Abstract —Optical character recognition r emains critical in- frastructure for document digitization, yet state-of-the-art per- formance is often r estricted to well-resourced institutions by prohibiti ve computational barriers. End-to-end transformer ar - chitectures achiev e strong accuracy but demand hundreds of GPU hours for domain adaptation, limiting accessibility for practitioners and digital humanities scholars. W e pr esent a modu- lar detection-and-correction framework that achieves near -SO T A accuracy with single-GPU training. Our approach decouples lightweight visual character detection (domain-agnostic) fr om domain-speciﬁc linguistic correction using pr etrained sequence models including T5, ByT5, and B AR T . By training the corr ectors entirely on synthetic noise, we enable annotation-fr ee domain adaptation without requiring labeled target images. Evaluating across modern clean handwriting, cursive script, and historical documents, we identify a critical “Pareto fr ontier” in architecture selection: T5-Base excels on modern text with standard vocab u- lary , whereas ByT5-Base dominates on historical documents by reconstructing archaic spellings at the byte level. Our results demonstrate that this decoupled paradigm matches end-to-end transformer accuracy while reducing compute by approximately 95%, establishing a viable, resour ce-efﬁcient alternative to mono- lithic OCR architectures. Index T erms —Optical Character Recognition, Domain Adap- tation, T ransf ormers, ByT5, Efﬁciency I . I N T RO D U C T I O N T ext line recognition (TLR) is a fundamental component of optical character recognition (OCR) pipelines, con verting localized text regions into machine-readable transcriptions. TLR supports applications ranging from historical manuscript digitization to real-time mobile document capture, making reliable and accessible OCR technology vital for scientiﬁc research and cultural heritage preservation. Recent end-to-end transformer -based approaches such as T rOCR [3] and DTrOCR [4] have achiev ed state-of-the-art accuracy , but their monolithic design incurs substantial com- putational overhead. Adapting these models to new domains (e.g., historical documents with archaic orthography) typically requires 200–600 GPU hours on high-end hardware [3]. Such requirements put advanced OCR de velopment beyond the reach of many practitioners: digital humanities researchers, archivists, and librarians often lack the GPU infrastructure necessary to train or ﬁne-tune these models. As a result, users must choose between suboptimal accuracy (generic pretrained models) or prohibitiv e costs (commercial licensing or cloud GPU rental). Why end-to-end transformers are inefﬁcient for adapta- tion? Current end-to-end TLR transformers are inefﬁcient for domain adaptation for tw o structural reasons: (1) Coupled Optimization: visual and linguistic components are trained jointly within a single monolithic architecture. Adapting to a ne w domain often requires modifying only the linguistic component (e.g., handling archaic orthography), yet the entire visual backbone must still be retrained, expending compute on domain-agnostic visual features; and (2) Linguistic Overengi- neering: these models retrain lar ge decoder stacks despite the existence of pretrained language models (e.g., T5, ByT5) that already capture rich linguistic knowledge, forcing practitioners to re-learn capabilities that are readily a v ailable off the shelf. Decoupled detection and linguistic correction. W e propose to decouple visual character detection from domain-speciﬁc linguistic corr ection . Our framew ork comprises: (1) V isual De- tector: a lightweight transformer detector (DINO-DETR [32]) trained once to localize and classify characters from synthetic text, a domain-agnostic, reusable front-end; and (2) Linguis- tic Corr ector: a pretrained language model (T5 [7], ByT5, B AR T [6]) ﬁne-tuned on synthetic domain-speciﬁc noise to correct detector output without requiring labeled images. This factorization enables one-time training of the visual module per script, swappable linguistic modules tailored to domain characteristics, and domain adaptation as a lightweight ﬁne- tuning task that can be performed on commodity hardware (approx. 4 hours on a single GPU). Key ﬁnding: a pareto frontier in architectur e selection. Across three distinct handwriting benchmarks representing a spectrum of dif ﬁculty (modern clean, cursive, and historical, see Fig. 1), we identify a critical trade-off in architecture selec- tion: T5-Base (token-based) provides optimal performance on modern, standard-vocab ulary text by lev eraging its pretrained dictionary (CVL: 1.90% CER). Howe ver , on historical docu- ments with archaic orthography and out-of-vocab ulary proper nouns, ByT5-Base (byte-le vel) outperforms T5 (5.35% vs. 5.86% CER on George W ashington Papers) by reconstructing words character -by-character, a voiding the v ocab ulary collapse that harms token-based models. This observation validates our modular paradigm: the visual (a) CVL (Modern): Clean, standard vocabulary . (b) IAM (Cursiv e): High visual ambiguity . (c) Historical (GW): Archaic spellings (”Ser t ”), degradation. Fig. 1: The Domain Gap. W e e v aluate across a difﬁculty spectrum. While token-based models (T5) excel on (a), they struggle with the archaic vocabulary and noise in (c), neces- sitating the byte-lev el reconstruction of ByT5. detector remains ﬁxed across domains, while the linguistic corrector is selected and trained to match the linguistic and orthographic char acteristics of the tar get domain . This ﬂexi- bility , impossible in end-to-end architectures that require full retraining to adjust linguistic beha vior , is key to resource- efﬁcient, domain-adaptiv e OCR. Main Contributions: (1) Appr oximately 95% Compute Reduction: W e demonstrate that our decoupled framew ork requires only 4 GPU hours on a single A100 for domain adaptation, compared to 3–5 days (72–120 hours) on 8 × A100 for end-to-end retraining, while achieving competitiv e accu- racy (1.90% CER on CVL). (2) Annotation-Fr ee Domain Adaptation: W e establish a protocol for training byte- and token-le vel correctors using entirely synthetic domain-speciﬁc noise (without requiring labeled real images), demonstrating that targeted synthetic noise patterns enable robust correction. On IAM, domain-speciﬁc noise yields 5.65% CER with ByT5, narrowing the gap to T5’ s 5.40% compared to random noise (6.35%). (3) Architectural Guidance for Domain Adapta- tion: W e provide the ﬁrst systematic comparison of token- based (T5) vs. byte-based (ByT5) correctors across modern and historical domains, establishing clear architectural recom- mendations: T5 excels on modern vocabularies (1.90% CER on CVL), while ByT5 dominates on archaic or specialized orthography (5.35% vs. 5.86% on George W ashington Papers). The remainder of this paper is organized as follows. Sec- tion II revie ws related work in post-OCR correction, detection- based recognition, and pretrained language models. Section III details the proposed detection module and linguistic correction framew ork. Section IV reports experimental results across three handwriting benchmarks, including a comparison of T5 and ByT5 architectures. Section V discusses the trade-offs between token- and byte-le vel correction, and Section VI concludes with limitations and future work. I I . R E L A T E D W O R K T ext line recognition has e volv ed from early segmentation- based methods to sophisticated end-to-end transformer archi- tectures. W e position our work within this landscape, focusing on: (i) detection-based recognition methods that provide ef- ﬁcient visual front-ends, (ii) pretrained language models for sequence correction, and (iii) end-to-end transformers that represent the current SO T A b ut impose substantial compute requirements. A. End-to-End T ransformer Ar chitectur es for T ext Recognition The current state of the art in te xt line recognition is dominated by large end-to-end transformer models that jointly learn visual encoding and linguistic decoding. TrOCR [3] combines a pretrained vision transformer (V iT) encoder with an autoregressi ve decoder , achie ving strong accuracy across printed, handwritten, and scene text after pretraining on 684M synthetic lines. DTrOCR [4] simpliﬁes this design with a decoder-only architecture, training on 2B lines to achiev e competitiv e performance with lower architectural complexity . These models achiev e state-of-the-art CER on standard benchmarks (IAM: 2.4–3.0%, RIMES: < 2 %) but require substantial computational resources: 200–620 GPU hours for ﬁne-tuning on multi-GPU clusters [3], [4]. The mono- lithic design—jointly optimizing visual and linguistic com- ponents—creates two inefﬁciencies: (i) linguistic decoders are trained from scratch, ignoring pretrained language model knowledge; and (ii) any component improvement requires full end-to-end retraining. W ork such as D AN [33] and Faster - D AN [34] extends this paradigm to page-lev el recognition, further increasing model scale and compute requirements. Our work demonstrates an alter native paradigm : detection-based visual recognition paired with pretrained lan- guage models achie ves competitiv e accuracy with substantially reduced training compute ( ∼ 24 vs. 200+ GPU hours), es- tablishing a viable efﬁciency–accurac y trade-of f for resource- constrained settings. B. Detection-Based T ext Line Recognition Detection-based character recognition has appeared in sev- eral waves. Early Latin-script systems combined character segmentation with HMMs and handcrafted classiﬁers [27], [28], but handwriting’ s cursiv e and overlapping structures made segmentation unreliable. In contrast, Chinese handwrit- ing beneﬁted from more distinct spatial character boundaries, sustaining a longer line of detection-based research [29]–[31]. More recently , DTLR [1] demonstrated that modern transformer-based detectors (DINO-DETR [32]) can localize and classify characters in parallel. Its main insights include ro- bust synthetic pretraining, CTC-based adaptation with minimal supervision, and efﬁcient inference. DTLR achiev es compet- itiv e results on Chinese HTR and cipher recognition, though Latin-script performance remains behind autoregressi ve de- coders when limited to N-gram post-processing. Our work div erges fr om DTLR [1] by replacing its N- gram post-processing with deep, pretrained language model correctors (T5, ByT5). While DTLR focuses on efﬁcient visual detection, we demonstrate that decoupling the visual backbone from the linguistic stage enables modular , annotation-fr ee do- main adaptation via synthetic noise—a capability not present in the original DTLR framew ork. This combination closes the accuracy gap to state-of-the-art end-to-end transformers while preserving the computational efﬁcienc y of detection- based pipelines. C. Pr etrained Language Models for Sequence Corr ection Pretrained language models hav e reshaped sequence correc- tion tasks by learning rich linguistic priors from large corpora. B AR T [6] fuses a bidirectional encoder with an autoregressi ve decoder and is trained as a denoising autoencoder , making it effecti ve for reconstructing corrupted text. T5 [7] frames div erse tasks as text-to-te xt transformations using Sentence- Piece subword tokenization, providing robust performance on modern linguistic data. Despite these strengths, subword tokenization poses chal- lenges for historical OCR. T okenizers are optimized for con- temporary corpora, and thus degrade on archaic spellings, specialized terminology , or OCR noise. This yields a T okeniza- tion Bottleneck : words are broken into incoherent or Out-of- vocab ulary subwords. D. P ost-OCR Err or Corr ection Post-OCR correction is a longstanding strate gy for im- proving text recognition accurac y without retraining visual recognition models. Early methods used rule-based spell- checkers and dictionary lookup [20]. Kolak and Resnik [21] formalized OCR correction within a noisy-channel framework, later e xtended with character confusion models [22], [23]. The advent of neural sequence models enabled data-driven correction: Rigaud et al. [24] applied RNN encoder–decoders to historical documents, while recent work inv estigates byte- lev el models (ByT5 [25]) and large language models with constrained decoding [26]. Our work departs from these approaches by integrating a lightweight, trainable detector as the visual front-end and systematically comparing multiple pretrained language models (T5, ByT5, B AR T). W e show that model choice should depend on document properties such as vocab ulary modernity and script orthography . E. Datasets for Handwriting Recognition Handwriting recognition is commonly e valuated on IAM [19], which contains modern cursiv e English (6,161 training lines). The CVL dataset [45] provides high-quality modern handwriting from 311 writers across sev en texts. Historical ev aluation is typically performed on the George W ashington Papers [46], comprising 18th-century manuscripts exhibiting archaic spelling and physical degradation. Our evaluation spans these three datasets , enabling comparison across modern clean handwriting (CVL), modern cursiv e (IAM), and historical manuscripts (GW). F . Synthetic Data and Pr etraining Synthetic pretraining is widely used in OCR to increase robustness to fonts, styles, and degradations. SynthTIGER [12] provides photorealistic synthetic text. TrOCR and DT rOCR pretrain on hundreds of millions to billions of synthetic lines [3], [4]. DTLR [1] shows that synthetic pretraining with masking enables detection-based models to learn without character-le vel real annotations. W e f ollow DTLR’s strategy for detector pretraining and construct self-supervised correction pairs by applying the detector to real training images. This enables domain-speciﬁc language model ﬁne-tuning at lo w computational cost ( ∼ 6 GPU hours) without additional annotation effort. G. P ositioning of Our Appr oach Our w ork lies at the intersection of three research threads: (i) detection-based recognition (DTLR) [1], providing an efﬁcient visual front-end; (ii) pretrained language models (T5, ByT5, B AR T) [6], [7], [44], offering linguistic correction capabilities; and (iii) post-OCR correction [20], [21], [24]–[26], decoupling visual and linguistic processing. Our key contrib ution is demonstrating that detection- based visual recognition, when paired with pretrained lan- guage models, achiev es near-SO T A accuracy (CVL: 1.90% CER matching end-to-end transformers; IAM: 5.18% CER competitiv e with SO T A 2.89%) with substantially reduced training compute ( ≈ 24 vs 200+ GPU hours). This efﬁciency stems from: (i) reusing pretrained language models rather than training linguistic decoders from scratch, (ii) lightweight detection architectures optimized for visual tasks, and (iii) independent optimization of visual and linguistic components. Unlike end-to-end transformers that conﬂate visual and lin- guistic processing into a single monolithic model [3], [4], [33], [34], our modular pipeline enables compute-ef ﬁcient OCR de- velopment on consumer GPUs, making SO T A-competitive te xt recognition accessible to practitioners without institutional- scale infrastructure. Our systematic ev aluation across modern (CVL), cursiv e (IAM), and historical (GW) handwriting fur- ther demonstrates that corrector choice should match docu- ment characteristics—T5 for modern text, ByT5 for historical documents—a ﬂexibility unav ailable in monolithic architec- tures. I I I . M E T H O D O L O G Y Our frame work decouples te xt line recognition into two stages: (i) a detection-based visual module that localizes and classiﬁes characters in parallel, and (ii) a pretrained language model that corrects detector outputs using domain-speciﬁc linguistic patterns (Figure 2). A. Detection Module W e adopt DINO-DETR [32] as a lightweight transformer detector that decodes learned queries into character predictions via bipartite matching. Follo wing DTLR [1], we train the detector to predict bounding boxes and class labels ov er 167 Latin characters (uppercase/lowercase letters, digits, punctua- tion, and diacritics). a) Synthetic Pretr aining Dataset: W e deﬁne the alphabet A as comprising 167 Latin characters. Synthetic sentences are sampled from W ikipedia. For each text line, we: (i) select a random font with 50% probability of handwriting style; (ii) render the text onto blank page backgrounds; (iii) T ext Line Image Detection-Based V isual Module Backbone + DINO-DETR Character Detections (boxes + labels) Pretrained Language Model Corrector BAR T T5 ByT5 Final T ranscription Synthetic pretraining + CTC-based adaptation Fine-tuned to model domain-speciﬁc noise Fig. 2: Decoupled detection-and-correction architecture. A detection-based visual module localizes and classiﬁes characters in parallel and is pretrained on large-scale synthetic data, follo wed by lightweight domain adaptation using weak supervision (CTC loss). A pretrained language model corrector is ﬁne-tuned to repair residual recognition errors and capture domain- speciﬁc linguistic patterns. Decoupling visual detection from linguistic correction enables efﬁcient adaptation across writing styles and document domains. apply color variation, blur , and structured noise; and (i v) apply block masking (v ertical/horizontal strips) and random erasing to simulate occlusions, ink degradation, and partial visibility . This aggressi ve masking encourages the model to lev erage contextual cues and improv es rob ustness to real-world degradation. b) T raining Objective: Predictions are matched to ground truth via bipartite matching using the Hungarian algorithm, as in DETR-style detectors [32]. The matching cost between a query q and a ground-truth character n is: C ( q , n ) = λ cls L cls ( c n , ˆ c q ) + ⊮ { c n  = ∅} λ box L box ( b n , ˆ b q ) , (1) where L cls is the focal loss [35] and L box combines L1 distance with generalized IoU (GIoU) [36]: L box ( b, ˆ b ) = λ 1 ∥ b − ˆ b ∥ 1 + λ iou GIoU ( b, ˆ b ) . (2) W e follow DTLR and DINO-DETR defaults for all weight coefﬁcients. c) Reusable V isual F r ont-End: W e train the detector once on synthetic Latin script for 225,000 steps. For domain adaptation to a new target domain, we perform a single lightweight ﬁne-tuning pass (approximately 3 hours on a single A100 GPU) using CTC loss with line-level annotations—no character-le vel bounding boxes required. Once ﬁne-tuning completes, the visual backbone is frozen. This ensures that expensive visual encoder training occurs only once , while all subsequent domain adaptation occurs in the language model corrector , achieving the ∼ 95% compute reduction described in Section I. d) CTC-Based Adaptation: Since real datasets lack character-le vel bounding boxes, we ﬁne-tune the detector using the Connectionist T emporal Classiﬁcation (CTC) loss [38] to align predictions with line-lev el ground truth. At inference time, the detector produces Q = 900 query predictions. Since the detector outputs an unordered set, we explicitly sort the predictions by their horizontal center co- ordinate ( x center ) to establish a left-to-right sequence. W e then apply non-maximum suppression(NMS, IoU threshold 0.4), and map the remaining predictions to a CTC-compatible sequence by treating the no-object class as the CTC blank sym- bol. A key modiﬁcation is required: standard CTC collapses consecutiv e identical labels, which would incorrectly merge repeated characters (e.g., “book”) that are spatially distinct. T o preserve such repetitions, we insert an explicit blank token between consecutive predictions prior to CTC decoding. The CTC loss is: L CTC = − log P ( y | { ˆ c m } M m =1 ) , (3) where y is the ground-truth transcription and the probability is computed via the standard forward–backward dynamic program [38]. B. Pr etrained Language Model Corr ection While the detector pro vides strong character localization, residual errors persist due to visual ambiguity , missing func- tion words, and lack of linguistic context. T o address this, we ﬁne-tune pretrained sequence-to-sequence models to map noisy detector outputs ( ˆ y ) to clean transcriptions ( y ). W e consider three architectures spanning tokenization gran- ularities: a) T5-Base (T oken-Level): W e use t5-base (220M pa- rameters), which operates on subword tokens (SentencePiece). T5 excels on modern text with standard vocab ulary due to its pretrained dictionary and strong semantic priors. b) ByT5-Base (Byte-Level): W e use google/byt5-base (582M parameters), which processes UTF-8 bytes directly . ByT5 a v oids tok enization failures on archaic spellings, out-of-vocab ulary terms, and noisy OCR artifacts by reconstructing text character-by- character—making it well-suited for historical documents. c) BART -Base (Denoising Baseline): W e use facebook/bart-base (140M parameters), a denoising autoencoder trained on masking, deletion, and permutation noise. B AR T serves as a strong lightweight baseline. d) T raining Data Construction: For each training image I i with ground-truth transcription y i , we run the ﬁne-tuned detector to obtain ˆ y i and construct the paired dataset D corr = { ( ˆ y i , y i ) } N train i =1 . W e use the best validation checkpoint (lowest CER) to a void overﬁtting to a transient detector state. This pairing is self-supervised and requires no annotation beyond line-lev el labels. T o improve robustness, we apply text augmentations to ˆ y i with 20% probability , including character substitutions (from confusion sets), insertions, and deletions, enabling the model to generalize beyond detector-speciﬁc error patterns. C. Synthetic Noise T raining Strate gies T o enable efﬁcient domain adaptation without large labeled corpora, we train correctors on synthetic noise applied to clean text. W e explore two complementary strategies: a) Random P erturbation: For general adaptation, we ap- ply stochastic edit operations (substitution p = 0 . 05 , insertion p = 0 . 03 , deletion p = 0 . 03 ) to clean text corpora. This exposes models to common OCR failure modes encountered across domains. W e ﬁnd that random perturbation alone is often sufﬁcient for BAR T and T5, which beneﬁt from strong pretrained lexical priors. b) Cursive-Collapse Noise: For cursiv e handwriting (e.g., IAM), random noise is insuf ﬁcient to capture realistic visual ambiguities arising from connected script. W e introduce a specialized “Cursi ve-Collapse” noise process that probabilis- tically applies domain-speciﬁc confusions: • Merges: ‘rn’ → ‘m’, ‘cl’ → ‘d’, ‘vv’ → ‘w’, ‘ii’ → ‘u’. • Splits: ‘m’ → ‘nn’, ‘w’ → ‘uu’, ‘u’ → ‘rn’. • Shape Confusions: ‘l’ ↔ ‘1’, ‘e’ ↔ ‘c’, ‘a’ ↔ ‘o’. T raining ByT5 on byte-lev el reconstruction with structured noise encourages it to disentangle cursiv e ambiguities. Abla- tion studies (Section IV -E) sho w that ByT5 reaches 5.65% CER with Cursive-Collapse noise versus 6.35% with random noise on IAM, demonstrating the importance of domain-aware synthetic noise for cursiv e script. D. T raining Hyperparameter s Language model adapters are trained with task-speciﬁc conﬁgurations to account for architectural and tokenization differences: a) T5-Base: Learning rate 5 × 10 − 5 , batch size 16, maximum sequence length 128 tokens, 10 epochs with early stopping (patience 3), AdamW optimizer . T otal training time ∼ 4–6 hours on A100. b) ByT5-Base: Learning rate 1 × 10 − 4 (higher due to byte-lev el granularity), batch size 8 with gradient accumula- tion steps 2 (ef fecti ve batch 16), maximum sequence length 256 bytes, 10 epochs with early stopping (patience 3), BF16 mixed precision, fused AdamW . T raining time ∼ 6–8 hours on A100. c) BART -Base: Learning rate 3 × 10 − 5 , batch size 16, maximum sequence length 128 tokens, 5 epochs, 500 warmup steps, AdamW with weight decay 0.01, FP16 mixed precision. T raining time ∼ 4–5 hours on A100. All models are trained with token-le vel cross-entropy loss under teacher forcing. W e select the checkpoint with lowest validation CER. E. T raining and Infer ence Efﬁciency a) T raining Efﬁciency: V isual detector ﬁne-tuning re- quires ∼ 3 hours per domain on a single A100 GPU. Lan- guage model adaptation requires 4–8 hours depending on model size (Section III-C). Thus, total domain adaptation cost is ∼ 22–26 GPU hours (3 hours detection + 4–8 hours correction), compared to 200–600 GPU hours reported for T rOCR/DTrOCR [3], [4]. This corresponds to an ∼ 10–25 × reduction in training cost while achieving competitive accu- racy . b) Infer ence Ef ﬁciency: At inference time, the detec- tion module processes a text line in ∼ 40–60 ms on A100; language model correctors add ∼ 20–40 ms, resulting in end- to-end latency of ∼ 80–120 ms/line (8–12 lines/sec). T rOCR processes lines in ∼ 100–150 ms, indicating comparable online throughput despite signiﬁcantly cheaper training. c) Sour ces of Efﬁciency: Three factors contribute to the observed ef ﬁciency gains: (i) pretrained language models eliminate the need to train linguistic decoders from scratch, (ii) character detection pro vides a lightweight visual front- end, and (iii) visual and linguistic modules can be optimized independently , eliminating full end-to-end backpropagation across large vision-language stacks. I V . E X P E R I M E N T S W e e v aluate our decoupled detection-and-correction frame- work across three handwriting benchmarks representing a spectrum of linguistic and visual difﬁculty . Our experiments are designed to answer three questions: 1) Domain Adaptation: Can a ﬁxed detector combined with a lightweight language model adapter generalize to cursiv e and historical domains without expensiv e end- to-end retraining? 2) T oken vs. Byte: How do subword (T5) and byte- lev el (ByT5) architectures compare when handling Out- of-V ocabulary (OO V) historical text and orthographic variation? 3) Efﬁciency: Can competitiv e accuracy be achieved using only synthetic noise training, eliminating the need for labeled real images in the correction stage? A. Datasets W e ev aluate on three datasets chosen to represent distinct handwriting “difﬁculty tiers”: • CVL (Modern / Clean): Modern English handwriting with standardized spelling and high-quality scans. Serves as a “control” modern domain. • IAM (Modern / Cursive): The standard benchmark for unconstrained cursiv e handwriting. While the vocab ulary is modern, visual ambiguity is high due to connected script and div erse handwriting styles. • George W ashington (GW) (Historical / Noisy): 18th- century manuscript pages with archaic orthography (e.g., long-s), paper degradation, and historical vocabulary . Represents the “hard” domain requiring linguistic recon- struction. T ABLE I: D A TA S E T C H A R A C T E R I S T I C S A C RO S S T H R E E H A N DW R I T I N G D O M A I N S . Dataset Domain V ocab ulary V isual Noise CVL Modern Standard Low IAM Modern Standard High (cursi ve) GW Historical Archaic High (degradation) B. Experimental Setup a) F ixed Detector: T o simulate a resource-constrained deployment scenario, we freeze the character detection model after domain-speciﬁc ﬁne-tuning. The detector is not retrained when comparing corrector architectures. This ensures that all performance differences are attributable to the linguistic correction modules, enabling a fair architectural comparison. b) Synthetic Noise T raining: Unlike prior work that relies on real paired OCR data, we train all correctors entirely on synthetic text augmented with domain-speciﬁc perturbations. For IAM (cursi ve), we simulate connected-script collapses (e.g., ‘rn’ → ‘m’, ‘cl’ → ‘d’). F or GW (historical), we inject archaic spelling variants and degradation artifacts. This enables annotation-free domain adaptation and remov es the need for labeled real images in the correction stage. c) Corr ector Arc hitectur es: W e ev aluate three pretrained sequence models alongside a detector-only baseline: • Baseline : Raw detector output with NMS and left-to-right sorting. • BAR T -Base : Denoising autoencoder (140M parameters) trained on token masking, deletion, and inﬁlling [6]. • T5-Base : T oken-based seq2seq model (220M parameters) using SentencePiece tokenization [7]. • ByT5-Base : Byte-lev el transformer (582M parameters) operating directly on UTF-8 bytes without tokenization. All correctors are trained with identical hyperparameters (batch size 32, learning rate 2 × 10 − 5 , 5 epochs with early stopping) on the same synthetic datasets to isolate architectural effects. C. Main Results: The P ar eto F r ontier T ables II, III, and IV report Character Error Rate (CER) across the three handwriting domains. The results rev eal a consistent architectural trade-off between token-le vel, byte- lev el, and denoising models. On the CVL dataset (T able II), T5-Base achieves the lowest CER at 1.90% , le veraging its pretrained SentencePiece vocab ulary to map noisy character sequences to v alid lexical forms. B AR T reaches 1.95% and ByT5 reaches 1.98%, all of which are competitiv e with end-to-end transformer models on this “clean” domain. The trend re verses on George W ashington (GW) (T a- ble IV). Here, ByT5-Base outperforms T5-Base ( 5 . 35% vs. 5 . 86% CER). The GW dataset contains archaic orthography and OO V proper nouns, which interact poorly with T5’ s sub- word vocab ulary , fragmenting words into incoherent tokens. ByT5 operates directly on UTF-8 bytes and reconstructs words character-by-character without relying on a ﬁxed vocab ulary , yielding superior performance in historical settings. On IAM (T able III), B AR T -Base achie ves the best CER at 5.18% , followed by T5-Base at 5.40%. In contrast, ByT5- Base regresses to 6.35% CER, underperforming e ven the detector-only baseline (6.09%). This sharp degradation high- lights the sensitivity of byte-lev el models to the quality of synthetic noise: random perturbations produce byte patterns that do not reﬂect realistic cursiv e collapses (e.g., ‘rn’ → ‘m’), leading ByT5 to learn implausible corrections. BAR T and T5, by contrast, beneﬁt from pretrained token-lev el lexical priors that smooth ov er such noise. T ogether , these ﬁndings rev eal a Par eto frontier acr oss cor - rector architectur es : BAR T -Base excels on modern cursiv e handwriting, T5-Base dominates on clean standard-vocabulary text, and ByT5-Base is essential for historical documents with archaic orthography . This frontier motiv ates a modular approach in which correctors are selected to match linguistic and orthographic characteristics of the target domain. D. Efﬁciency Analysis A key advantage of our decoupled architecture is its dra- matically reduced computational cost for domain adaptation (see the “ Adaptation Cost” columns in T ables II and III). Across domains, our full pipeline requires approximately 3 . 5 – 4 . 5 GPU hours on a single A100 ( ∼ 3 hours for detector ﬁne- tuning + 0 . 5 – 1 . 5 hours for corrector training). In contrast, T rOCR training requires 200 – 600 GPU hours on 8 × A100 infrastructure [3], corresponding to a 57 – 171 × reduction in training compute. This enables state-of-the-art handwriting recognition under resource constraints and with- out institutional-scale hardware. Furthermore, because correctors are trained on synthetic noise alone , the linguistic adaptation stage does not require labeled real images, making the pipeline fully annotation-free beyond line-level supervision for detector ﬁne-tuning. E. Ablation: Synthetic Noise Quality W e in vestigate the impact of synthetic noise design on corrector performance. On IAM, training ByT5 on generic random perturbations (character substitution p = 0 . 05 , inser- tion/deletion p = 0 . 03 ) yields poor performance: 6 . 35% CER, which underperforms the detector-only baseline ( 6 . 09% ). In contrast, T5 achie ves 5 . 40% CER under the same random- noise regime, indicating a dif ferential robustness to noise quality between token-le vel and byte-lev el models. Byte-level models are sensitive to synthetic noise. ByT5 learns character-le vel reconstruction patterns; generic ran- dom noise produces incoherent byte sequences that do not correspond to realistic OCR failure modes. When trained T ABLE II: C V L ( M O D E R N C L E A N H A N D W R I T I N G ) R E S U LT S . Method Architectur e Params T raining Data Adaptation Cost W . Acc (%) CER (%) MGP-STR † Multi-Granularity V iT 148M 100M+ Synthetic 100+ GPU hrs 82.30 — T5-Base (Ours) Linguistic Adapter 220M 1.5M Synthetic 3 + 0.5 GPU hrs 78.10 1.90 B AR T -Base (Ours) Linguistic Adapter 140M 1.5M Synthetic 3 + 0.5 GPU hrs 77.82 1.95 ByT5-Base (Ours) Linguistic Adapter 582M 1.5M Synthetic 3 + 1.0 GPU hrs 76.45 1.98 T rOCR-Base ‡ End-to-End T ransform. 334M 684M Synth Lines 200+ GPU hrs 74.50 3.42 Detector Baseline Object Detector Only 40M 1M Synthetic 3 GPU hrs 72.84 3.71 † Results reported from original publications. MGP-STR primarily reports W ord Accuracy; CER is not av ailable for this benchmark. ‡ Reported training time includes the massiv e pre-training stage required for con ver gence. Note: For our adapter models, cost is shown as ( visual ﬁne-tuning + linguistic adaptation ). T ABLE III: I A M ( M O D E R N C U R S I V E H A N D W R I T I N G ) R E S U LT S . Method Architectur e Params T raining Data Adaptation Cost CER (%) DT rOCR † Decoder-Only Transform. 300M 2B+ Synthetic Lines > 500 GPU hrs 2.38 T rOCR-Large ‡ End-to-End T ransform. 558M 684M Synthetic Lines 300+ GPU hrs 2.89 T rOCR-Base ‡ End-to-End T ransform. 334M 684M Synthetic Lines 200+ GPU hrs 3.42 BAR T -Base (Ours) Linguistic Adapter 140M 1.5M Synthetic 3 + 0.5 GPU hrs 5.18 T5-Base (Ours) Linguistic Adapter 220M 1.5M Synthetic 3 + 0.5 GPU hrs 5.40 ByT5-Base (Ours) Linguistic Adapter 582M 1.5M Synthetic 3 + 1.0 GPU hrs 6.35 Detector Baseline Object Detector Only 40M 1M Synthetic 3 GPU hrs 6.09 † Results reported from original publications. DTrOCR utilizes a billion-scale pre-training strategy . ‡ T rOCR training time includes the massiv e pre-training stage required for transformer con vergence. Note: For our adapter models, cost is presented as ( visual ﬁne-tuning + linguistic adaptation ). T ABLE IV: G E O R G E W A S H I N G T O N ( H I S T O R I C A L H A N D W R I T I N G ) R E S U LT S . Method Architectur e Evaluation CER (%) WER (%) Kumari et al. (2022) CNN + BGR U + CTC Supervised 4.88 14.56 BAR T -Base Linguistic Adapter Annotation-Free 5.20 13.80 ByT5-Base Byte-Lev el Adapter Annotation-Free 5.35 14.20 T5-Base T oken-Level Adapter Annotation-Free 5.86 15.10 Detector Baseline (Ours) DINO-DETR V isual Supervised 3.70 11.20 Note: Recent work using multimodal LLMs (Claude 3.5 Sonnet, Unlocking Archi ves 2024) achieved 5.7% CER and 8.9% WER on a different 18th/19th-century corpus, demonstrating comparable annotation-free performance trends. on domain-speciﬁc Cursive-Collapse noise—injecting re- alistic character confusions such as ‘m’ → ‘rn’, ‘cl’ → ‘d’, ‘l’ ↔ ‘1’—ByT5 improv es substantially to 5 . 65% CER, be- coming competitiv e with T5. This validates the importance of domain-aware synthetic noise design for ef fecti ve byte-lev el correction. T oken-le vel models (T5) are more rob ust to noise quality due to their pretrained lexical priors. Even unrealistic pertur- bations can help T5 by forcing it to le verage semantic and contextual constraints to reconstruct v alid token sequences. Howe ver , this robustness comes with a trade-of f: token-lev el models struggle with OO V terms and historical orthography that ByT5 can handle at the byte lev el. F . Qualitative Analysis: The Semantic T rade-off T able V illustrates the behavioral div ergence between model architectures. On the CVL test set, the T5-Base adapter achiev ed a 12:1 corr ection ratio (168 ﬁxes vs. 14 degrada- tions), reﬂecting strong robustness to noisy detector outputs. Semantic Prior “Knowledge” Effect. A key observation is T5’ s reliance on internal language priors over purely visual evidence. In sev eral “successful” cases, T5 correctly restored corrupted named entities—e.g., ”Heinz Zamek” → ”Heinz Zemanek” —lev eraging pretrained world kno wledge. Such cor- rections would be difﬁcult for byte-le vel models lacking access to semantic priors. Modernization Bias and Orthographic Drift. This seman- tic strength becomes problematic in settings requiring diplo- matic transcription. W e observe a systematic modernization bias in which T5 standardizes archaic German spellings ( fr e y → fr ei ) and reduces inﬂectional v ariation ( Aug enblic ke → Aug enblick ). More concerningly , the model occasionally hal- lucinates contextually plausible phrases—e.g., inserting “Zeit f ¨ ur mich” into a sentence about a death bell ( T odtenglocke ). These ﬁndings highlight a trade-of f: token-based correctors are well-suited for improving r eadability and sear chability , but they can compromise orthographic ﬁdelity . For archi val and humanities workﬂows where historical spelling is semantically meaningful, byte-level correctors or detector-only outputs may be preferable. G. Infer ence Speed W e measure inference latency on a single A100 GPU using greedy decoding. The visual detector contrib utes ap- T ABLE V: Q U A L I T A T I V E E R R O R A N A L Y S I S ( C V L D A T A S E T ) . Category Input (Detector Output) T5 Prediction Mechanism / Error T ype Success (English) olider cultivatet plands older cultivated plants Phonetic Repair (Fixes multi-token noise) plys on a qunte plays on a quote Contextual Disambiguation (plys → plays) by Heinz Zamek by Heinz Zemanek Knowledge Prior (Restores Named Entity) Failur e (German) deines Dienstes fr ey deines Dienstes frei Modernization Bias (Archaic y → i ) Zum Augenblic ke Zum A ugenblick Grammatical Flattening (Dativ e → Nom.) T odtenglocke schallen Zeit f ¨ ur mich schallen Semantic Hallucination (Context Leakage) proximately 50 ms per line. For sequential processing (batch size = 1 ), the T5 and ByT5 correctors add ∼ 170 ms and ∼ 380 ms per line, respectiv ely . For large-scale digitization, batched inference substantially amortizes the linguistic cor- rection cost, reducing ef fective latency to 7 – 10 ms per line. This separation of visual and linguistic components enables both interactive thr oughput (low-latenc y sequential decoding) and ar chival throughput (high-throughput batched decoding) without architectural changes. H. Summary Across three handwriting domains (CVL, IAM, GW), our experiments demonstrate that: (1) modular correctors achiev e competitiv e accuracy with end-to-end transformers (e.g., CVL: 1 . 90% CER) at 57 – 171 × lower training cost, (2) architectural choice should be domain-aw are—T5 excels on clean modern text, whereas ByT5 is necessary for historical domains with archaic orthography , (3) synthetic noise quality is critical for byte-lev el models but less so for token-lev el models, and (4) annotation-free adaptation via synthetic training is both feasible and effecti ve. Collectiv ely , these ﬁndings validate the modular paradigm as a practical, resource-efﬁcient alternativ e to monolithic end- to-end OCR systems for real-world deployment scenarios. V . C O N C L U S I O N W e introduced a modular and resource-efﬁcient frame work for text line recognition that decouples visual character detec- tion from linguistic correction. This design achieves state-of- the-art accuracy on modern handwriting (CVL: 1.90% CER) while reducing training cost by 57–171× compared to end- to-end transformers. Our results highlight that architectural specialization matters: token-lev el models (T5) perform best on clean modern text, whereas byte-lev el models (ByT5) are essential for historical domains with archaic orthography , yielding a practical Pareto frontier for corrector selection. More broadly , decoupling enables detectors to be trained once and reused, while lightweight adapters, trained with synthetic noise, enable annotation-free domain adaptation. This makes competitiv e OCR attainable on consumer hardware and sup- ports div erse digitization scenarios. Limitations . Our approach has se veral limitations. First, correction quality degrades when the detector produces unin- telligible outputs (e.g., sev erely de graded images with less than 20% character accuracy); in such cases, the corrector cannot reconstruct missing visual information. Second, a remaining gap to state-of-the-art end-to-end models (1.8 CER points on CVL) suggests that joint visual, linguistic optimization still holds advantages for maximizing absolute accuracy . Third, the quality of synthetic noise is critical for ByT5; poorly designed corruption can harm performance rather than help (Section IV -E). Future W ork . Promising directions include: (1) joint opti- mization of detector and corrector to reduce error propagation, (2) distilled adapters for ultra-low-latenc y deployment, (3) multilingual extension via mB AR T for non-Latin scripts, and (4) automatic noise generation tailored to detector -speciﬁc error patterns instead of hand-crafted confusion sets. The modular framework facilitates such extensions, allo wing prac- titioners to plug in new noise models or corrector architectures as needed. R E F E R E N C E S [1] R. Baena, S. Kalleli, and M. Aubry , “General Detection-based T ext Line Recognition, ” in Advances in Neural Information Processing Systems (NeurIPS) , 2024. [2] D. Hernandez Diaz, S. Qin, R. Ingle, Y . Fujii, and A. Bissacco, “Rethink- ing T ext Line Recognition Models, ” arXiv preprint 2021. [3] M. Li, T . Lv , J. Chen, L. Cui, Y . Lu, D. Florencio, C. Zhang, Z. Li, and F . W ei, “TrOCR: T ransformer-based Optical Character Recognition with Pre-trained Models, ” arXiv preprint arXiv:2109.10282, 2022. [4] M. Fujitake, “DT rOCR: Decoder-only T ransformer for Optical Character Recognition, ” arXiv preprint arXiv:2308.15996, 2023. [5] T . W ang, Y . Zhu, L. Jin, C. Luo, and X. Chen, “Page-Lev el Document Attention Networks for OCR, ” in Pr oc. Int. Conf. on Document Analysis and Recognition (ICDAR) , 2021. [6] M. Lewis, Y . Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy , V . Stoyanov , and L. Zettlemoyer , “B AR T : Denoising Sequence- to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension, ” in Pr oc. Assoc. Comput. Linguist. (ACL) , 2019. [7] C. Raffel, N. Shazeer , A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W . Li, and P . J. Liu, “Exploring the Limits of Transfer Learning with a Uniﬁed T ext-to-T ext T ransformer, ” J. Mac h. Learn. Res. , v ol. 21, no. 140, pp. 1–67, 2020. [8] L. Xue, A. Barua, N. Constant, R. Al-Rfou, S. Narang, M. Kale, A. Roberts, and C. Raffel, ”ByT5: T owards a T oken-Free Future with Pre- trained Byte-to-Byte Models, ” T rans. Assoc. Comput. Linguist. , vol. 10, pp. 291–306, 2022. [9] W . Qi, Y . Y an, Y . Gong, D. Liu, N. Duan, J. Chen, R. Zhang, and M. Zhou, “ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, ” in Pr oc. EMNLP , 2020. [10] Y . Liu, J. Gu, N. Goyal, X. Li, S. Eduno v , M. Ghazvininejad, M. Lewis, and L. Zettlemoyer, “Multilingual Denoising Pre-training for Neural Machine T ranslation, ” in Pr oc. Assoc. Comput. Linguist. (ACL) , 2020. [11] J. Zhang, Y . Zhao, M. Saleh, and P . Liu, “PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization, ” in Proc. Int. Conf. on Machine Learning (ICML) , 2020. [12] M. Y im, Y . Kim, H.-C. Cho, and S. Park, “SynthTIGER: Synthetic T ext Image Generator towards Better Scene T ext Recognition, ” in Pr oc. Int. Conf. on Document Analysis and Recognition (ICDAR) , 2021. [13] A. R. Katti, C. Reisswig, S. Guder, J. Brunner, J. Faddoul, O. Fink, and U. Brunner, “Chargrid: T owards Understanding 2D Documents, ” in Pr oc. Eur . Conf. Comput. V ision (ECCV) , 2018. [14] Y . Xu, M. Li, L. Cui, S. Huang, F . W ei, and M. Zhou, “LayoutLM: Pre-training of T ext and Layout for Document Image Understanding, ” in Pr oc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD) , 2019. [15] Z. Huang, C. Bouza, Y . Gao, C. Zhang, F . W ei, and J. W ang, “Lay- outLMv3: Pre-training for Document AI with Uniﬁed T ext and Image Masking, ” in Pr oc. Assoc. Comput. Linguist. (ACL) , 2022. [16] J. Luo, L. Zhang, S. W ang, and W . Bai, “LayoutLLM: Layout Instruction T uning with Large Language Models for Document AI, ” in Pr oc. IEEE/CVF Conf. Comput. V ision and P attern Recognition (CVPR) , 2024. [17] G. Kim, M. Kim, and S. Park, “Donut: OCR-free Document Under- standing Transformer , ” in Proc. Int. Conf. on Learning Representations (ICLR) , 2022. [18] L. Photes, K. Zhang, and D. W atanabe, “What Makes OCR Different in 2025? Impact of Multimodal LLMs and Specialized OCR Models, ” AI Syst. Rev . , 2025. [19] U.-V . Marti and H. Bunke, “The IAM-database: An English sentence database for ofﬂine handwriting recognition, ” Int. J. Document Anal. Recognit. , vol. 5, no. 1, pp. 39–46, 2002. [20] M. Khushi, K. Shaukat, T . M. Alam, and I. A. Hameed, “Customised OCR Correction for Historical Medical T ext, ” in Proc. IEEE Int. Conf. on Data Science and Advanced Analytics (DSAA) , 2015. [21] O. K olak and P . Resnik, “OCR error correction using a noisy channel model, ” in Pr oc. Second Int. Conf. on Human Language T echnology Resear ch (HLT) , 2002, pp. 257–262. [22] W . B. Lund, D. J. Kennard, and E. K. Ringger , “OCR Error Correction Using Character Correction and Feature-Based W ord Classiﬁcation, ” in Pr oc. IAPR W orkshop on Document Analysis Systems (DAS) , 2016, pp. 198–203. [23] P . Thompson, J. McNaught, and S. Ananiadou, “ A T ool for Facilitating OCR Postediting in Historical Documents, ” in Pr oc. 9th SIGHUM W ork- shop on Language T echnology for Cultural Heritage , Social Sciences, and Humanities , 2015, pp. 119–130. [24] C. Rigaud, A. Doucet, M. Coustaty , and J.-P . Moreux, “Neural OCR Post-Hoc Correction of Historical Corpora, ” T rans. Assoc. Comput. Linguist. , vol. 9, pp. 479–493, 2021. [25] M. E. B. V eninga, “LLMs for OCR Post-Correction, ” Master’ s thesis, Univ . of T wente, 2024. [26] E. Alvarez-Mellado et al. , “Post-OCR Correction Using Lar ge Language Models with Constrained Decoding, ” arXiv preprint, 2025. [27] Y . Bengio, Y . LeCun, C. Nohl, and C. Burges, “LeRec: A NN/HMM Hybrid for On-Line Handwriting Recognition, ” Neural Comput. , vol. 7, no. 6, pp. 1289–1303, 1995. [28] N. Arica and F . T . Y arman-V ural, “Optical Character Recognition for Cursiv e Handwriting, ” IEEE T rans. P attern Anal. Mach. Intell. , vol. 24, no. 6, pp. 801–813, 2002. [29] D. Peng, L. Jin, W . Ma et al. , “ A F ast and Accurate Fully Conv olutional Network for End-to-End Handwritten Chinese T ext Segmentation and Recognition, ” in Pr oc. Int. Conf . on Document Analysis and Recognition (ICD AR) , 2019, pp. 25–30. [30] D. Peng, L. Jin, Y . W u, Z. W ang, and M. Cai, “Recognition of Handwritten Chinese T ext by Segmentation: A Segment-Annotation- Free Approach, ” IEEE Tr ans. Multimedia , vol. 23, pp. 3496–3507, 2020. [31] D. Y u, X. Li, C. Zhang et al. , “T owards Accurate Scene T ext Recognition with Semantic Reasoning Networks, ” IEEE T rans. P attern Anal. Mach. Intell. , vol. 44, no. 10, pp. 6594–6607, 2021. [32] H. Zhang, F . Li, S. Liu et al. , “DINO: DETR with Improv ed Denoising Anchor Boxes for End-to-End Object Detection, ” in Pr oc. Int. Conf. on Learning Repr esentations (ICLR) , 2023. [33] D. Coquenet, C. Chatelain, and T . Paquet, “DAN: A Se gmentation-Free Document Attention Network for Handwritten Document Recognition, ” IEEE T rans. P attern Anal. Mach. Intell. , vol. 45, no. 7, pp. 8227–8243, 2023. [34] D. Coquenet, C. Chatelain, and T . Paquet, “Faster DAN: Multi-T arget Queries with Document Positional Encoding for End-to-End Handwrit- ten Document Recognition, ” in Proc. Int. Conf. on Document Analysis and Recognition (ICDAR) , 2023, pp. 182–199. [35] T .-Y . Lin, P . Goyal, R. Girshick, K. He, and P . Doll ´ ar , “Focal Loss for Dense Object Detection, ” in Proc. IEEE Int. Conf. Computer V ision (ICCV) , 2017, pp. 2980–2988. [36] H. Rezatoﬁghi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Sa varese, “Generalized Intersection o ver Union: A Metric and a Loss for Bounding Box Regression, ” in Pr oc. IEEE/CVF Conf. Computer V ision and P attern Recognition (CVPR) , 2019, pp. 658–666. [37] D. P . Kingma and J. Ba, “ Adam: A Method for Stochastic Optimization, ” in Pr oc. Int. Conf. on Learning Representations (ICLR) , 2015. [38] A. Graves, S. Fern ´ andez, F . Gomez, and J. Schmidhuber , “Connectionist T emporal Classiﬁcation: Labelling Unsegmented Sequence Data with Recurrent Neural Networks, ” in Proc. 23r d Int. Conf. on Machine Learning (ICML) , 2006, pp. 369–376. [39] P . Krishnan and C. V . Jawahar , “HWNet v2.0: An Efﬁcient W ord Image Representation for Handwritten Documents, ” Int. J. Document Anal. Recognit. , vol. 19, pp. 167–177, 2016. [40] B. Barz and J. Denzler, “The GoodNotes Handwriting Dataset, ” 2020. [Online]. A vailable: https://www .goodnotes.com/gnhk [41] A. Radford, J. W u, R. Child, D. Luan, D. Amodei, and I. Sutskev er , “Language Models are Unsupervised Multitask Learners, ” OpenAI Blog , vol. 1, no. 8, p. 9, 2019. [42] S. M. Pizer, E. P . Amburn, J. D. Austin et al. , “ Adaptive Histogram Equalization and Its V ariations, ” Comput. V ision Graph. Image Pr ocess. , vol. 39, no. 3, pp. 355–368, 1987. [43] K. Heaﬁeld, “K enLM: Faster and Smaller Language Model Queries, ” in Pr oc. Sixth W orkshop on Statistical Machine Tr anslation , 2011, pp. 187–197. [44] K. Song, X. T an, T . Qin, J. Lu, and T .-Y . Liu, “MASS: Masked Sequence to Sequence Pre-training for Language Generation, ” in Pr oc. Int. Conf. on Machine Learning (ICML) , 2019. [45] F . Kleber, S. Fiel, M. Diem, and R. Sablatnig, ”CVL-Database: An Off-Line Database for Writer Retriev al, Writer Identiﬁcation and W ord Spotting, ” in Pr oc. Int. Conf. on Document Analysis and Recognition (ICD AR) , 2013, pp. 560–564. [46] Library of Congress, ”George W ashington Papers, Series 2, Letter- books 1754-1799, ” [Online]. A vailable: https://www .loc.gov/collections/ george- washington- papers/

Efficient Domain Adaptation for Text Line Recognition via Decoupled Language Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment