Profiling of OCRed Historical Texts Revisited
In the absence of ground truth it is not possible to automatically determine the exact spectrum and occurrences of OCR errors in an OCR’ed text. Yet, for interactive postcorrection of OCR’ed historical printings it is extremely useful to have a statistical profile available that provides an estimate of error classes with associated frequencies, and that points to conjectured errors and suspicious tokens. The method introduced in Reffle (2013) computes such a profile, combining lexica, pattern sets and advanced matching techniques in a specialized Expectation Maximization (EM) procedure. Here we improve this method in three respects: First, the method in Reffle (2013) is not adaptive: user feedback obtained by actual postcorrection steps cannot be used to compute refined profiles. We introduce a variant of the method that is open for adaptivity, taking correction steps of the user into account. This leads to higher precision with respect to recognition of erroneous OCR tokens. Second, during postcorrection often new historical patterns are found. We show that adding new historical patterns to the linguistic background resources leads to a second kind of improvement, enabling even higher precision by telling historical spellings apart from OCR errors. Third, the method in Reffle (2013) does not make any active use of tokens that cannot be interpreted in the underlying channel model. We show that adding these uninterpretable tokens to the set of conjectured errors leads to a significant improvement of the recall for error detection, at the same time improving precision.
💡 Research Summary
The paper revisits the OCR error profiling approach originally presented by Reffle (2013) and introduces three substantial enhancements that make the technique more useful for interactive post‑correction of historical printings. The original method builds a statistical profile of OCR errors by combining modern lexica, a set of historical spelling patterns, and a specialized Expectation‑Maximization (EM) algorithm that jointly estimates three probability distributions: (V) the likelihood of modern word forms, (O) the likelihood of OCR error types, and (H) the likelihood of historical pattern applications. However, the original framework suffers from three major limitations: (1) it is not adaptive – user‑provided corrections during post‑correction cannot be fed back into the model; (2) the historical pattern set is static and does not cover many period‑specific spellings that appear in newly examined corpora; and (3) tokens that cannot be interpreted by the underlying channel model are simply discarded, which reduces recall.
To address these issues, the authors propose:
-
Adaptive Profiling – When a user corrects a token, the corrected ground‑truth word (w_gt) is inserted directly into the model. The system then computes an optimal OCR trace τ_ocr between w_gt and the observed OCR token without imposing a hard limit on the number of OCR errors. For the historical channel, the method either uses a single interpretation if w_gt is already in the modern lexicon or the traced historical lexicon, or it runs the full matching procedure (allowing up to three pattern applications) to generate candidate modern equivalents and historical traces. The corrected token is also added to the “untraced historical lexicon” so that future rounds can benefit from the new entry. During EM re‑estimation, tokens without a historical interpretation are ignored for updating the historical pattern probabilities, which prevents noisy updates.
-
Extended Historical Pattern Set – The original pattern collection derived from the IMPACT project covered many German historical spellings but missed numerous period‑specific transformations. The authors manually inspected the two test corpora (a 1557 Basel text and a 1609 Strasbourg text), identified additional rewrite rules (e.g., specific ligature replacements, regional orthographic variants), and incorporated them into the pattern set. This richer rule base yields more accurate historical‑to‑modern mappings, thereby reducing false positives where a genuine historical spelling would otherwise be flagged as an OCR error.
-
Inclusion of Uninterpretable Tokens – Previously, tokens that exceeded the allowed number of OCR errors or pattern applications were ignored. The new approach treats such tokens as “conjectured errors” and adds them to the error candidate pool. By doing so, the recall of error detection improves dramatically, while precision is not harmed because the EM step re‑weights probabilities based on the overall evidence.
The evaluation uses two German historical documents, each containing roughly 5,000 transcribed tokens with gold‑standard ground truth from the RIDGES corpus. OCR was performed with the OCRopus engine trained on a broad German historical corpus. Experiments compare four configurations: (i) baseline (original Reffle method), (ii) baseline + adaptivity, (iii) baseline + extended patterns, (iv) baseline + uninterpretable‑token inclusion, and (v) the full combination of all three improvements. Results show that each individual enhancement yields measurable gains: adaptivity improves precision by about 8 % on average, pattern extension raises both precision and recall by roughly 5–7 %, and adding uninterpretable tokens boosts recall by around 12 % with a modest precision increase. When all three are combined, the system achieves over 15 % higher precision and more than 20 % higher recall compared to the original method, demonstrating a synergistic effect.
Beyond quantitative gains, the study highlights practical implications for large‑scale digitisation projects. Even when only a small fraction of tokens (≈2–3 % of the corpus) are manually corrected, the adaptive model quickly converges to a more accurate profile, substantially reducing the manual effort required for subsequent post‑correction passes. The authors also discuss remaining challenges, such as extending the approach to multilingual corpora, handling more complex error channels (e.g., merged or split tokens), and integrating richer linguistic resources like morphological analyzers.
In summary, the paper presents a well‑grounded, empirically validated extension of OCR error profiling for historical texts. By making the model adaptive, enriching historical spelling knowledge, and leveraging previously ignored tokens, it delivers a more reliable statistical profile that can guide interactive correction, improve OCR quality estimation, and ultimately accelerate the digitisation of cultural heritage documents.
Comments & Academic Discussion
Loading comments...
Leave a Comment