A human-inspired recognition system for premodern Japanese historical documents
Recognition of historical documents is a challenging problem due to the noised, damaged characters and background. However, in Japanese historical documents, not only contains the mentioned problems, pre-modern Japanese characters were written in cursive and are connected. Therefore, character segmentation based methods do not work well. This leads to the idea of creating a new recognition system. In this paper, we propose a human-inspired document reading system to recognize multiple lines of premodern Japanese historical documents. During the reading, people employ eyes movement to determine the start of a text line. Then, they move the eyes from the current character/word to the next character/word. They can also determine the end of a line or skip a figure to move to the next line. The eyes movement integrates with visual processing to operate the reading process in the brain. We employ attention-based encoder-decoder to implement this recognition system. First, the recognition system detects where to start a text line. Second, the system scans and recognize character by character until the text line is completed. Then, the system continues to detect the start of the next text line. This process is repeated until reading the whole document. We tested our human-inspired recognition system on the pre-modern Japanese historical document provide by the PRMU Kuzushiji competition. The results of the experiments demonstrate the superiority and effectiveness of our proposed system by achieving Sequence Error Rate of 9.87% and 53.81% on level 2 and level 3 of the dataset, respectively. These results outperform to any other systems participated in the PRMU Kuzushiji competition.
💡 Research Summary
The paper addresses the long‑standing challenge of automatically recognizing pre‑modern Japanese historical documents, specifically Kuzushiji (cursive Japanese script), which are characterized by noisy, damaged backgrounds and highly connected, cursive characters. Traditional OCR pipelines rely on a two‑stage process: first detecting text lines, then segmenting those lines into individual characters before classification. In the case of Kuzushiji, line and character segmentation are extremely error‑prone because characters often merge into each other and the documents suffer from stains, insect damage, and uneven illumination. Consequently, segmentation errors cascade and severely degrade overall recognition performance.
To overcome these limitations, the authors propose a “human‑inspired” reading system that mimics the way a human reader’s eyes move across a page. Human reading consists of rapid saccades interleaved with fixations; a reader first locates the start of a line, fixates on a character or word, moves to the next, and finally detects the end of the line or skips non‑textual elements before proceeding to the next line. The proposed system implements this sequential behavior using an end‑to‑end attention‑based encoder‑decoder architecture.
The encoder is built on DenseNet, a densely connected convolutional network that concatenates the feature maps of all preceding layers to each subsequent layer. This design preserves low‑level details while simultaneously capturing high‑level semantics, which is crucial for distinguishing faint, partially occluded strokes. The network begins with a 48‑filter convolution followed by max‑pooling, then passes the image through three dense blocks (each with a configurable growth rate K and depth D) interleaved with transition layers that halve the number of feature maps to control memory usage. The output is a rich feature tensor of size H × W × C.
The decoder is an LSTM that generates one character token at a time. At each time step t, the decoder receives three inputs: (1) the embedding of the previously generated token y_{t‑1}, (2) the current hidden state h_{t‑1}, and (3) a context vector c_t computed by a soft‑attention mechanism over the encoder’s feature map. The attention weights α_{t}(u, v) are a normalized function of the compatibility between the current hidden state and each spatial location (u, v). To emulate the human ability to remember previously fixated regions, the model incorporates a coverage vector Cov_t that accumulates past attention probabilities. The coverage vector is added to the attention scoring function, thereby discouraging the model from repeatedly attending to already‑read areas and encouraging it to move forward, just as the human visual system does.
Training optimizes the cross‑entropy loss over the target character sequence using the AdaDelta optimizer with gradient clipping. Mini‑batches of size 8 are used, and early stopping is triggered when the validation error does not improve for 15 consecutive epochs. The LSTM hidden size is set to 256.
The system is evaluated on the PRMU Kuzushiji competition dataset, which provides three difficulty levels. The authors focus on Level 2 (three vertically aligned characters per line) and Level 3 (unrestricted multi‑line text with variable character counts). The dataset comprises 56,097 training, 6,233 validation, and 16,835 test images for Level 2, and 10,118 training, 1,125 validation, and 1,340 test images for Level 3, drawn from 2,222 scanned pages of 15 historical books.
Results show a Sequence Error Rate (SER) of 9.87 % on Level 2 and 53.81 % on Level 3, outperforming all other entries in the competition, including approaches that combine CNNs with BLSTMs, multi‑task layout analysis, and modular pipelines that separately train line detection, segmentation, and recognition. The improvement is especially pronounced on Level 3, where traditional segmentation‑based methods suffer from severe error propagation due to the highly connected cursive script. By processing the image as a whole and directly modeling the sequential reading process, the proposed system avoids the need for explicit line or character segmentation.
Key contributions of the work are:
- Introduction of a human‑inspired eye‑movement model for document reading, translating start‑detect, fixate, move, and end‑detect actions into a neural sequence model.
- Integration of DenseNet for robust feature extraction with a coverage‑augmented attention mechanism that mimics memory of previously read regions.
- End‑to‑end training that eliminates the segmentation bottleneck, leading to state‑of‑the‑art performance on a challenging historical script dataset.
The authors acknowledge limitations: the current start‑line detection relies on global attention and may struggle with complex layouts containing tables, illustrations, or multi‑column formats. The coverage vector, being a simple cumulative sum of attention probabilities, may not fully capture long‑range dependencies in very long documents. Future work is suggested in three directions: (a) coupling the model with explicit layout analysis modules to handle heterogeneous page structures, (b) exploring Transformer‑based architectures that can model global context more effectively, and (c) deploying the system in real‑world digitization projects to assess speed, scalability, and usability for archivists and scholars.
Comments & Academic Discussion
Loading comments...
Leave a Comment