SynJAC: Synthetic-data-driven Joint-granular Adaptation and Calibration for Domain Specific Scanned Document Key Information Extraction
Visually Rich Documents (VRDs), comprising elements such as charts, tables, and paragraphs, convey complex information across diverse domains. However, extracting key information from these documents remains labour-intensive, particularly for scanned formats with inconsistent layouts and domain-specific requirements. Despite advances in pretrained models for VRD understanding, their dependence on large annotated datasets for fine-tuning hinders scalability. This paper proposes \textbf{SynJAC} (Synthetic-data-driven Joint-granular Adaptation and Calibration), a method for key information extraction in scanned documents. SynJAC leverages synthetic, machine-generated data for domain adaptation and employs calibration on a small, manually annotated dataset to mitigate noise. By integrating fine-grained and coarse-grained document representation learning, SynJAC significantly reduces the need for extensive manual labelling while achieving competitive performance. Extensive experiments demonstrate its effectiveness in domain-specific and scanned VRD scenarios.
💡 Research Summary
SynJAC addresses the challenging problem of key information extraction (KIE) from visually rich documents (VRDs), especially scanned documents that suffer from noisy layouts and domain‑specific structures. The framework consists of four main components. First, a synthetic data generation pipeline automatically extracts layout information from a large unannotated corpus using OCR tools, then employs large language models (LLMs) to produce fine‑grained BIO tags for each token and to generate question‑answer pairs that are aligned with layout entities via fuzzy matching. These synthetic annotations provide structural, semantic, and task‑oriented knowledge without manual effort. Second, a joint‑granular architecture encodes both fine‑grained (word‑level) and coarse‑grained (entity‑level) representations. The novel Layout‑to‑Vector (L2V) encoder transforms bounding‑box coordinates into dense vectors, enriching spatial context and enabling cross‑granular attention mechanisms that allow the two granularities to interact and complement each other. Third, three domain adaptation strategies are introduced: Structural Domain Shifting (SDS) aligns the distribution of synthetic and real layouts; Synthetic Sequence Tagging (SST) fine‑tunes the fine‑grained branch using synthetic BIO tags; and Synthetic Instruction Tuning (SIT) adapts the coarse‑grained branch with the generated QA pairs. Together they reduce both structural and task‑specific domain gaps. Finally, a guidance‑based calibration stage leverages a small manually annotated set to re‑weight and pool the knowledge from synthetic and human data, mitigating noise inherent in synthetic labels. Extensive experiments on finance, education, and receipt domains demonstrate that SynJAC achieves competitive or superior F1 scores while reducing annotation effort by an order of magnitude compared to fully supervised baselines. Ablation studies confirm that L2V and the joint‑granular design each contribute significant performance gains. The paper concludes that synthetic data, when combined with a carefully calibrated joint‑granular model, provides a scalable solution for domain‑specific KIE in scanned VRDs, and suggests future work on improving synthetic data quality, integrating multimodal LLMs, and real‑time deployment.
Comments & Academic Discussion
Loading comments...
Leave a Comment