BookNet: Book Image Rectification via Cross-Page Attention Network
Book image rectification presents unique challenges in document image processing due to complex geometric distortions from binding constraints, where left and right pages exhibit distinctly asymmetric curvature patterns. However, existing single-page document image rectification methods fail to capture the coupled geometric relationships between adjacent pages in books. In this work, we introduce BookNet, the first end-to-end deep learning framework specifically designed for dual-page book image rectification. BookNet adopts a dual-branch architecture with cross-page attention mechanisms, enabling it to estimate warping flows for both individual pages and the complete book spread, explicitly modeling how left and right pages influence each other. Moreover, to address the absence of specialized datasets, we present Book3D, a large-scale synthetic dataset for training, and Book100, a comprehensive real-world benchmark for evaluation. Extensive experiments demonstrate that BookNet outperforms existing state-of-the-art methods on book image rectification. Code and dataset will be made publicly available.
💡 Research Summary
BookNet addresses the long‑standing problem of rectifying photographs of bound books, where the left and right pages exhibit asymmetric curvature due to binding constraints. Existing document dewarping methods focus on single‑page images and therefore cannot model the coupled geometric relationship between adjacent pages. The authors propose BookNet, the first end‑to‑end deep learning framework explicitly designed for dual‑page book image rectification.
The core of BookNet is a dual‑branch architecture equipped with cross‑page attention. Given a distorted book image containing both pages, the network predicts three dense flow fields: a left‑page flow (Ml), a right‑page flow (Mr), and a full‑spread flow (Mf). During training all three flows are supervised, but at inference only the full‑spread flow is used to warp the input image via differentiable bilinear sampling.
Feature extraction begins with a lightweight ResNet‑style CNN backbone that downsamples the input to 1/8 resolution, followed by a four‑layer Transformer encoder with multi‑head self‑attention and learnable 2‑D positional embeddings. This encoder captures both local deformation cues (e.g., curved text lines) and global spatial dependencies across the whole book spread.
The dual‑branch decoder contains two parallel branches, each initialized with learnable query embeddings (Ql for the left page, Qr for the right page). In the first decoding stage each branch independently attends to the shared encoder features, extracting page‑specific deformation patterns. The second stage introduces bidirectional cross‑page attention, allowing the left and right branches to exchange information and model how the deformation of one page influences the other. After decoding, the low‑resolution flows are up‑sampled and fused to produce the high‑resolution full‑spread flow Mf.
Training employs a multi‑task loss: L1 loss on each flow, smoothness regularization to enforce spatial coherence, and an image reconstruction loss that penalizes differences between the warped output and the ground‑truth flat scan. This combination encourages accurate geometry while preserving visual fidelity.
A major bottleneck for this research area is the lack of suitable data. To fill this gap the authors create two datasets. Book3D is a synthetic training set of 56,000 high‑resolution book images generated with Blender’s Cycles renderer. Realistic 3D book meshes are textured with content extracted from arXiv papers, ensuring authentic typography, equations, tables, and figures. For each sample the authors provide dense UV maps, ground‑truth flow fields, 3D coordinate maps, and binary masks. Book100 is a real‑world benchmark comprising 100 photographs taken with a consumer smartphone under diverse lighting, indoor/outdoor, and viewpoint conditions, each paired with a high‑quality reference scan obtained from a professional overhead document camera. The benchmark covers multiple languages, subjects, and layout complexities, offering a comprehensive testbed for future methods.
Extensive experiments on both synthetic and real data demonstrate that BookNet outperforms state‑of‑the‑art single‑page dewarping models such as DocUNet, DewarpNet, DocTR, and DocTR++. Quantitatively, BookNet achieves higher MS‑SSIM, lower LPIPS, and reduced mean absolute error, especially around page boundaries where previous methods suffer from stitching artifacts. Qualitatively, the rectified images show smooth, continuous text lines and correctly aligned equations across the spine, confirming that the cross‑page attention successfully captures inter‑page dependencies.
The authors acknowledge limitations: the current approach predicts 2‑D flows and does not explicitly reconstruct 3‑D geometry or model page thickness variations. Real‑time deployment on mobile devices also requires further model compression and hardware‑aware optimization. Future work may explore 3‑D shape estimation, multi‑view fusion, lightweight Transformer variants, and domain adaptation techniques to bridge the synthetic‑real gap.
In summary, BookNet introduces a novel cross‑page attention mechanism and a multi‑flow prediction strategy that together enable accurate, end‑to‑end rectification of dual‑page book photographs. By releasing the Book3D synthetic dataset and the Book100 real‑world benchmark, the authors provide valuable resources that are expected to catalyze further research in book digitization and related document‑image processing tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment