ScribeTokens: Fixed-Vocabulary Tokenization of Digital Ink
Digital ink – the coordinate stream captured from stylus or touch input – lacks a unified representation. Continuous vector representations produce long sequences and suffer from training instability, while existing token representations require large vocabularies, face out-of-vocabulary issues, and underperform vectors on recognition. We propose ScribeTokens, a tokenization that decomposes pen movement into unit pixel steps. Together with two pen-state tokens, this fixed 10-token base vocabulary suffices to represent any digital ink and enables aggressive BPE compression. On handwritten text generation, ScribeTokens dramatically outperforms vectors (17.33% vs. 70.29% CER), showing tokens are far more effective for generation. On recognition, ScribeTokens is the only token representation to outperform vectors without pretraining. We further introduce next-ink-token prediction as a self-supervised pretraining strategy, which consistently improves recognition across all token-based models and accelerates convergence by up to 83x. With pretraining, ScribeTokens achieves the best recognition results across all representations on both datasets (8.27% CER on IAM, 9.83% on DeepWriting).
💡 Research Summary
The paper addresses the long‑standing problem of how to represent digital ink—sequences of (x, y) coordinates captured from a stylus or touch device—in a way that is both compact and amenable to modern sequence models. Existing approaches fall into two camps. Continuous‑vector representations (e.g., Point‑3, Point‑5) keep the raw coordinates (often with pen‑up/down flags) and feed them to mixture‑density networks (MDNs) for generation. While conceptually simple, these methods produce very long sequences, require careful normalization, and suffer from numerical instability and mode collapse during training. Token‑based representations, on the other hand, discretize the ink into symbols, enabling Byte‑Pair Encoding (BPE) compression and stable cross‑entropy training. However, prior tokenizations either need large base vocabularies that grow with canvas resolution (AbsTokens, RelTokens), suffer from out‑of‑vocabulary (OOV) failures, or generate fragile syntactic strings that can become unparsable (TextTokens). Moreover, previous token schemes underperform vectors on handwriting recognition benchmarks.
ScribeTokens proposes a fundamentally different tokenization that eliminates all three drawbacks. The method first quantizes the continuous coordinates onto a uniform integer grid with spacing δ (the authors use δ = 8). For each pair of successive points (including points between strokes), the line segment is rasterized using Bresenham’s algorithm, which yields a deterministic sequence of adjacent grid cells. These adjacent moves are encoded with the eight directions of a Freeman chain code (→, ↑, ←, ↓, ↗, ↖, ↙, ↘). Two additional tokens,
Comments & Academic Discussion
Loading comments...
Leave a Comment