A Separable Architecture for Continuous Token Representation in Language Models

A Separable Architecture for Continuous Token Representation in Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Transformer scaling law analyses typically treat parameters as interchangeable; an abstraction that accurately predicts loss-compute relationships. Yet, in sub-billion-parameter small language models (SLMs), embedding matrices dominate the parameter budget. This work argues that this allocation is as suboptimal as it is counterintuitive. Leviathan is an architecture with a continuous embedding generator to replace the discrete lookup tables of canonical models. Evaluating on the Pile dataset under isoparametric settings, Leviathan consistently outperforms a standard, LLaMA-style architecture. By means of an empirical power-law fit, Leviathan exhibits a markedly superior effective parameter capacity. Across the regime studied, Leviathan behaves as a dense model with $1.47$ to $2.11 \times$ more parameters.


💡 Research Summary

The paper tackles a largely overlooked inefficiency in small language models (SLMs): the embedding matrix consumes a disproportionate share of the total parameter budget, especially when vocabularies are large (e.g., 200 k tokens). Traditional remedies—weight tying between input embeddings and the output head, or low‑rank factorization as in ALBERT—still retain a linear dependence on vocabulary size and force the input and output spaces to share the same geometry, limiting expressive power.

To address this, the authors introduce Leviathan, a Transformer architecture that replaces the static lookup table with a continuous token generator. The generator works in three stages. First, each token index i is decomposed into a k‑dimensional coordinate grid (k≈√V). Shared codebooks C₁…C_k map each coordinate component to a latent seed vector z(i) of dimension d_seed, compressing the indexing cost from O(V) to O(k·√V). After a dense projection, layer‑norm, and sigmoid, the seed is confined to the unit hypercube ˜z(i).

Second, each component of ˜z(i) is fed into a univariate B‑spline basis ϕ_r(·). By taking tensor‑product combinations of these bases, the model builds a set of rank‑1 separable modes and approximates a smooth high‑dimensional surface M(x) ≈ Σ_j Π_r ϕ_{r,j}(x_r). Stone‑Weierstrass guarantees that, with enough modes, this construction can approximate any continuous function.

Third, a final dense layer with a residual connection projects the aggregated modes back to the embedding space, yielding the token embedding e_i. Crucially, the parameters of this generator are token‑agnostic; they are shared across the entire vocabulary, encouraging the network to learn structured similarity among tokens rather than storing isolated vectors.

Leviathan’s generator is plugged into a standard LLaMA‑style decoder‑only Transformer. The output head remains a dense classifier W_class∈ℝ^{D×V}, and the model can be trained with the usual next‑token cross‑entropy loss.

The experimental evaluation uses the Pile dataset and a relatively large 200 k token “o200k” tokenizer to ensure that the “vocab tax” is a genuine bottleneck. Two regimes are explored.

  1. Iso‑body: Both Dense (baseline) and Leviathan share identical Transformer depth, width, and head count; only the input representation differs. Despite a negligible increase in total parameters, Leviathan achieves a 6.7 %–18.1 % reduction in validation perplexity across model scales (60 M–421 M). This demonstrates that the continuous generator provides a richer, more parameter‑efficient representation than a tied embedding table.

  2. Isoparametric: The parameters saved by removing the large embedding matrix are reinvested as additional Transformer layers (the “depth dividend”). For example, at a 109 M total budget, the Dense baseline uses 6 layers, whereas Leviathan can afford a 52‑layer network. The authors fit an empirical scaling law L(N) to the Dense family and define an “effective size” N_eq(L*) for a Leviathan model achieving loss L*. Across the studied range, Leviathan’s effective size is 1.47×–2.11× that of a Dense model with the same raw parameter count. At 109 M, Leviathan matches a 230 M Dense model; at 421 M, it matches a 724 M Dense model.

The paper also examines over‑training (training far beyond the compute‑optimal point). Leviathan continues to improve relative to the baseline, indicating that its representation benefits from longer training schedules—a trend consistent with recent SLM research.

Computationally, the generator adds 23 %–51 % overhead, but the relative cost shrinks as model size grows. The authors argue that this overhead is justified by the substantial gains in sample efficiency and effective capacity.

In summary, Leviathan demonstrates that replacing discrete, vocab‑size‑dependent embeddings with a continuous, separable generator can dramatically improve parameter efficiency in small‑to‑medium scale language models, especially when vocabularies are large. The approach opens a pathway toward more scalable multimodal or multilingual models where traditional embedding tables become a prohibitive bottleneck.


Comments & Academic Discussion

Loading comments...

Leave a Comment