MrRoPE: Mixed-radix Rotary Position Embedding

MrRoPE: Mixed-radix Rotary Position Embedding
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Rotary Position Embedding (RoPE)-extension refers to modifying or generalizing the Rotary Position Embedding scheme to handle longer sequences than those encountered during pre-training. However, current extension strategies are highly diverse and lack a unified theoretical foundation. In this paper, we propose MrRoPE (Mixed-radix RoPE), a generalized encoding formulation based on a radix system conversion perspective, which elegantly unifies various RoPE-extension approaches as distinct radix conversion strategies. Based on this theory, we introduce two training-free extensions, MrRoPE-Uni and MrRoPE-Pro, which leverage uniform and progressive radix conversion strategies, respectively, to achieve ’train short, test long’ generalization. Without fine-tuning, MrRoPE-Pro sustains over 85% recall in the 128K-context Needle-in-a-Haystack test and achieves more than double YaRN’s accuracy on Infinite-Bench retrieval and dialogue subsets. Theoretical analysis confirms that MrRoPE-Pro effectively raises the upper bound of RoPE’s attainable encoding length, which further validates the reliability and utility of our theory and methodology.


💡 Research Summary

The paper addresses a fundamental limitation of Rotary Position Embedding (RoPE), the positional encoding scheme widely used in modern large language models (LLMs). RoPE encodes each token’s absolute position as a set of rotation angles, one per embedding dimension, with higher dimensions rotating more slowly. During pre‑training, the slower‑rotating (high‑frequency) dimensions rarely complete a full rotation, so when a model is asked to process sequences longer than those seen during training, these dimensions encounter out‑of‑domain (OOD) angles and the model’s performance degrades sharply.

The authors observe that the mathematical form of RoPE’s angle computation,
  mθ_j = (m·b − (j‑1)·D_r) mod 2π,
is structurally identical to the digit extraction step in a radix‑based number system. By interpreting each dimension as a “digit” in a β‑radix system where β = b^{1/D_r}, RoPE can be seen as a biased β‑radix encoding: the base β is fixed, and the “digits” (dimensions) are limited in how far they can carry over because the training window does not allow a full cycle for the higher‑order digits.

From this perspective the authors introduce a radix conversion factor λ_j for each dimension. Scaling the base of a digit by λ_j effectively stretches the rotation period of that dimension. If λ_j > 1, the digit’s representable range expands, mitigating OOD for that dimension; if λ_j < 1, the range contracts, preserving low‑frequency information. Crucially, any existing RoPE‑extension method can be expressed as a particular choice of the λ_j vector:

  • NTK‑aware Interpolation applies a uniform scaling λ_j = S^{1/D_r} across all dimensions, where S is the desired overall length‑expansion factor. This uniformly stretches every digit, avoiding abrupt OOD but ignoring the spectral differences among dimensions.
  • YaRN (Neural‑Tangents‑by‑Parts) keeps the low‑frequency (high‑order) and high‑frequency (low‑order) dimensions unchanged (λ_j = 1) and linearly interpolates the middle dimensions with λ_j = r_j·S, where r_j decreases linearly with the dimension index. YaRN therefore follows the heuristic “extrapolate low‑frequency, interpolate high‑frequency” but does not explicitly define the optimal scaling for the intermediate range.

Building on this unified view, the authors propose two new, training‑free strategies:

  1. MrRoPE‑Uni (Uniform) – All middle dimensions share a constant λ_j = c, chosen so that the product of all λ_j equals the target expansion S (c = S^{1/(d_h‑d_l)}). Low‑ and high‑frequency dimensions remain unchanged (λ_j = 1). This method is a principled uniform‑radix extension that improves over NTK‑aware by explicitly separating the untouched extreme frequencies.

  2. MrRoPE‑Pro (Progressive) – The middle dimensions receive a monotonically increasing λ_j, forming an arithmetic progression. Specifically, λ_j = S·ε_j with ε_j = 2(1 + j − d_l) /


Comments & Academic Discussion

Loading comments...

Leave a Comment