A Survey of Token Compression for Efficient Multimodal Large Language Models

A Survey of Token Compression for Efficient Multimodal Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multimodal large language models (MLLMs) have made remarkable strides, largely driven by their ability to process increasingly long and complex contexts, such as high-resolution images, extended video sequences, and lengthy audio input. While this ability significantly enhances MLLM capabilities, it introduces substantial computational challenges, primarily due to the quadratic complexity of self-attention mechanisms with numerous input tokens. To mitigate these bottlenecks, token compression has emerged as an auspicious and critical approach, efficiently reducing the number of tokens during both training and inference. In this paper, we present the first systematic survey and synthesis of the burgeoning field of multimodal long context token compression. Recognizing that effective compression strategies are deeply tied to the unique characteristics and redundancies of each modality, we categorize existing approaches by their primary data focus, enabling researchers to quickly access and learn methods tailored to their specific area of interest: (1) image-centric compression, which addresses spatial redundancy in visual data; (2) video-centric compression, which tackles spatio-temporal redundancy in dynamic sequences; and (3) audio-centric compression, which handles temporal and spectral redundancy in acoustic signals. Beyond this modality-driven categorization, we further dissect methods based on their underlying mechanisms, including transformation-based, similarity-based, attention-based, and query-based approaches. By providing a comprehensive and structured overview, this survey aims to consolidate current progress, identify key challenges, and inspire future research directions in this rapidly evolving domain.


💡 Research Summary

This survey provides the first comprehensive overview of token‑compression techniques tailored for multimodal large language models (MLLMs). MLLMs have achieved impressive capabilities by ingesting long, high‑resolution visual streams, extended video sequences, and lengthy audio recordings. However, the quadratic complexity of the self‑attention mechanism makes processing the resulting massive token sequences computationally prohibitive. For example, a 90‑minute video can generate on the order of 54 million tokens, far exceeding the capacity of even the most advanced LLMs.

To address this bottleneck, the authors categorize existing token‑compression methods along two orthogonal axes. First, they group approaches by the primary modality they target: image‑centric, video‑centric, and audio‑centric. This reflects the fact that each modality exhibits distinct redundancy patterns—spatial similarity in images, spatio‑temporal correlation in video, and temporal‑spectral redundancy in audio. Second, they classify methods by the underlying algorithmic mechanism: transformation‑based, similarity‑based, attention‑based, and query‑based.

The survey enumerates a large number of representative works for each combination. In the image domain, transformation‑based models such as Intern VL1.5, Qwen2‑VL, and LaCo embed a lightweight visual encoder that directly reduces token count. Similarity‑based approaches like ToMe, VisionZip, and AuroraCap compute pairwise token similarities and merge or prune redundant patches. Attention‑based methods (e.g., PruMerge+, FastV, MustDrop) exploit low attention scores in the encoder or decoder to drop tokens on‑the‑fly. Query‑based techniques (Token Distillation, Sparse VLM, Cross‑Modal Selection) use language‑side queries or cross‑modal selectors to keep only the most relevant visual tokens.

For video, the same four mechanisms are instantiated with video‑specific designs. Transformation‑based systems (PLLaVA, LongVLM, Video‑ChatGPT) incorporate temporal pooling or hierarchical tokenizers. Similarity‑based methods (Prune Vid, DyCoke, FastVID) align frames and merge similar ones, while attention‑based strategies (MustDrop, FiCoCo, CoreMatching) use spatio‑temporal attention maps to identify dispensable frames or patches. Query‑based video compression includes token‑distillation pipelines such as Token Turing Machines and Long‑VMNet, which generate compact video representations guided by textual prompts.

In the audio domain, transformation‑based models (HTS‑AT, Qwen2‑audio, LLaMA‑Omni) apply spectral down‑sampling or frame‑level encoders to produce fewer audio tokens. Similarity‑based audio compression (A‑ToMe) clusters similar acoustic frames, whereas query‑based approaches leverage cross‑modal alignment between speech and text to retain only salient phonetic tokens. The survey notes that attention‑based audio compression is still under‑explored.

Beyond cataloguing methods, the paper highlights several cross‑cutting insights. First, while many techniques achieve high compression ratios, they differ in where the compression occurs (encoder vs. decoder) and whether they require retraining. Transformation‑based methods often need architectural changes and retraining but provide the most stable speedups. Similarity‑based and attention‑based methods can be applied as post‑hoc pruning, offering flexibility at the cost of occasional information loss. Query‑based methods align compression with downstream task relevance, potentially preserving performance but demanding sophisticated query design.

The authors also discuss current limitations. Most works focus on a single modality, leaving a gap in joint multimodal compression where, for instance, video frames and accompanying audio could be co‑pruned based on cross‑modal redundancy. Moreover, evaluation metrics are fragmented; the community lacks a unified benchmark that jointly measures FLOPs, latency, multimodal consistency, and downstream LLM answer quality.

Future research directions proposed include: (1) developing unified multimodal compression frameworks that treat image, video, and audio tokens as a single redundant pool; (2) adaptive compression schedules that dynamically adjust compression ratios based on input length and hardware constraints; (3) hardware‑aware token reduction techniques for edge devices; and (4) standardized multimodal benchmarks that capture both efficiency and task performance.

Finally, the survey underscores the practical impact of token compression: it enables real‑time multimodal reasoning, reduces cloud inference costs, and opens the door for deploying MLLMs on resource‑constrained platforms such as smartphones and embedded systems. By consolidating the rapidly expanding literature, the paper serves as a foundational reference for researchers and engineers aiming to build the next generation of efficient, scalable multimodal AI systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment