AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding
Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a p…
Authors: Haozhe Qi, Kevin Qu, Mahdi Rad
A daptT ok en: En trop y-based A daptiv e T ok en Selection for MLLM Long Video Understanding Haozhe Qi 1 , 2 ∗ , Kevin Qu 3 , Mahdi Rad 1 , Rui W ang 1 , Alexander Mathis 2 , and Marc P ollefeys 1 , 3 1 Microsoft Spatial AI Lab 2 EPFL 3 ETH Zuric h Abstract. Long video understanding remains challenging for Multi- mo dal Large Language Mo dels (MLLMs) due to high memory costs and con text-length limits. Prior approaches mitigate this by scoring and se- lecting fram es/tokens within short clips, but they lack a principled mec h- anism to (i) compare relev ance across distan t video clips and (ii) stop pro cessing once sufficient evidence has been gathered. W e prop ose Adapt- T oken, a training-free framew ork that turns an MLLM’s self-uncertaint y in to a global con trol signal for long-video token selection. AdaptT ok en splits a video in to groups, extracts cross-mo dal attention to rank to- k ens within each group, and uses the mo del’s resp onse entrop y to es- timate each group’s prompt relev ance. This en tropy signal enables a global token budget allocation across groups and further supp orts early stopping (AdaptT ok en-Lite), skipping the remaining groups when the mo del b ecomes sufficien tly certain. Across four long-video benchmarks (VideoMME, LongVideoBench, L VBench, and ML VU) and multiple base MLLMs (7B–72B), A daptT oken consistently impro ves accuracy (e.g., +6.7 on a verage ov er Qw en2.5-VL 7B) and contin ues to benefit from ex- tremely long inputs (up to 10K frames), while A daptT oken-Lite reduces inference time by ab out half with comparable p erformance. Pro ject page: https://haozheqi.github.io/adapt- token Keyw ords: Long video understanding · Multi-modal Large Language Mo del · T ok en selection · Mo del certain ty estimation 1 In tro duction Understanding long videos is essen tial for applications such as em b odied AI assistan ts [4] and multi-modal web agents [10]. Recen t multi-modal large lan- guage mo dels (MLLMs) [3, 6, 51, 54] ha ve achiev ed strong question-answ ering and instruction-following p erformance on short clips. How ever, they still strug- gle with long videos (e.g., hour-long inputs), largely b ecause memory demands and context-length limits constrain b oth the resolution and the n umber of frames that can b e pro cessed [48]. T o mitigate these limitations, recen t works [16, 23, 34, 47] pre-select relev an t or informativ e frames based on the text prompt and visual conten t b efore feed- ing them in to MLLMs. Ho w ever, frame-lev el selection often retains substantial ∗ W ork done during an internship at Microsoft 2 H. Qi et al. irrelev an t regions within the chosen frames, whic h can dilute the useful signal and h urt p erformance. Alternatively , tok en-level metho ds [7, 17, 40, 55] directly select or compress the visual tokens pro vided to the MLLM, typically using cross- mo dal atten tion from vision-text encoders [30] or MLLM lay ers to score tok en imp ortance. While this fine-grained approach reduces redundancy compared to frame-lev el selection, t wo challenges remain. First, tok en importance is usually computed within individual frames or short clips, lacking a global criterion to allo cate tok ens across distant clips. Second, the selector typically still pro cesses all sampled frames, incurring unnecessary computation when the MLLM could answ er accurately using fewer groups. A v er ag VideoMM ML V LongVideoBenc L VBenc 54 . 65 . 7 6 . 7 0 . 63 . 67 . 61 . 51 . 65 . 72. 63 . 45 . 61 . 68 . 72. 62. 51 . 63 . 51 . 62. 7 5 . 67 . 64 . 66 . Fig. 1: W e prop ose AdaptT ok en, a flexi- ble and efficient token selection strategy for long video understanding. Compared with state-of-the-art frame/tok en selection metho ds on several challenging long-video b enc hmarks, A daptT oken consisten tly de- liv ers improv ed p erformance. T o address these limitations, we prop ose AdaptT ok en, a training-free long-video token selection framework suitable for v arious base MLLMs that (i) estimates token imp ortance glob- ally and (ii) decides when to stop examining additional video frames (Fig. 2). Sp ecifically , A daptT ok en pro cesses the video group by group to av oid memory and context-length b ottlenec ks: within eac h group, it ranks visual tokens via cross-mo dal atten tion from a reference MLLM la yer, yielding intra-group token rel- ev ance scores. T o enable global com- parisons across groups, w e use the mo del’s resp onse entrop y as a group- lev el uncertain ty signal: groups that yield more confiden t (lo wer-en tropy) responses are treated as more prompt- relev an t and receive a larger share of the o v erall token budget. The same en trop y signal further supp orts early stopping (AdaptT ok en-Lite): once the mo del b e- comes sufficien tly confiden t after examining a few groups, w e skip the remaining groups to reduce computation. Finally , token imp ortance is not only attributed to prompt relev ance. W e impro v e token diversit y and temp oral cov erage with a lo cation-a w are global token remov al step that suppresses redundant tokens. W e integrate AdaptT ok en into multiple MLLM arc hitectures [2, 3, 6] span- ning mo del sizes from 7B to 72B parameters. A cross four challenging long- video b enc hmarks, VideoMME [11], LongVideoBench [44], L VBenc h [38], and ML VU [57], AdaptT ok en deliv ers consistent gains ov er the base MLLMs and prior frame/tok en selection metho ds (Fig. 1). Notably , thanks to global aw areness dur- ing token selection and redundancy remov al, A daptT oken scales to extremely long inputs and contin ues to impro ve p erformance even with 10K frames. T o the b est of our kno wledge, prior MLLM long-video understanding metho ds [7, 22] only report Needle-in-a-Haystac k results rather than end-task accuracy at this scale. Our early-stopping v arian t, AdaptT ok en-Lite, cuts av erage inference time A daptT oken 3 Adap Toke What is t he competition taking place in t he video T e xt pr ompt Entropy-based video relevance measuremen It ’ s men’ s triple jump. MLLM Decoder Enough clues Long video input Early Sto (Lite version Entropy-weighted visual token selectio 20% r ele v an 7 0% r ele v an 90% r ele v an .. Gr oup Gr oup Gr oup Gr oup Fig. 2: Ov erall pip eline of AdaptT ok en. A daptT oken pro cesses long videos by di- viding them into frame groups and selecting informative tokens within each group based on group relev ance estimated from response entrop y . It progressively gathers evidence across groups and stops pro cessing once sufficien t information has been collected. b y roughly half while maintaining comparable (or even b etter) p erformance. In summary , our con tributions are threefold. – W e prop ose A daptT oken, a training-free tok en selection framework that com- bines cross-mo dal attention with resp onse en tropy to assess tok en imp or- tance globally for long-video understanding. – W e further use resp onse entrop y to trigger early stopping, substantially ac- celerating inference while main taining accuracy . – W e demonstrate AdaptT ok en’s effectiveness across diverse MLLMs (multiple arc hitectures and scales) and sho w consistent impro vemen ts on four long- video b enc hmarks. 2 Related W orks 2.1 Multi-mo dal Large Language Models Driv en by the success of Large Language Models (LLMs) [1, 5, 46], multi-modal LLMs hav e emerged to extend text-only mo dels with additional mo dalities [8, 13, 14, 25]. Among them, v ideo MLLMs [19, 20, 41, 45, 54] aim to provide ro- bust and scalable solutions for understanding and reasoning ov er video data. A typical architecture comprises a visual enco der and a pro jection lay er that maps visual features in to the language embedding space, follow ed b y an LLM bac kb one that pro cesses the resulting m ulti-mo dal tok en sequence. Ho wev er, the redundancy in video streams pro duces a large n umber of visual tokens, which quic kly exhausts the context len g th and incurs substantial memory and com- pute ov erhead. AdaptT ok en mitigates this c hallenge b y selecting a compact set of informativ e tokens. 4 H. Qi et al. 2.2 Long Video Understanding T o process more frames, base MLLMs [2, 42, 53] t ypically extend the context win- do w via long-video training or apply more aggressiv e temp oral/spatial p ooling. Nev ertheless, efficiency remains a k ey bottleneck when handling large num bers of frames. Motiv ated by the redundancy in videos, recen t work focuses on extract- ing prompt-relev ant or informative conten t b efore feeding it into an MLLM. At the frame lev el, metho ds leverage CLIP [30] vision-text enco ders [23, 34], text- only LLMs [43], or learned frame selectors [47, 50] to identify relev an t frames. Ho wev er, selected frames may still contain substantial background or irrelev an t regions, motiv ating feature/tok en-level compression based on cross-modal at- ten tion from vision-text encoders [26, 37] or MLLM lay ers [40, 55]. A daptT oken follo ws this line of work, aiming to improv e efficiency while enabling globally informed tok en selection. 2.3 Certain ty Estimation for Model Resp onses Humans are more lik ely to mak e incorrect statemen ts when uncertain; anal- ogously , recen t w ork studies how to estimate an LLM’s confidence to detect or mitigate hallucinations. Self-Ev aluation [31] estimates confidence from the probabilit y of y es/no tokens, while INTUITOR [56] learns dedicated confidence tok ens via additional finetuning. Confidence can also b e inferred directly from the output distribution using metrics such as token-lev el entrop y and related uncertain ty scores [9], or self-certain t y based on KL divergence from a uniform distribution [18]. These estimates ha ve b een used to impro ve test-time reasoning via m ulti-round voting [12, 18]. In contrast, AdaptT ok en leverages mo del cer- tain ty to assess group relev ance and deriv e globally informed tok en imp ortance for long-video tok en selection. 3 Metho d A daptT oken consists of three main comp onen ts. Given a long video, AdaptT o- k en first splits it in to m ultiple frame groups. F or eac h group, we compute the group relev ance (Section 3.1) and intra-group token imp ortance to enable glob- ally informed tok en selection (Section 3.2). Next, we apply lo cation-a ware global tok en remo v al to the selected tokens to impro v e div ersity and co v erage beyond relev ance (Section 3.3). Finally , AdaptT ok en supports early stopping: once the mo del has gathered sufficient evidence, it stops examining additional groups to impro ve inference efficiency (Section 3.4). 3.1 Group Relev ance with Resp onse Entro py Due to memory and con text-length constraints, w e cannot feed a large n um- b er of video frames (e.g., > 1 K frames) in to an MLLM at once. A common practice is therefore to process either individual frames [15] or small groups of A daptT oken 5 frames [29, 40, 55] and select relev ant frames/tokens within each group. Ho w ever, in ter-group relev ance is often p o orly handled. Man y metho ds rely on a binary decision: irrelev ant frames/groups are either discarded entirely [15] or aggres- siv ely compressed (e.g., into a single token [7]). As the num b er of groups grows, their contributions to a given prompt can v ary substan tially , making it beneficial to assess group-wise imp ortance in a finer-grained and contin uous manner. 16 frames Wit h needl n=39 n=36 n=14 n=5 n=8 n=30 64 frames Wit h needl Entr op 64 frames Wit hout needl Fig. 3: Needle-in-a-Ha ystack experiments based on In tern VL2.5 8B. Resp onse en tropy distributions for correct vs. incorrect predictions under v arying n um b ers of input frames, with and without needles. Inspired b y recent progress in self-certaint y estimation, w e use an MLLM’s resp onse probabilities to quan tify frame- group relev ance. Prior work primarily uses self-certain ty to improv e test-time reason- ing in text-only LLMs [12, 18], e.g. by ev aluating multi- ple reasoning traces and vot- ing for the final answer. In con trast, we extend certain ty estimation to MLLMs with video inputs and use it to as- sess the relev ance of the input frames to the prompt. Sp ecifically , giv en a prefix of input tokens { x 1 , . . . , x i } , the LLM backbone autoregressiv ely pro duces a probabilit y distribution P i ∈ R D o ver the next to- k en, where D = |D| is the vocabulary size. The next token index is t i +1 = argmax j ∈{ 1 ,...,D } P i ( j ) , (1) and the corresp onding tok en em b edding is x i +1 = E mbeds [ t i +1 ] , where E mbeds [ ] denotes an em b edding lo okup table. Bey ond selecting the next tok en, P i also reflects the mo del’s b elief ab out the output. Denoting the probability of v o cabulary item j as P i ( j ) , the token entrop y [18] is e i = − D X j =1 P i ( j ) log P i ( j ) . (2) En tropy measures the uncertaint y of a probability distribution: higher en- trop y indicates greater uncertaint y . W e therefore define token certaint y as c i = − e i , and follo w F u et al. [12] b y av eraging the certain ties of the low est 10% generated tok ens to obtain the resp onse-certain ty score. C = 1 |G | X i ∈G c i . (3) Here, G denotes the set of generated tok ens with the low est 10% certaint y scores. W e use C to quantify the relev ance of the input frames to the text prompt, 6 H. Qi et al. VideoMM ML V Fig. 4: Real-data entrop y exp erimen ts on based on Intern VL2.5 8B. Resp onse-en trop y distributions for correct vs. incorrect predictions on real-world b enc hmarks (VideoMME and ML VU). based on the hypothesis that higher resp onse certain ty indicates that the model has observ ed more prompt-relev ant evidence. T o v alidate this hypothesis, we first conduct Needle-in-a-Haystac k exp eri- men ts b y injecting a needle frame into random videos and asking questions tied to that needle. As sho wn in Fig. 3, when there is no evidence (i.e., no needle frame) in the input, the MLLM remains highly uncertain even when it guesses correctly , suggesting that the mo del do es not observe prompt-relev ant evidence and is instead making a random guess. By contrast, when the input contains the required evidence (i.e., the needle frame), the MLLM t ypically produces lo wer-en tropy resp onses. A subset of cases still exhibits high en tropy despite the presence of the needle, indicating that the mo del fails to retrieve or attend to the relev an t frame and therefore b eha ves similarly to the no-evidence condition, making incorrect answ ers more lik ely . Moreo ver, reducing the num b er of input frames from 64 to 16 decreases the fraction of high-en trop y responses, consisten t with easier retriev al under shorter contexts. W e further examine this relationship on real-world b enc hmarks (Fig.4 and App endix Fig. 6). Giv en a video with N frames { f 1 , . . . , f N } sampled at a fixed FPS, w e partition it into groups of at most K frames, yielding G = N //K + 1 groups {F 1 , . . . , F G } . Within eac h group, frames are sampled with a stride G so that eac h group spans the full video but with different temp oral offsets. F or example, group F g consists of: F g = { f g , f g + G , . . . } . (4) W e then pro cess each group with the MLLM to obtain a resp onse and its asso ciated certaint y score. Across benchmarks, higher certain ty correlates with a higher probability of answering correctly , indicating that the corresp onding frame group con tains prompt-relev ant evidence. Compared to prior inter-group relev ance measurements [7, 15], resp onse cer- tain ty provides a quantitativ e and training-free signal for estimating in ter-group relev ance, whic h w e leverage to derive globally informed tok en-imp ortance scores. A daptT oken 7 MLLM Decode The person cut s a t omat o int o half The person cut s a t omat o Response E n tr op y: 0. 1 Response E n tr op y: 1. 2 The person cut s a cucumber Intr a-gr oup t ok en r ele v ance wit h cr oss-modal att ention Global-awar e T ok en Selectio Intr a-gr oup t ok en r ele v ance wit h cr oss-modal att ention Global-awar e T ok en Selectio Gr oup Gr oup Which f ood is cut in half b y t he person in t he video Fig. 5: Visualization of A daptT oken tok en selection. T w o frame groups are pre- sen ted side by side. F or each clip, we first estimate intra-group token relev ance via cross-mo dal attention (heatmaps in the second ro w), and group-lev el relev ance via re- sp onse entrop y , which measures the mo del’s answer confidence. Based on these signals, w e p erform global-aw are tok en selection, adaptively allo cating a larger tok en budget to groups that are more relev ant to the text prompt (colored masks in the third ro w). The resulting token set is compact yet information-dense, improving b oth accuracy and inference efficiency for long-video understanding. 3.2 En tropy-guided Global T oken Selection With the en tropy-based resp onse certaint y measure, a video split into G frame groups yields one certaint y score C g p er group F g , which captures group-lev el rel- ev ance. W e then estimate token-lev el relev ance within each group; imp ortan tly , tok en relev ance and C g can b e obtained in the same forward pass. T ok en imp ortance can b e estimated either using pretrained vision-text en- co ders [26, 42] or using attention from MLLM la yers [39, 40, 55]. W e adopt the latter b ecause MLLMs b etter capture the seman tics of complex text prompts. Concretely , for a giv en MLLM, w e select a late la y er with strong retriev al per- formance (identified via a lay er-wise Needle-in-a-Haystac k exp erimen t [55]) and use it to compute cross-mo dal attention. During the same forward pass used to compute group certaint y , we extract the visual key em b eddings { k 1 , . . . , k V } and text query embeddings { q 1 , . . . , q T } from the selected LLM lay er to compute cross-mo dal attention. W e then derive token relev ance scores for the visual to- k ens, R = { r 1 , . . . , r V } , by aggregating attention weigh ts ov er heads and taking the maxim um across text queries: r v = max t ∈{ 1 ,...,T } H X h =1 A ttn h ( q t , k v ) , v ∈ { 1 , . . . , V } . (5) Here, H denotes the num b er of attention heads and Attn h ( q t , k v ) is the cross- mo dal attention w eight from query q t to k ey k v in head h . W e compute R g for eac h frame group to obtain intra-group token-importance scores. T o enable global-a ware token selection, we adopt a tw o-stage token-allocation strategy . W e 8 H. Qi et al. first set an ov erall tok en budget B for the entire video, and then allocate a group-lev el budget B g for eac h frame group F g based on its certain ty score C g : B g = B × Softmax( { C 1 , . . . , C G } /τ ) g . (6) Here, τ is a temperature parameter con trolling the sharpness (fixed to 2 across all experiments). Given B g , w e select the top- B g visual tok ens in group g according to R g : more relev an t groups (higher C g ) retain more informative tok ens. Finally , since token lo cation is important for prediction, w e also k eep the corresp onding p ositional embeddings for the selected tokens. 3.3 Lo cation-a w are Global T ok en Remov al Our tw o-stage tok en allo cation strategy selects prompt-relev an t tokens globally . Ho wev er, beyond relev ance, effective long-video understanding also requires di- v ersity and cov erage [7, 34]. F or instance, neighboring groups F g and F g +1 ma y con tain highly similar con tent. Pro cessing b oth groups can yield similar certaint y scores C and tok en relev ance scores R , causing redundant tok ens to b e selected. T o mitigate this redundancy , w e introduce a token-remo v al step. After se- lecting tokens { x 1 , . . . , x B } via tw o-stage allo cation, w e compute the pairwise cosine similarit y of token features, S f ∈ R B × B . Since redundancy often arises from temp oral pro ximity , w e also record the global frame indices of the selected tok ens and compute a temp oral similarity term. W e first normalize frame indices to [0 , 1] , yielding { d 1 , . . . , d B } , and then define S d i,j = exp − ( d i − d j ) 2 /σ , where σ controls ho w quickly similarit y decays with temp oral distance. W e combine the t wo similarities as S = S f + S d . (7) T ok en remo v al is p erformed iterativ ely b y discarding tokens with the highest m utual similarity . In practice, we initially select a slightly larger set (10%) than the target budget B and then reduce it back to B using this redundancy-remov al pro cedure. W e further ablate different remo v al ratios in App endix T able 8. 3.4 Inference Efficiency with Early-stopping Splitting the whole video in to frame groups largely reduces the memory require- men t and also slightly improv es the inference sp eed since the model atten tions are computed with shorter lengths. How ever, all the frame groups still need to b e processed and examined, which linearly increases the inference time as the n umber of frame groups increases. F or a sup er-long video, the MLLM ma y al- ready gather enough clues using some frame groups without going through the whole video. How ev er, this has b een barely addressed by existing frame/tok en selection metho ds [7, 40, 55]. With group certaint y C g , w e can identify when to stop examining new frame groups with no additional computations needed, which largely impro ves the infer- ence efficiency while k eeping comparable p erformance. Sp ecifically , we show that A daptT oken 9 Algorithm 1 AdaptT oken for T oken Selection Inputs: Video frames { f 1 , f 2 , ..., f N } , text prompt T , ov erall tok en budget B Enco de T into text tokens { x t 1 , ..., x t T } using vocabulary embeddings E mbeds Split the video into frame groups {F 1 , ..., F G } using Eq. 4 Define the group input order with maximum margin: ˆ G = [1 , G/ 2 , G/ 4 , 3 G/ 4 , ... ] Initialize the confidence counter count = 0 rep eat for g in ˆ G do Enco de F g in to visual tokens { x v 1 , ..., x v V } using the MLLM vision enco der F eed { x t 1 , ..., x t T } and { x v 1 , ..., x v V } in to the MLLM deco der Compute the group-wise certaint y C g using Eq. 2 and Eq. 3 Compute the tok en-wise prompt relev ance R g using Eq. 5 if C g > C ∗ and early stopping is enabled (AdaptT ok en-Lite) then count ← count + 1 end if end for un til count ≥ 3 Up date ˆ G to skip the rest groups. Compute the group token budget B g using Eq. 6 for g in ˆ G do Select the top- B g tok ens according to R g end for Compute m utual token similarit y matrix S using Eq. 7 Iterativ ely remov e the top 0 . 1 B most similar tok ens based on S Aggregate the remaining tokens and feed them into the MLLM deco der a frame group is likely to contain many prompt-relev an t clues (Fig. 3) when its certain ty score is sufficien tly high, whic h pro vides a signal for deciding whether the mo del has gathered enough evidence. Only one group with high certaint y ma y carry partial clues. Therefore, we stop reviewing new groups once multiple frame groups hav e ac hieved high enough certain ties. Based on the MLLM en- trop y analyses, we find that the range of response entrop y remains stable across Needle-in-a-Ha ystack experiments (Fig. 3), real benchmarks (Fig. 4), and differ- en t MLLMs (App endix Fig. 6). Accordingly , we use the stopping criterion that three frame groups with en tropy b elo w C ∗ = 0 . 75 , which trades off computation and reliabilit y by demanding confirmation under multiple groups and generalizes across b enc hmarks and MLLMs. W e ablate other settings in App endix T able 9, demonstrating robustness to differen t hyperparameter choices. W e send frame groups to the MLLM in a maxim um margin manner, i.e., [ F 1 , F G/ 2 , F G/ 4 , F 3 G/ 4 , ... ] , to ensure the div ersity of the frame groups. Once stopp ed, we simply use the already reviewed frame groups to perform our t wo- stage tok en allo cation and token redundancy remov al metho ds to select tokens. The final selected tok ens con tain abundant text-prompt-relev ant information with high div ersity and co verage, which provides a globally consisten t compact represen tation for the original video. W e aggregate them following the temp o- ral order and send them again to the MLLM to obtain the final response. W e summarize the details of A daptT oken in Algorithm 1. 10 H. Qi et al. T able 1: Comprehensive ev aluation across long-video b enc hmarks. Results are grouped by model family . Best p erformance within eac h blo c k is sho wn in b old. The first tw o blo c ks rep ort representativ e closed-source and open-source MLLMs as reference p oin ts. Subsequent blocks list eac h base MLLM follow ed by the corresp onding in tegrated metho ds. Mo del LLM Size VideoMME ML VU LongVideoBenc h L VBench GPT-4o [27] - 71.9 64.6 66.7 27.0 Gemini-1.5-Pro [35] - 73.2 - 64.0 65.7 Gemini-2.5-Pro [8] - 84.3 - - 78.7 mPLUG-Owl3 [49] 7B 59.3 63.7 52.1 - NVILA [24] 8B 64.2 70.1 57.7 - Byte VideoLLM [36] 14B 64.6 70.1 70.1 - TPO [21] 7B 65.6 71.1 60.1 - VideoLLaMA3 [51] 7B 66.2 73.0 59.8 45.3 ViLAMP [7] 7B 67.5 72.6 61.2 45.2 Intern VL2.5 [6] 8B 64.2 68.9 59.5 43.4 w/ Zoom V [29] 8B 64.4 70.0 63.3 51.5 w/ T riumph [33] 8B 65.4 70.0 60.7 46.6 w/ Se ViCES [32] 8B 64.7 72.1 61.7 46.7 w/ FlexSelect [55] 8B 67.0 71.9 60.1 49.7 w/ AdaptT ok en-Lite (Ours) 8B 68.1 74.4 63.8 51.3 w/ AdaptT ok en (Ours) 8B 68.3 74.1 63.7 52.1 Qwen2.5-VL [3] 7B 65.4 70.2 59.5 45.3 w/ TimeSearch-R [28] 7B 66.6 71.5 60.1 - w/ Zoom V [29] 7B 63.6 67.0 61.0 51.3 w/ AdaReT AKE [40] 7B 67.7 75.0 62.6 51.2 w/ Se ViCES [32] 7B 65.5 72.2 63.9 45.4 w/ FlexSelect [55] 7B 68.2 72.5 62.4 51.2 w/ AdaptT ok en-Lite (Ours) 7B 69.8 76.3 65.1 53.3 w/ AdaptT ok en (Ours) 7B 70.5 76.8 65.2 54.8 Qwen2.5-VL [3] 72B 73.4 76.3 66.2 47.3 w/ AdaReT AKE [40] 72B 73.5 78.1 67.0 53.3 w/ FlexSelect [55] 72B 74.4 76.6 66.4 56.6 w/ AdaptT ok en (Ours) 72B 76.1 79.8 70.5 59.7 Qwen3-VL [2] 8B 71.4 78.1 65.9 58.0 w/ AdaptT ok en (Ours) 8B 73.8 79.3 66.9 60.6 4 Exp erimen ts 4.1 Benc hmarks and Mo dels W e ev aluate A daptT oken on four public long-video b enc hmarks commonly used b y other frame/token selection metho ds: VideoMME [11], ML VU [57], LongVideo- Benc h [44], and L VBenc h [38]. VideoMME, ML VU, and LongVideoBench include b oth short and long videos, spanning durations from a few seconds to o ver an hour and cov ering diverse tasks (e.g., topic reasoning, anomaly recognition, and video summarization), which tests robustness across video types and lengths. L VBench instead targets extremely long videos, with an a verage duration of 4,101 seconds and many samples exceeding t w o hours, pro viding a stringent test of scalabilit y to ultra-long inputs. W e ev aluate under the LMMS-Ev al frame- w ork [52] and rep ort the official metrics for each b enc hmark. W e in tegrate AdaptT oken in to m ultiple MLLMs, including In tern VL2.5 8B [6], Qw en2.5-VL 7B/72B [3], and Qwen3-VL 8B [2]. W e primarily b enc hmark on In- A daptT oken 11 tern VL2.5 8B and Qw en2.5-VL 7B, tw o common bac kb ones in prior frame/token selection work, to enable fair comparisons. W e additionally ev aluate on Qwen2.5- VL 72B to demonstrate scalabilit y to larger mo dels. Finally , Qwen3-VL pro vides a strong long-context base MLLM via ultra-long-context training; w e show that A daptT oken further improv es its p erformance while reducing time and memory consumption. W e refer to the full metho d as AdaptT ok en, and to the v arian t with early stopping as AdaptT ok en-Lite. Additional implementation details are pro vided in the App endix. 4.2 Main results Comparison with state-of-the-art methods. W e integrate A daptT oken into m ultiple base MLLMs and ev aluate on the four long-video b enc hmarks in T a- ble 1. T o enable fair comparisons with prior frame/token selection metho ds, we fo cus on tw o commonly used bac kb ones: In tern VL2.5 8B [6] and Qwen2.5-VL 7B [3]. Across all b enc hmarks, AdaptT ok en yields larger improv ements o ver the corresp onding base MLLMs than competing approaches, demonstrating strong effectiv eness for long-video understanding. F or Qwen2.5-VL 7B, prior metho ds attain their b est p erformance on dif- feren t b enc hmarks (e.g., FlexSelect on VideoMME, AdaReT AKE on ML VU, Se ViCES on LongVideoBenc h, and Zo om V on L VBench). In contrast, A daptT o- k en improv es p erformance consisten tly across all benchmarks, with an a verage gain of +6.7 ov er the Qwen2.5-VL 7B baseline, and surpasses the previous b est results on each b enc hmark. The largest gains o ccur on L VBenc h and ML VU, whic h contain extremely long videos (often exceeding tw o hours), highlighting A daptT oken’s robustness on ultra-long inputs. W e observe a similar trend on In tern VL2.5: A daptT ok en outperforms all compared metho ds on each b enc h- mark and impro v es the base mo del by +5.6 on a v erage. T ogether, these results demonstrate that A daptT oken generalizes well across MLLM backbones. T able 2: Accuracy and inference-time comparison betw een A daptT ok en and A daptT oken-Lite. Qwen2.5-VL 7B is used as the base MLLM. AdaptT oken-Lite ac hieves comparable accuracy while reducing av erage inference time b y approximately 50%. Methods A daptT ok en-Lite A daptT oken Acc. Time (s) Acc. Time (s) VideoMME 69.8 8.6 70.5 17.8 LongVideoBench 65.1 11.0 65.2 18.2 ML VU 76.3 10.1 76.8 21.5 L VBench 53.3 19.3 54.8 32.8 Generalization to larger MLLMs. T o assess scalability , w e further integrate AdaptT ok en into Qw en2.5-VL 72B. As shown in T able 1, AdaptT ok en consistently impro ves p erformance across all b enc hmarks, ac hieving an a verage gain of +5.7 ov er the 72B base- line and outp erforming all exist- ing tok en-selection methods ev al- uated at this scale. This indicates that AdaptT ok en remains effective when applied to larger-scale MLLMs. Efficiency with early-stopping. Our early-stopping strategy , described in Section 3.4, enables the MLLM to skip uninformativ e frame groups once suffi- cien t evidence has been collected, thereb y reducing inference time. T able 2 and App endix T able 7 report the av erage p er-sample inference time for A daptT ok en 12 H. Qi et al. and AdaptT oken-Lite. Using Qw en2.5-VL 7B as the base MLLM, AdaptT ok en- Lite ac hiev es accuracy comparable to AdaptT ok en (a verage difference: − 0 . 7 ) while reducing inference time b y approximately 50% (e.g., from 17.8s to 8.6s on VideoMME). Notably , AdaptT ok en-Lite still outperforms the previous state of the art. When applied to Intern VL2.5 8B, A daptT oken-Lite matches or ex- ceeds A daptT oken; in particular, it outperforms AdaptT ok en on ML VU and LongVideoBenc h, further v alidating the effectiveness of early stopping. Generalization to adv anced MLLMs with long context length. MLLMs suc h as Qw en2.5-VL [3] and Intern VL2.5 [6] are trained and ev aluated primarily on short video clips, with input limits of 25K tokens for Qwen2.5-VL and 64 frames (approximately 16K tokens) for In tern VL2.5. By selecting a compact set of informative tokens, A daptT oken effectiv ely increases relev an t video conten t in the input, leading to impro ved p erformance. Most recent MLLMs, such as Qw en3-VL [2], are explicitly designed to handle long-context inputs. Qw en3-VL supp orts up to 2048 frames and 224K tokens, enabling strong baseline p erfor- mance on long-video b enc hmarks (T able 1). W e show that A daptT oken remains b eneficial even in this setting. When applied to Qw en3-VL 8B and using up to 4096 input frames, A daptT ok en yields an a verage improv ement of +1.8 on the four b enc hmarks while substan tially improving efficiency . On L VBenc h, Adapt- T ok en requires 41 GB of GPU memory and 40.7s per sample on a verage, com- pared to 96 GB and 58.5s for the base Qwen3-VL 8B. 4.3 Ablations T able 3: A dditive ablation study of A daptT oken components. TS denotes token selection and TR denotes token remov al. In- tern VL2.5 8B is used as the base MLLM. The final configuration (f ) corresponds to the full A daptT oken. Method Max frames VideoMME ML VU a. Base Intern VL2.5 64 64.2 68.9 b. Base Intern VL2.5 256 64.3 67.2 c. b+Group TS 256 66.9 70.1 d. c+Scale up frames 1024 67.1 70.9 e. d+Global TS 1024 68.0 73.6 f. e+Global TR 1024 68.3 74.1 Effectiv eness of differen t com- p onen ts. T o identify the contri- butions of different comp onen ts in AdaptT ok en, we conduct abla- tion studies presented in T able 3. W e rep ort results on VideoMME and ML VU, whic h cov er a wide range of video durations. Start- ing from the Intern VL2.5 baseline with 64 input frames (a), simply increasing the maximum num b er of frames to 256 (b) do es not con- sisten tly impro ve p erformance and even degrades accuracy on ML VU (from 68.9 to 67.2). Adding group-wise token selection while constraining the final context length to a fixed budget B and preserving prompt-relev an t tokens (c) leads to a clear p erformance gain. With group-wise tok en selection in place, further scaling the maximum n umber of input frames to 1024 (d) yields mo dest impro vemen ts, despite incorp orating substantially more video con tent. This suggests that dif- feren t frame groups contribute unequally to the task and should not b e treated uniformly . Incorporating entrop y-guided global tok en selection (e) significantly outp erforms group-wise selection alone b y prioritizing more relev ant groups. Fi- A daptT oken 13 nally , applying our global token remo v al pro cedure (f ) further increases relev ant tok en diversit y and yields additional p erformance gains. Run time breakdo wn analysis. A daptT ok en-Lite substan tially reduces in- ference time compared to A daptT ok en while main taining comparable p erfor- mance (T able 2 and App endix T able 7). T o identify the source of this sp eedup, w e measure a runtime breakdown using Qwen2.5-VL 7B as the base MLLM. Sp ecifically , we decompose AdaptT ok en in to four stages. The group-wise infer- ence stage, whic h computes resp onse certain t y and token relev ance, dominates the ov erall runtime. Pro cessing one frame group takes 1.05 s on av erage, includ- ing 0.45 s for visual feature enco ding and 0.60 s for LLM-backbone inference. Although the visual enco der is m uc h smaller than the LLM, it incurs compara- ble latency due to full-attention computation. Consequently , the total cost of this stage scales approximately linearly with the n umber of groups. The remaining three stages are comparativ ely inexpensive. En tropy-guided global token selec- tion takes 0.07 s given the group certainties and token-relev ance scores from the first stage. Global token remov al takes 0.55 s. Finally , MLLM inference ov er the selected tokens takes only 0.31 s due to the reduced input length. Overall, these measuremen ts indicate that group-wise inference is the primary computational b ottlenec k, esp ecially as the num b er of frame groups increases. This explains wh y A daptT ok en-Lite, whic h reduces the num b er of pro cessed groups via early stopping, ac hieves large inference-time savings. T able 4: A daptT oken with more input frames. Qw en2.5-VL 7B is used as the base MLLM. A dapt- T oken maintains strong p erformance as the n umber of input frames increases, demonstrating robustness to extremely long video inputs (up to 10K frames). Max frames VideoMME ML VU LongVideoBench L VBench 4096 70.5 76.8 65.2 54.8 8192 70.2 77.1 65.5 54.9 10000 70.1 77.1 65.8 55.6 Scaling to 10k frames. A daptT oken enables MLLMs to process extremely long videos under a fixed memory budget by selecting prompt- relev an t tok ens in a globally informed manner. T o ev aluate this capability , we increase the maxim um n umber of in- put frames to 10K and rep ort results in T able 4. T o the b est of our knowledge, existing frame- and token-selection metho ds [7, 22] primarily rep ort Needle-in-a- Ha ystack experiments at this scale, rather than end-task b enc hmark accuracy . As the num b er of input frames increases, AdaptT ok en impro v es accuracy on three of the four b enc hmarks. Performance on VideoMME remains largely unchanged, whic h w e attribute to the fact that the evidence required by its questions can t ypically b e retriev ed from few er frames. In contrast, L VBench, with the longest a verage video length, b enefits the most from the additional frames that our metho d is able to pro cess. Comparison of certaint y measures. W e adopt resp onse en tropy to es- timate answer certaint y , as it demonstrates strong capability in distinguishing relev an t from irrelev an t frame groups. As discussed in Section 2.3, alternative metrics are also applicable, such as resp onse confidence [12] and resp onse KL div ergence [18]. 14 H. Qi et al. T able 5: Comparison of certaint y measures. A daptT oken ac hieves comparable p erformance when paired with differen t self-certaint y scores. Qw en2.5-VL 7B is used as the base MLLM. Method VideoMME ML VU Response KL-div ergence 70.0 76.5 Response confidence 70.1 76.7 Response en tropy 70.5 76.8 T able 5 compares these mea- sures on VideoMME and ML VU. The results sho w only marginal differences among the metrics, lik ely b ecause the generated an- sw ers are short (in con trast to long-form mathematical reason- ing), where certaint y estimates tend to diverge more substan- tially . Ov erall, resp onse en tropy performs slightly b etter than the alternativ es. W e emphasize that our contribution does not lie in prop osing a new certain ty metric, but in leveraging existing certain t y measures to guide input clue estima- tion for MLLMs. T able 6: Comparison to different v oting metho ds. A daptT oken aggregates tokens from dif- feren t frame groups in a global-aw are manner and obtains better performance o v er the voting meth- o ds (i.e., ma jorit y voting, b orda voting and weigh ted v oting). Method VideoMME ML VU LongVideoBench Base Intern VL2.5 64.2 68.9 59.5 Ma jorit y v oting 64.6 70.9 60.0 Borda voting 64.9 71.1 60.7 W eighted voting 65.3 71.6 61.6 AdaptT oken (Ours) 68.3 74.1 63.7 Comparison to v ot- ing metho ds. Mo del self- certain ty has primarily b een used in prior work for test- time reasoning in LLMs, where m ultiple reasoning traces are aggregated using v oting- based sc hemes to pro duce a final answ er [12, 18]. In con trast, A daptT oken lever- ages self-certain ty to identify prompt-relev an t clues within eac h frame group and p erforms globally-informed token selection b efore the fi- nal inference pass. W e compare A daptT oken with several v oting-based aggre- gation strategies in T able 6. W e implement and test three representativ e vot- ing metho ds. Ma jority voting selects the most frequent resp onse across groups. W eigh ted v oting assigns weigh ts to resp onses based on the corresp onding group certain ty . Borda voting ranks groups by certaint y and assigns scores according to v ( r ) = ( N − r + 1) p , where r denotes the group rank and p is set to 0.9 following Kang et al. [18]. While these voting methods impro ve o v er the base In tern VL2.5 mo del, their gains are substantially smaller than those ac hiev ed b y A daptT ok en. This comparison underscores the adv an tage of using self-certaint y for globally informed tok en selection rather than for p ost-hoc resp onse aggregation. 5 Conclusion W e presen t AdaptT ok en, a training-free and mo del-agnostic framework for long- video understanding with MLLMs. AdaptT ok en uses the mo del’s resp onse en- trop y as a global relev ance signal to allo cate tok en budgets across frame groups and select informative visual tokens via cross-modal atten tion. T ok en similari- ties are also considered to improv e the diversit y and temp oral co verage of the selected tok ens. AdaptT ok en-Lite further uses the same signal for early stopping. A daptT oken 15 A cross four long-video benchmarks and multiple base MLLMs (7B-72B), Adapt- T ok en consisten tly impro ves accuracy while scaling to extremely long inputs (up to 10K frames). AdaptT ok en-Lite cuts av erage inference time by about half with comparable p erformance. F uture w ork includes improving frame grouping and trav ersal strategies for b etter early stopping, and exploring more effective in tra-group token relev ance scoring metho ds b ey ond cross-attention. References 1. A chiam, J., Adler, S., Agarwal, S., Ahmad, L., Akk a ya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadk at, S., et al.: Gpt-4 technical rep ort. arXiv preprin t arXiv:2303.08774 (2023) 2. Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., T ang, J., T u, J., W an, J., W ang, P ., W ang, P ., W ang, Q., W ang, Y., Xie, T., Xu, Y., Xu, H., Xu, J., Y ang, Z., Y ang, M., Y ang, J., Y ang, A., Y u, B., Zhang, F., Zhang, H., Zhang, X., Zheng, B., Zhong, H., Zhou, J., Zhou, F., Zhou, J., Zh u, Y., Zh u, K.: Qwen3-vl tec hnical rep ort. arXiv preprin t arXiv:2511.21631 (2025) 3. Bai, S., Chen, K., Liu, X., W ang, J., Ge, W., Song, S., Dang, K., W ang, P ., W ang, S., T ang, J., et al.: Qwen2. 5-vl tec hnical rep ort. arXiv preprin t (2025) 4. Bonnetto, A., Qi, H., Leong, F., T ashko vsk a, M., Rad, M., Shokur, S., Hummel, F., Micera, S., P ollefeys, M., Mathis, A.: Epfl-smart-kitc hen-30: Densely annotated co oking dataset with 3d kinematics to challenge video and language models. arXiv preprin t arXiv:2506.01608 (2025) 5. Cai, Z., Cao, M., Chen, H., Chen, K., Chen, K., Chen, X., Chen, X., Chen, Z., Chen, Z., Chu, P ., et al.: Internlm2 tec hnical rep ort. arXiv preprint (2024) 6. Chen, Z., W ang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Y e, S., Tian, H., Liu, Z., et al.: Expanding p erformance b oundaries of op en-source multimodal mo dels with model, data, and test-time scaling. arXiv preprint (2024) 7. Cheng, C., Guan, J., W u, W., Y an, R.: Scaling video-language mo dels to 10k frames via hierarc hical differential distillation. arXiv preprin t arXiv:2504.02438 (2025) 8. Comanici, G., Bieber, E., Schaek ermann, M., P asupat, I., Sachdev a, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the fron tier with adv anced reasoning, m ultimo dalit y , long context, and next generation agen tic capabilities. arXiv preprint arXiv:2507.06261 (2025) 9. F adeev a, E., Rubashevskii, A., Shelmanov, A., Petrak ov, S., Li, H., Mubarak, H., T symbalo v, E., Kuzmin, G., Panc henk o, A., Baldwin, T., et al.: F act-chec king the output of large language models via token-lev el uncertaint y quantification. arXiv preprin t arXiv:2403.04696 (2024) 10. F an, Y., Ma, X., W u, R., Du, Y., Li, J., Gao, Z., Li, Q.: Videoagent: A memory- augmen ted m ultimo dal agent for video understanding. In: Europ ean Conference on Computer Vision. pp. 75–92. Springer (2024) 11. F u, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., W ang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-mme: The first-ever comprehensiv e ev aluation b enc hmark 16 H. Qi et al. of multi-modal llms in video analysis. In: Pro ceedings of the Computer Vision and P attern Recognition Conference. pp. 24108–24118 (2025) 12. F u, Y., W ang, X., Tian, Y., Zhao, J.: Deep think with confidence. arXiv preprint arXiv:2508.15260 (2025) 13. Girdhar, R., Singh, M., Ravi, N., V an Der Maaten, L., Joulin, A., Misra, I.: Omni- v ore: A single mo del for many visual modalities. In: Pro ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16102–16112 (2022) 14. Han, J., Gong, K., Zhang, Y., W ang, J., Zhang, K., Lin, D., Qiao, Y., Gao, P ., Y ue, X.: Onellm: One framew ork to align all modalities with language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and P attern Recognition. pp. 26584–26595 (2024) 15. Hu, J., Cheng, Z., Si, C., Li, W., Gong, S.: Cos: Chain-of-shot prompting for long video understanding. arXiv preprint arXiv:2502.06428 (2025) 16. Hu, K., Gao, F., Nie, X., Zhou, P ., T ran, S., Neiman, T., W ang, L., Shah, M., Hamid, R., Yin, B., et al.: M-llm based video frame selection for efficient video understanding. In: Pro ceedings of the Computer Vision and Pattern Recognition Conference. pp. 13702–13712 (2025) 17. Islam, M.M., Nagara jan, T., W ang, H., Bertasius, G., T orresani, L.: Bimba: Selectiv e-scan compression for long-range video question answ ering. In: Pro ceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 29096–29107 (2025) 18. Kang, Z., Zhao, X., Song, D.: Scalable b est-of-n selection for large language mo dels via self-certain ty . arXiv preprin t arXiv:2502.18581 (2025) 19. Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Li, Y., Liu, Z., Li, C.: Lla v a-onevision: Easy visual task transfer. arXiv preprin t (2024) 20. Li, K., W ang, Y., He, Y., Li, Y., W ang, Y., Liu, Y., W ang, Z., Xu, J., Chen, G., Luo, P ., et al.: Mvbench: A comprehensiv e multi-modal video understanding b enc hmark. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and P attern Recognition. pp. 22195–22206 (2024) 21. Li, R., W ang, X., Zhang, Y., Zohar, O., W ang, Z., Y eung-Levy , S.: T empo- ral preference optimization for long-form video understanding. arXiv preprint arXiv:2501.13919 (2025) 22. Li, X., W ang, Y., Y u, J., Zeng, X., Zh u, Y., Huang, H., Gao, J., Li, K., He, Y., W ang, C., et al.: Video c hat-flash: Hierarchical compression for long-con text video mo deling. arXiv preprint arXiv:2501.00574 (2024) 23. Liang, H., Li, J., Bai, T., Huang, X., Sun, L., W ang, Z., He, C., Cui, B., Chen, C., Zhang, W.: Keyvideollm: T ow ards large-scale video keyframe selection. arXiv preprin t arXiv:2407.03104 (2024) 24. Liu, Z., Zh u, L., Shi, B., Zhang, Z., Lou, Y., Y ang, S., Xi, H., Cao, S., Gu, Y., Li, D., et al.: Nvila: Efficien t frontier visual language models. In: Pro ceedings of the Computer Vision and Pattern Recognition Conference. pp. 4122–4134 (2025) 25. McKinzie, B., Gan, Z., F auconnier, J.P ., Dodge, S., Zhang, B., Dufter, P ., Shah, D., Du, X., Peng, F., Belyi, A., et al.: Mm1: metho ds, analysis and insights from m ultimo dal llm pre-training. In: European Conference on Computer Vision. pp. 304–323. Springer (2024) 26. Mohaimin ul Islam, M., Nagara jan, T., W ang, H., Bertasius, G., T orresani, L.: Bim ba: Selective-scan compression for long-range video question answ ering. arXiv e-prin ts pp. arXiv–2503 (2025) 27. Op enAI: Gpt-4o system card (2024), A daptT oken 17 28. P an, J., Zhang, Q., Zhang, R., Lu, M., W an, X., Zhang, Y., Liu, C., She, Q.: Timesearc h-r: Adaptiv e temporal search for long-form video understanding via self-v erification reinforcement learning. arXiv preprin t arXiv:2511.05489 (2025) 29. P an, J., Zhang, R., W an, X., Zhang, Y., Lu, M., She, Q.: Timesearc h: Hierarchical video search with spotlight and reflection for human-lik e long video understanding. arXiv preprin t arXiv:2504.01407 (2025) 30. Radford, A., Kim, J.W., Hallacy , C., Ramesh, A., Goh, G., Agarwal, S., Sastry , G., Ask ell, A., Mishkin, P ., Clark, J., et al.: Learning transferable visual mo dels from natural language sup ervision. In: International conference on mac hine learning. pp. 8748–8763. PmLR (2021) 31. Ren, J., Zhao, Y., V u, T., Liu, P .J., Lakshminaray anan, B.: Self-ev aluation im- pro ves selectiv e generation in large language mo dels. In: Proceedings on. pp. 49–64. PMLR (2023) 32. Sheng, Y., Hao, Y., Li, C., W ang, S., He, X.: Sevices: Unifying semantic-visual evidence consensus for long video understanding. arXiv preprint (2025) 33. Suo, Y., Ma, F., Zhu, L., W ang, T., Rao, F., Y ang, Y.: F rom trial to triumph: A dv ancing long video understanding via visual con text sample scaling and self- rew ard alignment. arXiv preprin t arXiv:2503.20472 (2025) 34. T ang, X., Qiu, J., Xie, L., Tian, Y., Jiao, J., Y e, Q.: Adaptiv e keyframe sampling for long video understanding. In: Pro ceedings of the Computer Vision and P attern Recognition Conference. pp. 29118–29128 (2025) 35. T eam, G., Anil, R., Borgeaud, S., Alayrac, J.B., Y u, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable m ultimo dal models. arXiv preprint arXiv:2312.11805 (2023) 36. W ang, H., Nie, Y., Y e, Y., W ang, Y., Li, S., Y u, H., Lu, J., Huang, C.: Dynamic- vlm: Simple dynamic visual token compression for videollm. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20812–20823 (2025) 37. W ang, L., Chen, Y., T ran, D., Boddeti, V.N., Chu, W.S.: Seal: Seman tic atten tion learning for long video representation. In: Pro ceedings of the Computer Vision and P attern Recognition Conference. pp. 26192–26201 (2025) 38. W ang, W., He, Z., Hong, W., Cheng, Y., Zhang, X., Qi, J., Ding, M., Gu, X., Huang, S., Xu, B., et al.: Lvb enc h: An extreme long video understanding b enc hmark. In: Pro ceedings of the IEEE/CVF In ternational Conference on Computer Vision. pp. 22958–22967 (2025) 39. W ang, X., Si, Q., W u, J., Zhu, S., Cao, L., Nie, L.: Retake: Reducing tem- p oral and knowledge redundancy for long video understanding. arXiv preprin t arXiv:2412.20504 (2024) 40. W ang, X., Si, Q., Zh u, S., W u, J., Cao, L., Nie, L.: Adaretak e: Adaptiv e redundancy reduction to p erceiv e longer for video-language understanding. In: Findings of the Asso ciation for Computational Linguistics: A CL 2025. pp. 5417–5432 (2025) 41. W ang, Y., Li, K., Li, X., Y u, J., He, Y., Chen, G., P ei, B., Zheng, R., W ang, Z., Shi, Y., et al.: Intern video2: Scaling foundation models for m ultimodal video un- derstanding. In: Europ ean Conference on Computer Vision. pp. 396–416. Springer (2024) 42. W ang, Y., Li, X., Y an, Z., He, Y., Y u, J., Zeng, X., W ang, C., Ma, C., Huang, H., Gao, J., et al.: In ternvideo2. 5: Emp o wering video mllms with long and ric h con text mo deling. arXiv preprin t arXiv:2501.12386 (2025) 43. W ang, Z., Y u, S., Stengel-Eskin, E., Y o on, J., Cheng, F., Bertasius, G., Bansal, M.: Videotree: A daptive tree-based video representation for llm reasoning on long 18 H. Qi et al. videos. In: Pro ceedings of the Computer Vision and Pattern Recognition Confer- ence. pp. 3272–3283 (2025) 44. W u, H., Li, D., Chen, B., Li, J.: Longvideobench: A benchmark for long-context in terleav ed video-language understanding. Adv ances in Neural Information Pro- cessing Systems 37 , 28828–28857 (2024) 45. Xu, M., Gao, M., Gan, Z., Chen, H.Y., Lai, Z., Gang, H., Kang, K., Dehghan, A.: Slo wfast-lla v a: A strong training-free baseline for video large language models. arXiv preprin t arXiv:2407.15841 (2024) 46. Y ang, A., Li, A., Y ang, B., Zhang, B., Hui, B., Zheng, B., Y u, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 tec hnical report. arXiv preprin t arXiv:2505.09388 (2025) 47. Y ao, L., W u, H., Ouyang, K., Zhang, Y., Xiong, C., Chen, B., Sun, X., Li, J.: Gener- ativ e frame sampler for long video understanding. arXiv preprint (2025) 48. Y ao, L., Xing, L., Shi, Y., Li, S., Liu, Y., Dong, Y., Zhang, Y.F., Li, L., Dong, Q., Dong, X., et al.: T o wards efficient multimodal large language mo dels: A surv ey on tok en compression. Authorea Preprints (2026) 49. Y e, J., Xu, H., Liu, H., Hu, A., Y an, M., Qian, Q., Zhang, J., Huang, F., Zhou, J.: mplug-o wl3: T ow ards long image-sequence understanding in multi-modal large language mo dels. arXiv preprint arXiv:2408.04840 (2024) 50. Y u, S., Jin, C., W ang, H., Chen, Z., Jin, S., Zuo, Z., Xu, X., Sun, Z., Zhang, B., W u, J., et al.: F rame-v oy ager: Learning to query frames for video large language mo dels. arXiv preprint arXiv:2410.03226 (2024) 51. Zhang, B., Li, K., Cheng, Z., Hu, Z., Y uan, Y., Chen, G., Leng, S., Jiang, Y., Zhang, H., Li, X., et al.: Videollama 3: F ron tier m ultimo dal foundation mo dels for image and video understanding. arXiv preprint arXiv:2501.13106 (2025) 52. Zhang, K., Li, B., Zhang, P ., Pu, F., Cahy ono, J.A., Hu, K., Liu, S., Zhang, Y., Y ang, J., Li, C., et al.: Lmms-ev al: Reality chec k on the ev aluation of large multi- mo dal mo dels. arXiv preprin t arXiv:2407.12772 (2024) 53. Zhang, P ., Zhang, K., Li, B., Zeng, G., Y ang, J., Zhang, Y., W ang, Z., T an, H., Li, C., Liu, Z.: Long context transfer from language to vision. arXiv preprint arXiv:2406.16852 (2024) 54. Zhang, Y., W u, J., Li, W., Li, B., Ma, Z., Liu, Z., Li, C.: Video instruction tuning with syn thetic data. arXiv preprint arXiv:2410.02713 (2024) 55. Zhang, Y., Lu, Y., W ang, T., Rao, F., Y ang, Y., Zhu, L.: Flexselect: Flexible token selection for efficien t long video understanding. arXiv preprint (2025) 56. Zhao, X., Kang, Z., F eng, A., Levine, S., Song, D.: Learning to reason without external rew ards. arXiv preprint arXiv:2505.19590 (2025) 57. Zhou, J., Sh u, Y., Zhao, B., W u, B., Liang, Z., Xiao, S., Qin, M., Y ang, X., Xiong, Y., Zhang, B., et al.: Mlvu: Benchmarking m ulti-task long video understanding. In: Pro ceedings of the Computer Vision and Pattern Recognition Conference. pp. 13691–13701 (2025) A daptT oken 19 A Benc hmark details VideoMME. VideoMME [11] is a b enc hmark for video understanding with div erse video t yp es and durations. It con tains 900 videos, with 2,700 manually annotated multiple-c hoice question-answer pairs across 30 subfields. The dataset is split into three subsets by duration: short ( < 2 minutes), medium (4-15 min- utes), and long (30-60 min utes). ML VU. ML VU [57] spans the widest range of video lengths, from 3 minutes to 2 hours. It includes nine tasks, such as topic reasoning, anomaly recognition, video summarization, and plot question answ ering. LongVideoBenc h. LongVideoBench [44] targets long-context video under- standing with videos up to one hour. It con tains 3,763 videos and 6,678 annotated m ultiple-choice questions across 17 categories, emphasizing referring reasoning that requires retrieving and analyzing multi-modal details from sp ecific temp oral segmen ts. L VBenc h. L VBench [38] fo cuses on long-video understanding, with an av er- age video duration of 4,101 seconds (4 × longer than VideoMME and 5 × longer than ML VU). It includes 1,549 annotated m ultiple-choice question-answer pairs co vering tasks such as even t understanding, key information retriev al, temp oral grounding, and reasoning. B Implemen tation details Except for the GPU-memory test on Qwen3-VL (conducted on an H200 GPU), all ev aluations are p erformed under the LMMS-Ev al framew ork [52] on 8 H100 GPUs. Intern VL2.5 and Qw en2.5-VL are ev aluated with a maxim um of 16K and 32K input tok ens, resp ectiv ely , follo wing their official settings. W e extend b oth con text limits by 16 × , which increases the maximum num b er of input frames to 1024 for Intern VL2.5 and 4096 for Qw en2.5-VL. F or the token budget B , we follo w FlexSelect [55] and use 32 frame-equiv alen t tokens for fair comparisons, corresp onding to 7,010 tokens for Qw en2.5-VL and 8,256 tokens for In tern VL2.5. F or Qw en3-VL, we increase the budget to 128 frame-equiv alent tok ens (32,768 tok ens). W e set the temp oral similarity decay parameter to 0.3 in all exp eri- men ts. The Needle-in-a-Ha ystack ev aluation uses the V-NIAH data in tro duced in LongV A [53]. W e com bine its needle samples with randomly sampled videos from VideoMME to obtain the correct/incorrect distributions. C A dditional exp erimen ts Efficiency with early stopping for In tern VL2.5. In the main paper, we es- tablish early-stopping effectiveness on Qwen2.5-VL 7B (T able 2) using our early stopping v ersion denoted as A daptT oken-Lite. Here, we k eep the proto col un- c hanged and switch only the backbone to In tern VL2.5 8B to test whether the gain is model-agnostic rather than arc hitecture-sp ecific. Sp ecifically , w e ev aluate A daptT oken and AdaptT ok en-Lite under identical token budgets and b enc hmark 20 H. Qi et al. LongVideoBenc L VBenc ML V VideoMM Fig. 6: En tropy distributions across benchmarks. Qw en2.5-VL is used as the base MLLM. T able 7: A ccuracy and inference-time comparison betw een AdaptT ok en and A daptT oken-Lite. Intern VL2.5 8B is used as the base MLLM. AdaptT ok en-Lite ac hieves comparable or better accuracy while reducing inference time. Metho ds A daptT oken-Lite A daptT oken A cc. Groups Time (s) A cc. Groups Time (s) VideoMME 68.1 6.9 8.4 68.3 11.0 13.0 LongVideoBenc h 63.8 7.6 9.2 63.7 11.1 13.0 ML VU 74.4 7.4 9.3 74.1 14.2 16.3 L VBenc h 51.3 11.3 12.3 52.1 16.2 18.8 splits, and rep ort three metrics in T able 7: task accuracy , the av erage num ber of inferred frame groups p er sample, and end-to-end inference time. This ap- p endix exp erimen t therefore serves as a direct cross-bac kb one v alidation of the main-pap er finding: the entrop y-based stopping rule preserves accuracy while reducing computation. Consistent with the Qwen2.5-VL results, AdaptT ok en- Lite matches or sligh tly exceeds A daptT oken on LongVideoBench and ML VU, while requiring only ab out 65% of the inference time. The reduction in processed groups correlates closely with inference time, confirming that group-wise MLLM inference is the dominan t runtime cost. Group-wise en tropy experiments with Qw en2.5-VL. In the main pa- p er (Fig. 4), w e analyze group-wise response en trop y on Intern VL2.5 and sho w that low er en tropy is asso ciated with correct predictions, which motiv ates our en tropy-based group ranking and early-stopping strategy . In this appendix ex- A daptT oken 21 p erimen t, we keep the same analysis proto col and c hange only the backbone to Qw en2.5-VL to test whether this en trop y-correctness relationship is bac kb one- indep enden t. Concretely , w e construct 64-frame groups, compute group-wise re- sp onse entrop y under the same prompting/inference setup, and ev aluate across all four b enc hmarks (VideoMME, LongVideoBench, ML VU, and L VBenc h). As sho wn in Fig. 6, correct and incorrect cases remain clearly separable in entrop y space, and the distributional trend closely matc hes the Intern VL2.5 results. This serv es as a direct cross-backbone v alidation of the main-pap er finding: resp onse en tropy is a stable uncertaint y signal for long-video group selection, enabling the same early-stopping hyperparameters to transfer across mo dels with little or no retuning. T able 8: Ablation on the tok en-remov al rate. A daptT oken is robust to different remo v al rates. Metho d ML VU Remo ving 0.05 B 76.1 Remo ving 0.1 B 76.8 Remo ving 0.2 B 76.9 Remo ving 0.3 B 76.7 Ablation on the token-remo v al rate. In the main pap er, AdaptT ok en uses a fixed remov al rate of 0.1 in all experiments to balance relev ance and div ersity after initial token selection. This appendix experiment isolates that design choice and asks whether p erformance depends critically on this sp ecific setting. F ollo wing the same setup as the main results (same backbone, token budget B , and inference pip eline), w e v ary only the post-selection remo v al ra- tio and ev aluate rates from 0.05 B to 0.3 B (T able 8). The resulting accuracy differences are small, indicating that AdaptT ok en is not sensitiv e to the exact remo v al ratio within a broad range. This supp orts the main-pap er configuration: 0.1 is a stable default that ac hiev es near-b est p erformance while av oiding extra h yp erparameter tuning. T able 9: Ablation on early-stopping hyperparameters. En tropy threshold 0.6 0.7 0.7 0.75 0.75 0.75 0.8 0.8 Group threshold 1 1 2 1 2 3 3 4 ML VU 76.2 75.9 76.2 75.9 76.0 76.3 76.0 76.3 Ablation on early-stopping hyperparameters. In the main paper, Adapt- T ok en-Lite uses a fixed early-stopping rule with an en tropy threshold of 0.75 22 H. Qi et al. and a group threshold of 3 in all the exp erimen ts, sho wing a strong accuracy– efficiency trade-off. This appendix study examines whether that choice is robust. Using the same setup as the main exp erimen ts (same backbone, tok en budget, and inference pip eline), w e v ary only these t wo stopping h yp erparameters and rep ort ML VU accuracy in T able 9. The entrop y threshold controls ho w confi- den tly a group m ust b e answered to coun t as evidence, while the group threshold con trols ho w man y such confident groups are required b efore stopping. Lo w ering the group threshold makes stopping more aggressiv e, whereas increasing it makes stopping more conserv ativ e; similarly , a higher en tropy threshold is permissive and a low er one is stricter. Results remain stable across a broad range, with the b est p erformance achiev ed by multiple nearby settings (e.g., 0.75/3 and 0.8/4), indicating that the metho d is not sensitive to exact tuning. This directly sup- p orts the main-pap er configuration: the default setting (0.75, 3) lies in a robust regime that transfers w ell without dataset-sp ecific or MLLM-sp ecific retuning. Ablation on frame-group construction. In the main paper, A daptT ok en relies on group-wise pro cessing to scale long-video understanding, where eac h group serves as a unit for entrop y estimation, token allo cation, and early stop- ping. This appendix ablation isolates the group-construction strategy to test the effectiv eness of our sp ecific grouping design. Keeping the rest of the pip eline unc hanged, we compare three input organizations on ML VU: (i) video-ch unk inputs, whic h use contiguous lo cal segments, (ii) con tin uous group inputs, which preserv e temp oral contin uity with sequential grouping, and (iii) our marginal group inputs, whic h sample frames across the full video and pro cess groups in maxim um-margin order. The results (73.5, 74.0, and 74.4, resp ectiv ely) show that our design consisten tly p erforms b est.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment