ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning

R esAda pt: Ada ptiv e R esoluti o n f or Eﬃ cient Multim oda l R ea so ning Huanxuan Liao τ , µ , Zhongtao Jiang, Y upu Hao τ , µ , Y u qiao T an τ , µ , S hizhu He τ , µ , Jun Zhao τ , µ , K un Xu † , Kang Liu τ , µ , ∗ τ Institute of Auto mation, Chin ese Academy of Scien ces , µ U niversity o f Chinese Academy of Sci ences , † Project L eader ∗ Corresponding author: kliu@nlpr .ia.ac.cn Multim odal Large L anguage Models (MLLMs) achiev e stronger visua l underst anding by scaling input ﬁdelit y , yet the resulting visua l token gro wth makes jointly sustaining high spatial resolution and lo ng temporal context prohibitiv e. Existing eﬃci ency strategies only partially resolv e this tensio n: m odel-side token compressio n discards ﬁne-grained evidence after encoding and can disrupt optimized inference kernels, whereas output-side a gentic reaso ning adds iterative l atency and can still miss decisive cues when the initial view is too coarse. W e argue that the bottlen eck lies not in how post-en coding represent atio ns are compressed but in the volume o f pixels the encoder receiv es, and address it with R esAdapt , an Input-side adaptatio n framework that learns ho w mu ch visu al b udget each frame should receive bef ore encoding. ResAda pt couples a light weight Allocator with an unchanged MLLM backbone, so the backbone retains its nativ e visu al-token interface while receiving an operator-transformed input. W e form ulate allocatio n as a co ntextual bandit and train the Allocator with Cost-A ware Poli cy Optimization ( CAPO) , which con verts sparse roll out feed back into a st ab le accuracy–cost learning signal. W e further introd uce a temporal-similarit y regulari zer that suppresses red undant high-budget all ocation o n adjacent simil ar frames, encoura ging diﬀerentiated, content-a ware all ocation in a single f orward pass. Across b udget-controlled video Q A, temporal grounding, and ima ge reasoning tas ks, ResAda pt improv es lo w-b udget operating points and often lies on or near the eﬃci ency–accuracy fronti er , with the clearest gains on reaso ning-intensiv e ben chmarks under a ggressiv e co mpression. Not ab ly , ResAda pt supports up to 16 × m ore frames at the same visual b udget while deliv ering o v er 15% performance gain. The learn ed policy exhibits open-loop active perception , concentrating visu al budget on informati on-dense content without modif ying the backbone architecture. These results positi on Input-side ada pt atio n as a practica l and eﬀectiv e route to long-co ntext video reaso ning under tight visu al b udgets. Project P age : https://xnhyacinth.github.i o/projects/R esAdapt Code Repo sitory : https://github.com/Xnhyacinth/R esAdapt Contact : liaohuanxuan2023@ia.ac.cn 1. Introd ucti on Multim odal L arge L anguage Models (MLLMs) achiev e stronger visu al underst anding by scaling input ﬁdelit y , yet the resulting visual-token gro wth ma kes jointly sustaining high spatial resolution and lo ng temporal context prohibitiv e ( Guo et al. , 2025a , Bai et al. , 2025a , Liu et al. , 2025a , Shu et al. , 2025 , Shao et al. , 2025b ). In practice, this trade-o ﬀ is centra l to video reasoning: reducing resolutio n risks losing the small visual cues that determine the answ er , whereas shortening the clip remo v es the temporal context needed for lo ng-horiz on inference. Ev en architecturally eﬃcient encoders ( Z hang et al. , 2026 , Liu et a l. , 2025b ) do not rem ov e this tension; they merely shift where it becomes painf ul. R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning Merge / Prune / Sparse Question: Why did the boy lift up the towel near the end? A. pick up coat. B. put on the plate. C. lose balance. Tool Call Multi-turn Interaction Zoom-in Search … Full Visual Tokens Compressed Visual Tokens Allocations Input-side Adaptive Sampling (Source-Level) Model-Side Token Economy (Representation-Level) Output-side Agentic Reasoning (Behavior-Level) Allocator (a) Input-side Adaptation vs. Existing Paradigms (b) ResAdapt achieves competitive performance Operator Resize Frame Select Figure 1: Input-side Adaptation impro ves the visual-token eﬃcien cy fronti er . ( a) Three eﬃciency paradigms f or video reasoning. Model-side methods compress tokens after encoding; output-side methods iterativ ely retriev e or zoom; R esAdapt reallocates per-frame visual b udget bef ore encoding, preserving the backbon e’s native token interface and compatibilit y with optimi zed inference engin es. ( b) Qwen2.5- VL-7B results with 32 frames at ∼ 10% visual retentio n, where R esAdapt li es on or near the P areto fronti er and sho ws its largest gain on the rea soning-hea vy benchmark. Mainstream eﬃciency methods l argely fall into t w o paradigms (Figure 1 a), both of which interven e too late. Model-side approaches prune or merge tokens after visual encoding ( Khaki et al. , 2025 , Xu et al. , 2025 , Bolya et al. , 2022 , T ao et al. , 2025 ), so once ﬁne-grain ed evidence is discarded, it cannot be recov ered. They also alter the token l ay o ut expected by the backbone, which can complicate direct use of optimiz ed attentio n kernels and inference engines ( Dao , 2024 , Kwo n et al. , 2023 , Zheng et al. , 2024 ). O utput-side a gentic reaso ning instead adds iterative retrieval or z oom steps ( Zhang et a l. , 2025b , Y ang et al. , 2025d , Shen et al. , 2025b , Zheng et al. , 2025b ). This strategy can impro v e co vera ge, but it introdu ces multi-turn latency and still depends on an initial coarse view that may undersample the decisive cue. These limit ations point to a diﬀerent bottlen eck: not how post-encoding represent atio ns are compressed, b ut ho w many pixels the encoder receives in the ﬁrst place. This pa per studies that perspectiv e through Input- side adaptation , which rea llocates visual budget before encoding rather than compressing represent ations afterward. Our method, R esAdapt , introdu ces a light weight Allocator that predicts a per-frame visual all ocatio n from coarse visual features and the query , then realizes that allocati on through a visual budget operator , such as resolution resizing or frame selectio n. The backbon e therefore processes a st andard visual- token sequence in a single pass, preserving its n ativ e interface and compatibilit y with existing optimiz ed inferen ce st acks ( Dao , 2024 , Kw o n et al. , 2023 ). Compared with pri or slo w–fast pipelin es ( Y ang et a l. , 2025a , Zhang et al. , 2026 ), which route frames using query-a gn ostic heuristics or ﬁxed resolution tiers, ResAda pt learns a query-aw are all ocation policy directly from task reward. Making this idea work requires more than a front-end allocator . The allocati on space is continu ous, the operator is non-diﬀ erentiab le, and n aiv e accuracy–cost pen alti es collapse toward uniformly tiny budgets. W e address these diﬃculti es with Cost-A ware Policy Optimi zation ( CAPO) , which conv erts sparse rollout 2 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning feed back into a st ab le a symmetric learning signal, and with a temporal-similarit y regulari zer that discoura ges red undant high-b udget all ocation on adjacent similar frames. T ogether , these components turn Input-side adaptatio n into a trainab le and content-a ware policy rather than a handcrafted compressi on rule. Across video Q A and temporal grounding benchmarks on m ultiple backbones and temporal horizons, R esAdapt impro ves or closely tracks the eﬃcien cy–accuracy Paret o frontier . R esAdapt matches or surpasses other token economy methods while compressing ov er 90% of visu al tokens (Figure 1 b), and the sav ed compute reinv ests as temporal cov era ge: under equiv alent budgets, R esAdapt processes 16 × m ore frames with > 15% relative gains. The learn ed policy exhibits open-loop active perception , concentrating visual b udget on informati on-dense frames in a single backbone pass without explicit sali ency supervision. Our main contrib utio ns are: 1. W e introd uce ResAda pt , an input-side ada pt atio n framework that f orm ulates dynamic per-frame vi- sual budgeting as a contextua l bandit problem, f ully preserving the native architecture and hardware optimiz atio ns o f MLLMs. 2. W e propose C APO with a temporal simil arit y regulari zer , pro viding a st ab le, asymmetri c learning signal to jointly optimiz e accuracy and cost without hand-crafted heuristics. 3. Through extensiv e experiments and ab latio ns, w e sho w that R esAda pt achiev es better eﬃci ency–accuracy P areto frontier across video Q A and temporal grounding t asks. 2. Background and Problem F orm ul atio n 2.1. Preliminaries Giv en a text query q and a video V = { f t } T t = 1 , let x = ( q , V ) denote the f ull input. A backbon e policy π ϕ encodes every frame at ﬁxed ﬁdelit y and autoregressi vely generates a rollo ut y = ( y 1 , . . . , y L ) : π ϕ ( y | x ) = L  j = 1 π ϕ ( y j | y < j , x ) . (1) When usef ul, w e write y = ( r , o ) for a reasoning trace r and a ﬁnal answ er o . The ineﬃci ency is immediate: visual cost scal es with total pixel v olume, whereas answer-criti cal evidence is sparse in time. T o control pre-encoding cost, w e introd uce an Allocator policy π θ that emits a per-frame allocati on vector s = ( s 1 , . . . , s T ) ∼ π θ ( · | x ) , s t ∈ [ s min , s max ] , (2) and appli es a visual budget operator O to each frame: ˜ f t = O ( f t , s t ) . The backbone then generates from the transformed input ˜ x = ( q , { ˜ f t } T t = 1 ) : π ϕ ( y | ˜ x ) = L  j = 1 π ϕ ( y j | y < j , ˜ x ) . (3) W e keep O abstract only to st ate the decision probl em cleanly . The framew ork is operator-a gnosti c: O may implement resi zing, frame selecti on, or other pre-encoding b udget controls. 3 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning 2.2. Prob lem F ormulatio n Because the Allocator acts once before decoding, the outer problem is a Contextual Bandit ( equival ently , a on e-step contextual MDP). The context is the raw input x ∈ X , and the action is the contin uou s all ocation v ector s ∈ [ s min , s max ] T . F or joint training, it is con veni ent to write the induced t wo-stage policy a s p θ , ϕ ( s , y | x ) = π θ ( s | x ) π ϕ ( y | ˜ x ) , (4) where ˜ x = ( q , { O ( f t , s t ) } T t = 1 ) is the deterministica lly transformed input. The immediate reward is respo nse qualit y r ( x , s , y ) = Q ( x , y ) . L et C ( s ) denote the visual cost ind uced by allocati on s . The idea l budgeted objectiv e is max θ , ϕ E x ∼ D , s ∼ π θ ( · | x ) , y ∼ π ϕ ( · | ˜ x ) [ Q ( x , y ) ] s.t. E x ∼ D , s ∼ π θ ( · | x ) [ C ( s ) ] ≤ τ , (5) where τ is the t arget b udget. L a grangian rel axatio n yields the unconstrain ed utilit y max θ , ϕ E x , s , y [ U ( x , s , y ) ] , U ( x , s , y ) = Q ( x , y ) − λ C ( s ) , (6) for trade-oﬀ coeﬃci ent λ ≥ 0 . Equatio ns ( 5 ) – ( 6 ) deﬁne the t arget trade-oﬀ b ut not yet a stab le optimiz er . Section 3 inst antiates this objectiv e with an Input-side adaptation policy , CAPO , temporal regulari zation, and PPO -st yle surrogate losses; the experiments use resize as the con crete operator . Det ailed derivati ons are deferred to Appendix C . 3. Method Figure 2 summarizes the Input-side adaptatio n framew ork. At inference, the Allocator predi cts one allocati on per frame and applies a pre-encoding operator before the video reaches the backbone in a single pa ss. In the experimental instanti atio n studied here, O is bilinear resizing, so the allocati on becomes a resi ze f actor s t and ˜ f t = R ( f t , s t ) . At training, rollout feed back updates the Allocator and, optiona lly , the backbone. 3.1. Jo int RL Optimi zation Framew ork Section 2.2 deﬁnes all ocatio n as a contextua l bandit and st ates the ideal accuracy–cost trade-oﬀ . Appendix C starts from the margin al probabilit y of the correct answer under the transformed input and m otivates a on e-step expected-reward objective by a bstracting the resulting scalar qualit y term a s a rollo ut utilit y Q ( x , y ) that is treated as parameter-independent after sampling y . F or a ﬁxed context x , the indu ced joint policy factori zes as p θ , ϕ ( s , y | x ) = π θ ( s | x ) π ϕ ( y | ˜ x ) . (7) Here π θ ( s | x ) is the densit y on s ind uced by the latent Bet a policy q θ ( a | x ) in Eq. ( 10 ) ( change o f variab les; the per-frame aﬃne map has a θ -independent J acobian, so ∇ θ log π θ ( s | x ) matches ∇ θ log q θ ( a | x ) ). The PPO ratios in Eq. ( 21 ) are therefore computed on the l atent actions a t as in Eq. ( 11 ) . The correspo nding ideal rollout reward is R ideal s , y = Q ( x , y ) − λ C ( s ) , (8) 4 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning Allocator Visual Input Low Resolution Text Query Operator … s (1) s (2) s (M) MLLM … o (1,1) o (1,N) o (1,2) MLLM … o (M,1) o (M,N) o (M,2) MLLM … o (2,1) o (2,N) o (2,2) … … R (1,1) R (1,N) R (1,2) … R (2,1) R (2,N) R (2,2) … R (M,1) R (M,N) R (M,2) … A (1,1) A (1,N) A (1,2) … A (2,1) A (2,N) A (2,2) … A (M,1) A (M,N) A (M,2) … c (M) c (2) c (1) A (M) A (2) A (1) Base Advantage for MLLM Efficiency Shaping Regularization Long Video Inputs Allocator Text Encoder Visual Encoder Spatial Attention Temporal Attention Cross Attention MLP K V Q Predictor Block Action Head Beta Dist. Linear Transform Sample ~ Beta Mean Text Query Cost-Aware Policy Optimization Dynamic Pivot New Visual Question: Why did the boy lift up the towel near the end? A. pick up coat. B. put on the plate. C. lose balance. Info Fusion Entropy / Sim… Allocations Figure 2: R esAdapt framew ork. ( a) At inferen ce, a light weight Allocator π θ maps coarse visual features and the query to latent actio ns a t ∼ Beta ( α t , β t ) , which parameterize per-frame input allocati ons. In the resize inst antiation used in our experiments, these allocatio ns are realized a s scales s t ∈ [ s min , s max ] , and the resi zed frames are processed by the MLLM in a single call. ( b) During training, CAPO resha pes group-relative advanta ges with a dynamic cost piv ot τ dyn , while temporal-similarit y regul ariz ation suppresses red undant high-budget a llocatio n on adjacent similar frames. and the ideal optimiz atio n t arget is max θ , ϕ J ( θ , ϕ ) = E π θ ( s | x )  E π ϕ ( y | ˜ x )  R ideal s , y  . (9) Equatio n ( 9 ) conditi ons on a ﬁxed x ; training a vera ges o ver x ∼ D . Although its gradi ents follo w the standard score-f unctio n form det ailed in Appendix C , direct optimi zation remains brittle for three rea sons: 1. P olicy parameteriz atio n. π θ m ust emit a T -dimensio nal continu ous action with negligib le ov erhead relative to the backbon e. 2. Credit assignment. Delayed rollout reward m ust yield lo w-variance, cost-a ware gradient estimates; the ra w L agrangian reward Q ( x , y ) − λ C ( s ) is high-v ariance and often coll apses the poli cy to minimum b udget. 3. Local temporal structure. R ollout-l evel reward carries no explicit temporal signal, so near-dupli cate adjacent frames can still receive redundant allocati on. The next three subsections resolve these bottlenecks in order . 3.2. Allocator Architecture This subsectio n resolv es Bottleneck 1. Equation ( 9 ) requires a contin uo us allocati on policy whose ov erhead is negligib le relative to the backbone it controls. The framew ork itself only assumes a policy o ver operator 5 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning parameters; belo w we describe the continu ous resi ze inst antiatio n used for training and eva luation. Each frame f t ∈ R 3 × H t × W t is encoded by a frozen light weight visu al encoder , while the query is encoded separately . Both are projected to a shared dimension D . A sha llo w decoder alternates temporal self-attentio n o ver { f t } T t = 1 with gated cross-attention to the query , produ cing hidden st ates { h t } T t = 1 . This design exposes both temporal redundan cy and query dependence at low cost. T o preserv e explorati on, we parameterize each latent actio n with a Bet a distributi on whose bounded support maps naturally to [ s min , s max ] : a t ∼ Beta ( α t , β t ) , s t = s min + a t ( s max − s min ) . (10) Because a t ∈ ( 0, 1 ) , the all ocation satisﬁes s t ∈ ( s min , s max ) alm ost surely; in our experiments, 0 < s min < 1 < s max permits both do wnscaling and selectiv e upscaling. Let q θ ( a | x ) denote the l atent Beta policy o ver a = ( a 1 , . . . , a T ) . C onditi oned on { h t } , it f actorizes across frames: log q θ ( a | x ) = T  t = 1 log Beta ( a t ; α t , β t ) . (11) The aﬃne map a 7 → s ind uces the corresponding policy π θ ( s | x ) ; the exact change-of -variab les det ails are deferred to Appendix C . The result is a continu ous, query-aw are allocati on policy . 3.3. Cost-A ware Policy Optimization ( CAPO) This subsectio n resolves Bottleneck 2. The Allocator emits a structured allocati on, but the optimi zer needs a scalar cost sign al compatib le with rollo ut reward. A ﬂat pen alt y o n C ( s ) collapses the policy to ward uniformly tiny b udgets. CAPO av o ids this f ailure by replacing the raw penalt y with a shaped surrogate learning signal. Compute metric. F or the resize inst antiation used in our experiments, if frame f t ∈ R 3 × H t × W t is resized by s t , its visual token count is n t ( s t ) ∝ ⌈ s t H t / P ⌉ ⌈ s t W t / P ⌉ for patch si ze P . Physi cal compute is measured by the token retention ratio ρ ( s ) =  T t = 1 n t ( s t )  T t = 1 n t ( 1 ) ≈  T t = 1 s 2 t H t W t  T t = 1 H t W t . (12) In our implement atio n, frames are norma lized to a comm on ba se resolution before allocati on, so ρ ( s ) is w ell approximated by the av era ge quadratic scale. Proxy cost. The quadratic dependence of ρ on s t o ver-empha sizes a few large all ocations and produ ces high-variance updates. W e therefore introdu ce the sm oother proxy c ( s ) = ¯ s − s min s max − s min , ¯ s = 1 T T  t = 1 s t , (13) used only inside the optimizer; ρ ( s ) remains the eﬃcien cy metric reported in experiments. Notatio n bridge. During training, R task m , n denotes the concrete rollo ut score, A base m , n the GRPO -normalized advantage, and A m , n the ﬁnal CAPO -shaped advant a ge. Appendix C rel ates these practica l quantities to the ideal reward in Eq. ( 8 ). Base advantage. F or each prompt x , let R task m , n denote the scalar task reward o f rollout ( m , n ) , A base m , n the correspo nding GRPO group-n ormalized advantage, c m = c ( s m ) the proxy cost o f all ocatio n m , and u m , n ∈ { 0, 1 } a bin ary correctness indicator ( exact-match for Q A; thresholded success f or contin uous metrics). 6 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning Dynamic cost pivot. A ﬁxed threshold does not track the evolving policy , whereas a purely group-dependent statistic is too no isy . CAPO therefore interpol ates bet ween a ﬁxed t arget and the prompt-loca l mean: τ dyn = κ mix ¯ c group + ( 1 − κ mix ) τ ﬁx , (14) where ¯ c group = 1 M  M m = 1 c m is the prompt-loca l cost mean, τ ﬁx ∈ [ 0, 1 ] is a ﬁxed target proxy budget correspo nding to Eq. ( 5 ), and κ mix ∈ [ 0, 1 ] controls adaptivity . Asymmetric s ha ping. With τ dyn as piv ot, CAPO applies a correctness-dependent bonus or penalt y: S m , n = ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ λ + σ  τ dyn − c m τ s  if u m , n = 1, − λ − σ  c m − τ dyn τ s  if u m , n = 0, , (15) with λ − > λ + > 0 . Eﬃci ent correct rollo uts receive a moderate bon us, whereas costly incorrect rollouts receiv e a stronger penalt y . The sigm oid temperature τ s sm ooths the transitio n around the pivot. Final CAPO advantage. L et ˜ A m , n = A base m , n + λ capo S m , n − γ c m . (16) The ﬁnal advant a ge is A m , n = ⎧ ⎨ ⎩ max  ˜ A m , n , ε +  if u m , n = 1, ˜ A m , n if u m , n = 0, (17) where λ capo > 0 scal es C APO sha ping, γ ≥ 0 appli es a residua l global cost pen alt y , and the ﬂoor ε + > 0 ensures that correct lo w-cost rollouts retain a positiv e learning signal. The dominant anti-coll apse term is the piv oted a symmetric sha ping in S m , n . 3.4. R egulariz atio n and Training Objectiv e This subsectio n resolv es Bottleneck 3 and assemb les the ﬁnal optimizer . T empora l similarit y loss ( L sim ). C APO optimizes the globa l qualit y–eﬃci ency trade-oﬀ but does not penalize red undant high-b udget a llocatio n on near-d uplicate adjacent frames. R eusing the coarse features f t from S ec. 3.2 , w e penalize such pairs by L sim = 1 T − 1 T − 1  t = 1 w t · max ( 0, log s t + log s t + 1 + η sim ) , (18) where the similarit y-gated w eight w t = σ  cos ( f t , f t + 1 ) − τ sim γ sim  (19) activates only when adjacent frames exceed a cosine-similarit y threshold τ sim ∈ ( 0, 1 ) , with temperature γ sim . No pen alt y is incurred when s t s t + 1 ≤ e − η sim . Concentrati on loss ( L con ). T o prevent the Bet a distrib utio ns from coll apsing to near-deterministi c spikes, w e softly cap the tot al concentrati on at κ max > 0 : L con = 1 T T  t = 1 max ( 0, α t + β t − κ max ) . (20) 7 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning T ogether , L sim enco urages diﬀerentiated allocati on across red undant neighbors, while L con preserv es explo- ratio n. Practica l training objective. W e optimize both polici es in a single GRPO -st yle loop ( Zheng et al. , 2025a , Y u et al. , 2025 ). For each prompt x , the Allocator samples M all ocatio n trajectori es s 1: M ; each transformed input ˜ x ( m ) then prod uces N respo nse rollouts. CAPO computes rollout advantages A m , n , which serve as the shared learning signal f or both polici es. The exact PPO approximati on is deferred to Appendix C . Allocator objectiv e. W e ﬁrst aggregate rollo ut advantages per all ocatio n, A CAPO m = 1 N  n A m , n , and optimize the per-frame PPO surrogate L θ = − 1 M T M  m = 1 T  t = 1 min  r ( m ) θ , t A CAPO m , clip  r ( m ) θ , t , 1 − ε , 1 + ε  A CAPO m  , (21) where the per-frame import ance ratio is r ( m ) θ , t = q θ ( a ( m ) t | x ) q θ old ( a ( m ) t | x ) . (22) The f ull Allocator loss is L alloc = L θ + λ sim L sim + λ con L con . (23) Backbon e update. Conditio ned on the sampled all ocatio ns, the backbon e is updated with the st andard token-lev el PPO approximati on L ϕ = − 1 M N M  m = 1 N  n = 1 1 L m , n L m , n  j = 1 min  r ( m , n ) ϕ , j A m , n , clip  r ( m , n ) ϕ , j , 1 − ε , 1 + ε  A m , n  , (24) where L m , n is the rollo ut length and r ( m , n ) ϕ , j = π ϕ ( y ( m , n ) j | y ( m , n ) < j , ˜ x ( m ) ) π ϕ old ( y ( m , n ) j | y ( m , n ) < j , ˜ x ( m ) ) . (25) If the backbone is frozen, w e omit L ϕ ; jointly updating both net w orks corresponds to R esAdapt-RL. In practice, L alloc and L ϕ are optimized alternately within the same loop. 4. Experiments 4.1. Setup Implementation. The Allocator π θ uses the SmolVLM architecture ( Maraﬁoti et al. , 2025 ) for high- throughput front-end predictio n. Throughout, w e inst antiate input-side allocati on with resize , so the learned all ocatio ns are realized as per-frame resi ze f actors. W e train the Allocator on Qwen2.5- VL-7B -Instruct ( Bai et al. , 2025b ) and additio nally test transfer to Qw en3- VL-8B-Instruct ( Bai et al. , 2025a ). W e report t w o settings: R esAdapt-RL , obt ained by jointly updating the Allocator and the backbon e, and R esAdapt , which directly reuses the trained Allocator with a frozen backbon e to evaluate plug-and-play generalizatio n. Resize is used d uring training because it pro vides the contin uo us action space required by our optimizer; thresholded 8 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning T ab le 1: E valuati on R esults on V ideo Q A Benchmarks. Retenti on ratio R reﬂects visua l token count; Rea soning ( ✓ / ✗ ) indicates chain-o f-tho ught use; bold marks the best result. R esAda pt yields larger gains on the reasoning benchmark than on the perceptio n benchmarks. Backbone Method R etention Ratio R R easoning Video Percepti on Benchmark Video R easoning Benchmark VideoMME LongVideoBench MMV U ML VU VideoMMMU L VBench Qwen2.5- VL-7B 32 Frames V anilla 100% ✗ 62.0 58.9 52.7 63.1 49.6 38.6 Random Drop 25.0% ✗ 58.9 57.8 49.6 58.3 45.3 36.7 T oMe ( Bolya et al. , 2022 ) 25.0% ✗ 58.7 58.0 51.0 58.7 41.8 37.7 Visio nZip ( Y ang et al. , 2025c ) 25.0% ✗ 59.4 57.1 49.8 57.9 42.4 36.5 FlashVid ( F an et al. , 2026 ) 29.3% ✗ 60.2 58.6 51.1 59.2 46.3 36.9 FixedScale 25.0% ✗ 60.0 56.8 51.2 59.8 46.7 37.3 R esAdapt ( Ours) 23.8% ✗ 60.3 58.2 51.9 60.1 48.8 37.9 Random Drop 10.0% ✗ 56.1 55.6 47.1 56.5 39.8 35.2 T oMe ( Bolya et al. , 2022 ) 10.0% ✗ 56.4 55.2 48.9 58.0 39.2 33.6 Visio nZip ( Y ang et al. , 2025c ) 10.0% ✗ 55.5 54.5 47.6 57.3 39.1 35.3 FlashVid ( F an et al. , 2026 ) 10.4% ✗ 57.9 56.8 47.9 57.7 39.4 36.5 FixedScale 12.3% ✗ 58.0 55.1 47.7 57.5 44.3 35.4 R esAdapt ( Ours) 11.4% ✗ 59.4 55.4 49.2 58.4 45.7 35.9 VideoA uto-R1 ( Liu et al. , 2026 ) 100% ✓ 63.2 58.9 55.0 60.0 53.6 41.5 + R esAdapt ( Ours) 23.8% ✓ 60.4 57.1 53.2 61.1 51.2 38.7 + R esAdapt ( Ours) 11.4% ✓ 59.3 56.3 51.8 59.3 49.1 36.7 128 Frames V anilla 100% ✗ 65.3 60.3 53.1 66.5 47.9 42.0 Random Drop 25.0% ✗ 64.9 61.2 50.8 64.8 48.1 41.3 T oMe ( Bolya et al. , 2022 ) 25.0% ✗ 65.1 61.6 51.9 63.1 46.6 42.1 Visio nZip ( Y ang et al. , 2025c ) 25.0% ✗ 64.8 61.3 51.1 64.5 47.3 41.5 R esAdapt ( Ours) 22.9% ✗ 65.6 60.2 52.8 65.9 51.1 42.1 Random Drop 10.0% ✗ 63.0 59.0 45.8 63.4 46.7 38.0 T oMe ( Bolya et al. , 2022 ) 10.0% ✗ 60.6 56.3 44.2 63.5 41.8 39.5 Visio nZip ( Y ang et al. , 2025c ) 10.0% ✗ 61.8 56.1 44.8 63.2 42.1 38.2 FixedScale 12.3% ✗ 64.1 60.9 49.6 64.5 46.9 40.3 R esAdapt ( Ours) 11.1% ✗ 63.8 58.6 49.0 64.3 49.2 39.9 VideoA uto-R1 ( Liu et al. , 2026 ) 100% ✓ 64.7 59.1 56.7 65.1 52.2 41.2 + R esAdapt ( Ours) 23.8% ✓ 66.2 60.2 53.5 66.0 52.6 41.8 + R esAdapt ( Ours) 11.4% ✓ 64.7 57.8 52.4 64.6 51.3 39.5 Qwen3- VL-8B 32 Frames V anilla 100% ✗ 65.0 58.6 57.5 64.0 60.8 40.2 Random Drop 25.0% ✗ 61.3 58.4 57.1 60.2 53.4 37.8 T oMe ( Bolya et al. , 2022 ) 25.0% ✗ 62.4 57.4 56.0 60.8 49.1 36.4 Visio nZip ( Y ang et al. , 2025c ) 25.0% ✗ 61.8 57.2 54.4 60.6 51.5 37.3 FlashVid ( F an et al. , 2026 ) 30.0% ✗ 63.9 59.0 54.8 61.9 55.1 38.5 R esAdapt ( Ours) 23.8% ✗ 62.6 57.5 55.3 61.0 58.4 38.5 Random Drop 10.0% ✗ 58.8 54.7 53.2 56.6 47.1 35.5 T oMe ( Bolya et al. , 2022 ) 10.0% ✗ 59.2 55.5 53.1 58.5 42.7 35.8 Visio nZip ( Y ang et al. , 2025c ) 10.0% ✗ 59.9 55.4 53.7 58.8 45.8 35.4 FlashVid ( F an et al. , 2026 ) 12.2% ✗ 61.0 57.1 54.8 59.1 47.8 37.1 FixedScale 12.3% ✗ 60.8 54.9 53.8 58.4 52.6 37.1 R esAdapt ( Ours) 11.4% ✗ 60.7 56.6 54.6 59.6 56.1 37.3 128 Frames V anilla 100% ✗ 69.4 64.3 58.5 72.7 63.0 45.7 Random Drop 25.0% ✗ 67.2 61.3 56.8 67.4 55.3 42.4 T oMe ( Bolya et al. , 2022 ) 25.0% ✗ 67.2 62.0 55.9 70.4 53.5 43.1 Visio nZip ( Y ang et al. , 2025c ) 25.0% ✗ 67.1 61.3 55.7 69.2 56.8 41.2 R esAdapt ( Ours) 22.9% ✗ 67.4 61.9 56.3 70.8 59.6 43.3 Random Drop 10.0% ✗ 64.1 58.3 55.4 62.4 55.5 38.8 T oMe ( Bolya et al. , 2022 ) 10.0% ✗ 64.7 58.6 55.1 67.3 46.3 40.5 Visio nZip ( Y ang et al. , 2025c ) 10.0% ✗ 64.2 59.1 54.2 66.8 47.6 39.4 FixedScale 12.3% ✗ 66.7 59.5 54.4 67.7 56.3 41.7 R esAdapt ( Ours) 11.1% ✗ 66.8 60.2 55.4 69.4 58.2 42.6 9 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning 10 25 50 Retention Ratio (%) 40.0 42.5 45.0 47.5 50.0 VideoMMMU Accuracy(%) 46.1 49.2 51.6 45.7 48.8 50.0 44.3 46.7 39.1 42.4 48.5 39.4 46.3 49.0 39.2 41.8 4 8 .7 39.8 45.3 47.7 20.4: avg scale 0.45 44.9: avg scale 0.67 10 25 50 Retention Ratio (%) 56 58 60 62 VideoMME Accuracy(%) 58.1 60.5 61.7 59.4 60.5 61.2 58.0 59.3 55.5 59.4 60.3 57.9 60.2 61.1 56.4 58.7 60.7 56.1 58.9 60.1 22.3: avg scale 0.48 39.0: avg scale 0.63 8 / 100% 32 / 25% 64 / 12.5% 128 / 6.2% #Frames / Retention Ratio (%) Relative Accuracy (%) ResAdapt-RL (Ours) ResAdapt (Ours) FixedScale VisionZip FlashVid T oMe Random V anilla Figure 3: Eﬃciency–accuracy trade-o ﬀs and temporal reall ocatio n. ( a,b) VideoMMMU and VideoMME v ersus visual-token retenti on rati o R . R esAdapt is o n or near the P areto frontier , with the clearest advanta ge on rea soning- heavy settings at low retenti on. ( c) R elative gain fro m trading spatial resoluti on f or temporal co vera ge under a ﬁxed 8-frame-equiva lent b udget. frame selecti on is treated only as the conceptua l zero-budget limit of the same pre-encoding interf ace. Full hyperparameters, hardware, prompts, and reward deﬁnitions are deferred to Appendix A . Baselin es. W e compare against three classes o f methods: heuristic baselines (Random Drop, FixedScal e), m odel-side compressi on (T oMe ( Bolya et al. , 2022 ), FlashVid ( F an et al. , 2026 ), Visio nZip ( Y ang et al. , 2025c )), and reasoning-time inference augmentation (VideoA uto-R1 ( Liu et al. , 2026 )). W e use visual- token retenti on rati o R as the primary budget descriptor and report the exact retained b udget f or every method. F or rea soning-time baselines, R measures only visual encoder tokens; unless latency is reported separately , these comparisons should therefore be read as visual-budget comparisons rather than tot al- inferen ce-budget matches. Because sev eral baselin es admit only discrete operating points, some comparisons are only approximately budget matched and should be interpreted rel ativ e to the explicit trade-o ﬀs sho wn in each tab le. Benchmarks. F or video Q A , we report results on VideoMME ( Fu et al. , 2025a ), L ongV ideoBench ( Wu et al. , 2024 ), MMV U ( Zhao et al. , 2025b ), ML V U ( Z hou et a l. , 2025 ), VideoMMMU ( Hu et al. , 2025 ), and L VBench ( W ang et al. , 2025b ). For temporal grounding , w e report R eca ll@ { 0.3, 0.5, 0.7 } and mIo U on Charades-ST A ( Gao et a l. , 2017 ) and Activit yNet ( F abian et a l. , 2015 ), plus grounding Q A on NExT -GQ A ( Xiao et al. , 2024 ). F or image underst anding , we eva luate on MathVista ( Lu et al. , 2023 ), MMMU ( Y ue et al. , 2024 ), OCRBench ( Liu et al. , 2024 ), ChartQ A ( Masry et al. , 2022 ), AI2D ( Kemb ha vi et al. , 2016 ), and T extVQ A ( Singh et al. , 2019 ). U nless st ated otherwise, ﬁgures and analyses use Qwen2.5- VL-7B with 32 input frames. All eva luations use lmms-eva l ( Zhang et al. , 2024a ); the exact token budgets and decoding limits are reported in Appendix A . 4.2. Main R esults W e organiz e the evaluatio n around t w o primary claims and one exploratory questio n. First, resize-based input-side all ocatio n should improv e lo w-b udget operating points on video Q A. S econd, allocating pixels before encoding should be clearly more robust than frame dropping and often competitiv e with m odel-side compressi on on temporal grounding. Third, we ask how f ar the learned policy transfers bey ond the training setting; w e treat this last an alysis as exploratory rather than as a head lin e claim. Video QA. This experiment tests the ﬁrst cl aim. W e emphasize lo w-b udget operating points rather than a univ ersal frontier st atement, since severa l baselin es are av ailab le only at discrete budgets (T ab le 1 ). 10 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning T ab le 2: Eva luation R esults on T emporal Grounding Benchmarks. Grounding is mu ch more compressi on-sensitiv e. Backbone Method R etention Ratio R R easoning T emporal Grounding Benchmark Charades-ST A Activit yNet NExT -GQ A 0.3 0.5 0.7 mIoU 0.3 0.5 0.7 mIoU Acc mIoU Qwen2.5- VL-7B 32 Frames V anilla 100% ✗ 71.0 51.4 26.0 47.3 30.4 18.0 8.9 22.6 78.9 28.0 Random Drop 25.0% ✗ 39.4 23.2 11.0 25.7 15.2 8.1 3.7 11.7 77.5 16.6 T oMe ( Bolya et al. , 2022 ) 25.0% ✗ 39.5 23.9 11.4 26.0 16.0 8.4 4.0 12.1 77.8 16.3 FlashVid ( F an et al. , 2026 ) 31.3% ✗ 40.7 24.2 11.3 26.6 15.8 8.4 3.8 12.0 78.1 16.5 FixedScale 25.0% ✗ 36.7 24.7 12.3 24.9 18.6 9.4 4.3 14.1 77.7 12.3 R esAdapt ( Ours) 16.2% ✗ 53.8 34.8 17.0 35.6 19.8 10.8 5.2 15.3 76.6 23.2 Random Drop 10.0% ✗ 36.9 23.2 11.6 24.6 14.3 7.5 3.6 11.1 76.3 15.4 T oMe ( Bolya et al. , 2022 ) 10.0% ✗ 41.3 26.9 14.1 27.4 16.0 8.4 4.0 12.2 77.3 15.7 FlashVid ( F an et al. , 2026 ) 12.6% ✗ 38.2 22.9 11.1 25.1 15.4 8.1 3.7 11.8 77.4 16.1 FixedScale 12.3% ✗ 48.0 31.5 15.4 32.0 17.5 8.9 4.0 13.3 76.1 13.7 FixedScale 6.3% ✗ 39.9 26.8 13.3 26.7 15.2 8.1 3.9 11.9 74.1 15.4 R esAdapt ( Ours) 6.8% ✗ 41.0 27.8 14.0 27.2 16.3 8.5 3.9 12.5 74.3 20.4 VideoA uto-R1 ( Liu et al. , 2026 ) 100% ✓ 60.0 48.3 27.2 41.5 50.8 34.1 17.4 34.4 73.6 33.8 + R esAdapt ( Ours) 6.8% ✓ 43.5 30.1 15.8 30.0 35.4 21.5 10.0 24.4 74.7 24.7 128 Frames V anilla 100% ✗ 77.5 60.3 34.1 52.8 47.9 30.9 17.5 34.4 79.8 29.9 Random Drop 25.0% ✗ 32.3 19.6 7.9 20.7 26.7 13.9 6.3 18.8 80.3 10.7 T oMe ( Bolya et al. , 2022 ) 25.0% ✗ 32.4 19.8 7.9 20.7 27.2 14.4 6.4 19.1 80.3 10.9 R esAdapt ( Ours) 16.1% ✗ 63.5 43.6 21.3 42.0 33.1 19.3 10.2 24.3 78.1 27.2 Random Drop 10.0% ✗ 37.8 23.8 11.2 24.7 23.8 12.0 5.3 17.0 79.4 12.8 T oMe ( Bolya et al. , 2022 ) 10.0% ✗ 27.9 16.2 7.3 17.9 22.9 11.8 5.5 16.4 79.1 11.1 FixedScale 12.3% ✗ 34.7 22.3 10.5 22.7 25.0 13.8 5.9 18.3 77.9 11.3 FixedScale 6.3% ✗ 42.6 28.4 14.3 28.3 22.8 12.8 5.7 17.1 75.7 12.9 R esAdapt ( Ours) 6.8% ✗ 43.5 29.8 15.0 28.9 23.5 12.9 6.1 17.2 76.2 23.9 VideoA uto-R1 ( Liu et al. , 2026 ) 100% ✓ 40.3 33.7 22.1 28.9 49.4 34.3 18.5 33.5 68.0 31.0 + R esAdapt ( Ours) 16.1% ✓ 72.8 53.0 27.5 49.1 65.8 44.9 23.8 44.7 79.3 35.3 + R esAdapt ( Ours) 6.8% ✓ 50.1 33.2 16.6 34.2 53.4 34.0 16.4 35.7 76.6 29.4 Qwen3- VL-8B 32 Frames V anilla 100% ✗ 73.0 49.0 21.4 46.4 44.6 28.3 15.5 31.8 78.7 34.2 Random Drop 25.0% ✗ 16.2 8.6 3.8 12.1 12.4 6.7 3.2 10.0 77.2 15.6 T oMe ( Bolya et al. , 2022 ) 25.0% ✗ 68.7 42.1 17.6 43.1 45.9 28.8 15.6 32.6 77.1 31.7 FlashVid ( F an et al. , 2026 ) 31.3% ✗ 72.9 52.3 25.1 47.7 51.9 33.4 19.0 36.8 77.8 33.9 R esAdapt ( Ours) 16.2% ✗ 64.4 37.3 16.3 39.9 40.0 24.4 13.0 28.5 75.1 30.2 Random Drop 10.0% ✗ 4.1 1.8 0.7 4.4 4.7 2.4 1.0 5.0 74.3 11.3 T oMe ( Bolya et al. , 2022 ) 10.0% ✗ 67.6 39.3 16.6 41.8 46.3 31.0 19.2 34.1 79.2 34.0 FlashVid ( F an et al. , 2026 ) 12.6% ✗ 68.8 46.9 22.9 44.6 49.9 31.5 17.4 35.2 75.6 31.8 FixedScale 12.3% ✗ 61.3 34.3 14.6 37.9 39.6 24.2 13.1 28.4 74.2 29.9 FixedScale 6.3% ✗ 52.7 28.2 11.3 33.2 37.0 22.3 12.0 27.0 71.5 28.0 R esAdapt ( Ours) 6.8% ✗ 53.6 29.0 11.8 33.6 37.5 22.5 12.3 27.2 71.8 28.2 128 Frames V anilla 100% ✗ 72.8 46.0 20.1 45.6 45.8 31.1 19.2 33.9 81.1 36.6 Random Drop 25.0% ✗ 41.6 25.2 10.6 27.4 36.1 21.1 12.7 26.3 79.3 22.4 R esAdapt ( Ours) 16.1% ✗ 64.4 37.0 15.9 39.8 40.6 26.7 15.7 30.0 76.8 33.3 Random Drop 10.0% ✗ 32.6 19.0 7.8 21.9 33.5 18.6 11.5 24.8 76.9 19.9 T oMe ( Bolya et al. , 2022 ) 10.0% ✗ 61.6 33.8 13.3 38.1 42.4 27.6 16.6 31.4 77.4 31.5 FixedScale 12.3% ✗ 61.7 34.9 14.7 38.1 39.9 26.2 15.3 29.5 75.4 32.6 FixedScale 6.3% ✗ 53.7 28.2 11.8 33.6 37.9 24.3 14.3 28.1 73.0 39.1 R esAdapt ( Ours) 6.8% ✗ 54.3 28.0 11.7 33.7 38.3 24.5 14.4 28.4 73.2 43.9 11 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning T ab le 3: Latency breakdo wn ( ms, ↓ ) on Qwen2.5- VL-7B with single-GP U Allocator and 4-GPU vLLM engine. A v eraged o v er 200 runs after 5 warm-up; E2E latency = S cal e Time + Gen. T ime. Method #Frames R etention Ratio R Scale Inference T otal TFLO Ps T ext Enc. Visual Enc. Scale Pred. Scale Apply Scale Time TFLO Ps TTFT Gen. Time TFLO Ps E2E Time V anill a 16 100% – – – – – – 111.4 378.9 527.9 111.4 527.9 R esAdapt 16 76.3% 1.5 19.8 94.1 85.6 6.3 205.8 77.2 ( ↓ 30.7%) 272.5 ( ↓ 28.1%) 370.7 ( ↓ 29.8%) 80.1 ( ↓ 28.1%) 576.5 ( ↑ 9.2%) R esAdapt 16 52.8% 1.5 19.9 102.9 94.5 8.4 225.7 51.5 ( ↓ 53.8%) 261.5 ( ↓ 31.0%) 313.1 ( ↓ 40.7%) 54.4 ( ↓ 51.2%) 538.8 ( ↑ 2.1%) R esAdapt 16 28.9% 1.5 20.4 103.4 92.2 9.0 225.0 31.0 ( ↓ 72.2%) 227.2 ( ↓ 40.0%) 237.9 ( ↓ 54.9%) 33.9 ( ↓ 69.6%) 462.9 ( ↓ 12.3%) V anill a 32 100% – – – – – – 222.5 723.3 881.9 222.5 881.9 R esAdapt 32 74.4% 2.9 19.9 204.1 97.4 14.4 335.9 153.9 ( ↓ 30.8%) 589.4 ( ↓ 18.5%) 627.6 ( ↓ 28.8%) 159.7 ( ↓ 28.2%) 963.5 ( ↑ 9.2%) R esAdapt 32 51.5% 2.9 20.0 193.2 92.0 16.2 321.4 102.4 ( ↓ 54.0%) 505.0 ( ↓ 30.2%) 467.1 ( ↓ 47.0%) 108.2 ( ↓ 51.4%) 788.5 ( ↓ 10.6%) R esAdapt 32 28.2% 2.9 20.3 190.4 90.3 17.3 318.3 61.4 ( ↓ 72.4%) 451.8 ( ↓ 37.5%) 332.6 ( ↓ 62.3%) 67.2 ( ↓ 69.8%) 650.9 ( ↓ 26.2%) V anill a 64 100% – – – – – – 444.6 1457.5 2059.6 444.6 2059.6 R esAdapt 64 73.2% 5.8 19.8 389.5 95.8 26.4 531.5 307.3 ( ↓ 30.9%) 1093.1 ( ↓ 25.0%) 1327.0 ( ↓ 35.6%) 318.9 ( ↓ 28.3%) 1858.5 ( ↓ 9.8%) R esAdapt 64 50.7% 5.8 20.1 382.1 94.9 29.9 527.0 204.3 ( ↓ 54.0%) 991.8 ( ↓ 31.9%) 740.5 ( ↓ 64.0%) 215.9 ( ↓ 51.4%) 1267.5 ( ↓ 38.5%) R esAdapt 64 27.8% 5.8 20.0 371.6 90.2 34.8 516.6 122.2 ( ↓ 72.5%) 899.2 ( ↓ 38.3%) 511.4 ( ↓ 75.2%) 133.8 ( ↓ 69.9%) 1028.0 ( ↓ 50.1%) V anill a 128 100% – – – – – – 888.9 2936.3 4877.0 888.9 4877.0 R esAdapt 128 74.2% 11.6 20.1 766.3 95.0 53.1 934.5 614.1 ( ↓ 30.9%) 2286.6 ( ↓ 22.1%) 2323.6 ( ↓ 52.4%) 637.3 ( ↓ 28.3%) 3258.1 ( ↓ 33.2%) R esAdapt 128 51.4% 11.6 20.2 755.3 93.8 59.4 928.7 408.0 ( ↓ 54.1%) 2071.0 ( ↓ 29.5%) 1496.0 ( ↓ 69.3%) 431.2 ( ↓ 51.5%) 2424.7 ( ↓ 50.3%) R esAdapt 128 28.2% 11.6 20.4 734.5 92.0 68.6 915.5 243.9 ( ↓ 72.6%) 1766.7 ( ↓ 39.8%) 1061.8 ( ↓ 78.2%) 267.1 ( ↓ 70.0%) 1977.3 ( ↓ 59.5%) Dis proportio nate gains on m ulti-step rea soning. U nder a ggressiv e compressi on ( ∼ 10% retention), content- a gnosti c methods discard sparse b ut decisiv e evidence. On Qw en2.5- VL with 32 frames, ResAda pt reaches 45.7 on VideoMMMU at 11.4% retention, improving ov er T oMe ( 39.2 ), Visi onZip ( 39.1 ), FlashV id ( 39.4 ), and FixedS cal e ( 44.3 ) while remaining competitiv e on the perceptio n benchmarks. The gain is therefore con centrated on the reaso ning-heavy benchmark rather than being uniform across t as ks. The transferred Allocator remains strong on Qw en3- VL, reaching 56.1 o n VideoMMMU at the same 11.4% retention, but we treat the cross-backbone comparison as supportive rather than deﬁnitiv e because the availa bl e operating points are not perfectly matched. Spatia l savings reinv ested as temporal cov erage. Extending co ntext from 32 to 128 frames ampliﬁ es the advantage. At 22.9% retenti on on Qwen2.5- VL, R esAdapt reaches 51.1 on VideoMMMU versu s 47.9 for the 128-frame uncompressed m odel while reco v ering m ost o f the perceptio n performance at f ar low er visual cost. Ev en at 11.1% retenti on, ResAda pt att ains 49.2 on VideoMMMU , again slightly exceeding the 128-frame vanilla score. The eﬀect is t ask-dependent rather than unif orm, b ut it sho ws that sav ed spatial b udget can be reinv ested as temporal headroom: the model observes 4 × m ore frames without paying n ativ e-resolution cost (Figure 3 ). T emporal Grounding. This experiment tests the second cl aim. Grounding is marked ly more compression- sensitiv e than Q A, so w e interpret T ab l e 2 as a comparison of operating points rather than as a perfectly b udget-matched frontier . Pre-encoding all ocatio n is m ore robust than frame dropping. On Qwen2.5- VL (32F), Random Drop, T oMe, FlashV id, and FixedScal e red uce Charades-ST A mIo U from 47.3 to 25.7 , 26.0 , 26.6 , and 24.9 , respectiv ely , at ≈ 25–31% retention. At a low er 16.2% budget, R esAdapt reaches 35.6 . The point is therefore not a universa l matched-b udget win ov er ev ery grounding baselin e; rather , all ocating pixels before encoding is subst antially m ore robust than frame dropping and remains competitiv e with m odel-side compressio n despite operating at a small er budget. On the reported Qw en3 row s, the same pattern holds mo st clearly a gainst frame dropping, while comparisons to model-side baselines are more mixed. R ea soning without temporal anchors regresses. The strongest grounding result is a single but import ant reaso ning-augmented operating point. On VideoA uto-R1 ( Qw en2.5- VL), extending from 32 to 128 frames 12 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning 0 5 10 15 20 25 30 0.2 0.4 0.6 0.8 1.0 Scale Factor Case 338 ( =0.214) 0 5 10 15 20 25 30 0.2 0.4 0.6 0.8 1.0 Case 1640 ( =0.247) 0 5 10 15 20 25 30 0.2 0.4 0.6 0.8 1.0 Case 1302 ( =0.251) 0 5 10 15 20 25 30 Frame Index 0.2 0.4 0.6 0.8 1.0 Scale Factor Case 1 166 ( =0.002) 0 5 10 15 20 25 30 Frame Index 0.2 0.4 0.6 0.8 1.0 Case 2435 ( =0.019) 0 5 10 15 20 25 30 Frame Index 0.2 0.4 0.6 0.8 1.0 Case 1770 ( =0.009) Figure 4: Emergent activ e perception. P er-frame scale s t o ver frame index for six VideoMME videos, gro uped by intra-video scal e div ersit y σ . High-diversity videos sho w localized scal e spikes on scene changes, text o verlay s, and rapid m otio n; lo w-div ersit y videos remain near-unif orm. degrades Charades-ST A mIo U from 41.5 to 28.9 , indicating that l onger reasoning chains alon e do not preserv e loca liz atio n qualit y in this setting. Adding R esAdapt at 16.1% visu al-token retentio n lifts the 128-frame score to 49.1 , showing that allocati on can materially impro ve this particular long-co ntext reaso ning regime. Emergent deno ising. On NExT -GQA ( Qw en3- VL, 128F), R esAdapt improv es mIoU from 36.6 to 43.9 at only 6.8% retentio n, suggesting that suppressing questi on-irrel evant frames can sharpen localization even under extreme compressi on. Exploratory image transfer . Transf er bey ond video is mixed and not central to our cl aims. The clearest positiv e result is ChartQ A on Qwen2.5- VL, b ut it is obtained at 105% visu al-token retentio n, whereas text- heavy t as ks degrade on ce resolutio n becomes too lo w . W e therefore treat image results as a robu stness check and report them in Appendix D.4 , T ab le 7 . 4.3. R untime Overhead This experiment is a pipeline-latency case study: it measures when the front-end cost of allocati on is amortized by do wnstream token savings. T ab le 3 reports l atency a gainst vanilla using a dedicated single-GPU Allocator and a separate 4-GPU vLLM engine; it should therefore be read as a vanilla-versu s-R esAdapt deployment pro ﬁle rather than as a same-hardware compariso n against T oMe, Fl as hVid, or Visio nZip. By adjusting the maxim um all ow ed sca le, R esAda pt spans retenti on rati os from conservativ e ( R ≈ 74% ) to a ggressiv e ( R ≈ 28% ) compressi on, providing a single controllab le accuracy–speed knob. The key trade-oﬀ is when token reducti on out weighs the front-end scaling cost. At R ≈ 74% , generati on time drops 29–52% but end-to-end (E2E) savings appear only at ≥ 64 frames ( − 9.8% ), gro wing to − 33.2% at 128 frames. At R ≈ 51% , the break-ev en shifts to 32 frames ( − 10.6% E2E); at R ≈ 28% , wall-clock savings 13 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning 0 5 10 15 20 25 30 Frame Index 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 Scale Factor (a) Aggregate Scale by Frame Position (V ideoMME) IQR Mean Median 0 8 16 24 31 Frame Index 0 25 50 75 100 125 150 175 Case ID (b) Per -Frame Scale Heatmap (First 200 Cases) 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Scale Factor Figure 5: Globa l allocati on statistics o n VideoMME. ( a) Aggregate predicted scal e by frame position. ( b) Case × frame heatmap for the ﬁrst 200 videos. High-scale allocati on appears a s loca lized bursts rather than a ﬁxed po sitiona l pattern. T ab le 4: Distributi on f amily ab lation for CAPO . The t wo variants foll ow the same training protocol. V ariant ¯ s VideoMME LongVideoBench MMVU VideoMMMU L VBench P er . Comp. Adap. β -CAPO 0.54 60.3 58.2 51.2 65.0 54.3 28.7 37.6 N -CAPO 0.60 61.0 57.4 51.8 66.0 50.0 30.3 37.2 emerge ev en at 16 frames ( − 12.3% ), reaching − 59.5% at 128 frames with 78% generati on-time reducti on. This scaling reﬂects the quadratic cost o f attention: backbon e savings compound faster than the lo w er-order Allocator ov erhead as sequen ces gro w , making R esAda pt mo st impactf ul in the lo ng-context regime. 4.4. Analysis and Ablati on Emergent activ e perceptio n. These an aly ses explain why the main results hold. The Allocator learns a strongly sparse temporal all ocatio n rather than a near-unif orm compressio n policy . Figure 4 show s this clearly: many videos are compressed alm ost everywhere, with short b ursts of higher resolutio n around text o verla ys, scene transitions, or other brief informativ e events. This is precisely the beha vior we would want from input-side allocati on: the Allocator spends pixels where the answer is likely to be decided, rather than distrib uting them uniformly across the clip. Figure 5 show s that this beha vior is not a trivial positi onal prior . The median scale st ays clo se to the lo w end o f the range, while the mean is lifted by localized peaks, implying that high-resolutio n all ocatio n is the exceptio n rather than the def ault. The per-video heatmap f urther conﬁrms that these peaks appear a s content-dependent segments rather than as a ﬁxed bias toward the beginning or end o f the sequen ce. CAPO ab lation. Tw o questio ns arise: ho w cost should enter optimiz atio n, and what indu ces the policy to vary across neighboring frames rather than coll apse to a unif orm scaler . T ab l e 4 suggests that the exact policy f amily is secondary: β -CAPO and N -CAPO trade margin al advantages 14 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning 0 100 200 300 400 500 T raining Step 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Mean Predicted Scale  s (a) T raining 0 100 200 300 400 500 T raining Step (b) V alidation β -Dist. w/o Cost β -Dist. w Direct Cost β -Dist. w CAPO  -Dist. w CAPO Figure 6: R eward-design ab lation. Mean predicted scal e ¯ s d uring training and validatio n. Direct cost penalties collapse to the minim um scale, wherea s CAPO variants con v erge to st ab le intermediate operating points. across benchmarks, with neither variant consistently dominating. Because the realized budgets are not exactly matched, w e interpret this as a directio nal trend rather than as a perfectly controlled ab l atio n. The shared ingredient that a ppears to matter is CAPO’s asymmetri c cost shaping, not the speciﬁ c parametric form o f the scal e distributi on. Figure 6 makes this m ore explicit from a training-dynamics perspectiv e. Direct cost penalties drive the policy rapid ly to ward the minimum-sca le boundary , while rem o ving cost altogether pushes it to ward the upper bound. CAPO is eﬀectiv e precisely because it st abilizes an intermediate operating point where the model is still rewarded for being selectiv e rather than merely cheap or merely accurate. W e provide f urther an aly sis o f per-sample scale adaptivit y and conv ergen ce behavi or in Appendix D.2.2 . T ab le 5: Operator genera liz ation. Zero-shot transfer o f R esAdapt scores to frame selecti on. Combining top- K selecti on with adapti ve resizing from 128 candidate frames outperf orms uniform sampling baselin es at lo wer token b udgets. Method VideoMME LongVideoBench L VBench MMVU Budget: 8 frames V anilla 54.0 53.9 33.3 48.9 T op-8 Select 52.2 51.1 32.0 49.2 Budget: 16 frames V anilla 58.9 56.0 36.1 50.9 Threshold S elect 58.0 57.4 36.4 51.0 Avg. Budget ( Retentio n Ratio) 12.2f (9.5%) 23.2f (18.1%) 16.7f (13.0%) 17.2f (13.4%) T op-32 Select + Resize 60.6 57.2 38.9 50.2 Avg. Budget ( Retentio n Ratio) 11.7f (9.1%) 16.9f (13.2%) 13.7f (10.7%) 14.1f (11.0%) Budget: 32 frames V anilla 62.3 58.7 39.5 52.0 T op-32 Select 59.7 55.7 37.0 51.2 T op-64 Select + Resize 62.5 58.4 40.0 52.3 Avg. Budget ( Retentio n Ratio) 23.8f (18.6%) 36.2f (28.3%) 24.1f (18.8%) 32.5f (25.4%) Operator genera liz atio n. Although R e- sAdapt is trained exclusively for adap- tiv e resizing, its learn ed poli cy gen eral- izes zero-shot to other input-side opera- tors. W e repurpo se the Allocator’s pre- dicted scal es as frame import ance scores to rank and ﬁlter 128 candidate frames. T a- b le 5 demonstrates that while pure frame selecti on (top- K or threshold-based) pro- vides a viab le strategy , combining selec- tio n with resi zing yields the best eﬃciency– accuracy trade-o ﬀ . S peciﬁca lly , selecting and resizing the top-32 or top-64 frames consistently outperf orms the vanilla 16- frame and 32-frame baselin es, respectiv ely , despite consuming f ewer tokens on av er- a ge. This conﬁrms that the Allocator learns a robu st, operator-a gnosti c measure of visual import ance that seamlessly cascades with discrete frame dropping. 15 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning 0.25 0.00 0.25 0.50 0.75 1.00 S c a l e s t = 0 . 0 0 3 2 Video 125 0.25 0.00 0.25 0.50 0.75 1.00 = 0 . 0 0 1 9 Video 288 0.25 0.00 0.25 0.50 0.75 1.00 = 0 . 0 0 1 9 Video 483 5 10 15 20 25 30 Frame Index 0.25 0.00 0.25 0.50 0.75 1.00 S c a l e s t = 0 . 2 4 2 1 5 10 15 20 25 30 Frame Index 0.25 0.00 0.25 0.50 0.75 1.00 = 0 . 2 4 8 1 5 10 15 20 25 30 Frame Index 0.25 0.00 0.25 0.50 0.75 1.00 = 0 . 2 1 1 4 w / o s i m w / s i m Figure 7: L sim ab lation: per-frame scal e proﬁl es. W ithout temporal-similarit y regulariz ation, the Allocat or a p- proaches near-unif orm sca ling; with it, the policy concentrates resoluti on on selected frames and suppresses red undant neighbors. T empora l regulari zation complements CAPO . CAPO al one determines how cost enters learning, b ut it does not by itself force the Allocator to distinguish am ong visually redundant neighbors. Figure 7 sho ws that on ce L sim is remo v ed, the learned scale traces coll apse to ward a near-co nst ant proﬁle that behav es mu ch like FixedScale. R eintrod u cing L sim restores sharp frame-lev el diﬀerentiati on, which conﬁrms that CAPO and the temporal regulari zer are complementary: one st abilizes the accuracy–cost objectiv e, and the other breaks the symmetry that would otherwise f av or uniform allocati on. R ob ustness and f ailure m odes. Ada ptiv e allocati on does not act as a loss less compressio n l ayer . In practice, R esAdapt usu ally preserv es many origin ally correct predicti ons, but it can still miss decisiv e evidence, especially when the relevant cue is visually simple and appears only brieﬂy . Because the policy is open-loop, it cannot revise allocati ons after reasoning begins or recov er evidence that wa s undersampled in the initi al pass. W e therefore interpret its gains as selectiv e redistrib ution of visual budget rather than as gu aranteed preservati on of all useful informatio n. 5. R elated W ork Input-side adaptation before visual encoding. A growing body of w ork red uces visual cost before or d uring input constru ctio n. Early approaches primarily perform temporal do wnsampling through keyframe selecti on or clip condensation ( Liang et al. , 2024 , Zhu et al. , 2025 , Sun et al. , 2025 , T ang et al. , 2025 ). More recent methods incorporate query a wareness and iterative search, t ailoring frame selection to questi on t ypes or intermedi ate evidence ( Zou et al. , 2025 , Li et al. , 2025a , Guo et a l. , 2025b , He et al. , 2025 ). Bey ond selecting which frames to process, several w orks allocate perceptual b udgets via multi-resoluti on encoding. Sl ow–fa st pipelines ( Y ang et al. , 2025a , Zhang et al. , 2026 ) use inter-frame similarit y to route frames to high- or lo w-resoluti on paths, but their binary , query-a gnosti c routing cannot adapt to the downstream questi on. Query-a ware m ulti-resolutio n strategies ( Zhang et al. , 2025d ) and early truncati on of less informativ e visu al tokens ( Chen et al. , 2026 ) go f urther by conditi oning on the query , yet still rely on handcrafted rules or ﬁxed resolutio n bins. In contrast, ResAda pt is an Input-side adaptatio n framework: it learns input-side allocati ons from t ask reward vi a RL and can realize them through diﬀerent pre-encoding operators, including resizing and frame selecti on; the experiments in this paper study the continu ous resiz e instanti atio n. 16 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning Model-side token eco nomy after encoding. Po st-encoding methods prune, merge, or redistribute visual tokens in embedding space. F or images, representative approaches include token merging ( Bolya et a l. , 2022 ), attentio n- or sali ency-guided pruning ( Chen et a l. , 2024 , Y ang et a l. , 2025c , Shang et a l. , 2025 , Zhang et al. , 2025c ), progressive dropping ( Xing et al. , 2024 , Zhang et a l. , 2024b ), and diversit y-based budget all ocatio n ( Alvar et a l. , 2025 , Y ang et al. , 2025b , Zhang et al. , 2025a ). Video-speciﬁ c extensi ons exploit spati otemporal redundan cy via static/dynamic token separation ( Huang et al. , 2025 , Shen et al. , 2025a ), hierarchi cal merging ( Hyun et al. , 2025 ), and segment-level f usio n or budget all ocation ( T ao et al. , 2025 , Fu et al. , 2024 , Shao et al. , 2025a ). These methods are complementary to R esAdapt: they operate after visual encoding and cannot recov er high-frequen cy det ails lo st to undersampling bef ore encoding. Our focus is earlier in the pipeline, deciding how many pixels to encode in the ﬁrst pl ace. Output-side agentic reaso ning. Another strategy lea v es the input ﬁxed and recov ers eﬃcien cy through iterativ e rea soning: retriev e candidate frames, zoom into regions, then re-query the m odel. Approaches range from st atic toolsets with predeﬁned cropping or clipping operators ( Zheng et a l. , 2025b , W ang et al. , 2025a , Song et al. , 2026 ) to dyn amic tooling via code-generatio n primitives ( Zhang et al. , 2025e , Zhao et al. , 2025a , Ho ng et al. , 2025 ), often exposed through execut ab le interf aces ( W ang et al. , 2024 ). While these methods can target hard evidence precisely , they are m ulti-pass by constructi on and rely on an initial coarse view to trigger subsequent reﬁnement. R esAdapt instead studies whether a single-pass pre-encoding all ocatio n policy can reco ver mu ch o f this beneﬁt without the latency and control ov erhead of iterative interaction. RL for multim oda l reasoning and perception control. Recent work has extended RL post-training from language models ( Shao et a l. , 2024 , Guo et al. , 2025a ) to multim oda l reaso ning and video underst anding. Algorithmic reﬁnements include improv ed advantage estimation and P PO -st yle st abilization ( Liu et a l. , 2025c , Y u et al. , 2025 , Z heng et al. , 2025a ), while video-domain extensio ns strengthen reaso ning through iterativ e frame selecti on and evidence reﬁnement ( F eng et al. , 2025 , Li et al. , 2025b , Liu et al. , 2026 , Y ang et al. , 2025d , Chen et al. , 2025 , W ang et al. , 2025c , Fu et a l. , 2025b ). O ur use o f RL is orthogonal: we apply it to input-side perceptio n control —learning frame-lev el visu al allocati ons under an explicit accuracy–cost trade-o ﬀ—rather than output-side rea soning policies. CAPO is designed for this setting, where naive cost penalties driv e the policy to a degenerate low-b udget solution. 6. Conclu sio n W e study Input-side adaptatio n through R esAdapt, a framework that shifts part of the eﬃcien cy b urden from post-en coding token compressi on to pre-encoding visual budget control. A light weight Allocator , trained with C ost-A ware Poli cy Optimi zation ( C APO) and temporal-similarit y regulariz atio n, predicts per-frame all ocatio ns before visual encoding, and the do wnstream backbon e contin ues to operate on its n ativ e token interface. Empirica lly , the framework is mo st convin cing in lo w-visual-token video QA and in one reasoning- augmented long-context grounding regime where sav ed spatial compute is reinv ested as temporal co v erage. The learn ed policy is sparse and content-dependent, b ut it remains open-loop: once the initial allocati ons are set, the model cannot reco v er missed evidence within the same pass. More broad ly , the results suggest that pre-encoding allocati on is a promising directio n for long-co ntext video rea soning, while the present evidence remains speciﬁc to the resi ze inst antiation studied here and transfers beyo nd the training domain only unev enly . 17 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning R ef erences Saeed Ranjbar Alvar , Gursimran Singh, Mohammad Akbari, and Y ong Z hang. Divprune: Diversity-based visual token pruning for l arge m ultimoda l models. In Proceedings of the Computer Visio n and P attern R ecogniti on Conf erence , pa ges 9392–9401, 2025. Shuai Bai, Y uxuan Cai, Ruizhe Chen, Keqin Chen, Xi onghui Chen, Zesen Cheng, Lianghao Deng, W ei Ding, Chang Gao, Chunji ang Ge, W enbin Ge, Zhif ang Guo, Q idong Huang, Jie Huang, F ei Huang, Binyuan Hui, S hutong Ji ang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiaw ei Liu, Chenglong Liu, Y ang Liu, Da yiheng Liu, Shixuan Liu, Dunjie Lu, R uilin Luo, Chenxu Lv , Rui Men, Lingchen Meng, Xuancheng R en, Xingzhang R en, Sibo S ong, Y u chong Sun, Jun T ang, Ji anhong Tu, Jianqiang W an, Peng W ang, P engfei W ang, Qiuyue W ang, Y uxuan W ang, Tianbao Xie, Yiheng Xu, Haiyang Xu, Jin Xu, Zhibo Y ang, Mingkun Y ang, Jianxin Y ang, An Y ang, Bow en Y u, F ei Zhang, H ang Zhang, Xi Zhang, Bo Z heng, Humen Zhong, Jingren Z hou, F an Z hou, Jing Z hou, Y uanzhi Z hu, and Ke Z hu. Qw en3-vl technica l report, 2025a. URL . Shuai Bai, Keqin Chen, Xuejing Liu, Jia lin W ang, W enbin Ge, Sibo S ong, Kai Dang, Peng W ang, Shijie W ang, Jun T ang, et al. Qwen2. 5-vl technical report. arXiv preprint , 2025b. Dani el Bolya, Cheng- Y ang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph F eichtenho f er , and Judy Ho ﬀman. T oken merging: Y our vit but f aster . arXiv preprint , 2022. Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is w orth 1/2 tokens after l ayer 2: Plug-and-play inference acceleration for large vision-language models. In European C onf erence on Computer Visi on , pages 19–35. Springer , 2024. Y ukang Chen, W ei Huang, Baif eng Shi, Qinghao Hu, Hanrong Y e, Ligeng Z hu, Zhijian Liu, P a vlo Molchan o v , J an Kautz, Xi aojuan Qi, et al. Scaling rl to long videos. arXiv preprint , 2025. Zeyuan Chen, Kai Zhang, Zhuo w en Tu, and Y uanjun Xi ong. Soft tail-dropping f or adaptiv e visu al tokeniz ation. arXiv preprint arXiv:2601.14246 , 2026. T ri Dao. Fl as hAttention-2: F aster attentio n with better parallelism and work partitio ning. In Intern atio nal Conferen ce on L earning R epresentations , 2024. Caba Heilbron F abian, Vi ctor Escorcia, Bernard Ghan em, and Juan Carlos Ni ebl es. Activit ynet: A l arge-scal e video benchmark for human activit y underst anding. In Proceedings of the ieee conf erence on computer visi on and pattern recognitio n , pa ges 961–970, 2015. Ziyang F an, K eyu Chen, R uilo ng Xing, Y ulin Li, Li Jiang, and Zhuot ao T ian. Fl as hvid: Eﬃci ent video large lan- guage m odels vi a training-free tree-based spati otemporal token merging. arXiv preprint , 2026. Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing W ang, Tianshu o Peng, Junfei Wu, Xi aoying Zhang, Benyo u W ang, and Xi angyu Y u e. Video-r1: R einforcing video reaso ning in mllms. arXiv preprint arXiv:2503.21776 , 2025. Chaoy ou Fu, Yuhan Dai, Y ongdong Luo, L ei Li, Shuhuai R en, R enrui Zhang, Zihan W ang, Chenyu Z hou, Y unhang Shen, Mengdan Zhang, et al. Video-mme: The ﬁrst-ever comprehensiv e evaluati on benchmark of m ulti-moda l llms in video an aly sis. In Proceedings of the Computer Visio n and P attern R ecogniti on C onf erence , pa ges 24108–24118, 2025a. 18 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning Shenghao Fu, Qiz e Y ang, Y uan-Ming Li, Xihan W ei, Xiaohua Xie, and W ei-S hi Zheng. L ov e-r1: Advancing lo ng video underst anding with an ada ptive zoom-in mechanism via m ulti-step reaso ning. arXiv preprint arXiv:2509.24786 , 2025b. T ianyu Fu, T engxuan Liu, Qinghao Han, Guohao Dai, Shengen Y an, Hu azhong Y ang, Xuefei Ning, and Y u W ang. Framef usion: Combining simil arit y and import ance for video token reducti on on l arge visio n language models. arXiv preprint , 2024. Jiyang Gao, Chen Sun, Zhenheng Y ang, and Ram Nev atia. T all: T empora l activit y localizatio n via l anguage query . In Proceedings of the I EEE international conf erence on computer visio n , pa ges 5267–5275, 2017. Da ya Guo, Dejian Y ang, Haow ei Z hang, Junxiao Song, R uoyu Z hang, R unxin Xu, Qihao Z hu, Shirong Ma, P eiyi W ang, Xiao Bi, et al. Deepseek-r1: Incentivizing reaso ning capa bilit y in llms via reinf orcement learning. arXiv preprint , 2025a. W eiyu Gu o, Ziyang Chen, Shaoguang W ang, Jianxiang He, Y ijie Xu, Jinhui Y e, Y ing Sun, and Hui Xi ong. L ogic- in-frames: Dyn amic keyframe search via visu al semantic-logi cal veriﬁcati on for long video understanding. arXiv preprint arXiv:2503.13139 , 2025b. Zefeng He, Xiaoye Q u, Y afu Li, Siyu an Huang, Daizong Liu, and Y u Cheng. Framethinker: L earning to think with lo ng videos vi a m ulti-turn frame spotlighting. arXiv preprint , 2025. J ack Ho ng, Chenxiao Zhao, ChengLin Zhu, W eiheng Lu, Guohai Xu, and Xing Y u. Deepeyesv2: T o ward a gentic multim odal model. arXiv preprint , 2025. Kairui Hu, P enghao W u, F anyi Pu, W ang Xiao, Y uanhan Zhang, Xiang Y ue, Bo Li, and Z iw ei Liu. Video- mmm u: Ev aluating kno wledge acquisition from m ulti-discipline pro fessi onal videos. arXiv preprint arXiv:2501.13826 , 2025. Xiaohu Huang, Hao Z hou, and Kai Han. Prunevid: Visua l token pruning f or eﬃci ent video large l anguage m odels. In Findings o f the Associatio n for C omputationa l Linguistics: A CL 2025 , pa ges 19959–19973, 2025. Jeongseok Hyun, Sukjun Hwang, Su Ho Han, T aeoh Kim, In woo ng L ee, Dongy oo n W ee, Joon- Y oung L ee, Seon Joo Kim, and Minho Shim. Multi-gran ular spati o-temporal token merging for training-free acceleratio n o f video llms. In Proceedings o f the I EEE/C VF Internatio nal Conferen ce on Computer Visi on , pages 23990– 24000, 2025. Anirudd ha Kemb havi, Mike Salv ato, E ric Kolv e, Minjoon Seo, Hannaneh Hajishirzi, and Ali F arhadi. A diagram is worth a dozen images. In European conferen ce on computer vision , pa ges 235–251. Springer , 2016. Samir Khaki, Junxian Guo, Jiaming T ang, S hang Y ang, Y ukang Chen, Konstantinos N Plataniotis, Y ao Lu, Song H an, and Z hijian Liu. S parsevil a: Decoupling visual sparsit y f or eﬃcient vlm inferen ce. In Proceedings o f the I EEE/C VF Internationa l Conferen ce on C omputer Visio n , pa ges 23784–23794, 2025. W oosuk Kw on, Z huohan Li, Siyu an Zhu ang, Y ing Sheng, Lianmin Zheng, C ody Hao Y u, Joseph E. Gonzalez, Hao Zhang, and Io n Stoi ca. Eﬃci ent mem ory mana gement for large language model serving with pa gedattentio n. In Proceedings o f the A CM SIGOPS 29th S ymposium on Operating Systems Principles , 2023. Jialu o Li, Bin Li, Jiahao Li, and Y an Lu. Divide, then ground: Adapting frame selectio n to query t ypes for lo ng-form video underst anding. arXiv preprint , 2025a. 19 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning Xinhao Li, Zi ang Y an, Desen Meng, Lu Dong, Xiangyu Zeng, Y inan He, Y ali W ang, Y u Qiao, Y i W ang, and Limin W ang. Videochat-r1: Enhancing spatio-tempora l perception via reinforcement ﬁne-tuning. arXiv preprint arXiv:2504.06958 , 2025b. Hao Liang, Jiapeng Li, Tianyi Bai, Xijie Huang, Linzhuang Sun, Z hengren W ang, Conghui He, Bin Cui, Chong Chen, and W entao Zhang. Keyvideollm: T o wards large-scale video keyframe selection. arXiv preprint arXiv:2407.03104 , 2024. Jiaheng Liu, Da wei Zhu, Zhiqi Bai, Y ancheng He, Huanxuan Liao, Haoran Q ue, Zekun W ang, Chenchen Zhang, Ge Z hang, Jiebin Z hang, et al. A comprehensiv e surv ey on lo ng context language m odeling. arXiv preprint arXiv:2503.17407 , 2025a. Shuming Liu, Mingchen Z huge, Changsheng Zhao, Jun Chen, Lemeng W u, Zechun Liu, Chenchen Zhu, Zhipeng Cai, Cho ng Zhou, H aozhe Liu, et al. Videoauto-r1: Video auto reasoning via thinking once, answ ering t wice. arXiv preprint , 2026. Y uliang Liu, Zhang Li, Mingxin Hu ang, Biao Y ang, W enw en Y u, Chunyu an Li, Xu-Cheng Y in, Cheng-Lin Liu, Lianw en Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in l arge multim oda l models. Scien ce China Informati on S cien ces , 67(12):220102, 2024. Zhijian Liu, Ligeng Z hu, Baifeng Shi, Z huoyang Z hang, Y uming L ou, Shang Y ang, H aocheng Xi, Shiyi Cao, Y uxian Gu, Dacheng Li, et al. Nvila: Eﬃci ent frontier visu al l anguage models. In Proceedings of the Computer Visio n and P attern R ecogniti on C onf erence , pages 4122–4134, 2025b. Zichen Liu, Changyu Chen, W enjun Li, Penghui Qi, Tianyu P ang, Chao Du, W ee Sun L ee, and Min Lin. U nderst anding r1-zero-like training: A critical perspectiv e. arXiv preprint , 2025c. P an Lu, H ritik Bansal, T ony Xia, Jiacheng Liu, Chunyuan Li, Hann aneh Hajishirzi, Hao Cheng, Kai- W ei Chang, Michel Galley , and Jianfeng Gao. Mathvista: Ev aluating mathematica l reaso ning o f fo undation m odels in visual contexts. arXiv preprint , 2023. Andrés Maraﬁ oti, Orr Zohar , Miquel F arré, Merv e Noyan, Elie Ba kouch, Pedro Cu enca, Cyril Zakka, Loubn a Ben All al, Anton L ozhko v , No uamane T azi, et al. Smolvlm: Redeﬁning small and eﬃci ent m ultimoda l m odels. arXiv preprint , 2025. Ahmed Masry , Xuan L ong Do, Jia Qing T an, Shaﬁq Jot y , and En am ul Hoqu e. Chartqa: A benchmark for questi on answering abo ut charts with visual and logical reaso ning. In Findings of the a ssociation f or computationa l linguistics: A CL 2022 , pages 2263–2279, 2022. Jeﬀ Ra sley , Samyam Rajbhandari, Ol atunji R uwa se, and Y uxi ong He. Deepspeed: Sy stem optimiz atio ns enab le training deep learning models with o ver 100 billio n parameters. In Proceedings of the 26th A CM SIGKD D intern atio nal conf erence on kno wledge discov ery & dat a mining , pa ges 3505–3506, 2020. Y uzhang Shang, Mu Cai, Bingxin Xu, Y ong Jae L ee, and Y an Y an. Ll av a-prumerge: Adaptiv e token red ucti on for eﬃci ent l arge m ultim odal models. In Proceedings of the I EEE/CVF International Conferen ce on C omputer Visi on , pages 22857–22867, 2025. Kel e Shao, Keda T ao, Can Qin, Haoxuan Y ou, Y ang Sui, and Huan W ang. Holitom: Holistic token merging for fast video large l anguage models. arXiv preprint , 2025a. 20 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning Kel e Shao, Keda T ao, Kejia Z hang, Sicheng F eng, Mu Cai, Yuzhang Shang, Haoxuan Y ou, Can Q in, Y ang Sui, and Hu an W ang. When tokens t alk too m uch: A survey of multim oda l lo ng-context token compressio n across images, videos, and audi os. arXiv preprint , 2025b. Zhihong Shao, Peiyi W ang, Qihao Z hu, R unxin Xu, Junxiao Song, Xi ao Bi, Haow ei Z hang, Mingchu an Z hang, YK Li, Y ang W u, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open l anguage m odels. arXiv preprint , 2024. L eqi Shen, Guoqiang Gong, T ao He, Y ifeng Zhang, Pengzhang Liu, Sicheng Z hao, and Guiguang Ding. F a st vid: Dyn amic densit y pruning for f ast video large l anguage models. arXiv preprint , 2025a. Xiaoqian Shen, Min-Hung Chen, Y u-Chiang Frank W ang, Mohamed Elhoseiny , and R y o Hachiuma. Zoom-zero: R einf orced coarse-to-ﬁne video understanding via temporal zoom-in. arXiv preprint , 2025b. Guangming S heng, Chi Zhang, Zilingfeng Y e, Xibin Wu, W ang Zhang, R u Zhang, Y anghu a Peng, Haibin Lin, and Chu an W u. Hybridﬂo w: A ﬂexib le and eﬃcient rlhf framew ork. In Proceedings o f the Tw entieth European C onf erence on Computer Sy stems , pa ges 1279–1297, 2025. Y an Shu, Z heng Liu, Peitian Z hang, Minghao Q in, Junjie Zhou, Zhengyang Liang, T iejun Huang, and Bo Zhao. Video-xl: Extra-lo ng vision language model f or hour-scal e video understanding. In Proceedings o f the Computer Visio n and P attern R ecogniti on C onf erence , pages 26160–26169, 2025. Amanpreet Singh, Viv ek Nat arajan, Meet Sha h, Y u Ji ang, Xinlei Chen, D hruv Batra, Devi P arikh, and Marcus R ohrbach. T o wards vqa m odels that can read. In Proceedings of the I EEE/C VF conferen ce on computer visio n and pattern recognitio n , pa ges 8317–8326, 2019. Mingyang S ong, H aoyu Sun, Jiaw ei Gu, Linjie Li, Luxin Xu, Ranjay Krishna, and Y u Cheng. Adarea soner: Dynamic tool orchestration for iterative visual reaso ning. arXiv preprint , 2026. Guangyu Sun, Archit Singhal, Burak U zkent, Mubarak Sha h, Chen Chen, and Garin Kess ler . From frames to clips: T raining-free ada ptive key clip selectio n f or lo ng-form video underst anding. arXiv preprint arXiv:2510.02262 , 2025. Xi T ang, Jihao Qiu, Lingxi Xie, Y unjie Tian, Jianbin Jiao, and Qixiang Y e. Adaptiv e keyframe sampling for lo ng video underst anding. arXiv preprint , 2025. Keda T ao, Can Qin, Haoxuan Y ou, Y ang Sui, and Huan W ang. Dycoke: Dynamic compressio n of tokens f or fast video l arge language models. In Proceedings of the Computer Visio n and P attern R ecogniti on C onf erence , pa ges 18992–19001, 2025. Haozhe W ang, Alex Su, W eiming R en, F angzhen Lin, and W enhu Chen. P ixel rea soner: Incentivizing pixel-space reasoning with curiosit y-driv en reinforcement learning. arXiv preprint , 2025a. W eihan W ang, Zehai He, W enyi Hong, Y ean Cheng, Xiaohan Z hang, Ji Q i, Ming Ding, Xi aotao Gu, Shiyu Huang, Bin Xu, et al. L vbench: An extreme long video understanding benchmark. In Proceedings of the I EEE/C VF Internationa l Conferen ce on C omputer Visio n , pa ges 22958–22967, 2025b. Xingyao W ang, Y angyi Chen, Lifan Yuan, Yizhe Zhang, Y unzhu Li, Hao Peng, and Heng Ji. Execut ab le code actio ns elicit better llm agents. In F ort y -ﬁrst Internationa l Conferen ce on Machine Learning , 2024. 21 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning Y e W ang, Z iheng W ang, Bos hen Xu, Y ang Du, Kejun Lin, Zihan Xiao, Zihao Yu e, Jianzhong Ju, Li ang Zhang, Dingyi Y ang, et al. Time-r1: Post-training large visio n l anguage m odel f or temporal video grounding. arXiv preprint arXiv:2503.13377 , 2025c. Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. L ongvideoben ch: A benchmark for long-co ntext interleav ed video-language understanding. Advances in Neura l Informati on Processing S ystems , 37:28828–28857, 2024. Junbin Xiao, Angela Y ao, Yi cong Li, and T at-Seng Chua. Can i trust y our answer? visually grounded video questi on answering. In Proceedings of the I EEE/CVF Conference on C omputer Visi on and P attern R ecogniti on , pa ges 13204–13214, 2024. Long Xing, Qidong Huang, Xiaoyi Dong, Ji ajie Lu, Pan Z hang, Y uhang Zang, Y uhang Cao, Conghui He, Ji aqi W ang, Feng Wu, et al. Pyramiddrop: Accelerating yo ur large vision-langua ge m odels via pyramid visual red undancy reducti on. arXiv preprint , 2024. R uyi Xu, Guangxu an Xiao, Y ukang Chen, Liuning He, Kelly Peng, Y ao Lu, and Song Han. Streamingvlm: R ea l-time underst anding for inﬁnite video streams. arXiv preprint , 2025. Biao Y ang, Bin W en, Boyang Ding, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, et al. Kwai keye-vl 1.5 technica l report. arXiv preprint , 2025a. Cheng Y ang, Y ang Sui, Jinqi Xiao, Lingyi Huang, Y u Gong, Chendi Li, Jinghua Y an, Y u Bai, P onn uswamy Sadaya ppan, Xia Hu, et a l. T opv: Compatib le token pruning with inferen ce time optimiz atio n f or fast and low-mem ory multim oda l visio n language model. In Proceedings o f the C omputer Visio n and P attern R ecogniti on Conf erence , pa ges 19803–19813, 2025b. Senqiao Y ang, Y ukang Chen, Zhuotao Tian, Chengyao W ang, Jingyao Li, Bei Y u, and Jiaya Jia. Visio nzip: Longer is better but not necessary in visio n language models. In Proceedings of the I EEE/C VF Conferen ce on Computer Visio n and P attern R ecogniti on , pages 19792–19802, 2025c. Zuhao Y ang, Sudong W ang, Kaichen Zhang, Keming Wu, Sicong L eng, Y if an Z hang, Bo Li, Chengw ei Qin, Shijian Lu, Xingxuan Li, and Lidong Bing. Longvt: Incentivizing "thinking with long videos" via n ativ e tool calling. arXiv preprint , 2025d. Qiying Y u, Zheng Zhang, R u of ei Zhu, Y uf eng Y uan, Xiaochen Zu o, Y u Y ue, W ein an Dai, Tiantian F an, Gaohong Liu, Ling jun Liu, et al. Dapo: An open-source llm reinf orcement learning system at scal e. arXiv preprint arXiv:2503.14476 , 2025. Xiang Y ue, Y uansheng Ni, Kai Zhang, Tianyu Z heng, R uoqi Liu, Ge Zhang, Samu el Stev ens, Dongfu Jiang, W eiming R en, Y uxuan Sun, et al. Mmm u: A massiv e multi-disciplin e multim odal understanding and reaso ning benchmark for expert a gi. In Proceedings o f the I EEE/CVF Conference on Computer Visi on and P attern R ecognition , pages 9556–9567, 2024. Boqiang Zhang, L ei Ke, R uihan Y ang, Qi Gao, T ianyu an Qu, R o ssell Chen, Dong Yu, et al. Penguin-vl: Exploring the eﬃci ency limits of vlm with llm-based vision encoders. arXiv preprint , 2026. Ce Zhang, Kaixin Ma, Tianqing F ang, W enhao Yu, Hongming Z hang, Zhisong Zhang, Y aqi Xie, Katia S ycara, Hait ao Mi, and Do ng Y u. Vscan: R ethinking visual token redu ctio n f or eﬃci ent large visio n-l anguage m odels. arXiv preprint , 2025a. 22 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning Congzhi Zhang, Z hibin W ang, Y inchao Ma, Ji aw ei Peng, Y ihan W ang, Qiang Z hou, Jun S ong, and Bo Z heng. R ewatch-r1: Boosting complex video rea soning in large vision-language models through a gentic dat a synthesis. arXiv preprint , 2025b. Kaichen Z hang, Bo Li, Peiyuan Zhang, F anyi P u, Jo shua Adrian Cahy on o, Kairui Hu, Shuai Liu, Y uanhan Zhang, Jingkang Y ang, Chunyu an Li, and Ziw ei Liu. Lmms-eval: R ea lit y check on the evaluati on of l arge m ultimoda l m odels, 2024a. URL . Qizhe Z hang, A osong Cheng, Ming Lu, R enrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. Bey ond text-visual attentio n: Exploiting visu al cues for eﬀectiv e token pruning in vlms. In Proceedings o f the I EEE/C VF Internationa l Conferen ce on Computer Visi on , pa ges 20857–20867, 2025c. Shaoji e Zhang, Jiahui Y ang, Jianqin Y in, Z henbo Luo, and Jian Lu an. Q -frame: Q uery-a ware frame selection and m ulti-resolutio n adaptation for video-llms. arXiv preprint , 2025d. Y i-F an Zhang, Xingyu Lu, Shukang Y in, Chaoyo u Fu, W ei Chen, Xi ao Hu, Bin W en, Kaiyu Jiang, Changyi Liu, T ianke Z hang, et a l. Thyme: Think bey ond images. arXiv preprint , 2025e. Y uan Z hang, Chun-Kai F an, Junpeng Ma, W enzhao Zheng, T ao Huang, Kuan Cheng, Denis Gudo vskiy , T o moyuki Okuno, Y ohei Nakata, K urt Keutzer , et al. Sparsevlm: Visua l token sparsiﬁcatio n for eﬃcient visio n-langu a ge model inferen ce. arXiv preprint , 2024b. Shitian Z hao, Haoqu an Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, and Chen W ei. Pyvisio n: Agentic vision with dynamic tooling. arXiv preprint , 2025a. Y ilun Zhao, H ao wei Z hang, Lujing Xie, T ongyan Hu, Guo Gan, Yitao L ong, Zhiyuan Hu, W eiyuan Chen, Chuhan Li, Zhijian Xu, et a l. Mmvu: Measuring expert-lev el multi-discipline video understanding. In Proceedings o f the Computer Visi on and P attern R ecogniti on C onf erence , pages 8475–8489, 2025b. Chujie Z heng, Shixuan Liu, Mingz e Li, Xio ng-Hui Chen, Bo wen Y u, Chang Gao, Kai Dang, Yu qio ng Liu, R ui Men, An Y ang, et al. Group sequence policy optimiz atio n. arXiv preprint , 2025a. Lianmin Zheng, Liangsheng Y in, Zhiqiang Xie, Chuyue Sun, Jeﬀ Huang, Cody Hao Y u, Shiyi Cao, Christos Ko zyrakis, Ion Stoica, Joseph E. Gonzalez, Cl ark Barrett, and Y ing Sheng. S GLang: Eﬃci ent executio n of structured l anguage model programs. In Advances in Neural Informati on Processing Systems , 2024. Ziw ei Zheng, Michael Y ang, Jack Hong, Chenxiao Z hao, Guohai Xu, Le Y ang, Chao Shen, and Xing Y u. Deep- eyes: Incentivizing" thinking with images" via reinforcement learning. arXiv preprint , 2025b. Junjie Z hou, Y an Shu, Bo Zhao, Boya Wu, Z hengyang Liang, Shitao Xiao, Minghao Qin, Xi Y ang, Y ongping Xio ng, Bo Zhang, et al. Mlvu: Benchmarking multi-tas k long video underst anding. In Proceedings o f the I EEE/C VF Conferen ce on Computer Visi on and P attern R ecognitio n , pa ges 13691–13701, 2025. Zirui Zhu, H ailun Xu, Y ang Luo, Y ong Liu, Kanchan Sarkar , Zhenheng Y ang, and Y ang Y ou. Focu s: Eﬃcient keyframe selecti on for long video understanding. arXiv preprint , 2025. Y uanhao Zou, Sheng ji Jin, Andong Deng, Y oupeng Zhao, Jun W ang, and Chen Chen. Air: Ena bling adapti ve, iterative, and reaso ning-based frame selectio n for video questi on answering. arXiv preprint arXiv:2510.04428 , 2025. 23 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning Limit ati ons and f uture work. R esAda pt improv es the eﬃcien cy–accuracy trade-oﬀ for long-video MLLMs, but the current evidence is still bounded by four concrete design choices. ( i) Front-end o verhead is amortized only in the lo ng-context regime. The Allocator adds a ﬁxed pre-encoding cost—coarse visu al encoding, cross-frame f usion, and distributi on predicti on—bef ore any backbon e savings are realized. When the sequence is short ( T ≤ 32 ), this constant ov erhead can oﬀset a meaningf ul fractio n o f the do wnstream attentio n redu ctio n, so the clearest wa ll-clock gains appear only when temporal context is long (S ec. 4.3 ). R ed ucing this ﬁxed cost through cached video features, cheaper front-ends, or distilled all ocatio n rules is therefore an important next step. ( ii) Allocati on is limited by coarse visual evidence. The Allocator observ es frozen coarse features f t ∈ R D rather than the f ull high-resolutio n frame. This is suﬃcient to detect broad redundan cy and scene structure, but it is w eaker on small text, subtle objects, and brief answer-criti cal cues embedded in otherwise simple frames (Figure 20 ). Multi-scal e conditi oning, moti on-a ware features, or light w eight loca l reﬁnement wo uld help close this gap without giving up the speed advantage of the current front-end. ( iii) T he present study validates the framew ork through one video-centric inst antiation. Our f ormulati on is genera l input-side adaptation, b ut the experiments inst antiate the operator with resi zing and train the policy primarily on video tasks. As a result, transfer beyond this regime is uneven: the learned policy sometimes recognizes ima ge inputs that need more ﬁdelit y , yet it does n ot deliver unif ormly eﬃciency-preserving gains on st atic-ima ge benchmarks (T ab le 7 ). Extending training to mixed image–video dat a and alternativ e operators such as hard frame selectio n remains open. ( iv) Allocati on is open-loop rather than rea soning-a ware. All budget decisio ns are committed before the backbon e processes any visual token. The policy therefore cannot revise a mist aken low-resoluti on choice after partial reaso ning or uncertaint y signals emerge. A natural extension is closed-l oop all ocation, where early backbon e st ates trigger re-encoding, budget revisio n, or a second visu al pass only when needed. So ft ware and Data The code f or this paper is availa bl e at: https://github.com/Xnhyacinth/ResAdapt A. Implementatio n Det ails A.1. Training Data Data Compositio n. W e build the training set from the diﬃcult y-ﬁltered dat a o f VideoA uto-R1 ( Liu et al. , 2026 ), keeping only image and video samples and discarding pure-text examples. T o improv e cov era ge of visually demanding subdomains, we additiona lly sample 16,500 video instances from Video-R1 ( F eng et al. , 2025 ), focu sing on OCR, free-f orm Q A, and regressio n-st yle t asks. The merged pool contains approximately 93.4K training samples. W e manually rem ov e all evaluatio n examples from our benchmark suite to av oid lea kage. 24 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning A.2. Training Conﬁguratio n U nless otherwise noted, training runs for on e epoch with globa l batch size 128 and AdamW . The learning rate is 2 × 10 − 5 for the Allocator and 1 × 10 − 6 for the backbon e, with w eight decay 0.01 and gradi ent clipping at 1.0 . W e set the maximum video token b udget to 8,192 , use T = 128 frames during training, and allo w scal es in the range [ s min , s max ] = [ 0.2, 1.8 ] , which permits both downsca ling and selectiv e upscaling. CAPO samples M = 16 all ocatio n trajectories per prompt and N = 1 rollout per trajectory . T raining is cond ucted on 32 H100 GPU s with V eRL ( S heng et al. , 2025 ), DeepSpeed ( Ras ley et al. , 2020 ), and vLLM ( Kwo n et al. , 2023 ). Eva luation uses lmms-eva l ( Zhang et al. , 2024a ); unless st ated otherwise, we cap respo nse length at 256 tokens and increase it to 4,096 for reaso ning models. A.3. R eward Design W e provide f ull det ails complementing Sec. 3.3 . The base sca lar reward R task m , n is task-speciﬁ c; eﬃcien cy enters later through CAPO advantage shaping rather than through a ra w additiv e reward term. Base Ta s k R eward ( R tas k m , n ). W e consider fo ur task t ypes: • Questi on Answ ering. For math problems, we extract the numeric answer and compare it to the ground truth within a toleran ce of 10 − 2 . F or multipl e-choice questio ns, w e extract the option letter . F or other Q A t asks, we compare norma lized strings ( e.g. , case-f olded, whitespace-stripped). This yields the binary reward R Q A ( ˆ o , o ) ∈ { 0, 1 } . • Free-form Generati on. F or open-ended t asks, w e compute the R O UGE-L score bet w een the generated answ er ˆ o and the reference o : R Gen ( ˆ o , o ) = R O U GE-L ( ˆ o , o ) ∈ [ 0, 1 ] . • T emporal Grounding. L et the ground-truth segments be G = { [ s j , e j ] } j and the predicted segments be  G = { [ ˆ s k , ˆ e k ] } k ( each set may contain one or multiple intervals). W e compute the temporal IoU and select the best-matching pair: R TG (  G , G ) = max [ ˆ s , ˆ e ] ∈  G , [ s , e ] ∈G tIoU ( [ ˆ s , ˆ e ] , [ s , e ] ) ∈ [ 0, 1 ] . If no valid segment can be parsed from the output, w e assign R TG (  G , G ) = 0 . • Grounding QA. W e parse both the textu al answer and the predicted temporal segments from the m odel output, compute R Q A ( ˆ o , o ) and R TG (  G , G ) , and sum them: R GQ A ( ˆ o ,  G ; o , G ) = R Q A ( ˆ o , o ) + R TG (  G , G ) ∈ [ 0, 2 ] . These task-speciﬁ c metrics deﬁne the scalar base reward R task m , n . C APO additio nally uses a binary success indicator u m , n ∈ { 0, 1 } : for exact-match Q A tasks we use the bin ary correctness outcome directly , whereas for contin uou s metrics (R O UGE-L, temporal Io U , and their grounding-Q A combinatio n) w e threshold the scalar score at 0.35 , matching the implementation. When format validati on is en ab led, a w eighted format term is added before GRPO norma lization, but u m , n is computed from the task metric al one. 25 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning T ab le 6: Prompt template used for CAPO training. The template presents video frames and the task questio n, requires intermediate reasoning inside tags, and pl aces the ﬁnal answ er in \boxed{} within tags. This structure enab les aut omatic reward extracti on from MLLM outputs. Prompt T empl ate f or Training with Thinking S ystem Prompt: Y ou are a helpf ul a ssist ant. Y ou FI RST think about the reasoning process as an internal m onol ogue and then provide the ﬁn al answ er . The reaso ning process MUST BE enclo sed within tags and the answ er MUST BE encl osed within tags. The ﬁnal answer MUST BE put in \boxed{} and the \boxed{} expressio n MUST BE contained entirely within the tags. Do not include any reasoning or expl anations outside these t ags. F ormat R eward. W e employ a binary format reward R fmt ( ˆ o ) ∈ { 0, 1 } enforced vi a strict regex validati on. The output must contain exactly one ... b lock and one ... b lock, with the ﬁnal answer enclosed in \\boxed{...} within the tags: R fmt ( ˆ o ) =  1 if f ormat matches regex, 0 otherwise. In the implement atio n, malf ormed outputs receive a penalt y before weighting, and the format term enters the scalar reward with weight 0.2 . A.4. Prompt T empl ate W e employ the standard prompt for GRPO training, sho wn in T ab le 6 . The m odel generates a reaso ning trace within tags ( optio nal f or R esAdapt since rea soning is hand led by the MLLM π ϕ , b ut maint ained for compatibilit y with rea soning-ba sed baselin es), foll ow ed by the ﬁnal answ er enclo sed in \\boxed{} . B. Complexit y An aly sis W e deriv e forma l computational bounds for R esAda pt to clarif y when Allocator ov erhead is negligib le rel ativ e to the savings induced in the backbon e. F or readabilit y , w e assume a st andard T ransf ormer backbon e with quadratic self-attenti on and a uniform native resolution H × W o ver T frames; the extensio n to heterogeneous resolutio ns is immediate by replacing H W with per-frame prod ucts H t W t . Baselin e cost. L et P denote the ViT patch siz e. A vanilla MLLM encoding T frames at f ull resoluti on incurs a total visual token count o f: N 0 = T ·  H P   W P  ≈ T H W P 2 . (26) 26 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning Adapti ve cost and token retenti on ratio. F or the resize inst antiation an alyzed in this paper , frame f t is rescal ed by f actor s t ∈ [ s min , s max ] , producing n t ( s t ) = ⌈ s t H / P ⌉ ⌈ s t W / P ⌉ ≈ s 2 t · H W / P 2 tokens. Summing o ver the sequence and norma lizing by N 0 yields the token retentio n ratio : N adapt = T  t = 1 n t ( s t ) ≈ H W P 2 T  t = 1 s 2 t , ρ ≜ N adapt N 0 = 1 T T  t = 1 s 2 t . (27) Because the learn ed Beta policy pl aces m ost redundant frames near s min (Figure 5 ), ρ is much small er than 1 in practice; across our evaluati on suite, ρ ∈ [ 0.06, 0.16 ] . Quadratic FLOP s red uctio n. F or an L mllm -layer MLLM with hidden dimension D mllm , self-attenti on cost scal es quadratically in the visual sequen ce length: Φ ( N ) = O ( L mllm N 2 D mllm ) . Substituting N adapt = ρ · N 0 giv es: Φ adapt mllm = O  L mllm · ρ 2 N 2 0 · D mllm  , (28) a red ucti on by a f actor o f ρ 2 relative to f ull-resolutio n processing. At the represent ativ e operating point ρ = 0.11 , w e obtain ρ 2 ≈ 0.012 , corresponding to roughly 83 × fewer backbone attentio n FLO Ps. Allocator o verhead. The Allocator processes N c = T · ⌈ H / P c ⌉ ⌈ W / P c ⌉ coarsely pooled tokens across L pred layers with dimension D pred , where P c ≫ P is the coarse spatial stride. Its cost and relative ov erhead are: Φ pred = O  L pred · N 2 c · D pred  , Φ pred Φ base mllm = O  L pred D pred L mllm D mllm ·  P P c  4  ≪ 1. (29) Substituting our implementatio n parameters ( P c = 14 , L pred = 4 , D pred = 1,024 v ersus L mllm = 28 , D mllm = 3,584 ), the Allocator accounts for less than 3% of inference FLO Ps. The decisio n stage is therefore sma ll compared with the backbon e computation it helps eliminate. Net s peed up. Combining the abo v e under the ﬁrst-order approximatio n Φ base mllm ≫ Φ pred : Speedup ≈ Φ base mllm Φ adapt mllm + Φ pred ≈ N 2 0 ( N adapt ) 2 = 1 ρ 2 . (30) At ρ = 0.11 , this a gain yields a theoretical redu ctio n of roughly 83 × in backbone attention computation. T emporal context scaling. The same savings admit a second interpret atio n in terms o f temporal co verage . U nder a ﬁxed token budget B , a vanilla MLLM can process only T 0 = B P 2 / ( H W ) f ull-resolutio n frames, whereas the resize inst antiation of R esAdapt used in our experiments can process T 0 / ρ adapti vely resiz ed frames. This yields an eﬀectiv e 1/ ρ ≈ 6 – 16 × increa se in temporal hori zon at compara bl e compute, which is exactly the trade-o ﬀ explo ited by the long-co ntext experiments in Sec. 4.2 . R emark ( accelerati on trans paren cy). A practica l consequence o f Input-side adaptatio n is that the backbone still receiv es an ordinary visu al-t oken sequence, only shorter . As a result, R esAdapt remains compatibl e with optimiz ed attention st acks such as FlashAttenti on, vLLM ( Kwo n et al. , 2023 ), and S GLang ( Z heng et al. , 2024 ) without kernel-l evel modiﬁ catio ns. By contrast, model-side pruning and merging o ften create irregular token l ay outs that are harder to route through the same optimiz ed kernels and may require fallback implementations or architecture-speciﬁc engineering. 27 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning C. Derivati on of Jo int RL F orm ul atio n This appendix collects derivatio ns omitted from S ec. 3 for space and clariﬁes how the one-step contextual MD P ( Contextua l Bandit) introdu ced in Sec. 2.2 motiv ates the practica l surrogate objectives optimiz ed by R esAdapt. All derivati ons are stated for a single context (video and query); the f ull objective is the expectation ov er the dat aset D . Notatio n. The prompt context is x = ( q , V ) . The Allocator ﬁrst samples l atent acti ons a from a Bet a policy q θ ( a | x ) (Sec. 3.2 ); the contin uo us allocati on s is the deterministic ima ge of a under Eq. ( 10 ) , and π θ ( s | x ) denotes the ind uced densit y (pushforw ard). A deterministic transformati on constructs the operator-transf ormed input ˜ x = ( q , { O ( f t , s t ) } T t = 1 ) ; in the experiment al inst antiation, O is bilinear resi zing. The MLLM backbone policy π ϕ ( y | ˜ x ) then samples a full response rollout y = ( r , o ) , where r is the reasoning trace and o is the ﬁnal answer . C.1. One-Step Contextual MD P and the Jo int Objective As deﬁned in S ec. 2.2 , the system is a on e-step contextual MDP . In this setting, there are no sequentia l state transitions across time steps t ; the episode terminates after the allocati on s is sampled and the correspo nding rollout y is produced. C onsequ ently , the va lue f unctio ns collapse to the immediate reward, and the standard P olicy Gradient Theorem simpliﬁes drastica lly without requiring temporal discount factors or credit a ssignment across Markov st ates. The joint distributi on of the allocati on and the rollout f actorizes conditio nally: p θ , ϕ ( s , y | x ) = π θ ( s | x ) π ϕ ( y | ˜ x ) . (31) F or a single context with ground-truth answ er o ⋆ , the margin al answer probabilit y under the transf ormed input is p θ , ϕ ( o ⋆ | x ) = E π θ ( s | x )  E π ϕ ( r | ˜ x )  π ϕ ( o ⋆ | ˜ x , r )   . (32) Equatio n ( 32 ) is the l aw of tot al expect atio n under an autoregressiv e factori zation π ϕ ( y | ˜ x ) = π ϕ ( r | ˜ x ) π ϕ ( o | ˜ x , r ) : the inner term is the conditio nal probability (ma ss or densit y) of the ground-truth answer o ⋆ giv en the preﬁx r . Summing/integrating o ver r yields the marginal P ( o ⋆ | x ) only under this generativ e ordering; the subsequent RL objectiv e does not require Eq. ( 32 ) to hold in clo sed form. Because log ( · ) is m on otone, maximizing log p θ , ϕ ( o ⋆ | x ) w ould be equival ent, but the RL derivati on belo w does not require introd ucing the logarithm. It only requires a scalar utilit y eva luated after sampling ( s , y ) . W e therefore abstract the answer-qua lit y term as a rollout utilit y Q ( x , y ) , where y = ( r , o ) , and treat it as parameter-independent once the rollo ut is sampled. This is a modeling abstracti on rather than an exact reform ulatio n: when Q is chosen as an answ er-aligned t as k score, the resulting RL problem is a surrogate to likelihood maximization. This lets us deﬁne the ideal rollout reward R ideal s , y = Q ( x , y ) − λ C ( s ) , (33) and optimize the one-step expected return max θ , ϕ J ( θ , ϕ ) = E x ∼ D E π θ ( s | x )  E π ϕ ( y | ˜ x )  R ideal s , y  . (34) 28 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning C.2. P olicy Gradient and Altern ating Optimiz ation Because the objectiv e inv olves t wo distinct parameteri zed policies, its gradients foll o w the score-f uncti on estimator (the likelihood-ratio / REINFO RCE identit y). This is the underlying policy-gradient structure; GRPO/PPO does not change that structure, b ut replaces the raw reward with norma liz ed advant ages and clipped surrogates for practica l optimi zation. T aking the gradient of J ( θ , ϕ ) with respect to the backbon e parameters ϕ : ∇ ϕ J ( θ , ϕ ) = E x E π θ ( s | x )  ∇ ϕ  π ϕ ( y | ˜ x ) R ideal s , y d y  = E x E π θ ( s | x ) E π ϕ ( y | ˜ x )  R ideal s , y ∇ ϕ log π ϕ ( y | ˜ x )  . (35) Similarly , the gradient with respect to the Allocator parameters θ relies on the marginalized reward R ideal s = E π ϕ ( y | ˜ x ) [ R ideal s , y ] : ∇ θ J ( θ , ϕ ) = E x E π θ ( s | x )  R ideal s ∇ θ log π θ ( s | x )  . (36) T o optimi ze this objectiv e with GRPO/PPO , we introdu ce importance sampling from beha vior policies π θ old and π ϕ old . A n aiv e joint importance weight π θ π ϕ π θ old π ϕ old suﬀers from compounded variance. W e therefore use an alternating bl ock-coordinate ascent approximati on. When updating the MLLM ( ϕ ), w e ﬁx the Allocator to its behavi or policy ( π θ = π θ old ), making its import ance ratio exactly 1 . The oﬀ -policy surrogate gradient for ϕ becomes: ∇ ϕ J surr ( ϕ ) = E π θ old E π ϕ old  π ϕ ( y | ˜ x ) π ϕ old ( y | ˜ x ) R ideal s , y ∇ ϕ log π ϕ ( y | ˜ x )  . (37) U sing the log-derivativ e identit y ∇ ϕ r ϕ = r ϕ ∇ ϕ log π ϕ where r ϕ = π ϕ / π ϕ old , this motiv ates the surrogate objectiv e: L ideal ϕ = E π θ old E π ϕ old  r ϕ ( y | ˜ x ) R ideal s , y  . (38) P olicy-gradient a scent on ϕ increa ses L ideal ϕ ( equiva lently , training minimi zes its negativ e); Sec. 3.4 implements the clipped PPO surrogate with advantages in pl ace o f R ideal . Con versely , when updating the Allocator ( θ ), we ﬁx the backbon e to its behavi or policy ( π ϕ = π ϕ old ). The correspo nding ideal allocator surrogate is L ideal θ = E π θ old  r θ ( s | x ) R ideal s  , r θ ( s | x ) = π θ ( s | x ) π θ old ( s | x ) , (39) where R ideal s = E π ϕ old ( y | ˜ x ) [ R ideal s , y ] . In practice, this expectation is approximated by Monte Carlo rollouts under the frozen backbone. Sequential all ocator–backbon e updates within one iteratio n. The alternating derivatio n abo v e ﬁxes on e policy while updating the other , so the import ance ratio f or the inactive policy is unit y . In implementations that ﬁrst update the Allocator from θ old to θ ′ and then update the MLLM on the same rollo ut batch, trajectories are still drawn from the behavior pair ( θ old , ϕ old ) while the MLLM gradient is eva luated under ϕ at ﬁxed 29 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning V ideoMME LongV ideoBench MMVU MMMU-P MMMU-C MMMU-A L VBench 0.2 0.4 0.6 0.8 1.0 Mean Scale Factor per V ideo Predicted Scale Distribution Across Benchmarks Perception Reasoning Figure 8: Per-video mean scale across benchmarks. Kernel densit y estimates o f the per-video mean scale ¯ s . R easo ning- heavy benchmarks shift to ward larger ¯ s than perceptio n-heavy o nes, indicating that the learn ed policy spends more ﬁdelit y where ﬁne-grained evidence is m ore likely to matter . ( x , a , y ) . The change in the marginal ov er allocati ons bet ween beha vior and the post-a llocator policy is corrected by the importance weight ρ θ = q θ ′ ( a | x ) / q θ old ( a | x ) = π θ ′ ( s | x ) / π θ old ( s | x ) ( aﬃne map, J acobian f actor cancels in the ratio). Multiplying rollout-l ev el advantages by ρ θ before the token-lev el PPO surrogate f or ϕ implements the st andard import ance-sampling correction when reusing trajectories drawn under q θ old while evaluating gradients at q θ ′ (support and clipping cav eats as in ordin ary PPO); this matches the practica l “ ispred ” path in the codebase. C.3. Advantage Shaping and Monte Carlo Surrogates The ideal linear penalt y − λ C ( s ) inside R ideal o ften causes catastrophic coll apse to minimum budgets. CAPO therefore repl aces the ra w reward with a cost-sha ped, group-norma liz ed advantage A s , y (denoted A m , n in the main text). This replacement is not an unbiased baseline transformati on of R ideal s , y : the CAPO signal depends on the sampled allocati on, the rollo ut outco me, and the within-group cost st atistics. Instead, it deﬁnes a deliberately biased surrogate objectiv e that trades exact ﬁdelit y to the L a grangian reward for low er variance and stronger budget control in practice. Applying PPO clipping to the exact joint ratios wo uld coupl e a ll frame- and token-lev el factors, which is prohibitiv ely no isy in practice. W e therefore arrive at practical decoupled objectives. For a batch of M all ocatio ns and N rollouts per all ocation, the MLLM sequen ce-lev el surrogate is: L seq ϕ = − 1 M N M  m = 1 N  n = 1 min  r ( m , n ) ϕ A m , n , clip ( r ( m , n ) ϕ , 1 − ε , 1 + ε ) A m , n  . (40) This sequen ce-lev el loss is already approximate because it uses the CAPO -sha ped advantage in pl ace o f the ideal reward. T o achieve ﬁner credit assignment f or the autoregressiv e MLLM, we f urther f actoriz e π ϕ ( y | ˜ x ) into token-lev el probabiliti es, distribute the same rollout-lev el advantage A m , n to all tokens, and av erage 30 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Mean Scale Factor 0 50 100 150 200 250 Count (a) Scale Distribution by Duration Short (n=900) Medium (n=900) Long (n=900) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Intra-V ideo Scale Std 0 20 40 60 80 100 120 Count (b) Scale Diversity by Duration Short Medium Long Short Medium Long 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy 67.8% 55.3% 50.8% (c) Accuracy by Duration Figure 9: VideoMME broken do wn by video durati on. As clip durati on gro ws, the poli cy lo wers the a v erage sca le, increa ses within-video scal e diversity , and f aces lo wer task accuracy . L onger clips are therefore processed more a ggressively and m ore selectiv ely . o ver the sequen ce length L m , n . Equation ( 24 ) should therefore be read as the st andard token-lev el PPO approximati on to this sequence-l ev el surrogate, not as an exact decompositio n o f the clipped joint ratio. Con versely , when updating the Allocator ( θ ), we ﬁx the MLLM ( π ϕ = π ϕ old ) and use the aggregated advantage A CAPO m = 1 N  n A m , n . Because the Allocator’s output distrib ution factori zes conditi onally across frames (Eq. 11 ), its score f unctio n decomposes additively: ∇ θ log π θ ( s ( m ) | x ) = T  t = 1 ∇ θ log Beta ( a ( m ) t ; α t , β t ) . (41) This additiv e log-probabilit y structure supports lo w-variance frame-level credit assignment. N everthel ess, Eq. ( 21 ) remains a practica l approximati on to a trajectory-l evel clipped objective: conditi onal independence justiﬁes decompositi on o f log π θ , b ut not exact factori zation of the nonlin ear PPO clipping term. W e use the per-frame surrogate because it is substanti ally more st ab le in l arge-scal e training. D. Suppl ement ary Experiments and Analysis This secti on ﬁrst an alyzes the learned allocati on policy , then studies the t wo key ab latio n axes, and ﬁn ally reports represent ative qualitative cases and a boundary-ca se transfer test bey ond video. Unless otherwise noted, all plots use Qw en2.5- VL-7B with 32 uniformly sampled frames. D.1. Beha vi oral Analysis o f the Learned Poli cy D.1.1. Benchmark-Level Budget Allocati on Figure 8 show s a clear benchmark-lev el ordering even though the policy never observes benchmark l abels d uring training. A v era ged across datasets, reasoning-ori ented t asks use slightly higher mean scales than perceptio n-oriented on es (0.435 vs. 0.417), with MMMU-Adaptation at the high end and VideoMME at the lo w end. The pattern is consistent with the main cl aim of the paper: the policy is not enforcing a ﬁxed compressi on rule, but adapting its operating point to the expected visual diﬃcult y o f the t as k family . 31 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning Artistic Performance Life Record Film & T elevision Knowledge Multilingual Sports Competition 0.0 0.1 0.2 0.3 0.4 Mean Scale Factor 58% 53% 61% 60% 53% 59% V ideoMME: Scale Allocation by T ask Category (accuracy annotated) Figure 10: Scale all ocatio n by VideoMME t as k category . Mean ¯ s varies subst antially across categories, with larger b udgets assigned to categori es that contain crowded m otio n or ﬁner local eviden ce. Accuracy annotatio ns sho w that all ocation is n ot a trivial proxy for whi ch category is easiest. 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Gini Coef ficient (Selectivity) 0 2 4 6 8 Density p=2.762e-01 V ideoMME Correct (n=1565) Incorrect (n=1 135) 0.00 0.05 0.10 0.15 0.20 0.25 Gini Coef ficient (Selectivity) 0 2 4 6 8 Density p=2.979e-02 MMMU-P Correct (n=198) Incorrect (n=102) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Gini Coef ficient (Selectivity) 0 1 2 3 4 5 6 Density p=1.046e-01 LongV idBench Correct (n=740) Incorrect (n=597) Scale Selectivity (Gini Coef ficient): Correct vs. Incorrect Predictions Figure 11: Selectivit y versu s prediction correctness on three represent ativ e benchmarks. Per-video Gini coeﬃcients o f the frame-level scales. Correct predictio ns tend to hav e higher Gini than incorrect on es, linking success to sharper con centration o f resolutio n rather than merely l arger av era ge budgets. D.1.2. L ong-Context and S emantic Structure Figure 9 is consistent with the long-context gains in the main pa per . From short to long clips, the mean scal e drops (0.342 → 0.336 → 0.332), b ut the within-video div ersit y rises (0.085 → ∼ 0.095). In other w ords, the policy does not merely compress lo nger videos more; it also becomes more selectiv e inside them, which is exactly the regime where uniform resizing is lea st satisfactory . Figure 10 reﬁnes the same story within a single benchmark. The policy spends the most budget on Sports Competitio n and the lea st on Artistic P erf ormance , suggesting that even within VideoMME it distinguishes categories that are dense and spatia lly demanding from those that are visually simpler . This complements the main benchmark t ab les: the appendix f ocuses on why ret ained b udgets diﬀer , while the main text already reports the exact realized retention ratios. 32 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning V ideo-MME LongV ideo Bench L VBench MMVU V -MMMU (Perc.) V -MMMU (Comp.) V -MMMU (Adapt.) 0 20 40 60 80 100 Proportion (%) 57.4% 53.3% 33.6% 43.7% 57.0% 46.3% 20.7% 4.9% 4.5% 7.5% 9.0% 8.0% 8.0% 4.7% 5.6% 5.0% 9.0% 8.3% 8.3% 9.0% 34.3% 36.2% 56.9% 39.8% 25.7% 37.3% 62.3% Correctness T ransition Distribution (V anilla → ResAdapt) CC WC CW WW 0 10 20 30 40 50 Error Introduction Rate (%) 0 10 20 30 40 50 Error Correction Rate (%) V i deo-MME LongVideo Bench L VBench MMVU V -MMMU (Perc.) V -MMMU (Comp.) V -MMMU (Adapt.) Net Beneficial Region Net Harmful Region Error Correction vs. Introduction T rade-off Figure 12: Sample-lev el rob ustness at 25% retention. Most originally correct predicti ons remain correct, b ut corrected and newly introd uced errors are o f comparab le magnitude. Adaptiv e allocati on is theref ore selectiv e rather than loss less. D.1.3. S electivit y and Su ccess W e next as k whether successf ul sampl es a llocate budget more selectiv ely within a clip. W e quantif y frame-lev el selectivit y with the Gini coeﬃci ent of the predicted scales. High Gini means the policy concentrates budget on a small subset of frames; low Gini means the all ocatio n is nearly uniform. Figure 11 sho ws that correct predicti ons consistently lie in the m ore selectiv e regime, with the clearest separatio n on MMMU-P . This sharpens the mechanism cl aim o f the appendix: success is associated not merely with keeping more pixels ov erall, but with concentrating them onto the frames that matter . R ob ustness and f ailure m odes. A ﬁnal questio n is whether adaptiv e compressi on preserv es existing correct answ ers or merely swa ps one error pattern for another . Figure 12 pro vides the right robustn ess interpretatio n f or aggressiv e compressio n. Predicti on st abilit y remains high o vera ll ( abo ut 89% o f originally correct samples st ay correct in the aggregate summary), so the policy is not helping only by randomly perturbing the answer distributio n. Ho w ev er , error correcti on and error introd ucti on are close eno ugh that the eﬀect should be read as selectiv e redistributi on : the policy repairs some failures, but it can also lose ﬁne-grained evidence, especially when the decisiv e cue is brief or visu ally simple. D.2. Ab latio n Studies D.2.1. T empora l Simil arit y Ab lation W e provide t wo compl ement ary views o f the temporal-similarit y ab lation: a cross-benchmark summary sho wing that the eﬀect genera liz es, and a single-ben chmark diagno stic panel sho wing exactly how the all ocatio n pattern changes. Figure 13 makes the role of L sim unu sually clear . Without it, the policy coll apses to near-unif orm scales on ev ery benchmark ( σ < 0.003 ); with it, the same model f amily reco vers substanti al within-video variatio n, with 4 × – 693 × larger diversit y depending on the benchmark. CAPO therefore controls where the globa l b udget should sit, whereas L sim prev ents the trivial ﬁxed-scale solution. 33 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning VideoMME LongVideoBench MMVU MMMU-P LVBench 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 I n t r a - v i d e o S c a l e S t d D e v ( ) 0.002 0.091 43× 0.033 0.133 4× 0.045 0.091 2× 0.000 0.095 693× 0.000 0.092 561× w / o s i m w / s i m Figure 13: Cross-benchmark scale div ersit y with and without L sim . Per-video sca le standard deviation σ across ﬁv e benchmarks. Without the regul arizer , div ersit y coll apses to ward z ero; adding L sim restores broad within-video variatio n on ev ery benchmark. Quantit ative co nﬁrmatio n. Figure 14 sho ws that this is not an artifact of any single statistic. The regulari zer changes the globa l histogram, the per-video range, the frame-to-frame variation, and the Gini coeﬃci ent in the same directio n, conﬁrming that the beneﬁt is structural rather than metric-speciﬁ c. D.2.2. R eward Design Ab lation W e next examine whether diﬀerent reward designs preserv e a no n-degenerate adaptiv e regime d uring training. All plots use EMA sm oothing to suppress per-step noise; raw valu es remain visible as translucent traces. P er-sample scale adaptivit y . Figure 15 complements Figure 6 by measuring the per-sample scale range s max − s min rather than the mean. CAPO preserves non-trivia l adapti vit y on validati on, whereas direct cost collapses to the low er boundary and cost-free optimi zation drifts to ward a nearly uniform high-scale policy . Con vergen ce and stabilit y . Figure 16 explains why CAPO w orks and the simpler baselin es do not. The CAPO variants con verge to st ab le interior solutions, whereas accuracy-only training saturates near s max and direct cost coll apses to s min . This is consistent with CAPO’s intended role: balancing t ask reward and b udget pressure without f alling into either trivial boundary solution. The key result is theref ore not merely con vergen ce, b ut con v ergence to a no n-degenerate operating point where content-ada ptiv e all ocation is still av ailab le. D.3. Qualitativ e Case Studies W e present fo ur represent ativ e case studies that complement the aggregate an alysis abo v e: t w o t as k-contra st examples from Video-MMMU , one evidence-localizatio n success from VideoMME, and one f ailure case. Each visualization (Figures 17 – 20 ) renders 32 unif ormly sampled frames at their assigned scale inside a ﬁxed grid; warmer borders indicate larger predicted scales. 34 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning 0.2 0.4 0.6 0.8 1.0 F r a m e S c a l e s t 0 5 10 15 20 25 30 Density w/o: =0.376, =0.176 w/: =0.337, =0.169 (a) Frame-Level Scale Distribution w / o s i m w / s i m 0.0 0.2 0.4 0.6 0.8 P e r - V i d e o S c a l e R a n g e ( m a x s t m i n s t ) 0 10 20 30 40 50 60 Density w/o med: 0.000 w/ med: 0.361 (b) Intra-Video Scale Range w / o s i m w / s i m 0.00 0.05 0.10 0.15 0.20 0.25 M e a n A d j a c e n t - F r a m e S c a l e ( | s t s t + 1 | ) 0 25 50 75 100 125 150 175 200 Density w/o med: 0.0000 w/ med: 0.0479 (c) Frame-to-Frame Scale Variation w / o s i m w / s i m 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Gini Coefficient (Scale Selectivity) 0 20 40 60 80 100 120 140 Density w/o med: 0.0000 w/ med: 0.0992 (d) Scale Selectivity (Gini) w / o s i m w / s i m Figure 14: F our diagno stics of the L sim ab lation o n VideoMME. With the regularizer , the frame-scale histogram becomes bim odal, the per-video range expands, adjacent-frame variation in creases, and the Gini coeﬃcient rises. The policy m o ves from near-uniform a llocatio n to a genuinely sel ective regime. T a s k-dependent operating regimes. Figures 17 and 18 contrast t w o Video-MMMU t asks from a visu ally similar educati ona l domain that neverthel ess demand very diﬀerent allocati ons. In the comprehensi on example, the relevant evidence is concentrated in a small set o f diagram-bearing slides, so the policy adopts a sparse operating regime and suppresses the explicitly irrelevant quiz frame. In the adaptation example, the do wnstream rea soning depends on reading a dense numeric tabl e, so the same policy shifts to a mu ch higher-b udget regime and preserves high ﬁdelit y much more broad ly . The contrast show s that the policy respo nds to what the task will require, not just to generic visual clutter . Eviden ce localization and failure. The VideoMME success case in Figure 19 sho ws a more local versi on o f the same phenomen on: the answer depends on short text ov erlays embedded in otherwise repetitiv e foota ge, and the policy magniﬁ es only those evidence-bearing mo ments. Figure 20 sho ws the failure mode that remains. The decisive cue is temporally brief and visually simple, so the policy enlarges a nearby frame b ut compresses the frame that actually contains the fork. This diagn osis matches the quantit ativ e robustn ess analysis: R esAda pt is strong at concentrating budget, but still vuln erab le when the decisive evidence is both 35 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning 0 100 200 300 400 500 Training Step 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 Scale Range s max − s min (a) T raining: Per-Sample Adaptivity 0 100 200 300 400 500 Training Step (b) V alidation: Per -Sample Adaptivity β -Dist. w/o Cost β -Dist. w Direct Cost β -Dist. w CAPO  -Dist. w CAPO Figure 15: P er-sample sca le ada ptivit y under diﬀerent rew ard designs. S cal e range s max − s min o ver training on ( a) training and ( b) va lidation s plits. CAPO keeps a no n-trivial adapti ve range, whereas direct cost collapses and cost-free training saturates. 0 100 200 300 400 500 Training Step 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Rolling Mean of  s ( w =5) (a) Smoothed Convergence (V alidation) β -Dist. w/o Cost β -Dist. w Direct Cost β -Dist. w CAPO  -Dist. w CAPO 0 100 200 300 400 500 Training Step 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Rolling Std of  s ( w =5) (b) Scale Stability (V alidation) β -Dist. w/o Cost β -Dist. w Direct Cost β -Dist. w CAPO  -Dist. w CAPO Figure 16: V a lidatio n-time conv ergence under diﬀerent reward designs. CAPO variants con verge to stab le intermediate operating points, while cost-free training saturates at the upper boundary and direct cost collapses to the lo w er boundary . Stabilit y alo ne is not suﬃci ent; the key is where the policy st abilizes. subtle and short-liv ed. Summary . T ogether , these case studies support the same three conclusi ons as the quantit ative appendix: the policy changes its operating regime with the t ask, concentrates ﬁdelit y on evidence-bearing frames, and fails in interpretabl e wa ys when subtle cues are missed. The qualit ativ e examples therefore reinforce the claim that R esAdapt learns a meaningf ul input-allocati on strategy rather than a ﬁxed compression heuristic. D.4. Bo undary-Case Transfer Beyond Video The paper’s main claims t arget video Q A and temporal grounding, so we pl ace image transfer at the end o f the appendix as a boundary-ca se analysis rather than as supporting evidence for the main contrib utio n. T a bl e 7 is still inf ormativ e: the l earned video poli cy sometimes identiﬁes ima ge inputs that warrant additiona l ﬁdelit y , as in ChartQ A, b ut it does not yet yield reliab le eﬃci ency-preserving transfer on text-dense image tasks. The result is therefore best read as scope clariﬁcatio n. It suggests that input-side adaptatio n is broader than the resize-on-video setting studied here, while also sho wing that a video-trained policy should not be assumed to transfer cleanly to st atic ima ges. 36 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning T ab le 7: Exploratory z ero-s hot transfer to image benchmarks. P arenthetica l valu es denote per-t as k retention rati o R , and ResAda pt-RL additio nally ﬁne-tunes the MLLM via RL. Model MathVista testmini MMMU va l OCRBench ChartQ A AI2D T extV Q A va l Qw en2.5- VL-7B 49.1 (100%) 50.9 (100%) 84.2 (100%) 83.9 (100%) 82.5 (100%) 82.9 (100%) Random Drop 44.8 (50%) 49.0 (50%) 74.8 (50%) 71.6 (50%) 80.3 (50%) 78.1 (50%) T oMe ( Bolya et al. , 2022 ) 46.2 (50%) 49.6 (50%) 79.3 (50%) 78.1 (50%) 81.9 (50%) 81.2 (50%) Visi onZip ( Y ang et al. , 2025c ) 47.2 (50%) 48.6 (50%) 79.6 (50%) 77.9 (50%) 81.9 (50%) 81.3 (50%) R esAdapt ( Qwen2.5- VL-7B) 45.5 (42%) 51.0 (29%) 80.0 (64%) 85.9 (105%) 81.4 (41%) 69.6 (30%) R esAdapt-RL ( Qwen2.5- VL-7B) 46.7 (42%) 50.9 (29%) 80.8 (64%) 86.6 (105%) 81.1 (41%) 70.1 (30%) Qw en3- VL-8B 56.1 (100%) 53.4 (100%) 85.0 (100%) 84.0 (100%) 83.5 (100%) 82.1 (100%) Random Drop 47.3 (50%) 48.7 (50%) 62.9 (50%) 70.2 (50%) 79.7 (50%) 76.6 (50%) Visi onZip ( Y ang et al. , 2025c ) 47.8 (50%) 50.3 (50%) 70.5 (50%) 75.0 (50%) 80.5 (50%) 79.3 (50%) T oMe ( Bolya et al. , 2022 ) 49.6 (50%) 50.6 (50%) 70.3 (50%) 75.2 (50%) 80.5 (50%) 79.4 (50%) R esAdapt ( Qwen3- VL-8B) 52.5 (42%) 50.9 (29%) 82.7 (64%) 83.2 (105%) 81.2 (41%) 67.8 (30%) 37 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning Q: Ev aluate ﬁv e st atements about Urban Geography Cit y Models ( concentri c zone, H oyt sector , multiple nuclei, galactic, Latin American); identif y which are correct. P lea se ignore the Quiz question in last frame o f the video. 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 A daptive Scale F actor (cell = 0.65×) Figure 17: Case 1: Video-MMMU Comprehensi on ( Hu et al. , 2025 ) ( V anill a × → R esAdapt ✓ ). The policy con centrates resolution on diagram-bearing slide frames, compresses lecturer-o nly frames, and suppresses the ﬁnal quiz frame that the prompt explicitly marks as irrelev ant. 38 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning Q: W atch and learn the video content. Then apply what you learn ed to answer: T ab le 11.47 provides a surv ey of the y oungest onlin e entrepreneurs ( ages 17–30) whose n et worth ≥ $1M. W e want to kno w whether a ges and net w orth are independent. χ 2 test statistic = ______ 0.3 0.4 0.5 0.6 0.7 0.8 0.9 A daptive Scale F actor (cell = 0.95×) Figure 18: Case 2: Video-MMMU Adaptatio n ( Hu et a l. , 2025 ) ( V anilla × → R esAdapt ✓ ). When the answer depends on reading a numeri c tabl e and performing a χ 2 computation, the policy keeps a m uch higher globa l b udget and strongly upscal es the t ab le-bearing frames. 39 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning Q: When is the zodiacal light visible from the video? ( A) Mar . 19, (B) Mar . 24, ( C) Mar . 25, (D) Mar . 29. 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 A daptive Scale F actor (cell = 0.60×) Figure 19: Ca se 3: VideoMME ( Fu et al. , 2025a ) ( V anill a × → R esAdapt ✓ ). Frames containing the decisiv e date o verlay s are enlarged, while the l argely hom ogeneou s sky footage is compressed. The policy spends budget on answ er-bearing evidence rather than on the surrounding context. 40 R esAdapt: Adaptive R esolution for Eﬃcient Multimoda l Reaso ning Q: Which item does the man throw into the trash at the beginning of the video? ( A) A fork, (B) A pair of chopsticks, ( C) A box o f nood l es, (D) A spoon. 0.3 0.4 0.5 0.6 0.7 A daptive Scale F actor (cell = 0.75×) Figure 20: Case 4: VideoMME ( Fu et al. , 2025a ) ( V anill a ✓ → R esAdapt × ; failure case). A nearby frame is enlarged, but the actual fork-bearing frame is compressed. The decisive ﬁne detail is theref ore lo st at exactly the wrong m oment. 41

ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment