Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model

Hydra: Unifying Do cumen t Retriev al and Generation in a Single Vision-Language Mo del A thos Georgiou Indep enden t Researc her athos.georgiou@nca-it.com Marc h 2026 Abstract Visual do cumen t understanding t ypically requires separate retriev al and generation mo dels, doubling memory and system complexit y . W e present Hydra, a dual-head approac h that pro- vides both ColBER T-st yle late-in teraction retriev al and autoregressiv e generation from a single vision-language model (VLM). A single LoRA adapter, trained only for retriev al, is toggled at inference: enabling it pro duces multi-v ector embeddings; disabling it recov ers the base mo del’s generation qualit y—byte-iden tical outputs in 100% of 10,500 greedy and sto chastic samples, with max | ∆ ANLS | =0 . 0044 across 15,301 samples on four V QA benchmarks (three informa- tiv e; ChartQA is near-zero for b oth mo dels under greedy deco ding) when compared against an indep enden t base-mo del pip eline. W e iden tify three engineering requirements (attention-mode restoration, lm_head preserv ation, KV-cac he-a ware decoding) whose omission silently breaks generation despite correct w eigh t reco v ery . On ViDoRe V1, Hydra (4B) is within 1 percent- age point of a controlled single-head baseline in a single training run, with higher aggregate scores on V2 and V3 that are concen trated on a subset of tasks; multi-seed exp eriments are needed to conﬁrm these trends. The single-mo del design reduces p eak GPU memory by 41%, though adapter switc hing introduces throughput ov erhead under concurrent serving loads (§7). An ablation shows that GritLM-st yle joint training pro vides no b eneﬁt within the LoRA-based ( r =16 ) training regime. A pro of-of-concept extension to Qwen2.5-Omni-3B demonstrates that the mechanism generalizes to audio retriev al and video embedding, with sp eec h generation. 1 In tro duction Do cumen t AI systems m ust solve tw o fundamen tally diﬀerent tasks: r etrieval —ﬁnding relev an t pages given a query—and understanding —extracting and in terpreting information within those pages. Mo dern approaches address these with separate mo dels: a retriev al mo del such as Col- P ali [F aysse et al., 2025] or ColQwen2 [ColPali T eam, 2024] for page-level retriev al via ColBER T- st yle late interaction [Khattab and Zaharia, 2020, San thanam et al., 2022], and a generative VLM suc h as Qw en2.5-VL [Bai et al., 2025] for do cumen t understanding. This dual-mo del paradigm is wasteful. Both mo dels share a common backbone architecture (a vision-language transformer), y et they m ust b e loaded indep endently , doubling GPU memory requiremen ts and complicating deplo yment. The w aste is particularly stark: ColP ali-family mo dels ar e ﬁne-tuned VLMs. ColPali, ColQwen2, and ColQw en3.5 [Georgiou, 2026] all b egin from a pretrained VLM, add a linear pro jection head ( custom_text_proj ) for 128- or 320-dimensional m ulti-vector embeddings, and ﬁne-tune with con- trastiv e loss. This ﬁne-tuning mo diﬁes the mo del’s attention patterns (to bidirectional) and internal 1 represen tations, sacriﬁcing autoregressive generation—though the capability remains laten t beneath the retriev al-adapted weigh ts. W e observ e that this sacriﬁce is unnecessary when using Lo w-Rank Adaptation (LoRA) [Hu et al., 2022]. Because LoRA adapters are additiv e ( W adapted = W base + B A ), disabling them at inference time exactly recov ers the base mo del’s weigh ts. This means a single VLM with a retriev al LoRA adapter can serve as b oth: • A retriev al mo del (LoRA-on, bidirectional atten tion → custom_text_proj → 320-dim em b eddings), and • A generativ e VLM (LoRA-oﬀ, causal atten tion → lm_head → autoregressive text). Critically , only the retriev al head requires training. Generation capability is reco vered b y disabling the adapter and restoring causal attention—though realizing this in practice requires addressing three non-obvious engineering requirements (Section 3.4). Prior work has explored related ideas. SV-RA G [Chen et al., 2025] trains two LoRA adapters on a shared VLM—one for retriev al, one for generation—and swaps them at inference. URaG [Shi et al., 2026] uniﬁes b oth tasks by inserting a retriev al mo dule at an in termediate transformer la y er. ColQw en2_4RAG [Oprea and Bâra, 2025] demonstrated the same LoRA on/oﬀ toggling mechanism in an application setting, but did not identify the engineering requirements for reliable generation (Section 3.4), ev aluate against a con trolled baseline, or compare against join t training. GritLM [Muennighoﬀ et al., 2025] sho wed that join t training can unify em b edding and generation in text-only models. Our contribution is not the toggling mechanism itself—which exists in prior work—but a systematic analysis of when and why it w orks: iden tifying three failure mo des that silently break generation in standard pip elines, and demonstrating through controlled exp erimen ts that generation training is unnecessary . W e call this architecture Hydra—one mo del, many heads. 1 Figure 1 illustrates the architecture, and Figure 2 shows ho w this extends to a complete retriev al-augmen ted generation (RAG) pipeline. Our contributions are: 1. A dual-head approach that pro vides b oth ColBER T retriev al and autoregressiv e generation from a single VLM, requiring only a single LoRA adapter and no generation training. W e iden tify three engineering requirements for making this work: atten tion mo de restoration, lm_head preserv ation, and KV-cache supp ort (Section 3). 2. Ev aluation on 9 ViDoRe V1 tasks against a con trolled baseline, with additional single-run results on V2 (4 tasks) and V3 (8 tasks), generation equiv alence across four V QA benchmarks, and eﬃciency measuremen ts demonstrating 41% memory reduction (Section 5). 3. An empirical ablation showing that, within LoRA-based training ( r =16 ), GritLM-st yle join t training pro duces equiv alen t results but still requires LoRA toggling—the additional training complexity pro vides no b eneﬁt ov er retriev al-only training (Section 5.3). 2 Related W ork Uniﬁed em bedding and generation. GritLM [Muennighoﬀ et al., 2025] sho wed that a single LLM can perform both em b edding and generation b y alternating betw een ob jectiv es during full ﬁne- tuning, switching b et w een bidirectional and causal attention masks at inference. OneGen [Zhang et al., 2024] uniﬁed b oth in a single forward pass by allocating sp ecial retriev al tokens whose hidden states serve as query embeddings during autoregressive generation. Both remain text-only and use dense single-vector em b eddings rather than multi-v ector late interaction. 1 The sp eciﬁc instantiation on Qwen3.5-4B is HydraQw en3.5-4B . “4B” is the mo del family name; the actual parameter count is 4.57B. 2 Qw en3.5 (frozen W base ) Vision Enco der (frozen) LoRA Adapter ( r =16 , α =64 ) Bidirectional A ttention custom_text_proj R d → R 320 320-dim Emb eddings (MaxSim scoring) Causal A ttention lm_head R d → R | V | A utoregressive T ext (KV-cache) Retrieval Mo de LoRA-on Generation Mo de LoRA-oﬀ Figure 1: Hydra architecture. A single VLM serves t wo modes b y toggling a LoRA adapter at inference time. Left : Retriev al mo de (LoRA-on, bidirectional atten tion) pro duces 320-dim multi- v ector embeddings via custom_text_proj . Right : Generation mo de (LoRA-oﬀ, causal atten tion) pro duces autoregressive text via the base lm_head with KV-cac he. The vision enco der is frozen and shared. No weigh t copying or mo del reloading o ccurs b et ween mo des. Solid arrows = retriev al path; dashed arro ws = generation path. Uniﬁed retriev al and generation for visual do cumen ts. SV-RAG [Chen et al., 2025] trains t wo separate LoRA adapters on a shared frozen MLLM backbone: one con verts the mo del into a ColBER T-st yle multi-v ector retriever, the second ﬁne-tunes it for QA generation, with adapters sw app ed at inference. URaG [Shi et al., 2026] inserts a light w eight retriev al module at an in ter- mediate transformer la yer, exploiting the observ ation that early la yers distribute attention broadly while deep er lay ers concentrate on evidence pages; irrelev an t pages are pruned mid-forward-pass, ac hieving retriev al and generation in a single pass. VDo cRA G [T anaka et al., 2025] pre-trains a VLM with b oth retriev al and generation objectives but deplo ys separate comp onen ts at infer- ence. VisRAG [Y u et al., 2024] uses VLMs for b oth tasks as a t wo-stage pip eline with separately ﬁne-tuned mo dels. Hydra diﬀers from SV-RA G in requiring one adapter and no gener ation tr aining —disabling the retriev al adapter exactly recov ers the base mo del’s generation capability . It diﬀers from URaG in pro ducing a standalone ColBER T retriever that can b e deplo yed indep enden tly of the generation path wa y , rather than coupling retriev al to an intermediate lay er of the generation forward pass. LoRA as an inference-time switc h. ColQwen2_4RA G [Oprea and Bâra, 2025] show ed that toggling ColQwen2’s LoRA adapters on and oﬀ switc hes the same Qwen2-VL bac kb one b et w een re- triev al and generation mo des, demonstrating the core mechanism in an application context without systematic ev aluation or the engineering analysis we provide. More broadly , aLoRA [Greenewald et al., 2025] inv ok es diﬀeren t LoRA adapters at diﬀerent RA G pip eline stages with KV-cac he reuse, MeteoRA [Xu et al., 2025b] em b eds multiple task-sp eciﬁc LoRA adapters with per-token gating, and S-LoRA [Sheng et al., 2024] provides serving infrastructure for concurrent adapter selection. Hydra diﬀers from these approaches in requiring no generation training and pro viding a systematic analysis of the failure mo des that mak e toggling reliable (Section 3.4). 3 ColP ali + LLM Two-Model Pipeline Indexing ColP ali (4B pa rams) Index (page emb eddings) Query ColP ali Retrieval T op- k P ages Sepa rate LLM (4B pa rams) Answ er 8B+ total pa rams 17,913 MB peak VRAM 2 mo dels in GPU memory Hydra Single-Model Pipeline Indexing Hydra (4B params) Retrieval head Index (page emb eddings) Query Hydra Retrieval head T op- k P ages Hydra Generation head Answ er 4B total pa rams 10,496 MB peak VRAM 1 mo del in GPU memory 41% memory reduction Figure 2: RAG pip eline comparison. T op: ColP ali retrieves relev an t pages, but a separate LLM is needed for generation at query time—requiring tw o mo dels in GPU memory (8B+ parameters, 17,913 MB peak VRAM). Bottom: Hydra uses a single 4B-parameter mo del for both indexing (retriev al head for embeddings) and querying (retriev al head ﬁnds top- k pages, generation head answ ers from them). Both heads share one mo del in GPU memory , reducing peak VRAM to 10,496 MB (41% savings). Solid blue b orders = retriev al; red b orders = generation. Scop e of comparison. W e build on ColQw en3.5 [Georgiou, 2026], which adapts Qw en3.5 [Qw en T eam, 2026] for ColBER T-style late-interaction retriev al ov er patc h embeddings [Khattab and Za- haria, 2020]. Our ev aluation is scop ed to this family of vision-ﬁrst, multi-v ector mo dels; single- v ector and hybrid text-vision approac hes diﬀer in retriev al mechanism and are not directly compa- rable. 3 Metho d 3.1 Arc hitecture Ov erview Hydra consists of a single ColQwen3.5 mo del—Qw en3.5 [Qwen T eam, 2026] augmen ted with a linear pro jection head ( custom_text_proj : R d → R 320 )—plus tw o output pathw a ys (Figure 1): 1. Retriev al head : The custom_text_proj projection, producing L 2 -normalized 320-dim m ulti- v ector em b eddings for ColBER T-st yle late-interacti on scoring. 2. Generation head : The base mo del’s lm_head ( R d → R | V | ), producing logits ov er the v o- cabulary for autoregressiv e deco ding. A single LoRA adapter ( r =16 , α =64 ) is applied to all language mo del projection la yers ( q_proj , k_proj , v_proj , o_proj , gate_proj , up_proj , down_proj ) and the custom_text_proj , excluding 4 the vision encoder. The vision encoder remains frozen, ensuring iden tical visual features in both mo des. 3.2 Mo de Switc hing The tw o heads are activ ated by toggling t wo con trols: Retriev al mo de (embedding). The LoRA adapter is enabled, and full-atten tion lay ers are patc hed to bidirectional atten tion. Sp eciﬁcally , for each full-attention lay er, we replace the causal atten tion mask M causal with a bidirectional mask M bidir : M bidir [ i, j ] = ( 0 if p ositions i and j are b oth v alid (non-padding) −∞ otherwise (1) This is implemented by extracting the diagonal of the 4D causal mask to iden tify v alid p ositions, then constructing a symmetric mask where all v alid p ositions attend to each other. Sliding-windo w la yers are left unchanged, as their lo cal attention pattern is compatible with b oth mo des. The for- w ard pass pro duces hidden states that are pro jected through custom_text_proj and L 2 -normalized to yield m ulti-vector em b eddings. Generation mo de. The LoRA adapter is disabled, restoring the base mo del w eights ( W adapted − B A = W base ). F ull-attention lay ers rev ert to their original causal attention. The forw ard pass pro duces hidden states that are pro jected through the base lm_head for greedy autoregressiv e deco ding. This mo de switc hing happens per call, with no w eight cop ying or mo del reloading (Algorithm 1). Algorithm 1 Mo de switching in Hydra 1: function Embed (images) 2: Enable LoRA adapter lay ers 3: Set full-attention la yers to bidirectional 4: return custom_text_proj (forward(images)) 5: 6: function Genera te (image, prompt) 7: Disable LoRA adapter lay ers 8: Restore causal atten tion on full-attention la yers 9: return autoregressive_decode( lm_head , forw ard(image, prompt)) 3.3 Design Rationale: Retriev al-Only T raining Prior approaches to uniﬁed retriev al and generation—GritLM [Muennighoﬀ et al., 2025] via joint training, SV-RAG [Chen et al., 2025] via dual adapters—assume that generation capabilit y must b e explicitly trained or preserved. W e show this is unnecessary when using LoRA. Let W base denote the frozen base model weigh ts, B , A the LoRA matrices (whose pro duct B A giv es the low-rank up date), and ϕ proj the custom_text_proj parameters. Retriev al training optimizes B A and ϕ proj via con trastiv e loss while W base (including lm_head ) remains frozen. At generation time, we disable LoRA and use W base directly . Since W base w as nev er modiﬁed, the generation capabilit y is e quivalent to the pretrained VLM at the weigh t lev el (see Section 5.2 for empirical veriﬁcation). 5 The ablation in Section 5.3 conﬁrms this systematically: join t training pro vides no measurable b eneﬁt. The LoRA toggling approac h is simpler: the base mo del’s weigh ts are recov ered exactly , yielding generation with no degradation under greedy deco ding (Section 5.2). 3.4 Three Engineering Requiremen ts for Dual-Head Generation LoRA’s additiv e structure guaran tees generation equiv alence in theory: disabling the adapter reco v- ers the base weigh ts exactly . In practice, we iden tiﬁed t wo mec hanisms b y whic h standard training pip elines silen tly corrupt the base weigh ts (Requirements 1–2 b elo w), plus a practical barrier that mak es naïv e generation infeasible (Requiremen t 3). The con tribution is not the mathematical prop ert y but the identiﬁcation of failure mo des that violate it. Making dual-head generation work from a retriev al-ﬁne-tuned mo del requires addressing these three requirements. Requiremen ts 1 and 2 are correctness constraints (generation fails silently without them); Requiremen t 3 is a practical necessity (generation works without it but takes ∼ 38 × longer). Requiremen t 1: A tten tion mo de restoration. Retriev al training patc hes full-attention la y- ers to bidirectional atten tion. If these patches are not reverted b efore generation, autoregressiv e deco ding fails: the mo del can attend to future tokens during preﬁll, breaking the causal structure that left-to-right generation dep ends on. In Qw en3.5’s hybrid architecture, only “full_attention” la yers (as opp osed to sliding-windo w la yers) require this patching, since sliding-windo w lay ers use a ﬁxed local windo w that is compatible with b oth mo des. Our implemen tation stores b oth the original causal and patched bidirectional forw ard functions p er lay er, switching b et ween them at mo de-toggle time. Requiremen t 2: Base model lm_head preserv ation. The lm_head used for generation must b e the original base mo del’s lm_head , loaded separately from the pretrained chec kp oin t. Although LoRA leav es W base frozen in principle, in practice the lm_head can b e corrupted during training through tw o mechanisms w e iden tiﬁed empirically . First, when lm_head shares tied w eigh ts with the input embedding lay er [Press and W olf, 2017], gradients from the embedding propagate to lm_head ev en though it is not a LoRA target. Second, failing to set requires_grad=False on lm_head allo ws PyT orch DDP to accumulate and sync hronize gradients for it even when no optimizer group up dates it, causing bf16 numerical drift ov er thousands of steps. W e av oid b oth failure mo des b y loading the lm_head from a separate instantiation of the base mo del and storing it alongside the adapter chec kp oin t. Requiremen t 3: KV-cac he-a w are generation. Without KV-cache, eac h token generation step requires a full forward pass including vision encoder pro cessing of pixel v alues, whic h is ex- tremely slow (281 seconds p er sample in our measuremen ts). W e implement KV-cac he-aw are gen- eration: pixel v alues are pro cessed on the ﬁrst forward step, and subsequen t steps reuse cached k ey-v alue pairs, yielding a ∼ 38 × sp eedup (7.4 seconds per sample). This requires calling the base mo del’s forw ard pass directly (b ypassing ColQwen3.5’s wrapp er, which does not supp ort use_cache=True ) and man ually managing the attention mask extension at each step. 6 4 T raining Only the retriev al head is trained. W e use standard ColP ali-engine training [F aysse et al., 2025] with the ColBER T contrastiv e loss. 4.1 T raining Data W e combine m ultiple visual do cument retriev al datasets: • vidore/colpali_train_set [F a ysse et al., 2025] • openbmb/VisRAG-Ret-Train-Synthetic-data [Y u et al., 2024] • openbmb/VisRAG-Ret-Train-In-domain-data [Y u et al., 2024] • llamaindex/vdr-multilingual-train (en, de, es, fr, it) 2 Eac h sample consists of a text query paired with a p ositiv e do cumen t page image. All datasets are publicly a v ailable for researc h use. Ev aluation uses the test split of vidore/colpali_train_set . 4.2 T raining Conﬁguration • Loss : ColBER T loss [Khattab and Zaharia, 2020], temp erature τ = 0 . 02 , in-batch negativ es. • LoRA : r =16 , α =64 , dropout = 0 . 197 (from hyperparameter sw eep). Applied to all LM pro jections and custom_text_proj ; vision enco der frozen. • Optimizer : Adam W, lr 5 × 10 − 5 , cosine sc hedule with 8% warm up. • Batc h : Eﬀectiv e batc h size 112 via Distributed Data Parallel (DDP). bf16 mixed precision. 1 ep o c h. W e train and ev aluate on Qwen3.5-4B. All results rep orted are from a single training run (seed 42). 5 Exp erimen ts 5.1 Retriev al: ViDoRe Benchmarks W e ev aluate retriev al p erformance on three ViDoRe b enc hmark suites: V1 [F a ysse et al., 2025] (9 of 10 standard tasks; 3 spanning arxiv pap ers, forms, tables, and synthetic do cumen ts), V2 [Macé et al., 2025] (4 tasks: biomedical, ESG, and economics rep orts), and V3 [Loison et al., 2026] (8 m ultilingual tasks across computer science, energy , ﬁnance, HR, industrial, pharmaceuticals, and physics domains). Ev aluation uses the Massive T ext Em b edding Benchmark (MTEB) frame- w ork [Muennighoﬀ et al., 2023] (v2.10.12–2.10.13) 4 with MaxSim scoring. T able 1 rep orts av erage normalized Discounted Cumulativ e Gain at rank 5 (nDCG@5) across all three suites alongside a con trolled single-head baseline; p er-task breakdowns are in Section A (App endix). The dual-head mo del ac hieves 0.8842 a verage nDCG@5, within 1 pp of a con trolled single-head ColQw en3.5 baseline (0.8892) trained under the same regime (same data, h yp erparameters, sin- gle epo ch). 5 P erformance is mixed per-task: Hydra leads on ArxivQA (+0.8 pp) and Do cV QA 2 https://huggingface.co/datasets/llamaindex/vdr- multilingual- train 3 W e exclude InfoVQA b ecause MTEB v2.10.12 uses a diﬀeren t subset split than the original ViDoRe V1 leader- b oard, pro ducing non-comparable scores; the remaining 9 tasks are identical across all compared mo dels. 4 V1 and baseline ev aluations used v2.10.12; V2/V3 ev aluations used v2.10.13. W e v eriﬁed iden tical task deﬁnitions across these versions for ov erlapping b enc hmarks. 5 Baseline mo del: ColQwen3.5-4B-controlled-baseline , trained with identical conﬁguration to Hydra but with- out generation capability . Both mo dels ev aluated on the same 9 V1 tasks using MTEB v2.10.12. 7 T able 1: Retriev al p erformance (av erage nDCG@5) on ViDoRe V1, V2, and V3. Baseline: single- head ColQw en3.5 trained under the same regime as Hydra (same data, hyperparameters, 1 ep och). P er-task results in App endix. Suite T asks Hydra Baseline ∆ ViDoRe V1 9 0.8842 0.8892 − 0.0050 ViDoRe V2 4 0.5811 0.5740 +0.0071 ViDoRe V3 8 0.5813 0.5343 +0.0469 (+1.0 pp) while the baseline leads on T abfquad (+4.1 pp). The diﬀerence is not statistically signiﬁ- can t (b o otstrap 95% CI [ − 0 . 016 , +0 . 004] , p =0 . 318 ; Wilco xon signed-rank p =0 . 734 ). 6 A cross all 21 tasks, the full picture is consistent with no meaningful retriev al cost: V1 within noise ( − 0.5 pp), V2 Hydra +0.7 pp, V3 Hydra +4.7 pp, with 12 of 21 tasks fav oring Hydra and adv antages concen trated on harder b enc hmarks. Both mo dels are single training runs; m ulti-seed exp erimen ts w ould clarify whether the V2/V3 adv antages are systematic (Section 7). ViDoRe V2 & V3. On the more c hallenging V2 and V3 b enchmarks (T able 1), Hydra (4B) ac hieves 0.5811 a verage nDCG@5 on V2 (+0.7 pp vs. baseline) and 0.5813 on V3 (+4.7 pp vs. baseline). W e observe higher scores on harder b enc hmarks, with 7 of 8 V3 tasks fa v oring Hydra. The largest V3 gains are on Finance EN (+9.5 pp), Industrial (+8.9 pp), and Finance FR (+8.6 pp). As both models are single training runs, we cannot fully disentangle architectural eﬀects from training v ariance; the consistency across tasks is suggestive but conﬁrmation requires m ulti-seed exp erimen ts. The lo wer absolute scores reﬂect the increased diﬃcult y of these b enc hmarks: V2 features sp ecialized professional do cumen ts, while V3 spans eight m ultilingual domains. 5.2 Generation Qualit y Since the generation head uses the unmodiﬁed base VLM with LoRA disabled, generation qual- it y should b e equiv alent to the pretrained Qw en3.5. W e use greedy deco ding ( T =0 ) to pro duce deterministic outputs, isolating weigh t and implementation diﬀerences from sampling v ariance. T o verify that LoRA toggling recov ers exact base w eights, we run b oth generation passes through the same KV-cache co de path with LoRA disabled: across 10,000 samples (Do cVQA 5,000 + T extV QA 5,000), the tw o runs pro duce byte-identic al outputs in 100% of cases ( ∆ ANLS =0 . 0 ). A T w o One-Sided T ests (TOST) equiv alence test [Sch uirmann, 1987] with b ound ε =0 . 01 conﬁrms formal equiv alence on b oth b enc hmarks ( p TOST < 0 . 001 , 90% CI [0 . 0 , 0 . 0] ). A follo w-up at T =0 . 7 (top_p = 0 . 8 , p er-sample seed con trol, n =500 ) also yields 100% exact matc h, conﬁrming the result extends to sto c hastic sampling. This is the exp ected consequence of LoRA’s additive structure: disabling the adapter recov ers the base w eights exactly ( W = W 0 + B A → W 0 ). As a stricter test, w e compare Hydra’s KV-cache generation path against HuggingF ace’s stan- dard generate() pip eline across four V QA b enc hmarks—Do cV QA [Mathew et al., 2021], ChartQA [Masry et al., 2022], InfoVQA [Mathew et al., 2022], and T extVQA [Singh et al., 2019]—totalling 15,301 samples, using A v erage Normalized Levensh tein Similarit y (ANLS) [Biten et al., 2019]. 7 A cross all 6 These tests op erate on n =9 task-level nDCG@5 a verages, matching standard ViDoRe rep orting granularit y . Per- query signiﬁcance tests w ould ha ve substantially more statistical pow er but are not standard for this benchmark. W e compare against a controlled baseline rather than prior uniﬁed systems (SV-RA G, URaG) b ecause those systems do not rep ort on the ViDoRe b enc hmarks used here. 7 InfoV QA is excluded from r etrieval ev aluation because diﬀerent MTEB v ersions use diﬀerent document subsets, c hanging the retriev al candidate p o ol and pro ducing non-comparable scores. F or generation, ANLS is computed 8 T able 2: Three-mo de comparison of Hydra (retriev al-only training) vs. GritLM-style joint training. Retriev al: 9 ViDoRe V1 tasks. Generation: Do cV QA v alidation ( n =200 ). Both mo dels use the same base and LoRA conﬁg (batc h size diﬀers; see text). Despite diﬀerent adapter weigh ts (max elemen t-wise diﬀ: 0.50), the tw o functional mo des are equiv alen t; the mo de that join t training w as designed to unlo c k (LoRA-on generation) fails. Inference Mo de Hydra GritLM-st yle LoRA-on, bidirectional (retriev al) 0.8842 nDCG@5 0.8893 nDCG@5 LoRA-oﬀ, causal (generation) 0.561 ANLS, 76.5% match 0.561 ANLS, 76.5% match LoRA-on, causal (GritLM generation) N/A image-blind † † See text. The n =200 subset is suﬃcient to detect this catastrophic failure mo de; the test is binary (do es the mo del condition on image conten t at all?). four b enc hmarks, the maximum absolute ANLS delta is 0.0044 (Do cV QA); no b enchmark shows statistically signiﬁcant degradation. ChartQA ANLS is near-zero for b oth mo dels under greedy de- co ding (v erb ose outputs exceed the 0.5 Levensh tein threshold); generation equiv alence rests on the remaining three b enc hmarks (12,801 samples). Per-benchmark results are in Section B (Appendix). 5.3 Ablation: Join t T raining vs. LoRA T oggle An alternative to Hydra’s retriev al-only training is GritLM-style join t training [Muennighoﬀ et al., 2025], whic h alternates b et w een em b edding and generation batches during ﬁne-tuning. W e train a joint mo del on the same Qw en3.5-4B base using alternating batches (80% ColBER T loss, 20% cross-en tropy on LLaV A-Instruct V QA data [Liu et al., 2023]), with identical LoRA conﬁguration (r=16, α =64), learning rate ( 5 × 10 − 5 ), and sc hedule (1 ep o c h, cosine with 8% w armup), but a smaller eﬀective batc h size of 32 (vs. 112 for Hydra). 8 W e ev aluate b oth mo dels in three inference mo des: LoRA-on retriev al, LoRA-oﬀ generation, and LoRA-on generation (the mo de GritLM-st yle training is designed to enable). T able 2 summarizes the results. The tw o functional modes—retriev al (LoRA on) and generation (LoRA oﬀ )—produce equiv alen t results for b oth training approac hes, despite signiﬁcantly diﬀerent adapter weigh ts. The 0.5 pp retriev al gap b et w een the t wo mo dels is comparable in magnitude to the non-signiﬁcan t Hydra-vs-baseline diﬀerence (T able 1), consistent with all three mo dels p erforming equiv alen tly on V1. The critical ﬁnding is that LoRA-on generation—the mo de that join t training w as designed to enable—fails entirely . On Do cV QA ( n =200 , T =0 ), the jointly-trained mo del pro duces a single tok en (“The”) with probability p =0 . 91 regardless of image conten t, unable to condition on visual input. This is the same failure mo de observed in our earlier 0.8B exp erimen ts, showing that a rank-16 LoRA adapter trained with bidirectional attention cannot supp ort autoregressiv e gener- ation, regardless of mo del scale or whether generation data w as included during training. This suggests that LoRA toggling is not merely conv enien t but structur al ly ne c essary within the LoRA training regime: the lo w-rank subspace cannot simultaneously serve b oth attention mo des in our exp erimen ts. This conclusion is sp eciﬁc to LoRA ( r =16 , α =64 ); GritLM’s full ﬁne-tuning success- fully supp orts b oth mo des [Muennighoﬀ et al., 2025], suggesting the failure is a low-rank constrain t rather than a fundamental prop ert y of bidirectional atten tion. p er-sample indep enden t of the candidate p ool, so the subset split does not aﬀect the ev aluation. 8 The batch size diﬀerence reﬂects the added memory cost of interlea ving generation batc hes. This do es not aﬀect the conclusion: the failure of LoRA-on generation is catastrophic (single-tok en collapse), not a marginal performance gap that batch size could explain. 9 T able 3: Summary of Hydra (4B) across all ev aluation dimensions. Eﬃciency measured on a single GPU. Metric Result ViDoRe V1 a vg nDCG@5 0.8842 (baseline: 0.8892, ∆= − 0 . 5 pp) ViDoRe V2 a vg nDCG@5 0.5811 (baseline: 0.5740, ∆=+0 . 7 pp) ViDoRe V3 a vg nDCG@5 0.5813 (baseline: 0.5343, ∆=+4 . 7 pp) Generation ANLS max | ∆ | =0 . 0044 across 4 b enc hmarks (15K samples) T rainable parameters ∼ 32 . 5 M (0.7% of 4.57B total) ∗ P eak VRAM 10,496 MB (vs. 17,913 MB tw o-mo del) Mo de-switc h latency 5.9 ms (1.8% of generation call) ∗ “4B” is the mo del family name; the actual parameter count is 4.57B. Since b oth approaches require LoRA toggling at inference and pro duce equiv alen t results, the 20% generation training batches provide no measurable ad v antage. Hydra’s retriev al-only training is simpler and suﬃcient. 5.4 Eﬃciency W e measure the practical o v erhead of the single-mo del arc hitecture on a single NVIDIA B200 GPU. Memory . Hydra (4B) uses 10,496 MB peak GPU memory during a full embed-then-generate cycle. Loading separate retriev al and generation mo dels (ColQwen3.5 + Qwen3.5) and p erforming the same op erations requires 17,913 MB. Hydra thus reduces p eak memory by 41%. Mo de-switc hing latency . A full mo de-switching round trip (retriev al → generation → retriev al) tak es 5.9 ms mean o ver 50 iterations. The 5.9 ms ov erhead is 1.8% of a single generation call (335 ms), making it negligible relativ e to inference. KV-cac he state isolation. A shared mo del raises the concern that in ternal state from one mo de could leak into the other. W e test this with a contamination proto col: em b ed → generate → em b ed → generate on 50 inputs, comparing each round-trip output against the corresp onding single- pass output. Em b eddings are bitwise iden tical across cycles (max element-wise diﬀ =0 . 0 , cosine similarit y =1 . 0 ), and generation outputs are b yte-iden tical in 100% of cases. No KV-cache state p ersists across mo de switches. 5.5 Summary T able 3 consolidates results across all ev aluation dimensions. 6 Omni-Mo dal Extension 6.1 Pro of of Concept: Omni-Mo dal Generalization T o test whether the Hydra mechanism generalizes b ey ond a single mo del family and mo dality , w e apply it—without additional training—to Qwen2.5-Omni-3B [Xu et al., 2025a], a multimodal mo del with native supp ort for image, audio, and video input, as well as text and sp eec h output. 10 Setup. W e use vidore/colqwen-omni-v0.1 9 , a ColBER T adapter trained on 127K image-text pairs atop the Qwen2.5-Omni-3B backbone using colpali-engine [F aysse et al., 2025]. The adapter w as trained on image data only; audio and video retriev al capabilities are en tirely zero-shot, ac- quired through the frozen Whisp er audio enco der and Qwen2-VL vision enco der in the base mo del. W e apply the Hydra arc hitecture as-is: LoRA-on with bidirectional atten tion for retriev al (via custom_text_proj , 128-dim embeddings), LoRA-oﬀ with causal attention for generation (via the base mo del’s lm_head ). No additional training is p erformed. The mo del additionally supp orts speech synthesis via the Qwen2.5-Omni talker mo dule and BigV GAN voco der [Lee et al., 2023], giving Hydra three inference mo des from a single 4.4B- parameter mo del instance: 1. Retriev al (LoRA on, bidirectional): ColBER T multi-v ector em b eddings ov er images, audio, or video. 2. T ext generation (LoRA oﬀ, causal): A utoregressiv e text conditioned on an y input modality . 3. Sp eec h generation (LoRA oﬀ, causal, talk er enabled): Sp ok en answers via thinker–talk er– v o co der pip eline. Image retriev al. The mo del achiev es 0.8812 av erage nDCG@5 on V1 (9 tasks), 0.5353 on V2 (4 tasks), and 0.4907 on V3 (8 tasks), comparable to the 4B v ariant despite a smaller bac kb one (3B) and diﬀerent mo del family; full p er-task results are in T able 7 (App endix). A udio retriev al (zero-shot). W e ev aluate text-to-audio retriev al on A udioCaps [Kim et al., 2019] ( n =500 test clips, 7–10 s each at 16 kHz). Audio clips are em b edded by routing ra w wa v e- forms through the Whisp er feature extractor [Radford et al., 2023] and the shared pro jection head; captions are embedded as text queries through the same bac kb one. Using ColBER T MaxSim scoring o ver the full 500 × 500 similarity matrix, the mo del ac hiev es R@1 = 26.2%, R@5 = 55.6%, R@10 = 69.0%, and MRR = 40.6%—with no audio contrastiv e training, relying en tirely on cross- mo dal transfer through the shared Qw en2.5-Omni backbone. F or reference, sup ervised audio-text mo dels (e.g., CLAP [Elizalde et al., 2023]) ac hiev e R@1 ≈ 35–40% on this b enchmark; the gap is exp ected giv en zero-shot transfer. Generation equiv alence. W e ev aluate generation preserv ation on DocVQA [Mathew et al., 2021] v alidation ( n =200 ) using ANLS with con tainment matching (to handle sen tence-form answ ers from the Qw en2.5-Omni generation st yle). The base mo del achiev es 0.9412 ANLS; Hydra-Omni with LoRA disabled ac hieves 0.9298 ANLS ( ∆= − 0 . 011 , < 1.2 pp). Both models pro duce correct answ ers on the same samples; the delta reﬂects formatting diﬀerences (the base model app ends hallucinated con tin uation text under greedy decoding, a kno wn pathology) rather than accuracy loss. Sp eec h generation. Hydra-Omni can also pro duce sp ok en answers b y routing through the think er, talker, and BigV GAN voco der [Lee et al., 2023] pip eline, pro ducing coherent sp eech (8.1 s at 24 kHz) from the same mo del instance. Summary . The omni-modal extension conﬁrms that the Hydra mec hanism generalizes b ey ond Qw en3.5 and image-only settings, pro ducing functional retriev al across images, audio, and video while preserving text and sp eec h generation—all without modiﬁcation or additional training. Video em b eddings are pro duced by the pip eline but not yet ev aluated on retriev al b enchmarks. 9 https://huggingface.co/vidore/colqwen- omni- v0.1 11 T able 4: Structural comparison of uniﬁed retriev al-generation arc hitectures. Hydra is the only approac h requiring no generation training and a single adapter. Prop ert y Hydra SV-RA G URaG GritLM A dapters needed 1 2 0 (custom mo dule) 0 (full FT) Generation training None Y es Y es Y es Retriev er indep endence Y es Y es No N/A Multi-v ector retriev al Y es Y es Y es No P eak VRAM (single mo del) 10.5 GB ∼ 10.5 GB × 2 single pass full mo del 7 Discussion LoRA as a mo de switc h. The ablation in Section 5.3 conﬁrms that LoRA toggling, rather than join t training, is the op erativ e mec hanism: GritLM-style training ac hieves equiv alen t results but still requires toggling, conﬁrming the additional complexity pro vides no b eneﬁt. Comparison with prior uniﬁed arc hitectures. T able 4 compares Hydra against prior uniﬁed retriev al-generation arc hitectures across key design dimensions. Hydra is the only approach that requires no generation training and uses a single adapter; the base mo del’s generation capability is reco vered b y disabling the adapter rather than b eing explicitly trained or preserved. Pro duction deploymen t considerations. Hydra’s single-mo del design reduces memory but in tro duces deplo ymen t trade-oﬀs. LoRA adapters incur measurable throughput o verhead in cur- ren t serving framew orks [Sheng et al., 2024]. A dditionally , the mo del cannot serve retriev al and generation requests sim ultaneously—mo de switches serialize these op erations at the mo del lev el, unlik e a t w o-mo del deplo ymen t that can parallelize them across concurrent queries. LoRA serv- ing infrastructure (S-LoRA [Sheng et al., 2024], vLLM adapter routing) is activ ely improving, but deplo yments should ev aluate throughput requirements alongside memory constrain ts. Limitations. • VLM families : T ested on Qw en3.5 (4B) and Qwen2.5-Omni (3B). While the omni-mo dal extension (Section 6) demonstrates generalit y across mo del families and mo dalities, testing on non-Qwen arc hitectures (Intern VL, LLaV A) remains future w ork. • Single training run : All results are from one training run p er mo del; v ariance across seeds is not estimated. • Generation ev aluation : Equiv alence v eriﬁed under greedy deco ding (Section 5.2); cross- implemen tation ev aluation under sampling ( T > 0 ) remains future work. • A udio/video retriev al : The omni-mo dal results (Section 6) are zero-shot; explicit audio and video con trastive training would likely improv e p erformance but is not explored. • LoRA rank : All experiments use r =16 . The ablation attributes the join t-training failure to a lo w-rank constraint (Section 5.3), but we do not test higher ranks ( r =32 , r =64 ); the conclusion that join t training is unnecessary ma y b e rank-dep enden t. • Video retriev al : The omni-mo dal extension (Section 6) veriﬁes that the pip eline pro duces video em b eddings but do es not ev aluate them on retriev al b enchmarks. “Video em b edding” should not b e interpreted as “video retriev al. ” • End-to-end RAG : Retriev al and generation are ev aluated indep enden tly . W e do not ev al- uate the full retriev e-then-generate pip eline (Figure 2) end-to-end; combined pip eline quality 12 (e.g., answer accuracy given retrieved context) remains un tested. F uture w ork. Several directions are promising: (1) testing on non-Qw en VLM families (In- tern VL [Chen et al., 2024], LLaV A [Liu et al., 2023]); (2) multi-page cross-attention for do cumen t- lev el reasoning; (3) explicit audio and video con trastiv e training to improv e zero-shot retriev al p erformance; (4) adapter comp osition for additional tasks b ey ond retriev al and generation. Broader impact. Hydra can pro cess sensitiv e do cumen ts (medical records, legal ﬁlings, ﬁnancial rep orts), and the single-mo del design concen trates b oth retriev al and generation b ehind one access p oin t. This simpliﬁes access control relativ e to multi-model pipelines, but a compromised mo del exp oses b oth capabilities sim ultaneously . Deplo ymen ts should enforce do cumen t-level p ermissions and audit query logs accordingly . 8 Conclusion Hydra demonstrates that a single retriev al-trained LoRA adapter suﬃces to provide b oth ColBER T- st yle do cumen t retriev al and autoregressiv e generation from one VLM instance, with no generation training. The key practical insight is not the toggling mec hanism—which exists in prior work—but that standard training pip elines silen tly corrupt the base mo del’s generation capabilit y through w eight-t ying gradients and DDP synchronization artifacts (Section 3.4). Once these failure mo des are addressed, the dual-head design matc hes a controlled single-head retriev al baseline within noise while preserving generation quality and reducing p eak GPU memory by 41%. The ablation reveals that this result is not merely con venien t but structurally necessary within LoRA ( r =16 ): joint training cannot make the adapted weigh ts supp ort b oth atten tion modes, so toggling is required regardless. The omni-mo dal extension conﬁrms the mec hanism generalizes across mo del families and mo dalities. More broadly , LoRA adapters are not merely a training conv enience—they are inference-time mo de switc hes. One mo del, many heads. Co de and mo dels. Mo del weigh ts are av ailable at https://huggingface.co/athrael- soju/ HydraQwen3.5- 4B (Hydra), https://huggingface.co/athrael- soju/HydraQwen2.5- Omni- 3B (Hydra- Omni), https://huggingface.co/athrael- soju/ColQwen3.5- 4B- controlled- baseline (controlled baseline), and https://huggingface.co/athrael- soju/DualHead- GritLM- Qwen3.5- 4B (GritLM ablation). T raining and ev aluation co de: https://github.com/athrael- soju/hydra . 13 A P er-T ask Retriev al Results T able 5: Per-task retriev al p erformance (nDCG@5) on ViDoRe V1, V2, and V3. Baseline: single- head ColQw en3.5 trained under the same regime as Hydra (same data, hyperparameters, 1 ep och). V iDoR e V1 V iDoR e V2 & V3 T ask Hydra Baseline ∆ T ask Hydra Baseline ∆ ArxivQA 0.8940 0.8862 +0.0078 V2 Do cV QA 0.6321 0.6220 +0.0101 BioMedical Lectures 0.5778 0.6013 − 0.0235 ShiftPro ject 0.8586 0.8751 − 0.0165 ESG Rep orts (HL) 0.6979 0.7024 − 0.0045 Syn thDo cQA-AI 0.9963 1.0000 − 0.0037 ESG Rep orts 0.5934 0.5267 +0.0667 Syn thDo cQA-Energy 0.9663 0.9652 +0.0011 Economics Rep orts 0.4552 0.4657 − 0.0105 Syn thDo cQA-Go v 0.9484 0.9558 − 0.0074 V2 A v erage 0.5811 0.5740 + 0 . 0071 Syn thDo cQA-Health. 0.9889 0.9926 − 0.0037 V3 T abfquad 0.8740 0.9151 − 0.0411 Computer Science 0.6964 0.6933 +0.0031 T atdqa 0.7989 0.7912 +0.0077 Energy 0.6723 0.6352 +0.0371 Finance (EN) 0.6181 0.5228 +0.0953 Finance (FR) 0.4949 0.4090 +0.0859 HR 0.5286 0.5313 − 0.0027 Industrial 0.5254 0.4363 +0.0891 Pharmaceuticals 0.6425 0.5934 +0.0491 Ph ysics 0.4718 0.4530 +0.0188 V1 A v erage 0.8842 0.8892 − 0 . 0050 V3 A verage 0.5813 0.5343 + 0 . 0469 B P er-Benc hmark Generation Results T able 6: Generation equiv alence across four VQA b enc hmarks. Base: Qw en3.5-4B via model.generate() ; Hydra: same w eights with LoRA disabled, using the custom KV-cache path. All greedy deco ding ( T =0 ). Exact Match% = fraction of b yte-identical output strings. Benc hmark n Base ANLS Hydra ANLS Exact Matc h% Do cV QA 5,000 0.5465 0.5509 73.9% ChartQA 2,500 0.0019 0.0015 28.2% InfoV QA 2,801 0.1792 0.1804 36.8% T extVQA 5,000 0.0576 0.0595 20.6% T otal 15,301 max | ∆ | =0 . 0044 14 C Omni-Mo dal P er-T ask Results T able 7: Hydra-Omni image retriev al on ViDoRe V1, V2, and V3. Mo del: vidore/colqwen-omni-v0.1 (Qwen2.5-Omni-3B bac kb one, 4.4B total parameters). No Hydra- sp eciﬁc training. V iDoR e V1 V iDoR e V2 V iDoR e V3 T ask nDCG@5 T ask nDCG@5 T ask nDCG@5 Syn thDo cQA-AI 0.9852 ESG Rep orts (HL) 0.6050 Computer Science 0.6727 Syn thDo cQA-Healthcare 0.9663 BioMedical Lectures 0.5827 Energy 0.5574 Syn thDo cQA-Energy 0.9566 ESG Rep orts 0.4860 Pharmaceuticals 0.5446 Syn thDo cQA-Go v 0.9529 Economics Rep orts 0.4676 HR 0.4953 T abfquad 0.8891 Finance (EN) 0.4636 ArxivQA 0.8677 Ph ysics 0.4180 ShiftPro ject 0.8398 Industrial 0.4012 T atdqa 0.8196 Finance (FR) 0.3731 Do cV QA 0.6537 V1 A v erage 0.8812 V2 A v erage 0.5353 V3 A v erage 0.4907 References Sh uai Bai et al. Qw en2.5-VL technical rep ort, 2025. Ali F urkan Biten, R ub èn Tito, Andres Maﬂa, Lluis Gomez, Marçal Rusiñol, Ernest V alven y , C. V. Ja wahar, and Dimosthenis Karatzas. Scene text visual question answ ering. In IEEE/CVF International Confer enc e on Computer V ision (ICCV) , pages 4291–4301, 2019. Jian Chen, R uiyi Zhang, Y ufan Zhou, T ong Y u, F ranck Dernoncourt, Jiuxiang Gu, R yan A. Rossi, Changy ou Chen, and T ong Sun. SV-RAG: LoRA-contextualizing adaptation of MLLMs for long do cumen t understanding. In International Confer enc e on L e arning R epr esentations , 2025. Zhe Chen et al. In tern VL: Scaling up vision foundation mo dels and aligning for generic visual- linguistic tasks. In Pr o c e e dings of the IEEE/CVF Confer enc e on Computer V ision and Pattern R e c o gnition , 2024. ColP ali T eam. ColQw en2: Visual do cumen t retriev al with ColQwen2, 2024. HuggingF ace mo del card: vidore/colqwen2-v1.0 . Benjamin Elizalde, Soham Deshm ukh, Mahmoud Al Ismail, and Huaming W ang. CLAP: Learn- ing audio concepts from natural language sup ervision. In IEEE International Confer enc e on A c oustics, Sp e e ch and Signal Pr o c essing (ICASSP) , 2023. Man uel F aysse, Hugues Sibille, T ony W u, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colom b o. ColPali: Eﬃcien t document retriev al with vision language mo dels. In International Confer enc e on L e arning R epr esentations , 2025. A thos Georgiou. ColQwen3.5: Visual do cument retriev al with Qwen3.5, 2026. HuggingF ace mo del: athrael-soju/colqwen3.5-4.5B-v3 . 15 Kristjan Greenew ald, Luis Lastras, Thomas Parnell, V ra j Shah, Lucian Popa, Giulio Zizzo, Ch ulaka Gunasekara, Ambrish Raw at, and David Cox. A ctiv ated LoRA: Fine-tuned LLMs for intrinsics, 2025. Edw ard J Hu, Y elong Shen, Phillip W allis, Zeyuan Allen-Zhu, Y uanzhi Li, Shean W ang, Lu W ang, and W eizhu Chen. LoRA: Lo w-rank adaptation of large language models. In International Confer enc e on L e arning R epr esentations , 2022. Omar Khattab and Matei Zaharia. ColBER T: Eﬃcien t and eﬀectiv e passage searc h via contextual- ized late in teraction o v er BER T. In Pr o c e e dings of the 43r d International A CM SIGIR Confer enc e on R ese ar ch and Development in Information R etrieval , pages 39–48, 2020. Chris Dong jo o Kim, By eongchang Kim, Hyunmin Lee, and Gunhee Kim. AudioCaps: Generating captions for audios in the wild. In Confer enc e of the North A meric an Chapter of the A sso ciation for Computational Linguistics (NAA CL-HL T) , 2019. Sang-gil Lee, W ei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Y o on. BigV GAN: A univ ersal neural v o co der with large-scale training. In International Confer enc e on L e arning R epr esentations , 2023. Haotian Liu, Chun yuan Li, Qingy ang W u, and Y ong Jae Lee. Visual instruction tuning. In A dvanc es in Neur al Information Pr o c essing Systems , 2023. An tónio Loison, Quentin Macé, An toine Edy , Victor Xing, T om Balough, Gabriel Moreira, Bo Liu, Man uel F aysse, Céline Hudelot, and Gautier Viaud. ViDoRe v3: A comprehensiv e ev aluation of retriev al augmen ted generation in complex real-world scenarios, 2026. Quen tin Macé, António Loison, and Man uel F aysse. ViDoRe b enc hmark v2: Raising the bar for visual retriev al, 2025. Ahmed Masry , Do Xuan Long, Jia Qing T an, Shaﬁq Joty , and Enam ul Ho que. ChartQA: A b enc hmark for question answering ab out c harts with visual and logical reasoning. In Findings of the A sso ciation for Computational Linguistics: A CL 2022 , pages 2263–2279, 2022. Minesh Mathew, Dimosthenis Karatzas, and C. V. Jaw ahar. Do cVQA: A dataset for VQA on do c- umen t images. In IEEE/CVF Winter Confer enc e on A pplic ations of Computer V ision (W A CV) , pages 2200–2209, 2021. Minesh Mathew, Vira j Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest V alven y , and C. V. Ja wa- har. InfographicVQA. In IEEE/CVF Winter Confer enc e on A pplic ations of Computer V ision (W A CV) , pages 1697–1706, 2022. Niklas Muennighoﬀ, Nouamane T azi, Loïc Magne, and Nils Reimers. MTEB: Massiv e text em- b edding b enc hmark. In Pr o c e e dings of the 17th Confer enc e of the Eur op e an Chapter of the A sso ciation for Computational Linguistics , pages 2014–2037, 2023. Niklas Muennighoﬀ, Hong jin Su, Liang W ang, Nan Y ang, F uru W ei, T ao Y u, Amanpreet Singh, and Douw e Kiela. Generativ e representational instruction tuning. In International Confer enc e on L e arning R epr esentations , 2025. Simona-V asilica Oprea and Adela Bâra. T ransforming pro duct discov ery and in terpretation using vision–language mo dels. Journal of The or etic al and A pplie d Ele ctr onic Commer c e R ese ar ch , 20 (3):191, 2025. doi: 10.3390/jtaer20030191. 16 Oﬁr Press and Lior W olf. Using the output embedding to improv e language models. In EA CL , 2017. Qw en T eam. Qw en3.5-4B, 2026. HuggingF ace mo del: Qwen/Qwen3.5-4B . Alec Radford, Jong W o ok Kim, T ao Xu, Greg Bro ckman, Christine McLea vey , and Ilya Sutskev er. Robust sp eec h recognition via large-scale w eak supervision. In International Confer enc e on Machine L e arning (ICML) , 2023. Kesha v Santhanam, Omar Khattab, Jon Saad-F alcon, Christopher P otts, and Matei Zaharia. Col- BER T v2: Eﬀectiv e and eﬃcient retriev al via ligh tw eigh t late interaction. In Confer enc e of the North A meric an Chapter of the A sso ciation for Computational Linguistics (NAA CL) , pages 3715– 3734, 2022. Donald J Sch uirmann. A comparison of the tw o one-sided tests pro cedure and the p o w er ap- proac h for assessing the equiv alence of av erage bioav ailability . Journal of Pharmac okinetics and Biopharmac eutics , 15(6):657–680, 1987. Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Ho op er, Nicholas Lee, Shuo Y ang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, and Ion Stoica. S- LoRA: Serving thousands of concurren t LoRA adapters. In Pr o c e e dings of Machine L e arning and Systems (MLSys) , 2024. Y ongxin Shi, Jiap eng W ang, Zeyu Shan, Dezhi Peng, Zening Lin, and Lianw en Jin. URaG: Uniﬁed retriev al and generation in m ultimodal LLMs for eﬃcie nt long document understanding. In Pr o c e e dings of the AAAI Confer enc e on A rtiﬁcial Intel ligenc e , 2026. Oral presen tation. Amanpreet Singh, Vivek Natara jan, Meet Shah, Y u Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. T ow ards V QA Models That Can Read. In IEEE/CVF Confer enc e on Computer V ision and Pattern R e c o gnition (CVPR) , pages 8317–8326, 2019. R yota T anaka, T aichi Iki, T aku Hasega w a, Kyosuk e Nishida, Kuniko Saito, and Jun Suzuki. VDo cRA G: Retriev al-augmented generation o ver visually-rich do cumen ts. In IEEE/CVF Con- fer enc e on Computer V ision and Pattern R e c o gnition (CVPR) , 2025. Jin Xu, Zhifang Jiang, An Y ang, et al. Qw en2.5-Omni technical rep ort. arXiv pr eprint arXiv:2503.20215 , 2025a. Jingw ei Xu, Junyu Lai, and Y unp eng Huang. MeteoRA: Multiple-tasks embedded LoRA for large language mo dels. In International Confer enc e on L e arning R epr esentations , 2025b. Shi Y u et al. VisRA G: Vision-based retriev al-augmen ted generation on multi-modality do cumen ts, 2024. Jin tian Zhang, Cheng Peng, Mengshu Sun, Xiang Chen, Lei Liang, Zhiqiang Zhang, Jun Zhou, Hua jun Chen, and Ningyu Zhang. OneGen: Eﬃcien t one-pass uniﬁed generation and retriev al for LLMs. In Findings of the A sso ciation for Computational Linguistics: EMNLP 2024 , 2024. 17

Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment