Bagging-Based Model Merging for Robust General Text Embeddings
General-purpose text embedding models underpin a wide range of NLP and information retrieval applications, and are typically trained on large-scale multi-task corpora to encourage broad generalization. However, it remains unclear how different multi-task training strategies compare in practice, and how to efficiently adapt embedding models as new domains and data types continually emerge. In this work, we present a systematic study of multi-task training for text embeddings from two perspectives: data scheduling and model merging. We compare batch-level shuffling, sequential training variants, two-stage training, and multiple merging granularities, and find that simple batch-level shuffling consistently yields the strongest overall performance, suggesting that task conflicts are limited and training datasets are largely complementary. Despite its effectiveness, batch-level shuffling exhibits two practical limitations: suboptimal out-of-domain (OOD) generalization and poor suitability for incremental learning due to expensive full retraining. To address these issues, we propose Bagging-based rObust mOdel Merging (BOOM), which trains multiple embedding models on sampled subsets and merges them into a single model, improving robustness while retaining single-model inference efficiency. Moreover, BOOM naturally supports efficient incremental updates by training lightweight update models on new data with a small historical subset and merging them into the existing model. Experiments across diverse embedding benchmarks demonstrate that BOOM consistently improves both in-domain and OOD performance over full-corpus batch-level shuffling, while substantially reducing training cost in incremental learning settings.
💡 Research Summary
The paper investigates two fundamental challenges in building general‑purpose text embedding models: (1) how to train on a large, heterogeneous multi‑task corpus most effectively, and (2) how to incorporate new domains or data types without costly full‑retraining. The authors first conduct a systematic comparison of four data‑scheduling strategies—batch‑level shuffling, dataset‑level sequential training, task‑level sequential training, and a two‑stage pre‑training/fine‑tuning pipeline. Across a suite of in‑domain and out‑of‑domain (OOD) benchmarks (MTEB English, RTEB‑beta, MTEB Code, etc.), simple batch‑level shuffling consistently yields the highest average scores, suggesting that task interference is limited and that the various corpora are largely complementary. However, this approach suffers from two practical drawbacks: (i) OOD performance can be sub‑optimal, sometimes lagging behind models trained on much smaller subsets, and (ii) incremental updates require retraining on the entire expanded corpus, which is computationally expensive.
To address these issues, the authors propose BOOM (Bagging‑based rObust mOdel Merging). BOOM applies the classic bagging idea to text embeddings: the full multi‑task dataset is sampled with replacement to create K bootstrap subsets (e.g., K = 5). Each subset is used to train an independent embedding model using the same batch‑level shuffling pipeline. After training, the K models are merged into a single model using weight‑space merging techniques provided by the open‑source MergeKit library. The paper evaluates several merging algorithms, including spherical linear interpolation (SLERP), Multi‑SLERP, Karcher mean on the unit hypersphere, Task Arithmetic, and TIES (which trims and aligns task vectors). Empirically, the Karcher mean yields the most stable and performant merged model.
BOOM offers two key benefits. First, the ensemble‑like diversity introduced by bootstrap sampling improves robustness: the merged model outperforms the baseline full‑corpus batch‑shuffled model by 1.8 % absolute on in‑domain tasks and 2.3 % on OOD tasks, with especially large gains (4‑6 % absolute) on the OOD benchmarks. Second, BOOM naturally supports efficient incremental learning. When new data arrives, a lightweight “update” model is trained on the new data plus a small sampled portion of historical data. This update model is then merged with the existing BOOM model, avoiding a full retrain. In incremental experiments, BOOM reduces training cost by roughly 45 % while maintaining or improving performance relative to full retraining.
The authors also analyze the method’s limitations. Merging in weight space can introduce minor performance loss due to non‑linear interactions among parameters; the choice of bootstrap size K and the merging algorithm is critical and may need task‑specific tuning; and applying BOOM to very large LLM‑based embeddings still incurs substantial memory and compute demands. Nonetheless, BOOM demonstrates a compelling trade‑off: it retains the inference efficiency of a single model while capturing the variance‑reduction benefits of an ensemble, and it enables practical, low‑cost updates for continuously evolving retrieval or IR systems.
In conclusion, BOOM shifts the focus from “more data” to “more diverse data representations” and shows that bagging combined with sophisticated model merging can simultaneously boost OOD generalization and enable cost‑effective incremental learning for general text embeddings. Future work may explore more advanced merging strategies (e.g., learned meta‑mergers), parameter compression during merging, and scaling BOOM to multi‑billion‑parameter LLMs for real‑time production environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment