StructAlign: Structured Cross-Modal Alignment for Continual Text-to-Video Retrieval

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Continual Text-to-Video Retrieval (CTVR) is a challenging multimodal continual learning setting, where models must incrementally learn new semantic categories while maintaining accurate text-video alignment for previously learned ones, thus making it particularly prone to catastrophic forgetting. A key challenge in CTVR is feature drift, which manifests in two forms: intra-modal feature drift caused by continual learning within each modality, and non-cooperative feature drift across modalities that leads to modality misalignment. To mitigate these issues, we propose StructAlign, a structured cross-modal alignment method for CTVR. First, StructAlign introduces a simplex Equiangular Tight Frame (ETF) geometry as a unified geometric prior to mitigate modality misalignment. Building upon this geometric prior, we design a cross-modal ETF alignment loss that aligns text and video features with category-level ETF prototypes, encouraging the learned representations to form an approximate simplex ETF geometry. In addition, to suppress intra-modal feature drift, we design a Cross-modal Relation Preserving loss, which leverages complementary modalities to preserve cross-modal similarity relations, providing stable relational supervision for feature updates. By jointly addressing non-cooperative feature drift across modalities and intra-modal feature drift, StructAlign effectively alleviates catastrophic forgetting in CTVR. Extensive experiments on benchmark datasets demonstrate that our method consistently outperforms state-of-the-art continual retrieval approaches.

💡 Research Summary

StructAlign tackles the emerging problem of Continual Text‑to‑Video Retrieval (CTVR), where a model must learn new semantic categories over a sequence of tasks while preserving fine‑grained alignment between textual queries and video clips for all previously seen categories. The authors identify two distinct sources of catastrophic forgetting in this multimodal continual learning setting: (1) intra‑modal feature drift, caused by continual updates within each modality’s encoder, and (2) non‑cooperative feature drift across modalities, where the text and video encoders drift in different directions because they are trained independently. To address both, StructAlign introduces a simplex Equiangular Tight Frame (ETF) geometry as a unified geometric prior. An ETF arranges C category prototypes in a high‑dimensional space such that all prototypes have equal norm and equal pairwise angles, yielding maximally separated class clusters while still allowing intra‑class variability.

The method builds on frozen CLIP text and video encoders and adds lightweight adaptation modules: a Mixture‑of‑Experts (MoE) layer is inserted into each self‑attention block of the text encoder, and Low‑Rank Adaptation (LoRA) modules are added to the query and value projections of the video encoder. These modules enable efficient parameter updates for new tasks without overwriting the bulk of the pretrained knowledge.

Two complementary loss functions enforce the ETF prior and stabilize learning. The Cross‑modal ETF Alignment loss (L_ETF) pulls the text and video embeddings of each sample toward the ETF prototype corresponding to its category, while simultaneously regularizing the set of prototypes to stay close to an ideal simplex ETF configuration. This directly mitigates non‑cooperative drift by forcing both modalities into the same geometric scaffold. The Cross‑modal Relation Preserving loss (L_CRP) preserves the cross‑modal similarity relations established in earlier tasks. It does so by storing “pseudo‑features” (representations of previous tasks) and encouraging the current embeddings to maintain the same pairwise similarity matrix with respect to the opposite modality. In effect, the relation matrix acts as a stable supervisory signal that constrains intra‑modal drift.

The overall training objective is a weighted sum of L_ETF, L_CRP, and a regularization term that keeps the prototypes near the reference ETF. During inference, a frame‑word similarity is computed by aggregating the maximum cosine similarity between each word and any video frame (and vice‑versa), ensuring bidirectional consistency.

Extensive experiments on three CTVR benchmarks (MSR‑VT, ActivityNet, YouCook2) demonstrate that StructAlign consistently outperforms state‑of‑the‑art continual retrieval baselines. Gains of 3–7 percentage points are reported on mean average precision (mAP) and recall@K, with particularly large improvements (≈15 %) in metrics that measure cross‑modal alignment stability. Ablation studies confirm that both the ETF alignment loss and the relation‑preserving loss are necessary; removing either component degrades performance. Analyses of the learned prototype geometry show that the embeddings converge close to the ideal simplex ETF, validating the geometric prior.

The paper also discusses limitations. The ETF prior requires the embedding dimension to be at least the number of categories, and the initialization of prototypes influences convergence speed. Moreover, all experiments rely on CLIP as the backbone; generalization to other vision‑language models remains to be tested.

In summary, StructAlign offers a principled solution to catastrophic forgetting in multimodal continual learning by (i) imposing a well‑separated, category‑level ETF structure that aligns text and video embeddings, and (ii) preserving cross‑modal relational knowledge across tasks. The combination of geometric regularization and relational distillation, together with parameter‑efficient adapters, yields a robust and scalable framework for continual text‑to‑video retrieval. Future work may explore dynamic prototype scaling, broader backbone compatibility, and real‑time streaming scenarios.

StructAlign: Structured Cross-Modal Alignment for Continual Text-to-Video Retrieval

💡 Research Summary

Comments & Academic Discussion

Leave a Comment