Bridging Explainability and Embeddings: BEE Aware of Spuriousness
Current methods for detecting spurious correlations rely on analyzing dataset statistics or error patterns, leaving many harmful shortcuts invisible when counterexamples are absent. We introduce BEE (Bridging Explainability and Embeddings), a framework that shifts the focus from model predictions to the weight space, and to the embedding geometry underlying decisions. By analyzing how fine-tuning perturbs pretrained representations, BEE uncovers spurious correlations that remain hidden from conventional evaluation pipelines. We use linear probing as a transparent diagnostic lens, revealing spurious features that not only persist after full fine-tuning but also transfer across diverse state-of-the-art models. Our experiments cover numerous datasets and domains: vision (Waterbirds, CelebA, ImageNet-1k), language (CivilComments, MIMIC-CXR medical notes), and multiple embedding families (CLIP, CLIP-DataComp.XL, mGTE, BLIP2, SigLIP2). BEE consistently exposes spurious correlations: from concepts that slash the ImageNet accuracy by up to 95%, to clinical shortcuts in MIMIC-CXR notes that induce dangerous false negatives. Together, these results position BEE as a general and principled tool for diagnosing spurious correlations in weight space, enabling principled dataset auditing and more trustworthy foundation models. The source code is publicly available at https://github.com/bit-ml/bee.
💡 Research Summary
The paper “Bridging Explainability and Embeddings: BEE Aware of Spuriousness” introduces BEE (Bridging Explainability and Embeddings), a novel framework designed to diagnose spurious correlations (SCs) learned by machine learning models, particularly during the fine-tuning of foundation models. The core problem addressed is that models often rely on deceptive, non-causal shortcuts in data (e.g., associating “firefighter” with “fire truck” regardless of context), which can lead to biased and unreliable decisions in critical applications. Traditional methods for detecting SCs are limited: data-centric approaches analyze dataset statistics but cannot confirm what the model actually learned, while error-analysis methods require counterexamples in validation sets, which are often unavailable.
BEE’s fundamental innovation is to shift the diagnostic lens from model outputs or input data to the weight space of the classifier itself. The method leverages the aligned multimodal embedding spaces of models like CLIP. It starts by initializing the weights of a linear classification layer with the text embeddings of the class names (e.g., the embedding for “fire truck” from CLIP’s text encoder). As this linear probe is trained on a target dataset, these weight vectors drift from their initial semantic positions. BEE hypothesizes and demonstrates that this drift is influenced not only by genuine class features but also towards the embeddings of spurious attributes prevalent in the training data.
The BEE pipeline consists of three main stages. First, Concept Extraction: Using captioning and keyword extraction models, BEE compiles a large set of textual “concepts” (n-grams like “firefighter,” “ocean”) present in the dataset. Second, Concept Filtering: It then filters out concepts that are semantically related to any class definition (e.g., “beak” for a bird class) using LLMs and WordNet, resulting in a pool of “class-neutral concepts” that are potential SCs. Third, Spurious Correlation Identification: For each class, BEE computes a “positive-SC score” for every class-neutral concept. This score measures how much more similar a concept’s embedding is to the trained class weight vector compared to its similarity with the least similar other class weight. A high score indicates the concept is uniquely aligned with that class, flagging it as a potent spurious shortcut. A dynamic thresholding algorithm then automatically selects the top-ranking SCs for each class.
The authors validate BEE extensively across multiple domains and model architectures. In vision, it successfully identifies known SCs in controlled benchmarks like Waterbirds (background bias) and CelebA (gender bias). More impressively, applied to ImageNet-1k, BEE uncovers previously unknown but highly impactful SCs. For instance, it identified concepts like “firefighter” as spuriously correlated with the “fire truck” class. When these concepts were artificially added to test images (e.g., superimposing a firefighter on a peacock), the fine-tuned model’s accuracy for the true class plummeted by up to 95%, proving the real-world effect of these learned biases. In language tasks, BEE exposed dangerous shortcuts in toxic comment classification (CivilComments) and, critically, in medical note analysis (MIMIC-CXR), where certain phrases could lead to false-negative predictions.
A key finding is the persistence and transferability of the SCs discovered by BEE. The spurious concepts identified using a simple linear probe on frozen embeddings were shown to also govern the behavior of fully fine-tuned models. Furthermore, these SCs transferred across different state-of-the-art foundation models (CLIP, BLIP2, SigLIP2, mGTE), suggesting that BEE is capturing fundamental biases in the datasets rather than artifacts of a specific model architecture.
In conclusion, BEE provides a principled, weight-space framework for auditing models and datasets for spurious correlations without the need for labeled counterexamples. By making the hidden drivers of model decisions explicit, it enables more transparent diagnostics and paves the way for developing more robust and trustworthy foundation models. The code is publicly released to facilitate further research and application.
Comments & Academic Discussion
Loading comments...
Leave a Comment