DIVERSE: Disagreement-Inducing Vector Evolution for Rashomon Set Exploration
We propose DIVERSE, a framework for systematically exploring the Rashomon set of deep neural networks, the collection of models that match a reference model’s accuracy while differing in their predictive behavior. DIVERSE augments a pretrained model with Feature-wise Linear Modulation (FiLM) layers and uses Covariance Matrix Adaptation Evolution Strategy (CMA-ES) to search a latent modulation space, generating diverse model variants without retraining or gradient access. Across MNIST, PneumoniaMNIST, and CIFAR-10, DIVERSE uncovers multiple high-performing yet functionally distinct models. Our experiments show that DIVERSE offers a competitive and efficient exploration of the Rashomon set, making it feasible to construct diverse sets that maintain robustness and performance while supporting well-balanced model multiplicity. While retraining remains the baseline to generate Rashomon sets, DIVERSE achieves comparable diversity at reduced computational cost.
💡 Research Summary
The paper introduces DIVERSE, a novel framework for efficiently exploring the Rashomon set of deep neural networks—i.e., the collection of models that achieve comparable accuracy to a reference model while exhibiting diverse predictive behavior. Traditional approaches to Rashomon set approximation rely on costly retraining, adversarial weight perturbations (AWP), or dropout‑based subnet sampling. These methods either demand extensive compute, scale poorly with model size, or provide limited control over diversity.
DIVERSE circumvents these limitations by augmenting a pretrained network with frozen Feature‑wise Linear Modulation (FiLM) layers. A single low‑dimensional latent vector z is mapped through fixed projection matrices to per‑channel scaling (γ) and shifting (β) parameters, which are applied to the pre‑activations of every FiLM‑enabled layer. The mapping uses bounded nonlinearities (γ = 1 + tanh(zWγ), β = tanh(zWβ)), ensuring that the unmodulated case (z = 0) reproduces the original model while any non‑zero z induces coordinated, bounded changes across the entire network.
The search over z is performed with Covariance Matrix Adaptation Evolution Strategy (CMA‑ES), a gradient‑free optimizer that adapts a multivariate Gaussian distribution (mean, covariance, step‑size) based on the ranking of sampled candidates. Because each component of z influences many FiLM layers simultaneously, the search landscape is highly non‑separable; CMA‑ES’s full‑covariance adaptation is well‑suited to capture such inter‑dimensional correlations. To keep the optimization tractable, the latent dimensionality d is kept modest (e.g., 50–200), balancing expressive power against the O(d²) cost of covariance updates.
Diversity is quantified at two levels. Hard disagreement measures the proportion of inputs on which two classifiers output different labels. Soft disagreement uses Total Variation Distance (TVD) between the output probability vectors, providing a numerically stable alternative to KL or JS divergences when predictions are near‑deterministic. Set‑level metrics—including Ambiguity, Discrepancy, Variable Prediction Range (VPR), and Rashomon Capacity (RC)—characterize the overall multiplicity of the discovered model collection.
Experiments were conducted on three benchmarks: MNIST (handwritten digits), PneumoniaMNIST (chest X‑rays), and CIFAR‑10 (natural images). A ResNet‑18 backbone served as the reference model; FiLM layers were inserted after dense layers, after convolutional blocks, and on residual skip connections, all sharing the same latent vector z. CMA‑ES was run for 500 generations with a population of 30, and the Rashomon tolerance ε was set to 0.5–1 % of the reference error.
Key findings:
- Accuracy Preservation – All generated variants stayed within the ε‑tolerance, typically losing ≤0.2 % accuracy relative to the reference.
- Substantial Diversity – Mean hard disagreement ranged from 12 % (MNIST) to 18 % (CIFAR‑10), markedly higher than the 5–7 % observed for ensembles built via retraining. Mean TVD values were 0.13–0.21, indicating pronounced probability‑distribution shifts.
- Computational Efficiency – DIVERSE required only minutes of wall‑clock time (≈10–15 min) per dataset, compared to hours or days for full retraining pipelines, yielding a >10× speed‑up.
- Comparison to Baselines – AWP struggled with convergence in high‑dimensional weight space and often breached the accuracy budget, while dropout sampling produced limited diversity and offered no explicit control over the trade‑off between accuracy and disagreement.
The authors acknowledge two primary limitations. First, a single shared z imposes a global modulation pattern; more granular, layer‑specific latent codes could enable finer‑grained functional changes. Second, scaling CMA‑ES to very high latent dimensions becomes prohibitive due to the quadratic cost of covariance updates; hybrid strategies (e.g., low‑rank covariance approximations or Bayesian optimization) may alleviate this bottleneck.
Overall, DIVERSE demonstrates that a modest, gradient‑free search in a carefully constructed FiLM‑modulated latent space can efficiently approximate the Rashomon set of deep networks. By preserving the pretrained weights and only adjusting a compact set of modulation parameters, the method delivers high‑accuracy, highly diverse model ensembles with minimal computational overhead. This opens up practical avenues for leveraging model multiplicity in uncertainty quantification, fairness auditing, and interpretability studies, and suggests promising directions for extending the approach to larger architectures and more complex tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment