Latent Domain Modeling Improves Robustness to Geographic Shifts
Geographic distribution shift arises when the distribution of locations on Earth in a training dataset is different from what is seen at inference time. Using standard empirical risk minimization (ERM) in this setting can lead to uneven generalization across different spatially-determined groups of interest such as continents or biomes. The most common approaches to tackling geographic distribution shift apply domain adaptation methods using discrete group labels, ignoring geographic coordinates that are often available as metadata. On the other hand, modeling methods that integrate geographic coordinates have been shown to improve overall performance, but their impact on geographic domain generalization has not been studied. In this work, we propose a general modeling framework for improving robustness to geographic distribution shift. The key idea is to model continuous, latent domain assignment using location encoders and to condition the main task predictor on the jointly-trained latents. On four diverse geo-tagged image datasets with different group splits, we show that instances of our framework achieve significant improvements in worst-group performance compared to existing domain adaptation and location-aware modeling methods. In particular, we achieve new state-of-the-art results on two datasets from the WILDS benchmark.
💡 Research Summary
The paper tackles the problem of geographic distribution shift, a specific form of subpopulation shift where training data are collected from a mixture of locations on Earth and the test distribution contains a different mixture of the same locations. Standard empirical risk minimization (ERM) often fails under this shift because it treats all samples as i.i.d., leading to models that perform well on over‑represented regions (e.g., certain continents) but poorly on under‑represented ones. Existing remedies—domain adaptation with discrete domain labels or distributionally robust optimization (DRO)—ignore the continuous geographic metadata (latitude and longitude) that is usually available, and therefore cannot capture intra‑domain variability or inter‑domain similarity.
The authors propose a general framework called Latent Domain Modeling. The key components are: (1) a location encoder ℓ(ϕ,λ) that maps latitude‑longitude pairs into a high‑dimensional latent space; (2) an auxiliary domain predictor \tilde h that takes ℓ as input and is trained with a cross‑entropy loss to predict the discrete domain label d (e.g., continent, biome). This auxiliary loss L_DP is weighted by a hyperparameter α and is used only during training; \tilde h is discarded at inference time. (3) The main task predictor f(x,ϕ,λ)=Φ(g(x),ℓ(ϕ,λ)) fuses image features g(x) with location latents via a fusion module Φ. Four fusion strategies are explored: simple concatenation, FiLM (feature‑wise linear modulation), Geo Priors (multiplicative Bayesian prior), and a modified D³G that learns location‑based domain relations β_j(ℓ).
Two types of location encoders are evaluated. WRAP is a non‑parametric sine‑cosine encoder followed by a small MLP. GeoCLIP uses pretrained random Fourier features from a large Flickr image‑location dataset; the authors keep the pretrained backbone frozen and train a lightweight MLP on top. The domain predictor is a single linear layer. The framework is flexible: α can be set to zero to obtain a pure location‑aware model without domain labels, and any image encoder g and loss L_TP can be swapped in.
Experiments are conducted on four geo‑tagged image datasets: (i) WILDS‑FMoW (satellite land‑use classification, domains = continents), (ii) WILDS‑PovertyMap (multispectral asset‑wealth regression, domains = regions), (iii) iNat‑Biomes (biome classification), and (iv) YFCC‑Avg (average color prediction from Flickr photos). Baselines include ERM, IRM, CORAL, GroupDRO, and recent location‑aware methods. Results show consistent improvements in worst‑group metrics: on FMoW the worst‑continent accuracy rises from 71.2 % (previous SOTA) to 75.3 % (+4 pp); on PovertyMap the worst‑region Pearson r improves from 0.45 to 0.49 (+0.04). Average performance is either maintained or slightly increased, demonstrating that the gains are not achieved at the expense of overall accuracy. FiLM and the adapted D³G fusion tend to yield the largest gains, while GeoCLIP‑based encoders provide the most stable generalization across datasets.
The analysis highlights several insights: (a) continuous latent domains capture richer spatial relationships than coarse discrete labels; (b) the auxiliary domain prediction loss effectively shapes the location encoder to align with domain structure without requiring domain information at test time; (c) the modular design allows practitioners to choose encoders and fusion methods that suit computational budgets. Limitations include dependence on the availability of domain labels during training (though α=0 mitigates this), sensitivity to the quality of pretrained location encoders, and the current focus on image‑latitude/longitude modalities.
Future directions suggested are unsupervised clustering to discover latent domains without any labels, extending the framework to temporal or multimodal data (e.g., text, audio), and integrating self‑supervised objectives to further regularize the location encoder.
In summary, the paper introduces a principled, flexible approach that leverages geographic coordinates to learn latent domain representations and condition the main predictor on them. This method substantially boosts robustness to geographic distribution shifts, achieving new state‑of‑the‑art worst‑group performance on two WILDS benchmarks while preserving or improving overall accuracy. It offers a practical pathway for building globally reliable AI systems that must operate across diverse and unevenly sampled regions.
Comments & Academic Discussion
Loading comments...
Leave a Comment