Knowledge-Guided Masked Autoencoder with Linear Spectral Mixing and Spectral-Angle-Aware Reconstruction
Integrating domain knowledge into deep learning has emerged as a promising direction for improving model interpretability, generalization, and data efficiency. In this work, we present a novel knowledge-guided ViT-based Masked Autoencoder that embeds scientific domain knowledge within the self-supervised reconstruction process. Instead of relying solely on data-driven optimization, our proposed approach incorporates the Linear Spectral Mixing Model (LSMM) as a physical constraint and physically-based Spectral Angle Mapper (SAM), ensuring that learned representations adhere to known structural relationships between observed signals and their latent components. The framework jointly optimizes LSMM and SAM loss with a conventional Huber loss objective, promoting both numerical accuracy and geometric consistency in the feature space. This knowledge-guided design enhances reconstruction fidelity, stabilizes training under limited supervision, and yields interpretable latent representations grounded in physical principles. The experimental findings indicate that the proposed model substantially enhances reconstruction quality and improves downstream task performance, highlighting the promise of embedding physics-informed inductive biases within transformer-based self-supervised learning.
💡 Research Summary
The paper introduces KARMA (Knowledge‑Augmented Reconstruction with Masked Autoencoding), a physics‑informed variant of Vision‑Transformer‑based Masked Autoencoders (ViT‑MAE) designed for hyperspectral imagery. The core novelty lies in embedding the Linear Spectral Mixing Model (LSMM) directly into the decoder as a parallel reconstruction branch and jointly optimizing a hybrid loss comprising Huber loss, Spectral Angle Mapper (SAM) loss, and a physics‑consistency loss.
In the LSMM branch, each decoder token is passed through a lightweight MLP to predict an abundance vector. A soft‑max activation enforces the non‑negativity and sum‑to‑one constraints intrinsic to physically plausible mixtures. The predicted abundances are multiplied by an endmember matrix A (size 218 × M) to reconstruct the observed spectrum as a linear combination of a small set of latent material signatures. This forces the network to learn a low‑rank, physically meaningful representation of the data.
SAM loss measures the angular distance between reconstructed and ground‑truth spectra, preserving spectral shape irrespective of absolute intensity. This is crucial for material discrimination in hyperspectral data, where the direction of a spectral vector encodes the material identity. The Huber loss provides robustness to outliers, while the physics‑consistency loss directly penalizes deviations between the LSMM reconstruction and the true spectrum. The total objective is L = λ₁ L_Huber + λ₂ L_SAM + λ₃ L_phys, with λ‑weights tuned empirically.
The architecture follows the standard MAE pipeline: hyperspectral cubes (218 bands after preprocessing) are split into 16 × 16 non‑overlapping patches, embedded into 512‑dimensional tokens, and 75 % of patches are masked. The encoder processes only visible tokens, while the decoder receives both visible and learned mask tokens. In addition to the conventional pixel‑value prediction head, the LSMM head produces abundance vectors and reconstructs spectra via the learned endmember matrix.
Experiments use EnMAP satellite data over California (5 000 training tiles, 500 validation, 2 000 test). Reconstruction quality is evaluated with PSNR and SSIM. KARMA achieves an average PSNR of 27.38 dB versus 24.61 dB for the baseline ViT‑MAE (≈ 11 % relative gain) and an SSIM of 0.68 versus 0.55 (≈ 24 % gain). The SAM component adds about 26 % training‑time overhead, leading to an overall 31.7 % increase in per‑sample training time, a cost justified by the substantial performance boost.
For downstream transfer, the pretrained encoder is frozen and a lightweight CNN head is attached for two tasks: crop‑type classification and national land‑cover classification. Across both tasks, KARMA’s representations yield higher Top‑1 accuracy and mean Intersection‑over‑Union (mIoU) compared to the non‑knowledge‑guided baseline, demonstrating that physics‑aware pretraining improves feature transferability, especially for tasks where spectral shape is critical.
Ablation studies show that LSMM alone or SAM alone each provide modest improvements, but their combination yields the best results. Varying the number of abundance components (M) reveals a trade‑off: larger M slightly improves reconstruction but incurs higher computational cost and risk of over‑parameterization.
The authors acknowledge limitations: the endmember matrix is learned rather than fixed to known spectral libraries, so its physical interpretability requires further validation; LSMM assumes linear mixing, which may be insufficient for strongly nonlinear interactions; and the current design is tailored to EnMAP data, requiring adaptation for other sensors or domains.
In conclusion, KARMA demonstrates that integrating domain‑specific physical models and geometry‑aware losses into transformer‑based self‑supervised learning can simultaneously enhance reconstruction fidelity, interpretability, and downstream task performance. Future work may explore nonlinear mixing models, explicit incorporation of known spectral libraries, and multimodal extensions to build more generalizable physics‑guided foundation models for remote sensing and beyond.
Comments & Academic Discussion
Loading comments...
Leave a Comment