Attention-Based Preprocessing Framework for Improving Rare Transient Classification
With large numbers of transients discovered by current and future imaging surveys, machine learning is increasingly applied to light curve and host galaxy properties to select events for follow-up. However, finding rare types of transients remains difficult due to extreme class imbalances in training sets, and extracting features from host images is complicated by the presence of bright foreground sources, particularly if the true host is faint or distant. Here we present a data augmentation pipeline for images and light curves that mitigates these issues, and apply this to improve classification of Superluminous Supernovae Type I (SLSNe-I) and Tidal Disruption Events (TDEs) with our existing NEEDLE code. The method uses a Similarity Index to remove image artefacts, and a masking procedure that removes unrelated sources while preserving the transient and its host. This focuses classifier attention on the relevant pixels, and enables arbitrary rotations for class upsampling. We also fit observed multi-band light curves with a two-dimensional Gaussian Process and generate data-driven synthetic samples by resampling and redshifting these models, cross-matching with galaxy images in the same class to produce unique but realistic new examples for training. Models trained with the augmented dataset achieve substantially higher purity: for classifications with a confidence of 0.8 or higher, we achieve 75% (43%) purity and 75% (66%) completeness for SLSNe-I (TDEs).
💡 Research Summary
The paper addresses the persistent challenge of classifying rare astronomical transients—specifically Superluminous Supernovae Type I (SLSNe‑I) and Tidal Disruption Events (TDEs)—in the face of extreme class imbalance and contaminated imaging data. Building on the existing NEEDLE (NEural Engine for Discovering Luminous Events) framework, the authors develop a comprehensive data‑centric preprocessing and augmentation pipeline that improves both image quality and light‑curve diversity without resorting to synthetic data from physical models.
First, a “Similarity Index” based on Structural Similarity (SSIM) automatically flags and discards images corrupted by artefacts such as saturated stars, chip gaps, or diffraction spikes, which affect roughly 1.7 % of the ZTF Bright Transient Survey images used for training. Next, a U‑Net‑style segmentation model isolates the transient and its host galaxy, masking all unrelated foreground/background sources. Masked regions are filled with background‑matched Gaussian noise, preserving the statistical properties of the image while ensuring that the classifier’s attention is focused on the physically relevant pixels.
The cleaned images are then subjected to aggressive augmentation: arbitrary rotations (0–360°), flips, and modest scaling are applied, with the host centroid re‑aligned after each transformation. This geometric augmentation expands the limited set of 87 SLSNe‑I and 64 TDE examples by an order of magnitude, providing a richer set of visual contexts for the neural network.
For the light‑curve side, the authors fit multi‑band observations with a two‑dimensional Gaussian Process (GP) using a Matern 3/2 × RBF kernel that captures correlations across time and wavelength. From the trained GP they draw random samples of key parameters (peak flux, rise/decay times) and apply redshift scaling to generate synthetic light curves at arbitrary distances. Realistic observational noise is added by sampling from the empirical error distributions of the original data.
The most novel component is a cross‑matching augmentation that pairs each synthetic light curve with a real host‑galaxy image from the same class. By inserting the synthetic transient at the host’s centroid and preserving the host’s absolute magnitude and colour distribution, the method creates physically consistent image‑light‑curve pairs that mimic genuine observations. This hybrid augmentation yields a training set roughly ten times larger than the original, while maintaining fidelity to the survey’s characteristics.
Training the upgraded NEEDLE model on this augmented dataset leads to substantial performance gains. At a classification confidence threshold of 0.8, the model achieves 75 % purity and 75 % completeness for SLSNe‑I, and 43 % purity with 66 % completeness for TDEs—significant improvements over the baseline. Ablation studies isolate the contributions of each step: artefact removal alone raises purity/completeness by ~5–7 %, masking alone by >10 %, and the full pipeline delivers the maximal boost. Moreover, SSIM and Fréchet Inception Distance metrics confirm that the synthetic samples are virtually indistinguishable from real data.
The authors acknowledge limitations: the GP‑based light‑curve generation may not capture highly irregular phenomena (e.g., rapid flares, multi‑peak events), and the masking process can erase subtle host structures such as tidal streams. Future work will explore hybrid approaches that blend physics‑based simulations with data‑centric augmentation, and will investigate Transformer‑based architectures that jointly process images and time‑series to further enhance rare‑event discovery. Overall, the study demonstrates that improving data quality and diversity can outweigh architectural refinements, offering a practical pathway to more reliable identification of the most elusive transients in upcoming large‑scale surveys.
Comments & Academic Discussion
Loading comments...
Leave a Comment