Event Tokenization and Masked-Token Prediction for Anomaly Detection at the Large Hadron Collider

Event Tokenization and Masked-Token Prediction for Anomaly Detection at the Large Hadron Collider
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We propose a novel use of Large Language Models (LLMs) as unsupervised anomaly detectors in particle physics. Using lightweight LLM-like networks with encoder-based architectures trained to reconstruct background events via masked-token prediction, our method identifies anomalies through deviations in reconstruction performance, without prior knowledge of signal characteristics. Applied to searches for simultaneous four-top-quark production, this token-based approach shows competitive performance against established unsupervised methods and effectively captures subtle discrepancies in collider data, suggesting a promising direction for model-independent searches for new physics.


💡 Research Summary

The paper introduces a novel, model‑independent approach for anomaly detection at the Large Hadron Collider (LHC) by treating collider events as sequences of discrete tokens and applying a lightweight transformer‑based masked‑language‑model (MLM) architecture. The authors focus on the challenging task of identifying simultaneous four‑top‑quark production (tt̄tt̄), a rare Standard Model process with a complex final state that closely resembles several background processes (tt̄W, tt̄WW, tt̄Z, tt̄H).

Dataset and Physical Context
The study uses the “Dark Machines” benchmark dataset, which contains over one billion simulated proton‑proton collisions at √s = 13 TeV generated with MG5_aMC@NLO, hadronized with Pythia 8, and passed through a fast detector simulation (Delphes). Events are pre‑selected and stored in CSV format, then split into 80 % training, 10 % validation, and 10 % test sets. Each event is represented by up to 18 particle objects (jets, b‑jets, electrons, muons, photons) together with the missing transverse energy (E_T^miss) and its azimuthal angle (φ_E^miss).

Tokenization Strategy
To make the data amenable to a language‑model pipeline, the continuous kinematic variables are discretized. The authors define four bins for transverse momentum (p_T), pseudorapidity (η), and E_T^miss such that each bin contains roughly 25 % of the background data. The azimuthal angles φ and φ_E^miss are divided into four equal-width bins of size π/4. Each particle type belongs to one of seven predefined categories (jet, b‑jet, e⁺, e⁻, μ⁺, μ⁻, γ). A token ID is then constructed as

token_part = 64 × (bin_obj − 1) + 16 × (bin_pT − 1) + 4 × (bin_η − 1) + bin_φ

resulting in IDs ranging from 1 to 448 for particle tokens. E_T^miss tokens occupy 449‑452 and φ_E^miss tokens 453‑456. An event is encoded as a fixed‑length sequence of 20 tokens (18 particle tokens + two global tokens). Zero‑padding is used for events with fewer than 18 objects.

Model Architecture
The core model is a compact transformer encoder consisting of two layers, each with four self‑attention heads. Token IDs are first embedded into a dense vector space, processed by the transformer blocks, and finally passed through a linear projection followed by a softmax layer that yields a probability distribution over the 456 possible token classes. This design mirrors a lightweight BERT‑style masked language model but with far fewer parameters, making it suitable for large‑scale LHC data processing.

Training Procedure
Training is performed exclusively on background events. For each training example, a single token is randomly masked, and the model is tasked with predicting the masked token using Sparse Categorical Cross‑Entropy loss. The optimizer is Adam, and early stopping based on validation loss prevents over‑fitting. This masked‑token‑prediction (MTP) objective forces the model to learn the joint distribution of particle types and their kinematics under the background hypothesis.

Inference and Anomaly Scoring
During inference, every token in an event is masked in turn, the model predicts the missing token, and the per‑token cross‑entropy loss is recorded. The average loss across all tokens defines an event‑level reconstruction score. Background events, which the model has seen during training, typically yield low scores, whereas anomalous events (e.g., four‑top signal) produce higher scores because their token patterns deviate from the learned background distribution. By selecting a threshold on this score, events can be flagged as anomalous.

Results
The authors evaluate the method on the four‑top signal. The distribution of reconstruction scores for signal and background shows a 70.85 % overlap, and the optimal threshold yields a Receiver Operating Characteristic Area Under Curve (ROC‑AUC) of 0.67. This performance is compared against three established unsupervised anomaly‑detection techniques applied to the same dataset: Deep Density Discrimination (DDD), Deep Support Vector Data Description (DeepSVDD), and Deep Robust One‑Class Classification (DROCC). While DDD still outperforms the proposed approach, the new method surpasses DeepSVDD and DROCC, demonstrating competitive capability despite its simplicity and modest model size.

Discussion and Future Directions
The paper highlights several key insights:

  1. Tokenization Feasibility – Discretizing continuous physics variables into a modest vocabulary (≈456 tokens) enables transformer models to capture complex event topologies without excessive memory consumption.
  2. Unsupervised Learning via MLM – Masked‑token prediction provides a powerful self‑supervised signal that forces the model to internalize the full joint distribution of background events, eliminating the need for labeled anomalies.
  3. Model Efficiency – The lightweight encoder (two layers, four heads) achieves reasonable discrimination while remaining computationally tractable for large LHC datasets, suggesting scalability to real‑time or near‑real‑time applications.

The authors acknowledge limitations: the manual binning scheme may discard subtle kinematic information, and the fixed‑length token sequence forces padding for events with fewer objects, potentially biasing the model. They propose exploring learned tokenization (e.g., vector‑quantized variational autoencoders) and more sophisticated masking strategies (multiple simultaneous masks, variable mask ratios) to enrich contextual learning. Additionally, deeper transformer stacks, hybrid graph‑transformer architectures, and systematic studies on real detector data are identified as promising avenues.

Conclusion
The study demonstrates that large‑language‑model concepts—tokenization, masked‑language‑model training, and transformer encoders—can be successfully transplanted into high‑energy physics for model‑independent anomaly detection. Although the current implementation does not yet surpass the best existing unsupervised methods, it achieves competitive performance with a far simpler and more flexible pipeline. With further refinements in token design, model depth, and training strategies, this approach could become a valuable tool for future LHC analyses, enabling the discovery of rare or unexpected phenomena without relying on specific signal models.


Comments & Academic Discussion

Loading comments...

Leave a Comment