ContraLog: Log File Anomaly Detection with Contrastive Learning and Masked Language Modeling
Log files record computational events that reflect system state and behavior, making them a primary source of operational insights in modern computer systems. Automated anomaly detection on logs is therefore critical, yet most established methods rely on log parsers that collapse messages into discrete templates, discarding variable values and semantic content. We propose ContraLog, a parser-free and self-supervised method that reframes log anomaly detection as predicting continuous message embeddings rather than discrete template IDs. ContraLog combines a message encoder that produces rich embeddings for individual log messages with a sequence encoder to model temporal dependencies within sequences. The model is trained with a combination of masked language modeling and contrastive learning to predict masked message embeddings based on the surrounding context. Experiments on the HDFS, BGL, and Thunderbird benchmark datasets empirically demonstrate effectiveness on complex datasets with diverse log messages. Additionally, we find that message embeddings generated by ContraLog carry meaningful information and are predictive of anomalies even without sequence context. These results highlight embedding-level prediction as an approach for log anomaly detection, with potential applicability to other event sequences.
💡 Research Summary
The paper introduces ContraLog, a novel, parser‑free framework for log file anomaly detection that operates directly on raw log messages. Traditional approaches first parse logs into discrete templates (keys) and variable values, then model the sequence of template IDs. This preprocessing discards valuable information such as the actual numeric or textual values, introduces parsing errors, and ignores semantic similarity between different templates. ContraLog eliminates the need for any log parser and instead learns continuous vector representations (embeddings) of each log line.
The architecture consists of two transformer‑based encoders. First, a Message Encoder receives tokenized log lines. Tokens are generated by a dataset‑specific Byte‑Pair Encoding (BPE) tokenizer that is trained on the target log corpus, allowing the tokenizer to capture long, repetitive fragments of log templates as single tokens and thus keep the vocabulary small. The Message Encoder processes each token sequence, applies mean‑pooling over token embeddings, and passes the result through a linear layer to produce a fixed‑dimensional message embedding Eᵢ. Positional encodings are added to preserve intra‑message order.
Second, a Sequence Encoder takes the ordered list of message embeddings {E₁,…,Eₙ}, adds another set of positional encodings, and runs a transformer over the whole sequence. This component captures temporal dependencies and contextual relationships among log entries.
Training combines masked language modeling (MLM) with contrastive learning. In each minibatch a subset of message embeddings is randomly masked. The Sequence Encoder predicts a representation Ŷⱼ for each masked position j. The original (unmasked) message embeddings Eᵢ serve as targets. Both Ŷⱼ and Eᵢ are L2‑normalized, and a similarity matrix K is built using cosine similarity scaled by a temperature τ. Row‑wise and column‑wise cross‑entropy losses (InfoNCE) are computed on K, treating the diagonal entries as positives and all off‑diagonal entries as negatives. The final symmetric loss L_sym is the average of the two directions. This objective forces the model to bring the predicted embedding close to the true embedding while pushing apart embeddings of different messages, thereby learning a discriminative embedding space without any explicit label information.
During inference, two complementary anomaly scores are derived.
-
Contextual anomaly score – Each log line is masked in turn, the Sequence Encoder predicts Ŷⱼ, and the cosine similarity between Ŷⱼ and the actual embedding Eⱼ is converted to a distance (1‑sim). A high distance indicates that the line is unexpected given its surrounding context. Scores can be aggregated per sequence by taking the maximum or the mean across all positions, yielding a sequence‑level contextual anomaly measure.
-
Point anomaly score – Independently of context, the Message Encoder’s embedding Eⱼ is compared to the distribution of embeddings observed during training (e.g., via distance to the nearest normal centroid or density estimation). Messages that lie far from the normal embedding manifold receive a high point anomaly score. This component addresses cases where a sequence consists of highly repetitive messages; the Sequence Encoder might otherwise reconstruct the masked token perfectly, leading to false negatives.
Both scores are standardized using robust Z‑scores and combined via an L₂ norm to produce a final anomaly score for each log sequence.
The authors evaluate ContraLog on three widely used benchmark datasets: HDFS (distributed file system logs), BGL (BlueGene/L supercomputer logs), and Thunderbird (mail server logs). These datasets differ markedly in template cardinality, variable diversity, and temporal complexity. ContraLog outperforms or matches state‑of‑the‑art parser‑based methods such as Drain, LogBERT, and LogAnomaly, despite requiring no hand‑crafted parsing rules or template dictionaries. Notably, on BGL and Thunderbird the embeddings alone separate normal from abnormal messages, as visualized with UMAP: normal embeddings form tight clusters while anomalous ones scatter outside. This demonstrates that the learned embedding space captures semantic nuances of log content.
Key contributions of the work are:
- A fully self‑supervised, parser‑free log anomaly detection pipeline that learns from raw text.
- The integration of masked embedding prediction with a symmetric contrastive loss, enabling the model to learn both contextual and point‑wise notions of normality.
- Empirical evidence that message‑level embeddings are highly informative for point anomaly detection, reducing reliance on sequence context.
- Practical engineering decisions such as dataset‑specific BPE tokenization to handle the low‑vocabulary, highly repetitive nature of log data.
The paper also discusses limitations and future directions. Real‑time deployment would require efficient masking strategies and possibly model compression. Extending the approach to multi‑domain transfer learning, incorporating additional operational data (metrics, traces), and exploring clustering‑based collective anomaly detection are promising avenues. Overall, ContraLog establishes a new paradigm for log anomaly detection that leverages continuous representations and contrastive learning, offering both theoretical insight and practical performance gains.
Comments & Academic Discussion
Loading comments...
Leave a Comment