Foundation CAN LM: A Pretrained Language Model For Automotive CAN Data

Foundation CAN LM: A Pretrained Language Model For Automotive CAN Data
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The Controller Area Network (CAN) bus provides a rich source of vehicular signals increasingly leveraged for applications in automotive and auto insurance domains, including collision detection, predictive maintenance, and driver risk modeling. Despite this potential, existing pipelines largely train isolated task-specific models on raw CAN data, with only limited efforts exploring decoded signals. Such fragmentation prevents shared representation learning and limits cross-task generalization. By contrast, natural language processing (NLP) and computer vision (CV) have been transformed by the foundation model paradigm: large-scale pretraining followed by task-specific adaptation. In this work, we introduce the foundation CAN model that demonstrates multi-objective downstream generalization using a single pretrained backbone. Our approach treats CAN data as a language: we pretrain on large-scale, unlabeled decoded CAN signals and fine-tune across heterogeneous auto insurance tasks. To enable this, we propose a unified tokenization scheme for mixed discrete-continuous signals and address challenges of temporal complexity and trip-specific variability. Our results show that one pretrained CAN model can adapt effectively to diverse predictive tasks, validating that the foundation modeling paradigm, proven in NLP and CV, also holds for CAN data. This establishes a new direction for generalizable representation learning in automotive AI.


💡 Research Summary

The paper introduces “Foundation CAN LM,” a large‑scale pretrained language model that treats decoded automotive Controller Area Network (CAN) signals as a textual language and demonstrates that a single backbone can be fine‑tuned for multiple heterogeneous automotive and auto‑insurance tasks. The authors begin by highlighting the fragmentation in current CAN‑based AI pipelines: most solutions train isolated, task‑specific models directly on raw CAN frames, which prevents shared representation learning, duplicates data preparation effort, and limits cross‑task generalization. Inspired by the success of foundation models in natural language processing (NLP) and computer vision (CV), the authors propose a two‑stage paradigm—massive unsupervised pretraining on decoded CAN data followed by task‑specific fine‑tuning.

The dataset consists of anonymized, decoded CAN logs from roughly 10 000 vehicles collected over a 90‑day period. From this corpus, a nine‑day subset containing approximately 19 billion tokens is used for pretraining. Each trip log is decoded into 44 interpretable features spanning vehicle dynamics, driver behavior, safety indicators, and vehicle state. The signals are synchronized to a 1 Hz sampling rate, normalized within empirically derived operating ranges, and segmented into 10‑second windows, yielding 450 tokens per sequence after tokenization.

A central technical contribution is a unified tokenization scheme that can handle mixed discrete‑continuous data. Continuous variables are first clipped to feature‑specific min‑max bounds, then linearly scaled to


Comments & Academic Discussion

Loading comments...

Leave a Comment