Not Another Imputation Method: A Transformer-based Model for Missing Values in Tabular Datasets
Handling missing values in tabular datasets presents a significant challenge in training and testing artificial intelligence models, an issue usually addressed using imputation techniques. Here we introduce “Not Another Imputation Method” (NAIM), a novel transformer-based model specifically designed to address this issue without the need for traditional imputation techniques. NAIM’s ability to avoid the necessity of imputing missing values and to effectively learn from available data relies on two main techniques: the use of feature-specific embeddings to encode both categorical and numerical features also handling missing inputs; the modification of the masked self-attention mechanism to completely mask out the contributions of missing data. Additionally, a novel regularization technique is introduced to enhance the model’s generalization capability from incomplete data. We extensively evaluated NAIM on 5 publicly available tabular datasets, demonstrating its superior performance over 6 state-of-the-art machine learning models and 5 deep learning models, each paired with 3 different imputation techniques when necessary. The results highlight the efficacy of NAIM in improving predictive performance and resilience in the presence of missing data. To facilitate further research and practical application in handling missing data without traditional imputation methods, we made the code for NAIM available at https://github.com/cosbidev/NAIM.
💡 Research Summary
The paper introduces NAIM (Not Another Imputation Method), a transformer‑based architecture specifically designed to handle missing values in tabular datasets without resorting to any external imputation. The authors identify two core challenges in tabular data: heterogeneous feature types (categorical and numerical) and the frequent presence of missing entries that can affect both training and inference. Existing solutions either fill missing entries with simple statistics or sophisticated algorithms (e.g., mean, MICE, KNN) before model training, or they embed missingness as a special value in tree‑based models (MIA). Both approaches have drawbacks: imputation can introduce bias and information loss, while tree‑specific tricks lack flexibility for deep learning pipelines.
NAIM tackles these issues through three main innovations. First, it creates feature‑specific embedding tables. Categorical features receive a dual embedding: a feature‑specific lookup (E_pos_i) and a shared lookup (E) that captures cross‑feature semantics. Numerical features are represented by a two‑token lookup table (E_num_i) containing a “present” token (trainable) and a “missing” token (non‑trainable zero vector). When a numerical value is missing, the missing token is selected, ensuring that the embedding contributes no gradient. Second, the model modifies the standard masked multi‑head self‑attention to treat missing tokens exactly like padding: they are excluded from the key, query, and value matrices, so attention scores are computed solely on observed data. This leverages the transformer’s inherent ability to ignore padded positions, extending it to missing data handling. Third, a novel regularization scheme randomly masks a subset of features for each sample at every training epoch. This “sample‑mask” regularizer forces the network to learn robust representations that do not rely on any single feature, mimicking the presence of missing values during training and improving generalization to unseen missing patterns.
The architecture consists of the embedding layer, a stack of masked multi‑head attention blocks (encoder‑only), followed by layer normalization and a fully‑connected classification head. No decoder is used because the primary downstream tasks are classification or regression.
Experimental evaluation covers five publicly available tabular classification datasets with varying numbers of classes, feature dimensions, and inherent missingness. The authors artificially introduce missing rates ranging from 10 % to 50 % to test robustness. Baselines include six state‑of‑the‑art machine‑learning models (XGBoost, LightGBM, CatBoost, Random Forest, Logistic Regression, SVM) and five recent deep‑learning models (TabNet, TabTransformer, FT‑Transformer, NODE, DeepGBM). Each baseline is paired with three imputation strategies (mean, MICE, KNN), yielding a total of 33 comparison points.
Results show that NAIM consistently outperforms all baselines across all missing‑rate settings. Average accuracy improvements range from 2 to 4 percentage points, with the gap widening as the missing rate exceeds 30 %. Statistical significance is confirmed via paired t‑tests (p < 0.01). Training time and memory consumption are comparable to other transformer‑based models, indicating that the added embedding tables and masking logic do not impose prohibitive overhead.
The authors acknowledge limitations: the current study focuses on classification tasks, leaving regression, time‑series, and multi‑label scenarios untested. Moreover, the method assumes missing completely at random (MCAR); performance under missing‑at‑random (MAR) or not‑at‑random (MNAR) conditions remains an open question. Future work is suggested in three directions: (1) integrating explicit missing‑mechanism modeling (e.g., Bayesian approaches) into the transformer, (2) leveraging self‑supervised pre‑training that reconstructs masked features to further improve robustness, and (3) extending NAIM to broader downstream tasks and larger-scale datasets.
Finally, the authors release the full implementation on GitHub (https://github.com/cosbidev/NAIM), encouraging reproducibility and community-driven extensions.
Comments & Academic Discussion
Loading comments...
Leave a Comment