Relational Database Distillation: From Structured Tables to Condensed Graph Data

Relational Database Distillation: From Structured Tables to Condensed Graph Data
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Relational databases (RDBs) underpin the majority of global data management systems, where information is structured into multiple interdependent tables. To effectively use the knowledge within RDBs for predictive tasks, recent advances leverage graph representation learning to capture complex inter-table relations as multi-hop dependencies. Despite achieving state-of-the-art performance, these methods remain hindered by the prohibitive storage overhead and excessive training time, due to the massive scale of the database and the computational burden of intensive message passing across interconnected tables. To alleviate these concerns, we propose and study the problem of Relational Database Distillation (RDD). Specifically, we aim to distill large-scale RDBs into compact heterogeneous graphs while retaining the predictive power (i.e., utility) required for training graph-based models. Multi-modal column information is preserved through node features, and primary-foreign key relations are encoded via heterogeneous edges, thereby maintaining both data fidelity and relational structure. To ensure adaptability across diverse downstream tasks without engaging the traditional, inefficient bi-level distillation framework, we further design a kernel ridge regression-guided objective with pseudo-labels, which produces quality features for the distilled graph. Extensive experiments on multiple real-world RDBs demonstrate that our solution substantially reduces the data size while maintaining competitive performance on classification and regression tasks, creating an effective pathway for scalable learning with RDBs.


💡 Research Summary

Relational databases (RDBs) are the backbone of many large‑scale data‑centric applications, but when they are transformed into heterogeneous relational entity graphs (REGs) for graph neural network (GNN) training, the sheer number of rows and the dense foreign‑key connectivity cause prohibitive memory consumption and training time. To address this scalability bottleneck, the authors introduce a new data‑centric problem called Relational Database Distillation (RDD): compress a massive RDB into a much smaller synthetic database that preserves the predictive information needed for downstream tasks. They propose Table‑to‑Graph (T2G), the first RDD framework that converts an RDB into a compact heterogeneous graph while keeping the original schema intact.

The T2G pipeline consists of four tightly coupled components. First, each column is encoded by a lightweight, modality‑specific tokenizer. Numerical columns are projected with a learnable linear matrix, categorical columns are embedded via a small lookup table, and other modalities (e.g., timestamps, text) can be handled analogously. These tokenizers are deliberately simple to keep the storage footprint low. Second, the column embeddings of all rows are clustered (e.g., K‑means) to discover common patterns across entities. The resulting cluster assignments serve as pseudo‑labels that capture the intrinsic structure of the data without using any task labels. Third, guided by these pseudo‑labels, a Stochastic Block Model (SBM) is fitted to the inter‑cluster connection probabilities for each edge type defined by the schema. Sampling from the learned SBM simultaneously generates nodes of each table type and heterogeneous edges, producing a synthetic REG whose node count |V′| is an order of magnitude smaller than the original |V| while preserving the original table‑to‑table relationships. Finally, to ensure that the compressed node features remain useful for prediction, the authors distill them using a Kernel Ridge Regression (KRR) objective. The KRR loss is supervised by both the true task labels (for the target table) and the pseudo‑labels, encouraging the synthetic features to retain task‑relevant signals as well as the structural regularities captured by the clustering step. This approach avoids the costly bi‑level optimization used in earlier dataset‑distillation works and works for both classification and regression tasks.

Extensive experiments on four real‑world RDBs—social media (users, posts, comments), e‑commerce (products, orders, reviews), finance (accounts, transactions, customers), and healthcare (patients, diagnoses, prescriptions)—demonstrate the effectiveness of T2G. Compression ratios range from 10× to 30×, reducing storage from several gigabytes to a few hundred megabytes. When training heterogeneous GNNs on the distilled graphs, classification accuracy drops by less than 1 % and regression RMSE increases by only about 1 % compared with training on the full REG. Moreover, training time is accelerated by a factor of 5–8 because the message‑passing graph is dramatically smaller. Ablation studies reveal that each component is essential: removing the clustering step or replacing SBM with random edge generation degrades performance sharply, and using only a standard MSE loss instead of the KRR objective harms downstream generalization.

The paper also discusses limitations. The choice of the number of clusters and SBM hyper‑parameters strongly influences the trade‑off between compression and accuracy, suggesting a need for automated hyper‑parameter search. Extremely high‑cardinality categorical columns or very sparse features may require more sophisticated tokenizers. Finally, the current formulation assumes a static schema; handling frequent schema evolution would be an interesting direction for future work.

In summary, T2G provides a practical solution to the scalability problem of graph‑based learning on relational data. By combining modality‑aware tokenization, clustering‑derived pseudo‑labels, SBM‑based graph synthesis, and KRR‑guided feature distillation, it creates a compact heterogeneous graph that retains the predictive power of the original database. This work opens a pathway for cost‑effective, large‑scale machine learning on relational databases across a variety of domains.


Comments & Academic Discussion

Loading comments...

Leave a Comment