Capturing and Anticipating User Intents in Data Analytics via Knowledge Graphs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In today’s data-driven world, the ability to extract meaningful information from data is becoming essential for businesses, organizations and researchers alike. For that purpose, a wide range of tools and systems exist addressing data-related tasks, from data integration, preprocessing and modeling, to the interpretation and evaluation of the results. As data continues to grow in volume, variety, and complexity, there is an increasing need for advanced but user-friendly tools, such as intelligent discovery assistants (IDAs) or automated machine learning (AutoML) systems, that facilitate the user’s interaction with the data. This enables non-expert users, such as citizen data scientists, to leverage powerful data analytics techniques effectively. The assistance offered by IDAs or AutoML tools should not be guided only by the analytical problem’s data but should also be tailored to each individual user. To this end, this work explores the usage of Knowledge Graphs (KG) as a basic framework for capturing in a human-centered manner complex analytics workflows, by storing information not only about the workflow’s components, datasets and algorithms but also about the users, their intents and their feedback, among others. The data stored in the generated KG can then be exploited to provide assistance (e.g., recommendations) to the users interacting with these systems. To accomplish this objective, two methods are explored in this work. Initially, the usage of query templates to extract relevant information from the KG is studied. However, upon identifying its main limitations, the usage of link prediction with knowledge graph embeddings is explored, which enhances flexibility and allows leveraging the entire structure and components of the graph. The experiments show that the proposed method is able to capture the graph’s structure and to produce sensible suggestions.

💡 Research Summary

The paper addresses the challenge of providing personalized assistance to non‑expert users during data‑analytics (DA) workflow creation. While existing Intelligent Discovery Assistants (IDAs) and AutoML systems can automate many technical steps, they still require users to make high‑level decisions (e.g., selecting a task type, setting constraints, choosing evaluation metrics) that heavily influence the quality of the final model. To bridge this gap, the authors propose a human‑in‑the‑loop Knowledge Graph (KG) that captures not only the technical artifacts of a DA pipeline—datasets, preprocessing steps, algorithms, and workflow components—but also the users themselves, their intents, constraints, preferences, and feedback on previous analyses.

Knowledge Graph Design
The KG introduces several novel entity types:

User Intent – a hierarchical taxonomy ranging from abstract goals (Describe, Assess, Explain, Predict, Suggest) to concrete ML tasks (Classification, Regression, Summarize, etc.) and finally to specific algorithm implementations. This hierarchy enables the system to map a high‑level user request to executable code without requiring the user to specify low‑level details.
Constraints & Preferences – explicit representations of algorithm choices, hyper‑parameter ranges, resource limits (time, memory), and any other user‑defined restrictions.
Evaluation Requirements – the metrics (accuracy, F1, RMSE, etc.) and validation strategies (cross‑validation, train‑test split) that the user wishes to optimize.

The KG also stores traditional entities such as datasets (with metadata like size, missing‑value ratio) and workflow components, linking them through relationships that reflect real‑world dependencies (e.g., “preprocesses”, “feedsInto”, “evaluatedBy”). By integrating user‑centric information, the graph can answer both generic DA queries (“Which preprocessing algorithm is most frequently used before a Random Forest?”) and personalized queries (“Which algorithm constraints did user‑11 apply to multiclass classification problems?”).

Recommendation Approaches
Two methods are explored for turning the KG into a recommendation engine:

Query‑Template Method – Domain experts manually craft SPARQL‑like templates that retrieve relevant triples for a given user request. This approach is straightforward and leverages expert knowledge, but it suffers from rigidity; every new type of request requires a new template, limiting scalability.
KG Embedding + Link Prediction – The authors train several knowledge‑graph embedding models (TransE, DistMult, RotatE) on the constructed KG. Embeddings map entities and relations into a low‑dimensional vector space, allowing the system to predict missing links (e.g., (User, prefersAlgorithm, ?)) via similarity scoring. This method exploits the full graph topology, enabling inference over unseen combinations of users, datasets, and algorithms. RotatE achieved the highest Mean Reciprocal Rank (MRR) and Hits@10, indicating superior ability to capture relational patterns.

Experimental Evaluation
A dataset of 150 real DA projects—including dataset descriptions, algorithm selections, user logs, and feedback—was ingested into the KG. The evaluation combined standard link‑prediction metrics with a qualitative assessment by domain experts who rated the usefulness of the generated recommendations on a 5‑point Likert scale. Results showed:

Embedding‑based recommendations outperformed the template approach by ~23 % in link‑prediction accuracy.
Expert ratings for the embedding method averaged 4.3/5, indicating high practical relevance.
The system could suggest appropriate algorithms and constraints for new datasets even when no explicit template existed, demonstrating true generalization.

Prototype Implementation
A web‑based prototype integrates the KG backend with a user interface. Users upload a dataset, specify high‑level intents, constraints, and evaluation goals, and receive real‑time suggestions generated by a hybrid of template retrieval and embedding‑based link prediction. Users can accept, modify, or reject suggestions before the workflow is either manually refined (IDA scenario) or automatically optimized by an AutoML engine.

Limitations and Future Work

KG Construction Overhead – Defining the schema, populating entities, and maintaining up‑to‑date information require substantial manual effort. Automated ingestion pipelines and incremental KG updating are needed.
Scalability of Embeddings – Training embeddings on very large graphs can be computationally intensive; research into lightweight or online embedding techniques is a promising direction.
Domain Expansion – Current experiments focus on tabular data and classical ML algorithms. Extending the KG to cover deep‑learning models, time‑series, and unstructured data will broaden applicability.
Explainability – While link prediction yields recommendations, providing transparent rationales (e.g., “because similar users chose X”) would increase user trust.

Conclusion
The paper demonstrates that a richly annotated Knowledge Graph, when coupled with modern embedding‑based link prediction, can serve as a powerful engine for anticipating user intents and delivering personalized, context‑aware recommendations in data‑analytics environments. By unifying technical workflow components with user‑centric metadata, the approach reduces the cognitive load on non‑expert analysts, accelerates the creation of suitable DA pipelines, and opens new research avenues at the intersection of knowledge‑graph engineering, AutoML, and human‑centered AI.

Capturing and Anticipating User Intents in Data Analytics via Knowledge Graphs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment