More Than "Means to an End": Supporting Reasoning with Transparently Designed AI Data Science Processes

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Generative artificial intelligence (AI) tools can now help people perform complex data science tasks regardless of their expertise. While these tools have great potential to help more people work with data, their end-to-end approach does not support users in evaluating alternative approaches and reformulating problems, both critical to solving open-ended tasks in high-stakes domains. In this paper, we reflect on two AI data science systems designed for the medical setting and how they function as tools for thought. We find that success in these systems was driven by constructing AI workflows around intentionally-designed intermediate artifacts, such as readable query languages, concept definitions, or input-output examples. Despite opaqueness in other parts of the AI process, these intermediates helped users reason about important analytical choices, refine their initial questions, and contribute their unique knowledge. We invite the HCI community to consider when and how intermediate artifacts should be designed to promote effective data science thinking.

💡 Research Summary

The paper addresses a critical gap in the emerging landscape of generative AI‑driven data‑science tools: while large language models (LLMs) can now automate many technical steps—visualization, modeling, natural‑language processing—their end‑to‑end pipelines often leave users blind to the intermediate decisions that shape the final output. This opacity is especially problematic in high‑stakes domains such as medicine, where users must evaluate alternative approaches, reformulate ill‑defined problems, and ensure that results are trustworthy.

To explore how transparent design can turn autonomous agents into “tools for thought” (TfT), the authors present two case studies from their own work: HACHI and Tempo. Both systems were built for clinical research but embody a common design principle—explicitly constructing intermediate artifacts that are concise, human‑readable, and intended for review and steering.

HACHI (Human+Agent Co‑design framework for Healthcare Instruments)
HACHI was created to help a pediatric emergency‑medicine researcher develop a decision‑support model for traumatic brain injury (TBI). The key challenge was that predictive signals reside in unstructured clinical notes. HACHI uses an LLM to (1) discover clinically meaningful concepts, (2) annotate those concepts across the note corpus, and (3) train a simple statistical model on the resulting concept matrix. Crucially, each of these steps produces an artifact that the clinician can inspect: the textual definition of each concept, the labeled notes, and performance metrics (e.g., AUC).

Through iterative review, clinicians uncovered several hidden problems: a “brain bleed” concept leaked post‑diagnostic information, prompting removal of contaminated cases; some concepts reflected documentation style rather than patient physiology, leading to tighter phrasing constraints; and performance disparities across two hospital campuses (AUC 0.93 vs 0.71) revealed fairness concerns, which were addressed by re‑weighting the loss function. Each feedback loop required only 1–2 hours of review, yet dramatically improved model generalizability. The authors note that while HACHI already surfaces useful artifacts, further opportunities exist—e.g., exposing timestamps of notes earlier to catch leakage, or allowing direct editing of concept definitions rather than only top‑level prompts.

Tempo
Tempo tackles a different bottleneck: extracting temporal event data from electronic health records (EHRs). Domain experts often struggle to translate clinical questions into complex SQL queries. Tempo introduces TempoQL, a compact, human‑readable query language, and an AI Assistant that translates natural‑language requests into TempoQL statements. The generated query and its result set become the intermediate artifacts. Users can verify that the correct events and time windows were selected, then edit the query or provide corrective feedback to the assistant.

User studies showed that TempoQL acted as a “cognitive scaffold”: a product manager could compare two queries that aggregated events over different intervals, reason about the discrepancy, and propose a hybrid aggregation strategy. Moreover, the LLM produced correct TempoQL queries 2.5 × more often than equivalent SQL, despite only seeing the TempoQL syntax at inference time—suggesting that a simpler, readable language not only aids users but also improves the model’s own accuracy.

Discussion and Open Questions
The authors argue that transparency alone is insufficient; what matters is designing transparency for the user. They propose three research questions for the HCI community:

When and how often should agents surface intermediates?
The trade‑off is between exhaustive visibility (which may overwhelm users) and selective exposure (which must capture stages where human expertise, social values, or problem formulation are most influential).
How should intermediate artifacts be presented?
The paper contrasts three successful formats: (a) a precise query language (TempoQL), (b) generated natural‑language prompts (HACHI concept extraction), and (c) input‑output pairs (concept labels). Effective designs are concise, require no coding expertise, and make key analytical choices explicit. Future work could borrow from the Predictability‑Computability‑Stability (PCS) framework to present multiple analysis variants for robustness checks.
How can we evaluate intermediate artifacts?
Evaluation must go beyond downstream performance; it should assess whether artifacts improve user understanding, reduce error rates, and support ethical decision‑making. Controlled user studies, longitudinal field deployments, and metric suites that capture cognitive load, trust, and task success are suggested.

Conclusion
By grounding AI data‑science pipelines in intentionally crafted intermediate artifacts, HACHI and Tempo demonstrate that generative AI can shift from a “black‑box answer generator” to an interactive partner that amplifies human reasoning. This approach is especially vital when problems are ill‑specified, expert intuition is essential, or the credibility of results hinges on methodological transparency. The paper calls on HCI researchers to develop systematic design guidelines, evaluation protocols, and adaptive agents that can decide autonomously when to solicit human input, thereby advancing the next generation of AI‑augmented data‑science tools.

More Than "Means to an End": Supporting Reasoning with Transparently Designed AI Data Science Processes

💡 Research Summary

Comments & Academic Discussion

Leave a Comment