"Detective Work We Shouldn't Have to Do": Practitioner Challenges in Regulatory-Aligned Data Quality in Machine Learning Systems
Ensuring data quality in machine learning (ML) systems has become increasingly complex as regulatory requirements expand. In the European Union (EU), frameworks such as the General Data Protection Regulation (GDPR) and the Artificial Intelligence Act (AI Act) articulate data quality requirements that closely parallel technical concerns in ML practice, while also extending to legal obligations related to accountability, risk management, and human rights protection. This paper presents a qualitative interview study with EU-based data practitioners working on ML systems in regulated contexts. Through semi-structured interviews, we investigate how practitioners interpret regulatory-aligned data quality, the challenges they encounter, and the supports they identify as necessary. Our findings reveal persistent gaps between legal principles and engineering workflows, fragmentation across data pipelines, limitations of existing tools, unclear responsibility boundaries between technical and legal teams, and a tendency toward reactive, audit-driven quality practices. We also identify practitioners’ needs for compliance-aware tooling, clearer governance structures, and cultural shifts toward proactive data governance.
💡 Research Summary
The paper investigates how data quality practices in machine‑learning (ML) systems intersect with the European Union’s regulatory landscape, specifically the General Data Protection Regulation (GDPR) and the Artificial Intelligence Act (AI Act). The authors introduce the notion of “regulatory‑aligned data quality” – the extent to which technical data‑quality activities both satisfy engineering standards and meet legal obligations. To explore this concept in practice, they conducted semi‑structured interviews with fourteen EU‑based practitioners who work on ML projects that process personal data or operate in high‑risk, regulated domains. Participants spanned a range of roles, including data collectors, data engineers, data scientists, ML engineers, and compliance officers.
The study is organized around three research questions: (1) how practitioners interpret and operationalise regulatory‑aligned data‑quality dimensions; (2) which tools, methods, and infrastructures they currently employ and what additional capabilities they need; and (3) how collaboration patterns between technical and legal/compliance teams influence implementation. The interview protocol covered participants’ background, concrete compliance‑driven data‑quality incidents, scenario‑based vignettes derived from GDPR and AI Act provisions, and reflections on cross‑team collaboration.
Key findings coalesce around four major themes. First, there is a persistent interpretation gap between high‑level legal principles (e.g., “accuracy”, “purpose limitation”, “non‑bias”) and concrete engineering actions. Practitioners often reduce GDPR’s “accuracy” to simple error detection, overlooking the need for real‑time rectification mechanisms that support data‑subject rights. Second, pipeline fragmentation hampers end‑to‑end compliance. Data ingestion, transformation, feature engineering, model training, deployment, and monitoring are handled by disparate tools and teams, making it difficult to propagate regulatory constraints consistently across stages. Third, tool limitations are evident: existing data‑quality frameworks such as Great Expectations or Deequ can automate checks but lack built‑in support for generating legally admissible audit evidence or linking quality metrics to GDPR‑required metadata (e.g., consent status, retention schedules). Data lineage and catalog solutions (Apache Atlas, DataHub) provide technical traceability but do not capture regulatory attributes like purpose tags or lawful‑basis annotations. Consequently, practitioners resort to manual stitching of evidence, which is error‑prone and resource‑intensive. Fourth, responsibility ambiguity emerges in organisational structures. Technical teams focus on system stability, while compliance officers concentrate on policy interpretation; without a clearly defined “Data Quality Owner” or similar role, accountability for regulatory breaches is diffuse, leading to ad‑hoc, audit‑driven reactions rather than proactive quality governance.
The interviewees also described a culture of reactive, audit‑driven compliance: when a new regulatory requirement surfaces, teams scramble to produce checklists and temporary fixes instead of embedding the requirement into the continuous‑integration/continuous‑deployment (CI/CD) pipeline. This reactive stance perpetuates the “detective work we shouldn’t have to do” sentiment expressed in the title.
Based on these observations, the authors propose three avenues for improvement. (1) Compliance‑aware tooling: develop or extend data‑quality platforms to map legal obligations directly onto measurable metrics, automatically generate audit logs, and embed regulatory metadata (purpose, lawful basis, retention) into data assets. (2) Governance redesign: institutionalise a dedicated role (e.g., Regulatory Data‑Quality Steward) that bridges legal and technical teams, clarifies responsibility boundaries, and oversees the lifecycle of compliance artefacts. (3) Cultural shift: promote proactive data governance through training, shared vocabularies, and the integration of regulatory checks into MLOps pipelines, thereby moving from post‑hoc audits to continuous, evidence‑based compliance.
In sum, the paper provides the first qualitative, practitioner‑centred account of how EU data‑protection and AI regulations are operationalised—or fail to be—within modern ML pipelines. It highlights concrete mismatches between legal expectations and engineering practice, identifies systemic tool and organisational shortcomings, and outlines concrete research and industry directions to close the gap. The insights are relevant not only for EU stakeholders but also for any jurisdiction where data‑centric AI regulation is emerging, offering a roadmap for building more transparent, accountable, and legally compliant ML systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment