Machine Learning Practitioners' Views on Data Quality in Light of EU Regulatory Requirements: A European Online Survey

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Understanding how data quality aligns with regulatory requirements in machine learning (ML) systems presents a critical challenge for practitioners navigating the evolving EU regulatory landscape. To address this, we first propose a practical framework aligning established data quality dimensions with specific EU regulatory requirements. Second, we conducted a comprehensive online survey with over 180 EU-based data practitioners, investigating their approaches, key challenges, and unmet needs when ensuring data quality in ML systems that align with regulatory requirements. Our findings highlight crucial gaps between current practices and regulatory expectations, underscoring practitioners’ need for more integrated data quality tools and better collaboration between technical and legal practitioners. These insights inform recommendations for bridging technical expertise and regulatory compliance, ultimately fostering responsible and trustworthy ML deployments.

💡 Research Summary

The paper tackles the pressing problem of aligning data‑quality management with European Union regulatory obligations in machine‑learning (ML) systems. It proceeds in two main steps. First, the authors construct a practical “regulation‑quality mapping framework” that links well‑established data‑quality dimensions (intrinsic, contextual, representational, accessibility) with concrete requirements from the General Data Protection Regulation (GDPR) and the forthcoming Artificial Intelligence Act (AI Act). By translating legal clauses such as GDPR Art. 5(1)(d) (“accuracy and, where necessary, up‑to‑date”) and AI Act high‑risk provisions into measurable quality metrics (e.g., label accuracy, missing‑value rates, update frequency, metadata completeness, traceability), the framework gives practitioners a common vocabulary for turning abstract legal mandates into actionable data‑handling practices.

Second, the authors empirically validate the framework through a large‑scale online survey of more than 180 EU‑based data practitioners drawn from a variety of sectors (healthcare, finance, public services, etc.) and roles (data engineers, ML scientists, product managers, compliance officers). The questionnaire, designed according to a design‑science research methodology, probes respondents’ understanding of data‑quality concepts, their operationalisation of regulatory requirements, the tools they employ, and the nature of cross‑functional collaboration.

Key findings include:

Awareness‑implementation gap – While roughly two‑thirds of respondents recognise accuracy, completeness and timeliness as core quality attributes, only a minority can map these to specific GDPR or AI Act clauses. Practitioners lack clear guidance on converting legal language into concrete quality indicators.
Tooling deficit – Automated data‑validation and monitoring solutions are used by just over a third of participants; most rely on ad‑hoc scripts or manual checks, exposing projects to human error and compliance risk.
Collaboration shortfall – Formal, recurring interaction between legal/compliance teams and data‑engineering groups occurs in only about 22 % of organisations, leading to siloed interpretations of the law and missed opportunities for early compliance integration.
Priority tension – Practitioners prioritize data volume and diversity to boost model performance, whereas regulators stress data minimisation and purpose limitation. This creates strategic trade‑offs that are rarely resolved in current project planning.
Differential emphasis on quality dimensions – Intrinsic (accuracy, consistency) and contextual (relevance, timeliness) dimensions dominate day‑to‑day work because they directly affect model metrics. Representational (documentation) and accessibility (security, availability) dimensions receive attention mainly when explicitly tied to regulatory mandates such as transparency and accountability.

From these insights the authors derive several actionable recommendations:

Standardised checklists and metadata schemas derived from the mapping framework, to be embedded in CI/CD pipelines for systematic compliance verification.
Institutionalised “legal‑data liaison” roles or regular cross‑functional forums that keep regulatory interpretations current and ensure that technical constraints are communicated back to compliance officers.
Development of automated compliance‑aware quality tooling that continuously monitors GDPR‑relevant properties (e.g., data freshness, minimisation) and AI Act risk indicators, reducing reliance on manual processes.
Dynamic prioritisation matrices that balance model performance goals against regulatory risk scores, enabling project managers to make informed trade‑offs early in the ML lifecycle.

The paper concludes that data‑quality management and EU regulatory compliance must be treated as a unified, iterative process rather than separate silos. The proposed framework, together with the empirical evidence, offers a concrete roadmap for academia, industry, and policymakers to bridge the current gap. Future work is suggested to pilot the checklist and tooling in real‑world settings, evaluate their impact on compliance outcomes, and extend the analysis to other jurisdictions such as the United States or Asian data‑protection regimes.

Machine Learning Practitioners' Views on Data Quality in Light of EU Regulatory Requirements: A European Online Survey

💡 Research Summary

Comments & Academic Discussion

Leave a Comment