Data Readiness Levels
Application of models to data is fraught. Data-generating collaborators often only have a very basic understanding of the complications of collating, processing and curating data. Challenges include: poor data collection practices, missing values, inconvenient storage mechanisms, intellectual property, security and privacy. All these aspects obstruct the sharing and interconnection of data, and the eventual interpretation of data through machine learning or other approaches. In project reporting, a major challenge is in encapsulating these problems and enabling goals to be built around the processing of data. Project overruns can occur due to failure to account for the amount of time required to curate and collate. But to understand these failures we need to have a common language for assessing the readiness of a particular data set. This position paper proposes the use of data readiness levels: it gives a rough outline of three stages of data preparedness and speculates on how formalisation of these levels into a common language for data readiness could facilitate project management.
💡 Research Summary
The paper “Data Readiness Levels” addresses a chronic problem in data‑driven projects: while much attention is given to model development, the preparation of the underlying data is often overlooked, leading to schedule overruns, budget blowouts, and sub‑optimal analytical outcomes. To remedy this, the authors propose a structured taxonomy called Data Readiness Levels (DRL), inspired by NASA’s Technology Readiness Levels, that categorizes the state of a dataset along three hierarchical bands—C, B, and A—each with sub‑levels (e.g., C4‑C1, B4‑B1, A4‑A1).
Band C concerns the mere existence and accessibility of data. The lowest sub‑level, C4, represents “hearsay” data: a belief that data exist without any verification of format, storage location, or legal/ethical clearance. Progression to C1 requires that the data be stored in a machine‑readable form (e.g., CSV, relational database), that privacy, copyright, and security constraints be resolved, and that the dataset be ingestible by analysis tools such as pandas. This band captures the classic “data munging” and “data wrangling” activities that often dominate early project effort.
Band B evaluates the fidelity, completeness, and representation of the data that is already accessible. At this stage analysts must audit missing‑value handling, error propagation, sensor noise characterization, unit consistency, and the provenance of collection protocols (randomization, sampling bias, etc.). Exploratory Data Analysis (EDA) and visualisation are emphasized to make data “vivid” for non‑technical stakeholders. By B1, the analyst has a clear mental model of the dataset’s limitations, knows how the recorded values map to the intended measurements, and can trust that the data have not been corrupted during transformation (e.g., spreadsheet mis‑sorting, gene‑name‑to‑date conversion).
Band A focuses on the alignment of the dataset with a specific analytical task or business question. Here the context is defined: a task might be “predict user churn,” “evaluate drug efficacy,” or “validate rocket engine performance.” Only when a dataset is explicitly linked to such a task, and any necessary annotation, labeling, or supplemental collection is performed, does it achieve A1 status. Importantly, a dataset can be A1 for one task while remaining at B2 or lower for another, underscoring the task‑centric nature of this band.
The authors illustrate the framework with a concrete case study: the migration and re‑structuring of the Proceedings of Machine Learning Research (PMLR) archive. Initially the data were at C4—information about papers existed only as scattered PDFs and BibTeX files, with no systematic access. The team spent two days creating bibliographic files for the first 26 volumes, three days converting the archive to a new web format, and additional evenings cleaning the resulting CSV files so they could be loaded into pandas. These efforts moved the data from C4 to C1 (machine‑readable, ethically cleared). Subsequent work on metadata validation, missing‑value handling, and task definition (e.g., building a citation‑network analysis) advanced the dataset to B1 and finally to A1 for specific research questions. The paper provides GitHub commit links and a rough estimate of person‑hours, demonstrating that the “cost of data readiness” can dominate project budgets.
By formalizing data readiness, the authors argue that project managers can:
- Quantify and communicate data‑related risk early, avoiding surprise overruns.
- Allocate resources (budget, personnel, tooling) based on the current DRL, ensuring that data‑centric work is not an afterthought.
- Perform ethical and legal checks when merging datasets, as required at the C‑level.
The paper also acknowledges limitations. The definitions of sub‑levels are somewhat subjective, lacking domain‑specific metrics; the cost model for moving between levels is not standardized; and the framework does not yet provide a quantitative “readiness score” that could be automatically computed. Future work is suggested to develop checklists, cost‑estimation models, and cross‑domain validation (e.g., healthcare, finance, manufacturing).
In summary, Data Readiness Levels offer a pragmatic, language‑based tool to make data quality, accessibility, and contextual suitability visible to all stakeholders. By integrating DRL into project planning, organizations can better anticipate the hidden work of data curation, reduce schedule risk, and ultimately improve the reliability of downstream machine‑learning or statistical analyses.
Comments & Academic Discussion
Loading comments...
Leave a Comment