Skip to content

Data quality

Compliance Info

Below we map the engineering practice to articles of the AI Act, which benefit from following the practice.

  • Art. 10 (Data and Data Governance), in particular:
    • Art. 10(2)(c): Maintaining documented preprocessing routines for labelling, cleaning, imputation, and enrichment keeps every data preparation step traceable.
    • Art. 10(3): Routine validation, reporting, and drift monitoring ensures training, validation, and test datasets remain complete, accurate, and representative.

Motivation

Art. 10(3) of the AI Act demands a certain quality of data used for training and evaluating models, in particular, these data sets should be:

  • relevant,
  • sufficiently representative
  • complete, and
  • free of errors.

To achieve those qualities, there are different techniques available at different steps in the system lifecycle.

Implementation Notes

Note that the techniques discussed in the section focus on technical approaches for ensuring data quality. They need to be accompanied by organizational and governance measures to become fully effective.

Data Preprocessing

Preparing data for training begins with making each transformation explicit, measurable, and reproducible. This enables auditors to understand how raw inputs were converted into model-ready datasets.

Handle Missing or Incomplete Data

Profile the dataset to surface null or placeholder values, then quantify whether the missingness introduces bias. Choose remediation techniques—such as interpolation, mean/mode imputation, or domain-specific defaults—and document them so the same logic applies across training and evaluation runs.

Enforce Consistency and Schemas

Apply schema validation to tabular data to guarantee types, ranges, and required fields stay aligned across ingestion sources. Deduplicate records, normalize formats (for example, timestamps), and ensure foreign keys or categorical labels stay within expected vocabularies before the data enters downstream pipelines.

Keep Pipelines Reproducible

Automate preprocessing in versioned workflows instead of manual notebooks. Use a workflow orchestrator or data pipeline tool that tracks parameters, input snapshots, and code revisions so the same preprocessing steps can be replayed during audits or incident investigations.

Data Quality Validation

Once preprocessing is locked down, validate that the resulting datasets remain faithful to reality and behave as expected over time.

Validate Against Ground Truth

Regularly sample records and compare them with verified business systems or domain experts. This check confirms that labelling, enrichment, and cleaning steps did not introduce errors and that sensitive attributes stay accurate.

Automate Accuracy Checks and Reporting

Run automated validation suites to catch logical conflicts—such as negative ages or impossible category combinations—and flag statistical outliers via z-score, interquartile range, or model-based anomaly detection. Summaries should flow into the regular data quality reports that analysts review and attach to the data governance documentation to keep stakeholders informed.

Monitor Drift and Trigger Remediation

Monitor the model over time and schedule periodic validation runs that compare live data against historical baselines. When the reports show significant drift in the data distribution, trigger investigation or retraining workflows.

Key Technologies

Legal Disclaimer (click to toggle)

The information provided on this website is for informational purposes only and does not constitute legal advice. The tools, practices, and mappings presented here reflect our interpretation of the EU AI Act and are intended to support understanding and implementation of trustworthy AI principles. Following this guidance does not guarantee compliance with the EU AI Act or any other legal or regulatory framework. We are not affiliated with, nor do we endorse, any of the tools listed on this website.