Chapter 3 of 10

Data, Labelling, Provenance, and Leakage Testing

Make data testable: provenance, labelling quality, representativeness, leakage, privacy, and data pipeline correctness.

45 min guide5 reference questions folded into the guide material

Guided briefing

Data, Labelling, Provenance, and Leakage Testing video briefing

A focused explanation of chapter 3, turning the AI testing theory into concrete validation checks.

Briefing focus

Module opening

This is a structured lesson briefing. Real video/audio can be added later as a media source.

Estimated time

9 min

1Module opening
2Learning objectives
3Mind map
4Scenario evidence breakdown

Transcript brief

Make data testable: provenance, labelling quality, representativeness, leakage, privacy, and data pipeline correctness. The briefing explains why the topic matters, walks through a failure scenario, and identifies the artefacts a tester should produce for evidence and auditability.

Key takeaways

Connect the AI risk to a measurable test or monitor.
Document the evidence needed for reproducibility and audit.
Use the lab or scenario to practise the validation workflow.

Module opening

Make data testable: provenance, labelling quality, representativeness, leakage, privacy, and data pipeline correctness.

Audience. QA engineers collaborating with data teams on datasets, labels, features, and test data strategy.

Why this matters. Most AI failures are prepared quietly inside the data. If testers cannot question data quality, they will only discover model issues after expensive training or production harm.

ISTQB CT-AI mapping. CT-AI 4.1-4.5, 7.3

Trainer note

Start with the scenario before the theory. Ask learners what evidence would make them confident, then use the module to build that evidence step by step.

Learning objectives

Explain the core quality risk in data, labelling, provenance, and leakage testing.
Select practical test evidence that supports an AI release decision.
Apply the module concepts to a realistic QA scenario.
Produce a portfolio artifact that can be reused in a professional AI testing context.

Mind map

Real-life scenario · Insurance

The claims model trained on tomorrow's information

Situation. A model predicts which claims need specialist review. A feature used in training was only available after the claim had already been manually reviewed, creating leakage and inflated evaluation results.

Lesson. AI testing is strongest when risks, examples, evidence, and release decisions are connected.

Scenario evidence breakdown

Scenario element	Detail
Product/System	Claims prioritisation workflow
AI feature	A model predicts which claims need specialist review.
Failure or risk	A feature used in training was only available after the claim had already been manually reviewed, creating leakage and inflated evaluation results.
Testing challenge	The model appeared excellent in validation but failed in production because the leaked feature disappeared at decision time.
Tester response	The tester introduced feature availability checks, provenance review, label-quality sampling, and train/validation/test partition rules.
Evidence required	Datasheet, feature availability matrix, leakage checklist, label audit, split strategy, and privacy review.
Business decision	Reject the model evaluation and require retraining with only decision-time features.

Visual flow

Data, Labelling, Provenance, and Leakage Testing scenario flow

Learning path

Start Here
5 min
Outcome, CT-AI exam relevance, and the leakage scenario.
Learn
22 min
Provenance, labels, representativeness, leakage, privacy, and consent.
See It
10 min
Feature timing and dataset evidence breakdown.
Try It
16 min
Build a datasheet and leakage review for the claims model.
Recall and Apply
10 min
Exam traps, active recall, and the portfolio artifact.

Decision-time feature availability

Data is testable only when the team can prove where each field came from, when it existed, and whether it was available at prediction time.

Example

The claims model used a feature created after manual review, so validation looked excellent while production could not use the same signal.

Mistake

Trusting high validation scores before checking feature timing, split strategy, and provenance.

Evidence

Feature availability matrix, datasheet, lineage record, split policy, leakage checklist, and data owner sign-off.

Worked example: Rejecting a leaked evaluation

Scenario. A claims prioritisation model reports excellent validation performance, but one high-importance feature is populated only after a specialist has reviewed the claim.

Reasoning. The feature leaks future information. The reported performance does not represent the live decision point, so the model cannot be approved from that evaluation.

Model answer. Reject the evaluation, remove decision-time unavailable features, rerun train/validation/test splits, and require a leakage review before release discussion.

Try it: Build the datasheet and leakage review

Prompt. Use the insurance claims scenario to document whether the dataset and features are suitable for release evaluation.

Learner action. Record data source, collection window, label policy, feature timing, split method, privacy handling, leakage risks, and release recommendation.

Expected output. `dataset-datasheet-and-leakage-review.md` with a feature availability matrix, leakage findings, and retraining recommendation.

Exam trap

Objective

CT-AI 4.1-4.5

Common trap

Accepting impressive performance without checking whether the data could exist at prediction time.

Wording clue

Prefer answers that mention provenance, label quality, split integrity, feature timing, and privacy controls.

Portfolio checkpoint

Create the module portfolio deliverable and use it to support your release decision.

Artifact structure

dataset-datasheet-and-leakage-review.md

ContextData sourceLabelsFeature timingSplit strategyPrivacy controlsLeakage findingsRecommendationOpen questions

Recall check

Why did the claims model evaluation fail?: It used information that was only available after manual review, creating leakage.
What evidence reveals leakage risk?: Feature timing, data lineage, split policy, and a leakage checklist.
Why are labels test oracles?: Supervised models learn from labels, so noisy or ambiguous labels damage both training and evaluation.
What portfolio artifact does this module produce?: dataset-datasheet-and-leakage-review.md, a dataset suitability and leakage evidence pack.

Topic-by-topic teaching guide

1. Data Provenance

Provenance explains where data came from, when it was collected, who transformed it, and what limitations it carries.

Teaching lens	Practical detail
Real QA example	A customer sentiment dataset collected during an outage may not represent normal behaviour.
What can go wrong	Using convenient data without knowing its origin or collection bias.
How a tester should think	Ask whether the dataset is suitable for this decision context.
Evidence to collect	Datasheet, lineage record, source owner, and collection notes.

2. Labelling Quality

Labels are test oracles for supervised learning. Ambiguous policy, rushed annotation, or poor reviewer agreement damages model learning.

Teaching lens	Practical detail
Real QA example	Two reviewers label the same message as complaint vs cancellation request.
What can go wrong	Assuming labels are correct because they are in a CSV.
How a tester should think	Sample labels, check guidance, and measure disagreement.
Evidence to collect	Labelling guide, inter-annotator agreement, disputed-label log.

3. Representativeness

The dataset must include the users, situations, slices, and edge cases the model will face.

Teaching lens	Practical detail
Real QA example	A voice model trained mostly on studio audio may fail in noisy call centres.
What can go wrong	Believing a large dataset is automatically representative.
How a tester should think	Compare dataset slices against expected production usage.
Evidence to collect	Slice coverage report and population comparison.

4. Leakage Testing

Leakage happens when training includes information that would not be available at prediction time or contains target proxies.

Teaching lens	Practical detail
Real QA example	Refund-approved date predicts fraud because it is created after investigation.
What can go wrong	Celebrating unrealistic performance without checking feature timing.
How a tester should think	Review each feature against the decision timeline.
Evidence to collect	Feature availability matrix and leakage test results.

5. Privacy and Consent

AI datasets may contain personal, sensitive, or regulated data. Testing must consider minimisation and lawful use.

Teaching lens	Practical detail
Real QA example	Free-text support tickets may include bank details or health information.
What can go wrong	Copying raw production data into notebooks or demos.
How a tester should think	Check minimisation, masking, access, and retention.
Evidence to collect	Privacy review, masking evidence, and access log.

Practical QA workflow

Start from the user or business decision affected by the AI system.
Name the AI asset under test: data, feature pipeline, model, prompt, retrieval index, tool, or full workflow.
Convert the main risk into observable quality signals and release gates.
Choose the right oracle: deterministic assertion, metric threshold, metamorphic relation, reviewer rubric, comparison, or production monitor.
Test important slices, edge cases, misuse cases, and change scenarios.
Record versions, data sources, thresholds, reviewer notes, and decision rationale.

Test design checklist

What harm could happen if this AI behaviour is wrong?
Which users, groups, products, regions, or workflows need separate evidence?
Which metric or observation would reveal the failure early?
What is the minimum evidence needed for release, shadow mode, rollback, or rejection?
Who owns the evidence after the model, prompt, or data changes?

Worked QA example

A tester receives a release request for the module scenario. Instead of asking only whether tests pass, the tester writes three release questions: what changed, who could be harmed, and what evidence proves the change is controlled. The answer becomes a small evidence pack: one risk table, one set of representative examples, one automated or reviewable check, and one release recommendation.

Common mistakes

Treating AI output as a normal deterministic response when the real risk is behavioural.
Reporting one impressive metric without slices, uncertainty, or business context.
Forgetting that data, prompts, model versions, and monitoring are part of the test surface.
Writing governance language that cannot be checked by a tester.

Guided exercise

Use the scenario above and create a one-page evidence plan. Include the decision being influenced, the main risk, the test oracle, the data or examples required, the release gate, and the owner.

Discussion prompt

Which data assumption would be most dangerous if it were wrong: source, label, timing, slice coverage, or consent?

Hands-on lab mapping

Lab: CourseMaterials/AI-Testing/labs/01_data_and_datasheets.ipynb
Task: Produce a dataset datasheet and run basic data quality checks.
Why this lab matters: it turns the module theory into visible evidence that a release approver can inspect.

Decision simulation

A model has unusually high validation performance. Decide what leakage and data split evidence you need before trusting it.

Key terms

Data provenance: Documented origin and transformation history of data.
Label noise: Incorrect, inconsistent, or ambiguous labels.
Data leakage: Use of information during training or evaluation that would not be available in real use.
Datasheet: Structured documentation of dataset purpose, composition, collection, and limitations.

Revision prompts

Explain the module scenario in two minutes to a product owner.
Name three pieces of evidence you would require before release.
Identify one automated check and one human-review check.
Describe how this topic changes after deployment.