Data, Labelling, Provenance, and Leakage Testing
Make data testable: provenance, labelling quality, representativeness, leakage, privacy, and data pipeline correctness.
Data, Labelling, Provenance, and Leakage Testing video briefing
A focused explanation of chapter 3, turning the AI testing theory into concrete validation checks.
Briefing focus
Module opening
This is a structured lesson briefing. Real video/audio can be added later as a media source.
Estimated time
9 min
- 1Module opening
- 2Learning objectives
- 3Mind map
- 4Scenario evidence breakdown
Transcript brief
Make data testable: provenance, labelling quality, representativeness, leakage, privacy, and data pipeline correctness. The briefing explains why the topic matters, walks through a failure scenario, and identifies the artefacts a tester should produce for evidence and auditability.
Key takeaways
- Connect the AI risk to a measurable test or monitor.
- Document the evidence needed for reproducibility and audit.
- Use the lab or scenario to practise the validation workflow.
Module opening
Make data testable: provenance, labelling quality, representativeness, leakage, privacy, and data pipeline correctness.
Audience. QA engineers collaborating with data teams on datasets, labels, features, and test data strategy.
Why this matters. Most AI failures are prepared quietly inside the data. If testers cannot question data quality, they will only discover model issues after expensive training or production harm.
ISTQB CT-AI mapping. CT-AI 4.1-4.5, 7.3
Trainer note
Start with the scenario before the theory. Ask learners what evidence would make them confident, then use the module to build that evidence step by step.
Learning objectives
- Explain the core quality risk in data, labelling, provenance, and leakage testing.
- Select practical test evidence that supports an AI release decision.
- Apply the module concepts to a realistic QA scenario.
- Produce a portfolio artifact that can be reused in a professional AI testing context.
Mind map
Real-life scenario · Insurance
The claims model trained on tomorrow's information
Situation. A model predicts which claims need specialist review. A feature used in training was only available after the claim had already been manually reviewed, creating leakage and inflated evaluation results.
Lesson. AI testing is strongest when risks, examples, evidence, and release decisions are connected.
Scenario evidence breakdown
| Scenario element | Detail |
|---|---|
| Product/System | Claims prioritisation workflow |
| AI feature | A model predicts which claims need specialist review. |
| Failure or risk | A feature used in training was only available after the claim had already been manually reviewed, creating leakage and inflated evaluation results. |
| Testing challenge | The model appeared excellent in validation but failed in production because the leaked feature disappeared at decision time. |
| Tester response | The tester introduced feature availability checks, provenance review, label-quality sampling, and train/validation/test partition rules. |
| Evidence required | Datasheet, feature availability matrix, leakage checklist, label audit, split strategy, and privacy review. |
| Business decision | Reject the model evaluation and require retraining with only decision-time features. |
Visual flow
Learning path
Start Here
5 minOutcome, CT-AI exam relevance, and the leakage scenario.
Learn
22 minProvenance, labels, representativeness, leakage, privacy, and consent.
See It
10 minFeature timing and dataset evidence breakdown.
Try It
16 minBuild a datasheet and leakage review for the claims model.
Recall and Apply
10 minExam traps, active recall, and the portfolio artifact.
Decision-time feature availability
Data is testable only when the team can prove where each field came from, when it existed, and whether it was available at prediction time.
Example
The claims model used a feature created after manual review, so validation looked excellent while production could not use the same signal.
Mistake
Trusting high validation scores before checking feature timing, split strategy, and provenance.
Evidence
Feature availability matrix, datasheet, lineage record, split policy, leakage checklist, and data owner sign-off.
Worked example: Rejecting a leaked evaluation
Scenario. A claims prioritisation model reports excellent validation performance, but one high-importance feature is populated only after a specialist has reviewed the claim.
Reasoning. The feature leaks future information. The reported performance does not represent the live decision point, so the model cannot be approved from that evaluation.
Model answer. Reject the evaluation, remove decision-time unavailable features, rerun train/validation/test splits, and require a leakage review before release discussion.
Try it: Build the datasheet and leakage review
Prompt. Use the insurance claims scenario to document whether the dataset and features are suitable for release evaluation.
Learner action. Record data source, collection window, label policy, feature timing, split method, privacy handling, leakage risks, and release recommendation.
Expected output. `dataset-datasheet-and-leakage-review.md` with a feature availability matrix, leakage findings, and retraining recommendation.
Exam trap
Objective
CT-AI 4.1-4.5
Common trap
Accepting impressive performance without checking whether the data could exist at prediction time.
Wording clue
Prefer answers that mention provenance, label quality, split integrity, feature timing, and privacy controls.
Portfolio checkpoint
Create the module portfolio deliverable and use it to support your release decision.
Artifact structure
dataset-datasheet-and-leakage-review.md
Recall check
- Why did the claims model evaluation fail?
- It used information that was only available after manual review, creating leakage.
- What evidence reveals leakage risk?
- Feature timing, data lineage, split policy, and a leakage checklist.
- Why are labels test oracles?
- Supervised models learn from labels, so noisy or ambiguous labels damage both training and evaluation.
- What portfolio artifact does this module produce?
- dataset-datasheet-and-leakage-review.md, a dataset suitability and leakage evidence pack.
Topic-by-topic teaching guide
1. Data Provenance
Provenance explains where data came from, when it was collected, who transformed it, and what limitations it carries.
| Teaching lens | Practical detail |
|---|---|
| Real QA example | A customer sentiment dataset collected during an outage may not represent normal behaviour. |
| What can go wrong | Using convenient data without knowing its origin or collection bias. |
| How a tester should think | Ask whether the dataset is suitable for this decision context. |
| Evidence to collect | Datasheet, lineage record, source owner, and collection notes. |
2. Labelling Quality
Labels are test oracles for supervised learning. Ambiguous policy, rushed annotation, or poor reviewer agreement damages model learning.
| Teaching lens | Practical detail |
|---|---|
| Real QA example | Two reviewers label the same message as complaint vs cancellation request. |
| What can go wrong | Assuming labels are correct because they are in a CSV. |
| How a tester should think | Sample labels, check guidance, and measure disagreement. |
| Evidence to collect | Labelling guide, inter-annotator agreement, disputed-label log. |
3. Representativeness
The dataset must include the users, situations, slices, and edge cases the model will face.
| Teaching lens | Practical detail |
|---|---|
| Real QA example | A voice model trained mostly on studio audio may fail in noisy call centres. |
| What can go wrong | Believing a large dataset is automatically representative. |
| How a tester should think | Compare dataset slices against expected production usage. |
| Evidence to collect | Slice coverage report and population comparison. |
4. Leakage Testing
Leakage happens when training includes information that would not be available at prediction time or contains target proxies.
| Teaching lens | Practical detail |
|---|---|
| Real QA example | Refund-approved date predicts fraud because it is created after investigation. |
| What can go wrong | Celebrating unrealistic performance without checking feature timing. |
| How a tester should think | Review each feature against the decision timeline. |
| Evidence to collect | Feature availability matrix and leakage test results. |
5. Privacy and Consent
AI datasets may contain personal, sensitive, or regulated data. Testing must consider minimisation and lawful use.
| Teaching lens | Practical detail |
|---|---|
| Real QA example | Free-text support tickets may include bank details or health information. |
| What can go wrong | Copying raw production data into notebooks or demos. |
| How a tester should think | Check minimisation, masking, access, and retention. |
| Evidence to collect | Privacy review, masking evidence, and access log. |
Practical QA workflow
- Start from the user or business decision affected by the AI system.
- Name the AI asset under test: data, feature pipeline, model, prompt, retrieval index, tool, or full workflow.
- Convert the main risk into observable quality signals and release gates.
- Choose the right oracle: deterministic assertion, metric threshold, metamorphic relation, reviewer rubric, comparison, or production monitor.
- Test important slices, edge cases, misuse cases, and change scenarios.
- Record versions, data sources, thresholds, reviewer notes, and decision rationale.
Test design checklist
- What harm could happen if this AI behaviour is wrong?
- Which users, groups, products, regions, or workflows need separate evidence?
- Which metric or observation would reveal the failure early?
- What is the minimum evidence needed for release, shadow mode, rollback, or rejection?
- Who owns the evidence after the model, prompt, or data changes?
Worked QA example
A tester receives a release request for the module scenario. Instead of asking only whether tests pass, the tester writes three release questions: what changed, who could be harmed, and what evidence proves the change is controlled. The answer becomes a small evidence pack: one risk table, one set of representative examples, one automated or reviewable check, and one release recommendation.
Common mistakes
- Treating AI output as a normal deterministic response when the real risk is behavioural.
- Reporting one impressive metric without slices, uncertainty, or business context.
- Forgetting that data, prompts, model versions, and monitoring are part of the test surface.
- Writing governance language that cannot be checked by a tester.
Guided exercise
Use the scenario above and create a one-page evidence plan. Include the decision being influenced, the main risk, the test oracle, the data or examples required, the release gate, and the owner.
Discussion prompt
Which data assumption would be most dangerous if it were wrong: source, label, timing, slice coverage, or consent?
Hands-on lab mapping
- Lab: CourseMaterials/AI-Testing/labs/01_data_and_datasheets.ipynb
- Task: Produce a dataset datasheet and run basic data quality checks.
- Why this lab matters: it turns the module theory into visible evidence that a release approver can inspect.
Decision simulation
A model has unusually high validation performance. Decide what leakage and data split evidence you need before trusting it.
Key terms
- Data provenance: Documented origin and transformation history of data.
- Label noise: Incorrect, inconsistent, or ambiguous labels.
- Data leakage: Use of information during training or evaluation that would not be available in real use.
- Datasheet: Structured documentation of dataset purpose, composition, collection, and limitations.
Revision prompts
- Explain the module scenario in two minutes to a product owner.
- Name three pieces of evidence you would require before release.
- Identify one automated check and one human-review check.
- Describe how this topic changes after deployment.