Chapter 4 of 10

ML Workflow, Models, Neural Networks, and Development Testing

Give QA professionals enough ML workflow knowledge to test development practices, training pipelines, and model artifacts with confidence.

45 min guide5 reference questions folded into the guide material

Guided briefing

ML Workflow, Models, Neural Networks, and Development Testing video briefing

A focused explanation of chapter 4, turning the AI testing theory into concrete validation checks.

Briefing focus

Module opening

This is a structured lesson briefing. Real video/audio can be added later as a media source.

Estimated time

9 min

1Module opening
2Learning objectives
3Mind map
4Scenario evidence breakdown

Transcript brief

Give QA professionals enough ML workflow knowledge to test development practices, training pipelines, and model artifacts with confidence. The briefing explains why the topic matters, walks through a failure scenario, and identifies the artefacts a tester should produce for evidence and auditability.

Key takeaways

Connect the AI risk to a measurable test or monitor.
Document the evidence needed for reproducibility and audit.
Use the lab or scenario to practise the validation workflow.

Module opening

Give QA professionals enough ML workflow knowledge to test development practices, training pipelines, and model artifacts with confidence.

Audience. Testers who need to understand model creation without becoming data scientists.

Why this matters. AI testing is stronger when QA can see the training pipeline as a product: inputs, transformations, configuration, outputs, environments, and promotion controls.

ISTQB CT-AI mapping. CT-AI 1.4-1.8, 4.2, 10.1

Trainer note

Start with the scenario before the theory. Ask learners what evidence would make them confident, then use the module to build that evidence step by step.

Learning objectives

Explain the core quality risk in ml workflow, models, neural networks, and development testing.
Select practical test evidence that supports an AI release decision.
Apply the module concepts to a realistic QA scenario.
Produce a portfolio artifact that can be reused in a professional AI testing context.

Mind map

Real-life scenario · Retail analytics

The notebook that worked only on the data scientist's laptop

Situation. A neural network predicts weekly product demand. The training notebook depended on hidden execution order, local files, and unpinned package versions.

Lesson. AI testing is strongest when risks, examples, evidence, and release decisions are connected.

Scenario evidence breakdown

Scenario element	Detail
Product/System	Demand forecasting platform
AI feature	A neural network predicts weekly product demand.
Failure or risk	The training notebook depended on hidden execution order, local files, and unpinned package versions.
Testing challenge	The team could not reproduce the promoted model or explain why two training runs gave different results.
Tester response	The tester required clean-run notebooks, pinned dependencies, seeded experiments, artifact versioning, and serving smoke tests.
Evidence required	Pipeline run log, environment lockfile, model registry entry, reproducibility check, and deployment smoke test.
Business decision	Do not promote until the pipeline can run cleanly and produce traceable artifacts.

Visual flow

ML Workflow, Models, Neural Networks, and Development Testing scenario flow

Learning path

Start Here
5 min
Outcome, CT-AI exam relevance, and the irreproducible notebook scenario.
Learn
22 min
ML workflow stages, model families, training pipeline tests, environments, and promotion.
See It
10 min
Reproducibility evidence for a demand forecasting model.
Try It
16 min
Build a workflow test plan and promotion checklist.
Recall and Apply
10 min
Exam traps, active recall, and the portfolio artifact.

Reproducible model creation

A model artifact is not release-ready unless the team can recreate or explain the training run, environment, data, parameters, and promotion decision.

Example

The demand forecasting notebook worked only after hidden cell execution order, local files, and unpinned packages lined up on one laptop.

Mistake

Treating a successful notebook run as production evidence.

Evidence

Clean-run notebook, pipeline log, environment lockfile, dataset version, seed, model registry entry, artifact hash, and serving smoke test.

Worked example: Blocking promotion from an unrepeatable run

Scenario. A neural network has strong local evaluation results, but CI cannot rerun the notebook and the produced model artifact is not linked to a dataset or dependency lockfile.

Reasoning. The evaluation may be real, but it is not auditable or reproducible. Promotion would create operational risk because failures could not be diagnosed or rolled back reliably.

Model answer. Block promotion until the pipeline runs from a clean state, dependencies are pinned, artifacts are versioned, and a serving smoke test passes in the target runtime.

Try it: Build the ML workflow test plan

Prompt. Use the retail forecasting scenario to define what QA should check before model promotion.

Learner action. Map workflow stages to checks for data, features, training code, environment, evaluation, packaging, serving, approval, and rollback.

Expected output. `ml-workflow-test-plan.md` with workflow map, stage gates, reproducibility evidence, promotion checklist, and open risks.

Exam trap

Objective

CT-AI 1.4-1.8

Common trap

Approving a model because one local run looked good, without reproducibility or artifact controls.

Wording clue

Look for answers that mention clean runs, versioned inputs, environment control, registry metadata, and promotion gates.

Portfolio checkpoint

Create the module portfolio deliverable and use it to support your release decision.

Artifact structure

ml-workflow-test-plan.md

ContextWorkflow mapStage checksReproducibility evidencePromotion gateRollback planOpen questions

Recall check

Why is a notebook not enough release evidence?: It may depend on hidden state, local files, unpinned packages, and manual execution order.
What makes a training run reproducible?: Versioned data, code, dependencies, parameters, seed, run logs, and artifact metadata.
What should promotion require?: Evaluation evidence, serving smoke tests, registry entry, approval record, and rollback readiness.
What portfolio artifact does this module produce?: ml-workflow-test-plan.md, a QA plan for ML pipeline and promotion evidence.

Topic-by-topic teaching guide

1. ML Workflow Basics

A typical workflow moves from problem framing to data preparation, training, validation, packaging, deployment, and monitoring.

Teaching lens	Practical detail
Real QA example	A churn model has separate datasets, feature code, training code, evaluation scripts, and serving endpoint.
What can go wrong	Testing only the final API and ignoring training pipeline defects.
How a tester should think	Treat each workflow stage as testable.
Evidence to collect	Workflow map and stage-level quality gates.

2. Model Families

Different model types have different behaviours and test needs. Trees, linear models, neural networks, and LLMs fail in different ways.

Teaching lens	Practical detail
Real QA example	A decision tree may be easier to inspect than a deep neural network, but may still fail on unseen slices.
What can go wrong	Using one generic test strategy for every model type.
How a tester should think	Adapt checks to model complexity and risk.
Evidence to collect	Model card and evaluation notes.

3. Training Pipeline Testing

Training code should be reproducible, observable, and versioned.

Teaching lens	Practical detail
Real QA example	A pipeline should fail clearly if schema changes or a required feature is missing.
What can go wrong	Accepting a manually executed notebook as production evidence.
How a tester should think	Run from clean state and verify artifact outputs.
Evidence to collect	Pipeline test log, schema checks, and artifact hashes.

4. Development Environments

AI systems often depend on packages, hardware, seeds, and data versions. Environment drift creates false confidence.

Teaching lens	Practical detail
Real QA example	A model trained with a different library version changes probability calibration.
What can go wrong	Ignoring dependency and hardware differences.
How a tester should think	Pin environments and document expected variation.
Evidence to collect	Lockfiles, Docker config, and run metadata.

5. Model Promotion

Promotion is the controlled movement of a model into staging or production. It should require evidence, not enthusiasm.

Teaching lens	Practical detail
Real QA example	A model with better offline accuracy still fails latency in serving.
What can go wrong	Promoting artifacts outside the registry or without approval.
How a tester should think	Check evaluation, packaging, serving, and rollback before promotion.
Evidence to collect	Promotion checklist and approval record.

Practical QA workflow

Start from the user or business decision affected by the AI system.
Name the AI asset under test: data, feature pipeline, model, prompt, retrieval index, tool, or full workflow.
Convert the main risk into observable quality signals and release gates.
Choose the right oracle: deterministic assertion, metric threshold, metamorphic relation, reviewer rubric, comparison, or production monitor.
Test important slices, edge cases, misuse cases, and change scenarios.
Record versions, data sources, thresholds, reviewer notes, and decision rationale.

Test design checklist

What harm could happen if this AI behaviour is wrong?
Which users, groups, products, regions, or workflows need separate evidence?
Which metric or observation would reveal the failure early?
What is the minimum evidence needed for release, shadow mode, rollback, or rejection?
Who owns the evidence after the model, prompt, or data changes?

Worked QA example

A tester receives a release request for the module scenario. Instead of asking only whether tests pass, the tester writes three release questions: what changed, who could be harmed, and what evidence proves the change is controlled. The answer becomes a small evidence pack: one risk table, one set of representative examples, one automated or reviewable check, and one release recommendation.

Common mistakes

Treating AI output as a normal deterministic response when the real risk is behavioural.
Reporting one impressive metric without slices, uncertainty, or business context.
Forgetting that data, prompts, model versions, and monitoring are part of the test surface.
Writing governance language that cannot be checked by a tester.

Guided exercise

Use the scenario above and create a one-page evidence plan. Include the decision being influenced, the main risk, the test oracle, the data or examples required, the release gate, and the owner.

Discussion prompt

What should count as a release artifact: notebook, script, model binary, dataset, evaluation report, or all of them?

Hands-on lab mapping

Lab: CourseMaterials/AI-Testing/labs/07_model_validation_metrics.ipynb
Task: Inspect a model evaluation workflow and identify reproducibility and promotion checks.
Why this lab matters: it turns the module theory into visible evidence that a release approver can inspect.

Decision simulation

A model performs well locally but cannot be reproduced in CI. Decide whether it can enter staging and what evidence is missing.

Key terms

Training pipeline: Automated or semi-automated process that creates a model artifact.
Model registry: Controlled store for versioned model artifacts and metadata.
Reproducibility: Ability to rerun evaluation with traceable inputs and comparable results.
Serving smoke test: A basic test that proves a packaged model loads and responds in its target environment.

Revision prompts

Explain the module scenario in two minutes to a product owner.
Name three pieces of evidence you would require before release.
Identify one automated check and one human-review check.
Describe how this topic changes after deployment.