ML Workflow, Models, Neural Networks, and Development Testing
Give QA professionals enough ML workflow knowledge to test development practices, training pipelines, and model artifacts with confidence.
ML Workflow, Models, Neural Networks, and Development Testing video briefing
A focused explanation of chapter 4, turning the AI testing theory into concrete validation checks.
Briefing focus
Module opening
This is a structured lesson briefing. Real video/audio can be added later as a media source.
Estimated time
9 min
- 1Module opening
- 2Learning objectives
- 3Mind map
- 4Scenario evidence breakdown
Transcript brief
Give QA professionals enough ML workflow knowledge to test development practices, training pipelines, and model artifacts with confidence. The briefing explains why the topic matters, walks through a failure scenario, and identifies the artefacts a tester should produce for evidence and auditability.
Key takeaways
- Connect the AI risk to a measurable test or monitor.
- Document the evidence needed for reproducibility and audit.
- Use the lab or scenario to practise the validation workflow.
Module opening
Give QA professionals enough ML workflow knowledge to test development practices, training pipelines, and model artifacts with confidence.
Audience. Testers who need to understand model creation without becoming data scientists.
Why this matters. AI testing is stronger when QA can see the training pipeline as a product: inputs, transformations, configuration, outputs, environments, and promotion controls.
ISTQB CT-AI mapping. CT-AI 1.4-1.8, 4.2, 10.1
Trainer note
Start with the scenario before the theory. Ask learners what evidence would make them confident, then use the module to build that evidence step by step.
Learning objectives
- Explain the core quality risk in ml workflow, models, neural networks, and development testing.
- Select practical test evidence that supports an AI release decision.
- Apply the module concepts to a realistic QA scenario.
- Produce a portfolio artifact that can be reused in a professional AI testing context.
Mind map
Real-life scenario · Retail analytics
The notebook that worked only on the data scientist's laptop
Situation. A neural network predicts weekly product demand. The training notebook depended on hidden execution order, local files, and unpinned package versions.
Lesson. AI testing is strongest when risks, examples, evidence, and release decisions are connected.
Scenario evidence breakdown
| Scenario element | Detail |
|---|---|
| Product/System | Demand forecasting platform |
| AI feature | A neural network predicts weekly product demand. |
| Failure or risk | The training notebook depended on hidden execution order, local files, and unpinned package versions. |
| Testing challenge | The team could not reproduce the promoted model or explain why two training runs gave different results. |
| Tester response | The tester required clean-run notebooks, pinned dependencies, seeded experiments, artifact versioning, and serving smoke tests. |
| Evidence required | Pipeline run log, environment lockfile, model registry entry, reproducibility check, and deployment smoke test. |
| Business decision | Do not promote until the pipeline can run cleanly and produce traceable artifacts. |
Visual flow
Learning path
Start Here
5 minOutcome, CT-AI exam relevance, and the irreproducible notebook scenario.
Learn
22 minML workflow stages, model families, training pipeline tests, environments, and promotion.
See It
10 minReproducibility evidence for a demand forecasting model.
Try It
16 minBuild a workflow test plan and promotion checklist.
Recall and Apply
10 minExam traps, active recall, and the portfolio artifact.
Reproducible model creation
A model artifact is not release-ready unless the team can recreate or explain the training run, environment, data, parameters, and promotion decision.
Example
The demand forecasting notebook worked only after hidden cell execution order, local files, and unpinned packages lined up on one laptop.
Mistake
Treating a successful notebook run as production evidence.
Evidence
Clean-run notebook, pipeline log, environment lockfile, dataset version, seed, model registry entry, artifact hash, and serving smoke test.
Worked example: Blocking promotion from an unrepeatable run
Scenario. A neural network has strong local evaluation results, but CI cannot rerun the notebook and the produced model artifact is not linked to a dataset or dependency lockfile.
Reasoning. The evaluation may be real, but it is not auditable or reproducible. Promotion would create operational risk because failures could not be diagnosed or rolled back reliably.
Model answer. Block promotion until the pipeline runs from a clean state, dependencies are pinned, artifacts are versioned, and a serving smoke test passes in the target runtime.
Try it: Build the ML workflow test plan
Prompt. Use the retail forecasting scenario to define what QA should check before model promotion.
Learner action. Map workflow stages to checks for data, features, training code, environment, evaluation, packaging, serving, approval, and rollback.
Expected output. `ml-workflow-test-plan.md` with workflow map, stage gates, reproducibility evidence, promotion checklist, and open risks.
Exam trap
Objective
CT-AI 1.4-1.8
Common trap
Approving a model because one local run looked good, without reproducibility or artifact controls.
Wording clue
Look for answers that mention clean runs, versioned inputs, environment control, registry metadata, and promotion gates.
Portfolio checkpoint
Create the module portfolio deliverable and use it to support your release decision.
Artifact structure
ml-workflow-test-plan.md
Recall check
- Why is a notebook not enough release evidence?
- It may depend on hidden state, local files, unpinned packages, and manual execution order.
- What makes a training run reproducible?
- Versioned data, code, dependencies, parameters, seed, run logs, and artifact metadata.
- What should promotion require?
- Evaluation evidence, serving smoke tests, registry entry, approval record, and rollback readiness.
- What portfolio artifact does this module produce?
- ml-workflow-test-plan.md, a QA plan for ML pipeline and promotion evidence.
Topic-by-topic teaching guide
1. ML Workflow Basics
A typical workflow moves from problem framing to data preparation, training, validation, packaging, deployment, and monitoring.
| Teaching lens | Practical detail |
|---|---|
| Real QA example | A churn model has separate datasets, feature code, training code, evaluation scripts, and serving endpoint. |
| What can go wrong | Testing only the final API and ignoring training pipeline defects. |
| How a tester should think | Treat each workflow stage as testable. |
| Evidence to collect | Workflow map and stage-level quality gates. |
2. Model Families
Different model types have different behaviours and test needs. Trees, linear models, neural networks, and LLMs fail in different ways.
| Teaching lens | Practical detail |
|---|---|
| Real QA example | A decision tree may be easier to inspect than a deep neural network, but may still fail on unseen slices. |
| What can go wrong | Using one generic test strategy for every model type. |
| How a tester should think | Adapt checks to model complexity and risk. |
| Evidence to collect | Model card and evaluation notes. |
3. Training Pipeline Testing
Training code should be reproducible, observable, and versioned.
| Teaching lens | Practical detail |
|---|---|
| Real QA example | A pipeline should fail clearly if schema changes or a required feature is missing. |
| What can go wrong | Accepting a manually executed notebook as production evidence. |
| How a tester should think | Run from clean state and verify artifact outputs. |
| Evidence to collect | Pipeline test log, schema checks, and artifact hashes. |
4. Development Environments
AI systems often depend on packages, hardware, seeds, and data versions. Environment drift creates false confidence.
| Teaching lens | Practical detail |
|---|---|
| Real QA example | A model trained with a different library version changes probability calibration. |
| What can go wrong | Ignoring dependency and hardware differences. |
| How a tester should think | Pin environments and document expected variation. |
| Evidence to collect | Lockfiles, Docker config, and run metadata. |
5. Model Promotion
Promotion is the controlled movement of a model into staging or production. It should require evidence, not enthusiasm.
| Teaching lens | Practical detail |
|---|---|
| Real QA example | A model with better offline accuracy still fails latency in serving. |
| What can go wrong | Promoting artifacts outside the registry or without approval. |
| How a tester should think | Check evaluation, packaging, serving, and rollback before promotion. |
| Evidence to collect | Promotion checklist and approval record. |
Practical QA workflow
- Start from the user or business decision affected by the AI system.
- Name the AI asset under test: data, feature pipeline, model, prompt, retrieval index, tool, or full workflow.
- Convert the main risk into observable quality signals and release gates.
- Choose the right oracle: deterministic assertion, metric threshold, metamorphic relation, reviewer rubric, comparison, or production monitor.
- Test important slices, edge cases, misuse cases, and change scenarios.
- Record versions, data sources, thresholds, reviewer notes, and decision rationale.
Test design checklist
- What harm could happen if this AI behaviour is wrong?
- Which users, groups, products, regions, or workflows need separate evidence?
- Which metric or observation would reveal the failure early?
- What is the minimum evidence needed for release, shadow mode, rollback, or rejection?
- Who owns the evidence after the model, prompt, or data changes?
Worked QA example
A tester receives a release request for the module scenario. Instead of asking only whether tests pass, the tester writes three release questions: what changed, who could be harmed, and what evidence proves the change is controlled. The answer becomes a small evidence pack: one risk table, one set of representative examples, one automated or reviewable check, and one release recommendation.
Common mistakes
- Treating AI output as a normal deterministic response when the real risk is behavioural.
- Reporting one impressive metric without slices, uncertainty, or business context.
- Forgetting that data, prompts, model versions, and monitoring are part of the test surface.
- Writing governance language that cannot be checked by a tester.
Guided exercise
Use the scenario above and create a one-page evidence plan. Include the decision being influenced, the main risk, the test oracle, the data or examples required, the release gate, and the owner.
Discussion prompt
What should count as a release artifact: notebook, script, model binary, dataset, evaluation report, or all of them?
Hands-on lab mapping
- Lab: CourseMaterials/AI-Testing/labs/07_model_validation_metrics.ipynb
- Task: Inspect a model evaluation workflow and identify reproducibility and promotion checks.
- Why this lab matters: it turns the module theory into visible evidence that a release approver can inspect.
Decision simulation
A model performs well locally but cannot be reproduced in CI. Decide whether it can enter staging and what evidence is missing.
Key terms
- Training pipeline: Automated or semi-automated process that creates a model artifact.
- Model registry: Controlled store for versioned model artifacts and metadata.
- Reproducibility: Ability to rerun evaluation with traceable inputs and comparable results.
- Serving smoke test: A basic test that proves a packaged model loads and responds in its target environment.
Revision prompts
- Explain the module scenario in two minutes to a product owner.
- Name three pieces of evidence you would require before release.
- Identify one automated check and one human-review check.
- Describe how this topic changes after deployment.