56. CI/CD for ML & LLM
Chapter 56 — CI/CD for ML & LLM
Overview
Establish reproducible builds, environment parity, and safe releases for ML/LLM workloads. Continuous Integration and Continuous Deployment (CI/CD) for machine learning extends traditional software practices with specialized pipelines for data validation, model training, evaluation gates, and progressive deployment strategies that account for the unique characteristics of ML systems.
Pipeline Architecture
- Data/feature checks; model training; eval gates; packaging; release.
- Environment management: containers, reproducibility, secrets.
- Approvals and rollbacks; canary and shadow deployments.
Deliverables
- Pipeline definitions and release checklist.
- Infrastructure-as-code templates
- Automated testing and evaluation frameworks
- Deployment runbooks and rollback procedures
Why It Matters
Reproducibility and safe releases are the backbone of trustworthy AI. CI/CD for ML/LLM codifies data checks, model evaluation gates, and rollbacks to avoid shipping regressions. Without proper CI/CD:
- Models trained on one engineer's laptop may not reproduce in production
- Data quality issues slip through to training, degrading model performance
- Breaking changes deploy without safety checks, impacting users
- Rollbacks take hours instead of minutes
- Compliance audits lack evidence trails
Organizations with mature ML CI/CD see 60-80% reduction in production incidents, 3x faster iteration cycles, and complete audit trails for regulated environments.
Complete ML/LLM CI/CD Pipeline
flowchart TD A[Code Commit] --> B[Data Validation] B --> C{Data Quality Gates} C -->|Pass| D[Feature Engineering] C -->|Fail| Z[Alert & Stop] D --> E[Model Training] E --> F[Model Evaluation] F --> G{Eval Gates} G -->|Pass| H[Model Packaging] G -->|Fail| Z H --> I[Security Scan] I --> J{Security Gates} J -->|Pass| K[Registry Upload] J -->|Fail| Z K --> L[Shadow Deployment] L --> M{Shadow Metrics} M -->|Pass| N[Canary Deployment 5%] M -->|Fail| Z N --> O{Canary Metrics} O -->|Pass| P[Progressive Rollout 25%→50%→100%] O -->|Fail| Q[Auto Rollback] P --> R[Production Monitoring] Q --> Z
Pipeline Stage Details
Stage 1: Data Validation
flowchart LR A[Raw Data] --> B[Schema Check] B --> C[Quality Check] C --> D[Drift Detection] D --> E{All Pass?} E -->|Yes| F[Proceed to Training] E -->|No| G[Alert & Block] B -.->|Validates| H[Column Types<br/>Required Fields<br/>Constraints] C -.->|Validates| I[Null Rates<br/>Duplicates<br/>Outliers] D -.->|Validates| J[PSI < 0.2<br/>KS Test<br/>Distribution Shift]
Data Quality Gate Criteria:
| Check Type | Metric | Threshold | Action on Failure |
|---|---|---|---|
| Schema Validation | Missing required columns | 0 missing | Block pipeline |
| Data Quality | Null rate per column | <1% | Block pipeline |
| Data Quality | Duplicate records | <0.1% | Warning |
| Covariate Drift | PSI (Population Stability Index) | <0.2 | Block pipeline |
| Concept Drift | Label distribution change | <10% relative | Warning |
| Freshness | Data age | <24 hours | Block if >48h |
Stage 2: Model Training & Reproducibility
flowchart TB A[Training Config<br/>Versioned] --> B[Set All Seeds<br/>Python, NumPy, PyTorch] B --> C[Load Data<br/>Hash: abc123] C --> D[Feature Engineering<br/>Version: 2.1.0] D --> E[Model Training] E --> F[Checkpointing<br/>Every 5 epochs] F --> G[Artifact Creation] G --> H[Metadata Logging] H --> I[Complete Package:<br/>✓ Model weights<br/>✓ Config<br/>✓ Data hash<br/>✓ Code version<br/>✓ Environment]
Reproducibility Checklist:
| Component | Requirement | Verification Method |
|---|---|---|
| Random Seeds | Fixed for Python, NumPy, PyTorch, TF | Assert same outputs on re-run |
| Data Version | SHA256 hash logged | Compare hash in metadata |
| Code Version | Git commit SHA | Tag in model artifact |
| Dependencies | Exact versions (requirements.txt) | Pin with ==, not >= |
| Environment | Docker image with digest | Use immutable image tags |
| Hyperparameters | All logged to MLflow/W&B | Retrieve and compare |
Stage 3: Evaluation Gates
flowchart TD A[Trained Model] --> B[Task Performance<br/>F1, Accuracy, BLEU] A --> C[Safety Evaluation<br/>Toxicity, Bias, PII] A --> D[Robustness Testing<br/>Adversarial, OOD] A --> E[Efficiency Testing<br/>Latency, Cost] A --> F[Fairness Testing<br/>Demographic Parity] B --> G{All Gates Pass?} C --> G D --> G E --> G F --> G G -->|Yes| H[Package Model] G -->|No| I[Block Release<br/>Alert Team] style I fill:#f88
LLM-Specific Evaluation Framework:
| Evaluation Category | Metrics | Test Set Size | Pass Threshold | Automated? |
|---|---|---|---|---|
| Task Performance | BLEU, ROUGE, Exact Match | 500-1000 examples | >Baseline | Yes |
| Faithfulness | NLI score, LLM-as-judge | 200-500 examples | >0.90 | Yes |
| Toxicity | Perspective API, Detoxify | 500-1000 examples | <1% toxic | Yes |
| Bias | Demographic parity, sentiment by group | 500 examples per group | Max disparity <0.10 | Yes |
| PII Leakage | Regex + NER detection | 1000 examples | 0 leaks | Yes |
| Cost Efficiency | Cost per request, tokens/query | 1000 examples | <$0.05/request | Yes |
| Latency | P50, P95, P99 | 1000 requests | P95 <3s | Yes |
Minimal Code Example - Evaluation Suite:
# eval_gates.py
def run_evaluation_suite(model, test_set):
results = {
"task_performance": evaluate_task_metrics(model, test_set),
"safety": evaluate_safety(model, test_set),
"fairness": evaluate_fairness(model, test_set),
"efficiency": evaluate_efficiency(model, test_set)
}
# Gate: Block if any critical threshold fails
if results["task_performance"]["f1"] < 0.85:
raise EvaluationFailure("F1 below threshold")
if results["safety"]["toxicity_rate"] > 0.01:
raise EvaluationFailure("Toxicity rate too high")
return results
Stage 4: Model Packaging
flowchart LR A[Model Artifact] --> B[Add Metadata] B --> C[Add Model Card] C --> D[Add Dependencies] D --> E[Security Scan] E --> F[Sign Artifact] F --> G[Upload to Registry] B -.->|Includes| H[Training Data Hash<br/>Eval Metrics<br/>Approvals<br/>Known Risks] E -.->|Checks| I[Vulnerabilities<br/>License Compliance<br/>Secret Scanning]
Model Package Structure:
model_package/
├── model.safetensors # Model weights (secure format)
├── metadata.json # Complete lineage
├── requirements.txt # Pinned dependencies
├── model_card.md # Documentation
├── evaluation_results.json # All metrics
├── data_contract.yaml # Input expectations
├── inference/
│ ├── preprocess.py
│ ├── predict.py
│ └── postprocess.py
└── signatures/
└── model.sig # Cryptographic signature
Stage 5: Progressive Deployment
flowchart TD A[Model in Registry] --> B[Shadow Deploy<br/>2 hours, 0% traffic] B --> C{Shadow Metrics OK?} C -->|Yes| D[Canary 5%<br/>1 hour] C -->|No| Z[Rollback] D --> E{Canary Metrics OK?} E -->|Yes| F[Canary 25%<br/>2 hours] E -->|No| Z F --> G{Metrics OK?} G -->|Yes| H[Canary 50%<br/>4 hours] G -->|No| Z H --> I{Metrics OK?} I -->|Yes| J[Full Rollout 100%] I -->|No| Z J --> K[Production<br/>Continuous Monitoring]
Release Strategy Comparison
| Strategy | Blast Radius | Rollback Speed | Cost Overhead | Validation Quality | Use Case |
|---|---|---|---|---|---|
| Big Bang | 100% users | Slow (10-30min) | None | Low | Small projects, low risk |
| Blue/Green | 50% temporary | Fast (1-2min) | 2x infra during deploy | Medium | Critical services |
| Canary | 5-25% | Fast (1-2min) | 10-25% | High | Standard practice |
| Shadow | 0% (no user impact) | N/A | 2x compute | Very High | High-risk changes |
| Feature Flags | Configurable 0-100% | Instant | Minimal | High | Gradual rollouts |
| Progressive (All Combined) | 5% → 100% | Fast (auto) | 25-50% | Very High | Production best practice |
Environment Parity Architecture
flowchart TB subgraph Dev Environment A[Local Docker] B[Same Base Image] C[Mock Services] end subgraph Staging Environment D[Kubernetes Dev] E[Same Base Image] F[Staging Services] end subgraph Production Environment G[Kubernetes Prod] H[Same Base Image] I[Production Services] end J[Dockerfile] --> B J --> E J --> H K[requirements.txt<br/>Pinned Versions] --> A K --> D K --> G style B fill:#90EE90 style E fill:#90EE90 style H fill:#90EE90
Environment Parity Principles:
| Principle | Implementation | Anti-Pattern to Avoid |
|---|---|---|
| Immutable Infrastructure | Docker images with SHA digests | Modifying running containers |
| Dependency Pinning | package==1.2.3 (exact) | package>=1.2 (range) |
| Config Externalization | Environment variables, ConfigMaps | Hardcoded configs |
| Infrastructure as Code | Terraform, CloudFormation | Manual infrastructure changes |
| Secrets Management | Vault, Secret Manager | Hardcoded API keys |
| No Snowflakes | All envs created from code | Manually configured servers |
Approval Workflow
stateDiagram-v2 [*] --> Development Development --> MLLeadApproval: Evaluation gates pass MLLeadApproval --> Staging: Approved MLLeadApproval --> Development: Rejected Staging --> SecurityReview: Staging validation pass SecurityReview --> ProductApproval: Security approved SecurityReview --> Staging: Security issues found ProductApproval --> Production: All approvals ProductApproval --> Staging: Product feedback Production --> Canary: Deploy canary Canary --> FullProduction: Metrics pass Canary --> Production: Rollback FullProduction --> Monitoring Monitoring --> [*]
Risk-Based Approval Matrix:
| Model Risk Level | Required Approvals | Evaluation Requirements | Deployment Strategy |
|---|---|---|---|
| Low | ML Lead | Basic metrics | Direct to production |
| Medium | ML Lead + Security | Metrics + safety tests | Canary 5% → 100% |
| High | ML Lead + Security + Compliance | All evals + fairness | Shadow → Canary → Full |
| Critical | ML Lead + Security + Compliance + Product + Legal | Comprehensive suite + external audit | Shadow → Canary 5% → 25% → 50% → 100% (24h between stages) |
CI/CD Tooling Comparison
| Tool Stack | Best For | Strengths | Limitations | Approximate Cost |
|---|---|---|---|---|
| GitHub Actions + MLflow | Startups, small teams | Easy setup, familiar, free tier | Limited ML features | Free - $500/mo |
| GitLab CI + DVC | Self-hosted, privacy-focused | Full control, data versioning | Ops overhead | $1K-5K/mo (infra) |
| Jenkins + Kubeflow | Enterprise, complex pipelines | Highly customizable, open source | Steep learning curve | $2K-10K/mo (infra) |
| Azure ML Pipelines | Azure-centric orgs | Tight Azure integration, managed | Vendor lock-in | $5K-50K/mo |
| AWS SageMaker Pipelines | AWS-centric orgs | Full AWS ecosystem, scalable | Complex pricing | $5K-50K/mo |
| GCP Vertex AI Pipelines | GCP-centric orgs | Managed, serverless | Vendor lock-in | $5K-50K/mo |
| Databricks MLOps | Data-heavy ML, Spark users | Excellent for big data | Expensive | $10K-100K/mo |
Case Study: LLM Summarization Service CI/CD
Background: A SaaS company built an LLM-powered document summarization service experiencing frequent quality regressions. Releases were manual, testing was ad-hoc, and rollbacks required 2-3 hours.
Problems Before CI/CD:
| Issue | Impact | Frequency |
|---|---|---|
| Quality Regressions | 15% of releases had issues discovered by users | Weekly |
| No Systematic Testing | Faithfulness and toxicity not measured | Every release |
| Manual Deployment | 4 hours per release, error-prone | Each release |
| Slow Rollbacks | 2-3 hours to rollback, manual process | When needed |
| No Lineage | Couldn't trace model → data → training | Always |
Solution - Automated CI/CD Pipeline:
flowchart LR A[Code Commit] --> B[Data Validation<br/>Schema + Drift] B --> C[Model Training<br/>Reproducible] C --> D[Evaluation Suite<br/>6 categories] D --> E{Gates Pass?} E -->|Yes| F[Package + Sign] E -->|No| G[Alert & Block] F --> H[Shadow Deploy<br/>2 hours] H --> I[Canary 5%<br/>1 hour] I --> J{Metrics OK?} J -->|Yes| K[Progressive Rollout] J -->|No| L[Auto Rollback<br/>2 minutes]
Evaluation Gates Implemented:
| Gate | Metric | Threshold | Result |
|---|---|---|---|
| Faithfulness | NLI entailment score | >0.90 | Caught 8 hallucination issues |
| Toxicity | Perspective API | <1% toxic rate | Caught 3 toxic output issues |
| Summary Quality | ROUGE-L vs human | >0.35 | Maintained quality bar |
| Cost | Cost per summary | <$0.05 | Optimized prompts |
| Latency | P95 latency | <3 seconds | Prevented slow models |
| Bias | Sentiment by topic | <0.10 disparity | Caught 2 bias issues |
Results After 6 Months:
| Metric | Before | After | Improvement |
|---|---|---|---|
| Regression Escapes | 15% of releases | <1% of releases | 93% reduction |
| Deployment Time | 4 hours manual | 30 min automated | 87% reduction |
| Rollback Time | 2-3 hours | 2 minutes | 98% reduction |
| Release Frequency | 1x/month | 3x/week | 12x increase |
| Compliance Audit Prep | 2 weeks | 2 hours | 95% reduction |
| Production Incidents | 6/quarter | 1/quarter | 83% reduction |
Key Success Factors:
- Invested 3 weeks upfront building pipeline - saved 10+ hours/week ongoing
- Started with shadow deployments - built confidence before auto-rollout
- Comprehensive eval gates - caught issues before users
- Automated rollback - reduced MTTR from hours to minutes
- Complete lineage tracking - compliance audit preparation 95% faster
Implementation Roadmap
Phase 1: Foundation (Weeks 1-2)
- Set up version control for code, configs, and pipelines
- Containerize training and inference code
- Pin all dependencies with exact versions
- Implement basic CI (linting, unit tests, security scans)
- Set up experiment tracking (MLflow, Weights & Biases)
Phase 2: Data Pipeline (Weeks 3-4)
- Define data contracts with upstream sources
- Implement data validation gates (schema, quality, drift)
- Create reproducible training pipeline with seed setting
- Add training config versioning and lineage tracking
- Set up resource quotas and timeouts
Phase 3: Evaluation (Weeks 5-6)
- Define evaluation metrics and minimum thresholds
- Build automated evaluation suite (task, safety, efficiency, fairness)
- Create regression test sets
- Implement LLM-specific evals (faithfulness, toxicity)
- Set up comparison against baseline models
Phase 4: Packaging (Week 7)
- Define model package format with metadata
- Set up model registry with versioning
- Implement approval workflows based on risk
- Create model card templates
- Store evaluation results with each version
Phase 5: Deployment (Weeks 8-10)
- Choose deployment strategy (canary, blue/green, shadow)
- Implement progressive rollout stages
- Set up automated rollback triggers
- Create deployment runbooks
- Wire monitoring to deployment pipeline
Phase 6: Production Hardening (Weeks 11-12)
- Implement complete lineage tracking
- Set up compliance evidence collection
- Create incident response runbooks
- Optimize pipeline performance and cost
- Regular review and improvement of gates
Best Practices & Anti-Patterns
✅ Best Practices
| Practice | Why It Matters | How to Implement |
|---|---|---|
| Progressive Complexity | Start simple, add sophistication | Week 1: Basic CI → Month 3: Full pipeline |
| Fail Fast, Fail Loud | Catch issues early, minimize wasted compute | Data validation before training |
| Observability First | Can't fix what you can't see | Log everything with structured logging |
| Automated Rollback | Reduce MTTR from hours to minutes | Pre-defined triggers + tested runbooks |
| Environment Parity | "Works on my machine" → "Works everywhere" | Containers + IaC + pinned deps |
❌ Anti-Patterns to Avoid
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Skipping Data Validation | Silent quality degradation | Always validate data first, fail fast |
| Non-Reproducible Builds | Can't debug production issues | Containerization, seeds, dependency pinning |
| Manual Approval Bottlenecks | Slow iteration, approval fatigue | Risk-based approval matrix |
| Insufficient Eval Coverage | Harmful outputs reach production | Multi-dimensional gates (task, safety, fairness) |
| No Rollback Plan | Extended outages during incidents | Automated rollback with tested runbooks |
Success Metrics
Track these KPIs to measure CI/CD maturity:
| Metric | Target | Measurement Method |
|---|---|---|
| Deployment Frequency | Daily | Count deployments/week |
| Lead Time | <4 hours | Code commit to production |
| MTTR | <10 minutes | Time to rollback |
| Change Failure Rate | <5% | Failed deployments / total |
| Evaluation Coverage | >90% | Critical paths with automated tests |
| Reproducibility | 100% | Builds exactly reproduced |
| Deployment Confidence | High | Team deploys without fear (survey) |