Part 10: MLOps & Platform Engineering

Chapter 56: CI/CD for ML & LLM

Hire Us
10Part 10: MLOps & Platform Engineering

56. CI/CD for ML & LLM

Chapter 56 — CI/CD for ML & LLM

Overview

Establish reproducible builds, environment parity, and safe releases for ML/LLM workloads. Continuous Integration and Continuous Deployment (CI/CD) for machine learning extends traditional software practices with specialized pipelines for data validation, model training, evaluation gates, and progressive deployment strategies that account for the unique characteristics of ML systems.

Pipeline Architecture

  • Data/feature checks; model training; eval gates; packaging; release.
  • Environment management: containers, reproducibility, secrets.
  • Approvals and rollbacks; canary and shadow deployments.

Deliverables

  • Pipeline definitions and release checklist.
  • Infrastructure-as-code templates
  • Automated testing and evaluation frameworks
  • Deployment runbooks and rollback procedures

Why It Matters

Reproducibility and safe releases are the backbone of trustworthy AI. CI/CD for ML/LLM codifies data checks, model evaluation gates, and rollbacks to avoid shipping regressions. Without proper CI/CD:

  • Models trained on one engineer's laptop may not reproduce in production
  • Data quality issues slip through to training, degrading model performance
  • Breaking changes deploy without safety checks, impacting users
  • Rollbacks take hours instead of minutes
  • Compliance audits lack evidence trails

Organizations with mature ML CI/CD see 60-80% reduction in production incidents, 3x faster iteration cycles, and complete audit trails for regulated environments.

Complete ML/LLM CI/CD Pipeline

flowchart TD A[Code Commit] --> B[Data Validation] B --> C{Data Quality Gates} C -->|Pass| D[Feature Engineering] C -->|Fail| Z[Alert & Stop] D --> E[Model Training] E --> F[Model Evaluation] F --> G{Eval Gates} G -->|Pass| H[Model Packaging] G -->|Fail| Z H --> I[Security Scan] I --> J{Security Gates} J -->|Pass| K[Registry Upload] J -->|Fail| Z K --> L[Shadow Deployment] L --> M{Shadow Metrics} M -->|Pass| N[Canary Deployment 5%] M -->|Fail| Z N --> O{Canary Metrics} O -->|Pass| P[Progressive Rollout 25%→50%→100%] O -->|Fail| Q[Auto Rollback] P --> R[Production Monitoring] Q --> Z

Pipeline Stage Details

Stage 1: Data Validation

flowchart LR A[Raw Data] --> B[Schema Check] B --> C[Quality Check] C --> D[Drift Detection] D --> E{All Pass?} E -->|Yes| F[Proceed to Training] E -->|No| G[Alert & Block] B -.->|Validates| H[Column Types<br/>Required Fields<br/>Constraints] C -.->|Validates| I[Null Rates<br/>Duplicates<br/>Outliers] D -.->|Validates| J[PSI < 0.2<br/>KS Test<br/>Distribution Shift]

Data Quality Gate Criteria:

Check TypeMetricThresholdAction on Failure
Schema ValidationMissing required columns0 missingBlock pipeline
Data QualityNull rate per column<1%Block pipeline
Data QualityDuplicate records<0.1%Warning
Covariate DriftPSI (Population Stability Index)<0.2Block pipeline
Concept DriftLabel distribution change<10% relativeWarning
FreshnessData age<24 hoursBlock if >48h

Stage 2: Model Training & Reproducibility

flowchart TB A[Training Config<br/>Versioned] --> B[Set All Seeds<br/>Python, NumPy, PyTorch] B --> C[Load Data<br/>Hash: abc123] C --> D[Feature Engineering<br/>Version: 2.1.0] D --> E[Model Training] E --> F[Checkpointing<br/>Every 5 epochs] F --> G[Artifact Creation] G --> H[Metadata Logging] H --> I[Complete Package:<br/>✓ Model weights<br/>✓ Config<br/>✓ Data hash<br/>✓ Code version<br/>✓ Environment]

Reproducibility Checklist:

ComponentRequirementVerification Method
Random SeedsFixed for Python, NumPy, PyTorch, TFAssert same outputs on re-run
Data VersionSHA256 hash loggedCompare hash in metadata
Code VersionGit commit SHATag in model artifact
DependenciesExact versions (requirements.txt)Pin with ==, not >=
EnvironmentDocker image with digestUse immutable image tags
HyperparametersAll logged to MLflow/W&BRetrieve and compare

Stage 3: Evaluation Gates

flowchart TD A[Trained Model] --> B[Task Performance<br/>F1, Accuracy, BLEU] A --> C[Safety Evaluation<br/>Toxicity, Bias, PII] A --> D[Robustness Testing<br/>Adversarial, OOD] A --> E[Efficiency Testing<br/>Latency, Cost] A --> F[Fairness Testing<br/>Demographic Parity] B --> G{All Gates Pass?} C --> G D --> G E --> G F --> G G -->|Yes| H[Package Model] G -->|No| I[Block Release<br/>Alert Team] style I fill:#f88

LLM-Specific Evaluation Framework:

Evaluation CategoryMetricsTest Set SizePass ThresholdAutomated?
Task PerformanceBLEU, ROUGE, Exact Match500-1000 examples>BaselineYes
FaithfulnessNLI score, LLM-as-judge200-500 examples>0.90Yes
ToxicityPerspective API, Detoxify500-1000 examples<1% toxicYes
BiasDemographic parity, sentiment by group500 examples per groupMax disparity <0.10Yes
PII LeakageRegex + NER detection1000 examples0 leaksYes
Cost EfficiencyCost per request, tokens/query1000 examples<$0.05/requestYes
LatencyP50, P95, P991000 requestsP95 <3sYes

Minimal Code Example - Evaluation Suite:

# eval_gates.py
def run_evaluation_suite(model, test_set):
    results = {
        "task_performance": evaluate_task_metrics(model, test_set),
        "safety": evaluate_safety(model, test_set),
        "fairness": evaluate_fairness(model, test_set),
        "efficiency": evaluate_efficiency(model, test_set)
    }

    # Gate: Block if any critical threshold fails
    if results["task_performance"]["f1"] < 0.85:
        raise EvaluationFailure("F1 below threshold")
    if results["safety"]["toxicity_rate"] > 0.01:
        raise EvaluationFailure("Toxicity rate too high")

    return results

Stage 4: Model Packaging

flowchart LR A[Model Artifact] --> B[Add Metadata] B --> C[Add Model Card] C --> D[Add Dependencies] D --> E[Security Scan] E --> F[Sign Artifact] F --> G[Upload to Registry] B -.->|Includes| H[Training Data Hash<br/>Eval Metrics<br/>Approvals<br/>Known Risks] E -.->|Checks| I[Vulnerabilities<br/>License Compliance<br/>Secret Scanning]

Model Package Structure:

model_package/
├── model.safetensors          # Model weights (secure format)
├── metadata.json              # Complete lineage
├── requirements.txt           # Pinned dependencies
├── model_card.md             # Documentation
├── evaluation_results.json    # All metrics
├── data_contract.yaml        # Input expectations
├── inference/
│   ├── preprocess.py
│   ├── predict.py
│   └── postprocess.py
└── signatures/
    └── model.sig             # Cryptographic signature

Stage 5: Progressive Deployment

flowchart TD A[Model in Registry] --> B[Shadow Deploy<br/>2 hours, 0% traffic] B --> C{Shadow Metrics OK?} C -->|Yes| D[Canary 5%<br/>1 hour] C -->|No| Z[Rollback] D --> E{Canary Metrics OK?} E -->|Yes| F[Canary 25%<br/>2 hours] E -->|No| Z F --> G{Metrics OK?} G -->|Yes| H[Canary 50%<br/>4 hours] G -->|No| Z H --> I{Metrics OK?} I -->|Yes| J[Full Rollout 100%] I -->|No| Z J --> K[Production<br/>Continuous Monitoring]

Release Strategy Comparison

StrategyBlast RadiusRollback SpeedCost OverheadValidation QualityUse Case
Big Bang100% usersSlow (10-30min)NoneLowSmall projects, low risk
Blue/Green50% temporaryFast (1-2min)2x infra during deployMediumCritical services
Canary5-25%Fast (1-2min)10-25%HighStandard practice
Shadow0% (no user impact)N/A2x computeVery HighHigh-risk changes
Feature FlagsConfigurable 0-100%InstantMinimalHighGradual rollouts
Progressive (All Combined)5% → 100%Fast (auto)25-50%Very HighProduction best practice

Environment Parity Architecture

flowchart TB subgraph Dev Environment A[Local Docker] B[Same Base Image] C[Mock Services] end subgraph Staging Environment D[Kubernetes Dev] E[Same Base Image] F[Staging Services] end subgraph Production Environment G[Kubernetes Prod] H[Same Base Image] I[Production Services] end J[Dockerfile] --> B J --> E J --> H K[requirements.txt<br/>Pinned Versions] --> A K --> D K --> G style B fill:#90EE90 style E fill:#90EE90 style H fill:#90EE90

Environment Parity Principles:

PrincipleImplementationAnti-Pattern to Avoid
Immutable InfrastructureDocker images with SHA digestsModifying running containers
Dependency Pinningpackage==1.2.3 (exact)package>=1.2 (range)
Config ExternalizationEnvironment variables, ConfigMapsHardcoded configs
Infrastructure as CodeTerraform, CloudFormationManual infrastructure changes
Secrets ManagementVault, Secret ManagerHardcoded API keys
No SnowflakesAll envs created from codeManually configured servers

Approval Workflow

stateDiagram-v2 [*] --> Development Development --> MLLeadApproval: Evaluation gates pass MLLeadApproval --> Staging: Approved MLLeadApproval --> Development: Rejected Staging --> SecurityReview: Staging validation pass SecurityReview --> ProductApproval: Security approved SecurityReview --> Staging: Security issues found ProductApproval --> Production: All approvals ProductApproval --> Staging: Product feedback Production --> Canary: Deploy canary Canary --> FullProduction: Metrics pass Canary --> Production: Rollback FullProduction --> Monitoring Monitoring --> [*]

Risk-Based Approval Matrix:

Model Risk LevelRequired ApprovalsEvaluation RequirementsDeployment Strategy
LowML LeadBasic metricsDirect to production
MediumML Lead + SecurityMetrics + safety testsCanary 5% → 100%
HighML Lead + Security + ComplianceAll evals + fairnessShadow → Canary → Full
CriticalML Lead + Security + Compliance + Product + LegalComprehensive suite + external auditShadow → Canary 5% → 25% → 50% → 100% (24h between stages)

CI/CD Tooling Comparison

Tool StackBest ForStrengthsLimitationsApproximate Cost
GitHub Actions + MLflowStartups, small teamsEasy setup, familiar, free tierLimited ML featuresFree - $500/mo
GitLab CI + DVCSelf-hosted, privacy-focusedFull control, data versioningOps overhead$1K-5K/mo (infra)
Jenkins + KubeflowEnterprise, complex pipelinesHighly customizable, open sourceSteep learning curve$2K-10K/mo (infra)
Azure ML PipelinesAzure-centric orgsTight Azure integration, managedVendor lock-in$5K-50K/mo
AWS SageMaker PipelinesAWS-centric orgsFull AWS ecosystem, scalableComplex pricing$5K-50K/mo
GCP Vertex AI PipelinesGCP-centric orgsManaged, serverlessVendor lock-in$5K-50K/mo
Databricks MLOpsData-heavy ML, Spark usersExcellent for big dataExpensive$10K-100K/mo

Case Study: LLM Summarization Service CI/CD

Background: A SaaS company built an LLM-powered document summarization service experiencing frequent quality regressions. Releases were manual, testing was ad-hoc, and rollbacks required 2-3 hours.

Problems Before CI/CD:

IssueImpactFrequency
Quality Regressions15% of releases had issues discovered by usersWeekly
No Systematic TestingFaithfulness and toxicity not measuredEvery release
Manual Deployment4 hours per release, error-proneEach release
Slow Rollbacks2-3 hours to rollback, manual processWhen needed
No LineageCouldn't trace model → data → trainingAlways

Solution - Automated CI/CD Pipeline:

flowchart LR A[Code Commit] --> B[Data Validation<br/>Schema + Drift] B --> C[Model Training<br/>Reproducible] C --> D[Evaluation Suite<br/>6 categories] D --> E{Gates Pass?} E -->|Yes| F[Package + Sign] E -->|No| G[Alert & Block] F --> H[Shadow Deploy<br/>2 hours] H --> I[Canary 5%<br/>1 hour] I --> J{Metrics OK?} J -->|Yes| K[Progressive Rollout] J -->|No| L[Auto Rollback<br/>2 minutes]

Evaluation Gates Implemented:

GateMetricThresholdResult
FaithfulnessNLI entailment score>0.90Caught 8 hallucination issues
ToxicityPerspective API<1% toxic rateCaught 3 toxic output issues
Summary QualityROUGE-L vs human>0.35Maintained quality bar
CostCost per summary<$0.05Optimized prompts
LatencyP95 latency<3 secondsPrevented slow models
BiasSentiment by topic<0.10 disparityCaught 2 bias issues

Results After 6 Months:

MetricBeforeAfterImprovement
Regression Escapes15% of releases<1% of releases93% reduction
Deployment Time4 hours manual30 min automated87% reduction
Rollback Time2-3 hours2 minutes98% reduction
Release Frequency1x/month3x/week12x increase
Compliance Audit Prep2 weeks2 hours95% reduction
Production Incidents6/quarter1/quarter83% reduction

Key Success Factors:

  1. Invested 3 weeks upfront building pipeline - saved 10+ hours/week ongoing
  2. Started with shadow deployments - built confidence before auto-rollout
  3. Comprehensive eval gates - caught issues before users
  4. Automated rollback - reduced MTTR from hours to minutes
  5. Complete lineage tracking - compliance audit preparation 95% faster

Implementation Roadmap

Phase 1: Foundation (Weeks 1-2)

  • Set up version control for code, configs, and pipelines
  • Containerize training and inference code
  • Pin all dependencies with exact versions
  • Implement basic CI (linting, unit tests, security scans)
  • Set up experiment tracking (MLflow, Weights & Biases)

Phase 2: Data Pipeline (Weeks 3-4)

  • Define data contracts with upstream sources
  • Implement data validation gates (schema, quality, drift)
  • Create reproducible training pipeline with seed setting
  • Add training config versioning and lineage tracking
  • Set up resource quotas and timeouts

Phase 3: Evaluation (Weeks 5-6)

  • Define evaluation metrics and minimum thresholds
  • Build automated evaluation suite (task, safety, efficiency, fairness)
  • Create regression test sets
  • Implement LLM-specific evals (faithfulness, toxicity)
  • Set up comparison against baseline models

Phase 4: Packaging (Week 7)

  • Define model package format with metadata
  • Set up model registry with versioning
  • Implement approval workflows based on risk
  • Create model card templates
  • Store evaluation results with each version

Phase 5: Deployment (Weeks 8-10)

  • Choose deployment strategy (canary, blue/green, shadow)
  • Implement progressive rollout stages
  • Set up automated rollback triggers
  • Create deployment runbooks
  • Wire monitoring to deployment pipeline

Phase 6: Production Hardening (Weeks 11-12)

  • Implement complete lineage tracking
  • Set up compliance evidence collection
  • Create incident response runbooks
  • Optimize pipeline performance and cost
  • Regular review and improvement of gates

Best Practices & Anti-Patterns

✅ Best Practices

PracticeWhy It MattersHow to Implement
Progressive ComplexityStart simple, add sophisticationWeek 1: Basic CI → Month 3: Full pipeline
Fail Fast, Fail LoudCatch issues early, minimize wasted computeData validation before training
Observability FirstCan't fix what you can't seeLog everything with structured logging
Automated RollbackReduce MTTR from hours to minutesPre-defined triggers + tested runbooks
Environment Parity"Works on my machine" → "Works everywhere"Containers + IaC + pinned deps

❌ Anti-Patterns to Avoid

Anti-PatternProblemSolution
Skipping Data ValidationSilent quality degradationAlways validate data first, fail fast
Non-Reproducible BuildsCan't debug production issuesContainerization, seeds, dependency pinning
Manual Approval BottlenecksSlow iteration, approval fatigueRisk-based approval matrix
Insufficient Eval CoverageHarmful outputs reach productionMulti-dimensional gates (task, safety, fairness)
No Rollback PlanExtended outages during incidentsAutomated rollback with tested runbooks

Success Metrics

Track these KPIs to measure CI/CD maturity:

MetricTargetMeasurement Method
Deployment FrequencyDailyCount deployments/week
Lead Time<4 hoursCode commit to production
MTTR<10 minutesTime to rollback
Change Failure Rate<5%Failed deployments / total
Evaluation Coverage>90%Critical paths with automated tests
Reproducibility100%Builds exactly reproduced
Deployment ConfidenceHighTeam deploys without fear (survey)