Chapter 14 — Data Quality & Observability

Overview

Ensure data is fit-for-purpose with measurable quality dimensions and automated observability. Data quality is the foundation of trustworthy AI systems. Poor quality data leads to inaccurate predictions, biased outcomes, and loss of user trust. Without proactive monitoring and automated quality checks, teams spend 50-80% of their time firefighting data issues rather than building value.

Why It Matters

Quality issues compound across pipelines. Explicit SLAs and monitoring make problems visible and actionable.

Business Impact of Data Quality:

Model Performance: Garbage in, garbage out - poor data quality directly impacts accuracy
Operational Costs: Data teams waste weeks debugging issues that proper monitoring would catch immediately
Revenue Loss: Retailer lost $500K when product recommendation pipeline silently failed for 3 days
Compliance Risk: Incorrect or incomplete data in regulated industries results in fines
User Trust: Inaccurate predictions erode confidence in AI systems

Data Quality Dimensions Framework

The Six Core Dimensions

Dimension	Definition	Measurement	Typical SLO	Business Impact Example
Completeness	All required data is present	% non-null values	> 99.5% for critical fields	Missing customer emails prevent marketing campaigns
Validity	Data conforms to formats and rules	% passing validation	> 99%	Invalid email formats cause delivery failures
Accuracy	Data correctly represents reality	Match rate vs. golden source	> 99.5%	Wrong customer addresses cause shipping delays
Timeliness	Data is available when needed	End-to-end latency (p95)	< 5 minutes	Stale inventory causes out-of-stock sales
Consistency	Data is uniform across systems	Cross-system match rate	> 99%	Same customer with different IDs across platforms
Uniqueness	No unwanted duplication	Duplicate record rate	< 1%	Duplicate customer records skew analytics

Extended Dimensions for AI/ML

Dimension	Why It Matters for ML	Detection Method	Remediation
Distribution Stability	Model breaks when input distribution shifts	KS test, PSI, Chi-square	Retrain model, alert data team
Bias & Fairness	Legal/ethical risk from disparate treatment	Demographic parity, equalized odds	Reweighting, remove proxy features
Feature Correlation	Multicollinearity affects interpretability	VIF, correlation matrix	Remove correlated features
Outliers	Can skew models or indicate data errors	IQR, Z-score, isolation forest	Cap/floor values, investigate cause

Data Quality Architecture

graph TB
    subgraph "Data Sources"
        S1[Databases]
        S2[Data Lakes]
        S3[APIs]
        S4[Streams]
    end

    subgraph "Collection Layer"
        C1[Schema Validators]
        C2[Profilers]
        C3[Lineage Trackers]
        C4[Quality Scanners]
    end

    subgraph "Processing Engine"
        P1[Quality Metrics<br/>Calculator]
        P2[Anomaly<br/>Detection]
        P3[Drift<br/>Detection]
        P4[Bias<br/>Analysis]
    end

    subgraph "Storage"
        ST1[Metrics Store<br/>Time Series DB]
        ST2[Metadata Store]
        ST3[Alert State]
    end

    subgraph "Presentation"
        PR1[DQ Dashboards]
        PR2[Alert Manager]
        PR3[Lineage Viewer]
        PR4[Quality Reports]
    end

    S1 & S2 & S3 & S4 --> C1 & C2 & C3 & C4
    C1 & C2 & C3 & C4 --> P1 & P2 & P3 & P4
    P1 & P2 & P3 & P4 --> ST1 & ST2 & ST3
    ST1 & ST2 & ST3 --> PR1 & PR2 & PR3 & PR4

    style P2 fill:#f96,stroke:#333,stroke-width:2px
    style P3 fill:#f96,stroke:#333,stroke-width:2px
    style PR2 fill:#bbf,stroke:#333,stroke-width:2px

Service Level Objectives (SLOs)

Defining Quality SLOs by Model Risk

graph TD
    A[Define Data Asset] --> B{Model Risk Level?}

    B -->|Critical - P0| C[Strict SLOs<br/>99.9% completeness<br/>99.5% validity<br/>< 1 min latency]
    B -->|Important - P1| D[Standard SLOs<br/>99% completeness<br/>95% validity<br/>< 5 min latency]
    B -->|Low Risk - P2| E[Relaxed SLOs<br/>95% completeness<br/>90% validity<br/>< 1 hour latency]

    C --> F[Continuous Monitoring]
    D --> F
    E --> F

    F --> G{SLO Breach?}
    G -->|Yes - P0| H[Page On-call<br/>Auto-rollback]
    G -->|Yes - P1| I[Alert Team<br/>Investigate]
    G -->|Yes - P2| J[Log Warning<br/>Schedule Review]
    G -->|No| K[Record Metrics]

    style C fill:#ff6b6b,stroke:#333,stroke-width:2px
    style H fill:#ff6b6b,stroke:#333,stroke-width:2px
    style K fill:#9f9,stroke:#333,stroke-width:2px

SLO vs. Model Impact Matrix

Data Quality SLO	Model Impact	Business Impact	Alert Threshold	Response Time	Example
Completeness > 99.9%	Critical	Revenue loss	< 99.5%	< 15 min (P0)	Missing customer IDs prevent personalization
Validity > 99%	High	Poor predictions	< 97%	< 1 hour (P1)	Invalid email addresses can't receive offers
Timeliness < 5 min (p95)	Critical	Stale decisions	> 10 min	< 30 min (P0)	Delayed inventory updates cause overselling
Distribution KS < 0.1	Critical	Model degradation	KS > 0.15	< 1 hour (P0)	Feature distribution shift breaks model
Bias: demographic parity < 10%	Critical	Legal exposure	> 15%	< 4 hours (P1)	Loan approvals vary by protected class

Data Observability Stack

Monitor Types and Detection Methods

graph LR
    subgraph "Schema Monitoring"
        SM1[Schema Registry]
        SM2[Change Detection]
        SM3[Breaking Changes<br/>Alert]
    end

    subgraph "Volume Monitoring"
        VM1[Row Count Tracker]
        VM2[Z-Score Anomaly<br/>Detection]
        VM3[Volume Alerts]
    end

    subgraph "Distribution Monitoring"
        DM1[Statistical Profiler]
        DM2[KS Test / Chi-Square]
        DM3[Drift Alerts]
    end

    subgraph "Freshness Monitoring"
        FM1[Watermark Tracker]
        FM2[Latency Metrics]
        FM3[SLA Breach Alerts]
    end

    SM1 --> SM2 --> SM3
    VM1 --> VM2 --> VM3
    DM1 --> DM2 --> DM3
    FM1 --> FM2 --> FM3

    SM3 & VM3 & DM3 & FM3 --> Alert[Alert Router]
    Alert --> PD[PagerDuty]
    Alert --> Slack[Slack]
    Alert --> Dashboard[Dashboard]

    style DM2 fill:#f9f,stroke:#333,stroke-width:2px
    style Alert fill:#bbf,stroke:#333,stroke-width:2px

Alert Severity Levels

Severity	Response Time	Scope	Actions	Examples
P0 - Critical	< 15 minutes	Complete data loss, critical field >50% null, breaking schema change	Page on-call, auto-rollback, incident call	Production model can't make predictions
P1 - High	< 1 hour	Significant drift (>30% features), SLO breach, 2x normal latency	Alert data team, root cause analysis, mitigation plan	Model accuracy dropped 10 points
P2 - Medium	< 4 hours	Moderate drift (10-30%), minor SLO miss, quality degradation	Create ticket, monitor trend, schedule review	Null rate increased from 1% to 3%
P3 - Low	< 24 hours	Small anomalies, informational	Log for analysis, review in standup	Non-critical field has unexpected values

Data Contracts

Contract-Driven Quality

graph TB
    subgraph "Contract Definition"
        CD1[Schema<br/>Required fields, types]
        CD2[Quality SLOs<br/>Completeness, validity]
        CD3[Freshness SLA<br/>Latency targets]
        CD4[Ownership<br/>Team, on-call]
    end

    subgraph "Validation Pipeline"
        V1[Ingestion]
        V2[Schema Check]
        V3[Quality Check]
        V4[Freshness Check]
    end

    subgraph "Action on Failure"
        A1{Critical<br/>Violation?}
        A2[Reject Data]
        A3[Quarantine]
        A4[Alert + Continue]
    end

    CD1 & CD2 & CD3 & CD4 --> Contract[Data Contract]
    Contract --> V1

    V1 --> V2 --> V3 --> V4
    V2 --> A1
    V3 --> A1
    V4 --> A1

    A1 -->|Yes| A2
    A1 -->|Medium| A3
    A1 -->|No| A4

    A2 --> Alert[Alert Owner]
    A3 --> Alert
    A4 --> Log[Log Warning]

    style Contract fill:#f96,stroke:#333,stroke-width:2px
    style A2 fill:#ff6b6b,stroke:#333,stroke-width:2px

Essential Contract Elements

Minimal Example - Customer Transactions Contract:

# data-contract: customer_transactions_v3
owner: data-platform@company.com
consumers: [fraud-detection, revenue-analytics]

schema:
  format: parquet
  partitioned_by: date
  required_columns:
    - {name: transaction_id, type: string, nullable: false, unique: true}
    - {name: customer_id, type: string, nullable: false}
    - {name: amount, type: decimal(10,2), nullable: false, min: 0, max: 1000000}
    - {name: timestamp, type: timestamp, nullable: false}

slos:
  freshness: "p95 < 5 minutes"
  completeness: "> 99.5%"
  accuracy: "> 99.9% validation pass"
  availability: "99.9% uptime"

quality_checks:
  on_write:
    - {rule: "transaction_id IS NOT NULL", action: reject}
    - {rule: "amount >= 0", action: reject}
    - {rule: "timestamp within 24h", action: quarantine}

breaking_changes:
  policy: "90-day deprecation notice for column removal"
  notification: "#data-announcements"

Data Quality Tool Comparison

Tool	Deployment	Best For	Strengths	Limitations	Cost Range	Decision Criteria
Great Expectations	Self-hosted	Python pipelines	Code-first, extensive checks, free	Steep learning curve	Free (OSS)	Choose for Python-heavy stack, engineering control
Monte Carlo	SaaS	Enterprise platforms	Auto-discovery, ML anomaly detection	Black box, expensive	$50K-200K/year	Choose for automated DQ at scale, budget available
Soda	Cloud/Self-hosted	SQL-based checks	Simple YAML config, multi-platform	Limited advanced features	$10K-100K/year	Choose for SQL-first teams, quick wins
DBT Tests	Self-hosted	Analytics engineering	Integrated with DBT, version controlled	Limited to DBT models	Free (OSS)	Choose if already using DBT
Datadog	SaaS	Full-stack monitoring	Unified observability, ML anomalies	Expensive at scale	$15-50/host/month	Choose for infrastructure + data monitoring
Anomalo	SaaS	Automated monitoring	Low config, automated baselines	Limited customization	$25K-150K/year	Choose for hands-off DQ monitoring

Drift Detection Framework

Distribution Drift Monitoring

graph TB
    subgraph "Baseline Establishment"
        B1[Training Data<br/>Statistics]
        B2[Reference Window<br/>Last 30 days]
        B3[Baseline Metrics<br/>Mean, Std, Quantiles]
    end

    subgraph "Current Data Profiling"
        C1[Production Data<br/>Last 24 hours]
        C2[Calculate Stats]
        C3[Feature Distributions]
    end

    subgraph "Drift Detection"
        D1{Numerical<br/>Feature?}
        D2[KS Test<br/>Threshold: 0.1]
        D3[Chi-Square Test<br/>Categorical]
        D4{Drift<br/>Detected?}
    end

    subgraph "Action"
        A1[Alert Data Science]
        A2[Investigate Root Cause]
        A3[Retrain Model?]
        A4[Fix Data Pipeline?]
    end

    B1 & B2 --> B3
    C1 --> C2 --> C3

    B3 & C3 --> D1
    D1 -->|Yes| D2
    D1 -->|No| D3

    D2 --> D4
    D3 --> D4

    D4 -->|Yes| A1 --> A2
    A2 --> A3
    A2 --> A4

    style D4 fill:#f96,stroke:#333,stroke-width:2px
    style A1 fill:#ff6b6b,stroke:#333,stroke-width:2px

Drift Detection Methods Comparison

Method	Data Type	Sensitivity	Interpretation	Best For	Threshold Guidance
KS Test	Numerical	High	Statistic 0-1, p-value	General numerical drift	KS > 0.1 or p < 0.05
Chi-Square	Categorical	Medium	p-value based	Categorical distributions	p < 0.05
PSI (Population Stability Index)	Both	Medium	0-∞, >0.25 is high	Credit/finance models	< 0.1 stable, > 0.25 significant
Wasserstein Distance	Numerical	High	Distance metric	Distribution shape changes	Domain-specific
Jensen-Shannon Divergence	Both	Medium	0-1 similarity	Probability distributions	> 0.15 significant

Real-World Case Study: E-commerce Data Quality

Challenge

E-commerce platform's recommendation model suddenly dropped from 15% CTR to 8% CTR. Revenue impact: $2M/week.

Investigation Timeline

Day 1: Model Debugging (Wrong Direction)

Checked model version, code, infrastructure → All nominal
Lost 24 hours investigating the wrong layer

Day 2: Data Quality Analysis

Discovered product description completeness dropped from 98% to 75%
40% increase in new products without descriptions
Root cause: Upstream product catalog API changed response format

Day 3: Emergency Mitigation

Filtered incomplete products from recommendations
Redeployed model with clean data
CTR recovered to 13.5%

Implementation: Preventive Measures

gantt
    title Data Quality Framework Implementation
    dateFormat YYYY-MM-DD
    section Week 1
    Data Contracts            :a1, 2025-10-03, 7d
    Schema Validation         :a2, 2025-10-03, 7d
    section Week 2
    Quality Metrics           :b1, 2025-10-10, 7d
    Drift Detection           :b2, 2025-10-10, 7d
    section Week 3
    Alerting Setup            :c1, 2025-10-17, 7d
    Runbook Creation          :c2, 2025-10-17, 7d
    section Week 4
    Dashboard Deployment      :d1, 2025-10-24, 7d
    Team Training             :d2, 2025-10-24, 7d

Results After Implementation

Metric	Before	After	Improvement
Mean Time to Detection (MTTD)	3 days	30 minutes	99% faster
Mean Time to Resolution (MTTR)	3 days	4 hours	95% faster
Data quality incidents	40/month	4/month	90% reduction
Model deployment failures	30%	5%	83% reduction
Infrastructure costs (monitoring)	$0	$5K/month	New investment
ROI	N/A	15x	Prevented $2M loss monthly

Key Success Factors

Proactive Monitoring: Automated checks at every pipeline stage
Clear Ownership: Data contracts with on-call rotations
Fast Feedback: Real-time alerts to Slack and PagerDuty
Actionable Runbooks: Step-by-step debugging guides
Regular Reviews: Weekly DQ metrics review meetings

Quality Scorecard Framework

Executive Scorecard

Domain	Overall Score	Completeness	Validity	Timeliness	Consistency	Trend	Owner
Sales	98.3% ✓	99.8%	99.5%	98.2%	99.1%	↑	Jane Smith
Marketing	91.7% ⚠	95.2%	97.8%	85.0%	92.5%	↓	John Doe
Product	98.9% ✓	99.9%	99.2%	99.5%	98.8%	→	Alice Johnson
Finance	85.2% ⚠	88.5%	92.0%	78.0%	85.5%	↓	Bob Wilson

Color Legend: ✓ Green (>95%), ⚠ Yellow (90-95%), ✗ Red (<90%)

Dataset-Level Metrics

Example: customer_transactions

Quality Dimensions:
├── Completeness: 99.8% (Target: 99.5%) ✓
├── Validity: 99.5% (Target: 99.0%) ✓
├── Timeliness: 98.2% (Target: 95.0%) ✓
├── Consistency: 99.1% (Target: 98.0%) ✓
└── Distribution Stability: 95.0% (Target: 90.0%) ✓

Overall Score: 98.3%
Trend: ↑ (up 2.1% from last month)
SLO Compliance: 100% (5/5 SLOs met)

Top Issues:
1. Merchant category null rate: 2.5% (up from 1.2%)
2. Timestamp occasionally delayed > 10 min
3. Minor distribution shift in transaction amounts (KS=0.08)

Implementation Checklist

Setup Phase (Week 1-2)

□ Select data quality platform/tooling
□ Define quality dimensions and metrics for critical datasets
□ Establish SLOs tied to business impact
□ Create data contracts for key pipelines
□ Set up time-series metrics storage
□ Configure alert routing (Slack, PagerDuty, email)

Monitoring Phase (Week 3-4)

□ Implement schema change detection
□ Deploy volume anomaly monitors (Z-score based)
□ Configure drift detection for ML features
□ Set up null and outlier monitoring
□ Create executive quality dashboards
□ Establish baseline metrics (30-day window)

Operational Phase (Week 5-6)

□ Create runbooks for common DQ issues
□ Train teams on alert response
□ Establish incident review cadence (weekly)
□ Define remediation SLAs by severity
□ Set up on-call rotation for P0 alerts
□ Document escalation paths

Continuous Improvement (Ongoing)

□ Review and update SLOs quarterly
□ Analyze alert fatigue, tune thresholds
□ Expand monitor coverage to more datasets
□ Automate remediation where possible
□ Conduct post-incident reviews for all P0/P1
□ Share learnings in data guild meetings

Best Practices

Start with Critical Paths: Monitor data used by production models first, not everything
Set Realistic SLOs: Base on business impact and current baseline, not perfection
Automate Validation: Integrate quality checks into CI/CD pipelines
Make Quality Visible: Dashboards accessible to all stakeholders, not just data team
Own Your Data: Assign clear ownership and accountability with on-call
Test in Stages: Validate at ingestion, transformation, and serving layers
Learn from Incidents: Every quality issue is an opportunity to improve monitoring
Balance Precision: Too many alerts cause fatigue, too few miss critical issues

Common Pitfalls

Alert Fatigue: Too many low-priority alerts train teams to ignore all notifications
Perfection Paralysis: Waiting for perfect quality prevents shipping AI products
Monitoring Overhead: Over-engineered DQ costs more than the data issues it prevents
Siloed Quality: Treating DQ as separate from engineering workflows instead of integrated
Reactive Only: Only fixing issues without preventing recurrence through better contracts
Missing Business Context: Tracking technical metrics without understanding impact
Stale Baselines: Using outdated reference data for drift detection
No Ownership: Quality issues with no clear owner never get fixed
Inadequate Testing: Not testing quality checks before deploying to production
Ignoring Trends: Only alerting on thresholds, missing gradual degradation

Decision Framework: When to Invest in Data Quality Platform

graph TD
    A[Assess DQ Needs] --> B{How many<br/>ML models in<br/>production?}

    B -->|< 5| C{Critical<br/>models?}
    B -->|5-20| D{Frequent<br/>DQ incidents?}
    B -->|> 20| E[Definitely Invest<br/>in DQ Platform]

    C -->|Yes| F[Invest in Basic<br/>Monitoring]
    C -->|No| G[Manual Checks OK<br/>for now]

    D -->|> 5/month| E
    D -->|< 5/month| F

    F --> H[Tools: Great Expectations<br/>+ DBT Tests]
    E --> I[Tools: Monte Carlo, Soda,<br/>or Datadog]
    G --> J[Tools: Spreadsheet<br/>tracking]

    style E fill:#d4edda,stroke:#333,stroke-width:2px
    style F fill:#fff3cd,stroke:#333,stroke-width:2px
    style G fill:#f8d7da,stroke:#333,stroke-width:2px

Investment Decision Criteria:

Definitely invest if: >10 production models, regulated industry, frequent DQ incidents (>5/month), distributed data teams
Start basic if: 5-10 models, some DQ pain points, growing team, planning AI scale-up
Manual OK for now if: <5 models, simple use cases, small team, low DQ incident rate

Expected ROI:

Typical implementation: 4-8 weeks, 1-2 FTEs
Cost: $25K-100K (tools) +$ 50K-150K (implementation)
Benefits: 70-90% reduction in MTTD, 60-80% reduction in DQ incidents, 30-50% improvement in data team productivity
Break-even: Usually within 6-12 months for teams with >10 models

Chapter 14: Data Quality & Observability

14. Data Quality & Observability