Part 10: MLOps & Platform Engineering

Chapter 57: Feature Stores & Model Registries

Hire Us
10Part 10: MLOps & Platform Engineering

57. Feature Stores & Model Registries

Chapter 57 — Feature Stores & Model Registries

Overview

Standardize features and model lifecycle; ensure discoverability and governance. Feature stores solve the training-serving skew problem while enabling feature reuse across teams. Model registries provide a single source of truth for model versions, lineage, and lifecycle management, essential for reproducibility and compliance.

Core Concepts

  • Feature contracts; online/offline parity; backfills.
  • Model versioning, lineage, and stage transitions.
  • Discoverability, reuse, and collaboration.
  • Point-in-time correctness and temporal joins.

Deliverables

  • Feature contracts and registry policies.
  • Feature catalog with documentation.
  • Model registry with full lineage tracking.
  • Promotion workflows and approval gates.

Why It Matters

Consistent features and governed model lifecycles reduce duplication, speed iteration, and simplify compliance. Without these foundational components:

  • Feature Duplication: Teams independently compute the same features, wasting resources
  • Training-Serving Skew: Models trained on one feature computation perform poorly in production where features are computed differently
  • Model Chaos: No single answer to "which model is in production?" or "what data trained this model?"
  • Compliance Gaps: Cannot prove what model version made a decision or trace back to training data
  • Slow Iteration: Teams spend weeks debugging feature differences instead of experimenting

Organizations with mature feature stores and model registries report 40-60% reduction in time-to-production for new models and 80% fewer production incidents due to training-serving skew.

Feature Store & Model Registry Architecture

graph TB subgraph Feature Store A[Batch Sources<br/>Data Warehouse, S3] --> B[Feature Pipeline] C[Stream Sources<br/>Kafka, Kinesis] --> B B --> D[Offline Store<br/>Historical Features] B --> E[Online Store<br/>Low-latency] D --> F[Training Jobs] E --> G[Inference Services] end subgraph Model Registry F --> H[Model Artifacts] H --> I[Registry Metadata] I --> J[Staging] J --> K{Approval Gates} K -->|Pass| L[Production] K -->|Fail| M[Archive] L --> G end G --> N[Predictions] N -.->|Feedback| C

Feature Store Deep Dive

Feature Lifecycle

flowchart LR A[Define Feature] --> B[Implement Pipeline] B --> C[Materialize Offline] C --> D[Materialize Online] D --> E[Training Uses Offline] E --> F[Inference Uses Online] F --> G[Monitor Drift] G --> H{Drift Detected?} H -->|Yes| I[Backfill & Retrain] H -->|No| F I --> C

Training-Serving Skew Problem & Solution

The Problem:

graph TB subgraph Training (Python/Pandas) A[Raw Data] --> B[Feature Logic v1<br/>Pandas groupby] B --> C[Training Features] end subgraph Production (SQL) D[Raw Data] --> E[Feature Logic v2<br/>SQL window function] E --> F[Serving Features] end C -.->|Different Logic!| G[Model Trained Here] F -.->|Different Values!| H[Model Serves Here] G -.->|Performance Mismatch| H style G fill:#f99 style H fill:#f99

The Solution - Feature Store:

graph TB A[Feature Definition<br/>Single Source of Truth] --> B[Offline Materialization] A --> C[Online Materialization] B --> D[Training] C --> E[Serving] D --> F[Same Features!] E --> F style F fill:#9f9

Feature Contract Example

ComponentSpecificationExample
Feature Groupuser_transaction_featuresVersion 2.1.0
Ownerpayments-ml-teamOn-call rotation
DescriptionAggregated transaction features for fraud detection7d/30d windows
Entitiesuser_idPrimary key
Featurestransaction_count_7d, avg_amount_30d, merchant_diversity3 features
Data Typesint64, float64, float64Validated
ValidationNon-null, range checkstransaction_count >= 0
SourcesBigQuery (batch), Kafka (stream)Dual ingestion
SLAFreshness: 15min, Availability: 99.9%, Latency: <10msMonitored
Upstreamraw.transactions, raw.merchant_infoLineage tracked

Point-in-Time Correctness

Without Point-in-Time Join (Data Leakage):

sequenceDiagram participant Training as Training Job participant Features as Feature Store participant Future as Future Data Training->>Features: Get features for 2024-01-15 Features->>Future: Accidentally uses data from 2024-01-20 Future-->>Features: Latest available features Features-->>Training: Features with future leak Training->>Training: Model learns from future! Note over Training: Artificially high accuracy in training<br/>Poor performance in production

With Point-in-Time Join (Correct):

sequenceDiagram participant Training as Training Job participant Features as Feature Store (PIT Join) participant Historical as Historical Data Training->>Features: Get features for 2024-01-15 12:00 Features->>Historical: Query features <= 2024-01-15 12:00 Historical-->>Features: Features as of that timestamp Features-->>Training: Historically accurate features Training->>Training: Model learns from valid data Note over Training: Training mimics production reality

Feature Store Platform Comparison

PlatformBest ForStrengthsLimitationsPricing
Feast (OSS)Self-hosted, flexibilityFree, extensible, cloud-agnosticDIY infrastructureFree (infra costs)
TectonEnterprise, managedFully managed, excellent UI, real-timeExpensive, vendor lock-inUsage-based, $$$$
AWS SageMaker Feature StoreAWS ecosystemNative AWS integration, managedAWS lock-in, limited featuresStorage + requests
GCP Vertex AI Feature StoreGCP ecosystemManaged, scalable, BigQuery integrationGCP lock-inStorage + requests
Azure ML Feature StoreAzure ecosystemAzure integration, Unity CatalogAzure lock-in, newerStorage + requests
Databricks Feature StoreSpark/Databricks usersDelta Lake, Unity Catalog, lineageDBU costs, Databricks requiredIncluded with DBU
HopsworksML platform integrationComplete ML platform, KubernetesComplex, steep learning curveOSS + enterprise

Model Registry Deep Dive

Model Lifecycle Management

stateDiagram-v2 [*] --> Development Development --> Staging: ML Lead Approval Staging --> StagingValidation: Deploy to Staging StagingValidation --> Production: Security + Product Approval StagingValidation --> Staging: Validation Failed Production --> Canary: Deploy Canary Canary --> FullProduction: Metrics Pass Canary --> Production: Metrics Fail (Rollback) FullProduction --> Production: Monitor Production --> Archived: Superseded Staging --> Archived: Abandoned Archived --> [*]

Complete Model Lineage

graph TB subgraph Data Lineage A[Raw Data<br/>transactions_2025_q3] --> B[Cleaned Data<br/>v1.2.3] B --> C[Feature Engineering<br/>v2.1.0] C --> D[Training Dataset<br/>sha256:abc123] end subgraph Model Lineage D --> E[Training Run<br/>run_id:456] E --> F[Model Artifact<br/>fraud_detector:v2.3.1] F --> G[Model Registry<br/>Stage: Staging] G --> H{Promotion Gates} H --> I[Production<br/>fraud_detector:v2.3.1] end subgraph Feature Lineage J[Feature Store<br/>user_features:v3.0] --> C K[Feature Store<br/>transaction_features:v2.5] --> C end subgraph Deployment Lineage I --> L[Canary Deployment<br/>5% traffic] L --> M[Full Deployment<br/>100% traffic] M --> N[Predictions] end style D fill:#e1f5ff style F fill:#fff4e1 style I fill:#e7f5e1

Model Metadata Schema

Essential Metadata Components:

CategoryFieldsPurpose
Identitymodel_id, version, created_at, created_byUnique identification
Trainingframework, data_version, data_hash, hyperparameters, seedReproducibility
Evaluationtest_metrics, fairness_metrics, cost_metrics, eval_dateQuality assurance
Governancerisk_category, approvals, compliance_flags, known_limitationsRisk management
Deploymentcurrent_stage, endpoints, traffic_percentage, rollback_targetOperations
Artifactsmodel_uri, container_image, model_card, eval_reportAssets
Lineageparent_version, data_sources, feature_versions, code_commitTraceability

Minimal Metadata Example:

{
  "model_id": "fraud_detector:v2.3.1",
  "training": {
    "dataset_version": "1.2.3",
    "dataset_hash": "sha256:abc123...",
    "features": {"user_features": "v3.0", "txn_features": "v2.5"}
  },
  "evaluation": {
    "f1": 0.89, "precision": 0.91, "recall": 0.87,
    "fairness_demographic_parity": 0.05
  },
  "governance": {
    "risk": "high",
    "approvals": ["ml_lead", "security"],
    "limitations": ["Lower recall on new merchant types"]
  },
  "deployment": {"stage": "production", "traffic": 100}
}

Promotion Workflow

Approval Requirements by Risk Level:

Risk LevelRequired ApprovalsEvaluation GatesDeployment StrategyTypical Timeline
LowML LeadBasic metricsDirect to prod1 day
MediumML Lead + SecurityMetrics + safetyCanary 5% → 100%2-3 days
HighML Lead + Security + ComplianceAll evals + fairnessShadow → Canary → Full5-7 days
CriticalML Lead + Security + Compliance + Product + LegalComprehensive + external auditShadow → Canary 5% → 25% → 50% → 100% (24h stages)10-14 days

Model Version Semantic Versioning

graph LR A[v2.3.1] --> B{Change Type?} B -->|Breaking Change<br/>New architecture| C[v3.0.0] B -->|New Feature<br/>Additional output| D[v2.4.0] B -->|Bug Fix<br/>Same behavior| E[v2.3.2] C --> F[Major Version<br/>Incompatible] D --> G[Minor Version<br/>Backward Compatible] E --> H[Patch Version<br/>Bug Fixes]

Case Study: Banking Feature & Model Standardization

Background: Large bank with 15+ data science teams building credit risk, fraud, and marketing models independently. Each team computed similar features with different logic.

Problems Before Implementation

ProblemImpactFrequencyCost
Duplicate Feature Pipelines8 different "transaction velocity" implementationsOngoing$200K/year wasted
Training-Serving Skew25% of models underperformed in productionEvery 4th modelLost business value
No Lineage3 weeks to trace model decisions for auditsEvery auditCompliance risk
Slow IterationNew models took 6-8 weeks to productionEvery modelOpportunity cost
Compliance GapsCould not prove which version made decisionsOngoingRegulatory risk

Solution Architecture

graph TB subgraph Centralized Feature Store A[150 Standard Features] --> B[Offline: Delta Lake on S3] A --> C[Online: Redis <10ms] B --> D[Training Jobs] C --> E[Inference Services] end subgraph Model Registry D --> F[MLflow Registry] F --> G[Dev → Staging → Prod] G --> H[Approval Workflow] H --> I[Deployment] end J[15 Teams] --> A K[40+ Models] --> F style A fill:#9f9 style F fill:#9f9

Implementation Phases

MonthMilestoneFeaturesTeamsModelsKey Win
1MVP20 critical features3 pilot teams5 modelsProved concept
2Expansion50 features10 teams15 modelsHit tipping point
3Growth100 featuresAll 15 teams30 modelsFull adoption
6Maturity150 featuresAll teams40+ modelsStandardized

Results After 12 Months

MetricBeforeAfterImprovement
Time to Production6-8 weeks2-3 weeks60% reduction
Feature Reuse10%75%7.5x increase
Training-Serving Skew Incidents12/year1/year92% reduction
Duplicate Feature Pipelines8 versions1 canonicalEliminated
Compliance Audit Prep3 weeks2 days90% reduction
Cost SavingsBaseline$400K/yearInfrastructure consolidation
Developer Satisfaction6.2/108.7/1040% improvement

Key Success Factors

mindmap root((Success Factors)) Executive Sponsorship CTO mandate Cross-team priority Gradual Rollout 20 features first Proved value Then scaled Documentation Feature catalog Examples Best practices Developer Experience Easier than DIY Great tooling Fast support Compliance Focus Risk reduction Audit trail Regulatory win

Implementation Checklist

Feature Store Implementation

Phase 1: Foundation (Weeks 1-3)

  • Choose platform (Feast, Tecton, cloud-native)
  • Set up offline store (data warehouse, Delta Lake)
  • Set up online store (Redis, DynamoDB)
  • Define feature schema and contracts
  • Build pipeline for 5-10 critical features

Phase 2: Core Features (Weeks 4-6)

  • Identify 20-30 most-used features across teams
  • Implement batch feature computation
  • Implement streaming features (if needed)
  • Set up materialization (offline → online)
  • Add monitoring for freshness/availability

Phase 3: Integration (Weeks 7-9)

  • Integrate with training pipelines
  • Integrate with serving infrastructure
  • Build feature discovery catalog
  • Document all features with examples
  • Onboard first 2-3 teams

Phase 4: Expansion (Months 3-6)

  • Add point-in-time correctness
  • Implement backfill capabilities
  • Add feature validation and drift detection
  • Expand to 100+ features
  • Onboard all ML teams

Model Registry Implementation

Phase 1: Setup (Week 1)

  • Choose platform (MLflow, SageMaker, Vertex AI)
  • Set up central tracking server
  • Define model metadata schema
  • Set up artifact storage (S3, GCS, Azure Blob)

Phase 2: Workflows (Weeks 2-3)

  • Define lifecycle stages and promotion criteria
  • Implement approval workflows
  • Set up automated registration from CI/CD
  • Create model card templates

Phase 3: Governance (Weeks 4-6)

  • Add lineage tracking
  • Implement risk-based approval matrix
  • Set up compliance evidence collection
  • Create rollback procedures

Phase 4: Optimization (Ongoing)

  • Add model comparison dashboards
  • Implement automated validation
  • Set up performance monitoring
  • Regular cleanup of archived models

Best Practices

Feature Store

PracticeWhyHow
Design for ReusabilityAvoid duplicationGeneric features, not model-specific
Document EverythingEnable discoveryName, description, owner, SLA, examples
Monitor Feature HealthCatch issues earlyFreshness, availability, drift, usage
Version CarefullyManage breaking changesSemantic versioning (v2.1.0)
Point-in-Time JoinsPrevent data leakageUse historical feature values

Model Registry

PracticeWhyHow
Automate RegistrationReduce errorsAuto-register from training pipeline
Enforce Stage GatesMaintain qualityValidation at each promotion
Immutable LineageEnable complianceNever modify, always append
Risk-Based ApprovalRight oversight levelMatch rigor to impact
Complete MetadataEnable debuggingTraining, eval, governance, deployment

Success Metrics

Feature Store Metrics

MetricTargetIndicates
Feature Reuse Rate>60%Effective sharing
Training-Serving Skew Incidents<2/yearConsistency achieved
Feature Freshness SLA>99%Reliable data
Time to Add New Feature<1 weekOperational efficiency

Model Registry Metrics

MetricTargetIndicates
Models Registered100% of productionComplete governance
Mean Time to Promote<3 daysEfficient process
Audit Prep Time<1 dayAutomated compliance
Rollback Time<10 minOperational maturity