Chapter 16 — Platforms & Architecture

Overview

Define reference architectures that cover data, models, orchestration, and deployment patterns. The platform and architecture decisions you make early in your AI journey will enable or constrain your capabilities for years. A well-designed AI platform provides the foundation for rapid experimentation, reliable production deployments, and continuous improvement.

Why It Matters

Architecture sets the runway for many teams. Standard components reduce duplicated effort and risk. Poor architectural decisions compound over time:

Technical Debt: Teams build workarounds when platform doesn't meet needs, creating maintenance burden
Scalability Bottlenecks: Architecture that works for 100 models fails at 1,000 models
Vendor Lock-in: Tight coupling to specific vendors limits flexibility and increases costs
Security Gaps: Ad-hoc solutions bypass security controls and create vulnerabilities
Team Friction: Inconsistent tooling across teams reduces collaboration and knowledge sharing

Real-world impact:

E-commerce company rebuilt entire ML platform after initial design couldn't scale beyond 50 models
Healthcare AI startup spent 18 months migrating off tightly-coupled cloud vendor
Financial services firm saved $2M annually by standardizing on reference architecture

Platform Architecture Layers

Layered Architecture Model

graph TB
    subgraph "Control Plane"
        CP1[Catalog & Metadata]
        CP2[Governance & Compliance]
        CP3[Monitoring & Observability]
        CP4[Security & IAM]
    end

    subgraph "Data Layer"
        D1[Raw Data Lake]
        D2[Curated Data Warehouse]
        D3[Feature Store]
        D4[Vector Database]
    end

    subgraph "Compute Layer"
        C1[Training Clusters]
        C2[Batch Inference]
        C3[Real-time Serving]
        C4[Experimentation]
    end

    subgraph "Orchestration Layer"
        O1[Workflow Engine]
        O2[Event Bus]
        O3[Scheduler]
        O4[Job Queue]
    end

    subgraph "Model Layer"
        M1[Model Registry]
        M2[Experiment Tracking]
        M3[Version Control]
        M4[A/B Testing]
    end

    CP1 & CP2 & CP3 & CP4 -.-> D1 & C1 & O1 & M1
    D1 & D2 & D3 & D4 --> C1 & C2 & C3
    O1 & O2 & O3 & O4 --> C1 & C2 & C3
    C1 --> M1
    M1 --> C2 & C3
    M2 & M3 & M4 --> M1

    style CP1 fill:#f96,stroke:#333,stroke-width:2px
    style D3 fill:#bbf,stroke:#333,stroke-width:2px
    style M1 fill:#9f9,stroke:#333,stroke-width:2px

Platform vs. Product Teams

Aspect	Platform Team	Product/Model Teams
Mission	Provide infrastructure and tooling	Build and deploy ML models
Deliverables	APIs, SDKs, reference implementations	Production models and applications
Users	Internal data scientists and engineers	End users and business stakeholders
Metrics	Platform uptime, adoption rate, developer satisfaction	Model performance, business impact
Skills	Infrastructure, DevOps, platform engineering	ML, data science, domain expertise

Data Architecture Patterns

Lakehouse Architecture

graph LR
    subgraph "Ingestion"
        I1[Batch Sources]
        I2[Streaming Sources]
        I3[API Sources]
    end

    subgraph "Bronze Layer - Raw"
        B1[Raw Data Lake<br/>Parquet/Delta]
        B2[Change Data Capture]
        B3[Audit Logs]
    end

    subgraph "Silver Layer - Curated"
        S1[Cleaned Data]
        S2[Deduplicated]
        S3[Standardized Schema]
    end

    subgraph "Gold Layer - Business"
        G1[Aggregated Features]
        G2[Business Metrics]
        G3[ML-Ready Datasets]
    end

    subgraph "Serving Layer"
        SV1[Analytics Warehouse]
        SV2[Feature Store]
        SV3[Vector Store]
    end

    I1 & I2 & I3 --> B1 & B2 & B3
    B1 & B2 & B3 --> S1 & S2 & S3
    S1 & S2 & S3 --> G1 & G2 & G3
    G1 & G2 & G3 --> SV1 & SV2 & SV3

    style G3 fill:#bbf,stroke:#333,stroke-width:2px
    style SV2 fill:#f96,stroke:#333,stroke-width:2px

Feature Store Architecture

graph TB
    subgraph "Feature Definition"
        FD1[Feature Registry]
        FD2[Feature Metadata]
        FD3[Lineage Tracking]
    end

    subgraph "Offline Store"
        OS1[Historical Features<br/>Data Warehouse]
        OS2[Point-in-Time Correct]
        OS3[Training Set Generation]
    end

    subgraph "Online Store"
        ON1[Real-time Features<br/>Redis/DynamoDB]
        ON2[Low-latency Lookup]
        ON3[< 10ms Response]
    end

    subgraph "Feature Computation"
        FC1[Batch Pipeline<br/>Daily/Hourly]
        FC2[Stream Pipeline<br/>Real-time]
        FC3[On-Demand<br/>Request-time]
    end

    FD1 & FD2 & FD3 --> OS1 & ON1
    FC1 --> OS1 & ON1
    FC2 --> ON1
    OS1 --> OS3
    ON1 --> ON2

    OS3 -.-> Training[Model Training]
    ON2 -.-> Inference[Model Inference]

    style ON1 fill:#f96,stroke:#333,stroke-width:2px
    style OS3 fill:#bbf,stroke:#333,stroke-width:2px

Technology Selection Matrix

Data Storage Comparison

Technology	Best For	Strengths	Limitations	Cost	When to Use
Delta Lake	Lakehouse on cloud	ACID transactions, time travel, open format	Requires Spark	Low	Choose for data lake + warehouse hybrid
Snowflake	Cloud data warehouse	Performance, ease of use, separation of compute/storage	Can be expensive, some vendor lock-in	Medium-High	Choose for analytics-heavy workloads
BigQuery	GCP analytics	Serverless, ML integration, fast queries	GCP lock-in, unpredictable costs	Medium	Choose for GCP-native stack
Databricks	Unified analytics	End-to-end platform, notebook experience	Expensive, learning curve	High	Choose for full ML platform
PostgreSQL	Transactional data	Mature, reliable, extensions (pgvector)	Limited horizontal scaling	Low	Choose for operational databases
Redis	Real-time serving	Sub-millisecond latency, simple	Limited persistence, memory-bound	Medium	Choose for feature serving

Model Serving Patterns

graph TB
    subgraph "Batch Serving"
        B1[Scheduled Job] --> B2[Load Full Dataset]
        B2 --> B3[Batch Inference]
        B3 --> B4[Write to Database/S3]
    end

    subgraph "Real-time Serving"
        R1[API Request] --> R2[Feature Lookup]
        R2 --> R3[Model Inference]
        R3 --> R4[Response < 100ms]
    end

    subgraph "Streaming Serving"
        S1[Event Stream] --> S2[Feature Enrichment]
        S2 --> S3[Model Inference]
        S3 --> S4[Sink to Target]
    end

    subgraph "Edge Serving"
        E1[Mobile/IoT Device] --> E2[Local Model]
        E2 --> E3[On-Device Inference]
        E3 --> E4[Offline Capable]
    end

    style R3 fill:#f96,stroke:#333,stroke-width:2px
    style S3 fill:#bbf,stroke:#333,stroke-width:2px

Orchestration Tool Comparison

Tool	Best For	Strengths	Limitations	Cost	Decision Factor
Airflow	Complex DAGs, established teams	Mature, flexible, large ecosystem	Steep learning curve, maintenance overhead	Free (self-hosted)	Choose for complex workflows
Prefect	Modern Python workflows	Pythonic, great DX, hybrid execution	Smaller ecosystem, newer	Free + paid tiers	Choose for Python-first teams
Kubeflow Pipelines	Kubernetes-native ML	K8s integration, portable, containerized	Complex setup, K8s expertise required	Free (infrastructure only)	Choose for K8s environments
AWS Step Functions	AWS-native workflows	Managed, serverless, visual designer	AWS lock-in, expensive at scale	Pay-per-execution	Choose for AWS serverless
Databricks Workflows	Databricks users	Tight Spark integration, notebooks support	Databricks lock-in	Included with Databricks	Choose if using Databricks

ML Training Infrastructure

Distributed Training Architecture

graph TB
    subgraph "Training Orchestrator"
        TO1[Experiment Tracking]
        TO2[Hyperparameter Tuning]
        TO3[Resource Allocation]
    end

    subgraph "Compute Cluster"
        C1[GPU Node 1<br/>8x A100]
        C2[GPU Node 2<br/>8x A100]
        C3[GPU Node 3<br/>8x A100]
        C4[GPU Node 4<br/>8x A100]
    end

    subgraph "Data Pipeline"
        DP1[Data Loader]
        DP2[Preprocessing]
        DP3[Distributed Cache]
    end

    subgraph "Model Registry"
        MR1[Checkpoints]
        MR2[Versioned Models]
        MR3[Metadata]
    end

    TO1 & TO2 & TO3 --> C1 & C2 & C3 & C4
    DP1 --> DP2 --> DP3
    DP3 --> C1 & C2 & C3 & C4
    C1 & C2 & C3 & C4 --> MR1 & MR2 & MR3

    style C1 fill:#f96,stroke:#333,stroke-width:2px
    style MR2 fill:#bbf,stroke:#333,stroke-width:2px

Compute Options Comparison

Workload	On-Premise	Cloud VMs	Kubernetes	Managed Services
Best For	Regulated industries, existing HPC	Full control, custom setup	Containerized, portable	Quick start, minimal ops
Pros	Data sovereignty, amortized cost	Flexible, spot instances	Scalable, cloud-agnostic	Fully managed, integrated
Cons	High capex, limited elasticity	Ops overhead, costly at scale	Complex setup, expertise required	Vendor lock-in, less control
Cost	High upfront, low variable	$1-10/GPU hour	$0.50-8/GPU hour + overhead	$2-15/GPU hour
Examples	On-prem GPU clusters	EC2, GCE, Azure VMs	EKS + Kubeflow, GKE + KFP	SageMaker, Vertex AI, AzureML

Build vs. Buy Decision Framework

Component-by-Component Analysis

Component	Build	Buy/Adopt Open Source	Managed Service	Recommendation
Data Ingestion	⚠️ Complex	✅ Airbyte, Fivetran	✅ AWS Glue, Azure Data Factory	Buy or managed
Data Lake	⚠️ Complex	✅ Delta Lake, Apache Iceberg	✅ S3 + Glue, Azure Data Lake	Open source + cloud storage
Data Warehouse	❌ Don't build	⚠️ DuckDB, ClickHouse	✅ Snowflake, BigQuery, Redshift	Managed service
Feature Store	⚠️ Consider for unique needs	✅ Feast	✅ Tecton, SageMaker Feature Store	Open source first
Experiment Tracking	❌ Don't build	✅ MLflow, W&B	✅ SageMaker Experiments	Open source (MLflow)
Model Training	✅ Custom training code	✅ Frameworks (PyTorch, TensorFlow)	⚠️ Managed training	Build on frameworks
Model Serving	⚠️ Consider for unique requirements	✅ TorchServe, TF Serving, Triton	✅ SageMaker, Vertex AI	Open source or managed
Orchestration	❌ Don't build	✅ Airflow, Prefect, Kubeflow	⚠️ Step Functions	Open source
Monitoring	❌ Don't build	✅ Prometheus + Grafana	✅ Datadog, New Relic	Open source or managed

✅ = Recommended | ⚠️ = Situational | ❌ = Avoid

Real-World Case Study: E-commerce ML Platform

Challenge

Growing e-commerce company had 5 data science teams building 50+ models with inconsistent tooling. No standard deployment process, models breaking in production, 6-week deployment cycles.

Solution Architecture

Built centralized ML platform over 6 months:

Data Layer:

S3 data lake (Bronze/Silver/Gold layers)
Snowflake warehouse for analytics
Feast feature store for online/offline features

Compute:

EKS cluster for distributed training
SageMaker for model serving
Spot instances for 70% cost savings

Orchestration:

Airflow for data pipelines
Argo Workflows for ML pipelines
Event-driven triggers via EventBridge

Observability:

Datadog for infrastructure metrics
Custom model monitoring with Grafana
PagerDuty for alerting

Implementation Timeline

Phase	Duration	Activities	Deliverables
Foundation	Month 1-2	VPCs, EKS, data lake, warehouse	Core infrastructure
ML Services	Month 3-4	Feature store, experiment tracking, model registry	ML platform services
Deployment	Month 5	Model serving, CI/CD pipelines	Deployment automation
Enablement	Month 6	Documentation, migration support, training	Team onboarding

Results After 12 Months

Metric	Before	After	Improvement
Deployment time	6 weeks	2 days	95% reduction
Model deployment success rate	60%	95%	+58%
Infrastructure costs	$450K/year	$315K/year	30% reduction
Data science productivity	Baseline	+40%	Less time on infrastructure
Platform adoption	N/A	100% of teams	Full migration
Models in production	50	200	4x growth
Incident rate	12/month	2/month	83% reduction

Key Success Factors

Executive Sponsorship: Dedicated budget and platform team
Golden Paths: Reference implementations for common patterns
Migration Support: Office hours and hands-on help
Iterative Rollout: Start with one team, expand gradually
Clear Metrics: Track adoption, performance, cost
Documentation: Comprehensive guides and examples

Implementation Checklist

Foundation Phase (Week 1-4)

□ Define reference architecture diagram
□ Select core technologies (lakehouse, warehouse, orchestration)
□ Establish infrastructure-as-code practices (Terraform, CDK)
□ Set up networking and security (VPCs, security groups, IAM)
□ Deploy monitoring and logging infrastructure
□ Create developer environments (sandbox, dev, staging, prod)

Platform Services (Week 5-12)

□ Deploy data lake/lakehouse with governance
□ Set up data warehouse and access patterns
□ Implement feature store (if needed)
□ Deploy experiment tracking system (MLflow)
□ Set up model registry
□ Configure CI/CD pipelines for ML workflows
□ Implement model serving infrastructure
□ Deploy orchestration platform (Airflow)

Developer Experience (Week 13-16)

□ Create golden path templates and examples
□ Write platform documentation and runbooks
□ Set up developer onboarding guides
□ Create self-service workflows (request access, deploy model)
□ Implement platform SDK/CLI
□ Establish office hours and support channels

Governance & Operations (Ongoing)

□ Define cost allocation and chargeback
□ Implement usage quotas and rate limits
□ Set up audit logging and compliance reporting
□ Create disaster recovery and backup procedures
□ Establish SLAs for platform services
□ Schedule regular capacity planning reviews

Best Practices

Start Simple, Evolve: Begin with proven tools, add complexity only when needed
Design for Portability: Avoid vendor lock-in where possible; use open formats and standards
Automate Everything: Infrastructure, deployments, testing, monitoring
Measure Platform Adoption: Track usage metrics, gather feedback, iterate
Invest in Developer Experience: Platform is only valuable if teams use it
Plan for Scale: Design for 10x current scale, implement for current + 3x
Security from Day One: Easier to build in than bolt on later
Document Decisions: Maintain Architecture Decision Records (ADRs)

Common Pitfalls

Over-Engineering: Building for Google-scale when you have 10 models
Under-Engineering: Not planning for growth, hitting scalability walls
Tool Sprawl: Too many overlapping tools, fragmented ecosystem
Neglecting DX: Powerful platform nobody can use
Ignoring Costs: Building without cost constraints, surprise bills later
Tight Coupling: Hard dependencies make changes difficult
No Governance: Platform without guard rails leads to chaos
Premature Optimization: Optimizing before understanding bottlenecks

Chapter 16: Platforms & Architecture

16. Platforms & Architecture