16. Platforms & Architecture
Chapter 16 — Platforms & Architecture
Overview
Define reference architectures that cover data, models, orchestration, and deployment patterns. The platform and architecture decisions you make early in your AI journey will enable or constrain your capabilities for years. A well-designed AI platform provides the foundation for rapid experimentation, reliable production deployments, and continuous improvement.
Why It Matters
Architecture sets the runway for many teams. Standard components reduce duplicated effort and risk. Poor architectural decisions compound over time:
- Technical Debt: Teams build workarounds when platform doesn't meet needs, creating maintenance burden
- Scalability Bottlenecks: Architecture that works for 100 models fails at 1,000 models
- Vendor Lock-in: Tight coupling to specific vendors limits flexibility and increases costs
- Security Gaps: Ad-hoc solutions bypass security controls and create vulnerabilities
- Team Friction: Inconsistent tooling across teams reduces collaboration and knowledge sharing
Real-world impact:
- E-commerce company rebuilt entire ML platform after initial design couldn't scale beyond 50 models
- Healthcare AI startup spent 18 months migrating off tightly-coupled cloud vendor
- Financial services firm saved $2M annually by standardizing on reference architecture
Platform Architecture Layers
Layered Architecture Model
graph TB subgraph "Control Plane" CP1[Catalog & Metadata] CP2[Governance & Compliance] CP3[Monitoring & Observability] CP4[Security & IAM] end subgraph "Data Layer" D1[Raw Data Lake] D2[Curated Data Warehouse] D3[Feature Store] D4[Vector Database] end subgraph "Compute Layer" C1[Training Clusters] C2[Batch Inference] C3[Real-time Serving] C4[Experimentation] end subgraph "Orchestration Layer" O1[Workflow Engine] O2[Event Bus] O3[Scheduler] O4[Job Queue] end subgraph "Model Layer" M1[Model Registry] M2[Experiment Tracking] M3[Version Control] M4[A/B Testing] end CP1 & CP2 & CP3 & CP4 -.-> D1 & C1 & O1 & M1 D1 & D2 & D3 & D4 --> C1 & C2 & C3 O1 & O2 & O3 & O4 --> C1 & C2 & C3 C1 --> M1 M1 --> C2 & C3 M2 & M3 & M4 --> M1 style CP1 fill:#f96,stroke:#333,stroke-width:2px style D3 fill:#bbf,stroke:#333,stroke-width:2px style M1 fill:#9f9,stroke:#333,stroke-width:2px
Platform vs. Product Teams
| Aspect | Platform Team | Product/Model Teams |
|---|---|---|
| Mission | Provide infrastructure and tooling | Build and deploy ML models |
| Deliverables | APIs, SDKs, reference implementations | Production models and applications |
| Users | Internal data scientists and engineers | End users and business stakeholders |
| Metrics | Platform uptime, adoption rate, developer satisfaction | Model performance, business impact |
| Skills | Infrastructure, DevOps, platform engineering | ML, data science, domain expertise |
Data Architecture Patterns
Lakehouse Architecture
graph LR subgraph "Ingestion" I1[Batch Sources] I2[Streaming Sources] I3[API Sources] end subgraph "Bronze Layer - Raw" B1[Raw Data Lake<br/>Parquet/Delta] B2[Change Data Capture] B3[Audit Logs] end subgraph "Silver Layer - Curated" S1[Cleaned Data] S2[Deduplicated] S3[Standardized Schema] end subgraph "Gold Layer - Business" G1[Aggregated Features] G2[Business Metrics] G3[ML-Ready Datasets] end subgraph "Serving Layer" SV1[Analytics Warehouse] SV2[Feature Store] SV3[Vector Store] end I1 & I2 & I3 --> B1 & B2 & B3 B1 & B2 & B3 --> S1 & S2 & S3 S1 & S2 & S3 --> G1 & G2 & G3 G1 & G2 & G3 --> SV1 & SV2 & SV3 style G3 fill:#bbf,stroke:#333,stroke-width:2px style SV2 fill:#f96,stroke:#333,stroke-width:2px
Feature Store Architecture
graph TB subgraph "Feature Definition" FD1[Feature Registry] FD2[Feature Metadata] FD3[Lineage Tracking] end subgraph "Offline Store" OS1[Historical Features<br/>Data Warehouse] OS2[Point-in-Time Correct] OS3[Training Set Generation] end subgraph "Online Store" ON1[Real-time Features<br/>Redis/DynamoDB] ON2[Low-latency Lookup] ON3[< 10ms Response] end subgraph "Feature Computation" FC1[Batch Pipeline<br/>Daily/Hourly] FC2[Stream Pipeline<br/>Real-time] FC3[On-Demand<br/>Request-time] end FD1 & FD2 & FD3 --> OS1 & ON1 FC1 --> OS1 & ON1 FC2 --> ON1 OS1 --> OS3 ON1 --> ON2 OS3 -.-> Training[Model Training] ON2 -.-> Inference[Model Inference] style ON1 fill:#f96,stroke:#333,stroke-width:2px style OS3 fill:#bbf,stroke:#333,stroke-width:2px
Technology Selection Matrix
Data Storage Comparison
| Technology | Best For | Strengths | Limitations | Cost | When to Use |
|---|---|---|---|---|---|
| Delta Lake | Lakehouse on cloud | ACID transactions, time travel, open format | Requires Spark | Low | Choose for data lake + warehouse hybrid |
| Snowflake | Cloud data warehouse | Performance, ease of use, separation of compute/storage | Can be expensive, some vendor lock-in | Medium-High | Choose for analytics-heavy workloads |
| BigQuery | GCP analytics | Serverless, ML integration, fast queries | GCP lock-in, unpredictable costs | Medium | Choose for GCP-native stack |
| Databricks | Unified analytics | End-to-end platform, notebook experience | Expensive, learning curve | High | Choose for full ML platform |
| PostgreSQL | Transactional data | Mature, reliable, extensions (pgvector) | Limited horizontal scaling | Low | Choose for operational databases |
| Redis | Real-time serving | Sub-millisecond latency, simple | Limited persistence, memory-bound | Medium | Choose for feature serving |
Model Serving Patterns
graph TB subgraph "Batch Serving" B1[Scheduled Job] --> B2[Load Full Dataset] B2 --> B3[Batch Inference] B3 --> B4[Write to Database/S3] end subgraph "Real-time Serving" R1[API Request] --> R2[Feature Lookup] R2 --> R3[Model Inference] R3 --> R4[Response < 100ms] end subgraph "Streaming Serving" S1[Event Stream] --> S2[Feature Enrichment] S2 --> S3[Model Inference] S3 --> S4[Sink to Target] end subgraph "Edge Serving" E1[Mobile/IoT Device] --> E2[Local Model] E2 --> E3[On-Device Inference] E3 --> E4[Offline Capable] end style R3 fill:#f96,stroke:#333,stroke-width:2px style S3 fill:#bbf,stroke:#333,stroke-width:2px
Orchestration Tool Comparison
| Tool | Best For | Strengths | Limitations | Cost | Decision Factor |
|---|---|---|---|---|---|
| Airflow | Complex DAGs, established teams | Mature, flexible, large ecosystem | Steep learning curve, maintenance overhead | Free (self-hosted) | Choose for complex workflows |
| Prefect | Modern Python workflows | Pythonic, great DX, hybrid execution | Smaller ecosystem, newer | Free + paid tiers | Choose for Python-first teams |
| Kubeflow Pipelines | Kubernetes-native ML | K8s integration, portable, containerized | Complex setup, K8s expertise required | Free (infrastructure only) | Choose for K8s environments |
| AWS Step Functions | AWS-native workflows | Managed, serverless, visual designer | AWS lock-in, expensive at scale | Pay-per-execution | Choose for AWS serverless |
| Databricks Workflows | Databricks users | Tight Spark integration, notebooks support | Databricks lock-in | Included with Databricks | Choose if using Databricks |
ML Training Infrastructure
Distributed Training Architecture
graph TB subgraph "Training Orchestrator" TO1[Experiment Tracking] TO2[Hyperparameter Tuning] TO3[Resource Allocation] end subgraph "Compute Cluster" C1[GPU Node 1<br/>8x A100] C2[GPU Node 2<br/>8x A100] C3[GPU Node 3<br/>8x A100] C4[GPU Node 4<br/>8x A100] end subgraph "Data Pipeline" DP1[Data Loader] DP2[Preprocessing] DP3[Distributed Cache] end subgraph "Model Registry" MR1[Checkpoints] MR2[Versioned Models] MR3[Metadata] end TO1 & TO2 & TO3 --> C1 & C2 & C3 & C4 DP1 --> DP2 --> DP3 DP3 --> C1 & C2 & C3 & C4 C1 & C2 & C3 & C4 --> MR1 & MR2 & MR3 style C1 fill:#f96,stroke:#333,stroke-width:2px style MR2 fill:#bbf,stroke:#333,stroke-width:2px
Compute Options Comparison
| Workload | On-Premise | Cloud VMs | Kubernetes | Managed Services |
|---|---|---|---|---|
| Best For | Regulated industries, existing HPC | Full control, custom setup | Containerized, portable | Quick start, minimal ops |
| Pros | Data sovereignty, amortized cost | Flexible, spot instances | Scalable, cloud-agnostic | Fully managed, integrated |
| Cons | High capex, limited elasticity | Ops overhead, costly at scale | Complex setup, expertise required | Vendor lock-in, less control |
| Cost | High upfront, low variable | $1-10/GPU hour | $0.50-8/GPU hour + overhead | $2-15/GPU hour |
| Examples | On-prem GPU clusters | EC2, GCE, Azure VMs | EKS + Kubeflow, GKE + KFP | SageMaker, Vertex AI, AzureML |
Build vs. Buy Decision Framework
Component-by-Component Analysis
| Component | Build | Buy/Adopt Open Source | Managed Service | Recommendation |
|---|---|---|---|---|
| Data Ingestion | ⚠️ Complex | ✅ Airbyte, Fivetran | ✅ AWS Glue, Azure Data Factory | Buy or managed |
| Data Lake | ⚠️ Complex | ✅ Delta Lake, Apache Iceberg | ✅ S3 + Glue, Azure Data Lake | Open source + cloud storage |
| Data Warehouse | ❌ Don't build | ⚠️ DuckDB, ClickHouse | ✅ Snowflake, BigQuery, Redshift | Managed service |
| Feature Store | ⚠️ Consider for unique needs | ✅ Feast | ✅ Tecton, SageMaker Feature Store | Open source first |
| Experiment Tracking | ❌ Don't build | ✅ MLflow, W&B | ✅ SageMaker Experiments | Open source (MLflow) |
| Model Training | ✅ Custom training code | ✅ Frameworks (PyTorch, TensorFlow) | ⚠️ Managed training | Build on frameworks |
| Model Serving | ⚠️ Consider for unique requirements | ✅ TorchServe, TF Serving, Triton | ✅ SageMaker, Vertex AI | Open source or managed |
| Orchestration | ❌ Don't build | ✅ Airflow, Prefect, Kubeflow | ⚠️ Step Functions | Open source |
| Monitoring | ❌ Don't build | ✅ Prometheus + Grafana | ✅ Datadog, New Relic | Open source or managed |
✅ = Recommended | ⚠️ = Situational | ❌ = Avoid
Real-World Case Study: E-commerce ML Platform
Challenge
Growing e-commerce company had 5 data science teams building 50+ models with inconsistent tooling. No standard deployment process, models breaking in production, 6-week deployment cycles.
Solution Architecture
Built centralized ML platform over 6 months:
Data Layer:
- S3 data lake (Bronze/Silver/Gold layers)
- Snowflake warehouse for analytics
- Feast feature store for online/offline features
Compute:
- EKS cluster for distributed training
- SageMaker for model serving
- Spot instances for 70% cost savings
Orchestration:
- Airflow for data pipelines
- Argo Workflows for ML pipelines
- Event-driven triggers via EventBridge
Observability:
- Datadog for infrastructure metrics
- Custom model monitoring with Grafana
- PagerDuty for alerting
Implementation Timeline
| Phase | Duration | Activities | Deliverables |
|---|---|---|---|
| Foundation | Month 1-2 | VPCs, EKS, data lake, warehouse | Core infrastructure |
| ML Services | Month 3-4 | Feature store, experiment tracking, model registry | ML platform services |
| Deployment | Month 5 | Model serving, CI/CD pipelines | Deployment automation |
| Enablement | Month 6 | Documentation, migration support, training | Team onboarding |
Results After 12 Months
| Metric | Before | After | Improvement |
|---|---|---|---|
| Deployment time | 6 weeks | 2 days | 95% reduction |
| Model deployment success rate | 60% | 95% | +58% |
| Infrastructure costs | $450K/year | $315K/year | 30% reduction |
| Data science productivity | Baseline | +40% | Less time on infrastructure |
| Platform adoption | N/A | 100% of teams | Full migration |
| Models in production | 50 | 200 | 4x growth |
| Incident rate | 12/month | 2/month | 83% reduction |
Key Success Factors
- Executive Sponsorship: Dedicated budget and platform team
- Golden Paths: Reference implementations for common patterns
- Migration Support: Office hours and hands-on help
- Iterative Rollout: Start with one team, expand gradually
- Clear Metrics: Track adoption, performance, cost
- Documentation: Comprehensive guides and examples
Implementation Checklist
Foundation Phase (Week 1-4)
□ Define reference architecture diagram
□ Select core technologies (lakehouse, warehouse, orchestration)
□ Establish infrastructure-as-code practices (Terraform, CDK)
□ Set up networking and security (VPCs, security groups, IAM)
□ Deploy monitoring and logging infrastructure
□ Create developer environments (sandbox, dev, staging, prod)
Platform Services (Week 5-12)
□ Deploy data lake/lakehouse with governance
□ Set up data warehouse and access patterns
□ Implement feature store (if needed)
□ Deploy experiment tracking system (MLflow)
□ Set up model registry
□ Configure CI/CD pipelines for ML workflows
□ Implement model serving infrastructure
□ Deploy orchestration platform (Airflow)
Developer Experience (Week 13-16)
□ Create golden path templates and examples
□ Write platform documentation and runbooks
□ Set up developer onboarding guides
□ Create self-service workflows (request access, deploy model)
□ Implement platform SDK/CLI
□ Establish office hours and support channels
Governance & Operations (Ongoing)
□ Define cost allocation and chargeback
□ Implement usage quotas and rate limits
□ Set up audit logging and compliance reporting
□ Create disaster recovery and backup procedures
□ Establish SLAs for platform services
□ Schedule regular capacity planning reviews
Best Practices
- Start Simple, Evolve: Begin with proven tools, add complexity only when needed
- Design for Portability: Avoid vendor lock-in where possible; use open formats and standards
- Automate Everything: Infrastructure, deployments, testing, monitoring
- Measure Platform Adoption: Track usage metrics, gather feedback, iterate
- Invest in Developer Experience: Platform is only valuable if teams use it
- Plan for Scale: Design for 10x current scale, implement for current + 3x
- Security from Day One: Easier to build in than bolt on later
- Document Decisions: Maintain Architecture Decision Records (ADRs)
Common Pitfalls
- Over-Engineering: Building for Google-scale when you have 10 models
- Under-Engineering: Not planning for growth, hitting scalability walls
- Tool Sprawl: Too many overlapping tools, fragmented ecosystem
- Neglecting DX: Powerful platform nobody can use
- Ignoring Costs: Building without cost constraints, surprise bills later
- Tight Coupling: Hard dependencies make changes difficult
- No Governance: Platform without guard rails leads to chaos
- Premature Optimization: Optimizing before understanding bottlenecks