Part 3: Data Foundations

Chapter 16: Platforms & Architecture

Hire Us
3Part 3: Data Foundations

16. Platforms & Architecture

Chapter 16 — Platforms & Architecture

Overview

Define reference architectures that cover data, models, orchestration, and deployment patterns. The platform and architecture decisions you make early in your AI journey will enable or constrain your capabilities for years. A well-designed AI platform provides the foundation for rapid experimentation, reliable production deployments, and continuous improvement.

Why It Matters

Architecture sets the runway for many teams. Standard components reduce duplicated effort and risk. Poor architectural decisions compound over time:

  • Technical Debt: Teams build workarounds when platform doesn't meet needs, creating maintenance burden
  • Scalability Bottlenecks: Architecture that works for 100 models fails at 1,000 models
  • Vendor Lock-in: Tight coupling to specific vendors limits flexibility and increases costs
  • Security Gaps: Ad-hoc solutions bypass security controls and create vulnerabilities
  • Team Friction: Inconsistent tooling across teams reduces collaboration and knowledge sharing

Real-world impact:

  • E-commerce company rebuilt entire ML platform after initial design couldn't scale beyond 50 models
  • Healthcare AI startup spent 18 months migrating off tightly-coupled cloud vendor
  • Financial services firm saved $2M annually by standardizing on reference architecture

Platform Architecture Layers

Layered Architecture Model

graph TB subgraph "Control Plane" CP1[Catalog & Metadata] CP2[Governance & Compliance] CP3[Monitoring & Observability] CP4[Security & IAM] end subgraph "Data Layer" D1[Raw Data Lake] D2[Curated Data Warehouse] D3[Feature Store] D4[Vector Database] end subgraph "Compute Layer" C1[Training Clusters] C2[Batch Inference] C3[Real-time Serving] C4[Experimentation] end subgraph "Orchestration Layer" O1[Workflow Engine] O2[Event Bus] O3[Scheduler] O4[Job Queue] end subgraph "Model Layer" M1[Model Registry] M2[Experiment Tracking] M3[Version Control] M4[A/B Testing] end CP1 & CP2 & CP3 & CP4 -.-> D1 & C1 & O1 & M1 D1 & D2 & D3 & D4 --> C1 & C2 & C3 O1 & O2 & O3 & O4 --> C1 & C2 & C3 C1 --> M1 M1 --> C2 & C3 M2 & M3 & M4 --> M1 style CP1 fill:#f96,stroke:#333,stroke-width:2px style D3 fill:#bbf,stroke:#333,stroke-width:2px style M1 fill:#9f9,stroke:#333,stroke-width:2px

Platform vs. Product Teams

AspectPlatform TeamProduct/Model Teams
MissionProvide infrastructure and toolingBuild and deploy ML models
DeliverablesAPIs, SDKs, reference implementationsProduction models and applications
UsersInternal data scientists and engineersEnd users and business stakeholders
MetricsPlatform uptime, adoption rate, developer satisfactionModel performance, business impact
SkillsInfrastructure, DevOps, platform engineeringML, data science, domain expertise

Data Architecture Patterns

Lakehouse Architecture

graph LR subgraph "Ingestion" I1[Batch Sources] I2[Streaming Sources] I3[API Sources] end subgraph "Bronze Layer - Raw" B1[Raw Data Lake<br/>Parquet/Delta] B2[Change Data Capture] B3[Audit Logs] end subgraph "Silver Layer - Curated" S1[Cleaned Data] S2[Deduplicated] S3[Standardized Schema] end subgraph "Gold Layer - Business" G1[Aggregated Features] G2[Business Metrics] G3[ML-Ready Datasets] end subgraph "Serving Layer" SV1[Analytics Warehouse] SV2[Feature Store] SV3[Vector Store] end I1 & I2 & I3 --> B1 & B2 & B3 B1 & B2 & B3 --> S1 & S2 & S3 S1 & S2 & S3 --> G1 & G2 & G3 G1 & G2 & G3 --> SV1 & SV2 & SV3 style G3 fill:#bbf,stroke:#333,stroke-width:2px style SV2 fill:#f96,stroke:#333,stroke-width:2px

Feature Store Architecture

graph TB subgraph "Feature Definition" FD1[Feature Registry] FD2[Feature Metadata] FD3[Lineage Tracking] end subgraph "Offline Store" OS1[Historical Features<br/>Data Warehouse] OS2[Point-in-Time Correct] OS3[Training Set Generation] end subgraph "Online Store" ON1[Real-time Features<br/>Redis/DynamoDB] ON2[Low-latency Lookup] ON3[< 10ms Response] end subgraph "Feature Computation" FC1[Batch Pipeline<br/>Daily/Hourly] FC2[Stream Pipeline<br/>Real-time] FC3[On-Demand<br/>Request-time] end FD1 & FD2 & FD3 --> OS1 & ON1 FC1 --> OS1 & ON1 FC2 --> ON1 OS1 --> OS3 ON1 --> ON2 OS3 -.-> Training[Model Training] ON2 -.-> Inference[Model Inference] style ON1 fill:#f96,stroke:#333,stroke-width:2px style OS3 fill:#bbf,stroke:#333,stroke-width:2px

Technology Selection Matrix

Data Storage Comparison

TechnologyBest ForStrengthsLimitationsCostWhen to Use
Delta LakeLakehouse on cloudACID transactions, time travel, open formatRequires SparkLowChoose for data lake + warehouse hybrid
SnowflakeCloud data warehousePerformance, ease of use, separation of compute/storageCan be expensive, some vendor lock-inMedium-HighChoose for analytics-heavy workloads
BigQueryGCP analyticsServerless, ML integration, fast queriesGCP lock-in, unpredictable costsMediumChoose for GCP-native stack
DatabricksUnified analyticsEnd-to-end platform, notebook experienceExpensive, learning curveHighChoose for full ML platform
PostgreSQLTransactional dataMature, reliable, extensions (pgvector)Limited horizontal scalingLowChoose for operational databases
RedisReal-time servingSub-millisecond latency, simpleLimited persistence, memory-boundMediumChoose for feature serving

Model Serving Patterns

graph TB subgraph "Batch Serving" B1[Scheduled Job] --> B2[Load Full Dataset] B2 --> B3[Batch Inference] B3 --> B4[Write to Database/S3] end subgraph "Real-time Serving" R1[API Request] --> R2[Feature Lookup] R2 --> R3[Model Inference] R3 --> R4[Response < 100ms] end subgraph "Streaming Serving" S1[Event Stream] --> S2[Feature Enrichment] S2 --> S3[Model Inference] S3 --> S4[Sink to Target] end subgraph "Edge Serving" E1[Mobile/IoT Device] --> E2[Local Model] E2 --> E3[On-Device Inference] E3 --> E4[Offline Capable] end style R3 fill:#f96,stroke:#333,stroke-width:2px style S3 fill:#bbf,stroke:#333,stroke-width:2px

Orchestration Tool Comparison

ToolBest ForStrengthsLimitationsCostDecision Factor
AirflowComplex DAGs, established teamsMature, flexible, large ecosystemSteep learning curve, maintenance overheadFree (self-hosted)Choose for complex workflows
PrefectModern Python workflowsPythonic, great DX, hybrid executionSmaller ecosystem, newerFree + paid tiersChoose for Python-first teams
Kubeflow PipelinesKubernetes-native MLK8s integration, portable, containerizedComplex setup, K8s expertise requiredFree (infrastructure only)Choose for K8s environments
AWS Step FunctionsAWS-native workflowsManaged, serverless, visual designerAWS lock-in, expensive at scalePay-per-executionChoose for AWS serverless
Databricks WorkflowsDatabricks usersTight Spark integration, notebooks supportDatabricks lock-inIncluded with DatabricksChoose if using Databricks

ML Training Infrastructure

Distributed Training Architecture

graph TB subgraph "Training Orchestrator" TO1[Experiment Tracking] TO2[Hyperparameter Tuning] TO3[Resource Allocation] end subgraph "Compute Cluster" C1[GPU Node 1<br/>8x A100] C2[GPU Node 2<br/>8x A100] C3[GPU Node 3<br/>8x A100] C4[GPU Node 4<br/>8x A100] end subgraph "Data Pipeline" DP1[Data Loader] DP2[Preprocessing] DP3[Distributed Cache] end subgraph "Model Registry" MR1[Checkpoints] MR2[Versioned Models] MR3[Metadata] end TO1 & TO2 & TO3 --> C1 & C2 & C3 & C4 DP1 --> DP2 --> DP3 DP3 --> C1 & C2 & C3 & C4 C1 & C2 & C3 & C4 --> MR1 & MR2 & MR3 style C1 fill:#f96,stroke:#333,stroke-width:2px style MR2 fill:#bbf,stroke:#333,stroke-width:2px

Compute Options Comparison

WorkloadOn-PremiseCloud VMsKubernetesManaged Services
Best ForRegulated industries, existing HPCFull control, custom setupContainerized, portableQuick start, minimal ops
ProsData sovereignty, amortized costFlexible, spot instancesScalable, cloud-agnosticFully managed, integrated
ConsHigh capex, limited elasticityOps overhead, costly at scaleComplex setup, expertise requiredVendor lock-in, less control
CostHigh upfront, low variable$1-10/GPU hour$0.50-8/GPU hour + overhead$2-15/GPU hour
ExamplesOn-prem GPU clustersEC2, GCE, Azure VMsEKS + Kubeflow, GKE + KFPSageMaker, Vertex AI, AzureML

Build vs. Buy Decision Framework

Component-by-Component Analysis

ComponentBuildBuy/Adopt Open SourceManaged ServiceRecommendation
Data Ingestion⚠️ Complex✅ Airbyte, Fivetran✅ AWS Glue, Azure Data FactoryBuy or managed
Data Lake⚠️ Complex✅ Delta Lake, Apache Iceberg✅ S3 + Glue, Azure Data LakeOpen source + cloud storage
Data Warehouse❌ Don't build⚠️ DuckDB, ClickHouse✅ Snowflake, BigQuery, RedshiftManaged service
Feature Store⚠️ Consider for unique needs✅ Feast✅ Tecton, SageMaker Feature StoreOpen source first
Experiment Tracking❌ Don't build✅ MLflow, W&B✅ SageMaker ExperimentsOpen source (MLflow)
Model Training✅ Custom training code✅ Frameworks (PyTorch, TensorFlow)⚠️ Managed trainingBuild on frameworks
Model Serving⚠️ Consider for unique requirements✅ TorchServe, TF Serving, Triton✅ SageMaker, Vertex AIOpen source or managed
Orchestration❌ Don't build✅ Airflow, Prefect, Kubeflow⚠️ Step FunctionsOpen source
Monitoring❌ Don't build✅ Prometheus + Grafana✅ Datadog, New RelicOpen source or managed

✅ = Recommended | ⚠️ = Situational | ❌ = Avoid

Real-World Case Study: E-commerce ML Platform

Challenge

Growing e-commerce company had 5 data science teams building 50+ models with inconsistent tooling. No standard deployment process, models breaking in production, 6-week deployment cycles.

Solution Architecture

Built centralized ML platform over 6 months:

Data Layer:

  • S3 data lake (Bronze/Silver/Gold layers)
  • Snowflake warehouse for analytics
  • Feast feature store for online/offline features

Compute:

  • EKS cluster for distributed training
  • SageMaker for model serving
  • Spot instances for 70% cost savings

Orchestration:

  • Airflow for data pipelines
  • Argo Workflows for ML pipelines
  • Event-driven triggers via EventBridge

Observability:

  • Datadog for infrastructure metrics
  • Custom model monitoring with Grafana
  • PagerDuty for alerting

Implementation Timeline

PhaseDurationActivitiesDeliverables
FoundationMonth 1-2VPCs, EKS, data lake, warehouseCore infrastructure
ML ServicesMonth 3-4Feature store, experiment tracking, model registryML platform services
DeploymentMonth 5Model serving, CI/CD pipelinesDeployment automation
EnablementMonth 6Documentation, migration support, trainingTeam onboarding

Results After 12 Months

MetricBeforeAfterImprovement
Deployment time6 weeks2 days95% reduction
Model deployment success rate60%95%+58%
Infrastructure costs$450K/year$315K/year30% reduction
Data science productivityBaseline+40%Less time on infrastructure
Platform adoptionN/A100% of teamsFull migration
Models in production502004x growth
Incident rate12/month2/month83% reduction

Key Success Factors

  1. Executive Sponsorship: Dedicated budget and platform team
  2. Golden Paths: Reference implementations for common patterns
  3. Migration Support: Office hours and hands-on help
  4. Iterative Rollout: Start with one team, expand gradually
  5. Clear Metrics: Track adoption, performance, cost
  6. Documentation: Comprehensive guides and examples

Implementation Checklist

Foundation Phase (Week 1-4)

□ Define reference architecture diagram
□ Select core technologies (lakehouse, warehouse, orchestration)
□ Establish infrastructure-as-code practices (Terraform, CDK)
□ Set up networking and security (VPCs, security groups, IAM)
□ Deploy monitoring and logging infrastructure
□ Create developer environments (sandbox, dev, staging, prod)

Platform Services (Week 5-12)

□ Deploy data lake/lakehouse with governance
□ Set up data warehouse and access patterns
□ Implement feature store (if needed)
□ Deploy experiment tracking system (MLflow)
□ Set up model registry
□ Configure CI/CD pipelines for ML workflows
□ Implement model serving infrastructure
□ Deploy orchestration platform (Airflow)

Developer Experience (Week 13-16)

□ Create golden path templates and examples
□ Write platform documentation and runbooks
□ Set up developer onboarding guides
□ Create self-service workflows (request access, deploy model)
□ Implement platform SDK/CLI
□ Establish office hours and support channels

Governance & Operations (Ongoing)

□ Define cost allocation and chargeback
□ Implement usage quotas and rate limits
□ Set up audit logging and compliance reporting
□ Create disaster recovery and backup procedures
□ Establish SLAs for platform services
□ Schedule regular capacity planning reviews

Best Practices

  1. Start Simple, Evolve: Begin with proven tools, add complexity only when needed
  2. Design for Portability: Avoid vendor lock-in where possible; use open formats and standards
  3. Automate Everything: Infrastructure, deployments, testing, monitoring
  4. Measure Platform Adoption: Track usage metrics, gather feedback, iterate
  5. Invest in Developer Experience: Platform is only valuable if teams use it
  6. Plan for Scale: Design for 10x current scale, implement for current + 3x
  7. Security from Day One: Easier to build in than bolt on later
  8. Document Decisions: Maintain Architecture Decision Records (ADRs)

Common Pitfalls

  1. Over-Engineering: Building for Google-scale when you have 10 models
  2. Under-Engineering: Not planning for growth, hitting scalability walls
  3. Tool Sprawl: Too many overlapping tools, fragmented ecosystem
  4. Neglecting DX: Powerful platform nobody can use
  5. Ignoring Costs: Building without cost constraints, surprise bills later
  6. Tight Coupling: Hard dependencies make changes difficult
  7. No Governance: Platform without guard rails leads to chaos
  8. Premature Optimization: Optimizing before understanding bottlenecks