54. Cloud AI Deployment (AWS, Azure, GCP)
Chapter 54 — Cloud AI Deployment (AWS, Azure, GCP)
Overview
Provision landing zones and managed services with cost and compliance governance. Modern AI workloads require careful cloud architecture that balances flexibility, cost, security, and performance across compute-intensive training, real-time inference, and data-intensive operations. This chapter provides practical guidance for deploying AI systems on AWS, Azure, and GCP with enterprise-grade governance.
Why It Matters
Consistent landing zones and policies reduce time to value and keep costs and risks under control across clouds. Organizations with well-designed cloud AI platforms achieve:
- 70-85% faster project setup through reusable infrastructure templates
- 30-50% cost savings via rightsizing, spot instances, and budget controls
- Improved security posture with baseline security controls and compliance frameworks
- Multi-cloud optionality reducing vendor lock-in risk
- Developer productivity gains through self-service infrastructure
- Consistent governance across teams and environments
Poor cloud architecture leads to cost overruns (often 3-5x budget), security incidents, compliance violations, and fragmented tooling that slows development.
Cloud Provider Comparison
| Capability | AWS | Azure | GCP | Notes |
|---|---|---|---|---|
| Managed LLM Inference | Bedrock | Azure OpenAI | Vertex AI | Azure has exclusive GPT-4 access |
| Custom Model Training | SageMaker | Azure ML | Vertex AI | All offer distributed training |
| GPU Instances | P4/P5 instances | NC/ND series | A2/A3 instances | GCP typically 10-20% cheaper |
| Serverless Compute | Lambda | Functions | Cloud Functions | AWS has most mature ecosystem |
| Vector Database | OpenSearch, RDS pgvector | CosmosDB, Azure AI Search | Vertex Vector Search | All support pgvector extension |
| Data Warehouse | Redshift | Synapse | BigQuery | BigQuery best for analytics |
| Object Storage | S3 | Blob Storage | Cloud Storage | Similar pricing, S3 most mature |
| Identity/IAM | IAM + Cognito | Azure AD + managed identity | IAM + Workload Identity | Azure AD most enterprise-friendly |
| Networking | VPC | VNet | VPC | Similar capabilities |
| Monitoring | CloudWatch | Azure Monitor | Cloud Monitoring | GCP has best logging/tracing |
| IaC Support | Native (CloudFormation) + Terraform | Native (ARM/Bicep) + Terraform | Native (Deployment Manager) + Terraform | Terraform most portable |
| Pricing Model | Pay-per-use | Pay-per-use + reservations | Pay-per-use + sustained discounts | GCP auto-discounts, AWS requires commitment |
Landing Zone Architecture
Core Components
graph TB subgraph Management Account/Subscription MA[Root Account] SCPs[Service Control Policies] Billing[Consolidated Billing] Audit[Audit & Compliance] end subgraph Security Account SIEM[SIEM/Log Aggregation] SecHub[Security Hub] KMS[Key Management] Secrets[Secrets Manager] end subgraph Shared Services DNS[Private DNS] Registry[Container Registry] Artifacts[Artifact Repository] VPN[VPN/DirectConnect] end subgraph Network Transit[Transit Gateway/Hub VNet] Firewall[Network Firewall] NAT[NAT Gateway] VPC1[Prod VPC/VNet] VPC2[Dev VPC/VNet] end subgraph AI Workload Account Compute[GPU Compute] Storage[Data Lake] VectorDB[Vector Database] Serving[Model Serving] Training[Training Pipeline] end subgraph Data Account DW[Data Warehouse] ETL[ETL Pipelines] Catalog[Data Catalog] end MA --> SCPs MA --> Billing MA --> Audit Audit --> SIEM KMS --> AI Workload Account Secrets --> AI Workload Account Transit --> VPC1 Transit --> VPC2 VPC1 --> Firewall VPC2 --> Firewall VPC1 --> Compute VPC1 --> Storage VPC1 --> VectorDB VPC1 --> Serving VPC1 --> Training Storage --> DW DW --> ETL ETL --> Catalog
Landing Zone Baseline (Terraform Example)
# Multi-cloud landing zone baseline (AWS example - adapt for Azure/GCP)
terraform {
required_providers { aws = { source = "hashicorp/aws", version = "~> 5.0" } }
backend "s3" {
bucket = "my-org-terraform-state"
key = "landing-zone/terraform.tfstate"
region = "us-east-1"
encrypt = true
}
}
# Identity baseline - IAM roles for AI workloads
module "identity" {
source = "./modules/identity"
roles = {
ai_data_scientist = {
trusted_entities = ["sagemaker.amazonaws.com"]
policies = ["arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"]
}
ai_ml_ops = {
trusted_entities = ["eks.amazonaws.com"]
policies = [module.custom_policies.model_deployment_policy_arn]
}
}
}
# Network baseline - VPC with private/public subnets
module "network" {
source = "./modules/network"
vpc_cidr = "10.0.0.0/16"
availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
enable_nat_gateway = true
enable_flow_logs = true
tags = { Environment = "production", CostCenter = "ai-platform" }
}
# Security baseline - KMS encryption, secrets management
module "security" {
source = "./modules/security"
kms_keys = {
ai_data = { description = "Encryption for AI data", key_users = [module.identity.ai_data_scientist_role_arn] }
}
secrets = {
openai_api_key = { description = "OpenAI API key", value = var.openai_api_key }
}
enable_security_hub = true
}
# Monitoring baseline - CloudWatch logs and alarms
module "monitoring" {
source = "./modules/monitoring"
log_groups = {
"/ai/training" = { retention_days = 30 }
"/ai/inference" = { retention_days = 90 }
}
alarms = {
high_latency = { metric = "InferenceLatency", threshold = 1000 }
low_gpu = { metric = "GPUUtilization", threshold = 20 }
}
}
# Cost management - budgets and alerts
module "cost_management" {
source = "./modules/cost_management"
budgets = {
ai_monthly = {
limit = "50000"
alerts = [
{ threshold = 80, subscribers = ["finance@company.com"] },
{ threshold = 100, subscribers = ["finance@company.com", "cto@company.com"] }
]
}
}
required_tags = ["Environment", "Project", "CostCenter"]
}
# Governance policies - SCPs for security
module "governance" {
source = "./modules/governance"
policies = {
deny_unencrypted_storage = {
description = "Deny unencrypted S3/EBS"
# Policy document enforces encryption
}
restrict_regions = {
description = "Restrict to approved regions"
allowed_regions = ["us-east-1", "us-west-2", "eu-west-1"]
}
}
}
Compute Options for AI Workloads
GPU Instance Selection
| Instance Type | Cloud | GPUs | GPU Memory | vCPUs | System RAM | Best For | Approx. Cost/hr |
|---|---|---|---|---|---|---|---|
| p4d.24xlarge | AWS | 8x A100 | 40GB each | 96 | 1152GB | Large model training | $32.77 |
| p5.48xlarge | AWS | 8x H100 | 80GB each | 192 | 2048GB | Frontier models, massive training | $98.32 |
| g5.xlarge | AWS | 1x A10G | 24GB | 4 | 16GB | Inference, fine-tuning | $1.006 |
| Standard_NC24ads_A100_v4 | Azure | 1x A100 | 80GB | 24 | 220GB | Medium training jobs | $3.67 |
| Standard_ND96asr_v4 | Azure | 8x A100 | 40GB each | 96 | 900GB | Distributed training | $27.20 |
| a2-highgpu-1g | GCP | 1x A100 | 40GB | 12 | 85GB | Inference, small training | $3.67 |
| a2-ultragpu-8g | GCP | 8x A100 | 80GB each | 96 | 1360GB | Large-scale training | $33.60 |
Serverless vs. Dedicated Compute
flowchart TD A[AI Workload] --> B{Traffic Pattern} B -->|Unpredictable, bursty| C[Serverless] B -->|Steady, predictable| D[Dedicated Instances] C --> C1{Latency Requirement} C1 -->|<100ms| C2[Lambda/Functions with provisioned concurrency] C1 -->|100ms-1s| C3[Standard Lambda/Functions] C1 -->|>1s| C4[Batch jobs - AWS Batch/Azure Batch] D --> D1{Scale} D1 -->|Small <10 req/s| D2[Single instance + autoscaling] D1 -->|Medium 10-1000 req/s| D3[Container orchestration - EKS/AKS/GKE] D1 -->|Large >1000 req/s| D4[Multi-region with load balancing] C2 --> E[Cold start ~100ms, costly] C3 --> F[Cold start ~1s, cost-effective] C4 --> G[No cold start concern, batch pricing] D2 --> H[Simple, less overhead] D3 --> I[Complex but scalable] D4 --> J[Complex + expensive but resilient]
Model Serving Architecture
# Kubernetes deployment for model serving (EKS/AKS/GKE)
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-model-server
namespace: ml-production
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate: { maxSurge: 1, maxUnavailable: 0 } # Zero-downtime
selector:
matchLabels: { app: ai-model-server, version: v2.3.1 }
template:
metadata:
labels: { app: ai-model-server, version: v2.3.1 }
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
spec:
# Node affinity for GPU nodes
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values: [g5.xlarge, g5.2xlarge]
# Init container downloads model from S3
initContainers:
- name: model-downloader
image: amazon/aws-cli:latest
command: ["/bin/sh", "-c", "aws s3 cp s3://my-models-bucket/model-v2.3.1/ /models/ --recursive"]
volumeMounts: [{ name: model-storage, mountPath: /models }]
containers:
- name: model-server
image: myregistry.azurecr.io/model-server:v2.3.1
ports: [{ containerPort: 8000, name: http }]
env:
- { name: MODEL_PATH, value: /models }
- { name: MAX_BATCH_SIZE, value: "32" }
resources:
requests: { memory: "16Gi", cpu: "4", nvidia.com/gpu: 1 }
limits: { memory: "16Gi", cpu: "8", nvidia.com/gpu: 1 }
livenessProbe:
httpGet: { path: /health, port: 8000 }
initialDelaySeconds: 60
readinessProbe:
httpGet: { path: /ready, port: 8000 }
initialDelaySeconds: 30
volumeMounts: [{ name: model-storage, mountPath: /models, readOnly: true }]
volumes: [{ name: model-storage, emptyDir: { sizeLimit: 50Gi } }]
---
# Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ai-model-server-hpa
spec:
scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: ai-model-server }
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 } }
- type: Pods
pods: { metric: { name: gpu_utilization }, target: { type: AverageValue, averageValue: "70" } }
behavior:
scaleUp: { stabilizationWindowSeconds: 60, policies: [{ type: Percent, value: 50, periodSeconds: 60 }] }
scaleDown: { stabilizationWindowSeconds: 300, policies: [{ type: Pods, value: 1, periodSeconds: 60 }] }
Data Storage Options
Storage Comparison
| Storage Type | AWS | Azure | GCP | Use Case | Cost (GB/month) |
|---|---|---|---|---|---|
| Object Storage | S3 Standard | Blob Hot | Standard | Training data, models | $0.023 |
| Object Storage (Infrequent) | S3 IA | Blob Cool | Nearline | Archived models | $0.0125 |
| Object Storage (Archive) | S3 Glacier | Blob Archive | Archive | Long-term retention | $0.004 |
| Block Storage (SSD) | EBS gp3 | Premium SSD | Persistent SSD | VM attached storage | $0.08 |
| Block Storage (HDD) | EBS st1 | Standard HDD | Standard | Large sequential data | $0.045 |
| File Storage | EFS | Azure Files | Filestore | Shared ML workspaces | $0.30 |
| Vector Database | OpenSearch | Cosmos DB | Vertex AI | Embeddings search | Variable |
| Data Warehouse | Redshift | Synapse | BigQuery | Analytics | Pay-per-query |
Vector Database Deployment
# Deploy Qdrant (open-source vector DB) on Kubernetes
# Alternative: Use managed services like Pinecone, Weaviate Cloud
apiVersion: apps/v1
kind: StatefulSet
metadata: { name: qdrant, namespace: vector-db }
spec:
serviceName: qdrant
replicas: 3
selector: { matchLabels: { app: qdrant } }
template:
metadata: { labels: { app: qdrant } }
spec:
containers:
- name: qdrant
image: qdrant/qdrant:v1.7.0
ports: [{ containerPort: 6333, name: http }, { containerPort: 6334, name: grpc }]
resources:
requests: { memory: "8Gi", cpu: "2" }
limits: { memory: "16Gi", cpu: "4" }
volumeMounts: [{ name: qdrant-storage, mountPath: /qdrant/storage }]
volumeClaimTemplates:
- metadata: { name: qdrant-storage }
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd
resources: { requests: { storage: 100Gi } }
# Python client for Qdrant
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
# Connect to Qdrant cluster
client = QdrantClient(host="qdrant.vector-db.svc.cluster.local", port=6333)
# Create collection
client.create_collection(
collection_name="document_embeddings",
vectors_config=VectorParams(size=1536, distance=Distance.COSINE) # OpenAI embedding size
)
# Upsert vectors
client.upsert(
collection_name="document_embeddings",
points=[PointStruct(id=doc_id, vector=embedding, payload={"text": text, "metadata": metadata})
for doc_id, embedding, text, metadata in data]
)
# Search
results = client.search(
collection_name="document_embeddings",
query_vector=query_embedding,
limit=10,
query_filter={"must": [{"key": "category", "match": {"value": "technical"}}]}
)
Cost Optimization Strategies
Cost Optimization Techniques
| Technique | Savings Potential | Implementation Complexity | Best For |
|---|---|---|---|
| Spot/Preemptible Instances | 60-90% | Medium | Fault-tolerant training |
| Reserved Instances | 40-60% | Low | Predictable workloads |
| Savings Plans | 30-50% | Low | Flexible compute usage |
| Right-Sizing | 20-40% | Medium | Over-provisioned resources |
| Auto-Scaling | 30-50% | Medium-High | Variable load |
| S3 Intelligent-Tiering | 20-70% | Low | Mixed access patterns |
| Compression | 50-80% (storage) | Low | Large datasets |
| Model Quantization | 2-4x (compute) | High | Inference workloads |
| Batch Inference | 40-60% | Medium | Non-real-time predictions |
| Regional Selection | 10-30% | Low | Flexible location |
Cost Monitoring & Alerts
# AWS Cost anomaly detection with Lambda
import boto3
from datetime import datetime, timedelta
ce_client = boto3.client('ce') # Cost Explorer
sns_client = boto3.client('sns')
def detect_cost_anomalies(event, context):
"""Detect unusual spending patterns"""
end_date = datetime.now().date()
start_date = end_date - timedelta(days=7)
response = ce_client.get_cost_and_usage(
TimePeriod={'Start': start_date.isoformat(), 'End': end_date.isoformat()},
Granularity='DAILY',
Metrics=['UnblendedCost'],
GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}, {'Type': 'TAG', 'Key': 'Project'}]
)
# Analyze for anomalies
anomalies = []
for result in response['ResultsByTime']:
for group in result['Groups']:
service, project = group['Keys']
cost = float(group['Metrics']['UnblendedCost']['Amount'])
historical_avg = get_historical_average(service, project, days=30)
# Flag if cost > 2x historical average and > $100
if cost > 2 * historical_avg and cost > 100:
anomalies.append({
'service': service, 'project': project, 'cost': cost,
'historical_avg': historical_avg,
'deviation': (cost / historical_avg - 1) * 100
})
# Send alerts if anomalies detected
if anomalies:
message = "Cost Anomalies:\n" + "\n".join([
f"• {a['service']} ({a['project']}): ${a['cost']:.2f} (avg: ${a['historical_avg']:.2f}, +{a['deviation']:.1f}%)"
for a in anomalies
])
sns_client.publish(TopicArn='arn:aws:sns:us-east-1:ACCOUNT:cost-alerts',
Subject='Cloud Cost Anomaly Detected', Message=message)
return {'statusCode': 200, 'anomalies': anomalies}
Security & Compliance
Security Baseline Checklist
- Encryption at Rest: All storage encrypted with KMS/customer-managed keys
- Encryption in Transit: TLS 1.2+ for all data transfer
- Identity Management: SSO integrated with corporate IdP
- Least Privilege: IAM roles/policies follow principle of least privilege
- Network Segmentation: Private subnets for AI workloads, no public IPs
- Secrets Management: API keys, credentials in Secrets Manager/Key Vault
- Logging: CloudTrail/Activity Log enabled for all API calls
- Monitoring: Real-time alerts for security events
- Vulnerability Scanning: Container images scanned pre-deployment
- Compliance: SOC 2, HIPAA, GDPR controls enabled where applicable
Compliance Frameworks
| Framework | AWS Service | Azure Service | GCP Service | Use Case |
|---|---|---|---|---|
| SOC 2 | AWS Audit Manager | Azure Compliance Manager | Compliance Reports Manager | SaaS vendors |
| HIPAA | HIPAA eligible services | Azure Health Data Services | HIPAA compliance | Healthcare data |
| GDPR | Data residency controls | Data residency + Privacy | Data residency | EU personal data |
| PCI DSS | PCI DSS compliance | PCI DSS Level 1 | PCI DSS Level 1 | Payment data |
| FedRAMP | GovCloud | Azure Government | Assured Workloads | US government |
Case Study: Multi-Cloud AI Platform
Background
A fintech company needs to deploy AI fraud detection models across AWS (primary) and Azure (DR) with strict compliance requirements (PCI DSS, SOC 2).
Requirements
- <50ms inference latency for fraud detection
- 99.99% availability
- Multi-region deployment (US, EU)
- Cost target: $150K/month
- PCI DSS compliant infrastructure
Implementation
Infrastructure Stack:
- Compute: EKS (AWS), AKS (Azure) with GPU node pools (g5.xlarge equivalent)
- Storage: S3 (AWS), Blob Storage (Azure) with encryption
- Vector DB: Managed Pinecone (multi-cloud)
- Monitoring: Datadog (unified monitoring)
- IaC: Terraform for both clouds
Architecture Highlights:
# Terraform module for model serving (portable across clouds)
module "model_serving" {
source = "./modules/model-serving"
cloud_provider = "aws" # or "azure"
region = "us-east-1"
cluster_config = {
node_instance_type = "g5.xlarge"
min_nodes = 3
max_nodes = 20
gpu_per_node = 1
}
model_config = {
model_path = "s3://models/fraud-detection-v3/"
replicas = 5
max_batch_size = 32
max_latency_ms = 50
autoscaling_metric = "request_queue_depth"
}
security_config = {
enable_encryption = true
kms_key_id = module.kms.key_id
network_policy = "private"
enable_audit_log = true
}
tags = {
Compliance = "PCI-DSS"
Environment = "production"
CostCenter = "fraud-prevention"
}
}
Results
| Metric | Target | Achieved |
|---|---|---|
| P95 Latency | <50ms | 43ms |
| Availability | 99.99% | 99.995% |
| Monthly Cost | $150K | $137K |
| Setup Time | 8 weeks | 3 weeks (with IaC) |
| Compliance Audits | Pass PCI DSS | Passed |
Cost Breakdown:
- Compute (GPU instances): $78K (57%)
- Data storage (S3/Blob): $12K (9%)
- Vector DB (Pinecone): $28K (20%)
- Data transfer: $8K (6%)
- Monitoring/Logging: $11K (8%)
Lessons Learned:
- Spot instances for training: Saved $32K/month on training jobs
- Regional data transfer costs: Implementing caching reduced egress by 40%
- Right-sizing: Initial GPU instances were over-provisioned; rightsizing saved $19K/month
- Terraform abstractions: Cloud-agnostic modules enabled Azure deployment in 2 weeks
Implementation Checklist
Planning
- Define multi-cloud strategy (single cloud vs. multi-cloud)
- Select regions based on data residency and latency requirements
- Determine compliance frameworks needed
- Establish cost budgets per environment
- Design network topology and security zones
Landing Zone Setup
- Create organizational structure (accounts/subscriptions/projects)
- Configure identity and SSO
- Set up network baseline (VPCs, subnets, routing)
- Deploy security baseline (KMS, secrets, audit logs)
- Implement cost management (budgets, alerts, tagging)
- Configure monitoring and logging infrastructure
AI-Specific Infrastructure
- Provision GPU compute clusters (EKS/AKS/GKE)
- Deploy vector database (managed or self-hosted)
- Set up model registry and artifact storage
- Configure autoscaling policies
- Implement CI/CD pipelines for ML
- Deploy model serving infrastructure
Security & Compliance
- Enable encryption at rest and in transit
- Configure network security groups/firewalls
- Set up vulnerability scanning for containers
- Implement secrets rotation
- Enable compliance monitoring (Security Hub, etc.)
- Document data flows for audit
Operations
- Create operational runbooks
- Set up on-call rotation
- Implement backup and disaster recovery
- Configure cost dashboards and reports
- Establish SLO/SLA monitoring
- Schedule regular security audits
Best Practices
Do's
- Infrastructure as Code: Use Terraform/Pulumi for reproducibility
- Modular Design: Create reusable modules for common patterns
- Cost Tagging: Enforce tagging policies for cost attribution
- Right-Size Early: Start small, scale based on metrics
- Multi-Region: Deploy critical services across regions
- Automate Security: Use policy-as-code (OPA, Sentinel)
- Monitor Everything: Metrics, logs, traces, costs
Don'ts
- Don't Skip Testing: Load test before production
- Don't Ignore Costs: Set budgets and alerts from day 1
- Don't Manual Deploy: Automate deployment with CI/CD
- Don't Over-Provision: Use autoscaling instead
- Don't Store Secrets in Code: Use secrets managers
- Don't Neglect DR: Have tested backup/restore procedures
Common Pitfalls
| Pitfall | Impact | Mitigation |
|---|---|---|
| Data Egress Costs | Unexpected 30-50% cost increase | Cache data, use CDNs, regionalize |
| GPU Idle Time | Wasted 40-60% of GPU budget | Autoscaling, spot instances, batch jobs |
| Network Bottlenecks | Slow data loading for training | Use instance storage, increase network bandwidth |
| Manual Provisioning | Weeks to deploy, configuration drift | IaC from day 1 |
| No Cost Visibility | Budget overruns by 3-5x | Tagging, dashboards, alerts |
| Single Region | Downtime during regional outages | Multi-region with failover |
Deliverables
1. IaC Templates
- Landing zone baseline (Terraform/Pulumi)
- Compute modules (EKS/AKS/GKE, GPU node pools)
- Storage modules (S3, Vector DB)
- Networking modules (VPC, subnets, security groups)
- Monitoring modules (CloudWatch, Prometheus)
2. Reference Architectures
- Model training pipeline
- Real-time inference serving
- Batch inference processing
- RAG application stack
- Multi-region deployment
3. Operational Runbooks
- Deployment procedures
- Incident response
- Disaster recovery
- Cost optimization
- Security hardening
4. Cost Models
- Cost estimation templates
- Budget allocation by workload
- Optimization recommendations
- Chargeback/showback reports