Chapter 54 — Cloud AI Deployment (AWS, Azure, GCP)

Overview

Provision landing zones and managed services with cost and compliance governance. Modern AI workloads require careful cloud architecture that balances flexibility, cost, security, and performance across compute-intensive training, real-time inference, and data-intensive operations. This chapter provides practical guidance for deploying AI systems on AWS, Azure, and GCP with enterprise-grade governance.

Why It Matters

Consistent landing zones and policies reduce time to value and keep costs and risks under control across clouds. Organizations with well-designed cloud AI platforms achieve:

70-85% faster project setup through reusable infrastructure templates
30-50% cost savings via rightsizing, spot instances, and budget controls
Improved security posture with baseline security controls and compliance frameworks
Multi-cloud optionality reducing vendor lock-in risk
Developer productivity gains through self-service infrastructure
Consistent governance across teams and environments

Poor cloud architecture leads to cost overruns (often 3-5x budget), security incidents, compliance violations, and fragmented tooling that slows development.

Cloud Provider Comparison

Capability	AWS	Azure	GCP	Notes
Managed LLM Inference	Bedrock	Azure OpenAI	Vertex AI	Azure has exclusive GPT-4 access
Custom Model Training	SageMaker	Azure ML	Vertex AI	All offer distributed training
GPU Instances	P4/P5 instances	NC/ND series	A2/A3 instances	GCP typically 10-20% cheaper
Serverless Compute	Lambda	Functions	Cloud Functions	AWS has most mature ecosystem
Vector Database	OpenSearch, RDS pgvector	CosmosDB, Azure AI Search	Vertex Vector Search	All support pgvector extension
Data Warehouse	Redshift	Synapse	BigQuery	BigQuery best for analytics
Object Storage	S3	Blob Storage	Cloud Storage	Similar pricing, S3 most mature
Identity/IAM	IAM + Cognito	Azure AD + managed identity	IAM + Workload Identity	Azure AD most enterprise-friendly
Networking	VPC	VNet	VPC	Similar capabilities
Monitoring	CloudWatch	Azure Monitor	Cloud Monitoring	GCP has best logging/tracing
IaC Support	Native (CloudFormation) + Terraform	Native (ARM/Bicep) + Terraform	Native (Deployment Manager) + Terraform	Terraform most portable
Pricing Model	Pay-per-use	Pay-per-use + reservations	Pay-per-use + sustained discounts	GCP auto-discounts, AWS requires commitment

Landing Zone Architecture

Core Components

graph TB
    subgraph Management Account/Subscription
        MA[Root Account]
        SCPs[Service Control Policies]
        Billing[Consolidated Billing]
        Audit[Audit & Compliance]
    end

    subgraph Security Account
        SIEM[SIEM/Log Aggregation]
        SecHub[Security Hub]
        KMS[Key Management]
        Secrets[Secrets Manager]
    end

    subgraph Shared Services
        DNS[Private DNS]
        Registry[Container Registry]
        Artifacts[Artifact Repository]
        VPN[VPN/DirectConnect]
    end

    subgraph Network
        Transit[Transit Gateway/Hub VNet]
        Firewall[Network Firewall]
        NAT[NAT Gateway]
        VPC1[Prod VPC/VNet]
        VPC2[Dev VPC/VNet]
    end

    subgraph AI Workload Account
        Compute[GPU Compute]
        Storage[Data Lake]
        VectorDB[Vector Database]
        Serving[Model Serving]
        Training[Training Pipeline]
    end

    subgraph Data Account
        DW[Data Warehouse]
        ETL[ETL Pipelines]
        Catalog[Data Catalog]
    end

    MA --> SCPs
    MA --> Billing
    MA --> Audit

    Audit --> SIEM
    KMS --> AI Workload Account
    Secrets --> AI Workload Account

    Transit --> VPC1
    Transit --> VPC2
    VPC1 --> Firewall
    VPC2 --> Firewall

    VPC1 --> Compute
    VPC1 --> Storage
    VPC1 --> VectorDB
    VPC1 --> Serving
    VPC1 --> Training

    Storage --> DW
    DW --> ETL
    ETL --> Catalog

Landing Zone Baseline (Terraform Example)

# Multi-cloud landing zone baseline (AWS example - adapt for Azure/GCP)

terraform {
  required_providers { aws = { source = "hashicorp/aws", version = "~> 5.0" } }
  backend "s3" {
    bucket = "my-org-terraform-state"
    key    = "landing-zone/terraform.tfstate"
    region = "us-east-1"
    encrypt = true
  }
}

# Identity baseline - IAM roles for AI workloads
module "identity" {
  source = "./modules/identity"
  roles = {
    ai_data_scientist = {
      trusted_entities = ["sagemaker.amazonaws.com"]
      policies = ["arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"]
    }
    ai_ml_ops = {
      trusted_entities = ["eks.amazonaws.com"]
      policies = [module.custom_policies.model_deployment_policy_arn]
    }
  }
}

# Network baseline - VPC with private/public subnets
module "network" {
  source = "./modules/network"
  vpc_cidr = "10.0.0.0/16"
  availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
  enable_nat_gateway = true
  enable_flow_logs = true
  tags = { Environment = "production", CostCenter = "ai-platform" }
}

# Security baseline - KMS encryption, secrets management
module "security" {
  source = "./modules/security"
  kms_keys = {
    ai_data = { description = "Encryption for AI data", key_users = [module.identity.ai_data_scientist_role_arn] }
  }
  secrets = {
    openai_api_key = { description = "OpenAI API key", value = var.openai_api_key }
  }
  enable_security_hub = true
}

# Monitoring baseline - CloudWatch logs and alarms
module "monitoring" {
  source = "./modules/monitoring"
  log_groups = {
    "/ai/training" = { retention_days = 30 }
    "/ai/inference" = { retention_days = 90 }
  }
  alarms = {
    high_latency = { metric = "InferenceLatency", threshold = 1000 }
    low_gpu = { metric = "GPUUtilization", threshold = 20 }
  }
}

# Cost management - budgets and alerts
module "cost_management" {
  source = "./modules/cost_management"
  budgets = {
    ai_monthly = {
      limit = "50000"
      alerts = [
        { threshold = 80, subscribers = ["finance@company.com"] },
        { threshold = 100, subscribers = ["finance@company.com", "cto@company.com"] }
      ]
    }
  }
  required_tags = ["Environment", "Project", "CostCenter"]
}

# Governance policies - SCPs for security
module "governance" {
  source = "./modules/governance"
  policies = {
    deny_unencrypted_storage = {
      description = "Deny unencrypted S3/EBS"
      # Policy document enforces encryption
    }
    restrict_regions = {
      description = "Restrict to approved regions"
      allowed_regions = ["us-east-1", "us-west-2", "eu-west-1"]
    }
  }
}

Compute Options for AI Workloads

GPU Instance Selection

Instance Type	Cloud	GPUs	GPU Memory	vCPUs	System RAM	Best For	Approx. Cost/hr
p4d.24xlarge	AWS	8x A100	40GB each	96	1152GB	Large model training	$32.77
p5.48xlarge	AWS	8x H100	80GB each	192	2048GB	Frontier models, massive training	$98.32
g5.xlarge	AWS	1x A10G	24GB	4	16GB	Inference, fine-tuning	$1.006
Standard_NC24ads_A100_v4	Azure	1x A100	80GB	24	220GB	Medium training jobs	$3.67
Standard_ND96asr_v4	Azure	8x A100	40GB each	96	900GB	Distributed training	$27.20
a2-highgpu-1g	GCP	1x A100	40GB	12	85GB	Inference, small training	$3.67
a2-ultragpu-8g	GCP	8x A100	80GB each	96	1360GB	Large-scale training	$33.60

Serverless vs. Dedicated Compute

flowchart TD
    A[AI Workload] --> B{Traffic Pattern}

    B -->|Unpredictable, bursty| C[Serverless]
    B -->|Steady, predictable| D[Dedicated Instances]

    C --> C1{Latency Requirement}
    C1 -->|<100ms| C2[Lambda/Functions with provisioned concurrency]
    C1 -->|100ms-1s| C3[Standard Lambda/Functions]
    C1 -->|>1s| C4[Batch jobs - AWS Batch/Azure Batch]

    D --> D1{Scale}
    D1 -->|Small <10 req/s| D2[Single instance + autoscaling]
    D1 -->|Medium 10-1000 req/s| D3[Container orchestration - EKS/AKS/GKE]
    D1 -->|Large >1000 req/s| D4[Multi-region with load balancing]

    C2 --> E[Cold start ~100ms, costly]
    C3 --> F[Cold start ~1s, cost-effective]
    C4 --> G[No cold start concern, batch pricing]

    D2 --> H[Simple, less overhead]
    D3 --> I[Complex but scalable]
    D4 --> J[Complex + expensive but resilient]

Model Serving Architecture

# Kubernetes deployment for model serving (EKS/AKS/GKE)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-model-server
  namespace: ml-production
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate: { maxSurge: 1, maxUnavailable: 0 }  # Zero-downtime
  selector:
    matchLabels: { app: ai-model-server, version: v2.3.1 }
  template:
    metadata:
      labels: { app: ai-model-server, version: v2.3.1 }
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
    spec:
      # Node affinity for GPU nodes
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node.kubernetes.io/instance-type
                operator: In
                values: [g5.xlarge, g5.2xlarge]

      # Init container downloads model from S3
      initContainers:
      - name: model-downloader
        image: amazon/aws-cli:latest
        command: ["/bin/sh", "-c", "aws s3 cp s3://my-models-bucket/model-v2.3.1/ /models/ --recursive"]
        volumeMounts: [{ name: model-storage, mountPath: /models }]

      containers:
      - name: model-server
        image: myregistry.azurecr.io/model-server:v2.3.1
        ports: [{ containerPort: 8000, name: http }]
        env:
        - { name: MODEL_PATH, value: /models }
        - { name: MAX_BATCH_SIZE, value: "32" }
        resources:
          requests: { memory: "16Gi", cpu: "4", nvidia.com/gpu: 1 }
          limits: { memory: "16Gi", cpu: "8", nvidia.com/gpu: 1 }
        livenessProbe:
          httpGet: { path: /health, port: 8000 }
          initialDelaySeconds: 60
        readinessProbe:
          httpGet: { path: /ready, port: 8000 }
          initialDelaySeconds: 30
        volumeMounts: [{ name: model-storage, mountPath: /models, readOnly: true }]
      volumes: [{ name: model-storage, emptyDir: { sizeLimit: 50Gi } }]

---
# Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-model-server-hpa
spec:
  scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: ai-model-server }
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 } }
  - type: Pods
    pods: { metric: { name: gpu_utilization }, target: { type: AverageValue, averageValue: "70" } }
  behavior:
    scaleUp: { stabilizationWindowSeconds: 60, policies: [{ type: Percent, value: 50, periodSeconds: 60 }] }
    scaleDown: { stabilizationWindowSeconds: 300, policies: [{ type: Pods, value: 1, periodSeconds: 60 }] }

Data Storage Options

Storage Comparison

Storage Type	AWS	Azure	GCP	Use Case	Cost (GB/month)
Object Storage	S3 Standard	Blob Hot	Standard	Training data, models	$0.023
Object Storage (Infrequent)	S3 IA	Blob Cool	Nearline	Archived models	$0.0125
Object Storage (Archive)	S3 Glacier	Blob Archive	Archive	Long-term retention	$0.004
Block Storage (SSD)	EBS gp3	Premium SSD	Persistent SSD	VM attached storage	$0.08
Block Storage (HDD)	EBS st1	Standard HDD	Standard	Large sequential data	$0.045
File Storage	EFS	Azure Files	Filestore	Shared ML workspaces	$0.30
Vector Database	OpenSearch	Cosmos DB	Vertex AI	Embeddings search	Variable
Data Warehouse	Redshift	Synapse	BigQuery	Analytics	Pay-per-query

Vector Database Deployment

# Deploy Qdrant (open-source vector DB) on Kubernetes
# Alternative: Use managed services like Pinecone, Weaviate Cloud
apiVersion: apps/v1
kind: StatefulSet
metadata: { name: qdrant, namespace: vector-db }
spec:
  serviceName: qdrant
  replicas: 3
  selector: { matchLabels: { app: qdrant } }
  template:
    metadata: { labels: { app: qdrant } }
    spec:
      containers:
      - name: qdrant
        image: qdrant/qdrant:v1.7.0
        ports: [{ containerPort: 6333, name: http }, { containerPort: 6334, name: grpc }]
        resources:
          requests: { memory: "8Gi", cpu: "2" }
          limits: { memory: "16Gi", cpu: "4" }
        volumeMounts: [{ name: qdrant-storage, mountPath: /qdrant/storage }]
  volumeClaimTemplates:
  - metadata: { name: qdrant-storage }
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: fast-ssd
      resources: { requests: { storage: 100Gi } }

# Python client for Qdrant
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

# Connect to Qdrant cluster
client = QdrantClient(host="qdrant.vector-db.svc.cluster.local", port=6333)

# Create collection
client.create_collection(
    collection_name="document_embeddings",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE)  # OpenAI embedding size
)

# Upsert vectors
client.upsert(
    collection_name="document_embeddings",
    points=[PointStruct(id=doc_id, vector=embedding, payload={"text": text, "metadata": metadata})
           for doc_id, embedding, text, metadata in data]
)

# Search
results = client.search(
    collection_name="document_embeddings",
    query_vector=query_embedding,
    limit=10,
    query_filter={"must": [{"key": "category", "match": {"value": "technical"}}]}
)

Cost Optimization Strategies

Cost Optimization Techniques

Technique	Savings Potential	Implementation Complexity	Best For
Spot/Preemptible Instances	60-90%	Medium	Fault-tolerant training
Reserved Instances	40-60%	Low	Predictable workloads
Savings Plans	30-50%	Low	Flexible compute usage
Right-Sizing	20-40%	Medium	Over-provisioned resources
Auto-Scaling	30-50%	Medium-High	Variable load
S3 Intelligent-Tiering	20-70%	Low	Mixed access patterns
Compression	50-80% (storage)	Low	Large datasets
Model Quantization	2-4x (compute)	High	Inference workloads
Batch Inference	40-60%	Medium	Non-real-time predictions
Regional Selection	10-30%	Low	Flexible location

Cost Monitoring & Alerts

# AWS Cost anomaly detection with Lambda
import boto3
from datetime import datetime, timedelta

ce_client = boto3.client('ce')  # Cost Explorer
sns_client = boto3.client('sns')

def detect_cost_anomalies(event, context):
    """Detect unusual spending patterns"""
    end_date = datetime.now().date()
    start_date = end_date - timedelta(days=7)

    response = ce_client.get_cost_and_usage(
        TimePeriod={'Start': start_date.isoformat(), 'End': end_date.isoformat()},
        Granularity='DAILY',
        Metrics=['UnblendedCost'],
        GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}, {'Type': 'TAG', 'Key': 'Project'}]
    )

    # Analyze for anomalies
    anomalies = []
    for result in response['ResultsByTime']:
        for group in result['Groups']:
            service, project = group['Keys']
            cost = float(group['Metrics']['UnblendedCost']['Amount'])
            historical_avg = get_historical_average(service, project, days=30)

            # Flag if cost > 2x historical average and > $100
            if cost > 2 * historical_avg and cost > 100:
                anomalies.append({
                    'service': service, 'project': project, 'cost': cost,
                    'historical_avg': historical_avg,
                    'deviation': (cost / historical_avg - 1) * 100
                })

    # Send alerts if anomalies detected
    if anomalies:
        message = "Cost Anomalies:\n" + "\n".join([
            f"• {a['service']} ({a['project']}): ${a['cost']:.2f} (avg: ${a['historical_avg']:.2f}, +{a['deviation']:.1f}%)"
            for a in anomalies
        ])
        sns_client.publish(TopicArn='arn:aws:sns:us-east-1:ACCOUNT:cost-alerts',
                          Subject='Cloud Cost Anomaly Detected', Message=message)

    return {'statusCode': 200, 'anomalies': anomalies}

Security & Compliance

Security Baseline Checklist

Compliance Frameworks

Framework	AWS Service	Azure Service	GCP Service	Use Case
SOC 2	AWS Audit Manager	Azure Compliance Manager	Compliance Reports Manager	SaaS vendors
HIPAA	HIPAA eligible services	Azure Health Data Services	HIPAA compliance	Healthcare data
GDPR	Data residency controls	Data residency + Privacy	Data residency	EU personal data
PCI DSS	PCI DSS compliance	PCI DSS Level 1	PCI DSS Level 1	Payment data
FedRAMP	GovCloud	Azure Government	Assured Workloads	US government

Case Study: Multi-Cloud AI Platform

Background

A fintech company needs to deploy AI fraud detection models across AWS (primary) and Azure (DR) with strict compliance requirements (PCI DSS, SOC 2).

Requirements

<50ms inference latency for fraud detection
99.99% availability
Multi-region deployment (US, EU)
Cost target: $150K/month
PCI DSS compliant infrastructure

Implementation

Infrastructure Stack:

Compute: EKS (AWS), AKS (Azure) with GPU node pools (g5.xlarge equivalent)
Storage: S3 (AWS), Blob Storage (Azure) with encryption
Vector DB: Managed Pinecone (multi-cloud)
Monitoring: Datadog (unified monitoring)
IaC: Terraform for both clouds

Architecture Highlights:

# Terraform module for model serving (portable across clouds)
module "model_serving" {
  source = "./modules/model-serving"

  cloud_provider = "aws"  # or "azure"
  region         = "us-east-1"

  cluster_config = {
    node_instance_type = "g5.xlarge"
    min_nodes          = 3
    max_nodes          = 20
    gpu_per_node       = 1
  }

  model_config = {
    model_path          = "s3://models/fraud-detection-v3/"
    replicas            = 5
    max_batch_size      = 32
    max_latency_ms      = 50
    autoscaling_metric  = "request_queue_depth"
  }

  security_config = {
    enable_encryption   = true
    kms_key_id          = module.kms.key_id
    network_policy      = "private"
    enable_audit_log    = true
  }

  tags = {
    Compliance = "PCI-DSS"
    Environment = "production"
    CostCenter = "fraud-prevention"
  }
}

Results

Metric	Target	Achieved
P95 Latency	<50ms	43ms
Availability	99.99%	99.995%
Monthly Cost	$150K	$137K
Setup Time	8 weeks	3 weeks (with IaC)
Compliance Audits	Pass PCI DSS	Passed

Cost Breakdown:

Compute (GPU instances): $78K (57%)
Data storage (S3/Blob): $12K (9%)
Vector DB (Pinecone): $28K (20%)
Data transfer: $8K (6%)
Monitoring/Logging: $11K (8%)

Lessons Learned:

Spot instances for training: Saved $32K/month on training jobs
Regional data transfer costs: Implementing caching reduced egress by 40%
Right-sizing: Initial GPU instances were over-provisioned; rightsizing saved $19K/month
Terraform abstractions: Cloud-agnostic modules enabled Azure deployment in 2 weeks

Implementation Checklist

Planning

Define multi-cloud strategy (single cloud vs. multi-cloud)
Select regions based on data residency and latency requirements
Determine compliance frameworks needed
Establish cost budgets per environment
Design network topology and security zones

Landing Zone Setup

Create organizational structure (accounts/subscriptions/projects)
Configure identity and SSO
Set up network baseline (VPCs, subnets, routing)
Deploy security baseline (KMS, secrets, audit logs)
Implement cost management (budgets, alerts, tagging)
Configure monitoring and logging infrastructure

AI-Specific Infrastructure

Provision GPU compute clusters (EKS/AKS/GKE)
Deploy vector database (managed or self-hosted)
Set up model registry and artifact storage
Configure autoscaling policies
Implement CI/CD pipelines for ML
Deploy model serving infrastructure

Security & Compliance

Enable encryption at rest and in transit
Configure network security groups/firewalls
Set up vulnerability scanning for containers
Implement secrets rotation
Enable compliance monitoring (Security Hub, etc.)
Document data flows for audit

Operations

Create operational runbooks
Set up on-call rotation
Implement backup and disaster recovery
Configure cost dashboards and reports
Establish SLO/SLA monitoring
Schedule regular security audits

Best Practices

Do's

Infrastructure as Code: Use Terraform/Pulumi for reproducibility
Modular Design: Create reusable modules for common patterns
Cost Tagging: Enforce tagging policies for cost attribution
Right-Size Early: Start small, scale based on metrics
Multi-Region: Deploy critical services across regions
Automate Security: Use policy-as-code (OPA, Sentinel)
Monitor Everything: Metrics, logs, traces, costs

Don'ts

Don't Skip Testing: Load test before production
Don't Ignore Costs: Set budgets and alerts from day 1
Don't Manual Deploy: Automate deployment with CI/CD
Don't Over-Provision: Use autoscaling instead
Don't Store Secrets in Code: Use secrets managers
Don't Neglect DR: Have tested backup/restore procedures

Common Pitfalls

Pitfall	Impact	Mitigation
Data Egress Costs	Unexpected 30-50% cost increase	Cache data, use CDNs, regionalize
GPU Idle Time	Wasted 40-60% of GPU budget	Autoscaling, spot instances, batch jobs
Network Bottlenecks	Slow data loading for training	Use instance storage, increase network bandwidth
Manual Provisioning	Weeks to deploy, configuration drift	IaC from day 1
No Cost Visibility	Budget overruns by 3-5x	Tagging, dashboards, alerts
Single Region	Downtime during regional outages	Multi-region with failover

Deliverables

1. IaC Templates

Landing zone baseline (Terraform/Pulumi)
Compute modules (EKS/AKS/GKE, GPU node pools)
Storage modules (S3, Vector DB)
Networking modules (VPC, subnets, security groups)
Monitoring modules (CloudWatch, Prometheus)

2. Reference Architectures

Model training pipeline
Real-time inference serving
Batch inference processing
RAG application stack
Multi-region deployment

3. Operational Runbooks

Deployment procedures
Incident response
Disaster recovery
Cost optimization
Security hardening

4. Cost Models

Cost estimation templates
Budget allocation by workload
Optimization recommendations
Chargeback/showback reports

54. Cloud AI Deployment (AWS, Azure, GCP)

Chapter 54 — Cloud AI Deployment (AWS, Azure, GCP)

Overview

Why It Matters

Cloud Provider Comparison

Landing Zone Architecture

Core Components

Landing Zone Baseline (Terraform Example)

Compute Options for AI Workloads

GPU Instance Selection

Serverless vs. Dedicated Compute

Model Serving Architecture

Data Storage Options

Storage Comparison

Vector Database Deployment

Cost Optimization Strategies

Cost Optimization Techniques

Cost Monitoring & Alerts

Security & Compliance

Security Baseline Checklist

Compliance Frameworks

Case Study: Multi-Cloud AI Platform

Background

Requirements

Implementation

Results

Implementation Checklist

Planning

Landing Zone Setup

AI-Specific Infrastructure

Security & Compliance

Operations

Best Practices

Do's

Don'ts

Common Pitfalls

Deliverables

1. IaC Templates

2. Reference Architectures

3. Operational Runbooks

4. Cost Models