Part 10: MLOps & Platform Engineering

Chapter 59: Cost, Performance & Capacity

Hire Us
10Part 10: MLOps & Platform Engineering

59. Cost, Performance & Capacity

Chapter 59 — Cost, Performance & Capacity

Overview

Optimize unit economics and performance; plan capacity for peak and average loads. ML and LLM systems can consume massive resources—hundreds of dollars per thousand requests, multi-second latencies, and unpredictable scaling patterns. Strategic optimization of cost, performance, and capacity ensures systems remain economically viable while delivering responsive user experiences. Organizations implementing systematic cost optimization typically reduce expenses by 60-85% while improving latency by 40-70%.

Key Objectives

  • Establish cost models per request, per token, and per user
  • Optimize performance through profiling, batching, and caching
  • Implement autoscaling strategies for variable loads
  • Plan capacity for peak demand with appropriate buffers
  • Balance cost, performance, and quality trade-offs

Deliverables

  • Cost model with unit economics analysis
  • Performance tuning guide and benchmarks
  • Capacity planning models and forecasts
  • Autoscaling configurations and policies
  • Optimization roadmap with ROI projections

Why It Matters

Without strategic optimization, AI costs can balloon from manageable to unsustainable within weeks. A well-architected cost and performance strategy turns ML from a cost center into a defensible competitive advantage.

The Cost Crisis:

  • LLM Costs: GPT-4 costs 0.03/1Kinputtokens,0.03/1K input tokens, 0.06/1K output tokens—a chatbot with 100K daily users can easily cost $50K+/month
  • Infrastructure: GPU costs range from 18/hourperGPU;a24/7deploymentwith4GPUscosts1-8/hour per GPU; a 24/7 deployment with 4 GPUs costs 35K-280K/year
  • Hidden Costs: Data transfer, storage, monitoring, and failed requests add 20-40% overhead
  • Scaling Surprises: A viral feature can 10x costs overnight without proper rate limiting

Organizations often discover:

  • Initial MVP costs 5K/monthProductioncosts5K/month → Production costs 150K/month (30x increase)
  • 70-80% of costs come from 20% of users (power users, bots)
  • 40-60% of requests could be cached without quality degradation
  • Simple optimizations (batching, caching, quantization) reduce costs 50-80%
  • 90% of performance bottlenecks come from 10% of code paths

Cost Model Architecture

graph TB A[User Request] --> B{Cache Hit?} B -->|Yes| C[Return Cached<br/>Cost: $0.0001] B -->|No| D[Feature Retrieval] D --> E{Complex Query?} E -->|No| F[Small Model<br/>Cost: $0.001] E -->|Yes| G[Large Model<br/>Cost: $0.02] F --> H[Cache Result] G --> H H --> I[Return to User] subgraph Cost Tracking C --> J[Log Metrics] F --> J G --> J J --> K[Cost Per User] J --> L[Cost Per Feature] J --> M[Total Daily Cost] end subgraph Budget Controls M --> N{Over Budget?} N -->|Yes| O[Throttle Requests] N -->|No| P[Continue] end

Complete Optimization Pipeline

flowchart TD A[Production System] --> B[Profiling & Measurement] B --> C{Identify Bottlenecks} C -->|Slow Requests| D[Performance Optimization] C -->|High Costs| E[Cost Optimization] C -->|Scaling Issues| F[Capacity Planning] D --> D1[Caching Layer] D --> D2[Batching] D --> D3[Model Optimization] E --> E1[Model Quantization] E --> E2[Prompt Compression] E --> E3[Tiered Models] F --> F1[Autoscaling Config] F --> F2[Load Forecasting] F --> F3[Resource Allocation] D1 --> G[Deploy Changes] D2 --> G D3 --> G E1 --> G E2 --> G E3 --> G F1 --> G F2 --> G F3 --> G G --> H[Monitor Metrics] H --> I{Goals Met?} I -->|No| B I -->|Yes| J[Production] J --> K[Continuous Monitoring] K --> B

Performance Optimization Techniques

1. Profiling & Bottleneck Identification

Profiling Framework:

# profiler.py - Simplified profiling for ML systems
import time
from functools import wraps

class Profiler:
    def __init__(self):
        self.metrics = {}

    def time(self, func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            start = time.perf_counter()
            result = func(*args, **kwargs)
            duration = (time.perf_counter() - start) * 1000

            if func.__name__ not in self.metrics:
                self.metrics[func.__name__] = []
            self.metrics[func.__name__].append(duration)
            return result
        return wrapper

    def report(self):
        for name, times in self.metrics.items():
            print(f"{name}: P50={sorted(times)[len(times)//2]:.1f}ms "
                  f"P95={sorted(times)[int(len(times)*0.95)]:.1f}ms")

# Usage
profiler = Profiler()

@profiler.time
def vector_search(query):
    return db.search(query, top_k=5)

@profiler.time
def llm_generate(prompt):
    return model.generate(prompt)

Bottleneck Analysis:

ComponentTypical LatencyTargetOptimization Strategy
Input Preprocessing5-20ms<10msVectorize operations, cache
Vector Search50-200ms<100msApproximate search (HNSW), GPU acceleration
LLM Prefill200-800ms<500msBatch requests, quantization
LLM Decode20-50ms/token<30ms/tokenSpeculative decoding, KV cache
Postprocessing10-30ms<20msOptimize regex, parallel processing

2. Batching Strategies

sequenceDiagram participant R1 as Request 1 participant R2 as Request 2 participant R3 as Request 3 participant B as Batcher participant M as Model R1->>B: Add request (t=0ms) R2->>B: Add request (t=20ms) R3->>B: Add request (t=40ms) Note over B: Wait 50ms OR<br/>batch full B->>M: Process batch [R1,R2,R3] (t=50ms) M->>B: Results [O1,O2,O3] (t=150ms) B->>R1: Return O1 B->>R2: Return O2 B->>R3: Return O3 Note over R1,R3: Total latency: 150ms<br/>vs 300ms individual

Simplified Batching Implementation:

# batcher.py
import asyncio
from collections import deque

class Batcher:
    def __init__(self, model_fn, max_size=32, max_wait_ms=50):
        self.model_fn = model_fn
        self.max_size = max_size
        self.max_wait = max_wait_ms / 1000
        self.queue = deque()

    async def predict(self, input_data):
        future = asyncio.Future()
        self.queue.append((input_data, future))

        if len(self.queue) >= self.max_size:
            asyncio.create_task(self._process())
        return await future

    async def _process(self):
        if not self.queue:
            return
        batch = [self.queue.popleft() for _ in range(min(len(self.queue), self.max_size))]
        inputs = [item[0] for item in batch]
        outputs = await self.model_fn(inputs)
        for (_, future), output in zip(batch, outputs):
            future.set_result(output)

Batching Performance:

Batch SizeIndividual LatencyBatched LatencyThroughput GainLatency Penalty
1 (no batch)100ms100ms1x0ms
8100ms120ms6.7x+20ms
16100ms140ms11.4x+40ms
32100ms180ms17.8x+80ms
64100ms250ms25.6x+150ms

3. Caching Strategies

flowchart LR A[Request] --> B{L1 Cache<br/>In-Memory} B -->|Hit| C[Return <1ms] B -->|Miss| D{L2 Cache<br/>Redis} D -->|Hit| E[Return 5ms] D -->|Miss| F[LLM Inference] F --> G[Cache Result] G --> H[Return 1-5s] E --> I[Promote to L1] H --> J[Store in L1+L2]

Simplified Caching:

# cache.py
import hashlib
import json

class MultiLevelCache:
    def __init__(self, redis_client, max_l1=1000, ttl=3600):
        self.redis = redis_client
        self.l1 = {}
        self.max_l1 = max_l1
        self.ttl = ttl

    def _key(self, data):
        return hashlib.sha256(json.dumps(data, sort_keys=True).encode()).hexdigest()

    def get(self, data):
        key = self._key(data)
        # L1 check
        if key in self.l1:
            return self.l1[key]
        # L2 check
        val = self.redis.get(key)
        if val:
            result = json.loads(val)
            self.l1[key] = result  # Promote
            return result
        return None

    def set(self, data, result):
        key = self._key(data)
        if len(self.l1) >= self.max_l1:
            self.l1.pop(next(iter(self.l1)))
        self.l1[key] = result
        self.redis.setex(key, self.ttl, json.dumps(result))

# Semantic cache for similar queries
class SemanticCache:
    def __init__(self, threshold=0.95):
        self.cache = []  # (embedding, response) pairs
        self.threshold = threshold

    def get(self, embedding):
        for cached_emb, response in self.cache:
            if cosine_sim(embedding, cached_emb) >= self.threshold:
                return response
        return None

    def set(self, embedding, response):
        self.cache.append((embedding, response))
        if len(self.cache) > 10000:
            self.cache = self.cache[-10000:]

Cache Performance Comparison:

Cache TypeHit LatencyCost per HitHit Rate (typical)Use Case
In-Memory LRU<1ms$0.0000120-30%Frequent exact matches
Redis1-5ms$0.000140-60%Distributed systems
Semantic Cache5-10ms$0.000530-50%Similar LLM queries
Database10-50ms$0.00170-90%Long-term storage
No Cache1000-5000ms$0.01-0.10N/ABaseline

4. Model Optimization

Quantization Techniques:

# optimization.py
from transformers import AutoModelForCausalLM
import torch

# INT8 quantization - 50% memory, ~2% quality loss
model_int8 = AutoModelForCausalLM.from_pretrained(
    "model-name",
    load_in_8bit=True,
    device_map="auto"
)

# INT4 quantization - 75% memory, ~5-10% quality loss
model_int4 = AutoModelForCausalLM.from_pretrained(
    "model-name",
    load_in_4bit=True,
    device_map="auto"
)

# PyTorch compilation - 20-30% speedup, no quality loss
model_compiled = torch.compile(model, mode="reduce-overhead")

Performance Comparison:

OptimizationMemoryLatencyThroughputQualitySetup Effort
Baseline FP16100%100%100%100%Low
INT8 Quantization50%60-70%150%98-99%Low
INT4 Quantization25%40-50%200%90-95%Medium
Mixed Precision70%70-80%130%99.5%Low
torch.compile100%70-80%120-130%100%Low
All Combined30%30-40%300-400%95-98%Medium

5. Autoscaling Strategies

flowchart TD A[Monitor Metrics] --> B{CPU > 70%<br/>OR<br/>Latency > 500ms<br/>OR<br/>Queue > 100} B -->|Yes| C[Scale Up<br/>+2 replicas] B -->|No| D{CPU < 30%<br/>AND<br/>Latency < 100ms<br/>AND<br/>Queue < 10} D -->|Yes| E[Scale Down<br/>-1 replica] D -->|No| F[Maintain] C --> G[Wait 60s Cooldown] E --> H[Wait 300s Cooldown] F --> A G --> A H --> A style C fill:#90EE90 style E fill:#FFB6C1

Kubernetes Autoscaling Config:

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-model-hpa
spec:
  scaleTargetRef:
    kind: Deployment
    name: ml-model
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: inference_latency_p95
      target:
        averageValue: 500m
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100  # Double quickly
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10  # Reduce slowly

Autoscaling Strategy Comparison:

MetricScale Up ThresholdScale Down ThresholdCooldownReason
CPU>70%<30%60s / 300sPrevent thrashing
Latency P95>500ms<100ms60s / 300sUser experience
Queue Depth>100<1060s / 300sRequest backlog
Error Rate>5%N/A30s / N/APossible overload

Cost Modeling & Optimization

flowchart LR A[Request] --> B{Cache Hit?} B -->|45%| C[Cache<br/>$0.0001] B -->|55%| D{Model Tier} D -->|Simple 60%| E[Small Model<br/>$0.002] D -->|Complex 40%| F[Large Model<br/>$0.025] C --> G[Track Cost] E --> G F --> G G --> H{Cost Analysis} H --> I[Per User] H --> J[Per Feature] H --> K[Total Budget]

Simplified Cost Calculator:

# cost_model.py
class CostModel:
    def __init__(self):
        self.gpu_hour = 4.00  # A100
        self.input_token_1k = 0.03  # GPT-4
        self.output_token_1k = 0.06

    def request_cost(self, input_tokens, output_tokens, cached=False):
        if cached:
            return 0.0001
        return (input_tokens/1000 * self.input_token_1k +
                output_tokens/1000 * self.output_token_1k)

    def monthly_projection(self, daily_requests, avg_in, avg_out, cache_rate=0.4):
        cached = daily_requests * cache_rate * 0.0001
        compute = daily_requests * (1 - cache_rate) * self.request_cost(avg_in, avg_out)
        monthly = (cached + compute) * 30
        return {
            "monthly_cost": monthly,
            "per_request": monthly / (daily_requests * 30)
        }

# Example
model = CostModel()
proj = model.monthly_projection(
    daily_requests=100000,
    avg_in=500,
    avg_out=300,
    cache_rate=0.45
)
print(f"Monthly: ${proj['monthly_cost']:,.2f}, Per request: ${proj['per_request']:.4f}")

Cost Optimization Strategies

StrategyCost ReductionImplementation EffortQuality Impact
Response Caching40-60%LowNone (cache hits)
Semantic Caching30-50%MediumNone
Model Quantization50-75%Low2-10% degradation
Batch Inference30-50%MediumSlight latency increase
Prompt Compression20-40%Medium1-5% degradation
Smaller Model for Simple Queries60-80% (on those queries)MediumMinimal (w/ routing)
Rate Limiting Power Users20-30%LowUser experience tradeoff
Spot/Preemptible Instances60-80%Medium-HighRequires fault tolerance
All Combined80-95%High5-15% degradation

Capacity Planning

flowchart TD A[Historical Data] --> B[Analyze Patterns] B --> C[Average RPS] B --> D[Peak RPS P99] B --> E[Growth Trend] C --> F[Min Replicas<br/>Avg × 1.2] D --> G[Max Replicas<br/>Peak × 1.2] E --> H[Future Forecast] F --> I[Reserved/Always-On] G --> J[Autoscale Limit] I --> K[Cost Optimization] J --> K K --> L[Reserved for Base<br/>On-Demand for Peaks]

Simplified Capacity Planner:

# capacity.py
import numpy as np

class CapacityPlanner:
    def __init__(self, historical_rps):
        self.rps = np.array(historical_rps)

    def analyze(self):
        return {
            "avg": np.mean(self.rps),
            "p95": np.percentile(self.rps, 95),
            "p99": np.percentile(self.rps, 99),
            "max": np.max(self.rps)
        }

    def recommend(self, replica_rps=10, margin=0.2):
        stats = self.analyze()
        min_reps = max(2, int(np.ceil(stats["avg"] / replica_rps * (1 + margin))))
        max_reps = int(np.ceil(stats["p99"] / replica_rps * (1 + margin)))
        return {
            "min_replicas": min_reps,
            "max_replicas": max_reps,
            "avg_utilization": stats["avg"] / (min_reps * replica_rps)
        }

    def forecast(self, growth_monthly=0.10, months=6):
        current = self.analyze()
        return [{
            "month": m,
            "avg_rps": current["avg"] * (1 + growth_monthly) ** m,
            "p99_rps": current["p99"] * (1 + growth_monthly) ** m
        } for m in range(1, months + 1)]

Case Study: Customer Support Chatbot Cost Optimization

Background: A SaaS company deployed an LLM-powered customer support chatbot. Initial costs: $180K/month for 3M conversations.

flowchart LR A[Initial State<br/>$180K/month] --> B[Phase 1: Caching<br/>$99K/month<br/>45% reduction] B --> C[Phase 2: Tiering<br/>$69K/month<br/>30% reduction] C --> D[Phase 3: Batching<br/>$59K/month<br/>15% reduction] D --> E[Phase 4: Prompts<br/>$53K/month<br/>10% reduction] E --> F[Final: 71% Total<br/>Cost Reduction] style A fill:#ffcccc style E fill:#ccffcc

Initial Architecture:

  • GPT-4 for all responses (100% traffic)
  • No caching, no batching
  • 500 input tokens, 300 output tokens per conversation
  • Cost: 0.06/conversation,Target:0.06/conversation, Target: 0.01/conversation
  • Latency: 3-5 seconds (users complained)

Problems:

  • Unsustainable unit economics (6x over budget)
  • 50% of questions were FAQ variations (wasteful)
  • Poor user experience due to latency
  • No optimization strategy

Optimization Journey:

Phase 1: Semantic Caching (Month 1)

  • Implemented multi-level cache (in-memory + Redis)
  • 45% cache hit rate for common questions
  • Result: 180K180K → 99K/month (45% reduction)
  • Latency: Cached responses 100ms (30x faster)

Phase 2: Model Tiering (Month 2)

  • Router classifies queries as simple/complex
  • GPT-3.5-Turbo for simple (60% of traffic, 10x cheaper)
  • GPT-4 for complex (40% of traffic)
  • Result: 99K99K → 69K/month (30% reduction)
  • Quality: Maintained with smart routing

Phase 3: Batching & Quantization (Month 3)

  • Email responses batched (not real-time)
  • Self-hosted quantized model for FAQs
  • Result: 69K69K → 59K/month (15% reduction)
  • Bonus: Reduced API dependency

Phase 4: Prompt Optimization (Month 4)

  • Compressed system prompts (40% fewer tokens)
  • Context pruning based on relevance scoring
  • Early stopping when confidence > 0.95
  • Result: 59K59K → 53K/month (10% reduction)

Final Results:

MetricBeforeAfterImprovement
Monthly Cost$180K$53K71% reduction
Cost per Conversation$0.06$0.01870% reduction
Average Latency3.5s1.2s66% improvement
P95 Latency5.2s2.1s60% improvement
CSAT Score7.2/108.1/1012% improvement (faster!)
Cache Hit Rate0%45%N/A
GPU Utilization35%72%2x efficiency

Key Success Factors:

  1. Started with easy wins - Caching provided 45% cost reduction in month 1
  2. Measured quality at each step - No degradation in customer satisfaction
  3. Tiered models - Right model for right complexity saves 60-80%
  4. Optimized cost AND latency - Users happier with faster responses
  5. Continuous monitoring - Caught regressions before customer impact

Cost vs Performance Trade-off Analysis

OptimizationCost ReductionLatency ImpactQuality ImpactImplementation TimeROI Timeline
Response Caching40-60%-95% (faster)None1-2 weeksImmediate
Semantic Caching30-50%-90% (faster)None2-3 weeksImmediate
Model Quantization50-75%-30% (faster)-2% to -10%1 weekImmediate
Batch Inference30-50%+20-100msNone2-3 weeksMonth 1
Model Tiering60-80%NeutralMinimal (w/ routing)3-4 weeksMonth 1
Prompt Compression20-40%-10% (faster)-1% to -5%2 weeksMonth 1
Autoscaling20-40%NeutralNone1-2 weeksOngoing
Spot Instances60-80%NeutralNone (w/ fallback)2-3 weeksImmediate

Capacity Planning Best Practices

PracticeDescriptionImpactWhen to Apply
20% Safety MarginMin replicas = avg load × 1.2Handles unexpected spikesAlways
P99 for MaxMax replicas = P99 load × 1.2Covers rare peak eventsHigh availability SLAs
Fast Scale-Up60s cooldown, 100% increaseQuick response to loadUser-facing services
Slow Scale-Down300s cooldown, 10% decreasePrevent thrashingAll autoscaling
Reserved BaseBuy reserved instances for min replicas40-60% cost savingsPredictable base load
On-Demand PeaksUse on-demand for autoscalingFlexibility without commitmentVariable workloads
Spot for BatchUse spot instances for non-critical60-80% cost savingsBatch processing

Implementation Checklist

Phase 1: Measurement (Week 1)

  • Set up cost tracking per request/user/feature
  • Profile system to identify bottlenecks
  • Establish baseline latency metrics (P50, P95, P99)
  • Measure current throughput and resource utilization
  • Calculate current unit economics

Phase 2: Quick Wins (Weeks 2-3)

  • Implement response caching (exact match)
  • Add batch processing where latency-tolerant
  • Enable model compilation (torch.compile, TensorRT)
  • Set up basic rate limiting
  • Optimize prompts to reduce token usage

Phase 3: Advanced Optimization (Weeks 4-6)

  • Implement semantic caching
  • Add model quantization (INT8)
  • Set up dynamic batching
  • Implement model tiering (small/large models)
  • Add prompt compression

Phase 4: Capacity Planning (Weeks 7-8)

  • Analyze historical load patterns
  • Model seasonal and growth trends
  • Configure autoscaling policies
  • Set up capacity alerts
  • Plan for peak events (campaigns, launches)

Phase 5: Cost Governance (Weeks 9-10)

  • Set budget alerts by service/team/user
  • Implement rate limiting per user/tenant
  • Add cost dashboards
  • Create cost allocation reports
  • Set up automatic cost anomaly detection

Phase 6: Continuous Optimization (Ongoing)

  • Regular performance review (monthly)
  • A/B test new optimizations
  • Update capacity plans quarterly
  • Benchmark against new models/techniques
  • Optimize based on actual usage patterns

Success Metrics

  • Cost Efficiency: Cost per request reduced by >50%
  • Performance: P95 latency <500ms for interactive use cases
  • Throughput: >80% GPU utilization during peak
  • Quality: <5% degradation from optimizations
  • Reliability: >99.9% availability with autoscaling
  • Capacity: Handle 3x current peak load with autoscaling