Chapter 59 — Cost, Performance & Capacity

Overview

Optimize unit economics and performance; plan capacity for peak and average loads. ML and LLM systems can consume massive resources—hundreds of dollars per thousand requests, multi-second latencies, and unpredictable scaling patterns. Strategic optimization of cost, performance, and capacity ensures systems remain economically viable while delivering responsive user experiences. Organizations implementing systematic cost optimization typically reduce expenses by 60-85% while improving latency by 40-70%.

Key Objectives

Establish cost models per request, per token, and per user
Optimize performance through profiling, batching, and caching
Implement autoscaling strategies for variable loads
Plan capacity for peak demand with appropriate buffers
Balance cost, performance, and quality trade-offs

Deliverables

Cost model with unit economics analysis
Performance tuning guide and benchmarks
Capacity planning models and forecasts
Autoscaling configurations and policies
Optimization roadmap with ROI projections

Why It Matters

Without strategic optimization, AI costs can balloon from manageable to unsustainable within weeks. A well-architected cost and performance strategy turns ML from a cost center into a defensible competitive advantage.

The Cost Crisis:

LLM Costs: GPT-4 costs $0.03/1K input tokens,$ 0.06/1K output tokens—a chatbot with 100K daily users can easily cost $50K+/month
Infrastructure: GPU costs range from $1-8/hour per GPU; a 24/7 deployment with 4 GPUs costs$ 35K-280K/year
Hidden Costs: Data transfer, storage, monitoring, and failed requests add 20-40% overhead
Scaling Surprises: A viral feature can 10x costs overnight without proper rate limiting

Organizations often discover:

Initial MVP costs $5K/month → Production costs$ 150K/month (30x increase)
70-80% of costs come from 20% of users (power users, bots)
40-60% of requests could be cached without quality degradation
Simple optimizations (batching, caching, quantization) reduce costs 50-80%
90% of performance bottlenecks come from 10% of code paths

Cost Model Architecture

graph TB
    A[User Request] --> B{Cache Hit?}
    B -->|Yes| C[Return Cached<br/>Cost: $0.0001]
    B -->|No| D[Feature Retrieval]

    D --> E{Complex Query?}
    E -->|No| F[Small Model<br/>Cost: $0.001]
    E -->|Yes| G[Large Model<br/>Cost: $0.02]

    F --> H[Cache Result]
    G --> H

    H --> I[Return to User]

    subgraph Cost Tracking
        C --> J[Log Metrics]
        F --> J
        G --> J
        J --> K[Cost Per User]
        J --> L[Cost Per Feature]
        J --> M[Total Daily Cost]
    end

    subgraph Budget Controls
        M --> N{Over Budget?}
        N -->|Yes| O[Throttle Requests]
        N -->|No| P[Continue]
    end

Complete Optimization Pipeline

flowchart TD
    A[Production System] --> B[Profiling & Measurement]
    B --> C{Identify Bottlenecks}

    C -->|Slow Requests| D[Performance Optimization]
    C -->|High Costs| E[Cost Optimization]
    C -->|Scaling Issues| F[Capacity Planning]

    D --> D1[Caching Layer]
    D --> D2[Batching]
    D --> D3[Model Optimization]

    E --> E1[Model Quantization]
    E --> E2[Prompt Compression]
    E --> E3[Tiered Models]

    F --> F1[Autoscaling Config]
    F --> F2[Load Forecasting]
    F --> F3[Resource Allocation]

    D1 --> G[Deploy Changes]
    D2 --> G
    D3 --> G
    E1 --> G
    E2 --> G
    E3 --> G
    F1 --> G
    F2 --> G
    F3 --> G

    G --> H[Monitor Metrics]
    H --> I{Goals Met?}
    I -->|No| B
    I -->|Yes| J[Production]

    J --> K[Continuous Monitoring]
    K --> B

Performance Optimization Techniques

1. Profiling & Bottleneck Identification

Profiling Framework:

# profiler.py - Simplified profiling for ML systems
import time
from functools import wraps

class Profiler:
    def __init__(self):
        self.metrics = {}

    def time(self, func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            start = time.perf_counter()
            result = func(*args, **kwargs)
            duration = (time.perf_counter() - start) * 1000

            if func.__name__ not in self.metrics:
                self.metrics[func.__name__] = []
            self.metrics[func.__name__].append(duration)
            return result
        return wrapper

    def report(self):
        for name, times in self.metrics.items():
            print(f"{name}: P50={sorted(times)[len(times)//2]:.1f}ms "
                  f"P95={sorted(times)[int(len(times)*0.95)]:.1f}ms")

# Usage
profiler = Profiler()

@profiler.time
def vector_search(query):
    return db.search(query, top_k=5)

@profiler.time
def llm_generate(prompt):
    return model.generate(prompt)

Bottleneck Analysis:

Component	Typical Latency	Target	Optimization Strategy
Input Preprocessing	5-20ms	<10ms	Vectorize operations, cache
Vector Search	50-200ms	<100ms	Approximate search (HNSW), GPU acceleration
LLM Prefill	200-800ms	<500ms	Batch requests, quantization
LLM Decode	20-50ms/token	<30ms/token	Speculative decoding, KV cache
Postprocessing	10-30ms	<20ms	Optimize regex, parallel processing

2. Batching Strategies

sequenceDiagram
    participant R1 as Request 1
    participant R2 as Request 2
    participant R3 as Request 3
    participant B as Batcher
    participant M as Model

    R1->>B: Add request (t=0ms)
    R2->>B: Add request (t=20ms)
    R3->>B: Add request (t=40ms)

    Note over B: Wait 50ms OR<br/>batch full

    B->>M: Process batch [R1,R2,R3] (t=50ms)
    M->>B: Results [O1,O2,O3] (t=150ms)

    B->>R1: Return O1
    B->>R2: Return O2
    B->>R3: Return O3

    Note over R1,R3: Total latency: 150ms<br/>vs 300ms individual

Simplified Batching Implementation:

# batcher.py
import asyncio
from collections import deque

class Batcher:
    def __init__(self, model_fn, max_size=32, max_wait_ms=50):
        self.model_fn = model_fn
        self.max_size = max_size
        self.max_wait = max_wait_ms / 1000
        self.queue = deque()

    async def predict(self, input_data):
        future = asyncio.Future()
        self.queue.append((input_data, future))

        if len(self.queue) >= self.max_size:
            asyncio.create_task(self._process())
        return await future

    async def _process(self):
        if not self.queue:
            return
        batch = [self.queue.popleft() for _ in range(min(len(self.queue), self.max_size))]
        inputs = [item[0] for item in batch]
        outputs = await self.model_fn(inputs)
        for (_, future), output in zip(batch, outputs):
            future.set_result(output)

Batching Performance:

Batch Size	Individual Latency	Batched Latency	Throughput Gain	Latency Penalty
1 (no batch)	100ms	100ms	1x	0ms
8	100ms	120ms	6.7x	+20ms
16	100ms	140ms	11.4x	+40ms
32	100ms	180ms	17.8x	+80ms
64	100ms	250ms	25.6x	+150ms

3. Caching Strategies

flowchart LR
    A[Request] --> B{L1 Cache<br/>In-Memory}
    B -->|Hit| C[Return <1ms]
    B -->|Miss| D{L2 Cache<br/>Redis}
    D -->|Hit| E[Return 5ms]
    D -->|Miss| F[LLM Inference]
    F --> G[Cache Result]
    G --> H[Return 1-5s]

    E --> I[Promote to L1]
    H --> J[Store in L1+L2]

Simplified Caching:

# cache.py
import hashlib
import json

class MultiLevelCache:
    def __init__(self, redis_client, max_l1=1000, ttl=3600):
        self.redis = redis_client
        self.l1 = {}
        self.max_l1 = max_l1
        self.ttl = ttl

    def _key(self, data):
        return hashlib.sha256(json.dumps(data, sort_keys=True).encode()).hexdigest()

    def get(self, data):
        key = self._key(data)
        # L1 check
        if key in self.l1:
            return self.l1[key]
        # L2 check
        val = self.redis.get(key)
        if val:
            result = json.loads(val)
            self.l1[key] = result  # Promote
            return result
        return None

    def set(self, data, result):
        key = self._key(data)
        if len(self.l1) >= self.max_l1:
            self.l1.pop(next(iter(self.l1)))
        self.l1[key] = result
        self.redis.setex(key, self.ttl, json.dumps(result))

# Semantic cache for similar queries
class SemanticCache:
    def __init__(self, threshold=0.95):
        self.cache = []  # (embedding, response) pairs
        self.threshold = threshold

    def get(self, embedding):
        for cached_emb, response in self.cache:
            if cosine_sim(embedding, cached_emb) >= self.threshold:
                return response
        return None

    def set(self, embedding, response):
        self.cache.append((embedding, response))
        if len(self.cache) > 10000:
            self.cache = self.cache[-10000:]

Cache Performance Comparison:

Cache Type	Hit Latency	Cost per Hit	Hit Rate (typical)	Use Case
In-Memory LRU	<1ms	$0.00001	20-30%	Frequent exact matches
Redis	1-5ms	$0.0001	40-60%	Distributed systems
Semantic Cache	5-10ms	$0.0005	30-50%	Similar LLM queries
Database	10-50ms	$0.001	70-90%	Long-term storage
No Cache	1000-5000ms	$0.01-0.10	N/A	Baseline

4. Model Optimization

Quantization Techniques:

# optimization.py
from transformers import AutoModelForCausalLM
import torch

# INT8 quantization - 50% memory, ~2% quality loss
model_int8 = AutoModelForCausalLM.from_pretrained(
    "model-name",
    load_in_8bit=True,
    device_map="auto"
)

# INT4 quantization - 75% memory, ~5-10% quality loss
model_int4 = AutoModelForCausalLM.from_pretrained(
    "model-name",
    load_in_4bit=True,
    device_map="auto"
)

# PyTorch compilation - 20-30% speedup, no quality loss
model_compiled = torch.compile(model, mode="reduce-overhead")

Performance Comparison:

Optimization	Memory	Latency	Throughput	Quality	Setup Effort
Baseline FP16	100%	100%	100%	100%	Low
INT8 Quantization	50%	60-70%	150%	98-99%	Low
INT4 Quantization	25%	40-50%	200%	90-95%	Medium
Mixed Precision	70%	70-80%	130%	99.5%	Low
torch.compile	100%	70-80%	120-130%	100%	Low
All Combined	30%	30-40%	300-400%	95-98%	Medium

5. Autoscaling Strategies

flowchart TD
    A[Monitor Metrics] --> B{CPU > 70%<br/>OR<br/>Latency > 500ms<br/>OR<br/>Queue > 100}
    B -->|Yes| C[Scale Up<br/>+2 replicas]
    B -->|No| D{CPU < 30%<br/>AND<br/>Latency < 100ms<br/>AND<br/>Queue < 10}
    D -->|Yes| E[Scale Down<br/>-1 replica]
    D -->|No| F[Maintain]

    C --> G[Wait 60s Cooldown]
    E --> H[Wait 300s Cooldown]
    F --> A
    G --> A
    H --> A

    style C fill:#90EE90
    style E fill:#FFB6C1

Kubernetes Autoscaling Config:

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-model-hpa
spec:
  scaleTargetRef:
    kind: Deployment
    name: ml-model
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: inference_latency_p95
      target:
        averageValue: 500m
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100  # Double quickly
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10  # Reduce slowly

Autoscaling Strategy Comparison:

Metric	Scale Up Threshold	Scale Down Threshold	Cooldown	Reason
CPU	>70%	<30%	60s / 300s	Prevent thrashing
Latency P95	>500ms	<100ms	60s / 300s	User experience
Queue Depth	>100	<10	60s / 300s	Request backlog
Error Rate	>5%	N/A	30s / N/A	Possible overload

Cost Modeling & Optimization

flowchart LR
    A[Request] --> B{Cache Hit?}
    B -->|45%| C[Cache<br/>$0.0001]
    B -->|55%| D{Model Tier}

    D -->|Simple 60%| E[Small Model<br/>$0.002]
    D -->|Complex 40%| F[Large Model<br/>$0.025]

    C --> G[Track Cost]
    E --> G
    F --> G

    G --> H{Cost Analysis}
    H --> I[Per User]
    H --> J[Per Feature]
    H --> K[Total Budget]

Simplified Cost Calculator:

# cost_model.py
class CostModel:
    def __init__(self):
        self.gpu_hour = 4.00  # A100
        self.input_token_1k = 0.03  # GPT-4
        self.output_token_1k = 0.06

    def request_cost(self, input_tokens, output_tokens, cached=False):
        if cached:
            return 0.0001
        return (input_tokens/1000 * self.input_token_1k +
                output_tokens/1000 * self.output_token_1k)

    def monthly_projection(self, daily_requests, avg_in, avg_out, cache_rate=0.4):
        cached = daily_requests * cache_rate * 0.0001
        compute = daily_requests * (1 - cache_rate) * self.request_cost(avg_in, avg_out)
        monthly = (cached + compute) * 30
        return {
            "monthly_cost": monthly,
            "per_request": monthly / (daily_requests * 30)
        }

# Example
model = CostModel()
proj = model.monthly_projection(
    daily_requests=100000,
    avg_in=500,
    avg_out=300,
    cache_rate=0.45
)
print(f"Monthly: ${proj['monthly_cost']:,.2f}, Per request: ${proj['per_request']:.4f}")

Cost Optimization Strategies

Strategy	Cost Reduction	Implementation Effort	Quality Impact
Response Caching	40-60%	Low	None (cache hits)
Semantic Caching	30-50%	Medium	None
Model Quantization	50-75%	Low	2-10% degradation
Batch Inference	30-50%	Medium	Slight latency increase
Prompt Compression	20-40%	Medium	1-5% degradation
Smaller Model for Simple Queries	60-80% (on those queries)	Medium	Minimal (w/ routing)
Rate Limiting Power Users	20-30%	Low	User experience tradeoff
Spot/Preemptible Instances	60-80%	Medium-High	Requires fault tolerance
All Combined	80-95%	High	5-15% degradation

Capacity Planning

flowchart TD
    A[Historical Data] --> B[Analyze Patterns]
    B --> C[Average RPS]
    B --> D[Peak RPS P99]
    B --> E[Growth Trend]

    C --> F[Min Replicas<br/>Avg × 1.2]
    D --> G[Max Replicas<br/>Peak × 1.2]
    E --> H[Future Forecast]

    F --> I[Reserved/Always-On]
    G --> J[Autoscale Limit]

    I --> K[Cost Optimization]
    J --> K

    K --> L[Reserved for Base<br/>On-Demand for Peaks]

Simplified Capacity Planner:

# capacity.py
import numpy as np

class CapacityPlanner:
    def __init__(self, historical_rps):
        self.rps = np.array(historical_rps)

    def analyze(self):
        return {
            "avg": np.mean(self.rps),
            "p95": np.percentile(self.rps, 95),
            "p99": np.percentile(self.rps, 99),
            "max": np.max(self.rps)
        }

    def recommend(self, replica_rps=10, margin=0.2):
        stats = self.analyze()
        min_reps = max(2, int(np.ceil(stats["avg"] / replica_rps * (1 + margin))))
        max_reps = int(np.ceil(stats["p99"] / replica_rps * (1 + margin)))
        return {
            "min_replicas": min_reps,
            "max_replicas": max_reps,
            "avg_utilization": stats["avg"] / (min_reps * replica_rps)
        }

    def forecast(self, growth_monthly=0.10, months=6):
        current = self.analyze()
        return [{
            "month": m,
            "avg_rps": current["avg"] * (1 + growth_monthly) ** m,
            "p99_rps": current["p99"] * (1 + growth_monthly) ** m
        } for m in range(1, months + 1)]

Case Study: Customer Support Chatbot Cost Optimization

Background: A SaaS company deployed an LLM-powered customer support chatbot. Initial costs: $180K/month for 3M conversations.

flowchart LR
    A[Initial State<br/>$180K/month] --> B[Phase 1: Caching<br/>$99K/month<br/>45% reduction]
    B --> C[Phase 2: Tiering<br/>$69K/month<br/>30% reduction]
    C --> D[Phase 3: Batching<br/>$59K/month<br/>15% reduction]
    D --> E[Phase 4: Prompts<br/>$53K/month<br/>10% reduction]

    E --> F[Final: 71% Total<br/>Cost Reduction]

    style A fill:#ffcccc
    style E fill:#ccffcc

Initial Architecture:

GPT-4 for all responses (100% traffic)
No caching, no batching
500 input tokens, 300 output tokens per conversation
Cost: $0.06/conversation, Target:$ 0.01/conversation
Latency: 3-5 seconds (users complained)

Problems:

Unsustainable unit economics (6x over budget)
50% of questions were FAQ variations (wasteful)
Poor user experience due to latency
No optimization strategy

Optimization Journey:

Phase 1: Semantic Caching (Month 1)

Implemented multi-level cache (in-memory + Redis)
45% cache hit rate for common questions
Result: $180K →$ 99K/month (45% reduction)
Latency: Cached responses 100ms (30x faster)

Phase 2: Model Tiering (Month 2)

Router classifies queries as simple/complex
GPT-3.5-Turbo for simple (60% of traffic, 10x cheaper)
GPT-4 for complex (40% of traffic)
Result: $99K →$ 69K/month (30% reduction)
Quality: Maintained with smart routing

Phase 3: Batching & Quantization (Month 3)

Email responses batched (not real-time)
Self-hosted quantized model for FAQs
Result: $69K →$ 59K/month (15% reduction)
Bonus: Reduced API dependency

Phase 4: Prompt Optimization (Month 4)

Compressed system prompts (40% fewer tokens)
Context pruning based on relevance scoring
Early stopping when confidence > 0.95
Result: $59K →$ 53K/month (10% reduction)

Final Results:

Metric	Before	After	Improvement
Monthly Cost	$180K	$53K	71% reduction
Cost per Conversation	$0.06	$0.018	70% reduction
Average Latency	3.5s	1.2s	66% improvement
P95 Latency	5.2s	2.1s	60% improvement
CSAT Score	7.2/10	8.1/10	12% improvement (faster!)
Cache Hit Rate	0%	45%	N/A
GPU Utilization	35%	72%	2x efficiency

Key Success Factors:

Started with easy wins - Caching provided 45% cost reduction in month 1
Measured quality at each step - No degradation in customer satisfaction
Tiered models - Right model for right complexity saves 60-80%
Optimized cost AND latency - Users happier with faster responses
Continuous monitoring - Caught regressions before customer impact

Cost vs Performance Trade-off Analysis

Optimization	Cost Reduction	Latency Impact	Quality Impact	Implementation Time	ROI Timeline
Response Caching	40-60%	-95% (faster)	None	1-2 weeks	Immediate
Semantic Caching	30-50%	-90% (faster)	None	2-3 weeks	Immediate
Model Quantization	50-75%	-30% (faster)	-2% to -10%	1 week	Immediate
Batch Inference	30-50%	+20-100ms	None	2-3 weeks	Month 1
Model Tiering	60-80%	Neutral	Minimal (w/ routing)	3-4 weeks	Month 1
Prompt Compression	20-40%	-10% (faster)	-1% to -5%	2 weeks	Month 1
Autoscaling	20-40%	Neutral	None	1-2 weeks	Ongoing
Spot Instances	60-80%	Neutral	None (w/ fallback)	2-3 weeks	Immediate

Capacity Planning Best Practices

Practice	Description	Impact	When to Apply
20% Safety Margin	Min replicas = avg load × 1.2	Handles unexpected spikes	Always
P99 for Max	Max replicas = P99 load × 1.2	Covers rare peak events	High availability SLAs
Fast Scale-Up	60s cooldown, 100% increase	Quick response to load	User-facing services
Slow Scale-Down	300s cooldown, 10% decrease	Prevent thrashing	All autoscaling
Reserved Base	Buy reserved instances for min replicas	40-60% cost savings	Predictable base load
On-Demand Peaks	Use on-demand for autoscaling	Flexibility without commitment	Variable workloads
Spot for Batch	Use spot instances for non-critical	60-80% cost savings	Batch processing

Implementation Checklist

Phase 1: Measurement (Week 1)

Set up cost tracking per request/user/feature
Profile system to identify bottlenecks
Establish baseline latency metrics (P50, P95, P99)
Measure current throughput and resource utilization
Calculate current unit economics

Phase 2: Quick Wins (Weeks 2-3)

Implement response caching (exact match)
Add batch processing where latency-tolerant
Enable model compilation (torch.compile, TensorRT)
Set up basic rate limiting
Optimize prompts to reduce token usage

Phase 3: Advanced Optimization (Weeks 4-6)

Implement semantic caching
Add model quantization (INT8)
Set up dynamic batching
Implement model tiering (small/large models)
Add prompt compression

Phase 4: Capacity Planning (Weeks 7-8)

Analyze historical load patterns
Model seasonal and growth trends
Configure autoscaling policies
Set up capacity alerts
Plan for peak events (campaigns, launches)

Phase 5: Cost Governance (Weeks 9-10)

Set budget alerts by service/team/user
Implement rate limiting per user/tenant
Add cost dashboards
Create cost allocation reports
Set up automatic cost anomaly detection

Phase 6: Continuous Optimization (Ongoing)

Regular performance review (monthly)
A/B test new optimizations
Update capacity plans quarterly
Benchmark against new models/techniques
Optimize based on actual usage patterns

Success Metrics

Cost Efficiency: Cost per request reduced by >50%
Performance: P95 latency <500ms for interactive use cases
Throughput: >80% GPU utilization during peak
Quality: <5% degradation from optimizations
Reliability: >99.9% availability with autoscaling
Capacity: Handle 3x current peak load with autoscaling

Chapter 59: Cost, Performance & Capacity

59. Cost, Performance & Capacity