59. Cost, Performance & Capacity
Chapter 59 — Cost, Performance & Capacity
Overview
Optimize unit economics and performance; plan capacity for peak and average loads. ML and LLM systems can consume massive resources—hundreds of dollars per thousand requests, multi-second latencies, and unpredictable scaling patterns. Strategic optimization of cost, performance, and capacity ensures systems remain economically viable while delivering responsive user experiences. Organizations implementing systematic cost optimization typically reduce expenses by 60-85% while improving latency by 40-70%.
Key Objectives
- Establish cost models per request, per token, and per user
- Optimize performance through profiling, batching, and caching
- Implement autoscaling strategies for variable loads
- Plan capacity for peak demand with appropriate buffers
- Balance cost, performance, and quality trade-offs
Deliverables
- Cost model with unit economics analysis
- Performance tuning guide and benchmarks
- Capacity planning models and forecasts
- Autoscaling configurations and policies
- Optimization roadmap with ROI projections
Why It Matters
Without strategic optimization, AI costs can balloon from manageable to unsustainable within weeks. A well-architected cost and performance strategy turns ML from a cost center into a defensible competitive advantage.
The Cost Crisis:
- LLM Costs: GPT-4 costs 0.06/1K output tokens—a chatbot with 100K daily users can easily cost $50K+/month
- Infrastructure: GPU costs range from 35K-280K/year
- Hidden Costs: Data transfer, storage, monitoring, and failed requests add 20-40% overhead
- Scaling Surprises: A viral feature can 10x costs overnight without proper rate limiting
Organizations often discover:
- Initial MVP costs 150K/month (30x increase)
- 70-80% of costs come from 20% of users (power users, bots)
- 40-60% of requests could be cached without quality degradation
- Simple optimizations (batching, caching, quantization) reduce costs 50-80%
- 90% of performance bottlenecks come from 10% of code paths
Cost Model Architecture
graph TB A[User Request] --> B{Cache Hit?} B -->|Yes| C[Return Cached<br/>Cost: $0.0001] B -->|No| D[Feature Retrieval] D --> E{Complex Query?} E -->|No| F[Small Model<br/>Cost: $0.001] E -->|Yes| G[Large Model<br/>Cost: $0.02] F --> H[Cache Result] G --> H H --> I[Return to User] subgraph Cost Tracking C --> J[Log Metrics] F --> J G --> J J --> K[Cost Per User] J --> L[Cost Per Feature] J --> M[Total Daily Cost] end subgraph Budget Controls M --> N{Over Budget?} N -->|Yes| O[Throttle Requests] N -->|No| P[Continue] end
Complete Optimization Pipeline
flowchart TD A[Production System] --> B[Profiling & Measurement] B --> C{Identify Bottlenecks} C -->|Slow Requests| D[Performance Optimization] C -->|High Costs| E[Cost Optimization] C -->|Scaling Issues| F[Capacity Planning] D --> D1[Caching Layer] D --> D2[Batching] D --> D3[Model Optimization] E --> E1[Model Quantization] E --> E2[Prompt Compression] E --> E3[Tiered Models] F --> F1[Autoscaling Config] F --> F2[Load Forecasting] F --> F3[Resource Allocation] D1 --> G[Deploy Changes] D2 --> G D3 --> G E1 --> G E2 --> G E3 --> G F1 --> G F2 --> G F3 --> G G --> H[Monitor Metrics] H --> I{Goals Met?} I -->|No| B I -->|Yes| J[Production] J --> K[Continuous Monitoring] K --> B
Performance Optimization Techniques
1. Profiling & Bottleneck Identification
Profiling Framework:
# profiler.py - Simplified profiling for ML systems
import time
from functools import wraps
class Profiler:
def __init__(self):
self.metrics = {}
def time(self, func):
@wraps(func)
def wrapper(*args, **kwargs):
start = time.perf_counter()
result = func(*args, **kwargs)
duration = (time.perf_counter() - start) * 1000
if func.__name__ not in self.metrics:
self.metrics[func.__name__] = []
self.metrics[func.__name__].append(duration)
return result
return wrapper
def report(self):
for name, times in self.metrics.items():
print(f"{name}: P50={sorted(times)[len(times)//2]:.1f}ms "
f"P95={sorted(times)[int(len(times)*0.95)]:.1f}ms")
# Usage
profiler = Profiler()
@profiler.time
def vector_search(query):
return db.search(query, top_k=5)
@profiler.time
def llm_generate(prompt):
return model.generate(prompt)
Bottleneck Analysis:
| Component | Typical Latency | Target | Optimization Strategy |
|---|---|---|---|
| Input Preprocessing | 5-20ms | <10ms | Vectorize operations, cache |
| Vector Search | 50-200ms | <100ms | Approximate search (HNSW), GPU acceleration |
| LLM Prefill | 200-800ms | <500ms | Batch requests, quantization |
| LLM Decode | 20-50ms/token | <30ms/token | Speculative decoding, KV cache |
| Postprocessing | 10-30ms | <20ms | Optimize regex, parallel processing |
2. Batching Strategies
sequenceDiagram participant R1 as Request 1 participant R2 as Request 2 participant R3 as Request 3 participant B as Batcher participant M as Model R1->>B: Add request (t=0ms) R2->>B: Add request (t=20ms) R3->>B: Add request (t=40ms) Note over B: Wait 50ms OR<br/>batch full B->>M: Process batch [R1,R2,R3] (t=50ms) M->>B: Results [O1,O2,O3] (t=150ms) B->>R1: Return O1 B->>R2: Return O2 B->>R3: Return O3 Note over R1,R3: Total latency: 150ms<br/>vs 300ms individual
Simplified Batching Implementation:
# batcher.py
import asyncio
from collections import deque
class Batcher:
def __init__(self, model_fn, max_size=32, max_wait_ms=50):
self.model_fn = model_fn
self.max_size = max_size
self.max_wait = max_wait_ms / 1000
self.queue = deque()
async def predict(self, input_data):
future = asyncio.Future()
self.queue.append((input_data, future))
if len(self.queue) >= self.max_size:
asyncio.create_task(self._process())
return await future
async def _process(self):
if not self.queue:
return
batch = [self.queue.popleft() for _ in range(min(len(self.queue), self.max_size))]
inputs = [item[0] for item in batch]
outputs = await self.model_fn(inputs)
for (_, future), output in zip(batch, outputs):
future.set_result(output)
Batching Performance:
| Batch Size | Individual Latency | Batched Latency | Throughput Gain | Latency Penalty |
|---|---|---|---|---|
| 1 (no batch) | 100ms | 100ms | 1x | 0ms |
| 8 | 100ms | 120ms | 6.7x | +20ms |
| 16 | 100ms | 140ms | 11.4x | +40ms |
| 32 | 100ms | 180ms | 17.8x | +80ms |
| 64 | 100ms | 250ms | 25.6x | +150ms |
3. Caching Strategies
flowchart LR A[Request] --> B{L1 Cache<br/>In-Memory} B -->|Hit| C[Return <1ms] B -->|Miss| D{L2 Cache<br/>Redis} D -->|Hit| E[Return 5ms] D -->|Miss| F[LLM Inference] F --> G[Cache Result] G --> H[Return 1-5s] E --> I[Promote to L1] H --> J[Store in L1+L2]
Simplified Caching:
# cache.py
import hashlib
import json
class MultiLevelCache:
def __init__(self, redis_client, max_l1=1000, ttl=3600):
self.redis = redis_client
self.l1 = {}
self.max_l1 = max_l1
self.ttl = ttl
def _key(self, data):
return hashlib.sha256(json.dumps(data, sort_keys=True).encode()).hexdigest()
def get(self, data):
key = self._key(data)
# L1 check
if key in self.l1:
return self.l1[key]
# L2 check
val = self.redis.get(key)
if val:
result = json.loads(val)
self.l1[key] = result # Promote
return result
return None
def set(self, data, result):
key = self._key(data)
if len(self.l1) >= self.max_l1:
self.l1.pop(next(iter(self.l1)))
self.l1[key] = result
self.redis.setex(key, self.ttl, json.dumps(result))
# Semantic cache for similar queries
class SemanticCache:
def __init__(self, threshold=0.95):
self.cache = [] # (embedding, response) pairs
self.threshold = threshold
def get(self, embedding):
for cached_emb, response in self.cache:
if cosine_sim(embedding, cached_emb) >= self.threshold:
return response
return None
def set(self, embedding, response):
self.cache.append((embedding, response))
if len(self.cache) > 10000:
self.cache = self.cache[-10000:]
Cache Performance Comparison:
| Cache Type | Hit Latency | Cost per Hit | Hit Rate (typical) | Use Case |
|---|---|---|---|---|
| In-Memory LRU | <1ms | $0.00001 | 20-30% | Frequent exact matches |
| Redis | 1-5ms | $0.0001 | 40-60% | Distributed systems |
| Semantic Cache | 5-10ms | $0.0005 | 30-50% | Similar LLM queries |
| Database | 10-50ms | $0.001 | 70-90% | Long-term storage |
| No Cache | 1000-5000ms | $0.01-0.10 | N/A | Baseline |
4. Model Optimization
Quantization Techniques:
# optimization.py
from transformers import AutoModelForCausalLM
import torch
# INT8 quantization - 50% memory, ~2% quality loss
model_int8 = AutoModelForCausalLM.from_pretrained(
"model-name",
load_in_8bit=True,
device_map="auto"
)
# INT4 quantization - 75% memory, ~5-10% quality loss
model_int4 = AutoModelForCausalLM.from_pretrained(
"model-name",
load_in_4bit=True,
device_map="auto"
)
# PyTorch compilation - 20-30% speedup, no quality loss
model_compiled = torch.compile(model, mode="reduce-overhead")
Performance Comparison:
| Optimization | Memory | Latency | Throughput | Quality | Setup Effort |
|---|---|---|---|---|---|
| Baseline FP16 | 100% | 100% | 100% | 100% | Low |
| INT8 Quantization | 50% | 60-70% | 150% | 98-99% | Low |
| INT4 Quantization | 25% | 40-50% | 200% | 90-95% | Medium |
| Mixed Precision | 70% | 70-80% | 130% | 99.5% | Low |
| torch.compile | 100% | 70-80% | 120-130% | 100% | Low |
| All Combined | 30% | 30-40% | 300-400% | 95-98% | Medium |
5. Autoscaling Strategies
flowchart TD A[Monitor Metrics] --> B{CPU > 70%<br/>OR<br/>Latency > 500ms<br/>OR<br/>Queue > 100} B -->|Yes| C[Scale Up<br/>+2 replicas] B -->|No| D{CPU < 30%<br/>AND<br/>Latency < 100ms<br/>AND<br/>Queue < 10} D -->|Yes| E[Scale Down<br/>-1 replica] D -->|No| F[Maintain] C --> G[Wait 60s Cooldown] E --> H[Wait 300s Cooldown] F --> A G --> A H --> A style C fill:#90EE90 style E fill:#FFB6C1
Kubernetes Autoscaling Config:
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ml-model-hpa
spec:
scaleTargetRef:
kind: Deployment
name: ml-model
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
averageUtilization: 70
- type: Pods
pods:
metric:
name: inference_latency_p95
target:
averageValue: 500m
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100 # Double quickly
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10 # Reduce slowly
Autoscaling Strategy Comparison:
| Metric | Scale Up Threshold | Scale Down Threshold | Cooldown | Reason |
|---|---|---|---|---|
| CPU | >70% | <30% | 60s / 300s | Prevent thrashing |
| Latency P95 | >500ms | <100ms | 60s / 300s | User experience |
| Queue Depth | >100 | <10 | 60s / 300s | Request backlog |
| Error Rate | >5% | N/A | 30s / N/A | Possible overload |
Cost Modeling & Optimization
flowchart LR A[Request] --> B{Cache Hit?} B -->|45%| C[Cache<br/>$0.0001] B -->|55%| D{Model Tier} D -->|Simple 60%| E[Small Model<br/>$0.002] D -->|Complex 40%| F[Large Model<br/>$0.025] C --> G[Track Cost] E --> G F --> G G --> H{Cost Analysis} H --> I[Per User] H --> J[Per Feature] H --> K[Total Budget]
Simplified Cost Calculator:
# cost_model.py
class CostModel:
def __init__(self):
self.gpu_hour = 4.00 # A100
self.input_token_1k = 0.03 # GPT-4
self.output_token_1k = 0.06
def request_cost(self, input_tokens, output_tokens, cached=False):
if cached:
return 0.0001
return (input_tokens/1000 * self.input_token_1k +
output_tokens/1000 * self.output_token_1k)
def monthly_projection(self, daily_requests, avg_in, avg_out, cache_rate=0.4):
cached = daily_requests * cache_rate * 0.0001
compute = daily_requests * (1 - cache_rate) * self.request_cost(avg_in, avg_out)
monthly = (cached + compute) * 30
return {
"monthly_cost": monthly,
"per_request": monthly / (daily_requests * 30)
}
# Example
model = CostModel()
proj = model.monthly_projection(
daily_requests=100000,
avg_in=500,
avg_out=300,
cache_rate=0.45
)
print(f"Monthly: ${proj['monthly_cost']:,.2f}, Per request: ${proj['per_request']:.4f}")
Cost Optimization Strategies
| Strategy | Cost Reduction | Implementation Effort | Quality Impact |
|---|---|---|---|
| Response Caching | 40-60% | Low | None (cache hits) |
| Semantic Caching | 30-50% | Medium | None |
| Model Quantization | 50-75% | Low | 2-10% degradation |
| Batch Inference | 30-50% | Medium | Slight latency increase |
| Prompt Compression | 20-40% | Medium | 1-5% degradation |
| Smaller Model for Simple Queries | 60-80% (on those queries) | Medium | Minimal (w/ routing) |
| Rate Limiting Power Users | 20-30% | Low | User experience tradeoff |
| Spot/Preemptible Instances | 60-80% | Medium-High | Requires fault tolerance |
| All Combined | 80-95% | High | 5-15% degradation |
Capacity Planning
flowchart TD A[Historical Data] --> B[Analyze Patterns] B --> C[Average RPS] B --> D[Peak RPS P99] B --> E[Growth Trend] C --> F[Min Replicas<br/>Avg × 1.2] D --> G[Max Replicas<br/>Peak × 1.2] E --> H[Future Forecast] F --> I[Reserved/Always-On] G --> J[Autoscale Limit] I --> K[Cost Optimization] J --> K K --> L[Reserved for Base<br/>On-Demand for Peaks]
Simplified Capacity Planner:
# capacity.py
import numpy as np
class CapacityPlanner:
def __init__(self, historical_rps):
self.rps = np.array(historical_rps)
def analyze(self):
return {
"avg": np.mean(self.rps),
"p95": np.percentile(self.rps, 95),
"p99": np.percentile(self.rps, 99),
"max": np.max(self.rps)
}
def recommend(self, replica_rps=10, margin=0.2):
stats = self.analyze()
min_reps = max(2, int(np.ceil(stats["avg"] / replica_rps * (1 + margin))))
max_reps = int(np.ceil(stats["p99"] / replica_rps * (1 + margin)))
return {
"min_replicas": min_reps,
"max_replicas": max_reps,
"avg_utilization": stats["avg"] / (min_reps * replica_rps)
}
def forecast(self, growth_monthly=0.10, months=6):
current = self.analyze()
return [{
"month": m,
"avg_rps": current["avg"] * (1 + growth_monthly) ** m,
"p99_rps": current["p99"] * (1 + growth_monthly) ** m
} for m in range(1, months + 1)]
Case Study: Customer Support Chatbot Cost Optimization
Background: A SaaS company deployed an LLM-powered customer support chatbot. Initial costs: $180K/month for 3M conversations.
flowchart LR A[Initial State<br/>$180K/month] --> B[Phase 1: Caching<br/>$99K/month<br/>45% reduction] B --> C[Phase 2: Tiering<br/>$69K/month<br/>30% reduction] C --> D[Phase 3: Batching<br/>$59K/month<br/>15% reduction] D --> E[Phase 4: Prompts<br/>$53K/month<br/>10% reduction] E --> F[Final: 71% Total<br/>Cost Reduction] style A fill:#ffcccc style E fill:#ccffcc
Initial Architecture:
- GPT-4 for all responses (100% traffic)
- No caching, no batching
- 500 input tokens, 300 output tokens per conversation
- Cost: 0.01/conversation
- Latency: 3-5 seconds (users complained)
Problems:
- Unsustainable unit economics (6x over budget)
- 50% of questions were FAQ variations (wasteful)
- Poor user experience due to latency
- No optimization strategy
Optimization Journey:
Phase 1: Semantic Caching (Month 1)
- Implemented multi-level cache (in-memory + Redis)
- 45% cache hit rate for common questions
- Result: 99K/month (45% reduction)
- Latency: Cached responses 100ms (30x faster)
Phase 2: Model Tiering (Month 2)
- Router classifies queries as simple/complex
- GPT-3.5-Turbo for simple (60% of traffic, 10x cheaper)
- GPT-4 for complex (40% of traffic)
- Result: 69K/month (30% reduction)
- Quality: Maintained with smart routing
Phase 3: Batching & Quantization (Month 3)
- Email responses batched (not real-time)
- Self-hosted quantized model for FAQs
- Result: 59K/month (15% reduction)
- Bonus: Reduced API dependency
Phase 4: Prompt Optimization (Month 4)
- Compressed system prompts (40% fewer tokens)
- Context pruning based on relevance scoring
- Early stopping when confidence > 0.95
- Result: 53K/month (10% reduction)
Final Results:
| Metric | Before | After | Improvement |
|---|---|---|---|
| Monthly Cost | $180K | $53K | 71% reduction |
| Cost per Conversation | $0.06 | $0.018 | 70% reduction |
| Average Latency | 3.5s | 1.2s | 66% improvement |
| P95 Latency | 5.2s | 2.1s | 60% improvement |
| CSAT Score | 7.2/10 | 8.1/10 | 12% improvement (faster!) |
| Cache Hit Rate | 0% | 45% | N/A |
| GPU Utilization | 35% | 72% | 2x efficiency |
Key Success Factors:
- Started with easy wins - Caching provided 45% cost reduction in month 1
- Measured quality at each step - No degradation in customer satisfaction
- Tiered models - Right model for right complexity saves 60-80%
- Optimized cost AND latency - Users happier with faster responses
- Continuous monitoring - Caught regressions before customer impact
Cost vs Performance Trade-off Analysis
| Optimization | Cost Reduction | Latency Impact | Quality Impact | Implementation Time | ROI Timeline |
|---|---|---|---|---|---|
| Response Caching | 40-60% | -95% (faster) | None | 1-2 weeks | Immediate |
| Semantic Caching | 30-50% | -90% (faster) | None | 2-3 weeks | Immediate |
| Model Quantization | 50-75% | -30% (faster) | -2% to -10% | 1 week | Immediate |
| Batch Inference | 30-50% | +20-100ms | None | 2-3 weeks | Month 1 |
| Model Tiering | 60-80% | Neutral | Minimal (w/ routing) | 3-4 weeks | Month 1 |
| Prompt Compression | 20-40% | -10% (faster) | -1% to -5% | 2 weeks | Month 1 |
| Autoscaling | 20-40% | Neutral | None | 1-2 weeks | Ongoing |
| Spot Instances | 60-80% | Neutral | None (w/ fallback) | 2-3 weeks | Immediate |
Capacity Planning Best Practices
| Practice | Description | Impact | When to Apply |
|---|---|---|---|
| 20% Safety Margin | Min replicas = avg load × 1.2 | Handles unexpected spikes | Always |
| P99 for Max | Max replicas = P99 load × 1.2 | Covers rare peak events | High availability SLAs |
| Fast Scale-Up | 60s cooldown, 100% increase | Quick response to load | User-facing services |
| Slow Scale-Down | 300s cooldown, 10% decrease | Prevent thrashing | All autoscaling |
| Reserved Base | Buy reserved instances for min replicas | 40-60% cost savings | Predictable base load |
| On-Demand Peaks | Use on-demand for autoscaling | Flexibility without commitment | Variable workloads |
| Spot for Batch | Use spot instances for non-critical | 60-80% cost savings | Batch processing |
Implementation Checklist
Phase 1: Measurement (Week 1)
- Set up cost tracking per request/user/feature
- Profile system to identify bottlenecks
- Establish baseline latency metrics (P50, P95, P99)
- Measure current throughput and resource utilization
- Calculate current unit economics
Phase 2: Quick Wins (Weeks 2-3)
- Implement response caching (exact match)
- Add batch processing where latency-tolerant
- Enable model compilation (torch.compile, TensorRT)
- Set up basic rate limiting
- Optimize prompts to reduce token usage
Phase 3: Advanced Optimization (Weeks 4-6)
- Implement semantic caching
- Add model quantization (INT8)
- Set up dynamic batching
- Implement model tiering (small/large models)
- Add prompt compression
Phase 4: Capacity Planning (Weeks 7-8)
- Analyze historical load patterns
- Model seasonal and growth trends
- Configure autoscaling policies
- Set up capacity alerts
- Plan for peak events (campaigns, launches)
Phase 5: Cost Governance (Weeks 9-10)
- Set budget alerts by service/team/user
- Implement rate limiting per user/tenant
- Add cost dashboards
- Create cost allocation reports
- Set up automatic cost anomaly detection
Phase 6: Continuous Optimization (Ongoing)
- Regular performance review (monthly)
- A/B test new optimizations
- Update capacity plans quarterly
- Benchmark against new models/techniques
- Optimize based on actual usage patterns
Success Metrics
- Cost Efficiency: Cost per request reduced by >50%
- Performance: P95 latency <500ms for interactive use cases
- Throughput: >80% GPU utilization during peak
- Quality: <5% degradation from optimizations
- Reliability: >99.9% availability with autoscaling
- Capacity: Handle 3x current peak load with autoscaling