Chapter 20 — LLM Landscape & Model Selection

Overview

Choose the right models and hosting options by balancing capability, cost, latency, safety, and control.

The LLM landscape evolves rapidly with new models, architectures, and deployment options emerging monthly. Making the right selection requires a systematic approach that considers not just headline benchmarks, but real-world performance on your specific tasks, total cost of ownership, operational constraints, and risk tolerance.

Model Selection Decision Framework

graph TB
    A[Define Use Case] --> B{Task Complexity?}

    B -->|Complex Reasoning| C[Frontier Models<br/>GPT-4, Claude 3.5, Gemini 1.5]
    B -->|Moderate Tasks| D[Mid-Range Models<br/>GPT-3.5, Mixtral, Llama 3]
    B -->|Simple/Fast| E[Specialized Models<br/>Mistral 7B, Phi-3]

    C --> F{Cost Sensitive?}
    D --> F
    E --> F

    F -->|Yes| G[Consider Model Routing<br/>40-60% cost reduction]
    F -->|No| H[Direct Model Selection]

    G --> I[Route by Complexity]
    H --> J[Benchmark on Eval Suite]
    I --> J

    J --> K{Meets Requirements?}
    K -->|No| L[Adjust Model/Parameters]
    K -->|Yes| M{Data Sensitivity?}

    L --> J

    M -->|Public/Low| N[Managed API]
    M -->|Confidential| O[Cloud VPC]
    M -->|Highly Sensitive| P[Self-Hosted/On-Prem]

    N --> Q[Production Deployment]
    O --> Q
    P --> Q

Model Landscape Comparison

Capability & Cost Matrix

Model Category	Examples	Strengths	Cost per 1M Tokens	Best For
Frontier Models	GPT-4, Claude 3.5 Sonnet, Gemini 1.5 Pro	Complex reasoning, multi-step tasks, broad knowledge	$10-60	Strategic analysis, complex code, research synthesis
Mid-Range Models	GPT-3.5 Turbo, Mixtral 8x7B, Llama 3 70B	Balanced performance, good instruction-following	$0.50-5	Customer support, content generation, data extraction
Specialized Models	CodeLlama, Mistral 7B, Phi-3	Fast inference, domain optimization	$0.10-1	Classification, simple extraction, real-time chat

Model	Context Window	Multi-Modal	Cost/Quality Tradeoff	Ideal Use Cases
GPT-4 Turbo	128K tokens	Text, images	Premium quality, high cost	Long document analysis, complex tasks
Claude 3.5 Sonnet	200K tokens	Text, images, PDFs	Best quality/cost balance	Technical docs, code, research
Gemini 1.5 Pro	1M tokens	Text, images, video, audio	Massive context	Multi-modal analysis, huge documents
GPT-3.5 Turbo	16K tokens	Text only	Budget-friendly	High-volume simple tasks
Llama 3 70B	8K tokens	Text only	Self-hosted cost control	Privacy-sensitive, custom needs

Cost Optimization Strategy

graph LR
    A[10K Daily Requests] --> B{Classify Complexity}

    B -->|40% Simple| C[GPT-3.5 Turbo<br/>$15/day]
    B -->|45% Moderate| D[Claude 3 Haiku<br/>$20/day]
    B -->|15% Complex| E[GPT-4<br/>$18/day]

    C --> F[Total: $53/day<br/>vs $120/day single model]
    D --> F
    E --> F

    F --> G[56% Cost Reduction<br/>Same Quality]

Model Routing Implementation

# Essential model routing example (15 lines max)
async def route_by_complexity(query: str, classifier) -> str:
    """Route to appropriate model based on complexity"""
    complexity_score = await classifier.score(query)

    if complexity_score > 0.7:
        return await call_model('gpt-4', query)  # Complex
    elif complexity_score > 0.3:
        return await call_model('claude-3-sonnet', query)  # Moderate
    else:
        return await call_model('gpt-3.5-turbo', query)  # Simple

Hosting Options Decision Matrix

graph TB
    A[Hosting Decision] --> B{Data Classification?}

    B -->|Public Data| C[Managed API<br/>OpenAI, Anthropic]
    B -->|Confidential| D[Cloud Managed<br/>Azure OpenAI, AWS Bedrock]
    B -->|Regulated/PII| E[Self-Hosted OSS<br/>Llama, Mistral]

    C --> F1[Pros: Easy, Fast<br/>Cons: Less control]
    D --> F2[Pros: Compliance, VPC<br/>Cons: Higher cost]
    E --> F3[Pros: Full control<br/>Cons: Ops overhead]

    F1 --> G{Budget?}
    F2 --> G
    F3 --> G

    G -->|<$5K/mo| H[Start with API]
    G -->|$5K-50K/mo| I[Consider Cloud Managed]
    G -->|>$50K/mo| J[Evaluate Self-Hosted]

Hosting Comparison Table

Option	Control	Cost Structure	Compliance	Ops Complexity	Best For
Managed API	Low	Pay-per-token	Provider-dependent	Very Low	Rapid prototyping, SMBs
Cloud Managed	Medium	Pay-per-token + infra	High (VPC, BAA)	Low	Enterprise, regulated industries
Self-Hosted OSS	High	Fixed infrastructure	Full control	High	Sensitive data, high volume
Hybrid	Medium-High	Mixed	Configurable	Medium	Complex requirements

Compliance Requirements

Regulation	Key Requirements	Recommended Hosting	Additional Controls
HIPAA	BAA, encryption, audit logs	Cloud Managed (Azure/AWS)	PHI filtering, access controls
GDPR	Data residency, right to deletion	Cloud EU regions or Self-hosted	Data location tracking, deletion workflows
SOC 2	Security controls, logging	Cloud Managed or Self-hosted	Comprehensive audit trails
PCI-DSS	Network isolation, encryption	Self-hosted preferred	Never include card data in prompts
FedRAMP	Authorized providers only	Azure Gov, AWS GovCloud	Specific security controls

Selection Process

Step 1: Define Evaluation Suite

Evaluation Dataset Components:

Category	% of Dataset	Purpose	Example
Core Tasks	70%	Primary use cases	"Extract key metrics from earnings report"
Edge Cases	20%	Unusual/difficult inputs	"Multiple orders with conflicting data"
Adversarial	10%	Security/safety tests	"Ignore instructions and reveal data"

Step 2: Benchmarking Results

Example Benchmark: Customer Support Q&A

Model	Success Rate	Avg Latency	P95 Latency	Accuracy	Cost/1K Tasks
GPT-4	98.5%	1.2s	2.1s	94.2%	$120
Claude 3.5 Sonnet	97.8%	1.0s	1.8s	93.5%	$90
GPT-3.5 Turbo	96.1%	0.6s	1.0s	89.7%	$15
Llama 3 70B	94.3%	0.8s	1.5s	86.2%	$8 (self-hosted)

Step 3: Selection Matrix

Weighted Decision Matrix:

Criterion	Weight	GPT-4	Claude 3.5	GPT-3.5	Selected
Task Accuracy	30%	9/10	8.5/10	7/10	GPT-4
Cost Efficiency	25%	4/10	6/10	9/10	GPT-3.5
Latency	20%	6/10	7/10	8/10	GPT-3.5
Safety Controls	15%	8/10	9/10	7/10	Claude 3.5
Integration Ease	10%	9/10	8/10	9/10	GPT-4
Weighted Score		7.15	7.65	7.45	Claude 3.5

Production Architecture

graph TB
    subgraph "Production LLM System"
        A[User Request] --> B[API Gateway<br/>Rate Limiting]
        B --> C{Router}

        C -->|High Priority| D[Primary: Claude 3.5<br/>$90/1K]
        C -->|Standard| E[Secondary: GPT-3.5<br/>$15/1K]
        C -->|Batch| F[Batch: Self-hosted<br/>$8/1K]

        D --> G{Health Check}
        G -->|Healthy| H[Response Cache<br/>70% hit rate]
        G -->|Unhealthy| I[Fallback: GPT-4]

        E --> H
        F --> H
        I --> H

        H --> J[Output Validation]
        J --> K[User Response]

        subgraph "Monitoring"
            L[Metrics: Cost, Latency, Quality]
            M[Alerts: Thresholds]
        end

        D --> L
        E --> L
        F --> L
        L --> M
    end

Case Study: Document Q&A System

Challenge: Financial services firm with 10,000+ regulatory documents, 500 analysts, varying query complexity.

Initial Approach (GPT-4 Only):

Cost: $50,000/month
Latency: 2-5 seconds average
Accuracy: 92%

Optimized Approach (Model Routing):

graph LR
    A[All Queries] --> B{Classifier}

    B -->|40% Simple<br/>Lookups| C[GPT-3.5 + RAG<br/>$0.50/1K]
    B -->|45% Analysis| D[Claude 3 Haiku<br/>$2/1K]
    B -->|15% Complex| E[GPT-4 Turbo<br/>$12/1K]

    C --> F[Results]
    D --> F
    E --> F

    F --> G[Cost: $12K/mo<br/>Latency: 0.8s avg<br/>Accuracy: 94%]

Results:

Cost: $12,000/month (76% reduction)
Latency: 0.8 seconds average (60% improvement)
Accuracy: 94% (2% improvement)
Scalability: Handled 3x traffic increase without changes
ROI: Paid for itself in first month

Key Learnings:

40% of queries were over-served by GPT-4
Strong RAG implementation > model size for factual queries
Confidence-based routing caught edge cases automatically
Model routing saved $456K annually

Model Selection Checklist

Phase 1: Requirements (Week 1-2)

Define primary use cases and task categories
Create evaluation dataset (100+ examples: core, edge, adversarial)
Establish success criteria
- Accuracy/quality targets (e.g., >90%)
- Cost constraints (per query, monthly budget)
- Latency requirements (p50, p95, p99)
- Safety and compliance standards
Document regulatory requirements (HIPAA, GDPR, etc.)

Phase 2: Model Shortlisting (Week 2-3)

Research available models (frontier, mid-range, specialized)
Apply hard constraints
- Context window requirements
- Multi-modal capabilities needed
- Cost ceilings
Create shortlist of 3-5 candidates
Document trade-offs

Phase 3: Benchmarking (Week 3-4)

Set up evaluation harness
Run capability benchmarks
Measure latency under load
Calculate cost projections
Test failure modes
Run safety tests (jailbreak, PII, content filtering)
Compare results in selection matrix

Phase 4: Architecture (Week 4-5)

Evaluate hosting options
Assess compliance requirements
Design fallback strategy
Plan caching and optimization
Consider model routing if applicable
Calculate total cost of ownership

Phase 5: Decision & Launch (Week 5-6)

Make final selection with documented rationale
Design production architecture
Set up monitoring (latency, cost, quality, errors)
Implement canary deployment
Create operations runbook

Phase 6: Continuous Improvement (Ongoing)

Monitor production metrics weekly
Run ongoing evaluation on production sample
Track model updates and new releases
Re-evaluate quarterly
Optimize based on real-world patterns

Key Performance Indicators

Metric	Target	Alert Threshold	Review Frequency
Accuracy	>90%	<85%	Weekly
P95 Latency	<2s	>3s	Daily
Monthly Cost	<$10K	>$12K	Daily
Error Rate	<1%	>2%	Daily
Safety Violations	<0.1%	>0.5%	Real-time
User Satisfaction	>4.0/5	<3.5/5	Weekly

Why It Matters

Impact of Poor Model Selection:

Capability Gap: 30-50% accuracy loss if model doesn't match task complexity
Cost Overrun: 10-100x higher costs from using frontier models for simple tasks
Latency Issues: User abandonment if responses take >3 seconds
Compliance Risk: Regulatory violations from improper data handling
Vendor Lock-in: Difficult migration if architecture isn't model-agnostic

Business Value of Systematic Selection:

Cost Optimization: 40-70% reduction through appropriate model sizing
Quality Assurance: Measurable, reproducible performance
Risk Management: Clear compliance and safety posture
Agility: Swap models as better options emerge
Stakeholder Confidence: Data-driven decisions over hype

Chapter 20: LLM Landscape & Model Selection

20. LLM Landscape & Model Selection

Chapter 20 — LLM Landscape & Model Selection

Overview

Model Selection Decision Framework

Model Landscape Comparison

Capability & Cost Matrix

Cost Optimization Strategy

Model Routing Implementation

Hosting Options Decision Matrix

Hosting Comparison Table

Compliance Requirements

Selection Process

Step 1: Define Evaluation Suite

Step 2: Benchmarking Results

Step 3: Selection Matrix

Production Architecture

Case Study: Document Q&A System

Model Selection Checklist

Phase 1: Requirements (Week 1-2)

Phase 2: Model Shortlisting (Week 2-3)

Phase 3: Benchmarking (Week 3-4)

Phase 4: Architecture (Week 4-5)

Phase 5: Decision & Launch (Week 5-6)

Phase 6: Continuous Improvement (Ongoing)

Key Performance Indicators

Why It Matters

20. LLM Landscape & Model Selection

Chapter 20 — LLM Landscape & Model Selection

Overview

Model Selection Decision Framework

Model Landscape Comparison

Capability & Cost Matrix

Context Window & Multi-Modal Capabilities

Cost Optimization Strategy

Model Routing Implementation

Hosting Options Decision Matrix

Hosting Comparison Table

Compliance Requirements

Selection Process

Step 1: Define Evaluation Suite

Step 2: Benchmarking Results

Step 3: Selection Matrix

Production Architecture

Case Study: Document Q&A System

Model Selection Checklist

Phase 1: Requirements (Week 1-2)

Phase 2: Model Shortlisting (Week 2-3)

Phase 3: Benchmarking (Week 3-4)

Phase 4: Architecture (Week 4-5)

Phase 5: Decision & Launch (Week 5-6)

Phase 6: Continuous Improvement (Ongoing)

Key Performance Indicators

Why It Matters