2. The AI Landscape & Core Concepts
Chapter 2 — The AI Landscape & Core Concepts
Overview
Establish a shared foundation across ML paradigms, generative AI, and agentic systems. Understand when to use which approach, associated costs/latency, and core constraints.
This chapter provides a comprehensive map of the AI landscape, helping you select the right approach for specific business problems. Whether you're dealing with structured predictions, natural language understanding, or autonomous agents, understanding the capabilities, limitations, and tradeoffs of each paradigm is essential for successful AI consulting.
Objectives
- Map the AI landscape from classical ML to generative AI to agentic systems
- Provide decision frameworks for selecting the right AI approach
- Understand cost, latency, and quality tradeoffs
- Establish shared vocabulary and mental models for AI consulting
Core Concepts
Learning Paradigms
AI systems learn from data using different paradigms, each suited to different types of problems:
graph TD A[Machine Learning Paradigms] --> B[Supervised Learning] A --> C[Unsupervised Learning] A --> D[Semi-Supervised Learning] A --> E[Self-Supervised Learning] A --> F[Reinforcement Learning] B --> B1[Classification] B --> B2[Regression] B --> B3[Requires labeled data] C --> C1[Clustering] C --> C2[Dimensionality Reduction] C --> C3[No labels needed] D --> D1[Small labeled + Large unlabeled] D --> D2[Cost-effective labeling] E --> E1[Create labels from data] E --> E2[Foundation models] F --> F1[Trial and error] F --> F2[Maximize rewards]
Learning Paradigm Selection Framework
flowchart TD Start[Business Problem] --> Q1{Labels Available?} Q1 -->|Yes: Abundant| Supervised[Supervised Learning] Q1 -->|Yes: Limited| SemiSupervised[Semi-Supervised Learning] Q1 -->|No Labels| Q2{Pattern Discovery?} Q2 -->|Yes| Unsupervised[Unsupervised Learning] Q2 -->|No| Q3{Sequential Decisions?} Q3 -->|Yes| RL[Reinforcement Learning] Q3 -->|No| SelfSupervised[Self-Supervised Learning] Supervised --> Ex1[Credit scoring<br/>Churn prediction<br/>Medical diagnosis] SemiSupervised --> Ex2[Document classification<br/>Image labeling] Unsupervised --> Ex3[Customer segmentation<br/>Anomaly detection] RL --> Ex4[Game playing<br/>Resource optimization<br/>RLHF for LLMs] SelfSupervised --> Ex5[Foundation models<br/>GPT, BERT, Claude]
Paradigm Comparison Matrix
| Paradigm | Data Requirements | Common Use Cases | Typical Accuracy | Time to Value | Cost |
|---|---|---|---|---|---|
| Supervised | 1K-1M labeled examples | Classification, regression, forecasting | 85-95% | 4-8 weeks | $$-$$$ |
| Unsupervised | Unlabeled data only | Clustering, dimensionality reduction | N/A (interpretive) | 2-4 weeks | $ |
| Semi-Supervised | 100s labeled + 10Ks unlabeled | Image/text classification | 80-90% | 6-10 weeks | $$ |
| Self-Supervised | Massive unlabeled corpora | Foundation model pretraining | Varies | Months (done by providers) | $$$$ |
| Reinforcement Learning | Simulation or feedback | Sequential decisions, optimization | Varies | 8-24 weeks | $$$-$$$$ |
Case Study: E-commerce Customer Segmentation
- Approach: Unsupervised learning (K-means clustering)
- Data: 500K customers, 50 behavioral features
- Outcome: Identified 7 distinct segments
- Business Impact:
- Marketing conversion improved by 34%
- Customer retention increased by 18%
- Campaign ROI improved from 2.1x to 3.4x
- Time to insight: 3 weeks vs. 3 months manual analysis
Classical vs. Generative AI
A fundamental distinction in modern AI that drives architecture and economics:
Capability Comparison
| Aspect | Classical ML | Generative AI |
|---|---|---|
| Primary Task | Prediction, classification, scoring | Content synthesis, reasoning, generation |
| Output Type | Categorical labels, numerical scores | Text, images, code, structured data |
| Data Requirements | Structured, tabular, labeled (1K-1M examples) | Large-scale unstructured data (billions of tokens) |
| Interpretability | Often high (linear models, trees) | Generally low (black box) |
| Latency | Typically 1-100ms | 100ms-10s depending on size |
| Cost per Inference | 0.01 | 0.10+ |
| Determinism | Consistent outputs for same input | Stochastic (varies across runs) |
| Training Time | Hours to days | Weeks to months (foundation models) |
| Training Cost | 10K | 100M+ (foundation models) |
| Best For | Structured prediction, tabular data | Language, reasoning, content creation |
Decision Framework
flowchart TD Start[Business Problem] --> Q1{Output Type?} Q1 -->|Number/Category| Classical[Classical ML] Q1 -->|Text/Content| Q2{Data Structure?} Q2 -->|Structured/Tabular| Q3{Need Reasoning?} Q3 -->|No| Classical Q3 -->|Yes| GenAI[Generative AI] Q2 -->|Unstructured Text/Images| GenAI Classical --> Ex1[Examples:<br/>• Fraud detection<br/>• Churn prediction<br/>• Price optimization<br/>• Demand forecasting] GenAI --> Ex2[Examples:<br/>• Document summarization<br/>• Chatbots<br/>• Code generation<br/>• Content creation] style Classical fill:#90EE90 style GenAI fill:#87CEEB
Case Study: Bank Loan Default Prediction
- Approach: Classical ML (Gradient Boosting)
- Data: 500K historical loans, 150 features
- Model: XGBoost achieving 0.85 AUC-ROC
- Business Impact:
- Default rate reduced by 18%
- Annual savings: $12M
- Model inference: <10ms
- Cost per prediction: $0.0001
- ROI: 2,400% in first year
vs. Generative AI Attempt (same problem):
- Approach: LLM-based reasoning
- Performance: 0.72 AUC-ROC (lower accuracy)
- Latency: 800ms (80x slower)
- Cost per prediction: $0.02 (200x more expensive)
- Conclusion: Wrong tool for the job—structured prediction doesn't need generative AI
Large Language Models (LLMs)
LLMs are the foundation of modern generative AI applications.
LLM Landscape & Economics
graph TD A[LLM Options] --> B[Cloud APIs] A --> C[Self-Hosted OSS] A --> D[Hybrid] B --> B1[OpenAI GPT-4] B --> B2[Anthropic Claude] B --> B3[Google Gemini] C --> C1[Llama 3.1] C --> C2[Mistral] C --> C3[Gemma] D --> D1[API for Complex Tasks] D --> D2[Local for Simple/Sensitive] B1 --> Cost1[$10-30/1M tokens] C1 --> Cost2[$0.10-1/1M tokens<br/>after amortization] D1 --> Cost3[Optimized routing]
Model Selection Matrix
| Model | Context Window | Cost per 1M tokens (Input/Output) | Latency (P95) | Best For |
|---|---|---|---|---|
| GPT-4 Turbo | 128K | 30 | 1-3s | Complex reasoning, high-stakes |
| Claude 3.5 Sonnet | 200K | 15 | 1-2s | Long context, analysis |
| GPT-3.5 Turbo | 16K | 1.50 | 0.5-1s | General purpose, high volume |
| Gemini 1.5 Pro | 2M | 5 | 1-2s | Very long documents |
| Llama 3.1 70B (hosted) | 128K | 0.79 | 0.8-1.5s | Cost-sensitive, moderate volume |
| Llama 3.1 70B (self) | 128K | ~0.001 | 0.5-1s | High volume (>1M/month) |
LLM Adaptation Approaches
flowchart TD Start[Need to Adapt LLM?] --> Q1{Task Complexity} Q1 -->|Simple| Prompt[Prompt Engineering] Q1 -->|Moderate| Q2{Need External Knowledge?} Q1 -->|Complex| Q3{Have Training Data?} Q2 -->|Yes| RAG[RAG] Q2 -->|No| Prompt Q3 -->|Yes: >10K examples| FT[Fine-tuning] Q3 -->|Yes: 1K-10K examples| LoRA[LoRA] Q3 -->|No| RAG Prompt --> P1[Cost: $0<br/>Time: Hours<br/>Flexibility: High] RAG --> P2[Cost: $-$$<br/>Time: Days-Weeks<br/>Updatable: Yes] LoRA --> P3[Cost: $$<br/>Time: Days<br/>Tasks: Multi-task] FT --> P4[Cost: $$$<br/>Time: Weeks<br/>Performance: Highest]
Adaptation Approach Comparison
| Approach | Cost | Time to Deploy | Data Needed | Performance | Updatability | Best For |
|---|---|---|---|---|---|---|
| Prompt Engineering | None | Hours | 0-10 examples | 70-85% | Immediate | Most applications, rapid iteration |
| RAG | $-$$ (infra) | Days-Weeks | 100s-1000s docs | 80-90% | Easy (add docs) | Knowledge-intensive, dynamic info |
| LoRA | $$ | Days | 1K-10K examples | 85-92% | Moderate (retrain) | Multiple specialized tasks |
| Fine-tuning | $$$ | Weeks | 10K-100K examples | 90-95% | Hard (full retrain) | Specialized domains, specific formats |
Case Study: Legal Document Analysis
- Company: Mid-size law firm processing 500 contracts/month
- Tested Approaches:
- Prompt Engineering: 76% accuracy, $0.05/doc, ready in 1 week
- RAG with firm templates: 89% accuracy, $0.12/doc, ready in 3 weeks
- Fine-tuned model: 94% accuracy, $0.08/doc, ready in 8 weeks
- Decision: Chose RAG
- Rationale: 89% accuracy sufficient, fastest to update with new clauses
- Business Impact:
- Review time reduced from 45 min to 12 min (73% reduction)
- Monthly savings: $18K in attorney time
- ROI: 450% in first year
Retrieval-Augmented Generation (RAG)
RAG grounds LLM outputs in authoritative data, reducing hallucinations and enabling access to current/private information.
RAG Architecture & Components
graph LR A[User Query] --> B[Query Embedding] B --> C[Vector Search] C --> D[Retrieve Top-K Chunks] D --> E[Rerank Optional] E --> F[Assemble Context] F --> G[LLM Generation] G --> H[Response] I[Document Corpus] --> J[Chunking<br/>200-1000 tokens] J --> K[Embedding<br/>text-embedding-3] K --> L[Vector Database<br/>Pinecone/Weaviate] L --> C style G fill:#87CEEB style L fill:#90EE90
RAG Design Decisions Matrix
| Decision Point | Options | Tradeoffs | Recommendation |
|---|---|---|---|
| Chunk Size | 200 / 500 / 1000 tokens | Small: precise, more chunks Large: more context, less precise | 500 tokens for most use cases |
| Chunk Overlap | 0 / 50 / 100 tokens | More: better continuity, redundancy Less: efficient, potential gaps | 50 tokens (10% overlap) |
| Top-K | 3 / 5 / 10 chunks | More: better recall, higher cost Fewer: focused, may miss info | 5 chunks for most use cases |
| Embedding Model | OpenAI / Cohere / OSS | Proprietary: quality, cost Open: control, no API cost | OpenAI for quality, OSS for volume |
| Vector DB | Pinecone / Weaviate / pgvector | Managed: easy, $ Self-hosted: control, ops overhead | Pinecone for <10M docs, pgvector for >10M |
| Reranking | Yes / No | Improves precision 10-15%, adds 100-200ms latency | Yes for high-stakes applications |
RAG Performance Benchmarks
| Metric | Without RAG | Basic RAG | Advanced RAG (reranking) | Improvement |
|---|---|---|---|---|
| Answer Accuracy | 65% (pure LLM) | 82% | 89% | 37% improvement |
| Hallucination Rate | 18% | 5% | 2% | 89% reduction |
| Source Attribution | N/A | 84% correct | 92% correct | Traceable answers |
| Latency | 800ms | 1.2s | 1.8s | Worth the tradeoff |
| Cost per Query | $0.02 | $0.05 | $0.08 | Justifiable for accuracy |
Case Study: Technical Support Knowledge Base
- Company: SaaS company with 2,500 help articles
- Baseline: Keyword search, 62% resolution rate
- RAG Implementation:
- Chunk size: 500 tokens, 50 overlap
- Vector DB: Pinecone (500K vectors)
- Top-K: 5, with reranking
- LLM: GPT-4 Turbo
- Results:
- Answer accuracy: 87% (vs. 62% keyword)
- First-contact resolution: 81% (vs. 65%)
- Average handle time: 6.2 min (vs. 9.5 min)
- Cost per query: $0.06
- Monthly volume: 50K queries
- Annual savings: $480K in support costs
- ROI: 720% in year 1
Agentic Systems
Agents extend LLMs with tool use, planning, and iterative refinement.
Agent Architecture Patterns
graph TD Input[User Input] --> Agent[Agent Core<br/>LLM] Agent --> Planning[Planning Module] Planning --> Tools[Tool Selection] Tools --> Execute[Execute Tools] Execute --> Memory[Update Memory] Memory --> Reflect[Reflection] Reflect --> Decision{Task Complete?} Decision -->|No| Planning Decision -->|Yes| Output[Final Response] Tools --> Tool1[Web Search<br/>Google/Bing] Tools --> Tool2[Calculator<br/>Python REPL] Tools --> Tool3[Database Query<br/>SQL] Tools --> Tool4[API Calls<br/>REST/GraphQL] style Agent fill:#87CEEB style Tools fill:#90EE90
Agent Pattern Comparison
| Pattern | Complexity | Reliability | Cost/Task | Latency | Use Cases | Success Rate |
|---|---|---|---|---|---|---|
| ReAct | Medium | 75-85% | 0.15 | 5-15s | Customer support, data analysis | 80% |
| Plan-and-Execute | Medium-High | 70-80% | 0.25 | 10-30s | Travel booking, research | 75% |
| Reflexion | High | 80-90% | 0.40 | 15-45s | Code debugging, complex problem-solving | 85% |
| Multi-Agent | Very High | 65-75% | 0.60 | 30-120s | Software development, strategic planning | 70% |
Agent Tool Ecosystem
graph LR A[Agent Core] --> B[Knowledge Tools] A --> C[Action Tools] A --> D[Analysis Tools] B --> B1[Search<br/>Web/Internal] B --> B2[RAG<br/>Documents] B --> B3[Memory<br/>Vector DB] C --> C1[Database<br/>CRUD Ops] C --> C2[APIs<br/>External Services] C --> C3[Email/Slack<br/>Communication] D --> D1[Calculator<br/>Math/Finance] D --> D2[Code Execution<br/>Python/SQL] D --> D3[Data Viz<br/>Charts/Graphs]
Case Study: Customer Service Agent
- Company: E-commerce retailer, 500+ support agents
- Agent Capabilities:
- Search order database (tool:
search_orders) - Calculate refunds (tool:
calculator) - Update CRM (tool:
update_crm) - Send emails (tool:
send_email)
- Search order database (tool:
- Implementation: ReAct pattern with GPT-4
- Results:
- Average handle time: 5.2 min (vs. 8.5 min manual)
- Time savings: 39%
- Error rate in calculations: 95% reduction (agent always accurate)
- Agent satisfaction: 4.3/5 (tools reduce frustration)
- Cost per interaction: $0.12
- Annual savings: $890K across 500 agents
- CSAT maintained at 4.2/5
When To Use What
Choosing the right AI approach is critical for success. Here's a comprehensive decision framework:
Decision Tree by Problem Type
flowchart TD Start[Business Problem] --> Q1{Problem Type} Q1 -->|Deterministic Logic| Rules[Rules/Heuristics] Q1 -->|Structured Prediction| Q2{Data Type?} Q1 -->|Content Generation| GenAI[Generative AI] Q1 -->|Sequential Decisions| Agents[Agentic Systems] Q2 -->|Tabular/Structured| ClassicalML[Classical ML] Q2 -->|Text| Q3{Labeled Data?} Q2 -->|Images| Q4{Volume?} Q3 -->|Yes: >10K| ClassicalML Q3 -->|No| GenAI Q4 -->|High: >100K| DeepLearning[Deep Learning] Q4 -->|Low| GenAI Rules --> R1[Tax calculations<br/>Access control<br/>Compliance checks] ClassicalML --> C1[Fraud detection<br/>Churn prediction<br/>Price optimization] GenAI --> G1[Summarization<br/>Q&A<br/>Content creation] Agents --> A1[Research tasks<br/>Multi-step workflows<br/>Tool orchestration] DeepLearning --> D1[Image classification<br/>Object detection<br/>OCR]
Approach Selection Matrix
| Approach | Best For | Data Requirements | Latency | Cost | Complexity |
|---|---|---|---|---|---|
| Rules/Heuristics | Deterministic logic, compliance | Minimal | <1ms | Very Low | Low |
| Classical ML | Structured prediction, tabular data | 1K-1M labeled examples | 1-100ms | Low | Medium |
| Deep Learning (CV) | Images, video, complex vision tasks | 10K-1M labeled images | 10-500ms | Medium | High |
| LLM (Prompting) | Unstructured text, reasoning | 0-10 examples | 100ms-5s | Medium | Low-Medium |
| RAG | Grounded generation, knowledge tasks | 100s-1000s documents | 200ms-10s | Medium | Medium |
| Fine-tuning | Specialized domains, specific formats | 1K-100K examples | 100ms-5s | Medium-High | High |
| RL/Agents | Sequential decisions, optimization | Simulation or feedback | Varies widely | High | Very High |
Cost-Performance-Latency Tradeoff
graph TD A[Choose 2 of 3] --> B[Low Cost] A --> C[High Performance] A --> D[Low Latency] B --> BC[Low Cost + High Performance<br/>= Higher Latency<br/>Example: Batch processing with large models] B --> BD[Low Cost + Low Latency<br/>= Lower Performance<br/>Example: Simple rules or small models] C --> CD[High Performance + Low Latency<br/>= High Cost<br/>Example: GPT-4 with optimized infrastructure] style A fill:#FFD700
Optimization Strategies
| Technique | Latency Impact | Cost Impact | Quality Impact | Best For |
|---|---|---|---|---|
| Caching | -50-90% | -50-90% | Neutral | Repeated queries |
| Batching | +50-200% | -30-50% | Neutral | High throughput, latency-tolerant |
| Model Distillation | -40-70% | -40-70% | -5-15% | Production deployment |
| Quantization | -20-50% | -20-50% | -1-5% | Edge deployment |
| Prompt Compression | -20-40% | -20-40% | -0-10% | Long context scenarios |
| Smaller Model | -50-80% | -50-80% | -10-30% | Simple tasks |
| Hybrid Routing | -30-60% | -40-70% | Neutral to +5% | Mixed complexity workload |
Case Study: Document Summarization Cost Optimization
- Baseline: GPT-4 for all documents
- Cost: $0.08/document
- Monthly volume: 100K documents
- Monthly cost: $8,000
- Optimized Approach:
- Caching common documents (40% hit rate): Save $3,200
- Route simple docs to GPT-3.5 (25% of volume): Save $1,200
- Batching (10 at a time): Save $800
- Prompt optimization (-30% tokens): Save $600
- Result:
- New monthly cost: $2,200
- Savings: 78% (69,600/year)
- Quality maintained: 94% similarity to baseline
- Latency impact: +15% (acceptable for async workflow)
Constraints & Tradeoffs
Every AI solution involves tradeoffs. Understanding these is crucial for setting realistic expectations.
Cost Structure Analysis
graph TD A[Total Cost of AI System] --> B[Development] A --> C[Inference] A --> D[Operations] B --> B1[Data labeling<br/>$10K-$500K] B --> B2[Experimentation<br/>$20K-$200K] B --> B3[Engineering time<br/>$100K-$1M] C --> C1[Compute per request<br/>$0.0001-$0.10] C --> C2[API costs<br/>$1K-$100K/month] C --> C3[Infrastructure<br/>$5K-$50K/month] D --> D1[Monitoring<br/>$1K-$10K/month] D --> D2[Retraining<br/>$5K-$50K/quarter] D --> D3[Ops team<br/>$200K-$800K/year]
Data Constraints
Data Quality Requirements by Approach
| ML Approach | Completeness | Missing Data Tolerance | Label Accuracy | Noise Tolerance |
|---|---|---|---|---|
| Classical ML | >90% | <10% with imputation | >95% | Low-Medium |
| Deep Learning | >80% | <20% (learns to ignore) | >90% | Medium-High |
| LLMs | Variable | High (handles missing context) | N/A (unsupervised) | High |
| Fine-tuning | >95% | <5% | >98% | Very Low |
Data Privacy & Consent Decision Tree
flowchart TD Start[Data Source] --> Q1{User Consent?} Q1 -->|No| Stop[Cannot Use] Q1 -->|Yes| Q2{Contains PII?} Q2 -->|Yes| Q3{Need PII?} Q3 -->|No| Redact[Redact/Anonymize] Q3 -->|Yes| Q4{Compliance Framework?} Q4 -->|GDPR/CCPA| Controls1[• Data minimization<br/>• Encryption at rest/transit<br/>• Access controls<br/>• Right to deletion] Q4 -->|HIPAA| Controls2[• BAA required<br/>• Audit logs<br/>• De-identification<br/>• Limited retention] Q4 -->|Other| Controls3[• Risk assessment<br/>• Legal review<br/>• Custom controls] Q2 -->|No| Q5{Sensitive Domain?} Q5 -->|Yes: Finance, Health| Assess[Risk Assessment Required] Q5 -->|No| Proceed[Proceed with Standard Governance] Redact --> Proceed Controls1 --> Proceed Controls2 --> Proceed Controls3 --> Proceed Assess --> Proceed
Safety & Security Threats
AI Threat Landscape
graph TD A[AI Security Threats] --> B[Input Attacks] A --> C[Model Attacks] A --> D[Output Attacks] A --> E[Data Attacks] B --> B1[Prompt Injection<br/>Override instructions] B --> B2[Adversarial Inputs<br/>Misclassification] C --> C1[Model Extraction<br/>IP theft] C --> C2[Model Inversion<br/>Training data recovery] D --> D1[Data Exfiltration<br/>Leak PII/secrets] D --> D2[Hallucination<br/>False information] E --> E1[Data Poisoning<br/>Corrupt training] E --> E2[Backdoors<br/>Trigger behaviors]
Defense Strategy Matrix
| Threat | Impact | Defense Mechanism | Implementation Cost | Effectiveness |
|---|---|---|---|---|
| Prompt Injection | High | Input sanitization, prompt guards, output validation | $ | 80-90% |
| Data Exfiltration | Critical | PII redaction, access controls, output filtering | $$ | 95-99% |
| Jailbreaking | Medium-High | System prompt hardening, red-teaming, content filters | $$ | 70-85% |
| Hallucination | Medium | RAG grounding, fact-checking, confidence scores | $$ | 60-80% |
| Model Extraction | Medium | Rate limiting, watermarking, API monitoring | $ | 75-90% |
| Adversarial Examples | Medium | Adversarial training, input validation, ensemble models | $$$ | 70-85% |
Case Study: Financial Services Chatbot Security
- Initial State: Basic chatbot, no specialized security
- Security Assessment Findings:
- Vulnerable to prompt injection (15/20 test cases succeeded)
- PII leakage in 8% of responses
- No jailbreak protections
- Implemented Defenses:
- Multi-layer input validation: $20K
- PII redaction (input & output): $30K
- Red-team testing & hardening: $40K
- Ongoing monitoring: $15K/year
- Results After 6 Months:
- Prompt injection success rate: <2% (93% improvement)
- PII leakage: 0 incidents
- Zero security breaches
- Compliance audit: 100% pass
- Avoided: Estimated $5M+ in potential breach costs
- ROI: Incalculable (risk mitigation)
Hosting Options
Choosing where and how to host AI models significantly impacts cost, control, and capabilities.
Hosting Strategy Decision Tree
flowchart TD Start[Hosting Decision] --> Q1{Volume?} Q1 -->|Low: <100K/month| Q2{Data Sensitivity?} Q1 -->|Medium: 100K-1M/month| Q3{Cost Optimization Priority?} Q1 -->|High: >1M/month| SelfHost[Self-Hosted] Q2 -->|High: PII, proprietary| SelfHost Q2 -->|Medium| API[Managed API] Q3 -->|High| Q4{Technical Expertise?} Q3 -->|Medium| Hybrid[Hybrid Approach] Q4 -->|Yes| SelfHost Q4 -->|No| Hybrid API --> A1[OpenAI, Anthropic<br/>Google, Cohere] SelfHost --> S1[Llama, Mistral<br/>On AWS/GCP/Azure] Hybrid --> H1[API for complex<br/>Self-hosted for simple/sensitive]
Hosting Comparison Matrix
| Factor | Managed APIs | Self-Hosted OSS | Hybrid |
|---|---|---|---|
| Time to Production | Days | Weeks-Months | Weeks |
| Upfront Cost | $0 | 200K (hardware) or 50K/month (cloud) | 30K |
| Per-Request Cost | 0.10 | 0.001 (amortized) | 0.02 (optimized routing) |
| Control | Low (vendor-dependent) | High (full control) | Medium (selective) |
| Compliance | Vendor-dependent (may limit use cases) | Full control (meet any requirement) | Flexible (route by requirement) |
| Scalability | Automatic (vendor handles) | Manual (requires planning) | Mixed (auto + manual) |
| Latest Models | Immediate access | Delayed (3-6 months) | Best of both |
| Customization | Limited (API parameters only) | Full (modify anything) | Selective (fine-tune what matters) |
| Operational Overhead | Minimal | High (DevOps, MLOps teams) | Medium |
Break-Even Analysis
Scenario: Customer support chatbot
| Volume (req/month) | Managed API Cost | Self-Hosted Cost | Break-Even Point |
|---|---|---|---|
| 10K | $500 | $15,000 (not worth it) | N/A |
| 100K | $5,000 | $15,000 | ~300K requests/month |
| 1M | $50,000 | $20,000 | ✅ Self-hosted wins |
| 10M | $500,000 | $35,000 | ✅ 14x cheaper self-hosted |
Cost Breakdown (Self-Hosted at 1M req/month):
- Infrastructure (4x A100 GPUs on AWS): $23,600/month
- Engineering (2 FTE MLOps): $30,000/month
- Total: $53,600/month
- Amortized per request: $0.05
- vs. API cost: $0.50/request
- Savings: 90% ($447K/month)
Case Study: Healthcare AI Platform
- Company: Hospital network, patient intake chatbot
- Requirements:
- HIPAA compliance (cannot send PHI to third party)
- 500K conversations/month
- <2s latency
- Decision: Self-hosted Llama 3.1 70B on-premises
- Investment:
- Hardware: $160K (8x A100 GPUs)
- Setup: $80K (engineering)
- Annual ops: $240K (2 FTE)
- Economics:
- Year 1 total: $480K
- Year 2-3: $240K/year
- vs. API (if allowed): $3M/year
- 3-Year Savings: $8.5M
- ROI: 1,771%
- Additional Benefits:
- Full HIPAA compliance
- Custom fine-tuning on medical data
- No rate limits
- Data never leaves premises
Evaluation Essentials
Rigorous evaluation is critical for AI success. Different paradigms require different evaluation approaches.
Classical ML Evaluation Framework
Classification Metrics Decision Tree
flowchart TD Start[Classification Problem] --> Q1{Class Balance?} Q1 -->|Balanced| Accuracy[Accuracy] Q1 -->|Imbalanced| Q2{Cost of Errors?} Q2 -->|FP more costly| Precision[Precision] Q2 -->|FN more costly| Recall[Recall] Q2 -->|Both matter equally| F1[F1 Score] Q2 -->|Need full picture| AUCROC[AUC-ROC] Q1 -->|Very Imbalanced: <5%| AUCPR[AUC-PR]
Metric Selection Matrix
| Use Case | Primary Metric | Why | Threshold |
|---|---|---|---|
| Fraud Detection | Precision + Recall | Both FP (false accusation) and FN (missed fraud) costly | Precision >90%, Recall >80% |
| Spam Filter | Precision | FP (blocking good email) very costly | Precision >95% |
| Medical Diagnosis | Recall | FN (missed disease) potentially fatal | Recall >95% |
| Churn Prediction | AUC-ROC | Need ranked list for targeting | AUC >0.75 |
| Click Prediction | AUC-PR | Very imbalanced (CTR ~1%) | AUC-PR >0.3 |
Generative AI Evaluation
Multi-Dimensional Evaluation Framework
graph TD A[LLM Evaluation] --> B[Factuality] A --> C[Relevance] A --> D[Coherence] A --> E[Safety] A --> F[Task-Specific] B --> B1[Grounding in context<br/>Metrics: Exact match, ROUGE] B --> B2[Hallucination detection<br/>Metrics: Faithfulness score] C --> C1[Answers the question<br/>Metrics: Semantic similarity] C --> C2[Appropriate scope<br/>Metrics: Length, coverage] D --> D1[Logical flow<br/>Metrics: Coherence score] D --> D2[Consistency<br/>Metrics: Self-BLEU] E --> E1[No toxicity<br/>Metrics: Perspective API] E --> E2[No PII leakage<br/>Metrics: Regex + NER] E --> E3[No jailbreaks<br/>Metrics: Red-team pass rate] F --> F1[Format adherence<br/>Metrics: Schema validation] F --> F2[Domain accuracy<br/>Metrics: Expert review]
Evaluation Approach Comparison
| Approach | Speed | Cost | Scalability | Reliability | Best For |
|---|---|---|---|---|---|
| Automated Metrics (ROUGE, BLEU) | Fast | Low | High | Moderate (correlation with quality varies) | Large-scale, continuous |
| LLM-as-Judge | Medium | Medium | High | Good (80-90% agreement with humans) | Scalable quality assessment |
| Human Evaluation | Slow | High | Low | Highest (gold standard) | High-stakes, final validation |
| Hybrid (Auto + Sample Human) | Medium | Medium | Medium-High | High | Production systems |
Case Study: Customer Support QA System Evaluation
- System: RAG-based Q&A for 500 agents
- Evaluation Strategy: Hybrid approach
- Automated (100% of responses):
- Latency: <2s (SLA)
- Safety checks: 0 PII leakage
- Cost: <$0.10/query
- LLM-as-Judge (10% sample, daily):
- Relevance: >85%
- Factuality: >90%
- Coherence: >90%
- Human Review (1% sample, weekly):
- Overall quality: >4.0/5
- Agent trust: >3.8/5
- Actionable: >85%
- Automated (100% of responses):
- Continuous Monitoring:
- Daily: Automated metrics
- Weekly: LLM-as-judge trends
- Monthly: Human evaluation deep-dive
- Feedback Loop:
- Quality dips trigger investigation
- Human feedback used to improve prompts
- New edge cases added to test set
- Result: Maintained 89% accuracy over 12 months with continuous improvement
Summary
The AI landscape offers diverse approaches for different problems:
Technology Selection Framework
graph TD A[Business Problem] --> B{Problem Characteristics} B -->|Deterministic, rules-based| Rules[Rules/Heuristics<br/>$, <1ms, Low complexity] B -->|Structured data, prediction| Classical[Classical ML<br/>$$, 1-100ms, Medium complexity] B -->|Unstructured text, generation| GenAI[Generative AI<br/>$$$, 100ms-5s, Medium complexity] B -->|Multi-step, tool use| Agents[Agentic Systems<br/>$$$$, 5-60s, High complexity] Classical --> C1[XGBoost, Random Forest<br/>85-95% accuracy<br/>Best for tabular] GenAI --> G1[LLMs + RAG<br/>80-90% accuracy<br/>Best for knowledge work] Agents --> A1[ReAct, Multi-Agent<br/>70-85% task success<br/>Best for workflows] style Rules fill:#90EE90 style Classical fill:#87CEEB style GenAI fill:#FFD700 style Agents fill:#FFA500
Key Takeaways
-
Match approach to problem: Not every problem needs the latest LLM
- Structured prediction → Classical ML
- Knowledge work → LLMs + RAG
- Multi-step workflows → Agents
-
Understand tradeoffs: Optimize for 2 of 3 (cost, latency, quality)
- High volume + latency-tolerant → Optimize for cost
- Real-time + quality → Accept higher cost
- Low budget + quality → Accept higher latency
-
Rigorous evaluation: Appropriate metrics for each paradigm
- Classical ML: Accuracy, precision, recall, AUC
- Generative AI: Factuality, relevance, safety
- Agents: Task success rate, efficiency
-
Safety first: Governance and controls embedded from the start
- Defense in depth (input validation, output filtering, monitoring)
- Privacy by design (PII redaction, access controls)
- Continuous monitoring (automated + human review)
-
Economics matter: Consider total cost of ownership
- Development + Inference + Operations
- Break-even analysis for hosting decisions
- ROI measurement across full lifecycle
Success Formula:
- Right tool for the job → 3-5x better ROI
- Early validation → 60% cost reduction (fail fast)
- Continuous optimization → 20-40% ongoing improvement
- Safety & compliance → 20M in avoided fines
The next chapter explores ethical considerations and professional conduct in AI consulting.