Chapter 2 — The AI Landscape & Core Concepts

Overview

Establish a shared foundation across ML paradigms, generative AI, and agentic systems. Understand when to use which approach, associated costs/latency, and core constraints.

This chapter provides a comprehensive map of the AI landscape, helping you select the right approach for specific business problems. Whether you're dealing with structured predictions, natural language understanding, or autonomous agents, understanding the capabilities, limitations, and tradeoffs of each paradigm is essential for successful AI consulting.

Objectives

Map the AI landscape from classical ML to generative AI to agentic systems
Provide decision frameworks for selecting the right AI approach
Understand cost, latency, and quality tradeoffs
Establish shared vocabulary and mental models for AI consulting

Core Concepts

Learning Paradigms

AI systems learn from data using different paradigms, each suited to different types of problems:

graph TD
    A[Machine Learning Paradigms] --> B[Supervised Learning]
    A --> C[Unsupervised Learning]
    A --> D[Semi-Supervised Learning]
    A --> E[Self-Supervised Learning]
    A --> F[Reinforcement Learning]

    B --> B1[Classification]
    B --> B2[Regression]
    B --> B3[Requires labeled data]

    C --> C1[Clustering]
    C --> C2[Dimensionality Reduction]
    C --> C3[No labels needed]

    D --> D1[Small labeled + Large unlabeled]
    D --> D2[Cost-effective labeling]

    E --> E1[Create labels from data]
    E --> E2[Foundation models]

    F --> F1[Trial and error]
    F --> F2[Maximize rewards]

Learning Paradigm Selection Framework

flowchart TD
    Start[Business Problem] --> Q1{Labels Available?}
    Q1 -->|Yes: Abundant| Supervised[Supervised Learning]
    Q1 -->|Yes: Limited| SemiSupervised[Semi-Supervised Learning]
    Q1 -->|No Labels| Q2{Pattern Discovery?}

    Q2 -->|Yes| Unsupervised[Unsupervised Learning]
    Q2 -->|No| Q3{Sequential Decisions?}

    Q3 -->|Yes| RL[Reinforcement Learning]
    Q3 -->|No| SelfSupervised[Self-Supervised Learning]

    Supervised --> Ex1[Credit scoring<br/>Churn prediction<br/>Medical diagnosis]
    SemiSupervised --> Ex2[Document classification<br/>Image labeling]
    Unsupervised --> Ex3[Customer segmentation<br/>Anomaly detection]
    RL --> Ex4[Game playing<br/>Resource optimization<br/>RLHF for LLMs]
    SelfSupervised --> Ex5[Foundation models<br/>GPT, BERT, Claude]

Paradigm Comparison Matrix

Paradigm	Data Requirements	Common Use Cases	Typical Accuracy	Time to Value	Cost
Supervised	1K-1M labeled examples	Classification, regression, forecasting	85-95%	4-8 weeks	$$-$$$
Unsupervised	Unlabeled data only	Clustering, dimensionality reduction	N/A (interpretive)	2-4 weeks	$
Semi-Supervised	100s labeled + 10Ks unlabeled	Image/text classification	80-90%	6-10 weeks	$$
Self-Supervised	Massive unlabeled corpora	Foundation model pretraining	Varies	Months (done by providers)	$$$$
Reinforcement Learning	Simulation or feedback	Sequential decisions, optimization	Varies	8-24 weeks	$$$-$$$$

Case Study: E-commerce Customer Segmentation

Approach: Unsupervised learning (K-means clustering)
Data: 500K customers, 50 behavioral features
Outcome: Identified 7 distinct segments
Business Impact:
- Marketing conversion improved by 34%
- Customer retention increased by 18%
- Campaign ROI improved from 2.1x to 3.4x
- Time to insight: 3 weeks vs. 3 months manual analysis

Classical vs. Generative AI

A fundamental distinction in modern AI that drives architecture and economics:

Capability Comparison

Aspect	Classical ML	Generative AI
Primary Task	Prediction, classification, scoring	Content synthesis, reasoning, generation
Output Type	Categorical labels, numerical scores	Text, images, code, structured data
Data Requirements	Structured, tabular, labeled (1K-1M examples)	Large-scale unstructured data (billions of tokens)
Interpretability	Often high (linear models, trees)	Generally low (black box)
Latency	Typically 1-100ms	100ms-10s depending on size
Cost per Inference	$0.0001-$ 0.01	$0.001-$ 0.10+
Determinism	Consistent outputs for same input	Stochastic (varies across runs)
Training Time	Hours to days	Weeks to months (foundation models)
Training Cost	$100-$ 10K	$100K-$ 100M+ (foundation models)
Best For	Structured prediction, tabular data	Language, reasoning, content creation

Decision Framework

flowchart TD
    Start[Business Problem] --> Q1{Output Type?}
    Q1 -->|Number/Category| Classical[Classical ML]
    Q1 -->|Text/Content| Q2{Data Structure?}

    Q2 -->|Structured/Tabular| Q3{Need Reasoning?}
    Q3 -->|No| Classical
    Q3 -->|Yes| GenAI[Generative AI]

    Q2 -->|Unstructured Text/Images| GenAI

    Classical --> Ex1[Examples:<br/>• Fraud detection<br/>• Churn prediction<br/>• Price optimization<br/>• Demand forecasting]
    GenAI --> Ex2[Examples:<br/>• Document summarization<br/>• Chatbots<br/>• Code generation<br/>• Content creation]

    style Classical fill:#90EE90
    style GenAI fill:#87CEEB

Case Study: Bank Loan Default Prediction

Approach: Classical ML (Gradient Boosting)
Data: 500K historical loans, 150 features
Model: XGBoost achieving 0.85 AUC-ROC
Business Impact:
- Default rate reduced by 18%
- Annual savings: $12M
- Model inference: <10ms
- Cost per prediction: $0.0001
- ROI: 2,400% in first year

vs. Generative AI Attempt (same problem):

Approach: LLM-based reasoning
Performance: 0.72 AUC-ROC (lower accuracy)
Latency: 800ms (80x slower)
Cost per prediction: $0.02 (200x more expensive)
Conclusion: Wrong tool for the job—structured prediction doesn't need generative AI

Large Language Models (LLMs)

LLMs are the foundation of modern generative AI applications.

LLM Landscape & Economics

graph TD
    A[LLM Options] --> B[Cloud APIs]
    A --> C[Self-Hosted OSS]
    A --> D[Hybrid]

    B --> B1[OpenAI GPT-4]
    B --> B2[Anthropic Claude]
    B --> B3[Google Gemini]

    C --> C1[Llama 3.1]
    C --> C2[Mistral]
    C --> C3[Gemma]

    D --> D1[API for Complex Tasks]
    D --> D2[Local for Simple/Sensitive]

    B1 --> Cost1[$10-30/1M tokens]
    C1 --> Cost2[$0.10-1/1M tokens<br/>after amortization]
    D1 --> Cost3[Optimized routing]

Model Selection Matrix

Model	Context Window	Cost per 1M tokens (Input/Output)	Latency (P95)	Best For
GPT-4 Turbo	128K	$10 /$ 30	1-3s	Complex reasoning, high-stakes
Claude 3.5 Sonnet	200K	$3 /$ 15	1-2s	Long context, analysis
GPT-3.5 Turbo	16K	$0.50 /$ 1.50	0.5-1s	General purpose, high volume
Gemini 1.5 Pro	2M	$1.25 /$ 5	1-2s	Very long documents
Llama 3.1 70B (hosted)	128K	$0.79 /$ 0.79	0.8-1.5s	Cost-sensitive, moderate volume
Llama 3.1 70B (self)	128K	~ $0.001 /$ 0.001	0.5-1s	High volume (>1M/month)

LLM Adaptation Approaches

flowchart TD
    Start[Need to Adapt LLM?] --> Q1{Task Complexity}

    Q1 -->|Simple| Prompt[Prompt Engineering]
    Q1 -->|Moderate| Q2{Need External Knowledge?}
    Q1 -->|Complex| Q3{Have Training Data?}

    Q2 -->|Yes| RAG[RAG]
    Q2 -->|No| Prompt

    Q3 -->|Yes: >10K examples| FT[Fine-tuning]
    Q3 -->|Yes: 1K-10K examples| LoRA[LoRA]
    Q3 -->|No| RAG

    Prompt --> P1[Cost: $0<br/>Time: Hours<br/>Flexibility: High]
    RAG --> P2[Cost: $-$$<br/>Time: Days-Weeks<br/>Updatable: Yes]
    LoRA --> P3[Cost: $$<br/>Time: Days<br/>Tasks: Multi-task]
    FT --> P4[Cost: $$$<br/>Time: Weeks<br/>Performance: Highest]

Adaptation Approach Comparison

Approach	Cost	Time to Deploy	Data Needed	Performance	Updatability	Best For
Prompt Engineering	None	Hours	0-10 examples	70-85%	Immediate	Most applications, rapid iteration
RAG	$-$$ (infra)	Days-Weeks	100s-1000s docs	80-90%	Easy (add docs)	Knowledge-intensive, dynamic info
LoRA	$$	Days	1K-10K examples	85-92%	Moderate (retrain)	Multiple specialized tasks
Fine-tuning	$$$	Weeks	10K-100K examples	90-95%	Hard (full retrain)	Specialized domains, specific formats

Case Study: Legal Document Analysis

Company: Mid-size law firm processing 500 contracts/month
Tested Approaches:
1. Prompt Engineering: 76% accuracy, $0.05/doc, ready in 1 week
2. RAG with firm templates: 89% accuracy, $0.12/doc, ready in 3 weeks
3. Fine-tuned model: 94% accuracy, $0.08/doc, ready in 8 weeks
Decision: Chose RAG
- Rationale: 89% accuracy sufficient, fastest to update with new clauses
- Business Impact:
  - Review time reduced from 45 min to 12 min (73% reduction)
  - Monthly savings: $18K in attorney time
  - ROI: 450% in first year

Retrieval-Augmented Generation (RAG)

RAG grounds LLM outputs in authoritative data, reducing hallucinations and enabling access to current/private information.

RAG Architecture & Components

graph LR
    A[User Query] --> B[Query Embedding]
    B --> C[Vector Search]
    C --> D[Retrieve Top-K Chunks]
    D --> E[Rerank Optional]
    E --> F[Assemble Context]
    F --> G[LLM Generation]
    G --> H[Response]

    I[Document Corpus] --> J[Chunking<br/>200-1000 tokens]
    J --> K[Embedding<br/>text-embedding-3]
    K --> L[Vector Database<br/>Pinecone/Weaviate]
    L --> C

    style G fill:#87CEEB
    style L fill:#90EE90

RAG Design Decisions Matrix

Decision Point	Options	Tradeoffs	Recommendation
Chunk Size	200 / 500 / 1000 tokens	Small: precise, more chunks Large: more context, less precise	500 tokens for most use cases
Chunk Overlap	0 / 50 / 100 tokens	More: better continuity, redundancy Less: efficient, potential gaps	50 tokens (10% overlap)
Top-K	3 / 5 / 10 chunks	More: better recall, higher cost Fewer: focused, may miss info	5 chunks for most use cases
Embedding Model	OpenAI / Cohere / OSS	Proprietary: quality, cost Open: control, no API cost	OpenAI for quality, OSS for volume
Vector DB	Pinecone / Weaviate / pgvector	Managed: easy, $ Self-hosted: control, ops overhead	Pinecone for <10M docs, pgvector for >10M
Reranking	Yes / No	Improves precision 10-15%, adds 100-200ms latency	Yes for high-stakes applications

RAG Performance Benchmarks

Metric	Without RAG	Basic RAG	Advanced RAG (reranking)	Improvement
Answer Accuracy	65% (pure LLM)	82%	89%	37% improvement
Hallucination Rate	18%	5%	2%	89% reduction
Source Attribution	N/A	84% correct	92% correct	Traceable answers
Latency	800ms	1.2s	1.8s	Worth the tradeoff
Cost per Query	$0.02	$0.05	$0.08	Justifiable for accuracy

Case Study: Technical Support Knowledge Base

Company: SaaS company with 2,500 help articles
Baseline: Keyword search, 62% resolution rate
RAG Implementation:
- Chunk size: 500 tokens, 50 overlap
- Vector DB: Pinecone (500K vectors)
- Top-K: 5, with reranking
- LLM: GPT-4 Turbo
Results:
- Answer accuracy: 87% (vs. 62% keyword)
- First-contact resolution: 81% (vs. 65%)
- Average handle time: 6.2 min (vs. 9.5 min)
- Cost per query: $0.06
- Monthly volume: 50K queries
- Annual savings: $480K in support costs
- ROI: 720% in year 1

Agentic Systems

Agents extend LLMs with tool use, planning, and iterative refinement.

Agent Architecture Patterns

graph TD
    Input[User Input] --> Agent[Agent Core<br/>LLM]
    Agent --> Planning[Planning Module]
    Planning --> Tools[Tool Selection]
    Tools --> Execute[Execute Tools]
    Execute --> Memory[Update Memory]
    Memory --> Reflect[Reflection]
    Reflect --> Decision{Task Complete?}
    Decision -->|No| Planning
    Decision -->|Yes| Output[Final Response]

    Tools --> Tool1[Web Search<br/>Google/Bing]
    Tools --> Tool2[Calculator<br/>Python REPL]
    Tools --> Tool3[Database Query<br/>SQL]
    Tools --> Tool4[API Calls<br/>REST/GraphQL]

    style Agent fill:#87CEEB
    style Tools fill:#90EE90

Agent Pattern Comparison

Pattern	Complexity	Reliability	Cost/Task	Latency	Use Cases	Success Rate
ReAct	Medium	75-85%	$0.05-$ 0.15	5-15s	Customer support, data analysis	80%
Plan-and-Execute	Medium-High	70-80%	$0.10-$ 0.25	10-30s	Travel booking, research	75%
Reflexion	High	80-90%	$0.15-$ 0.40	15-45s	Code debugging, complex problem-solving	85%
Multi-Agent	Very High	65-75%	$0.25-$ 0.60	30-120s	Software development, strategic planning	70%

Agent Tool Ecosystem

graph LR
    A[Agent Core] --> B[Knowledge Tools]
    A --> C[Action Tools]
    A --> D[Analysis Tools]

    B --> B1[Search<br/>Web/Internal]
    B --> B2[RAG<br/>Documents]
    B --> B3[Memory<br/>Vector DB]

    C --> C1[Database<br/>CRUD Ops]
    C --> C2[APIs<br/>External Services]
    C --> C3[Email/Slack<br/>Communication]

    D --> D1[Calculator<br/>Math/Finance]
    D --> D2[Code Execution<br/>Python/SQL]
    D --> D3[Data Viz<br/>Charts/Graphs]

Case Study: Customer Service Agent

Company: E-commerce retailer, 500+ support agents
Agent Capabilities:
1. Search order database (tool: search_orders)
2. Calculate refunds (tool: calculator)
3. Update CRM (tool: update_crm)
4. Send emails (tool: send_email)
Implementation: ReAct pattern with GPT-4
Results:
- Average handle time: 5.2 min (vs. 8.5 min manual)
- Time savings: 39%
- Error rate in calculations: 95% reduction (agent always accurate)
- Agent satisfaction: 4.3/5 (tools reduce frustration)
- Cost per interaction: $0.12
- Annual savings: $890K across 500 agents
- CSAT maintained at 4.2/5

When To Use What

Choosing the right AI approach is critical for success. Here's a comprehensive decision framework:

Decision Tree by Problem Type

flowchart TD
    Start[Business Problem] --> Q1{Problem Type}

    Q1 -->|Deterministic Logic| Rules[Rules/Heuristics]
    Q1 -->|Structured Prediction| Q2{Data Type?}
    Q1 -->|Content Generation| GenAI[Generative AI]
    Q1 -->|Sequential Decisions| Agents[Agentic Systems]

    Q2 -->|Tabular/Structured| ClassicalML[Classical ML]
    Q2 -->|Text| Q3{Labeled Data?}
    Q2 -->|Images| Q4{Volume?}

    Q3 -->|Yes: >10K| ClassicalML
    Q3 -->|No| GenAI

    Q4 -->|High: >100K| DeepLearning[Deep Learning]
    Q4 -->|Low| GenAI

    Rules --> R1[Tax calculations<br/>Access control<br/>Compliance checks]
    ClassicalML --> C1[Fraud detection<br/>Churn prediction<br/>Price optimization]
    GenAI --> G1[Summarization<br/>Q&A<br/>Content creation]
    Agents --> A1[Research tasks<br/>Multi-step workflows<br/>Tool orchestration]
    DeepLearning --> D1[Image classification<br/>Object detection<br/>OCR]

Approach Selection Matrix

Approach	Best For	Data Requirements	Latency	Cost	Complexity
Rules/Heuristics	Deterministic logic, compliance	Minimal	<1ms	Very Low	Low
Classical ML	Structured prediction, tabular data	1K-1M labeled examples	1-100ms	Low	Medium
Deep Learning (CV)	Images, video, complex vision tasks	10K-1M labeled images	10-500ms	Medium	High
LLM (Prompting)	Unstructured text, reasoning	0-10 examples	100ms-5s	Medium	Low-Medium
RAG	Grounded generation, knowledge tasks	100s-1000s documents	200ms-10s	Medium	Medium
Fine-tuning	Specialized domains, specific formats	1K-100K examples	100ms-5s	Medium-High	High
RL/Agents	Sequential decisions, optimization	Simulation or feedback	Varies widely	High	Very High

Cost-Performance-Latency Tradeoff

graph TD
    A[Choose 2 of 3] --> B[Low Cost]
    A --> C[High Performance]
    A --> D[Low Latency]

    B --> BC[Low Cost + High Performance<br/>= Higher Latency<br/>Example: Batch processing with large models]
    B --> BD[Low Cost + Low Latency<br/>= Lower Performance<br/>Example: Simple rules or small models]
    C --> CD[High Performance + Low Latency<br/>= High Cost<br/>Example: GPT-4 with optimized infrastructure]

    style A fill:#FFD700

Optimization Strategies

Technique	Latency Impact	Cost Impact	Quality Impact	Best For
Caching	-50-90%	-50-90%	Neutral	Repeated queries
Batching	+50-200%	-30-50%	Neutral	High throughput, latency-tolerant
Model Distillation	-40-70%	-40-70%	-5-15%	Production deployment
Quantization	-20-50%	-20-50%	-1-5%	Edge deployment
Prompt Compression	-20-40%	-20-40%	-0-10%	Long context scenarios
Smaller Model	-50-80%	-50-80%	-10-30%	Simple tasks
Hybrid Routing	-30-60%	-40-70%	Neutral to +5%	Mixed complexity workload

Case Study: Document Summarization Cost Optimization

Baseline: GPT-4 for all documents
- Cost: $0.08/document
- Monthly volume: 100K documents
- Monthly cost: $8,000
Optimized Approach:
1. Caching common documents (40% hit rate): Save $3,200
2. Route simple docs to GPT-3.5 (25% of volume): Save $1,200
3. Batching (10 at a time): Save $800
4. Prompt optimization (-30% tokens): Save $600
Result:
- New monthly cost: $2,200
- Savings: 78% ( $5,800/month,$ 69,600/year)
- Quality maintained: 94% similarity to baseline
- Latency impact: +15% (acceptable for async workflow)

Constraints & Tradeoffs

Every AI solution involves tradeoffs. Understanding these is crucial for setting realistic expectations.

Cost Structure Analysis

graph TD
    A[Total Cost of AI System] --> B[Development]
    A --> C[Inference]
    A --> D[Operations]

    B --> B1[Data labeling<br/>$10K-$500K]
    B --> B2[Experimentation<br/>$20K-$200K]
    B --> B3[Engineering time<br/>$100K-$1M]

    C --> C1[Compute per request<br/>$0.0001-$0.10]
    C --> C2[API costs<br/>$1K-$100K/month]
    C --> C3[Infrastructure<br/>$5K-$50K/month]

    D --> D1[Monitoring<br/>$1K-$10K/month]
    D --> D2[Retraining<br/>$5K-$50K/quarter]
    D --> D3[Ops team<br/>$200K-$800K/year]

Data Constraints

Data Quality Requirements by Approach

ML Approach	Completeness	Missing Data Tolerance	Label Accuracy	Noise Tolerance
Classical ML	>90%	<10% with imputation	>95%	Low-Medium
Deep Learning	>80%	<20% (learns to ignore)	>90%	Medium-High
LLMs	Variable	High (handles missing context)	N/A (unsupervised)	High
Fine-tuning	>95%	<5%	>98%	Very Low

Data Privacy & Consent Decision Tree

flowchart TD
    Start[Data Source] --> Q1{User Consent?}
    Q1 -->|No| Stop[Cannot Use]
    Q1 -->|Yes| Q2{Contains PII?}

    Q2 -->|Yes| Q3{Need PII?}
    Q3 -->|No| Redact[Redact/Anonymize]
    Q3 -->|Yes| Q4{Compliance Framework?}

    Q4 -->|GDPR/CCPA| Controls1[• Data minimization<br/>• Encryption at rest/transit<br/>• Access controls<br/>• Right to deletion]
    Q4 -->|HIPAA| Controls2[• BAA required<br/>• Audit logs<br/>• De-identification<br/>• Limited retention]
    Q4 -->|Other| Controls3[• Risk assessment<br/>• Legal review<br/>• Custom controls]

    Q2 -->|No| Q5{Sensitive Domain?}
    Q5 -->|Yes: Finance, Health| Assess[Risk Assessment Required]
    Q5 -->|No| Proceed[Proceed with Standard Governance]

    Redact --> Proceed
    Controls1 --> Proceed
    Controls2 --> Proceed
    Controls3 --> Proceed
    Assess --> Proceed

Safety & Security Threats

AI Threat Landscape

graph TD
    A[AI Security Threats] --> B[Input Attacks]
    A --> C[Model Attacks]
    A --> D[Output Attacks]
    A --> E[Data Attacks]

    B --> B1[Prompt Injection<br/>Override instructions]
    B --> B2[Adversarial Inputs<br/>Misclassification]

    C --> C1[Model Extraction<br/>IP theft]
    C --> C2[Model Inversion<br/>Training data recovery]

    D --> D1[Data Exfiltration<br/>Leak PII/secrets]
    D --> D2[Hallucination<br/>False information]

    E --> E1[Data Poisoning<br/>Corrupt training]
    E --> E2[Backdoors<br/>Trigger behaviors]

Defense Strategy Matrix

Threat	Impact	Defense Mechanism	Implementation Cost	Effectiveness
Prompt Injection	High	Input sanitization, prompt guards, output validation	$	80-90%
Data Exfiltration	Critical	PII redaction, access controls, output filtering	$$	95-99%
Jailbreaking	Medium-High	System prompt hardening, red-teaming, content filters	$$	70-85%
Hallucination	Medium	RAG grounding, fact-checking, confidence scores	$$	60-80%
Model Extraction	Medium	Rate limiting, watermarking, API monitoring	$	75-90%
Adversarial Examples	Medium	Adversarial training, input validation, ensemble models	$$$	70-85%

Case Study: Financial Services Chatbot Security

Initial State: Basic chatbot, no specialized security
Security Assessment Findings:
- Vulnerable to prompt injection (15/20 test cases succeeded)
- PII leakage in 8% of responses
- No jailbreak protections
Implemented Defenses:
1. Multi-layer input validation: $20K
2. PII redaction (input & output): $30K
3. Red-team testing & hardening: $40K
4. Ongoing monitoring: $15K/year
Results After 6 Months:
- Prompt injection success rate: <2% (93% improvement)
- PII leakage: 0 incidents
- Zero security breaches
- Compliance audit: 100% pass
- Avoided: Estimated $5M+ in potential breach costs
- ROI: Incalculable (risk mitigation)

Hosting Options

Choosing where and how to host AI models significantly impacts cost, control, and capabilities.

Hosting Strategy Decision Tree

flowchart TD
    Start[Hosting Decision] --> Q1{Volume?}
    Q1 -->|Low: <100K/month| Q2{Data Sensitivity?}
    Q1 -->|Medium: 100K-1M/month| Q3{Cost Optimization Priority?}
    Q1 -->|High: >1M/month| SelfHost[Self-Hosted]

    Q2 -->|High: PII, proprietary| SelfHost
    Q2 -->|Medium| API[Managed API]

    Q3 -->|High| Q4{Technical Expertise?}
    Q3 -->|Medium| Hybrid[Hybrid Approach]

    Q4 -->|Yes| SelfHost
    Q4 -->|No| Hybrid

    API --> A1[OpenAI, Anthropic<br/>Google, Cohere]
    SelfHost --> S1[Llama, Mistral<br/>On AWS/GCP/Azure]
    Hybrid --> H1[API for complex<br/>Self-hosted for simple/sensitive]

Hosting Comparison Matrix

Factor	Managed APIs	Self-Hosted OSS	Hybrid
Time to Production	Days	Weeks-Months	Weeks
Upfront Cost	$0	$40K-$ 200K (hardware) or $10K-$ 50K/month (cloud)	$5K-$ 30K
Per-Request Cost	$0.001-$ 0.10	$0.0001-$ 0.001 (amortized)	$0.0005-$ 0.02 (optimized routing)
Control	Low (vendor-dependent)	High (full control)	Medium (selective)
Compliance	Vendor-dependent (may limit use cases)	Full control (meet any requirement)	Flexible (route by requirement)
Scalability	Automatic (vendor handles)	Manual (requires planning)	Mixed (auto + manual)
Latest Models	Immediate access	Delayed (3-6 months)	Best of both
Customization	Limited (API parameters only)	Full (modify anything)	Selective (fine-tune what matters)
Operational Overhead	Minimal	High (DevOps, MLOps teams)	Medium

Break-Even Analysis

Scenario: Customer support chatbot

Volume (req/month)	Managed API Cost	Self-Hosted Cost	Break-Even Point
10K	$500	$15,000 (not worth it)	N/A
100K	$5,000	$15,000	~300K requests/month
1M	$50,000	$20,000	✅ Self-hosted wins
10M	$500,000	$35,000	✅ 14x cheaper self-hosted

Cost Breakdown (Self-Hosted at 1M req/month):

Infrastructure (4x A100 GPUs on AWS): $23,600/month
Engineering (2 FTE MLOps): $30,000/month
Total: $53,600/month
Amortized per request: $0.05
vs. API cost: $0.50/request
Savings: 90% ($447K/month)

Case Study: Healthcare AI Platform

Company: Hospital network, patient intake chatbot
Requirements:
- HIPAA compliance (cannot send PHI to third party)
- 500K conversations/month
- <2s latency
Decision: Self-hosted Llama 3.1 70B on-premises
Investment:
- Hardware: $160K (8x A100 GPUs)
- Setup: $80K (engineering)
- Annual ops: $240K (2 FTE)
Economics:
- Year 1 total: $480K
- Year 2-3: $240K/year
- vs. API (if allowed): $3M/year
- 3-Year Savings: $8.5M
- ROI: 1,771%
Additional Benefits:
- Full HIPAA compliance
- Custom fine-tuning on medical data
- No rate limits
- Data never leaves premises

Evaluation Essentials

Rigorous evaluation is critical for AI success. Different paradigms require different evaluation approaches.

Classical ML Evaluation Framework

Classification Metrics Decision Tree

flowchart TD
    Start[Classification Problem] --> Q1{Class Balance?}
    Q1 -->|Balanced| Accuracy[Accuracy]
    Q1 -->|Imbalanced| Q2{Cost of Errors?}

    Q2 -->|FP more costly| Precision[Precision]
    Q2 -->|FN more costly| Recall[Recall]
    Q2 -->|Both matter equally| F1[F1 Score]
    Q2 -->|Need full picture| AUCROC[AUC-ROC]

    Q1 -->|Very Imbalanced: <5%| AUCPR[AUC-PR]

Metric Selection Matrix

Use Case	Primary Metric	Why	Threshold
Fraud Detection	Precision + Recall	Both FP (false accusation) and FN (missed fraud) costly	Precision >90%, Recall >80%
Spam Filter	Precision	FP (blocking good email) very costly	Precision >95%
Medical Diagnosis	Recall	FN (missed disease) potentially fatal	Recall >95%
Churn Prediction	AUC-ROC	Need ranked list for targeting	AUC >0.75
Click Prediction	AUC-PR	Very imbalanced (CTR ~1%)	AUC-PR >0.3

Generative AI Evaluation

Multi-Dimensional Evaluation Framework

graph TD
    A[LLM Evaluation] --> B[Factuality]
    A --> C[Relevance]
    A --> D[Coherence]
    A --> E[Safety]
    A --> F[Task-Specific]

    B --> B1[Grounding in context<br/>Metrics: Exact match, ROUGE]
    B --> B2[Hallucination detection<br/>Metrics: Faithfulness score]

    C --> C1[Answers the question<br/>Metrics: Semantic similarity]
    C --> C2[Appropriate scope<br/>Metrics: Length, coverage]

    D --> D1[Logical flow<br/>Metrics: Coherence score]
    D --> D2[Consistency<br/>Metrics: Self-BLEU]

    E --> E1[No toxicity<br/>Metrics: Perspective API]
    E --> E2[No PII leakage<br/>Metrics: Regex + NER]
    E --> E3[No jailbreaks<br/>Metrics: Red-team pass rate]

    F --> F1[Format adherence<br/>Metrics: Schema validation]
    F --> F2[Domain accuracy<br/>Metrics: Expert review]

Evaluation Approach Comparison

Approach	Speed	Cost	Scalability	Reliability	Best For
Automated Metrics (ROUGE, BLEU)	Fast	Low	High	Moderate (correlation with quality varies)	Large-scale, continuous
LLM-as-Judge	Medium	Medium	High	Good (80-90% agreement with humans)	Scalable quality assessment
Human Evaluation	Slow	High	Low	Highest (gold standard)	High-stakes, final validation
Hybrid (Auto + Sample Human)	Medium	Medium	Medium-High	High	Production systems

Case Study: Customer Support QA System Evaluation

System: RAG-based Q&A for 500 agents
Evaluation Strategy: Hybrid approach
1. Automated (100% of responses):
  - Latency: <2s (SLA)
  - Safety checks: 0 PII leakage
  - Cost: <$0.10/query
2. LLM-as-Judge (10% sample, daily):
  - Relevance: >85%
  - Factuality: >90%
  - Coherence: >90%
3. Human Review (1% sample, weekly):
  - Overall quality: >4.0/5
  - Agent trust: >3.8/5
  - Actionable: >85%
Continuous Monitoring:
- Daily: Automated metrics
- Weekly: LLM-as-judge trends
- Monthly: Human evaluation deep-dive
Feedback Loop:
- Quality dips trigger investigation
- Human feedback used to improve prompts
- New edge cases added to test set
Result: Maintained 89% accuracy over 12 months with continuous improvement

Summary

The AI landscape offers diverse approaches for different problems:

Technology Selection Framework

graph TD
    A[Business Problem] --> B{Problem Characteristics}
    B -->|Deterministic, rules-based| Rules[Rules/Heuristics<br/>$, <1ms, Low complexity]
    B -->|Structured data, prediction| Classical[Classical ML<br/>$$, 1-100ms, Medium complexity]
    B -->|Unstructured text, generation| GenAI[Generative AI<br/>$$$, 100ms-5s, Medium complexity]
    B -->|Multi-step, tool use| Agents[Agentic Systems<br/>$$$$, 5-60s, High complexity]

    Classical --> C1[XGBoost, Random Forest<br/>85-95% accuracy<br/>Best for tabular]
    GenAI --> G1[LLMs + RAG<br/>80-90% accuracy<br/>Best for knowledge work]
    Agents --> A1[ReAct, Multi-Agent<br/>70-85% task success<br/>Best for workflows]

    style Rules fill:#90EE90
    style Classical fill:#87CEEB
    style GenAI fill:#FFD700
    style Agents fill:#FFA500

Key Takeaways

Match approach to problem: Not every problem needs the latest LLM
- Structured prediction → Classical ML
- Knowledge work → LLMs + RAG
- Multi-step workflows → Agents
Understand tradeoffs: Optimize for 2 of 3 (cost, latency, quality)
- High volume + latency-tolerant → Optimize for cost
- Real-time + quality → Accept higher cost
- Low budget + quality → Accept higher latency
Rigorous evaluation: Appropriate metrics for each paradigm
- Classical ML: Accuracy, precision, recall, AUC
- Generative AI: Factuality, relevance, safety
- Agents: Task success rate, efficiency
Safety first: Governance and controls embedded from the start
- Defense in depth (input validation, output filtering, monitoring)
- Privacy by design (PII redaction, access controls)
- Continuous monitoring (automated + human review)
Economics matter: Consider total cost of ownership
- Development + Inference + Operations
- Break-even analysis for hosting decisions
- ROI measurement across full lifecycle

Success Formula:

Right tool for the job → 3-5x better ROI
Early validation → 60% cost reduction (fail fast)
Continuous optimization → 20-40% ongoing improvement
Safety & compliance → $2M-$ 20M in avoided fines

The next chapter explores ethical considerations and professional conduct in AI consulting.

2. The AI Landscape & Core Concepts