Part 1: Foundations of AI Consulting

Chapter 2: The AI Landscape & Core Concepts

Hire Us
1Part 1: Foundations of AI Consulting

2. The AI Landscape & Core Concepts

Chapter 2 — The AI Landscape & Core Concepts

Overview

Establish a shared foundation across ML paradigms, generative AI, and agentic systems. Understand when to use which approach, associated costs/latency, and core constraints.

This chapter provides a comprehensive map of the AI landscape, helping you select the right approach for specific business problems. Whether you're dealing with structured predictions, natural language understanding, or autonomous agents, understanding the capabilities, limitations, and tradeoffs of each paradigm is essential for successful AI consulting.

Objectives

  • Map the AI landscape from classical ML to generative AI to agentic systems
  • Provide decision frameworks for selecting the right AI approach
  • Understand cost, latency, and quality tradeoffs
  • Establish shared vocabulary and mental models for AI consulting

Core Concepts

Learning Paradigms

AI systems learn from data using different paradigms, each suited to different types of problems:

graph TD A[Machine Learning Paradigms] --> B[Supervised Learning] A --> C[Unsupervised Learning] A --> D[Semi-Supervised Learning] A --> E[Self-Supervised Learning] A --> F[Reinforcement Learning] B --> B1[Classification] B --> B2[Regression] B --> B3[Requires labeled data] C --> C1[Clustering] C --> C2[Dimensionality Reduction] C --> C3[No labels needed] D --> D1[Small labeled + Large unlabeled] D --> D2[Cost-effective labeling] E --> E1[Create labels from data] E --> E2[Foundation models] F --> F1[Trial and error] F --> F2[Maximize rewards]

Learning Paradigm Selection Framework

flowchart TD Start[Business Problem] --> Q1{Labels Available?} Q1 -->|Yes: Abundant| Supervised[Supervised Learning] Q1 -->|Yes: Limited| SemiSupervised[Semi-Supervised Learning] Q1 -->|No Labels| Q2{Pattern Discovery?} Q2 -->|Yes| Unsupervised[Unsupervised Learning] Q2 -->|No| Q3{Sequential Decisions?} Q3 -->|Yes| RL[Reinforcement Learning] Q3 -->|No| SelfSupervised[Self-Supervised Learning] Supervised --> Ex1[Credit scoring<br/>Churn prediction<br/>Medical diagnosis] SemiSupervised --> Ex2[Document classification<br/>Image labeling] Unsupervised --> Ex3[Customer segmentation<br/>Anomaly detection] RL --> Ex4[Game playing<br/>Resource optimization<br/>RLHF for LLMs] SelfSupervised --> Ex5[Foundation models<br/>GPT, BERT, Claude]

Paradigm Comparison Matrix

ParadigmData RequirementsCommon Use CasesTypical AccuracyTime to ValueCost
Supervised1K-1M labeled examplesClassification, regression, forecasting85-95%4-8 weeks$$-$$$
UnsupervisedUnlabeled data onlyClustering, dimensionality reductionN/A (interpretive)2-4 weeks$
Semi-Supervised100s labeled + 10Ks unlabeledImage/text classification80-90%6-10 weeks$$
Self-SupervisedMassive unlabeled corporaFoundation model pretrainingVariesMonths (done by providers)$$$$
Reinforcement LearningSimulation or feedbackSequential decisions, optimizationVaries8-24 weeks$$$-$$$$

Case Study: E-commerce Customer Segmentation

  • Approach: Unsupervised learning (K-means clustering)
  • Data: 500K customers, 50 behavioral features
  • Outcome: Identified 7 distinct segments
  • Business Impact:
    • Marketing conversion improved by 34%
    • Customer retention increased by 18%
    • Campaign ROI improved from 2.1x to 3.4x
    • Time to insight: 3 weeks vs. 3 months manual analysis

Classical vs. Generative AI

A fundamental distinction in modern AI that drives architecture and economics:

Capability Comparison

AspectClassical MLGenerative AI
Primary TaskPrediction, classification, scoringContent synthesis, reasoning, generation
Output TypeCategorical labels, numerical scoresText, images, code, structured data
Data RequirementsStructured, tabular, labeled (1K-1M examples)Large-scale unstructured data (billions of tokens)
InterpretabilityOften high (linear models, trees)Generally low (black box)
LatencyTypically 1-100ms100ms-10s depending on size
Cost per Inference0.00010.0001-0.010.0010.001-0.10+
DeterminismConsistent outputs for same inputStochastic (varies across runs)
Training TimeHours to daysWeeks to months (foundation models)
Training Cost100100-10K100K100K-100M+ (foundation models)
Best ForStructured prediction, tabular dataLanguage, reasoning, content creation

Decision Framework

flowchart TD Start[Business Problem] --> Q1{Output Type?} Q1 -->|Number/Category| Classical[Classical ML] Q1 -->|Text/Content| Q2{Data Structure?} Q2 -->|Structured/Tabular| Q3{Need Reasoning?} Q3 -->|No| Classical Q3 -->|Yes| GenAI[Generative AI] Q2 -->|Unstructured Text/Images| GenAI Classical --> Ex1[Examples:<br/>• Fraud detection<br/>• Churn prediction<br/>• Price optimization<br/>• Demand forecasting] GenAI --> Ex2[Examples:<br/>• Document summarization<br/>• Chatbots<br/>• Code generation<br/>• Content creation] style Classical fill:#90EE90 style GenAI fill:#87CEEB

Case Study: Bank Loan Default Prediction

  • Approach: Classical ML (Gradient Boosting)
  • Data: 500K historical loans, 150 features
  • Model: XGBoost achieving 0.85 AUC-ROC
  • Business Impact:
    • Default rate reduced by 18%
    • Annual savings: $12M
    • Model inference: <10ms
    • Cost per prediction: $0.0001
    • ROI: 2,400% in first year

vs. Generative AI Attempt (same problem):

  • Approach: LLM-based reasoning
  • Performance: 0.72 AUC-ROC (lower accuracy)
  • Latency: 800ms (80x slower)
  • Cost per prediction: $0.02 (200x more expensive)
  • Conclusion: Wrong tool for the job—structured prediction doesn't need generative AI

Large Language Models (LLMs)

LLMs are the foundation of modern generative AI applications.

LLM Landscape & Economics

graph TD A[LLM Options] --> B[Cloud APIs] A --> C[Self-Hosted OSS] A --> D[Hybrid] B --> B1[OpenAI GPT-4] B --> B2[Anthropic Claude] B --> B3[Google Gemini] C --> C1[Llama 3.1] C --> C2[Mistral] C --> C3[Gemma] D --> D1[API for Complex Tasks] D --> D2[Local for Simple/Sensitive] B1 --> Cost1[$10-30/1M tokens] C1 --> Cost2[$0.10-1/1M tokens<br/>after amortization] D1 --> Cost3[Optimized routing]

Model Selection Matrix

ModelContext WindowCost per 1M tokens (Input/Output)Latency (P95)Best For
GPT-4 Turbo128K10/10 / 301-3sComplex reasoning, high-stakes
Claude 3.5 Sonnet200K3/3 / 151-2sLong context, analysis
GPT-3.5 Turbo16K0.50/0.50 / 1.500.5-1sGeneral purpose, high volume
Gemini 1.5 Pro2M1.25/1.25 / 51-2sVery long documents
Llama 3.1 70B (hosted)128K0.79/0.79 / 0.790.8-1.5sCost-sensitive, moderate volume
Llama 3.1 70B (self)128K~0.001/0.001 / 0.0010.5-1sHigh volume (>1M/month)

LLM Adaptation Approaches

flowchart TD Start[Need to Adapt LLM?] --> Q1{Task Complexity} Q1 -->|Simple| Prompt[Prompt Engineering] Q1 -->|Moderate| Q2{Need External Knowledge?} Q1 -->|Complex| Q3{Have Training Data?} Q2 -->|Yes| RAG[RAG] Q2 -->|No| Prompt Q3 -->|Yes: >10K examples| FT[Fine-tuning] Q3 -->|Yes: 1K-10K examples| LoRA[LoRA] Q3 -->|No| RAG Prompt --> P1[Cost: $0<br/>Time: Hours<br/>Flexibility: High] RAG --> P2[Cost: $-$$<br/>Time: Days-Weeks<br/>Updatable: Yes] LoRA --> P3[Cost: $$<br/>Time: Days<br/>Tasks: Multi-task] FT --> P4[Cost: $$$<br/>Time: Weeks<br/>Performance: Highest]

Adaptation Approach Comparison

ApproachCostTime to DeployData NeededPerformanceUpdatabilityBest For
Prompt EngineeringNoneHours0-10 examples70-85%ImmediateMost applications, rapid iteration
RAG$-$$ (infra)Days-Weeks100s-1000s docs80-90%Easy (add docs)Knowledge-intensive, dynamic info
LoRA$$Days1K-10K examples85-92%Moderate (retrain)Multiple specialized tasks
Fine-tuning$$$Weeks10K-100K examples90-95%Hard (full retrain)Specialized domains, specific formats

Case Study: Legal Document Analysis

  • Company: Mid-size law firm processing 500 contracts/month
  • Tested Approaches:
    1. Prompt Engineering: 76% accuracy, $0.05/doc, ready in 1 week
    2. RAG with firm templates: 89% accuracy, $0.12/doc, ready in 3 weeks
    3. Fine-tuned model: 94% accuracy, $0.08/doc, ready in 8 weeks
  • Decision: Chose RAG
    • Rationale: 89% accuracy sufficient, fastest to update with new clauses
    • Business Impact:
      • Review time reduced from 45 min to 12 min (73% reduction)
      • Monthly savings: $18K in attorney time
      • ROI: 450% in first year

Retrieval-Augmented Generation (RAG)

RAG grounds LLM outputs in authoritative data, reducing hallucinations and enabling access to current/private information.

RAG Architecture & Components

graph LR A[User Query] --> B[Query Embedding] B --> C[Vector Search] C --> D[Retrieve Top-K Chunks] D --> E[Rerank Optional] E --> F[Assemble Context] F --> G[LLM Generation] G --> H[Response] I[Document Corpus] --> J[Chunking<br/>200-1000 tokens] J --> K[Embedding<br/>text-embedding-3] K --> L[Vector Database<br/>Pinecone/Weaviate] L --> C style G fill:#87CEEB style L fill:#90EE90

RAG Design Decisions Matrix

Decision PointOptionsTradeoffsRecommendation
Chunk Size200 / 500 / 1000 tokensSmall: precise, more chunks
Large: more context, less precise
500 tokens for most use cases
Chunk Overlap0 / 50 / 100 tokensMore: better continuity, redundancy
Less: efficient, potential gaps
50 tokens (10% overlap)
Top-K3 / 5 / 10 chunksMore: better recall, higher cost
Fewer: focused, may miss info
5 chunks for most use cases
Embedding ModelOpenAI / Cohere / OSSProprietary: quality, cost
Open: control, no API cost
OpenAI for quality, OSS for volume
Vector DBPinecone / Weaviate / pgvectorManaged: easy, $
Self-hosted: control, ops overhead
Pinecone for <10M docs, pgvector for >10M
RerankingYes / NoImproves precision 10-15%, adds 100-200ms latencyYes for high-stakes applications

RAG Performance Benchmarks

MetricWithout RAGBasic RAGAdvanced RAG (reranking)Improvement
Answer Accuracy65% (pure LLM)82%89%37% improvement
Hallucination Rate18%5%2%89% reduction
Source AttributionN/A84% correct92% correctTraceable answers
Latency800ms1.2s1.8sWorth the tradeoff
Cost per Query$0.02$0.05$0.08Justifiable for accuracy

Case Study: Technical Support Knowledge Base

  • Company: SaaS company with 2,500 help articles
  • Baseline: Keyword search, 62% resolution rate
  • RAG Implementation:
    • Chunk size: 500 tokens, 50 overlap
    • Vector DB: Pinecone (500K vectors)
    • Top-K: 5, with reranking
    • LLM: GPT-4 Turbo
  • Results:
    • Answer accuracy: 87% (vs. 62% keyword)
    • First-contact resolution: 81% (vs. 65%)
    • Average handle time: 6.2 min (vs. 9.5 min)
    • Cost per query: $0.06
    • Monthly volume: 50K queries
    • Annual savings: $480K in support costs
    • ROI: 720% in year 1

Agentic Systems

Agents extend LLMs with tool use, planning, and iterative refinement.

Agent Architecture Patterns

graph TD Input[User Input] --> Agent[Agent Core<br/>LLM] Agent --> Planning[Planning Module] Planning --> Tools[Tool Selection] Tools --> Execute[Execute Tools] Execute --> Memory[Update Memory] Memory --> Reflect[Reflection] Reflect --> Decision{Task Complete?} Decision -->|No| Planning Decision -->|Yes| Output[Final Response] Tools --> Tool1[Web Search<br/>Google/Bing] Tools --> Tool2[Calculator<br/>Python REPL] Tools --> Tool3[Database Query<br/>SQL] Tools --> Tool4[API Calls<br/>REST/GraphQL] style Agent fill:#87CEEB style Tools fill:#90EE90

Agent Pattern Comparison

PatternComplexityReliabilityCost/TaskLatencyUse CasesSuccess Rate
ReActMedium75-85%0.050.05-0.155-15sCustomer support, data analysis80%
Plan-and-ExecuteMedium-High70-80%0.100.10-0.2510-30sTravel booking, research75%
ReflexionHigh80-90%0.150.15-0.4015-45sCode debugging, complex problem-solving85%
Multi-AgentVery High65-75%0.250.25-0.6030-120sSoftware development, strategic planning70%

Agent Tool Ecosystem

graph LR A[Agent Core] --> B[Knowledge Tools] A --> C[Action Tools] A --> D[Analysis Tools] B --> B1[Search<br/>Web/Internal] B --> B2[RAG<br/>Documents] B --> B3[Memory<br/>Vector DB] C --> C1[Database<br/>CRUD Ops] C --> C2[APIs<br/>External Services] C --> C3[Email/Slack<br/>Communication] D --> D1[Calculator<br/>Math/Finance] D --> D2[Code Execution<br/>Python/SQL] D --> D3[Data Viz<br/>Charts/Graphs]

Case Study: Customer Service Agent

  • Company: E-commerce retailer, 500+ support agents
  • Agent Capabilities:
    1. Search order database (tool: search_orders)
    2. Calculate refunds (tool: calculator)
    3. Update CRM (tool: update_crm)
    4. Send emails (tool: send_email)
  • Implementation: ReAct pattern with GPT-4
  • Results:
    • Average handle time: 5.2 min (vs. 8.5 min manual)
    • Time savings: 39%
    • Error rate in calculations: 95% reduction (agent always accurate)
    • Agent satisfaction: 4.3/5 (tools reduce frustration)
    • Cost per interaction: $0.12
    • Annual savings: $890K across 500 agents
    • CSAT maintained at 4.2/5

When To Use What

Choosing the right AI approach is critical for success. Here's a comprehensive decision framework:

Decision Tree by Problem Type

flowchart TD Start[Business Problem] --> Q1{Problem Type} Q1 -->|Deterministic Logic| Rules[Rules/Heuristics] Q1 -->|Structured Prediction| Q2{Data Type?} Q1 -->|Content Generation| GenAI[Generative AI] Q1 -->|Sequential Decisions| Agents[Agentic Systems] Q2 -->|Tabular/Structured| ClassicalML[Classical ML] Q2 -->|Text| Q3{Labeled Data?} Q2 -->|Images| Q4{Volume?} Q3 -->|Yes: >10K| ClassicalML Q3 -->|No| GenAI Q4 -->|High: >100K| DeepLearning[Deep Learning] Q4 -->|Low| GenAI Rules --> R1[Tax calculations<br/>Access control<br/>Compliance checks] ClassicalML --> C1[Fraud detection<br/>Churn prediction<br/>Price optimization] GenAI --> G1[Summarization<br/>Q&A<br/>Content creation] Agents --> A1[Research tasks<br/>Multi-step workflows<br/>Tool orchestration] DeepLearning --> D1[Image classification<br/>Object detection<br/>OCR]

Approach Selection Matrix

ApproachBest ForData RequirementsLatencyCostComplexity
Rules/HeuristicsDeterministic logic, complianceMinimal<1msVery LowLow
Classical MLStructured prediction, tabular data1K-1M labeled examples1-100msLowMedium
Deep Learning (CV)Images, video, complex vision tasks10K-1M labeled images10-500msMediumHigh
LLM (Prompting)Unstructured text, reasoning0-10 examples100ms-5sMediumLow-Medium
RAGGrounded generation, knowledge tasks100s-1000s documents200ms-10sMediumMedium
Fine-tuningSpecialized domains, specific formats1K-100K examples100ms-5sMedium-HighHigh
RL/AgentsSequential decisions, optimizationSimulation or feedbackVaries widelyHighVery High

Cost-Performance-Latency Tradeoff

graph TD A[Choose 2 of 3] --> B[Low Cost] A --> C[High Performance] A --> D[Low Latency] B --> BC[Low Cost + High Performance<br/>= Higher Latency<br/>Example: Batch processing with large models] B --> BD[Low Cost + Low Latency<br/>= Lower Performance<br/>Example: Simple rules or small models] C --> CD[High Performance + Low Latency<br/>= High Cost<br/>Example: GPT-4 with optimized infrastructure] style A fill:#FFD700

Optimization Strategies

TechniqueLatency ImpactCost ImpactQuality ImpactBest For
Caching-50-90%-50-90%NeutralRepeated queries
Batching+50-200%-30-50%NeutralHigh throughput, latency-tolerant
Model Distillation-40-70%-40-70%-5-15%Production deployment
Quantization-20-50%-20-50%-1-5%Edge deployment
Prompt Compression-20-40%-20-40%-0-10%Long context scenarios
Smaller Model-50-80%-50-80%-10-30%Simple tasks
Hybrid Routing-30-60%-40-70%Neutral to +5%Mixed complexity workload

Case Study: Document Summarization Cost Optimization

  • Baseline: GPT-4 for all documents
    • Cost: $0.08/document
    • Monthly volume: 100K documents
    • Monthly cost: $8,000
  • Optimized Approach:
    1. Caching common documents (40% hit rate): Save $3,200
    2. Route simple docs to GPT-3.5 (25% of volume): Save $1,200
    3. Batching (10 at a time): Save $800
    4. Prompt optimization (-30% tokens): Save $600
  • Result:
    • New monthly cost: $2,200
    • Savings: 78% (5,800/month,5,800/month, 69,600/year)
    • Quality maintained: 94% similarity to baseline
    • Latency impact: +15% (acceptable for async workflow)

Constraints & Tradeoffs

Every AI solution involves tradeoffs. Understanding these is crucial for setting realistic expectations.

Cost Structure Analysis

graph TD A[Total Cost of AI System] --> B[Development] A --> C[Inference] A --> D[Operations] B --> B1[Data labeling<br/>$10K-$500K] B --> B2[Experimentation<br/>$20K-$200K] B --> B3[Engineering time<br/>$100K-$1M] C --> C1[Compute per request<br/>$0.0001-$0.10] C --> C2[API costs<br/>$1K-$100K/month] C --> C3[Infrastructure<br/>$5K-$50K/month] D --> D1[Monitoring<br/>$1K-$10K/month] D --> D2[Retraining<br/>$5K-$50K/quarter] D --> D3[Ops team<br/>$200K-$800K/year]

Data Constraints

Data Quality Requirements by Approach

ML ApproachCompletenessMissing Data ToleranceLabel AccuracyNoise Tolerance
Classical ML>90%<10% with imputation>95%Low-Medium
Deep Learning>80%<20% (learns to ignore)>90%Medium-High
LLMsVariableHigh (handles missing context)N/A (unsupervised)High
Fine-tuning>95%<5%>98%Very Low

Data Privacy & Consent Decision Tree

flowchart TD Start[Data Source] --> Q1{User Consent?} Q1 -->|No| Stop[Cannot Use] Q1 -->|Yes| Q2{Contains PII?} Q2 -->|Yes| Q3{Need PII?} Q3 -->|No| Redact[Redact/Anonymize] Q3 -->|Yes| Q4{Compliance Framework?} Q4 -->|GDPR/CCPA| Controls1[• Data minimization<br/>• Encryption at rest/transit<br/>• Access controls<br/>• Right to deletion] Q4 -->|HIPAA| Controls2[• BAA required<br/>• Audit logs<br/>• De-identification<br/>• Limited retention] Q4 -->|Other| Controls3[• Risk assessment<br/>• Legal review<br/>• Custom controls] Q2 -->|No| Q5{Sensitive Domain?} Q5 -->|Yes: Finance, Health| Assess[Risk Assessment Required] Q5 -->|No| Proceed[Proceed with Standard Governance] Redact --> Proceed Controls1 --> Proceed Controls2 --> Proceed Controls3 --> Proceed Assess --> Proceed

Safety & Security Threats

AI Threat Landscape

graph TD A[AI Security Threats] --> B[Input Attacks] A --> C[Model Attacks] A --> D[Output Attacks] A --> E[Data Attacks] B --> B1[Prompt Injection<br/>Override instructions] B --> B2[Adversarial Inputs<br/>Misclassification] C --> C1[Model Extraction<br/>IP theft] C --> C2[Model Inversion<br/>Training data recovery] D --> D1[Data Exfiltration<br/>Leak PII/secrets] D --> D2[Hallucination<br/>False information] E --> E1[Data Poisoning<br/>Corrupt training] E --> E2[Backdoors<br/>Trigger behaviors]

Defense Strategy Matrix

ThreatImpactDefense MechanismImplementation CostEffectiveness
Prompt InjectionHighInput sanitization, prompt guards, output validation$80-90%
Data ExfiltrationCriticalPII redaction, access controls, output filtering$$95-99%
JailbreakingMedium-HighSystem prompt hardening, red-teaming, content filters$$70-85%
HallucinationMediumRAG grounding, fact-checking, confidence scores$$60-80%
Model ExtractionMediumRate limiting, watermarking, API monitoring$75-90%
Adversarial ExamplesMediumAdversarial training, input validation, ensemble models$$$70-85%

Case Study: Financial Services Chatbot Security

  • Initial State: Basic chatbot, no specialized security
  • Security Assessment Findings:
    • Vulnerable to prompt injection (15/20 test cases succeeded)
    • PII leakage in 8% of responses
    • No jailbreak protections
  • Implemented Defenses:
    1. Multi-layer input validation: $20K
    2. PII redaction (input & output): $30K
    3. Red-team testing & hardening: $40K
    4. Ongoing monitoring: $15K/year
  • Results After 6 Months:
    • Prompt injection success rate: <2% (93% improvement)
    • PII leakage: 0 incidents
    • Zero security breaches
    • Compliance audit: 100% pass
    • Avoided: Estimated $5M+ in potential breach costs
    • ROI: Incalculable (risk mitigation)

Hosting Options

Choosing where and how to host AI models significantly impacts cost, control, and capabilities.

Hosting Strategy Decision Tree

flowchart TD Start[Hosting Decision] --> Q1{Volume?} Q1 -->|Low: <100K/month| Q2{Data Sensitivity?} Q1 -->|Medium: 100K-1M/month| Q3{Cost Optimization Priority?} Q1 -->|High: >1M/month| SelfHost[Self-Hosted] Q2 -->|High: PII, proprietary| SelfHost Q2 -->|Medium| API[Managed API] Q3 -->|High| Q4{Technical Expertise?} Q3 -->|Medium| Hybrid[Hybrid Approach] Q4 -->|Yes| SelfHost Q4 -->|No| Hybrid API --> A1[OpenAI, Anthropic<br/>Google, Cohere] SelfHost --> S1[Llama, Mistral<br/>On AWS/GCP/Azure] Hybrid --> H1[API for complex<br/>Self-hosted for simple/sensitive]

Hosting Comparison Matrix

FactorManaged APIsSelf-Hosted OSSHybrid
Time to ProductionDaysWeeks-MonthsWeeks
Upfront Cost$040K40K-200K (hardware) or 10K10K-50K/month (cloud)5K5K-30K
Per-Request Cost0.0010.001-0.100.00010.0001-0.001 (amortized)0.00050.0005-0.02 (optimized routing)
ControlLow (vendor-dependent)High (full control)Medium (selective)
ComplianceVendor-dependent (may limit use cases)Full control (meet any requirement)Flexible (route by requirement)
ScalabilityAutomatic (vendor handles)Manual (requires planning)Mixed (auto + manual)
Latest ModelsImmediate accessDelayed (3-6 months)Best of both
CustomizationLimited (API parameters only)Full (modify anything)Selective (fine-tune what matters)
Operational OverheadMinimalHigh (DevOps, MLOps teams)Medium

Break-Even Analysis

Scenario: Customer support chatbot

Volume (req/month)Managed API CostSelf-Hosted CostBreak-Even Point
10K$500$15,000 (not worth it)N/A
100K$5,000$15,000~300K requests/month
1M$50,000$20,000Self-hosted wins
10M$500,000$35,00014x cheaper self-hosted

Cost Breakdown (Self-Hosted at 1M req/month):

  • Infrastructure (4x A100 GPUs on AWS): $23,600/month
  • Engineering (2 FTE MLOps): $30,000/month
  • Total: $53,600/month
  • Amortized per request: $0.05
  • vs. API cost: $0.50/request
  • Savings: 90% ($447K/month)

Case Study: Healthcare AI Platform

  • Company: Hospital network, patient intake chatbot
  • Requirements:
    • HIPAA compliance (cannot send PHI to third party)
    • 500K conversations/month
    • <2s latency
  • Decision: Self-hosted Llama 3.1 70B on-premises
  • Investment:
    • Hardware: $160K (8x A100 GPUs)
    • Setup: $80K (engineering)
    • Annual ops: $240K (2 FTE)
  • Economics:
    • Year 1 total: $480K
    • Year 2-3: $240K/year
    • vs. API (if allowed): $3M/year
    • 3-Year Savings: $8.5M
    • ROI: 1,771%
  • Additional Benefits:
    • Full HIPAA compliance
    • Custom fine-tuning on medical data
    • No rate limits
    • Data never leaves premises

Evaluation Essentials

Rigorous evaluation is critical for AI success. Different paradigms require different evaluation approaches.

Classical ML Evaluation Framework

Classification Metrics Decision Tree

flowchart TD Start[Classification Problem] --> Q1{Class Balance?} Q1 -->|Balanced| Accuracy[Accuracy] Q1 -->|Imbalanced| Q2{Cost of Errors?} Q2 -->|FP more costly| Precision[Precision] Q2 -->|FN more costly| Recall[Recall] Q2 -->|Both matter equally| F1[F1 Score] Q2 -->|Need full picture| AUCROC[AUC-ROC] Q1 -->|Very Imbalanced: <5%| AUCPR[AUC-PR]

Metric Selection Matrix

Use CasePrimary MetricWhyThreshold
Fraud DetectionPrecision + RecallBoth FP (false accusation) and FN (missed fraud) costlyPrecision >90%, Recall >80%
Spam FilterPrecisionFP (blocking good email) very costlyPrecision >95%
Medical DiagnosisRecallFN (missed disease) potentially fatalRecall >95%
Churn PredictionAUC-ROCNeed ranked list for targetingAUC >0.75
Click PredictionAUC-PRVery imbalanced (CTR ~1%)AUC-PR >0.3

Generative AI Evaluation

Multi-Dimensional Evaluation Framework

graph TD A[LLM Evaluation] --> B[Factuality] A --> C[Relevance] A --> D[Coherence] A --> E[Safety] A --> F[Task-Specific] B --> B1[Grounding in context<br/>Metrics: Exact match, ROUGE] B --> B2[Hallucination detection<br/>Metrics: Faithfulness score] C --> C1[Answers the question<br/>Metrics: Semantic similarity] C --> C2[Appropriate scope<br/>Metrics: Length, coverage] D --> D1[Logical flow<br/>Metrics: Coherence score] D --> D2[Consistency<br/>Metrics: Self-BLEU] E --> E1[No toxicity<br/>Metrics: Perspective API] E --> E2[No PII leakage<br/>Metrics: Regex + NER] E --> E3[No jailbreaks<br/>Metrics: Red-team pass rate] F --> F1[Format adherence<br/>Metrics: Schema validation] F --> F2[Domain accuracy<br/>Metrics: Expert review]

Evaluation Approach Comparison

ApproachSpeedCostScalabilityReliabilityBest For
Automated Metrics (ROUGE, BLEU)FastLowHighModerate (correlation with quality varies)Large-scale, continuous
LLM-as-JudgeMediumMediumHighGood (80-90% agreement with humans)Scalable quality assessment
Human EvaluationSlowHighLowHighest (gold standard)High-stakes, final validation
Hybrid (Auto + Sample Human)MediumMediumMedium-HighHighProduction systems

Case Study: Customer Support QA System Evaluation

  • System: RAG-based Q&A for 500 agents
  • Evaluation Strategy: Hybrid approach
    1. Automated (100% of responses):
      • Latency: <2s (SLA)
      • Safety checks: 0 PII leakage
      • Cost: <$0.10/query
    2. LLM-as-Judge (10% sample, daily):
      • Relevance: >85%
      • Factuality: >90%
      • Coherence: >90%
    3. Human Review (1% sample, weekly):
      • Overall quality: >4.0/5
      • Agent trust: >3.8/5
      • Actionable: >85%
  • Continuous Monitoring:
    • Daily: Automated metrics
    • Weekly: LLM-as-judge trends
    • Monthly: Human evaluation deep-dive
  • Feedback Loop:
    • Quality dips trigger investigation
    • Human feedback used to improve prompts
    • New edge cases added to test set
  • Result: Maintained 89% accuracy over 12 months with continuous improvement

Summary

The AI landscape offers diverse approaches for different problems:

Technology Selection Framework

graph TD A[Business Problem] --> B{Problem Characteristics} B -->|Deterministic, rules-based| Rules[Rules/Heuristics<br/>$, <1ms, Low complexity] B -->|Structured data, prediction| Classical[Classical ML<br/>$$, 1-100ms, Medium complexity] B -->|Unstructured text, generation| GenAI[Generative AI<br/>$$$, 100ms-5s, Medium complexity] B -->|Multi-step, tool use| Agents[Agentic Systems<br/>$$$$, 5-60s, High complexity] Classical --> C1[XGBoost, Random Forest<br/>85-95% accuracy<br/>Best for tabular] GenAI --> G1[LLMs + RAG<br/>80-90% accuracy<br/>Best for knowledge work] Agents --> A1[ReAct, Multi-Agent<br/>70-85% task success<br/>Best for workflows] style Rules fill:#90EE90 style Classical fill:#87CEEB style GenAI fill:#FFD700 style Agents fill:#FFA500

Key Takeaways

  1. Match approach to problem: Not every problem needs the latest LLM

    • Structured prediction → Classical ML
    • Knowledge work → LLMs + RAG
    • Multi-step workflows → Agents
  2. Understand tradeoffs: Optimize for 2 of 3 (cost, latency, quality)

    • High volume + latency-tolerant → Optimize for cost
    • Real-time + quality → Accept higher cost
    • Low budget + quality → Accept higher latency
  3. Rigorous evaluation: Appropriate metrics for each paradigm

    • Classical ML: Accuracy, precision, recall, AUC
    • Generative AI: Factuality, relevance, safety
    • Agents: Task success rate, efficiency
  4. Safety first: Governance and controls embedded from the start

    • Defense in depth (input validation, output filtering, monitoring)
    • Privacy by design (PII redaction, access controls)
    • Continuous monitoring (automated + human review)
  5. Economics matter: Consider total cost of ownership

    • Development + Inference + Operations
    • Break-even analysis for hosting decisions
    • ROI measurement across full lifecycle

Success Formula:

  • Right tool for the job → 3-5x better ROI
  • Early validation → 60% cost reduction (fail fast)
  • Continuous optimization → 20-40% ongoing improvement
  • Safety & compliance2M2M-20M in avoided fines

The next chapter explores ethical considerations and professional conduct in AI consulting.