Chapter 22 — Retrieval-Augmented Generation (RAG)

Overview

Ground LLM outputs in authoritative corpora to reduce hallucination and improve freshness.

Retrieval-Augmented Generation (RAG) combines LLM reasoning with information retrieval accuracy. Instead of relying solely on training data, RAG systems dynamically fetch relevant information from knowledge bases, enabling up-to-date, factually grounded responses that reduce hallucinations by 60-80%.

Why RAG Matters

Business Impact:

Accuracy: 60-80% reduction in hallucinations vs raw LLM
Freshness: Access to real-time data beyond model cutoff
Cost: 10x cheaper than fine-tuning for knowledge updates
Traceability: Full source attribution and citations
Privacy: Keep sensitive data out of model training

RAG Architecture

graph TB
    subgraph "Offline: Indexing Pipeline"
        A[Raw Documents] --> B[Document Processing<br/>PDF, HTML, Markdown]
        B --> C[Chunking Strategy<br/>512-1024 tokens]
        C --> D[Metadata Enrichment<br/>Source, Time, Tags]
        D --> E[Embedding Generation<br/>text-embedding-3]
        E --> F[Vector Database<br/>Pinecone, Qdrant]
    end

    subgraph "Online: Retrieval Pipeline"
        G[User Query] --> H[Query Enhancement<br/>Rewrite, Expand]
        H --> I[Hybrid Search<br/>Vector + Keyword]
        I --> J[Reranking<br/>Top 5 from 20]
        J --> K[Context Assembly<br/>Citations + Budget]

        F -.-> I

        K --> L[LLM Generation<br/>Grounded Answer]
        L --> M[Response + Sources]
    end

RAG Decision Framework

graph TB
    A[Need Updated Info?] --> B{Data Characteristics}

    B -->|Static, Infrequent| C[Fine-Tuning]
    B -->|Dynamic, Frequent| D[RAG]
    B -->|Mixed| E[Hybrid: RAG + FT]

    D --> F{Scale?}
    F -->|<10K docs| G[Simple RAG<br/>ChromaDB + GPT-4]
    F -->|10K-1M docs| H[Production RAG<br/>Pinecone + Reranking]
    F -->|>1M docs| I[Advanced RAG<br/>Hybrid Search + Caching]

    G --> J[Implementation]
    H --> J
    I --> J

Chunking Strategy Comparison

Strategy	Best For	Chunk Size	Overlap	Pros	Cons
Fixed Size	General purpose	512-1024 tokens	10-20%	Simple, predictable	May split semantic units
Sentence-Based	Q&A systems	Variable (3-5 sentences)	1 sentence	Preserves meaning	Variable chunk sizes
Paragraph-Based	Documents	Variable	None	Natural boundaries	Highly variable sizes
Semantic	Technical docs	Variable	Context-aware	Intelligent boundaries	Computationally expensive
Sliding Window	Dense retrieval	256-512 tokens	50%	Maximum coverage	Redundancy, higher cost
Hierarchical	Long documents	Multi-level	Nested	Preserves structure	Complex implementation

Quick Implementation:

# Essential chunking (10 lines)
def chunk_text(text: str, size: int = 512, overlap: int = 50) -> list:
    """Simple fixed-size chunking with overlap"""
    tokens = text.split()
    chunks = []
    for i in range(0, len(tokens), size - overlap):
        chunk = ' '.join(tokens[i:i + size])
        chunks.append(chunk)
    return chunks

Embedding Model Comparison

Model	Dimensions	Max Tokens	Cost per 1M	Best For
OpenAI text-embedding-3-small	1536	8191	$0.02	Cost-optimized general use
OpenAI text-embedding-3-large	3072	8191	$0.13	Maximum performance
Cohere embed-v3	1024	512	$0.10	Multilingual support
Sentence-Transformers (OSS)	384-768	512	Free	Privacy-sensitive, self-hosted
BGE-Large (OSS)	1024	512	Free	SOTA open source

Vector Database Selection

graph TB
    A[Choose Vector DB] --> B{Requirements}

    B -->|Ease of Use| C[Pinecone<br/>Managed, Scalable]
    B -->|Full Control| D[Qdrant/Milvus<br/>Self-Hosted]
    B -->|Prototyping| E[ChromaDB<br/>Embedded, Simple]

    C --> F{Scale?}
    D --> F
    E --> F

    F -->|<100K vectors| G[Any Option Works]
    F -->|100K-10M| H[Pinecone or Qdrant]
    F -->|>10M| I[Milvus or Weaviate]

Database Feature Comparison

Database	Features	Scale	Best For
Pinecone	Managed, high performance, serverless	Billions	Production, ease of use
Weaviate	GraphQL, hybrid search, modules	Millions	Complex schemas, AI-native
Qdrant	Advanced filtering, high performance	Millions	Metadata-rich retrieval
Milvus	Distributed, open source	Billions	Large scale, self-hosted
ChromaDB	Embedded, simple API	Thousands	Prototyping, local dev
FAISS	In-memory, fast	Millions	Research, custom needs

Retrieval Strategies

1. Naive RAG vs Advanced RAG

Aspect	Naive RAG	Advanced RAG
Query Processing	Direct embedding	Rewriting, expansion, decomposition
Search Method	Vector similarity only	Hybrid (vector + keyword)
Reranking	None	Cross-encoder or LLM reranking
Context	Top-K chunks	Deduplicated, citation-tracked
Accuracy	70-75%	85-95%
Cost	Low	Medium

2. Hybrid Search Architecture

graph LR
    A[User Query] --> B[Vector Search<br/>Semantic Similarity]
    A --> C[Keyword Search<br/>BM25]

    B --> D[Top 20 Results]
    C --> D

    D --> E[Fusion Ranking<br/>RRF or Weighted]
    E --> F[Reranker<br/>Cross-Encoder]
    F --> G[Top 5 Final Results]

Why Hybrid Search:

Vector Search: Captures semantic meaning ("car" ↔ "automobile")
Keyword Search: Catches exact matches ("SKU-12345", names, IDs)
Combined: 15-30% better recall than either alone

Query Enhancement Techniques

Technique	Purpose	Example	Impact
Query Rewriting	Clarify ambiguous queries	"It" → "The product mentioned"	+20% relevance
Query Expansion	Add synonyms, related terms	"car" → "car automobile vehicle"	+15% recall
Query Decomposition	Break complex into simple	"A and B" → ["A", "B"]	+25% for complex queries
Hypothetical Document	Generate ideal answer first	Create embedding from ideal answer	+10-15% precision

Reranking for Accuracy

graph TB
    A[Initial Retrieval<br/>20 Candidates] --> B{Reranking Method}

    B -->|Fast| C[Cross-Encoder<br/>ms-marco-MiniLM<br/>2x improvement]
    B -->|Highest Quality| D[LLM Reranking<br/>GPT-4<br/>3x improvement]
    B -->|Balanced| E[Cohere Rerank<br/>1.5x improvement]

    C --> F[Top 5 Results]
    D --> F
    E --> F

    F --> G[Context Assembly]

Reranking Performance

Method	Latency	Accuracy Gain	Cost	Best For
None	0ms	Baseline	$0	Budget-constrained
Cross-Encoder	50-100ms	+40-60%	Low	Production default
Cohere Rerank	100-200ms	+30-50%	Medium	Easy integration
LLM Reranking	500-1000ms	+50-70%	High	Critical accuracy

Essential RAG Implementation

# Minimal production RAG (20 lines)
from openai import OpenAI

client = OpenAI()

def rag_query(question: str, vector_db) -> dict:
    """Complete RAG pipeline"""
    # 1. Retrieve relevant chunks
    query_embedding = client.embeddings.create(
        model="text-embedding-3-small", input=question
    ).data[0].embedding

    chunks = vector_db.search(query_embedding, top_k=5)
    context = "\n\n".join([f"[{i+1}] {c['text']}" for i, c in enumerate(chunks)])

    # 2. Generate grounded answer
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{
            "role": "user",
            "content": f"Answer based on context. Include citations [1], [2], etc.\n\nContext:\n{context}\n\nQuestion: {question}"
        }]
    )

    return {"answer": response.choices[0].message.content, "sources": chunks}

Case Study: Technical Documentation Q&A

Challenge: Software company with 5,000+ docs, 200 support engineers, 92% of questions repeated.

Initial State (GPT-4 Zero-Shot):

Accuracy: 68%
Hallucination rate: 32%
Time per query: 15 minutes (manual search + verify)
Cost: $200/day in engineer time

Solution: Production RAG System

graph TB
    A[5,000 Docs] --> B[Chunk + Index<br/>25,000 chunks]
    B --> C[Pinecone Vector DB<br/>text-embedding-3-small]

    D[Support Query] --> E[Hybrid Search<br/>Vector + Keyword]
    E --> F[Cross-Encoder Rerank<br/>Top 5 from 20]
    F --> G[GPT-4 Generation<br/>With Citations]

    C -.-> E

    G --> H[Verified Answer<br/>+ Source Links]

Implementation Details:

Chunking: 512 tokens with 50 token overlap
Metadata: Version number, doc category, last updated
Reranking: Cross-encoder for top 5
Caching: 70% of queries hit cache (24-hour TTL)

Results After 3 Months:

Metric	Before	After	Improvement
Accuracy	68%	94%	+26%
Hallucination Rate	32%	2%	-94%
Time per Query	15 min	30 sec	-97%
Engineer Hours Saved	0	120 hrs/week	Massive
Cost per Query	~$10	$0.15	-98.5%
User Satisfaction	3.2/5	4.7/5	+47%

ROI Calculation:

Setup Cost: $15K (2 weeks eng time + infrastructure)
Monthly Savings: $80K (engineer time) -$ 3K (API costs) = $77K
Payback Period: 6 days
Annual ROI: 6,200%

Key Learnings:

Hybrid search crucial: Caught exact matches (version numbers, error codes)
Reranking worth the cost: Improved accuracy by 18% for $0.05/query
Metadata filtering: Version-specific retrieval reduced confusion
Citation requirement: Built trust, reduced manual verification
Caching high-value: 70% cache hit rate saved $1.8K/month

Evaluation Metrics

Metric	What It Measures	Target	How to Measure
Retrieval Precision@K	Relevant docs in top K	>0.8	Manual labeling
Retrieval Recall@K	Coverage of relevant docs	>0.9	Golden Q&A dataset
MRR (Mean Reciprocal Rank)	Rank of first relevant doc	>0.7	Position of correct answer
Answer Accuracy	Correctness of final answer	>0.85	LLM-as-judge + human
Faithfulness	Grounding in retrieved context	>0.95	Entailment checking
Citation Accuracy	Correct source attribution	100%	Automated verification
Latency P95	Response time	<2s	Production monitoring

Common Pitfalls & Solutions

Pitfall	Impact	Solution
Too small chunks	Loss of context	Use 512-1024 tokens with overlap
No reranking	Poor top-K accuracy	Add cross-encoder reranking
Ignoring metadata	Can't filter by date/version	Enrich chunks with metadata
No query enhancement	Ambiguous queries fail	Rewrite with conversation context
Single retrieval method	Misses exact matches	Use hybrid search (vector + keyword)
No citation tracking	Can't verify sources	Track chunk IDs, return sources

Implementation Checklist

Week 1-2: Data Preparation

Collect and normalize documents
Choose chunking strategy (recommend 512 tokens + overlap)
Enrich chunks with metadata (source, date, category)
Select embedding model (recommend text-embedding-3-small)
Generate and store embeddings

Week 2-3: Indexing

Select vector database (Pinecone for production, ChromaDB for dev)
Configure hybrid search (if applicable)
Set up metadata indexing for filtering
Implement incremental update pipeline
Test index performance (<100ms retrieval)

Week 3-4: Retrieval Pipeline

Implement query enhancement (rewriting, expansion)
Set up hybrid search (vector + BM25)
Add reranking (cross-encoder recommended)
Configure context assembly with token budgets
Add citation tracking

Week 4-5: Generation & Evaluation

Design grounded generation prompts
Implement citation formatting
Build evaluation dataset (100+ Q&A pairs)
Measure accuracy, faithfulness, latency
Set up continuous evaluation (10% sample)

Week 5-6: Production Deployment

Add caching layer (70%+ hit rate target)
Implement monitoring (latency, accuracy, cost)
Set up alerts for quality degradation
Create feedback loop for improvement
Document usage guidelines

Advanced RAG Patterns

Multi-Step Retrieval

graph TB
    A[Complex Query] --> B[Query Decomposition<br/>3 Sub-Queries]
    B --> C1[Retrieve for Q1]
    B --> C2[Retrieve for Q2]
    B --> C3[Retrieve for Q3]

    C1 --> D[Deduplicate<br/>Merge Contexts]
    C2 --> D
    C3 --> D

    D --> E[Synthesize Answer<br/>From Combined Context]

Self-Reflective RAG

graph TB
    A[Query] --> B[Initial Retrieval]
    B --> C[Generate Answer]
    C --> D{Verify Faithfulness}

    D -->|Low Confidence| E[Retrieve More Context<br/>With Refinement]
    D -->|High Confidence| F[Return Answer]

    E --> C

Why It Matters

When to Use RAG:

✅ Knowledge changes frequently (docs, policies, products)
✅ Need source citations and provenance
✅ Privacy concerns prevent fine-tuning on data
✅ Want to update knowledge without retraining
✅ Have structured knowledge base (docs, wikis, databases)

When NOT to Use RAG:

❌ Knowledge is static and fits in context window
❌ Need model to internalize reasoning patterns (use fine-tuning)
❌ No source documents available (pure generation task)
❌ Retrieval latency unacceptable (<100ms required)

Chapter 22: Retrieval-Augmented Generation (RAG)

22. Retrieval-Augmented Generation (RAG)