4Part 4: Generative AI & LLM Consulting
22. Retrieval-Augmented Generation (RAG)
Chapter 22 — Retrieval-Augmented Generation (RAG)
Overview
Ground LLM outputs in authoritative corpora to reduce hallucination and improve freshness.
Retrieval-Augmented Generation (RAG) combines LLM reasoning with information retrieval accuracy. Instead of relying solely on training data, RAG systems dynamically fetch relevant information from knowledge bases, enabling up-to-date, factually grounded responses that reduce hallucinations by 60-80%.
Why RAG Matters
Business Impact:
- Accuracy: 60-80% reduction in hallucinations vs raw LLM
- Freshness: Access to real-time data beyond model cutoff
- Cost: 10x cheaper than fine-tuning for knowledge updates
- Traceability: Full source attribution and citations
- Privacy: Keep sensitive data out of model training
RAG Architecture
graph TB subgraph "Offline: Indexing Pipeline" A[Raw Documents] --> B[Document Processing<br/>PDF, HTML, Markdown] B --> C[Chunking Strategy<br/>512-1024 tokens] C --> D[Metadata Enrichment<br/>Source, Time, Tags] D --> E[Embedding Generation<br/>text-embedding-3] E --> F[Vector Database<br/>Pinecone, Qdrant] end subgraph "Online: Retrieval Pipeline" G[User Query] --> H[Query Enhancement<br/>Rewrite, Expand] H --> I[Hybrid Search<br/>Vector + Keyword] I --> J[Reranking<br/>Top 5 from 20] J --> K[Context Assembly<br/>Citations + Budget] F -.-> I K --> L[LLM Generation<br/>Grounded Answer] L --> M[Response + Sources] end
RAG Decision Framework
graph TB A[Need Updated Info?] --> B{Data Characteristics} B -->|Static, Infrequent| C[Fine-Tuning] B -->|Dynamic, Frequent| D[RAG] B -->|Mixed| E[Hybrid: RAG + FT] D --> F{Scale?} F -->|<10K docs| G[Simple RAG<br/>ChromaDB + GPT-4] F -->|10K-1M docs| H[Production RAG<br/>Pinecone + Reranking] F -->|>1M docs| I[Advanced RAG<br/>Hybrid Search + Caching] G --> J[Implementation] H --> J I --> J
Chunking Strategy Comparison
| Strategy | Best For | Chunk Size | Overlap | Pros | Cons |
|---|---|---|---|---|---|
| Fixed Size | General purpose | 512-1024 tokens | 10-20% | Simple, predictable | May split semantic units |
| Sentence-Based | Q&A systems | Variable (3-5 sentences) | 1 sentence | Preserves meaning | Variable chunk sizes |
| Paragraph-Based | Documents | Variable | None | Natural boundaries | Highly variable sizes |
| Semantic | Technical docs | Variable | Context-aware | Intelligent boundaries | Computationally expensive |
| Sliding Window | Dense retrieval | 256-512 tokens | 50% | Maximum coverage | Redundancy, higher cost |
| Hierarchical | Long documents | Multi-level | Nested | Preserves structure | Complex implementation |
Quick Implementation:
# Essential chunking (10 lines)
def chunk_text(text: str, size: int = 512, overlap: int = 50) -> list:
"""Simple fixed-size chunking with overlap"""
tokens = text.split()
chunks = []
for i in range(0, len(tokens), size - overlap):
chunk = ' '.join(tokens[i:i + size])
chunks.append(chunk)
return chunks
Embedding Model Comparison
| Model | Dimensions | Max Tokens | Cost per 1M | Best For |
|---|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | 8191 | $0.02 | Cost-optimized general use |
| OpenAI text-embedding-3-large | 3072 | 8191 | $0.13 | Maximum performance |
| Cohere embed-v3 | 1024 | 512 | $0.10 | Multilingual support |
| Sentence-Transformers (OSS) | 384-768 | 512 | Free | Privacy-sensitive, self-hosted |
| BGE-Large (OSS) | 1024 | 512 | Free | SOTA open source |
Vector Database Selection
graph TB A[Choose Vector DB] --> B{Requirements} B -->|Ease of Use| C[Pinecone<br/>Managed, Scalable] B -->|Full Control| D[Qdrant/Milvus<br/>Self-Hosted] B -->|Prototyping| E[ChromaDB<br/>Embedded, Simple] C --> F{Scale?} D --> F E --> F F -->|<100K vectors| G[Any Option Works] F -->|100K-10M| H[Pinecone or Qdrant] F -->|>10M| I[Milvus or Weaviate]
Database Feature Comparison
| Database | Features | Scale | Best For |
|---|---|---|---|
| Pinecone | Managed, high performance, serverless | Billions | Production, ease of use |
| Weaviate | GraphQL, hybrid search, modules | Millions | Complex schemas, AI-native |
| Qdrant | Advanced filtering, high performance | Millions | Metadata-rich retrieval |
| Milvus | Distributed, open source | Billions | Large scale, self-hosted |
| ChromaDB | Embedded, simple API | Thousands | Prototyping, local dev |
| FAISS | In-memory, fast | Millions | Research, custom needs |
Retrieval Strategies
1. Naive RAG vs Advanced RAG
| Aspect | Naive RAG | Advanced RAG |
|---|---|---|
| Query Processing | Direct embedding | Rewriting, expansion, decomposition |
| Search Method | Vector similarity only | Hybrid (vector + keyword) |
| Reranking | None | Cross-encoder or LLM reranking |
| Context | Top-K chunks | Deduplicated, citation-tracked |
| Accuracy | 70-75% | 85-95% |
| Cost | Low | Medium |
2. Hybrid Search Architecture
graph LR A[User Query] --> B[Vector Search<br/>Semantic Similarity] A --> C[Keyword Search<br/>BM25] B --> D[Top 20 Results] C --> D D --> E[Fusion Ranking<br/>RRF or Weighted] E --> F[Reranker<br/>Cross-Encoder] F --> G[Top 5 Final Results]
Why Hybrid Search:
- Vector Search: Captures semantic meaning ("car" ↔ "automobile")
- Keyword Search: Catches exact matches ("SKU-12345", names, IDs)
- Combined: 15-30% better recall than either alone
Query Enhancement Techniques
| Technique | Purpose | Example | Impact |
|---|---|---|---|
| Query Rewriting | Clarify ambiguous queries | "It" → "The product mentioned" | +20% relevance |
| Query Expansion | Add synonyms, related terms | "car" → "car automobile vehicle" | +15% recall |
| Query Decomposition | Break complex into simple | "A and B" → ["A", "B"] | +25% for complex queries |
| Hypothetical Document | Generate ideal answer first | Create embedding from ideal answer | +10-15% precision |
Reranking for Accuracy
graph TB A[Initial Retrieval<br/>20 Candidates] --> B{Reranking Method} B -->|Fast| C[Cross-Encoder<br/>ms-marco-MiniLM<br/>2x improvement] B -->|Highest Quality| D[LLM Reranking<br/>GPT-4<br/>3x improvement] B -->|Balanced| E[Cohere Rerank<br/>1.5x improvement] C --> F[Top 5 Results] D --> F E --> F F --> G[Context Assembly]
Reranking Performance
| Method | Latency | Accuracy Gain | Cost | Best For |
|---|---|---|---|---|
| None | 0ms | Baseline | $0 | Budget-constrained |
| Cross-Encoder | 50-100ms | +40-60% | Low | Production default |
| Cohere Rerank | 100-200ms | +30-50% | Medium | Easy integration |
| LLM Reranking | 500-1000ms | +50-70% | High | Critical accuracy |
Essential RAG Implementation
# Minimal production RAG (20 lines)
from openai import OpenAI
client = OpenAI()
def rag_query(question: str, vector_db) -> dict:
"""Complete RAG pipeline"""
# 1. Retrieve relevant chunks
query_embedding = client.embeddings.create(
model="text-embedding-3-small", input=question
).data[0].embedding
chunks = vector_db.search(query_embedding, top_k=5)
context = "\n\n".join([f"[{i+1}] {c['text']}" for i, c in enumerate(chunks)])
# 2. Generate grounded answer
response = client.chat.completions.create(
model="gpt-4",
messages=[{
"role": "user",
"content": f"Answer based on context. Include citations [1], [2], etc.\n\nContext:\n{context}\n\nQuestion: {question}"
}]
)
return {"answer": response.choices[0].message.content, "sources": chunks}
Case Study: Technical Documentation Q&A
Challenge: Software company with 5,000+ docs, 200 support engineers, 92% of questions repeated.
Initial State (GPT-4 Zero-Shot):
- Accuracy: 68%
- Hallucination rate: 32%
- Time per query: 15 minutes (manual search + verify)
- Cost: $200/day in engineer time
Solution: Production RAG System
graph TB A[5,000 Docs] --> B[Chunk + Index<br/>25,000 chunks] B --> C[Pinecone Vector DB<br/>text-embedding-3-small] D[Support Query] --> E[Hybrid Search<br/>Vector + Keyword] E --> F[Cross-Encoder Rerank<br/>Top 5 from 20] F --> G[GPT-4 Generation<br/>With Citations] C -.-> E G --> H[Verified Answer<br/>+ Source Links]
Implementation Details:
- Chunking: 512 tokens with 50 token overlap
- Metadata: Version number, doc category, last updated
- Reranking: Cross-encoder for top 5
- Caching: 70% of queries hit cache (24-hour TTL)
Results After 3 Months:
| Metric | Before | After | Improvement |
|---|---|---|---|
| Accuracy | 68% | 94% | +26% |
| Hallucination Rate | 32% | 2% | -94% |
| Time per Query | 15 min | 30 sec | -97% |
| Engineer Hours Saved | 0 | 120 hrs/week | Massive |
| Cost per Query | ~$10 | $0.15 | -98.5% |
| User Satisfaction | 3.2/5 | 4.7/5 | +47% |
ROI Calculation:
- Setup Cost: $15K (2 weeks eng time + infrastructure)
- Monthly Savings: 3K (API costs) = $77K
- Payback Period: 6 days
- Annual ROI: 6,200%
Key Learnings:
- Hybrid search crucial: Caught exact matches (version numbers, error codes)
- Reranking worth the cost: Improved accuracy by 18% for $0.05/query
- Metadata filtering: Version-specific retrieval reduced confusion
- Citation requirement: Built trust, reduced manual verification
- Caching high-value: 70% cache hit rate saved $1.8K/month
Evaluation Metrics
| Metric | What It Measures | Target | How to Measure |
|---|---|---|---|
| Retrieval Precision@K | Relevant docs in top K | >0.8 | Manual labeling |
| Retrieval Recall@K | Coverage of relevant docs | >0.9 | Golden Q&A dataset |
| MRR (Mean Reciprocal Rank) | Rank of first relevant doc | >0.7 | Position of correct answer |
| Answer Accuracy | Correctness of final answer | >0.85 | LLM-as-judge + human |
| Faithfulness | Grounding in retrieved context | >0.95 | Entailment checking |
| Citation Accuracy | Correct source attribution | 100% | Automated verification |
| Latency P95 | Response time | <2s | Production monitoring |
Common Pitfalls & Solutions
| Pitfall | Impact | Solution |
|---|---|---|
| Too small chunks | Loss of context | Use 512-1024 tokens with overlap |
| No reranking | Poor top-K accuracy | Add cross-encoder reranking |
| Ignoring metadata | Can't filter by date/version | Enrich chunks with metadata |
| No query enhancement | Ambiguous queries fail | Rewrite with conversation context |
| Single retrieval method | Misses exact matches | Use hybrid search (vector + keyword) |
| No citation tracking | Can't verify sources | Track chunk IDs, return sources |
Implementation Checklist
Week 1-2: Data Preparation
- Collect and normalize documents
- Choose chunking strategy (recommend 512 tokens + overlap)
- Enrich chunks with metadata (source, date, category)
- Select embedding model (recommend text-embedding-3-small)
- Generate and store embeddings
Week 2-3: Indexing
- Select vector database (Pinecone for production, ChromaDB for dev)
- Configure hybrid search (if applicable)
- Set up metadata indexing for filtering
- Implement incremental update pipeline
- Test index performance (<100ms retrieval)
Week 3-4: Retrieval Pipeline
- Implement query enhancement (rewriting, expansion)
- Set up hybrid search (vector + BM25)
- Add reranking (cross-encoder recommended)
- Configure context assembly with token budgets
- Add citation tracking
Week 4-5: Generation & Evaluation
- Design grounded generation prompts
- Implement citation formatting
- Build evaluation dataset (100+ Q&A pairs)
- Measure accuracy, faithfulness, latency
- Set up continuous evaluation (10% sample)
Week 5-6: Production Deployment
- Add caching layer (70%+ hit rate target)
- Implement monitoring (latency, accuracy, cost)
- Set up alerts for quality degradation
- Create feedback loop for improvement
- Document usage guidelines
Advanced RAG Patterns
Multi-Step Retrieval
graph TB A[Complex Query] --> B[Query Decomposition<br/>3 Sub-Queries] B --> C1[Retrieve for Q1] B --> C2[Retrieve for Q2] B --> C3[Retrieve for Q3] C1 --> D[Deduplicate<br/>Merge Contexts] C2 --> D C3 --> D D --> E[Synthesize Answer<br/>From Combined Context]
Self-Reflective RAG
graph TB A[Query] --> B[Initial Retrieval] B --> C[Generate Answer] C --> D{Verify Faithfulness} D -->|Low Confidence| E[Retrieve More Context<br/>With Refinement] D -->|High Confidence| F[Return Answer] E --> C
Why It Matters
When to Use RAG:
- ✅ Knowledge changes frequently (docs, policies, products)
- ✅ Need source citations and provenance
- ✅ Privacy concerns prevent fine-tuning on data
- ✅ Want to update knowledge without retraining
- ✅ Have structured knowledge base (docs, wikis, databases)
When NOT to Use RAG:
- ❌ Knowledge is static and fits in context window
- ❌ Need model to internalize reasoning patterns (use fine-tuning)
- ❌ No source documents available (pure generation task)
- ❌ Retrieval latency unacceptable (<100ms required)