Part 4: Generative AI & LLM Consulting

Chapter 22: Retrieval-Augmented Generation (RAG)

Hire Us
4Part 4: Generative AI & LLM Consulting

22. Retrieval-Augmented Generation (RAG)

Chapter 22 — Retrieval-Augmented Generation (RAG)

Overview

Ground LLM outputs in authoritative corpora to reduce hallucination and improve freshness.

Retrieval-Augmented Generation (RAG) combines LLM reasoning with information retrieval accuracy. Instead of relying solely on training data, RAG systems dynamically fetch relevant information from knowledge bases, enabling up-to-date, factually grounded responses that reduce hallucinations by 60-80%.

Why RAG Matters

Business Impact:

  • Accuracy: 60-80% reduction in hallucinations vs raw LLM
  • Freshness: Access to real-time data beyond model cutoff
  • Cost: 10x cheaper than fine-tuning for knowledge updates
  • Traceability: Full source attribution and citations
  • Privacy: Keep sensitive data out of model training

RAG Architecture

graph TB subgraph "Offline: Indexing Pipeline" A[Raw Documents] --> B[Document Processing<br/>PDF, HTML, Markdown] B --> C[Chunking Strategy<br/>512-1024 tokens] C --> D[Metadata Enrichment<br/>Source, Time, Tags] D --> E[Embedding Generation<br/>text-embedding-3] E --> F[Vector Database<br/>Pinecone, Qdrant] end subgraph "Online: Retrieval Pipeline" G[User Query] --> H[Query Enhancement<br/>Rewrite, Expand] H --> I[Hybrid Search<br/>Vector + Keyword] I --> J[Reranking<br/>Top 5 from 20] J --> K[Context Assembly<br/>Citations + Budget] F -.-> I K --> L[LLM Generation<br/>Grounded Answer] L --> M[Response + Sources] end

RAG Decision Framework

graph TB A[Need Updated Info?] --> B{Data Characteristics} B -->|Static, Infrequent| C[Fine-Tuning] B -->|Dynamic, Frequent| D[RAG] B -->|Mixed| E[Hybrid: RAG + FT] D --> F{Scale?} F -->|<10K docs| G[Simple RAG<br/>ChromaDB + GPT-4] F -->|10K-1M docs| H[Production RAG<br/>Pinecone + Reranking] F -->|>1M docs| I[Advanced RAG<br/>Hybrid Search + Caching] G --> J[Implementation] H --> J I --> J

Chunking Strategy Comparison

StrategyBest ForChunk SizeOverlapProsCons
Fixed SizeGeneral purpose512-1024 tokens10-20%Simple, predictableMay split semantic units
Sentence-BasedQ&A systemsVariable (3-5 sentences)1 sentencePreserves meaningVariable chunk sizes
Paragraph-BasedDocumentsVariableNoneNatural boundariesHighly variable sizes
SemanticTechnical docsVariableContext-awareIntelligent boundariesComputationally expensive
Sliding WindowDense retrieval256-512 tokens50%Maximum coverageRedundancy, higher cost
HierarchicalLong documentsMulti-levelNestedPreserves structureComplex implementation

Quick Implementation:

# Essential chunking (10 lines)
def chunk_text(text: str, size: int = 512, overlap: int = 50) -> list:
    """Simple fixed-size chunking with overlap"""
    tokens = text.split()
    chunks = []
    for i in range(0, len(tokens), size - overlap):
        chunk = ' '.join(tokens[i:i + size])
        chunks.append(chunk)
    return chunks

Embedding Model Comparison

ModelDimensionsMax TokensCost per 1MBest For
OpenAI text-embedding-3-small15368191$0.02Cost-optimized general use
OpenAI text-embedding-3-large30728191$0.13Maximum performance
Cohere embed-v31024512$0.10Multilingual support
Sentence-Transformers (OSS)384-768512FreePrivacy-sensitive, self-hosted
BGE-Large (OSS)1024512FreeSOTA open source

Vector Database Selection

graph TB A[Choose Vector DB] --> B{Requirements} B -->|Ease of Use| C[Pinecone<br/>Managed, Scalable] B -->|Full Control| D[Qdrant/Milvus<br/>Self-Hosted] B -->|Prototyping| E[ChromaDB<br/>Embedded, Simple] C --> F{Scale?} D --> F E --> F F -->|<100K vectors| G[Any Option Works] F -->|100K-10M| H[Pinecone or Qdrant] F -->|>10M| I[Milvus or Weaviate]

Database Feature Comparison

DatabaseFeaturesScaleBest For
PineconeManaged, high performance, serverlessBillionsProduction, ease of use
WeaviateGraphQL, hybrid search, modulesMillionsComplex schemas, AI-native
QdrantAdvanced filtering, high performanceMillionsMetadata-rich retrieval
MilvusDistributed, open sourceBillionsLarge scale, self-hosted
ChromaDBEmbedded, simple APIThousandsPrototyping, local dev
FAISSIn-memory, fastMillionsResearch, custom needs

Retrieval Strategies

1. Naive RAG vs Advanced RAG

AspectNaive RAGAdvanced RAG
Query ProcessingDirect embeddingRewriting, expansion, decomposition
Search MethodVector similarity onlyHybrid (vector + keyword)
RerankingNoneCross-encoder or LLM reranking
ContextTop-K chunksDeduplicated, citation-tracked
Accuracy70-75%85-95%
CostLowMedium

2. Hybrid Search Architecture

graph LR A[User Query] --> B[Vector Search<br/>Semantic Similarity] A --> C[Keyword Search<br/>BM25] B --> D[Top 20 Results] C --> D D --> E[Fusion Ranking<br/>RRF or Weighted] E --> F[Reranker<br/>Cross-Encoder] F --> G[Top 5 Final Results]

Why Hybrid Search:

  • Vector Search: Captures semantic meaning ("car" ↔ "automobile")
  • Keyword Search: Catches exact matches ("SKU-12345", names, IDs)
  • Combined: 15-30% better recall than either alone

Query Enhancement Techniques

TechniquePurposeExampleImpact
Query RewritingClarify ambiguous queries"It" → "The product mentioned"+20% relevance
Query ExpansionAdd synonyms, related terms"car" → "car automobile vehicle"+15% recall
Query DecompositionBreak complex into simple"A and B" → ["A", "B"]+25% for complex queries
Hypothetical DocumentGenerate ideal answer firstCreate embedding from ideal answer+10-15% precision

Reranking for Accuracy

graph TB A[Initial Retrieval<br/>20 Candidates] --> B{Reranking Method} B -->|Fast| C[Cross-Encoder<br/>ms-marco-MiniLM<br/>2x improvement] B -->|Highest Quality| D[LLM Reranking<br/>GPT-4<br/>3x improvement] B -->|Balanced| E[Cohere Rerank<br/>1.5x improvement] C --> F[Top 5 Results] D --> F E --> F F --> G[Context Assembly]

Reranking Performance

MethodLatencyAccuracy GainCostBest For
None0msBaseline$0Budget-constrained
Cross-Encoder50-100ms+40-60%LowProduction default
Cohere Rerank100-200ms+30-50%MediumEasy integration
LLM Reranking500-1000ms+50-70%HighCritical accuracy

Essential RAG Implementation

# Minimal production RAG (20 lines)
from openai import OpenAI

client = OpenAI()

def rag_query(question: str, vector_db) -> dict:
    """Complete RAG pipeline"""
    # 1. Retrieve relevant chunks
    query_embedding = client.embeddings.create(
        model="text-embedding-3-small", input=question
    ).data[0].embedding

    chunks = vector_db.search(query_embedding, top_k=5)
    context = "\n\n".join([f"[{i+1}] {c['text']}" for i, c in enumerate(chunks)])

    # 2. Generate grounded answer
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{
            "role": "user",
            "content": f"Answer based on context. Include citations [1], [2], etc.\n\nContext:\n{context}\n\nQuestion: {question}"
        }]
    )

    return {"answer": response.choices[0].message.content, "sources": chunks}

Case Study: Technical Documentation Q&A

Challenge: Software company with 5,000+ docs, 200 support engineers, 92% of questions repeated.

Initial State (GPT-4 Zero-Shot):

  • Accuracy: 68%
  • Hallucination rate: 32%
  • Time per query: 15 minutes (manual search + verify)
  • Cost: $200/day in engineer time

Solution: Production RAG System

graph TB A[5,000 Docs] --> B[Chunk + Index<br/>25,000 chunks] B --> C[Pinecone Vector DB<br/>text-embedding-3-small] D[Support Query] --> E[Hybrid Search<br/>Vector + Keyword] E --> F[Cross-Encoder Rerank<br/>Top 5 from 20] F --> G[GPT-4 Generation<br/>With Citations] C -.-> E G --> H[Verified Answer<br/>+ Source Links]

Implementation Details:

  • Chunking: 512 tokens with 50 token overlap
  • Metadata: Version number, doc category, last updated
  • Reranking: Cross-encoder for top 5
  • Caching: 70% of queries hit cache (24-hour TTL)

Results After 3 Months:

MetricBeforeAfterImprovement
Accuracy68%94%+26%
Hallucination Rate32%2%-94%
Time per Query15 min30 sec-97%
Engineer Hours Saved0120 hrs/weekMassive
Cost per Query~$10$0.15-98.5%
User Satisfaction3.2/54.7/5+47%

ROI Calculation:

  • Setup Cost: $15K (2 weeks eng time + infrastructure)
  • Monthly Savings: 80K(engineertime)80K (engineer time) - 3K (API costs) = $77K
  • Payback Period: 6 days
  • Annual ROI: 6,200%

Key Learnings:

  1. Hybrid search crucial: Caught exact matches (version numbers, error codes)
  2. Reranking worth the cost: Improved accuracy by 18% for $0.05/query
  3. Metadata filtering: Version-specific retrieval reduced confusion
  4. Citation requirement: Built trust, reduced manual verification
  5. Caching high-value: 70% cache hit rate saved $1.8K/month

Evaluation Metrics

MetricWhat It MeasuresTargetHow to Measure
Retrieval Precision@KRelevant docs in top K>0.8Manual labeling
Retrieval Recall@KCoverage of relevant docs>0.9Golden Q&A dataset
MRR (Mean Reciprocal Rank)Rank of first relevant doc>0.7Position of correct answer
Answer AccuracyCorrectness of final answer>0.85LLM-as-judge + human
FaithfulnessGrounding in retrieved context>0.95Entailment checking
Citation AccuracyCorrect source attribution100%Automated verification
Latency P95Response time<2sProduction monitoring

Common Pitfalls & Solutions

PitfallImpactSolution
Too small chunksLoss of contextUse 512-1024 tokens with overlap
No rerankingPoor top-K accuracyAdd cross-encoder reranking
Ignoring metadataCan't filter by date/versionEnrich chunks with metadata
No query enhancementAmbiguous queries failRewrite with conversation context
Single retrieval methodMisses exact matchesUse hybrid search (vector + keyword)
No citation trackingCan't verify sourcesTrack chunk IDs, return sources

Implementation Checklist

Week 1-2: Data Preparation

  • Collect and normalize documents
  • Choose chunking strategy (recommend 512 tokens + overlap)
  • Enrich chunks with metadata (source, date, category)
  • Select embedding model (recommend text-embedding-3-small)
  • Generate and store embeddings

Week 2-3: Indexing

  • Select vector database (Pinecone for production, ChromaDB for dev)
  • Configure hybrid search (if applicable)
  • Set up metadata indexing for filtering
  • Implement incremental update pipeline
  • Test index performance (<100ms retrieval)

Week 3-4: Retrieval Pipeline

  • Implement query enhancement (rewriting, expansion)
  • Set up hybrid search (vector + BM25)
  • Add reranking (cross-encoder recommended)
  • Configure context assembly with token budgets
  • Add citation tracking

Week 4-5: Generation & Evaluation

  • Design grounded generation prompts
  • Implement citation formatting
  • Build evaluation dataset (100+ Q&A pairs)
  • Measure accuracy, faithfulness, latency
  • Set up continuous evaluation (10% sample)

Week 5-6: Production Deployment

  • Add caching layer (70%+ hit rate target)
  • Implement monitoring (latency, accuracy, cost)
  • Set up alerts for quality degradation
  • Create feedback loop for improvement
  • Document usage guidelines

Advanced RAG Patterns

Multi-Step Retrieval

graph TB A[Complex Query] --> B[Query Decomposition<br/>3 Sub-Queries] B --> C1[Retrieve for Q1] B --> C2[Retrieve for Q2] B --> C3[Retrieve for Q3] C1 --> D[Deduplicate<br/>Merge Contexts] C2 --> D C3 --> D D --> E[Synthesize Answer<br/>From Combined Context]

Self-Reflective RAG

graph TB A[Query] --> B[Initial Retrieval] B --> C[Generate Answer] C --> D{Verify Faithfulness} D -->|Low Confidence| E[Retrieve More Context<br/>With Refinement] D -->|High Confidence| F[Return Answer] E --> C

Why It Matters

When to Use RAG:

  • ✅ Knowledge changes frequently (docs, policies, products)
  • ✅ Need source citations and provenance
  • ✅ Privacy concerns prevent fine-tuning on data
  • ✅ Want to update knowledge without retraining
  • ✅ Have structured knowledge base (docs, wikis, databases)

When NOT to Use RAG:

  • ❌ Knowledge is static and fits in context window
  • ❌ Need model to internalize reasoning patterns (use fine-tuning)
  • ❌ No source documents available (pure generation task)
  • ❌ Retrieval latency unacceptable (<100ms required)