Chapter 26 — Multimodal Systems (VLMs)

Overview

Build systems that combine vision and language for grounded understanding and responses. Vision-Language Models (VLMs) enable systems to understand and reason about visual content alongside textual information—from intelligent document processing to visual question answering and product discovery.

Multimodal Processing Architecture

graph TB
    A[Input Sources] --> B{Input Type}
    B -->|Images| C[Image Preprocessing]
    B -->|Documents| D[Document Layout Analysis]
    B -->|Video| E[Frame Extraction]

    C --> F[Vision Encoder]
    D --> G[OCR + Layout Model]
    E --> F

    F --> H[Multimodal Fusion]
    G --> H

    H --> I[Vision-Language Model]
    I --> J[Response Generator]

    K[Text Query] --> I

    J --> L{Output Type}
    L -->|Captions| M[Descriptive Text]
    L -->|Answers| N[Grounded Responses]
    L -->|Extraction| O[Structured Data]

    P[Privacy Filter] -.-> C
    P -.-> D
    Q[Quality Gate] -.-> F
    R[Citation Validator] -.-> N

Complete VLM Pipeline

graph LR
    A[Visual Input] --> B[Preprocessing]
    B --> C[Privacy Masking]
    C --> D[Vision Encoder]

    E[Text Query] --> F[Text Encoder]

    D --> G[Cross-Modal Attention]
    F --> G

    G --> H[VLM Reasoning]
    H --> I[Response Generation]
    I --> J{Quality Check}

    J -->|High Confidence| K[Grounded Answer]
    J -->|Low Confidence| L[Request Clarification]

    M[Citation Engine] --> K
    N[Hallucination Filter] -.-> I

Model Selection Framework

VLM Comparison Matrix

Model	Best For	Context Length	Latency	Cost	Accuracy
GPT-4 Vision	Complex reasoning, general VQA	128K tokens	High (2-5s)	$$$$	Excellent (95%+)
Claude 3.5 Sonnet	Document understanding, code in images	200K tokens	Medium (1-3s)	$$$	Excellent (94%+)
Gemini 1.5 Pro	Long context, video understanding	1M tokens	Medium (1-4s)	$$$	Excellent (93%+)
LLaVA 1.6	Open-source, customizable	4K tokens	Low (200-800ms)	$	Good (87%)
BLIP-2	Captioning, image-text retrieval	77 tokens	Very Low (100-300ms)	$	Good (85%)
PaliGemma	Fine-tuning, specialized tasks	256 tokens	Low (300-600ms)	$	Variable
Specialized OCR	Document extraction only	N/A	Very Low (<100ms)	$	Excellent (98%+ for text)

Vision Encoder Comparison

Encoder	Resolution	Parameters	Speed	Best For
CLIP ViT-L/14	224×224	428M	45ms	General vision-language alignment
SigLIP	384×384	400M	62ms	Improved training efficiency
EVA-CLIP	224×224	1.1B	120ms	Highest accuracy tasks
ConvNeXt	224×224	197M	38ms	Balance of speed and quality

OCR + VLM Hybrid Strategy

Approach	Accuracy	Latency	Cost	Best For
OCR + Layout + VLM	96%	2.5s	$$	Structured documents
Pure VLM (GPT-4V)	91%	3.5s	$$$$	Complex visual reasoning
Pure VLM (Claude 3.5)	94%	2.2s	$$$	Documents with tables
Specialized OCR Only	98% text	0.8s	$	Text extraction only

Decision Tree: Choosing the Right VLM

graph TD
    A[VLM Selection] --> B{Primary Use Case?}

    B -->|Document Processing| C{Need Layout Understanding?}
    C -->|Yes| D[Claude 3.5 Sonnet]
    C -->|No| E[Specialized OCR + LLM]

    B -->|Video Analysis| F{Context Length Needed?}
    F -->|Long| G[Gemini 1.5 Pro]
    F -->|Short| H[LLaVA + Temporal Model]

    B -->|General VQA| I{Budget Constraint?}
    I -->|High Budget| J[GPT-4 Vision]
    I -->|Cost Sensitive| K{Need Customization?}
    K -->|Yes| L[Fine-tune LLaVA/PaliGemma]
    K -->|No| M[BLIP-2]

    B -->|Real-time Processing| N{Latency Requirement?}
    N -->|<500ms| O[Quantized LLaVA on GPU]
    N -->|<100ms| P[Specialized Vision Models]

Use Case Architectures

Document Question Answering Pipeline

graph LR
    A[Document Image] --> B[Quality Assessment]
    B --> C{Quality OK?}
    C -->|No| D[Enhancement]
    C -->|Yes| E[OCR Extraction]
    D --> E

    E --> F[Layout Detection]
    F --> G[Entity Recognition]
    G --> H[VLM Processing]

    I[User Question] --> H

    H --> J[Answer Generation]
    J --> K[Citation Extraction]
    K --> L[Confidence Scoring]
    L --> M[Grounded Response]

Performance Targets:

Metric	Target	Measurement
OCR Accuracy	> 98%	Character-level accuracy
Layout Detection	> 95%	IoU for bounding boxes
Answer Relevance	> 90%	Human evaluation score
Citation Precision	> 92%	Exact text match rate
End-to-End Latency	< 3s	P95 response time

Visual Search Architecture

graph TB
    A[Image Collection] --> B[Embedding Generation]
    B --> C[Vector Index]

    D[Query Input] --> E{Query Type}
    E -->|Text| F[Text Embedding]
    E -->|Image| G[Image Embedding]
    E -->|Multimodal| H[Joint Embedding]

    F --> I[Vector Search]
    G --> I
    H --> I

    C --> I

    I --> J[Top-K Retrieval]
    J --> K[Reranking]
    K --> L[VLM Description]
    L --> M[Enriched Results]

Embedding Strategy Comparison:

Approach	Precision@10	Recall@100	Latency	Best For
CLIP ViT-L/14	0.72	0.89	45ms	General images
OpenCLIP ConvNext	0.78	0.92	62ms	High accuracy
SigLIP	0.76	0.91	38ms	Efficiency
Custom Fine-tuned	0.85	0.95	55ms	Domain-specific

Multimodal RAG System

graph TB
    A[User Query] --> B[Query Analysis]
    B --> C{Requires Visual Context?}

    C -->|Yes| D[Multimodal Retrieval]
    C -->|No| E[Text-only Retrieval]

    D --> F[Image Vector DB]
    D --> G[Text Vector DB]
    E --> G

    F --> H[Image Results]
    G --> I[Text Results]

    H --> J[Cross-Modal Reranking]
    I --> J

    J --> K[VLM with Retrieved Context]
    K --> L{Confidence Check}

    L -->|High| M[Final Answer with Citations]
    L -->|Low| N[Gather More Context]
    N --> D

Privacy and Safety Controls

PII Detection and Redaction Flow

graph LR
    A[Input Image] --> B[PII Detection]
    B --> C{PII Found?}

    C -->|Faces| D[Face Blur]
    C -->|Text| E[Text Analysis]
    C -->|Plates| F[Plate Masking]
    C -->|None| G[Safe to Process]

    E --> H{Contains PII?}
    H -->|Yes| I[Redact Regions]
    H -->|No| G

    D --> J[Masked Image]
    F --> J
    I --> J

    J --> K[Privacy Audit Log]
    K --> L[VLM Processing]
    G --> L

Privacy Control Framework

PII Type	Detection Method	Redaction Strategy	Accuracy	Performance
Faces	RetinaFace, MTCNN	Gaussian blur (99×99)	98.5%	45ms/image
License Plates	YOLOv8 + OCR	Black box overlay	96.2%	65ms/image
Text (SSN, CC)	Regex + NER	Pixelation or masking	94.7%	120ms/page
Signatures	Custom CNN	Blur or removal	92.3%	35ms/image
Barcodes/QR	OpenCV detection	Black box	99.1%	15ms/image

Content Moderation Pipeline

graph TB
    A[Image Upload] --> B[Safety Classifier]
    B --> C{Content Type}

    C -->|Safe| D[Allow Processing]
    C -->|NSFW| E[Block + Log]
    C -->|Violence| E
    C -->|Hate Symbols| E
    C -->|Borderline| F[Human Review Queue]

    F --> G{Moderator Decision}
    G -->|Approve| H[Allow with Flag]
    G -->|Reject| E

    D --> I[VLM Processing]
    H --> I

    I --> J[Output Safety Check]
    J --> K{Output Safe?}
    K -->|Yes| L[Deliver Response]
    K -->|No| M[Filter + Alert]

Safety Thresholds:

Category	Block Threshold	Review Threshold	Action
NSFW Content	> 0.85	0.60 - 0.85	Immediate block / Human review
Violence	> 0.80	0.55 - 0.80	Block / Review
Hate Symbols	> 0.75	0.50 - 0.75	Block / Review
Medical Imagery	> 0.90	0.70 - 0.90	Flag for compliance

Preprocessing Strategy

Input Type	Preprocessing Steps	Tools	Performance Impact
Scanned Documents	Deskew → Denoise → Enhance → OCR	OpenCV, Tesseract	+15% accuracy
Photos	Resize → Color correction → Compression	PIL, ImageMagick	+8% accuracy
Low Light	Histogram equalization → Denoise	CLAHE, BM3D	+22% accuracy
Multi-page PDFs	Split → Page ordering → Format normalize	pdf2image, PyPDF2	Enables batch processing

Minimal Code Example

# Document QA with VLM
import anthropic, base64

client = anthropic.Anthropic(api_key="...")
with open("invoice.jpg", 'rb') as f:
    image_b64 = base64.b64encode(f.read()).decode()

message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{"role": "user", "content": [
        {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image_b64}},
        {"type": "text", "text": "Extract all line items with prices. Cite specific text."}
    ]}]
)
print(message.content[0].text)

Case Study: Insurance Claims Processing

Challenge

A major insurance company processed 50,000+ claims monthly with 45-minute average processing time per claim. Manual document review created backlogs and 18% error rates.

Solution Architecture

graph TB
    A[Claim Submission] --> B[Document Triage]
    B --> C[Quality Check]
    C --> D[PII Redaction]
    D --> E[Parallel Processing]

    E --> F[OCR Engine]
    E --> G[VLM Analysis]

    F --> H[Entity Extraction]
    G --> I[Visual Verification]

    H --> J[Cross-Validation]
    I --> J

    J --> K{Confidence > 0.9?}
    K -->|Yes| L[Auto-Approve]
    K -->|No| M[Human Review]

    L --> N[Claims System]
    M --> O[Review Queue]
    O --> N

Results & Impact

Metric	Before	After	Improvement
Processing Time	45 min	17 min	62% reduction
Automation Rate	0%	72%	72% automated
Extraction Accuracy	82%	96%	+14 pp
Error Rate	18%	4%	78% reduction
Monthly Cost	$1.3M	$425K	$875K savings
Throughput	50K/month	125K/month	150% increase

Key Success Factors

Multi-Model Strategy: Combined OCR + Layout Detection + VLM for 96% accuracy
Privacy-First Design: Automated PII redaction achieved HIPAA compliance
Grounded Citations: 92% citation accuracy enabled rapid human verification
Progressive Enhancement: Edge cases improved from 78% → 94% with 5K samples

Cost-Benefit Analysis

Initial Investment: $500K (6 months development)
Annual Costs:
  - Infrastructure: $180K (GPU, storage)
  - API Costs: $96K (VLM calls)
  - Operations: $144K (monitoring, maintenance)
  Total Annual: $420K

Annual Savings:
  - Labor Reduction: $1.05M (reduced manual review hours)
  - Error Reduction: $750K (fewer claim disputes)
  - Fraud Prevention: $600K (better detection)
  Total Annual: $2.4M

ROI: 371% first year
Payback Period: 3.8 months

Evaluation Framework

Quality Metrics Dashboard

Dimension	Metric	Target	Measurement Method
Accuracy	Exact Match	> 85%	Direct comparison with ground truth
Relevance	F1 Score	> 0.90	Precision × Recall harmonic mean
Grounding	Citation Accuracy	> 92%	Text found in source verification
Spatial	IoU for Boxes	> 0.75	Intersection over Union
Robustness	Performance on Edge Cases	> 80%	Noise, rotation, quality variations

Evaluation Rubric

graph LR
    A[VLM Response] --> B[Accuracy Check]
    A --> C[Faithfulness Check]
    A --> D[Completeness Check]

    B --> E{Factually Correct?}
    E -->|Yes| F[+1.0]
    E -->|Partial| G[+0.5]
    E -->|No| H[0.0]

    C --> I{Grounded in Image?}
    I -->|Fully| J[+1.0]
    I -->|Partially| K[+0.5]
    I -->|Hallucination| L[0.0]

    D --> M{All Elements Covered?}
    M -->|Complete| N[+1.0]
    M -->|Partial| O[+0.5]
    M -->|Missing Key Info| P[0.0]

    F --> Q[Final Score]
    G --> Q
    H --> Q
    J --> Q
    K --> Q
    L --> Q
    N --> Q
    O --> Q
    P --> Q

Deployment Checklist

Pre-Production Validation

Key Takeaways

Model Selection Matters: Choose based on use case, not just benchmark scores
Privacy First: Implement PII detection before any cloud processing
Ground Everything: Citations and confidence scores build trust
Measure Faithfulness: Accuracy alone isn't enough—check hallucinations
Plan for Edge Cases: 80% of effort goes to the last 20% of accuracy
Hybrid Approaches Win: OCR + VLM often beats pure VLM for documents

Chapter 26: Multimodal Systems (VLMs)

26. Multimodal Systems (VLMs)