Chapter 23 — Customization: Fine-Tuning & Adaptation

Overview

Adapt models for domain language, formats, and tasks via fine-tuning, adapters, or preference optimization.

While prompt engineering achieves remarkable results, some applications require models that inherently understand domain-specific language, adhere to particular formats, or align with specific preferences. Fine-tuning and model adaptation specialize foundation models for unique requirements, improving performance beyond what prompting alone can achieve.

The Customization Decision Tree

graph TB
    A[Need Better Performance?] --> B{What's the Issue?}

    B -->|Format Inconsistency| C[Few-Shot Examples<br/>or Fine-Tuning]
    B -->|Domain Knowledge| D[RAG or Fine-Tuning]
    B -->|Style/Tone| E[Prompt Engineering<br/>or DPO]
    B -->|Task Capability| F[Fine-Tuning]

    C --> G{Budget & Scale?}
    D --> G
    E --> G
    F --> G

    G -->|<$5K| H[Prompt Engineering<br/>Few-Shot Learning]
    G -->|$5K-50K| I[LoRA Fine-Tuning<br/>Single A100, Days]
    G -->|>$50K| J[Full Fine-Tuning<br/>Multiple GPUs, Weeks]

Customization Methods Comparison

Method	Data Required	Compute	Cost	Flexibility	Best For
Prompt Engineering	0-10 examples	None	$0	Very High	Quick iteration, testing
Few-Shot Learning	5-100 examples	None	Low	High	Task demonstration
LoRA	1K-100K+	1x A100	Medium	High	Efficient specialization
Full Fine-Tuning	10K-100K+	4-8x A100	High	Low	Maximum performance
DPO/RLHF	1K-10K pairs	2-4x A100	Medium-High	Medium	Alignment, style

##When to Fine-Tune vs When to Prompt

graph TB
    A[Performance Gap] --> B{Root Cause}

    B -->|Format Adherence| C{Consistency?}
    C -->|>95% needed| D[Fine-Tune]
    C -->|<95% okay| E[Prompt + Validation]

    B -->|Domain Knowledge| F{Knowledge Type?}
    F -->|Factual/Changing| G[RAG]
    F -->|Reasoning Patterns| H[Fine-Tune]

    B -->|Style/Tone| I{Examples Available?}
    I -->|<100| J[Few-Shot Prompting]
    I -->|>1000| K[DPO or Fine-Tune]

    B -->|Task Capability| L{New Task?}
    L -->|Similar to Existing| M[Prompt Engineering]
    L -->|Novel Task| N[Fine-Tune]

Decision Matrix

Scenario	Prompting	RAG	Fine-Tuning
Strict JSON format (99%+)	⚠️ (85-90%)	⚠️ (85-90%)	✅ (98%+)
Updated product info	❌	✅	⚠️ (stale quickly)
Domain-specific terminology	⚠️ (with examples)	✅ (with docs)	✅
Consistent brand voice	⚠️ (variable)	❌	✅
Novel task type	❌	❌	✅
Budget <$1K	✅	✅	❌
Need fast iteration	✅	✅	❌

Fine-Tuning Approaches

1. Full Fine-Tuning vs LoRA

Aspect	Full Fine-Tuning	LoRA (Low-Rank Adaptation)
Parameters Trained	All (7B-70B)	<1% (4M-50M)
GPU Memory	80-320GB	16-40GB
Training Time	Days-Weeks	Hours-Days
Storage	14-140GB	20-200MB
Performance	100% baseline	95-99% of full FT
Cost	$1K-10K	$100-1K
Multiple Adapters	No	Yes (swap on same base)

2. Supervised Fine-Tuning (SFT)

Purpose: Teach model new tasks or domain knowledge

Data Format:

{
  "instruction": "Extract key financial metrics from the earnings report.",
  "input": "Q3 revenue was $2.5B, up 15% YoY. Net income reached $500M.",
  "output": "{\"revenue\": \"$2.5B\", \"growth\": \"15% YoY\", \"net_income\": \"$500M\"}"
}

Essential Implementation:

# Minimal LoRA fine-tuning (15 lines)
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer
from peft import LoraConfig, get_peft_model

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b")

# Configure LoRA - only train tiny adapter matrices
lora_config = LoraConfig(r=8, lora_alpha=32, target_modules=["q_proj", "v_proj"])
model = get_peft_model(model, lora_config)
# Result: Only 0.06% of parameters trainable (4M vs 7B)

trainer = Trainer(model=model, train_dataset=dataset)
trainer.train()
model.save_pretrained("./lora_adapters")  # Only 20MB!

3. Preference Optimization (DPO/RLHF)

Purpose: Align model outputs with human preferences (style, safety, helpfulness)

Data Format (Preference Pairs):

{
  "prompt": "Explain quantum computing to a 10-year-old",
  "chosen": "Imagine regular computers count on fingers. Quantum computers have magic hands that try all finger combinations at once!",
  "rejected": "Quantum computing leverages superposition and entanglement to perform computations using qubits."
}

When to Use DPO:

✅ Have preference data (chosen/rejected pairs)
✅ Want to improve style, tone, helpfulness
✅ Need alignment without complex RL setup
✅ Budget allows 1K-10K preference examples

Data Preparation

Quality Over Quantity

Data Volume	Quality	Typical Results	Best Practices
100-500	Very High (human-curated)	Good for narrow tasks	Expert curation, multiple reviews
500-5K	High (filtered synthetic)	Good for most tasks	LLM generation + validation
5K-50K	Medium (mixed sources)	Best performance	Combine human + synthetic + filtering
>50K	Variable (web-scraped)	Diminishing returns	Strong deduplication, quality filtering

Data Collection Strategies

Source	Volume	Quality	Cost	Best For
Human Annotation	Low (100s)	Highest	$5K-50K	Critical tasks, gold standards
Expert Curation	Medium (1000s)	High	$1K-10K	Domain specialization
LLM Synthesis	High (10K+)	Medium	$100-1K	Data augmentation, scaling
User Interactions	High (10K+)	Variable	Low	Real-world patterns
Public Datasets	Very High (100K+)	Variable	Free	General capability

Synthetic Data Generation

Essential Example:

# Generate training data with GPT-4 (10 lines)
async def generate_training_examples(seed_examples: list, count: int) -> list:
    """Scale up training data with LLM"""
    examples = []
    for seed in seed_examples:
        prompt = f"Create 5 similar examples:\n{seed}\n\nVary details but keep task type."
        response = await openai.chat.completions.create(
            model="gpt-4", messages=[{"role": "user", "content": prompt}]
        )
        examples.extend(parse_examples(response))
        if len(examples) >= count: break
    return examples[:count]

Training Configuration

Hyperparameter Guide

Parameter	Typical Range	Impact	Tuning Strategy
Learning Rate	2e-5 to 5e-5	Speed & stability	Start 2e-5, increase if slow
Batch Size	4-16 per GPU	Memory & stability	Largest that fits in VRAM
Epochs	1-5	Overfitting risk	Stop when validation plateaus
LoRA Rank (r)	8-64	Capacity vs efficiency	8 for simple, 64 for complex
Warmup Steps	100-500	Training stability	10% of total steps

Preventing Overfitting

graph TB
    A[Training Progress] --> B{Monitor Metrics}

    B --> C[Training Loss<br/>Decreasing]
    B --> D[Validation Loss<br/>Increasing?]

    D -->|Yes| E[OVERFITTING]
    D -->|No| F[Keep Training]

    E --> G[Solutions]
    G --> H[Early Stopping]
    G --> I[More Data]
    G --> J[Lower Learning Rate]
    G --> K[Reduce Epochs]

Evaluation & Safety

Before/After Comparison Framework

Metric Category	Specific Metrics	Target
Task Performance	Accuracy, F1, ROUGE	>85% on test set
Format Compliance	JSON validity, schema adherence	>95%
Regression Testing	Performance on general tasks	Within 5% of base model
Safety	Jailbreak resistance, toxic content	>95% pass rate
Latency	Inference speed	<10% slowdown

Safety Validation Checklist

Test jailbreak resistance (20+ attempts)
Check for bias amplification (demographic parity tests)
Verify toxic content filtering still works
Confirm PII handling hasn't degraded
Test refusal capabilities maintained

Case Study: Customer Support Ticket Summarization

Challenge: Support platform with 50+ agents, inconsistent summary quality, 20-30 min/ticket.

Initial State (Few-Shot GPT-4):

Format compliance: 82%
Factual accuracy: 91%
Consistency: Medium
Cost: $0.30/summary
Latency: 2.8s

Solution: Fine-Tuned Llama 2 7B with LoRA

graph TB
    A[5,000 Historical Tickets<br/>Human Summaries] --> B[Filter Quality<br/>Remove Incomplete]
    B --> C[3,500 High-Quality<br/>Examples]
    C --> D[Generate 2,000<br/>Synthetic with GPT-4]

    D --> E[5,500 Total Examples<br/>Balanced by Category]

    E --> F[LoRA Fine-Tuning<br/>r=16, 3 epochs<br/>6 hours on A100]

    F --> G[Validation<br/>96% Format Compliance]
    G --> H[Production Deployment]

Training Configuration:

Method: LoRA (r=16, alpha=32)
Data: 5,500 examples (3,500 real + 2,000 synthetic)
Hardware: Single A100, 6 hours
Validation: 20% holdout set

Results After 2 Months:

Metric	Few-Shot GPT-4	Fine-Tuned Llama 2 7B	Improvement
Format Compliance	82%	96%	+14%
Factual Accuracy	91%	94%	+3%
Agent Consistency	Medium	High	Qualitative
Cost per 1K	$30	$0.30	-99%
Latency P95	2.8s	0.7s	-75%
Inference Cost	$450/day	$4.50/day	-99%

ROI Analysis:

Setup Cost: $8K (1 week eng + GPU time)
Daily Savings: $445.50 (inference cost reduction)
Payback Period: 18 days
Annual Savings: $163K

Key Learnings:

LoRA sufficient: 96% performance of full fine-tuning at 10x lower cost
Synthetic data crucial: Scaled from 3.5K to 5.5K examples cheaply
Format tasks ideal: 96% compliance vs 82% with prompting
Regression testing essential: Caught edge case failures early
Self-hosting wins: 99% cost reduction for high-volume use case

Model Versioning & Management

Version Control Strategy

graph LR
    A[v1.0<br/>Base Model] --> B[Fine-Tune]
    B --> C[v2.0<br/>Eval + Safety]

    C --> D{Pass Gates?}
    D -->|Yes| E[v2.0 Registered]
    D -->|No| F[Rejected]

    E --> G[A/B Test<br/>10% Traffic]
    G --> H{Performance?}

    H -->|Better| I[v2.0 Production<br/>100% Traffic]
    H -->|Worse| J[Rollback to v1.0]

    I --> K[v1.0 Deprecated]

Quality Gates

Gate	Threshold	Blocker?
Task Accuracy	>85% on test set	Yes
Format Compliance	>95%	Yes
Safety Pass Rate	>95%	Yes
No Regression	Within 5% on general tasks	Yes
Latency	<2x base model	No (warning)

Implementation Checklist

Phase 1: Planning (Week 1)

Define fine-tuning goals (format, domain, style, capability)
Determine if fine-tuning necessary vs prompting/RAG
Choose method (LoRA vs full vs DPO)
Estimate data requirements (1K-100K examples)
Plan compute resources and budget ($100-10K)

Phase 2: Data Collection (Week 2-3)

Collect initial examples from target domain
Filter for quality (completeness, correctness, consistency)
Generate synthetic data if needed (GPT-4 synthesis)
Balance dataset across categories/tasks
Create holdout validation (20%) and test (10%) sets
Check for PII and sensitive content

Phase 3: Training (Week 3-4)

Set up GPU infrastructure (A100 recommended)
Configure hyperparameters (LR, batch size, epochs)
Implement logging and checkpointing
Run training with validation monitoring
Watch for overfitting (validation loss diverging)
Save best checkpoint

Phase 4: Evaluation (Week 4-5)

Run before/after comparison on test set
Check regression on general capabilities
Validate safety properties maintained
Get human evaluation on sample outputs (50-100)
Document performance metrics

Phase 5: Deployment (Week 5-6)

Create model card with metadata and performance
Version and register model
Set up A/B testing (10% traffic initially)
Implement monitoring (latency, accuracy, cost)
Plan rollback strategy
Document usage guidelines

Common Pitfalls & Solutions

Pitfall	Impact	Solution
Too little data	Poor generalization	Generate synthetic examples, use LoRA
Low-quality data	Model learns bad patterns	Strong filtering, human validation
Overfitting	Great on train, poor on test	Early stopping, more data, regularization
No safety testing	Jailbreak vulnerabilities	Comprehensive safety test suite
Ignoring regression	Lost general capabilities	Test on diverse general tasks
Wrong base model	Poor starting point	Match base model to domain

When to Fine-Tune (Decision Framework)

Fine-Tune When:

✅ Need >95% format compliance (JSON, templates)
✅ Domain-specific reasoning patterns (legal, medical, code)
✅ Consistent style/tone critical (brand voice)
✅ Novel task type not in base model training
✅ High-volume use case (cost reduction important)
✅ Have 1K+ high-quality training examples
✅ Budget allows $5K+ for training

Don't Fine-Tune When:

❌ Prompting achieves >90% accuracy
❌ RAG solves the problem (factual/changing knowledge)
❌ <500 examples available
❌ Need fast iteration (<1 week cycles)
❌ Budget <$5K total
❌ Task changes frequently (>monthly)

Advanced Topics

Multi-Task Fine-Tuning

graph TB
    A[Base Model] --> B[Task 1 Data<br/>Classification]
    A --> C[Task 2 Data<br/>Extraction]
    A --> D[Task 3 Data<br/>Generation]

    B --> E[Combined Dataset<br/>Balanced Sampling]
    C --> E
    D --> E

    E --> F[Single Fine-Tuned Model<br/>Multi-Capable]

Benefits:

One model handles multiple tasks
Better generalization across tasks
Lower deployment complexity

Challenges:

Need balanced data across tasks
Risk of task interference
Harder to optimize per-task

Continual Learning

Problem: How to add new capabilities without forgetting old ones?

Solutions:

Replay: Mix old training data with new
Elastic Weight Consolidation: Protect important weights
Adapter Stacking: Add new LoRA for new tasks
Periodic Retraining: Full retrain every N months

Why It Matters

Business Impact:

Accuracy: 10-30% improvement over prompting for specialized tasks
Cost: 90-99% reduction for high-volume use cases (self-hosted)
Consistency: 95%+ format compliance vs 80-90% with prompting
Differentiation: Proprietary models as competitive advantage
Privacy: Keep sensitive training data in-house

Technical Impact:

Specialization: Models internalize domain knowledge
Efficiency: Smaller fine-tuned models outperform larger general ones
Control: Full ownership of model weights and behavior
Scalability: LoRA enables multiple specialized models efficiently

Chapter 23: Customization: FT & Adaptation

23. Customization: FT & Adaptation