23. Customization: FT & Adaptation
Chapter 23 — Customization: Fine-Tuning & Adaptation
Overview
Adapt models for domain language, formats, and tasks via fine-tuning, adapters, or preference optimization.
While prompt engineering achieves remarkable results, some applications require models that inherently understand domain-specific language, adhere to particular formats, or align with specific preferences. Fine-tuning and model adaptation specialize foundation models for unique requirements, improving performance beyond what prompting alone can achieve.
The Customization Decision Tree
graph TB A[Need Better Performance?] --> B{What's the Issue?} B -->|Format Inconsistency| C[Few-Shot Examples<br/>or Fine-Tuning] B -->|Domain Knowledge| D[RAG or Fine-Tuning] B -->|Style/Tone| E[Prompt Engineering<br/>or DPO] B -->|Task Capability| F[Fine-Tuning] C --> G{Budget & Scale?} D --> G E --> G F --> G G -->|<$5K| H[Prompt Engineering<br/>Few-Shot Learning] G -->|$5K-50K| I[LoRA Fine-Tuning<br/>Single A100, Days] G -->|>$50K| J[Full Fine-Tuning<br/>Multiple GPUs, Weeks]
Customization Methods Comparison
| Method | Data Required | Compute | Cost | Flexibility | Best For |
|---|---|---|---|---|---|
| Prompt Engineering | 0-10 examples | None | $0 | Very High | Quick iteration, testing |
| Few-Shot Learning | 5-100 examples | None | Low | High | Task demonstration |
| LoRA | 1K-100K+ | 1x A100 | Medium | High | Efficient specialization |
| Full Fine-Tuning | 10K-100K+ | 4-8x A100 | High | Low | Maximum performance |
| DPO/RLHF | 1K-10K pairs | 2-4x A100 | Medium-High | Medium | Alignment, style |
##When to Fine-Tune vs When to Prompt
graph TB A[Performance Gap] --> B{Root Cause} B -->|Format Adherence| C{Consistency?} C -->|>95% needed| D[Fine-Tune] C -->|<95% okay| E[Prompt + Validation] B -->|Domain Knowledge| F{Knowledge Type?} F -->|Factual/Changing| G[RAG] F -->|Reasoning Patterns| H[Fine-Tune] B -->|Style/Tone| I{Examples Available?} I -->|<100| J[Few-Shot Prompting] I -->|>1000| K[DPO or Fine-Tune] B -->|Task Capability| L{New Task?} L -->|Similar to Existing| M[Prompt Engineering] L -->|Novel Task| N[Fine-Tune]
Decision Matrix
| Scenario | Prompting | RAG | Fine-Tuning |
|---|---|---|---|
| Strict JSON format (99%+) | ⚠️ (85-90%) | ⚠️ (85-90%) | ✅ (98%+) |
| Updated product info | ❌ | ✅ | ⚠️ (stale quickly) |
| Domain-specific terminology | ⚠️ (with examples) | ✅ (with docs) | ✅ |
| Consistent brand voice | ⚠️ (variable) | ❌ | ✅ |
| Novel task type | ❌ | ❌ | ✅ |
| Budget <$1K | ✅ | ✅ | ❌ |
| Need fast iteration | ✅ | ✅ | ❌ |
Fine-Tuning Approaches
1. Full Fine-Tuning vs LoRA
| Aspect | Full Fine-Tuning | LoRA (Low-Rank Adaptation) |
|---|---|---|
| Parameters Trained | All (7B-70B) | <1% (4M-50M) |
| GPU Memory | 80-320GB | 16-40GB |
| Training Time | Days-Weeks | Hours-Days |
| Storage | 14-140GB | 20-200MB |
| Performance | 100% baseline | 95-99% of full FT |
| Cost | $1K-10K | $100-1K |
| Multiple Adapters | No | Yes (swap on same base) |
2. Supervised Fine-Tuning (SFT)
Purpose: Teach model new tasks or domain knowledge
Data Format:
{
"instruction": "Extract key financial metrics from the earnings report.",
"input": "Q3 revenue was $2.5B, up 15% YoY. Net income reached $500M.",
"output": "{\"revenue\": \"$2.5B\", \"growth\": \"15% YoY\", \"net_income\": \"$500M\"}"
}
Essential Implementation:
# Minimal LoRA fine-tuning (15 lines)
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer
from peft import LoraConfig, get_peft_model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b")
# Configure LoRA - only train tiny adapter matrices
lora_config = LoraConfig(r=8, lora_alpha=32, target_modules=["q_proj", "v_proj"])
model = get_peft_model(model, lora_config)
# Result: Only 0.06% of parameters trainable (4M vs 7B)
trainer = Trainer(model=model, train_dataset=dataset)
trainer.train()
model.save_pretrained("./lora_adapters") # Only 20MB!
3. Preference Optimization (DPO/RLHF)
Purpose: Align model outputs with human preferences (style, safety, helpfulness)
Data Format (Preference Pairs):
{
"prompt": "Explain quantum computing to a 10-year-old",
"chosen": "Imagine regular computers count on fingers. Quantum computers have magic hands that try all finger combinations at once!",
"rejected": "Quantum computing leverages superposition and entanglement to perform computations using qubits."
}
When to Use DPO:
- ✅ Have preference data (chosen/rejected pairs)
- ✅ Want to improve style, tone, helpfulness
- ✅ Need alignment without complex RL setup
- ✅ Budget allows 1K-10K preference examples
Data Preparation
Quality Over Quantity
| Data Volume | Quality | Typical Results | Best Practices |
|---|---|---|---|
| 100-500 | Very High (human-curated) | Good for narrow tasks | Expert curation, multiple reviews |
| 500-5K | High (filtered synthetic) | Good for most tasks | LLM generation + validation |
| 5K-50K | Medium (mixed sources) | Best performance | Combine human + synthetic + filtering |
| >50K | Variable (web-scraped) | Diminishing returns | Strong deduplication, quality filtering |
Data Collection Strategies
| Source | Volume | Quality | Cost | Best For |
|---|---|---|---|---|
| Human Annotation | Low (100s) | Highest | $5K-50K | Critical tasks, gold standards |
| Expert Curation | Medium (1000s) | High | $1K-10K | Domain specialization |
| LLM Synthesis | High (10K+) | Medium | $100-1K | Data augmentation, scaling |
| User Interactions | High (10K+) | Variable | Low | Real-world patterns |
| Public Datasets | Very High (100K+) | Variable | Free | General capability |
Synthetic Data Generation
Essential Example:
# Generate training data with GPT-4 (10 lines)
async def generate_training_examples(seed_examples: list, count: int) -> list:
"""Scale up training data with LLM"""
examples = []
for seed in seed_examples:
prompt = f"Create 5 similar examples:\n{seed}\n\nVary details but keep task type."
response = await openai.chat.completions.create(
model="gpt-4", messages=[{"role": "user", "content": prompt}]
)
examples.extend(parse_examples(response))
if len(examples) >= count: break
return examples[:count]
Training Configuration
Hyperparameter Guide
| Parameter | Typical Range | Impact | Tuning Strategy |
|---|---|---|---|
| Learning Rate | 2e-5 to 5e-5 | Speed & stability | Start 2e-5, increase if slow |
| Batch Size | 4-16 per GPU | Memory & stability | Largest that fits in VRAM |
| Epochs | 1-5 | Overfitting risk | Stop when validation plateaus |
| LoRA Rank (r) | 8-64 | Capacity vs efficiency | 8 for simple, 64 for complex |
| Warmup Steps | 100-500 | Training stability | 10% of total steps |
Preventing Overfitting
graph TB A[Training Progress] --> B{Monitor Metrics} B --> C[Training Loss<br/>Decreasing] B --> D[Validation Loss<br/>Increasing?] D -->|Yes| E[OVERFITTING] D -->|No| F[Keep Training] E --> G[Solutions] G --> H[Early Stopping] G --> I[More Data] G --> J[Lower Learning Rate] G --> K[Reduce Epochs]
Evaluation & Safety
Before/After Comparison Framework
| Metric Category | Specific Metrics | Target |
|---|---|---|
| Task Performance | Accuracy, F1, ROUGE | >85% on test set |
| Format Compliance | JSON validity, schema adherence | >95% |
| Regression Testing | Performance on general tasks | Within 5% of base model |
| Safety | Jailbreak resistance, toxic content | >95% pass rate |
| Latency | Inference speed | <10% slowdown |
Safety Validation Checklist
- Test jailbreak resistance (20+ attempts)
- Check for bias amplification (demographic parity tests)
- Verify toxic content filtering still works
- Confirm PII handling hasn't degraded
- Test refusal capabilities maintained
Case Study: Customer Support Ticket Summarization
Challenge: Support platform with 50+ agents, inconsistent summary quality, 20-30 min/ticket.
Initial State (Few-Shot GPT-4):
- Format compliance: 82%
- Factual accuracy: 91%
- Consistency: Medium
- Cost: $0.30/summary
- Latency: 2.8s
Solution: Fine-Tuned Llama 2 7B with LoRA
graph TB A[5,000 Historical Tickets<br/>Human Summaries] --> B[Filter Quality<br/>Remove Incomplete] B --> C[3,500 High-Quality<br/>Examples] C --> D[Generate 2,000<br/>Synthetic with GPT-4] D --> E[5,500 Total Examples<br/>Balanced by Category] E --> F[LoRA Fine-Tuning<br/>r=16, 3 epochs<br/>6 hours on A100] F --> G[Validation<br/>96% Format Compliance] G --> H[Production Deployment]
Training Configuration:
- Method: LoRA (r=16, alpha=32)
- Data: 5,500 examples (3,500 real + 2,000 synthetic)
- Hardware: Single A100, 6 hours
- Validation: 20% holdout set
Results After 2 Months:
| Metric | Few-Shot GPT-4 | Fine-Tuned Llama 2 7B | Improvement |
|---|---|---|---|
| Format Compliance | 82% | 96% | +14% |
| Factual Accuracy | 91% | 94% | +3% |
| Agent Consistency | Medium | High | Qualitative |
| Cost per 1K | $30 | $0.30 | -99% |
| Latency P95 | 2.8s | 0.7s | -75% |
| Inference Cost | $450/day | $4.50/day | -99% |
ROI Analysis:
- Setup Cost: $8K (1 week eng + GPU time)
- Daily Savings: $445.50 (inference cost reduction)
- Payback Period: 18 days
- Annual Savings: $163K
Key Learnings:
- LoRA sufficient: 96% performance of full fine-tuning at 10x lower cost
- Synthetic data crucial: Scaled from 3.5K to 5.5K examples cheaply
- Format tasks ideal: 96% compliance vs 82% with prompting
- Regression testing essential: Caught edge case failures early
- Self-hosting wins: 99% cost reduction for high-volume use case
Model Versioning & Management
Version Control Strategy
graph LR A[v1.0<br/>Base Model] --> B[Fine-Tune] B --> C[v2.0<br/>Eval + Safety] C --> D{Pass Gates?} D -->|Yes| E[v2.0 Registered] D -->|No| F[Rejected] E --> G[A/B Test<br/>10% Traffic] G --> H{Performance?} H -->|Better| I[v2.0 Production<br/>100% Traffic] H -->|Worse| J[Rollback to v1.0] I --> K[v1.0 Deprecated]
Quality Gates
| Gate | Threshold | Blocker? |
|---|---|---|
| Task Accuracy | >85% on test set | Yes |
| Format Compliance | >95% | Yes |
| Safety Pass Rate | >95% | Yes |
| No Regression | Within 5% on general tasks | Yes |
| Latency | <2x base model | No (warning) |
Implementation Checklist
Phase 1: Planning (Week 1)
- Define fine-tuning goals (format, domain, style, capability)
- Determine if fine-tuning necessary vs prompting/RAG
- Choose method (LoRA vs full vs DPO)
- Estimate data requirements (1K-100K examples)
- Plan compute resources and budget ($100-10K)
Phase 2: Data Collection (Week 2-3)
- Collect initial examples from target domain
- Filter for quality (completeness, correctness, consistency)
- Generate synthetic data if needed (GPT-4 synthesis)
- Balance dataset across categories/tasks
- Create holdout validation (20%) and test (10%) sets
- Check for PII and sensitive content
Phase 3: Training (Week 3-4)
- Set up GPU infrastructure (A100 recommended)
- Configure hyperparameters (LR, batch size, epochs)
- Implement logging and checkpointing
- Run training with validation monitoring
- Watch for overfitting (validation loss diverging)
- Save best checkpoint
Phase 4: Evaluation (Week 4-5)
- Run before/after comparison on test set
- Check regression on general capabilities
- Validate safety properties maintained
- Get human evaluation on sample outputs (50-100)
- Document performance metrics
Phase 5: Deployment (Week 5-6)
- Create model card with metadata and performance
- Version and register model
- Set up A/B testing (10% traffic initially)
- Implement monitoring (latency, accuracy, cost)
- Plan rollback strategy
- Document usage guidelines
Common Pitfalls & Solutions
| Pitfall | Impact | Solution |
|---|---|---|
| Too little data | Poor generalization | Generate synthetic examples, use LoRA |
| Low-quality data | Model learns bad patterns | Strong filtering, human validation |
| Overfitting | Great on train, poor on test | Early stopping, more data, regularization |
| No safety testing | Jailbreak vulnerabilities | Comprehensive safety test suite |
| Ignoring regression | Lost general capabilities | Test on diverse general tasks |
| Wrong base model | Poor starting point | Match base model to domain |
When to Fine-Tune (Decision Framework)
Fine-Tune When:
- ✅ Need >95% format compliance (JSON, templates)
- ✅ Domain-specific reasoning patterns (legal, medical, code)
- ✅ Consistent style/tone critical (brand voice)
- ✅ Novel task type not in base model training
- ✅ High-volume use case (cost reduction important)
- ✅ Have 1K+ high-quality training examples
- ✅ Budget allows $5K+ for training
Don't Fine-Tune When:
- ❌ Prompting achieves >90% accuracy
- ❌ RAG solves the problem (factual/changing knowledge)
- ❌ <500 examples available
- ❌ Need fast iteration (<1 week cycles)
- ❌ Budget <$5K total
- ❌ Task changes frequently (>monthly)
Advanced Topics
Multi-Task Fine-Tuning
graph TB A[Base Model] --> B[Task 1 Data<br/>Classification] A --> C[Task 2 Data<br/>Extraction] A --> D[Task 3 Data<br/>Generation] B --> E[Combined Dataset<br/>Balanced Sampling] C --> E D --> E E --> F[Single Fine-Tuned Model<br/>Multi-Capable]
Benefits:
- One model handles multiple tasks
- Better generalization across tasks
- Lower deployment complexity
Challenges:
- Need balanced data across tasks
- Risk of task interference
- Harder to optimize per-task
Continual Learning
Problem: How to add new capabilities without forgetting old ones?
Solutions:
- Replay: Mix old training data with new
- Elastic Weight Consolidation: Protect important weights
- Adapter Stacking: Add new LoRA for new tasks
- Periodic Retraining: Full retrain every N months
Why It Matters
Business Impact:
- Accuracy: 10-30% improvement over prompting for specialized tasks
- Cost: 90-99% reduction for high-volume use cases (self-hosted)
- Consistency: 95%+ format compliance vs 80-90% with prompting
- Differentiation: Proprietary models as competitive advantage
- Privacy: Keep sensitive training data in-house
Technical Impact:
- Specialization: Models internalize domain knowledge
- Efficiency: Smaller fine-tuned models outperform larger general ones
- Control: Full ownership of model weights and behavior
- Scalability: LoRA enables multiple specialized models efficiently