Part 4: Generative AI & LLM Consulting

Chapter 23: Customization: FT & Adaptation

Hire Us
4Part 4: Generative AI & LLM Consulting

23. Customization: FT & Adaptation

Chapter 23 — Customization: Fine-Tuning & Adaptation

Overview

Adapt models for domain language, formats, and tasks via fine-tuning, adapters, or preference optimization.

While prompt engineering achieves remarkable results, some applications require models that inherently understand domain-specific language, adhere to particular formats, or align with specific preferences. Fine-tuning and model adaptation specialize foundation models for unique requirements, improving performance beyond what prompting alone can achieve.

The Customization Decision Tree

graph TB A[Need Better Performance?] --> B{What's the Issue?} B -->|Format Inconsistency| C[Few-Shot Examples<br/>or Fine-Tuning] B -->|Domain Knowledge| D[RAG or Fine-Tuning] B -->|Style/Tone| E[Prompt Engineering<br/>or DPO] B -->|Task Capability| F[Fine-Tuning] C --> G{Budget & Scale?} D --> G E --> G F --> G G -->|<$5K| H[Prompt Engineering<br/>Few-Shot Learning] G -->|$5K-50K| I[LoRA Fine-Tuning<br/>Single A100, Days] G -->|>$50K| J[Full Fine-Tuning<br/>Multiple GPUs, Weeks]

Customization Methods Comparison

MethodData RequiredComputeCostFlexibilityBest For
Prompt Engineering0-10 examplesNone$0Very HighQuick iteration, testing
Few-Shot Learning5-100 examplesNoneLowHighTask demonstration
LoRA1K-100K+1x A100MediumHighEfficient specialization
Full Fine-Tuning10K-100K+4-8x A100HighLowMaximum performance
DPO/RLHF1K-10K pairs2-4x A100Medium-HighMediumAlignment, style

##When to Fine-Tune vs When to Prompt

graph TB A[Performance Gap] --> B{Root Cause} B -->|Format Adherence| C{Consistency?} C -->|>95% needed| D[Fine-Tune] C -->|<95% okay| E[Prompt + Validation] B -->|Domain Knowledge| F{Knowledge Type?} F -->|Factual/Changing| G[RAG] F -->|Reasoning Patterns| H[Fine-Tune] B -->|Style/Tone| I{Examples Available?} I -->|<100| J[Few-Shot Prompting] I -->|>1000| K[DPO or Fine-Tune] B -->|Task Capability| L{New Task?} L -->|Similar to Existing| M[Prompt Engineering] L -->|Novel Task| N[Fine-Tune]

Decision Matrix

ScenarioPromptingRAGFine-Tuning
Strict JSON format (99%+)⚠️ (85-90%)⚠️ (85-90%)✅ (98%+)
Updated product info⚠️ (stale quickly)
Domain-specific terminology⚠️ (with examples)✅ (with docs)
Consistent brand voice⚠️ (variable)
Novel task type
Budget <$1K
Need fast iteration

Fine-Tuning Approaches

1. Full Fine-Tuning vs LoRA

AspectFull Fine-TuningLoRA (Low-Rank Adaptation)
Parameters TrainedAll (7B-70B)<1% (4M-50M)
GPU Memory80-320GB16-40GB
Training TimeDays-WeeksHours-Days
Storage14-140GB20-200MB
Performance100% baseline95-99% of full FT
Cost$1K-10K$100-1K
Multiple AdaptersNoYes (swap on same base)

2. Supervised Fine-Tuning (SFT)

Purpose: Teach model new tasks or domain knowledge

Data Format:

{
  "instruction": "Extract key financial metrics from the earnings report.",
  "input": "Q3 revenue was $2.5B, up 15% YoY. Net income reached $500M.",
  "output": "{\"revenue\": \"$2.5B\", \"growth\": \"15% YoY\", \"net_income\": \"$500M\"}"
}

Essential Implementation:

# Minimal LoRA fine-tuning (15 lines)
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer
from peft import LoraConfig, get_peft_model

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b")

# Configure LoRA - only train tiny adapter matrices
lora_config = LoraConfig(r=8, lora_alpha=32, target_modules=["q_proj", "v_proj"])
model = get_peft_model(model, lora_config)
# Result: Only 0.06% of parameters trainable (4M vs 7B)

trainer = Trainer(model=model, train_dataset=dataset)
trainer.train()
model.save_pretrained("./lora_adapters")  # Only 20MB!

3. Preference Optimization (DPO/RLHF)

Purpose: Align model outputs with human preferences (style, safety, helpfulness)

Data Format (Preference Pairs):

{
  "prompt": "Explain quantum computing to a 10-year-old",
  "chosen": "Imagine regular computers count on fingers. Quantum computers have magic hands that try all finger combinations at once!",
  "rejected": "Quantum computing leverages superposition and entanglement to perform computations using qubits."
}

When to Use DPO:

  • ✅ Have preference data (chosen/rejected pairs)
  • ✅ Want to improve style, tone, helpfulness
  • ✅ Need alignment without complex RL setup
  • ✅ Budget allows 1K-10K preference examples

Data Preparation

Quality Over Quantity

Data VolumeQualityTypical ResultsBest Practices
100-500Very High (human-curated)Good for narrow tasksExpert curation, multiple reviews
500-5KHigh (filtered synthetic)Good for most tasksLLM generation + validation
5K-50KMedium (mixed sources)Best performanceCombine human + synthetic + filtering
>50KVariable (web-scraped)Diminishing returnsStrong deduplication, quality filtering

Data Collection Strategies

SourceVolumeQualityCostBest For
Human AnnotationLow (100s)Highest$5K-50KCritical tasks, gold standards
Expert CurationMedium (1000s)High$1K-10KDomain specialization
LLM SynthesisHigh (10K+)Medium$100-1KData augmentation, scaling
User InteractionsHigh (10K+)VariableLowReal-world patterns
Public DatasetsVery High (100K+)VariableFreeGeneral capability

Synthetic Data Generation

Essential Example:

# Generate training data with GPT-4 (10 lines)
async def generate_training_examples(seed_examples: list, count: int) -> list:
    """Scale up training data with LLM"""
    examples = []
    for seed in seed_examples:
        prompt = f"Create 5 similar examples:\n{seed}\n\nVary details but keep task type."
        response = await openai.chat.completions.create(
            model="gpt-4", messages=[{"role": "user", "content": prompt}]
        )
        examples.extend(parse_examples(response))
        if len(examples) >= count: break
    return examples[:count]

Training Configuration

Hyperparameter Guide

ParameterTypical RangeImpactTuning Strategy
Learning Rate2e-5 to 5e-5Speed & stabilityStart 2e-5, increase if slow
Batch Size4-16 per GPUMemory & stabilityLargest that fits in VRAM
Epochs1-5Overfitting riskStop when validation plateaus
LoRA Rank (r)8-64Capacity vs efficiency8 for simple, 64 for complex
Warmup Steps100-500Training stability10% of total steps

Preventing Overfitting

graph TB A[Training Progress] --> B{Monitor Metrics} B --> C[Training Loss<br/>Decreasing] B --> D[Validation Loss<br/>Increasing?] D -->|Yes| E[OVERFITTING] D -->|No| F[Keep Training] E --> G[Solutions] G --> H[Early Stopping] G --> I[More Data] G --> J[Lower Learning Rate] G --> K[Reduce Epochs]

Evaluation & Safety

Before/After Comparison Framework

Metric CategorySpecific MetricsTarget
Task PerformanceAccuracy, F1, ROUGE>85% on test set
Format ComplianceJSON validity, schema adherence>95%
Regression TestingPerformance on general tasksWithin 5% of base model
SafetyJailbreak resistance, toxic content>95% pass rate
LatencyInference speed<10% slowdown

Safety Validation Checklist

  • Test jailbreak resistance (20+ attempts)
  • Check for bias amplification (demographic parity tests)
  • Verify toxic content filtering still works
  • Confirm PII handling hasn't degraded
  • Test refusal capabilities maintained

Case Study: Customer Support Ticket Summarization

Challenge: Support platform with 50+ agents, inconsistent summary quality, 20-30 min/ticket.

Initial State (Few-Shot GPT-4):

  • Format compliance: 82%
  • Factual accuracy: 91%
  • Consistency: Medium
  • Cost: $0.30/summary
  • Latency: 2.8s

Solution: Fine-Tuned Llama 2 7B with LoRA

graph TB A[5,000 Historical Tickets<br/>Human Summaries] --> B[Filter Quality<br/>Remove Incomplete] B --> C[3,500 High-Quality<br/>Examples] C --> D[Generate 2,000<br/>Synthetic with GPT-4] D --> E[5,500 Total Examples<br/>Balanced by Category] E --> F[LoRA Fine-Tuning<br/>r=16, 3 epochs<br/>6 hours on A100] F --> G[Validation<br/>96% Format Compliance] G --> H[Production Deployment]

Training Configuration:

  • Method: LoRA (r=16, alpha=32)
  • Data: 5,500 examples (3,500 real + 2,000 synthetic)
  • Hardware: Single A100, 6 hours
  • Validation: 20% holdout set

Results After 2 Months:

MetricFew-Shot GPT-4Fine-Tuned Llama 2 7BImprovement
Format Compliance82%96%+14%
Factual Accuracy91%94%+3%
Agent ConsistencyMediumHighQualitative
Cost per 1K$30$0.30-99%
Latency P952.8s0.7s-75%
Inference Cost$450/day$4.50/day-99%

ROI Analysis:

  • Setup Cost: $8K (1 week eng + GPU time)
  • Daily Savings: $445.50 (inference cost reduction)
  • Payback Period: 18 days
  • Annual Savings: $163K

Key Learnings:

  1. LoRA sufficient: 96% performance of full fine-tuning at 10x lower cost
  2. Synthetic data crucial: Scaled from 3.5K to 5.5K examples cheaply
  3. Format tasks ideal: 96% compliance vs 82% with prompting
  4. Regression testing essential: Caught edge case failures early
  5. Self-hosting wins: 99% cost reduction for high-volume use case

Model Versioning & Management

Version Control Strategy

graph LR A[v1.0<br/>Base Model] --> B[Fine-Tune] B --> C[v2.0<br/>Eval + Safety] C --> D{Pass Gates?} D -->|Yes| E[v2.0 Registered] D -->|No| F[Rejected] E --> G[A/B Test<br/>10% Traffic] G --> H{Performance?} H -->|Better| I[v2.0 Production<br/>100% Traffic] H -->|Worse| J[Rollback to v1.0] I --> K[v1.0 Deprecated]

Quality Gates

GateThresholdBlocker?
Task Accuracy>85% on test setYes
Format Compliance>95%Yes
Safety Pass Rate>95%Yes
No RegressionWithin 5% on general tasksYes
Latency<2x base modelNo (warning)

Implementation Checklist

Phase 1: Planning (Week 1)

  • Define fine-tuning goals (format, domain, style, capability)
  • Determine if fine-tuning necessary vs prompting/RAG
  • Choose method (LoRA vs full vs DPO)
  • Estimate data requirements (1K-100K examples)
  • Plan compute resources and budget ($100-10K)

Phase 2: Data Collection (Week 2-3)

  • Collect initial examples from target domain
  • Filter for quality (completeness, correctness, consistency)
  • Generate synthetic data if needed (GPT-4 synthesis)
  • Balance dataset across categories/tasks
  • Create holdout validation (20%) and test (10%) sets
  • Check for PII and sensitive content

Phase 3: Training (Week 3-4)

  • Set up GPU infrastructure (A100 recommended)
  • Configure hyperparameters (LR, batch size, epochs)
  • Implement logging and checkpointing
  • Run training with validation monitoring
  • Watch for overfitting (validation loss diverging)
  • Save best checkpoint

Phase 4: Evaluation (Week 4-5)

  • Run before/after comparison on test set
  • Check regression on general capabilities
  • Validate safety properties maintained
  • Get human evaluation on sample outputs (50-100)
  • Document performance metrics

Phase 5: Deployment (Week 5-6)

  • Create model card with metadata and performance
  • Version and register model
  • Set up A/B testing (10% traffic initially)
  • Implement monitoring (latency, accuracy, cost)
  • Plan rollback strategy
  • Document usage guidelines

Common Pitfalls & Solutions

PitfallImpactSolution
Too little dataPoor generalizationGenerate synthetic examples, use LoRA
Low-quality dataModel learns bad patternsStrong filtering, human validation
OverfittingGreat on train, poor on testEarly stopping, more data, regularization
No safety testingJailbreak vulnerabilitiesComprehensive safety test suite
Ignoring regressionLost general capabilitiesTest on diverse general tasks
Wrong base modelPoor starting pointMatch base model to domain

When to Fine-Tune (Decision Framework)

Fine-Tune When:

  • ✅ Need >95% format compliance (JSON, templates)
  • ✅ Domain-specific reasoning patterns (legal, medical, code)
  • ✅ Consistent style/tone critical (brand voice)
  • ✅ Novel task type not in base model training
  • ✅ High-volume use case (cost reduction important)
  • ✅ Have 1K+ high-quality training examples
  • ✅ Budget allows $5K+ for training

Don't Fine-Tune When:

  • ❌ Prompting achieves >90% accuracy
  • ❌ RAG solves the problem (factual/changing knowledge)
  • ❌ <500 examples available
  • ❌ Need fast iteration (<1 week cycles)
  • ❌ Budget <$5K total
  • ❌ Task changes frequently (>monthly)

Advanced Topics

Multi-Task Fine-Tuning

graph TB A[Base Model] --> B[Task 1 Data<br/>Classification] A --> C[Task 2 Data<br/>Extraction] A --> D[Task 3 Data<br/>Generation] B --> E[Combined Dataset<br/>Balanced Sampling] C --> E D --> E E --> F[Single Fine-Tuned Model<br/>Multi-Capable]

Benefits:

  • One model handles multiple tasks
  • Better generalization across tasks
  • Lower deployment complexity

Challenges:

  • Need balanced data across tasks
  • Risk of task interference
  • Harder to optimize per-task

Continual Learning

Problem: How to add new capabilities without forgetting old ones?

Solutions:

  1. Replay: Mix old training data with new
  2. Elastic Weight Consolidation: Protect important weights
  3. Adapter Stacking: Add new LoRA for new tasks
  4. Periodic Retraining: Full retrain every N months

Why It Matters

Business Impact:

  • Accuracy: 10-30% improvement over prompting for specialized tasks
  • Cost: 90-99% reduction for high-volume use cases (self-hosted)
  • Consistency: 95%+ format compliance vs 80-90% with prompting
  • Differentiation: Proprietary models as competitive advantage
  • Privacy: Keep sensitive training data in-house

Technical Impact:

  • Specialization: Models internalize domain knowledge
  • Efficiency: Smaller fine-tuned models outperform larger general ones
  • Control: Full ownership of model weights and behavior
  • Scalability: LoRA enables multiple specialized models efficiently