6Part 6: Solution Patterns (Classical & Applied AI)
34. Computer Vision
Chapter 34 — Computer Vision
Overview
Vision systems for detection, OCR, visual QA, and edge deployment. This chapter covers production computer vision architectures—from task selection and model deployment to edge optimization and continuous monitoring. We focus on practical patterns that balance accuracy, latency, cost, and privacy requirements.
Computer Vision Architecture Patterns
End-to-End CV Pipeline
graph TB A[Image Input] --> B[Preprocessing] B --> C{Deployment Target} C -->|Cloud| D[High-Accuracy Models] C -->|Edge| E[Optimized Models] D --> F[Batch Processing] E --> G[Real-time Inference] F --> H[Results Storage] G --> H H --> I[Post-processing] I --> J[Business Logic] J --> K[Monitoring & Feedback] K --> L[Model Retraining] L --> B style D fill:#e1f5fe style E fill:#f3e5f5 style K fill:#fff3e0
Task Selection Framework
graph TD A[CV Problem] --> B{Input Type?} B -->|Single Object| C[Classification] B -->|Multiple Objects| D{Need Locations?} D -->|Yes| E{Need Boundaries?} E -->|Boxes| F[Object Detection] E -->|Pixels| G[Segmentation] D -->|No| H[Multi-label Classification] B -->|Text in Image| I[OCR Pipeline] B -->|Question About Image| J[Visual QA] style F fill:#c8e6c9 style G fill:#ffccbc style I fill:#b3e5fc
Core CV Tasks Comparison
| Task | Accuracy Target | Latency | Model Size | Use Case |
|---|---|---|---|---|
| Image Classification | 85-95% | <50ms | 5-50MB | Quality control, content moderation |
| Object Detection | mAP 40-55 | <200ms | 20-100MB | Autonomous vehicles, surveillance |
| Semantic Segmentation | mIoU 60-80% | <500ms | 50-200MB | Medical imaging, satellite analysis |
| Instance Segmentation | mAP 35-50 | <500ms | 100-300MB | Robotics, counting objects |
| OCR | 90-98% CER | <100ms | 30-100MB | Document processing, license plates |
| Pose Estimation | PCK 85-95% | <100ms | 20-80MB | Sports analytics, AR/VR |
Data Strategy & Quality
Annotation Quality Framework
| Aspect | Best Practice | Common Pitfall | Impact |
|---|---|---|---|
| Label Consistency | Clear guidelines, regular audits | Ambiguous definitions | 10-20% accuracy drop |
| Inter-annotator Agreement | Multiple annotators per image, consensus | Single annotator | Poor generalization |
| Edge Cases | Explicitly annotate hard examples | Only annotate easy cases | Fails on real data |
| Class Balance | Sample to balance classes | Natural distribution | Biased predictions |
| Quality Control | Review random samples, track metrics | No validation | Garbage in, garbage out |
Data Augmentation Strategy
Essential Augmentations:
geometric:
- random_crop: 0.8-1.0 scale
- horizontal_flip: 50% probability
- rotation: ±15 degrees
photometric:
- brightness: ±20%
- contrast: ±20%
- hue/saturation: ±15%
noise:
- gaussian_noise: 20% probability
- blur: 20% probability
Implementation Pattern:
# Minimal augmentation pipeline
transform = A.Compose([
A.RandomResizedCrop(640, 640, scale=(0.8, 1.0)),
A.HorizontalFlip(p=0.5),
A.RandomBrightnessContrast(p=0.5),
A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
], bbox_params=A.BboxParams(format='coco', label_fields=['labels']))
Model Selection & Performance
Object Detection Models Comparison
| Model | Speed (FPS) | mAP | Parameters | Latency | Best For |
|---|---|---|---|---|---|
| YOLOv8n | 80 | 37.3 | 3.2M | 12ms | Edge, mobile, real-time |
| YOLOv8s | 60 | 44.9 | 11.2M | 18ms | Balanced performance |
| YOLOv8m | 35 | 50.2 | 25.9M | 35ms | High accuracy needed |
| Faster R-CNN | 15 | 42.0 | 41.8M | 80ms | Accuracy over speed |
| EfficientDet-D0 | 45 | 34.6 | 3.9M | 25ms | Edge deployment |
Image Classification Models Comparison
| Model | Top-1 Acc | Parameters | Inference (ms) | Model Size | Best For |
|---|---|---|---|---|---|
| MobileNetV3-Small | 67.4% | 2.5M | 10 | 9MB | Mobile, edge |
| EfficientNet-B0 | 77.1% | 5.3M | 15 | 20MB | Balanced |
| ResNet50 | 76.1% | 25.6M | 20 | 98MB | Standard baseline |
| ViT-Base | 84.5% | 86M | 45 | 330MB | High accuracy |
| ConvNeXt-Tiny | 82.1% | 28M | 25 | 110MB | Modern architecture |
Minimal Detection Implementation
from ultralytics import YOLO
# Quick object detection
model = YOLO('yolov8n.pt')
results = model.predict('image.jpg', conf=0.5)
# Parse results
for r in results:
for box in r.boxes:
print(f"{r.names[int(box.cls[0])]}: {box.conf[0]:.2f}")
Minimal OCR Implementation
import easyocr
# Quick OCR
reader = easyocr.Reader(['en'], gpu=True)
results = reader.readtext('document.jpg')
# Extract text
for bbox, text, conf in results:
print(f"{text} ({conf:.2f})")
Edge vs Cloud Deployment
Deployment Decision Framework
graph TD A[CV Deployment] --> B{Latency Requirement} B -->|<100ms| C[Edge Required] B -->|>500ms| D{Privacy Sensitive?} B -->|100-500ms| E{Network Reliability?} C --> F[Edge Deployment] D -->|Yes| F D -->|No| G{Volume?} E -->|Poor| F E -->|Good| G G -->|High >1M/day| H[Cloud + CDN] G -->|Low| I[Cloud Serverless] F --> J[Optimize: TensorRT, ONNX] H --> K[Batch Processing] I --> L[On-demand Scaling] style F fill:#c8e6c9 style H fill:#b3e5fc style I fill:#f3e5f5
Edge vs Cloud Comparison
| Aspect | Edge | Cloud | Decision Factor |
|---|---|---|---|
| Latency | 10-50ms | 100-500ms (+ network) | Real-time apps → Edge |
| Privacy | Data stays local | Data transmitted | Sensitive data → Edge |
| Cost (high volume) | Lower (one-time hardware) | Higher (per-inference) | >1M inferences/day → Edge |
| Scalability | Limited by hardware | Virtually unlimited | Variable load → Cloud |
| Updates | Requires device update | Instant | Frequent updates → Cloud |
| Compute | Limited (mobile/embedded) | Powerful GPUs | Complex models → Cloud |
| Bandwidth | No network needed | Requires connectivity | Offline use → Edge |
| Reliability | Works offline | Network dependent | Unreliable network → Edge |
Edge Optimization Techniques
Quantization Impact:
INT8_quantization:
speed_improvement: 3-5x
model_size_reduction: 4x
accuracy_drop: 0.5-2%
best_for: Embedded devices, mobile
INT16_quantization:
speed_improvement: 2x
model_size_reduction: 2x
accuracy_drop: <0.5%
best_for: Edge servers
Optimization Stack:
# Export to ONNX
torch.onnx.export(model, dummy_input, 'model.onnx')
# Convert to TensorRT (NVIDIA)
import tensorrt as trt
# trt.Builder().build_engine('model.onnx')
# Quantize for mobile
quantized = torch.quantization.quantize_dynamic(
model, {nn.Linear, nn.Conv2d}, dtype=torch.qint8
)
Performance Benchmarks
Detection Model Benchmarks (640x640 input)
| Model | Hardware | Latency | Throughput | Power | Cost/1M inferences |
|---|---|---|---|---|---|
| YOLOv8n | Jetson Nano | 45ms | 22 FPS | 5W | $0.12 |
| YOLOv8n | Raspberry Pi 4 | 180ms | 5.5 FPS | 3W | $0.25 |
| YOLOv8s | Tesla T4 (Cloud) | 8ms | 125 FPS | 70W | $2.50 |
| YOLOv8m | A100 (Cloud) | 6ms | 166 FPS | 250W | $8.00 |
| Faster R-CNN | A100 (Cloud) | 22ms | 45 FPS | 250W | $12.00 |
OCR Performance Benchmarks
| System | Language | Accuracy (CER) | Latency | Best For |
|---|---|---|---|---|
| EasyOCR | English | 96% | 150ms | General purpose |
| Tesseract | English | 94% | 80ms | Documents |
| PaddleOCR | Multi-lang | 95% | 120ms | Asian languages |
| Google Vision API | Multi-lang | 98% | 200ms (+ network) | High accuracy needed |
Case Study: Warehouse Damage Detection
Business Context
- Industry: Logistics and warehousing
- Scale: 20 warehouses, 100K packages/day
- Problem: Manual inspection slow, inconsistent (78% accuracy)
- Goal: Automated damage detection with <200ms latency
- Constraints: Edge deployment (no cloud), privacy, 95%+ accuracy
Solution Architecture
graph TB A[Camera Feed 30 FPS] --> B[NVIDIA Jetson Xavier NX] B --> C[Frame Selection 1/4 frames] C --> D[Preprocessing Pipeline] D --> E[YOLOv8s-INT8 TensorRT] E --> F{Damage Detected?} F -->|Yes conf>0.8| G[Alert + Save Image] F -->|No| H[Continue Monitoring] G --> I[Daily Batch Sync] H --> I I --> J[Cloud Analytics] J --> K[Model Retraining Pipeline] K -.Weekly Updates.-> B style E fill:#c8e6c9 style G fill:#ffccbc style K fill:#b3e5fc
Implementation & Results
Technical Stack:
hardware:
device: NVIDIA Jetson Xavier NX
power: 23W (< 30W requirement)
cost: $399/unit
model:
architecture: YOLOv8-small
optimization: TensorRT INT8
size: 47MB (< 100MB requirement)
classes: [intact, torn, crushed, wet_damage]
data:
training_images: 50,000
warehouses: 20
augmentation: rotation, brightness, blur, occlusion
Performance Results:
| Metric | Baseline (Manual) | Requirement | Achieved | Improvement |
|---|---|---|---|---|
| Accuracy | 78% | >95% | 96.8% | +24% |
| Latency | 12s (human) | <200ms | 127ms | 99% faster |
| FPS | N/A | >7 | 7.9 | Real-time |
| False Positive Rate | 15% | <2% | 1.4% | -91% |
| Throughput | 50 pkgs/hr/worker | 1000 pkgs/hr/cam | 1,140 pkgs/hr | 23x |
| Cost per inspection | $0.18 | <$0.05 | $0.008 | -96% |
ROI Analysis:
investment:
hardware: $7,980 (20 units)
development: $45,000
training: $8,000
total: $60,980
annual_savings:
labor_reduction: $180,000
damage_claims_reduction: $95,000
throughput_increase: $120,000
total: $395,000
roi: 548%
payback_period: 2.3 months
Key Learnings
- Edge essential for latency + privacy: Cloud would add 100-200ms network latency; on-device processing eliminated privacy concerns
- Quantization ROI strong: INT8 gave 5x speedup with only 0.3% accuracy loss
- Domain data critical: Pre-trained models: 68% accuracy; custom trained: 96.8%
- Monitoring prevents drift: Weekly retraining improved accuracy from 94% → 96.8% over 6 months
- Confidence thresholds matter: Using conf>0.8 reduced false positives from 3.2% → 1.4%
Implementation Checklist
Phase 1: Problem Definition (Week 1)
- Define CV task type (detection, classification, OCR)
- Establish accuracy and latency requirements
- Choose deployment target (edge vs cloud)
- Document data collection strategy
- Create annotation guidelines with examples
Phase 2: Data & Training (Week 2-4)
- Collect 5,000+ diverse images (10K+ for complex tasks)
- Annotate with multiple reviewers for quality
- Split data: 70% train, 15% val, 15% test
- Train baseline model, evaluate on validation
- Implement augmentation, retrain, compare
Phase 3: Optimization (Week 5-6)
- Benchmark latency on target hardware
- Apply quantization if edge deployment
- Export to deployment format (ONNX/TensorRT)
- Validate accuracy post-optimization (< 2% drop acceptable)
- Load test for throughput requirements
Phase 4: Deployment (Week 7-8)
- Deploy in shadow mode, validate with production data
- Set up monitoring (accuracy, latency, errors)
- Create alerting for model drift
- Plan weekly/monthly retraining
- Document model card and limitations
Common Pitfalls & Solutions
| Pitfall | Symptom | Solution | Prevention |
|---|---|---|---|
| Annotation bias | High train, low test accuracy | Multiple annotators, guidelines | Inter-annotator agreement >85% |
| Insufficient diversity | Poor real-world performance | Collect varied conditions, augment | Test on holdout domains |
| Wrong metrics | Good mAP, poor business impact | Optimize business metrics | Define success criteria upfront |
| Class imbalance | Bias toward majority class | Weighted loss, balanced sampling | Aim for 20-80% per class |
| Quantization shock | Model breaks after optimization | Quantization-aware training | Validate accuracy post-quant |
| Deployment mismatch | Test >> production accuracy | Match test to production | Test on production-like data |
Key Takeaways
- Task selection drives architecture: Object detection needs location; classification needs labels; match model to task
- Data quality > model size: 10K clean, diverse images outperform 100K biased images
- Edge optimization critical: INT8 quantization: 4x smaller, 3-5x faster, <2% accuracy loss
- Deployment context matters: Edge for latency/privacy; cloud for complex models and scalability
- Monitor and retrain: CV models drift; plan weekly/monthly retraining from day one
- Benchmarks guide decisions: Test on target hardware early; latency surprises are expensive