6Part 6: Solution Patterns (Classical & Applied AI)
35. Recommenders & Personalization
Chapter 35 — Recommenders & Personalization
Overview
Build recommendation systems with content-based, collaborative filtering, and bandit-based approaches while mitigating cold-start problems. This chapter covers production recommender architectures—from multi-stage candidate generation to ranking, diversity re-ranking, and continuous A/B testing. We focus on balancing business objectives (engagement, revenue) with user experience (relevance, diversity, fairness).
Recommendation System Architecture
Multi-Stage Ranking Pipeline
graph TB A[User Request] --> B[Candidate Generation] B --> C[Content-Based<br/>~300 items] B --> D[Collaborative Filtering<br/>~400 items] B --> E[Trending/Popular<br/>~300 items] C --> F[Candidate Pool<br/>~1000 items<br/>50ms] D --> F E --> F F --> G[Feature Engineering] G --> H[ML Ranking Model<br/>XGBoost/Neural] H --> I[Top 100 Scored<br/>100ms total] I --> J[Diversity Re-ranking<br/>MMR Algorithm] J --> K[Business Rules<br/>Filters] K --> L[Final Top-N<br/>10-20 items<br/>180ms total] L --> M[A/B Test Assignment] M --> N[Serve to User] N --> O[Log Interaction] O --> P[Feedback Loop] P -.Retrain.-> H style F fill:#e1f5fe style H fill:#c8e6c9 style J fill:#fff3e0 style P fill:#f3e5f5
Pipeline Stage Performance
| Stage | Purpose | Scale | Latency Budget | Accuracy Impact |
|---|---|---|---|---|
| Candidate Generation | Fast filtering | 1M → 1K | <50ms | Recall-focused |
| Scoring/Ranking | Precise relevance | 1K → 100 | <100ms | Precision-focused |
| Re-ranking | Diversity, business | 100 → 20 | <20ms | User satisfaction |
| Post-processing | Final filters | 20 → 10 | <10ms | Quality gates |
| Total | End-to-end | 1M → 10 | <180ms | User-facing |
Candidate Generation Strategies
Algorithm Comparison
| Algorithm | Strengths | Weaknesses | Best For | Complexity |
|---|---|---|---|---|
| Content-Based | No cold-start for items, explainable | Filter bubble, needs features | New items, transparency | O(n log n) |
| Collaborative Filtering | Discovers patterns, no features needed | Cold-start, sparsity issues | Established catalog | O(k*n) |
| Matrix Factorization | Scalable, captures latent factors | Black box, cold-start | Large scale | O(k*iterations) |
| Hybrid | Best of both worlds | Complex, slower | Production systems | Combined |
| Bandits | Explores new items, adapts | Needs traffic, slower learning | Dynamic inventory | O(arms) |
Implementation Patterns
Content-Based (TF-IDF Similarity):
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Build feature vectors
vectorizer = TfidfVectorizer(max_features=5000)
item_features = vectorizer.fit_transform(items['description'])
# Compute similarity
similarity_matrix = cosine_similarity(item_features)
# Get similar items
similar_items = similarity_matrix[item_idx].argsort()[-20:][::-1]
Collaborative Filtering (Item-Item):
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
# Create user-item matrix
user_item_matrix = csr_matrix((ratings, (users, items)))
# Find similar items
knn = NearestNeighbors(metric='cosine', algorithm='brute')
knn.fit(user_item_matrix.T) # Transpose for item-item
distances, indices = knn.kneighbors(user_item_matrix.T[item_idx], n_neighbors=21)
Hybrid Fusion:
# Weighted combination
def hybrid_score(content_score, collab_score, alpha=0.3):
return alpha * content_score + (1 - alpha) * collab_score
# Re-rank candidates
candidates['score'] = hybrid_score(
candidates['content_score'],
candidates['collab_score'],
alpha=0.3 # 30% content, 70% collaborative
)
Ranking & Re-ranking
Learning-to-Rank Features
Feature Categories:
user_features:
- age, tenure_days, avg_rating
- total_interactions, engagement_score
- preferences, demographics
item_features:
- avg_rating, popularity_score
- recency_days, category, price
- inventory_status
interaction_features:
- content_similarity, cf_score
- time_since_last_view
- historical_affinity
context_features:
- time_of_day, day_of_week
- device_type, location
- session_duration
Minimal XGBoost Ranker:
import xgboost as xgb
# Train ranker
model = xgb.XGBClassifier(
objective='binary:logistic',
max_depth=6,
learning_rate=0.1,
n_estimators=100
)
model.fit(X_train, y_train)
# Predict click probability
scores = model.predict_proba(features)[:, 1]
Diversity Re-ranking
Maximal Marginal Relevance (MMR):
def mmr_rerank(candidates, similarity_matrix, lambda_=0.7, top_n=10):
"""
Balance relevance and diversity
lambda_: relevance weight (0-1)
"""
selected = []
remaining = list(range(len(candidates)))
# Select first (highest relevance)
selected.append(max(remaining, key=lambda i: candidates[i]['score']))
remaining.remove(selected[0])
# Iteratively add diverse items
while len(selected) < top_n and remaining:
mmr_scores = [
lambda_ * candidates[i]['score'] -
(1 - lambda_) * max([similarity_matrix[i][s] for s in selected])
for i in remaining
]
next_idx = remaining[np.argmax(mmr_scores)]
selected.append(next_idx)
remaining.remove(next_idx)
return [candidates[i] for i in selected]
Multi-Armed Bandits
Bandit Algorithm Comparison
| Algorithm | Exploration | Regret Bound | Best For |
|---|---|---|---|
| ε-Greedy | Random ε% of time | O(log T) | Simple, fast |
| Thompson Sampling | Bayesian sampling | O(log T) | Best empirical performance |
| UCB | Optimistic estimates | O(√log T) | Theoretical guarantees |
| LinUCB | Contextual features | O(√T log T) | Personalized bandits |
Thompson Sampling Implementation:
# Minimal Thompson Sampling
arms = {item_id: {'successes': 1, 'failures': 1} for item_id in items}
def select_item(candidates):
samples = {
item: np.random.beta(arms[item]['successes'], arms[item]['failures'])
for item in candidates
}
return max(samples, key=samples.get)
def update(item_id, clicked):
if clicked:
arms[item_id]['successes'] += 1
else:
arms[item_id]['failures'] += 1
Evaluation Metrics
Metric Categories
| Metric Type | Examples | Measures | Business Impact |
|---|---|---|---|
| Accuracy | Precision@K, Recall@K, NDCG | Relevance | User satisfaction |
| Diversity | Intra-list diversity, Coverage | Variety | Discovery, long-tail |
| Business | CTR, CVR, Revenue | Outcomes | Direct ROI |
| Engagement | Session time, Return rate | Stickiness | Retention |
Key Implementations:
# Hit Rate@K
def hit_rate_at_k(predictions, ground_truth, k=10):
hits = sum(
1 for preds, truth in zip(predictions, ground_truth)
if any(item in set(preds[:k]) for item in truth)
)
return hits / len(predictions)
# Coverage (catalog diversity)
def coverage(predictions, total_items):
recommended = set(item for recs in predictions for item in recs)
return len(recommended) / total_items
# Personalization (user diversity)
def personalization(predictions):
pairs = [(predictions[i], predictions[j])
for i in range(len(predictions))
for j in range(i+1, len(predictions))]
jaccard_dists = [
1 - len(set(p1) & set(p2)) / len(set(p1) | set(p2))
for p1, p2 in pairs
]
return np.mean(jaccard_dists)
Case Study: E-commerce Product Recommendations
Business Context
- Industry: E-commerce marketplace
- Scale: 10M users, 1M products, 100M monthly sessions
- Problem: 2.1% CTR, poor long-tail exposure (15% coverage)
- Goal: Increase CTR, conversion, and product diversity
- Constraints: <100ms latency, real-time personalization
Hybrid Multi-Stage Architecture
graph TB A[User Session Start] --> B{User Type?} B -->|New| C[Content-Based + Popular] B -->|Returning| D[Collaborative Filtering] C --> E[Candidate Pool ~1000] D --> E E --> F[Feature Engineering<br/>25 features] F --> G[XGBoost Ranker<br/>50ms] G --> H[Top 100 Scored] H --> I[MMR Diversification<br/>λ=0.7] I --> J[Business Rules Filter] J --> K[Final 20 Recs<br/>Total: 78ms] K --> L[A/B Test<br/>Traffic Split] L --> M[Variant A: Hybrid] L --> N[Variant B: CF Only] M --> O[User Interaction] N --> O O --> P[Event Logging] P -.Daily Retrain.-> G style E fill:#e1f5fe style G fill:#c8e6c9 style I fill:#fff3e0 style L fill:#ffccbc
Implementation & Results
Technical Stack:
candidate_generation:
content_based: TF-IDF + Cosine Similarity (~300 items)
collaborative: Item-Item KNN (~400 items)
trending: Redis sorted set (~200 items)
personalized: User embedding similarity (~100 items)
ranking:
model: XGBoost (100 trees, depth=6)
features: 25 (user + item + interaction + context)
training: Daily on 7-day window
serving: Python + Redis feature store
diversification:
algorithm: MMR
lambda: 0.7 (70% relevance, 30% diversity)
similarity: Pre-computed item embeddings
infrastructure:
api: FastAPI + Uvicorn
cache: Redis (candidate pool + features)
monitoring: Prometheus + Grafana
ab_testing: Custom framework (10% exploration)
Performance Results:
| Metric | Baseline (Popular) | CF Only | Hybrid System | Improvement |
|---|---|---|---|---|
| CTR | 2.1% | 3.4% | 4.8% | +129% |
| Conversion Rate | 1.2% | 1.8% | 2.6% | +117% |
| Avg Order Value | $45 | $52 | $58 | +29% |
| Revenue/Session | $0.54 | $0.94 | $1.51 | +180% |
| Coverage | 15% | 35% | 68% | +353% |
| Diversity (ILD) | 0.32 | 0.51 | 0.74 | +131% |
| p95 Latency | 15ms | 45ms | 78ms | Within 100ms SLA |
| NDCG@10 | 0.42 | 0.58 | 0.71 | +69% |
ROI Analysis:
investment:
development: $120,000 (3 engineers × 4 months)
infrastructure: $8,000/month (Redis, compute)
total_first_year: $216,000
annual_impact:
revenue_increase: $4.2M (100M sessions × $0.97 lift)
operational_savings: $180,000 (reduced manual curation)
total_benefit: $4.38M
roi: 1,928%
payback_period: 18 days
Key Learnings
- Multi-stage essential for scale: Narrowing 1M → 20 in <100ms required 3-stage funnel (candidate → rank → rerank)
- Hybrid beats pure methods: 70% collaborative + 30% content captured diverse user intents better than either alone
- Diversity drives revenue: MMR re-ranking increased long-tail exposure 280% AND boosted AOV by 29%
- Real-time context critical: Adding session features (cart, recent views) improved CTR 35% over static user profiles
- A/B testing reveals counter-intuitive results: Showing 20 recs outperformed 30 recs (choice paralysis); cold start users prefer popular over personalized
Implementation Checklist
Phase 1: Foundation (Week 1)
- Define business objectives (CTR, revenue, engagement)
- Collect interaction data: clicks, purchases, ratings, dwell time
- Establish baseline (popular items or random)
- Define success metrics (offline: NDCG, coverage; online: CTR, CVR)
- Create temporal train/val/test splits (avoid data leakage)
Phase 2: Candidate Generation (Week 2-3)
- Implement content-based filtering (TF-IDF or embeddings)
- Implement collaborative filtering (user-item or item-item)
- Add trending/popularity signals (time-decayed)
- Generate ~1000 candidates per user in <50ms
- Measure recall@100, coverage, and diversity
Phase 3: Ranking & Reranking (Week 4-5)
- Engineer 20-30 features (user, item, interaction, context)
- Train XGBoost/LightGBM ranker on click data
- Implement MMR or similar diversity re-ranker
- Add business rules (inventory, margins, promotions)
- Optimize to <100ms end-to-end latency
Phase 4: Deployment (Week 6-7)
- Deploy with A/B testing (10% traffic to new system)
- Monitor online metrics: CTR, CVR, revenue, engagement
- Implement multi-armed bandits for exploration (ε=0.1 or Thompson)
- Set up daily/weekly retraining pipeline
- Create dashboards and alerting
Phase 5: Optimization (Ongoing)
- Analyze user segments (new vs returning, demographics)
- Test new features and algorithms (contextualbandits, deep learning)
- Monitor for filter bubbles and bias
- Address cold-start with hybrid and default strategies
- Document learnings, update runbooks
Common Pitfalls & Solutions
| Pitfall | Symptom | Solution | Prevention |
|---|---|---|---|
| Filter bubble | Same item types every time | MMR re-ranking, bandits | Monitor diversity metrics |
| Popularity bias | Only popular items shown | Long-tail boosting, exploration | Track coverage |
| Cold start | Poor recs for new users/items | Content-based, popularity fallback | Hybrid approach |
| Feedback loop | Rich get richer, poor get poorer | Randomization, bandits | Regular audits |
| Offline/online gap | Good NDCG, poor CTR | Optimize for business metrics | Online A/B tests |
| Context ignored | Same recs always | Add session, time, device features | Feature importance |
| Latency creep | >100ms response | Cache candidates, async features | Load testing |
Key Takeaways
- Multi-stage pipeline is mandatory: 1M items → 10 recs in <100ms requires candidate generation (<50ms) → ranking (<100ms) → reranking (<20ms)
- Hybrid > pure algorithms: Content-based + collaborative + context captures more user intent than any single method (70/30 split typical)
- Diversity drives long-term value: MMR re-ranking increases coverage 3-4x and prevents filter bubbles
- Explore to learn: Multi-armed bandits (10% exploration) discover new winners; pure exploitation misses opportunities
- Business metrics trump ML metrics: Optimize CTR, revenue, retention—not just NDCG or precision
- Cold-start needs hybrid: New users → popular + content; new items → content + bandits; established users → collaborative