Chapter 4 — Roles, Teams & Operating Models

Overview

Define the organizational patterns that enable effective, safe AI delivery at scale.

Successful AI initiatives require more than technical brilliance—they demand well-structured teams, clear roles and responsibilities, and organizational models that balance innovation with governance. This chapter provides blueprints for structuring AI teams, from individual roles to enterprise-wide operating models.

Whether you're a startup building your first AI capability or an enterprise scaling across multiple initiatives, the patterns here will help you avoid common pitfalls and accelerate time-to-value.

Roles & Responsibilities

AI teams blend traditional technology roles with specialized AI capabilities. Here's a comprehensive breakdown:

graph TD
    A[AI Team Structure] --> B[Leadership]
    A --> C[Product & Design]
    A --> D[Engineering]
    A --> E[Governance]

    B --> B1[Partner/Principal]
    B --> B2[Engagement Lead]

    C --> C1[Product Manager]
    C --> C2[UX/Content Designer]

    D --> D1[Tech Lead/Architect]
    D --> D2[ML/LLM Engineer]
    D --> D3[Data Engineer]
    D --> D4[Platform Engineer]

    E --> E1[Security/Compliance]
    E --> E2[Ethics/Responsible AI]

Leadership Roles

Partner / Principal

Responsibilities:

Executive stakeholder alignment and sponsor management
Commercial model design (pricing, contracts, IP)
Strategic direction and portfolio governance
Risk management and escalation
Business development and client relationships

Key Skills:

Business acumen and financial modeling
Stakeholder management and influence
AI landscape knowledge (breadth over depth)
Risk assessment and mitigation
Communication and storytelling

Typical Background: Management consulting, senior product leadership, or technology executive with 10+ years experience

Decision Rights:

Investment allocation across AI portfolio
Go/no-go on high-risk initiatives
Commercial terms and partnerships
Escalation path for major issues

Engagement Lead / Program Manager

Responsibilities:

End-to-end engagement delivery (scope, schedule, budget)
RAID (Risks, Assumptions, Issues, Dependencies) management
Cross-functional coordination and stakeholder communication
Resource allocation and team health
Quality assurance and client satisfaction

Key Skills:

Project/program management methodologies (Agile, SAFe)
Risk management and issue resolution
Communication and facilitation
Resource planning and budgeting
Client management

Typical Background: Program management, delivery management, or technical project management with 5-8 years experience

Decision Rights:

Day-to-day prioritization and resource allocation
Escalation of risks and issues
Sprint planning and milestone adjustments
Team composition changes

Tools & Artifacts:

Project plan with milestones and dependencies
RAID log (updated weekly)
Status reports and stakeholder communications
Resource allocation matrix
Budget tracking and forecasts

Product & Design Roles

Product Manager

Responsibilities:

User research and Jobs-to-Be-Done (JTBD) mapping
Product vision and roadmap
Feature prioritization and backlog management
Success metrics definition and tracking
Adoption and value realization

Key Skills:

User-centered design thinking
Data-driven decision making
AI capabilities and limitations understanding
A/B testing and experimentation
Change management

Typical Background: Product management with 3-7 years experience, ideally with AI/ML exposure

Decision Rights:

Feature prioritization within approved scope
User experience and acceptance criteria
A/B test design and interpretation
Adoption strategies and tactics

AI-Specific Competencies:

Competency	Description	Why It Matters
Probabilistic Thinking	Understanding uncertainty and confidence intervals	AI outputs aren't deterministic; need to design for edge cases
Evaluation Design	Defining success metrics and test strategies	Traditional product metrics insufficient for AI
Human-in-the-Loop	Designing workflows blending AI and human judgment	Most AI systems augment rather than replace humans
Adoption Measurement	Tracking actual usage beyond deployment	AI value requires user adoption, not just deployment

PRD Structure for AI Features:

graph TD
    A[AI Product Requirements] --> B[Problem Statement]
    A --> C[Success Metrics]
    A --> D[User Stories]
    A --> E[Acceptance Criteria]
    A --> F[Risks & Mitigations]

    C --> C1[Primary:<br/>Business metric target]
    C --> C2[Secondary:<br/>User satisfaction]
    C --> C3[Guardrails:<br/>Safety thresholds]

    E --> E1[Performance SLAs]
    E --> E2[UX Requirements]
    E --> E3[Logging & Monitoring]

    F --> F1[Over-reliance Risk]
    F --> F2[Quality Risk]
    F --> F3[Adoption Risk]

PRD Example: Customer Support AI Assistant

Component	Specification	Measurement
Problem	Agents spend 40% of time searching 2,500+ KB articles	Time studies, agent surveys
Primary Metric	Reduce AHT by 20% (from 12 min to <10 min)	Production metrics
Secondary Metrics	Maintain CSAT >4.0/5, FCR +15%	Post-interaction surveys
Guardrails	Zero PII leakage, hallucination <5%	Automated monitoring
Performance SLA	Suggestions appear within 2 seconds	P95 latency tracking
Out of Scope	Fully autonomous responses, customer-facing, multilingual	MVP boundaries

Risk-Mitigation Matrix:

Risk	Likelihood	Impact	Mitigation	Cost
Over-reliance	Medium	High	Continuous training, sampling	$20K
Hallucinations	Low	Critical	RAG grounding, monitoring	$30K
Adoption resistance	High	High	Co-design, early involvement	$40K

UX / Content Designer

Responsibilities:

Conversation design for chatbots and AI assistants
UI/UX for AI-augmented workflows
Content safety and tone guidelines
Explainability and transparency design
Accessibility compliance

Key Skills:

Conversational design and NLP understanding
Human-AI interaction patterns
Content strategy and guidelines
Prototyping and user testing
Accessibility standards (WCAG)

AI-Specific Challenges:

Designing for probabilistic outputs (showing confidence, uncertainty)
Error handling and graceful degradation
Explaining AI decisions to non-technical users
Managing user expectations (disclosure, limitations)
Handling edge cases and failures

Example: Conversation Flow Design:

User: "I haven't received my order"

Agent UI:
┌─────────────────────────────────────────────────────┐
│ AI Suggestion (Confidence: HIGH)                    │
│                                                      │
│ "I'm sorry to hear that. Let me look into your      │
│  order status. Can you provide your order number?"  │
│                                                      │
│ [Accept] [Edit] [Reject] [Escalate]                │
│                                                      │
│ Suggested Actions:                                   │
│  • Search order by customer email                   │
│  • Check shipping status                            │
│  • Review recent support tickets                    │
└─────────────────────────────────────────────────────┘

If Confidence: LOW
┌─────────────────────────────────────────────────────┐
│ ⚠️ Low Confidence - Verify Before Sending           │
│                                                      │
│ Suggested response may not be accurate.             │
│ Consider:                                            │
│  • Consulting knowledge base                        │
│  • Escalating to senior agent                       │
│  • Using standard template                          │
└─────────────────────────────────────────────────────┘

Engineering Roles

Tech Lead / Solution Architect

Responsibilities:

Technical architecture and design decisions
Non-functional requirements (performance, scalability, security)
Integration patterns and API design
Technology selection and evaluation
Technical risk assessment
Code quality and engineering standards

Key Skills:

System design and architecture patterns
Cloud platforms (AWS, Azure, GCP)
API design and microservices
Performance optimization
Security and compliance

Typical Background: Software engineering or ML engineering with 7-10+ years, including 2+ years in AI/ML

Decision Rights:

Technology stack within approved options
Architecture patterns and design principles
NFR targets (latency, throughput, availability)
Technical debt management

Example: Architecture Decision Record (ADR):

## ADR-015: Vector Database Selection for RAG System

**Status**: Accepted

**Context**:
We need a vector database to store 500K document embeddings for RAG-based
customer support. Requirements:
- <200ms query latency at P95
- Support for metadata filtering
- Scalable to 5M+ documents
- Integration with existing AWS infrastructure

**Options Considered**:
1. Pinecone (managed service)
2. Weaviate (self-hosted)
3. pgvector (PostgreSQL extension)
4. Elasticsearch with vector search

**Decision**: Pinecone

**Rationale**:
- Meets latency requirements (benchmarked at 120ms P95)
- Managed service reduces ops burden
- Strong metadata filtering support
- Pricing competitive for our volume ($150/mo projected)

**Tradeoffs Accepted**:
- Vendor lock-in (mitigated by abstraction layer)
- Slightly higher cost than self-hosted (~2x)
- Data sent to third party (acceptable for non-PII KB articles)

**Consequences**:
- Faster time to market (no ops setup)
- Monthly operational cost
- Dependency on Pinecone availability SLA (99.9%)
- Will re-evaluate if volume exceeds 10M documents

**Review Date**: 2025-06-01 (6 months)

ML / LLM Engineer

Responsibilities:

Model development, training, and evaluation
Prompt engineering and optimization
RAG pipeline design and implementation
Fine-tuning and model adaptation
Evaluation framework design
Model performance monitoring

Key Skills:

Machine learning fundamentals and algorithms
Deep learning frameworks (PyTorch, TensorFlow)
LLM APIs and open-source models
Prompt engineering and RAG architecture
Evaluation science and metrics
Python and ML libraries

Typical Background: ML engineering, data science, or research with 3-7 years experience

Decision Rights:

Model selection within architecture constraints
Evaluation metrics and thresholds
Prompt design and optimization
Training data preparation and augmentation

Skill Ladder Example:

Level	Experience	Capabilities	Autonomy
Junior	0-2 years	Implement models from specs, run evals, tune prompts	Supervised by senior
Mid	2-5 years	Design evaluation frameworks, optimize RAG pipelines, fine-tune models	Owns features end-to-end
Senior	5-10 years	Architecture design, novel approaches, evaluation strategy	Owns system design
Staff+	10+ years	Cross-system optimization, research, technical strategy	Sets technical direction

Data Engineer

Responsibilities:

Data pipeline development and orchestration
Data quality monitoring and validation
Data contracts and schema management
Data lineage and catalog maintenance
Privacy and compliance controls
Feature engineering and feature store

Key Skills:

ETL/ELT pipeline development
SQL and data modeling
Workflow orchestration (Airflow, Prefect)
Data quality frameworks
Privacy engineering
Cloud data platforms

Typical Background: Data engineering, software engineering, or analytics engineering with 3-7 years experience

Decision Rights:

Data pipeline architecture and tools
Data quality thresholds and monitoring
Schema evolution and versioning
Data access patterns and optimization

Data Contract Components:

graph TD
    A[Data Contract] --> B[Schema Definition]
    A --> C[Quality Rules]
    A --> D[SLA Commitments]
    A --> E[Privacy Controls]

    B --> B1[Field definitions<br/>Types & constraints<br/>PII flags]

    C --> C1[Completeness: <5% null]
    C --> C2[Timeliness: <24hr lag]
    C --> C3[Uniqueness: Key constraints]

    D --> D1[Freshness: 24 hours]
    D --> D2[Availability: 99.5%]
    D --> D3[Support: On-call team]

    E --> E1[Retention: 730 days]
    E --> E2[Access: Role-based]
    E --> E3[PII handling: Redact/encrypt]

Data Contract Template (Simplified):

Component	Example Specification	Enforcement
Schema	customer_id (string, required, unique)	Validation on ingestion
Quality Rules	<5% NULL for required fields, <24hr freshness	Automated monitoring, alerts
SLA	99.5% availability, 24hr refresh	Dashboard tracking
Privacy	No PII, 730-day retention, role-based access	Access logs, auto-deletion

Platform Engineer (MLOps)

Responsibilities:

ML infrastructure and tooling
CI/CD pipelines for ML
Model registry and versioning
Model serving and deployment
Monitoring and observability
Cost optimization

Key Skills:

DevOps and CI/CD
Kubernetes and container orchestration
ML serving frameworks (TensorFlow Serving, Seldon)
Monitoring and alerting (Prometheus, Grafana)
Infrastructure as Code (Terraform, Pulumi)
Cloud platforms

Typical Background: DevOps, SRE, or platform engineering with 3-7 years, plus ML exposure

Decision Rights:

Platform architecture and tooling
Deployment strategies (canary, blue-green)
SLOs and monitoring strategy
Resource allocation and scaling policies

MLOps Maturity Ladder:

graph TD
    A[Level 0: Manual] --> B[Level 1: Automated Training]
    B --> C[Level 2: Automated Deployment]
    C --> D[Level 3: Full CI/CD]
    D --> E[Level 4: Automated Monitoring & Retraining]

    A1[Notebooks, manual deploy] --> A
    B1[Automated pipelines, manual deploy] --> B
    C1[Automated deploy, manual monitoring] --> C
    D1[Full automation, manual retraining] --> D
    E1[Fully automated MLOps] --> E

Governance Roles

Security / Compliance Specialist

Responsibilities:

Threat modeling for AI systems
Security architecture review
Privacy impact assessments (DPIA/PIA)
Compliance verification (GDPR, CCPA, AI Act)
Audit support and evidence collection
Incident response

Key Skills:

Security frameworks (NIST, ISO 27001)
Privacy regulations (GDPR, CCPA, HIPAA)
Threat modeling methodologies
Risk assessment
Audit and compliance

Typical Background: Security engineering, compliance, or risk management with 5-10 years experience

Decision Rights:

Security controls and requirements
Compliance sign-off on deployments
Incident response procedures
Third-party vendor assessment

Ethics / Responsible AI Lead

Responsibilities:

Ethical risk assessment
Fairness testing and bias mitigation
Explainability requirements
AI governance framework design
Ethics training and awareness
Stakeholder engagement

Key Skills:

AI ethics frameworks and principles
Fairness metrics and mitigation techniques
Stakeholder engagement
Policy development
Communication and training

Typical Background: Ethics, policy, social science, or technical background with ethics focus; 3-7 years

Decision Rights:

Ethical requirements and guardrails
Fairness metric selection and thresholds
Stakeholder consultation approach
Escalation for ethical concerns

Team Topologies

How you structure AI teams significantly impacts velocity, quality, and alignment.

Topology 1: Cross-Functional Pods

Structure: Small, autonomous teams owning specific AI products or use cases end-to-end.

graph TD
    A[AI Team Pods] --> B[Customer Support Pod]
    A --> C[Fraud Detection Pod]
    A --> D[Recommendation Pod]

    B --> B1[Product Manager]
    B --> B2[ML Engineer x2]
    B --> B3[Data Engineer]
    B --> B4[Platform Engineer]

    C --> C1[Product Manager]
    C --> C2[ML Engineer x2]
    C --> C3[Data Engineer]

    D --> D1[Product Manager]
    D --> D2[ML Engineer x3]
    D --> D3[Data Engineer]

Pros:

Fast decision making and iteration
Clear ownership and accountability
Direct connection to business value
High autonomy and team satisfaction

Cons:

Risk of duplicated infrastructure
Inconsistent standards across pods
Knowledge silos
Difficulty sharing resources

Best For:

Startups and scale-ups (10-100 people)
Organizations with 2-5 distinct AI products
High innovation priority, acceptable redundancy

Example: Customer Support Pod:

Team Size: 6 people
Reporting: Dotted line to AI Platform Lead, solid line to Support VP

Roles:
- 1 Product Manager (20% time from Support team)
- 2 ML Engineers (full-time, dedicated)
- 1 Data Engineer (50% time, shared with another pod)
- 1 Platform Engineer (30% time, shared with platform team)
- 1 UX Designer (20% time, shared resource)

Owns:
- Customer support AI assistant
- Agent training and adoption
- Knowledge base improvement pipeline
- Metrics and monitoring
- A/B testing and iteration

Dependencies:
- Shared ML platform (deployment, monitoring)
- Shared data infrastructure
- Security/compliance review (central team)

Topology 2: Central AI Platform COE

Structure: Centralized team provides AI capabilities and platform to business units.

graph TD
    A[Central AI Platform Team] --> B[Core Platform]
    A --> C[Shared Services]
    A --> D[Governance]

    B --> B1[MLOps Infrastructure]
    B --> B2[Model Registry]
    B --> B3[Feature Store]

    C --> C1[Evaluation Framework]
    C --> C2[RAG Pipeline]
    C --> C3[LLM Gateway]

    D --> D1[Standards & Policies]
    D --> D2[Security & Compliance]
    D --> D3[Training & Enablement]

    E[Business Unit A] -.requests.-> A
    F[Business Unit B] -.requests.-> A
    G[Business Unit C] -.requests.-> A

Pros:

Strong governance and standards
Efficient resource utilization
Deep AI expertise concentration
Consistent quality and security

Cons:

Can become bottleneck
Slower response to business needs
Risk of disconnect from business context
Queue management challenges

Best For:

Large enterprises (1000+ people)
Highly regulated industries
Early-stage AI capability building
Organizations prioritizing control over speed

Operating Model:

## AI Platform COE Service Catalog

### Tier 1: Self-Service (SLA: Immediate)
- Model deployment via platform UI
- Standard RAG pipeline template
- Pre-built evaluation frameworks
- Documentation and tutorials

### Tier 2: Guided Service (SLA: 2 weeks)
- Custom model development support
- Advanced RAG optimization
- Custom evaluation design
- Performance tuning

### Tier 3: Full Service (SLA: 8-12 weeks)
- End-to-end AI solution delivery
- Novel architecture design
- Research and prototyping
- Staffing: 2-5 people embedded with business unit

### Intake Process:
1. Business unit submits intake form
2. COE reviews and triages (T-shirt sizing)
3. If Tier 1/2: Self-service or guided
4. If Tier 3: Proposal with scope, timeline, resources
5. Prioritization committee approves (monthly)
6. Kickoff and delivery

Topology 3: Federated Model

Structure: Domain-aligned pods with centralized platform and standards.

graph TD
    A[AI Operating Model] --> B[Central Platform Team]
    A --> C[Domain Pods]

    B --> B1[Shared Infrastructure]
    B --> B2[Standards & Governance]
    B --> B3[Enablement & Training]

    C --> C1[Marketing AI Pod]
    C --> C2[Operations AI Pod]
    C --> C3[Product AI Pod]

    C1 -.uses.-> B1
    C2 -.uses.-> B1
    C3 -.uses.-> B1

    C1 -.adheres to.-> B2
    C2 -.adheres to.-> B2
    C3 -.adheres to.-> B2

Pros:

Balances autonomy and consistency
Domain expertise close to business
Shared infrastructure reduces redundancy
Scales well to large organizations

Cons:

Requires strong governance without micromanagement
Can be complex to coordinate
Potential for standards drift
Requires investment in platform team

Best For:

Large organizations (500+ people) with multiple business units
Organizations with 5+ AI initiatives
Mature AI practice (18+ months experience)

Governance Structure:

## Federated AI Governance

### Central Platform Team (Enables)
- **Size**: 8-12 people
- **Responsibilities**:
  - Maintain shared ML platform
  - Define technical standards (e.g., model versioning, monitoring)
  - Provide training and enablement
  - Shared services (e.g., LLM gateway, eval frameworks)
- **Does NOT**:
  - Prioritize domain pod work
  - Own domain-specific models or products

### Domain Pods (Execute)
- **Size**: 4-8 people each
- **Responsibilities**:
  - Build and operate AI products for their domain
  - Meet company AI standards
  - Contribute learnings to central team
  - Participate in community of practice
- **Autonomy**:
  - Full ownership of roadmap and priorities
  - Choice of models and techniques (within guardrails)
  - Direct reporting to domain leadership

### AI Governance Council (Aligns)
- **Members**: Central platform lead + domain pod leads + CISO + Chief Data Officer
- **Frequency**: Monthly
- **Responsibilities**:
  - Set AI strategy and standards
  - Prioritize cross-cutting initiatives
  - Resolve conflicts and dependencies
  - Share best practices
  - Review high-risk initiatives

### Community of Practice (Shares)
- **Members**: All AI practitioners
- **Frequency**: Weekly office hours, monthly demos
- **Activities**:
  - Knowledge sharing (brown bags, demos)
  - Template and pattern library
  - Peer code reviews
  - Working groups (e.g., LLM evaluation, fairness)

RACI & Decision Rights

Clear decision-making authority prevents delays and conflicts.

RACI Matrix for AI Initiatives

Decision	Product	Architecture	Data	Security	ML Eng	Platform
Feature Prioritization	A	C	C	I	C	I
Success Metrics	A/R	C	C	I	C	I
Model Selection	C	R	C	C	A	C
Architecture Design	C	A/R	C	C	C	C
Data Usage	C	I	A/R	C	C	I
Privacy/Security Controls	I	C	C	A/R	C	C
Deployment Approval	C	R	I	R	R	A
SLO Definition	R	R	I	C	R	A

Legend:

R: Responsible (does the work)
A: Accountable (single decision-maker)
C: Consulted (input sought)
I: Informed (kept in the loop)

Decision-Making Framework

graph TD
    A[Decision Type] --> B{Scope}
    B -->|Individual Feature| C[Product Manager]
    B -->|Technical Design| D[Tech Lead/Architect]
    B -->|Data Usage| E[Data Lead]
    B -->|Security/Compliance| F[Security Lead]
    B -->|Cross-Team| G[Governance Council]
    B -->|Strategic| H[Executive Sponsor]

    C --> C1{Within Approved Scope?}
    C1 -->|Yes| C2[Decide & Inform]
    C1 -->|No| C3[Escalate to Engagement Lead]

    D --> D1{Follows Standards?}
    D1 -->|Yes| D2[Decide & Document ADR]
    D1 -->|No| D3[Escalate to Architecture Review Board]

Enablement & Hiring

Building AI capability requires both hiring and upskilling.

Skill Matrix by Role

ML/LLM Engineer Skill Matrix

Skill Domain	Junior (L2)	Mid (L3)	Senior (L4)	Staff+ (L5+)
ML Fundamentals	Understand common algorithms	Apply algorithms to new problems	Design novel approaches	Research & innovation
LLM Capabilities	Use LLM APIs, basic prompting	Advanced prompting, RAG implementation	RAG optimization, fine-tuning	Architecture innovation
Evaluation	Run standard evals	Design custom evals	Define evaluation strategy	Novel evaluation methods
Production	Deploy models with guidance	Own deployment end-to-end	Design serving architecture	Optimize cost/perf at scale
Communication	Explain work to team	Present to stakeholders	Influence across org	Set technical direction

Hiring Considerations

Build vs. Hire vs. Partner:

Capability	Build (Train Existing)	Hire (Full-Time)	Partner (Consultant/Contractor)
Domain Expertise	Preferred (context important)	When can't build fast enough	For short-term projects
Core AI/ML Skills	For mid-level roles	For senior specialized roles	For short-term, specialized needs
MLOps/Platform	If have DevOps background	For dedicated platform team	For initial platform setup
Ethics/Governance	Supplement with training	For dedicated role at scale	For assessments and audits

Hiring Rubric Example (ML Engineer):

## ML Engineer Hiring Rubric

### Technical Depth (40%)
- [ ] ML fundamentals: Can explain bias-variance tradeoff, overfitting, regularization
- [ ] Practical experience: Has built and deployed 2+ production ML systems
- [ ] LLM knowledge: Understands prompting, RAG, fine-tuning tradeoffs
- [ ] Evaluation: Can design appropriate metrics for different tasks

### Problem Solving (30%)
- [ ] Breaks down ambiguous problems into actionable steps
- [ ] Considers multiple approaches and tradeoffs
- [ ] Asks clarifying questions
- [ ] Realistic about constraints (data, time, resources)

### Communication (20%)
- [ ] Explains technical concepts clearly to non-technical audience
- [ ] Writes clear documentation
- [ ] Collaborates effectively in group settings
- [ ] Gives and receives feedback constructively

### Culture Fit (10%)
- [ ] Curiosity and learning orientation
- [ ] Ethical awareness and responsibility
- [ ] Team-first mindset
- [ ] Resilience and adaptability

### Interview Process:
1. Phone screen (30 min): Background, motivations, high-level technical
2. Technical interview (60 min): ML problem solving, coding
3. System design (60 min): Design RAG system for given use case
4. Behavioral (45 min): Past projects, collaboration, ethics
5. Hiring committee review

Upskilling Paths

GenAI Upskilling for Traditional ML Engineers:

## 12-Week GenAI Upskilling Program

### Weeks 1-2: LLM Foundations
- How LLMs work (transformers, attention, pretraining)
- Prompting basics (zero-shot, few-shot, chain-of-thought)
- Hands-on: Use OpenAI/Anthropic APIs for classification tasks
- Project: Convert existing ML classification model to LLM-based

### Weeks 3-4: Advanced Prompting
- Prompt engineering techniques
- Few-shot learning and in-context learning
- Output parsing and structured generation
- Hands-on: Build a data extraction pipeline
- Project: Summarization or Q&A system

### Weeks 5-6: RAG Systems
- Embeddings and vector search
- Chunking strategies
- Retrieval optimization
- Hands-on: Build end-to-end RAG pipeline
- Project: Internal knowledge base Q&A

### Weeks 7-8: Fine-Tuning & Adaptation
- When to fine-tune vs. prompt
- Full fine-tuning vs. LoRA
- Dataset preparation and evaluation
- Hands-on: Fine-tune a small model
- Project: Domain-specific model adaptation

### Weeks 9-10: Evaluation & Safety
- LLM evaluation frameworks
- Hallucination detection and mitigation
- Safety and red-teaming
- Hands-on: Design evaluation for a use case
- Project: Comprehensive evaluation of RAG system

### Weeks 11-12: Production & Optimization
- Serving and scaling LLMs
- Cost optimization techniques
- Monitoring and debugging
- Hands-on: Deploy and optimize a system
- Final project: Presentation to team

### Assessments:
- Weekly quizzes (20%)
- Hands-on projects (50%)
- Final project (30%)

### Resources:
- Online courses (Coursera, DeepLearning.AI)
- Internal documentation and templates
- 1-on-1 mentorship with senior LLM engineer
- Community of practice participation

Metrics & Success Measurement

Track what matters across delivery, product, and platform.

Delivery Metrics (Team Health)

Metric	Target	Calculation	Purpose
Lead Time	<4 weeks	Idea to production	Measure delivery speed
Deployment Frequency	Weekly+	Deployments per week	Measure iteration speed
Change Failure Rate	<15%	Failed deploys / total deploys	Measure quality
MTTR	<2 hours	Time to resolve incidents	Measure reliability
Cycle Time	<2 weeks	Start to done for features	Measure efficiency

DORA Metrics Calculation Framework:

graph LR
    A[DORA Metrics Dashboard] --> B[Lead Time]
    A --> C[Deployment Frequency]
    A --> D[Change Failure Rate]
    A --> E[MTTR]

    B --> B1[Ticket creation → Deployment<br/>Median & P95]
    C --> C1[Deployments per week<br/>Target: Weekly+]
    D --> D1[Failed deploys / Total<br/>Target: <15%]
    E --> E1[Incident start → Resolution<br/>Target: <2 hours]

DORA Metrics Tracking Summary:

Metric	Calculation	Data Sources	Target (High Performers)	Alert Threshold
Lead Time	Ticket creation → Production deployment (median)	JIRA + GitHub/GitLab	<1 week	>4 weeks
Deployment Frequency	Deployments per week	CI/CD pipeline logs	Daily+	<1 per month
Change Failure Rate	Failed deployments / Total deployments	Incident logs + deployment logs	<15%	>30%
MTTR	Incident detection → Resolution (median)	Monitoring system + incident tracker	<1 hour	>4 hours

Product Metrics (Business Value)

Metric	Example Target	Measurement Approach
Adoption Rate	80% of target users within 3 months	Active users / eligible users
User Satisfaction	CSAT >4.0/5	Post-interaction surveys
Task Success Rate	>90% of tasks completed successfully	Task completion tracking
Time Savings	20% reduction in task time	Before/after time measurement
Business Impact	$500K annual cost savings	ROI calculation

Platform Metrics (Operational Excellence)

Metric	Target	Purpose
API Availability	99.9%	Service reliability
P95 Latency	<500ms	User experience
Cost per Request	<$0.01	Economic efficiency
GPU Utilization	>70%	Resource efficiency
Model Drift	<5% degradation/month	Model health

Platform Dashboard Design Framework:

graph TD
    A[AI Platform Dashboard] --> B[Real-Time Metrics]
    A --> C[Historical Trends]
    A --> D[Alerts & Incidents]

    B --> B1[Availability: 99.9%]
    B --> B2[P95 Latency: <500ms]
    B --> B3[Cost/Request: $0.01]
    B --> B4[GPU Utilization: 70%]

    C --> C1[Latency Over Time]
    C --> C2[Cost Breakdown]
    C --> C3[Model Performance Trends]

    D --> D1[Active Alerts]
    D --> D2[Recent Incidents]
    D --> D3[SLA Compliance]

Dashboard Components Checklist:

Component	Metrics Displayed	Update Frequency	Alert Threshold
Real-Time Status	Availability, latency, cost, utilization	Every 1 minute	Availability <99%, Latency >500ms
Historical Trends	7-day & 30-day charts	Every 5 minutes	Degradation >10% week-over-week
Cost Analysis	Breakdown by component, trends	Daily	Budget exceeded
Model Performance	Accuracy, drift, feature importance	Hourly	Drift >5%, Accuracy <threshold
Incidents	Active alerts, recent issues, MTTR	Real-time	Any P0/P1 active

Anti-Patterns

Common failure modes and how to avoid them.

1. Platform Without Customers

Symptom: Building extensive ML infrastructure before validating demand.

Impact:

Wasted investment in unused features
Missed business opportunities
Team frustration and low morale

Example: A company built a full-featured ML platform with feature store, experiment tracking, and deployment automation. After 9 months and $2M investment, they realized they only had 2 simple use cases that could have used existing tools.

Prevention:

Start with 1-2 concrete use cases
Build minimum viable platform iteratively
Validate demand before building
Follow "you aren't gonna need it" (YAGNI) principle

Recovery:

Identify current needs vs. speculative features
Deprecate unused components
Pivot to support actual use cases
Co-create with actual users

2. Siloed DS/ML Teams

Symptom: Data scientists disconnected from product, engineering, and operations.

Impact:

Models that don't solve real problems
Deployment bottlenecks and handoff failures
Low production adoption rate
Frustration on all sides

Example: A data science team built 20+ models over 18 months, but only 3 made it to production. The disconnect: they optimized for accuracy, not business impact or operational feasibility.

Prevention:

Embed DS/ML in cross-functional teams
Require product alignment before modeling
Include ops in design reviews
Measure success by production impact, not model accuracy

Recovery:

Reorganize into cross-functional pods
Establish product-first culture
Create handoff processes with engineering and ops
Jointly define success metrics (business + technical)

3. No Clear Safety Ownership

Symptom: Safety and ethics treated as "someone else's problem."

Impact:

Safety issues discovered late (expensive to fix)
Compliance violations and legal risk
Reputational damage
User harm

Example: A hiring AI was deployed without fairness testing because "everyone assumed someone else was handling it." Six months later, a bias audit revealed significant gender disparities, leading to a lawsuit and public embarrassment.

Prevention:

Explicit responsible AI role or embedded responsibility
Safety reviews as deployment gate
Clear RACI for ethics and fairness
Regular training on responsible AI

Recovery:

Immediate safety audit of all deployed systems
Appoint responsible AI lead
Implement governance process
Retrain team on ethical AI practices

Summary

Effective AI teams require systematic approaches to roles, structure, and operations:

Team Design Framework

graph TD
    A[AI Team Design] --> B[Roles & Skills]
    A --> C[Team Topology]
    A --> D[Decision Rights]
    A --> E[Metrics]

    B --> B1[Leadership: Partner, PM]
    B --> B2[Technical: ML Eng, Data Eng, MLOps]
    B --> B3[Governance: Security, Ethics]

    C --> C1[Pods: Agile, 4-8 people]
    C --> C2[COE: Control, 8-15 people]
    C --> C3[Federated: Balance, multiple pods + platform]

    D --> D1[RACI Matrix]
    D --> D2[Decision Framework]
    D --> D3[Escalation Paths]

    E --> E1[DORA Metrics]
    E --> E2[Business Impact]
    E --> E3[Platform Health]

Key Takeaways Matrix

Dimension	Best Practice	Typical Cost	Impact of Skipping	ROI
Role Clarity	Defined responsibilities, RACI matrix	$10K-$ 20K (documentation)	Delays, conflicts	5-10x (time saved)
Right Topology	Match structure to scale	$50K-$ 200K (reorganization)	Bottlenecks, inefficiency	3-7x (productivity)
Decision Rights	Clear decision framework	$15K-$ 30K (setup)	Slow decisions, escalations	4-8x (speed)
Upskilling	12-week programs per person	$5K-$ 15K per person	Talent gaps, hiring costs	2-5x (vs. hiring)
Metrics & Dashboards	DORA + business + platform metrics	$30K-$ 80K (setup)	Blind spots, reactive mode	6-12x (issue prevention)

Team Size & Structure Guidelines

Startup (10-50 people):

Structure: 1-2 cross-functional pods
Team size: 4-8 people per pod
Cost: $800K-$ 1.6M annually (fully loaded)
Capability: 2-5 AI products

Scale-up (50-500 people):

Structure: 3-8 specialized pods OR Central COE
Team size: 20-40 AI practitioners
Cost: $4M-$ 8M annually
Capability: 10-30 AI products

Enterprise (500+ people):

Structure: Federated model (Central platform + domain pods)
Team size: 50-150+ AI practitioners
Cost: $10M-$ 30M+ annually
Capability: 50-200+ AI products

Hiring vs. Build vs. Partner Decision Matrix

Capability Need	Build (Upskill)	Hire (Full-time)	Partner (Consultant)	Typical Outcome
Domain Expertise	✅ Preferred	When too slow	Short-term needs	80% build, 15% hire, 5% partner
Core AI/ML Skills	For mid-level	✅ Senior specialists	Specialized projects	40% build, 50% hire, 10% partner
MLOps/Platform	If DevOps exists	✅ Dedicated team	Initial setup	30% build, 60% hire, 10% partner
Ethics/Governance	Training supplements	At scale	Assessments, audits	50% build, 30% hire, 20% partner

Cost Comparison (Per Capability):

Upskilling: $5K-$ 15K + 3 months → Retain existing talent
Hiring: $150K-$ 250K/year + 3-6 months ramp → New capability
Partnering: $150-$ 300/hour ( $30K-$ 150K per project) → Immediate expertise

Critical Success Factors

Clear Roles: Specialized AI roles (ML engineer, LLM engineer, MLOps) alongside traditional roles (PM, engineering, security)
Right Topology: Choose structure based on scale—pods for agility (startups), COE for control (early AI), federated for balance (enterprises)
Decision Clarity: RACI matrices and decision rights prevent delays and conflicts (30-50% faster decisions)
Continuous Learning: Upskilling existing talent is faster (3 months) and more sustainable than hiring (6+ months) for every skill
Balanced Metrics: Track delivery speed (DORA), business impact (ROI), and operational excellence (SLAs)

Common Failure Modes & Prevention

Anti-Pattern	Cost of Failure	Prevention	Recovery Time
Platform without customers	$2M+ wasted investment	Start with 1-2 use cases, build incrementally	6-12 months
Siloed DS/ML teams	85% models never deployed	Cross-functional pods, product alignment	3-6 months
No clear safety ownership	$500K-$ 5M lawsuits + reputation	Explicit responsible AI role, gated reviews	9-18 months
Unclear decision rights	50% slower delivery	RACI matrix, decision framework	1-3 months

Key Insight: Effective team design is not about perfect organizational charts—it's about clear roles, aligned incentives, rapid decision-making, and continuous learning. Invest in structure and processes early to avoid expensive reorganizations later.

The next chapter explores the end-to-end AI lifecycle from discovery through value realization.

4. Roles, Teams & Operating Models

Chapter 4 — Roles, Teams & Operating Models

Overview

Roles & Responsibilities

Leadership Roles

Partner / Principal

Engagement Lead / Program Manager

Product & Design Roles

Product Manager

UX / Content Designer

Engineering Roles

Tech Lead / Solution Architect

ML / LLM Engineer

Data Engineer

Platform Engineer (MLOps)

Governance Roles

Security / Compliance Specialist

Ethics / Responsible AI Lead

Team Topologies

Topology 1: Cross-Functional Pods

Topology 2: Central AI Platform COE

Topology 3: Federated Model

RACI & Decision Rights

RACI Matrix for AI Initiatives

Decision-Making Framework

Enablement & Hiring

Skill Matrix by Role

ML/LLM Engineer Skill Matrix

Hiring Considerations

Upskilling Paths

Metrics & Success Measurement

Delivery Metrics (Team Health)

Product Metrics (Business Value)

Platform Metrics (Operational Excellence)

Anti-Patterns

1. Platform Without Customers

2. Siloed DS/ML Teams

3. No Clear Safety Ownership

Summary

Team Design Framework

Key Takeaways Matrix

Team Size & Structure Guidelines

Hiring vs. Build vs. Partner Decision Matrix

Critical Success Factors

Common Failure Modes & Prevention