Chapter 68 — Process & Operating Procedures

Overview

Operationalize SOPs for safe, reliable AI operations from model updates to incidents.

Policies define what must be done, but procedures define how to do it. Standard Operating Procedures (SOPs) translate governance requirements into executable workflows that teams can follow consistently. Well-designed SOPs ensure AI systems operate safely and reliably, enable rapid incident response, and maintain compliance. This chapter provides frameworks and templates for creating, testing, and maintaining operational procedures across the AI lifecycle.

Why It Matters

Clear procedures reduce incidents and align teams in high-velocity environments. SOPs translate policy into action.

Why SOPs are critical for AI operations:

Consistency: Everyone follows the same process, reducing variance and errors
Speed: Clear procedures enable fast decision-making without escalation
Safety: Critical safety checks are embedded in workflows, not left to memory
Compliance: Audit requirements are built into procedures, not added afterward
Onboarding: New team members can operate effectively by following documented procedures
Continuous Improvement: Standardized processes can be measured and optimized

Costs of missing or poor SOPs:

Inconsistent practices lead to quality issues and incidents
Slow response times because teams debate what to do
Compliance failures from missed steps or undocumented actions
Knowledge loss when experienced operators leave
Difficult post-incident analysis due to lack of standard procedures
Inability to scale operations beyond small teams who "just know what to do"

SOP Framework

graph TD
    A[SOP Development] --> B[Identify Critical Workflows]
    A --> C[Document Procedures]
    A --> D[Assign Ownership]
    A --> E[Test & Validate]
    A --> F[Maintain & Improve]

    B --> B1[Model Lifecycle]
    B --> B2[Data Management]
    B --> B3[Incident Response]
    B --> B4[Access & Rights]

    C --> C1[Step-by-Step Instructions]
    C --> C2[Decision Trees]
    C --> C3[Checklists]

    D --> D1[RACI Matrix]
    D --> D2[Escalation Paths]
    D --> D3[Handoff Procedures]

    E --> E1[Tabletop Exercises]
    E --> E2[Simulation Drills]
    E --> E3[Shadow Operations]

    F --> F1[Regular Reviews]
    F --> F2[Incident Learnings]
    F --> F3[Process Metrics]

Core AI Operating Procedures

Model Update & Deployment SOP

Scope: This SOP covers the process for updating AI models in production, including testing, approval, deployment, and rollback procedures.

Roles & Responsibilities (RACI):

Activity	Model Owner	Tech Lead	Security	QA	Operations	Approver
Update trigger & justification	R	C	I	I	I	I
Evaluation on test sets	A	R	I	C	I	I
Security & safety review	C	C	R	C	I	I
Approval decision	I	I	C	C	I	R/A
Deployment execution	C	C	I	I	R/A	I
Monitoring & validation	A	R	I	C	R	I
Rollback (if needed)	A	R	I	I	R	C

R = Responsible, A = Accountable, C = Consulted, I = Informed

Procedure Steps:

1. Update Initiation

Model owner documents update reason (bug fix, performance improvement, new capability)
Assess impact scope (which applications/users affected)
Create update ticket with justification and success criteria
Obtain preliminary approval from tech lead

2. Evaluation & Testing

Run automated evaluation suite on golden test set
- Pass criteria: All metrics within 5% of baseline or better
Execute safety testing on adversarial set
- Pass criteria: Zero critical safety failures
Perform A/B test in staging environment
- Duration: Minimum 48 hours, 1000+ requests
- Pass criteria: Statistical significance (p<0.05) favoring new model
Document all test results with screenshots and logs

3. Review & Approval

Security review: Scan for vulnerabilities, validate data handling
QA review: Verify test coverage and results
Present findings to approval committee
- Required quorum: 2 of 3 approvers (Tech Lead, Security, Product)
Obtain signed approval or document rejection reasons

4. Deployment

Schedule deployment window (low-traffic period)
Notify stakeholders 24 hours in advance
Deploy to canary environment (5% traffic)
- Monitor for: 1 hour, error rates, latency, quality metrics
- Rollback trigger: Error rate >2x baseline OR critical safety issue
If canary successful, gradual rollout: 25% → 50% → 100%
- Monitor each stage: 30 minutes minimum
Update configuration management system with new version

5. Post-Deployment Validation

Monitor production metrics for 24 hours
- Track: Error rate, latency P50/P99, quality scores, user feedback
Compare against baseline and success criteria
If degradation detected, execute rollback procedure
Document final deployment status and any issues

6. Rollback Procedure (if needed)

Pause traffic to new model immediately
Route 100% traffic to previous stable version
Verify metrics return to baseline within 15 minutes
Log incident and root cause analysis
Communicate rollback to stakeholders
Schedule postmortem within 48 hours

Approval Criteria:

Criterion	Threshold	Measured By
Evaluation performance	≥ baseline on all metrics	Automated eval suite
Safety testing	Zero critical issues	Adversarial test set
A/B test significance	p < 0.05 improvement	Statistical analysis
Security clearance	No high/critical vulnerabilities	Security scan + manual review
Stakeholder approval	2 of 3 approvers	Approval meeting

Rollback Triggers:

Trigger	Threshold	Action
Error rate spike	>2x baseline for 5+ minutes	Immediate automatic rollback
Latency degradation	P99 >1.5x baseline for 10+ minutes	Manual rollback decision
Quality drop	Key metric <80% of baseline	Manual rollback decision
Safety incident	Any critical safety issue	Immediate manual rollback
User complaints	>5 escalated complaints in 1 hour	Investigation + potential rollback

Documentation Requirements:

Update justification and expected impact
Test results (all evaluations, A/B tests)
Security and safety review sign-offs
Approval meeting notes and decisions
Deployment timeline and traffic splits
Post-deployment metrics and analysis
Any incidents and mitigations

Data Access & Management SOP

Scope: Request, approval, provisioning, and revocation of access to sensitive data for AI development and operations.

Roles & Responsibilities (RACI):

Activity	Requestor	Manager	Data Owner	Security	Privacy	DBA
Access request	R/A	I	I	I	I	I
Manager approval	I	R/A	I	I	I	I
Data classification check	I	I	R	C	C	I
Privacy review	I	I	C	C	R/A	I
Security review	I	I	C	R/A	C	I
Access provisioning	I	I	I	I	I	R/A
Access audit	I	C	R	R	R	R
Access revocation	R	A	C	C	C	R

Procedure Steps:

1. Access Request

Submit request via access management portal
- Include: Data set name, business justification, duration, use case
Specify access type: read-only, read-write, export, API
Estimate data volume and query patterns
Manager approval (auto-notification)

2. Data Classification & Review

Data owner classifies data sensitivity (Public, Internal, Confidential, Restricted)
Privacy team reviews for PII/sensitive data
- If PII present, require data minimization justification
- Assess if anonymization/pseudonymization required
Security team reviews for compliance requirements
- Check against data handling policies
- Verify requestor has required certifications/training

3. Approval Decision

All required reviews complete
For Confidential/Restricted data: Escalate to Data Governance Board
- SLA: Decision within 3 business days
Approval granted with conditions (e.g., access expiry, usage restrictions)
Rejection: Document reason and provide alternative if available

4. Access Provisioning

DBA creates account with least-privilege permissions
Set access expiration (default 90 days, max 1 year)
Configure audit logging for all access
Provide access credentials via secure channel
Send confirmation and usage guidelines to requestor

5. Ongoing Compliance

Automated monthly audit of active access
- Flag access >90 days for renewal review
- Flag inactive access (no queries in 30 days) for revocation
Quarterly access review by data owners
Log analysis for anomalous access patterns
Manager notified of team member access for awareness

6. Access Revocation

Trigger: Access expiry, employee departure, policy violation, or manual request
Immediate revocation of credentials
Verify no data exports or copies remain
Log revocation with reason
Notify stakeholders (manager, data owner, security)

Access Classification & Approval Requirements:

Data Classification	Review Requirements	Max Duration	Conditions
Public	Manager approval only	No limit	Standard usage terms
Internal	Manager + Data Owner	1 year	Training certification required
Confidential	Manager + Data Owner + Security	90 days	NDA, audit logging, no export
Restricted	All reviews + Board approval	30 days	Purpose-specific, strict audit, encrypted access

Audit & Monitoring:

Check	Frequency	Action if Non-Compliant
Active access review	Monthly	Flag for renewal or revocation
Inactive access (no use)	Monthly	Auto-revoke after 30 days inactive
Access pattern anomalies	Daily	Alert security team, investigate
Data export tracking	Real-time	Alert if restricted data exported
Compliance training	Quarterly	Suspend access until training complete

Incident Response SOP

Scope: Detection, triage, response, resolution, and post-incident review for AI system incidents.

Incident Severity Levels:

Severity	Impact	Examples	Response Time	Escalation
SEV 1 (Critical)	User-facing failure, safety risk, data breach	Production down, harmful outputs, PII leak	Immediate (24/7)	VP + Legal + PR
SEV 2 (High)	Major degradation, compliance risk	High error rate, bias issues, SLA miss	<30 minutes	Director + Compliance
SEV 3 (Medium)	Minor degradation, workaround exists	Elevated latency, non-critical feature down	<2 hours	Manager + Tech Lead
SEV 4 (Low)	No user impact, internal issue	Logging issue, minor monitoring gap	<1 business day	Tech Lead

Incident Response Roles:

Role	Responsibilities
Incident Commander (IC)	Leads response, makes decisions, coordinates teams
Technical Lead	Diagnoses issue, implements fixes, manages technical team
Communications Lead	Stakeholder updates, user communication, status page
Scribe	Documents timeline, decisions, actions in real-time
Subject Matter Experts (SMEs)	Provide specialized knowledge (security, privacy, etc.)

Procedure Steps:

1. Detection & Alerting

Incident detected via: Automated monitoring, user report, or internal discovery
Create incident ticket with initial details
Assess severity based on impact criteria
Page on-call responder immediately (SEV 1-2) or assign (SEV 3-4)

2. Initial Response (First 15 Minutes)

On-call responder acknowledges within 5 minutes
Gather initial facts: What's broken? Since when? How many users?
Confirm severity level (escalate/de-escalate as needed)
For SEV 1-2: Page Incident Commander and assemble response team
Establish communication channel (dedicated Slack channel, war room)

3. Triage & Diagnosis (First Hour)

IC assigns roles (Tech Lead, Comms Lead, Scribe)
Tech Lead investigates root cause
- Check recent changes (model updates, config changes, dependency updates)
- Review logs and metrics
- Isolate affected components
Scribe documents timeline and findings in real-time
IC decides on immediate actions: rollback, failover, disable feature, etc.

4. Mitigation & Resolution

Implement mitigation (e.g., rollback to last known good version)
Verify mitigation effective (metrics return to baseline)
If not resolved, escalate to next level SMEs
For SEV 1: Provide status updates every 30 min to stakeholders
Continue until root cause found and permanent fix deployed

5. Communication

Comms Lead drafts initial stakeholder notification within 1 hour (SEV 1-2)
Update status page and affected users
Provide regular updates:
- SEV 1: Every 30 minutes until resolved
- SEV 2: Every 1 hour
- SEV 3: Every 4 hours
Final notification when incident resolved with summary

6. Resolution & Handoff

Confirm system fully recovered and stable (monitor 1-2 hours)
Update incident ticket with resolution details
IC declares incident resolved
Hand off any follow-up items to responsible teams
Thank and dismiss response team

7. Post-Incident Review (Within 48 Hours)

Schedule postmortem meeting with all participants
Scribe prepares incident timeline and analysis
Blameless discussion: What happened? Why? How to prevent?
Document action items with owners and due dates
Publish postmortem report to relevant stakeholders
Track action items to completion

Incident Communication Templates:

Initial Notification Template (SEV 1-2):

Component	Content Guidelines	Example
Subject Line	SEV level + brief description	"[SEV 1] Incident: Production AI Service Unavailable"
Status	Current state	"INVESTIGATING" / "MITIGATING" / "RESOLVED"
Severity	Impact level	SEV 1 / SEV 2 / SEV 3 / SEV 4
Start Time	When incident began	"2024-03-15 14:32 UTC"
Affected	What's impacted	"All customer-facing AI features, 10K users"
Impact Description	User-facing consequences	"Users unable to access AI assistant, receiving error messages"
Current Actions	Mitigation steps	"Team investigating root cause, failover initiated to backup"
Next Update	Communication cadence	"Next update in 30 minutes at 15:15 UTC"
Incident Commander	Point of contact	"Jane Smith - @jsmith"
Incident Channel	Where to follow	"#incident-2024-03-15-001 or war room bridge link"

Resolution Notification Template:

Component	Content Guidelines	Example
Subject Line	RESOLVED + incident description	"[RESOLVED] SEV 1: Production AI Service Unavailable"
Status	Confirmed resolution	"RESOLVED"
Duration	Total time impacted	"1 hour 23 minutes (14:32 - 15:55 UTC)"
Root Cause	Brief technical summary	"Database connection pool exhaustion due to query spike"
Users Affected	Impact scope	"10,234 users (35% of active users)"
Impact Duration	How long affected	"1 hour 23 minutes of degraded service"
Data Integrity	Confirm no data loss	"Confirmed: No data loss or corruption"
Resolution	What fixed it	"Increased connection pool size, restarted affected services"
Prevention	Future mitigation	"Implementing auto-scaling, enhanced monitoring alerts"
Postmortem Link	Detailed analysis	"Full postmortem available within 48 hours at [link]"

Incident Response Checklist (Incident Commander):

Immediate (0-15 min):

Acknowledge incident and confirm severity
Assemble response team and assign roles
Establish communication channel
Begin timeline documentation

Active Response (15 min - Resolution):

Direct technical investigation and mitigation
Make go/no-go decisions (rollback, failover, etc.)
Ensure regular stakeholder communication
Escalate as needed (higher severity, additional SMEs)
Monitor mitigation effectiveness

Resolution (After Fix):

Confirm system stable and recovered
Send resolution notification
Schedule postmortem within 48 hours
Capture lessons learned
Assign follow-up action items

Post-Incident (48 hours):

Facilitate blameless postmortem
Publish incident report
Track action items to completion
Update runbooks and SOPs based on learnings

Scope: Handle data subject access requests (DSAR), consent management, and data retention/deletion per privacy regulations (GDPR, CCPA, etc.).

Roles & Responsibilities (RACI):

Activity	Privacy Team	Legal	Engineering	Data Owner	Requestor
DSAR receipt	R/A	I	I	I	-
Request validation	R/A	C	I	I	-
Data identification	C	I	R	A	I
Data retrieval	I	I	R/A	C	I
Legal review	C	R/A	I	C	I
Response delivery	R/A	C	C	I	-

Procedure Steps:

1. Data Subject Access Request (DSAR)

Request Receipt:

DSAR received via privacy portal, email, or support ticket
Log request in privacy management system within 24 hours
Assign case ID and notify privacy team

Identity Verification:

Request identity verification from requester
- Method: Government ID, account authentication, or challenge questions
Verify request legitimacy (no suspicious indicators)
If identity cannot be verified, document and close request

Request Categorization:

Classify request type:
- Access: Provide copy of personal data
- Rectification: Correct inaccurate data
- Erasure (Right to be forgotten): Delete data
- Portability: Export data in machine-readable format
- Objection: Stop processing for specific purpose

2. Data Identification & Retrieval

Data Mapping:

Identify all systems containing requester's data
- Check: Production databases, AI training data, logs, backups, archives
For AI systems: Check if data used for training, fine-tuning, or RAG
Document data locations and formats

Data Retrieval:

Engineering team extracts relevant data
- SLA: Within 15 days for GDPR, 30 days for CCPA
Compile data into structured format
Redact any third-party personal data (privacy of others)
Legal review of data package before delivery

3. Response Preparation & Delivery

Response Package:

Prepare response document including:
- Data categories collected
- Sources of data
- Purposes of processing
- Third parties with access
- Retention periods
- Data copy (if access request)

Delivery:

Send via secure channel (encrypted email, secure portal)
Request delivery confirmation
Log response sent with timestamp
Archive request and response for audit (7 years)

4. Data Erasure / Right to be Forgotten

Erasure Assessment:

Legal team reviews if erasure legally required
- Exemptions: Legal obligation, public interest, vital interests, contract performance
If exempt, document justification and notify requester
If erasure required, proceed with deletion

Deletion Execution:

Identify all data copies across systems
- Include: Production DBs, backups, logs, AI models, cache
For AI models trained on data:
- Option 1: Retrain without the data (expensive)
- Option 2: Document as in-scope model and sunset when next retrain
- Option 3: If contribution negligible, document and monitor
Execute deletion with verification
- Hard delete (not just soft delete/archive)
- Verify deletion with checksums or audits

Deletion Verification:

Engineering confirms deletion across all systems
Audit logs capture deletion events
Document systems where deletion completed
Send confirmation to requester
Retain deletion record for compliance audit

5. Consent Management

Consent Capture:

Explicit consent required for:
- Sensitive personal data (health, biometric, etc.)
- AI model training using user data
- Third-party data sharing
Consent must be:
- Freely given: No coercion
- Specific: Purpose-specific
- Informed: Clear explanation of use
- Unambiguous: Affirmative action (no pre-checked boxes)
Log consent with timestamp, version, and scope

Consent Withdrawal:

User can withdraw consent at any time
Withdrawal as easy as giving consent (one-click)
Upon withdrawal:
- Stop processing data for that purpose immediately
- Option to delete data (offer to user)
- Update consent records
Notify user of withdrawal confirmation

DSAR Response Timelines:

Regulation	Response Deadline	Extension Possible	Notification Requirement
GDPR	30 days	+60 days if complex	Must notify within 30 days
CCPA	45 days	+45 days if needed	Must notify within 45 days
UK GDPR	30 days	+60 days if complex	Must notify within 30 days

Data Retention & Deletion Schedule:

Data Type	Retention Period	Deletion Method	Exceptions
User Account Data	Duration of account + 90 days	Hard delete from production + backups	Legal hold, fraud investigation
Training Data	Model lifetime or 3 years max	Remove from training sets, document in model	Aggregated/anonymized data OK to retain
Logs (with PII)	90 days	Automated purge	Incident investigation (180 days)
Logs (no PII)	1 year	Automated purge	-
Backups	30 days	Overwrite/encrypt	-
Audit Records	7 years	Secure archive then delete	Regulatory requirement

Change Management SOP

Scope: Managing changes to AI systems, infrastructure, and configurations to minimize risk and ensure proper approvals.

Change Types & Approval Requirements:

Change Type	Examples	Approval Required	Testing Required	Rollback Plan
Emergency	Production incident fix, security patch	Post-approval within 24h	Minimal (verified in staging)	Mandatory
Standard	Model update, feature addition, config change	CAB approval	Full test suite	Mandatory
Minor	Documentation, logging, monitoring	Tech Lead approval	Unit tests	Optional
Pre-Approved	Scaling resources, routine maintenance	Auto-approved	Validation checks	Optional

Change Advisory Board (CAB):

Role	Responsibility	Frequency
Chair	Lead meeting, make final decisions	Weekly
Technical Rep	Assess technical risk and feasibility	Weekly
Security Rep	Review security implications	Weekly
Compliance Rep	Ensure regulatory compliance	Weekly
Product Rep	Assess business impact and timing	Weekly

Change Request Process:

1. Request Submission:

Submit change request (CR) with details:
- What is changing and why
- Business justification and urgency
- Affected systems and users
- Testing plan and results
- Rollback plan and timeline
- Deployment window preference

2. Impact Assessment:

Technical review: Dependencies, risks, resource needs
Security review: Vulnerabilities, data exposure
Compliance review: Regulatory requirements
Business review: User impact, timing considerations

3. CAB Review & Approval:

Present change request to CAB
CAB reviews risk vs. benefit
Decision: Approve, Reject, Defer (with reasons)
If approved: Assign deployment window
If rejected: Document reasons and next steps

4. Implementation:

Implement change during approved window
Follow deployment SOP (testing, canary, gradual rollout)
Monitor metrics closely during and after deployment
Document actual vs. planned execution

5. Post-Implementation Review:

Verify change achieved intended outcome
Check for any unintended side effects
Update CR with actual results
Close CR or log follow-up items

Change Risk Assessment Matrix:

Risk Level	Impact	Likelihood	Approval Needed	Testing Scope
Low	Minimal	Unlikely	Tech Lead	Unit tests
Medium	Moderate	Possible	Manager + CAB review	Integration tests
High	Significant	Likely	CAB + Director	Full test suite + canary
Critical	Severe	Highly likely	CAB + VP	All tests + staged rollout

SOP Development & Maintenance

SOP Creation Template

SOP Document Structure:

Section	Purpose	Required Elements	Best Practices
Document Control	Version tracking and ownership	Version number, owner, approver, review dates, related policies	Use semantic versioning (X.Y), set review reminders
Purpose	Why SOP exists	1-2 sentence summary of objective	Clear value statement, measurable outcome
Scope	Boundaries and applicability	In-scope, out-of-scope, who must follow	Explicit boundaries, no ambiguity
RACI Matrix	Role clarity	Activities mapped to R/A/C/I by role	One R and one A per activity, no gaps
Prerequisites	Required conditions	Access, tools, training, initial state	Explicit, verifiable requirements
Procedure Steps	How to execute	Step-by-step instructions with decisions	Numbered, actionable, with verification
Escalation	Exception handling	Scenarios, actions, contacts	Clear triggers, response paths
Success Criteria	Completion validation	Metrics, checks, outcomes	Measurable, observable indicators
Audit Trail	Compliance evidence	What to log, where, retention	Complete traceability, compliance-ready
Related Documents	References	Links to policies, SOPs, runbooks	Easy navigation, context connection
Revision History	Change tracking	Version, date, author, changes	Complete audit trail of evolution

Procedure Step Template:

Element	Description	Example
Step Number	Sequential identifier	"Step 3: Validate Test Results"
Responsible Role	Who executes	"QA Engineer"
Estimated Duration	Time required	"30 minutes"
Instructions	Detailed steps	"1. Run automated test suite 2. Review results dashboard 3. Check all metrics meet thresholds"
Decision Points	Conditional logic	"If test pass rate < 95%, escalate to Tech Lead; Else, proceed to deployment"
Verification	Success check	"Confirm: Test dashboard shows 'All Passed' status"
Outputs	What's produced	"Test results report saved to [location], approval ticket updated"

SOP Testing & Validation

Testing Methods:

Method	Purpose	Frequency	Participants
Tabletop Exercise	Walk through SOP, identify gaps	Annually + after major changes	Cross-functional team
Simulation Drill	Execute SOP in test environment	Quarterly	Operations team
Shadow Operation	New person follows SOP with guidance	During onboarding	New hire + mentor
Audit	Verify SOP followed correctly	Monthly sample	Compliance team

Tabletop Exercise Framework:

Exercise Structure:

Phase	Duration	Activities	Participants	Outputs
Setup	5 min	Introduce scenario, objectives, ground rules	All	Shared understanding
Walkthrough	30 min	Step through SOP, discuss each action	All	Identified ambiguities
Decision Points	20 min	Deep-dive on critical decisions	All	Clarified criteria
Gap Analysis	15 min	Capture findings, missing elements	All	Improvement list
Wrap-Up	10 min	Prioritize actions, assign owners	All	Action plan

Scenario Design Guidelines:

Element	Description	Best Practice
Realism	Based on actual or plausible events	Use past incidents or realistic projections
Complexity	Appropriate challenge level	Match to team experience, increase gradually
Time Pressure	Include urgency elements	Mirror real incident pressure
Ambiguity	Some unclear elements	Test decision-making under uncertainty
Multiple Paths	Decision points with alternatives	Explore different response options

Discussion Framework:

Discussion Area	Key Questions	What to Validate
Clarity	Is each step clear? Any jargon or assumptions?	Understanding, no ambiguity
Completeness	Any missing steps? All scenarios covered?	No gaps, comprehensive
Roles	Who does what? Any overlaps or gaps?	RACI accuracy, no confusion
Decisions	What are the criteria? Who decides?	Clear thresholds, authority
Escalation	When to escalate? To whom? How?	Clear triggers, paths, contacts
Tools	What's needed? Is it accessible?	Readiness, permissions
Coordination	How do teams communicate? Hand-offs?	Clear communication flow

Findings Documentation:

Finding Type	Description	Action Required	Priority Criteria
Gap	Missing step or information	Add to SOP	High if blocks execution
Ambiguity	Unclear instruction or decision	Clarify language	High if causes confusion
Tool Issue	Missing access or capability	Provision or document	Medium if workaround exists
Training Need	Knowledge or skill deficit	Update training	Medium if affects quality
Process Improvement	Efficiency or quality opportunity	Evaluate and implement	Low if enhancement only

SOP Metrics & Compliance

SOP Effectiveness Metrics:

Metric	Definition	Target	Frequency
SOP Coverage	% of critical workflows with documented SOPs	100%	Quarterly
SOP Compliance	% of operations following SOP (via audit)	>95%	Monthly
SOP Freshness	% of SOPs reviewed within refresh cycle	100%	Monthly
Incident Attribution	% of incidents where SOP was/wasn't followed	Track trend	Per incident
Time to Execute	Actual vs. estimated SOP completion time	Within 20%	Per execution
User Feedback	Clarity and usability rating from practitioners	>4.0/5.0	Quarterly

Compliance Tracking:

graph TD
    A[SOP Execution] --> B[Automated Logging]
    A --> C[Manual Documentation]

    B --> D[Compliance Dashboard]
    C --> D

    D --> E{SOP Followed?}
    E -->|Yes| F[Track Metrics]
    E -->|No| G[Investigate Reason]

    G --> H{Intentional?}
    H -->|Emergency| I[Post-Approval Process]
    H -->|Training Gap| J[Additional Training]
    H -->|SOP Issue| K[Update SOP]
    H -->|Non-Compliance| L[Corrective Action]

    F --> M[Continuous Improvement]
    I --> M
    J --> M
    K --> M

SOP Review & Update Cycle:

Trigger	Action	SLA
Scheduled Review	Annual review by owner + stakeholders	Complete 30 days before anniversary
Incident Learnings	Update SOP based on postmortem findings	Within 1 week of postmortem
Policy Change	Align SOP to new/updated policy	Within 2 weeks of policy update
Tool/Platform Change	Update procedures for new tools	Before new tool goes live
User Feedback	Address clarity or usability issues	Quarterly batch update
Audit Findings	Correct gaps identified in audit	Within 2 weeks of finding

Case Study: Financial Services AI Platform

Context:

Large financial institution deploying AI for fraud detection and customer service
Regulatory requirements (SOX, PCI-DSS, GDPR) demand strict operational controls
Previous AI pilot failed due to undocumented processes and compliance violations

SOP Implementation:

Phase 1: Critical SOPs (Months 1-2)

Developed 5 core SOPs:
- Model deployment and rollback
- Data access and handling
- Incident response
- Data subject rights (DSAR)
- Change management
Conducted tabletop exercises for each SOP
Refined based on feedback before launch

Phase 2: Operationalization (Months 3-4)

Trained all teams on SOPs (200+ people)
Integrated SOPs into deployment tooling (automated checks)
Set up compliance dashboard and audit trails
Established CAB with weekly cadence

Phase 3: Continuous Improvement (Month 5+)

Monthly compliance audits and metrics review
Quarterly SOP refresh based on learnings
Incident postmortems fed into SOP updates
Expanded SOP library to 15+ procedures

Results:

Metric	Before SOPs	After SOPs	Improvement
Deployment incidents	12 per month	2 per month	-83%
Mean time to recovery (MTTR)	4.5 hours	45 minutes	-83%
Compliance violations	8 per quarter	0 per quarter	-100%
Change failure rate	22%	4%	-82%
Audit findings (ops procedures)	15 major	0 major, 2 minor	-87%
Time to onboard new operator	6 weeks	2 weeks	-67%

Key Success Factors:

Executive Mandate: CISO and CTO required SOPs for all AI operations
Cross-Functional Development: SOPs co-created by ops, security, legal, compliance
Realistic Testing: Tabletop exercises and drills identified gaps before launch
Tool Integration: Automated SOP enforcement where possible (checklists, gates)
Continuous Learning: Postmortems and audits fed into SOP improvements
Cultural Shift: From "we know what to do" to "documented, repeatable processes"

Implementation Checklist

SOP Development (Weeks 1-4)

Identify Critical Workflows

Map AI lifecycle (build, deploy, operate, monitor, update)
Identify high-risk operations requiring SOPs
Prioritize based on risk, frequency, and compliance needs
Assign SOP owners for each procedure

Develop Initial SOPs

Use SOP template for consistency
Include RACI matrix, step-by-step procedures, decision trees
Define success criteria and verification steps
Document escalation paths and exception handling
Review with cross-functional stakeholders

Approval & Baseline

Present SOPs to governance board for approval
Incorporate feedback and finalize
Publish in central SOP repository
Version control and change tracking

Testing & Training (Weeks 5-8)

SOP Validation

Conduct tabletop exercises for each SOP
Identify gaps, ambiguities, or missing steps
Refine SOPs based on findings
Repeat validation until satisfactory

Team Training

Train all relevant teams on SOPs
Provide quick reference guides and checklists
Conduct simulation drills for incident response
Certify competency (test or observed execution)

Tool & System Readiness

Configure tools to support SOP execution (forms, workflows, alerts)
Set up audit logging for SOP compliance
Build compliance dashboard for monitoring
Test end-to-end with real scenarios

Operationalization (Month 3+)

Go Live

Announce SOP go-live date and expectations
Make SOPs mandatory for all operations
Provide office hours for questions and support
Monitor closely for compliance and issues

Compliance Monitoring

Conduct monthly compliance audits (sample checks)
Track SOP metrics (coverage, compliance, effectiveness)
Address non-compliance: training, SOP updates, or corrective action
Report metrics to governance board

Continuous Improvement

Quarterly SOP review and refresh
Update SOPs based on incident learnings
Incorporate new regulations or policy changes
Collect user feedback and improve clarity
Expand SOP library as new workflows emerge

Deliverables

Core SOPs

Model update and deployment SOP with rollback procedures
Data access and management SOP with DSAR handling
Incident response SOP with severity levels and escalation
Data subject rights and consent management SOP
Change management SOP with CAB process

Supporting Materials

SOP template and writing guidelines
RACI matrix for each procedure
Escalation paths and contact lists
Checklists and quick reference guides
Decision trees for complex procedures

Testing & Training

Tabletop exercise templates and scenarios
Simulation drill scripts
Training materials and certification tests
New hire onboarding checklist

Monitoring & Governance

SOP compliance dashboard
Audit procedures and checklists
Metrics tracking and reporting
SOP review and update schedule

Key Takeaways

SOPs translate policy into action - Governance policies define "what," SOPs define "how" to do it consistently and safely.
RACI clarity prevents chaos - Explicitly defining who is Responsible, Accountable, Consulted, and Informed eliminates confusion during critical operations.
Test before you need it - Tabletop exercises and drills identify gaps when stakes are low, not during a real incident.
Automate enforcement where possible - Build SOP steps into tools and workflows to prevent human error and ensure compliance.
Incident learnings drive improvement - Every incident is an opportunity to update SOPs and prevent recurrence.
Living documents, not binders - SOPs must evolve with the system, regulations, and learnings. Regular review and updates are essential.
Measure compliance, not just existence - Having SOPs means nothing if teams don't follow them. Monitor compliance and address gaps.
Culture matters - SOPs work when teams see them as helpful guardrails, not bureaucratic overhead. Make them practical, clear, and accessible.

Chapter 68: Process & Operating Procedures

68. Process & Operating Procedures

Chapter 68 — Process & Operating Procedures

Overview

Why It Matters

SOP Framework

Core AI Operating Procedures

Model Update & Deployment SOP

Data Access & Management SOP

Incident Response SOP

Change Management SOP

SOP Development & Maintenance

SOP Creation Template

SOP Testing & Validation

SOP Metrics & Compliance

Case Study: Financial Services AI Platform

Implementation Checklist

SOP Development (Weeks 1-4)

Testing & Training (Weeks 5-8)

Operationalization (Month 3+)

Deliverables

Core SOPs

Supporting Materials

Testing & Training

Monitoring & Governance

Key Takeaways

68. Process & Operating Procedures

Chapter 68 — Process & Operating Procedures

Overview

Why It Matters

SOP Framework

Core AI Operating Procedures

Model Update & Deployment SOP

Data Access & Management SOP

Incident Response SOP

Data Subject Rights & Consent SOP

Change Management SOP

SOP Development & Maintenance

SOP Creation Template

SOP Testing & Validation

SOP Metrics & Compliance

Case Study: Financial Services AI Platform

Implementation Checklist

SOP Development (Weeks 1-4)

Testing & Training (Weeks 5-8)

Operationalization (Month 3+)

Deliverables

Core SOPs

Supporting Materials

Testing & Training

Monitoring & Governance

Key Takeaways