Build conversational AI systems with automatic speech recognition (ASR), text-to-speech (TTS), speaker diarization, and real-time dialogue management. Voice interfaces demand sub-second latency, natural interaction, explicit consent, and seamless human handoffs.
System Architecture
graph TB
A[Phone/VoIP Input] --> B[Telephony Gateway]
B --> C[Audio Preprocessing]
C --> D[Streaming ASR]
D --> E[NLU & Intent Detection]
E --> F{Intent Type}
F -->|Information Retrieval| G[RAG System]
F -->|Tool Use| H[API Calls]
F -->|Escalation| I[Human Handoff]
G --> J[Response Generation]
H --> J
J --> K[TTS Synthesis]
K --> L[Audio Delivery]
L --> M[User]
N[Consent Manager] -.-> C
O[Latency Monitor] -.-> D
O -.-> K
P[Call Analytics] -.-> E
P -.-> J
Voice Processing Pipeline
graph LR
A[Audio Input] --> B[Voice Activity Detection]
B --> C{Speech Detected?}
C -->|Yes| D[Streaming ASR]
C -->|No| E[Silence Handler]
D --> F[Partial Hypotheses]
F --> G[Intent Classification]
G --> H{Confidence > 0.7?}
H -->|Yes| I[Execute Intent]
H -->|No| J[Clarification Request]
I --> K[Response Generation]
J --> K
K --> L[TTS Streaming]
L --> M[Audio Output]
Model Selection Framework
ASR Model Comparison
Model
Language Support
WER
Latency
Cost
Best For
Whisper Large-v3
99 languages
5.2%
800ms
$$$
Multi-lingual, high accuracy
Wav2Vec2 Large
English
3.8%
450ms
$$
English-only, on-device
Conformer-Transducer
English
4.1%
300ms
$$
Streaming, low latency
Google STT
125 languages
6.0%
600ms
$$$$
Cloud, multi-lingual
Amazon Transcribe
37 languages
5.8%
700ms
$$$
AWS ecosystem
TTS Model Comparison
Model
Naturalness (MOS)
Latency (TTFB)
Voices
Cost
Best For
ElevenLabs
4.8/5.0
250ms
1000+
$$$$
Highest quality
Azure Neural TTS
4.5/5.0
180ms
400+
$$$
Enterprise, multi-lingual
Google WaveNet
4.6/5.0
300ms
200+
$$$
Natural prosody
Coqui XTTS
4.2/5.0
450ms
Custom
$
Open-source, on-prem
FastSpeech 2
4.0/5.0
120ms
Limited
$
Low latency
Intent Classification Performance
Approach
Accuracy
Latency
Training Data Required
Best For
Fine-tuned BERT
94%
45ms
1000+ examples
Custom domains
OpenAI Embeddings
91%
80ms
100+ examples
Quick deployment
FastText
87%
15ms
500+ examples
Low latency
Rasa NLU
89%
35ms
500+ examples
On-premise
Latency Architecture
graph TB
A[User Utterance] --> B[VAD: 50ms]
B --> C[ASR Partial: 200ms]
C --> D[Intent: 100ms]
D --> E[Retrieval: 300ms]
E --> F[Generation: 400ms]
F --> G[TTS First Token: 300ms]
H[Total Budget: 1.2s] -.-> B
H -.-> C
H -.-> D
H -.-> E
H -.-> F
H -.-> G
I{Budget Exceeded?} --> J[Degradation Mode]
J --> K[Faster Models]
J --> L[Cached Responses]
J --> M[Partial Responses]
Performance Targets:
Component
Target Latency
P95 Latency
P99 Latency
Optimization Strategy
ASR (partial)
< 200ms
250ms
400ms
Streaming inference
ASR (final)
< 500ms
700ms
1000ms
Batch processing
NLU + Intent
< 100ms
150ms
250ms
Cached embeddings
Retrieval
< 300ms
500ms
800ms
Vector index optimization
Response Gen
< 400ms
600ms
1000ms
Prompt caching
TTS (first token)
< 300ms
450ms
700ms
Streaming synthesis
Total E2E
< 1.2s
1.8s
3.0s
End-to-end monitoring
Dialogue State Machine
graph LR
A[Greeting] --> B[Intent Capture]
B --> C{Intent Clear?}
C -->|Yes| D{Auth Required?}
C -->|No| E[Clarification]
D -->|Yes| F[Authentication]
D -->|No| G[Fulfillment]
F --> H{Authenticated?}
H -->|Yes| G
H -->|No| I[Retry or Escalate]
G --> J{Success?}
J -->|Yes| K[Confirmation]
J -->|No| L[Error Handling]
K --> M[Closing]
L --> N{Retry?}
N -->|Yes| B
N -->|No| O[Human Handoff]
E --> B
I --> O
O --> P[Transfer to Agent]
Privacy and Consent Flow
graph TB
A[Call Start] --> B[Consent Prompt]
B --> C[Play: Call may be recorded]
C --> D{User Response}
D -->|Accept| E[Record Consent]
D -->|Decline| F[Limited Processing]
E --> G[Full Service]
F --> H[Basic Service Only]
G --> I[Store Transcript]
H --> J[No Storage]
I --> K[Retention: 30 days]
J --> L[Real-time Only]
M[User Request] --> N[Delete Data]
N --> O[Confirm Deletion]
Consent Compliance Matrix:
Jurisdiction
Consent Required
Recording Disclosure
Retention Limit
Deletion Right
EU (GDPR)
Explicit opt-in
Before recording
Purpose-limited
Yes, within 30 days
California (CCPA)
Opt-out available
At call start
12 months default
Yes, within 45 days
Two-Party States
Both parties
Before recording
Varies
State law dependent
Healthcare (HIPAA)
Written consent
Encrypted storage
6 years
Limited exceptions
Minimal Code Example
# Voice assistant with streaming ASR and TTSimport asyncio
from openai import OpenAI
client = OpenAI()
asyncdefvoice_assistant(audio_stream):
# Streaming ASR
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_stream,
response_format="text"
)
# Generate response
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": transcript}],
stream=True
)
# Streaming TTSfor chunk in response:
if chunk.choices[0].delta.content:
audio = client.audio.speech.create(
model="tts-1",
voice="alloy",
input=chunk.choices[0].delta.content
)
yield audio
Case Study: Banking Voice Bot
Challenge
Major bank needed to automate 40% of call center volume for account inquiries, balance checks, and fraud alerts while maintaining security and compliance.
Solution Architecture
graph TB
A[Incoming Call] --> B[IVR Menu]
B --> C{Intent}
C -->|Balance| D[Voice Auth]
C -->|Fraud Alert| E[High Priority]
C -->|General| F[FAQ Bot]
D --> G{Authenticated?}
G -->|Yes| H[Account API]
G -->|No| I[Retry 2x]
H --> J[TTS Response]
I --> K{Max Retries?}
K -->|Yes| L[Human Agent]
K -->|No| D
E --> M[Urgent Queue]
F --> N[Knowledge Base]
N --> O{Resolved?}
O -->|Yes| P[Closing]
O -->|No| L
Q[Compliance Logging] -.-> D
Q -.-> H
Q -.-> J