Chapter 18 — Real-Time Streaming Data

Overview

Stream processing unlocks low-latency personalization, alerting, and analytics. Streaming systems process data continuously as it arrives, enabling real-time decision-making for AI applications. This chapter covers the architecture, patterns, and best practices for building reliable streaming data pipelines that power real-time AI.

Why It Matters

Streaming amplifies both value and risk. Controls on schemas, state, and backpressure prevent outages and surprises.

Value of Real-Time Processing:

Fraud Detection: Detect and block fraudulent transactions in milliseconds
Personalization: Update recommendations based on immediate user behavior
Alerting: Trigger notifications for critical events (system failures, security threats)
Monitoring: Real-time dashboards for operations and business metrics
Dynamic Pricing: Adjust prices based on real-time demand and inventory

Risks of Streaming:

Complexity: Significantly more complex than batch processing
Failure Modes: Data loss, duplicate processing, out-of-order events
Cost: Can be expensive if not properly optimized
Debugging: Harder to debug and test than batch pipelines
State Management: Managing state at scale is challenging

Real-world impact:

E-commerce company increased conversion by 12% with real-time personalization
Bank prevented $50M in fraud with real-time transaction monitoring
Streaming infrastructure costs reduced by 80% after proper optimization
Ad-tech company processing 10M events/sec for real-time bidding

Streaming vs. Batch Processing

Aspect	Batch Processing	Stream Processing
Latency	Minutes to hours	Milliseconds to seconds
Data Volume	Process complete datasets	Continuous, unbounded streams
Complexity	Lower	Higher
Cost	Lower (process once)	Higher (always running)
Use Cases	Reports, ML training, ETL	Fraud detection, real-time dashboards, alerting
Fault Tolerance	Restart from checkpoint	Requires careful state management
Testing	Easier (deterministic)	Harder (time-dependent, stateful)
Tools	Spark, Airflow, DBT	Kafka, Flink, Spark Streaming

Streaming Architecture Patterns

End-to-End Streaming Platform

graph LR
    subgraph "Data Sources"
        S1[Application Events]
        S2[Database CDC]
        S3[IoT Devices]
        S4[API Streams]
    end

    subgraph "Message Broker"
        MB1[Kafka/Pulsar]
        MB2[Topic Partitions]
        MB3[Schema Registry]
    end

    subgraph "Stream Processing"
        SP1[Filtering]
        SP2[Windowing]
        SP3[Aggregation]
        SP4[Stateful Joins]
    end

    subgraph "State Store"
        SS1[RocksDB/Redis]
        SS2[Checkpoints]
        SS3[Recovery]
    end

    subgraph "Sinks"
        SK1[Feature Store<br/>Redis]
        SK2[Analytics DB<br/>ClickHouse]
        SK3[Alerts<br/>PagerDuty]
        SK4[ML Inference]
    end

    S1 & S2 & S3 & S4 --> MB1
    MB1 --> MB2 --> MB3
    MB3 --> SP1 --> SP2 --> SP3 --> SP4
    SP4 <-.-> SS1 & SS2 & SS3
    SP4 --> SK1 & SK2 & SK3 & SK4

    style MB1 fill:#bbf,stroke:#333,stroke-width:2px
    style SS1 fill:#f96,stroke:#333,stroke-width:2px

Event Time vs. Processing Time

graph TB
    subgraph "Event Generation"
        E1[Event Occurs<br/>Event Time: T0]
        E2[Event Sent<br/>Ingestion Time: T1]
    end

    subgraph "Processing"
        P1[Event Received<br/>Processing Time: T2]
        P2[Watermark Check]
        P3{Event Time<br/>< Watermark?}
    end

    subgraph "Windowing"
        W1[Assign to Window]
        W2[Late Event Handling]
        W3[Window Complete]
    end

    E1 --> E2 --> P1
    P1 --> P2 --> P3
    P3 -->|On Time| W1 --> W3
    P3 -->|Late| W2

    style P3 fill:#fff3cd,stroke:#333,stroke-width:2px
    style W2 fill:#f96,stroke:#333,stroke-width:2px

Windowing Patterns

graph TB
    subgraph "Tumbling Window"
        T1[Window 1<br/>00:00-00:05] --> T2[Window 2<br/>00:05-00:10]
        T2 --> T3[Window 3<br/>00:10-00:15]
        T4[Non-overlapping<br/>Fixed Size]
    end

    subgraph "Sliding Window"
        S1[Window 1<br/>00:00-00:10] --> S2[Window 2<br/>00:01-00:11]
        S2 --> S3[Window 3<br/>00:02-00:12]
        S4[Overlapping<br/>Moving Average]
    end

    subgraph "Session Window"
        SS1[User Activity] --> SS2[30s Gap] --> SS3[New Session]
        SS4[Variable Size<br/>Gap-based]
    end

    style T4 fill:#bbf,stroke:#333,stroke-width:2px
    style S4 fill:#9f9,stroke:#333,stroke-width:2px
    style SS4 fill:#f96,stroke:#333,stroke-width:2px

Stateful Processing Architecture

graph LR
    subgraph "Event Stream"
        ES1[Event 1] --> ES2[Event 2] --> ES3[Event 3]
    end

    subgraph "Stateful Operator"
        SO1[Read State]
        SO2[Process Event]
        SO3[Update State]
        SO4[Emit Result]
    end

    subgraph "State Backend"
        SB1[In-Memory State]
        SB2[RocksDB Disk]
        SB3[Periodic Checkpoint]
    end

    ES1 --> SO1
    SO1 --> SO2 --> SO3 --> SO4
    SO1 <--> SB1
    SB1 <--> SB2
    SB2 --> SB3

    style SO3 fill:#f96,stroke:#333,stroke-width:2px
    style SB3 fill:#bbf,stroke:#333,stroke-width:2px

Technology Comparison

Stream Processing Frameworks

Framework	Best For	Strengths	Limitations	Latency	When to Use
Apache Flink	Complex stateful processing	True streaming, exactly-once, mature	Steep learning curve	< 100ms	Choose for complex event processing
Kafka Streams	Kafka-native applications	Simple, library (not cluster), Kafka integration	Kafka-only, limited ops features	< 500ms	Choose for Kafka-centric stacks
Spark Streaming	Batch + streaming hybrid	Unified API, existing Spark expertise	Micro-batch (not true streaming)	1-5 seconds	Choose for batch/stream hybrid
AWS Kinesis	AWS-native streaming	Managed, serverless, simple	AWS lock-in, limited processing	< 1 second	Choose for AWS serverless
Google Dataflow	GCP-native streaming	Managed, Apache Beam API	GCP lock-in, can be expensive	< 1 second	Choose for GCP environments

Message Broker Comparison

Broker	Best For	Strengths	Limitations	Throughput	Decision Factor
Apache Kafka	High-throughput, event streaming	Battle-tested, ecosystem, scalable	Complex ops, Java-centric	1M+ msg/sec	Choose for event streaming
Apache Pulsar	Multi-tenancy, geo-replication	Cloud-native, unified queue+stream	Smaller ecosystem, newer	500K+ msg/sec	Choose for multi-region
AWS Kinesis	AWS-native, managed	Fully managed, simple	AWS lock-in, throughput limits	1MB/sec per shard	Choose for AWS managed
Redis Streams	Simple streaming, caching	Fast, simple, combined with cache	Not designed for high-volume	100K msg/sec	Choose for simple use cases
RabbitMQ	Traditional messaging	AMQP support, flexible routing	Lower throughput vs. Kafka	50K msg/sec	Choose for messaging patterns

Backpressure Management Strategies

Backpressure Handling Patterns

graph TB
    subgraph "Detection"
        D1[Monitor Buffer Usage]
        D2[Track Processing Lag]
        D3[Measure Throughput]
    end

    subgraph "Response Strategies"
        R1[Scale Up Resources]
        R2[Apply Rate Limiting]
        R3[Drop Low-Priority Events]
        R4[Buffer & Batch]
    end

    subgraph "Prevention"
        P1[Capacity Planning]
        P2[Auto-scaling Rules]
        P3[Circuit Breakers]
    end

    D1 & D2 & D3 --> R1 & R2 & R3 & R4
    R1 & R2 & R3 & R4 --> P1 & P2 & P3

    style D2 fill:#fff3cd,stroke:#333,stroke-width:2px
    style R1 fill:#bbf,stroke:#333,stroke-width:2px

Backpressure Strategies Comparison

Strategy	When to Use	Pros	Cons	Implementation Complexity
Scale Up	Resources constrained, budget available	Increases throughput, no data loss	Costs money, may not solve root cause	Low
Drop Events	Some data loss acceptable	Prevents cascading failures, simple	Loses data, may violate SLAs	Low
Buffer & Batch	Temporary spikes, smooth workload	Smooths load, no data loss	Increases latency, memory pressure	Medium
Sampling	Approximation acceptable	Reduces load significantly	Less accurate, may miss important events	Medium
Load Shedding	Prioritize critical events	Keeps system alive, preserves critical path	Loses non-critical data	High

Exactly-Once Processing Guarantees

Exactly-Once Architecture

graph LR
    subgraph "Source"
        SR1[Kafka Consumer]
        SR2[Track Offsets]
        SR3[Commit After Checkpoint]
    end

    subgraph "Processing"
        PR1[Stateful Operations]
        PR2[Periodic Checkpoints]
        PR3[State Snapshots]
    end

    subgraph "Sink"
        SK1[Two-Phase Commit]
        SK2[Pre-commit Write]
        SK3[Commit After Checkpoint]
    end

    SR1 --> SR2 --> PR1
    PR1 --> PR2 --> PR3
    PR3 --> SK1 --> SK2 --> SK3
    SK3 --> SR3

    style PR2 fill:#f96,stroke:#333,stroke-width:2px
    style SK1 fill:#bbf,stroke:#333,stroke-width:2px

Processing Guarantee Comparison

Guarantee	How It Works	Use When	Trade-offs	Performance Impact
At-Most-Once	No retries, may lose data	Non-critical metrics, high throughput needed	Simple, fast, data loss OK	None
At-Least-Once	Retry on failure, may duplicate	Most use cases, idempotent processing	Simple, no data loss, duplicates possible	Low (~5%)
Exactly-Once	Checkpoints + 2-phase commit	Financial transactions, compliance	Complex, no data loss, no duplicates	Medium (~20%)

Real-World Case Study: E-commerce Real-Time Personalization

Challenge

E-commerce company wanted to personalize homepage in real-time based on user behavior. Existing batch system updated once daily, causing stale recommendations.

Solution Architecture

Technology Stack:

Source: User events (clicks, views, searches) → Kafka (24 partitions)
Processing: Apache Flink for windowing and feature computation
State: RocksDB for user session state (2GB per task)
Sink: Redis for real-time feature serving (sub-ms reads)
ML: Inference service reads features from Redis (<50ms total)

Implementation Timeline:

Phase	Duration	Activities	Deliverables
Event Collection	Week 1	Instrument apps, set up Kafka, schema registry	Event pipeline
Stream Processing	Week 2-3	Flink jobs, windowing logic, feature engineering	Feature computation
Feature Serving	Week 4	Redis deployment, ML integration, API	Real-time serving
Monitoring	Week 5	Dashboards, alerts, chaos testing	Observability

Feature Computation:

5-minute tumbling windows per user
Features: last 5 products viewed, categories browsed, time on site, search queries, cart adds
Sliding window for 30-minute moving averages
Session window for user session analytics (30-second gap)

Results After 6 Months

Metric	Before	After	Improvement
Personalization latency	24 hours	5 minutes	99.7% reduction
Conversion rate	Baseline	+12%	Real-time relevance
Infrastructure costs	N/A	$15K/month	Kafka + Flink + Redis
System reliability	N/A	99.95% uptime	Robust architecture
Processing capacity	N/A	50K events/sec	Room to scale to 200K
Feature freshness	Daily batch	< 5 minutes	Real-time

Key Success Factors

Gradual Rollout: Started with 1% traffic, scaled to 100% over 6 weeks
Comprehensive Monitoring: 40+ metrics tracked from day one
Fallback to Batch: Automatic fallback if streaming pipeline fails
Extensive Testing: Chaos engineering, load testing, failure injection
Clear Runbooks: Documented procedures for 15 common failure scenarios

Monitoring & Observability

Streaming Metrics Dashboard

Metric Category	Key Metrics	Alert Threshold	Action
Lag	Processing lag, consumer lag	> 1 minute	Scale up, investigate bottleneck
Throughput	Events/second, bytes/second	< 80% of expected	Check upstream, alert stakeholders
Errors	Failed events, exceptions	> 1%	Inspect logs, fix bugs
Latency	End-to-end latency, window latency	> 2x baseline	Optimize processing, scale resources
Backpressure	Buffer utilization, backpressure time	> 10%	Increase parallelism, optimize code
State	State size, checkpoint duration	Approaching limits	Archive old state, tune checkpoints
Resources	CPU, memory, network	> 80% utilization	Scale horizontally

Implementation Checklist

Design Phase (Week 1)

□ Define streaming use case and requirements
□ Identify data sources and event schema
□ Choose streaming framework (Flink/Kafka Streams/Spark)
□ Design windowing and aggregation logic
□ Plan state management strategy
□ Define exactly-once vs. at-least-once requirements

Development Phase (Week 2-3)

□ Set up Kafka topics with appropriate partitions
□ Implement schema registry for event validation
□ Build stream processing jobs
□ Implement stateful operations with checkpoints
□ Add error handling and dead letter queues
□ Write unit tests for processing logic

Testing Phase (Week 4)

□ Integration test with production-like data
□ Load test with 2x expected throughput
□ Chaos test failure scenarios
□ Validate exactly-once semantics
□ Test backpressure handling
□ Verify state recovery after failures

Deployment Phase (Week 5)

□ Deploy to production with 1% traffic
□ Set up monitoring dashboards (Grafana)
□ Configure alerting (PagerDuty)
□ Document runbooks for common issues
□ Gradually increase traffic to 100%
□ Conduct post-deployment review

Operations (Ongoing)

□ Monitor lag and throughput daily
□ Review error rates and investigate anomalies
□ Optimize costs monthly (right-size cluster)
□ Update schemas with backward compatibility
□ Quarterly disaster recovery drills
□ Annual architecture review

Best Practices

Start Simple: Begin with batch, add streaming only when latency matters
Schema Management: Use schema registry, enforce compatibility
Handle Late Data: Configure watermarks and allowed lateness
Monitor Backpressure: Alert before it causes cascading failures
Test Failure Modes: Chaos testing for resilience
Optimize State: Keep state size manageable, use RocksDB for large state
Exactly-Once When Needed: At-least-once is simpler and often sufficient
Cost Optimization: Right-size clusters, use auto-scaling

Common Pitfalls

Over-Engineering: Building real-time when batch would suffice
Ignoring Backpressure: Letting backpressure cascade to failures
No Schema Evolution: Breaking consumers with schema changes
Unbounded State: State growing forever, causing OOM
Inadequate Monitoring: Can't detect or debug issues
No Failure Testing: Surprised when things break in production
Wrong Time Semantics: Using processing time when event time needed
Expensive Joins: Large-state joins that don't scale

Chapter 18: Real-Time Streaming Data

18. Real-Time Streaming Data