3Part 3: Data Foundations
18. Real-Time Streaming Data
Chapter 18 — Real-Time Streaming Data
Overview
Stream processing unlocks low-latency personalization, alerting, and analytics. Streaming systems process data continuously as it arrives, enabling real-time decision-making for AI applications. This chapter covers the architecture, patterns, and best practices for building reliable streaming data pipelines that power real-time AI.
Why It Matters
Streaming amplifies both value and risk. Controls on schemas, state, and backpressure prevent outages and surprises.
Value of Real-Time Processing:
- Fraud Detection: Detect and block fraudulent transactions in milliseconds
- Personalization: Update recommendations based on immediate user behavior
- Alerting: Trigger notifications for critical events (system failures, security threats)
- Monitoring: Real-time dashboards for operations and business metrics
- Dynamic Pricing: Adjust prices based on real-time demand and inventory
Risks of Streaming:
- Complexity: Significantly more complex than batch processing
- Failure Modes: Data loss, duplicate processing, out-of-order events
- Cost: Can be expensive if not properly optimized
- Debugging: Harder to debug and test than batch pipelines
- State Management: Managing state at scale is challenging
Real-world impact:
- E-commerce company increased conversion by 12% with real-time personalization
- Bank prevented $50M in fraud with real-time transaction monitoring
- Streaming infrastructure costs reduced by 80% after proper optimization
- Ad-tech company processing 10M events/sec for real-time bidding
Streaming vs. Batch Processing
| Aspect | Batch Processing | Stream Processing |
|---|---|---|
| Latency | Minutes to hours | Milliseconds to seconds |
| Data Volume | Process complete datasets | Continuous, unbounded streams |
| Complexity | Lower | Higher |
| Cost | Lower (process once) | Higher (always running) |
| Use Cases | Reports, ML training, ETL | Fraud detection, real-time dashboards, alerting |
| Fault Tolerance | Restart from checkpoint | Requires careful state management |
| Testing | Easier (deterministic) | Harder (time-dependent, stateful) |
| Tools | Spark, Airflow, DBT | Kafka, Flink, Spark Streaming |
Streaming Architecture Patterns
End-to-End Streaming Platform
graph LR subgraph "Data Sources" S1[Application Events] S2[Database CDC] S3[IoT Devices] S4[API Streams] end subgraph "Message Broker" MB1[Kafka/Pulsar] MB2[Topic Partitions] MB3[Schema Registry] end subgraph "Stream Processing" SP1[Filtering] SP2[Windowing] SP3[Aggregation] SP4[Stateful Joins] end subgraph "State Store" SS1[RocksDB/Redis] SS2[Checkpoints] SS3[Recovery] end subgraph "Sinks" SK1[Feature Store<br/>Redis] SK2[Analytics DB<br/>ClickHouse] SK3[Alerts<br/>PagerDuty] SK4[ML Inference] end S1 & S2 & S3 & S4 --> MB1 MB1 --> MB2 --> MB3 MB3 --> SP1 --> SP2 --> SP3 --> SP4 SP4 <-.-> SS1 & SS2 & SS3 SP4 --> SK1 & SK2 & SK3 & SK4 style MB1 fill:#bbf,stroke:#333,stroke-width:2px style SS1 fill:#f96,stroke:#333,stroke-width:2px
Event Time vs. Processing Time
graph TB subgraph "Event Generation" E1[Event Occurs<br/>Event Time: T0] E2[Event Sent<br/>Ingestion Time: T1] end subgraph "Processing" P1[Event Received<br/>Processing Time: T2] P2[Watermark Check] P3{Event Time<br/>< Watermark?} end subgraph "Windowing" W1[Assign to Window] W2[Late Event Handling] W3[Window Complete] end E1 --> E2 --> P1 P1 --> P2 --> P3 P3 -->|On Time| W1 --> W3 P3 -->|Late| W2 style P3 fill:#fff3cd,stroke:#333,stroke-width:2px style W2 fill:#f96,stroke:#333,stroke-width:2px
Windowing Patterns
graph TB subgraph "Tumbling Window" T1[Window 1<br/>00:00-00:05] --> T2[Window 2<br/>00:05-00:10] T2 --> T3[Window 3<br/>00:10-00:15] T4[Non-overlapping<br/>Fixed Size] end subgraph "Sliding Window" S1[Window 1<br/>00:00-00:10] --> S2[Window 2<br/>00:01-00:11] S2 --> S3[Window 3<br/>00:02-00:12] S4[Overlapping<br/>Moving Average] end subgraph "Session Window" SS1[User Activity] --> SS2[30s Gap] --> SS3[New Session] SS4[Variable Size<br/>Gap-based] end style T4 fill:#bbf,stroke:#333,stroke-width:2px style S4 fill:#9f9,stroke:#333,stroke-width:2px style SS4 fill:#f96,stroke:#333,stroke-width:2px
Stateful Processing Architecture
graph LR subgraph "Event Stream" ES1[Event 1] --> ES2[Event 2] --> ES3[Event 3] end subgraph "Stateful Operator" SO1[Read State] SO2[Process Event] SO3[Update State] SO4[Emit Result] end subgraph "State Backend" SB1[In-Memory State] SB2[RocksDB Disk] SB3[Periodic Checkpoint] end ES1 --> SO1 SO1 --> SO2 --> SO3 --> SO4 SO1 <--> SB1 SB1 <--> SB2 SB2 --> SB3 style SO3 fill:#f96,stroke:#333,stroke-width:2px style SB3 fill:#bbf,stroke:#333,stroke-width:2px
Technology Comparison
Stream Processing Frameworks
| Framework | Best For | Strengths | Limitations | Latency | When to Use |
|---|---|---|---|---|---|
| Apache Flink | Complex stateful processing | True streaming, exactly-once, mature | Steep learning curve | < 100ms | Choose for complex event processing |
| Kafka Streams | Kafka-native applications | Simple, library (not cluster), Kafka integration | Kafka-only, limited ops features | < 500ms | Choose for Kafka-centric stacks |
| Spark Streaming | Batch + streaming hybrid | Unified API, existing Spark expertise | Micro-batch (not true streaming) | 1-5 seconds | Choose for batch/stream hybrid |
| AWS Kinesis | AWS-native streaming | Managed, serverless, simple | AWS lock-in, limited processing | < 1 second | Choose for AWS serverless |
| Google Dataflow | GCP-native streaming | Managed, Apache Beam API | GCP lock-in, can be expensive | < 1 second | Choose for GCP environments |
Message Broker Comparison
| Broker | Best For | Strengths | Limitations | Throughput | Decision Factor |
|---|---|---|---|---|---|
| Apache Kafka | High-throughput, event streaming | Battle-tested, ecosystem, scalable | Complex ops, Java-centric | 1M+ msg/sec | Choose for event streaming |
| Apache Pulsar | Multi-tenancy, geo-replication | Cloud-native, unified queue+stream | Smaller ecosystem, newer | 500K+ msg/sec | Choose for multi-region |
| AWS Kinesis | AWS-native, managed | Fully managed, simple | AWS lock-in, throughput limits | 1MB/sec per shard | Choose for AWS managed |
| Redis Streams | Simple streaming, caching | Fast, simple, combined with cache | Not designed for high-volume | 100K msg/sec | Choose for simple use cases |
| RabbitMQ | Traditional messaging | AMQP support, flexible routing | Lower throughput vs. Kafka | 50K msg/sec | Choose for messaging patterns |
Backpressure Management Strategies
Backpressure Handling Patterns
graph TB subgraph "Detection" D1[Monitor Buffer Usage] D2[Track Processing Lag] D3[Measure Throughput] end subgraph "Response Strategies" R1[Scale Up Resources] R2[Apply Rate Limiting] R3[Drop Low-Priority Events] R4[Buffer & Batch] end subgraph "Prevention" P1[Capacity Planning] P2[Auto-scaling Rules] P3[Circuit Breakers] end D1 & D2 & D3 --> R1 & R2 & R3 & R4 R1 & R2 & R3 & R4 --> P1 & P2 & P3 style D2 fill:#fff3cd,stroke:#333,stroke-width:2px style R1 fill:#bbf,stroke:#333,stroke-width:2px
Backpressure Strategies Comparison
| Strategy | When to Use | Pros | Cons | Implementation Complexity |
|---|---|---|---|---|
| Scale Up | Resources constrained, budget available | Increases throughput, no data loss | Costs money, may not solve root cause | Low |
| Drop Events | Some data loss acceptable | Prevents cascading failures, simple | Loses data, may violate SLAs | Low |
| Buffer & Batch | Temporary spikes, smooth workload | Smooths load, no data loss | Increases latency, memory pressure | Medium |
| Sampling | Approximation acceptable | Reduces load significantly | Less accurate, may miss important events | Medium |
| Load Shedding | Prioritize critical events | Keeps system alive, preserves critical path | Loses non-critical data | High |
Exactly-Once Processing Guarantees
Exactly-Once Architecture
graph LR subgraph "Source" SR1[Kafka Consumer] SR2[Track Offsets] SR3[Commit After Checkpoint] end subgraph "Processing" PR1[Stateful Operations] PR2[Periodic Checkpoints] PR3[State Snapshots] end subgraph "Sink" SK1[Two-Phase Commit] SK2[Pre-commit Write] SK3[Commit After Checkpoint] end SR1 --> SR2 --> PR1 PR1 --> PR2 --> PR3 PR3 --> SK1 --> SK2 --> SK3 SK3 --> SR3 style PR2 fill:#f96,stroke:#333,stroke-width:2px style SK1 fill:#bbf,stroke:#333,stroke-width:2px
Processing Guarantee Comparison
| Guarantee | How It Works | Use When | Trade-offs | Performance Impact |
|---|---|---|---|---|
| At-Most-Once | No retries, may lose data | Non-critical metrics, high throughput needed | Simple, fast, data loss OK | None |
| At-Least-Once | Retry on failure, may duplicate | Most use cases, idempotent processing | Simple, no data loss, duplicates possible | Low (~5%) |
| Exactly-Once | Checkpoints + 2-phase commit | Financial transactions, compliance | Complex, no data loss, no duplicates | Medium (~20%) |
Real-World Case Study: E-commerce Real-Time Personalization
Challenge
E-commerce company wanted to personalize homepage in real-time based on user behavior. Existing batch system updated once daily, causing stale recommendations.
Solution Architecture
Technology Stack:
- Source: User events (clicks, views, searches) → Kafka (24 partitions)
- Processing: Apache Flink for windowing and feature computation
- State: RocksDB for user session state (2GB per task)
- Sink: Redis for real-time feature serving (sub-ms reads)
- ML: Inference service reads features from Redis (<50ms total)
Implementation Timeline:
| Phase | Duration | Activities | Deliverables |
|---|---|---|---|
| Event Collection | Week 1 | Instrument apps, set up Kafka, schema registry | Event pipeline |
| Stream Processing | Week 2-3 | Flink jobs, windowing logic, feature engineering | Feature computation |
| Feature Serving | Week 4 | Redis deployment, ML integration, API | Real-time serving |
| Monitoring | Week 5 | Dashboards, alerts, chaos testing | Observability |
Feature Computation:
- 5-minute tumbling windows per user
- Features: last 5 products viewed, categories browsed, time on site, search queries, cart adds
- Sliding window for 30-minute moving averages
- Session window for user session analytics (30-second gap)
Results After 6 Months
| Metric | Before | After | Improvement |
|---|---|---|---|
| Personalization latency | 24 hours | 5 minutes | 99.7% reduction |
| Conversion rate | Baseline | +12% | Real-time relevance |
| Infrastructure costs | N/A | $15K/month | Kafka + Flink + Redis |
| System reliability | N/A | 99.95% uptime | Robust architecture |
| Processing capacity | N/A | 50K events/sec | Room to scale to 200K |
| Feature freshness | Daily batch | < 5 minutes | Real-time |
Key Success Factors
- Gradual Rollout: Started with 1% traffic, scaled to 100% over 6 weeks
- Comprehensive Monitoring: 40+ metrics tracked from day one
- Fallback to Batch: Automatic fallback if streaming pipeline fails
- Extensive Testing: Chaos engineering, load testing, failure injection
- Clear Runbooks: Documented procedures for 15 common failure scenarios
Monitoring & Observability
Streaming Metrics Dashboard
| Metric Category | Key Metrics | Alert Threshold | Action |
|---|---|---|---|
| Lag | Processing lag, consumer lag | > 1 minute | Scale up, investigate bottleneck |
| Throughput | Events/second, bytes/second | < 80% of expected | Check upstream, alert stakeholders |
| Errors | Failed events, exceptions | > 1% | Inspect logs, fix bugs |
| Latency | End-to-end latency, window latency | > 2x baseline | Optimize processing, scale resources |
| Backpressure | Buffer utilization, backpressure time | > 10% | Increase parallelism, optimize code |
| State | State size, checkpoint duration | Approaching limits | Archive old state, tune checkpoints |
| Resources | CPU, memory, network | > 80% utilization | Scale horizontally |
Implementation Checklist
Design Phase (Week 1)
□ Define streaming use case and requirements
□ Identify data sources and event schema
□ Choose streaming framework (Flink/Kafka Streams/Spark)
□ Design windowing and aggregation logic
□ Plan state management strategy
□ Define exactly-once vs. at-least-once requirements
Development Phase (Week 2-3)
□ Set up Kafka topics with appropriate partitions
□ Implement schema registry for event validation
□ Build stream processing jobs
□ Implement stateful operations with checkpoints
□ Add error handling and dead letter queues
□ Write unit tests for processing logic
Testing Phase (Week 4)
□ Integration test with production-like data
□ Load test with 2x expected throughput
□ Chaos test failure scenarios
□ Validate exactly-once semantics
□ Test backpressure handling
□ Verify state recovery after failures
Deployment Phase (Week 5)
□ Deploy to production with 1% traffic
□ Set up monitoring dashboards (Grafana)
□ Configure alerting (PagerDuty)
□ Document runbooks for common issues
□ Gradually increase traffic to 100%
□ Conduct post-deployment review
Operations (Ongoing)
□ Monitor lag and throughput daily
□ Review error rates and investigate anomalies
□ Optimize costs monthly (right-size cluster)
□ Update schemas with backward compatibility
□ Quarterly disaster recovery drills
□ Annual architecture review
Best Practices
- Start Simple: Begin with batch, add streaming only when latency matters
- Schema Management: Use schema registry, enforce compatibility
- Handle Late Data: Configure watermarks and allowed lateness
- Monitor Backpressure: Alert before it causes cascading failures
- Test Failure Modes: Chaos testing for resilience
- Optimize State: Keep state size manageable, use RocksDB for large state
- Exactly-Once When Needed: At-least-once is simpler and often sufficient
- Cost Optimization: Right-size clusters, use auto-scaling
Common Pitfalls
- Over-Engineering: Building real-time when batch would suffice
- Ignoring Backpressure: Letting backpressure cascade to failures
- No Schema Evolution: Breaking consumers with schema changes
- Unbounded State: State growing forever, causing OOM
- Inadequate Monitoring: Can't detect or debug issues
- No Failure Testing: Surprised when things break in production
- Wrong Time Semantics: Using processing time when event time needed
- Expensive Joins: Large-state joins that don't scale