Part 3: Data Foundations

Chapter 18: Real-Time Streaming Data

Hire Us
3Part 3: Data Foundations

18. Real-Time Streaming Data

Chapter 18 — Real-Time Streaming Data

Overview

Stream processing unlocks low-latency personalization, alerting, and analytics. Streaming systems process data continuously as it arrives, enabling real-time decision-making for AI applications. This chapter covers the architecture, patterns, and best practices for building reliable streaming data pipelines that power real-time AI.

Why It Matters

Streaming amplifies both value and risk. Controls on schemas, state, and backpressure prevent outages and surprises.

Value of Real-Time Processing:

  • Fraud Detection: Detect and block fraudulent transactions in milliseconds
  • Personalization: Update recommendations based on immediate user behavior
  • Alerting: Trigger notifications for critical events (system failures, security threats)
  • Monitoring: Real-time dashboards for operations and business metrics
  • Dynamic Pricing: Adjust prices based on real-time demand and inventory

Risks of Streaming:

  • Complexity: Significantly more complex than batch processing
  • Failure Modes: Data loss, duplicate processing, out-of-order events
  • Cost: Can be expensive if not properly optimized
  • Debugging: Harder to debug and test than batch pipelines
  • State Management: Managing state at scale is challenging

Real-world impact:

  • E-commerce company increased conversion by 12% with real-time personalization
  • Bank prevented $50M in fraud with real-time transaction monitoring
  • Streaming infrastructure costs reduced by 80% after proper optimization
  • Ad-tech company processing 10M events/sec for real-time bidding

Streaming vs. Batch Processing

AspectBatch ProcessingStream Processing
LatencyMinutes to hoursMilliseconds to seconds
Data VolumeProcess complete datasetsContinuous, unbounded streams
ComplexityLowerHigher
CostLower (process once)Higher (always running)
Use CasesReports, ML training, ETLFraud detection, real-time dashboards, alerting
Fault ToleranceRestart from checkpointRequires careful state management
TestingEasier (deterministic)Harder (time-dependent, stateful)
ToolsSpark, Airflow, DBTKafka, Flink, Spark Streaming

Streaming Architecture Patterns

End-to-End Streaming Platform

graph LR subgraph "Data Sources" S1[Application Events] S2[Database CDC] S3[IoT Devices] S4[API Streams] end subgraph "Message Broker" MB1[Kafka/Pulsar] MB2[Topic Partitions] MB3[Schema Registry] end subgraph "Stream Processing" SP1[Filtering] SP2[Windowing] SP3[Aggregation] SP4[Stateful Joins] end subgraph "State Store" SS1[RocksDB/Redis] SS2[Checkpoints] SS3[Recovery] end subgraph "Sinks" SK1[Feature Store<br/>Redis] SK2[Analytics DB<br/>ClickHouse] SK3[Alerts<br/>PagerDuty] SK4[ML Inference] end S1 & S2 & S3 & S4 --> MB1 MB1 --> MB2 --> MB3 MB3 --> SP1 --> SP2 --> SP3 --> SP4 SP4 <-.-> SS1 & SS2 & SS3 SP4 --> SK1 & SK2 & SK3 & SK4 style MB1 fill:#bbf,stroke:#333,stroke-width:2px style SS1 fill:#f96,stroke:#333,stroke-width:2px

Event Time vs. Processing Time

graph TB subgraph "Event Generation" E1[Event Occurs<br/>Event Time: T0] E2[Event Sent<br/>Ingestion Time: T1] end subgraph "Processing" P1[Event Received<br/>Processing Time: T2] P2[Watermark Check] P3{Event Time<br/>< Watermark?} end subgraph "Windowing" W1[Assign to Window] W2[Late Event Handling] W3[Window Complete] end E1 --> E2 --> P1 P1 --> P2 --> P3 P3 -->|On Time| W1 --> W3 P3 -->|Late| W2 style P3 fill:#fff3cd,stroke:#333,stroke-width:2px style W2 fill:#f96,stroke:#333,stroke-width:2px

Windowing Patterns

graph TB subgraph "Tumbling Window" T1[Window 1<br/>00:00-00:05] --> T2[Window 2<br/>00:05-00:10] T2 --> T3[Window 3<br/>00:10-00:15] T4[Non-overlapping<br/>Fixed Size] end subgraph "Sliding Window" S1[Window 1<br/>00:00-00:10] --> S2[Window 2<br/>00:01-00:11] S2 --> S3[Window 3<br/>00:02-00:12] S4[Overlapping<br/>Moving Average] end subgraph "Session Window" SS1[User Activity] --> SS2[30s Gap] --> SS3[New Session] SS4[Variable Size<br/>Gap-based] end style T4 fill:#bbf,stroke:#333,stroke-width:2px style S4 fill:#9f9,stroke:#333,stroke-width:2px style SS4 fill:#f96,stroke:#333,stroke-width:2px

Stateful Processing Architecture

graph LR subgraph "Event Stream" ES1[Event 1] --> ES2[Event 2] --> ES3[Event 3] end subgraph "Stateful Operator" SO1[Read State] SO2[Process Event] SO3[Update State] SO4[Emit Result] end subgraph "State Backend" SB1[In-Memory State] SB2[RocksDB Disk] SB3[Periodic Checkpoint] end ES1 --> SO1 SO1 --> SO2 --> SO3 --> SO4 SO1 <--> SB1 SB1 <--> SB2 SB2 --> SB3 style SO3 fill:#f96,stroke:#333,stroke-width:2px style SB3 fill:#bbf,stroke:#333,stroke-width:2px

Technology Comparison

Stream Processing Frameworks

FrameworkBest ForStrengthsLimitationsLatencyWhen to Use
Apache FlinkComplex stateful processingTrue streaming, exactly-once, matureSteep learning curve< 100msChoose for complex event processing
Kafka StreamsKafka-native applicationsSimple, library (not cluster), Kafka integrationKafka-only, limited ops features< 500msChoose for Kafka-centric stacks
Spark StreamingBatch + streaming hybridUnified API, existing Spark expertiseMicro-batch (not true streaming)1-5 secondsChoose for batch/stream hybrid
AWS KinesisAWS-native streamingManaged, serverless, simpleAWS lock-in, limited processing< 1 secondChoose for AWS serverless
Google DataflowGCP-native streamingManaged, Apache Beam APIGCP lock-in, can be expensive< 1 secondChoose for GCP environments

Message Broker Comparison

BrokerBest ForStrengthsLimitationsThroughputDecision Factor
Apache KafkaHigh-throughput, event streamingBattle-tested, ecosystem, scalableComplex ops, Java-centric1M+ msg/secChoose for event streaming
Apache PulsarMulti-tenancy, geo-replicationCloud-native, unified queue+streamSmaller ecosystem, newer500K+ msg/secChoose for multi-region
AWS KinesisAWS-native, managedFully managed, simpleAWS lock-in, throughput limits1MB/sec per shardChoose for AWS managed
Redis StreamsSimple streaming, cachingFast, simple, combined with cacheNot designed for high-volume100K msg/secChoose for simple use cases
RabbitMQTraditional messagingAMQP support, flexible routingLower throughput vs. Kafka50K msg/secChoose for messaging patterns

Backpressure Management Strategies

Backpressure Handling Patterns

graph TB subgraph "Detection" D1[Monitor Buffer Usage] D2[Track Processing Lag] D3[Measure Throughput] end subgraph "Response Strategies" R1[Scale Up Resources] R2[Apply Rate Limiting] R3[Drop Low-Priority Events] R4[Buffer & Batch] end subgraph "Prevention" P1[Capacity Planning] P2[Auto-scaling Rules] P3[Circuit Breakers] end D1 & D2 & D3 --> R1 & R2 & R3 & R4 R1 & R2 & R3 & R4 --> P1 & P2 & P3 style D2 fill:#fff3cd,stroke:#333,stroke-width:2px style R1 fill:#bbf,stroke:#333,stroke-width:2px

Backpressure Strategies Comparison

StrategyWhen to UseProsConsImplementation Complexity
Scale UpResources constrained, budget availableIncreases throughput, no data lossCosts money, may not solve root causeLow
Drop EventsSome data loss acceptablePrevents cascading failures, simpleLoses data, may violate SLAsLow
Buffer & BatchTemporary spikes, smooth workloadSmooths load, no data lossIncreases latency, memory pressureMedium
SamplingApproximation acceptableReduces load significantlyLess accurate, may miss important eventsMedium
Load SheddingPrioritize critical eventsKeeps system alive, preserves critical pathLoses non-critical dataHigh

Exactly-Once Processing Guarantees

Exactly-Once Architecture

graph LR subgraph "Source" SR1[Kafka Consumer] SR2[Track Offsets] SR3[Commit After Checkpoint] end subgraph "Processing" PR1[Stateful Operations] PR2[Periodic Checkpoints] PR3[State Snapshots] end subgraph "Sink" SK1[Two-Phase Commit] SK2[Pre-commit Write] SK3[Commit After Checkpoint] end SR1 --> SR2 --> PR1 PR1 --> PR2 --> PR3 PR3 --> SK1 --> SK2 --> SK3 SK3 --> SR3 style PR2 fill:#f96,stroke:#333,stroke-width:2px style SK1 fill:#bbf,stroke:#333,stroke-width:2px

Processing Guarantee Comparison

GuaranteeHow It WorksUse WhenTrade-offsPerformance Impact
At-Most-OnceNo retries, may lose dataNon-critical metrics, high throughput neededSimple, fast, data loss OKNone
At-Least-OnceRetry on failure, may duplicateMost use cases, idempotent processingSimple, no data loss, duplicates possibleLow (~5%)
Exactly-OnceCheckpoints + 2-phase commitFinancial transactions, complianceComplex, no data loss, no duplicatesMedium (~20%)

Real-World Case Study: E-commerce Real-Time Personalization

Challenge

E-commerce company wanted to personalize homepage in real-time based on user behavior. Existing batch system updated once daily, causing stale recommendations.

Solution Architecture

Technology Stack:

  • Source: User events (clicks, views, searches) → Kafka (24 partitions)
  • Processing: Apache Flink for windowing and feature computation
  • State: RocksDB for user session state (2GB per task)
  • Sink: Redis for real-time feature serving (sub-ms reads)
  • ML: Inference service reads features from Redis (<50ms total)

Implementation Timeline:

PhaseDurationActivitiesDeliverables
Event CollectionWeek 1Instrument apps, set up Kafka, schema registryEvent pipeline
Stream ProcessingWeek 2-3Flink jobs, windowing logic, feature engineeringFeature computation
Feature ServingWeek 4Redis deployment, ML integration, APIReal-time serving
MonitoringWeek 5Dashboards, alerts, chaos testingObservability

Feature Computation:

  • 5-minute tumbling windows per user
  • Features: last 5 products viewed, categories browsed, time on site, search queries, cart adds
  • Sliding window for 30-minute moving averages
  • Session window for user session analytics (30-second gap)

Results After 6 Months

MetricBeforeAfterImprovement
Personalization latency24 hours5 minutes99.7% reduction
Conversion rateBaseline+12%Real-time relevance
Infrastructure costsN/A$15K/monthKafka + Flink + Redis
System reliabilityN/A99.95% uptimeRobust architecture
Processing capacityN/A50K events/secRoom to scale to 200K
Feature freshnessDaily batch< 5 minutesReal-time

Key Success Factors

  1. Gradual Rollout: Started with 1% traffic, scaled to 100% over 6 weeks
  2. Comprehensive Monitoring: 40+ metrics tracked from day one
  3. Fallback to Batch: Automatic fallback if streaming pipeline fails
  4. Extensive Testing: Chaos engineering, load testing, failure injection
  5. Clear Runbooks: Documented procedures for 15 common failure scenarios

Monitoring & Observability

Streaming Metrics Dashboard

Metric CategoryKey MetricsAlert ThresholdAction
LagProcessing lag, consumer lag> 1 minuteScale up, investigate bottleneck
ThroughputEvents/second, bytes/second< 80% of expectedCheck upstream, alert stakeholders
ErrorsFailed events, exceptions> 1%Inspect logs, fix bugs
LatencyEnd-to-end latency, window latency> 2x baselineOptimize processing, scale resources
BackpressureBuffer utilization, backpressure time> 10%Increase parallelism, optimize code
StateState size, checkpoint durationApproaching limitsArchive old state, tune checkpoints
ResourcesCPU, memory, network> 80% utilizationScale horizontally

Implementation Checklist

Design Phase (Week 1)

□ Define streaming use case and requirements
□ Identify data sources and event schema
□ Choose streaming framework (Flink/Kafka Streams/Spark)
□ Design windowing and aggregation logic
□ Plan state management strategy
□ Define exactly-once vs. at-least-once requirements

Development Phase (Week 2-3)

□ Set up Kafka topics with appropriate partitions
□ Implement schema registry for event validation
□ Build stream processing jobs
□ Implement stateful operations with checkpoints
□ Add error handling and dead letter queues
□ Write unit tests for processing logic

Testing Phase (Week 4)

□ Integration test with production-like data
□ Load test with 2x expected throughput
□ Chaos test failure scenarios
□ Validate exactly-once semantics
□ Test backpressure handling
□ Verify state recovery after failures

Deployment Phase (Week 5)

□ Deploy to production with 1% traffic
□ Set up monitoring dashboards (Grafana)
□ Configure alerting (PagerDuty)
□ Document runbooks for common issues
□ Gradually increase traffic to 100%
□ Conduct post-deployment review

Operations (Ongoing)

□ Monitor lag and throughput daily
□ Review error rates and investigate anomalies
□ Optimize costs monthly (right-size cluster)
□ Update schemas with backward compatibility
□ Quarterly disaster recovery drills
□ Annual architecture review

Best Practices

  1. Start Simple: Begin with batch, add streaming only when latency matters
  2. Schema Management: Use schema registry, enforce compatibility
  3. Handle Late Data: Configure watermarks and allowed lateness
  4. Monitor Backpressure: Alert before it causes cascading failures
  5. Test Failure Modes: Chaos testing for resilience
  6. Optimize State: Keep state size manageable, use RocksDB for large state
  7. Exactly-Once When Needed: At-least-once is simpler and often sufficient
  8. Cost Optimization: Right-size clusters, use auto-scaling

Common Pitfalls

  1. Over-Engineering: Building real-time when batch would suffice
  2. Ignoring Backpressure: Letting backpressure cascade to failures
  3. No Schema Evolution: Breaking consumers with schema changes
  4. Unbounded State: State growing forever, causing OOM
  5. Inadequate Monitoring: Can't detect or debug issues
  6. No Failure Testing: Surprised when things break in production
  7. Wrong Time Semantics: Using processing time when event time needed
  8. Expensive Joins: Large-state joins that don't scale