Final Project 3: Real-Time Analytics Engine
Problem Statement
You're the lead engineer at a company that provides real-time analytics for IoT and monitoring platforms. Your customers include:
- Smart City: 100,000 traffic sensors sending location data every second
- E-commerce: Real-time fraud detection analyzing millions of transactions
- Gaming Company: Live player analytics for 10 million concurrent users
- FinTech: Real-time trading analytics processing stock market feeds
Current Pain Points with Batch Analytics
- Latency: Results available in minutes or hours, not seconds
- Stale Data: Dashboard shows data from 5 minutes ago
- Alert Delays: Fraud detected 30 minutes after transaction completes
- Resource Waste: Computing aggregations on all historical data repeatedly
- Complex Setup: Requires data engineering team to build and maintain pipelines
The Business Challenge
Build a real-time analytics platform that processes streaming data with sub-second latency:
Customer Requirements:
- Ingest 1M+ events per second with <100ms latency
- Support complex windowing
- Complex Event Processing for pattern matching
- Real-time aggregations updated every second
- WebSocket dashboards showing live metrics
- SQL-like query language for easy analytics
- Horizontal scaling
- Fault tolerance
- Time-series optimization
Real-World Scenario
A retail company wants real-time fraud detection:
1-- Detect suspicious pattern: User makes 5+ purchases in different cities within 1 hour
2SELECT
3 user_id,
4 COUNT(*) as purchase_count,
5 COUNT(DISTINCT city) as city_count,
6 COLLECT_LIST(city) as cities
7FROM purchases
8GROUP BY
9 user_id,
10 HOP(timestamp, INTERVAL '1' HOUR, INTERVAL '5' MINUTE) -- Sliding window
11HAVING
12 COUNT(*) >= 5 AND
13 COUNT(DISTINCT city) >= 3
14-- Real-time alert when pattern matches
15EMIT CHANGES;
When a pattern matches, the system:
- Triggers real-time alert to fraud team
- Updates live dashboard showing fraud attempts per minute
- Stores pattern match in time-series database for historical analysis
- Sends webhook to payment processor to block card
This requires:
- Stream processing: Continuous query execution on unbounded data
- Windowing: Group events by time intervals
- Aggregation: COUNT, DISTINCT across windows
- CEP: Pattern matching
- Real-time output: WebSocket updates to dashboard
- Low latency: <1 second from event ingestion to alert
Requirements
Functional Requirements
Must Have:
- Stream Ingestion: Consume from Kafka, NATS, HTTP endpoints, WebSockets
- Windowing Functions: Tumbling, sliding, and session windows
- Aggregations: COUNT, SUM, AVG, MIN, MAX, percentiles
- Complex Event Processing: Pattern matching across event streams
- SQL-Like Query Language: Familiar syntax for analysts
- Real-Time Joins: Stream-to-stream and stream-to-table joins
- State Management: Windowed state with fault tolerance
- Time-Series Storage: Efficient storage and querying of historical data
- WebSocket Streaming: Push updates to dashboards in real-time
- Horizontal Scaling: Partition streams across multiple workers
- Exactly-Once Semantics: No duplicate or lost events
- Late Event Handling: Handle out-of-order events with watermarks
- Metrics & Monitoring: Track throughput, latency, backpressure
- REST API: Query historical data and stream metadata
- Real-Time Dashboards: Web UI showing live metrics
Should Have:
- User-Defined Functions: Custom transformations in Go
- Materialized Views: Pre-computed aggregations for fast queries
- Backpressure Handling: Slow down ingestion when consumers can't keep up
- Multi-Tenancy: Isolated streams per customer
- Schema Registry: Track event schemas with evolution
- Alerting: Trigger webhooks, emails, Slack notifications
- Data Enrichment: Join streams with reference data
- Sampling: Process subset of data for large-scale analytics
Nice to Have:
- Machine Learning Integration: Anomaly detection with online learning
- Graph Analytics: Track relationships between entities
- Geo-Spatial Queries: Process location data with geofencing
- Multi-Language UDFs: JavaScript, Python UDFs via WASM
- Query Federation: Join data from external databases
Non-Functional Requirements
Performance:
- Ingestion: 1M+ events/sec per node
- Latency: End-to-end <100ms, <200ms
- Query Latency: Sub-second for real-time queries
- Throughput: 100k+ queries/sec for dashboard updates
Scalability:
- Horizontal Scaling: Linear throughput increase with more nodes
- Stream Partitioning: 1000+ partitions per topic
- Concurrent Queries: 10,000+ simultaneous queries
- Data Retention: 1 year+ of time-series data
Reliability:
- Availability: 99.9% uptime
- Fault Tolerance: Survive node failures with <5 second recovery
- Exactly-Once: Zero duplicate events in aggregations
- Data Durability: Replicate state to prevent data loss
Consistency:
- Event Time Processing: Process events by their timestamp, not arrival time
- Watermarks: Handle late events up to 1 hour delay
- Out-of-Order Handling: Recompute windows when late events arrive
Constraints
Technical Constraints:
- Language: Go for high performance and low GC overhead
- Message Queue: Kafka or NATS Streaming for durability
- Time-Series DB: InfluxDB, TimescaleDB, or custom storage
- WebSocket Library: gorilla/websocket for real-time updates
- Query Language: SQL-like syntax
Resource Constraints:
- Memory: Assume 32 GB RAM per node
- CPU: Multi-core servers
- Network: 10 Gbps for inter-node communication
Business Constraints:
- Development Time: 10 weeks for MVP
- Team Size: 2-3 developers
- Learning Curve: Analysts with SQL knowledge
Design Considerations
Stream Processing Model
Choose between micro-batching and true streaming:
Micro-Batching:
- Simpler to implement with batch processing optimizations
- Higher throughput but latency bounded by batch interval
- Easier exactly-once semantics
True Streaming:
- Ultra-low latency processing one event at a time
- More complex implementation
- Better for sub-second analytics requirements
Recommendation: Use adaptive micro-batching that adjusts batch size based on load to balance throughput and latency while targeting <100ms end-to-end latency.
Windowing Strategy
Choose between processing-time windows and event-time windows:
Processing Time:
- Simpler implementation
- No late events or watermarks needed
- Non-deterministic results if events arrive out of order
Event Time:
- Correct results for analytics
- Handles out-of-order events
- Requires watermarking to manage memory
Recommendation: Use event-time windowing with watermarks to ensure accurate analytics. Set allowed lateness to bound memory usage while handling most out-of-order events.
State Management
Choose between in-memory state and persistent state:
In-Memory:
- Extremely fast with no disk I/O
- Lost on node failures requiring full reprocessing
Persistent:
- Survives failures with incremental checkpointing
- Handles large state exceeding RAM
- Slower due to disk operations
Recommendation: Hybrid approach—keep hot state in memory with periodic checkpoints to RocksDB. On failure, restore from last checkpoint and replay message queue from recorded offsets.
Time-Series Storage
Choose between row-oriented storage and columnar storage:
Row-Oriented:
- Simple schema and fast writes
- Slow analytical queries requiring full table scans
Columnar:
- Fast analytical queries scanning only needed columns
- High compression ratios
- Vectorized execution for aggregations
Recommendation: Use columnar storage for analytics workloads with monthly partitioning and clustering by user/time for optimal query performance.
Real-Time Broadcasting
Choose between direct WebSocket connections and pub/sub systems:
Direct WebSocket:
- Simple implementation with low latency
- Doesn't scale beyond single server
- Lost connections during restarts
Pub/Sub:
- Scales horizontally across multiple web servers
- Handles reconnections gracefully
- Efficient fan-out to thousands of clients
Recommendation: Use Redis Pub/Sub to enable multiple web servers to push updates and handle client reconnections automatically.
Acceptance Criteria
The project is considered complete when it demonstrates:
Core Functionality:
- Ingests events from Kafka/NATS/HTTP at 1M+ events/second
- Implements tumbling, sliding, and session windows correctly
- Processes SQL-like queries with aggregations
- Detects complex event patterns across multiple streams
- Stores time-series data efficiently with sub-second query latency
- Pushes real-time updates via WebSocket to connected dashboards
- Handles exactly-once semantics with no data loss or duplication
Performance Requirements:
- End-to-end latency <100ms from ingestion to query result
- Scales linearly with added worker nodes
- Handles late events correctly using watermark strategies
- Memory usage remains bounded with proper window eviction
- Recovers from node failures in <30 seconds with state restoration
Production Readiness:
- Comprehensive unit tests for windowing and aggregation logic
- Integration tests validating end-to-end stream processing
- Load tests demonstrating 1M+ events/sec throughput
- Monitoring dashboard showing throughput, latency, and backpressure metrics
- Docker deployment configuration for local development
- Complete README with architecture diagrams and setup instructions
User Experience:
- Web dashboard displays live metrics updated in real-time
- SQL query interface allows ad-hoc analytics queries
- System handles graceful shutdowns without data loss
- Clear error messages and logging for debugging
- API documentation for integration with external systems
Usage Examples
Example 1: Real-Time Fraud Detection
Detect users making multiple purchases in different cities within a short time window:
1package main
2
3import (
4 "context"
5 "fmt"
6 "time"
7
8 "github.com/yourorg/analytics-engine/pkg/stream"
9 "github.com/yourorg/analytics-engine/pkg/cep"
10)
11
12func main() {
13 // Initialize stream processor
14 processor := stream.NewProcessor(stream.Config{
15 KafkaBrokers: []string{"localhost:9092"},
16 Topic: "purchases",
17 })
18
19 // Define fraud detection pattern
20 pattern := &cep.Pattern{
21 Name: "Multi-City Fraud Detection",
22 Conditions: []cep.Condition{
23 {
24 EventType: "purchase",
25 Predicate: func(e *stream.Event) bool {
26 amount, _ := e.Value["amount"].(float64)
27 return amount > 100 // Purchases over $100
28 },
29 MinCount: 5, // At least 5 purchases
30 },
31 },
32 TimeWindow: 1 * time.Hour,
33 Actions: []cep.Action{
34 &cep.AlertAction{
35 Name: "fraud-alert",
36 Handler: func(matches []*stream.Event) {
37 cities := make(map[string]bool)
38 for _, event := range matches {
39 city, _ := event.Value["city"].(string)
40 cities[city] = true
41 }
42
43 // Alert if 3+ different cities
44 if len(cities) >= 3 {
45 userID := matches[0].Value["user_id"]
46 fmt.Printf("FRAUD ALERT: User %v made %d purchases in %d cities\n",
47 userID, len(matches), len(cities))
48
49 // Send to webhook, Slack, etc.
50 sendFraudAlert(userID, matches, cities)
51 }
52 },
53 },
54 },
55 }
56
57 // Start pattern matching
58 matcher := cep.NewPatternMatcher(pattern)
59 processor.AddMatcher(matcher)
60
61 processor.Run(context.Background())
62}
Example 2: Real-Time Metrics Dashboard
Create live dashboard showing events per second and average latency:
1package main
2
3import (
4 "context"
5 "time"
6
7 "github.com/yourorg/analytics-engine/pkg/stream"
8 "github.com/yourorg/analytics-engine/pkg/window"
9 "github.com/yourorg/analytics-engine/pkg/websocket"
10)
11
12func main() {
13 // Initialize WebSocket server
14 wsServer := websocket.NewServer(":8081")
15 go wsServer.Run()
16
17 // Create windowing operator for 1-minute tumbling windows
18 operator := window.NewOperator(window.Config{
19 Type: window.Tumbling,
20 Size: 1 * time.Minute,
21 Aggregator: &window.MultiAggregator{
22 Aggregators: map[string]window.Aggregator{
23 "count": &window.CountAggregator{},
24 "avg_latency": &window.AvgAggregator{Field: "latency"},
25 "p95_latency": &window.PercentileAggregator{
26 Field: "latency",
27 Percentile: 0.95,
28 },
29 },
30 },
31 })
32
33 // Process events and broadcast results
34 processor := stream.NewProcessor(stream.Config{
35 KafkaBrokers: []string{"localhost:9092"},
36 Topic: "api-requests",
37 })
38
39 processor.OnWindowClose(func(result window.Result) {
40 // Broadcast metrics to all connected WebSocket clients
41 wsServer.Broadcast(map[string]interface{}{
42 "timestamp": time.Now(),
43 "events_per_sec": result.Aggregates["count"].(int) / 60,
44 "avg_latency_ms": result.Aggregates["avg_latency"].(float64),
45 "p95_latency_ms": result.Aggregates["p95_latency"].(float64),
46 })
47 })
48
49 processor.AddOperator(operator)
50 processor.Run(context.Background())
51}
Example 3: SQL-Like Stream Queries
Execute continuous queries using SQL-like syntax:
1-- Create stream from Kafka topic
2CREATE STREAM purchases (
3 user_id BIGINT,
4 amount DECIMAL(10,2),
5 city STRING,
6 timestamp TIMESTAMP
7) WITH (
8 kafka_topic = 'purchases',
9 value_format = 'JSON'
10);
11
12-- Real-time aggregation with tumbling window
13SELECT
14 city,
15 COUNT(*) as purchase_count,
16 SUM(amount) as total_revenue,
17 AVG(amount) as avg_order_value
18FROM purchases
19WINDOW TUMBLING
20GROUP BY city
21EMIT CHANGES;
22
23-- Complex pattern matching with session window
24SELECT
25 user_id,
26 SESSION_START() as session_start,
27 SESSION_END() as session_end,
28 COUNT(*) as page_views,
29 COLLECT_LIST(page_url) as visited_pages
30FROM page_views
31WINDOW SESSION
32GROUP BY user_id
33HAVING COUNT(*) > 10
34EMIT CHANGES;
Example 4: Client Integration
Connect to the analytics engine from external applications:
1package main
2
3import (
4 "context"
5 "encoding/json"
6 "fmt"
7
8 "github.com/gorilla/websocket"
9)
10
11func main() {
12 // Connect to WebSocket endpoint
13 conn, _, err := websocket.DefaultDialer.Dial("ws://localhost:8081/ws", nil)
14 if err != nil {
15 panic(err)
16 }
17 defer conn.Close()
18
19 // Subscribe to real-time metrics
20 subscription := map[string]interface{}{
21 "action": "subscribe",
22 "topics": []string{"dashboard:metrics", "alerts:fraud"},
23 }
24
25 if err := conn.WriteJSON(subscription); err != nil {
26 panic(err)
27 }
28
29 // Receive real-time updates
30 for {
31 _, message, err := conn.ReadMessage()
32 if err != nil {
33 fmt.Printf("Read error: %v\n", err)
34 break
35 }
36
37 var update map[string]interface{}
38 if err := json.Unmarshal(message, &update); err != nil {
39 continue
40 }
41
42 fmt.Printf("Real-time update: %+v\n", update)
43
44 // Update your dashboard UI here
45 updateDashboard(update)
46 }
47}
Key Takeaways
Technical Skills Mastered
Stream Processing Fundamentals:
- Designed high-throughput data pipelines processing 1M+ events/second
- Implemented windowing operators for time-based aggregations
- Built complex event processing engines for multi-event pattern detection
- Created watermark strategies to handle late and out-of-order events
- Achieved sub-100ms end-to-end latency for real-time analytics
Distributed Systems Architecture:
- Implemented horizontal scaling with stream partitioning across worker nodes
- Built fault-tolerant state management with checkpointing and recovery
- Achieved exactly-once processing semantics preventing data loss or duplication
- Designed backpressure mechanisms for flow control and system stability
- Created distributed query execution across multiple processing nodes
Data Engineering Excellence:
- Optimized time-series storage with columnar databases
- Implemented efficient aggregation algorithms for real-time computation
- Built SQL-like query engines for stream processing
- Created incremental aggregations updating with each new event
- Designed schema evolution and data enrichment pipelines
Real-Time Systems Design:
- Built WebSocket broadcasting for live dashboard updates
- Implemented Redis Pub/Sub for scalable message distribution
- Created low-latency query APIs with sub-second response times
- Designed alerting systems with real-time pattern detection
- Built monitoring and observability for stream processing metrics
Production Engineering Skills
Performance Optimization:
- Profiled and optimized hot paths for maximum throughput
- Implemented memory-efficient data structures for windowing
- Used adaptive micro-batching to balance latency and throughput
- Optimized serialization/deserialization for network efficiency
- Designed zero-copy data processing where possible
Testing and Quality Assurance:
- Created comprehensive unit tests for windowing and aggregation logic
- Built integration tests validating end-to-end stream processing
- Designed load tests demonstrating 1M+ events/sec capacity
- Implemented chaos testing for fault tolerance validation
- Created benchmarks measuring latency percentiles
Operations and Deployment:
- Containerized applications with Docker for reproducible deployments
- Created Kubernetes manifests for production orchestration
- Implemented health checks and graceful shutdown mechanisms
- Built comprehensive monitoring with Prometheus and Grafana
- Designed alert rules for system anomalies and degradation
Industry Applications
This project demonstrates skills directly applicable to:
Real-Time Analytics Platforms:
- Building systems like Apache Flink, Kafka Streams, or Apache Storm
- Creating custom analytics engines for specific use cases
- Implementing fraud detection and anomaly detection systems
- Designing IoT data processing pipelines
Data Engineering Roles:
- Stream processing pipeline development
- Time-series database optimization
- Real-time ETL and data transformation
- Event-driven architecture design
DevOps and SRE:
- Operating high-throughput distributed systems
- Performance tuning and optimization
- Building monitoring and alerting infrastructure
- Ensuring system reliability and fault tolerance
Architectural Patterns Learned
Event-Driven Architecture:
- Loose coupling through event streams
- Asynchronous communication patterns
- Event sourcing and CQRS principles
- Stream-table duality concepts
Distributed Computing:
- Consistent hashing for data partitioning
- Leader election and coordination
- Distributed state management
- Replication and fault tolerance
Performance Patterns:
- Batching for throughput optimization
- Pipelining for latency reduction
- Caching for query acceleration
- Incremental computation for efficiency
Next Steps
Immediate Enhancements
Advanced Features:
- Machine Learning Integration: Add anomaly detection with online learning algorithms
- Graph Analytics: Track relationships between entities
- Geo-Spatial Queries: Process location data with geofencing and proximity alerts
- Query Optimization: Implement cost-based optimizer for complex queries
- Multi-Tenant Isolation: Add quota management and resource isolation per customer
Performance Improvements:
- GPU Acceleration: Offload aggregations to GPUs for massive parallelism
- Zero-Copy Networking: Use shared memory for inter-process communication
- Columnar Processing: Vectorize operations for SIMD optimization
- Compression: Implement streaming compression for network and storage
- Custom Allocators: Reduce GC pressure with object pooling
Operational Features:
- Auto-Scaling: Automatically add/remove workers based on load
- Multi-Region Deployment: Replicate streams across geographic regions
- Backup and Restore: Snapshot state for disaster recovery
- Security: Add authentication, authorization, and encryption
- Compliance: Implement audit logging and data retention policies
Production Deployment
Infrastructure Setup:
- Deploy to Kubernetes cluster with StatefulSets for workers
- Configure Kafka cluster with replication factor 3 for durability
- Set up ClickHouse cluster with distributed tables
- Deploy Redis cluster for high-availability pub/sub
- Configure load balancers for WebSocket and API endpoints
Monitoring and Observability:
- Set up Prometheus for metrics collection
- Configure Grafana dashboards for visualization
- Implement distributed tracing with Jaeger
- Set up alerting rules for anomalies
- Create runbooks for incident response
Testing at Scale:
- Run load tests with realistic traffic patterns
- Perform chaos engineering experiments
- Validate recovery time objectives and recovery point objectives
- Test backpressure handling under extreme load
- Measure cost per million events processed
Further Learning
Advanced Stream Processing:
- Study Apache Flink internals and architecture
- Read "Streaming Systems" by Tyler Akidau for theoretical foundations
- Explore ksqlDB source code for SQL stream processing patterns
- Learn about watermark strategies in distributed systems
- Investigate stateful stream processing with Chandy-Lamport snapshots
Distributed Systems:
- Study consensus algorithms for coordination
- Learn about distributed transactions and exactly-once semantics
- Explore distributed tracing and observability patterns
- Investigate service mesh architectures
- Study CAP theorem implications for stream processing
Data Engineering:
- Master Apache Kafka internals and operations
- Learn columnar storage formats
- Study query optimization techniques
- Explore data lake architectures
- Investigate real-time OLAP databases
Career Development
Roles This Project Prepares You For:
- Staff/Senior Data Engineer: Building production data platforms
- Stream Processing Engineer: Developing real-time analytics systems
- Platform Engineer: Operating high-scale distributed systems
- Solutions Architect: Designing event-driven architectures
- Site Reliability Engineer: Ensuring reliability of data infrastructure
Companies Using Similar Systems:
- Confluent: Kafka Streams, ksqlDB development
- Databricks: Structured Streaming, Delta Lake
- Snowflake: Real-time data warehousing
- Datadog: Log aggregation and real-time monitoring
- Stripe: Fraud detection and analytics
- Uber: Real-time pricing and dispatch systems
Competitive Advantages:
- Proven ability to build systems processing millions of events/second
- Deep understanding of stream processing internals
- Experience with production-grade fault tolerance and scalability
- Hands-on expertise with modern data infrastructure
- Portfolio demonstrating end-to-end system design and implementation
Download Complete Solution
📦 Download Complete Solution
Get the complete real-time analytics engine with all source code, tests, and deployment configurations:
⬇️ Download Complete ProjectIncludes: Stream processing engine, windowing operators, CEP pattern matcher, time-series storage integration, WebSocket server, real-time dashboard, SQL query parser, comprehensive test suite, Docker deployment, and complete README with architecture diagrams and implementation guide.
Note: The README contains detailed architecture diagrams, implementation phases, project structure breakdown, and step-by-step development guide.
Congratulations! You've completed one of the most challenging projects in modern data engineering. You've built a production-grade real-time analytics engine capable of processing millions of events per second with sub-100ms latency. This project demonstrates mastery of stream processing, windowing algorithms, complex event processing, and distributed systems architecture—skills that are highly valued at companies building modern data platforms like Confluent, Databricks, Snowflake, and Uber.
The patterns and techniques you've learned here—watermarking, exactly-once semantics, fault-tolerant state management, and real-time aggregations—are fundamental to building any large-scale event-driven system. You're now equipped to tackle the most demanding real-time data challenges in the industry.