Final Project 3: Real-Time Analytics Engine

Final Project 3: Real-Time Analytics Engine

Problem Statement

You're the lead engineer at a company that provides real-time analytics for IoT and monitoring platforms. Your customers include:

  • Smart City: 100,000 traffic sensors sending location data every second
  • E-commerce: Real-time fraud detection analyzing millions of transactions
  • Gaming Company: Live player analytics for 10 million concurrent users
  • FinTech: Real-time trading analytics processing stock market feeds

Current Pain Points with Batch Analytics

  • Latency: Results available in minutes or hours, not seconds
  • Stale Data: Dashboard shows data from 5 minutes ago
  • Alert Delays: Fraud detected 30 minutes after transaction completes
  • Resource Waste: Computing aggregations on all historical data repeatedly
  • Complex Setup: Requires data engineering team to build and maintain pipelines

The Business Challenge

Build a real-time analytics platform that processes streaming data with sub-second latency:

Customer Requirements:

  • Ingest 1M+ events per second with <100ms latency
  • Support complex windowing
  • Complex Event Processing for pattern matching
  • Real-time aggregations updated every second
  • WebSocket dashboards showing live metrics
  • SQL-like query language for easy analytics
  • Horizontal scaling
  • Fault tolerance
  • Time-series optimization

Real-World Scenario

A retail company wants real-time fraud detection:

 1-- Detect suspicious pattern: User makes 5+ purchases in different cities within 1 hour
 2SELECT
 3  user_id,
 4  COUNT(*) as purchase_count,
 5  COUNT(DISTINCT city) as city_count,
 6  COLLECT_LIST(city) as cities
 7FROM purchases
 8GROUP BY
 9  user_id,
10  HOP(timestamp, INTERVAL '1' HOUR, INTERVAL '5' MINUTE)  -- Sliding window
11HAVING
12  COUNT(*) >= 5 AND
13  COUNT(DISTINCT city) >= 3
14-- Real-time alert when pattern matches
15EMIT CHANGES;

When a pattern matches, the system:

  1. Triggers real-time alert to fraud team
  2. Updates live dashboard showing fraud attempts per minute
  3. Stores pattern match in time-series database for historical analysis
  4. Sends webhook to payment processor to block card

This requires:

  • Stream processing: Continuous query execution on unbounded data
  • Windowing: Group events by time intervals
  • Aggregation: COUNT, DISTINCT across windows
  • CEP: Pattern matching
  • Real-time output: WebSocket updates to dashboard
  • Low latency: <1 second from event ingestion to alert

Requirements

Functional Requirements

Must Have:

  1. Stream Ingestion: Consume from Kafka, NATS, HTTP endpoints, WebSockets
  2. Windowing Functions: Tumbling, sliding, and session windows
  3. Aggregations: COUNT, SUM, AVG, MIN, MAX, percentiles
  4. Complex Event Processing: Pattern matching across event streams
  5. SQL-Like Query Language: Familiar syntax for analysts
  6. Real-Time Joins: Stream-to-stream and stream-to-table joins
  7. State Management: Windowed state with fault tolerance
  8. Time-Series Storage: Efficient storage and querying of historical data
  9. WebSocket Streaming: Push updates to dashboards in real-time
  10. Horizontal Scaling: Partition streams across multiple workers
  11. Exactly-Once Semantics: No duplicate or lost events
  12. Late Event Handling: Handle out-of-order events with watermarks
  13. Metrics & Monitoring: Track throughput, latency, backpressure
  14. REST API: Query historical data and stream metadata
  15. Real-Time Dashboards: Web UI showing live metrics

Should Have:

  1. User-Defined Functions: Custom transformations in Go
  2. Materialized Views: Pre-computed aggregations for fast queries
  3. Backpressure Handling: Slow down ingestion when consumers can't keep up
  4. Multi-Tenancy: Isolated streams per customer
  5. Schema Registry: Track event schemas with evolution
  6. Alerting: Trigger webhooks, emails, Slack notifications
  7. Data Enrichment: Join streams with reference data
  8. Sampling: Process subset of data for large-scale analytics

Nice to Have:

  1. Machine Learning Integration: Anomaly detection with online learning
  2. Graph Analytics: Track relationships between entities
  3. Geo-Spatial Queries: Process location data with geofencing
  4. Multi-Language UDFs: JavaScript, Python UDFs via WASM
  5. Query Federation: Join data from external databases

Non-Functional Requirements

Performance:

  • Ingestion: 1M+ events/sec per node
  • Latency: End-to-end <100ms, <200ms
  • Query Latency: Sub-second for real-time queries
  • Throughput: 100k+ queries/sec for dashboard updates

Scalability:

  • Horizontal Scaling: Linear throughput increase with more nodes
  • Stream Partitioning: 1000+ partitions per topic
  • Concurrent Queries: 10,000+ simultaneous queries
  • Data Retention: 1 year+ of time-series data

Reliability:

  • Availability: 99.9% uptime
  • Fault Tolerance: Survive node failures with <5 second recovery
  • Exactly-Once: Zero duplicate events in aggregations
  • Data Durability: Replicate state to prevent data loss

Consistency:

  • Event Time Processing: Process events by their timestamp, not arrival time
  • Watermarks: Handle late events up to 1 hour delay
  • Out-of-Order Handling: Recompute windows when late events arrive

Constraints

Technical Constraints:

  1. Language: Go for high performance and low GC overhead
  2. Message Queue: Kafka or NATS Streaming for durability
  3. Time-Series DB: InfluxDB, TimescaleDB, or custom storage
  4. WebSocket Library: gorilla/websocket for real-time updates
  5. Query Language: SQL-like syntax

Resource Constraints:

  1. Memory: Assume 32 GB RAM per node
  2. CPU: Multi-core servers
  3. Network: 10 Gbps for inter-node communication

Business Constraints:

  1. Development Time: 10 weeks for MVP
  2. Team Size: 2-3 developers
  3. Learning Curve: Analysts with SQL knowledge

Design Considerations

Stream Processing Model

Choose between micro-batching and true streaming:

Micro-Batching:

  • Simpler to implement with batch processing optimizations
  • Higher throughput but latency bounded by batch interval
  • Easier exactly-once semantics

True Streaming:

  • Ultra-low latency processing one event at a time
  • More complex implementation
  • Better for sub-second analytics requirements

Recommendation: Use adaptive micro-batching that adjusts batch size based on load to balance throughput and latency while targeting <100ms end-to-end latency.

Windowing Strategy

Choose between processing-time windows and event-time windows:

Processing Time:

  • Simpler implementation
  • No late events or watermarks needed
  • Non-deterministic results if events arrive out of order

Event Time:

  • Correct results for analytics
  • Handles out-of-order events
  • Requires watermarking to manage memory

Recommendation: Use event-time windowing with watermarks to ensure accurate analytics. Set allowed lateness to bound memory usage while handling most out-of-order events.

State Management

Choose between in-memory state and persistent state:

In-Memory:

  • Extremely fast with no disk I/O
  • Lost on node failures requiring full reprocessing

Persistent:

  • Survives failures with incremental checkpointing
  • Handles large state exceeding RAM
  • Slower due to disk operations

Recommendation: Hybrid approach—keep hot state in memory with periodic checkpoints to RocksDB. On failure, restore from last checkpoint and replay message queue from recorded offsets.

Time-Series Storage

Choose between row-oriented storage and columnar storage:

Row-Oriented:

  • Simple schema and fast writes
  • Slow analytical queries requiring full table scans

Columnar:

  • Fast analytical queries scanning only needed columns
  • High compression ratios
  • Vectorized execution for aggregations

Recommendation: Use columnar storage for analytics workloads with monthly partitioning and clustering by user/time for optimal query performance.

Real-Time Broadcasting

Choose between direct WebSocket connections and pub/sub systems:

Direct WebSocket:

  • Simple implementation with low latency
  • Doesn't scale beyond single server
  • Lost connections during restarts

Pub/Sub:

  • Scales horizontally across multiple web servers
  • Handles reconnections gracefully
  • Efficient fan-out to thousands of clients

Recommendation: Use Redis Pub/Sub to enable multiple web servers to push updates and handle client reconnections automatically.


Acceptance Criteria

The project is considered complete when it demonstrates:

Core Functionality:

  • Ingests events from Kafka/NATS/HTTP at 1M+ events/second
  • Implements tumbling, sliding, and session windows correctly
  • Processes SQL-like queries with aggregations
  • Detects complex event patterns across multiple streams
  • Stores time-series data efficiently with sub-second query latency
  • Pushes real-time updates via WebSocket to connected dashboards
  • Handles exactly-once semantics with no data loss or duplication

Performance Requirements:

  • End-to-end latency <100ms from ingestion to query result
  • Scales linearly with added worker nodes
  • Handles late events correctly using watermark strategies
  • Memory usage remains bounded with proper window eviction
  • Recovers from node failures in <30 seconds with state restoration

Production Readiness:

  • Comprehensive unit tests for windowing and aggregation logic
  • Integration tests validating end-to-end stream processing
  • Load tests demonstrating 1M+ events/sec throughput
  • Monitoring dashboard showing throughput, latency, and backpressure metrics
  • Docker deployment configuration for local development
  • Complete README with architecture diagrams and setup instructions

User Experience:

  • Web dashboard displays live metrics updated in real-time
  • SQL query interface allows ad-hoc analytics queries
  • System handles graceful shutdowns without data loss
  • Clear error messages and logging for debugging
  • API documentation for integration with external systems

Usage Examples

Example 1: Real-Time Fraud Detection

Detect users making multiple purchases in different cities within a short time window:

 1package main
 2
 3import (
 4    "context"
 5    "fmt"
 6    "time"
 7
 8    "github.com/yourorg/analytics-engine/pkg/stream"
 9    "github.com/yourorg/analytics-engine/pkg/cep"
10)
11
12func main() {
13    // Initialize stream processor
14    processor := stream.NewProcessor(stream.Config{
15        KafkaBrokers: []string{"localhost:9092"},
16        Topic:        "purchases",
17    })
18
19    // Define fraud detection pattern
20    pattern := &cep.Pattern{
21        Name: "Multi-City Fraud Detection",
22        Conditions: []cep.Condition{
23            {
24                EventType: "purchase",
25                Predicate: func(e *stream.Event) bool {
26                    amount, _ := e.Value["amount"].(float64)
27                    return amount > 100 // Purchases over $100
28                },
29                MinCount: 5, // At least 5 purchases
30            },
31        },
32        TimeWindow: 1 * time.Hour,
33        Actions: []cep.Action{
34            &cep.AlertAction{
35                Name: "fraud-alert",
36                Handler: func(matches []*stream.Event) {
37                    cities := make(map[string]bool)
38                    for _, event := range matches {
39                        city, _ := event.Value["city"].(string)
40                        cities[city] = true
41                    }
42
43                    // Alert if 3+ different cities
44                    if len(cities) >= 3 {
45                        userID := matches[0].Value["user_id"]
46                        fmt.Printf("FRAUD ALERT: User %v made %d purchases in %d cities\n",
47                            userID, len(matches), len(cities))
48
49                        // Send to webhook, Slack, etc.
50                        sendFraudAlert(userID, matches, cities)
51                    }
52                },
53            },
54        },
55    }
56
57    // Start pattern matching
58    matcher := cep.NewPatternMatcher(pattern)
59    processor.AddMatcher(matcher)
60
61    processor.Run(context.Background())
62}

Example 2: Real-Time Metrics Dashboard

Create live dashboard showing events per second and average latency:

 1package main
 2
 3import (
 4    "context"
 5    "time"
 6
 7    "github.com/yourorg/analytics-engine/pkg/stream"
 8    "github.com/yourorg/analytics-engine/pkg/window"
 9    "github.com/yourorg/analytics-engine/pkg/websocket"
10)
11
12func main() {
13    // Initialize WebSocket server
14    wsServer := websocket.NewServer(":8081")
15    go wsServer.Run()
16
17    // Create windowing operator for 1-minute tumbling windows
18    operator := window.NewOperator(window.Config{
19        Type:       window.Tumbling,
20        Size:       1 * time.Minute,
21        Aggregator: &window.MultiAggregator{
22            Aggregators: map[string]window.Aggregator{
23                "count":      &window.CountAggregator{},
24                "avg_latency": &window.AvgAggregator{Field: "latency"},
25                "p95_latency": &window.PercentileAggregator{
26                    Field:      "latency",
27                    Percentile: 0.95,
28                },
29            },
30        },
31    })
32
33    // Process events and broadcast results
34    processor := stream.NewProcessor(stream.Config{
35        KafkaBrokers: []string{"localhost:9092"},
36        Topic:        "api-requests",
37    })
38
39    processor.OnWindowClose(func(result window.Result) {
40        // Broadcast metrics to all connected WebSocket clients
41        wsServer.Broadcast(map[string]interface{}{
42            "timestamp":   time.Now(),
43            "events_per_sec": result.Aggregates["count"].(int) / 60,
44            "avg_latency_ms": result.Aggregates["avg_latency"].(float64),
45            "p95_latency_ms": result.Aggregates["p95_latency"].(float64),
46        })
47    })
48
49    processor.AddOperator(operator)
50    processor.Run(context.Background())
51}

Example 3: SQL-Like Stream Queries

Execute continuous queries using SQL-like syntax:

 1-- Create stream from Kafka topic
 2CREATE STREAM purchases (
 3    user_id BIGINT,
 4    amount DECIMAL(10,2),
 5    city STRING,
 6    timestamp TIMESTAMP
 7) WITH (
 8    kafka_topic = 'purchases',
 9    value_format = 'JSON'
10);
11
12-- Real-time aggregation with tumbling window
13SELECT
14    city,
15    COUNT(*) as purchase_count,
16    SUM(amount) as total_revenue,
17    AVG(amount) as avg_order_value
18FROM purchases
19WINDOW TUMBLING
20GROUP BY city
21EMIT CHANGES;
22
23-- Complex pattern matching with session window
24SELECT
25    user_id,
26    SESSION_START() as session_start,
27    SESSION_END() as session_end,
28    COUNT(*) as page_views,
29    COLLECT_LIST(page_url) as visited_pages
30FROM page_views
31WINDOW SESSION
32GROUP BY user_id
33HAVING COUNT(*) > 10
34EMIT CHANGES;

Example 4: Client Integration

Connect to the analytics engine from external applications:

 1package main
 2
 3import (
 4    "context"
 5    "encoding/json"
 6    "fmt"
 7
 8    "github.com/gorilla/websocket"
 9)
10
11func main() {
12    // Connect to WebSocket endpoint
13    conn, _, err := websocket.DefaultDialer.Dial("ws://localhost:8081/ws", nil)
14    if err != nil {
15        panic(err)
16    }
17    defer conn.Close()
18
19    // Subscribe to real-time metrics
20    subscription := map[string]interface{}{
21        "action": "subscribe",
22        "topics": []string{"dashboard:metrics", "alerts:fraud"},
23    }
24
25    if err := conn.WriteJSON(subscription); err != nil {
26        panic(err)
27    }
28
29    // Receive real-time updates
30    for {
31        _, message, err := conn.ReadMessage()
32        if err != nil {
33            fmt.Printf("Read error: %v\n", err)
34            break
35        }
36
37        var update map[string]interface{}
38        if err := json.Unmarshal(message, &update); err != nil {
39            continue
40        }
41
42        fmt.Printf("Real-time update: %+v\n", update)
43
44        // Update your dashboard UI here
45        updateDashboard(update)
46    }
47}

Key Takeaways

Technical Skills Mastered

Stream Processing Fundamentals:

  • Designed high-throughput data pipelines processing 1M+ events/second
  • Implemented windowing operators for time-based aggregations
  • Built complex event processing engines for multi-event pattern detection
  • Created watermark strategies to handle late and out-of-order events
  • Achieved sub-100ms end-to-end latency for real-time analytics

Distributed Systems Architecture:

  • Implemented horizontal scaling with stream partitioning across worker nodes
  • Built fault-tolerant state management with checkpointing and recovery
  • Achieved exactly-once processing semantics preventing data loss or duplication
  • Designed backpressure mechanisms for flow control and system stability
  • Created distributed query execution across multiple processing nodes

Data Engineering Excellence:

  • Optimized time-series storage with columnar databases
  • Implemented efficient aggregation algorithms for real-time computation
  • Built SQL-like query engines for stream processing
  • Created incremental aggregations updating with each new event
  • Designed schema evolution and data enrichment pipelines

Real-Time Systems Design:

  • Built WebSocket broadcasting for live dashboard updates
  • Implemented Redis Pub/Sub for scalable message distribution
  • Created low-latency query APIs with sub-second response times
  • Designed alerting systems with real-time pattern detection
  • Built monitoring and observability for stream processing metrics

Production Engineering Skills

Performance Optimization:

  • Profiled and optimized hot paths for maximum throughput
  • Implemented memory-efficient data structures for windowing
  • Used adaptive micro-batching to balance latency and throughput
  • Optimized serialization/deserialization for network efficiency
  • Designed zero-copy data processing where possible

Testing and Quality Assurance:

  • Created comprehensive unit tests for windowing and aggregation logic
  • Built integration tests validating end-to-end stream processing
  • Designed load tests demonstrating 1M+ events/sec capacity
  • Implemented chaos testing for fault tolerance validation
  • Created benchmarks measuring latency percentiles

Operations and Deployment:

  • Containerized applications with Docker for reproducible deployments
  • Created Kubernetes manifests for production orchestration
  • Implemented health checks and graceful shutdown mechanisms
  • Built comprehensive monitoring with Prometheus and Grafana
  • Designed alert rules for system anomalies and degradation

Industry Applications

This project demonstrates skills directly applicable to:

Real-Time Analytics Platforms:

  • Building systems like Apache Flink, Kafka Streams, or Apache Storm
  • Creating custom analytics engines for specific use cases
  • Implementing fraud detection and anomaly detection systems
  • Designing IoT data processing pipelines

Data Engineering Roles:

  • Stream processing pipeline development
  • Time-series database optimization
  • Real-time ETL and data transformation
  • Event-driven architecture design

DevOps and SRE:

  • Operating high-throughput distributed systems
  • Performance tuning and optimization
  • Building monitoring and alerting infrastructure
  • Ensuring system reliability and fault tolerance

Architectural Patterns Learned

Event-Driven Architecture:

  • Loose coupling through event streams
  • Asynchronous communication patterns
  • Event sourcing and CQRS principles
  • Stream-table duality concepts

Distributed Computing:

  • Consistent hashing for data partitioning
  • Leader election and coordination
  • Distributed state management
  • Replication and fault tolerance

Performance Patterns:

  • Batching for throughput optimization
  • Pipelining for latency reduction
  • Caching for query acceleration
  • Incremental computation for efficiency

Next Steps

Immediate Enhancements

Advanced Features:

  1. Machine Learning Integration: Add anomaly detection with online learning algorithms
  2. Graph Analytics: Track relationships between entities
  3. Geo-Spatial Queries: Process location data with geofencing and proximity alerts
  4. Query Optimization: Implement cost-based optimizer for complex queries
  5. Multi-Tenant Isolation: Add quota management and resource isolation per customer

Performance Improvements:

  1. GPU Acceleration: Offload aggregations to GPUs for massive parallelism
  2. Zero-Copy Networking: Use shared memory for inter-process communication
  3. Columnar Processing: Vectorize operations for SIMD optimization
  4. Compression: Implement streaming compression for network and storage
  5. Custom Allocators: Reduce GC pressure with object pooling

Operational Features:

  1. Auto-Scaling: Automatically add/remove workers based on load
  2. Multi-Region Deployment: Replicate streams across geographic regions
  3. Backup and Restore: Snapshot state for disaster recovery
  4. Security: Add authentication, authorization, and encryption
  5. Compliance: Implement audit logging and data retention policies

Production Deployment

Infrastructure Setup:

  1. Deploy to Kubernetes cluster with StatefulSets for workers
  2. Configure Kafka cluster with replication factor 3 for durability
  3. Set up ClickHouse cluster with distributed tables
  4. Deploy Redis cluster for high-availability pub/sub
  5. Configure load balancers for WebSocket and API endpoints

Monitoring and Observability:

  1. Set up Prometheus for metrics collection
  2. Configure Grafana dashboards for visualization
  3. Implement distributed tracing with Jaeger
  4. Set up alerting rules for anomalies
  5. Create runbooks for incident response

Testing at Scale:

  1. Run load tests with realistic traffic patterns
  2. Perform chaos engineering experiments
  3. Validate recovery time objectives and recovery point objectives
  4. Test backpressure handling under extreme load
  5. Measure cost per million events processed

Further Learning

Advanced Stream Processing:

  • Study Apache Flink internals and architecture
  • Read "Streaming Systems" by Tyler Akidau for theoretical foundations
  • Explore ksqlDB source code for SQL stream processing patterns
  • Learn about watermark strategies in distributed systems
  • Investigate stateful stream processing with Chandy-Lamport snapshots

Distributed Systems:

  • Study consensus algorithms for coordination
  • Learn about distributed transactions and exactly-once semantics
  • Explore distributed tracing and observability patterns
  • Investigate service mesh architectures
  • Study CAP theorem implications for stream processing

Data Engineering:

  • Master Apache Kafka internals and operations
  • Learn columnar storage formats
  • Study query optimization techniques
  • Explore data lake architectures
  • Investigate real-time OLAP databases

Career Development

Roles This Project Prepares You For:

  • Staff/Senior Data Engineer: Building production data platforms
  • Stream Processing Engineer: Developing real-time analytics systems
  • Platform Engineer: Operating high-scale distributed systems
  • Solutions Architect: Designing event-driven architectures
  • Site Reliability Engineer: Ensuring reliability of data infrastructure

Companies Using Similar Systems:

  • Confluent: Kafka Streams, ksqlDB development
  • Databricks: Structured Streaming, Delta Lake
  • Snowflake: Real-time data warehousing
  • Datadog: Log aggregation and real-time monitoring
  • Stripe: Fraud detection and analytics
  • Uber: Real-time pricing and dispatch systems

Competitive Advantages:

  • Proven ability to build systems processing millions of events/second
  • Deep understanding of stream processing internals
  • Experience with production-grade fault tolerance and scalability
  • Hands-on expertise with modern data infrastructure
  • Portfolio demonstrating end-to-end system design and implementation

Download Complete Solution

📦 Download Complete Solution

Get the complete real-time analytics engine with all source code, tests, and deployment configurations:

⬇️ Download Complete Project

Includes: Stream processing engine, windowing operators, CEP pattern matcher, time-series storage integration, WebSocket server, real-time dashboard, SQL query parser, comprehensive test suite, Docker deployment, and complete README with architecture diagrams and implementation guide.

Note: The README contains detailed architecture diagrams, implementation phases, project structure breakdown, and step-by-step development guide.


Congratulations! You've completed one of the most challenging projects in modern data engineering. You've built a production-grade real-time analytics engine capable of processing millions of events per second with sub-100ms latency. This project demonstrates mastery of stream processing, windowing algorithms, complex event processing, and distributed systems architecture—skills that are highly valued at companies building modern data platforms like Confluent, Databricks, Snowflake, and Uber.

The patterns and techniques you've learned here—watermarking, exactly-once semantics, fault-tolerant state management, and real-time aggregations—are fundamental to building any large-scale event-driven system. You're now equipped to tackle the most demanding real-time data challenges in the industry.