Final Project 2: Distributed Database System

Final Project 2: Distributed Database System

Learning Objectives

After completing this project, you will master:

Core Database Engineering:

  • Design and implement distributed storage systems with sharding and replication
  • Build LSM-tree and B-tree storage engines from scratch with compression
  • Implement ACID transaction processing with distributed consensus protocols
  • Create SQL query engines with parsers, optimizers, and executors
  • Design write-ahead logging and crash recovery mechanisms

Distributed Systems Mastery:

  • Implement Raft consensus for leader election and data replication
  • Handle distributed transactions with two-phase and three-phase commit protocols
  • Design partitioning strategies for horizontal scaling
  • Build automatic failover and recovery systems with quorum-based consistency
  • Implement geo-replication for multi-region deployments

Performance Engineering:

  • Optimize query performance with cost-based optimizers and execution plans
  • Design caching strategies and connection pooling for high throughput
  • Implement compaction algorithms for space reclamation in LSM-trees
  • Build monitoring and profiling systems for performance analysis
  • Handle 100,000+ queries per second with sub-10ms latency requirements

Production Systems Architecture:

  • Design backup and restore systems with point-in-time recovery
  • Implement schema management with DDL operations and migrations
  • Build multi-tenant isolation and security mechanisms
  • Create monitoring and alerting for distributed system health
  • Design systems that scale from gigabytes to petabytes seamlessly

Progressive Complexity Phases

Phase 1: Foundation

  • Basic storage engine with key-value operations
  • Simple query parser and executor for basic SQL
  • Single-node transactions with MVCC implementation
  • Basic in-memory storage and persistence to disk

Phase 2: Distribution

  • Raft consensus implementation for leader election
  • Data sharding and partitioning across multiple nodes
  • Basic replication with follower nodes
  • Simple distributed transaction coordination

Phase 3: Advanced Features

  • Two-phase commit for distributed transactions
  • Query optimization with cost-based planner
  • LSM-tree storage engine with compaction
  • Secondary indexes and efficient query execution

Phase 4: Production Hardening

  • Automatic failover and recovery mechanisms
  • Backup and restore with point-in-time recovery
  • Performance monitoring and optimization
  • Security and multi-tenancy features

Scope Boundaries

Included in Scope:

  • Building distributed storage and replication systems
  • Implementing consensus protocols for consistency
  • Creating SQL query engines with optimizers
  • Designing storage engines
  • Building distributed transaction systems
  • Implementing backup/restore and recovery mechanisms

Out of Scope:

  • Building custom network protocols for node communication
  • Implementing advanced cryptographic security features
  • Creating machine learning query optimization
  • Building graphical database administration tools
  • Implementing stored procedures and user-defined functions
  • Advanced features like full-text search or time-series optimization

Prerequisites

Technical Skills Required:

  • Advanced Go programming
  • Understanding of database concepts
  • Knowledge of distributed systems fundamentals
  • Familiarity with data structures and algorithms
  • Experience with system programming and memory management

Recommended Background:

  • Previous experience building database systems or storage engines
  • Understanding of network programming and RPC frameworks
  • Knowledge of operating systems concepts
  • Familiarity with performance profiling and optimization
  • Experience with large-scale system design

Problem Statement

You're the founding engineer at a startup building a new generation database system. Your customers are growing rapidly—one SaaS company went from 10 GB to 10 TB of data in 6 months, and their existing database can't keep up. They're experiencing:

  • Performance Degradation: Queries that took 50ms now take 5+ seconds
  • Storage Limitations: Single server disk is full; can't add more storage
  • Scalability Wall: Can't scale reads or writes horizontally
  • No Geographic Distribution: Customers in Asia experience 300ms+ latency
  • Single Point of Failure: Database downtime means entire application is down

The Business Challenge:

Build a distributed database that solves these problems while maintaining strong consistency guarantees:

Customer Requirements:

  • Handle petabyte-scale data
  • Support 100,000+ queries per second across all nodes
  • Sub-10ms latency for point queries
  • ACID transactions
  • SQL-like query interface
  • Geographic replication with automatic failover
  • Zero-downtime upgrades and maintenance
  • Point-in-time recovery

Current Market Gap:

  • PostgreSQL/MySQL: Great for single-node, but hard to scale horizontally
  • MongoDB: Scales well but weak transaction guarantees
  • Cassandra: Scales amazingly but eventual consistency is hard for developers
  • Google Spanner: Perfect, but costs $1,000+/month for small deployments
  • Your Solution: Strong consistency + horizontal scaling + affordable

Real-World Scenario:

An e-commerce company is migrating from PostgreSQL to your database:

 1-- This transaction must be ACID
 2BEGIN;
 3  -- Deduct inventory
 4  UPDATE inventory SET quantity = quantity - 1 WHERE product_id = 'laptop-123';
 5
 6  -- Create order
 7  INSERT INTO orders VALUES;
 8
 9  -- Charge payment
10  INSERT INTO payments VALUES, 1299.00, 'pending');
11COMMIT;  -- Either all succeed or all fail

This transaction spans 3 shards in 3 different regions. Your database must guarantee:

  1. Atomicity: All 3 operations succeed or all fail
  2. Consistency: Inventory never goes negative
  3. Isolation: Concurrent transactions don't see partial updates
  4. Durability: Once committed, data survives node failures

Requirements

Functional Requirements

Must Have:

  1. Distributed Storage: Data sharded across multiple nodes for horizontal scaling
  2. Replication: Each shard replicated 3x for fault tolerance
  3. ACID Transactions: Full support for distributed transactions with 2PC/3PC
  4. Query Engine: SQL-like query language with JOIN, GROUP BY, aggregations
  5. Query Optimizer: Cost-based optimization
  6. Storage Engine: LSM-tree or B-tree with compression and caching
  7. Consensus Protocol: Raft or Paxos for leader election and replication
  8. Automatic Sharding: Data distributed by hash or range-based partitioning
  9. Read/Write Isolation: Snapshot isolation or serializable isolation
  10. Write-Ahead Log: Durability with crash recovery
  11. Automatic Failover: Detect node failures and promote replica to leader
  12. Backup & Restore: Full and incremental backups with point-in-time recovery
  13. Monitoring: Metrics for query latency, throughput, replication lag
  14. Multi-Tenancy: Logical databases with isolated data
  15. Schema Management: DDL support

Should Have:

  1. Secondary Indexes: B-tree indexes for fast lookups
  2. Query Caching: Cache frequent queries in memory
  3. Connection Pooling: Reuse connections to reduce overhead
  4. Prepared Statements: Pre-compile queries for performance
  5. Batch Operations: Bulk inserts/updates for efficiency
  6. Geo-Replication: Multi-region deployment with low-latency reads
  7. Compaction: LSM-tree compaction for space reclamation
  8. Statistics Collection: Table/column statistics for query optimization
  9. Deadlock Detection: Detect and resolve distributed deadlocks
  10. Rate Limiting: Prevent runaway queries from consuming resources

Nice to Have:

  1. Materialized Views: Pre-computed aggregations for fast analytics
  2. Full-Text Search: Inverted indexes for text search
  3. Time-Series Optimization: Efficient storage for time-series data
  4. Change Data Capture: Stream changes to external systems
  5. Multi-Version Concurrency Control: Non-blocking reads

Non-Functional Requirements

Performance:

  • Point Queries: p95 < 10ms, p99 < 20ms
  • Range Scans: 10,000 rows/sec per node
  • Write Throughput: 50,000 writes/sec per node
  • Transaction Latency: p95 < 50ms for distributed transactions
  • Join Performance: 1M rows joined in < 1 second

Scalability:

  • Horizontal Scaling: Add nodes to increase capacity linearly
  • Data Size: Support 1 PB+ of data across cluster
  • Concurrent Connections: 10,000+ simultaneous connections
  • Shard Count: 1,000+ shards in a single cluster
  • Replica Lag: < 100ms replication lag under normal load

Reliability:

  • Availability: 99.99% uptime
  • Durability: 99.9999999% - lose max 1 write per billion
  • Fault Tolerance: Survive 2 simultaneous node failures
  • Recovery Time: < 30 seconds for automatic failover
  • Data Integrity: Zero data corruption

Consistency:

  • Default: Snapshot isolation
  • Strict Mode: Serializable isolation
  • Tunable Consistency: Allow relaxed consistency for specific queries

Constraints

Technical Constraints:

  1. Programming Language: Go for performance and concurrency
  2. Storage Format: Column-oriented for analytics, row-oriented for OLTP
  3. Network Protocol: gRPC for inter-node communication
  4. Consensus: Raft
  5. Wire Protocol: PostgreSQL-compatible protocol

Resource Constraints:

  1. Memory: Assume 16 GB RAM per node minimum
  2. Disk: NVMe SSD for low-latency writes
  3. Network: 10 Gbps minimum for inter-node traffic

Business Constraints:

  1. Development Time: 12 weeks for MVP
  2. Team Size: 3-4 developers
  3. Compatibility: Must support PostgreSQL wire protocol for easy migration

Design Considerations

Storage Engine Strategy

When building a distributed database, the choice of storage engine significantly impacts performance and scalability. Consider these fundamental tradeoffs:

LSM-Tree:

  • Optimized for write-heavy workloads with sequential I/O
  • Natural fit for modern SSDs which prefer sequential writes
  • Requires compaction to maintain read performance
  • Used by systems like RocksDB, LevelDB, and Cassandra

B-Tree:

  • Optimized for read-heavy workloads with in-place updates
  • Efficient range scans and point queries
  • Can suffer from fragmentation over time
  • Used by traditional databases like PostgreSQL and MySQL

Decision Point: For modern cloud workloads, LSM-trees generally provide better write throughput and storage efficiency. Use B-trees for secondary indexes to optimize specific query patterns.

Sharding and Partitioning

Distributing data across nodes requires careful consideration of access patterns:

Hash-Based Sharding:

  • Ensures uniform data distribution across shards
  • Simple to implement but range queries require querying all shards
  • Difficult to rebalance when adding/removing nodes

Range-Based Sharding:

  • Efficient for range queries
  • Can create hotspots if data access is skewed
  • Easier to rebalance through shard splitting/merging

Decision Point: Range-based sharding with automatic split/merge provides the best balance for general-purpose workloads. Use load monitoring to detect and split hot shards automatically.

Distributed Transaction Protocol

Maintaining ACID guarantees across distributed nodes presents unique challenges:

Two-Phase Commit:

  • Simple protocol with prepare and commit phases
  • Coordinator can become single point of failure
  • Can block if coordinator fails during transaction

Three-Phase Commit:

  • Non-blocking variant of 2PC
  • More complex implementation with additional network round-trip
  • Better fault tolerance but higher latency

Decision Point: Use 2PC with a replicated coordinator to eliminate single point of failure while maintaining simplicity. The coordinator state is persisted to write-ahead log for crash recovery.

Replication Strategy

Ensuring data availability and consistency requires choosing the right replication approach:

Single-Leader Replication:

  • All writes go through designated leader node
  • Strong consistency with simpler conflict resolution
  • Leader can become bottleneck under high write load

Multi-Leader Replication:

  • Multiple nodes can accept writes simultaneously
  • Better write throughput but requires conflict resolution
  • Complex consistency guarantees

Decision Point: Single-leader replication with Raft consensus provides the best balance of consistency, availability, and implementation complexity. Raft handles automatic leader election and ensures replicas stay synchronized.

Query Optimization Approach

Generating efficient query execution plans is critical for performance:

Rule-Based Optimization:

  • Fast planning time with predefined transformation rules
  • Predictable but may miss optimal plans for complex queries
  • No dependency on runtime statistics

Cost-Based Optimization:

  • Evaluates multiple execution plans using statistics
  • Generates optimal plans but requires statistics collection
  • Planning time increases with query complexity

Decision Point: Use a hybrid approach—rule-based optimization for simple queries and cost-based optimization for complex multi-way joins. Maintain table statistics through periodic sampling.


Acceptance Criteria

Your distributed database system is complete when it meets these criteria:

Functional Completeness:

  • Successfully executes distributed transactions across 3+ shards
  • Handles node failures with automatic failover within 30 seconds
  • Supports SQL queries including JOINs, aggregations, and subqueries
  • Implements automatic data sharding with configurable partitioning
  • Provides point-in-time recovery for backup restoration
  • Includes monitoring dashboard with key performance metrics

Performance Targets:

  • Point queries complete in < 10ms
  • System sustains 100,000+ queries per second across cluster
  • Write operations achieve 50,000+ ops/sec per node
  • Storage compression achieves 80%+ space savings
  • Query optimizer improves execution time by 50%+ for complex queries

Reliability Verification:

  • Zero data loss when nodes fail
  • Transactions maintain ACID properties under concurrent load
  • System recovers from complete cluster failure within 5 minutes
  • No split-brain scenarios occur during network partitions
  • All data includes checksums to detect corruption

Code Quality:

  • Unit test coverage exceeds 70% for core components
  • Integration tests cover distributed transaction scenarios
  • Performance benchmarks document latency and throughput
  • Code follows Go best practices and includes comprehensive comments
  • README provides clear setup and deployment instructions

Usage Examples

Basic SQL Operations

 1// Connect to distributed database
 2db, err := sql.Open("distdb", "host=localhost port=5432 user=admin")
 3if err != nil {
 4    log.Fatal(err)
 5}
 6defer db.Close()
 7
 8// Create table
 9_, err = db.Exec(`
10    CREATE TABLE users (
11        id SERIAL PRIMARY KEY,
12        email VARCHAR(255) UNIQUE,
13        name VARCHAR(255),
14        created_at TIMESTAMP DEFAULT NOW()
15    ) WITH (
16        SHARDING = 'hash',
17        SHARDS = 8,
18        REPLICATION = 3
19    )
20`)
21
22// Insert data
23result, err := db.Exec(`
24    INSERT INTO users
25    VALUES
26`, "alice@example.com", "Alice")
27
28// Query with automatic shard routing
29var user User
30err = db.QueryRow(`
31    SELECT id, email, name, created_at
32    FROM users
33    WHERE email = $1
34`, "alice@example.com").Scan(&user.ID, &user.Email, &user.Name, &user.CreatedAt)

Distributed Transactions

 1// Begin distributed transaction
 2tx, err := db.Begin()
 3if err != nil {
 4    log.Fatal(err)
 5}
 6defer tx.Rollback()
 7
 8// Transfer money between accounts
 9_, err = tx.Exec(`
10    UPDATE accounts
11    SET balance = balance - $1
12    WHERE user_id = $2
13`, 100.00, senderID)
14
15_, err = tx.Exec(`
16    UPDATE accounts
17    SET balance = balance + $1
18    WHERE user_id = $2
19`, 100.00, receiverID)
20
21// Record transaction history
22_, err = tx.Exec(`
23    INSERT INTO transactions
24    VALUES)
25`, senderID, receiverID, 100.00)
26
27// Commit with 2PC across all involved shards
28if err := tx.Commit(); err != nil {
29    log.Printf("Transaction failed: %v", err)
30    return err
31}

Query Optimization

 1// Complex query with JOIN and aggregation
 2rows, err := db.Query(`
 3    SELECT
 4        u.name,
 5        COUNT(o.id) as order_count,
 6        SUM(o.total) as total_spent
 7    FROM users u
 8    JOIN orders o ON u.id = o.user_id
 9    WHERE u.country = $1
10      AND o.created_at > $2
11    GROUP BY u.id, u.name
12    HAVING SUM(o.total) > $3
13    ORDER BY total_spent DESC
14    LIMIT 10
15`, "US", time.Now().AddDate(0, -1, 0), 1000.00)
16
17// Explain query plan
18var plan string
19db.QueryRow("EXPLAIN " + query).Scan(&plan)
20fmt.Println(plan)
21// Output:
22// Exchange(Gather) -> Aggregate(SUM) -> HashJoin(users, orders)
23// -> IndexScan(users.country) -> IndexScan(orders.created_at)

Cluster Management

 1// Add new node to cluster
 2client := admin.NewClient("localhost:9090")
 3
 4err := client.AddNode(ctx, &admin.NodeConfig{
 5    NodeID:    4,
 6    Address:   "192.168.1.4:9001",
 7    DataDir:   "/data/node4",
 8    ShardIDs:  []int{3, 4}, // Shards to host
 9})
10
11// Rebalance shards across nodes
12err = client.RebalanceCluster(ctx, &admin.RebalanceConfig{
13    MaxShardSize: 10 * 1024 * 1024 * 1024, // 10 GB
14    MinShardSize: 1 * 1024 * 1024 * 1024,  // 1 GB
15})
16
17// Check cluster health
18status, err := client.ClusterStatus(ctx)
19fmt.Printf("Nodes: %d, Shards: %d, Health: %s\n",
20    status.NodeCount, status.ShardCount, status.Health)

Backup and Recovery

 1// Create full backup
 2backup := backup.NewClient("localhost:9090")
 3
 4err := backup.CreateBackup(ctx, &backup.BackupConfig{
 5    Type:        backup.FullBackup,
 6    Destination: "s3://my-bucket/backups/",
 7    Compression: backup.Zstd,
 8})
 9
10// Create incremental backup
11err = backup.CreateBackup(ctx, &backup.BackupConfig{
12    Type:         backup.IncrementalBackup,
13    Destination:  "s3://my-bucket/backups/",
14    SinceBackup:  lastBackupID,
15})
16
17// Restore to specific point in time
18err = backup.RestoreBackup(ctx, &backup.RestoreConfig{
19    BackupID:  "backup-20240115-120000",
20    PointInTime: time.Parse(time.RFC3339, "2024-01-15T11:30:00Z"),
21    Target:    "new-cluster",
22})

Performance Monitoring

 1// Query performance metrics
 2metrics := monitoring.NewClient("localhost:9090")
 3
 4stats, err := metrics.GetQueryStats(ctx, &monitoring.QueryFilter{
 5    TimeRange: monitoring.Last24Hours,
 6    MinLatency: 100 * time.Millisecond, // Slow queries
 7})
 8
 9for _, stat := range stats.Queries {
10    fmt.Printf("Query: %s\n", stat.SQL)
11    fmt.Printf("Executions: %d\n", stat.ExecutionCount)
12    fmt.Printf("Avg Latency: %v\n", stat.AvgLatency)
13    fmt.Printf("P95 Latency: %v\n", stat.P95Latency)
14    fmt.Printf("P99 Latency: %v\n", stat.P99Latency)
15}
16
17// Storage metrics
18storage, err := metrics.GetStorageStats(ctx)
19fmt.Printf("Total Data: %s\n", storage.TotalSize)
20fmt.Printf("Compression Ratio: %.2f\n", storage.CompressionRatio)
21fmt.Printf("Disk Usage: %.2f%%\n", storage.DiskUsagePercent)

Key Takeaways

Database Internals Mastery:

  1. Storage Engines: LSM-trees provide superior write performance for modern workloads through sequential I/O and efficient compaction strategies
  2. Distributed Consensus: Raft consensus ensures strong consistency and automatic failover without complex conflict resolution
  3. Transaction Protocols: Two-phase commit with replicated coordinator provides ACID guarantees across distributed shards
  4. Query Optimization: Cost-based optimization using table statistics generates execution plans 10-100x faster than naive approaches

Distributed Systems Principles:

  1. CAP Theorem: This system prioritizes Consistency and Partition tolerance, accepting temporary unavailability during network splits
  2. Quorum-Based Replication: 3-way replication with quorum reads/writes ensures data survives node failures while maintaining consistency
  3. Sharding Strategies: Range-based partitioning with automatic splitting provides better query locality than hash-based approaches
  4. Failure Handling: Write-ahead logging combined with Raft consensus enables crash recovery without data loss

Performance Engineering:

  1. Write Amplification: LSM-trees trade write amplification during compaction for better write throughput on SSDs
  2. Read Optimization: Bloom filters and block caches reduce unnecessary disk I/O by 90%+ for hot data
  3. Query Planning: Predicate pushdown and join reordering can reduce query execution time from minutes to milliseconds
  4. Caching Strategies: Multi-level caching is essential for sub-10ms latency

Production Considerations:

  1. Monitoring: Track latency, throughput, replication lag, and compaction metrics for early problem detection
  2. Backup Strategy: Combine full backups with incremental backups for point-in-time recovery
  3. Capacity Planning: Plan for 3x data growth headroom and monitor shard sizes to trigger automatic splitting
  4. Operational Excellence: Automated failover, rolling upgrades, and chaos testing are essential for 99.99% uptime

Career Impact:

  • Understanding distributed database internals positions you for roles at database companies
  • This project demonstrates systems programming skills valued at cloud providers
  • Experience with consensus algorithms and distributed transactions is rare and highly sought after
  • The ability to design petabyte-scale systems is a distinguishing factor for senior/staff engineer roles

Next Steps

Immediate Extensions:

  1. Implement secondary indexes with B-tree structures for faster lookups
  2. Add query caching layer to reduce load on storage engine
  3. Build connection pooling to handle 10,000+ concurrent clients
  4. Implement prepared statements for query performance optimization
  5. Add deadlock detection and resolution for distributed transactions

Advanced Features:

  1. Geo-replication for multi-region deployments with tunable consistency
  2. Materialized views for pre-computed aggregations
  3. Change Data Capture to stream changes to Kafka or other systems
  4. Time-series optimization for IoT and monitoring workloads
  5. Full-text search using inverted indexes

Production Hardening:

  1. Chaos engineering tests
  2. Load testing to verify 100k+ QPS target with realistic workloads
  3. Security audit including authentication, authorization, and encryption
  4. Performance profiling to identify and eliminate bottlenecks
  5. Documentation for operators and contributors

Learning Resources:

  1. Read "Designing Data-Intensive Applications" by Martin Kleppmann
  2. Study open-source databases: CockroachDB, TiDB, YugabyteDB
  3. Review research papers: Bigtable, Spanner, Raft, Percolator
  4. Explore database internals: storage formats, query execution, optimization
  5. Join database engineering communities and contribute to open source

Career Development:

  1. Write technical blog posts about your implementation choices and tradeoffs
  2. Present your project at local meetups or conferences
  3. Contribute to open-source database projects
  4. Apply for database engineer or distributed systems roles
  5. Consider specializing in query optimization, storage engines, or distributed consensus

Download Complete Solution

The complete implementation includes:

  • Full LSM-tree storage engine with compaction, bloom filters, and block cache
  • Raft consensus implementation for leader election and log replication
  • Distributed transaction coordinator using two-phase commit protocol
  • Range-based sharding with automatic split/merge functionality
  • SQL query parser and optimizer with cost-based optimization
  • Write-ahead logging for crash recovery and durability
  • Comprehensive test suite including unit, integration, and chaos tests
  • Docker deployment configuration for multi-node clusters
  • Monitoring dashboards with Prometheus and Grafana integration
  • Complete README with architecture documentation, setup instructions, and deployment guide
💾 Download Complete Solution

The downloaded package includes all source code, comprehensive tests, Docker configuration, and a detailed README with implementation guide, architecture decisions, and deployment instructions. Review the README first for understanding the complete system architecture and step-by-step implementation details.


References

Academic Papers:

Books:

  • "Designing Data-Intensive Applications" by Martin Kleppmann - Essential distributed systems knowledge
  • "Database Internals" by Alex Petrov - Deep dive into storage engines and query processing
  • "The Art of PostgreSQL" by Dimitri Fontaine - SQL optimization and best practices

Open Source References:

  • CockroachDB - Distributed SQL database in Go
  • TiDB - MySQL-compatible distributed database
  • etcd - Production Raft implementation in Go
  • BadgerDB - LSM-tree storage engine in Go
  • LevelDB - Original LSM-tree implementation

Congratulations! You've built a production-grade distributed database with strong consistency guarantees, horizontal scalability, and performance comparable to commercial systems. This project demonstrates deep understanding of distributed systems, storage engines, and database internals—skills highly valued at companies like Google, Amazon, Snowflake, and MongoDB. Use this project as a foundation for exploring advanced topics in distributed databases or as a portfolio piece for systems engineering roles.