Final Project 2: Distributed Database System

Learning Objectives

After completing this project, you will master:

Core Database Engineering:

Design and implement distributed storage systems with sharding and replication
Build LSM-tree and B-tree storage engines from scratch with compression
Implement ACID transaction processing with distributed consensus protocols
Create SQL query engines with parsers, optimizers, and executors
Design write-ahead logging and crash recovery mechanisms

Distributed Systems Mastery:

Implement Raft consensus for leader election and data replication
Handle distributed transactions with two-phase and three-phase commit protocols
Design partitioning strategies for horizontal scaling
Build automatic failover and recovery systems with quorum-based consistency
Implement geo-replication for multi-region deployments

Performance Engineering:

Optimize query performance with cost-based optimizers and execution plans
Design caching strategies and connection pooling for high throughput
Implement compaction algorithms for space reclamation in LSM-trees
Build monitoring and profiling systems for performance analysis
Handle 100,000+ queries per second with sub-10ms latency requirements

Production Systems Architecture:

Design backup and restore systems with point-in-time recovery
Implement schema management with DDL operations and migrations
Build multi-tenant isolation and security mechanisms
Create monitoring and alerting for distributed system health
Design systems that scale from gigabytes to petabytes seamlessly

Progressive Complexity Phases

Phase 1: Foundation

Basic storage engine with key-value operations
Simple query parser and executor for basic SQL
Single-node transactions with MVCC implementation
Basic in-memory storage and persistence to disk

Phase 2: Distribution

Raft consensus implementation for leader election
Data sharding and partitioning across multiple nodes
Basic replication with follower nodes
Simple distributed transaction coordination

Phase 3: Advanced Features

Two-phase commit for distributed transactions
Query optimization with cost-based planner
LSM-tree storage engine with compaction
Secondary indexes and efficient query execution

Phase 4: Production Hardening

Automatic failover and recovery mechanisms
Backup and restore with point-in-time recovery
Performance monitoring and optimization
Security and multi-tenancy features

Scope Boundaries

Included in Scope:

Building distributed storage and replication systems
Implementing consensus protocols for consistency
Creating SQL query engines with optimizers
Designing storage engines
Building distributed transaction systems
Implementing backup/restore and recovery mechanisms

Out of Scope:

Building custom network protocols for node communication
Implementing advanced cryptographic security features
Creating machine learning query optimization
Building graphical database administration tools
Implementing stored procedures and user-defined functions
Advanced features like full-text search or time-series optimization

Prerequisites

Technical Skills Required:

Advanced Go programming
Understanding of database concepts
Knowledge of distributed systems fundamentals
Familiarity with data structures and algorithms
Experience with system programming and memory management

Recommended Background:

Previous experience building database systems or storage engines
Understanding of network programming and RPC frameworks
Knowledge of operating systems concepts
Familiarity with performance profiling and optimization
Experience with large-scale system design

Problem Statement

You're the founding engineer at a startup building a new generation database system. Your customers are growing rapidly—one SaaS company went from 10 GB to 10 TB of data in 6 months, and their existing database can't keep up. They're experiencing:

Performance Degradation: Queries that took 50ms now take 5+ seconds
Storage Limitations: Single server disk is full; can't add more storage
Scalability Wall: Can't scale reads or writes horizontally
No Geographic Distribution: Customers in Asia experience 300ms+ latency
Single Point of Failure: Database downtime means entire application is down

The Business Challenge:

Build a distributed database that solves these problems while maintaining strong consistency guarantees:

Customer Requirements:

Handle petabyte-scale data
Support 100,000+ queries per second across all nodes
Sub-10ms latency for point queries
ACID transactions
SQL-like query interface
Geographic replication with automatic failover
Zero-downtime upgrades and maintenance
Point-in-time recovery

Current Market Gap:

PostgreSQL/MySQL: Great for single-node, but hard to scale horizontally
MongoDB: Scales well but weak transaction guarantees
Cassandra: Scales amazingly but eventual consistency is hard for developers
Google Spanner: Perfect, but costs $1,000+/month for small deployments
Your Solution: Strong consistency + horizontal scaling + affordable

Real-World Scenario:

An e-commerce company is migrating from PostgreSQL to your database:

 1-- This transaction must be ACID
 2BEGIN;
 3  -- Deduct inventory
 4  UPDATE inventory SET quantity = quantity - 1 WHERE product_id = 'laptop-123';
 5
 6  -- Create order
 7  INSERT INTO orders VALUES;
 8
 9  -- Charge payment
10  INSERT INTO payments VALUES, 1299.00, 'pending');
11COMMIT;  -- Either all succeed or all fail

This transaction spans 3 shards in 3 different regions. Your database must guarantee:

Atomicity: All 3 operations succeed or all fail
Consistency: Inventory never goes negative
Isolation: Concurrent transactions don't see partial updates
Durability: Once committed, data survives node failures

Requirements

Functional Requirements

Must Have:

Distributed Storage: Data sharded across multiple nodes for horizontal scaling
Replication: Each shard replicated 3x for fault tolerance
ACID Transactions: Full support for distributed transactions with 2PC/3PC
Query Engine: SQL-like query language with JOIN, GROUP BY, aggregations
Query Optimizer: Cost-based optimization
Storage Engine: LSM-tree or B-tree with compression and caching
Consensus Protocol: Raft or Paxos for leader election and replication
Automatic Sharding: Data distributed by hash or range-based partitioning
Read/Write Isolation: Snapshot isolation or serializable isolation
Write-Ahead Log: Durability with crash recovery
Automatic Failover: Detect node failures and promote replica to leader
Backup & Restore: Full and incremental backups with point-in-time recovery
Monitoring: Metrics for query latency, throughput, replication lag
Multi-Tenancy: Logical databases with isolated data
Schema Management: DDL support

Should Have:

Secondary Indexes: B-tree indexes for fast lookups
Query Caching: Cache frequent queries in memory
Connection Pooling: Reuse connections to reduce overhead
Prepared Statements: Pre-compile queries for performance
Batch Operations: Bulk inserts/updates for efficiency
Geo-Replication: Multi-region deployment with low-latency reads
Compaction: LSM-tree compaction for space reclamation
Statistics Collection: Table/column statistics for query optimization
Deadlock Detection: Detect and resolve distributed deadlocks
Rate Limiting: Prevent runaway queries from consuming resources

Nice to Have:

Materialized Views: Pre-computed aggregations for fast analytics
Full-Text Search: Inverted indexes for text search
Time-Series Optimization: Efficient storage for time-series data
Change Data Capture: Stream changes to external systems
Multi-Version Concurrency Control: Non-blocking reads

Non-Functional Requirements

Performance:

Point Queries: p95 < 10ms, p99 < 20ms
Range Scans: 10,000 rows/sec per node
Write Throughput: 50,000 writes/sec per node
Transaction Latency: p95 < 50ms for distributed transactions
Join Performance: 1M rows joined in < 1 second

Scalability:

Horizontal Scaling: Add nodes to increase capacity linearly
Data Size: Support 1 PB+ of data across cluster
Concurrent Connections: 10,000+ simultaneous connections
Shard Count: 1,000+ shards in a single cluster
Replica Lag: < 100ms replication lag under normal load

Reliability:

Availability: 99.99% uptime
Durability: 99.9999999% - lose max 1 write per billion
Fault Tolerance: Survive 2 simultaneous node failures
Recovery Time: < 30 seconds for automatic failover
Data Integrity: Zero data corruption

Consistency:

Default: Snapshot isolation
Strict Mode: Serializable isolation
Tunable Consistency: Allow relaxed consistency for specific queries

Constraints

Technical Constraints:

Programming Language: Go for performance and concurrency
Storage Format: Column-oriented for analytics, row-oriented for OLTP
Network Protocol: gRPC for inter-node communication
Consensus: Raft
Wire Protocol: PostgreSQL-compatible protocol

Resource Constraints:

Memory: Assume 16 GB RAM per node minimum
Disk: NVMe SSD for low-latency writes
Network: 10 Gbps minimum for inter-node traffic

Business Constraints:

Development Time: 12 weeks for MVP
Team Size: 3-4 developers
Compatibility: Must support PostgreSQL wire protocol for easy migration

Design Considerations

Storage Engine Strategy

When building a distributed database, the choice of storage engine significantly impacts performance and scalability. Consider these fundamental tradeoffs:

LSM-Tree:

Optimized for write-heavy workloads with sequential I/O
Natural fit for modern SSDs which prefer sequential writes
Requires compaction to maintain read performance
Used by systems like RocksDB, LevelDB, and Cassandra

B-Tree:

Optimized for read-heavy workloads with in-place updates
Efficient range scans and point queries
Can suffer from fragmentation over time
Used by traditional databases like PostgreSQL and MySQL

Decision Point: For modern cloud workloads, LSM-trees generally provide better write throughput and storage efficiency. Use B-trees for secondary indexes to optimize specific query patterns.

Sharding and Partitioning

Distributing data across nodes requires careful consideration of access patterns:

Hash-Based Sharding:

Ensures uniform data distribution across shards
Simple to implement but range queries require querying all shards
Difficult to rebalance when adding/removing nodes

Range-Based Sharding:

Efficient for range queries
Can create hotspots if data access is skewed
Easier to rebalance through shard splitting/merging

Decision Point: Range-based sharding with automatic split/merge provides the best balance for general-purpose workloads. Use load monitoring to detect and split hot shards automatically.

Distributed Transaction Protocol

Maintaining ACID guarantees across distributed nodes presents unique challenges:

Two-Phase Commit:

Simple protocol with prepare and commit phases
Coordinator can become single point of failure
Can block if coordinator fails during transaction

Three-Phase Commit:

Non-blocking variant of 2PC
More complex implementation with additional network round-trip
Better fault tolerance but higher latency

Decision Point: Use 2PC with a replicated coordinator to eliminate single point of failure while maintaining simplicity. The coordinator state is persisted to write-ahead log for crash recovery.

Replication Strategy

Ensuring data availability and consistency requires choosing the right replication approach:

Single-Leader Replication:

All writes go through designated leader node
Strong consistency with simpler conflict resolution
Leader can become bottleneck under high write load

Multi-Leader Replication:

Multiple nodes can accept writes simultaneously
Better write throughput but requires conflict resolution
Complex consistency guarantees

Decision Point: Single-leader replication with Raft consensus provides the best balance of consistency, availability, and implementation complexity. Raft handles automatic leader election and ensures replicas stay synchronized.

Query Optimization Approach

Generating efficient query execution plans is critical for performance:

Rule-Based Optimization:

Fast planning time with predefined transformation rules
Predictable but may miss optimal plans for complex queries
No dependency on runtime statistics

Cost-Based Optimization:

Evaluates multiple execution plans using statistics
Generates optimal plans but requires statistics collection
Planning time increases with query complexity

Decision Point: Use a hybrid approach—rule-based optimization for simple queries and cost-based optimization for complex multi-way joins. Maintain table statistics through periodic sampling.

Acceptance Criteria

Your distributed database system is complete when it meets these criteria:

Functional Completeness:

Successfully executes distributed transactions across 3+ shards
Handles node failures with automatic failover within 30 seconds
Supports SQL queries including JOINs, aggregations, and subqueries
Implements automatic data sharding with configurable partitioning
Provides point-in-time recovery for backup restoration
Includes monitoring dashboard with key performance metrics

Performance Targets:

Point queries complete in < 10ms
System sustains 100,000+ queries per second across cluster
Write operations achieve 50,000+ ops/sec per node
Storage compression achieves 80%+ space savings
Query optimizer improves execution time by 50%+ for complex queries

Reliability Verification:

Zero data loss when nodes fail
Transactions maintain ACID properties under concurrent load
System recovers from complete cluster failure within 5 minutes
No split-brain scenarios occur during network partitions
All data includes checksums to detect corruption

Code Quality:

Unit test coverage exceeds 70% for core components
Integration tests cover distributed transaction scenarios
Performance benchmarks document latency and throughput
Code follows Go best practices and includes comprehensive comments
README provides clear setup and deployment instructions

Usage Examples

Basic SQL Operations

 1// Connect to distributed database
 2db, err := sql.Open("distdb", "host=localhost port=5432 user=admin")
 3if err != nil {
 4    log.Fatal(err)
 5}
 6defer db.Close()
 7
 8// Create table
 9_, err = db.Exec(`
10    CREATE TABLE users (
11        id SERIAL PRIMARY KEY,
12        email VARCHAR(255) UNIQUE,
13        name VARCHAR(255),
14        created_at TIMESTAMP DEFAULT NOW()
15    ) WITH (
16        SHARDING = 'hash',
17        SHARDS = 8,
18        REPLICATION = 3
19    )
20`)
21
22// Insert data
23result, err := db.Exec(`
24    INSERT INTO users
25    VALUES
26`, "alice@example.com", "Alice")
27
28// Query with automatic shard routing
29var user User
30err = db.QueryRow(`
31    SELECT id, email, name, created_at
32    FROM users
33    WHERE email = $1
34`, "alice@example.com").Scan(&user.ID, &user.Email, &user.Name, &user.CreatedAt)

Distributed Transactions

 1// Begin distributed transaction
 2tx, err := db.Begin()
 3if err != nil {
 4    log.Fatal(err)
 5}
 6defer tx.Rollback()
 7
 8// Transfer money between accounts
 9_, err = tx.Exec(`
10    UPDATE accounts
11    SET balance = balance - $1
12    WHERE user_id = $2
13`, 100.00, senderID)
14
15_, err = tx.Exec(`
16    UPDATE accounts
17    SET balance = balance + $1
18    WHERE user_id = $2
19`, 100.00, receiverID)
20
21// Record transaction history
22_, err = tx.Exec(`
23    INSERT INTO transactions
24    VALUES)
25`, senderID, receiverID, 100.00)
26
27// Commit with 2PC across all involved shards
28if err := tx.Commit(); err != nil {
29    log.Printf("Transaction failed: %v", err)
30    return err
31}

Query Optimization

 1// Complex query with JOIN and aggregation
 2rows, err := db.Query(`
 3    SELECT
 4        u.name,
 5        COUNT(o.id) as order_count,
 6        SUM(o.total) as total_spent
 7    FROM users u
 8    JOIN orders o ON u.id = o.user_id
 9    WHERE u.country = $1
10      AND o.created_at > $2
11    GROUP BY u.id, u.name
12    HAVING SUM(o.total) > $3
13    ORDER BY total_spent DESC
14    LIMIT 10
15`, "US", time.Now().AddDate(0, -1, 0), 1000.00)
16
17// Explain query plan
18var plan string
19db.QueryRow("EXPLAIN " + query).Scan(&plan)
20fmt.Println(plan)
21// Output:
22// Exchange(Gather) -> Aggregate(SUM) -> HashJoin(users, orders)
23// -> IndexScan(users.country) -> IndexScan(orders.created_at)

Cluster Management

 1// Add new node to cluster
 2client := admin.NewClient("localhost:9090")
 3
 4err := client.AddNode(ctx, &admin.NodeConfig{
 5    NodeID:    4,
 6    Address:   "192.168.1.4:9001",
 7    DataDir:   "/data/node4",
 8    ShardIDs:  []int{3, 4}, // Shards to host
 9})
10
11// Rebalance shards across nodes
12err = client.RebalanceCluster(ctx, &admin.RebalanceConfig{
13    MaxShardSize: 10 * 1024 * 1024 * 1024, // 10 GB
14    MinShardSize: 1 * 1024 * 1024 * 1024,  // 1 GB
15})
16
17// Check cluster health
18status, err := client.ClusterStatus(ctx)
19fmt.Printf("Nodes: %d, Shards: %d, Health: %s\n",
20    status.NodeCount, status.ShardCount, status.Health)

Backup and Recovery

 1// Create full backup
 2backup := backup.NewClient("localhost:9090")
 3
 4err := backup.CreateBackup(ctx, &backup.BackupConfig{
 5    Type:        backup.FullBackup,
 6    Destination: "s3://my-bucket/backups/",
 7    Compression: backup.Zstd,
 8})
 9
10// Create incremental backup
11err = backup.CreateBackup(ctx, &backup.BackupConfig{
12    Type:         backup.IncrementalBackup,
13    Destination:  "s3://my-bucket/backups/",
14    SinceBackup:  lastBackupID,
15})
16
17// Restore to specific point in time
18err = backup.RestoreBackup(ctx, &backup.RestoreConfig{
19    BackupID:  "backup-20240115-120000",
20    PointInTime: time.Parse(time.RFC3339, "2024-01-15T11:30:00Z"),
21    Target:    "new-cluster",
22})

Performance Monitoring

 1// Query performance metrics
 2metrics := monitoring.NewClient("localhost:9090")
 3
 4stats, err := metrics.GetQueryStats(ctx, &monitoring.QueryFilter{
 5    TimeRange: monitoring.Last24Hours,
 6    MinLatency: 100 * time.Millisecond, // Slow queries
 7})
 8
 9for _, stat := range stats.Queries {
10    fmt.Printf("Query: %s\n", stat.SQL)
11    fmt.Printf("Executions: %d\n", stat.ExecutionCount)
12    fmt.Printf("Avg Latency: %v\n", stat.AvgLatency)
13    fmt.Printf("P95 Latency: %v\n", stat.P95Latency)
14    fmt.Printf("P99 Latency: %v\n", stat.P99Latency)
15}
16
17// Storage metrics
18storage, err := metrics.GetStorageStats(ctx)
19fmt.Printf("Total Data: %s\n", storage.TotalSize)
20fmt.Printf("Compression Ratio: %.2f\n", storage.CompressionRatio)
21fmt.Printf("Disk Usage: %.2f%%\n", storage.DiskUsagePercent)

Key Takeaways

Database Internals Mastery:

Storage Engines: LSM-trees provide superior write performance for modern workloads through sequential I/O and efficient compaction strategies
Distributed Consensus: Raft consensus ensures strong consistency and automatic failover without complex conflict resolution
Transaction Protocols: Two-phase commit with replicated coordinator provides ACID guarantees across distributed shards
Query Optimization: Cost-based optimization using table statistics generates execution plans 10-100x faster than naive approaches

Distributed Systems Principles:

CAP Theorem: This system prioritizes Consistency and Partition tolerance, accepting temporary unavailability during network splits
Quorum-Based Replication: 3-way replication with quorum reads/writes ensures data survives node failures while maintaining consistency
Sharding Strategies: Range-based partitioning with automatic splitting provides better query locality than hash-based approaches
Failure Handling: Write-ahead logging combined with Raft consensus enables crash recovery without data loss

Performance Engineering:

Write Amplification: LSM-trees trade write amplification during compaction for better write throughput on SSDs
Read Optimization: Bloom filters and block caches reduce unnecessary disk I/O by 90%+ for hot data
Query Planning: Predicate pushdown and join reordering can reduce query execution time from minutes to milliseconds
Caching Strategies: Multi-level caching is essential for sub-10ms latency

Production Considerations:

Monitoring: Track latency, throughput, replication lag, and compaction metrics for early problem detection
Backup Strategy: Combine full backups with incremental backups for point-in-time recovery
Capacity Planning: Plan for 3x data growth headroom and monitor shard sizes to trigger automatic splitting
Operational Excellence: Automated failover, rolling upgrades, and chaos testing are essential for 99.99% uptime

Career Impact:

Understanding distributed database internals positions you for roles at database companies
This project demonstrates systems programming skills valued at cloud providers
Experience with consensus algorithms and distributed transactions is rare and highly sought after
The ability to design petabyte-scale systems is a distinguishing factor for senior/staff engineer roles

Next Steps

Immediate Extensions:

Implement secondary indexes with B-tree structures for faster lookups
Add query caching layer to reduce load on storage engine
Build connection pooling to handle 10,000+ concurrent clients
Implement prepared statements for query performance optimization
Add deadlock detection and resolution for distributed transactions

Advanced Features:

Geo-replication for multi-region deployments with tunable consistency
Materialized views for pre-computed aggregations
Change Data Capture to stream changes to Kafka or other systems
Time-series optimization for IoT and monitoring workloads
Full-text search using inverted indexes

Production Hardening:

Chaos engineering tests
Load testing to verify 100k+ QPS target with realistic workloads
Security audit including authentication, authorization, and encryption
Performance profiling to identify and eliminate bottlenecks
Documentation for operators and contributors

Learning Resources:

Read "Designing Data-Intensive Applications" by Martin Kleppmann
Study open-source databases: CockroachDB, TiDB, YugabyteDB
Review research papers: Bigtable, Spanner, Raft, Percolator
Explore database internals: storage formats, query execution, optimization
Join database engineering communities and contribute to open source

Career Development:

Write technical blog posts about your implementation choices and tradeoffs
Present your project at local meetups or conferences
Contribute to open-source database projects
Apply for database engineer or distributed systems roles
Consider specializing in query optimization, storage engines, or distributed consensus

Download Complete Solution

The complete implementation includes:

Full LSM-tree storage engine with compaction, bloom filters, and block cache
Raft consensus implementation for leader election and log replication
Distributed transaction coordinator using two-phase commit protocol
Range-based sharding with automatic split/merge functionality
SQL query parser and optimizer with cost-based optimization
Write-ahead logging for crash recovery and durability
Comprehensive test suite including unit, integration, and chaos tests
Docker deployment configuration for multi-node clusters
Monitoring dashboards with Prometheus and Grafana integration
Complete README with architecture documentation, setup instructions, and deployment guide

💾 Download Complete Solution

The downloaded package includes all source code, comprehensive tests, Docker configuration, and a detailed README with implementation guide, architecture decisions, and deployment instructions. Review the README first for understanding the complete system architecture and step-by-step implementation details.

References

Academic Papers:

Bigtable: A Distributed Storage System - Google's LSM-tree based storage
Spanner: Google's Globally-Distributed Database - Distributed transactions at scale
In Search of an Understandable Consensus Algorithm - Consensus protocol
The Design and Implementation of a Log-Structured File System - LSM-tree foundations

Books:

"Designing Data-Intensive Applications" by Martin Kleppmann - Essential distributed systems knowledge
"Database Internals" by Alex Petrov - Deep dive into storage engines and query processing
"The Art of PostgreSQL" by Dimitri Fontaine - SQL optimization and best practices

Open Source References:

CockroachDB - Distributed SQL database in Go
TiDB - MySQL-compatible distributed database
etcd - Production Raft implementation in Go
BadgerDB - LSM-tree storage engine in Go
LevelDB - Original LSM-tree implementation

Congratulations! You've built a production-grade distributed database with strong consistency guarantees, horizontal scalability, and performance comparable to commercial systems. This project demonstrates deep understanding of distributed systems, storage engines, and database internals—skills highly valued at companies like Google, Amazon, Snowflake, and MongoDB. Use this project as a foundation for exploring advanced topics in distributed databases or as a portfolio piece for systems engineering roles.