# Performance Tuning Guide for Distributed Database

This guide provides detailed instructions for optimizing the distributed database for your specific workload.

## Table of Contents

- [Workload Analysis](#workload-analysis)
- [Storage Engine Tuning](#storage-engine-tuning)
- [Raft Consensus Tuning](#raft-consensus-tuning)
- [Transaction Tuning](#transaction-tuning)
- [Sharding Tuning](#sharding-tuning)
- [Monitoring and Profiling](#monitoring-and-profiling)
- [Capacity Planning](#capacity-planning)
- [Performance Optimization Checklist](#performance-optimization-checklist)

## Workload Analysis

Before tuning, understand your workload characteristics:

```bash
# Profile your workload
# 1. Measure read/write ratio
# 2. Check key access patterns
# 3. Monitor transaction length
# 4. Track concurrent connections
```

### Workload Classification

#### Write-Heavy (90% writes, 10% reads)

Examples:
- Logging systems
- Time-series databases
- Event streaming
- IoT sensor data

Characteristics:
- Throughput matters more than latency
- Most data writes to same time period
- Large batch operations
- Sequential key patterns

Optimization goals:
- Maximize write throughput (100k+ ops/sec)
- Minimize CPU usage
- Efficient WAL batching

#### Read-Heavy (10% writes, 90% reads)

Examples:
- Analytics databases
- Reporting systems
- Data warehouses
- Social media feeds

Characteristics:
- Need low-latency reads
- Random access patterns
- Large working sets
- Mostly cacheable data

Optimization goals:
- Maximize cache hit rate (>90%)
- Minimize disk I/O
- Optimize for memory usage

#### Balanced (50% reads, 50% writes)

Examples:
- OLTP databases
- Web applications
- E-commerce platforms
- CRM systems

Characteristics:
- Both throughput and latency matter
- Mixed access patterns
- Medium working sets
- Multi-shard operations

Optimization goals:
- Balance throughput and latency
- Optimize transaction throughput
- Efficient resource utilization

#### Low-Latency (tail latency matters)

Examples:
- Real-time trading systems
- Gaming databases
- Chat applications
- High-frequency trading

Characteristics:
- p99 latency critical (< 100ms)
- Predictable latencies
- Small working sets
- Consistent performance

Optimization goals:
- Minimize p99 latency
- Reduce jitter
- Consistent performance

## Storage Engine Tuning

### MemTable Size Configuration

The MemTable is an in-memory skip list that buffers writes before flushing to disk.

```yaml
storage:
  memtable_size: 67108864  # Default: 64 MB
```

**Tuning Strategy:**

```
Size vs. Flush Frequency Trade-off:

Larger MemTable (128 MB)
├─ Pros:
│  ├─ Fewer flushes = lower write latency spikes
│  ├─ Better batch efficiency
│  └─ Lower compaction overhead
└─ Cons:
   ├─ Higher memory usage
   ├─ Longer crash recovery
   └─ Larger flush operations

Smaller MemTable (32 MB)
├─ Pros:
│  ├─ Lower memory footprint
│  ├─ Faster crash recovery
│  └─ More frequent, smaller compactions
└─ Cons:
   ├─ More frequent flushes
   ├─ Write latency spikes
   └─ Higher compaction overhead
```

**Recommendations by Workload:**

| Workload | Size | Reasoning |
|----------|------|-----------|
| Write-heavy | 128-256 MB | Reduce flush frequency |
| Read-heavy | 32-64 MB | Save memory for block cache |
| Balanced | 64-96 MB | Middle ground |
| Low-latency | 32-48 MB | Frequent, consistent flushes |

**Monitoring MemTable Health:**

```bash
# Check flush frequency
curl http://localhost:9001/metrics | grep memtable_flushes_total

# If flushes/sec > 10:
#   → MemTable is too small, increase size

# If flush_bytes > 1GB/sec:
#   → Flush operations are heavy, increase size further
```

### Block Cache Size Configuration

The block cache is an LRU cache for frequently accessed data blocks.

```yaml
storage:
  block_cache_size: 1073741824  # Default: 1 GB
```

**Tuning Strategy:**

```
Cache Size vs. Hit Rate Trade-off:

Larger Cache (2-4 GB)
├─ Pros:
│  ├─ Higher hit rate (90%+)
│  ├─ Reduced disk I/O
│  └─ Faster read operations
└─ Cons:
   ├─ Higher memory usage
   ├─ Longer eviction scanning
   └─ Less memory for OS/buffers

Smaller Cache (256-512 MB)
├─ Pros:
│  ├─ Lower memory footprint
│  ├─ Reduced eviction overhead
│  └─ More memory for other uses
└─ Cons:
   ├─ Lower hit rate (50-70%)
   ├─ More disk reads
   └─ Slower reads
```

**Recommended Cache Hit Rates:**

| Workload | Target Hit Rate | Cache Size |
|----------|-----------------|------------|
| Write-heavy | 50-60% | 512 MB - 1 GB |
| Read-heavy | 85-95% | 2-4 GB |
| Balanced | 70-80% | 1-2 GB |
| Low-latency | 90%+ | 3-4 GB |

**Warming the Cache:**

Before putting database in production:

```bash
# 1. Access entire working set
./bin/dbclient --addr=localhost:9001

db> SCAN 0x00 0xff

# 2. Monitor cache statistics
watch -n 1 'curl -s http://localhost:9001/metrics | grep cache'

# 3. Verify hit rate
# Goal: >80% hit rate after warmup
```

### Bloom Filter Configuration

Bloom filters quickly reject non-existent keys without disk access.

```yaml
storage:
  bloom_filter_bits: 10  # Default: 10 bits per key
```

**Tuning Strategy:**

```
Bloom Filter Bits vs. False Positive Rate:

Formula: False Positive Rate ≈ (0.6185)^(bits/key)

Bits/Key | False Positive Rate | Size for 1M keys
---------|-------------------|------------------
5        | 6.2%              | 625 KB
10       | 1.0%              | 1.2 MB
15       | 0.16%             | 1.9 MB
20       | 0.026%            | 2.4 MB

Higher bits = Lower false positive rate
Lower bits = Smaller memory, faster computation
```

**Recommendations:**

| Workload | Bits/Key | Reasoning |
|----------|----------|-----------|
| Write-heavy | 8-10 | Speed is critical, accept 2-3% false positives |
| Read-heavy | 12-15 | Reduce disk reads, 0.2-1% false positives acceptable |
| Balanced | 10-12 | Balance between speed and accuracy |
| Low-latency | 15+ | Minimize disk reads at all costs |

**Impact Analysis:**

```bash
# With 10 bits/key (1% false positive rate):
# For 1 million keys in SSTable:
# - Bloom filter size: 1.2 MB
# - False positives per 1000 reads: 10
# - Disk I/O saved: ~99% for non-existent keys

# With 15 bits/key (0.16% false positive rate):
# - Bloom filter size: 1.9 MB
# - False positives per 1000 reads: 1.6
# - Disk I/O saved: ~99.8% for non-existent keys
```

### Compression Configuration

Choose compression algorithm based on workload:

```yaml
storage:
  compression: "snappy"  # Options: "snappy", "zstd", "lz4"
```

**Algorithm Comparison:**

| Algorithm | Compression Ratio | Speed | Use Case |
|-----------|-------------------|-------|----------|
| snappy | 50-60% | Very fast (~500 MB/s) | Write-heavy, low-latency |
| zstd | 60-70% | Fast (~200 MB/s) | Balanced, space-constrained |
| lz4 | 40-50% | Fastest (~2 GB/s) | Real-time streaming |
| none | 0% (100%) | Instant (no CPU) | Rare: pre-compressed data |

**Recommendations:**

| Workload | Algorithm | Reasoning |
|----------|-----------|-----------|
| Write-heavy | snappy | Fast compression, good ratio |
| Read-heavy | zstd | Better compression, CPU acceptable |
| Balanced | snappy | Balance between speed and ratio |
| Low-latency | snappy | Minimize CPU, accept higher disk usage |
| Space-constrained | zstd | Best compression ratio |

**Monitoring Compression:**

```bash
# Check compression overhead
curl http://localhost:9001/metrics | grep compression

# Calculate effective compression ratio
# ratio = uncompressed_bytes / compressed_bytes

# If ratio < 1.5:
#   → Data doesn't compress well, consider disabling

# If ratio > 3.0:
#   → Compression is very effective, good choice
```

### Write-Ahead Log (WAL) Tuning

The WAL ensures durability by writing sequentially to disk.

```yaml
storage:
  wal_buffer_size: 4194304      # 4 MB
  wal_sync_interval: 100        # milliseconds
  wal_enable_direct_io: false   # true for direct I/O
```

**Tuning Strategy:**

```
WAL Buffer Sync Timing:

fsync every 1ms (wal_sync_interval: 1)
├─ Maximum durability
├─ Lower throughput (~50k ops/sec)
└─ Use for: Critical financial data

fsync every 100ms (wal_sync_interval: 100)
├─ Good durability (lose ~100 ops on crash)
├─ Good throughput (~100k ops/sec)
└─ Use for: Default / balanced workloads

fsync every 1000ms (wal_sync_interval: 1000)
├─ Lower durability (lose ~1000 ops on crash)
├─ Maximum throughput (~150k ops/sec)
└─ Use for: Real-time analytics (data loss acceptable)
```

**Buffer Size Tuning:**

```yaml
# Large buffer (8 MB)
wal_buffer_size: 8388608
# Pros: Fewer fsync calls, better batching
# Cons: More memory, longer crash recovery

# Small buffer (1 MB)
wal_buffer_size: 1048576
# Pros: Lower memory, faster recovery
# Cons: More fsync calls, lower throughput
```

**Direct I/O Configuration:**

```yaml
# Direct I/O (bypass OS cache)
wal_enable_direct_io: true
# Pros: More predictable performance
# Cons: May be slower on systems with slow I/O

# OS Cache
wal_enable_direct_io: false
# Pros: Faster on systems with fast disk cache
# Cons: Less predictable, can cause stalls
```

## Raft Consensus Tuning

### Election Timeout Configuration

The election timeout determines how quickly failed leaders are replaced.

```yaml
raft:
  election_timeout_min: 300     # milliseconds
  election_timeout_max: 500     # milliseconds
```

**Tuning Strategy:**

```
Shorter Timeout (150-250ms)
├─ Pros:
│  ├─ Faster failure detection
│  ├─ Lower max downtime
│  └─ Better for latency-sensitive systems
└─ Cons:
   ├─ More frequent elections (jitter)
   ├─ Network hiccups trigger elections
   └─ Higher CPU usage

Longer Timeout (500-1000ms)
├─ Pros:
│  ├─ Fewer spurious elections
│  ├─ More stable during network blips
│  └─ Lower CPU usage
└─ Cons:
   ├─ Slower failure detection
   ├─ Higher max downtime
   └─ Less responsive to failures
```

**Recommendations by Network:**

| Network | Timeout Range | Notes |
|---------|---------------|-------|
| LAN (low latency) | 150-300ms | Can afford short timeouts |
| WAN (high latency) | 500-1000ms | Need longer for network jitter |
| Unstable | 800-1500ms | High jitter, need stability |
| Cloud (WAN) | 400-800ms | Typical cloud network latency |

**Monitoring Elections:**

```bash
# Check election frequency
curl http://localhost:9001/metrics | grep leader_elections_total

# If elections > 1 per minute:
#   → Timeout may be too short, increase

# If elections < 1 per hour:
#   → Timeout may be too long, decrease slightly
```

### Heartbeat Interval Configuration

The heartbeat interval determines how often the leader sends "keep alive" messages.

```yaml
raft:
  heartbeat_interval: 50  # milliseconds
```

**Tuning Strategy:**

```
Heartbeat Interval vs. Election Timeout:

Rule: heartbeat_interval < election_timeout / 5

Example:
election_timeout = 300ms
heartbeat_interval = 60ms (300/5)

More frequent heartbeats:
├─ Better leader liveliness detection
├─ More network traffic
└─ Higher CPU usage

Less frequent heartbeats:
├─ Lower network overhead
├─ Higher failure detection latency
└─ Fewer false positives
```

**Recommended Ratios:**

| Network Type | Timeout | Heartbeat | Ratio |
|--------------|---------|-----------|-------|
| LAN | 250ms | 50ms | 1:5 |
| WAN | 500ms | 100ms | 1:5 |
| Cloud | 600ms | 120ms | 1:5 |
| Unstable | 1000ms | 150ms | 1:6-1:7 |

## Transaction Tuning

### Transaction Timeout Configuration

```yaml
transaction:
  timeout: 30  # seconds
```

**Tuning Guidelines:**

```
Shorter Timeout (5-10 seconds)
├─ Pros:
│  ├─ Faster deadlock resolution
│  ├─ Less resource waste on failed txns
│  └─ Lower memory pressure
└─ Cons:
   ├─ Long-running transactions abort
   ├─ May impact batch operations
   └─ Requires retry logic

Longer Timeout (60-300 seconds)
├─ Pros:
│  ├─ Accommodates slow transactions
│  ├─ Batch operations can run
│  └─ Simpler application logic
└─ Cons:
   ├─ Dead transactions hold resources
   ├─ Higher memory usage
   └─ Slower deadlock detection
```

**Recommendations:**

| Workload | Timeout | Reasoning |
|----------|---------|-----------|
| Short transactions | 5-10s | Fast online transaction processing |
| Batch operations | 60-120s | Allow time for large transfers |
| Analytics queries | 300-600s | Long-running aggregations |
| User-facing | 30s | Balance between UX and resource usage |

### Maximum Concurrent Transactions

```yaml
transaction:
  max_concurrent: 10000  # Concurrent transactions
```

**Tuning Strategy:**

```
Concurrent Transactions vs. Memory/CPU:

Each transaction needs:
├─ 1 KB for transaction state
├─ Locks on keys being modified
├─ Undo log for rollback
└─ Average: ~5-10 KB per transaction

Memory estimate:
max_concurrent * 10 KB = memory needed

Example:
10,000 transactions * 10 KB = 100 MB
100,000 transactions * 10 KB = 1 GB
```

**Recommendations:**

| System RAM | Max Concurrent | Estimated Memory |
|-----------|-----------------|------------------|
| 4 GB | 10,000 | ~100 MB |
| 8 GB | 25,000 | ~250 MB |
| 16 GB | 50,000 | ~500 MB |
| 32+ GB | 100,000+ | ~1 GB |

**Monitoring:**

```bash
# Check active transaction count
curl http://localhost:9001/metrics | grep txn_active

# If txn_active regularly approaches max_concurrent:
#   → Increase max_concurrent or improve throughput
```

## Sharding Tuning

### Replication Factor Configuration

```yaml
sharding:
  replication_factor: 3  # Replicas per shard
```

**Tuning Strategy:**

```
Replication Factor vs. Durability/Availability:

Factor 1 (No replication)
├─ Pros: No network overhead, minimal latency
└─ Cons: Any node failure = data loss

Factor 3 (Standard)
├─ Pros: Can tolerate 1 node failure
├─ Cons: 3x network traffic, 3x storage
└─ Use for: Default, most systems

Factor 5 (High availability)
├─ Pros: Can tolerate 2 node failures
├─ Cons: 5x network overhead
└─ Use for: Critical data, 5+ node clusters
```

**Recommendations:**

| Scenario | Factor | Reasoning |
|----------|--------|-----------|
| Development | 1-2 | Cost savings |
| Production | 3 | Good balance |
| Mission-critical | 5 | High durability |
| Multi-region | 4-6 | Across regions |

### Shard Size Configuration

```yaml
sharding:
  shard_size: 10737418240  # 10 GB per shard
```

**Tuning Strategy:**

```
Shard Size vs. Rebalancing Overhead:

Larger Shards (50-100 GB)
├─ Fewer shards needed
├─ Less rebalancing overhead
├─ Simpler routing logic
└─ But: Slower rebalancing when needed

Smaller Shards (1-10 GB)
├─ More shards, more parallelism
├─ Finer-grained load balancing
├─ Faster individual shard moves
└─ But: More rebalancing operations
```

**Recommendations:**

| Deployment | Size | Reasoning |
|------------|------|-----------|
| Small (< 100 GB) | 10-20 GB | Simple balancing |
| Medium (100 GB - 1 TB) | 20-50 GB | Manageable rebalancing |
| Large (1-10 TB) | 50-100 GB | Reduce rebalancing frequency |
| Huge (> 10 TB) | 100-200 GB | Very large shards |

### Auto-Split Threshold

```yaml
sharding:
  auto_split: true
  split_threshold: 0.8  # Split at 80% capacity
```

**Tuning Strategy:**

```
Split Threshold Trade-off:

Lower threshold (0.6 = 60%)
├─ More frequent splits
├─ Better load distribution
├─ More network overhead
└─ More metadata management

Higher threshold (0.9 = 90%)
├─ Fewer splits
├─ More imbalanced loading
├─ Less overhead
└─ Risk of hotspots
```

**Recommendations:**

| Workload | Threshold | Notes |
|----------|-----------|-------|
| Balanced | 0.75-0.80 | Good default |
| Write-heavy | 0.70-0.75 | More frequent balancing |
| Read-heavy | 0.80-0.85 | Less frequent moves |

## Monitoring and Profiling

### Key Metrics to Monitor

```bash
# LSM-Tree Health
curl http://localhost:9001/metrics | grep -E "lsm_"

# Key metrics:
# - lsm_memtable_flushes_total: How often MemTable flushes
# - lsm_compaction_bytes_total: How much data compacted
# - lsm_block_cache_hit_rate: Cache effectiveness
# - lsm_reads_total: Total read count
# - lsm_writes_total: Total write count
```

### CPU Profiling

```bash
# Collect CPU profile (30 seconds)
curl http://localhost:9001/debug/pprof/profile?seconds=30 > cpu.prof

# Analyze
go tool pprof cpu.prof

# Commands:
# top10 - Top 10 functions by CPU time
# list functionName - Show source code
# web - Generate SVG graph
```

### Memory Profiling

```bash
# Collect heap snapshot
curl http://localhost:9001/debug/pprof/heap > heap.prof

# Find memory leaks
go tool pprof heap.prof
# In pprof:
# top10 - Top allocators
# list functionName - See allocations
# alloc_space - Total allocated
# in_use_space - Current in-use memory
```

### Benchmarking

```bash
# Run benchmarks before optimization
make benchmark > baseline.txt

# Make configuration changes

# Run benchmarks after optimization
make benchmark > tuned.txt

# Compare results
benchstat baseline.txt tuned.txt
```

## Capacity Planning

### Hardware Sizing

#### Write-Heavy System (500k ops/sec)

```
CPU:
  ├─ 5 cores for storage (100k ops/sec per core)
  ├─ 2 cores for Raft
  └─ 1 core for overhead
  Total: 8 cores minimum

RAM:
  ├─ MemTable: 256 MB
  ├─ Block cache: 2 GB
  ├─ Compaction buffers: 500 MB
  ├─ WAL buffers: 100 MB
  └─ OS/other: 1 GB
  Total: 4-5 GB

Disk:
  ├─ Write rate: 500k ops/sec * 1 KB = 500 MB/sec
  ├─ Daily: 500 MB/s * 86,400s = 43 TB/day
  ├─ Retention: 30 days = 1.3 PB (with 3x replication: 4 PB)
  └─ Need: NVMe SSD, 10+ GB/s throughput
```

#### Read-Heavy System (1M ops/sec)

```
CPU:
  ├─ 10 cores for reads (100k cached reads per core)
  ├─ 2 cores for Raft
  └─ 2 cores for overhead
  Total: 16 cores

RAM:
  ├─ Block cache: 8-16 GB
  ├─ MemTable: 128 MB
  ├─ Other: 2 GB
  └─ OS: 2 GB
  Total: 16-20 GB

Disk:
  ├─ Read-only mostly
  ├─ Need: Capacity for working set
  └─ 500 GB - 2 TB SSD
```

### Scaling Strategy

```
Phase 1: Single node (< 10 GB data)
├─ Simple setup
├─ Use all resources
└─ Monitor: throughput, latency, CPU, memory

Phase 2: 3-node cluster (10 GB - 100 GB)
├─ Add replication for durability
├─ Monitor: replication lag, leader stability
└─ Plan: when to start sharding

Phase 3: Multiple shards (100 GB - 1 TB)
├─ 3 nodes per shard
├─ Monitor: hotspots, shard sizes
└─ Implement: auto-balancing

Phase 4: Large cluster (> 1 TB)
├─ Many shards (10+)
├─ Multiple datacenters
├─ Sophisticated monitoring
└─ Advanced optimization needed
```

## Performance Optimization Checklist

### Pre-Production

- [ ] **Define SLOs** (Service Level Objectives)
  - Target throughput: ____ ops/sec
  - Target latency: ____ ms (p50, p95, p99)
  - Target availability: ____ %

- [ ] **Analyze Workload**
  - Read/write ratio: ____/____
  - Key distribution: uniform / skewed
  - Transaction size: ____KB avg
  - Concurrent users: ____

- [ ] **Hardware Planning**
  - CPU cores: ____
  - RAM: ____ GB
  - Disk: ____ GB / type: ____
  - Network: ____ Gbps

- [ ] **Storage Tuning**
  - [ ] MemTable size optimized
  - [ ] Block cache sized appropriately
  - [ ] Bloom filter bits tuned
  - [ ] Compression algorithm selected

- [ ] **Raft Tuning**
  - [ ] Election timeout appropriate for network
  - [ ] Heartbeat interval configured
  - [ ] Replication factor set

- [ ] **Monitoring Setup**
  - [ ] Prometheus metrics configured
  - [ ] Grafana dashboard imported
  - [ ] Alerts configured (SLO violations)

### Production Deployment

- [ ] **Performance Baseline**
  - Throughput measured: ____ ops/sec
  - Latency measured: p50/p95/p99
  - Resource usage: CPU ___%, RAM ___%, Disk ___ %

- [ ] **Capacity Planning**
  - [ ] Growth trajectory modeled (1, 3, 6, 12 months)
  - [ ] Scaling plan documented
  - [ ] Hardware budget estimated

- [ ] **Backup & Recovery**
  - [ ] Backup strategy implemented
  - [ ] Backup frequency: ____
  - [ ] Recovery tested and timed

- [ ] **Observability**
  - [ ] Key metrics being monitored
  - [ ] Slow query logging enabled
  - [ ] Error tracking configured

- [ ] **Optimization Cycle**
  - [ ] Baseline performance recorded
  - [ ] Top bottlenecks identified
  - [ ] Tuning plan created
  - [ ] Changes deployed and measured
  - [ ] Impact documented

### Ongoing Optimization

```
Weekly:
├─ Review metrics dashboards
├─ Check for anomalies
└─ Note any performance degradation

Monthly:
├─ Analyze access patterns
├─ Identify hot keys/shards
├─ Review slow query logs
└─ Plan optimizations

Quarterly:
├─ Capacity planning review
├─ Hardware upgrade planning
├─ Workload characterization
└─ Optimization effectiveness review
```

## Advanced Tuning Scenarios

### Scenario 1: Too Much GC Pause Time

**Symptom:** Occasional 100-500ms latency spikes

**Diagnosis:**
```bash
curl http://localhost:9001/debug/pprof/heap > heap.prof
go tool pprof heap.prof
# Look for large allocations in compaction
```

**Solutions:**
1. Reduce block cache size
2. Reduce MemTable size
3. Disable compression temporarily
4. Increase GC heap size

### Scenario 2: Write Stalls During Compaction

**Symptom:** Write latency spikes every 10-20 seconds

**Diagnosis:**
```bash
curl http://localhost:9001/metrics | grep compaction
# If compaction_bytes > 100 MB/s, compaction is slow
```

**Solutions:**
1. Increase MemTable size
2. Reduce compaction parallelism
3. Use faster compression (snappy over zstd)
4. Add more disk bandwidth

### Scenario 3: Raft Election Instability

**Symptom:** Frequent leader changes (> 1 per minute)

**Diagnosis:**
```bash
curl http://localhost:9001/metrics | grep leader_elections
# Check network latency
ping other-nodes
```

**Solutions:**
1. Increase election timeout (if network latency high)
2. Increase heartbeat interval
3. Check network quality
4. Check CPU load on nodes

---

**Remember**: Always test changes in staging before production!
