# Distributed KV Store - Troubleshooting Guide

Comprehensive troubleshooting guide for common issues and solutions.

## Table of Contents

- [Startup Issues](#startup-issues)
- [Cluster Formation Issues](#cluster-formation-issues)
- [Replication Problems](#replication-problems)
- [Performance Issues](#performance-issues)
- [Network Issues](#network-issues)
- [Data Loss Prevention](#data-loss-prevention)
- [Debugging Tools and Techniques](#debugging-tools-and-techniques)
- [FAQ](#faq)

---

## Startup Issues

### Issue 1: Node Fails to Start - "Address Already in Use"

**Error Message**:
```
bind: address already in use
```

**Cause**: Another process is using the port, or the port is in TIME_WAIT state.

**Solutions**:

Option 1: Use different port
```bash
./kvstore --node-id=node1 --raft-addr=localhost:7010 --http-addr=:8090
```

Option 2: Kill existing process
```bash
# Find process using port 8080
lsof -i :8080
# Kill it
kill -9 <PID>

# Or kill all kvstore processes
pkill -f kvstore
```

Option 3: Wait for port to be released (60 seconds)
```bash
# Check if port is in TIME_WAIT
netstat -an | grep 8080

# Wait and try again after 60 seconds
```

Option 4: Force reuse address (in code)
```bash
# Modify node.go to set SO_REUSEADDR
// TCP transport configuration allows address reuse
```

---

### Issue 2: "Permission Denied" on Data Directory

**Error Message**:
```
Failed to create node: mkdir ./data: permission denied
```

**Cause**: Insufficient permissions to create data directory.

**Solutions**:

```bash
# Check permissions
ls -la . | grep data

# Give user permissions
chmod 755 ./data

# Or create directory first with proper permissions
mkdir -p ./data/node1
chmod 755 ./data/node1

# Run with explicit path
./kvstore --data-dir=/tmp/kvstore-data
```

---

### Issue 3: Out of Memory During Startup

**Error Message**:
```
Cannot allocate memory
```

**Cause**: Insufficient system memory or memory limits.

**Solutions**:

```bash
# Check available memory
free -h

# Reduce snapshot buffer size
# Modify snapshotThreshold in node.go

# Run with memory limits (Docker)
docker run -m 512m kvstore:latest

# Or Kubernetes limits
resources:
  limits:
    memory: "512Mi"
```

---

## Cluster Formation Issues

### Issue 1: Node Won't Become Leader

**Symptoms**:
- Cluster stuck with no leader
- All nodes in follower state
- Writes fail with "not leader"

**Check Raft State**:
```bash
curl http://localhost:8081/api/stats | jq '.stats.state'
# Should show "Leader" or "Follower"
```

**Causes and Solutions**:

1. **Incorrect Bootstrap Configuration**
```bash
# First node must bootstrap
./kvstore --node-id=node1 --bootstrap

# Verify bootstrap worked
ls -la ./data/node1/*.db
```

2. **Network Connectivity Issues**
```bash
# Test connectivity between nodes
nc -zv localhost 7000  # from node2
nc -zv localhost 7001  # from node1

# Check firewall rules
sudo ufw status
sudo firewall-cmd --list-all

# If behind NAT, verify external addresses
```

3. **Clock Skew Between Nodes**
```bash
# Check system time on each node
date

# Sync time if needed
ntpdate -s time.nist.gov

# Or use chrony
chronyc makestep
```

4. **Log Divergence**
```bash
# Remove corrupted logs and restart
rm -rf ./data/node1/*.db
./kvstore --node-id=node1 --bootstrap

# For other nodes, just restart (they'll catch up)
./kvstore --node-id=node2
```

---

### Issue 2: Nodes Cannot Join Cluster

**Error When Joining**:
```
Error: not leader
Error: node already in cluster
Error: timeout
```

**Solutions**:

**Check 1: Verify Leader is Elected**
```bash
# Check each node
for port in 8081 8082 8083; do
  echo "Node on port $port:"
  curl -s http://localhost:$port/api/stats | jq '.is_leader'
done

# Should have exactly one leader
```

**Check 2: Use Correct Leader Address**
```bash
# Find leader
LEADER=$(curl -s http://localhost:8081/api/stats | jq -r '.leader')
echo "Leader is at: $LEADER"

# Join to leader
curl -X POST http://$LEADER/api/join \
  -H "Content-Type: application/json" \
  -d '{"node_id":"node2","addr":"localhost:7001"}'
```

**Check 3: Verify Raft Address Format**
```bash
# Wrong format (hostname vs IP)
-d '{"node_id":"node2","addr":"localhost:7001"}'

# This could cause issues if nodes use different address resolution
# Use consistent format across all nodes:
# - All IPs: 192.168.1.10:7001
# - All hostnames: node2.example.com:7001
```

**Check 4: Node Already in Cluster**
```bash
# If getting "already in cluster" error:
# The node was previously joined but not cleanly removed
# Solutions:

# Option A: Remove from cluster first
curl -X POST http://localhost:8081/api/remove-voter \
  -H "Content-Type: application/json" \
  -d '{"node_id":"node2"}'

# Option B: Clear data and rejoin
rm -rf ./data/node2
# Restart node2
./kvstore --node-id=node2 --raft-addr=localhost:7001
# Then join to cluster
```

---

### Issue 3: Cluster Goes Leaderless After Node Failure

**Symptoms**:
- Nodes become "candidate" state
- Continuous election timeouts
- No leader elected

**Cause**: Quorum lost (fewer than 50% of nodes running)

**Solutions**:

For 3-node cluster:
- Need at least 2 nodes running
- If only 1 node alive, cluster is stuck

For 5-node cluster:
- Need at least 3 nodes running

**Recovery**:
```bash
# If you have 2 out of 3 nodes running:
# They should form a quorum and elect leader
curl http://localhost:8081/api/stats | jq '.stats.state'

# If still stuck, restart both remaining nodes:
pkill kvstore
rm -rf ./data/node1/*.db ./data/node2/*.db

# Restart node1 with bootstrap
./kvstore --node-id=node1 --raft-addr=localhost:7000 --bootstrap

# Restart node2
./kvstore --node-id=node2 --raft-addr=localhost:7001

# Rejoin
curl -X POST http://localhost:8081/api/join \
  -H "Content-Type: application/json" \
  -d '{"node_id":"node2","addr":"localhost:7001"}'
```

---

## Replication Problems

### Issue 1: Data Not Replicating to All Nodes

**Symptoms**:
- Write succeeds on leader
- Read returns value on leader only
- Follower nodes have stale data

**Diagnostics**:
```bash
# Check commit index on each node
echo "Node 1:"
curl -s http://localhost:8081/api/stats | jq '.stats.commit_index'

echo "Node 2:"
curl -s http://localhost:8082/api/stats | jq '.stats.commit_index'

echo "Node 3:"
curl -s http://localhost:8083/api/stats | jq '.stats.commit_index'

# All should have same commit_index
```

**Solutions**:

**1. Network Partition**
```bash
# Check if follower can reach leader
nc -zv <leader-ip> 7000

# If using Docker, check network
docker network ls
docker network inspect kvstore-network

# Check DNS resolution
nslookup node1.kvstore
```

**2. Follower Lagging**
```bash
# Check last_log_index vs applied_index
curl -s http://localhost:8082/api/stats | jq '.stats | {last_log_index, applied_index}'

# Large gap indicates replication lag
# Wait a few seconds and recheck
sleep 5
curl -s http://localhost:8082/api/stats | jq '.stats | {last_log_index, applied_index}'
```

**3. Node Not Joined to Cluster**
```bash
# Verify node is in cluster configuration
curl -s http://localhost:8081/api/stats | jq '.stats'

# If node not listed, rejoin it:
curl -X POST http://localhost:8081/api/join \
  -H "Content-Type: application/json" \
  -d '{"node_id":"node2","addr":"localhost:7001"}'
```

**4. Follower Config Mismatch**
```bash
# Remove and re-add follower
curl -X POST http://localhost:8081/api/remove-voter \
  -H "Content-Type: application/json" \
  -d '{"node_id":"node2"}'

sleep 2

curl -X POST http://localhost:8081/api/join \
  -H "Content-Type: application/json" \
  -d '{"node_id":"node2","addr":"localhost:7001"}'
```

---

### Issue 2: High Replication Lag

**Symptoms**:
- Followers far behind leader
- Slow data propagation
- High network latency

**Diagnostics**:
```bash
# Calculate lag
LEADER_INDEX=$(curl -s http://localhost:8081/api/stats | jq '.stats.last_log_index')
FOLLOWER_INDEX=$(curl -s http://localhost:8082/api/stats | jq '.stats.applied_index')
LAG=$((LEADER_INDEX - FOLLOWER_INDEX))
echo "Replication lag: $LAG entries"
```

**Solutions**:

**1. Increase Batch Size**
```go
// In node.go initialization
raftConfig.MaxAppendEntries = 128  // Increase from 64
```

**2. Reduce Heartbeat Frequency**
```go
raftConfig.HeartbeatTimeout = 2000 * time.Millisecond  // Increase from 1000
```

**3. Check Network Bandwidth**
```bash
# Monitor network usage
iftop
nethogs

# If saturated, may need to optimize log entries
```

**4. Snapshot More Frequently**
```go
raftConfig.SnapshotThreshold = 4096  // Reduce from 8192
```

---

## Performance Issues

### Issue 1: High CPU Usage

**Cause Analysis**:

1. **Excessive Elections**
```bash
# Check if node state changes frequently
for i in {1..10}; do
  curl -s http://localhost:8081/api/stats | jq '.stats.state'
  sleep 1
done

# Should stay consistent (mostly "Leader" or "Follower")
# If changing, indicates election thrashing
```

2. **High Log Replication Traffic**
```bash
# Monitor Raft metrics
curl -s http://localhost:8081/api/stats | jq '.stats | {last_log_index, last_log_term}'

# If indices increasing rapidly, high write load
```

**Solutions**:

```bash
# Reduce Raft tick frequency
raftConfig.HeartbeatTimeout = 2000 * time.Millisecond
raftConfig.ElectionTimeout = 2000 * time.Millisecond

# Increase snapshot threshold
raftConfig.SnapshotThreshold = 16384

# Batch writes if possible (application-level)
```

---

### Issue 2: High Memory Usage

**Symptoms**:
- Memory grows over time
- Eventually OOM kill

**Causes**:

1. **Large Log Accumulation**
```bash
# Check log size
du -sh ./data/node1/

# If > several GB, need snapshots
```

2. **Memory Leaks**
```bash
# Monitor memory trend
free -h && sleep 10 && free -h

# If growing continuously despite idle workload, leak present
```

**Solutions**:

```bash
# Enable snapshots
raftConfig.SnapshotInterval = 60 * time.Second
raftConfig.SnapshotThreshold = 2048

# Reduce memory storage size (application-level)
// Implement data cleanup/archival policies

# Or switch to disk-based storage
// Replace MemoryStorage with BadgerStorage
```

---

### Issue 3: Slow Reads/Writes

**Diagnostics**:
```bash
# Measure latency
time curl http://localhost:8081/api/get?key=test

# Measure throughput
for i in {1..100}; do
  curl -X POST http://localhost:8081/api/set \
    -H "Content-Type: application/json" \
    -d "{\"key\":\"test$i\",\"value\":\"value\"}"
done
```

**Solutions**:

```bash
# 1. Reduce network latency
# - Move nodes geographically closer
# - Use local network instead of internet

# 2. Reduce Raft commit timeout
raftConfig.CommitTimeout = 25 * time.Millisecond  # Reduce from 50

# 3. Increase write batching
// At application level, batch multiple sets into single RPC

# 4. Use persistent storage (faster than BoltDB in-memory)
// Switch to BadgerDB or RocksDB
```

---

## Network Issues

### Issue 1: Node Cannot Find Other Nodes

**Error**: Connection refused on Raft port

**Causes**:

1. **Firewall Blocking**
```bash
# Check firewall rules
sudo ufw status
sudo firewall-cmd --list-all

# Allow ports
sudo ufw allow 7000:7010/tcp
sudo firewall-cmd --add-port=7000-7010/tcp --permanent

# For Docker
docker network inspect kvstore-network
```

2. **DNS Resolution**
```bash
# If using hostnames instead of IPs:
nslookup node1.example.com
nslookup node2.example.com

# Ensure all nodes can resolve hostnames
# Or use IP addresses instead
```

3. **Address Binding Issue**
```bash
# Node binding to localhost only (127.0.0.1)
# But trying to connect from different host

# Fix: Bind to all interfaces
./kvstore --raft-addr=0.0.0.0:7000

# Or specify actual IP
./kvstore --raft-addr=192.168.1.10:7000
```

---

### Issue 2: Docker Network Issues

**Problem**: Containers can't communicate

**Solutions**:

```bash
# Check network connectivity
docker exec kvstore-node1 ping -c 1 kvstore-node2

# If fails, check network driver
docker network inspect kvstore-network | grep -E "Driver|Name"

# Recreate network if needed
docker network rm kvstore-network
docker network create --driver bridge kvstore-network

# Restart containers on network
docker-compose down
docker-compose up -d
```

---

## Data Loss Prevention

### Issue 1: Data Disappears After Node Restart

**Cause**: In-memory storage is not persistent

**Solution**: Switch to persistent storage
```go
// In node.go, replace:
store := storage.NewMemoryStorage()

// With:
store, err := storage.NewBadgerStorage("./data/kvstore")
if err != nil {
    return nil, err
}
```

---

### Issue 2: Incomplete Snapshots

**Symptoms**: Data inconsistent after recovery

**Prevention**:

```bash
# Ensure graceful shutdown
# Send SIGTERM, not SIGKILL
kill -15 <pid>  # Graceful
# NOT: kill -9 <pid>  # Forceful

# For Docker
docker stop kvstore-node1  # Graceful
# NOT: docker kill kvstore-node1  # Forceful

# Use termination grace period
terminationGracePeriodSeconds: 30  # In Kubernetes
```

---

### Issue 3: Backup Corruption

**Verification**:
```bash
# Check backup integrity
tar -tzf backup.tar.gz > /dev/null && echo "Valid" || echo "Corrupted"

# Test restoration on separate node
mkdir -p test-restore
cd test-restore
tar -xzf backup.tar.gz
./kvstore --data-dir=./data/node1
```

---

## Debugging Tools and Techniques

### Enable Debug Logging

**In node.go**:
```go
import hclog "github.com/hashicorp/go-hclog"

raftConfig.Logger = hclog.New(&hclog.LoggerOptions{
    Name:   "raft",
    Level:  hclog.Debug,
    Output: os.Stderr,
})
```

Recompile and run:
```bash
go build -o kvstore ./cmd/kvstore
./kvstore --bootstrap 2>&1 | tee debug.log
```

### Monitor Live State

**Script to monitor cluster**:
```bash
#!/bin/bash
while true; do
  clear
  echo "=== Cluster Status ==="
  for port in 8081 8082 8083; do
    echo -n "Port $port: "
    curl -s http://localhost:$port/api/stats | jq -r '.stats.state'
  done
  echo ""
  echo "=== Leader ==="
  curl -s http://localhost:8081/api/stats | jq '.leader'
  echo ""
  echo "=== Log Indices ==="
  for port in 8081 8082 8083; do
    echo -n "Port $port: "
    curl -s http://localhost:$port/api/stats | jq -r '.stats | "\(.last_log_index)/\(.applied_index)"'
  done
  sleep 2
done
```

### Inspect Raw Raft Logs

**BoltDB Inspection**:
```bash
# Install bolt CLI
go get github.com/boltdb/bolt/cmd/bolt

# Browse logs
bolt buckets ./data/node1/raft-log.db
bolt keys ./data/node1/raft-log.db log
```

---

## FAQ

### Q: How many nodes do I need for production?

**A**: Minimum 3 nodes:
- 3 nodes: tolerate 1 failure
- 5 nodes: tolerate 2 failures
- 7 nodes: tolerate 3 failures

Odd numbers are important (2 nodes = 1 failure tolerance = only 1 quorum option).

### Q: What happens if 2 out of 3 nodes fail?

**A**: Cluster becomes unavailable:
- 2 out of 3 = quorum lost
- Cannot elect leader
- Writes fail
- Reads return stale data

Only solution: restart a failed node.

### Q: How long does leader election take?

**A**: Typically 1-3 seconds:
- Election timeout: 1000ms
- Candidate waits for votes
- Once majority reached, elected

Can be tuned:
```go
raftConfig.ElectionTimeout = 1000 * time.Millisecond
raftConfig.HeartbeatTimeout = 1000 * time.Millisecond
```

### Q: Can I change cluster size?

**A**: Yes, but carefully:
- Adding nodes: use `/api/join`
- Removing nodes: implement `/api/remove` endpoint
- Changing quorum: Raft handles automatically

### Q: What if leader keeps changing?

**A**: Check:
1. Network connectivity between nodes
2. System clock skew (NTP sync)
3. CPU/memory constraints (add resources)
4. Increase election timeout (less sensitive)

### Q: Can I backup while cluster is running?

**A**: Yes, Raft logs are consistent:
```bash
# Safe to backup while running
tar -czf backup.tar.gz ./data/node1/
```

### Q: What's the maximum cluster size?

**A**: Practical limit: 7-9 nodes
- More nodes = slower consensus
- Network partitions more likely
- Recommended: 3, 5, or 7 nodes only

---

For additional help, check:
- README.md - Complete documentation
- DEPLOYMENT_GUIDE.md - Deployment procedures
- EXAMPLES.md - Working examples

Last updated: 2025-10-22
