# Container Orchestrator - Complete Implementation Guide

A lightweight, production-ready container orchestration system for managing Docker containers across multiple nodes with health checking, automatic restarts, and service discovery. This comprehensive guide provides everything you need to understand, build, and deploy the orchestrator.

## Table of Contents

1. [Overview](#overview)
2. [Architecture Deep Dive](#architecture-deep-dive)
3. [Features](#features)
4. [Prerequisites](#prerequisites)
5. [Quick Start](#quick-start)
6. [Project Structure](#project-structure)
7. [Implementation Guide](#implementation-guide)
8. [API Reference](#api-reference)
9. [Testing Strategy](#testing-strategy)
10. [Deployment Guide](#deployment-guide)
11. [Performance Tuning](#performance-tuning)
12. [Troubleshooting](#troubleshooting)
13. [Future Enhancements](#future-enhancements)

---

## Overview

The Container Orchestrator is a simplified yet powerful container management system similar to Kubernetes or Docker Swarm, designed to teach core orchestration concepts while remaining production-capable. It manages the complete container lifecycle, from deployment through health monitoring to automatic recovery.

### What You'll Learn

- Container orchestration fundamentals
- Docker SDK integration in Go
- Resource-aware scheduling algorithms
- Health monitoring and auto-recovery patterns
- Service discovery and registry patterns
- REST API design for infrastructure management
- Graceful shutdown and signal handling
- Integration testing with Docker

### Use Cases

- **Learning Platform**: Understand orchestration internals
- **Development Environments**: Lightweight local container management
- **Testing Infrastructure**: Isolated container testing
- **Microservices Management**: Small-scale service orchestration
- **CI/CD Pipelines**: Container lifecycle automation

---

## Architecture Deep Dive

### System Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                       REST API Layer                        │
│  (HTTP Endpoints for Deploy, Scale, Monitor, Logs)         │
└───────────────────┬─────────────────────────────────────────┘
                    │
┌───────────────────┴─────────────────────────────────────────┐
│                    Orchestrator Core                        │
├──────────────┬──────────────┬──────────────┬───────────────┤
│   Scheduler  │   Registry   │    Health    │  Docker Client│
│              │              │   Checker    │               │
└──────┬───────┴──────┬───────┴──────┬───────┴───────┬───────┘
       │              │              │               │
       ▼              ▼              ▼               ▼
┌─────────────┐ ┌─────────────┐ ┌──────────┐ ┌─────────────┐
│   Node      │ │  Service    │ │Container │ │   Docker    │
│  Selection  │ │  Discovery  │ │ Monitor  │ │   Engine    │
└─────────────┘ └─────────────┘ └──────────┘ └─────────────┘
```

### Component Interactions

1. **API Layer → Scheduler**: Receives deployment requests, asks scheduler for optimal node
2. **Scheduler → Docker Client**: Selects node, instructs Docker to create containers
3. **Docker Client → Docker Engine**: Manages container lifecycle (create, start, stop, remove)
4. **Registry → Service Discovery**: Tracks all services and their containers
5. **Health Checker → Docker Client**: Monitors containers, triggers restarts on failure
6. **Event Log**: Records all orchestrator activities for auditing

### Core Components Explained

#### 1. Scheduler (internal/scheduler/scheduler.go)

The scheduler is responsible for intelligent container placement across available nodes.

**Key Responsibilities:**
- Node registration and management
- Resource availability tracking
- Container placement decisions
- Load distribution

**Scheduling Algorithm:**
```
FOR each deployment request:
  1. Validate resource requirements (CPU, Memory)
  2. Filter available nodes (status = available)
  3. Check each node's capacity against requirements
  4. Select node with most available memory (prevents fragmentation)
  5. Return selected node or error if no suitable node exists
```

**Why "Most Available Memory" Strategy:**
- Simple and effective for small-scale deployments
- Prevents memory fragmentation across nodes
- Leaves CPU-heavy workloads flexibility
- Easy to understand and debug

**Concurrency Model:**
- Read-Write mutex (sync.RWMutex) for thread-safe operations
- Multiple readers allowed simultaneously
- Exclusive locking for writes (node registration/removal)

#### 2. Docker Client (internal/docker/client.go)

Wraps the official Docker SDK to provide orchestrator-specific operations.

**Key Responsibilities:**
- Container lifecycle management
- Image pulling and caching
- Port binding configuration
- Resource limit enforcement
- Log retrieval

**Container Creation Flow:**
```
1. Check if image exists locally
2. Pull image if not present (with progress tracking)
3. Configure container settings:
   - Environment variables
   - Command and arguments
   - Labels for orchestrator tracking
4. Configure host settings:
   - CPU shares (relative weight)
   - Memory limits (hard cap)
   - Port bindings (host:container mapping)
   - Restart policy (on-failure with max retries)
5. Create container with Docker API
6. Return container ID for tracking
```

**Resource Management:**
- CPU Shares: Relative weight (1024 = 1 CPU worth)
- Memory Limits: Hard cap in bytes (triggers OOM if exceeded)
- Restart Policy: Automatic restart on failure (max 3 retries)

**Label-Based Tracking:**
All containers get labels:
- `orchestrator.service`: Service name
- `orchestrator.managed`: "true" (for filtering)

This enables the orchestrator to:
- List only managed containers
- Group containers by service
- Avoid interfering with user containers

#### 3. Health Checker (internal/health/checker.go)

Continuous health monitoring with automatic recovery.

**Supported Check Types:**

1. **HTTP Health Check**
   - Sends GET request to specified endpoint
   - Success: HTTP 200-299 status code
   - Use for: Web applications, REST APIs, microservices
   - Example: `http://localhost:8080/health`

2. **TCP Health Check**
   - Attempts TCP connection to host:port
   - Success: Connection established
   - Use for: Databases, TCP services, message queues
   - Example: `localhost:5432` (PostgreSQL)

3. **Exec Health Check**
   - Executes command inside container
   - Success: Exit code 0
   - Use for: Custom health logic, batch jobs
   - Example: `["/bin/check-health.sh"]`

**Health Check Lifecycle:**

```
Start → Check (Interval) → Evaluate → Action
  ▲                                      │
  │                                      ▼
  └───────────── Repeat ─────────────────┘

Actions:
- Healthy: Reset failure count, continue monitoring
- Unhealthy (< retries): Increment failure count, wait
- Failed (>= retries): Mark unhealthy, trigger callback
```

**Failure Handling:**
```go
if healthCheck fails:
    failureCount++
    if failureCount >= retries:
        status = "unhealthy"
        triggerRestartCallback()
```

**Concurrency:**
- Each container gets its own monitoring goroutine
- Independent timers prevent cascading delays
- Graceful shutdown via stop channels

#### 4. Service Registry (internal/registry/service_registry.go)

Central service discovery and container tracking.

**Data Structures:**
```
Registry
├── services (map[string]*Service)
│   ├── "web-app" → Service
│   │   ├── Name: "web-app"
│   │   ├── Image: "nginx:alpine"
│   │   ├── Replicas: 3
│   │   └── Containers: [Container1, Container2, Container3]
│   └── "api-server" → Service
│       └── ...
```

**Operations:**
- **Register**: Add/update service
- **Unregister**: Remove service completely
- **Get**: Retrieve service by name
- **List**: Get all services
- **AddContainer**: Add container to service
- **RemoveContainer**: Remove container from service
- **Discover**: Find healthy containers for service

**Service Discovery Pattern:**
```go
endpoints := registry.Discover("web-app")
// Returns: ["container-id-1", "container-id-2"] (only healthy)

// Load balancing logic:
selectedEndpoint := endpoints[rand.Intn(len(endpoints))]
makeRequest(selectedEndpoint)
```

#### 5. API Handler (internal/api/handlers.go)

REST API for orchestrator management.

**Request Flow:**
```
HTTP Request → Router (Gorilla Mux) → Handler → Core Components
                                          ↓
                                    JSON Response
```

**Handler Architecture:**
- Single API struct holds references to all components
- Each handler is a method on API struct
- Stateless request processing (no session state)
- Event logging for audit trail

**Deployment Handler Deep Dive:**

```go
POST /api/deploy
├── 1. Parse and validate deployment request
├── 2. Create service in registry
├── 3. For each replica:
│   ├── 3a. Select node via scheduler
│   ├── 3b. Create container via Docker client
│   ├── 3c. Start container
│   ├── 3d. Register container in service
│   └── 3e. Start health monitoring (if configured)
├── 4. Log deployment event
└── 5. Return service details (201 Created)
```

**Error Handling:**
- Validation errors: 400 Bad Request
- Not found: 404 Not Found
- Scheduling failures: 500 Internal Server Error
- Docker errors: 500 with descriptive message

---

## Features

### Core Features

- **Container Lifecycle Management**: Complete CRUD operations on containers
- **Multi-Node Scheduling**: Intelligent placement across available nodes
- **Health Monitoring**: HTTP, TCP, and exec-based health checks
- **Auto-Recovery**: Automatic restart of failed containers
- **Service Registry**: Service discovery and container tracking
- **REST API**: Full programmatic access
- **Event Logging**: Audit trail of all operations
- **Resource Management**: CPU and memory limits
- **Port Mapping**: Flexible container port exposure
- **Graceful Shutdown**: Clean termination with context cancellation

### Advanced Features

- **Label-Based Filtering**: Track only managed containers
- **Concurrent Operations**: Thread-safe component design
- **Image Caching**: Pull images only once
- **Log Streaming**: Real-time container logs
- **Service Scaling**: Dynamic replica adjustment
- **Failure Retry Logic**: Configurable retry thresholds
- **Event History**: Last 100 events stored

---

## Prerequisites

### Required Software

- **Go 1.23 or higher**: For building the application
  ```bash
  go version
  # Should output: go version go1.23.0 or higher
  ```

- **Docker**: For container management
  ```bash
  docker --version
  # Should output: Docker version 20.10.0 or higher
  ```

- **Docker Socket Access**: Read/write access to `/var/run/docker.sock`
  ```bash
  ls -la /var/run/docker.sock
  # Should show socket file exists
  ```

### Optional Tools

- **Make**: For build automation
- **golangci-lint**: For code linting
- **Docker Compose**: For simplified deployment
- **curl** or **httpie**: For API testing

### System Requirements

- **CPU**: 2+ cores recommended
- **RAM**: 2GB+ available memory
- **Disk**: 1GB+ free space
- **OS**: Linux, macOS, or Windows with WSL2

---

## Quick Start

### Option 1: Using Make (Recommended)

```bash
# Clone or extract the project
cd container-orchestrator

# Download dependencies
go mod download

# Build the binary
make build

# Run the orchestrator
make run

# In another terminal, deploy a test service
curl -X POST http://localhost:8080/api/deploy \
  -H "Content-Type: application/json" \
  -d '{
    "name": "test-nginx",
    "image": "nginx:alpine",
    "replicas": 2
  }'

# View running services
curl http://localhost:8080/api/services
```

### Option 2: Using Docker Compose (Easiest)

```bash
# Start the orchestrator
docker-compose up -d

# View logs
docker-compose logs -f orchestrator

# The API is now available at http://localhost:8080

# Stop when done
docker-compose down
```

### Option 3: Manual Build and Run

```bash
# Download dependencies
go mod download

# Build binary
go build -o bin/orchestrator cmd/orchestrator/main.go

# Run orchestrator
./bin/orchestrator

# The API starts on port 8080 by default
```

### Verify Installation

```bash
# Check if API is running
curl http://localhost:8080/api/services

# Should return: []

# Deploy a test service
curl -X POST http://localhost:8080/api/deploy \
  -H "Content-Type: application/json" \
  -d '{
    "name": "hello",
    "image": "nginx:alpine",
    "replicas": 1,
    "ports": [{
      "container_port": 80,
      "host_port": 8081,
      "protocol": "tcp"
    }]
  }'

# Access the service
curl http://localhost:8081
# Should return: nginx welcome page
```

---

## Project Structure

```
container-orchestrator/
├── cmd/
│   └── orchestrator/
│       └── main.go                    # Application entry point
│
├── internal/                          # Private application code
│   ├── models/
│   │   └── types.go                   # Data structures and types
│   │
│   ├── docker/
│   │   └── client.go                  # Docker SDK wrapper
│   │
│   ├── scheduler/
│   │   └── scheduler.go               # Container placement logic
│   │
│   ├── health/
│   │   └── checker.go                 # Health monitoring
│   │
│   ├── registry/
│   │   └── service_registry.go        # Service discovery
│   │
│   └── api/
│       └── handlers.go                # REST API handlers
│
├── tests/
│   └── integration_test.go            # Integration tests
│
├── Dockerfile                         # Multi-stage container build
├── Makefile                           # Build automation
├── docker-compose.yml                 # Docker Compose config
├── go.mod                             # Go module definition
├── go.sum                             # Dependency checksums
├── .gitignore                         # Git ignore patterns
└── README.md                          # This file
```

### File-by-File Breakdown

#### cmd/orchestrator/main.go (96 lines)

**Purpose**: Application bootstrap and initialization

**Key Functions:**
- `main()`: Entry point, calls run() and handles errors
- `run()`: Initializes all components and starts HTTP server
- Signal handling for graceful shutdown

**Dependencies:**
- Gorilla Mux (HTTP routing)
- All internal components (docker, scheduler, registry, health, api)

**Initialization Flow:**
```go
1. Create Docker client
2. Create scheduler
3. Create registry
4. Create health checker
5. Register local node
6. Create API handler
7. Setup HTTP routes
8. Start HTTP server
9. Wait for interrupt signal
10. Graceful shutdown
```

#### internal/models/types.go (91 lines)

**Purpose**: Define all data structures used throughout the application

**Key Types:**

1. **DeploymentRequest**: Client request to deploy a service
   - Service name, image, replica count
   - Environment variables, commands
   - Port mappings, resource limits
   - Health check configuration

2. **Container**: Running container instance
   - Unique ID, name, status
   - Node assignment
   - Health status, timestamps

3. **Service**: Logical service with multiple replicas
   - Name, image, desired replicas
   - Container instances
   - Health check config
   - Creation timestamp

4. **Node**: Host machine for containers
   - ID, address, capacity (CPU, RAM)
   - Availability status
   - Last heartbeat timestamp

5. **Event**: Audit log entry
   - Timestamp, type (deployment, health_check, restart)
   - Service and container references
   - Message description

**Design Patterns:**
- JSON tags for API serialization
- Time.Duration for intervals (parsed correctly)
- Omitempty for optional fields

#### internal/docker/client.go (200 lines)

**Purpose**: Wrap Docker SDK with orchestrator-specific logic

**Key Methods:**

1. **NewClient()**: Initialize Docker client with API negotiation
2. **CreateContainer()**: Create container from deployment request
3. **StartContainer()**: Start a created container
4. **StopContainer()**: Gracefully stop container with timeout
5. **RemoveContainer()**: Delete container
6. **GetContainerLogs()**: Stream container logs
7. **InspectContainer()**: Get detailed container info
8. **ListContainers()**: List all managed containers
9. **ensureImage()**: Pull image if not cached locally
10. **buildPortBindings()**: Convert port mappings to Docker format

**Error Handling:**
- Wrap errors with context using fmt.Errorf
- Return descriptive error messages
- Propagate Docker SDK errors

**Resource Management:**
- Convert MB to bytes for memory limits
- Configure restart policies
- Set container labels for tracking

#### internal/scheduler/scheduler.go (106 lines)

**Purpose**: Select optimal nodes for container placement

**Key Methods:**

1. **NewScheduler()**: Initialize scheduler with empty node map
2. **RegisterNode()**: Add node to available pool
3. **UnregisterNode()**: Remove node from pool
4. **SelectNode()**: Choose best node for deployment
5. **GetNodes()**: List all nodes
6. **GetNode()**: Get specific node by ID

**Scheduling Algorithm:**
```go
func SelectNode(req DeploymentRequest) (*Node, error) {
    // Set defaults
    cpuReq := req.Resources.CPUShares or 1024
    memReq := req.Resources.MemoryMB or 512

    // Find best node
    var bestNode *Node
    var maxMemory int64

    for each node:
        if node.Available and node.MemoryMB >= memReq:
            if node.MemoryMB > maxMemory:
                bestNode = node
                maxMemory = node.MemoryMB

    if bestNode == nil:
        return ErrNoSuitableNode

    return bestNode
}
```

**Concurrency Safety:**
- RWMutex for thread-safe access
- Read locks for queries (GetNodes, SelectNode)
- Write locks for modifications (RegisterNode, UnregisterNode)

#### internal/health/checker.go (165 lines)

**Purpose**: Monitor container health and trigger recovery

**Key Methods:**

1. **NewChecker()**: Initialize health checker
2. **StartMonitoring()**: Begin health checks for container
3. **StopMonitoring()**: Stop health checks
4. **GetStatus()**: Query current health status
5. **runHealthChecks()**: Background monitoring loop
6. **performCheck()**: Execute single health check
7. **checkHTTP()**: HTTP health check implementation
8. **checkTCP()**: TCP health check implementation
9. **checkExec()**: Exec health check (simplified)

**Health Check Loop:**
```go
func runHealthChecks(monitor, onFailure) {
    ticker := time.NewTicker(interval)

    for {
        select {
        case <-stopCh:
            return  // Graceful shutdown

        case <-ticker.C:
            healthy := performCheck(monitor)

            if healthy:
                status = "healthy"
                failureCount = 0
            else:
                failureCount++
                if failureCount >= retries:
                    status = "unhealthy"
                    onFailure(containerID)
        }
    }
}
```

**Check Implementations:**

1. **HTTP**: GET request with timeout
   - Success: 200-299 status code
   - Failure: Network error or non-2xx status

2. **TCP**: Connection attempt with timeout
   - Success: Connection established
   - Failure: Connection refused or timeout

3. **Exec**: Command execution (simplified)
   - Production: Use Docker Exec API
   - Current: Always returns true (placeholder)

#### internal/registry/service_registry.go (118 lines)

**Purpose**: Track services and enable discovery

**Key Methods:**

1. **NewRegistry()**: Initialize empty registry
2. **Register()**: Add or update service
3. **Unregister()**: Remove service
4. **Get()**: Retrieve service by name
5. **List()**: Get all services
6. **AddContainer()**: Add container to service
7. **RemoveContainer()**: Remove container from service
8. **Discover()**: Find healthy endpoints for service

**Service Discovery:**
```go
func Discover(serviceName) ([]string, error) {
    service := registry.Get(serviceName)

    endpoints := []
    for each container in service.Containers:
        if container.Status == "running" and
           container.Health == "healthy":
            endpoints.append(container.ID)

    return endpoints
}
```

**Thread Safety:**
- RWMutex for concurrent access
- Write lock for Register, Unregister, AddContainer, RemoveContainer
- Read lock for Get, List, Discover

#### internal/api/handlers.go (242 lines)

**Purpose**: HTTP API for orchestrator management

**Key Methods:**

1. **NewAPI()**: Create API with component dependencies
2. **DeployHandler()**: Handle service deployments
3. **ListServicesHandler()**: Return all services
4. **ScaleHandler()**: Scale service replicas
5. **LogsHandler()**: Stream container logs
6. **EventsHandler()**: Return event history
7. **handleContainerFailure()**: Recovery callback
8. **logEvent()**: Record events (max 100)

**Handler Implementations:**

**DeployHandler (POST /api/deploy):**
```go
1. Parse JSON deployment request
2. Validate required fields (name, image)
3. Set default replicas (1 if not specified)
4. Create service in registry
5. For each replica:
   a. Select node via scheduler
   b. Create container via Docker client
   c. Start container
   d. Add container to service
   e. Start health monitoring (if configured)
6. Log deployment event
7. Return service details (201 Created)
```

**ScaleHandler (POST /api/services/{name}/scale):**
```go
1. Extract service name from URL path
2. Parse desired replica count
3. Get current service
4. Compare current vs desired replicas
5. If scaling up:
   - Create and start new containers
   - Add to service
6. If scaling down:
   - Stop and remove excess containers
   - Remove from service
7. Return 202 Accepted
```

**LogsHandler (GET /api/services/{name}/logs):**
```go
1. Extract service name
2. Get service from registry
3. Get first container
4. Stream logs via Docker client
5. Return as text/plain
```

#### tests/integration_test.go (131 lines)

**Purpose**: End-to-end integration testing

**Test Cases:**

1. **TestDeployService**: Full deployment flow
   - Setup components
   - Register test node
   - Deploy 2 nginx replicas
   - Verify service registered
   - Cleanup containers

2. **TestSchedulerNodeSelection**: Scheduling logic
   - Register 2 nodes (different capacities)
   - Request high-memory deployment
   - Verify node with more memory selected

3. **TestHealthChecking**: Health monitoring
   - Start monitoring non-existent endpoint
   - Verify failures detected
   - Verify status becomes "unhealthy"

**Test Helpers:**
- httptest.NewRequest for HTTP testing
- httptest.NewRecorder for response capture
- Cleanup in defer statements
- require vs assert (fail-fast vs continue)

---

## Implementation Guide

This section provides step-by-step guidance for implementing the orchestrator from scratch.

### Step 1: Project Setup

```bash
# Create project directory
mkdir container-orchestrator
cd container-orchestrator

# Initialize Go module
go mod init github.com/yourusername/container-orchestrator

# Create directory structure
mkdir -p cmd/orchestrator
mkdir -p internal/{models,docker,scheduler,health,registry,api}
mkdir -p tests

# Install dependencies
go get github.com/docker/docker@v28.5.1+incompatible
go get github.com/docker/go-connections@v0.6.0
go get github.com/gorilla/mux@v1.8.1
go get github.com/stretchr/testify@v1.11.1
```

### Step 2: Define Data Models

Create `internal/models/types.go`:

```go
package models

import "time"

// DeploymentRequest represents a service deployment request
type DeploymentRequest struct {
    Name        string            `json:"name"`
    Image       string            `json:"image"`
    Replicas    int               `json:"replicas"`
    Command     []string          `json:"command,omitempty"`
    Env         map[string]string `json:"env,omitempty"`
    Ports       []PortMapping     `json:"ports,omitempty"`
    Resources   ResourceLimits    `json:"resources,omitempty"`
    HealthCheck HealthCheckConfig `json:"health_check,omitempty"`
}

// PortMapping defines container port exposure
type PortMapping struct {
    ContainerPort int    `json:"container_port"`
    HostPort      int    `json:"host_port"`
    Protocol      string `json:"protocol"` // tcp, udp
}

// ResourceLimits defines CPU and memory constraints
type ResourceLimits struct {
    CPUShares    int64 `json:"cpu_shares"`
    MemoryMB     int64 `json:"memory_mb"`
    MemorySwapMB int64 `json:"memory_swap_mb,omitempty"`
}

// HealthCheckConfig defines health monitoring
type HealthCheckConfig struct {
    Type     string        `json:"type"`              // http, tcp, exec
    Endpoint string        `json:"endpoint"`          // URL or address
    Command  []string      `json:"command,omitempty"` // For exec
    Interval time.Duration `json:"interval"`
    Timeout  time.Duration `json:"timeout"`
    Retries  int           `json:"retries"`
}

// Container represents a running container instance
type Container struct {
    ID          string    `json:"id"`
    Name        string    `json:"name"`
    ServiceName string    `json:"service_name"`
    NodeID      string    `json:"node_id"`
    Image       string    `json:"image"`
    Status      string    `json:"status"` // running, stopped, failed
    Health      string    `json:"health"` // healthy, unhealthy, unknown
    CreatedAt   time.Time `json:"created_at"`
    StartedAt   time.Time `json:"started_at"`
}

// Service represents a deployed service with replicas
type Service struct {
    Name        string            `json:"name"`
    Image       string            `json:"image"`
    Replicas    int               `json:"replicas"`
    Containers  []Container       `json:"containers"`
    HealthCheck HealthCheckConfig `json:"health_check"`
    CreatedAt   time.Time         `json:"created_at"`
}

// Node represents a host machine
type Node struct {
    ID        string    `json:"id"`
    Address   string    `json:"address"`
    CPUCores  int       `json:"cpu_cores"`
    MemoryMB  int64     `json:"memory_mb"`
    Available bool      `json:"available"`
    LastSeen  time.Time `json:"last_seen"`
}

// Event represents a system event
type Event struct {
    Timestamp time.Time `json:"timestamp"`
    Type      string    `json:"type"` // deployment, health_check, restart
    Service   string    `json:"service"`
    Container string    `json:"container,omitempty"`
    Message   string    `json:"message"`
}
```

**Key Design Decisions:**

1. **JSON Tags**: Enable automatic marshaling/unmarshaling
2. **Time.Duration**: Proper parsing for intervals (e.g., "10s", "5m")
3. **Omitempty**: Optional fields don't clutter JSON
4. **Separation**: Clear distinction between request and runtime models

### Step 3: Implement Scheduler

Create `internal/scheduler/scheduler.go`:

```go
package scheduler

import (
    "errors"
    "sync"
    "github.com/yourusername/container-orchestrator/internal/models"
)

var ErrNoSuitableNode = errors.New("no suitable node found")

// Scheduler handles container placement
type Scheduler struct {
    nodes map[string]*models.Node
    mu    sync.RWMutex
}

// NewScheduler creates a new scheduler
func NewScheduler() *Scheduler {
    return &Scheduler{
        nodes: make(map[string]*models.Node),
    }
}

// RegisterNode adds a node
func (s *Scheduler) RegisterNode(node *models.Node) {
    s.mu.Lock()
    defer s.mu.Unlock()
    s.nodes[node.ID] = node
}

// SelectNode chooses the best node
func (s *Scheduler) SelectNode(req models.DeploymentRequest) (*models.Node, error) {
    s.mu.RLock()
    defer s.mu.RUnlock()

    // Set default resource requirements
    cpuReq := req.Resources.CPUShares
    if cpuReq == 0 {
        cpuReq = 1024 // Default 1 CPU share
    }

    memReq := req.Resources.MemoryMB
    if memReq == 0 {
        memReq = 512 // Default 512MB
    }

    // Find node with most available memory
    var bestNode *models.Node
    var maxAvailableMemory int64

    for _, node := range s.nodes {
        if !node.Available {
            continue
        }

        if node.MemoryMB >= memReq {
            if bestNode == nil || node.MemoryMB > maxAvailableMemory {
                bestNode = node
                maxAvailableMemory = node.MemoryMB
            }
        }
    }

    if bestNode == nil {
        return nil, ErrNoSuitableNode
    }

    return bestNode, nil
}
```

**Implementation Notes:**

1. **Thread Safety**: RWMutex allows concurrent reads
2. **Simple Algorithm**: Most available memory (easy to understand)
3. **Default Resources**: Prevent zero-value issues
4. **Error Handling**: Custom error type for clarity

### Step 4: Implement Docker Client

Create `internal/docker/client.go` (partial example):

```go
package docker

import (
    "context"
    "fmt"
    "io"
    "github.com/docker/docker/api/types/container"
    "github.com/docker/docker/client"
    "github.com/yourusername/container-orchestrator/internal/models"
)

// Client wraps Docker SDK
type Client struct {
    cli *client.Client
}

// NewClient creates a Docker client
func NewClient() (*Client, error) {
    cli, err := client.NewClientWithOpts(
        client.FromEnv,
        client.WithAPIVersionNegotiation(),
    )
    if err != nil {
        return nil, fmt.Errorf("failed to create docker client: %w", err)
    }
    return &Client{cli: cli}, nil
}

// CreateContainer creates a new container
func (c *Client) CreateContainer(
    ctx context.Context,
    req models.DeploymentRequest,
    name string,
) (string, error) {
    // Pull image first
    if err := c.ensureImage(ctx, req.Image); err != nil {
        return "", err
    }

    // Configure container
    config := &container.Config{
        Image: req.Image,
        Cmd:   req.Command,
        Labels: map[string]string{
            "orchestrator.service": req.Name,
            "orchestrator.managed": "true",
        },
    }

    // Configure resources
    hostConfig := &container.HostConfig{
        Resources: container.Resources{
            CPUShares: req.Resources.CPUShares,
            Memory:    req.Resources.MemoryMB * 1024 * 1024,
        },
        RestartPolicy: container.RestartPolicy{
            Name:              "on-failure",
            MaximumRetryCount: 3,
        },
    }

    // Create container
    resp, err := c.cli.ContainerCreate(
        ctx, config, hostConfig, nil, nil, name,
    )
    if err != nil {
        return "", fmt.Errorf("failed to create: %w", err)
    }

    return resp.ID, nil
}

// ensureImage pulls image if not present
func (c *Client) ensureImage(ctx context.Context, imageName string) error {
    // Check if image exists
    images, err := c.cli.ImageList(ctx, image.ListOptions{})
    if err != nil {
        return err
    }

    for _, img := range images {
        for _, tag := range img.RepoTags {
            if tag == imageName {
                return nil // Already exists
            }
        }
    }

    // Pull image
    reader, err := c.cli.ImagePull(ctx, imageName, image.PullOptions{})
    if err != nil {
        return err
    }
    defer reader.Close()

    // Consume output (required to complete pull)
    io.Copy(io.Discard, reader)
    return nil
}
```

**Key Points:**

1. **API Version Negotiation**: Compatible with any Docker version
2. **Image Caching**: Pull once, use many times
3. **Labels**: Track managed containers
4. **Resource Limits**: Enforce CPU/memory caps
5. **Error Wrapping**: Provide context for debugging

### Step 5: Implement Health Checker

Create `internal/health/checker.go` (partial):

```go
package health

import (
    "context"
    "net/http"
    "sync"
    "time"
    "github.com/yourusername/container-orchestrator/internal/models"
)

// Checker monitors container health
type Checker struct {
    checks map[string]*HealthMonitor
    mu     sync.RWMutex
}

// HealthMonitor tracks health for one container
type HealthMonitor struct {
    ContainerID  string
    Config       models.HealthCheckConfig
    Status       string
    FailureCount int
    LastCheck    time.Time
    StopCh       chan struct{}
}

// NewChecker creates a health checker
func NewChecker() *Checker {
    return &Checker{
        checks: make(map[string]*HealthMonitor),
    }
}

// StartMonitoring begins health checks
func (c *Checker) StartMonitoring(
    containerID string,
    config models.HealthCheckConfig,
    onFailure func(string),
) {
    c.mu.Lock()
    defer c.mu.Unlock()

    monitor := &HealthMonitor{
        ContainerID: containerID,
        Config:      config,
        Status:      "unknown",
        StopCh:      make(chan struct{}),
    }

    c.checks[containerID] = monitor
    go c.runHealthChecks(monitor, onFailure)
}

// runHealthChecks is the monitoring loop
func (c *Checker) runHealthChecks(
    monitor *HealthMonitor,
    onFailure func(string),
) {
    ticker := time.NewTicker(monitor.Config.Interval)
    defer ticker.Stop()

    for {
        select {
        case <-monitor.StopCh:
            return
        case <-ticker.C:
            healthy := c.performCheck(monitor)

            c.mu.Lock()
            if healthy {
                monitor.Status = "healthy"
                monitor.FailureCount = 0
            } else {
                monitor.FailureCount++
                if monitor.FailureCount >= monitor.Config.Retries {
                    monitor.Status = "unhealthy"
                    c.mu.Unlock()
                    onFailure(monitor.ContainerID)
                    continue
                }
            }
            c.mu.Unlock()
        }
    }
}

// performCheck executes a health check
func (c *Checker) performCheck(monitor *HealthMonitor) bool {
    ctx, cancel := context.WithTimeout(
        context.Background(),
        monitor.Config.Timeout,
    )
    defer cancel()

    switch monitor.Config.Type {
    case "http":
        return c.checkHTTP(ctx, monitor.Config.Endpoint)
    case "tcp":
        return c.checkTCP(ctx, monitor.Config.Endpoint)
    default:
        return false
    }
}

// checkHTTP performs HTTP health check
func (c *Checker) checkHTTP(ctx context.Context, endpoint string) bool {
    req, err := http.NewRequestWithContext(ctx, "GET", endpoint, nil)
    if err != nil {
        return false
    }

    resp, err := http.DefaultClient.Do(req)
    if err != nil {
        return false
    }
    defer resp.Body.Close()

    return resp.StatusCode >= 200 && resp.StatusCode < 300
}
```

**Implementation Highlights:**

1. **Goroutine per Container**: Independent monitoring
2. **Stop Channel**: Graceful shutdown
3. **Failure Threshold**: Prevent flapping
4. **Context Timeout**: Prevent hanging checks
5. **Callback Pattern**: Decouple health from recovery

### Step 6: Implement Service Registry

Create `internal/registry/service_registry.go`:

```go
package registry

import (
    "errors"
    "sync"
    "github.com/yourusername/container-orchestrator/internal/models"
)

var ErrServiceNotFound = errors.New("service not found")

// Registry manages services
type Registry struct {
    services map[string]*models.Service
    mu       sync.RWMutex
}

// NewRegistry creates a registry
func NewRegistry() *Registry {
    return &Registry{
        services: make(map[string]*models.Service),
    }
}

// Register adds or updates a service
func (r *Registry) Register(service *models.Service) {
    r.mu.Lock()
    defer r.mu.Unlock()
    r.services[service.Name] = service
}

// Get retrieves a service
func (r *Registry) Get(serviceName string) (*models.Service, error) {
    r.mu.RLock()
    defer r.mu.RUnlock()

    service, exists := r.services[serviceName]
    if !exists {
        return nil, ErrServiceNotFound
    }
    return service, nil
}

// AddContainer adds a container to a service
func (r *Registry) AddContainer(
    serviceName string,
    container models.Container,
) error {
    r.mu.Lock()
    defer r.mu.Unlock()

    service, exists := r.services[serviceName]
    if !exists {
        return ErrServiceNotFound
    }

    service.Containers = append(service.Containers, container)
    return nil
}
```

**Design Principles:**

1. **Simple Map**: Fast O(1) lookups
2. **Thread Safe**: RWMutex for concurrency
3. **Immutability**: Return copies, not references
4. **Error Handling**: Custom error types

### Step 7: Implement API Handlers

Create `internal/api/handlers.go`:

```go
package api

import (
    "encoding/json"
    "fmt"
    "net/http"
    "time"
    "github.com/gorilla/mux"
    "github.com/yourusername/container-orchestrator/internal/docker"
    "github.com/yourusername/container-orchestrator/internal/health"
    "github.com/yourusername/container-orchestrator/internal/models"
    "github.com/yourusername/container-orchestrator/internal/registry"
    "github.com/yourusername/container-orchestrator/internal/scheduler"
)

// API handles HTTP requests
type API struct {
    dockerClient  *docker.Client
    scheduler     *scheduler.Scheduler
    registry      *registry.Registry
    healthChecker *health.Checker
    eventLog      []models.Event
}

// NewAPI creates an API handler
func NewAPI(
    dockerClient *docker.Client,
    sched *scheduler.Scheduler,
    reg *registry.Registry,
    checker *health.Checker,
) *API {
    return &API{
        dockerClient:  dockerClient,
        scheduler:     sched,
        registry:      reg,
        healthChecker: checker,
        eventLog:      make([]models.Event, 0),
    }
}

// DeployHandler handles service deployments
func (api *API) DeployHandler(w http.ResponseWriter, r *http.Request) {
    var req models.DeploymentRequest
    if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
        http.Error(w, "Invalid request", http.StatusBadRequest)
        return
    }

    // Validate
    if req.Name == "" || req.Image == "" {
        http.Error(w, "Name and image required", http.StatusBadRequest)
        return
    }

    if req.Replicas <= 0 {
        req.Replicas = 1
    }

    // Create service
    service := &models.Service{
        Name:        req.Name,
        Image:       req.Image,
        Replicas:    req.Replicas,
        Containers:  make([]models.Container, 0),
        HealthCheck: req.HealthCheck,
        CreatedAt:   time.Now(),
    }
    api.registry.Register(service)

    // Deploy containers
    for i := 0; i < req.Replicas; i++ {
        containerName := fmt.Sprintf("%s-%d", req.Name, i)

        // Select node
        node, err := api.scheduler.SelectNode(req)
        if err != nil {
            http.Error(w, err.Error(), http.StatusInternalServerError)
            return
        }

        // Create and start
        containerID, err := api.dockerClient.CreateContainer(
            r.Context(), req, containerName,
        )
        if err != nil {
            http.Error(w, err.Error(), http.StatusInternalServerError)
            return
        }

        if err := api.dockerClient.StartContainer(r.Context(), containerID); err != nil {
            http.Error(w, err.Error(), http.StatusInternalServerError)
            return
        }

        // Add to service
        container := models.Container{
            ID:          containerID,
            Name:        containerName,
            ServiceName: req.Name,
            NodeID:      node.ID,
            Image:       req.Image,
            Status:      "running",
            Health:      "unknown",
            CreatedAt:   time.Now(),
            StartedAt:   time.Now(),
        }
        api.registry.AddContainer(req.Name, container)

        // Start health monitoring
        if req.HealthCheck.Type != "" {
            api.healthChecker.StartMonitoring(
                containerID,
                req.HealthCheck,
                api.handleContainerFailure,
            )
        }
    }

    // Log event
    api.logEvent(models.Event{
        Timestamp: time.Now(),
        Type:      "deployment",
        Service:   req.Name,
        Message:   fmt.Sprintf("Deployed %d replicas", req.Replicas),
    })

    w.WriteHeader(http.StatusCreated)
    json.NewEncoder(w).Encode(service)
}
```

**Handler Best Practices:**

1. **Early Validation**: Fail fast on invalid input
2. **Atomic Operations**: Deploy all or none
3. **Error Responses**: HTTP status + message
4. **Event Logging**: Audit trail
5. **JSON Encoding**: Automatic serialization

### Step 8: Wire Everything Together

Create `cmd/orchestrator/main.go`:

```go
package main

import (
    "context"
    "log"
    "net/http"
    "os"
    "os/signal"
    "syscall"
    "time"
    "github.com/gorilla/mux"
    "github.com/yourusername/container-orchestrator/internal/api"
    "github.com/yourusername/container-orchestrator/internal/docker"
    "github.com/yourusername/container-orchestrator/internal/health"
    "github.com/yourusername/container-orchestrator/internal/models"
    "github.com/yourusername/container-orchestrator/internal/registry"
    "github.com/yourusername/container-orchestrator/internal/scheduler"
)

func main() {
    if err := run(); err != nil {
        log.Fatalf("Application error: %v", err)
    }
}

func run() error {
    // Initialize components
    dockerClient, err := docker.NewClient()
    if err != nil {
        return err
    }
    defer dockerClient.Close()

    sched := scheduler.NewScheduler()
    reg := registry.NewRegistry()
    checker := health.NewChecker()

    // Register local node
    localNode := &models.Node{
        ID:        "local-node",
        Address:   "localhost",
        CPUCores:  4,
        MemoryMB:  8192,
        Available: true,
        LastSeen:  time.Now(),
    }
    sched.RegisterNode(localNode)

    // Create API
    apiHandler := api.NewAPI(dockerClient, sched, reg, checker)

    // Setup routes
    router := mux.NewRouter()
    router.HandleFunc("/api/deploy", apiHandler.DeployHandler).Methods("POST")
    router.HandleFunc("/api/services", apiHandler.ListServicesHandler).Methods("GET")
    router.HandleFunc("/api/services/{name}/scale", apiHandler.ScaleHandler).Methods("POST")
    router.HandleFunc("/api/services/{name}/logs", apiHandler.LogsHandler).Methods("GET")
    router.HandleFunc("/api/events", apiHandler.EventsHandler).Methods("GET")

    // Start server
    port := os.Getenv("PORT")
    if port == "" {
        port = "8080"
    }

    srv := &http.Server{
        Addr:         ":" + port,
        Handler:      router,
        ReadTimeout:  15 * time.Second,
        WriteTimeout: 15 * time.Second,
        IdleTimeout:  60 * time.Second,
    }

    // Start in goroutine
    go func() {
        log.Printf("Listening on :%s", port)
        if err := srv.ListenAndServe(); err != http.ErrServerClosed {
            log.Fatalf("Server error: %v", err)
        }
    }()

    // Wait for interrupt
    sigCh := make(chan os.Signal, 1)
    signal.Notify(sigCh, os.Interrupt, syscall.SIGTERM)
    <-sigCh

    // Graceful shutdown
    log.Println("Shutting down...")
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    return srv.Shutdown(ctx)
}
```

**Application Structure:**

1. **Initialization**: Create all components
2. **Configuration**: Register nodes, setup routes
3. **Server Start**: Listen in background
4. **Signal Handling**: Wait for interrupt
5. **Graceful Shutdown**: 30s timeout for cleanup

---

## API Reference

### Base URL

```
http://localhost:8080
```

### Authentication

Currently no authentication (add JWT or API keys for production).

### Endpoints

#### 1. Deploy Service

Deploy a new service with specified replicas.

**Request:**
```http
POST /api/deploy
Content-Type: application/json

{
  "name": "web-app",
  "image": "nginx:alpine",
  "replicas": 3,
  "command": ["nginx", "-g", "daemon off;"],
  "env": {
    "ENV": "production",
    "LOG_LEVEL": "info"
  },
  "ports": [
    {
      "container_port": 80,
      "host_port": 8080,
      "protocol": "tcp"
    }
  ],
  "resources": {
    "cpu_shares": 1024,
    "memory_mb": 512
  },
  "health_check": {
    "type": "http",
    "endpoint": "http://localhost:80/health",
    "interval": "10s",
    "timeout": "5s",
    "retries": 3
  }
}
```

**Response:**
```http
HTTP/1.1 201 Created
Content-Type: application/json

{
  "name": "web-app",
  "image": "nginx:alpine",
  "replicas": 3,
  "containers": [
    {
      "id": "abc123...",
      "name": "web-app-0",
      "service_name": "web-app",
      "node_id": "local-node",
      "image": "nginx:alpine",
      "status": "running",
      "health": "unknown",
      "created_at": "2024-01-15T10:00:00Z",
      "started_at": "2024-01-15T10:00:01Z"
    }
  ],
  "health_check": {...},
  "created_at": "2024-01-15T10:00:00Z"
}
```

**Field Descriptions:**
- `name`: Unique service identifier
- `image`: Docker image (e.g., "nginx:alpine", "redis:7")
- `replicas`: Number of container instances (default: 1)
- `command`: Override container command (optional)
- `env`: Environment variables (optional)
- `ports`: Port mappings (optional)
- `resources`: CPU/memory limits (optional)
- `health_check`: Health monitoring config (optional)

#### 2. List Services

Get all deployed services.

**Request:**
```http
GET /api/services
```

**Response:**
```http
HTTP/1.1 200 OK
Content-Type: application/json

[
  {
    "name": "web-app",
    "image": "nginx:alpine",
    "replicas": 3,
    "containers": [...]
  },
  {
    "name": "api-server",
    "image": "node:18-alpine",
    "replicas": 2,
    "containers": [...]
  }
]
```

#### 3. Scale Service

Change the number of replicas for a service.

**Request:**
```http
POST /api/services/web-app/scale
Content-Type: application/json

{
  "replicas": 5
}
```

**Response:**
```http
HTTP/1.1 202 Accepted
Content-Type: application/json

{
  "message": "Scaling up..."
}
```

**Scaling Behavior:**
- **Scale Up**: Creates new containers to reach desired count
- **Scale Down**: Stops and removes excess containers
- **No Change**: Returns "Already at desired scale"

#### 4. Get Service Logs

Retrieve logs from a service's first container.

**Request:**
```http
GET /api/services/web-app/logs
```

**Response:**
```http
HTTP/1.1 200 OK
Content-Type: text/plain

2024-01-15T10:00:00.000Z Starting nginx...
2024-01-15T10:00:01.000Z Listening on port 80
```

**Note:** Currently returns logs from first container only. Production implementation should aggregate logs from all containers.

#### 5. View Events

Get recent orchestrator events (last 100).

**Request:**
```http
GET /api/events
```

**Response:**
```http
HTTP/1.1 200 OK
Content-Type: application/json

[
  {
    "timestamp": "2024-01-15T10:00:00Z",
    "type": "deployment",
    "service": "web-app",
    "message": "Deployed 3 replicas"
  },
  {
    "timestamp": "2024-01-15T10:05:30Z",
    "type": "health_check",
    "service": "web-app",
    "container": "abc123",
    "message": "Container failed health check, restarting..."
  }
]
```

**Event Types:**
- `deployment`: Service deployed
- `health_check`: Health check failure/recovery
- `restart`: Container restarted
- `scale`: Service scaled up/down

### Error Responses

#### 400 Bad Request
```json
{
  "error": "Name and image are required"
}
```

#### 404 Not Found
```json
{
  "error": "Service not found"
}
```

#### 500 Internal Server Error
```json
{
  "error": "Scheduling failed: no suitable node found"
}
```

### Example Usage with curl

```bash
# Deploy nginx
curl -X POST http://localhost:8080/api/deploy \
  -H "Content-Type: application/json" \
  -d '{
    "name": "nginx",
    "image": "nginx:alpine",
    "replicas": 2,
    "ports": [{"container_port": 80, "host_port": 8080, "protocol": "tcp"}]
  }'

# List services
curl http://localhost:8080/api/services

# Scale service
curl -X POST http://localhost:8080/api/services/nginx/scale \
  -H "Content-Type: application/json" \
  -d '{"replicas": 5}'

# Get logs
curl http://localhost:8080/api/services/nginx/logs

# View events
curl http://localhost:8080/api/events
```

---

## Testing Strategy

### Testing Philosophy

- **Unit Tests**: Test individual components in isolation
- **Integration Tests**: Test component interactions
- **End-to-End Tests**: Test complete workflows
- **Coverage Goal**: 70%+ code coverage

### Test Structure

```
tests/
├── unit/
│   ├── scheduler_test.go      # Scheduler logic tests
│   ├── registry_test.go       # Registry operations tests
│   └── health_test.go         # Health check tests
├── integration/
│   └── integration_test.go    # Full workflow tests
└── helpers/
    └── test_helpers.go        # Shared test utilities
```

### Running Tests

```bash
# All tests
go test ./...

# Specific package
go test ./internal/scheduler

# With coverage
go test -cover ./...

# Generate coverage report
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out -o coverage.html
open coverage.html

# Verbose output
go test -v ./...

# Integration tests only (requires Docker)
go test -v ./tests/

# Run with race detector
go test -race ./...
```

### Unit Test Examples

#### Scheduler Test

```go
func TestSchedulerNodeSelection(t *testing.T) {
    sched := scheduler.NewScheduler()

    // Register nodes with different capacities
    sched.RegisterNode(&models.Node{
        ID:        "node1",
        CPUCores:  2,
        MemoryMB:  1024,
        Available: true,
    })

    sched.RegisterNode(&models.Node{
        ID:        "node2",
        CPUCores:  4,
        MemoryMB:  4096,
        Available: true,
    })

    // Request deployment
    req := models.DeploymentRequest{
        Resources: models.ResourceLimits{
            MemoryMB: 2048,
        },
    }

    // Should select node2 (more memory)
    node, err := sched.SelectNode(req)
    require.NoError(t, err)
    assert.Equal(t, "node2", node.ID)
}
```

#### Registry Test

```go
func TestServiceRegistration(t *testing.T) {
    reg := registry.NewRegistry()

    // Register service
    service := &models.Service{
        Name:     "test-service",
        Image:    "nginx:alpine",
        Replicas: 2,
    }
    reg.Register(service)

    // Retrieve service
    retrieved, err := reg.Get("test-service")
    require.NoError(t, err)
    assert.Equal(t, "test-service", retrieved.Name)
    assert.Equal(t, "nginx:alpine", retrieved.Image)

    // Test not found
    _, err = reg.Get("nonexistent")
    assert.Error(t, err)
    assert.Equal(t, registry.ErrServiceNotFound, err)
}
```

#### Health Check Test

```go
func TestHealthCheckFailure(t *testing.T) {
    checker := health.NewChecker()

    failureCount := 0
    onFailure := func(containerID string) {
        failureCount++
    }

    // Monitor non-existent endpoint
    checker.StartMonitoring("test-container", models.HealthCheckConfig{
        Type:     "http",
        Endpoint: "http://localhost:9999/health",
        Interval: 1 * time.Second,
        Timeout:  500 * time.Millisecond,
        Retries:  2,
    }, onFailure)

    // Wait for failures
    time.Sleep(3 * time.Second)

    // Verify failure detected
    assert.Greater(t, failureCount, 0)
    assert.Equal(t, "unhealthy", checker.GetStatus("test-container"))

    // Cleanup
    checker.StopMonitoring("test-container")
}
```

### Integration Test Example

```go
func TestFullDeploymentWorkflow(t *testing.T) {
    // Setup
    dockerClient, err := docker.NewClient()
    require.NoError(t, err)
    defer dockerClient.Close()

    sched := scheduler.NewScheduler()
    reg := registry.NewRegistry()
    checker := health.NewChecker()

    // Register node
    sched.RegisterNode(&models.Node{
        ID:        "test-node",
        CPUCores:  2,
        MemoryMB:  2048,
        Available: true,
    })

    apiHandler := api.NewAPI(dockerClient, sched, reg, checker)

    // Deploy service
    deployReq := models.DeploymentRequest{
        Name:     "test-nginx",
        Image:    "nginx:alpine",
        Replicas: 2,
    }

    body, _ := json.Marshal(deployReq)
    req := httptest.NewRequest("POST", "/api/deploy", bytes.NewReader(body))
    w := httptest.NewRecorder()

    apiHandler.DeployHandler(w, req)

    // Verify deployment
    assert.Equal(t, http.StatusCreated, w.Code)

    service, err := reg.Get("test-nginx")
    require.NoError(t, err)
    assert.Equal(t, 2, len(service.Containers))

    // Cleanup
    ctx := context.Background()
    for _, container := range service.Containers {
        dockerClient.StopContainer(ctx, container.ID, 5*time.Second)
        dockerClient.RemoveContainer(ctx, container.ID)
    }
}
```

### Test Coverage Goals

| Component | Target Coverage | Rationale |
|-----------|----------------|-----------|
| Scheduler | 90%+ | Critical path, simple logic |
| Registry | 85%+ | Core functionality |
| Docker Client | 70%+ | Integration with external SDK |
| Health Checker | 80%+ | Important but timing-dependent |
| API Handlers | 75%+ | Many edge cases |

### Mocking Strategy

For external dependencies:

```go
// Mock Docker client for unit tests
type MockDockerClient struct {
    CreateFunc func(context.Context, models.DeploymentRequest, string) (string, error)
    StartFunc  func(context.Context, string) error
    // ...
}

func (m *MockDockerClient) CreateContainer(...) (string, error) {
    return m.CreateFunc(...)
}
```

### Continuous Integration

Example GitHub Actions workflow:

```yaml
name: Tests

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    services:
      docker:
        image: docker:dind
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-go@v4
        with:
          go-version: '1.23'
      - run: go test -race -coverprofile=coverage.out ./...
      - run: go tool cover -func=coverage.out
```

---

## Deployment Guide

### Local Development

```bash
# Install dependencies
go mod download

# Build binary
make build

# Run locally
./bin/orchestrator

# Or with hot reload (using air)
go install github.com/air-verse/air@latest
air
```

### Docker Deployment

#### Single Container

```bash
# Build image
docker build -t container-orchestrator:latest .

# Run container
docker run -d \
  --name orchestrator \
  -p 8080:8080 \
  -v /var/run/docker.sock:/var/run/docker.sock \
  --restart unless-stopped \
  container-orchestrator:latest

# View logs
docker logs -f orchestrator

# Stop container
docker stop orchestrator
docker rm orchestrator
```

**Important:** The `-v /var/run/docker.sock:/var/run/docker.sock` flag is required for the orchestrator to communicate with the Docker daemon.

#### Docker Compose (Recommended)

```yaml
# docker-compose.yml
version: '3.8'

services:
  orchestrator:
    build: .
    ports:
      - "8080:8080"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - PORT=8080
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:8080/api/services"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
```

**Usage:**
```bash
# Start
docker-compose up -d

# View logs
docker-compose logs -f

# Stop
docker-compose down

# Rebuild
docker-compose up -d --build
```

### Production Deployment

#### AWS ECS

1. **Build and push image:**
```bash
# Build for linux/amd64
docker buildx build --platform linux/amd64 -t your-registry/orchestrator:latest .

# Push to ECR
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin your-account.dkr.ecr.us-east-1.amazonaws.com
docker tag orchestrator:latest your-account.dkr.ecr.us-east-1.amazonaws.com/orchestrator:latest
docker push your-account.dkr.ecr.us-east-1.amazonaws.com/orchestrator:latest
```

2. **Create ECS task definition:**
```json
{
  "family": "orchestrator",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512",
  "containerDefinitions": [
    {
      "name": "orchestrator",
      "image": "your-account.dkr.ecr.us-east-1.amazonaws.com/orchestrator:latest",
      "portMappings": [
        {
          "containerPort": 8080,
          "protocol": "tcp"
        }
      ],
      "mountPoints": [
        {
          "sourceVolume": "docker-socket",
          "containerPath": "/var/run/docker.sock"
        }
      ],
      "environment": [
        {
          "name": "PORT",
          "value": "8080"
        }
      ]
    }
  ],
  "volumes": [
    {
      "name": "docker-socket",
      "host": {
        "sourcePath": "/var/run/docker.sock"
      }
    }
  ]
}
```

#### Kubernetes

```yaml
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: orchestrator
spec:
  replicas: 1
  selector:
    matchLabels:
      app: orchestrator
  template:
    metadata:
      labels:
        app: orchestrator
    spec:
      containers:
      - name: orchestrator
        image: your-registry/orchestrator:latest
        ports:
        - containerPort: 8080
        env:
        - name: PORT
          value: "8080"
        volumeMounts:
        - name: docker-socket
          mountPath: /var/run/docker.sock
      volumes:
      - name: docker-socket
        hostPath:
          path: /var/run/docker.sock
---
apiVersion: v1
kind: Service
metadata:
  name: orchestrator
spec:
  selector:
    app: orchestrator
  ports:
  - port: 80
    targetPort: 8080
  type: LoadBalancer
```

**Deploy:**
```bash
kubectl apply -f deployment.yaml
kubectl get pods
kubectl logs -f deployment/orchestrator
```

#### Systemd Service (Linux)

```ini
# /etc/systemd/system/orchestrator.service
[Unit]
Description=Container Orchestrator
After=docker.service
Requires=docker.service

[Service]
Type=simple
User=orchestrator
Group=docker
ExecStart=/usr/local/bin/orchestrator
Restart=always
RestartSec=10
Environment="PORT=8080"

[Install]
WantedBy=multi-user.target
```

**Setup:**
```bash
# Create user
sudo useradd -r -s /bin/false orchestrator
sudo usermod -aG docker orchestrator

# Install binary
sudo cp bin/orchestrator /usr/local/bin/
sudo chmod +x /usr/local/bin/orchestrator

# Enable service
sudo systemctl daemon-reload
sudo systemctl enable orchestrator
sudo systemctl start orchestrator

# Check status
sudo systemctl status orchestrator
sudo journalctl -u orchestrator -f
```

### Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `PORT` | 8080 | HTTP API port |

### Security Considerations

1. **Docker Socket Access**: Mounting `/var/run/docker.sock` gives full Docker access. Use with caution.

2. **Network Isolation**: In production, use private networks and API gateway.

3. **Authentication**: Add JWT or API key authentication:
```go
func authMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        apiKey := r.Header.Get("X-API-Key")
        if apiKey != os.Getenv("API_KEY") {
            http.Error(w, "Unauthorized", http.StatusUnauthorized)
            return
        }
        next.ServeHTTP(w, r)
    })
}
```

4. **TLS**: Use reverse proxy (nginx, Caddy) for HTTPS:
```nginx
server {
    listen 443 ssl;
    ssl_certificate /path/to/cert.pem;
    ssl_certificate_key /path/to/key.pem;

    location / {
        proxy_pass http://localhost:8080;
    }
}
```

5. **Resource Limits**: Set Docker daemon limits to prevent abuse.

---

## Performance Tuning

### Benchmarking

```bash
# Benchmark scheduler
go test -bench=BenchmarkScheduler -benchmem ./internal/scheduler

# Profile CPU
go test -cpuprofile=cpu.prof -bench=. ./internal/scheduler
go tool pprof cpu.prof

# Profile memory
go test -memprofile=mem.prof -bench=. ./internal/scheduler
go tool pprof mem.prof
```

### Optimization Tips

1. **Connection Pooling**: Reuse HTTP connections for health checks
2. **Batch Operations**: Deploy multiple containers in parallel
3. **Cache Image Lists**: Reduce Docker API calls
4. **Async Logging**: Non-blocking event logging
5. **Goroutine Limits**: Pool goroutines for health checks

### Monitoring

**Prometheus Metrics (Add to production):**

```go
import "github.com/prometheus/client_golang/prometheus"

var (
    deploymentsTotal = prometheus.NewCounter(...)
    containersRunning = prometheus.NewGauge(...)
    healthCheckDuration = prometheus.NewHistogram(...)
)
```

**Health Endpoint:**

```go
router.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
    json.NewEncoder(w).Encode(map[string]string{"status": "healthy"})
})
```

---

## Troubleshooting

### Common Issues

#### 1. Docker Socket Permission Denied

**Error:**
```
failed to create docker client: permission denied
```

**Solution:**
```bash
# Add user to docker group
sudo usermod -aG docker $USER

# Logout and login again
# Or use newgrp
newgrp docker

# Verify access
docker ps
```

#### 2. Port Already in Use

**Error:**
```
listen tcp :8080: bind: address already in use
```

**Solution:**
```bash
# Find process using port
lsof -i :8080

# Kill process
kill -9 <PID>

# Or use different port
PORT=9090 ./bin/orchestrator
```

#### 3. Cannot Pull Images

**Error:**
```
failed to ensure image: pull access denied
```

**Solution:**
```bash
# Login to Docker Hub
docker login

# Or use private registry
docker login your-registry.com
```

#### 4. Container Fails to Start

**Debugging:**
```bash
# Check container logs
docker logs <container-id>

# Inspect container
docker inspect <container-id>

# Check orchestrator logs
docker logs orchestrator
```

#### 5. Health Checks Always Failing

**Debugging:**
```go
// Add logging to health checker
fmt.Printf("Health check %s for %s: %v\n",
    monitor.Config.Type,
    monitor.Config.Endpoint,
    healthy)
```

**Common causes:**
- Incorrect endpoint URL
- Container not exposing port
- Health endpoint not ready yet (add initial delay)

### Debug Mode

Enable verbose logging:

```go
// In main.go
log.SetFlags(log.LstdFlags | log.Lshortfile)

// Add debug logging
log.Printf("Deploying service: %+v", req)
log.Printf("Selected node: %s", node.ID)
log.Printf("Created container: %s", containerID)
```

---

## Future Enhancements

### Short-Term (1-2 months)

- [ ] **Persistent State**: SQLite database for service registry
- [ ] **Volume Management**: Support for container volumes
- [ ] **Network Isolation**: Custom Docker networks per service
- [ ] **Config Management**: Secrets and config injection
- [ ] **Web Dashboard**: Simple UI for visualization
- [ ] **Metrics Export**: Prometheus integration
- [ ] **Logs Aggregation**: Collect logs from all replicas

### Medium-Term (3-6 months)

- [ ] **Multi-Node Support**: Distributed orchestration
- [ ] **Advanced Scheduling**: Bin-packing, affinity rules
- [ ] **Rolling Updates**: Zero-downtime deployments
- [ ] **Load Balancing**: Built-in reverse proxy
- [ ] **Service Mesh**: Inter-service communication
- [ ] **Auto-Scaling**: HPA based on metrics
- [ ] **Backup/Restore**: State persistence

### Long-Term (6+ months)

- [ ] **Plugin System**: Extensible architecture
- [ ] **Multi-Tenancy**: Namespace isolation
- [ ] **GitOps Integration**: Declarative deployments
- [ ] **Chaos Engineering**: Failure injection
- [ ] **Cost Optimization**: Resource usage analytics
- [ ] **Compliance**: Audit logs, access control
- [ ] **Federation**: Multi-cluster management

### Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.

**Areas needing help:**
- Test coverage improvements
- Documentation enhancements
- Performance optimizations
- Bug fixes
- Feature implementations

---

## License

MIT License - see LICENSE file for details.

## Support

- **Issues**: Open an issue on GitHub
- **Discussions**: GitHub Discussions
- **Email**: support@example.com

## Acknowledgments

- Docker SDK for Go
- Gorilla Mux router
- Testify testing toolkit
- The Kubernetes project (inspiration)

---

## Appendix

### A. Docker SDK Reference

**Official Docs**: https://docs.docker.com/engine/api/sdk/

**Key Types:**
- `types.Container`: Container information
- `types.ContainerJSON`: Detailed container inspect
- `container.Config`: Container configuration
- `container.HostConfig`: Host-specific config

### B. Scheduling Algorithms

**Current**: Most Available Memory (MAM)
- Simple and effective
- O(N) complexity
- No state tracking needed

**Alternatives:**
1. **Round Robin**: Distribute evenly
2. **Least Loaded**: Consider CPU+Memory
3. **Bin Packing**: Maximize resource usage
4. **Affinity-Based**: Place related services together

### C. Health Check Best Practices

1. **HTTP**: Return 200 from `/health` endpoint
2. **TCP**: Listen on expected port
3. **Exec**: Fast-running health script
4. **Timing**: Interval > Timeout
5. **Retries**: 2-3 retries before failure
6. **Initial Delay**: Wait for app startup

### D. Glossary

- **Container**: Isolated runtime environment
- **Service**: Logical grouping of containers
- **Replica**: Individual container instance
- **Node**: Host machine for containers
- **Scheduler**: Component that selects nodes
- **Registry**: Service discovery system
- **Health Check**: Container monitoring
- **Event**: Audit log entry

### E. Architecture Patterns

**Orchestrator Pattern**: Central controller manages distributed workers

**Registry Pattern**: Service discovery via central registry

**Health Check Pattern**: Periodic polling with failure callbacks

**Event Sourcing**: Log all state changes for audit

---

**Last Updated**: 2024-01-15
**Version**: 1.0.0
**Contributors**: Your Name

---

## Quick Reference Card

### Build Commands
```bash
make build          # Build binary
make run            # Build and run
make test           # Run tests
make docker-build   # Build Docker image
docker-compose up   # Run with Docker Compose
```

### API Endpoints
```
POST   /api/deploy              # Deploy service
GET    /api/services            # List services
POST   /api/services/{name}/scale  # Scale service
GET    /api/services/{name}/logs   # Get logs
GET    /api/events              # View events
```

### Environment Variables
```
PORT=8080          # API server port
```

### File Structure
```
cmd/orchestrator/main.go        # Entry point
internal/models/types.go        # Data models
internal/docker/client.go       # Docker SDK
internal/scheduler/scheduler.go # Scheduling
internal/health/checker.go      # Health checks
internal/registry/service_registry.go  # Registry
internal/api/handlers.go        # API handlers
```

---

**End of Implementation Guide**
