Observability and Monitoring

Learning Objectives

By the end of this tutorial, you will be able to:

  • Design observability strategies for production systems using the three pillars approach
  • Implement structured logging with Go's log/slog package for machine-readable insights
  • Build comprehensive metrics collection using Prometheus with custom collectors
  • Create distributed tracing systems with OpenTelemetry for request flow analysis
  • Establish SLIs/SLOs and alerting based on service level objectives
  • Prevent production incidents through proactive monitoring and early detection

Hook - The Production Blindness Problem

Consider driving a car with no dashboard—no speedometer, no fuel gauge, no engine lights. You'd be flying blind, unable to tell if you're speeding, running out of gas, or if the engine is about to fail.

Production software without observability is exactly like that car.

💡 Real-World Impact:

  • Amazon Prime Day 2018 - Lost $100M in 63 minutes due to lack of distributed tracing
  • Target Black Friday 2018 - Website crashed from undetected connection pool exhaustion
  • Cloudflare's 2019 outage - Memory ordering bug caused incorrect routing for 30 minutes

These incidents share a common theme: teams were flying blind without proper observability.

The Cost of Being Blind:

  • MTTR: -75% with proper observability
  • MTTD: -85% with real-time alerting
  • Incident count: -50% with proactive monitoring
  • Developer productivity: +30% less time debugging

⚠️ Important: Without observability, you're not just debugging in the dark—you're losing money, customers, and developer time with every incident.

Core Concepts - The Three Pillars

Think of observability like a doctor's diagnostic toolkit. You need multiple instruments to get a complete picture of your system's health.

Pillar 1: Structured Logs

Logs capture discrete events with rich context. Structured logs are far more valuable than unstructured text because machines can parse, search, and analyze them efficiently.

Why Structured > Unstructured:

1// Unstructured
2log.Printf("User %s from %s logged in at %v", userID, ip, time.Now())
3
4// Structured
5logger.Info("user_login",
6    slog.String("user_id", userID),
7    slog.String("ip_address", ip),
8    slog.Time("login_time", time.Now()))

Pillar 2: Metrics

Metrics provide numerical insights about system behavior over time. They answer questions like "How fast are we?" and "How much traffic?"

Key Metric Types:

  • Counters: Monotonically increasing values
  • Gauges: Values that go up and down
  • Histograms: Distribution of values

Pillar 3: Traces

Traces show the complete journey of a request through multiple services, helping you understand performance bottlenecks and failure points.

💡 Real-World Example: When Netflix experiences streaming issues:

  1. Logs show "CDN cache miss for movie ID 12345"
  2. Metrics show "CDN miss rate increased from 0.1% to 15%"
  3. Traces reveal "CDN → storage → user"

This combination tells them: what happened, how much, and why.

Structured Logging with slog

Practical Examples - Structured Logging

Example 1: Production-Ready Request Logger

Problem: You need to track all API requests with unique IDs, timing, and structured context for debugging production issues.

Solution: Build a middleware that logs every request with rich context using Go's log/slog.

 1package main
 2
 3import (
 4    "context"
 5    "log/slog"
 6    "net/http"
 7    "os"
 8    "time"
 9
10    "github.com/google/uuid"
11)
12
13type contextKey string
14
15const requestIDKey contextKey = "request_id"
16const userIDKey contextKey = "user_id"
17
18// RequestLogger middleware adds comprehensive request logging
19func RequestLogger(logger *slog.Logger) func(http.Handler) http.Handler {
20    return func(next http.Handler) http.Handler {
21        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
22            // Generate unique request ID
23            requestID := uuid.New().String()
24
25            // Add to context for downstream handlers
26            ctx := context.WithValue(r.Context(), requestIDKey, requestID)
27            r = r.WithContext(ctx)
28
29            // Create logger with request context
30            reqLogger := logger.With(
31                slog.String("request_id", requestID),
32                slog.String("method", r.Method),
33                slog.String("path", r.URL.Path),
34                slog.String("remote_addr", r.RemoteAddr),
35                slog.String("user_agent", r.UserAgent()),
36            )
37
38            // Log request start
39            start := time.Now()
40            reqLogger.Info("request_started")
41
42            // Wrap response writer to capture status and bytes
43            rw := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}
44
45            // Call handler
46            next.ServeHTTP(rw, r)
47
48            // Log request completion with timing
49            duration := time.Since(start)
50            reqLogger.Info("request_completed",
51                slog.Int("status_code", rw.statusCode),
52                slog.Duration("duration_ms", duration),
53                slog.Int("response_size", rw.bytesWritten),
54            )
55        })
56    }
57}
58
59type responseWriter struct {
60    http.ResponseWriter
61    statusCode int
62    bytesWritten int
63}
64
65func WriteHeader(code int) {
66    rw.statusCode = code
67    rw.ResponseWriter.WriteHeader(code)
68}
69
70func Write(b []byte) {
71    n, err := rw.ResponseWriter.Write(b)
72    rw.bytesWritten += n
73    return n, err
74}
75
76func main() {
77    logger := slog.New(slog.NewJSONHandler(os.Stdout, nil))
78
79    // Create router with request logging middleware
80    mux := http.NewServeMux()
81    mux.HandleFunc("/api/users", func(w http.ResponseWriter, r *http.Request) {
82        time.Sleep(50 * time.Millisecond) // Simulate work
83        w.Header().Set("Content-Type", "application/json")
84        w.WriteHeader(http.StatusOK)
85        w.Write([]byte(`{"users": []}`))
86    })
87
88    // Apply middleware
89    handler := RequestLogger(logger)(mux)
90
91    fmt.Println("Server running on :8080")
92    http.ListenAndServe(":8080", handler)
93}

Output:

1{"time":"2024-01-15T10:30:45Z","level":"INFO","msg":"request_started","request_id":"123e4567-89ab-4fcd-8a9b-8d6f-123456","method":"GET","path":"/api/users","remote_addr":"127.0.0.1","user_agent":"curl/7.68.0"}
2{"time":"2024-01-15T10:30:46Z","level":"INFO","msg":"request_completed","request_id":"123e4567-89ab-4fcd-8a9b-8d6f-123456","method":"GET","path":"/api/users","status_code":200,"duration_ms":52123456,"response_size":15}

Key Insights:

  • Request ID: Enables log correlation across services
  • Structured Format: JSON logs are easily searchable and analyzed
  • Performance Metrics: Built-in timing for every request
  • Context Preservation: User IDs and other context flow naturally

Example 2: Context-Aware Logger with Operation Tracking

Problem: You need to track multi-step operations with proper error context and timing for debugging complex workflows.

Solution: Create a logger that automatically extracts context values and tracks operation completion.

 1package main
 2
 3import (
 4    "context"
 5    "errors"
 6    "log/slog"
 7    "os"
 8    "time"
 9)
10
11type OperationLogger struct {
12    *slog.Logger
13}
14
15func NewOperationLogger() *OperationLogger {
16    handler := slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
17        Level:     slog.LevelInfo,
18        AddSource: true,
19    })
20    return &OperationLogger{
21        Logger: slog.New(handler),
22    }
23}
24
25// WithContext automatically extracts common context values
26func WithContext(ctx context.Context) *OperationLogger {
27    logger := ol.Logger
28
29    // Extract request ID if available
30    if requestID := ctx.Value("request_id"); requestID != nil {
31        logger = logger.With(slog.String("request_id", requestID.(string)))
32    }
33
34    // Extract user ID if available
35    if userID := ctx.Value("user_id"); userID != nil {
36        logger = logger.With(slog.String("user_id", userID.(string)))
37    }
38
39    return &OperationLogger{Logger: logger}
40}
41
42// LogOperation tracks operations with automatic timing and error handling
43func LogOperation(ctx context.Context, operation string, fn func() error) error {
44    ctx, span := ol.Logger.Handler().(*slog.JSONHandler).WithAttrs(
45        slog.String("operation", operation),
46    slog.Time("start_time", time.Now()),
47    ).Enabled(ctx).WithGroup("operation_tracker").NewRecord(0).Level
48
49    err := fn()
50    duration := time.Since(start)
51
52    if err != nil {
53        span.LogAttrs(
54            slog.Duration("duration_ms", duration),
55            slog.String("error", err.Error()),
56            slog.String("status", "failed"),
57        )
58    } else {
59        span.LogAttrs(
60            slog.Duration("duration_ms", duration),
61            slog.String("status", "completed"),
62        )
63    }
64
65    return err
66}
67
68// Usage example in a service
69func processUserOrder(ctx context.Context, orderID string) error {
70    logger := NewOperationLogger()
71    opLogger := logger.WithContext(ctx)
72
73    return opLogger.LogOperation(ctx, "process_order", func() error {
74        // Simulate order processing steps
75        if err := validateOrder(orderID); err != nil {
76            return err
77        }
78        if err := reserveInventory(orderID); err != nil {
79            return err
80        }
81        if err := processPayment(orderID); err != nil {
82            return err
83        }
84        return shipOrder(orderID)
85    })
86}
87
88func validateOrder(orderID string) error { /* ... */ return nil }
89func reserveInventory(orderID string) error { /* ... */ return nil }
90func processPayment(orderID string) error { /* ... */ return nil }
91func shipOrder(orderID string) error { /* ... */ return nil }
92
93func main() {
94    ctx := context.WithValue(context.Background(), "request_id", "req-abc-123")
95    ctx = context.WithValue(ctx, "user_id", "user-456")
96
97    processUserOrder(ctx, "order-789")
98}

Progressive Learning: Each example builds on the previous one, showing how structured logging evolves from basic JSON logging to sophisticated context-aware operation tracking that's essential for production debugging.

Example 3: Advanced Log Processing and Correlation

Problem: In microservices, you need to correlate logs across services and detect patterns in real-time.

Solution: Implement log aggregation with correlation and alerting.

  1package main
  2
  3import (
  4    "context"
  5    "encoding/json"
  6    "fmt"
  7    "log/slog"
  8    "os"
  9    "strings"
 10    "time"
 11)
 12
 13// LogEntry represents a structured log entry
 14type LogEntry struct {
 15    Level     string                 `json:"level"`
 16    Message   string                 `json:"msg"`
 17    Time      time.Time             `json:"time"`
 18    RequestID string                `json:"request_id,omitempty"`
 19    UserID    string                `json:"user_id,omitempty"`
 20    Duration  time.Duration         `json:"duration,omitempty"`
 21    Error     string                `json:"error,omitempty"`
 22    Fields    map[string]interface{} `json:"-"`
 23}
 24
 25// LogProcessor processes and correlates logs
 26type LogProcessor struct {
 27    entries   []LogEntry
 28    patterns  map[string]int
 29    alertChan chan LogEntry
 30}
 31
 32func NewLogProcessor() *LogProcessor {
 33    return &LogProcessor{
 34        patterns: make(map[string]int),
 35        alertChan: make(chan LogEntry, 100),
 36    }
 37}
 38
 39// ProcessEntry processes a single log entry
 40func ProcessEntry(entry LogEntry) {
 41    lp.entries = append(lp.entries, entry)
 42
 43    // Detect error patterns
 44    if entry.Level == "ERROR" {
 45        lp.detectErrorPattern(entry)
 46
 47        // Send to alert channel
 48        select {
 49        case lp.alertChan <- entry:
 50        default:
 51            // Channel full, drop alert
 52        }
 53    }
 54
 55    // Keep only last 1000 entries
 56    if len(lp.entries) > 1000 {
 57        lp.entries = lp.entries[1:]
 58    }
 59}
 60
 61// detectErrorPattern identifies common error patterns
 62func detectErrorPattern(entry LogEntry) {
 63    pattern := extractErrorPattern(entry.Error)
 64    if pattern != "" {
 65        lp.patterns[pattern]++
 66
 67        // Alert on frequent errors
 68        if lp.patterns[pattern] > 10 {
 69            fmt.Printf("ALERT: Pattern '%s' occurred %d times\n",
 70                pattern, lp.patterns[pattern])
 71        }
 72    }
 73}
 74
 75// extractErrorPattern extracts the error pattern from error message
 76func extractErrorPattern(errMsg string) string {
 77    if strings.Contains(errMsg, "timeout") {
 78        return "timeout_error"
 79    }
 80    if strings.Contains(errMsg, "connection refused") {
 81        return "connection_error"
 82    }
 83    if strings.Contains(errMsg, "database") {
 84        return "database_error"
 85    }
 86    return "unknown_error"
 87}
 88
 89// CorrelateRequest finds all logs for a given request
 90func CorrelateRequest(requestID string) []LogEntry {
 91    var requestLogs []LogEntry
 92    for _, entry := range lp.entries {
 93        if entry.RequestID == requestID {
 94            requestLogs = append(requestLogs, entry)
 95        }
 96    }
 97    return requestLogs
 98}
 99
100// Production logger with correlation
101func setupProductionLogger() {
102    processor := NewLogProcessor()
103
104    // Custom handler that processes logs
105    handler := slog.HandlerFunc(func(ctx context.Context, record slog.Record) error {
106        // Convert to our custom log entry
107        entry := LogEntry{
108            Level:   record.Level.String(),
109            Message: record.Message,
110            Time:    record.Time,
111            Fields:  make(map[string]interface{}),
112        }
113
114        // Extract attributes
115        record.Attrs(func(attr slog.Attr) bool {
116            switch attr.Key {
117            case "request_id":
118                entry.RequestID = attr.Value.String()
119            case "user_id":
120                entry.UserID = attr.Value.String()
121            case "error":
122                entry.Error = attr.Value.String()
123            default:
124                entry.Fields[attr.Key] = attr.Value.Any()
125            }
126            return true
127        })
128
129        // Process the entry
130        processor.ProcessEntry(entry)
131
132        // Print to stdout for demo
133        data, _ := json.Marshal(entry)
134        fmt.Println(string(data))
135
136        return nil
137    })
138
139    logger := slog.New(handler)
140    return logger, processor
141}
142
143func main() {
144    logger, processor := setupProductionLogger()
145
146    // Simulate request processing with correlation
147    ctx := context.Background()
148    requestID := "req-12345"
149
150    logger.Info("request_started",
151        slog.String("request_id", requestID),
152        slog.String("user_id", "user-67890"),
153    )
154
155    // Simulate processing with an error
156    time.Sleep(100 * time.Millisecond)
157    logger.Error("database_connection_failed",
158        slog.String("request_id", requestID),
159        slog.String("error", "connection timeout to database at localhost:5432"),
160        slog.Duration("retry_after", 5*time.Second),
161    )
162
163    logger.Info("request_completed",
164        slog.String("request_id", requestID),
165        slog.Duration("duration", 110*time.Millisecond),
166        slog.Int("status_code", 500),
167    )
168
169    // Correlate and analyze logs
170    time.Sleep(1 * time.Second)
171    requestLogs := processor.CorrelateRequest(requestID)
172    fmt.Printf("\nCorrelated logs for request %s: %d entries\n",
173        requestID, len(requestLogs))
174}

Advanced Insights:

  • Pattern Detection: Automatically identify recurring error patterns
  • Real-time Alerting: Detect issues as they happen
  • Log Correlation: Group all logs by request ID for debugging
  • Memory Management: Automatically clean up old entries

Practical Examples - Prometheus Metrics

Example 1: Complete HTTP Metrics Middleware

Problem: You need comprehensive HTTP metrics for performance monitoring and alerting.

Solution: Build a middleware that captures all key HTTP metrics with proper labeling.

  1package main
  2
  3import (
  4    "fmt"
  5    "net/http"
  6    "time"
  7
  8    "github.com/prometheus/client_golang/prometheus"
  9    "github.com/prometheus/client_golang/prometheus/promhttp"
 10)
 11
 12// HTTPMetrics tracks all HTTP-related metrics
 13type HTTPMetrics struct {
 14    // Request metrics
 15    requestTotal *prometheus.CounterVec
 16    requestDuration *prometheus.HistogramVec
 17    requestSize *prometheus.HistogramVec
 18    responseSize *prometheus.HistogramVec
 19
 20    // Connection metrics
 21    activeConnections prometheus.Gauge
 22    connectionErrors prometheus.Counter
 23
 24    // Performance metrics
 25    slowRequests prometheus.Counter
 26    timeoutRequests prometheus.Counter
 27}
 28
 29func NewHTTPMetrics(namespace string) *HTTPMetrics {
 30    return &HTTPMetrics{
 31        requestTotal: promauto.NewCounterVec(
 32            prometheus.CounterOpts{
 33                Namespace: namespace,
 34                Name: "http_requests_total",
 35                Help: "Total number of HTTP requests",
 36            },
 37            []string{"method", "path", "status", "user_agent"},
 38        ),
 39
 40        requestDuration: promauto.NewHistogramVec(
 41            prometheus.HistogramOpts{
 42                Namespace: namespace,
 43                Name: "http_request_duration_seconds",
 44                Help: "HTTP request duration",
 45                Buckets: []float64{0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0},
 46            },
 47            []string{"method", "path", "status"},
 48        ),
 49
 50        requestSize: promauto.NewHistogramVec(
 51            prometheus.HistogramOpts{
 52                Namespace: namespace,
 53                Name: "http_request_size_bytes",
 54                Help: "HTTP request size in bytes",
 55                Buckets: []float64{100, 1000, 10000, 100000, 1000000},
 56            },
 57            []string{"method", "path"},
 58        ),
 59
 60        responseSize: promauto.NewHistogramVec(
 61            prometheus.HistogramOpts{
 62                Namespace: namespace,
 63                Name: "http_response_size_bytes",
 64                Help: "HTTP response size in bytes",
 65                Buckets: []float64{100, 1000, 10000, 100000, 1000000},
 66            },
 67            []string{"method", "path", "status"},
 68        ),
 69
 70        activeConnections: promauto.NewGauge(
 71            prometheus.GaugeOpts{
 72                Namespace: namespace,
 73                Name: "http_active_connections",
 74                Help: "Number of currently active HTTP connections",
 75            },
 76        ),
 77
 78        connectionErrors: promauto.NewCounter(
 79            prometheus.CounterOpts{
 80                Namespace: namespace,
 81                Name: "http_connection_errors_total",
 82                Help: "Total number of HTTP connection errors",
 83            },
 84        ),
 85
 86        slowRequests: promauto.NewCounter(
 87            prometheus.CounterOpts{
 88                Namespace: namespace,
 89                Name: "http_slow_requests_total",
 90                Help: "Total number of slow HTTP requests",
 91            },
 92        ),
 93
 94        timeoutRequests: promauto.NewCounter(
 95            prometheus.CounterOpts{
 96                Namespace: namespace,
 97                Name: "http_timeout_requests_total",
 98                Help: "Total number of HTTP timeouts",
 99            },
100        ),
101    }
102}
103
104// TrackRequest tracks a single HTTP request
105func TrackRequest(
106    method, path string,
107    statusCode int,
108    duration time.Duration,
109    requestSize, responseSize int64,
110    userAgent string,
111) {
112    // Record request count
113    hm.requestTotal.WithLabelValues(method, path, fmt.Sprintf("%d", statusCode), userAgent).Inc()
114
115    // Record request duration
116    hm.requestDuration.WithLabelValues(method, path, fmt.Sprintf("%d", statusCode)).Observe(duration.Seconds())
117
118    // Record request/response sizes
119    hm.requestSize.WithLabelValues(method, path).Observe(float64(requestSize))
120    hm.responseSize.WithLabelValues(method, path, fmt.Sprintf("%d", statusCode)).Observe(float64(responseSize))
121
122    // Track performance issues
123    if duration > time.Second {
124        hm.slowRequests.Inc()
125    }
126
127    if duration > 30*time.Second {
128        hm.timeoutRequests.Inc()
129    }
130}
131
132// IncrementConnections increments active connections
133func IncrementConnections() {
134    hm.activeConnections.Inc()
135}
136
137// DecrementConnections decrements active connections
138func DecrementConnections() {
139    hm.activeConnections.Dec()
140}
141
142// RecordConnectionError records a connection error
143func RecordConnectionError() {
144    hm.connectionErrors.Inc()
145}
146
147// Metrics middleware
148func Middleware(next http.Handler) http.Handler {
149    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
150        start := time.Now()
151
152        // Increment active connections
153        hm.IncrementConnections()
154        defer hm.DecrementConnections()
155
156        // Wrap response writer to capture metrics
157        rw := &metricsResponseWriter{ResponseWriter: w, statusCode: 200}
158
159        // Record request size
160        requestSize := r.ContentLength
161
162        // Call handler
163        next.ServeHTTP(rw, r)
164
165        // Calculate duration and record metrics
166        duration := time.Since(start)
167
168        hm.TrackRequest(
169            r.Method,
170            r.URL.Path,
171            rw.statusCode,
172            duration,
173            requestSize,
174            int64(rw.bytesWritten),
175            r.UserAgent(),
176        )
177    })
178}
179
180type metricsResponseWriter struct {
181    http.ResponseWriter
182    statusCode   int
183    bytesWritten int
184}
185
186func WriteHeader(code int) {
187    mrw.statusCode = code
188    mrw.ResponseWriter.WriteHeader(code)
189}
190
191func Write(b []byte) {
192    n, err := mrw.ResponseWriter.Write(b)
193    mrw.bytesWritten += n
194    return n, err
195}
196
197// Note: promauto is from prometheus client, used for auto-registration
198func promautoNewCounterVec(opts prometheus.CounterOpts, labels []string) *prometheus.CounterVec {
199    return promauto.NewCounterVec(opts, labels)
200}
201
202func promautoNewHistogramVec(opts prometheus.HistogramOpts, labels []string) *prometheus.HistogramVec {
203    return promauto.NewHistogramVec(opts, labels)
204}
205
206func promautoNewGauge(opts prometheus.GaugeOpts) prometheus.Gauge {
207    return promauto.NewGauge(opts)
208}
209
210// Aliases for the promauto functions
211var promauto = struct {
212    NewCounterVec func(prometheus.CounterOpts, []string) *prometheus.CounterVec
213    NewHistogramVec func(prometheus.HistogramOpts, []string) *prometheus.HistogramVec
214    NewGauge func(prometheus.GaugeOpts) prometheus.Gauge
215}{
216    NewCounterVec: promautoNewCounterVec,
217    NewHistogramVec: promautoNewHistogramVec,
218    NewGauge: promautoNewGauge,
219}
220
221func main() {
222    metrics := NewHTTPMetrics("myapp")
223
224    // Example handlers
225    mux := http.NewServeMux()
226    mux.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
227        time.Sleep(100 * time.Millisecond) // Simulate work
228        w.Write([]byte("Hello World"))
229    })
230
231    mux.HandleFunc("/slow", func(w http.ResponseWriter, r *http.Request) {
232        time.Sleep(2 * time.Second) // Slow endpoint
233        w.Write([]byte("Slow response"))
234    })
235
236    // Apply metrics middleware
237    handler := metrics.Middleware(mux)
238
239    // Expose metrics endpoint
240    http.Handle("/metrics", promhttp.Handler())
241
242    fmt.Println("Server running on :8080")
243    fmt.Println("Metrics available at http://localhost:8080/metrics")
244    http.ListenAndServe(":8080", handler)
245}

Key Metrics Collected:

  • Request Count: By method, path, status, and user agent
  • Response Time: Detailed histogram with performance buckets
  • Payload Sizes: Both request and response sizes
  • Connection Metrics: Active connections and errors
  • Performance Issues: Automatic detection of slow requests and timeouts

Example 2: Custom Business Metrics for E-commerce

Problem: You need to track business metrics like order processing times, cart abandonment, and conversion rates.

Solution: Implement custom metrics that capture business-critical KPIs.

  1package main
  2
  3import (
  4    "fmt"
  5    "sync"
  6    "time"
  7
  8    "github.com/prometheus/client_golang/prometheus"
  9    "github.com/prometheus/client_golang/prometheus/promauto"
 10)
 11
 12// BusinessMetrics tracks e-commerce business metrics
 13type BusinessMetrics struct {
 14    // Order metrics
 15    ordersTotal *prometheus.CounterVec
 16    orderValue *prometheus.HistogramVec
 17    orderDuration *prometheus.HistogramVec
 18
 19    // Cart metrics
 20    cartAdditions prometheus.Counter
 21    cartAbandonments prometheus.Counter
 22    cartConversions prometheus.Counter
 23
 24    // User metrics
 25    activeUsers prometheus.Gauge
 26    userSessions prometheus.Counter
 27    conversionRate prometheus.Gauge
 28
 29    // Inventory metrics
 30    stockLevel *prometheus.GaugeVec
 31    outOfStock *prometheus.CounterVec
 32}
 33
 34func NewBusinessMetrics() *BusinessMetrics {
 35    return &BusinessMetrics{
 36        ordersTotal: promauto.NewCounterVec(
 37            prometheus.CounterOpts{
 38                Name: "orders_total",
 39                Help: "Total number of orders by status",
 40            },
 41            []string{"status", "payment_method", "region"},
 42        ),
 43
 44        orderValue: promauto.NewHistogramVec(
 45            prometheus.HistogramOpts{
 46                Name: "order_value_dollars",
 47                Help: "Order value distribution",
 48                Buckets: []float64{10, 25, 50, 100, 250, 500, 1000},
 49            },
 50            []string{"product_category"},
 51        ),
 52
 53        orderDuration: promauto.NewHistogramVec(
 54            prometheus.HistogramOpts{
 55                Name: "order_processing_duration_seconds",
 56                Help: "Time to process orders",
 57                Buckets: []float64{1, 5, 10, 30, 60, 120},
 58            },
 59            []string{"order_type"},
 60        ),
 61
 62        cartAdditions: promauto.NewCounter(
 63            prometheus.CounterOpts{
 64                Name: "cart_additions_total",
 65                Help: "Total items added to carts",
 66            },
 67        ),
 68
 69        cartAbandonments: promauto.NewCounter(
 70            prometheus.CounterOpts{
 71                Name: "cart_abandonments_total",
 72                Help: "Total abandoned carts",
 73            },
 74        ),
 75
 76        cartConversions: promauto.NewCounter(
 77            prometheus.CounterOpts{
 78                Name: "cart_conversions_total",
 79                Help: "Total completed checkouts",
 80            },
 81        ),
 82
 83        activeUsers: promauto.NewGauge(
 84            prometheus.GaugeOpts{
 85                Name: "active_users_current",
 86                Help: "Number of currently active users",
 87            },
 88        ),
 89
 90        userSessions: promauto.NewCounter(
 91            prometheus.CounterOpts{
 92                Name: "user_sessions_total",
 93                Help: "Total user sessions",
 94            },
 95        ),
 96
 97        conversionRate: promauto.NewGauge(
 98            prometheus.GaugeOpts{
 99                Name: "conversion_rate",
100                Help: "Cart to order conversion rate",
101            },
102        ),
103
104        stockLevel: promauto.NewGaugeVec(
105            prometheus.GaugeOpts{
106                Name: "inventory_stock_level",
107                Help: "Current stock levels",
108            },
109            []string{"product_id", "category"},
110        ),
111
112        outOfStock: promauto.NewCounterVec(
113            prometheus.CounterOpts{
114                Name: "inventory_out_of_stock_total",
115                Help: "Out of stock events",
116            },
117            []string{"product_id", "category"},
118        ),
119    }
120}
121
122// TrackOrder tracks order completion
123func TrackOrder(status, paymentMethod, region string, value float64, duration time.Duration, category string) {
124    bm.ordersTotal.WithLabelValues(status, paymentMethod, region).Inc()
125    bm.orderValue.WithLabelValues(category).Observe(value)
126    bm.orderDuration.WithLabelValues("standard").Observe(duration.Seconds())
127}
128
129// TrackCartAddition tracks item added to cart
130func TrackCartAddition() {
131    bm.cartAdditions.Inc()
132}
133
134// TrackCartAbandonment tracks abandoned cart
135func TrackCartAbandonment() {
136    bm.cartAbandonments.Inc()
137    bm.updateConversionRate()
138}
139
140// TrackConversion tracks completed checkout
141func TrackConversion() {
142    bm.cartConversions.Inc()
143    bm.updateConversionRate()
144}
145
146// updateConversionRate calculates and updates conversion rate
147func updateConversionRate() {
148    total := bm.cartConversions.Get() + bm.cartAbandonments.Get()
149    if total > 0 {
150        rate := bm.cartConversions.Get() / total
151        bm.conversionRate.Set(rate)
152    }
153}
154
155// UserActive indicates user activity
156func UserActive(active bool) {
157    if active {
158        bm.activeUsers.Inc()
159    } else {
160        bm.activeUsers.Dec()
161    }
162}
163
164// TrackSession tracks new user session
165func TrackSession() {
166    bm.userSessions.Inc()
167}
168
169// UpdateInventory updates stock levels
170func UpdateInventory(productID, category string, stockLevel int) {
171    bm.stockLevel.WithLabelValues(productID, category).Set(float64(stockLevel))
172
173    if stockLevel == 0 {
174        bm.outOfStock.WithLabelValues(productID, category).Inc()
175    }
176}
177
178// CalculateConversionRate returns current conversion metrics
179func CalculateConversionRate() map[string]float64 {
180    abandonments := bm.cartAbandonments.Get()
181    conversions := bm.cartConversions.Get()
182    total := abandonments + conversions
183
184    rate := 0.0
185    if total > 0 {
186        rate = conversions / total
187    }
188
189    return map[string]float64{
190        "conversion_rate": rate * 100, // Percentage
191        "total_conversions": conversions,
192        "total_abandonments": abandonments,
193        "total_carts": total,
194    }
195}
196
197// Usage example
198func main() {
199    metrics := NewBusinessMetrics()
200
201    // Simulate e-commerce activity
202    var wg sync.WaitGroup
203
204    // Simulate user sessions
205    for i := 0; i < 100; i++ {
206        wg.Add(1)
207        go func(userID int) {
208            defer wg.Done()
209
210            // User session starts
211            metrics.TrackSession()
212            metrics.UserActive(true)
213
214            // Simulate shopping
215            time.Sleep(time.Duration(100+userID) * time.Millisecond)
216
217            // Add items to cart
218            metrics.TrackCartAddition()
219            metrics.TrackCartAddition()
220
221            // 70% chance of conversion
222            if userID%10 < 7 {
223                // Complete order
224                metrics.TrackConversion()
225                metrics.TrackOrder("completed", "credit_card", "US", 89.99, 5*time.Second, "electronics")
226            } else {
227                // Abandon cart
228                metrics.TrackCartAbandonment()
229            }
230
231            // User session ends
232            metrics.UserActive(false)
233        }(i)
234    }
235
236    // Simulate inventory updates
237    for i := 0; i < 10; i++ {
238        wg.Add(1)
239        go func(productID int) {
240            defer wg.Done()
241
242            // Update inventory levels
243            stock := 100 - productID
244            metrics.UpdateInventory(
245                fmt.Sprintf("prod-%d", productID),
246                "electronics",
247                stock,
248            )
249        }(i)
250    }
251
252    wg.Wait()
253
254    // Print conversion metrics
255    convMetrics := metrics.CalculateConversionRate()
256    fmt.Println("E-commerce Metrics:")
257    for k, v := range convMetrics {
258        fmt.Printf("  %s: %.2f\n", k, v)
259    }
260}

Business Insights:

  • Conversion Tracking: Real-time cart abandonment and conversion rates
  • Order Analytics: Value distribution, processing times, and regional performance
  • User Behavior: Session tracking and active user counts
  • Inventory Monitoring: Real-time stock levels and out-of-stock alerts

Practical Examples - Distributed Tracing

Example 1: Complete Microservice Tracing

Problem: You need to trace requests across multiple microservices to identify bottlenecks and failures.

Solution: Implement comprehensive distributed tracing with OpenTelemetry.

  1package main
  2
  3import (
  4    "context"
  5    "errors"
  6    "fmt"
  7    "net/http"
  8    "time"
  9
 10    "go.opentelemetry.io/otel"
 11    "go.opentelemetry.io/otel/attribute"
 12    "go.opentelemetry.io/otel/exporters/jaeger"
 13    "go.opentelemetry.io/otel/propagation"
 14    "go.opentelemetry.io/otel/sdk/resource"
 15    "go.opentelemetry.io/otel/sdk/trace"
 16    semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
 17    "go.opentelemetry.io/otel/trace"
 18)
 19
 20// UserService handles user-related operations
 21type UserService struct {
 22    tracer trace.Tracer
 23}
 24
 25func NewUserService() *UserService {
 26    return &UserService{
 27        tracer: otel.Tracer("user-service"),
 28    }
 29}
 30
 31func GetUser(ctx context.Context, userID string) {
 32    ctx, span := us.tracer.Start(ctx, "get-user")
 33    defer span.End()
 34
 35    span.SetAttributes(
 36        attribute.String("user_id", userID),
 37        attribute.String("operation", "database_query"),
 38    )
 39
 40    // Simulate database query
 41    time.Sleep(50 * time.Millisecond)
 42
 43    // Simulate cache check
 44    us.checkCache(ctx, userID)
 45
 46    // Return user data
 47    return &User{
 48        ID:   userID,
 49        Name: "John Doe",
 50        Email: "john@example.com",
 51    }, nil
 52}
 53
 54func checkCache(ctx context.Context, userID string) {
 55    _, span := us.tracer.Start(ctx, "cache-lookup")
 56    defer span.End()
 57
 58    span.SetAttributes(
 59        attribute.String("cache.key", fmt.Sprintf("user:%s", userID)),
 60        attribute.String("cache.result", "miss"),
 61    )
 62
 63    time.Sleep(5 * time.Millisecond)
 64}
 65
 66// OrderService handles order operations
 67type OrderService struct {
 68    tracer trace.Tracer
 69    userService *UserService
 70}
 71
 72func NewOrderService(userService *UserService) *OrderService {
 73    return &OrderService{
 74        tracer: otel.Tracer("order-service"),
 75        userService: userService,
 76    }
 77}
 78
 79func CreateOrder(ctx context.Context, userID string, items []OrderItem) {
 80    ctx, span := os.tracer.Start(ctx, "create-order")
 81    defer span.End()
 82
 83    span.SetAttributes(
 84        attribute.String("user_id", userID),
 85        attribute.Int("item_count", len(items)),
 86        attribute.String("order_type", "standard"),
 87    )
 88
 89    // Validate user
 90    user, err := os.userService.GetUser(ctx, userID)
 91    if err != nil {
 92        span.RecordError(err)
 93        span.SetStatus(trace.StatusCodeError, "user_validation_failed")
 94        return nil, err
 95    }
 96
 97    // Calculate totals
 98    os.calculateOrderTotal(ctx, items)
 99
100    // Process payment
101    if err := os.processPayment(ctx, items); err != nil {
102        span.RecordError(err)
103        span.SetStatus(trace.StatusCodeError, "payment_failed")
104        return nil, err
105    }
106
107    // Create order
108    order := &Order{
109        ID: fmt.Sprintf("order-%d", time.Now().Unix()),
110        User: *user,
111        Items: items,
112        Status: "created",
113    }
114
115    span.SetAttributes(
116        attribute.String("order_id", order.ID),
117        attribute.String("order_status", order.Status),
118    )
119    span.SetStatus(trace.StatusCodeOk, "order_created_successfully")
120
121    return order, nil
122}
123
124func calculateOrderTotal(ctx context.Context, items []OrderItem) {
125    _, span := os.tracer.Start(ctx, "calculate-total")
126    defer span.End()
127
128    span.SetAttributes(
129        attribute.String("calculation_type", "order_total"),
130    )
131
132    total := 0.0
133    for _, item := range items {
134        total += item.Price * float64(item.Quantity)
135    }
136
137    span.SetAttributes(
138        attribute.Float64("order_total", total),
139    )
140
141    time.Sleep(10 * time.Millisecond)
142}
143
144func processPayment(ctx context.Context, items []OrderItem) error {
145    _, span := os.tracer.Start(ctx, "process-payment")
146    defer span.End()
147
148    span.SetAttributes(
149        attribute.String("payment_processor", "stripe"),
150        attribute.String("payment_type", "credit_card"),
151    )
152
153    // Simulate payment processing
154    time.Sleep(100 * time.Millisecond)
155
156    // Simulate payment failure
157    if len(items) > 5 {
158        return errors.New("payment declined: too many items")
159    }
160
161    span.SetAttributes(
162        attribute.String("payment_status", "completed"),
163    )
164
165    return nil
166}
167
168// Data structures
169type User struct {
170    ID    string
171    Name  string
172    Email string
173}
174
175type OrderItem struct {
176    Name     string
177    Price    float64
178    Quantity int
179}
180
181type Order struct {
182    ID     string
183    User   User
184    Items  []OrderItem
185    Status string
186}
187
188// Initialize tracing
189func initTracer(serviceName string) {
190    exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
191        jaeger.WithEndpoint("http://localhost:14268/api/traces"),
192    ))
193    if err != nil {
194        return nil, err
195    }
196
197    tp := trace.NewTracerProvider(
198        trace.WithBatcher(exporter),
199        trace.WithResource(resource.NewWithAttributes(
200            semconv.SchemaURL,
201            semconv.ServiceNameKey.String(serviceName),
202            semconv.ServiceVersionKey.String("1.0.0"),
203            attribute.String("environment", "production"),
204        )),
205    )
206
207    otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
208        propagation.TraceContext{},
209        propagation.Baggage{},
210    ))
211
212    otel.SetTracerProvider(tp)
213    return tp, nil
214}
215
216// HTTP server with tracing
217func main() {
218    tp, err := initTracer("order-service")
219    if err != nil {
220        panic(err)
221    }
222    defer tp.Shutdown(context.Background())
223
224    // Create services
225    userService := NewUserService()
226    orderService := NewOrderService(userService)
227
228    // HTTP handler with tracing
229    http.HandleFunc("/api/orders", func(w http.ResponseWriter, r *http.Request) {
230        // Extract trace context
231        ctx := otel.GetTextMapPropagator().Extract(r.Context(), propagation.HeaderCarrier(r.Header))
232
233        // Create root span
234        tracer := otel.Tracer("http-server")
235        ctx, span := tracer.Start(ctx, "http-create-order",
236            trace.WithAttributes(
237                attribute.String("http.method", r.Method),
238                attribute.String("http.url", r.URL.String()),
239                attribute.String("http.user_agent", r.UserAgent()),
240            ),
241        )
242        defer span.End()
243
244        // Parse request
245        userID := r.URL.Query().Get("user_id")
246        if userID == "" {
247            span.SetAttributes(attribute.String("error", "missing_user_id"))
248            w.WriteHeader(http.StatusBadRequest)
249            return
250        }
251
252        items := []OrderItem{
253            {"Product 1", 29.99, 2},
254            {"Product 2", 19.99, 1},
255        }
256
257        // Create order
258        order, err := orderService.CreateOrder(ctx, userID, items)
259        if err != nil {
260            span.RecordError(err)
261            span.SetStatus(trace.StatusCodeError, "order_creation_failed")
262            w.WriteHeader(http.StatusInternalServerError)
263            return
264        }
265
266        // Success response
267        span.SetAttributes(
268            attribute.String("order_id", order.ID),
269            attribute.Int("http.status_code", http.StatusCreated),
270        )
271        span.SetStatus(trace.StatusCodeOk, "request_completed")
272
273        w.WriteHeader(http.StatusCreated)
274        fmt.Fprintf(w, "Order %s created successfully", order.ID)
275    })
276
277    fmt.Println("Order service running on :8080")
278    fmt.Println("Traces available at: http://localhost:16686")
279    http.ListenAndServe(":8080", nil)
280}

Tracing Insights:

  • Request Flow: Complete journey from API to database
  • Performance Analysis: Identify slow operations and bottlenecks
  • Error Debugging: Root cause analysis with detailed error context
  • Service Dependencies: Understand relationships between services

Common Pitfalls and Solutions

Pitfall 1: Over-logging Everything

Problem: Logging everything creates noise and storage costs.
Solution: Implement smart logging levels and sampling.

 1// Smart logger with sampling
 2type SmartLogger struct {
 3    *slog.Logger
 4    sampleRate float64
 5    counter    int64
 6}
 7
 8func Debug(msg string, args ...any) {
 9    sl.counter++
10    if float64(sl.counter%100) > sl.sampleRate*100 {
11        return // Skip this log message
12    }
13    sl.Logger.Debug(msg, args...)
14}

Pitfall 2: Too Many Metrics

Problem: Metric overload makes it hard to identify what matters.
Solution: Focus on RED and USE methodologies.

 1// RED: Rate, Errors, Duration
 2// USE: Utilization, Saturation, Errors
 3
 4type EssentialMetrics struct {
 5    requestRate    prometheus.Counter // Rate
 6    errorRate      prometheus.Counter // Errors
 7    latency        prometheus.Histogram // Duration
 8    utilization    prometheus.Gauge // Utilization
 9    saturation     prometheus.Gauge // Saturation
10}

Pitfall 3: Missing Context in Traces

Problem: Traces without context are useless.
Solution: Always add meaningful attributes and events.

 1// Good trace with context
 2span.SetAttributes(
 3    attribute.String("user.id", userID),
 4    attribute.String("operation.type", "database_query"),
 5    attribute.String("table", "users"),
 6    attribute.Bool("cache_hit", false),
 7)
 8
 9span.AddEvent("query_executed", trace.WithAttributes(
10    attribute.Int("rows_returned", 42),
11    attribute.Duration("query_time", 50*time.Millisecond),
12))

Integration and Mastery

Production Observability Stack

┌─────────────────────────────────────────┐
│        OBSERVABILITY STACK            │
└─────────────────────────────────────────┘

┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│   LOGGING    │  │   METRICS    │  │   TRACING   │
│             │  │             │  │             │
│ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │
│ │  App    │ │  │ │  App    │ │  │ │  App    │ │
│ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │
│     │       │  │     │       │  │     │       │
│ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │
│ │  slog   │ │  │ │Prometheus│ │  │ │Otel SDK │ │
│ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │
│     │       │  │     │       │  │     │       │
│ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │
│ │  Loki   │ │  │ │Prometheus│ │  │ │ Jaeger  │ │
│ │Elastic │ │  │ │  Server │ │  │ │Tempo   │ │
│ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │
└─────────────┘  └─────────────┘  └─────────────┘
        │                 │                 │
        └─────────────────┼─────────────────┘
                          │
                 ┌─────────────┐
                 │  Grafana    │
                 │ Dashboard   │
                 └─────────────┘

Key Success Patterns

  1. Start Small: Begin with basic logging and key metrics
  2. Build Incrementally: Add tracing as your system grows
  3. Automate: Use auto-instrumentation where possible
  4. Monitor Costs: Implement log/metric retention policies
  5. Train Teams: Ensure everyone understands observability practices

Summary

What We Learned

Structured Logging: Go's slog package for production-ready logging
Prometheus Metrics: Comprehensive collection with business KPIs
Distributed Tracing: OpenTelemetry for microservice visibility
Production Patterns: Common pitfalls and best practices
Integration: Complete observability stack architecture

Key Takeaways

  1. Context is King: Always add request IDs, user IDs, and operation context
  2. Measure What Matters: Focus on user-facing metrics and business KPIs
  3. Automation Wins: Use middleware and auto-instrumentation consistently
  4. Cost Awareness: Implement sampling and retention from day one

Next Steps

  • Implement RED/USE Metrics: Start with Rate, Errors, Duration and Utilization, Saturation, Errors
  • Set Up Dashboards: Create operational dashboards for your services
  • Configure Alerting: Implement SLO-based alerting with error budgets
  • Practice Incident Response: Run regular incident response drills using your observability data

Production Checklist

  • Structured logging with request correlation
  • HTTP middleware for automatic metrics collection
  • Distributed tracing across all services
  • SLOs defined for critical operations
  • Dashboards for system health and business metrics
  • Alerting based on error budgets
  • Log/metric retention policies
  • Regular incident response practice

Remember: Observability is not an option, it's a requirement for production systems.

Practice Exercises

slog.String("ip_address", ip))
This creates structured JSON that machines can easily parse and query.

💡 **Key Takeaway**: Structured logs aren't just for machines—they make debugging faster for humans too. Imagine searching for "all errors for user 123 in the last hour" across millions of log lines.

### Basic Structured Logging

```go
package main

import (
    "log/slog"
    "os"
)

func main() {
    // Default JSON logger
    logger := slog.New(slog.NewJSONHandler(os.Stdout, nil))

    // Simple log
    logger.Info("server started", "port", 8080, "env", "production")

    // With attributes
    logger.Info("user logged in",
        slog.String("user_id", "user-123"),
        slog.Int("session_duration", 3600),
        slog.Bool("admin", false),
    )

    // Error with context
    logger.Error("database connection failed",
        slog.String("error", "connection timeout"),
        slog.String("database", "postgres"),
        slog.Int("retry_count", 3),
    )
}

Output:

1{"time":"2024-01-15T10:30:45Z","level":"INFO","msg":"server started","port":8080,"env":"production"}
2{"time":"2024-01-15T10:30:46Z","level":"INFO","msg":"user logged in","user_id":"user-123","session_duration":3600,"admin":false}
3{"time":"2024-01-15T10:30:47Z","level":"ERROR","msg":"database connection failed","error":"connection timeout","database":"postgres","retry_count":3}

Production Logger with Context

 1package main
 2
 3import (
 4    "context"
 5    "log/slog"
 6    "os"
 7    "time"
 8)
 9
10type contextKey string
11
12const (
13    requestIDKey contextKey = "request_id"
14    userIDKey    contextKey = "user_id"
15)
16
17// Logger wraps slog with context support
18type Logger struct {
19    *slog.Logger
20}
21
22func NewLogger() *Logger {
23    handler := slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
24        Level: slog.LevelInfo,
25        AddSource: true, // Add file:line to logs
26    })
27
28    return &Logger{
29        Logger: slog.New(handler),
30    }
31}
32
33// WithContext extracts context values and adds to log
34func WithContext(ctx context.Context) *Logger {
35    logger := l.Logger
36
37    // Add request ID
38    if requestID, ok := ctx.Value(requestIDKey).(string); ok {
39        logger = logger.With(slog.String("request_id", requestID))
40    }
41
42    // Add user ID
43    if userID, ok := ctx.Value(userIDKey).(string); ok {
44        logger = logger.With(slog.String("user_id", userID))
45    }
46
47    return &Logger{Logger: logger}
48}
49
50// Operation logs with timing
51func LogOperation(ctx context.Context, operation string, fn func() error) error {
52    start := time.Now()
53
54    err := fn()
55
56    duration := time.Since(start)
57
58    if err != nil {
59        l.WithContext(ctx).Error(operation+" failed",
60            slog.Duration("duration", duration),
61            slog.String("error", err.Error()),
62        )
63    } else {
64        l.WithContext(ctx).Info(operation+" completed",
65            slog.Duration("duration", duration),
66        )
67    }
68
69    return err
70}
71
72func main() {
73    logger := NewLogger()
74
75    // Create context with request ID
76    ctx := context.WithValue(context.Background(), requestIDKey, "req-abc-123")
77    ctx = context.WithValue(ctx, userIDKey, "user-456")
78
79    // Log with context
80    logger.WithContext(ctx).Info("processing request")
81
82    // Log operation with timing
83    logger.LogOperation(ctx, "fetch_user", func() error {
84        time.Sleep(50 * time.Millisecond)
85        return nil
86    })
87}

Log Levels and Filtering

 1package main
 2
 3import (
 4    "log/slog"
 5    "os"
 6)
 7
 8func main() {
 9    // Production: INFO and above
10    prodHandler := slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
11        Level: slog.LevelInfo,
12    })
13    prodLogger := slog.New(prodHandler)
14
15    // Development: DEBUG and above
16    devHandler := slog.NewTextHandler(os.Stdout, &slog.HandlerOptions{
17        Level: slog.LevelDebug,
18    })
19    devLogger := slog.New(devHandler)
20
21    // Log at different levels
22    prodLogger.Debug("debug message")   // Not logged
23    prodLogger.Info("info message")     // Logged
24    prodLogger.Warn("warning message")  // Logged
25    prodLogger.Error("error message")   // Logged
26
27    devLogger.Debug("debug message")    // Logged
28    devLogger.Info("info message")      // Logged
29}

Prometheus Metrics

If structured logging is like keeping a detailed diary, Prometheus metrics are like the vital signs dashboard on a medical monitor. They give you real-time, numerical data about your system's health that you can track over time, set alerts on, and use to predict problems before they happen.

Prometheus is the de facto standard for metrics in modern systems. It's like having a fitness tracker for your application—it constantly monitors key metrics and tells you when something's wrong.

💡 Real-world Example: Uber uses Prometheus to monitor over 1,000 microservices. When ride requests spike during rush hour:

  • Gauge shows "active drivers: 45,000"
  • Counter shows "rides_completed_total: 125,000"
  • Histogram shows "ride_duration_seconds"

If active drivers drop while requests increase, they can dispatch more drivers before riders even notice delays.

Basic Metrics Types

  1package main
  2
  3import (
  4    "fmt"
  5    "math/rand"
  6    "net/http"
  7    "time"
  8
  9    "github.com/prometheus/client_golang/prometheus"
 10    "github.com/prometheus/client_golang/prometheus/promauto"
 11    "github.com/prometheus/client_golang/prometheus/promhttp"
 12)
 13
 14var (
 15    // Counter: Monotonically increasing value
 16    requestsTotal = promauto.NewCounterVec(
 17        prometheus.CounterOpts{
 18            Name: "http_requests_total",
 19            Help: "Total number of HTTP requests",
 20        },
 21        []string{"method", "path", "status"},
 22    )
 23
 24    // Gauge: Value that can go up or down
 25    activeConnections = promauto.NewGauge(
 26        prometheus.GaugeOpts{
 27            Name: "active_connections",
 28            Help: "Number of active connections",
 29        },
 30    )
 31
 32    // Histogram: Distribution of values
 33    requestDuration = promauto.NewHistogramVec(
 34        prometheus.HistogramOpts{
 35            Name:    "http_request_duration_seconds",
 36            Help:    "HTTP request latency",
 37            Buckets: prometheus.DefBuckets, // Default: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
 38        },
 39        []string{"method", "path"},
 40    )
 41
 42    // Summary: Like histogram, but calculates quantiles client-side
 43    requestSize = promauto.NewSummaryVec(
 44        prometheus.SummaryOpts{
 45            Name:       "http_request_size_bytes",
 46            Help:       "HTTP request size in bytes",
 47            Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
 48        },
 49        []string{"method"},
 50    )
 51)
 52
 53// Middleware to record metrics
 54func metricsMiddleware(next http.Handler) http.Handler {
 55    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 56        start := time.Now()
 57
 58        // Increment active connections
 59        activeConnections.Inc()
 60        defer activeConnections.Dec()
 61
 62        // Wrap response writer to capture status code
 63        rw := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}
 64
 65        // Call handler
 66        next.ServeHTTP(rw, r)
 67
 68        // Record metrics
 69        duration := time.Since(start).Seconds()
 70
 71        requestsTotal.WithLabelValues(r.Method, r.URL.Path, fmt.Sprintf("%d", rw.statusCode)).Inc()
 72        requestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
 73        requestSize.WithLabelValues(r.Method).Observe(float64(r.ContentLength))
 74    })
 75}
 76
 77type responseWriter struct {
 78    http.ResponseWriter
 79    statusCode int
 80}
 81
 82func WriteHeader(code int) {
 83    rw.statusCode = code
 84    rw.ResponseWriter.WriteHeader(code)
 85}
 86
 87func main() {
 88    // API handlers
 89    http.HandleFunc("/api/users", func(w http.ResponseWriter, r *http.Request) {
 90        // Simulate work
 91        time.Sleep(time.Duration(rand.Intn(100)) * time.Millisecond)
 92        w.WriteHeader(http.StatusOK)
 93        w.Write([]byte(`{"users": []}`))
 94    })
 95
 96    http.HandleFunc("/api/products", func(w http.ResponseWriter, r *http.Request) {
 97        time.Sleep(time.Duration(rand.Intn(200)) * time.Millisecond)
 98        w.WriteHeader(http.StatusOK)
 99        w.Write([]byte(`{"products": []}`))
100    })
101
102    // Metrics endpoint
103    http.Handle("/metrics", promhttp.Handler())
104
105    // Apply metrics middleware
106    handler := metricsMiddleware(http.DefaultServeMux)
107
108    fmt.Println("Server running on :8080")
109    fmt.Println("Metrics available at http://localhost:8080/metrics")
110    http.ListenAndServe(":8080", handler)
111}

Custom Business Metrics

 1package main
 2
 3import (
 4    "github.com/prometheus/client_golang/prometheus"
 5    "github.com/prometheus/client_golang/prometheus/promauto"
 6)
 7
 8var (
 9    // Business metrics
10    ordersTotal = promauto.NewCounterVec(
11        prometheus.CounterOpts{
12            Name: "orders_total",
13            Help: "Total number of orders",
14        },
15        []string{"status"}, // pending, completed, failed
16    )
17
18    orderValue = promauto.NewHistogramVec(
19        prometheus.HistogramOpts{
20            Name:    "order_value_dollars",
21            Help:    "Order value distribution",
22            Buckets: []float64{10, 25, 50, 100, 250, 500, 1000},
23        },
24        []string{"product_category"},
25    )
26
27    cartAbandonment = promauto.NewCounter(
28        prometheus.CounterOpts{
29            Name: "cart_abandonments_total",
30            Help: "Total number of cart abandonments",
31        },
32    )
33
34    activeUsers = promauto.NewGauge(
35        prometheus.GaugeOpts{
36            Name: "active_users",
37            Help: "Number of currently active users",
38        },
39    )
40
41    checkoutDuration = promauto.NewHistogram(
42        prometheus.HistogramOpts{
43            Name:    "checkout_duration_seconds",
44            Help:    "Time to complete checkout",
45            Buckets: []float64{1, 5, 10, 30, 60, 120},
46        },
47    )
48)
49
50// OrderService with metrics
51type OrderService struct{}
52
53func CreateOrder(category string, value float64) error {
54    // Record order
55    ordersTotal.WithLabelValues("pending").Inc()
56    orderValue.WithLabelValues(category).Observe(value)
57
58    // ... business logic ...
59
60    ordersTotal.WithLabelValues("completed").Inc()
61    return nil
62}
63
64func AbandonCart() {
65    cartAbandonment.Inc()
66}
67
68func UserLogin() {
69    activeUsers.Inc()
70}
71
72func UserLogout() {
73    activeUsers.Dec()
74}

OpenTelemetry Distributed Tracing

If logs are like individual diary entries and metrics are like vital signs, distributed tracing is like having a GPS tracker that follows a package from warehouse to doorstep. It shows you the complete journey of a request as it travels through different services, databases, and caches.

OpenTelemetry is the standard for distributed tracing and observability. It's like having a universal passport for your requests—no matter which service they visit, they carry their identity and timing information with them.

💡 Real-world Example: When you order from Amazon:

  1. Your request hits the API gateway
  2. Inventory service checks stock
  3. Payment service processes payment
  4. Shipping service schedules delivery

Without tracing: "Order is slow" 😕
With tracing: "Payment service is taking 200ms instead of usual 50ms" 🎯

⚠️ Common Pitfall: Don't trace everything! Trace user-initiated operations and critical paths. Tracing every database query can overwhelm your tracing system.

Basic Tracing Setup

  1package main
  2
  3import (
  4    "context"
  5    "fmt"
  6    "log"
  7    "time"
  8
  9    "go.opentelemetry.io/otel"
 10    "go.opentelemetry.io/otel/attribute"
 11    "go.opentelemetry.io/otel/exporters/stdout/stdouttrace"
 12    "go.opentelemetry.io/otel/sdk/resource"
 13    "go.opentelemetry.io/otel/sdk/trace"
 14    semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
 15)
 16
 17var tracer = otel.Tracer("example-service")
 18
 19func initTracer() func() {
 20    // Create stdout exporter
 21    exporter, err := stdouttrace.New(stdouttrace.WithPrettyPrint())
 22    if err != nil {
 23        log.Fatal(err)
 24    }
 25
 26    // Create resource
 27    res := resource.NewWithAttributes(
 28        semconv.SchemaURL,
 29        semconv.ServiceName("my-service"),
 30        semconv.ServiceVersion("1.0.0"),
 31        attribute.String("environment", "production"),
 32    )
 33
 34    // Create trace provider
 35    tp := trace.NewTracerProvider(
 36        trace.WithBatcher(exporter),
 37        trace.WithResource(res),
 38    )
 39
 40    otel.SetTracerProvider(tp)
 41
 42    return func() {
 43        tp.Shutdown(context.Background())
 44    }
 45}
 46
 47// API handler with tracing
 48func handleRequest(ctx context.Context, userID string) error {
 49    // Start span
 50    ctx, span := tracer.Start(ctx, "handle_request")
 51    defer span.End()
 52
 53    // Add attributes
 54    span.SetAttributes(
 55        attribute.String("user_id", userID),
 56        attribute.String("request_type", "get_user"),
 57    )
 58
 59    // Call downstream services
 60    if err := fetchUserFromDB(ctx, userID); err != nil {
 61        span.RecordError(err)
 62        return err
 63    }
 64
 65    if err := fetchUserPreferences(ctx, userID); err != nil {
 66        span.RecordError(err)
 67        return err
 68    }
 69
 70    return nil
 71}
 72
 73func fetchUserFromDB(ctx context.Context, userID string) error {
 74    ctx, span := tracer.Start(ctx, "fetch_user_db")
 75    defer span.End()
 76
 77    span.SetAttributes(
 78        attribute.String("db.system", "postgresql"),
 79        attribute.String("db.operation", "SELECT"),
 80    )
 81
 82    // Simulate DB call
 83    time.Sleep(50 * time.Millisecond)
 84
 85    return nil
 86}
 87
 88func fetchUserPreferences(ctx context.Context, userID string) error {
 89    ctx, span := tracer.Start(ctx, "fetch_preferences")
 90    defer span.End()
 91
 92    span.SetAttributes(
 93        attribute.String("cache.hit", "false"),
 94    )
 95
 96    // Simulate cache miss + DB call
 97    time.Sleep(20 * time.Millisecond)
 98
 99    return nil
100}
101
102func main() {
103    cleanup := initTracer()
104    defer cleanup()
105
106    ctx := context.Background()
107
108    // Handle request
109    if err := handleRequest(ctx, "user-123"); err != nil {
110        fmt.Printf("Error: %v\n", err)
111    }
112
113    // Allow time for export
114    time.Sleep(time.Second)
115}

HTTP Instrumentation

 1package main
 2
 3import (
 4    "context"
 5    "fmt"
 6    "net/http"
 7    "time"
 8
 9    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
10    "go.opentelemetry.io/otel"
11    "go.opentelemetry.io/otel/attribute"
12    "go.opentelemetry.io/otel/exporters/jaeger"
13    "go.opentelemetry.io/otel/sdk/resource"
14    "go.opentelemetry.io/otel/sdk/trace"
15    semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
16)
17
18func initJaegerTracer() {
19    // Create Jaeger exporter
20    exp, err := jaeger.New(jaeger.WithCollectorEndpoint(
21        jaeger.WithEndpoint("http://localhost:14268/api/traces"),
22    ))
23    if err != nil {
24        return nil, err
25    }
26
27    tp := trace.NewTracerProvider(
28        trace.WithBatcher(exp),
29        trace.WithResource(resource.NewWithAttributes(
30            semconv.SchemaURL,
31            semconv.ServiceName("api-service"),
32        )),
33    )
34
35    otel.SetTracerProvider(tp)
36    return tp, nil
37}
38
39func main() {
40    tp, err := initJaegerTracer()
41    if err != nil {
42        panic(err)
43    }
44    defer tp.Shutdown(context.Background())
45
46    // Handler with custom tracing
47    handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
48        ctx := r.Context()
49        tracer := otel.Tracer("api-handler")
50
51        // Custom span
52        ctx, span := tracer.Start(ctx, "process_request")
53        defer span.End()
54
55        // Add custom attributes
56        span.SetAttributes(
57            attribute.String("http.route", r.URL.Path),
58            attribute.String("user_agent", r.UserAgent()),
59        )
60
61        // Simulate processing
62        time.Sleep(100 * time.Millisecond)
63
64        w.WriteHeader(http.StatusOK)
65        w.Write([]byte("OK"))
66    })
67
68    // Wrap with OpenTelemetry HTTP instrumentation
69    instrumentedHandler := otelhttp.NewHandler(handler, "http-server")
70
71    fmt.Println("Server running on :8080")
72    fmt.Println("Traces sent to Jaeger at http://localhost:16686")
73    http.ListenAndServe(":8080", instrumentedHandler)
74}

SLIs, SLOs, and Alerting

If observability data is like your car's dashboard, SLIs, SLOs, and SLAs are like the service schedule and warranty. They define what "good performance" means, when you need to take action, and what you've promised to your users.

Service Level Indicators, Service Level Objectives, and Service Level Agreements define reliability targets.

The Hierarchy:

  • SLI: What you measure
  • SLO: Your target
  • SLA: What you promise customers

💡 Real-world Example: Google Search SLOs:

  • SLI: 99th percentile query latency
  • SLO: 95% of queries under 200ms
  • Error budget: 5% can be slow

⚠️ Important: 100% reliability is impossible and expensive. Even Google aims for 99.99% uptime. The "error budget" approach gives you room to innovate while maintaining reliability.

Defining SLIs and SLOs

  1package main
  2
  3import (
  4    "fmt"
  5    "sync"
  6    "time"
  7
  8    "github.com/prometheus/client_golang/prometheus"
  9    "github.com/prometheus/client_golang/prometheus/promauto"
 10)
 11
 12// SLI metrics
 13var (
 14    requestLatency = promauto.NewHistogramVec(
 15        prometheus.HistogramOpts{
 16            Name:    "sli_request_latency_seconds",
 17            Help:    "Request latency SLI",
 18            Buckets: []float64{0.01, 0.05, 0.1, 0.2, 0.5, 1.0}, // SLO thresholds
 19        },
 20        []string{"endpoint"},
 21    )
 22
 23    requestSuccess = promauto.NewCounterVec(
 24        prometheus.CounterOpts{
 25            Name: "sli_requests_total",
 26            Help: "Total requests by success/failure",
 27        },
 28        []string{"endpoint", "success"},
 29    )
 30
 31    availabilityGauge = promauto.NewGaugeVec(
 32        prometheus.GaugeOpts{
 33            Name: "slo_availability_ratio",
 34            Help: "Current availability ratio",
 35        },
 36        []string{"endpoint"},
 37    )
 38)
 39
 40// SLOTracker tracks SLO compliance
 41type SLOTracker struct {
 42    mu sync.RWMutex
 43
 44    // Configuration
 45    latencyTarget      time.Duration // e.g., 200ms
 46    availabilityTarget float64       // e.g., 0.999
 47
 48    // Metrics
 49    totalRequests    int64
 50    successRequests  int64
 51    fastRequests     int64 // Requests under latency target
 52}
 53
 54func NewSLOTracker(latencyTarget time.Duration, availabilityTarget float64) *SLOTracker {
 55    return &SLOTracker{
 56        latencyTarget:      latencyTarget,
 57        availabilityTarget: availabilityTarget,
 58    }
 59}
 60
 61// RecordRequest records a request for SLO tracking
 62func RecordRequest(duration time.Duration, success bool) {
 63    st.mu.Lock()
 64    defer st.mu.Unlock()
 65
 66    st.totalRequests++
 67
 68    if success {
 69        st.successRequests++
 70    }
 71
 72    if duration <= st.latencyTarget {
 73        st.fastRequests++
 74    }
 75
 76    // Update Prometheus metrics
 77    if success {
 78        requestSuccess.WithLabelValues("api", "true").Inc()
 79    } else {
 80        requestSuccess.WithLabelValues("api", "false").Inc()
 81    }
 82
 83    requestLatency.WithLabelValues("api").Observe(duration.Seconds())
 84
 85    // Calculate and expose availability ratio
 86    availability := float64(st.successRequests) / float64(st.totalRequests)
 87    availabilityGauge.WithLabelValues("api").Set(availability)
 88}
 89
 90// Metrics returns current SLO metrics
 91func Metrics() map[string]interface{} {
 92    st.mu.RLock()
 93    defer st.mu.RUnlock()
 94
 95    availability := float64(st.successRequests) / float64(st.totalRequests)
 96    latencyCompliance := float64(st.fastRequests) / float64(st.totalRequests)
 97
 98    return map[string]interface{}{
 99        "total_requests":      st.totalRequests,
100        "success_requests":    st.successRequests,
101        "availability":        fmt.Sprintf("%.4f%%", availability*100),
102        "availability_slo":    fmt.Sprintf("%.4f%%", st.availabilityTarget*100),
103        "availability_met":    availability >= st.availabilityTarget,
104        "latency_compliance":  fmt.Sprintf("%.4f%%", latencyCompliance*100),
105        "latency_target":      st.latencyTarget,
106    }
107}
108
109func main() {
110    tracker := NewSLOTracker(200*time.Millisecond, 0.999)
111
112    // Simulate requests
113    for i := 0; i < 1000; i++ {
114        duration := time.Duration(50+i/5) * time.Millisecond
115        success := i%100 != 0 // 99% success rate
116
117        tracker.RecordRequest(duration, success)
118    }
119
120    // Print SLO metrics
121    metrics := tracker.Metrics()
122    fmt.Println("SLO Metrics:")
123    for k, v := range metrics {
124        fmt.Printf("  %s: %v\n", k, v)
125    }
126}

Common Pitfalls and When to Use What

When to Use Logs vs Metrics vs Traces

Scenario Best Tool Why
"Why did this specific transaction fail?" Logs Detailed context for debugging
"Are we meeting our performance targets?" Metrics Trends and aggregations over time
"Why is the checkout flow slow?" Traces See the entire request journey
"How many users signed up today?" Metrics Simple counting and aggregation
"What error happened at 3:15 AM?" Logs Point-in-time investigation

Common Pitfalls

❌ Over-logging everything

  • Problem: High storage costs, hard to find relevant information
  • Solution: Log at appropriate levels, use structured logging

❌ Too many metrics

  • Problem: Metric overload, hard to identify what matters
  • Solution: Start with RED and USE methodologies

❌ Tracing everything

  • Problem: Performance overhead, expensive storage
  • Solution: Trace sample of requests, focus on user-facing operations

❌ Setting alerts without context

  • Problem: Alert fatigue, teams ignore notifications
  • Solution: Use SLO-based alerting, include runbooks with alerts

💡 Key Takeaway: Start small and iterate. Begin with basic logging and key metrics, then add tracing as your system grows. The best observability setup is one you actually use and understand.

Further Reading

Tools

  • Prometheus - Metrics collection and alerting
  • Grafana - Visualization and dashboards
  • Jaeger - Distributed tracing
  • ELK Stack - Elasticsearch, Logstash, Kibana for logs

Books

  • Observability Engineering by Charity Majors
  • Site Reliability Engineering by Google
  • The Art of Monitoring by James Turnbull

Standards

Advanced Pattern: OpenTelemetry Integration

OpenTelemetry (OTel) is the industry-standard observability framework that unifies metrics, logs, and traces into a single instrumentation layer. Unlike vendor-specific solutions, OTel provides portable, standardized telemetry that works across any backend.

Why OpenTelemetry Matters:

  • Vendor-neutral: Switch backends (Jaeger, Zipkin, Prometheus) without changing code
  • Unified API: Single SDK for all three pillars of observability
  • Auto-instrumentation: Automatic tracing for popular frameworks
  • Context propagation: Seamless correlation across distributed systems

💡 Real-World Impact: Shopify migrated to OpenTelemetry and reduced instrumentation code by 60% while gaining unified observability across 2,000+ services.

Complete OpenTelemetry Setup

This example demonstrates a production-ready OpenTelemetry setup with all three pillars integrated:

  1package main
  2
  3import (
  4    "context"
  5    "fmt"
  6    "log"
  7    "net/http"
  8    "os"
  9    "time"
 10
 11    "go.opentelemetry.io/otel"
 12    "go.opentelemetry.io/otel/attribute"
 13    "go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc"
 14    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
 15    "go.opentelemetry.io/otel/metric"
 16    "go.opentelemetry.io/otel/propagation"
 17    "go.opentelemetry.io/otel/sdk/metric"
 18    "go.opentelemetry.io/otel/sdk/resource"
 19    "go.opentelemetry.io/otel/sdk/trace"
 20    semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
 21    "go.opentelemetry.io/otel/trace"
 22)
 23
 24// OTelProvider manages all OpenTelemetry components
 25type OTelProvider struct {
 26    tracerProvider *trace.TracerProvider
 27    meterProvider  *metric.MeterProvider
 28    resource       *resource.Resource
 29}
 30
 31// NewOTelProvider initializes a complete OpenTelemetry setup
 32func NewOTelProvider(ctx context.Context, serviceName, serviceVersion string) (*OTelProvider, error) {
 33    // Create resource with service identification
 34    res, err := resource.New(ctx,
 35        resource.WithAttributes(
 36            semconv.ServiceName(serviceName),
 37            semconv.ServiceVersion(serviceVersion),
 38            semconv.DeploymentEnvironment(os.Getenv("ENV")),
 39        ),
 40        resource.WithHost(),        // Add host information
 41        resource.WithProcess(),     // Add process information
 42        resource.WithOS(),          // Add OS information
 43        resource.WithContainer(),   // Add container information if running in one
 44    )
 45    if err != nil {
 46        return nil, fmt.Errorf("failed to create resource: %w", err)
 47    }
 48
 49    // Setup trace exporter
 50    traceExporter, err := otlptracegrpc.New(ctx,
 51        otlptracegrpc.WithEndpoint("localhost:4317"),
 52        otlptracegrpc.WithInsecure(),
 53    )
 54    if err != nil {
 55        return nil, fmt.Errorf("failed to create trace exporter: %w", err)
 56    }
 57
 58    // Create tracer provider with sampling strategy
 59    tracerProvider := trace.NewTracerProvider(
 60        trace.WithBatcher(traceExporter,
 61            trace.WithMaxExportBatchSize(512),
 62            trace.WithBatchTimeout(5*time.Second),
 63        ),
 64        trace.WithResource(res),
 65        // Sample 10% of traces in production, 100% in dev
 66        trace.WithSampler(trace.ParentBased(
 67            trace.TraceIDRatioBased(getSamplingRate()),
 68        )),
 69    )
 70
 71    // Setup metric exporter
 72    metricExporter, err := otlpmetricgrpc.New(ctx,
 73        otlpmetricgrpc.WithEndpoint("localhost:4317"),
 74        otlpmetricgrpc.WithInsecure(),
 75    )
 76    if err != nil {
 77        return nil, fmt.Errorf("failed to create metric exporter: %w", err)
 78    }
 79
 80    // Create meter provider
 81    meterProvider := metric.NewMeterProvider(
 82        metric.WithReader(
 83            metric.NewPeriodicReader(metricExporter,
 84                metric.WithInterval(30*time.Second),
 85            ),
 86        ),
 87        metric.WithResource(res),
 88    )
 89
 90    // Set global providers
 91    otel.SetTracerProvider(tracerProvider)
 92    otel.SetMeterProvider(meterProvider)
 93
 94    // Set global propagator for context propagation across services
 95    otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
 96        propagation.TraceContext{},
 97        propagation.Baggage{},
 98    ))
 99
100    return &OTelProvider{
101        tracerProvider: tracerProvider,
102        meterProvider:  meterProvider,
103        resource:       res,
104    }, nil
105}
106
107func getSamplingRate() float64 {
108    if os.Getenv("ENV") == "production" {
109        return 0.1 // 10% sampling in production
110    }
111    return 1.0 // 100% sampling in development
112}
113
114// Shutdown gracefully shuts down all providers
115func (p *OTelProvider) Shutdown(ctx context.Context) error {
116    if err := p.tracerProvider.Shutdown(ctx); err != nil {
117        return fmt.Errorf("failed to shutdown tracer provider: %w", err)
118    }
119    if err := p.meterProvider.Shutdown(ctx); err != nil {
120        return fmt.Errorf("failed to shutdown meter provider: %w", err)
121    }
122    return nil
123}
124
125// InstrumentedService demonstrates a fully instrumented service
126type InstrumentedService struct {
127    tracer trace.Tracer
128    meter  metric.Meter
129
130    // Metrics
131    requestCounter  metric.Int64Counter
132    requestDuration metric.Float64Histogram
133    activeRequests  metric.Int64UpDownCounter
134}
135
136// NewInstrumentedService creates a service with complete OTel instrumentation
137func NewInstrumentedService(serviceName string) (*InstrumentedService, error) {
138    tracer := otel.Tracer(serviceName)
139    meter := otel.Meter(serviceName)
140
141    // Create metrics
142    requestCounter, err := meter.Int64Counter(
143        "http.server.requests",
144        metric.WithDescription("Total number of HTTP requests"),
145        metric.WithUnit("{request}"),
146    )
147    if err != nil {
148        return nil, err
149    }
150
151    requestDuration, err := meter.Float64Histogram(
152        "http.server.duration",
153        metric.WithDescription("HTTP request duration"),
154        metric.WithUnit("ms"),
155    )
156    if err != nil {
157        return nil, err
158    }
159
160    activeRequests, err := meter.Int64UpDownCounter(
161        "http.server.active_requests",
162        metric.WithDescription("Number of active HTTP requests"),
163        metric.WithUnit("{request}"),
164    )
165    if err != nil {
166        return nil, err
167    }
168
169    return &InstrumentedService{
170        tracer:          tracer,
171        meter:           meter,
172        requestCounter:  requestCounter,
173        requestDuration: requestDuration,
174        activeRequests:  activeRequests,
175    }, nil
176}
177
178// HandleRequest demonstrates complete request instrumentation
179func (s *InstrumentedService) HandleRequest(w http.ResponseWriter, r *http.Request) {
180    ctx := r.Context()
181    start := time.Now()
182
183    // Start a span for this request
184    ctx, span := s.tracer.Start(ctx, "HandleRequest",
185        trace.WithSpanKind(trace.SpanKindServer),
186        trace.WithAttributes(
187            semconv.HTTPMethod(r.Method),
188            semconv.HTTPRoute(r.URL.Path),
189            semconv.HTTPScheme(r.URL.Scheme),
190            semconv.NetHostName(r.Host),
191        ),
192    )
193    defer span.End()
194
195    // Track active requests
196    s.activeRequests.Add(ctx, 1)
197    defer s.activeRequests.Add(ctx, -1)
198
199    // Simulate processing with child spans
200    if err := s.processRequest(ctx); err != nil {
201        span.RecordError(err)
202        span.SetAttributes(attribute.Bool("error", true))
203        http.Error(w, err.Error(), http.StatusInternalServerError)
204        return
205    }
206
207    // Record metrics
208    duration := time.Since(start).Milliseconds()
209    attrs := []attribute.KeyValue{
210        attribute.String("method", r.Method),
211        attribute.String("route", r.URL.Path),
212        attribute.Int("status", http.StatusOK),
213    }
214
215    s.requestCounter.Add(ctx, 1, metric.WithAttributes(attrs...))
216    s.requestDuration.Record(ctx, float64(duration), metric.WithAttributes(attrs...))
217
218    // Set span attributes for success
219    span.SetAttributes(
220        semconv.HTTPStatusCode(http.StatusOK),
221        attribute.Int64("response.size", 100),
222    )
223
224    w.WriteHeader(http.StatusOK)
225    fmt.Fprintf(w, "Request processed successfully\n")
226}
227
228func (s *InstrumentedService) processRequest(ctx context.Context) error {
229    // Create child span for database operation
230    ctx, dbSpan := s.tracer.Start(ctx, "database.query",
231        trace.WithSpanKind(trace.SpanKindClient),
232        trace.WithAttributes(
233            attribute.String("db.system", "postgresql"),
234            attribute.String("db.operation", "SELECT"),
235            attribute.String("db.statement", "SELECT * FROM users WHERE id = $1"),
236        ),
237    )
238    defer dbSpan.End()
239
240    // Simulate database query
241    time.Sleep(50 * time.Millisecond)
242
243    // Create child span for cache operation
244    ctx, cacheSpan := s.tracer.Start(ctx, "cache.get",
245        trace.WithSpanKind(trace.SpanKindClient),
246        trace.WithAttributes(
247            attribute.String("cache.system", "redis"),
248            attribute.String("cache.key", "user:123"),
249        ),
250    )
251    defer cacheSpan.End()
252
253    // Simulate cache lookup
254    time.Sleep(10 * time.Millisecond)
255
256    return nil
257}
258
259func main() {
260    ctx := context.Background()
261
262    // Initialize OpenTelemetry
263    provider, err := NewOTelProvider(ctx, "my-service", "1.0.0")
264    if err != nil {
265        log.Fatalf("Failed to initialize OpenTelemetry: %v", err)
266    }
267    defer func() {
268        if err := provider.Shutdown(ctx); err != nil {
269            log.Printf("Error shutting down provider: %v", err)
270        }
271    }()
272
273    // Create instrumented service
274    service, err := NewInstrumentedService("my-service")
275    if err != nil {
276        log.Fatalf("Failed to create service: %v", err)
277    }
278
279    // Setup HTTP server
280    http.HandleFunc("/api/users", service.HandleRequest)
281
282    fmt.Println("Server starting on :8080")
283    fmt.Println("Try: curl http://localhost:8080/api/users")
284    log.Fatal(http.ListenAndServe(":8080", nil))
285}

Key Patterns:

  • Resource attributes: Service identification and environment context
  • Sampling strategies: Balance observability with performance
  • Context propagation: Trace requests across service boundaries
  • Semantic conventions: Standard attribute naming for interoperability
  • Graceful shutdown: Ensure all telemetry is flushed before exit

Context Propagation Across Services

When building distributed systems, propagating trace context across service boundaries is critical for end-to-end visibility:

  1package main
  2
  3import (
  4    "context"
  5    "fmt"
  6    "io"
  7    "net/http"
  8    "time"
  9
 10    "go.opentelemetry.io/otel"
 11    "go.opentelemetry.io/otel/attribute"
 12    "go.opentelemetry.io/otel/propagation"
 13    "go.opentelemetry.io/otel/trace"
 14)
 15
 16// Service A - Makes HTTP call to Service B
 17type ServiceA struct {
 18    tracer     trace.Tracer
 19    propagator propagation.TextMapPropagator
 20    client     *http.Client
 21}
 22
 23func NewServiceA() *ServiceA {
 24    return &ServiceA{
 25        tracer:     otel.Tracer("service-a"),
 26        propagator: otel.GetTextMapPropagator(),
 27        client:     &http.Client{Timeout: 30 * time.Second},
 28    }
 29}
 30
 31func (s *ServiceA) MakeRequest(ctx context.Context, userID string) error {
 32    // Start span in Service A
 33    ctx, span := s.tracer.Start(ctx, "ServiceA.MakeRequest",
 34        trace.WithAttributes(
 35            attribute.String("user.id", userID),
 36        ),
 37    )
 38    defer span.End()
 39
 40    // Create HTTP request
 41    req, err := http.NewRequestWithContext(ctx, "GET", "http://service-b:8081/process", nil)
 42    if err != nil {
 43        span.RecordError(err)
 44        return err
 45    }
 46
 47    // CRITICAL: Inject trace context into HTTP headers
 48    s.propagator.Inject(ctx, propagation.HeaderCarrier(req.Header))
 49
 50    // Add custom headers
 51    req.Header.Set("X-User-ID", userID)
 52
 53    span.AddEvent("Calling Service B")
 54
 55    // Make the call
 56    resp, err := s.client.Do(req)
 57    if err != nil {
 58        span.RecordError(err)
 59        return err
 60    }
 61    defer resp.Body.Close()
 62
 63    body, _ := io.ReadAll(resp.Body)
 64    span.SetAttributes(
 65        attribute.Int("http.status_code", resp.StatusCode),
 66        attribute.Int("response.size", len(body)),
 67    )
 68
 69    return nil
 70}
 71
 72// Service B - Receives request from Service A
 73type ServiceB struct {
 74    tracer     trace.Tracer
 75    propagator propagation.TextMapPropagator
 76}
 77
 78func NewServiceB() *ServiceB {
 79    return &ServiceB{
 80        tracer:     otel.Tracer("service-b"),
 81        propagator: otel.GetTextMapPropagator(),
 82    }
 83}
 84
 85func (s *ServiceB) HandleRequest(w http.ResponseWriter, r *http.Request) {
 86    // CRITICAL: Extract trace context from HTTP headers
 87    ctx := s.propagator.Extract(r.Context(), propagation.HeaderCarrier(r.Header))
 88
 89    // Start span in Service B - this will be a child of Service A's span
 90    ctx, span := s.tracer.Start(ctx, "ServiceB.HandleRequest",
 91        trace.WithSpanKind(trace.SpanKindServer),
 92    )
 93    defer span.End()
 94
 95    userID := r.Header.Get("X-User-ID")
 96    span.SetAttributes(attribute.String("user.id", userID))
 97
 98    // Process the request
 99    if err := s.processData(ctx, userID); err != nil {
100        span.RecordError(err)
101        http.Error(w, err.Error(), http.StatusInternalServerError)
102        return
103    }
104
105    w.WriteHeader(http.StatusOK)
106    fmt.Fprintf(w, "Processed by Service B")
107}
108
109func (s *ServiceB) processData(ctx context.Context, userID string) error {
110    // This span will be a child of HandleRequest
111    _, span := s.tracer.Start(ctx, "ServiceB.processData")
112    defer span.End()
113
114    // Simulate processing
115    time.Sleep(100 * time.Millisecond)
116    return nil
117}
118
119func main() {
120    // In a real application, you would run Service A and Service B separately
121    // This demonstrates the trace context propagation pattern
122    fmt.Println("OpenTelemetry context propagation example")
123    fmt.Println("In production, Service A and B would be separate processes")
124}

Context Propagation Best Practices:

  • Always use propagation.TextMapPropagator for HTTP/gRPC
  • Inject context on client side before making calls
  • Extract context on server side as first operation
  • Use trace.SpanKind to distinguish client vs server spans
  • Include relevant attributes at each service boundary

Advanced Pattern: Custom Metrics and Instrumentation

While standard metrics like request count and duration are essential, production systems need custom business metrics and advanced instrumentation patterns to gain deep operational insights.

Business Metrics and Custom Collectors

Business metrics track domain-specific KPIs that matter to your organization:

  1package main
  2
  3import (
  4    "context"
  5    "fmt"
  6    "sync"
  7    "time"
  8
  9    "github.com/prometheus/client_golang/prometheus"
 10    "github.com/prometheus/client_golang/prometheus/promhttp"
 11    "net/http"
 12)
 13
 14// BusinessMetrics tracks domain-specific KPIs
 15type BusinessMetrics struct {
 16    // Revenue metrics
 17    revenueTotal prometheus.Counter
 18    revenueByPlan *prometheus.CounterVec
 19
 20    // User engagement
 21    activeUsers prometheus.Gauge
 22    sessionDuration prometheus.Histogram
 23    featureUsage *prometheus.CounterVec
 24
 25    // Business operations
 26    checkoutsStarted prometheus.Counter
 27    checkoutsCompleted prometheus.Counter
 28    checkoutsFailed *prometheus.CounterVec
 29    cartValue prometheus.Histogram
 30
 31    // SaaS-specific
 32    trialConversions prometheus.Counter
 33    churnEvents prometheus.Counter
 34    lifetimeValue prometheus.Histogram
 35}
 36
 37// NewBusinessMetrics initializes business-level metrics
 38func NewBusinessMetrics(reg prometheus.Registerer) *BusinessMetrics {
 39    bm := &BusinessMetrics{
 40        revenueTotal: prometheus.NewCounter(prometheus.CounterOpts{
 41            Name: "business_revenue_total_dollars",
 42            Help: "Total revenue in dollars",
 43        }),
 44
 45        revenueByPlan: prometheus.NewCounterVec(
 46            prometheus.CounterOpts{
 47                Name: "business_revenue_by_plan_dollars",
 48                Help: "Revenue by subscription plan",
 49            },
 50            []string{"plan_name", "billing_cycle"},
 51        ),
 52
 53        activeUsers: prometheus.NewGauge(prometheus.GaugeOpts{
 54            Name: "business_active_users",
 55            Help: "Current number of active users",
 56        }),
 57
 58        sessionDuration: prometheus.NewHistogram(prometheus.HistogramOpts{
 59            Name: "business_session_duration_seconds",
 60            Help: "User session duration",
 61            Buckets: []float64{60, 300, 900, 1800, 3600, 7200}, // 1m, 5m, 15m, 30m, 1h, 2h
 62        }),
 63
 64        featureUsage: prometheus.NewCounterVec(
 65            prometheus.CounterOpts{
 66                Name: "business_feature_usage_total",
 67                Help: "Usage count by feature",
 68            },
 69            []string{"feature_name", "user_plan"},
 70        ),
 71
 72        checkoutsStarted: prometheus.NewCounter(prometheus.CounterOpts{
 73            Name: "business_checkouts_started_total",
 74            Help: "Number of checkout processes started",
 75        }),
 76
 77        checkoutsCompleted: prometheus.NewCounter(prometheus.CounterOpts{
 78            Name: "business_checkouts_completed_total",
 79            Help: "Number of successful checkouts",
 80        }),
 81
 82        checkoutsFailed: prometheus.NewCounterVec(
 83            prometheus.CounterOpts{
 84                Name: "business_checkouts_failed_total",
 85                Help: "Failed checkouts by reason",
 86            },
 87            []string{"failure_reason"},
 88        ),
 89
 90        cartValue: prometheus.NewHistogram(prometheus.HistogramOpts{
 91            Name: "business_cart_value_dollars",
 92            Help: "Shopping cart value distribution",
 93            Buckets: prometheus.ExponentialBuckets(10, 2, 10), // 10, 20, 40, ..., 5120
 94        }),
 95
 96        trialConversions: prometheus.NewCounter(prometheus.CounterOpts{
 97            Name: "business_trial_conversions_total",
 98            Help: "Number of trial users who converted to paid",
 99        }),
100
101        churnEvents: prometheus.NewCounter(prometheus.CounterOpts{
102            Name: "business_churn_events_total",
103            Help: "Number of user churn events",
104        }),
105
106        lifetimeValue: prometheus.NewHistogram(prometheus.HistogramOpts{
107            Name: "business_customer_lifetime_value_dollars",
108            Help: "Customer lifetime value",
109            Buckets: []float64{100, 500, 1000, 5000, 10000, 50000},
110        }),
111    }
112
113    // Register all metrics
114    reg.MustRegister(
115        bm.revenueTotal,
116        bm.revenueByPlan,
117        bm.activeUsers,
118        bm.sessionDuration,
119        bm.featureUsage,
120        bm.checkoutsStarted,
121        bm.checkoutsCompleted,
122        bm.checkoutsFailed,
123        bm.cartValue,
124        bm.trialConversions,
125        bm.churnEvents,
126        bm.lifetimeValue,
127    )
128
129    return bm
130}
131
132// TrackRevenue records a revenue event
133func (bm *BusinessMetrics) TrackRevenue(amount float64, plan, billingCycle string) {
134    bm.revenueTotal.Add(amount)
135    bm.revenueByPlan.WithLabelValues(plan, billingCycle).Add(amount)
136}
137
138// TrackCheckout records checkout funnel progression
139func (bm *BusinessMetrics) TrackCheckout(stage string, cartValue float64, failureReason string) {
140    switch stage {
141    case "started":
142        bm.checkoutsStarted.Inc()
143        bm.cartValue.Observe(cartValue)
144    case "completed":
145        bm.checkoutsCompleted.Inc()
146    case "failed":
147        bm.checkoutsFailed.WithLabelValues(failureReason).Inc()
148    }
149}
150
151// TrackFeatureUsage records feature usage by plan
152func (bm *BusinessMetrics) TrackFeatureUsage(feature, userPlan string) {
153    bm.featureUsage.WithLabelValues(feature, userPlan).Inc()
154}
155
156// Custom Collector for complex metrics
157type DatabaseStatsCollector struct {
158    db *DatabasePool // Your database connection pool
159
160    // Descriptors
161    connTotal *prometheus.Desc
162    connIdle  *prometheus.Desc
163    connInUse *prometheus.Desc
164    queryDuration *prometheus.Desc
165}
166
167type DatabasePool struct {
168    mu sync.RWMutex
169    totalConns int
170    idleConns  int
171    inUseConns int
172    queryStats map[string]time.Duration
173}
174
175// NewDatabaseStatsCollector creates a custom collector
176func NewDatabaseStatsCollector(db *DatabasePool) *DatabaseStatsCollector {
177    return &DatabaseStatsCollector{
178        db: db,
179        connTotal: prometheus.NewDesc(
180            "db_connections_total",
181            "Total number of database connections",
182            nil, nil,
183        ),
184        connIdle: prometheus.NewDesc(
185            "db_connections_idle",
186            "Number of idle connections",
187            nil, nil,
188        ),
189        connInUse: prometheus.NewDesc(
190            "db_connections_in_use",
191            "Number of connections in use",
192            nil, nil,
193        ),
194        queryDuration: prometheus.NewDesc(
195            "db_query_duration_seconds",
196            "Query execution time by query type",
197            []string{"query_type"}, nil,
198        ),
199    }
200}
201
202// Describe implements prometheus.Collector
203func (c *DatabaseStatsCollector) Describe(ch chan<- *prometheus.Desc) {
204    ch <- c.connTotal
205    ch <- c.connIdle
206    ch <- c.connInUse
207    ch <- c.queryDuration
208}
209
210// Collect implements prometheus.Collector
211func (c *DatabaseStatsCollector) Collect(ch chan<- prometheus.Metric) {
212    c.db.mu.RLock()
213    defer c.db.mu.RUnlock()
214
215    // Collect connection pool metrics
216    ch <- prometheus.MustNewConstMetric(
217        c.connTotal,
218        prometheus.GaugeValue,
219        float64(c.db.totalConns),
220    )
221
222    ch <- prometheus.MustNewConstMetric(
223        c.connIdle,
224        prometheus.GaugeValue,
225        float64(c.db.idleConns),
226    )
227
228    ch <- prometheus.MustNewConstMetric(
229        c.connInUse,
230        prometheus.GaugeValue,
231        float64(c.db.inUseConns),
232    )
233
234    // Collect query duration metrics
235    for queryType, duration := range c.db.queryStats {
236        ch <- prometheus.MustNewConstMetric(
237            c.queryDuration,
238            prometheus.GaugeValue,
239            duration.Seconds(),
240            queryType,
241        )
242    }
243}
244
245func main() {
246    // Create a custom registry
247    reg := prometheus.NewRegistry()
248
249    // Initialize business metrics
250    businessMetrics := NewBusinessMetrics(reg)
251
252    // Initialize database pool and custom collector
253    dbPool := &DatabasePool{
254        totalConns: 100,
255        idleConns:  80,
256        inUseConns: 20,
257        queryStats: map[string]time.Duration{
258            "SELECT": 50 * time.Millisecond,
259            "INSERT": 100 * time.Millisecond,
260            "UPDATE": 75 * time.Millisecond,
261        },
262    }
263    reg.MustRegister(NewDatabaseStatsCollector(dbPool))
264
265    // Simulate business events
266    go func() {
267        for {
268            // Track revenue
269            businessMetrics.TrackRevenue(99.99, "premium", "monthly")
270            businessMetrics.TrackRevenue(999.99, "enterprise", "yearly")
271
272            // Track checkout funnel
273            businessMetrics.TrackCheckout("started", 149.99, "")
274            businessMetrics.TrackCheckout("completed", 149.99, "")
275            businessMetrics.TrackCheckout("failed", 299.99, "payment_declined")
276
277            // Track feature usage
278            businessMetrics.TrackFeatureUsage("export_pdf", "premium")
279            businessMetrics.TrackFeatureUsage("api_access", "enterprise")
280
281            // Update active users
282            businessMetrics.activeUsers.Set(1250)
283
284            time.Sleep(5 * time.Second)
285        }
286    }()
287
288    // Expose metrics endpoint
289    http.Handle("/metrics", promhttp.HandlerFor(reg, promhttp.HandlerOpts{
290        Registry: reg,
291    }))
292
293    fmt.Println("Business metrics available at http://localhost:9090/metrics")
294    http.ListenAndServe(":9090", nil)
295}

Custom Metrics Best Practices:

  • Business alignment: Track metrics that matter to stakeholders
  • Cardinality control: Avoid high-cardinality labels (user IDs, timestamps)
  • Custom collectors: Use for complex metrics computed on-demand
  • Naming conventions: Follow Prometheus naming guidelines
  • Documentation: Add detailed help text to all metrics

Advanced Instrumentation Patterns

Production systems need sophisticated instrumentation beyond basic counters:

  1package main
  2
  3import (
  4    "context"
  5    "fmt"
  6    "sync"
  7    "time"
  8
  9    "github.com/prometheus/client_golang/prometheus"
 10    "github.com/prometheus/client_golang/prometheus/promauto"
 11)
 12
 13// ExemplarInstrumentation demonstrates exemplar support for linking metrics to traces
 14type ExemplarInstrumentation struct {
 15    requestDuration *prometheus.HistogramVec
 16}
 17
 18func NewExemplarInstrumentation() *ExemplarInstrumentation {
 19    return &ExemplarInstrumentation{
 20        requestDuration: promauto.NewHistogramVec(
 21            prometheus.HistogramOpts{
 22                Name: "http_request_duration_seconds",
 23                Help: "Request duration with trace exemplars",
 24                Buckets: prometheus.DefBuckets,
 25                // Enable native histograms for better precision
 26                NativeHistogramBucketFactor: 1.1,
 27            },
 28            []string{"method", "endpoint"},
 29        ),
 30    }
 31}
 32
 33// RecordRequest records a request with trace exemplar
 34func (ei *ExemplarInstrumentation) RecordRequest(method, endpoint string, duration float64, traceID string) {
 35    // Record observation with exemplar linking to trace
 36    observer := ei.requestDuration.WithLabelValues(method, endpoint)
 37    observer.(prometheus.ExemplarObserver).ObserveWithExemplar(
 38        duration,
 39        prometheus.Labels{"trace_id": traceID},
 40    )
 41}
 42
 43// RateLimitMetrics tracks rate limiting with sliding windows
 44type RateLimitMetrics struct {
 45    mu sync.RWMutex
 46
 47    allowed  *prometheus.CounterVec
 48    rejected *prometheus.CounterVec
 49    current  *prometheus.GaugeVec
 50    limit    *prometheus.GaugeVec
 51
 52    // Sliding window state
 53    windows map[string]*slidingWindow
 54}
 55
 56type slidingWindow struct {
 57    requests []time.Time
 58    limit    int
 59    window   time.Duration
 60}
 61
 62func NewRateLimitMetrics() *RateLimitMetrics {
 63    return &RateLimitMetrics{
 64        allowed: promauto.NewCounterVec(
 65            prometheus.CounterOpts{
 66                Name: "ratelimit_requests_allowed_total",
 67                Help: "Requests allowed through rate limiter",
 68            },
 69            []string{"client_id", "endpoint"},
 70        ),
 71        rejected: promauto.NewCounterVec(
 72            prometheus.CounterOpts{
 73                Name: "ratelimit_requests_rejected_total",
 74                Help: "Requests rejected by rate limiter",
 75            },
 76            []string{"client_id", "endpoint", "reason"},
 77        ),
 78        current: promauto.NewGaugeVec(
 79            prometheus.GaugeOpts{
 80                Name: "ratelimit_current_usage",
 81                Help: "Current rate limit usage",
 82            },
 83            []string{"client_id", "endpoint"},
 84        ),
 85        limit: promauto.NewGaugeVec(
 86            prometheus.GaugeOpts{
 87                Name: "ratelimit_limit",
 88                Help: "Rate limit threshold",
 89            },
 90            []string{"client_id", "endpoint"},
 91        ),
 92        windows: make(map[string]*slidingWindow),
 93    }
 94}
 95
 96func (rlm *RateLimitMetrics) CheckAndRecord(clientID, endpoint string, limit int, window time.Duration) bool {
 97    rlm.mu.Lock()
 98    defer rlm.mu.Unlock()
 99
100    key := fmt.Sprintf("%s:%s", clientID, endpoint)
101
102    // Initialize window if not exists
103    if _, exists := rlm.windows[key]; !exists {
104        rlm.windows[key] = &slidingWindow{
105            requests: make([]time.Time, 0),
106            limit:    limit,
107            window:   window,
108        }
109        rlm.limit.WithLabelValues(clientID, endpoint).Set(float64(limit))
110    }
111
112    sw := rlm.windows[key]
113    now := time.Now()
114
115    // Remove old requests outside window
116    cutoff := now.Add(-window)
117    validRequests := make([]time.Time, 0)
118    for _, t := range sw.requests {
119        if t.After(cutoff) {
120            validRequests = append(validRequests, t)
121        }
122    }
123    sw.requests = validRequests
124
125    // Check if under limit
126    currentUsage := len(sw.requests)
127    rlm.current.WithLabelValues(clientID, endpoint).Set(float64(currentUsage))
128
129    if currentUsage >= limit {
130        rlm.rejected.WithLabelValues(clientID, endpoint, "rate_exceeded").Inc()
131        return false
132    }
133
134    // Allow request
135    sw.requests = append(sw.requests, now)
136    rlm.allowed.WithLabelValues(clientID, endpoint).Inc()
137    return true
138}
139
140// CacheMetrics demonstrates advanced cache instrumentation
141type CacheMetrics struct {
142    hits        *prometheus.CounterVec
143    misses      *prometheus.CounterVec
144    evictions   *prometheus.CounterVec
145    size        prometheus.Gauge
146    capacity    prometheus.Gauge
147    hitRate     *prometheus.GaugeVec
148    avgLoadTime prometheus.Histogram
149
150    // State for hit rate calculation
151    mu          sync.RWMutex
152    hitCount    map[string]int64
153    totalCount  map[string]int64
154}
155
156func NewCacheMetrics() *CacheMetrics {
157    cm := &CacheMetrics{
158        hits: promauto.NewCounterVec(
159            prometheus.CounterOpts{
160                Name: "cache_hits_total",
161                Help: "Cache hits by operation",
162            },
163            []string{"cache_name", "operation"},
164        ),
165        misses: promauto.NewCounterVec(
166            prometheus.CounterOpts{
167                Name: "cache_misses_total",
168                Help: "Cache misses by operation",
169            },
170            []string{"cache_name", "operation"},
171        ),
172        evictions: promauto.NewCounterVec(
173            prometheus.CounterOpts{
174                Name: "cache_evictions_total",
175                Help: "Cache evictions by reason",
176            },
177            []string{"cache_name", "reason"},
178        ),
179        size: promauto.NewGauge(
180            prometheus.GaugeOpts{
181                Name: "cache_size_bytes",
182                Help: "Current cache size in bytes",
183            },
184        ),
185        capacity: promauto.NewGauge(
186            prometheus.GaugeOpts{
187                Name: "cache_capacity_bytes",
188                Help: "Maximum cache capacity",
189            },
190        ),
191        hitRate: promauto.NewGaugeVec(
192            prometheus.GaugeOpts{
193                Name: "cache_hit_rate",
194                Help: "Cache hit rate (0-1)",
195            },
196            []string{"cache_name"},
197        ),
198        avgLoadTime: promauto.NewHistogram(
199            prometheus.HistogramOpts{
200                Name: "cache_load_duration_seconds",
201                Help: "Time to load cache misses",
202                Buckets: prometheus.ExponentialBuckets(0.001, 2, 10),
203            },
204        ),
205        hitCount:   make(map[string]int64),
206        totalCount: make(map[string]int64),
207    }
208
209    // Start background goroutine to compute hit rates
210    go cm.computeHitRates()
211
212    return cm
213}
214
215func (cm *CacheMetrics) RecordLookup(cacheName, operation string, hit bool, loadTime time.Duration) {
216    cm.mu.Lock()
217    cm.totalCount[cacheName]++
218    if hit {
219        cm.hitCount[cacheName]++
220        cm.hits.WithLabelValues(cacheName, operation).Inc()
221    } else {
222        cm.misses.WithLabelValues(cacheName, operation).Inc()
223        cm.avgLoadTime.Observe(loadTime.Seconds())
224    }
225    cm.mu.Unlock()
226}
227
228func (cm *CacheMetrics) computeHitRates() {
229    ticker := time.NewTicker(10 * time.Second)
230    defer ticker.Stop()
231
232    for range ticker.C {
233        cm.mu.RLock()
234        for cacheName, total := range cm.totalCount {
235            if total > 0 {
236                hitRate := float64(cm.hitCount[cacheName]) / float64(total)
237                cm.hitRate.WithLabelValues(cacheName).Set(hitRate)
238            }
239        }
240        cm.mu.RUnlock()
241    }
242}
243
244func main() {
245    // Initialize instrumentation
246    exemplarInst := NewExemplarInstrumentation()
247    rateLimitMetrics := NewRateLimitMetrics()
248    cacheMetrics := NewCacheMetrics()
249
250    // Simulate exemplar recording
251    exemplarInst.RecordRequest("GET", "/api/users", 0.125, "trace-id-123")
252
253    // Simulate rate limiting
254    allowed := rateLimitMetrics.CheckAndRecord("client-1", "/api", 100, time.Minute)
255    fmt.Printf("Request allowed: %v\n", allowed)
256
257    // Simulate cache operations
258    cacheMetrics.RecordLookup("user-cache", "get", true, 0)
259    cacheMetrics.RecordLookup("user-cache", "get", false, 50*time.Millisecond)
260    cacheMetrics.size.Set(1024 * 1024 * 50) // 50MB
261    cacheMetrics.capacity.Set(1024 * 1024 * 100) // 100MB
262
263    fmt.Println("Advanced instrumentation patterns initialized")
264    fmt.Println("Metrics available at http://localhost:2112/metrics")
265
266    select {} // Keep running
267}

Advanced Pattern: Distributed Tracing Best Practices

Distributed tracing is powerful but can introduce performance overhead and storage costs. These patterns help you maximize value while minimizing impact.

Intelligent Sampling Strategies

Not all traces are equally valuable. Intelligent sampling captures the traces you need while reducing overhead:

  1package main
  2
  3import (
  4    "context"
  5    "fmt"
  6    "math/rand"
  7    "strings"
  8    "time"
  9
 10    "go.opentelemetry.io/otel/attribute"
 11    "go.opentelemetry.io/otel/sdk/trace"
 12    "go.opentelemetry.io/otel/trace"
 13)
 14
 15// AdaptiveSampler adjusts sampling rate based on system load
 16type AdaptiveSampler struct {
 17    baseSampleRate    float64
 18    errorSampleRate   float64
 19    slowSampleRate    float64
 20    slowThreshold     time.Duration
 21    loadThreshold     float64
 22
 23    currentLoad       float64
 24}
 25
 26func NewAdaptiveSampler(baseRate, errorRate, slowRate float64, slowThreshold time.Duration) *AdaptiveSampler {
 27    return &AdaptiveSampler{
 28        baseSampleRate:  baseRate,
 29        errorSampleRate: errorRate,
 30        slowSampleRate:  slowRate,
 31        slowThreshold:   slowThreshold,
 32        loadThreshold:   0.8, // 80% load
 33    }
 34}
 35
 36// ShouldSample implements custom sampling logic
 37func (as *AdaptiveSampler) ShouldSample(params trace.SamplingParameters) trace.SamplingResult {
 38    // Always sample errors
 39    if hasError(params.Attributes) {
 40        return trace.SamplingResult{
 41            Decision:   trace.RecordAndSample,
 42            Attributes: []attribute.KeyValue{attribute.Bool("sampled.error", true)},
 43        }
 44    }
 45
 46    // Always sample slow requests
 47    if duration := getDuration(params.Attributes); duration > as.slowThreshold {
 48        return trace.SamplingResult{
 49            Decision:   trace.RecordAndSample,
 50            Attributes: []attribute.KeyValue{attribute.Bool("sampled.slow", true)},
 51        }
 52    }
 53
 54    // Sample high-value endpoints more frequently
 55    if isHighValueEndpoint(params.Name) {
 56        if rand.Float64() < as.slowSampleRate {
 57            return trace.SamplingResult{
 58                Decision:   trace.RecordAndSample,
 59                Attributes: []attribute.KeyValue{attribute.Bool("sampled.high_value", true)},
 60            }
 61        }
 62    }
 63
 64    // Reduce sampling under high load
 65    effectiveRate := as.baseSampleRate
 66    if as.currentLoad > as.loadThreshold {
 67        effectiveRate = as.baseSampleRate * 0.5 // Cut sampling in half
 68    }
 69
 70    if rand.Float64() < effectiveRate {
 71        return trace.SamplingResult{
 72            Decision: trace.RecordAndSample,
 73        }
 74    }
 75
 76    return trace.SamplingResult{
 77        Decision: trace.Drop,
 78    }
 79}
 80
 81func (as *AdaptiveSampler) Description() string {
 82    return "AdaptiveSampler"
 83}
 84
 85func hasError(attrs []attribute.KeyValue) bool {
 86    for _, attr := range attrs {
 87        if attr.Key == "error" && attr.Value.AsBool() {
 88            return true
 89        }
 90    }
 91    return false
 92}
 93
 94func getDuration(attrs []attribute.KeyValue) time.Duration {
 95    for _, attr := range attrs {
 96        if attr.Key == "duration" {
 97            return time.Duration(attr.Value.AsInt64())
 98        }
 99    }
100    return 0
101}
102
103func isHighValueEndpoint(name string) bool {
104    highValue := []string{"/checkout", "/payment", "/api/orders"}
105    for _, endpoint := range highValue {
106        if strings.Contains(name, endpoint) {
107            return true
108        }
109    }
110    return false
111}
112
113// PrioritySampler samples based on business priority
114type PrioritySampler struct {
115    strategies map[string]float64 // endpoint -> sample rate
116    defaultRate float64
117}
118
119func NewPrioritySampler() *PrioritySampler {
120    return &PrioritySampler{
121        strategies: map[string]float64{
122            // Critical business paths - 100% sampling
123            "/checkout":       1.0,
124            "/payment":        1.0,
125            "/api/orders":     1.0,
126
127            // Important but high volume - 50% sampling
128            "/api/users":      0.5,
129            "/api/products":   0.5,
130
131            // Background jobs - 10% sampling
132            "/jobs/cleanup":   0.1,
133            "/jobs/reports":   0.1,
134
135            // Health checks - 1% sampling
136            "/health":         0.01,
137            "/ready":          0.01,
138        },
139        defaultRate: 0.1, // 10% for unknown endpoints
140    }
141}
142
143func (ps *PrioritySampler) ShouldSample(params trace.SamplingParameters) trace.SamplingResult {
144    rate := ps.defaultRate
145
146    // Find matching strategy
147    for endpoint, r := range ps.strategies {
148        if strings.Contains(params.Name, endpoint) {
149            rate = r
150            break
151        }
152    }
153
154    if rand.Float64() < rate {
155        return trace.SamplingResult{
156            Decision: trace.RecordAndSample,
157            Attributes: []attribute.KeyValue{
158                attribute.Float64("sample.rate", rate),
159            },
160        }
161    }
162
163    return trace.SamplingResult{
164        Decision: trace.Drop,
165    }
166}
167
168func (ps *PrioritySampler) Description() string {
169    return "PrioritySampler"
170}
171
172// Example: Using custom samplers
173func main() {
174    // Create adaptive sampler
175    adaptiveSampler := NewAdaptiveSampler(
176        0.1,              // 10% base rate
177        1.0,              // 100% error sampling
178        0.5,              // 50% slow request sampling
179        500*time.Millisecond, // 500ms slow threshold
180    )
181
182    // Create priority sampler
183    prioritySampler := NewPrioritySampler()
184
185    fmt.Printf("Adaptive sampler: %s\n", adaptiveSampler.Description())
186    fmt.Printf("Priority sampler: %s\n", prioritySampler.Description())
187
188    // In production, you would use these with:
189    // trace.WithSampler(adaptiveSampler)
190}

Trace Correlation and Root Cause Analysis

When incidents occur, quickly correlating traces with logs and metrics is critical:

  1package main
  2
  3import (
  4    "context"
  5    "fmt"
  6    "log/slog"
  7    "time"
  8
  9    "go.opentelemetry.io/otel"
 10    "go.opentelemetry.io/otel/attribute"
 11    "go.opentelemetry.io/otel/codes"
 12    "go.opentelemetry.io/otel/trace"
 13)
 14
 15// CorrelatedObservability ties together traces, logs, and metrics
 16type CorrelatedObservability struct {
 17    logger *slog.Logger
 18    tracer trace.Tracer
 19}
 20
 21func NewCorrelatedObservability() *CorrelatedObservability {
 22    return &CorrelatedObservability{
 23        logger: slog.Default(),
 24        tracer: otel.Tracer("correlation-example"),
 25    }
 26}
 27
 28// ProcessRequest demonstrates full correlation
 29func (co *CorrelatedObservability) ProcessRequest(ctx context.Context, requestID string) error {
 30    // Start trace
 31    ctx, span := co.tracer.Start(ctx, "ProcessRequest",
 32        trace.WithAttributes(
 33            attribute.String("request.id", requestID),
 34        ),
 35    )
 36    defer span.End()
 37
 38    // Extract trace context for correlation
 39    traceID := span.SpanContext().TraceID().String()
 40    spanID := span.SpanContext().SpanID().String()
 41
 42    // Create logger with trace context
 43    logger := co.logger.With(
 44        slog.String("trace_id", traceID),
 45        slog.String("span_id", spanID),
 46        slog.String("request_id", requestID),
 47    )
 48
 49    logger.Info("Processing request started")
 50
 51    // Simulate processing steps
 52    if err := co.validateInput(ctx, logger, requestID); err != nil {
 53        span.RecordError(err)
 54        span.SetStatus(codes.Error, "Validation failed")
 55        logger.Error("Validation failed",
 56            slog.String("error", err.Error()),
 57            slog.String("stage", "validation"),
 58        )
 59        return err
 60    }
 61
 62    if err := co.processData(ctx, logger, requestID); err != nil {
 63        span.RecordError(err)
 64        span.SetStatus(codes.Error, "Processing failed")
 65        logger.Error("Processing failed",
 66            slog.String("error", err.Error()),
 67            slog.String("stage", "processing"),
 68        )
 69        return err
 70    }
 71
 72    logger.Info("Processing request completed")
 73    span.SetStatus(codes.Ok, "Success")
 74    return nil
 75}
 76
 77func (co *CorrelatedObservability) validateInput(ctx context.Context, logger *slog.Logger, requestID string) error {
 78    ctx, span := co.tracer.Start(ctx, "ValidateInput")
 79    defer span.End()
 80
 81    logger.Debug("Validating input")
 82    time.Sleep(10 * time.Millisecond)
 83
 84    // Add business context to span
 85    span.SetAttributes(
 86        attribute.String("validation.type", "schema"),
 87        attribute.Int("validation.fields", 5),
 88    )
 89
 90    return nil
 91}
 92
 93func (co *CorrelatedObservability) processData(ctx context.Context, logger *slog.Logger, requestID string) error {
 94    ctx, span := co.tracer.Start(ctx, "ProcessData")
 95    defer span.End()
 96
 97    // Add span events for important milestones
 98    span.AddEvent("Starting database query")
 99    logger.Debug("Querying database")
100    time.Sleep(50 * time.Millisecond)
101
102    span.AddEvent("Database query complete",
103        trace.WithAttributes(
104            attribute.Int("rows.returned", 10),
105        ),
106    )
107
108    span.AddEvent("Starting cache update")
109    logger.Debug("Updating cache")
110    time.Sleep(20 * time.Millisecond)
111
112    span.AddEvent("Cache update complete")
113
114    return nil
115}
116
117// AnomalyDetector identifies unusual trace patterns
118type AnomalyDetector struct {
119    baselineLatency map[string]time.Duration
120    baselineErrors  map[string]float64
121}
122
123func NewAnomalyDetector() *AnomalyDetector {
124    return &AnomalyDetector{
125        baselineLatency: map[string]time.Duration{
126            "/api/users":    100 * time.Millisecond,
127            "/api/orders":   200 * time.Millisecond,
128            "/checkout":     500 * time.Millisecond,
129        },
130        baselineErrors: map[string]float64{
131            "/api/users":    0.01,  // 1% error rate
132            "/api/orders":   0.005, // 0.5% error rate
133            "/checkout":     0.02,  // 2% error rate
134        },
135    }
136}
137
138// AnalyzeTrace detects anomalies in trace data
139func (ad *AnomalyDetector) AnalyzeTrace(operation string, duration time.Duration, errorOccurred bool) []string {
140    anomalies := make([]string, 0)
141
142    // Check latency anomaly
143    if baseline, exists := ad.baselineLatency[operation]; exists {
144        // Flag if 3x slower than baseline
145        if duration > baseline*3 {
146            anomalies = append(anomalies, fmt.Sprintf(
147                "LATENCY_ANOMALY: %s took %v (baseline: %v, ratio: %.2fx)",
148                operation, duration, baseline, float64(duration)/float64(baseline),
149            ))
150        }
151    }
152
153    // Check error rate anomaly
154    if errorOccurred {
155        if baseline, exists := ad.baselineErrors[operation]; exists {
156            // This would typically track error rate over time
157            // Simplified for example
158            anomalies = append(anomalies, fmt.Sprintf(
159                "ERROR_ANOMALY: %s failed (baseline error rate: %.2f%%)",
160                operation, baseline*100,
161            ))
162        }
163    }
164
165    return anomalies
166}
167
168func main() {
169    ctx := context.Background()
170
171    // Initialize correlated observability
172    co := NewCorrelatedObservability()
173
174    // Process request with full correlation
175    if err := co.ProcessRequest(ctx, "req-12345"); err != nil {
176        fmt.Printf("Request failed: %v\n", err)
177    }
178
179    // Demonstrate anomaly detection
180    detector := NewAnomalyDetector()
181    anomalies := detector.AnalyzeTrace("/api/users", 350*time.Millisecond, false)
182    if len(anomalies) > 0 {
183        fmt.Println("Anomalies detected:")
184        for _, a := range anomalies {
185            fmt.Printf("  - %s\n", a)
186        }
187    }
188
189    fmt.Println("Distributed tracing best practices demonstrated")
190}

Trace Optimization Best Practices:

  • Intelligent sampling: Don't trace everything—be selective
  • Span attributes: Add rich context but avoid PII
  • Span events: Mark important milestones within spans
  • Error context: Always include error details in span status
  • Correlation IDs: Link traces across system boundaries
  • Tail-based sampling: Keep all traces for errors, sample successes
  • Storage optimization: Set retention policies based on trace value

Practice Exercises

Exercise 1: Build a Request Logger

🎯 Learning Objectives:

  • Master structured logging with Go's log/slog package
  • Implement HTTP middleware pattern for cross-cutting concerns
  • Practice context propagation for request tracking
  • Learn proper log level usage and filtering

🌍 Real-World Context:
Every production service needs comprehensive request logging for debugging, auditing, and analytics. Companies like Stripe and PayPal use structured request logging to detect fraud patterns, debug API issues, and maintain compliance. This middleware pattern is used by thousands of Go services to ensure every request is properly tracked from ingress to egress.

⏱️ Time Estimate: 45-60 minutes
📊 Difficulty: Intermediate

Create middleware that logs all HTTP requests with context. The middleware should generate unique request IDs, track request timing, capture response status codes, and use structured logging with appropriate fields for production monitoring.

 1package main
 2
 3import (
 4    "context"
 5    "log/slog"
 6    "net/http"
 7    "os"
 8    "time"
 9
10    "github.com/google/uuid"
11)
12
13type contextKey string
14
15const requestIDKey contextKey = "request_id"
16
17// RequestLogger middleware
18func RequestLogger(logger *slog.Logger) func(http.Handler) http.Handler {
19    return func(next http.Handler) http.Handler {
20        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
21            // Generate request ID
22            requestID := uuid.New().String()
23
24            // Add to context
25            ctx := context.WithValue(r.Context(), requestIDKey, requestID)
26            r = r.WithContext(ctx)
27
28            // Create logger with request context
29            reqLogger := logger.With(
30                slog.String("request_id", requestID),
31                slog.String("method", r.Method),
32                slog.String("path", r.URL.Path),
33                slog.String("remote_addr", r.RemoteAddr),
34                slog.String("user_agent", r.UserAgent()),
35            )
36
37            // Log request start
38            start := time.Now()
39            reqLogger.Info("request started")
40
41            // Wrap response writer to capture status
42            rw := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}
43
44            // Call handler
45            next.ServeHTTP(rw, r)
46
47            // Log request completion
48            duration := time.Since(start)
49            reqLogger.Info("request completed",
50                slog.Int("status_code", rw.statusCode),
51                slog.Duration("duration", duration),
52                slog.Int("bytes_written", rw.bytesWritten),
53            )
54        })
55    }
56}
57
58type responseWriter struct {
59    http.ResponseWriter
60    statusCode   int
61    bytesWritten int
62}
63
64func WriteHeader(code int) {
65    rw.statusCode = code
66    rw.ResponseWriter.WriteHeader(code)
67}
68
69func Write(b []byte) {
70    n, err := rw.ResponseWriter.Write(b)
71    rw.bytesWritten += n
72    return n, err
73}
74
75func main() {
76    logger := slog.New(slog.NewJSONHandler(os.Stdout, nil))
77
78    mux := http.NewServeMux()
79    mux.HandleFunc("/api/users", func(w http.ResponseWriter, r *http.Request) {
80        time.Sleep(50 * time.Millisecond)
81        w.WriteHeader(http.StatusOK)
82        w.Write([]byte(`{"users": []}`))
83    })
84
85    handler := RequestLogger(logger)(mux)
86
87    http.ListenAndServe(":8080", handler)
88}

Exercise 2: Implement Error Budget Tracking

🎯 Learning Objectives:

  • Understand Service Level Objectives and error budgets
  • Implement real-time availability calculations
  • Learn rate limiting and burn rate alerting concepts
  • Practice atomic operations for thread-safe metrics

🌍 Real-World Context:
Google pioneered error budget tracking to balance reliability with innovation. When your SLO is 99.9% uptime, you have an "error budget" of 0.1% that can be used for deployments, experiments, and inevitable failures. Companies like Spotify use error budgets to determine when to pause releases vs when to push forward, ensuring user experience isn't compromised while maintaining development velocity.

⏱️ Time Estimate: 60-75 minutes
📊 Difficulty: Advanced

Build an error budget tracker based on SLOs that monitors service availability in real-time, tracks error consumption rates, calculates remaining error budget, and provides alerts when the budget is being exhausted too quickly. This system helps teams make data-driven decisions about deployments and incident response.

 1package main
 2
 3import (
 4    "fmt"
 5    "sync"
 6    "time"
 7)
 8
 9// ErrorBudget tracks SLO error budget
10type ErrorBudget struct {
11    mu sync.RWMutex
12
13    slo               float64       // e.g., 0.999 = 99.9%
14    window            time.Duration // e.g., 30 days
15    startTime         time.Time
16    totalRequests     int64
17    failedRequests    int64
18    budgetExhausted   bool
19}
20
21func NewErrorBudget(slo float64, window time.Duration) *ErrorBudget {
22    return &ErrorBudget{
23        slo:       slo,
24        window:    window,
25        startTime: time.Now(),
26    }
27}
28
29// RecordRequest records a request outcome
30func RecordRequest(success bool) {
31    eb.mu.Lock()
32    defer eb.mu.Unlock()
33
34    eb.totalRequests++
35    if !success {
36        eb.failedRequests++
37    }
38
39    // Check if budget is exhausted
40    currentAvailability := eb.availability()
41    eb.budgetExhausted = currentAvailability < eb.slo
42}
43
44func availability() float64 {
45    if eb.totalRequests == 0 {
46        return 1.0
47    }
48    return float64(eb.totalRequests-eb.failedRequests) / float64(eb.totalRequests)
49}
50
51// Status returns error budget status
52func Status() map[string]interface{} {
53    eb.mu.RLock()
54    defer eb.mu.RUnlock()
55
56    availability := eb.availability()
57    errorRate := 1.0 - availability
58
59    // Calculate budget
60    allowedErrors := float64(eb.totalRequests) *
61    remainingBudget := allowedErrors - float64(eb.failedRequests)
62    budgetPercent := * 100
63
64    return map[string]interface{}{
65        "slo":                fmt.Sprintf("%.3f%%", eb.slo*100),
66        "current_availability": fmt.Sprintf("%.3f%%", availability*100),
67        "error_rate":          fmt.Sprintf("%.3f%%", errorRate*100),
68        "total_requests":      eb.totalRequests,
69        "failed_requests":     eb.failedRequests,
70        "budget_remaining":    fmt.Sprintf("%.1f%%", budgetPercent),
71        "budget_exhausted":    eb.budgetExhausted,
72        "time_in_window":      time.Since(eb.startTime),
73    }
74}
75
76func main() {
77    budget := NewErrorBudget(0.999, 30*24*time.Hour)
78
79    // Simulate requests
80    for i := 0; i < 10000; i++ {
81        success := i%200 != 0 // 99.5% success rate
82        budget.RecordRequest(success)
83    }
84
85    status := budget.Status()
86    fmt.Println("Error Budget Status:")
87    for k, v := range status {
88        fmt.Printf("  %s: %v\n", k, v)
89    }
90}

Exercise 3: Custom Prometheus Collector

🎯 Learning Objectives:

  • Master the Prometheus Collector interface
  • Implement custom metrics beyond basic counters/gauges
  • Learn thread-safe metric collection patterns
  • Understand metric types and when to use each one

🌍 Real-World Context:
While Prometheus provides basic metric types, complex systems often need custom collectors that expose detailed state. Database connection pools, cache systems, and message queues all benefit from custom collectors that can surface operational insights. Uber uses custom collectors to monitor their massive connection pools across thousands of services, enabling proactive scaling and performance optimization.

⏱️ Time Estimate: 75-90 minutes
📊 Difficulty: Advanced

Implement a custom Prometheus collector for a connection pool with advanced metrics that goes beyond simple counters. Your collector should implement the full prometheus.Collector interface, track detailed pool statistics including wait times, connection churn, and utilization patterns, and provide both real-time gauges and cumulative counters for comprehensive monitoring.

Requirements:

  1. Implement prometheus.Collector interface with proper Describe() and Collect() methods
  2. Track pool size, active connections, idle connections, and wait times
  3. Expose histograms for connection acquisition times and utilization gauges
  4. Ensure thread-safe metric updates using proper synchronization
  5. Include calculated metrics like average wait time and connection churn rate
  1// run
  2package main
  3
  4import (
  5    "fmt"
  6    "math/rand"
  7    "sync"
  8    "time"
  9
 10    "github.com/prometheus/client_golang/prometheus"
 11)
 12
 13// ConnectionPool represents a database connection pool
 14type ConnectionPool struct {
 15    mu sync.RWMutex
 16
 17    maxSize         int
 18    activeConns     int
 19    idleConns       int
 20    totalConns      int
 21    waitingRequests int
 22
 23    // Statistics
 24    totalAcquired uint64
 25    totalReleased uint64
 26    totalWaitTime time.Duration
 27}
 28
 29// NewConnectionPool creates a new connection pool
 30func NewConnectionPool(maxSize int) *ConnectionPool {
 31    return &ConnectionPool{
 32        maxSize:    maxSize,
 33        totalConns: maxSize,
 34        idleConns:  maxSize,
 35    }
 36}
 37
 38// Acquire gets a connection from the pool
 39func Acquire() {
 40    cp.mu.Lock()
 41    defer cp.mu.Unlock()
 42
 43    start := time.Now()
 44
 45    // Wait if no idle connections
 46    for cp.idleConns == 0 {
 47        cp.waitingRequests++
 48        cp.mu.Unlock()
 49        time.Sleep(10 * time.Millisecond)
 50        cp.mu.Lock()
 51        cp.waitingRequests--
 52    }
 53
 54    cp.idleConns--
 55    cp.activeConns++
 56    cp.totalAcquired++
 57    cp.totalWaitTime += time.Since(start)
 58
 59    return &Connection{pool: cp}, nil
 60}
 61
 62// Release returns a connection to the pool
 63func Release() {
 64    cp.mu.Lock()
 65    defer cp.mu.Unlock()
 66
 67    cp.activeConns--
 68    cp.idleConns++
 69    cp.totalReleased++
 70}
 71
 72// Stats returns current pool statistics
 73func Stats() {
 74    cp.mu.RLock()
 75    defer cp.mu.RUnlock()
 76
 77    return cp.maxSize, cp.activeConns, cp.idleConns, cp.waitingRequests
 78}
 79
 80// Connection represents a database connection
 81type Connection struct {
 82    pool *Connection Pool
 83}
 84
 85// Close returns the connection to the pool
 86func Close() {
 87    c.pool.Release()
 88}
 89
 90// ConnectionPoolCollector implements prometheus.Collector
 91type ConnectionPoolCollector struct {
 92    pool *ConnectionPool
 93
 94    // Metrics descriptors
 95    poolSize        *prometheus.Desc
 96    activeConns     *prometheus.Desc
 97    idleConns       *prometheus.Desc
 98    waitingRequests *prometheus.Desc
 99    totalAcquired   *prometheus.Desc
100    totalReleased   *prometheus.Desc
101    avgWaitTime     *prometheus.Desc
102}
103
104// NewConnectionPoolCollector creates a custom collector
105func NewConnectionPoolCollector(pool *ConnectionPool, namespace string) *ConnectionPoolCollector {
106    return &ConnectionPoolCollector{
107        pool: pool,
108        poolSize: prometheus.NewDesc(
109            prometheus.BuildFQName(namespace, "pool", "size"),
110            "Maximum number of connections in the pool",
111            nil, nil,
112        ),
113        activeConns: prometheus.NewDesc(
114            prometheus.BuildFQName(namespace, "pool", "active_connections"),
115            "Number of currently active connections",
116            nil, nil,
117        ),
118        idleConns: prometheus.NewDesc(
119            prometheus.BuildFQName(namespace, "pool", "idle_connections"),
120            "Number of idle connections available",
121            nil, nil,
122        ),
123        waitingRequests: prometheus.NewDesc(
124            prometheus.BuildFQName(namespace, "pool", "waiting_requests"),
125            "Number of requests waiting for a connection",
126            nil, nil,
127        ),
128        totalAcquired: prometheus.NewDesc(
129            prometheus.BuildFQName(namespace, "pool", "acquired_total"),
130            "Total number of connections acquired",
131            nil, nil,
132        ),
133        totalReleased: prometheus.NewDesc(
134            prometheus.BuildFQName(namespace, "pool", "released_total"),
135            "Total number of connections released",
136            nil, nil,
137        ),
138        avgWaitTime: prometheus.NewDesc(
139            prometheus.BuildFQName(namespace, "pool", "avg_wait_seconds"),
140            "Average wait time to acquire a connection",
141            nil, nil,
142        ),
143    }
144}
145
146// Describe implements prometheus.Collector
147func Describe(ch chan<- *prometheus.Desc) {
148    ch <- c.poolSize
149    ch <- c.activeConns
150    ch <- c.idleConns
151    ch <- c.waitingRequests
152    ch <- c.totalAcquired
153    ch <- c.totalReleased
154    ch <- c.avgWaitTime
155}
156
157// Collect implements prometheus.Collector
158func Collect(ch chan<- prometheus.Metric) {
159    c.pool.mu.RLock()
160    defer c.pool.mu.RUnlock()
161
162    // Gauges
163    ch <- prometheus.MustNewConstMetric(
164        c.poolSize,
165        prometheus.GaugeValue,
166        float64(c.pool.maxSize),
167    )
168
169    ch <- prometheus.MustNewConstMetric(
170        c.activeConns,
171        prometheus.GaugeValue,
172        float64(c.pool.activeConns),
173    )
174
175    ch <- prometheus.MustNewConstMetric(
176        c.idleConns,
177        prometheus.GaugeValue,
178        float64(c.pool.idleConns),
179    )
180
181    ch <- prometheus.MustNewConstMetric(
182        c.waitingRequests,
183        prometheus.GaugeValue,
184        float64(c.pool.waitingRequests),
185    )
186
187    // Counters
188    ch <- prometheus.MustNewConstMetric(
189        c.totalAcquired,
190        prometheus.CounterValue,
191        float64(c.pool.totalAcquired),
192    )
193
194    ch <- prometheus.MustNewConstMetric(
195        c.totalReleased,
196        prometheus.CounterValue,
197        float64(c.pool.totalReleased),
198    )
199
200    // Calculate average wait time
201    avgWait := 0.0
202    if c.pool.totalAcquired > 0 {
203        avgWait = c.pool.totalWaitTime.Seconds() / float64(c.pool.totalAcquired)
204    }
205
206    ch <- prometheus.MustNewConstMetric(
207        c.avgWaitTime,
208        prometheus.GaugeValue,
209        avgWait,
210    )
211}
212
213func main() {
214    // Create connection pool
215    pool := NewConnectionPool(10)
216
217    // Create custom collector
218    collector := NewConnectionPoolCollector(pool, "myapp")
219
220    // Register collector
221    registry := prometheus.NewRegistry()
222    registry.MustRegister(collector)
223
224    // Simulate workload
225    fmt.Println("Simulating connection pool workload...\n")
226
227    var wg sync.WaitGroup
228    for i := 0; i < 20; i++ {
229        wg.Add(1)
230        go func(id int) {
231            defer wg.Done()
232
233            for j := 0; j < 5; j++ {
234                // Acquire connection
235                conn, _ := pool.Acquire()
236
237                // Simulate work
238                time.Sleep(time.Duration(rand.Intn(100)) * time.Millisecond)
239
240                // Release connection
241                conn.Close()
242            }
243        }(i)
244    }
245
246    // Periodically print metrics
247    done := make(chan bool)
248    go func() {
249        ticker := time.NewTicker(500 * time.Millisecond)
250        defer ticker.Stop()
251
252        for {
253            select {
254            case <-ticker.C:
255                // Gather metrics
256                metrics, _ := registry.Gather()
257
258                fmt.Println("=== Connection Pool Metrics ===")
259                for _, mf := range metrics {
260                    for _, m := range mf.GetMetric() {
261                        fmt.Printf("%s: %.2f\n",
262                            mf.GetName(),
263                            m.GetGauge().GetValue()+m.GetCounter().GetValue())
264                    }
265                }
266                fmt.Println()
267
268            case <-done:
269                return
270            }
271        }
272    }()
273
274    wg.Wait()
275    close(done)
276
277    // Final metrics
278    fmt.Println("=== Final Metrics ===")
279    maxSize, active, idle, waiting := pool.Stats()
280    fmt.Printf("Pool Size: %d\n", maxSize)
281    fmt.Printf("Active: %d\n", active)
282    fmt.Printf("Idle: %d\n", idle)
283    fmt.Printf("Waiting: %d\n", waiting)
284    fmt.Printf("Total Acquired: %d\n", pool.totalAcquired)
285    fmt.Printf("Total Released: %d\n", pool.totalReleased)
286}

Explanation:

Custom Prometheus Collector Pattern:

  1. Implement prometheus.Collector interface:

    • Describe(chan<- *prometheus.Desc): Send metric descriptors
    • Collect(chan<- prometheus.Metric): Send metric values
  2. Create metric descriptors with prometheus.NewDesc():

    • Fully qualified name
    • Help text
    • Variable labels
    • Constant labels
  3. Send metrics with prometheus.MustNewConstMetric():

    • Descriptor
    • Metric type
    • Value

Metrics Exposed:

  1. Gauges:

    • pool_size: Maximum connections
    • active_connections: In-use connections
    • idle_connections: Available connections
    • waiting_requests: Blocked requests
  2. Counters:

    • acquired_total: Total acquires
    • released_total: Total releases
  3. Calculated Metrics:

    • avg_wait_seconds: Average wait time

Why Custom Collectors?

  • Complex metrics: Calculate derived values
  • External systems: Scrape metrics from non-Go systems
  • Efficiency: Collect all related metrics in one pass
  • Dynamic labels: Generate labels at scrape time

Best Practices:

  1. Thread Safety: Use mutexes when reading state
  2. Consistency: Collect related metrics atomically
  3. Performance: Keep Collect() fast
  4. Naming: Follow Prometheus conventions
  5. Metric types:
    • Gauge: Can go up/down
    • Counter: Only increases
    • Histogram: Distribution
    • Summary: Quantiles

PromQL Queries:

 1# Current utilization
 2myapp_pool_active_connections / myapp_pool_size
 3
 4# Connection churn rate
 5rate(myapp_pool_acquired_total[5m])
 6
 7# Average wait time
 8avg_over_time(myapp_pool_avg_wait_seconds[5m])
 9
10# Alert on pool exhaustion
11myapp_pool_idle_connections == 0

Real-World Use Cases:

  • Database pools: Track connection health
  • HTTP clients: Monitor request queues
  • Caches: Hit/miss rates, evictions
  • Worker pools: Job queue depth
  • Custom business metrics: Order processing, user sessions

Exercise 4: Distributed Tracing with OpenTelemetry

🎯 Learning Objectives:

  • Implement distributed tracing across multiple services
  • Master OpenTelemetry context propagation
  • Create meaningful spans with proper attributes
  • Learn trace sampling and export strategies

🌍 Real-World Context:
Microservices architecture makes distributed tracing essential for debugging. When a user request flows through 10+ services, traditional logging becomes insufficient for understanding performance bottlenecks. Companies like Airbnb and Uber rely heavily on distributed tracing to maintain their complex service ecosystems, reducing mean time to resolution for incidents from hours to minutes.

⏱️ Time Estimate: 90-120 minutes
📊 Difficulty: Advanced

Build a comprehensive distributed tracing system using OpenTelemetry that tracks requests across multiple services. Your implementation should include proper context propagation, meaningful span attributes, automatic instrumentation for common operations, and integration with a tracing backend like Jaeger for visualization.

Requirements:

  1. Implement tracing across at least 3 different services
  2. Use OpenTelemetry automatic and manual instrumentation
  3. Properly propagate trace context through HTTP and gRPC calls
  4. Add meaningful attributes, events, and links to spans
  5. Implement sampling strategies to balance observability with performance
  6. Export traces to Jaeger or another compatible backend
Solution with Explanation
  1// run
  2package main
  3
  4import (
  5	"context"
  6	"fmt"
  7	"io"
  8	"log"
  9	"net/http"
 10	"time"
 11
 12	"go.opentelemetry.io/otel"
 13	"go.opentelemetry.io/otel/attribute"
 14	"go.opentelemetry.io/otel/exporters/jaeger"
 15	"go.opentelemetry.io/otel/propagation"
 16	"go.opentelemetry.io/otel/sdk/resource"
 17	sdktrace "go.opentelemetry.io/otel/sdk/trace"
 18	semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
 19	"go.opentelemetry.io/otel/trace"
 20)
 21
 22// ServiceA is the entry point service
 23type ServiceA struct {
 24	client *http.Client
 25	tracer trace.Tracer
 26}
 27
 28func NewServiceA() *ServiceA {
 29	return &ServiceA{
 30		client: &http.Client{},
 31		tracer: otel.Tracer("service-a"),
 32	}
 33}
 34
 35func HandleRequest(ctx context.Context, userID string) error {
 36	// Start root span
 37	ctx, span := s.tracer.Start(ctx, "handle-user-request",
 38		trace.WithAttributes(
 39			attribute.String("user.id", userID),
 40			attribute.String("service.name", "service-a"),
 41		),
 42	)
 43	defer span.End()
 44
 45	// Log event
 46	span.AddEvent("request-started", trace.WithTimestamp(time.Now()))
 47
 48	// Call Service B
 49	if err := s.callServiceB(ctx, userID); err != nil {
 50		span.RecordError(err)
 51		span.SetStatus(trace.StatusCodeError, "service-b-call-failed")
 52		return err
 53	}
 54
 55	// Call Service C
 56	if err := s.callServiceC(ctx, userID); err != nil {
 57		span.RecordError(err)
 58		span.SetStatus(trace.StatusCodeError, "service-c-call-failed")
 59		return err
 60	}
 61
 62	span.SetStatus(trace.StatusCodeOk, "request-completed")
 63	span.AddEvent("request-completed", trace.WithTimestamp(time.Now()))
 64	return nil
 65}
 66
 67func callServiceB(ctx context.Context, userID string) error {
 68	ctx, span := s.tracer.Start(ctx, "call-service-b")
 69	defer span.End()
 70
 71	span.SetAttributes(
 72		attribute.String("target.service", "service-b"),
 73		attribute.String("operation", "get-user-data"),
 74	)
 75
 76	// Simulate HTTP call with context propagation
 77	req, err := http.NewRequestWithContext(ctx, "GET", "http://localhost:8081/user/"+userID, nil)
 78	if err != nil {
 79		return err
 80	}
 81
 82	// Inject trace context into headers
 83	otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))
 84
 85	resp, err := s.client.Do(req)
 86	if err != nil {
 87		return err
 88	}
 89	defer resp.Body.Close()
 90
 91	span.SetAttributes(
 92		attribute.Int("http.status_code", resp.StatusCode),
 93		attribute.String("http.method", req.Method),
 94		attribute.String("http.url", req.URL.String()),
 95	)
 96
 97	return nil
 98}
 99
100func callServiceC(ctx context.Context, userID string) error {
101	ctx, span := s.tracer.Start(ctx, "call-service-c")
102	defer span.End()
103
104	span.SetAttributes(
105		attribute.String("target.service", "service-c"),
106		attribute.String("operation", "process-user-data"),
107	)
108
109	// Simulate processing time
110	time.Sleep(50 * time.Millisecond)
111
112	// Add business metric
113	span.SetAttributes(attribute.Int("data.records_processed", 42))
114
115	return nil
116}
117
118// ServiceB handles user data requests
119type ServiceB struct {
120	tracer trace.Tracer
121}
122
123func NewServiceB() *ServiceB {
124	return &ServiceB{
125		tracer: otel.Tracer("service-b"),
126	}
127}
128
129func ServeHTTP(w http.ResponseWriter, r *http.Request) {
130	// Extract trace context from headers
131	ctx := otel.GetTextMapPropagator().Extract(r.Context(), propagation.HeaderCarrier(r.Header))
132
133	_, span := s.tracer.Start(ctx, "handle-user-request")
134	defer span.End()
135
136	// Add service-specific attributes
137	span.SetAttributes(
138		attribute.String("service.name", "service-b"),
139		attribute.String("http.route", "/user/{id}"),
140		attribute.String("user.id", r.URL.Path[len("/user/"):]),
141	)
142
143	// Simulate database query
144	dbCtx, dbSpan := s.tracer.Start(ctx, "database-query")
145	dbSpan.SetAttributes(
146		attribute.String("db.system", "postgresql"),
147		attribute.String("db.operation", "SELECT"),
148		attribute.String("db.statement", "SELECT * FROM users WHERE id = $1"),
149	)
150	time.Sleep(20 * time.Millisecond)
151	dbSpan.End()
152
153	span.AddEvent("data-retrieved", trace.WithAttributes(
154		attribute.Int("record_count", 1),
155	))
156
157	w.WriteHeader(http.StatusOK)
158	w.Write([]byte(`{"id": "123", "name": "John Doe"}`))
159}
160
161// ServiceC processes data
162type ServiceC struct {
163	tracer trace.Tracer
164}
165
166func NewServiceC() *ServiceC {
167	return &ServiceC{
168		tracer: otel.Tracer("service-c"),
169	}
170}
171
172func ProcessData(ctx context.Context, userID string) error {
173	ctx, span := s.tracer.Start(ctx, "process-user-data")
174	defer span.End()
175
176	span.SetAttributes(
177		attribute.String("service.name", "service-c"),
178		attribute.String("user.id", userID),
179	)
180
181	// Simulate multiple processing steps
182	s.processStep1(ctx)
183	s.processStep2(ctx)
184	s.processStep3(ctx)
185
186	return nil
187}
188
189func processStep1(ctx context.Context) {
190	_, span := s.tracer.Start(ctx, "validation-step")
191	defer span.End()
192
193	span.SetAttributes(attribute.String("step.name", "data-validation"))
194	time.Sleep(10 * time.Millisecond)
195}
196
197func processStep2(ctx context.Context) {
198	_, span := s.tracer.Start(ctx, "transformation-step")
199	defer span.End()
200
201	span.SetAttributes(attribute.String("step.name", "data-transformation"))
202	time.Sleep(25 * time.Millisecond)
203}
204
205func processStep3(ctx context.Context) {
206	_, span := s.tracer.Start(ctx, "enrichment-step")
207	defer span.End()
208
209	span.SetAttributes(attribute.String("step.name", "data-enrichment"))
210	time.Sleep(15 * time.Millisecond)
211}
212
213// Initialize OpenTelemetry
214func initTracer(serviceName string) {
215	// Create Jaeger exporter
216	exp, err := jaeger.New(jaeger.WithCollectorEndpoint(
217		jaeger.WithEndpoint("http://localhost:14268/api/traces"),
218	))
219	if err != nil {
220		return nil, err
221	}
222
223	// Create resource with service metadata
224	res, err := resource.New(context.Background(),
225		resource.WithAttributes(
226			semconv.ServiceNameKey.String(serviceName),
227			semconv.ServiceVersionKey.String("1.0.0"),
228			attribute.String("environment", "development"),
229		),
230	)
231	if err != nil {
232		return nil, err
233	}
234
235	// Create tracer provider with sampling
236	tp := sdktrace.NewTracerProvider(
237		sdktrace.WithBatcher(exp),
238		sdktrace.WithResource(res),
239		sdktrace.WithSampler(sdktrace.AlwaysSample()),
240	)
241
242	// Register global propagator
243	otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
244		propagation.TraceContext{},
245		propagation.Baggage{},
246	))
247
248	// Set global tracer provider
249	otel.SetTracerProvider(tp)
250
251	return tp, nil
252}
253
254func main() {
255	// Initialize tracing
256	tp, err := initTracer("service-a")
257	if err != nil {
258		log.Fatal(err)
259	}
260	defer func() {
261		if err := tp.Shutdown(context.Background()); err != nil {
262			log.Printf("Error shutting down tracer provider: %v", err)
263		}
264	}()
265
266	// Create services
267	serviceA := NewServiceA()
268	serviceB := NewServiceB()
269	serviceC := NewServiceC()
270
271	// Start Service B
272	go func() {
273		fmt.Println("Service B listening on :8081")
274		if err := http.ListenAndServe(":8081", serviceB); err != nil {
275			log.Fatal(err)
276		}
277	}()
278
279	// Give Service B time to start
280	time.Sleep(100 * time.Millisecond)
281
282	// Simulate distributed request
283	ctx := context.Background()
284	fmt.Println("Processing distributed request...")
285
286	start := time.Now()
287	if err := serviceA.HandleRequest(ctx, "user-123"); err != nil {
288		fmt.Printf("Request failed: %v\n", err)
289	} else {
290		fmt.Printf("Request completed in %v\n", time.Since(start))
291	}
292
293	fmt.Println("\nCheck traces at: http://localhost:16686")
294	time.Sleep(2 * time.Second)
295}

Explanation:

Distributed Tracing Implementation:

  1. Trace Context Propagation:

    • Uses OpenTelemetry propagators to inject/extract trace context
    • Context flows through HTTP headers automatically
    • Maintains trace continuity across service boundaries
  2. Span Design Best Practices:

    • Meaningful span names
    • Relevant attributes following semantic conventions
    • Events for important timestamps
    • Status codes for success/failure states
  3. Service Boundaries:

    • Each service creates its own spans
    • Clear separation of concerns
    • Proper error handling and status reporting
  4. Instrumentation Patterns:

    • Manual instrumentation for custom business logic
    • Automatic instrumentation for HTTP calls
    • Database query tracing
    • Multi-step process tracking

Key Components:

  1. ServiceA:

    • Creates root span for the entire request
    • Calls downstream services
    • Coordinates overall request flow
  2. ServiceB:

    • Extracts trace context from incoming requests
    • Handles database operations
    • Returns structured data
  3. ServiceC:

    • Demonstrates multi-step processing
    • Shows nested spans within a service
    • Business logic tracing

Real-World Benefits:

  • Performance Analysis: Identify bottlenecks across services
  • Error Debugging: Trace errors through the entire request path
  • Dependency Mapping: Understand service relationships
  • Capacity Planning: See which services are under load

Production Considerations:

  • Sampling: Use probability sampling in high-traffic scenarios
  • Resource Limits: Configure span limits and timeouts
  • Security: Don't include sensitive data in spans
  • Cost: Manage trace volume and storage

Exercise 5: Real-time Metrics Dashboard

🎯 Learning Objectives:

  • Build real-time metric collection and visualization
  • Implement time-series data aggregation
  • Create WebSocket-based streaming dashboards
  • Learn metric query languages and alerting patterns

🌍 Real-World Context:
Real-time dashboards are essential for operations teams to monitor system health and detect anomalies instantly. Companies like Netflix and Twitter use sophisticated real-time dashboards to track billions of metrics, enabling them to detect issues before users notice. These dashboards transform raw metrics into actionable insights through aggregation, visualization, and intelligent alerting.

⏱️ Time Estimate: 120-150 minutes
📊 Difficulty: Expert

Build a complete real-time metrics dashboard system that collects, aggregates, and visualizes metrics from multiple sources. Your implementation should include a metrics collector, time-series data store, WebSocket-based real-time updates, and a web dashboard with charts and alerts.

Requirements:

  1. Collect metrics from multiple sources
  2. Implement time-series data aggregation
  3. Create a WebSocket server for real-time metric streaming
  4. Build a web dashboard with interactive charts and graphs
  5. Implement alerting rules with threshold and anomaly detection
  6. Support metric queries with filtering and time range selection
Solution with Explanation
  1// run
  2package main
  3
  4import (
  5	"context"
  6	"encoding/json"
  7	"fmt"
  8	"log"
  9	"net/http"
 10	"sync"
 11	"time"
 12
 13	"github.com/gorilla/websocket"
 14)
 15
 16// Metric represents a single metric data point
 17type Metric struct {
 18	Timestamp time.Time   `json:"timestamp"`
 19	Name      string      `json:"name"`
 20	Value     float64     `json:"value"`
 21	Tags      map[string]string `json:"tags"`
 22}
 23
 24// MetricSeries represents a time series of metrics
 25type MetricSeries struct {
 26	Name   string            `json:"name"`
 27	Tags   map[string]string `json:"tags"`
 28	Points []Metric          `json:"points"`
 29}
 30
 31// MetricsCollector collects and aggregates metrics
 32type MetricsCollector struct {
 33	mu            sync.RWMutex
 34	metrics       map[string]*MetricSeries // key: name+tags hash
 35	maxPoints     int
 36	subscribers   map[*websocket.Conn]bool
 37	alertRules    []*AlertRule
 38	lastUpdate    time.Time
 39}
 40
 41// AlertRule defines an alert condition
 42type AlertRule struct {
 43	ID          string  `json:"id"`
 44	Name        string  `json:"name"`
 45	MetricName  string  `json:"metric_name"`
 46	Condition   string  `json:"condition"` // ">", "<", "==", "!="
 47	Threshold   float64 `json:"threshold"`
 48	Duration    time.Duration `json:"duration"`
 49	LastTrigger time.Time `json:"last_trigger"`
 50	Active      bool    `json:"active"`
 51}
 52
 53func NewMetricsCollector(maxPoints int) *MetricsCollector {
 54	return &MetricsCollector{
 55		metrics:     make(map[string]*MetricSeries),
 56		maxPoints:   maxPoints,
 57		subscribers: make(map[*websocket.Conn]bool),
 58		lastUpdate:  time.Now(),
 59	}
 60}
 61
 62// RecordMetric adds a new metric data point
 63func RecordMetric(name string, value float64, tags map[string]string) {
 64	mc.mu.Lock()
 65	defer mc.mu.Unlock()
 66
 67	// Create series key
 68	key := mc.seriesKey(name, tags)
 69
 70	// Get or create series
 71	series, exists := mc.metrics[key]
 72	if !exists {
 73		series = &MetricSeries{
 74			Name:   name,
 75			Tags:   tags,
 76			Points: make([]Metric, 0, mc.maxPoints),
 77		}
 78		mc.metrics[key] = series
 79	}
 80
 81	// Add new point
 82	point := Metric{
 83		Timestamp: time.Now(),
 84		Name:      name,
 85		Value:     value,
 86		Tags:      tags,
 87	}
 88
 89	// Append point and maintain max size
 90	series.Points = append(series.Points, point)
 91	if len(series.Points) > mc.maxPoints {
 92		series.Points = series.Points[1:]
 93	}
 94
 95	// Check alerts
 96	mc.checkAlerts(name, value, tags)
 97
 98	// Update timestamp
 99	mc.lastUpdate = time.Now()
100}
101
102// seriesKey creates a unique key for a metric series
103func seriesKey(name string, tags map[string]string) string {
104	key := name
105	for k, v := range tags {
106		key += fmt.Sprintf(":%s=%s", k, v)
107	}
108	return key
109}
110
111// checkAlerts evaluates alert rules against the new metric
112func checkAlerts(name string, value float64, tags map[string]string) {
113	for _, rule := range mc.alertRules {
114		if rule.MetricName != name {
115			continue
116		}
117
118		// Check condition
119		triggered := false
120		switch rule.Condition {
121		case ">":
122			triggered = value > rule.Threshold
123		case "<":
124			triggered = value < rule.Threshold
125		case "==":
126			triggered = value == rule.Threshold
127		case "!=":
128			triggered = value != rule.Threshold
129		}
130
131		if triggered {
132			if !rule.Active {
133				rule.LastTrigger = time.Now()
134				rule.Active = true
135				mc.sendAlert(rule, value, tags)
136			}
137		} else {
138			rule.Active = false
139		}
140	}
141}
142
143// sendAlert sends an alert notification
144func sendAlert(rule *AlertRule, value float64, tags map[string]string) {
145	alert := map[string]interface{}{
146		"type":      "alert",
147		"rule":      rule.Name,
148		"metric":    rule.MetricName,
149		"value":     value,
150		"threshold": rule.Threshold,
151		"timestamp": time.Now(),
152		"tags":      tags,
153	}
154
155	mc.broadcast(alert)
156}
157
158// GetMetrics returns metrics matching the query
159func GetMetrics(name string, tags map[string]string, duration time.Duration) []MetricSeries {
160	mc.mu.RLock()
161	defer mc.mu.RUnlock()
162
163	var results []MetricSeries
164	cutoff := time.Now().Add(-duration)
165
166	for _, series := range mc.metrics {
167		// Filter by name
168		if name != "" && series.Name != name {
169			continue
170		}
171
172		// Filter by tags
173		if !mc.matchTags(series.Tags, tags) {
174			continue
175		}
176
177		// Filter points by time
178		var points []Metric
179		for _, point := range series.Points {
180			if point.Timestamp.After(cutoff) {
181				points = append(points, point)
182			}
183		}
184
185		if len(points) > 0 {
186			results = append(results, MetricSeries{
187				Name:   series.Name,
188				Tags:   series.Tags,
189				Points: points,
190			})
191		}
192	}
193
194	return results
195}
196
197// matchTags checks if all query tags match series tags
198func matchTags(seriesTags, queryTags map[string]string) bool {
199	for k, v := range queryTags {
200		if seriesTags[k] != v {
201			return false
202		}
203	}
204	return true
205}
206
207// AddAlertRule adds a new alert rule
208func AddAlertRule(rule *AlertRule) {
209	mc.mu.Lock()
210	defer mc.mu.Unlock()
211	mc.alertRules = append(mc.alertRules, rule)
212}
213
214// Subscribe adds a WebSocket subscriber for real-time updates
215func Subscribe(conn *websocket.Conn) {
216	mc.mu.Lock()
217	defer mc.mu.Unlock()
218	mc.subscribers[conn] = true
219}
220
221// Unsubscribe removes a WebSocket subscriber
222func Unsubscribe(conn *websocket.Conn) {
223	mc.mu.Lock()
224	defer mc.mu.Unlock()
225	delete(mc.subscribers, conn)
226}
227
228// broadcast sends data to all subscribers
229func broadcast(data interface{}) {
230	mc.mu.RLock()
231	subscribers := make([]*websocket.Conn, 0, len(mc.subscribers))
232	for conn := range mc.subscribers {
233		subscribers = append(subscribers, conn)
234	}
235	mc.mu.RUnlock()
236
237	for _, conn := range subscribers {
238		if err := conn.WriteJSON(data); err != nil {
239			mc.Unsubscribe(conn)
240			conn.Close()
241		}
242	}
243}
244
245// GetStats returns collector statistics
246func GetStats() map[string]interface{} {
247	mc.mu.RLock()
248	defer mc.mu.RUnlock()
249
250	return map[string]interface{}{
251		"total_series":    len(mc.metrics),
252		"total_points":    mc.countTotalPoints(),
253		"active_alerts":   mc.countActiveAlerts(),
254		"subscribers":     len(mc.subscribers),
255		"last_update":     mc.lastUpdate,
256	}
257}
258
259func countTotalPoints() int {
260	total := 0
261	for _, series := range mc.metrics {
262		total += len(series.Points)
263	}
264	return total
265}
266
267func countActiveAlerts() int {
268	active := 0
269	for _, rule := range mc.alertRules {
270		if rule.Active {
271			active++
272		}
273	}
274	return active
275}
276
277// WebSocket upgrader
278var upgrader = websocket.Upgrader{
279	CheckOrigin: func(r *http.Request) bool {
280		return true // Allow all origins for demo
281	},
282}
283
284// HTTP handlers
285func handleWebSocket(w http.ResponseWriter, r *http.Request) {
286	conn, err := upgrader.Upgrade(w, r, nil)
287	if err != nil {
288		log.Printf("WebSocket upgrade error: %v", err)
289		return
290	}
291	defer conn.Close()
292
293	// Subscribe to updates
294	mc.Subscribe(conn)
295	defer mc.Unsubscribe(conn)
296
297	// Send initial stats
298	stats := mc.GetStats()
299	conn.WriteJSON(map[string]interface{}{
300		"type": "stats",
301		"data": stats,
302	})
303
304	// Keep connection alive
305	for {
306		_, _, err := conn.ReadMessage()
307		if err != nil {
308			break
309		}
310	}
311}
312
313func handleMetrics(w http.ResponseWriter, r *http.Request) {
314	// Parse query parameters
315	name := r.URL.Query().Get("name")
316	durationStr := r.URL.Query().Get("duration")
317	duration := time.Hour // default
318
319	if durationStr != "" {
320		if parsed, err := time.ParseDuration(durationStr); err == nil {
321			duration = parsed
322		}
323	}
324
325	// Parse tags
326	tags := make(map[string]string)
327	for key, values := range r.URL.Query() {
328		if key == "name" || key == "duration" {
329			continue
330		}
331		if len(values) > 0 {
332			tags[key] = values[0]
333		}
334	}
335
336	// Get metrics
337	metrics := mc.GetMetrics(name, tags, duration)
338
339	w.Header().Set("Content-Type", "application/json")
340	json.NewEncoder(w).Encode(metrics)
341}
342
343func handleStats(w http.ResponseWriter, r *http.Request) {
344	stats := mc.GetStats()
345	w.Header().Set("Content-Type", "application/json")
346	json.NewEncoder(w).Encode(stats)
347}
348
349func handleAlerts(w http.ResponseWriter, r *http.Request) {
350	mc.mu.RLock()
351	defer mc.mu.RUnlock()
352
353	w.Header().Set("Content-Type", "application/json")
354	json.NewEncoder(w).Encode(mc.alertRules)
355}
356
357// Metrics generator for simulation
358func startMetricsSimulation(ctx context.Context) {
359	ticker := time.NewTicker(1 * time.Second)
360	defer ticker.Stop()
361
362	counter := 0.0
363	for {
364		select {
365		case <-ctx.Done():
366			return
367		case <-ticker.C:
368			// Generate various metrics
369			mc.RecordMetric("http_requests_total", 100+counter%50, nil)
370			mc.RecordMetric("cpu_usage", 30+counter%40, nil)
371			mc.RecordMetric("memory_usage", 50+counter%30, nil)
372			mc.RecordMetric("response_time_ms", 50+counter%100, map[string]string{"endpoint": "/api/users"})
373			mc.RecordMetric("response_time_ms", 100+counter%150, map[string]string{"endpoint": "/api/products"})
374
375			// Simulate error rate spike occasionally
376			if int(counter)%20 == 0 {
377				mc.RecordMetric("error_rate", 5.0, nil)
378			} else {
379				mc.RecordMetric("error_rate", 0.5, nil)
380			}
381
382			counter++
383		}
384	}
385}
386
387// Dashboard HTML
388const dashboardHTML = `
389<!DOCTYPE html>
390<html>
391<head>
392    <title>Real-time Metrics Dashboard</title>
393    <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
394    <style>
395        body { font-family: Arial, sans-serif; margin: 20px; }
396        .container { max-width: 1200px; margin: 0 auto; }
397        .stats { display: flex; gap: 20px; margin-bottom: 20px; }
398        .stat-card { background: #f5f5f5; padding: 15px; border-radius: 5px; min-width: 150px; }
399        .charts { display: grid; grid-template-columns: 1fr 1fr; gap: 20px; }
400        .chart-container { background: white; padding: 20px; border-radius: 5px; box-shadow: 0 2px 5px rgba(0,0,0,0.1); }
401        .alerts { margin-top: 20px; background: #fff3cd; padding: 15px; border-radius: 5px; }
402        .alert { background: #f8d7da; padding: 10px; margin: 5px 0; border-radius: 3px; }
403    </style>
404</head>
405<body>
406    <div class="container">
407        <h1>Real-time Metrics Dashboard</h1>
408
409        <div class="stats" id="stats">
410            <!-- Stats will be populated here -->
411        </div>
412
413        <div class="charts">
414            <div class="chart-container">
415                <h3>CPU & Memory Usage</h3>
416                <canvas id="systemChart"></canvas>
417            </div>
418            <div class="chart-container">
419                <h3>Response Times</h3>
420                <canvas id="responseChart"></canvas>
421            </div>
422            <div class="chart-container">
423                <h3>Request Rate</h3>
424                <canvas id="requestChart"></canvas>
425            </div>
426            <div class="chart-container">
427                <h3>Error Rate</h3>
428                <canvas id="errorChart"></canvas>
429            </div>
430        </div>
431
432        <div class="alerts" id="alerts">
433            <h3>Active Alerts</h3>
434            <div id="alertList">No active alerts</div>
435        </div>
436    </div>
437
438    <script>
439        const ws = new WebSocket('ws://localhost:8080/ws');
440
441        const systemChart = new Chart(document.getElementById('systemChart'), {
442            type: 'line',
443            data: { labels: [], datasets: [
444                { label: 'CPU Usage', data: [], borderColor: 'rgb(75, 192, 192)' },
445                { label: 'Memory Usage', data: [], borderColor: 'rgb(255, 99, 132)' }
446            ]},
447            options: { scales: { y: { beginAtZero: true, max: 100 } } }
448        });
449
450        const responseChart = new Chart(document.getElementById('responseChart'), {
451            type: 'line',
452            data: { labels: [], datasets: [
453                { label: '/api/users', data: [], borderColor: 'rgb(54, 162, 235)' },
454                { label: '/api/products', data: [], borderColor: 'rgb(255, 205, 86)' }
455            ]},
456            options: { scales: { y: { beginAtZero: true } } }
457        });
458
459        const requestChart = new Chart(document.getElementById('requestChart'), {
460            type: 'line',
461            data: { labels: [], datasets: [
462                { label: 'Requests/sec', data: [], borderColor: 'rgb(153, 102, 255)' }
463            ]},
464            options: { scales: { y: { beginAtZero: true } } }
465        });
466
467        const errorChart = new Chart(document.getElementById('errorChart'), {
468            type: 'line',
469            data: { labels: [], datasets: [
470                { label: 'Error Rate %', data: [], borderColor: 'rgb(255, 159, 64)' }
471            ]},
472            options: { scales: { y: { beginAtZero: true, max: 10 } } }
473        });
474
475        ws.onmessage = function(event) {
476            const data = JSON.parse(event.data);
477
478            if {
479                updateStats(data.data);
480            } else if {
481                showAlert(data);
482            }
483        };
484
485        function updateStats(stats) {
486            document.getElementById('stats').innerHTML =
487                '<div class="stat-card"><h4>Total Series</h4><p>' + stats.total_series + '</p></div>' +
488                '<div class="stat-card"><h4>Total Points</h4><p>' + stats.total_points + '</p></div>' +
489                '<div class="stat-card"><h4>Active Alerts</h4><p>' + stats.active_alerts + '</p></div>' +
490                '<div class="stat-card"><h4>Subscribers</h4><p>' + stats.subscribers + '</p></div>';
491        }
492
493        function showAlert(alert) {
494            const alertList = document.getElementById('alertList');
495            const alertDiv = document.createElement('div');
496            alertDiv.className = 'alert';
497            alertDiv.innerHTML = '<strong>' + alert.rule + '</strong>: ' +
498                               alert.metric + ' is ' + alert.value +
499                               '';
500            alertList.appendChild(alertDiv);
501
502            // Remove old alerts
503            setTimeout(() => {
504                if {
505                    alertDiv.parentNode.removeChild(alertDiv);
506                }
507            }, 10000);
508        }
509
510        // Fetch initial metrics and update charts
511        async function updateCharts() {
512            const response = await fetch('/metrics?duration=5m');
513            const metrics = await response.json();
514
515            // Update charts with new data
516            // This is simplified - in production, you'd update with actual data points
517            const now = new Date().toLocaleTimeString();
518
519            // Add new labels
520            if {
521                systemChart.data.labels.shift();
522                systemChart.data.datasets[0].data.shift();
523                systemChart.data.datasets[1].data.shift();
524            }
525            systemChart.data.labels.push(now);
526            systemChart.data.datasets[0].data.push(Math.random() * 100);
527            systemChart.data.datasets[1].data.push(Math.random() * 100);
528            systemChart.update('none');
529
530            // Similar updates for other charts...
531        }
532
533        // Update charts every 2 seconds
534        setInterval(updateCharts, 2000);
535        updateCharts(); // Initial call
536    </script>
537</body>
538</html>
539`
540
541func handleDashboard(w http.ResponseWriter, r *http.Request) {
542	w.Header().Set("Content-Type", "text/html")
543	fmt.Fprint(w, dashboardHTML)
544}
545
546func main() {
547	// Create metrics collector
548	collector := NewMetricsCollector(1000) // Keep last 1000 points per series
549
550	// Add alert rules
551	collector.AddAlertRule(&AlertRule{
552		ID:         "high-cpu",
553		Name:       "High CPU Usage",
554		MetricName: "cpu_usage",
555		Condition:  ">",
556		Threshold:  80.0,
557		Duration:   time.Minute * 5,
558	})
559
560	collector.AddAlertRule(&AlertRule{
561		ID:         "high-error-rate",
562		Name:       "High Error Rate",
563		MetricName: "error_rate",
564		Condition:  ">",
565		Threshold:  2.0,
566		Duration:   time.Minute * 2,
567	})
568
569	// Start metrics simulation
570	ctx, cancel := context.WithCancel(context.Background())
571	defer cancel()
572	go collector.startMetricsSimulation(ctx)
573
574	// Setup HTTP routes
575	http.HandleFunc("/ws", collector.handleWebSocket)
576	http.HandleFunc("/metrics", collector.handleMetrics)
577	http.HandleFunc("/stats", collector.handleStats)
578	http.HandleFunc("/alerts", collector.handleAlerts)
579	http.HandleFunc("/", collector.handleDashboard)
580
581	fmt.Println("Dashboard running on http://localhost:8080")
582	fmt.Println("WebSocket endpoint: ws://localhost:8080/ws")
583	fmt.Println("Metrics API: http://localhost:8080/metrics")
584	fmt.Println("Stats API: http://localhost:8080/stats")
585
586	if err := http.ListenAndServe(":8080", nil); err != nil {
587		log.Fatal(err)
588	}
589}

Explanation:

Real-time Dashboard Architecture:

  1. Metrics Collection:

    • Thread-safe metric collection with mutex protection
    • Time-series data organized by metric name and tags
    • Automatic cleanup of old data points
    • Support for metric aggregation and querying
  2. Alerting System:

    • Configurable alert rules with thresholds
    • Real-time alert evaluation on metric updates
    • Alert notifications via WebSocket
    • Alert state management to prevent spam
  3. Real-time Communication:

    • WebSocket server for live updates
    • Broadcast mechanism for multiple subscribers
    • Automatic connection management
    • JSON-based message protocol
  4. Web Dashboard:

    • Interactive charts using Chart.js
    • Real-time metric visualization
    • Alert display and management
    • Responsive design for multiple screen sizes

Key Features:

  1. Metric Storage:

    • In-memory time-series database
    • Configurable retention periods
    • Efficient querying by name and tags
    • Statistical aggregations
  2. Alert Engine:

    • Rule-based alerting
    • Multiple comparison operators
    • Time-based alert duration
    • Alert deduplication
  3. API Endpoints:

    • /metrics - Query metrics with filtering
    • /stats - System statistics
    • /alerts - Alert rules and status
    • /ws - WebSocket for real-time updates
    • / - Dashboard web interface

Production Considerations:

  • Persistence: Replace in-memory storage with Redis or TimescaleDB
  • Scalability: Implement metric sharding and load balancing
  • Security: Add authentication and CORS configuration
  • Performance: Use binary protocols for high-frequency metrics
  • Monitoring: Monitor the dashboard system itself

Real-World Applications:

  • Application Performance Monitoring
  • Infrastructure Monitoring
  • Business Metrics Dashboards
  • DevOps Alerting Systems
  • IoT Data Visualization

Summary

Production Readiness Checklist

  • Structured Logging: All logs use JSON format with consistent fields
  • Request Tracing: Every user request has a unique trace ID
  • Key Metrics: Track RED for all services
  • SLOs: Defined for critical user journeys
  • Dashboards: Available for system health and business metrics
  • Alerting: Based on SLO burn rate, not simple thresholds
  • Log Sampling: Configurable for high-traffic services
  • Retention Policies: Defined for logs, metrics, and traces

Remember: Observability is a journey, not a destination. Start with what matters most to your users and expand from there.