Learning Objectives
By the end of this tutorial, you will be able to:
- Design observability strategies for production systems using the three pillars approach
- Implement structured logging with Go's
log/slogpackage for machine-readable insights - Build comprehensive metrics collection using Prometheus with custom collectors
- Create distributed tracing systems with OpenTelemetry for request flow analysis
- Establish SLIs/SLOs and alerting based on service level objectives
- Prevent production incidents through proactive monitoring and early detection
Hook - The Production Blindness Problem
Consider driving a car with no dashboard—no speedometer, no fuel gauge, no engine lights. You'd be flying blind, unable to tell if you're speeding, running out of gas, or if the engine is about to fail.
Production software without observability is exactly like that car.
💡 Real-World Impact:
- Amazon Prime Day 2018 - Lost $100M in 63 minutes due to lack of distributed tracing
- Target Black Friday 2018 - Website crashed from undetected connection pool exhaustion
- Cloudflare's 2019 outage - Memory ordering bug caused incorrect routing for 30 minutes
These incidents share a common theme: teams were flying blind without proper observability.
The Cost of Being Blind:
- MTTR: -75% with proper observability
- MTTD: -85% with real-time alerting
- Incident count: -50% with proactive monitoring
- Developer productivity: +30% less time debugging
⚠️ Important: Without observability, you're not just debugging in the dark—you're losing money, customers, and developer time with every incident.
Core Concepts - The Three Pillars
Think of observability like a doctor's diagnostic toolkit. You need multiple instruments to get a complete picture of your system's health.
Pillar 1: Structured Logs
Logs capture discrete events with rich context. Structured logs are far more valuable than unstructured text because machines can parse, search, and analyze them efficiently.
Why Structured > Unstructured:
1// Unstructured
2log.Printf("User %s from %s logged in at %v", userID, ip, time.Now())
3
4// Structured
5logger.Info("user_login",
6 slog.String("user_id", userID),
7 slog.String("ip_address", ip),
8 slog.Time("login_time", time.Now()))
Pillar 2: Metrics
Metrics provide numerical insights about system behavior over time. They answer questions like "How fast are we?" and "How much traffic?"
Key Metric Types:
- Counters: Monotonically increasing values
- Gauges: Values that go up and down
- Histograms: Distribution of values
Pillar 3: Traces
Traces show the complete journey of a request through multiple services, helping you understand performance bottlenecks and failure points.
💡 Real-World Example: When Netflix experiences streaming issues:
- Logs show "CDN cache miss for movie ID 12345"
- Metrics show "CDN miss rate increased from 0.1% to 15%"
- Traces reveal "CDN → storage → user"
This combination tells them: what happened, how much, and why.
Structured Logging with slog
Practical Examples - Structured Logging
Example 1: Production-Ready Request Logger
Problem: You need to track all API requests with unique IDs, timing, and structured context for debugging production issues.
Solution: Build a middleware that logs every request with rich context using Go's log/slog.
1package main
2
3import (
4 "context"
5 "log/slog"
6 "net/http"
7 "os"
8 "time"
9
10 "github.com/google/uuid"
11)
12
13type contextKey string
14
15const requestIDKey contextKey = "request_id"
16const userIDKey contextKey = "user_id"
17
18// RequestLogger middleware adds comprehensive request logging
19func RequestLogger(logger *slog.Logger) func(http.Handler) http.Handler {
20 return func(next http.Handler) http.Handler {
21 return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
22 // Generate unique request ID
23 requestID := uuid.New().String()
24
25 // Add to context for downstream handlers
26 ctx := context.WithValue(r.Context(), requestIDKey, requestID)
27 r = r.WithContext(ctx)
28
29 // Create logger with request context
30 reqLogger := logger.With(
31 slog.String("request_id", requestID),
32 slog.String("method", r.Method),
33 slog.String("path", r.URL.Path),
34 slog.String("remote_addr", r.RemoteAddr),
35 slog.String("user_agent", r.UserAgent()),
36 )
37
38 // Log request start
39 start := time.Now()
40 reqLogger.Info("request_started")
41
42 // Wrap response writer to capture status and bytes
43 rw := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}
44
45 // Call handler
46 next.ServeHTTP(rw, r)
47
48 // Log request completion with timing
49 duration := time.Since(start)
50 reqLogger.Info("request_completed",
51 slog.Int("status_code", rw.statusCode),
52 slog.Duration("duration_ms", duration),
53 slog.Int("response_size", rw.bytesWritten),
54 )
55 })
56 }
57}
58
59type responseWriter struct {
60 http.ResponseWriter
61 statusCode int
62 bytesWritten int
63}
64
65func WriteHeader(code int) {
66 rw.statusCode = code
67 rw.ResponseWriter.WriteHeader(code)
68}
69
70func Write(b []byte) {
71 n, err := rw.ResponseWriter.Write(b)
72 rw.bytesWritten += n
73 return n, err
74}
75
76func main() {
77 logger := slog.New(slog.NewJSONHandler(os.Stdout, nil))
78
79 // Create router with request logging middleware
80 mux := http.NewServeMux()
81 mux.HandleFunc("/api/users", func(w http.ResponseWriter, r *http.Request) {
82 time.Sleep(50 * time.Millisecond) // Simulate work
83 w.Header().Set("Content-Type", "application/json")
84 w.WriteHeader(http.StatusOK)
85 w.Write([]byte(`{"users": []}`))
86 })
87
88 // Apply middleware
89 handler := RequestLogger(logger)(mux)
90
91 fmt.Println("Server running on :8080")
92 http.ListenAndServe(":8080", handler)
93}
Output:
1{"time":"2024-01-15T10:30:45Z","level":"INFO","msg":"request_started","request_id":"123e4567-89ab-4fcd-8a9b-8d6f-123456","method":"GET","path":"/api/users","remote_addr":"127.0.0.1","user_agent":"curl/7.68.0"}
2{"time":"2024-01-15T10:30:46Z","level":"INFO","msg":"request_completed","request_id":"123e4567-89ab-4fcd-8a9b-8d6f-123456","method":"GET","path":"/api/users","status_code":200,"duration_ms":52123456,"response_size":15}
Key Insights:
- Request ID: Enables log correlation across services
- Structured Format: JSON logs are easily searchable and analyzed
- Performance Metrics: Built-in timing for every request
- Context Preservation: User IDs and other context flow naturally
Example 2: Context-Aware Logger with Operation Tracking
Problem: You need to track multi-step operations with proper error context and timing for debugging complex workflows.
Solution: Create a logger that automatically extracts context values and tracks operation completion.
1package main
2
3import (
4 "context"
5 "errors"
6 "log/slog"
7 "os"
8 "time"
9)
10
11type OperationLogger struct {
12 *slog.Logger
13}
14
15func NewOperationLogger() *OperationLogger {
16 handler := slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
17 Level: slog.LevelInfo,
18 AddSource: true,
19 })
20 return &OperationLogger{
21 Logger: slog.New(handler),
22 }
23}
24
25// WithContext automatically extracts common context values
26func WithContext(ctx context.Context) *OperationLogger {
27 logger := ol.Logger
28
29 // Extract request ID if available
30 if requestID := ctx.Value("request_id"); requestID != nil {
31 logger = logger.With(slog.String("request_id", requestID.(string)))
32 }
33
34 // Extract user ID if available
35 if userID := ctx.Value("user_id"); userID != nil {
36 logger = logger.With(slog.String("user_id", userID.(string)))
37 }
38
39 return &OperationLogger{Logger: logger}
40}
41
42// LogOperation tracks operations with automatic timing and error handling
43func LogOperation(ctx context.Context, operation string, fn func() error) error {
44 ctx, span := ol.Logger.Handler().(*slog.JSONHandler).WithAttrs(
45 slog.String("operation", operation),
46 slog.Time("start_time", time.Now()),
47 ).Enabled(ctx).WithGroup("operation_tracker").NewRecord(0).Level
48
49 err := fn()
50 duration := time.Since(start)
51
52 if err != nil {
53 span.LogAttrs(
54 slog.Duration("duration_ms", duration),
55 slog.String("error", err.Error()),
56 slog.String("status", "failed"),
57 )
58 } else {
59 span.LogAttrs(
60 slog.Duration("duration_ms", duration),
61 slog.String("status", "completed"),
62 )
63 }
64
65 return err
66}
67
68// Usage example in a service
69func processUserOrder(ctx context.Context, orderID string) error {
70 logger := NewOperationLogger()
71 opLogger := logger.WithContext(ctx)
72
73 return opLogger.LogOperation(ctx, "process_order", func() error {
74 // Simulate order processing steps
75 if err := validateOrder(orderID); err != nil {
76 return err
77 }
78 if err := reserveInventory(orderID); err != nil {
79 return err
80 }
81 if err := processPayment(orderID); err != nil {
82 return err
83 }
84 return shipOrder(orderID)
85 })
86}
87
88func validateOrder(orderID string) error { /* ... */ return nil }
89func reserveInventory(orderID string) error { /* ... */ return nil }
90func processPayment(orderID string) error { /* ... */ return nil }
91func shipOrder(orderID string) error { /* ... */ return nil }
92
93func main() {
94 ctx := context.WithValue(context.Background(), "request_id", "req-abc-123")
95 ctx = context.WithValue(ctx, "user_id", "user-456")
96
97 processUserOrder(ctx, "order-789")
98}
Progressive Learning: Each example builds on the previous one, showing how structured logging evolves from basic JSON logging to sophisticated context-aware operation tracking that's essential for production debugging.
Example 3: Advanced Log Processing and Correlation
Problem: In microservices, you need to correlate logs across services and detect patterns in real-time.
Solution: Implement log aggregation with correlation and alerting.
1package main
2
3import (
4 "context"
5 "encoding/json"
6 "fmt"
7 "log/slog"
8 "os"
9 "strings"
10 "time"
11)
12
13// LogEntry represents a structured log entry
14type LogEntry struct {
15 Level string `json:"level"`
16 Message string `json:"msg"`
17 Time time.Time `json:"time"`
18 RequestID string `json:"request_id,omitempty"`
19 UserID string `json:"user_id,omitempty"`
20 Duration time.Duration `json:"duration,omitempty"`
21 Error string `json:"error,omitempty"`
22 Fields map[string]interface{} `json:"-"`
23}
24
25// LogProcessor processes and correlates logs
26type LogProcessor struct {
27 entries []LogEntry
28 patterns map[string]int
29 alertChan chan LogEntry
30}
31
32func NewLogProcessor() *LogProcessor {
33 return &LogProcessor{
34 patterns: make(map[string]int),
35 alertChan: make(chan LogEntry, 100),
36 }
37}
38
39// ProcessEntry processes a single log entry
40func ProcessEntry(entry LogEntry) {
41 lp.entries = append(lp.entries, entry)
42
43 // Detect error patterns
44 if entry.Level == "ERROR" {
45 lp.detectErrorPattern(entry)
46
47 // Send to alert channel
48 select {
49 case lp.alertChan <- entry:
50 default:
51 // Channel full, drop alert
52 }
53 }
54
55 // Keep only last 1000 entries
56 if len(lp.entries) > 1000 {
57 lp.entries = lp.entries[1:]
58 }
59}
60
61// detectErrorPattern identifies common error patterns
62func detectErrorPattern(entry LogEntry) {
63 pattern := extractErrorPattern(entry.Error)
64 if pattern != "" {
65 lp.patterns[pattern]++
66
67 // Alert on frequent errors
68 if lp.patterns[pattern] > 10 {
69 fmt.Printf("ALERT: Pattern '%s' occurred %d times\n",
70 pattern, lp.patterns[pattern])
71 }
72 }
73}
74
75// extractErrorPattern extracts the error pattern from error message
76func extractErrorPattern(errMsg string) string {
77 if strings.Contains(errMsg, "timeout") {
78 return "timeout_error"
79 }
80 if strings.Contains(errMsg, "connection refused") {
81 return "connection_error"
82 }
83 if strings.Contains(errMsg, "database") {
84 return "database_error"
85 }
86 return "unknown_error"
87}
88
89// CorrelateRequest finds all logs for a given request
90func CorrelateRequest(requestID string) []LogEntry {
91 var requestLogs []LogEntry
92 for _, entry := range lp.entries {
93 if entry.RequestID == requestID {
94 requestLogs = append(requestLogs, entry)
95 }
96 }
97 return requestLogs
98}
99
100// Production logger with correlation
101func setupProductionLogger() {
102 processor := NewLogProcessor()
103
104 // Custom handler that processes logs
105 handler := slog.HandlerFunc(func(ctx context.Context, record slog.Record) error {
106 // Convert to our custom log entry
107 entry := LogEntry{
108 Level: record.Level.String(),
109 Message: record.Message,
110 Time: record.Time,
111 Fields: make(map[string]interface{}),
112 }
113
114 // Extract attributes
115 record.Attrs(func(attr slog.Attr) bool {
116 switch attr.Key {
117 case "request_id":
118 entry.RequestID = attr.Value.String()
119 case "user_id":
120 entry.UserID = attr.Value.String()
121 case "error":
122 entry.Error = attr.Value.String()
123 default:
124 entry.Fields[attr.Key] = attr.Value.Any()
125 }
126 return true
127 })
128
129 // Process the entry
130 processor.ProcessEntry(entry)
131
132 // Print to stdout for demo
133 data, _ := json.Marshal(entry)
134 fmt.Println(string(data))
135
136 return nil
137 })
138
139 logger := slog.New(handler)
140 return logger, processor
141}
142
143func main() {
144 logger, processor := setupProductionLogger()
145
146 // Simulate request processing with correlation
147 ctx := context.Background()
148 requestID := "req-12345"
149
150 logger.Info("request_started",
151 slog.String("request_id", requestID),
152 slog.String("user_id", "user-67890"),
153 )
154
155 // Simulate processing with an error
156 time.Sleep(100 * time.Millisecond)
157 logger.Error("database_connection_failed",
158 slog.String("request_id", requestID),
159 slog.String("error", "connection timeout to database at localhost:5432"),
160 slog.Duration("retry_after", 5*time.Second),
161 )
162
163 logger.Info("request_completed",
164 slog.String("request_id", requestID),
165 slog.Duration("duration", 110*time.Millisecond),
166 slog.Int("status_code", 500),
167 )
168
169 // Correlate and analyze logs
170 time.Sleep(1 * time.Second)
171 requestLogs := processor.CorrelateRequest(requestID)
172 fmt.Printf("\nCorrelated logs for request %s: %d entries\n",
173 requestID, len(requestLogs))
174}
Advanced Insights:
- Pattern Detection: Automatically identify recurring error patterns
- Real-time Alerting: Detect issues as they happen
- Log Correlation: Group all logs by request ID for debugging
- Memory Management: Automatically clean up old entries
Practical Examples - Prometheus Metrics
Example 1: Complete HTTP Metrics Middleware
Problem: You need comprehensive HTTP metrics for performance monitoring and alerting.
Solution: Build a middleware that captures all key HTTP metrics with proper labeling.
1package main
2
3import (
4 "fmt"
5 "net/http"
6 "time"
7
8 "github.com/prometheus/client_golang/prometheus"
9 "github.com/prometheus/client_golang/prometheus/promhttp"
10)
11
12// HTTPMetrics tracks all HTTP-related metrics
13type HTTPMetrics struct {
14 // Request metrics
15 requestTotal *prometheus.CounterVec
16 requestDuration *prometheus.HistogramVec
17 requestSize *prometheus.HistogramVec
18 responseSize *prometheus.HistogramVec
19
20 // Connection metrics
21 activeConnections prometheus.Gauge
22 connectionErrors prometheus.Counter
23
24 // Performance metrics
25 slowRequests prometheus.Counter
26 timeoutRequests prometheus.Counter
27}
28
29func NewHTTPMetrics(namespace string) *HTTPMetrics {
30 return &HTTPMetrics{
31 requestTotal: promauto.NewCounterVec(
32 prometheus.CounterOpts{
33 Namespace: namespace,
34 Name: "http_requests_total",
35 Help: "Total number of HTTP requests",
36 },
37 []string{"method", "path", "status", "user_agent"},
38 ),
39
40 requestDuration: promauto.NewHistogramVec(
41 prometheus.HistogramOpts{
42 Namespace: namespace,
43 Name: "http_request_duration_seconds",
44 Help: "HTTP request duration",
45 Buckets: []float64{0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0},
46 },
47 []string{"method", "path", "status"},
48 ),
49
50 requestSize: promauto.NewHistogramVec(
51 prometheus.HistogramOpts{
52 Namespace: namespace,
53 Name: "http_request_size_bytes",
54 Help: "HTTP request size in bytes",
55 Buckets: []float64{100, 1000, 10000, 100000, 1000000},
56 },
57 []string{"method", "path"},
58 ),
59
60 responseSize: promauto.NewHistogramVec(
61 prometheus.HistogramOpts{
62 Namespace: namespace,
63 Name: "http_response_size_bytes",
64 Help: "HTTP response size in bytes",
65 Buckets: []float64{100, 1000, 10000, 100000, 1000000},
66 },
67 []string{"method", "path", "status"},
68 ),
69
70 activeConnections: promauto.NewGauge(
71 prometheus.GaugeOpts{
72 Namespace: namespace,
73 Name: "http_active_connections",
74 Help: "Number of currently active HTTP connections",
75 },
76 ),
77
78 connectionErrors: promauto.NewCounter(
79 prometheus.CounterOpts{
80 Namespace: namespace,
81 Name: "http_connection_errors_total",
82 Help: "Total number of HTTP connection errors",
83 },
84 ),
85
86 slowRequests: promauto.NewCounter(
87 prometheus.CounterOpts{
88 Namespace: namespace,
89 Name: "http_slow_requests_total",
90 Help: "Total number of slow HTTP requests",
91 },
92 ),
93
94 timeoutRequests: promauto.NewCounter(
95 prometheus.CounterOpts{
96 Namespace: namespace,
97 Name: "http_timeout_requests_total",
98 Help: "Total number of HTTP timeouts",
99 },
100 ),
101 }
102}
103
104// TrackRequest tracks a single HTTP request
105func TrackRequest(
106 method, path string,
107 statusCode int,
108 duration time.Duration,
109 requestSize, responseSize int64,
110 userAgent string,
111) {
112 // Record request count
113 hm.requestTotal.WithLabelValues(method, path, fmt.Sprintf("%d", statusCode), userAgent).Inc()
114
115 // Record request duration
116 hm.requestDuration.WithLabelValues(method, path, fmt.Sprintf("%d", statusCode)).Observe(duration.Seconds())
117
118 // Record request/response sizes
119 hm.requestSize.WithLabelValues(method, path).Observe(float64(requestSize))
120 hm.responseSize.WithLabelValues(method, path, fmt.Sprintf("%d", statusCode)).Observe(float64(responseSize))
121
122 // Track performance issues
123 if duration > time.Second {
124 hm.slowRequests.Inc()
125 }
126
127 if duration > 30*time.Second {
128 hm.timeoutRequests.Inc()
129 }
130}
131
132// IncrementConnections increments active connections
133func IncrementConnections() {
134 hm.activeConnections.Inc()
135}
136
137// DecrementConnections decrements active connections
138func DecrementConnections() {
139 hm.activeConnections.Dec()
140}
141
142// RecordConnectionError records a connection error
143func RecordConnectionError() {
144 hm.connectionErrors.Inc()
145}
146
147// Metrics middleware
148func Middleware(next http.Handler) http.Handler {
149 return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
150 start := time.Now()
151
152 // Increment active connections
153 hm.IncrementConnections()
154 defer hm.DecrementConnections()
155
156 // Wrap response writer to capture metrics
157 rw := &metricsResponseWriter{ResponseWriter: w, statusCode: 200}
158
159 // Record request size
160 requestSize := r.ContentLength
161
162 // Call handler
163 next.ServeHTTP(rw, r)
164
165 // Calculate duration and record metrics
166 duration := time.Since(start)
167
168 hm.TrackRequest(
169 r.Method,
170 r.URL.Path,
171 rw.statusCode,
172 duration,
173 requestSize,
174 int64(rw.bytesWritten),
175 r.UserAgent(),
176 )
177 })
178}
179
180type metricsResponseWriter struct {
181 http.ResponseWriter
182 statusCode int
183 bytesWritten int
184}
185
186func WriteHeader(code int) {
187 mrw.statusCode = code
188 mrw.ResponseWriter.WriteHeader(code)
189}
190
191func Write(b []byte) {
192 n, err := mrw.ResponseWriter.Write(b)
193 mrw.bytesWritten += n
194 return n, err
195}
196
197// Note: promauto is from prometheus client, used for auto-registration
198func promautoNewCounterVec(opts prometheus.CounterOpts, labels []string) *prometheus.CounterVec {
199 return promauto.NewCounterVec(opts, labels)
200}
201
202func promautoNewHistogramVec(opts prometheus.HistogramOpts, labels []string) *prometheus.HistogramVec {
203 return promauto.NewHistogramVec(opts, labels)
204}
205
206func promautoNewGauge(opts prometheus.GaugeOpts) prometheus.Gauge {
207 return promauto.NewGauge(opts)
208}
209
210// Aliases for the promauto functions
211var promauto = struct {
212 NewCounterVec func(prometheus.CounterOpts, []string) *prometheus.CounterVec
213 NewHistogramVec func(prometheus.HistogramOpts, []string) *prometheus.HistogramVec
214 NewGauge func(prometheus.GaugeOpts) prometheus.Gauge
215}{
216 NewCounterVec: promautoNewCounterVec,
217 NewHistogramVec: promautoNewHistogramVec,
218 NewGauge: promautoNewGauge,
219}
220
221func main() {
222 metrics := NewHTTPMetrics("myapp")
223
224 // Example handlers
225 mux := http.NewServeMux()
226 mux.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
227 time.Sleep(100 * time.Millisecond) // Simulate work
228 w.Write([]byte("Hello World"))
229 })
230
231 mux.HandleFunc("/slow", func(w http.ResponseWriter, r *http.Request) {
232 time.Sleep(2 * time.Second) // Slow endpoint
233 w.Write([]byte("Slow response"))
234 })
235
236 // Apply metrics middleware
237 handler := metrics.Middleware(mux)
238
239 // Expose metrics endpoint
240 http.Handle("/metrics", promhttp.Handler())
241
242 fmt.Println("Server running on :8080")
243 fmt.Println("Metrics available at http://localhost:8080/metrics")
244 http.ListenAndServe(":8080", handler)
245}
Key Metrics Collected:
- Request Count: By method, path, status, and user agent
- Response Time: Detailed histogram with performance buckets
- Payload Sizes: Both request and response sizes
- Connection Metrics: Active connections and errors
- Performance Issues: Automatic detection of slow requests and timeouts
Example 2: Custom Business Metrics for E-commerce
Problem: You need to track business metrics like order processing times, cart abandonment, and conversion rates.
Solution: Implement custom metrics that capture business-critical KPIs.
1package main
2
3import (
4 "fmt"
5 "sync"
6 "time"
7
8 "github.com/prometheus/client_golang/prometheus"
9 "github.com/prometheus/client_golang/prometheus/promauto"
10)
11
12// BusinessMetrics tracks e-commerce business metrics
13type BusinessMetrics struct {
14 // Order metrics
15 ordersTotal *prometheus.CounterVec
16 orderValue *prometheus.HistogramVec
17 orderDuration *prometheus.HistogramVec
18
19 // Cart metrics
20 cartAdditions prometheus.Counter
21 cartAbandonments prometheus.Counter
22 cartConversions prometheus.Counter
23
24 // User metrics
25 activeUsers prometheus.Gauge
26 userSessions prometheus.Counter
27 conversionRate prometheus.Gauge
28
29 // Inventory metrics
30 stockLevel *prometheus.GaugeVec
31 outOfStock *prometheus.CounterVec
32}
33
34func NewBusinessMetrics() *BusinessMetrics {
35 return &BusinessMetrics{
36 ordersTotal: promauto.NewCounterVec(
37 prometheus.CounterOpts{
38 Name: "orders_total",
39 Help: "Total number of orders by status",
40 },
41 []string{"status", "payment_method", "region"},
42 ),
43
44 orderValue: promauto.NewHistogramVec(
45 prometheus.HistogramOpts{
46 Name: "order_value_dollars",
47 Help: "Order value distribution",
48 Buckets: []float64{10, 25, 50, 100, 250, 500, 1000},
49 },
50 []string{"product_category"},
51 ),
52
53 orderDuration: promauto.NewHistogramVec(
54 prometheus.HistogramOpts{
55 Name: "order_processing_duration_seconds",
56 Help: "Time to process orders",
57 Buckets: []float64{1, 5, 10, 30, 60, 120},
58 },
59 []string{"order_type"},
60 ),
61
62 cartAdditions: promauto.NewCounter(
63 prometheus.CounterOpts{
64 Name: "cart_additions_total",
65 Help: "Total items added to carts",
66 },
67 ),
68
69 cartAbandonments: promauto.NewCounter(
70 prometheus.CounterOpts{
71 Name: "cart_abandonments_total",
72 Help: "Total abandoned carts",
73 },
74 ),
75
76 cartConversions: promauto.NewCounter(
77 prometheus.CounterOpts{
78 Name: "cart_conversions_total",
79 Help: "Total completed checkouts",
80 },
81 ),
82
83 activeUsers: promauto.NewGauge(
84 prometheus.GaugeOpts{
85 Name: "active_users_current",
86 Help: "Number of currently active users",
87 },
88 ),
89
90 userSessions: promauto.NewCounter(
91 prometheus.CounterOpts{
92 Name: "user_sessions_total",
93 Help: "Total user sessions",
94 },
95 ),
96
97 conversionRate: promauto.NewGauge(
98 prometheus.GaugeOpts{
99 Name: "conversion_rate",
100 Help: "Cart to order conversion rate",
101 },
102 ),
103
104 stockLevel: promauto.NewGaugeVec(
105 prometheus.GaugeOpts{
106 Name: "inventory_stock_level",
107 Help: "Current stock levels",
108 },
109 []string{"product_id", "category"},
110 ),
111
112 outOfStock: promauto.NewCounterVec(
113 prometheus.CounterOpts{
114 Name: "inventory_out_of_stock_total",
115 Help: "Out of stock events",
116 },
117 []string{"product_id", "category"},
118 ),
119 }
120}
121
122// TrackOrder tracks order completion
123func TrackOrder(status, paymentMethod, region string, value float64, duration time.Duration, category string) {
124 bm.ordersTotal.WithLabelValues(status, paymentMethod, region).Inc()
125 bm.orderValue.WithLabelValues(category).Observe(value)
126 bm.orderDuration.WithLabelValues("standard").Observe(duration.Seconds())
127}
128
129// TrackCartAddition tracks item added to cart
130func TrackCartAddition() {
131 bm.cartAdditions.Inc()
132}
133
134// TrackCartAbandonment tracks abandoned cart
135func TrackCartAbandonment() {
136 bm.cartAbandonments.Inc()
137 bm.updateConversionRate()
138}
139
140// TrackConversion tracks completed checkout
141func TrackConversion() {
142 bm.cartConversions.Inc()
143 bm.updateConversionRate()
144}
145
146// updateConversionRate calculates and updates conversion rate
147func updateConversionRate() {
148 total := bm.cartConversions.Get() + bm.cartAbandonments.Get()
149 if total > 0 {
150 rate := bm.cartConversions.Get() / total
151 bm.conversionRate.Set(rate)
152 }
153}
154
155// UserActive indicates user activity
156func UserActive(active bool) {
157 if active {
158 bm.activeUsers.Inc()
159 } else {
160 bm.activeUsers.Dec()
161 }
162}
163
164// TrackSession tracks new user session
165func TrackSession() {
166 bm.userSessions.Inc()
167}
168
169// UpdateInventory updates stock levels
170func UpdateInventory(productID, category string, stockLevel int) {
171 bm.stockLevel.WithLabelValues(productID, category).Set(float64(stockLevel))
172
173 if stockLevel == 0 {
174 bm.outOfStock.WithLabelValues(productID, category).Inc()
175 }
176}
177
178// CalculateConversionRate returns current conversion metrics
179func CalculateConversionRate() map[string]float64 {
180 abandonments := bm.cartAbandonments.Get()
181 conversions := bm.cartConversions.Get()
182 total := abandonments + conversions
183
184 rate := 0.0
185 if total > 0 {
186 rate = conversions / total
187 }
188
189 return map[string]float64{
190 "conversion_rate": rate * 100, // Percentage
191 "total_conversions": conversions,
192 "total_abandonments": abandonments,
193 "total_carts": total,
194 }
195}
196
197// Usage example
198func main() {
199 metrics := NewBusinessMetrics()
200
201 // Simulate e-commerce activity
202 var wg sync.WaitGroup
203
204 // Simulate user sessions
205 for i := 0; i < 100; i++ {
206 wg.Add(1)
207 go func(userID int) {
208 defer wg.Done()
209
210 // User session starts
211 metrics.TrackSession()
212 metrics.UserActive(true)
213
214 // Simulate shopping
215 time.Sleep(time.Duration(100+userID) * time.Millisecond)
216
217 // Add items to cart
218 metrics.TrackCartAddition()
219 metrics.TrackCartAddition()
220
221 // 70% chance of conversion
222 if userID%10 < 7 {
223 // Complete order
224 metrics.TrackConversion()
225 metrics.TrackOrder("completed", "credit_card", "US", 89.99, 5*time.Second, "electronics")
226 } else {
227 // Abandon cart
228 metrics.TrackCartAbandonment()
229 }
230
231 // User session ends
232 metrics.UserActive(false)
233 }(i)
234 }
235
236 // Simulate inventory updates
237 for i := 0; i < 10; i++ {
238 wg.Add(1)
239 go func(productID int) {
240 defer wg.Done()
241
242 // Update inventory levels
243 stock := 100 - productID
244 metrics.UpdateInventory(
245 fmt.Sprintf("prod-%d", productID),
246 "electronics",
247 stock,
248 )
249 }(i)
250 }
251
252 wg.Wait()
253
254 // Print conversion metrics
255 convMetrics := metrics.CalculateConversionRate()
256 fmt.Println("E-commerce Metrics:")
257 for k, v := range convMetrics {
258 fmt.Printf(" %s: %.2f\n", k, v)
259 }
260}
Business Insights:
- Conversion Tracking: Real-time cart abandonment and conversion rates
- Order Analytics: Value distribution, processing times, and regional performance
- User Behavior: Session tracking and active user counts
- Inventory Monitoring: Real-time stock levels and out-of-stock alerts
Practical Examples - Distributed Tracing
Example 1: Complete Microservice Tracing
Problem: You need to trace requests across multiple microservices to identify bottlenecks and failures.
Solution: Implement comprehensive distributed tracing with OpenTelemetry.
1package main
2
3import (
4 "context"
5 "errors"
6 "fmt"
7 "net/http"
8 "time"
9
10 "go.opentelemetry.io/otel"
11 "go.opentelemetry.io/otel/attribute"
12 "go.opentelemetry.io/otel/exporters/jaeger"
13 "go.opentelemetry.io/otel/propagation"
14 "go.opentelemetry.io/otel/sdk/resource"
15 "go.opentelemetry.io/otel/sdk/trace"
16 semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
17 "go.opentelemetry.io/otel/trace"
18)
19
20// UserService handles user-related operations
21type UserService struct {
22 tracer trace.Tracer
23}
24
25func NewUserService() *UserService {
26 return &UserService{
27 tracer: otel.Tracer("user-service"),
28 }
29}
30
31func GetUser(ctx context.Context, userID string) {
32 ctx, span := us.tracer.Start(ctx, "get-user")
33 defer span.End()
34
35 span.SetAttributes(
36 attribute.String("user_id", userID),
37 attribute.String("operation", "database_query"),
38 )
39
40 // Simulate database query
41 time.Sleep(50 * time.Millisecond)
42
43 // Simulate cache check
44 us.checkCache(ctx, userID)
45
46 // Return user data
47 return &User{
48 ID: userID,
49 Name: "John Doe",
50 Email: "john@example.com",
51 }, nil
52}
53
54func checkCache(ctx context.Context, userID string) {
55 _, span := us.tracer.Start(ctx, "cache-lookup")
56 defer span.End()
57
58 span.SetAttributes(
59 attribute.String("cache.key", fmt.Sprintf("user:%s", userID)),
60 attribute.String("cache.result", "miss"),
61 )
62
63 time.Sleep(5 * time.Millisecond)
64}
65
66// OrderService handles order operations
67type OrderService struct {
68 tracer trace.Tracer
69 userService *UserService
70}
71
72func NewOrderService(userService *UserService) *OrderService {
73 return &OrderService{
74 tracer: otel.Tracer("order-service"),
75 userService: userService,
76 }
77}
78
79func CreateOrder(ctx context.Context, userID string, items []OrderItem) {
80 ctx, span := os.tracer.Start(ctx, "create-order")
81 defer span.End()
82
83 span.SetAttributes(
84 attribute.String("user_id", userID),
85 attribute.Int("item_count", len(items)),
86 attribute.String("order_type", "standard"),
87 )
88
89 // Validate user
90 user, err := os.userService.GetUser(ctx, userID)
91 if err != nil {
92 span.RecordError(err)
93 span.SetStatus(trace.StatusCodeError, "user_validation_failed")
94 return nil, err
95 }
96
97 // Calculate totals
98 os.calculateOrderTotal(ctx, items)
99
100 // Process payment
101 if err := os.processPayment(ctx, items); err != nil {
102 span.RecordError(err)
103 span.SetStatus(trace.StatusCodeError, "payment_failed")
104 return nil, err
105 }
106
107 // Create order
108 order := &Order{
109 ID: fmt.Sprintf("order-%d", time.Now().Unix()),
110 User: *user,
111 Items: items,
112 Status: "created",
113 }
114
115 span.SetAttributes(
116 attribute.String("order_id", order.ID),
117 attribute.String("order_status", order.Status),
118 )
119 span.SetStatus(trace.StatusCodeOk, "order_created_successfully")
120
121 return order, nil
122}
123
124func calculateOrderTotal(ctx context.Context, items []OrderItem) {
125 _, span := os.tracer.Start(ctx, "calculate-total")
126 defer span.End()
127
128 span.SetAttributes(
129 attribute.String("calculation_type", "order_total"),
130 )
131
132 total := 0.0
133 for _, item := range items {
134 total += item.Price * float64(item.Quantity)
135 }
136
137 span.SetAttributes(
138 attribute.Float64("order_total", total),
139 )
140
141 time.Sleep(10 * time.Millisecond)
142}
143
144func processPayment(ctx context.Context, items []OrderItem) error {
145 _, span := os.tracer.Start(ctx, "process-payment")
146 defer span.End()
147
148 span.SetAttributes(
149 attribute.String("payment_processor", "stripe"),
150 attribute.String("payment_type", "credit_card"),
151 )
152
153 // Simulate payment processing
154 time.Sleep(100 * time.Millisecond)
155
156 // Simulate payment failure
157 if len(items) > 5 {
158 return errors.New("payment declined: too many items")
159 }
160
161 span.SetAttributes(
162 attribute.String("payment_status", "completed"),
163 )
164
165 return nil
166}
167
168// Data structures
169type User struct {
170 ID string
171 Name string
172 Email string
173}
174
175type OrderItem struct {
176 Name string
177 Price float64
178 Quantity int
179}
180
181type Order struct {
182 ID string
183 User User
184 Items []OrderItem
185 Status string
186}
187
188// Initialize tracing
189func initTracer(serviceName string) {
190 exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
191 jaeger.WithEndpoint("http://localhost:14268/api/traces"),
192 ))
193 if err != nil {
194 return nil, err
195 }
196
197 tp := trace.NewTracerProvider(
198 trace.WithBatcher(exporter),
199 trace.WithResource(resource.NewWithAttributes(
200 semconv.SchemaURL,
201 semconv.ServiceNameKey.String(serviceName),
202 semconv.ServiceVersionKey.String("1.0.0"),
203 attribute.String("environment", "production"),
204 )),
205 )
206
207 otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
208 propagation.TraceContext{},
209 propagation.Baggage{},
210 ))
211
212 otel.SetTracerProvider(tp)
213 return tp, nil
214}
215
216// HTTP server with tracing
217func main() {
218 tp, err := initTracer("order-service")
219 if err != nil {
220 panic(err)
221 }
222 defer tp.Shutdown(context.Background())
223
224 // Create services
225 userService := NewUserService()
226 orderService := NewOrderService(userService)
227
228 // HTTP handler with tracing
229 http.HandleFunc("/api/orders", func(w http.ResponseWriter, r *http.Request) {
230 // Extract trace context
231 ctx := otel.GetTextMapPropagator().Extract(r.Context(), propagation.HeaderCarrier(r.Header))
232
233 // Create root span
234 tracer := otel.Tracer("http-server")
235 ctx, span := tracer.Start(ctx, "http-create-order",
236 trace.WithAttributes(
237 attribute.String("http.method", r.Method),
238 attribute.String("http.url", r.URL.String()),
239 attribute.String("http.user_agent", r.UserAgent()),
240 ),
241 )
242 defer span.End()
243
244 // Parse request
245 userID := r.URL.Query().Get("user_id")
246 if userID == "" {
247 span.SetAttributes(attribute.String("error", "missing_user_id"))
248 w.WriteHeader(http.StatusBadRequest)
249 return
250 }
251
252 items := []OrderItem{
253 {"Product 1", 29.99, 2},
254 {"Product 2", 19.99, 1},
255 }
256
257 // Create order
258 order, err := orderService.CreateOrder(ctx, userID, items)
259 if err != nil {
260 span.RecordError(err)
261 span.SetStatus(trace.StatusCodeError, "order_creation_failed")
262 w.WriteHeader(http.StatusInternalServerError)
263 return
264 }
265
266 // Success response
267 span.SetAttributes(
268 attribute.String("order_id", order.ID),
269 attribute.Int("http.status_code", http.StatusCreated),
270 )
271 span.SetStatus(trace.StatusCodeOk, "request_completed")
272
273 w.WriteHeader(http.StatusCreated)
274 fmt.Fprintf(w, "Order %s created successfully", order.ID)
275 })
276
277 fmt.Println("Order service running on :8080")
278 fmt.Println("Traces available at: http://localhost:16686")
279 http.ListenAndServe(":8080", nil)
280}
Tracing Insights:
- Request Flow: Complete journey from API to database
- Performance Analysis: Identify slow operations and bottlenecks
- Error Debugging: Root cause analysis with detailed error context
- Service Dependencies: Understand relationships between services
Common Pitfalls and Solutions
Pitfall 1: Over-logging Everything
Problem: Logging everything creates noise and storage costs.
Solution: Implement smart logging levels and sampling.
1// Smart logger with sampling
2type SmartLogger struct {
3 *slog.Logger
4 sampleRate float64
5 counter int64
6}
7
8func Debug(msg string, args ...any) {
9 sl.counter++
10 if float64(sl.counter%100) > sl.sampleRate*100 {
11 return // Skip this log message
12 }
13 sl.Logger.Debug(msg, args...)
14}
Pitfall 2: Too Many Metrics
Problem: Metric overload makes it hard to identify what matters.
Solution: Focus on RED and USE methodologies.
1// RED: Rate, Errors, Duration
2// USE: Utilization, Saturation, Errors
3
4type EssentialMetrics struct {
5 requestRate prometheus.Counter // Rate
6 errorRate prometheus.Counter // Errors
7 latency prometheus.Histogram // Duration
8 utilization prometheus.Gauge // Utilization
9 saturation prometheus.Gauge // Saturation
10}
Pitfall 3: Missing Context in Traces
Problem: Traces without context are useless.
Solution: Always add meaningful attributes and events.
1// Good trace with context
2span.SetAttributes(
3 attribute.String("user.id", userID),
4 attribute.String("operation.type", "database_query"),
5 attribute.String("table", "users"),
6 attribute.Bool("cache_hit", false),
7)
8
9span.AddEvent("query_executed", trace.WithAttributes(
10 attribute.Int("rows_returned", 42),
11 attribute.Duration("query_time", 50*time.Millisecond),
12))
Integration and Mastery
Production Observability Stack
┌─────────────────────────────────────────┐
│ OBSERVABILITY STACK │
└─────────────────────────────────────────┘
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ LOGGING │ │ METRICS │ │ TRACING │
│ │ │ │ │ │
│ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │
│ │ App │ │ │ │ App │ │ │ │ App │ │
│ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │
│ │ │ │ │ │ │ │ │
│ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │
│ │ slog │ │ │ │Prometheus│ │ │ │Otel SDK │ │
│ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │
│ │ │ │ │ │ │ │ │
│ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │
│ │ Loki │ │ │ │Prometheus│ │ │ │ Jaeger │ │
│ │Elastic │ │ │ │ Server │ │ │ │Tempo │ │
│ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
└─────────────────┼─────────────────┘
│
┌─────────────┐
│ Grafana │
│ Dashboard │
└─────────────┘
Key Success Patterns
- Start Small: Begin with basic logging and key metrics
- Build Incrementally: Add tracing as your system grows
- Automate: Use auto-instrumentation where possible
- Monitor Costs: Implement log/metric retention policies
- Train Teams: Ensure everyone understands observability practices
Summary
What We Learned
✅ Structured Logging: Go's slog package for production-ready logging
✅ Prometheus Metrics: Comprehensive collection with business KPIs
✅ Distributed Tracing: OpenTelemetry for microservice visibility
✅ Production Patterns: Common pitfalls and best practices
✅ Integration: Complete observability stack architecture
Key Takeaways
- Context is King: Always add request IDs, user IDs, and operation context
- Measure What Matters: Focus on user-facing metrics and business KPIs
- Automation Wins: Use middleware and auto-instrumentation consistently
- Cost Awareness: Implement sampling and retention from day one
Next Steps
- Implement RED/USE Metrics: Start with Rate, Errors, Duration and Utilization, Saturation, Errors
- Set Up Dashboards: Create operational dashboards for your services
- Configure Alerting: Implement SLO-based alerting with error budgets
- Practice Incident Response: Run regular incident response drills using your observability data
Production Checklist
- Structured logging with request correlation
- HTTP middleware for automatic metrics collection
- Distributed tracing across all services
- SLOs defined for critical operations
- Dashboards for system health and business metrics
- Alerting based on error budgets
- Log/metric retention policies
- Regular incident response practice
Remember: Observability is not an option, it's a requirement for production systems.
Practice Exercises
slog.String("ip_address", ip))
This creates structured JSON that machines can easily parse and query.
💡 **Key Takeaway**: Structured logs aren't just for machines—they make debugging faster for humans too. Imagine searching for "all errors for user 123 in the last hour" across millions of log lines.
### Basic Structured Logging
```go
package main
import (
"log/slog"
"os"
)
func main() {
// Default JSON logger
logger := slog.New(slog.NewJSONHandler(os.Stdout, nil))
// Simple log
logger.Info("server started", "port", 8080, "env", "production")
// With attributes
logger.Info("user logged in",
slog.String("user_id", "user-123"),
slog.Int("session_duration", 3600),
slog.Bool("admin", false),
)
// Error with context
logger.Error("database connection failed",
slog.String("error", "connection timeout"),
slog.String("database", "postgres"),
slog.Int("retry_count", 3),
)
}
Output:
1{"time":"2024-01-15T10:30:45Z","level":"INFO","msg":"server started","port":8080,"env":"production"}
2{"time":"2024-01-15T10:30:46Z","level":"INFO","msg":"user logged in","user_id":"user-123","session_duration":3600,"admin":false}
3{"time":"2024-01-15T10:30:47Z","level":"ERROR","msg":"database connection failed","error":"connection timeout","database":"postgres","retry_count":3}
Production Logger with Context
1package main
2
3import (
4 "context"
5 "log/slog"
6 "os"
7 "time"
8)
9
10type contextKey string
11
12const (
13 requestIDKey contextKey = "request_id"
14 userIDKey contextKey = "user_id"
15)
16
17// Logger wraps slog with context support
18type Logger struct {
19 *slog.Logger
20}
21
22func NewLogger() *Logger {
23 handler := slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
24 Level: slog.LevelInfo,
25 AddSource: true, // Add file:line to logs
26 })
27
28 return &Logger{
29 Logger: slog.New(handler),
30 }
31}
32
33// WithContext extracts context values and adds to log
34func WithContext(ctx context.Context) *Logger {
35 logger := l.Logger
36
37 // Add request ID
38 if requestID, ok := ctx.Value(requestIDKey).(string); ok {
39 logger = logger.With(slog.String("request_id", requestID))
40 }
41
42 // Add user ID
43 if userID, ok := ctx.Value(userIDKey).(string); ok {
44 logger = logger.With(slog.String("user_id", userID))
45 }
46
47 return &Logger{Logger: logger}
48}
49
50// Operation logs with timing
51func LogOperation(ctx context.Context, operation string, fn func() error) error {
52 start := time.Now()
53
54 err := fn()
55
56 duration := time.Since(start)
57
58 if err != nil {
59 l.WithContext(ctx).Error(operation+" failed",
60 slog.Duration("duration", duration),
61 slog.String("error", err.Error()),
62 )
63 } else {
64 l.WithContext(ctx).Info(operation+" completed",
65 slog.Duration("duration", duration),
66 )
67 }
68
69 return err
70}
71
72func main() {
73 logger := NewLogger()
74
75 // Create context with request ID
76 ctx := context.WithValue(context.Background(), requestIDKey, "req-abc-123")
77 ctx = context.WithValue(ctx, userIDKey, "user-456")
78
79 // Log with context
80 logger.WithContext(ctx).Info("processing request")
81
82 // Log operation with timing
83 logger.LogOperation(ctx, "fetch_user", func() error {
84 time.Sleep(50 * time.Millisecond)
85 return nil
86 })
87}
Log Levels and Filtering
1package main
2
3import (
4 "log/slog"
5 "os"
6)
7
8func main() {
9 // Production: INFO and above
10 prodHandler := slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
11 Level: slog.LevelInfo,
12 })
13 prodLogger := slog.New(prodHandler)
14
15 // Development: DEBUG and above
16 devHandler := slog.NewTextHandler(os.Stdout, &slog.HandlerOptions{
17 Level: slog.LevelDebug,
18 })
19 devLogger := slog.New(devHandler)
20
21 // Log at different levels
22 prodLogger.Debug("debug message") // Not logged
23 prodLogger.Info("info message") // Logged
24 prodLogger.Warn("warning message") // Logged
25 prodLogger.Error("error message") // Logged
26
27 devLogger.Debug("debug message") // Logged
28 devLogger.Info("info message") // Logged
29}
Prometheus Metrics
If structured logging is like keeping a detailed diary, Prometheus metrics are like the vital signs dashboard on a medical monitor. They give you real-time, numerical data about your system's health that you can track over time, set alerts on, and use to predict problems before they happen.
Prometheus is the de facto standard for metrics in modern systems. It's like having a fitness tracker for your application—it constantly monitors key metrics and tells you when something's wrong.
💡 Real-world Example: Uber uses Prometheus to monitor over 1,000 microservices. When ride requests spike during rush hour:
- Gauge shows "active drivers: 45,000"
- Counter shows "rides_completed_total: 125,000"
- Histogram shows "ride_duration_seconds"
If active drivers drop while requests increase, they can dispatch more drivers before riders even notice delays.
Basic Metrics Types
1package main
2
3import (
4 "fmt"
5 "math/rand"
6 "net/http"
7 "time"
8
9 "github.com/prometheus/client_golang/prometheus"
10 "github.com/prometheus/client_golang/prometheus/promauto"
11 "github.com/prometheus/client_golang/prometheus/promhttp"
12)
13
14var (
15 // Counter: Monotonically increasing value
16 requestsTotal = promauto.NewCounterVec(
17 prometheus.CounterOpts{
18 Name: "http_requests_total",
19 Help: "Total number of HTTP requests",
20 },
21 []string{"method", "path", "status"},
22 )
23
24 // Gauge: Value that can go up or down
25 activeConnections = promauto.NewGauge(
26 prometheus.GaugeOpts{
27 Name: "active_connections",
28 Help: "Number of active connections",
29 },
30 )
31
32 // Histogram: Distribution of values
33 requestDuration = promauto.NewHistogramVec(
34 prometheus.HistogramOpts{
35 Name: "http_request_duration_seconds",
36 Help: "HTTP request latency",
37 Buckets: prometheus.DefBuckets, // Default: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
38 },
39 []string{"method", "path"},
40 )
41
42 // Summary: Like histogram, but calculates quantiles client-side
43 requestSize = promauto.NewSummaryVec(
44 prometheus.SummaryOpts{
45 Name: "http_request_size_bytes",
46 Help: "HTTP request size in bytes",
47 Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
48 },
49 []string{"method"},
50 )
51)
52
53// Middleware to record metrics
54func metricsMiddleware(next http.Handler) http.Handler {
55 return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
56 start := time.Now()
57
58 // Increment active connections
59 activeConnections.Inc()
60 defer activeConnections.Dec()
61
62 // Wrap response writer to capture status code
63 rw := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}
64
65 // Call handler
66 next.ServeHTTP(rw, r)
67
68 // Record metrics
69 duration := time.Since(start).Seconds()
70
71 requestsTotal.WithLabelValues(r.Method, r.URL.Path, fmt.Sprintf("%d", rw.statusCode)).Inc()
72 requestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
73 requestSize.WithLabelValues(r.Method).Observe(float64(r.ContentLength))
74 })
75}
76
77type responseWriter struct {
78 http.ResponseWriter
79 statusCode int
80}
81
82func WriteHeader(code int) {
83 rw.statusCode = code
84 rw.ResponseWriter.WriteHeader(code)
85}
86
87func main() {
88 // API handlers
89 http.HandleFunc("/api/users", func(w http.ResponseWriter, r *http.Request) {
90 // Simulate work
91 time.Sleep(time.Duration(rand.Intn(100)) * time.Millisecond)
92 w.WriteHeader(http.StatusOK)
93 w.Write([]byte(`{"users": []}`))
94 })
95
96 http.HandleFunc("/api/products", func(w http.ResponseWriter, r *http.Request) {
97 time.Sleep(time.Duration(rand.Intn(200)) * time.Millisecond)
98 w.WriteHeader(http.StatusOK)
99 w.Write([]byte(`{"products": []}`))
100 })
101
102 // Metrics endpoint
103 http.Handle("/metrics", promhttp.Handler())
104
105 // Apply metrics middleware
106 handler := metricsMiddleware(http.DefaultServeMux)
107
108 fmt.Println("Server running on :8080")
109 fmt.Println("Metrics available at http://localhost:8080/metrics")
110 http.ListenAndServe(":8080", handler)
111}
Custom Business Metrics
1package main
2
3import (
4 "github.com/prometheus/client_golang/prometheus"
5 "github.com/prometheus/client_golang/prometheus/promauto"
6)
7
8var (
9 // Business metrics
10 ordersTotal = promauto.NewCounterVec(
11 prometheus.CounterOpts{
12 Name: "orders_total",
13 Help: "Total number of orders",
14 },
15 []string{"status"}, // pending, completed, failed
16 )
17
18 orderValue = promauto.NewHistogramVec(
19 prometheus.HistogramOpts{
20 Name: "order_value_dollars",
21 Help: "Order value distribution",
22 Buckets: []float64{10, 25, 50, 100, 250, 500, 1000},
23 },
24 []string{"product_category"},
25 )
26
27 cartAbandonment = promauto.NewCounter(
28 prometheus.CounterOpts{
29 Name: "cart_abandonments_total",
30 Help: "Total number of cart abandonments",
31 },
32 )
33
34 activeUsers = promauto.NewGauge(
35 prometheus.GaugeOpts{
36 Name: "active_users",
37 Help: "Number of currently active users",
38 },
39 )
40
41 checkoutDuration = promauto.NewHistogram(
42 prometheus.HistogramOpts{
43 Name: "checkout_duration_seconds",
44 Help: "Time to complete checkout",
45 Buckets: []float64{1, 5, 10, 30, 60, 120},
46 },
47 )
48)
49
50// OrderService with metrics
51type OrderService struct{}
52
53func CreateOrder(category string, value float64) error {
54 // Record order
55 ordersTotal.WithLabelValues("pending").Inc()
56 orderValue.WithLabelValues(category).Observe(value)
57
58 // ... business logic ...
59
60 ordersTotal.WithLabelValues("completed").Inc()
61 return nil
62}
63
64func AbandonCart() {
65 cartAbandonment.Inc()
66}
67
68func UserLogin() {
69 activeUsers.Inc()
70}
71
72func UserLogout() {
73 activeUsers.Dec()
74}
OpenTelemetry Distributed Tracing
If logs are like individual diary entries and metrics are like vital signs, distributed tracing is like having a GPS tracker that follows a package from warehouse to doorstep. It shows you the complete journey of a request as it travels through different services, databases, and caches.
OpenTelemetry is the standard for distributed tracing and observability. It's like having a universal passport for your requests—no matter which service they visit, they carry their identity and timing information with them.
💡 Real-world Example: When you order from Amazon:
- Your request hits the API gateway
- Inventory service checks stock
- Payment service processes payment
- Shipping service schedules delivery
Without tracing: "Order is slow" 😕
With tracing: "Payment service is taking 200ms instead of usual 50ms" 🎯
⚠️ Common Pitfall: Don't trace everything! Trace user-initiated operations and critical paths. Tracing every database query can overwhelm your tracing system.
Basic Tracing Setup
1package main
2
3import (
4 "context"
5 "fmt"
6 "log"
7 "time"
8
9 "go.opentelemetry.io/otel"
10 "go.opentelemetry.io/otel/attribute"
11 "go.opentelemetry.io/otel/exporters/stdout/stdouttrace"
12 "go.opentelemetry.io/otel/sdk/resource"
13 "go.opentelemetry.io/otel/sdk/trace"
14 semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
15)
16
17var tracer = otel.Tracer("example-service")
18
19func initTracer() func() {
20 // Create stdout exporter
21 exporter, err := stdouttrace.New(stdouttrace.WithPrettyPrint())
22 if err != nil {
23 log.Fatal(err)
24 }
25
26 // Create resource
27 res := resource.NewWithAttributes(
28 semconv.SchemaURL,
29 semconv.ServiceName("my-service"),
30 semconv.ServiceVersion("1.0.0"),
31 attribute.String("environment", "production"),
32 )
33
34 // Create trace provider
35 tp := trace.NewTracerProvider(
36 trace.WithBatcher(exporter),
37 trace.WithResource(res),
38 )
39
40 otel.SetTracerProvider(tp)
41
42 return func() {
43 tp.Shutdown(context.Background())
44 }
45}
46
47// API handler with tracing
48func handleRequest(ctx context.Context, userID string) error {
49 // Start span
50 ctx, span := tracer.Start(ctx, "handle_request")
51 defer span.End()
52
53 // Add attributes
54 span.SetAttributes(
55 attribute.String("user_id", userID),
56 attribute.String("request_type", "get_user"),
57 )
58
59 // Call downstream services
60 if err := fetchUserFromDB(ctx, userID); err != nil {
61 span.RecordError(err)
62 return err
63 }
64
65 if err := fetchUserPreferences(ctx, userID); err != nil {
66 span.RecordError(err)
67 return err
68 }
69
70 return nil
71}
72
73func fetchUserFromDB(ctx context.Context, userID string) error {
74 ctx, span := tracer.Start(ctx, "fetch_user_db")
75 defer span.End()
76
77 span.SetAttributes(
78 attribute.String("db.system", "postgresql"),
79 attribute.String("db.operation", "SELECT"),
80 )
81
82 // Simulate DB call
83 time.Sleep(50 * time.Millisecond)
84
85 return nil
86}
87
88func fetchUserPreferences(ctx context.Context, userID string) error {
89 ctx, span := tracer.Start(ctx, "fetch_preferences")
90 defer span.End()
91
92 span.SetAttributes(
93 attribute.String("cache.hit", "false"),
94 )
95
96 // Simulate cache miss + DB call
97 time.Sleep(20 * time.Millisecond)
98
99 return nil
100}
101
102func main() {
103 cleanup := initTracer()
104 defer cleanup()
105
106 ctx := context.Background()
107
108 // Handle request
109 if err := handleRequest(ctx, "user-123"); err != nil {
110 fmt.Printf("Error: %v\n", err)
111 }
112
113 // Allow time for export
114 time.Sleep(time.Second)
115}
HTTP Instrumentation
1package main
2
3import (
4 "context"
5 "fmt"
6 "net/http"
7 "time"
8
9 "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
10 "go.opentelemetry.io/otel"
11 "go.opentelemetry.io/otel/attribute"
12 "go.opentelemetry.io/otel/exporters/jaeger"
13 "go.opentelemetry.io/otel/sdk/resource"
14 "go.opentelemetry.io/otel/sdk/trace"
15 semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
16)
17
18func initJaegerTracer() {
19 // Create Jaeger exporter
20 exp, err := jaeger.New(jaeger.WithCollectorEndpoint(
21 jaeger.WithEndpoint("http://localhost:14268/api/traces"),
22 ))
23 if err != nil {
24 return nil, err
25 }
26
27 tp := trace.NewTracerProvider(
28 trace.WithBatcher(exp),
29 trace.WithResource(resource.NewWithAttributes(
30 semconv.SchemaURL,
31 semconv.ServiceName("api-service"),
32 )),
33 )
34
35 otel.SetTracerProvider(tp)
36 return tp, nil
37}
38
39func main() {
40 tp, err := initJaegerTracer()
41 if err != nil {
42 panic(err)
43 }
44 defer tp.Shutdown(context.Background())
45
46 // Handler with custom tracing
47 handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
48 ctx := r.Context()
49 tracer := otel.Tracer("api-handler")
50
51 // Custom span
52 ctx, span := tracer.Start(ctx, "process_request")
53 defer span.End()
54
55 // Add custom attributes
56 span.SetAttributes(
57 attribute.String("http.route", r.URL.Path),
58 attribute.String("user_agent", r.UserAgent()),
59 )
60
61 // Simulate processing
62 time.Sleep(100 * time.Millisecond)
63
64 w.WriteHeader(http.StatusOK)
65 w.Write([]byte("OK"))
66 })
67
68 // Wrap with OpenTelemetry HTTP instrumentation
69 instrumentedHandler := otelhttp.NewHandler(handler, "http-server")
70
71 fmt.Println("Server running on :8080")
72 fmt.Println("Traces sent to Jaeger at http://localhost:16686")
73 http.ListenAndServe(":8080", instrumentedHandler)
74}
SLIs, SLOs, and Alerting
If observability data is like your car's dashboard, SLIs, SLOs, and SLAs are like the service schedule and warranty. They define what "good performance" means, when you need to take action, and what you've promised to your users.
Service Level Indicators, Service Level Objectives, and Service Level Agreements define reliability targets.
The Hierarchy:
- SLI: What you measure
- SLO: Your target
- SLA: What you promise customers
💡 Real-world Example: Google Search SLOs:
- SLI: 99th percentile query latency
- SLO: 95% of queries under 200ms
- Error budget: 5% can be slow
⚠️ Important: 100% reliability is impossible and expensive. Even Google aims for 99.99% uptime. The "error budget" approach gives you room to innovate while maintaining reliability.
Defining SLIs and SLOs
1package main
2
3import (
4 "fmt"
5 "sync"
6 "time"
7
8 "github.com/prometheus/client_golang/prometheus"
9 "github.com/prometheus/client_golang/prometheus/promauto"
10)
11
12// SLI metrics
13var (
14 requestLatency = promauto.NewHistogramVec(
15 prometheus.HistogramOpts{
16 Name: "sli_request_latency_seconds",
17 Help: "Request latency SLI",
18 Buckets: []float64{0.01, 0.05, 0.1, 0.2, 0.5, 1.0}, // SLO thresholds
19 },
20 []string{"endpoint"},
21 )
22
23 requestSuccess = promauto.NewCounterVec(
24 prometheus.CounterOpts{
25 Name: "sli_requests_total",
26 Help: "Total requests by success/failure",
27 },
28 []string{"endpoint", "success"},
29 )
30
31 availabilityGauge = promauto.NewGaugeVec(
32 prometheus.GaugeOpts{
33 Name: "slo_availability_ratio",
34 Help: "Current availability ratio",
35 },
36 []string{"endpoint"},
37 )
38)
39
40// SLOTracker tracks SLO compliance
41type SLOTracker struct {
42 mu sync.RWMutex
43
44 // Configuration
45 latencyTarget time.Duration // e.g., 200ms
46 availabilityTarget float64 // e.g., 0.999
47
48 // Metrics
49 totalRequests int64
50 successRequests int64
51 fastRequests int64 // Requests under latency target
52}
53
54func NewSLOTracker(latencyTarget time.Duration, availabilityTarget float64) *SLOTracker {
55 return &SLOTracker{
56 latencyTarget: latencyTarget,
57 availabilityTarget: availabilityTarget,
58 }
59}
60
61// RecordRequest records a request for SLO tracking
62func RecordRequest(duration time.Duration, success bool) {
63 st.mu.Lock()
64 defer st.mu.Unlock()
65
66 st.totalRequests++
67
68 if success {
69 st.successRequests++
70 }
71
72 if duration <= st.latencyTarget {
73 st.fastRequests++
74 }
75
76 // Update Prometheus metrics
77 if success {
78 requestSuccess.WithLabelValues("api", "true").Inc()
79 } else {
80 requestSuccess.WithLabelValues("api", "false").Inc()
81 }
82
83 requestLatency.WithLabelValues("api").Observe(duration.Seconds())
84
85 // Calculate and expose availability ratio
86 availability := float64(st.successRequests) / float64(st.totalRequests)
87 availabilityGauge.WithLabelValues("api").Set(availability)
88}
89
90// Metrics returns current SLO metrics
91func Metrics() map[string]interface{} {
92 st.mu.RLock()
93 defer st.mu.RUnlock()
94
95 availability := float64(st.successRequests) / float64(st.totalRequests)
96 latencyCompliance := float64(st.fastRequests) / float64(st.totalRequests)
97
98 return map[string]interface{}{
99 "total_requests": st.totalRequests,
100 "success_requests": st.successRequests,
101 "availability": fmt.Sprintf("%.4f%%", availability*100),
102 "availability_slo": fmt.Sprintf("%.4f%%", st.availabilityTarget*100),
103 "availability_met": availability >= st.availabilityTarget,
104 "latency_compliance": fmt.Sprintf("%.4f%%", latencyCompliance*100),
105 "latency_target": st.latencyTarget,
106 }
107}
108
109func main() {
110 tracker := NewSLOTracker(200*time.Millisecond, 0.999)
111
112 // Simulate requests
113 for i := 0; i < 1000; i++ {
114 duration := time.Duration(50+i/5) * time.Millisecond
115 success := i%100 != 0 // 99% success rate
116
117 tracker.RecordRequest(duration, success)
118 }
119
120 // Print SLO metrics
121 metrics := tracker.Metrics()
122 fmt.Println("SLO Metrics:")
123 for k, v := range metrics {
124 fmt.Printf(" %s: %v\n", k, v)
125 }
126}
Common Pitfalls and When to Use What
When to Use Logs vs Metrics vs Traces
| Scenario | Best Tool | Why |
|---|---|---|
| "Why did this specific transaction fail?" | Logs | Detailed context for debugging |
| "Are we meeting our performance targets?" | Metrics | Trends and aggregations over time |
| "Why is the checkout flow slow?" | Traces | See the entire request journey |
| "How many users signed up today?" | Metrics | Simple counting and aggregation |
| "What error happened at 3:15 AM?" | Logs | Point-in-time investigation |
Common Pitfalls
❌ Over-logging everything
- Problem: High storage costs, hard to find relevant information
- Solution: Log at appropriate levels, use structured logging
❌ Too many metrics
- Problem: Metric overload, hard to identify what matters
- Solution: Start with RED and USE methodologies
❌ Tracing everything
- Problem: Performance overhead, expensive storage
- Solution: Trace sample of requests, focus on user-facing operations
❌ Setting alerts without context
- Problem: Alert fatigue, teams ignore notifications
- Solution: Use SLO-based alerting, include runbooks with alerts
💡 Key Takeaway: Start small and iterate. Begin with basic logging and key metrics, then add tracing as your system grows. The best observability setup is one you actually use and understand.
Further Reading
Tools
- Prometheus - Metrics collection and alerting
- Grafana - Visualization and dashboards
- Jaeger - Distributed tracing
- ELK Stack - Elasticsearch, Logstash, Kibana for logs
Books
- Observability Engineering by Charity Majors
- Site Reliability Engineering by Google
- The Art of Monitoring by James Turnbull
Standards
Advanced Pattern: OpenTelemetry Integration
OpenTelemetry (OTel) is the industry-standard observability framework that unifies metrics, logs, and traces into a single instrumentation layer. Unlike vendor-specific solutions, OTel provides portable, standardized telemetry that works across any backend.
Why OpenTelemetry Matters:
- Vendor-neutral: Switch backends (Jaeger, Zipkin, Prometheus) without changing code
- Unified API: Single SDK for all three pillars of observability
- Auto-instrumentation: Automatic tracing for popular frameworks
- Context propagation: Seamless correlation across distributed systems
💡 Real-World Impact: Shopify migrated to OpenTelemetry and reduced instrumentation code by 60% while gaining unified observability across 2,000+ services.
Complete OpenTelemetry Setup
This example demonstrates a production-ready OpenTelemetry setup with all three pillars integrated:
1package main
2
3import (
4 "context"
5 "fmt"
6 "log"
7 "net/http"
8 "os"
9 "time"
10
11 "go.opentelemetry.io/otel"
12 "go.opentelemetry.io/otel/attribute"
13 "go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc"
14 "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
15 "go.opentelemetry.io/otel/metric"
16 "go.opentelemetry.io/otel/propagation"
17 "go.opentelemetry.io/otel/sdk/metric"
18 "go.opentelemetry.io/otel/sdk/resource"
19 "go.opentelemetry.io/otel/sdk/trace"
20 semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
21 "go.opentelemetry.io/otel/trace"
22)
23
24// OTelProvider manages all OpenTelemetry components
25type OTelProvider struct {
26 tracerProvider *trace.TracerProvider
27 meterProvider *metric.MeterProvider
28 resource *resource.Resource
29}
30
31// NewOTelProvider initializes a complete OpenTelemetry setup
32func NewOTelProvider(ctx context.Context, serviceName, serviceVersion string) (*OTelProvider, error) {
33 // Create resource with service identification
34 res, err := resource.New(ctx,
35 resource.WithAttributes(
36 semconv.ServiceName(serviceName),
37 semconv.ServiceVersion(serviceVersion),
38 semconv.DeploymentEnvironment(os.Getenv("ENV")),
39 ),
40 resource.WithHost(), // Add host information
41 resource.WithProcess(), // Add process information
42 resource.WithOS(), // Add OS information
43 resource.WithContainer(), // Add container information if running in one
44 )
45 if err != nil {
46 return nil, fmt.Errorf("failed to create resource: %w", err)
47 }
48
49 // Setup trace exporter
50 traceExporter, err := otlptracegrpc.New(ctx,
51 otlptracegrpc.WithEndpoint("localhost:4317"),
52 otlptracegrpc.WithInsecure(),
53 )
54 if err != nil {
55 return nil, fmt.Errorf("failed to create trace exporter: %w", err)
56 }
57
58 // Create tracer provider with sampling strategy
59 tracerProvider := trace.NewTracerProvider(
60 trace.WithBatcher(traceExporter,
61 trace.WithMaxExportBatchSize(512),
62 trace.WithBatchTimeout(5*time.Second),
63 ),
64 trace.WithResource(res),
65 // Sample 10% of traces in production, 100% in dev
66 trace.WithSampler(trace.ParentBased(
67 trace.TraceIDRatioBased(getSamplingRate()),
68 )),
69 )
70
71 // Setup metric exporter
72 metricExporter, err := otlpmetricgrpc.New(ctx,
73 otlpmetricgrpc.WithEndpoint("localhost:4317"),
74 otlpmetricgrpc.WithInsecure(),
75 )
76 if err != nil {
77 return nil, fmt.Errorf("failed to create metric exporter: %w", err)
78 }
79
80 // Create meter provider
81 meterProvider := metric.NewMeterProvider(
82 metric.WithReader(
83 metric.NewPeriodicReader(metricExporter,
84 metric.WithInterval(30*time.Second),
85 ),
86 ),
87 metric.WithResource(res),
88 )
89
90 // Set global providers
91 otel.SetTracerProvider(tracerProvider)
92 otel.SetMeterProvider(meterProvider)
93
94 // Set global propagator for context propagation across services
95 otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
96 propagation.TraceContext{},
97 propagation.Baggage{},
98 ))
99
100 return &OTelProvider{
101 tracerProvider: tracerProvider,
102 meterProvider: meterProvider,
103 resource: res,
104 }, nil
105}
106
107func getSamplingRate() float64 {
108 if os.Getenv("ENV") == "production" {
109 return 0.1 // 10% sampling in production
110 }
111 return 1.0 // 100% sampling in development
112}
113
114// Shutdown gracefully shuts down all providers
115func (p *OTelProvider) Shutdown(ctx context.Context) error {
116 if err := p.tracerProvider.Shutdown(ctx); err != nil {
117 return fmt.Errorf("failed to shutdown tracer provider: %w", err)
118 }
119 if err := p.meterProvider.Shutdown(ctx); err != nil {
120 return fmt.Errorf("failed to shutdown meter provider: %w", err)
121 }
122 return nil
123}
124
125// InstrumentedService demonstrates a fully instrumented service
126type InstrumentedService struct {
127 tracer trace.Tracer
128 meter metric.Meter
129
130 // Metrics
131 requestCounter metric.Int64Counter
132 requestDuration metric.Float64Histogram
133 activeRequests metric.Int64UpDownCounter
134}
135
136// NewInstrumentedService creates a service with complete OTel instrumentation
137func NewInstrumentedService(serviceName string) (*InstrumentedService, error) {
138 tracer := otel.Tracer(serviceName)
139 meter := otel.Meter(serviceName)
140
141 // Create metrics
142 requestCounter, err := meter.Int64Counter(
143 "http.server.requests",
144 metric.WithDescription("Total number of HTTP requests"),
145 metric.WithUnit("{request}"),
146 )
147 if err != nil {
148 return nil, err
149 }
150
151 requestDuration, err := meter.Float64Histogram(
152 "http.server.duration",
153 metric.WithDescription("HTTP request duration"),
154 metric.WithUnit("ms"),
155 )
156 if err != nil {
157 return nil, err
158 }
159
160 activeRequests, err := meter.Int64UpDownCounter(
161 "http.server.active_requests",
162 metric.WithDescription("Number of active HTTP requests"),
163 metric.WithUnit("{request}"),
164 )
165 if err != nil {
166 return nil, err
167 }
168
169 return &InstrumentedService{
170 tracer: tracer,
171 meter: meter,
172 requestCounter: requestCounter,
173 requestDuration: requestDuration,
174 activeRequests: activeRequests,
175 }, nil
176}
177
178// HandleRequest demonstrates complete request instrumentation
179func (s *InstrumentedService) HandleRequest(w http.ResponseWriter, r *http.Request) {
180 ctx := r.Context()
181 start := time.Now()
182
183 // Start a span for this request
184 ctx, span := s.tracer.Start(ctx, "HandleRequest",
185 trace.WithSpanKind(trace.SpanKindServer),
186 trace.WithAttributes(
187 semconv.HTTPMethod(r.Method),
188 semconv.HTTPRoute(r.URL.Path),
189 semconv.HTTPScheme(r.URL.Scheme),
190 semconv.NetHostName(r.Host),
191 ),
192 )
193 defer span.End()
194
195 // Track active requests
196 s.activeRequests.Add(ctx, 1)
197 defer s.activeRequests.Add(ctx, -1)
198
199 // Simulate processing with child spans
200 if err := s.processRequest(ctx); err != nil {
201 span.RecordError(err)
202 span.SetAttributes(attribute.Bool("error", true))
203 http.Error(w, err.Error(), http.StatusInternalServerError)
204 return
205 }
206
207 // Record metrics
208 duration := time.Since(start).Milliseconds()
209 attrs := []attribute.KeyValue{
210 attribute.String("method", r.Method),
211 attribute.String("route", r.URL.Path),
212 attribute.Int("status", http.StatusOK),
213 }
214
215 s.requestCounter.Add(ctx, 1, metric.WithAttributes(attrs...))
216 s.requestDuration.Record(ctx, float64(duration), metric.WithAttributes(attrs...))
217
218 // Set span attributes for success
219 span.SetAttributes(
220 semconv.HTTPStatusCode(http.StatusOK),
221 attribute.Int64("response.size", 100),
222 )
223
224 w.WriteHeader(http.StatusOK)
225 fmt.Fprintf(w, "Request processed successfully\n")
226}
227
228func (s *InstrumentedService) processRequest(ctx context.Context) error {
229 // Create child span for database operation
230 ctx, dbSpan := s.tracer.Start(ctx, "database.query",
231 trace.WithSpanKind(trace.SpanKindClient),
232 trace.WithAttributes(
233 attribute.String("db.system", "postgresql"),
234 attribute.String("db.operation", "SELECT"),
235 attribute.String("db.statement", "SELECT * FROM users WHERE id = $1"),
236 ),
237 )
238 defer dbSpan.End()
239
240 // Simulate database query
241 time.Sleep(50 * time.Millisecond)
242
243 // Create child span for cache operation
244 ctx, cacheSpan := s.tracer.Start(ctx, "cache.get",
245 trace.WithSpanKind(trace.SpanKindClient),
246 trace.WithAttributes(
247 attribute.String("cache.system", "redis"),
248 attribute.String("cache.key", "user:123"),
249 ),
250 )
251 defer cacheSpan.End()
252
253 // Simulate cache lookup
254 time.Sleep(10 * time.Millisecond)
255
256 return nil
257}
258
259func main() {
260 ctx := context.Background()
261
262 // Initialize OpenTelemetry
263 provider, err := NewOTelProvider(ctx, "my-service", "1.0.0")
264 if err != nil {
265 log.Fatalf("Failed to initialize OpenTelemetry: %v", err)
266 }
267 defer func() {
268 if err := provider.Shutdown(ctx); err != nil {
269 log.Printf("Error shutting down provider: %v", err)
270 }
271 }()
272
273 // Create instrumented service
274 service, err := NewInstrumentedService("my-service")
275 if err != nil {
276 log.Fatalf("Failed to create service: %v", err)
277 }
278
279 // Setup HTTP server
280 http.HandleFunc("/api/users", service.HandleRequest)
281
282 fmt.Println("Server starting on :8080")
283 fmt.Println("Try: curl http://localhost:8080/api/users")
284 log.Fatal(http.ListenAndServe(":8080", nil))
285}
Key Patterns:
- Resource attributes: Service identification and environment context
- Sampling strategies: Balance observability with performance
- Context propagation: Trace requests across service boundaries
- Semantic conventions: Standard attribute naming for interoperability
- Graceful shutdown: Ensure all telemetry is flushed before exit
Context Propagation Across Services
When building distributed systems, propagating trace context across service boundaries is critical for end-to-end visibility:
1package main
2
3import (
4 "context"
5 "fmt"
6 "io"
7 "net/http"
8 "time"
9
10 "go.opentelemetry.io/otel"
11 "go.opentelemetry.io/otel/attribute"
12 "go.opentelemetry.io/otel/propagation"
13 "go.opentelemetry.io/otel/trace"
14)
15
16// Service A - Makes HTTP call to Service B
17type ServiceA struct {
18 tracer trace.Tracer
19 propagator propagation.TextMapPropagator
20 client *http.Client
21}
22
23func NewServiceA() *ServiceA {
24 return &ServiceA{
25 tracer: otel.Tracer("service-a"),
26 propagator: otel.GetTextMapPropagator(),
27 client: &http.Client{Timeout: 30 * time.Second},
28 }
29}
30
31func (s *ServiceA) MakeRequest(ctx context.Context, userID string) error {
32 // Start span in Service A
33 ctx, span := s.tracer.Start(ctx, "ServiceA.MakeRequest",
34 trace.WithAttributes(
35 attribute.String("user.id", userID),
36 ),
37 )
38 defer span.End()
39
40 // Create HTTP request
41 req, err := http.NewRequestWithContext(ctx, "GET", "http://service-b:8081/process", nil)
42 if err != nil {
43 span.RecordError(err)
44 return err
45 }
46
47 // CRITICAL: Inject trace context into HTTP headers
48 s.propagator.Inject(ctx, propagation.HeaderCarrier(req.Header))
49
50 // Add custom headers
51 req.Header.Set("X-User-ID", userID)
52
53 span.AddEvent("Calling Service B")
54
55 // Make the call
56 resp, err := s.client.Do(req)
57 if err != nil {
58 span.RecordError(err)
59 return err
60 }
61 defer resp.Body.Close()
62
63 body, _ := io.ReadAll(resp.Body)
64 span.SetAttributes(
65 attribute.Int("http.status_code", resp.StatusCode),
66 attribute.Int("response.size", len(body)),
67 )
68
69 return nil
70}
71
72// Service B - Receives request from Service A
73type ServiceB struct {
74 tracer trace.Tracer
75 propagator propagation.TextMapPropagator
76}
77
78func NewServiceB() *ServiceB {
79 return &ServiceB{
80 tracer: otel.Tracer("service-b"),
81 propagator: otel.GetTextMapPropagator(),
82 }
83}
84
85func (s *ServiceB) HandleRequest(w http.ResponseWriter, r *http.Request) {
86 // CRITICAL: Extract trace context from HTTP headers
87 ctx := s.propagator.Extract(r.Context(), propagation.HeaderCarrier(r.Header))
88
89 // Start span in Service B - this will be a child of Service A's span
90 ctx, span := s.tracer.Start(ctx, "ServiceB.HandleRequest",
91 trace.WithSpanKind(trace.SpanKindServer),
92 )
93 defer span.End()
94
95 userID := r.Header.Get("X-User-ID")
96 span.SetAttributes(attribute.String("user.id", userID))
97
98 // Process the request
99 if err := s.processData(ctx, userID); err != nil {
100 span.RecordError(err)
101 http.Error(w, err.Error(), http.StatusInternalServerError)
102 return
103 }
104
105 w.WriteHeader(http.StatusOK)
106 fmt.Fprintf(w, "Processed by Service B")
107}
108
109func (s *ServiceB) processData(ctx context.Context, userID string) error {
110 // This span will be a child of HandleRequest
111 _, span := s.tracer.Start(ctx, "ServiceB.processData")
112 defer span.End()
113
114 // Simulate processing
115 time.Sleep(100 * time.Millisecond)
116 return nil
117}
118
119func main() {
120 // In a real application, you would run Service A and Service B separately
121 // This demonstrates the trace context propagation pattern
122 fmt.Println("OpenTelemetry context propagation example")
123 fmt.Println("In production, Service A and B would be separate processes")
124}
Context Propagation Best Practices:
- Always use
propagation.TextMapPropagatorfor HTTP/gRPC - Inject context on client side before making calls
- Extract context on server side as first operation
- Use
trace.SpanKindto distinguish client vs server spans - Include relevant attributes at each service boundary
Advanced Pattern: Custom Metrics and Instrumentation
While standard metrics like request count and duration are essential, production systems need custom business metrics and advanced instrumentation patterns to gain deep operational insights.
Business Metrics and Custom Collectors
Business metrics track domain-specific KPIs that matter to your organization:
1package main
2
3import (
4 "context"
5 "fmt"
6 "sync"
7 "time"
8
9 "github.com/prometheus/client_golang/prometheus"
10 "github.com/prometheus/client_golang/prometheus/promhttp"
11 "net/http"
12)
13
14// BusinessMetrics tracks domain-specific KPIs
15type BusinessMetrics struct {
16 // Revenue metrics
17 revenueTotal prometheus.Counter
18 revenueByPlan *prometheus.CounterVec
19
20 // User engagement
21 activeUsers prometheus.Gauge
22 sessionDuration prometheus.Histogram
23 featureUsage *prometheus.CounterVec
24
25 // Business operations
26 checkoutsStarted prometheus.Counter
27 checkoutsCompleted prometheus.Counter
28 checkoutsFailed *prometheus.CounterVec
29 cartValue prometheus.Histogram
30
31 // SaaS-specific
32 trialConversions prometheus.Counter
33 churnEvents prometheus.Counter
34 lifetimeValue prometheus.Histogram
35}
36
37// NewBusinessMetrics initializes business-level metrics
38func NewBusinessMetrics(reg prometheus.Registerer) *BusinessMetrics {
39 bm := &BusinessMetrics{
40 revenueTotal: prometheus.NewCounter(prometheus.CounterOpts{
41 Name: "business_revenue_total_dollars",
42 Help: "Total revenue in dollars",
43 }),
44
45 revenueByPlan: prometheus.NewCounterVec(
46 prometheus.CounterOpts{
47 Name: "business_revenue_by_plan_dollars",
48 Help: "Revenue by subscription plan",
49 },
50 []string{"plan_name", "billing_cycle"},
51 ),
52
53 activeUsers: prometheus.NewGauge(prometheus.GaugeOpts{
54 Name: "business_active_users",
55 Help: "Current number of active users",
56 }),
57
58 sessionDuration: prometheus.NewHistogram(prometheus.HistogramOpts{
59 Name: "business_session_duration_seconds",
60 Help: "User session duration",
61 Buckets: []float64{60, 300, 900, 1800, 3600, 7200}, // 1m, 5m, 15m, 30m, 1h, 2h
62 }),
63
64 featureUsage: prometheus.NewCounterVec(
65 prometheus.CounterOpts{
66 Name: "business_feature_usage_total",
67 Help: "Usage count by feature",
68 },
69 []string{"feature_name", "user_plan"},
70 ),
71
72 checkoutsStarted: prometheus.NewCounter(prometheus.CounterOpts{
73 Name: "business_checkouts_started_total",
74 Help: "Number of checkout processes started",
75 }),
76
77 checkoutsCompleted: prometheus.NewCounter(prometheus.CounterOpts{
78 Name: "business_checkouts_completed_total",
79 Help: "Number of successful checkouts",
80 }),
81
82 checkoutsFailed: prometheus.NewCounterVec(
83 prometheus.CounterOpts{
84 Name: "business_checkouts_failed_total",
85 Help: "Failed checkouts by reason",
86 },
87 []string{"failure_reason"},
88 ),
89
90 cartValue: prometheus.NewHistogram(prometheus.HistogramOpts{
91 Name: "business_cart_value_dollars",
92 Help: "Shopping cart value distribution",
93 Buckets: prometheus.ExponentialBuckets(10, 2, 10), // 10, 20, 40, ..., 5120
94 }),
95
96 trialConversions: prometheus.NewCounter(prometheus.CounterOpts{
97 Name: "business_trial_conversions_total",
98 Help: "Number of trial users who converted to paid",
99 }),
100
101 churnEvents: prometheus.NewCounter(prometheus.CounterOpts{
102 Name: "business_churn_events_total",
103 Help: "Number of user churn events",
104 }),
105
106 lifetimeValue: prometheus.NewHistogram(prometheus.HistogramOpts{
107 Name: "business_customer_lifetime_value_dollars",
108 Help: "Customer lifetime value",
109 Buckets: []float64{100, 500, 1000, 5000, 10000, 50000},
110 }),
111 }
112
113 // Register all metrics
114 reg.MustRegister(
115 bm.revenueTotal,
116 bm.revenueByPlan,
117 bm.activeUsers,
118 bm.sessionDuration,
119 bm.featureUsage,
120 bm.checkoutsStarted,
121 bm.checkoutsCompleted,
122 bm.checkoutsFailed,
123 bm.cartValue,
124 bm.trialConversions,
125 bm.churnEvents,
126 bm.lifetimeValue,
127 )
128
129 return bm
130}
131
132// TrackRevenue records a revenue event
133func (bm *BusinessMetrics) TrackRevenue(amount float64, plan, billingCycle string) {
134 bm.revenueTotal.Add(amount)
135 bm.revenueByPlan.WithLabelValues(plan, billingCycle).Add(amount)
136}
137
138// TrackCheckout records checkout funnel progression
139func (bm *BusinessMetrics) TrackCheckout(stage string, cartValue float64, failureReason string) {
140 switch stage {
141 case "started":
142 bm.checkoutsStarted.Inc()
143 bm.cartValue.Observe(cartValue)
144 case "completed":
145 bm.checkoutsCompleted.Inc()
146 case "failed":
147 bm.checkoutsFailed.WithLabelValues(failureReason).Inc()
148 }
149}
150
151// TrackFeatureUsage records feature usage by plan
152func (bm *BusinessMetrics) TrackFeatureUsage(feature, userPlan string) {
153 bm.featureUsage.WithLabelValues(feature, userPlan).Inc()
154}
155
156// Custom Collector for complex metrics
157type DatabaseStatsCollector struct {
158 db *DatabasePool // Your database connection pool
159
160 // Descriptors
161 connTotal *prometheus.Desc
162 connIdle *prometheus.Desc
163 connInUse *prometheus.Desc
164 queryDuration *prometheus.Desc
165}
166
167type DatabasePool struct {
168 mu sync.RWMutex
169 totalConns int
170 idleConns int
171 inUseConns int
172 queryStats map[string]time.Duration
173}
174
175// NewDatabaseStatsCollector creates a custom collector
176func NewDatabaseStatsCollector(db *DatabasePool) *DatabaseStatsCollector {
177 return &DatabaseStatsCollector{
178 db: db,
179 connTotal: prometheus.NewDesc(
180 "db_connections_total",
181 "Total number of database connections",
182 nil, nil,
183 ),
184 connIdle: prometheus.NewDesc(
185 "db_connections_idle",
186 "Number of idle connections",
187 nil, nil,
188 ),
189 connInUse: prometheus.NewDesc(
190 "db_connections_in_use",
191 "Number of connections in use",
192 nil, nil,
193 ),
194 queryDuration: prometheus.NewDesc(
195 "db_query_duration_seconds",
196 "Query execution time by query type",
197 []string{"query_type"}, nil,
198 ),
199 }
200}
201
202// Describe implements prometheus.Collector
203func (c *DatabaseStatsCollector) Describe(ch chan<- *prometheus.Desc) {
204 ch <- c.connTotal
205 ch <- c.connIdle
206 ch <- c.connInUse
207 ch <- c.queryDuration
208}
209
210// Collect implements prometheus.Collector
211func (c *DatabaseStatsCollector) Collect(ch chan<- prometheus.Metric) {
212 c.db.mu.RLock()
213 defer c.db.mu.RUnlock()
214
215 // Collect connection pool metrics
216 ch <- prometheus.MustNewConstMetric(
217 c.connTotal,
218 prometheus.GaugeValue,
219 float64(c.db.totalConns),
220 )
221
222 ch <- prometheus.MustNewConstMetric(
223 c.connIdle,
224 prometheus.GaugeValue,
225 float64(c.db.idleConns),
226 )
227
228 ch <- prometheus.MustNewConstMetric(
229 c.connInUse,
230 prometheus.GaugeValue,
231 float64(c.db.inUseConns),
232 )
233
234 // Collect query duration metrics
235 for queryType, duration := range c.db.queryStats {
236 ch <- prometheus.MustNewConstMetric(
237 c.queryDuration,
238 prometheus.GaugeValue,
239 duration.Seconds(),
240 queryType,
241 )
242 }
243}
244
245func main() {
246 // Create a custom registry
247 reg := prometheus.NewRegistry()
248
249 // Initialize business metrics
250 businessMetrics := NewBusinessMetrics(reg)
251
252 // Initialize database pool and custom collector
253 dbPool := &DatabasePool{
254 totalConns: 100,
255 idleConns: 80,
256 inUseConns: 20,
257 queryStats: map[string]time.Duration{
258 "SELECT": 50 * time.Millisecond,
259 "INSERT": 100 * time.Millisecond,
260 "UPDATE": 75 * time.Millisecond,
261 },
262 }
263 reg.MustRegister(NewDatabaseStatsCollector(dbPool))
264
265 // Simulate business events
266 go func() {
267 for {
268 // Track revenue
269 businessMetrics.TrackRevenue(99.99, "premium", "monthly")
270 businessMetrics.TrackRevenue(999.99, "enterprise", "yearly")
271
272 // Track checkout funnel
273 businessMetrics.TrackCheckout("started", 149.99, "")
274 businessMetrics.TrackCheckout("completed", 149.99, "")
275 businessMetrics.TrackCheckout("failed", 299.99, "payment_declined")
276
277 // Track feature usage
278 businessMetrics.TrackFeatureUsage("export_pdf", "premium")
279 businessMetrics.TrackFeatureUsage("api_access", "enterprise")
280
281 // Update active users
282 businessMetrics.activeUsers.Set(1250)
283
284 time.Sleep(5 * time.Second)
285 }
286 }()
287
288 // Expose metrics endpoint
289 http.Handle("/metrics", promhttp.HandlerFor(reg, promhttp.HandlerOpts{
290 Registry: reg,
291 }))
292
293 fmt.Println("Business metrics available at http://localhost:9090/metrics")
294 http.ListenAndServe(":9090", nil)
295}
Custom Metrics Best Practices:
- Business alignment: Track metrics that matter to stakeholders
- Cardinality control: Avoid high-cardinality labels (user IDs, timestamps)
- Custom collectors: Use for complex metrics computed on-demand
- Naming conventions: Follow Prometheus naming guidelines
- Documentation: Add detailed help text to all metrics
Advanced Instrumentation Patterns
Production systems need sophisticated instrumentation beyond basic counters:
1package main
2
3import (
4 "context"
5 "fmt"
6 "sync"
7 "time"
8
9 "github.com/prometheus/client_golang/prometheus"
10 "github.com/prometheus/client_golang/prometheus/promauto"
11)
12
13// ExemplarInstrumentation demonstrates exemplar support for linking metrics to traces
14type ExemplarInstrumentation struct {
15 requestDuration *prometheus.HistogramVec
16}
17
18func NewExemplarInstrumentation() *ExemplarInstrumentation {
19 return &ExemplarInstrumentation{
20 requestDuration: promauto.NewHistogramVec(
21 prometheus.HistogramOpts{
22 Name: "http_request_duration_seconds",
23 Help: "Request duration with trace exemplars",
24 Buckets: prometheus.DefBuckets,
25 // Enable native histograms for better precision
26 NativeHistogramBucketFactor: 1.1,
27 },
28 []string{"method", "endpoint"},
29 ),
30 }
31}
32
33// RecordRequest records a request with trace exemplar
34func (ei *ExemplarInstrumentation) RecordRequest(method, endpoint string, duration float64, traceID string) {
35 // Record observation with exemplar linking to trace
36 observer := ei.requestDuration.WithLabelValues(method, endpoint)
37 observer.(prometheus.ExemplarObserver).ObserveWithExemplar(
38 duration,
39 prometheus.Labels{"trace_id": traceID},
40 )
41}
42
43// RateLimitMetrics tracks rate limiting with sliding windows
44type RateLimitMetrics struct {
45 mu sync.RWMutex
46
47 allowed *prometheus.CounterVec
48 rejected *prometheus.CounterVec
49 current *prometheus.GaugeVec
50 limit *prometheus.GaugeVec
51
52 // Sliding window state
53 windows map[string]*slidingWindow
54}
55
56type slidingWindow struct {
57 requests []time.Time
58 limit int
59 window time.Duration
60}
61
62func NewRateLimitMetrics() *RateLimitMetrics {
63 return &RateLimitMetrics{
64 allowed: promauto.NewCounterVec(
65 prometheus.CounterOpts{
66 Name: "ratelimit_requests_allowed_total",
67 Help: "Requests allowed through rate limiter",
68 },
69 []string{"client_id", "endpoint"},
70 ),
71 rejected: promauto.NewCounterVec(
72 prometheus.CounterOpts{
73 Name: "ratelimit_requests_rejected_total",
74 Help: "Requests rejected by rate limiter",
75 },
76 []string{"client_id", "endpoint", "reason"},
77 ),
78 current: promauto.NewGaugeVec(
79 prometheus.GaugeOpts{
80 Name: "ratelimit_current_usage",
81 Help: "Current rate limit usage",
82 },
83 []string{"client_id", "endpoint"},
84 ),
85 limit: promauto.NewGaugeVec(
86 prometheus.GaugeOpts{
87 Name: "ratelimit_limit",
88 Help: "Rate limit threshold",
89 },
90 []string{"client_id", "endpoint"},
91 ),
92 windows: make(map[string]*slidingWindow),
93 }
94}
95
96func (rlm *RateLimitMetrics) CheckAndRecord(clientID, endpoint string, limit int, window time.Duration) bool {
97 rlm.mu.Lock()
98 defer rlm.mu.Unlock()
99
100 key := fmt.Sprintf("%s:%s", clientID, endpoint)
101
102 // Initialize window if not exists
103 if _, exists := rlm.windows[key]; !exists {
104 rlm.windows[key] = &slidingWindow{
105 requests: make([]time.Time, 0),
106 limit: limit,
107 window: window,
108 }
109 rlm.limit.WithLabelValues(clientID, endpoint).Set(float64(limit))
110 }
111
112 sw := rlm.windows[key]
113 now := time.Now()
114
115 // Remove old requests outside window
116 cutoff := now.Add(-window)
117 validRequests := make([]time.Time, 0)
118 for _, t := range sw.requests {
119 if t.After(cutoff) {
120 validRequests = append(validRequests, t)
121 }
122 }
123 sw.requests = validRequests
124
125 // Check if under limit
126 currentUsage := len(sw.requests)
127 rlm.current.WithLabelValues(clientID, endpoint).Set(float64(currentUsage))
128
129 if currentUsage >= limit {
130 rlm.rejected.WithLabelValues(clientID, endpoint, "rate_exceeded").Inc()
131 return false
132 }
133
134 // Allow request
135 sw.requests = append(sw.requests, now)
136 rlm.allowed.WithLabelValues(clientID, endpoint).Inc()
137 return true
138}
139
140// CacheMetrics demonstrates advanced cache instrumentation
141type CacheMetrics struct {
142 hits *prometheus.CounterVec
143 misses *prometheus.CounterVec
144 evictions *prometheus.CounterVec
145 size prometheus.Gauge
146 capacity prometheus.Gauge
147 hitRate *prometheus.GaugeVec
148 avgLoadTime prometheus.Histogram
149
150 // State for hit rate calculation
151 mu sync.RWMutex
152 hitCount map[string]int64
153 totalCount map[string]int64
154}
155
156func NewCacheMetrics() *CacheMetrics {
157 cm := &CacheMetrics{
158 hits: promauto.NewCounterVec(
159 prometheus.CounterOpts{
160 Name: "cache_hits_total",
161 Help: "Cache hits by operation",
162 },
163 []string{"cache_name", "operation"},
164 ),
165 misses: promauto.NewCounterVec(
166 prometheus.CounterOpts{
167 Name: "cache_misses_total",
168 Help: "Cache misses by operation",
169 },
170 []string{"cache_name", "operation"},
171 ),
172 evictions: promauto.NewCounterVec(
173 prometheus.CounterOpts{
174 Name: "cache_evictions_total",
175 Help: "Cache evictions by reason",
176 },
177 []string{"cache_name", "reason"},
178 ),
179 size: promauto.NewGauge(
180 prometheus.GaugeOpts{
181 Name: "cache_size_bytes",
182 Help: "Current cache size in bytes",
183 },
184 ),
185 capacity: promauto.NewGauge(
186 prometheus.GaugeOpts{
187 Name: "cache_capacity_bytes",
188 Help: "Maximum cache capacity",
189 },
190 ),
191 hitRate: promauto.NewGaugeVec(
192 prometheus.GaugeOpts{
193 Name: "cache_hit_rate",
194 Help: "Cache hit rate (0-1)",
195 },
196 []string{"cache_name"},
197 ),
198 avgLoadTime: promauto.NewHistogram(
199 prometheus.HistogramOpts{
200 Name: "cache_load_duration_seconds",
201 Help: "Time to load cache misses",
202 Buckets: prometheus.ExponentialBuckets(0.001, 2, 10),
203 },
204 ),
205 hitCount: make(map[string]int64),
206 totalCount: make(map[string]int64),
207 }
208
209 // Start background goroutine to compute hit rates
210 go cm.computeHitRates()
211
212 return cm
213}
214
215func (cm *CacheMetrics) RecordLookup(cacheName, operation string, hit bool, loadTime time.Duration) {
216 cm.mu.Lock()
217 cm.totalCount[cacheName]++
218 if hit {
219 cm.hitCount[cacheName]++
220 cm.hits.WithLabelValues(cacheName, operation).Inc()
221 } else {
222 cm.misses.WithLabelValues(cacheName, operation).Inc()
223 cm.avgLoadTime.Observe(loadTime.Seconds())
224 }
225 cm.mu.Unlock()
226}
227
228func (cm *CacheMetrics) computeHitRates() {
229 ticker := time.NewTicker(10 * time.Second)
230 defer ticker.Stop()
231
232 for range ticker.C {
233 cm.mu.RLock()
234 for cacheName, total := range cm.totalCount {
235 if total > 0 {
236 hitRate := float64(cm.hitCount[cacheName]) / float64(total)
237 cm.hitRate.WithLabelValues(cacheName).Set(hitRate)
238 }
239 }
240 cm.mu.RUnlock()
241 }
242}
243
244func main() {
245 // Initialize instrumentation
246 exemplarInst := NewExemplarInstrumentation()
247 rateLimitMetrics := NewRateLimitMetrics()
248 cacheMetrics := NewCacheMetrics()
249
250 // Simulate exemplar recording
251 exemplarInst.RecordRequest("GET", "/api/users", 0.125, "trace-id-123")
252
253 // Simulate rate limiting
254 allowed := rateLimitMetrics.CheckAndRecord("client-1", "/api", 100, time.Minute)
255 fmt.Printf("Request allowed: %v\n", allowed)
256
257 // Simulate cache operations
258 cacheMetrics.RecordLookup("user-cache", "get", true, 0)
259 cacheMetrics.RecordLookup("user-cache", "get", false, 50*time.Millisecond)
260 cacheMetrics.size.Set(1024 * 1024 * 50) // 50MB
261 cacheMetrics.capacity.Set(1024 * 1024 * 100) // 100MB
262
263 fmt.Println("Advanced instrumentation patterns initialized")
264 fmt.Println("Metrics available at http://localhost:2112/metrics")
265
266 select {} // Keep running
267}
Advanced Pattern: Distributed Tracing Best Practices
Distributed tracing is powerful but can introduce performance overhead and storage costs. These patterns help you maximize value while minimizing impact.
Intelligent Sampling Strategies
Not all traces are equally valuable. Intelligent sampling captures the traces you need while reducing overhead:
1package main
2
3import (
4 "context"
5 "fmt"
6 "math/rand"
7 "strings"
8 "time"
9
10 "go.opentelemetry.io/otel/attribute"
11 "go.opentelemetry.io/otel/sdk/trace"
12 "go.opentelemetry.io/otel/trace"
13)
14
15// AdaptiveSampler adjusts sampling rate based on system load
16type AdaptiveSampler struct {
17 baseSampleRate float64
18 errorSampleRate float64
19 slowSampleRate float64
20 slowThreshold time.Duration
21 loadThreshold float64
22
23 currentLoad float64
24}
25
26func NewAdaptiveSampler(baseRate, errorRate, slowRate float64, slowThreshold time.Duration) *AdaptiveSampler {
27 return &AdaptiveSampler{
28 baseSampleRate: baseRate,
29 errorSampleRate: errorRate,
30 slowSampleRate: slowRate,
31 slowThreshold: slowThreshold,
32 loadThreshold: 0.8, // 80% load
33 }
34}
35
36// ShouldSample implements custom sampling logic
37func (as *AdaptiveSampler) ShouldSample(params trace.SamplingParameters) trace.SamplingResult {
38 // Always sample errors
39 if hasError(params.Attributes) {
40 return trace.SamplingResult{
41 Decision: trace.RecordAndSample,
42 Attributes: []attribute.KeyValue{attribute.Bool("sampled.error", true)},
43 }
44 }
45
46 // Always sample slow requests
47 if duration := getDuration(params.Attributes); duration > as.slowThreshold {
48 return trace.SamplingResult{
49 Decision: trace.RecordAndSample,
50 Attributes: []attribute.KeyValue{attribute.Bool("sampled.slow", true)},
51 }
52 }
53
54 // Sample high-value endpoints more frequently
55 if isHighValueEndpoint(params.Name) {
56 if rand.Float64() < as.slowSampleRate {
57 return trace.SamplingResult{
58 Decision: trace.RecordAndSample,
59 Attributes: []attribute.KeyValue{attribute.Bool("sampled.high_value", true)},
60 }
61 }
62 }
63
64 // Reduce sampling under high load
65 effectiveRate := as.baseSampleRate
66 if as.currentLoad > as.loadThreshold {
67 effectiveRate = as.baseSampleRate * 0.5 // Cut sampling in half
68 }
69
70 if rand.Float64() < effectiveRate {
71 return trace.SamplingResult{
72 Decision: trace.RecordAndSample,
73 }
74 }
75
76 return trace.SamplingResult{
77 Decision: trace.Drop,
78 }
79}
80
81func (as *AdaptiveSampler) Description() string {
82 return "AdaptiveSampler"
83}
84
85func hasError(attrs []attribute.KeyValue) bool {
86 for _, attr := range attrs {
87 if attr.Key == "error" && attr.Value.AsBool() {
88 return true
89 }
90 }
91 return false
92}
93
94func getDuration(attrs []attribute.KeyValue) time.Duration {
95 for _, attr := range attrs {
96 if attr.Key == "duration" {
97 return time.Duration(attr.Value.AsInt64())
98 }
99 }
100 return 0
101}
102
103func isHighValueEndpoint(name string) bool {
104 highValue := []string{"/checkout", "/payment", "/api/orders"}
105 for _, endpoint := range highValue {
106 if strings.Contains(name, endpoint) {
107 return true
108 }
109 }
110 return false
111}
112
113// PrioritySampler samples based on business priority
114type PrioritySampler struct {
115 strategies map[string]float64 // endpoint -> sample rate
116 defaultRate float64
117}
118
119func NewPrioritySampler() *PrioritySampler {
120 return &PrioritySampler{
121 strategies: map[string]float64{
122 // Critical business paths - 100% sampling
123 "/checkout": 1.0,
124 "/payment": 1.0,
125 "/api/orders": 1.0,
126
127 // Important but high volume - 50% sampling
128 "/api/users": 0.5,
129 "/api/products": 0.5,
130
131 // Background jobs - 10% sampling
132 "/jobs/cleanup": 0.1,
133 "/jobs/reports": 0.1,
134
135 // Health checks - 1% sampling
136 "/health": 0.01,
137 "/ready": 0.01,
138 },
139 defaultRate: 0.1, // 10% for unknown endpoints
140 }
141}
142
143func (ps *PrioritySampler) ShouldSample(params trace.SamplingParameters) trace.SamplingResult {
144 rate := ps.defaultRate
145
146 // Find matching strategy
147 for endpoint, r := range ps.strategies {
148 if strings.Contains(params.Name, endpoint) {
149 rate = r
150 break
151 }
152 }
153
154 if rand.Float64() < rate {
155 return trace.SamplingResult{
156 Decision: trace.RecordAndSample,
157 Attributes: []attribute.KeyValue{
158 attribute.Float64("sample.rate", rate),
159 },
160 }
161 }
162
163 return trace.SamplingResult{
164 Decision: trace.Drop,
165 }
166}
167
168func (ps *PrioritySampler) Description() string {
169 return "PrioritySampler"
170}
171
172// Example: Using custom samplers
173func main() {
174 // Create adaptive sampler
175 adaptiveSampler := NewAdaptiveSampler(
176 0.1, // 10% base rate
177 1.0, // 100% error sampling
178 0.5, // 50% slow request sampling
179 500*time.Millisecond, // 500ms slow threshold
180 )
181
182 // Create priority sampler
183 prioritySampler := NewPrioritySampler()
184
185 fmt.Printf("Adaptive sampler: %s\n", adaptiveSampler.Description())
186 fmt.Printf("Priority sampler: %s\n", prioritySampler.Description())
187
188 // In production, you would use these with:
189 // trace.WithSampler(adaptiveSampler)
190}
Trace Correlation and Root Cause Analysis
When incidents occur, quickly correlating traces with logs and metrics is critical:
1package main
2
3import (
4 "context"
5 "fmt"
6 "log/slog"
7 "time"
8
9 "go.opentelemetry.io/otel"
10 "go.opentelemetry.io/otel/attribute"
11 "go.opentelemetry.io/otel/codes"
12 "go.opentelemetry.io/otel/trace"
13)
14
15// CorrelatedObservability ties together traces, logs, and metrics
16type CorrelatedObservability struct {
17 logger *slog.Logger
18 tracer trace.Tracer
19}
20
21func NewCorrelatedObservability() *CorrelatedObservability {
22 return &CorrelatedObservability{
23 logger: slog.Default(),
24 tracer: otel.Tracer("correlation-example"),
25 }
26}
27
28// ProcessRequest demonstrates full correlation
29func (co *CorrelatedObservability) ProcessRequest(ctx context.Context, requestID string) error {
30 // Start trace
31 ctx, span := co.tracer.Start(ctx, "ProcessRequest",
32 trace.WithAttributes(
33 attribute.String("request.id", requestID),
34 ),
35 )
36 defer span.End()
37
38 // Extract trace context for correlation
39 traceID := span.SpanContext().TraceID().String()
40 spanID := span.SpanContext().SpanID().String()
41
42 // Create logger with trace context
43 logger := co.logger.With(
44 slog.String("trace_id", traceID),
45 slog.String("span_id", spanID),
46 slog.String("request_id", requestID),
47 )
48
49 logger.Info("Processing request started")
50
51 // Simulate processing steps
52 if err := co.validateInput(ctx, logger, requestID); err != nil {
53 span.RecordError(err)
54 span.SetStatus(codes.Error, "Validation failed")
55 logger.Error("Validation failed",
56 slog.String("error", err.Error()),
57 slog.String("stage", "validation"),
58 )
59 return err
60 }
61
62 if err := co.processData(ctx, logger, requestID); err != nil {
63 span.RecordError(err)
64 span.SetStatus(codes.Error, "Processing failed")
65 logger.Error("Processing failed",
66 slog.String("error", err.Error()),
67 slog.String("stage", "processing"),
68 )
69 return err
70 }
71
72 logger.Info("Processing request completed")
73 span.SetStatus(codes.Ok, "Success")
74 return nil
75}
76
77func (co *CorrelatedObservability) validateInput(ctx context.Context, logger *slog.Logger, requestID string) error {
78 ctx, span := co.tracer.Start(ctx, "ValidateInput")
79 defer span.End()
80
81 logger.Debug("Validating input")
82 time.Sleep(10 * time.Millisecond)
83
84 // Add business context to span
85 span.SetAttributes(
86 attribute.String("validation.type", "schema"),
87 attribute.Int("validation.fields", 5),
88 )
89
90 return nil
91}
92
93func (co *CorrelatedObservability) processData(ctx context.Context, logger *slog.Logger, requestID string) error {
94 ctx, span := co.tracer.Start(ctx, "ProcessData")
95 defer span.End()
96
97 // Add span events for important milestones
98 span.AddEvent("Starting database query")
99 logger.Debug("Querying database")
100 time.Sleep(50 * time.Millisecond)
101
102 span.AddEvent("Database query complete",
103 trace.WithAttributes(
104 attribute.Int("rows.returned", 10),
105 ),
106 )
107
108 span.AddEvent("Starting cache update")
109 logger.Debug("Updating cache")
110 time.Sleep(20 * time.Millisecond)
111
112 span.AddEvent("Cache update complete")
113
114 return nil
115}
116
117// AnomalyDetector identifies unusual trace patterns
118type AnomalyDetector struct {
119 baselineLatency map[string]time.Duration
120 baselineErrors map[string]float64
121}
122
123func NewAnomalyDetector() *AnomalyDetector {
124 return &AnomalyDetector{
125 baselineLatency: map[string]time.Duration{
126 "/api/users": 100 * time.Millisecond,
127 "/api/orders": 200 * time.Millisecond,
128 "/checkout": 500 * time.Millisecond,
129 },
130 baselineErrors: map[string]float64{
131 "/api/users": 0.01, // 1% error rate
132 "/api/orders": 0.005, // 0.5% error rate
133 "/checkout": 0.02, // 2% error rate
134 },
135 }
136}
137
138// AnalyzeTrace detects anomalies in trace data
139func (ad *AnomalyDetector) AnalyzeTrace(operation string, duration time.Duration, errorOccurred bool) []string {
140 anomalies := make([]string, 0)
141
142 // Check latency anomaly
143 if baseline, exists := ad.baselineLatency[operation]; exists {
144 // Flag if 3x slower than baseline
145 if duration > baseline*3 {
146 anomalies = append(anomalies, fmt.Sprintf(
147 "LATENCY_ANOMALY: %s took %v (baseline: %v, ratio: %.2fx)",
148 operation, duration, baseline, float64(duration)/float64(baseline),
149 ))
150 }
151 }
152
153 // Check error rate anomaly
154 if errorOccurred {
155 if baseline, exists := ad.baselineErrors[operation]; exists {
156 // This would typically track error rate over time
157 // Simplified for example
158 anomalies = append(anomalies, fmt.Sprintf(
159 "ERROR_ANOMALY: %s failed (baseline error rate: %.2f%%)",
160 operation, baseline*100,
161 ))
162 }
163 }
164
165 return anomalies
166}
167
168func main() {
169 ctx := context.Background()
170
171 // Initialize correlated observability
172 co := NewCorrelatedObservability()
173
174 // Process request with full correlation
175 if err := co.ProcessRequest(ctx, "req-12345"); err != nil {
176 fmt.Printf("Request failed: %v\n", err)
177 }
178
179 // Demonstrate anomaly detection
180 detector := NewAnomalyDetector()
181 anomalies := detector.AnalyzeTrace("/api/users", 350*time.Millisecond, false)
182 if len(anomalies) > 0 {
183 fmt.Println("Anomalies detected:")
184 for _, a := range anomalies {
185 fmt.Printf(" - %s\n", a)
186 }
187 }
188
189 fmt.Println("Distributed tracing best practices demonstrated")
190}
Trace Optimization Best Practices:
- Intelligent sampling: Don't trace everything—be selective
- Span attributes: Add rich context but avoid PII
- Span events: Mark important milestones within spans
- Error context: Always include error details in span status
- Correlation IDs: Link traces across system boundaries
- Tail-based sampling: Keep all traces for errors, sample successes
- Storage optimization: Set retention policies based on trace value
Practice Exercises
Exercise 1: Build a Request Logger
🎯 Learning Objectives:
- Master structured logging with Go's
log/slogpackage - Implement HTTP middleware pattern for cross-cutting concerns
- Practice context propagation for request tracking
- Learn proper log level usage and filtering
🌍 Real-World Context:
Every production service needs comprehensive request logging for debugging, auditing, and analytics. Companies like Stripe and PayPal use structured request logging to detect fraud patterns, debug API issues, and maintain compliance. This middleware pattern is used by thousands of Go services to ensure every request is properly tracked from ingress to egress.
⏱️ Time Estimate: 45-60 minutes
📊 Difficulty: Intermediate
Create middleware that logs all HTTP requests with context. The middleware should generate unique request IDs, track request timing, capture response status codes, and use structured logging with appropriate fields for production monitoring.
1package main
2
3import (
4 "context"
5 "log/slog"
6 "net/http"
7 "os"
8 "time"
9
10 "github.com/google/uuid"
11)
12
13type contextKey string
14
15const requestIDKey contextKey = "request_id"
16
17// RequestLogger middleware
18func RequestLogger(logger *slog.Logger) func(http.Handler) http.Handler {
19 return func(next http.Handler) http.Handler {
20 return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
21 // Generate request ID
22 requestID := uuid.New().String()
23
24 // Add to context
25 ctx := context.WithValue(r.Context(), requestIDKey, requestID)
26 r = r.WithContext(ctx)
27
28 // Create logger with request context
29 reqLogger := logger.With(
30 slog.String("request_id", requestID),
31 slog.String("method", r.Method),
32 slog.String("path", r.URL.Path),
33 slog.String("remote_addr", r.RemoteAddr),
34 slog.String("user_agent", r.UserAgent()),
35 )
36
37 // Log request start
38 start := time.Now()
39 reqLogger.Info("request started")
40
41 // Wrap response writer to capture status
42 rw := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}
43
44 // Call handler
45 next.ServeHTTP(rw, r)
46
47 // Log request completion
48 duration := time.Since(start)
49 reqLogger.Info("request completed",
50 slog.Int("status_code", rw.statusCode),
51 slog.Duration("duration", duration),
52 slog.Int("bytes_written", rw.bytesWritten),
53 )
54 })
55 }
56}
57
58type responseWriter struct {
59 http.ResponseWriter
60 statusCode int
61 bytesWritten int
62}
63
64func WriteHeader(code int) {
65 rw.statusCode = code
66 rw.ResponseWriter.WriteHeader(code)
67}
68
69func Write(b []byte) {
70 n, err := rw.ResponseWriter.Write(b)
71 rw.bytesWritten += n
72 return n, err
73}
74
75func main() {
76 logger := slog.New(slog.NewJSONHandler(os.Stdout, nil))
77
78 mux := http.NewServeMux()
79 mux.HandleFunc("/api/users", func(w http.ResponseWriter, r *http.Request) {
80 time.Sleep(50 * time.Millisecond)
81 w.WriteHeader(http.StatusOK)
82 w.Write([]byte(`{"users": []}`))
83 })
84
85 handler := RequestLogger(logger)(mux)
86
87 http.ListenAndServe(":8080", handler)
88}
Exercise 2: Implement Error Budget Tracking
🎯 Learning Objectives:
- Understand Service Level Objectives and error budgets
- Implement real-time availability calculations
- Learn rate limiting and burn rate alerting concepts
- Practice atomic operations for thread-safe metrics
🌍 Real-World Context:
Google pioneered error budget tracking to balance reliability with innovation. When your SLO is 99.9% uptime, you have an "error budget" of 0.1% that can be used for deployments, experiments, and inevitable failures. Companies like Spotify use error budgets to determine when to pause releases vs when to push forward, ensuring user experience isn't compromised while maintaining development velocity.
⏱️ Time Estimate: 60-75 minutes
📊 Difficulty: Advanced
Build an error budget tracker based on SLOs that monitors service availability in real-time, tracks error consumption rates, calculates remaining error budget, and provides alerts when the budget is being exhausted too quickly. This system helps teams make data-driven decisions about deployments and incident response.
1package main
2
3import (
4 "fmt"
5 "sync"
6 "time"
7)
8
9// ErrorBudget tracks SLO error budget
10type ErrorBudget struct {
11 mu sync.RWMutex
12
13 slo float64 // e.g., 0.999 = 99.9%
14 window time.Duration // e.g., 30 days
15 startTime time.Time
16 totalRequests int64
17 failedRequests int64
18 budgetExhausted bool
19}
20
21func NewErrorBudget(slo float64, window time.Duration) *ErrorBudget {
22 return &ErrorBudget{
23 slo: slo,
24 window: window,
25 startTime: time.Now(),
26 }
27}
28
29// RecordRequest records a request outcome
30func RecordRequest(success bool) {
31 eb.mu.Lock()
32 defer eb.mu.Unlock()
33
34 eb.totalRequests++
35 if !success {
36 eb.failedRequests++
37 }
38
39 // Check if budget is exhausted
40 currentAvailability := eb.availability()
41 eb.budgetExhausted = currentAvailability < eb.slo
42}
43
44func availability() float64 {
45 if eb.totalRequests == 0 {
46 return 1.0
47 }
48 return float64(eb.totalRequests-eb.failedRequests) / float64(eb.totalRequests)
49}
50
51// Status returns error budget status
52func Status() map[string]interface{} {
53 eb.mu.RLock()
54 defer eb.mu.RUnlock()
55
56 availability := eb.availability()
57 errorRate := 1.0 - availability
58
59 // Calculate budget
60 allowedErrors := float64(eb.totalRequests) *
61 remainingBudget := allowedErrors - float64(eb.failedRequests)
62 budgetPercent := * 100
63
64 return map[string]interface{}{
65 "slo": fmt.Sprintf("%.3f%%", eb.slo*100),
66 "current_availability": fmt.Sprintf("%.3f%%", availability*100),
67 "error_rate": fmt.Sprintf("%.3f%%", errorRate*100),
68 "total_requests": eb.totalRequests,
69 "failed_requests": eb.failedRequests,
70 "budget_remaining": fmt.Sprintf("%.1f%%", budgetPercent),
71 "budget_exhausted": eb.budgetExhausted,
72 "time_in_window": time.Since(eb.startTime),
73 }
74}
75
76func main() {
77 budget := NewErrorBudget(0.999, 30*24*time.Hour)
78
79 // Simulate requests
80 for i := 0; i < 10000; i++ {
81 success := i%200 != 0 // 99.5% success rate
82 budget.RecordRequest(success)
83 }
84
85 status := budget.Status()
86 fmt.Println("Error Budget Status:")
87 for k, v := range status {
88 fmt.Printf(" %s: %v\n", k, v)
89 }
90}
Exercise 3: Custom Prometheus Collector
🎯 Learning Objectives:
- Master the Prometheus Collector interface
- Implement custom metrics beyond basic counters/gauges
- Learn thread-safe metric collection patterns
- Understand metric types and when to use each one
🌍 Real-World Context:
While Prometheus provides basic metric types, complex systems often need custom collectors that expose detailed state. Database connection pools, cache systems, and message queues all benefit from custom collectors that can surface operational insights. Uber uses custom collectors to monitor their massive connection pools across thousands of services, enabling proactive scaling and performance optimization.
⏱️ Time Estimate: 75-90 minutes
📊 Difficulty: Advanced
Implement a custom Prometheus collector for a connection pool with advanced metrics that goes beyond simple counters. Your collector should implement the full prometheus.Collector interface, track detailed pool statistics including wait times, connection churn, and utilization patterns, and provide both real-time gauges and cumulative counters for comprehensive monitoring.
Requirements:
- Implement prometheus.Collector interface with proper Describe() and Collect() methods
- Track pool size, active connections, idle connections, and wait times
- Expose histograms for connection acquisition times and utilization gauges
- Ensure thread-safe metric updates using proper synchronization
- Include calculated metrics like average wait time and connection churn rate
1// run
2package main
3
4import (
5 "fmt"
6 "math/rand"
7 "sync"
8 "time"
9
10 "github.com/prometheus/client_golang/prometheus"
11)
12
13// ConnectionPool represents a database connection pool
14type ConnectionPool struct {
15 mu sync.RWMutex
16
17 maxSize int
18 activeConns int
19 idleConns int
20 totalConns int
21 waitingRequests int
22
23 // Statistics
24 totalAcquired uint64
25 totalReleased uint64
26 totalWaitTime time.Duration
27}
28
29// NewConnectionPool creates a new connection pool
30func NewConnectionPool(maxSize int) *ConnectionPool {
31 return &ConnectionPool{
32 maxSize: maxSize,
33 totalConns: maxSize,
34 idleConns: maxSize,
35 }
36}
37
38// Acquire gets a connection from the pool
39func Acquire() {
40 cp.mu.Lock()
41 defer cp.mu.Unlock()
42
43 start := time.Now()
44
45 // Wait if no idle connections
46 for cp.idleConns == 0 {
47 cp.waitingRequests++
48 cp.mu.Unlock()
49 time.Sleep(10 * time.Millisecond)
50 cp.mu.Lock()
51 cp.waitingRequests--
52 }
53
54 cp.idleConns--
55 cp.activeConns++
56 cp.totalAcquired++
57 cp.totalWaitTime += time.Since(start)
58
59 return &Connection{pool: cp}, nil
60}
61
62// Release returns a connection to the pool
63func Release() {
64 cp.mu.Lock()
65 defer cp.mu.Unlock()
66
67 cp.activeConns--
68 cp.idleConns++
69 cp.totalReleased++
70}
71
72// Stats returns current pool statistics
73func Stats() {
74 cp.mu.RLock()
75 defer cp.mu.RUnlock()
76
77 return cp.maxSize, cp.activeConns, cp.idleConns, cp.waitingRequests
78}
79
80// Connection represents a database connection
81type Connection struct {
82 pool *Connection Pool
83}
84
85// Close returns the connection to the pool
86func Close() {
87 c.pool.Release()
88}
89
90// ConnectionPoolCollector implements prometheus.Collector
91type ConnectionPoolCollector struct {
92 pool *ConnectionPool
93
94 // Metrics descriptors
95 poolSize *prometheus.Desc
96 activeConns *prometheus.Desc
97 idleConns *prometheus.Desc
98 waitingRequests *prometheus.Desc
99 totalAcquired *prometheus.Desc
100 totalReleased *prometheus.Desc
101 avgWaitTime *prometheus.Desc
102}
103
104// NewConnectionPoolCollector creates a custom collector
105func NewConnectionPoolCollector(pool *ConnectionPool, namespace string) *ConnectionPoolCollector {
106 return &ConnectionPoolCollector{
107 pool: pool,
108 poolSize: prometheus.NewDesc(
109 prometheus.BuildFQName(namespace, "pool", "size"),
110 "Maximum number of connections in the pool",
111 nil, nil,
112 ),
113 activeConns: prometheus.NewDesc(
114 prometheus.BuildFQName(namespace, "pool", "active_connections"),
115 "Number of currently active connections",
116 nil, nil,
117 ),
118 idleConns: prometheus.NewDesc(
119 prometheus.BuildFQName(namespace, "pool", "idle_connections"),
120 "Number of idle connections available",
121 nil, nil,
122 ),
123 waitingRequests: prometheus.NewDesc(
124 prometheus.BuildFQName(namespace, "pool", "waiting_requests"),
125 "Number of requests waiting for a connection",
126 nil, nil,
127 ),
128 totalAcquired: prometheus.NewDesc(
129 prometheus.BuildFQName(namespace, "pool", "acquired_total"),
130 "Total number of connections acquired",
131 nil, nil,
132 ),
133 totalReleased: prometheus.NewDesc(
134 prometheus.BuildFQName(namespace, "pool", "released_total"),
135 "Total number of connections released",
136 nil, nil,
137 ),
138 avgWaitTime: prometheus.NewDesc(
139 prometheus.BuildFQName(namespace, "pool", "avg_wait_seconds"),
140 "Average wait time to acquire a connection",
141 nil, nil,
142 ),
143 }
144}
145
146// Describe implements prometheus.Collector
147func Describe(ch chan<- *prometheus.Desc) {
148 ch <- c.poolSize
149 ch <- c.activeConns
150 ch <- c.idleConns
151 ch <- c.waitingRequests
152 ch <- c.totalAcquired
153 ch <- c.totalReleased
154 ch <- c.avgWaitTime
155}
156
157// Collect implements prometheus.Collector
158func Collect(ch chan<- prometheus.Metric) {
159 c.pool.mu.RLock()
160 defer c.pool.mu.RUnlock()
161
162 // Gauges
163 ch <- prometheus.MustNewConstMetric(
164 c.poolSize,
165 prometheus.GaugeValue,
166 float64(c.pool.maxSize),
167 )
168
169 ch <- prometheus.MustNewConstMetric(
170 c.activeConns,
171 prometheus.GaugeValue,
172 float64(c.pool.activeConns),
173 )
174
175 ch <- prometheus.MustNewConstMetric(
176 c.idleConns,
177 prometheus.GaugeValue,
178 float64(c.pool.idleConns),
179 )
180
181 ch <- prometheus.MustNewConstMetric(
182 c.waitingRequests,
183 prometheus.GaugeValue,
184 float64(c.pool.waitingRequests),
185 )
186
187 // Counters
188 ch <- prometheus.MustNewConstMetric(
189 c.totalAcquired,
190 prometheus.CounterValue,
191 float64(c.pool.totalAcquired),
192 )
193
194 ch <- prometheus.MustNewConstMetric(
195 c.totalReleased,
196 prometheus.CounterValue,
197 float64(c.pool.totalReleased),
198 )
199
200 // Calculate average wait time
201 avgWait := 0.0
202 if c.pool.totalAcquired > 0 {
203 avgWait = c.pool.totalWaitTime.Seconds() / float64(c.pool.totalAcquired)
204 }
205
206 ch <- prometheus.MustNewConstMetric(
207 c.avgWaitTime,
208 prometheus.GaugeValue,
209 avgWait,
210 )
211}
212
213func main() {
214 // Create connection pool
215 pool := NewConnectionPool(10)
216
217 // Create custom collector
218 collector := NewConnectionPoolCollector(pool, "myapp")
219
220 // Register collector
221 registry := prometheus.NewRegistry()
222 registry.MustRegister(collector)
223
224 // Simulate workload
225 fmt.Println("Simulating connection pool workload...\n")
226
227 var wg sync.WaitGroup
228 for i := 0; i < 20; i++ {
229 wg.Add(1)
230 go func(id int) {
231 defer wg.Done()
232
233 for j := 0; j < 5; j++ {
234 // Acquire connection
235 conn, _ := pool.Acquire()
236
237 // Simulate work
238 time.Sleep(time.Duration(rand.Intn(100)) * time.Millisecond)
239
240 // Release connection
241 conn.Close()
242 }
243 }(i)
244 }
245
246 // Periodically print metrics
247 done := make(chan bool)
248 go func() {
249 ticker := time.NewTicker(500 * time.Millisecond)
250 defer ticker.Stop()
251
252 for {
253 select {
254 case <-ticker.C:
255 // Gather metrics
256 metrics, _ := registry.Gather()
257
258 fmt.Println("=== Connection Pool Metrics ===")
259 for _, mf := range metrics {
260 for _, m := range mf.GetMetric() {
261 fmt.Printf("%s: %.2f\n",
262 mf.GetName(),
263 m.GetGauge().GetValue()+m.GetCounter().GetValue())
264 }
265 }
266 fmt.Println()
267
268 case <-done:
269 return
270 }
271 }
272 }()
273
274 wg.Wait()
275 close(done)
276
277 // Final metrics
278 fmt.Println("=== Final Metrics ===")
279 maxSize, active, idle, waiting := pool.Stats()
280 fmt.Printf("Pool Size: %d\n", maxSize)
281 fmt.Printf("Active: %d\n", active)
282 fmt.Printf("Idle: %d\n", idle)
283 fmt.Printf("Waiting: %d\n", waiting)
284 fmt.Printf("Total Acquired: %d\n", pool.totalAcquired)
285 fmt.Printf("Total Released: %d\n", pool.totalReleased)
286}
Explanation:
Custom Prometheus Collector Pattern:
-
Implement prometheus.Collector interface:
Describe(chan<- *prometheus.Desc): Send metric descriptorsCollect(chan<- prometheus.Metric): Send metric values
-
Create metric descriptors with
prometheus.NewDesc():- Fully qualified name
- Help text
- Variable labels
- Constant labels
-
Send metrics with
prometheus.MustNewConstMetric():- Descriptor
- Metric type
- Value
Metrics Exposed:
-
Gauges:
pool_size: Maximum connectionsactive_connections: In-use connectionsidle_connections: Available connectionswaiting_requests: Blocked requests
-
Counters:
acquired_total: Total acquiresreleased_total: Total releases
-
Calculated Metrics:
avg_wait_seconds: Average wait time
Why Custom Collectors?
- Complex metrics: Calculate derived values
- External systems: Scrape metrics from non-Go systems
- Efficiency: Collect all related metrics in one pass
- Dynamic labels: Generate labels at scrape time
Best Practices:
- Thread Safety: Use mutexes when reading state
- Consistency: Collect related metrics atomically
- Performance: Keep
Collect()fast - Naming: Follow Prometheus conventions
- Metric types:
- Gauge: Can go up/down
- Counter: Only increases
- Histogram: Distribution
- Summary: Quantiles
PromQL Queries:
1# Current utilization
2myapp_pool_active_connections / myapp_pool_size
3
4# Connection churn rate
5rate(myapp_pool_acquired_total[5m])
6
7# Average wait time
8avg_over_time(myapp_pool_avg_wait_seconds[5m])
9
10# Alert on pool exhaustion
11myapp_pool_idle_connections == 0
Real-World Use Cases:
- Database pools: Track connection health
- HTTP clients: Monitor request queues
- Caches: Hit/miss rates, evictions
- Worker pools: Job queue depth
- Custom business metrics: Order processing, user sessions
Exercise 4: Distributed Tracing with OpenTelemetry
🎯 Learning Objectives:
- Implement distributed tracing across multiple services
- Master OpenTelemetry context propagation
- Create meaningful spans with proper attributes
- Learn trace sampling and export strategies
🌍 Real-World Context:
Microservices architecture makes distributed tracing essential for debugging. When a user request flows through 10+ services, traditional logging becomes insufficient for understanding performance bottlenecks. Companies like Airbnb and Uber rely heavily on distributed tracing to maintain their complex service ecosystems, reducing mean time to resolution for incidents from hours to minutes.
⏱️ Time Estimate: 90-120 minutes
📊 Difficulty: Advanced
Build a comprehensive distributed tracing system using OpenTelemetry that tracks requests across multiple services. Your implementation should include proper context propagation, meaningful span attributes, automatic instrumentation for common operations, and integration with a tracing backend like Jaeger for visualization.
Requirements:
- Implement tracing across at least 3 different services
- Use OpenTelemetry automatic and manual instrumentation
- Properly propagate trace context through HTTP and gRPC calls
- Add meaningful attributes, events, and links to spans
- Implement sampling strategies to balance observability with performance
- Export traces to Jaeger or another compatible backend
Solution with Explanation
1// run
2package main
3
4import (
5 "context"
6 "fmt"
7 "io"
8 "log"
9 "net/http"
10 "time"
11
12 "go.opentelemetry.io/otel"
13 "go.opentelemetry.io/otel/attribute"
14 "go.opentelemetry.io/otel/exporters/jaeger"
15 "go.opentelemetry.io/otel/propagation"
16 "go.opentelemetry.io/otel/sdk/resource"
17 sdktrace "go.opentelemetry.io/otel/sdk/trace"
18 semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
19 "go.opentelemetry.io/otel/trace"
20)
21
22// ServiceA is the entry point service
23type ServiceA struct {
24 client *http.Client
25 tracer trace.Tracer
26}
27
28func NewServiceA() *ServiceA {
29 return &ServiceA{
30 client: &http.Client{},
31 tracer: otel.Tracer("service-a"),
32 }
33}
34
35func HandleRequest(ctx context.Context, userID string) error {
36 // Start root span
37 ctx, span := s.tracer.Start(ctx, "handle-user-request",
38 trace.WithAttributes(
39 attribute.String("user.id", userID),
40 attribute.String("service.name", "service-a"),
41 ),
42 )
43 defer span.End()
44
45 // Log event
46 span.AddEvent("request-started", trace.WithTimestamp(time.Now()))
47
48 // Call Service B
49 if err := s.callServiceB(ctx, userID); err != nil {
50 span.RecordError(err)
51 span.SetStatus(trace.StatusCodeError, "service-b-call-failed")
52 return err
53 }
54
55 // Call Service C
56 if err := s.callServiceC(ctx, userID); err != nil {
57 span.RecordError(err)
58 span.SetStatus(trace.StatusCodeError, "service-c-call-failed")
59 return err
60 }
61
62 span.SetStatus(trace.StatusCodeOk, "request-completed")
63 span.AddEvent("request-completed", trace.WithTimestamp(time.Now()))
64 return nil
65}
66
67func callServiceB(ctx context.Context, userID string) error {
68 ctx, span := s.tracer.Start(ctx, "call-service-b")
69 defer span.End()
70
71 span.SetAttributes(
72 attribute.String("target.service", "service-b"),
73 attribute.String("operation", "get-user-data"),
74 )
75
76 // Simulate HTTP call with context propagation
77 req, err := http.NewRequestWithContext(ctx, "GET", "http://localhost:8081/user/"+userID, nil)
78 if err != nil {
79 return err
80 }
81
82 // Inject trace context into headers
83 otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))
84
85 resp, err := s.client.Do(req)
86 if err != nil {
87 return err
88 }
89 defer resp.Body.Close()
90
91 span.SetAttributes(
92 attribute.Int("http.status_code", resp.StatusCode),
93 attribute.String("http.method", req.Method),
94 attribute.String("http.url", req.URL.String()),
95 )
96
97 return nil
98}
99
100func callServiceC(ctx context.Context, userID string) error {
101 ctx, span := s.tracer.Start(ctx, "call-service-c")
102 defer span.End()
103
104 span.SetAttributes(
105 attribute.String("target.service", "service-c"),
106 attribute.String("operation", "process-user-data"),
107 )
108
109 // Simulate processing time
110 time.Sleep(50 * time.Millisecond)
111
112 // Add business metric
113 span.SetAttributes(attribute.Int("data.records_processed", 42))
114
115 return nil
116}
117
118// ServiceB handles user data requests
119type ServiceB struct {
120 tracer trace.Tracer
121}
122
123func NewServiceB() *ServiceB {
124 return &ServiceB{
125 tracer: otel.Tracer("service-b"),
126 }
127}
128
129func ServeHTTP(w http.ResponseWriter, r *http.Request) {
130 // Extract trace context from headers
131 ctx := otel.GetTextMapPropagator().Extract(r.Context(), propagation.HeaderCarrier(r.Header))
132
133 _, span := s.tracer.Start(ctx, "handle-user-request")
134 defer span.End()
135
136 // Add service-specific attributes
137 span.SetAttributes(
138 attribute.String("service.name", "service-b"),
139 attribute.String("http.route", "/user/{id}"),
140 attribute.String("user.id", r.URL.Path[len("/user/"):]),
141 )
142
143 // Simulate database query
144 dbCtx, dbSpan := s.tracer.Start(ctx, "database-query")
145 dbSpan.SetAttributes(
146 attribute.String("db.system", "postgresql"),
147 attribute.String("db.operation", "SELECT"),
148 attribute.String("db.statement", "SELECT * FROM users WHERE id = $1"),
149 )
150 time.Sleep(20 * time.Millisecond)
151 dbSpan.End()
152
153 span.AddEvent("data-retrieved", trace.WithAttributes(
154 attribute.Int("record_count", 1),
155 ))
156
157 w.WriteHeader(http.StatusOK)
158 w.Write([]byte(`{"id": "123", "name": "John Doe"}`))
159}
160
161// ServiceC processes data
162type ServiceC struct {
163 tracer trace.Tracer
164}
165
166func NewServiceC() *ServiceC {
167 return &ServiceC{
168 tracer: otel.Tracer("service-c"),
169 }
170}
171
172func ProcessData(ctx context.Context, userID string) error {
173 ctx, span := s.tracer.Start(ctx, "process-user-data")
174 defer span.End()
175
176 span.SetAttributes(
177 attribute.String("service.name", "service-c"),
178 attribute.String("user.id", userID),
179 )
180
181 // Simulate multiple processing steps
182 s.processStep1(ctx)
183 s.processStep2(ctx)
184 s.processStep3(ctx)
185
186 return nil
187}
188
189func processStep1(ctx context.Context) {
190 _, span := s.tracer.Start(ctx, "validation-step")
191 defer span.End()
192
193 span.SetAttributes(attribute.String("step.name", "data-validation"))
194 time.Sleep(10 * time.Millisecond)
195}
196
197func processStep2(ctx context.Context) {
198 _, span := s.tracer.Start(ctx, "transformation-step")
199 defer span.End()
200
201 span.SetAttributes(attribute.String("step.name", "data-transformation"))
202 time.Sleep(25 * time.Millisecond)
203}
204
205func processStep3(ctx context.Context) {
206 _, span := s.tracer.Start(ctx, "enrichment-step")
207 defer span.End()
208
209 span.SetAttributes(attribute.String("step.name", "data-enrichment"))
210 time.Sleep(15 * time.Millisecond)
211}
212
213// Initialize OpenTelemetry
214func initTracer(serviceName string) {
215 // Create Jaeger exporter
216 exp, err := jaeger.New(jaeger.WithCollectorEndpoint(
217 jaeger.WithEndpoint("http://localhost:14268/api/traces"),
218 ))
219 if err != nil {
220 return nil, err
221 }
222
223 // Create resource with service metadata
224 res, err := resource.New(context.Background(),
225 resource.WithAttributes(
226 semconv.ServiceNameKey.String(serviceName),
227 semconv.ServiceVersionKey.String("1.0.0"),
228 attribute.String("environment", "development"),
229 ),
230 )
231 if err != nil {
232 return nil, err
233 }
234
235 // Create tracer provider with sampling
236 tp := sdktrace.NewTracerProvider(
237 sdktrace.WithBatcher(exp),
238 sdktrace.WithResource(res),
239 sdktrace.WithSampler(sdktrace.AlwaysSample()),
240 )
241
242 // Register global propagator
243 otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
244 propagation.TraceContext{},
245 propagation.Baggage{},
246 ))
247
248 // Set global tracer provider
249 otel.SetTracerProvider(tp)
250
251 return tp, nil
252}
253
254func main() {
255 // Initialize tracing
256 tp, err := initTracer("service-a")
257 if err != nil {
258 log.Fatal(err)
259 }
260 defer func() {
261 if err := tp.Shutdown(context.Background()); err != nil {
262 log.Printf("Error shutting down tracer provider: %v", err)
263 }
264 }()
265
266 // Create services
267 serviceA := NewServiceA()
268 serviceB := NewServiceB()
269 serviceC := NewServiceC()
270
271 // Start Service B
272 go func() {
273 fmt.Println("Service B listening on :8081")
274 if err := http.ListenAndServe(":8081", serviceB); err != nil {
275 log.Fatal(err)
276 }
277 }()
278
279 // Give Service B time to start
280 time.Sleep(100 * time.Millisecond)
281
282 // Simulate distributed request
283 ctx := context.Background()
284 fmt.Println("Processing distributed request...")
285
286 start := time.Now()
287 if err := serviceA.HandleRequest(ctx, "user-123"); err != nil {
288 fmt.Printf("Request failed: %v\n", err)
289 } else {
290 fmt.Printf("Request completed in %v\n", time.Since(start))
291 }
292
293 fmt.Println("\nCheck traces at: http://localhost:16686")
294 time.Sleep(2 * time.Second)
295}
Explanation:
Distributed Tracing Implementation:
-
Trace Context Propagation:
- Uses OpenTelemetry propagators to inject/extract trace context
- Context flows through HTTP headers automatically
- Maintains trace continuity across service boundaries
-
Span Design Best Practices:
- Meaningful span names
- Relevant attributes following semantic conventions
- Events for important timestamps
- Status codes for success/failure states
-
Service Boundaries:
- Each service creates its own spans
- Clear separation of concerns
- Proper error handling and status reporting
-
Instrumentation Patterns:
- Manual instrumentation for custom business logic
- Automatic instrumentation for HTTP calls
- Database query tracing
- Multi-step process tracking
Key Components:
-
ServiceA:
- Creates root span for the entire request
- Calls downstream services
- Coordinates overall request flow
-
ServiceB:
- Extracts trace context from incoming requests
- Handles database operations
- Returns structured data
-
ServiceC:
- Demonstrates multi-step processing
- Shows nested spans within a service
- Business logic tracing
Real-World Benefits:
- Performance Analysis: Identify bottlenecks across services
- Error Debugging: Trace errors through the entire request path
- Dependency Mapping: Understand service relationships
- Capacity Planning: See which services are under load
Production Considerations:
- Sampling: Use probability sampling in high-traffic scenarios
- Resource Limits: Configure span limits and timeouts
- Security: Don't include sensitive data in spans
- Cost: Manage trace volume and storage
Exercise 5: Real-time Metrics Dashboard
🎯 Learning Objectives:
- Build real-time metric collection and visualization
- Implement time-series data aggregation
- Create WebSocket-based streaming dashboards
- Learn metric query languages and alerting patterns
🌍 Real-World Context:
Real-time dashboards are essential for operations teams to monitor system health and detect anomalies instantly. Companies like Netflix and Twitter use sophisticated real-time dashboards to track billions of metrics, enabling them to detect issues before users notice. These dashboards transform raw metrics into actionable insights through aggregation, visualization, and intelligent alerting.
⏱️ Time Estimate: 120-150 minutes
📊 Difficulty: Expert
Build a complete real-time metrics dashboard system that collects, aggregates, and visualizes metrics from multiple sources. Your implementation should include a metrics collector, time-series data store, WebSocket-based real-time updates, and a web dashboard with charts and alerts.
Requirements:
- Collect metrics from multiple sources
- Implement time-series data aggregation
- Create a WebSocket server for real-time metric streaming
- Build a web dashboard with interactive charts and graphs
- Implement alerting rules with threshold and anomaly detection
- Support metric queries with filtering and time range selection
Solution with Explanation
1// run
2package main
3
4import (
5 "context"
6 "encoding/json"
7 "fmt"
8 "log"
9 "net/http"
10 "sync"
11 "time"
12
13 "github.com/gorilla/websocket"
14)
15
16// Metric represents a single metric data point
17type Metric struct {
18 Timestamp time.Time `json:"timestamp"`
19 Name string `json:"name"`
20 Value float64 `json:"value"`
21 Tags map[string]string `json:"tags"`
22}
23
24// MetricSeries represents a time series of metrics
25type MetricSeries struct {
26 Name string `json:"name"`
27 Tags map[string]string `json:"tags"`
28 Points []Metric `json:"points"`
29}
30
31// MetricsCollector collects and aggregates metrics
32type MetricsCollector struct {
33 mu sync.RWMutex
34 metrics map[string]*MetricSeries // key: name+tags hash
35 maxPoints int
36 subscribers map[*websocket.Conn]bool
37 alertRules []*AlertRule
38 lastUpdate time.Time
39}
40
41// AlertRule defines an alert condition
42type AlertRule struct {
43 ID string `json:"id"`
44 Name string `json:"name"`
45 MetricName string `json:"metric_name"`
46 Condition string `json:"condition"` // ">", "<", "==", "!="
47 Threshold float64 `json:"threshold"`
48 Duration time.Duration `json:"duration"`
49 LastTrigger time.Time `json:"last_trigger"`
50 Active bool `json:"active"`
51}
52
53func NewMetricsCollector(maxPoints int) *MetricsCollector {
54 return &MetricsCollector{
55 metrics: make(map[string]*MetricSeries),
56 maxPoints: maxPoints,
57 subscribers: make(map[*websocket.Conn]bool),
58 lastUpdate: time.Now(),
59 }
60}
61
62// RecordMetric adds a new metric data point
63func RecordMetric(name string, value float64, tags map[string]string) {
64 mc.mu.Lock()
65 defer mc.mu.Unlock()
66
67 // Create series key
68 key := mc.seriesKey(name, tags)
69
70 // Get or create series
71 series, exists := mc.metrics[key]
72 if !exists {
73 series = &MetricSeries{
74 Name: name,
75 Tags: tags,
76 Points: make([]Metric, 0, mc.maxPoints),
77 }
78 mc.metrics[key] = series
79 }
80
81 // Add new point
82 point := Metric{
83 Timestamp: time.Now(),
84 Name: name,
85 Value: value,
86 Tags: tags,
87 }
88
89 // Append point and maintain max size
90 series.Points = append(series.Points, point)
91 if len(series.Points) > mc.maxPoints {
92 series.Points = series.Points[1:]
93 }
94
95 // Check alerts
96 mc.checkAlerts(name, value, tags)
97
98 // Update timestamp
99 mc.lastUpdate = time.Now()
100}
101
102// seriesKey creates a unique key for a metric series
103func seriesKey(name string, tags map[string]string) string {
104 key := name
105 for k, v := range tags {
106 key += fmt.Sprintf(":%s=%s", k, v)
107 }
108 return key
109}
110
111// checkAlerts evaluates alert rules against the new metric
112func checkAlerts(name string, value float64, tags map[string]string) {
113 for _, rule := range mc.alertRules {
114 if rule.MetricName != name {
115 continue
116 }
117
118 // Check condition
119 triggered := false
120 switch rule.Condition {
121 case ">":
122 triggered = value > rule.Threshold
123 case "<":
124 triggered = value < rule.Threshold
125 case "==":
126 triggered = value == rule.Threshold
127 case "!=":
128 triggered = value != rule.Threshold
129 }
130
131 if triggered {
132 if !rule.Active {
133 rule.LastTrigger = time.Now()
134 rule.Active = true
135 mc.sendAlert(rule, value, tags)
136 }
137 } else {
138 rule.Active = false
139 }
140 }
141}
142
143// sendAlert sends an alert notification
144func sendAlert(rule *AlertRule, value float64, tags map[string]string) {
145 alert := map[string]interface{}{
146 "type": "alert",
147 "rule": rule.Name,
148 "metric": rule.MetricName,
149 "value": value,
150 "threshold": rule.Threshold,
151 "timestamp": time.Now(),
152 "tags": tags,
153 }
154
155 mc.broadcast(alert)
156}
157
158// GetMetrics returns metrics matching the query
159func GetMetrics(name string, tags map[string]string, duration time.Duration) []MetricSeries {
160 mc.mu.RLock()
161 defer mc.mu.RUnlock()
162
163 var results []MetricSeries
164 cutoff := time.Now().Add(-duration)
165
166 for _, series := range mc.metrics {
167 // Filter by name
168 if name != "" && series.Name != name {
169 continue
170 }
171
172 // Filter by tags
173 if !mc.matchTags(series.Tags, tags) {
174 continue
175 }
176
177 // Filter points by time
178 var points []Metric
179 for _, point := range series.Points {
180 if point.Timestamp.After(cutoff) {
181 points = append(points, point)
182 }
183 }
184
185 if len(points) > 0 {
186 results = append(results, MetricSeries{
187 Name: series.Name,
188 Tags: series.Tags,
189 Points: points,
190 })
191 }
192 }
193
194 return results
195}
196
197// matchTags checks if all query tags match series tags
198func matchTags(seriesTags, queryTags map[string]string) bool {
199 for k, v := range queryTags {
200 if seriesTags[k] != v {
201 return false
202 }
203 }
204 return true
205}
206
207// AddAlertRule adds a new alert rule
208func AddAlertRule(rule *AlertRule) {
209 mc.mu.Lock()
210 defer mc.mu.Unlock()
211 mc.alertRules = append(mc.alertRules, rule)
212}
213
214// Subscribe adds a WebSocket subscriber for real-time updates
215func Subscribe(conn *websocket.Conn) {
216 mc.mu.Lock()
217 defer mc.mu.Unlock()
218 mc.subscribers[conn] = true
219}
220
221// Unsubscribe removes a WebSocket subscriber
222func Unsubscribe(conn *websocket.Conn) {
223 mc.mu.Lock()
224 defer mc.mu.Unlock()
225 delete(mc.subscribers, conn)
226}
227
228// broadcast sends data to all subscribers
229func broadcast(data interface{}) {
230 mc.mu.RLock()
231 subscribers := make([]*websocket.Conn, 0, len(mc.subscribers))
232 for conn := range mc.subscribers {
233 subscribers = append(subscribers, conn)
234 }
235 mc.mu.RUnlock()
236
237 for _, conn := range subscribers {
238 if err := conn.WriteJSON(data); err != nil {
239 mc.Unsubscribe(conn)
240 conn.Close()
241 }
242 }
243}
244
245// GetStats returns collector statistics
246func GetStats() map[string]interface{} {
247 mc.mu.RLock()
248 defer mc.mu.RUnlock()
249
250 return map[string]interface{}{
251 "total_series": len(mc.metrics),
252 "total_points": mc.countTotalPoints(),
253 "active_alerts": mc.countActiveAlerts(),
254 "subscribers": len(mc.subscribers),
255 "last_update": mc.lastUpdate,
256 }
257}
258
259func countTotalPoints() int {
260 total := 0
261 for _, series := range mc.metrics {
262 total += len(series.Points)
263 }
264 return total
265}
266
267func countActiveAlerts() int {
268 active := 0
269 for _, rule := range mc.alertRules {
270 if rule.Active {
271 active++
272 }
273 }
274 return active
275}
276
277// WebSocket upgrader
278var upgrader = websocket.Upgrader{
279 CheckOrigin: func(r *http.Request) bool {
280 return true // Allow all origins for demo
281 },
282}
283
284// HTTP handlers
285func handleWebSocket(w http.ResponseWriter, r *http.Request) {
286 conn, err := upgrader.Upgrade(w, r, nil)
287 if err != nil {
288 log.Printf("WebSocket upgrade error: %v", err)
289 return
290 }
291 defer conn.Close()
292
293 // Subscribe to updates
294 mc.Subscribe(conn)
295 defer mc.Unsubscribe(conn)
296
297 // Send initial stats
298 stats := mc.GetStats()
299 conn.WriteJSON(map[string]interface{}{
300 "type": "stats",
301 "data": stats,
302 })
303
304 // Keep connection alive
305 for {
306 _, _, err := conn.ReadMessage()
307 if err != nil {
308 break
309 }
310 }
311}
312
313func handleMetrics(w http.ResponseWriter, r *http.Request) {
314 // Parse query parameters
315 name := r.URL.Query().Get("name")
316 durationStr := r.URL.Query().Get("duration")
317 duration := time.Hour // default
318
319 if durationStr != "" {
320 if parsed, err := time.ParseDuration(durationStr); err == nil {
321 duration = parsed
322 }
323 }
324
325 // Parse tags
326 tags := make(map[string]string)
327 for key, values := range r.URL.Query() {
328 if key == "name" || key == "duration" {
329 continue
330 }
331 if len(values) > 0 {
332 tags[key] = values[0]
333 }
334 }
335
336 // Get metrics
337 metrics := mc.GetMetrics(name, tags, duration)
338
339 w.Header().Set("Content-Type", "application/json")
340 json.NewEncoder(w).Encode(metrics)
341}
342
343func handleStats(w http.ResponseWriter, r *http.Request) {
344 stats := mc.GetStats()
345 w.Header().Set("Content-Type", "application/json")
346 json.NewEncoder(w).Encode(stats)
347}
348
349func handleAlerts(w http.ResponseWriter, r *http.Request) {
350 mc.mu.RLock()
351 defer mc.mu.RUnlock()
352
353 w.Header().Set("Content-Type", "application/json")
354 json.NewEncoder(w).Encode(mc.alertRules)
355}
356
357// Metrics generator for simulation
358func startMetricsSimulation(ctx context.Context) {
359 ticker := time.NewTicker(1 * time.Second)
360 defer ticker.Stop()
361
362 counter := 0.0
363 for {
364 select {
365 case <-ctx.Done():
366 return
367 case <-ticker.C:
368 // Generate various metrics
369 mc.RecordMetric("http_requests_total", 100+counter%50, nil)
370 mc.RecordMetric("cpu_usage", 30+counter%40, nil)
371 mc.RecordMetric("memory_usage", 50+counter%30, nil)
372 mc.RecordMetric("response_time_ms", 50+counter%100, map[string]string{"endpoint": "/api/users"})
373 mc.RecordMetric("response_time_ms", 100+counter%150, map[string]string{"endpoint": "/api/products"})
374
375 // Simulate error rate spike occasionally
376 if int(counter)%20 == 0 {
377 mc.RecordMetric("error_rate", 5.0, nil)
378 } else {
379 mc.RecordMetric("error_rate", 0.5, nil)
380 }
381
382 counter++
383 }
384 }
385}
386
387// Dashboard HTML
388const dashboardHTML = `
389<!DOCTYPE html>
390<html>
391<head>
392 <title>Real-time Metrics Dashboard</title>
393 <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
394 <style>
395 body { font-family: Arial, sans-serif; margin: 20px; }
396 .container { max-width: 1200px; margin: 0 auto; }
397 .stats { display: flex; gap: 20px; margin-bottom: 20px; }
398 .stat-card { background: #f5f5f5; padding: 15px; border-radius: 5px; min-width: 150px; }
399 .charts { display: grid; grid-template-columns: 1fr 1fr; gap: 20px; }
400 .chart-container { background: white; padding: 20px; border-radius: 5px; box-shadow: 0 2px 5px rgba(0,0,0,0.1); }
401 .alerts { margin-top: 20px; background: #fff3cd; padding: 15px; border-radius: 5px; }
402 .alert { background: #f8d7da; padding: 10px; margin: 5px 0; border-radius: 3px; }
403 </style>
404</head>
405<body>
406 <div class="container">
407 <h1>Real-time Metrics Dashboard</h1>
408
409 <div class="stats" id="stats">
410 <!-- Stats will be populated here -->
411 </div>
412
413 <div class="charts">
414 <div class="chart-container">
415 <h3>CPU & Memory Usage</h3>
416 <canvas id="systemChart"></canvas>
417 </div>
418 <div class="chart-container">
419 <h3>Response Times</h3>
420 <canvas id="responseChart"></canvas>
421 </div>
422 <div class="chart-container">
423 <h3>Request Rate</h3>
424 <canvas id="requestChart"></canvas>
425 </div>
426 <div class="chart-container">
427 <h3>Error Rate</h3>
428 <canvas id="errorChart"></canvas>
429 </div>
430 </div>
431
432 <div class="alerts" id="alerts">
433 <h3>Active Alerts</h3>
434 <div id="alertList">No active alerts</div>
435 </div>
436 </div>
437
438 <script>
439 const ws = new WebSocket('ws://localhost:8080/ws');
440
441 const systemChart = new Chart(document.getElementById('systemChart'), {
442 type: 'line',
443 data: { labels: [], datasets: [
444 { label: 'CPU Usage', data: [], borderColor: 'rgb(75, 192, 192)' },
445 { label: 'Memory Usage', data: [], borderColor: 'rgb(255, 99, 132)' }
446 ]},
447 options: { scales: { y: { beginAtZero: true, max: 100 } } }
448 });
449
450 const responseChart = new Chart(document.getElementById('responseChart'), {
451 type: 'line',
452 data: { labels: [], datasets: [
453 { label: '/api/users', data: [], borderColor: 'rgb(54, 162, 235)' },
454 { label: '/api/products', data: [], borderColor: 'rgb(255, 205, 86)' }
455 ]},
456 options: { scales: { y: { beginAtZero: true } } }
457 });
458
459 const requestChart = new Chart(document.getElementById('requestChart'), {
460 type: 'line',
461 data: { labels: [], datasets: [
462 { label: 'Requests/sec', data: [], borderColor: 'rgb(153, 102, 255)' }
463 ]},
464 options: { scales: { y: { beginAtZero: true } } }
465 });
466
467 const errorChart = new Chart(document.getElementById('errorChart'), {
468 type: 'line',
469 data: { labels: [], datasets: [
470 { label: 'Error Rate %', data: [], borderColor: 'rgb(255, 159, 64)' }
471 ]},
472 options: { scales: { y: { beginAtZero: true, max: 10 } } }
473 });
474
475 ws.onmessage = function(event) {
476 const data = JSON.parse(event.data);
477
478 if {
479 updateStats(data.data);
480 } else if {
481 showAlert(data);
482 }
483 };
484
485 function updateStats(stats) {
486 document.getElementById('stats').innerHTML =
487 '<div class="stat-card"><h4>Total Series</h4><p>' + stats.total_series + '</p></div>' +
488 '<div class="stat-card"><h4>Total Points</h4><p>' + stats.total_points + '</p></div>' +
489 '<div class="stat-card"><h4>Active Alerts</h4><p>' + stats.active_alerts + '</p></div>' +
490 '<div class="stat-card"><h4>Subscribers</h4><p>' + stats.subscribers + '</p></div>';
491 }
492
493 function showAlert(alert) {
494 const alertList = document.getElementById('alertList');
495 const alertDiv = document.createElement('div');
496 alertDiv.className = 'alert';
497 alertDiv.innerHTML = '<strong>' + alert.rule + '</strong>: ' +
498 alert.metric + ' is ' + alert.value +
499 '';
500 alertList.appendChild(alertDiv);
501
502 // Remove old alerts
503 setTimeout(() => {
504 if {
505 alertDiv.parentNode.removeChild(alertDiv);
506 }
507 }, 10000);
508 }
509
510 // Fetch initial metrics and update charts
511 async function updateCharts() {
512 const response = await fetch('/metrics?duration=5m');
513 const metrics = await response.json();
514
515 // Update charts with new data
516 // This is simplified - in production, you'd update with actual data points
517 const now = new Date().toLocaleTimeString();
518
519 // Add new labels
520 if {
521 systemChart.data.labels.shift();
522 systemChart.data.datasets[0].data.shift();
523 systemChart.data.datasets[1].data.shift();
524 }
525 systemChart.data.labels.push(now);
526 systemChart.data.datasets[0].data.push(Math.random() * 100);
527 systemChart.data.datasets[1].data.push(Math.random() * 100);
528 systemChart.update('none');
529
530 // Similar updates for other charts...
531 }
532
533 // Update charts every 2 seconds
534 setInterval(updateCharts, 2000);
535 updateCharts(); // Initial call
536 </script>
537</body>
538</html>
539`
540
541func handleDashboard(w http.ResponseWriter, r *http.Request) {
542 w.Header().Set("Content-Type", "text/html")
543 fmt.Fprint(w, dashboardHTML)
544}
545
546func main() {
547 // Create metrics collector
548 collector := NewMetricsCollector(1000) // Keep last 1000 points per series
549
550 // Add alert rules
551 collector.AddAlertRule(&AlertRule{
552 ID: "high-cpu",
553 Name: "High CPU Usage",
554 MetricName: "cpu_usage",
555 Condition: ">",
556 Threshold: 80.0,
557 Duration: time.Minute * 5,
558 })
559
560 collector.AddAlertRule(&AlertRule{
561 ID: "high-error-rate",
562 Name: "High Error Rate",
563 MetricName: "error_rate",
564 Condition: ">",
565 Threshold: 2.0,
566 Duration: time.Minute * 2,
567 })
568
569 // Start metrics simulation
570 ctx, cancel := context.WithCancel(context.Background())
571 defer cancel()
572 go collector.startMetricsSimulation(ctx)
573
574 // Setup HTTP routes
575 http.HandleFunc("/ws", collector.handleWebSocket)
576 http.HandleFunc("/metrics", collector.handleMetrics)
577 http.HandleFunc("/stats", collector.handleStats)
578 http.HandleFunc("/alerts", collector.handleAlerts)
579 http.HandleFunc("/", collector.handleDashboard)
580
581 fmt.Println("Dashboard running on http://localhost:8080")
582 fmt.Println("WebSocket endpoint: ws://localhost:8080/ws")
583 fmt.Println("Metrics API: http://localhost:8080/metrics")
584 fmt.Println("Stats API: http://localhost:8080/stats")
585
586 if err := http.ListenAndServe(":8080", nil); err != nil {
587 log.Fatal(err)
588 }
589}
Explanation:
Real-time Dashboard Architecture:
-
Metrics Collection:
- Thread-safe metric collection with mutex protection
- Time-series data organized by metric name and tags
- Automatic cleanup of old data points
- Support for metric aggregation and querying
-
Alerting System:
- Configurable alert rules with thresholds
- Real-time alert evaluation on metric updates
- Alert notifications via WebSocket
- Alert state management to prevent spam
-
Real-time Communication:
- WebSocket server for live updates
- Broadcast mechanism for multiple subscribers
- Automatic connection management
- JSON-based message protocol
-
Web Dashboard:
- Interactive charts using Chart.js
- Real-time metric visualization
- Alert display and management
- Responsive design for multiple screen sizes
Key Features:
-
Metric Storage:
- In-memory time-series database
- Configurable retention periods
- Efficient querying by name and tags
- Statistical aggregations
-
Alert Engine:
- Rule-based alerting
- Multiple comparison operators
- Time-based alert duration
- Alert deduplication
-
API Endpoints:
/metrics- Query metrics with filtering/stats- System statistics/alerts- Alert rules and status/ws- WebSocket for real-time updates/- Dashboard web interface
Production Considerations:
- Persistence: Replace in-memory storage with Redis or TimescaleDB
- Scalability: Implement metric sharding and load balancing
- Security: Add authentication and CORS configuration
- Performance: Use binary protocols for high-frequency metrics
- Monitoring: Monitor the dashboard system itself
Real-World Applications:
- Application Performance Monitoring
- Infrastructure Monitoring
- Business Metrics Dashboards
- DevOps Alerting Systems
- IoT Data Visualization
Summary
Production Readiness Checklist
- Structured Logging: All logs use JSON format with consistent fields
- Request Tracing: Every user request has a unique trace ID
- Key Metrics: Track RED for all services
- SLOs: Defined for critical user journeys
- Dashboards: Available for system health and business metrics
- Alerting: Based on SLO burn rate, not simple thresholds
- Log Sampling: Configurable for high-traffic services
- Retention Policies: Defined for logs, metrics, and traces