Service Mesh with Istio

Why This Matters - The Communication Challenge

Think of managing a city's transportation network. Instead of individual drivers navigating traffic, you have a smart traffic system that coordinates all vehicles, optimizes routes, enforces safety rules, and monitors everything centrally. This is exactly what a service mesh does for microservices.

Real-World Scenario: A large e-commerce platform with 50+ microservices processes millions of requests daily. Each service needs to communicate securely, handle failures gracefully, and provide observability. Without a service mesh, each service team implements these concerns differently, leading to inconsistent behavior and massive operational overhead.

Business Impact: Companies adopting service meshes report:

  • 60-80% reduction in networking code duplication across services
  • 50% faster deployment cycles with traffic management features
  • 90% improved security posture with automatic mTLS
  • 40% reduction in mean time to recovery with observability

Learning Objectives

By the end of this article, you will:

  • Understand service mesh architecture and benefits
  • Implement Istio traffic management for Go applications
  • Configure zero-trust security with mTLS and authorization policies
  • Set up observability with distributed tracing and monitoring
  • Deploy production-ready multi-cluster service mesh
  • Master advanced patterns including canary deployments and fault injection

Core Concepts - Service Mesh Fundamentals

What is a Service Mesh?

A service mesh is a dedicated infrastructure layer that handles service-to-service communication in microservices architectures. It uses sidecar proxies deployed alongside each service to manage all network traffic.

Key Components:

  • Data Plane: Sidecar proxies that intercept and manage network traffic
  • Control Plane: Central management that configures and monitors the data plane
  • Sidecar Pattern: Each service gets a dedicated proxy without code changes
 1// Your Go application - unchanged!
 2// Service mesh handles all networking concerns
 3package main
 4
 5import (
 6    "fmt"
 7    "log"
 8    "net/http"
 9    "os"
10)
11
12func main() {
13    // Your business logic only
14    http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
15        // Service mesh automatically provides:
16        // - Load balancing
17        // - Retry logic
18        // - Circuit breaking
19        // - mTLS encryption
20        // - Distributed tracing headers
21
22        // Just focus on business logic
23        hostname := os.Getenv("HOSTNAME")
24        fmt.Fprintf(w, "Response from service: %s\n", hostname)
25
26        // Service mesh intercepts the response
27        // and adds security, observability, etc.
28    })
29
30    log.Println("Service starting on :8080")
31    log.Fatal(http.ListenAndServe(":8080", nil))
32}

💡 Key Insight: Your Go applications remain unchanged - the service mesh handles all networking concerns transparently through the sidecar proxy.

Service Mesh vs. Kubernetes Ingress

Traditional Kubernetes Ingress:

 1# Basic ingress - limited functionality
 2apiVersion: networking.k8s.io/v1
 3kind: Ingress
 4metadata:
 5  name: simple-ingress
 6spec:
 7  rules:
 8  - host: myapp.example.com
 9    http:
10      paths:
11      - path: /
12        pathType: Prefix
13        backend:
14          service:
15            name: myapp
16            port:
17              number: 8080

Service Mesh Capabilities:

 1# Advanced traffic management with Istio
 2apiVersion: networking.istio.io/v1beta1
 3kind: VirtualService
 4metadata:
 5  name: advanced-routing
 6spec:
 7  hosts:
 8  - myapp.example.com
 9  http:
10  # Canary deployment - route 10% to v2
11  - match:
12    - headers:
13        canary:
14          exact: "enabled"
15    route:
16    - destination:
17        host: myapp
18        subset: v2
19      weight: 10
20  # Default route - 90% to v1
21  - route:
22    - destination:
23        host: myapp
24        subset: v1
25      weight: 90
26  # Fault injection for testing
27  - fault:
28      delay:
29        percentage:
30          value: 5
31        fixedDelay: 2s

Practical Examples - Service Mesh Integration

Basic Go Service with Istio

  1// service/main.go
  2package main
  3
  4import (
  5    "context"
  6    "encoding/json"
  7    "fmt"
  8    "log"
  9    "net/http"
 10    "os"
 11    "time"
 12
 13    "github.com/prometheus/client_golang/prometheus/promhttp"
 14)
 15
 16type ServiceResponse struct {
 17    Service  string    `json:"service"`
 18    Version  string    `json:"version"`
 19    Hostname string    `json:"hostname"`
 20    RequestID string   `json:"request_id"`
 21    TraceID  string    `json:"trace_id"`
 22    Timestamp time.Time `json:"timestamp"`
 23}
 24
 25func main() {
 26    // Service configuration
 27    serviceName := os.Getenv("SERVICE_NAME")
 28    version := os.Getenv("SERVICE_VERSION")
 29
 30    // Initialize structured logger
 31    logger := structuredLogger(serviceName, version)
 32
 33    // Health check endpoint
 34    http.HandleFunc("/health", healthCheckHandler(logger))
 35
 36    // Main service endpoint
 37    http.HandleFunc("/", serviceHandler(serviceName, version, logger))
 38
 39    // Metrics endpoint
 40    http.Handle("/metrics", promhttp.Handler())
 41
 42    port := os.Getenv("PORT")
 43    if port == "" {
 44        port = "8080"
 45    }
 46
 47    logger.Info("service starting",
 48        "port", port,
 49        "version", version,
 50    )
 51
 52    // Your Go service runs normally
 53    // Istio sidecar intercepts all traffic
 54    if err := http.ListenAndServe(":"+port, nil); err != nil {
 55        logger.Error("service failed", "error", err)
 56        os.Exit(1)
 57    }
 58}
 59
 60func serviceHandler(serviceName, version string, logger structuredLogger) http.HandlerFunc {
 61    return func(w http.ResponseWriter, r *http.Request) {
 62        // Extract Istio-provided headers
 63        requestID := r.Header.Get("X-Request-Id")
 64        traceID := r.Header.Get("X-B3-Traceid")
 65        spanID := r.Header.Get("X-B3-Spanid")
 66
 67        response := ServiceResponse{
 68            Service:   serviceName,
 69            Version:   version,
 70            Hostname:  os.Getenv("HOSTNAME"),
 71            RequestID: requestID,
 72            TraceID:   traceID,
 73            Timestamp: time.Now(),
 74        }
 75
 76        logger.Info("request processed",
 77            "method", r.Method,
 78            "path", r.URL.Path,
 79            "request_id", requestID,
 80            "trace_id", traceID,
 81            "span_id", spanID,
 82        )
 83
 84        w.Header().Set("Content-Type", "application/json")
 85        w.Header().Set("X-Service-Version", version)
 86
 87        if err := json.NewEncoder(w).Encode(response); err != nil {
 88            logger.Error("encoding failed", "error", err)
 89            http.Error(w, "Internal error", http.StatusInternalServerError)
 90        }
 91    }
 92}
 93
 94func healthCheckHandler(logger structuredLogger) http.HandlerFunc {
 95    return func(w http.ResponseWriter, r *http.Request) {
 96        health := map[string]interface{}{
 97            "status":    "healthy",
 98            "timestamp": time.Now(),
 99            "version":   os.Getenv("SERVICE_VERSION"),
100        }
101
102        w.Header().Set("Content-Type", "application/json")
103        json.NewEncoder(w).Encode(health)
104
105        logger.Debug("health check completed")
106    }
107}
108
109// Simple structured logger for demonstration
110type structuredLogger struct {
111    serviceName string
112    version     string
113}
114
115func structuredLogger(serviceName, version string) structuredLogger {
116    return structuredLogger{serviceName: serviceName, version: version}
117}
118
119func Info(msg string, keysAndValues ...interface{}) {
120    fmt.Printf("[INFO] %s.%s: %s %v\n", l.serviceName, l.version, msg, keysAndValues)
121}
122
123func Error(msg string, keysAndValues ...interface{}) {
124    fmt.Printf("[ERROR] %s.%s: %s %v\n", l.serviceName, l.version, msg, keysAndValues)
125}
126
127func Debug(msg string, keysAndValues ...interface{}) {
128    fmt.Printf("[DEBUG] %s.%s: %s %v\n", l.serviceName, l.version, msg, keysAndValues)
129}

Kubernetes Deployment with Istio Sidecar

 1# deployment.yaml
 2apiVersion: apps/v1
 3kind: Deployment
 4metadata:
 5  name: myapp-v1
 6  labels:
 7    app: myapp
 8    version: v1
 9spec:
10  replicas: 3
11  selector:
12    matchLabels:
13      app: myapp
14      version: v1
15  template:
16    metadata:
17      labels:
18        app: myapp
19        version: v1
20      annotations:
21        # Istio automatically injects sidecar when annotation is present
22        sidecar.istio.io/inject: "true"
23        # Configure sidecar resources
24        sidecar.istio.io/proxyCPU: "100m"
25        sidecar.istio.io/proxyMemory: "128Mi"
26        # Traffic interception settings
27        traffic.sidecar.istio.io/includeOutboundPorts: "8080"
28        traffic.sidecar.istio.io/excludeInboundPorts: "8081"
29    spec:
30      containers:
31      - name: myapp
32        image: my-registry/myapp:v1
33        ports:
34        - containerPort: 8080
35          name: http
36        env:
37        - name: SERVICE_NAME
38          value: "myapp"
39        - name: SERVICE_VERSION
40          value: "v1"
41        - name: PORT
42          value: "8080"
43        resources:
44          requests:
45            cpu: "50m"
46            memory: "64Mi"
47          limits:
48            cpu: "200m"
49            memory: "128Mi"
50        # Application readiness and liveness
51        readinessProbe:
52          httpGet:
53            path: /health
54            port: 8080
55          initialDelaySeconds: 5
56          periodSeconds: 10
57        livenessProbe:
58          httpGet:
59            path: /health
60            port: 8080
61          initialDelaySeconds: 30
62          periodSeconds: 30

Deployment with Version Management:

 1# deployment-v2.yaml - Canary version
 2apiVersion: apps/v1
 3kind: Deployment
 4metadata:
 5  name: myapp-v2
 6  labels:
 7    app: myapp
 8    version: v2
 9spec:
10  replicas: 1  # Start with fewer replicas for canary
11  selector:
12    matchLabels:
13      app: myapp
14      version: v2
15  template:
16    metadata:
17      labels:
18        app: myapp
19        version: v2
20      annotations:
21        sidecar.istio.io/inject: "true"
22    spec:
23      containers:
24      - name: myapp
25        image: my-registry/myapp:v2
26        ports:
27        - containerPort: 8080
28          name: http
29        env:
30        - name: SERVICE_NAME
31          value: "myapp"
32        - name: SERVICE_VERSION
33          value: "v2"
34        resources:
35          requests:
36            cpu: "50m"
37            memory: "64Mi"
38          limits:
39            cpu: "200m"
40            memory: "128Mi"

Traffic Management - Advanced Patterns

Progressive Canary Deployments

  1// canary/controller.go
  2package main
  3
  4import (
  5    "context"
  6    "fmt"
  7    "time"
  8
  9    networkingv1beta1 "istio.io/client-go/pkg/apis/networking/v1beta1"
 10    "istio.io/client-go/pkg/clientset/versioned"
 11    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 12    "k8s.io/client-go/kubernetes"
 13)
 14
 15type CanaryController struct {
 16    istioClient versioned.Interface
 17    k8sClient   kubernetes.Interface
 18    service     string
 19    namespace   string
 20}
 21
 22func NewCanaryController(k8sCfg *rest.Config, service, namespace string) {
 23    istioClient, err := versioned.NewForConfig(k8sCfg)
 24    if err != nil {
 25        return nil, fmt.Errorf("failed to create Istio client: %w", err)
 26    }
 27
 28    k8sClient, err := kubernetes.NewForConfig(k8sCfg)
 29    if err != nil {
 30        return nil, fmt.Errorf("failed to create Kubernetes client: %w", err)
 31    }
 32
 33    return &CanaryController{
 34        istioClient: istioClient,
 35        k8sClient:   k8sClient,
 36        service:     service,
 37        namespace:   namespace,
 38    }, nil
 39}
 40
 41// ProgressiveCanary performs gradual traffic shifting with monitoring
 42func ProgressiveCanary(ctx context.Context) error {
 43    // Progressive traffic stages
 44    stages := []struct {
 45        canaryWeight int32
 46        stableWeight int32
 47        duration     time.Duration
 48        description  string
 49    }{
 50        {5, 95, 5 * time.Minute, "Initial 5% canary"},
 51        {10, 90, 10 * time.Minute, "10% canary deployment"},
 52        {25, 75, 15 * time.Minute, "25% canary expansion"},
 53        {50, 50, 20 * time.Minute, "50% traffic split"},
 54        {75, 25, 15 * time.Minute, "75% canary dominance"},
 55        {100, 0, 10 * time.Minute, "Full canary deployment"},
 56    }
 57
 58    for i, stage := range stages {
 59        fmt.Printf("Stage %d: %s\n", i+1, stage.description)
 60
 61        // Update VirtualService with new traffic weights
 62        if err := c.updateTrafficWeights(ctx, stage.canaryWeight, stage.stableWeight); err != nil {
 63            return fmt.Errorf("failed to update traffic weights: %w", err)
 64        }
 65
 66        // Monitor for the duration of this stage
 67        fmt.Printf("Monitoring for %v...\n", stage.duration)
 68        if err := c.monitorCanaryHealth(ctx, stage.duration); err != nil {
 69            fmt.Printf("Health check failed: %v\n", err)
 70
 71            // Automatic rollback on failure
 72            if rollbackErr := c.rollbackToStable(ctx); rollbackErr != nil {
 73                return fmt.Errorf("failed to rollback: %w", rollbackErr, err)
 74            }
 75            return fmt.Errorf("canary failed, rolled back: %w", err)
 76        }
 77
 78        fmt.Printf("Stage %d completed successfully\n", i+1)
 79    }
 80
 81    // Clean up old stable version after successful canary
 82    return c.cleanupOldVersion(ctx)
 83}
 84
 85func updateTrafficWeights(ctx context.Context, canaryWeight, stableWeight int32) error {
 86    vs, err := c.istioClient.NetworkingV1beta1().VirtualServices(c.namespace).Get(ctx, c.service, metav1.GetOptions{})
 87    if err != nil {
 88        return fmt.Errorf("failed to get VirtualService: %w", err)
 89    }
 90
 91    // Update traffic weights
 92    vs.Spec.Http[0].Route = []*networkingv1beta1.HTTPRouteDestination{
 93        {
 94            Destination: &networkingv1beta1.Destination{
 95                Host:   c.service,
 96                Subset: "stable",
 97            },
 98            Weight: stableWeight,
 99        },
100        {
101            Destination: &networkingv1beta1.Destination{
102                Host:   c.service,
103                Subset: "canary",
104            },
105            Weight: canaryWeight,
106        },
107    }
108
109    _, err = c.istioClient.NetworkingV1beta1().VirtualServices(c.namespace).Update(ctx, vs, metav1.UpdateOptions{})
110    if err != nil {
111        return fmt.Errorf("failed to update VirtualService: %w", err)
112    }
113
114    fmt.Printf("Updated traffic weights: stable=%d, canary=%d\n", stableWeight, canaryWeight)
115    return nil
116}
117
118func monitorCanaryHealth(ctx context.Context, duration time.Duration) error {
119    ticker := time.NewTicker(30 * time.Second)
120    defer ticker.Stop()
121
122    timeout := time.After(duration)
123    consecutiveErrors := 0
124    maxErrors := 3
125
126    for {
127        select {
128        case <-ctx.Done():
129            return ctx.Err()
130        case <-timeout:
131            return nil // Monitoring period completed successfully
132        case <-ticker.C:
133            healthy, err := c.checkCanaryHealth(ctx)
134            if err != nil {
135                consecutiveErrors++
136                fmt.Printf("Health check error: %v\n", err, consecutiveErrors)
137
138                if consecutiveErrors >= maxErrors {
139                    return fmt.Errorf("too many consecutive health check failures")
140                }
141                continue
142            }
143
144            consecutiveErrors = 0
145            if !healthy {
146                return fmt.Errorf("canary unhealthy")
147            }
148
149            fmt.Printf("Canary health check passed\n")
150        }
151    }
152}
153
154func checkCanaryHealth(ctx context.Context) {
155    // Check pod readiness
156    pods, err := c.k8sClient.CoreV1().Pods(c.namespace).List(ctx, metav1.ListOptions{
157        LabelSelector: "app=myapp,version=canary",
158    })
159    if err != nil {
160        return false, fmt.Errorf("failed to list canary pods: %w", err)
161    }
162
163    readyPods := 0
164    for _, pod := range pods.Items {
165        for _, condition := range pod.Status.Conditions {
166            if condition.Type == "Ready" && condition.Status == "True" {
167                readyPods++
168                break
169            }
170        }
171    }
172
173    if len(pods.Items) == 0 {
174        return false, fmt.Errorf("no canary pods found")
175    }
176
177    healthPercentage := float64(readyPods) / float64(len(pods.Items)) * 100
178    fmt.Printf("Canary health: %d/%d pods ready\n", readyPods, len(pods.Items), healthPercentage)
179
180    // Consider healthy if at least 80% of pods are ready
181    return healthPercentage >= 80.0, nil
182}
183
184func rollbackToStable(ctx context.Context) error {
185    fmt.Println("Performing automatic rollback to stable version...")
186
187    return c.updateTrafficWeights(ctx, 0, 100)
188}
189
190func cleanupOldVersion(ctx context.Context) error {
191    fmt.Println("Cleaning up old stable version...")
192
193    // Delete old stable deployment
194    err := c.k8sClient.AppsV1().Deployments(c.namespace).Delete(ctx, c.service+"-v1", metav1.DeleteOptions{})
195    if err != nil {
196        return fmt.Errorf("failed to delete old stable deployment: %w", err)
197    }
198
199    fmt.Println("Cleanup completed successfully")
200    return nil
201}

Istio Configuration for Traffic Management

 1# destination-rule.yaml
 2apiVersion: networking.istio.io/v1beta1
 3kind: DestinationRule
 4metadata:
 5  name: myapp
 6  namespace: default
 7spec:
 8  host: myapp
 9  trafficPolicy:
10    # Connection pool settings
11    connectionPool:
12      tcp:
13        maxConnections: 100
14      http:
15        http1MaxPendingRequests: 50
16        maxRequestsPerConnection: 10
17    # Load balancing strategy
18    loadBalancer:
19      simple: LEAST_CONN  # Least connections
20    # Outlier detection for circuit breaking
21    outlierDetection:
22      consecutiveErrors: 3
23      interval: 30s
24      baseEjectionTime: 30s
25      maxEjectionPercent: 50
26  # Subsets for version routing
27  subsets:
28  - name: stable
29    labels:
30      version: v1
31  - name: canary
32    labels:
33      version: v2
34
35---
36# virtual-service.yaml
37apiVersion: networking.istio.io/v1beta1
38kind: VirtualService
39metadata:
40  name: myapp
41  namespace: default
42spec:
43  hosts:
44  - myapp
45  http:
46  # Route based on headers
47  - match:
48    - headers:
49        x-user-type:
50          exact: "premium"
51    route:
52    - destination:
53        host: myapp
54        subset: canary
55      weight: 100
56  # Route based on query parameters
57  - match:
58    - queryParams:
59        version:
60          exact: "v2"
61    route:
62    - destination:
63        host: myapp
64        subset: canary
65      weight: 100
66  # Default traffic routing
67  - route:
68    - destination:
69        host: myapp
70        subset: stable
71      weight: 95
72    - destination:
73        host: myapp
74        subset: canary
75      weight: 5
76  # Retry configuration
77  retries:
78    attempts: 3
79    perTryTimeout: 2s
80    retryOn: 5xx,gateway-error,connect-failure,refused-stream
81  # Timeout configuration
82  timeout: 10s

Security - Zero-Trust Networking

Mutual TLS Configuration

 1# peer-authentication.yaml - Enable mTLS
 2apiVersion: security.istio.io/v1beta1
 3kind: PeerAuthentication
 4metadata:
 5  name: default
 6  namespace: default
 7spec:
 8  mtls:
 9    # STRICT mode enforces mTLS for all service-to-service communication
10    mode: STRICT
11
12---
13# authorization-policy.yaml - Service-to-service authorization
14apiVersion: security.istio.io/v1beta1
15kind: AuthorizationPolicy
16metadata:
17  name: myapp-authz
18  namespace: default
19spec:
20  selector:
21    matchLabels:
22      app: myapp
23  action: ALLOW
24  rules:
25  # Allow requests from frontend service
26  - from:
27    - source:
28        principals: ["cluster.local/ns/default/sa/frontend"]
29    to:
30    - operation:
31        methods: ["GET", "POST"]
32  # Allow requests from admin service with full access
33  - from:
34    - source:
35        principals: ["cluster.local/ns/default/sa/admin"]
36  # Deny requests without proper authentication
37  - when:
38    - key: request.auth.principal
39      notValues: ["*"]
40    action: DENY

JWT Authentication for External Access

  1// auth/middleware.go
  2package auth
  3
  4import (
  5    "context"
  6    "net/http"
  7    "strings"
  8
  9    "github.com/golang-jwt/jwt/v4"
 10)
 11
 12type JWTClaims struct {
 13    UserID   string   `json:"sub"`
 14    Email    string   `json:"email"`
 15    Roles    []string `json:"roles"`
 16    jwt.RegisteredClaims
 17}
 18
 19type AuthMiddleware struct {
 20    secretKey []byte
 21}
 22
 23func NewAuthMiddleware(secretKey string) *AuthMiddleware {
 24    return &AuthMiddleware{secretKey: []byte(secretKey)}
 25}
 26
 27func Middleware(next http.Handler) http.Handler {
 28    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 29        // Istio validates JWT and adds claims as headers
 30        userID := r.Header.Get("X-Jwt-Claim-Sub")
 31        email := r.Header.Get("X-Jwt-Claim-Email")
 32        roles := r.Header.Get("X-Jwt-Claim-Roles")
 33
 34        if userID == "" {
 35            // Fallback: Check Authorization header
 36            authHeader := r.Header.Get("Authorization")
 37            if strings.HasPrefix(authHeader, "Bearer ") {
 38                tokenString := authHeader[7:]
 39                claims, err := a.validateJWT(tokenString)
 40                if err != nil {
 41                    http.Error(w, "Invalid token", http.StatusUnauthorized)
 42                    return
 43                }
 44
 45                // Add claims to context
 46                ctx := context.WithValue(r.Context(), "user", claims)
 47                r = r.WithContext(ctx)
 48            } else {
 49                http.Error(w, "Authorization required", http.StatusUnauthorized)
 50                return
 51            }
 52        } else {
 53            // Parse roles from header
 54            userRoles := strings.Split(roles, ",")
 55            claims := &JWTClaims{
 56                UserID: userID,
 57                Email:  email,
 58                Roles:  userRoles,
 59            }
 60
 61            ctx := context.WithValue(r.Context(), "user", claims)
 62            r = r.WithContext(ctx)
 63        }
 64
 65        next.ServeHTTP(w, r)
 66    })
 67}
 68
 69func validateJWT(tokenString string) {
 70    token, err := jwt.ParseWithClaims(tokenString, &JWTClaims{}, func(token *jwt.Token) {
 71        return a.secretKey, nil
 72    })
 73
 74    if err != nil {
 75        return nil, err
 76    }
 77
 78    if claims, ok := token.Claims.(*JWTClaims); ok && token.Valid {
 79        return claims, nil
 80    }
 81
 82    return nil, jwt.ErrInvalidKey
 83}
 84
 85func HasRole(role string) func(http.Handler) http.Handler {
 86    return func(next http.Handler) http.Handler {
 87        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 88            user, ok := r.Context().Value("user").(*JWTClaims)
 89            if !ok {
 90                http.Error(w, "User not found", http.StatusUnauthorized)
 91                return
 92            }
 93
 94            for _, userRole := range user.Roles {
 95                if userRole == role {
 96                    next.ServeHTTP(w, r)
 97                    return
 98                }
 99            }
100
101            http.Error(w, "Insufficient permissions", http.StatusForbidden)
102        })
103    }
104}

JWT Authentication Configuration

 1# request-authentication.yaml
 2apiVersion: security.istio.io/v1beta1
 3kind: RequestAuthentication
 4metadata:
 5  name: jwt-auth
 6  namespace: default
 7spec:
 8  selector:
 9    matchLabels:
10      app: myapp
11  jwtRules:
12  - issuer: "https://auth.example.com"
13    jwksUri: "https://auth.example.com/.well-known/jwks.json"
14    audiences:
15    - "myapp-api"
16    # Forward original JWT to application
17    forwardOriginalToken: true
18    # Extract claims as headers
19    fromHeaders:
20    - name: Authorization
21      prefix: "Bearer "
22
23---
24# jwt-authorization.yaml
25apiVersion: security.istio.io/v1beta1
26kind: AuthorizationPolicy
27metadata:
28  name: jwt-require
29  namespace: default
30spec:
31  selector:
32    matchLabels:
33      app: myapp
34  action: ALLOW
35  rules:
36  # Require valid JWT for API endpoints
37  - to:
38    - operation:
39        notPaths: ["/health", "/metrics"]
40  when:
41  - key: request.auth.claims[aud]
42    values: ["myapp-api"]

Common Patterns and Pitfalls

Pattern 1: Service-to-Service Communication

 1// client/service-client.go
 2package client
 3
 4import (
 5    "context"
 6    "fmt"
 7    "net/http"
 8    "time"
 9
10    "github.com/hashicorp/go-retryablehttp"
11)
12
13type ServiceClient struct {
14    baseURL    string
15    httpClient *http.Client
16}
17
18func NewServiceClient(serviceName, namespace string) *ServiceClient {
19    // In Istio, services communicate via Kubernetes service names
20    baseURL := fmt.Sprintf("http://%s.%s.svc.cluster.local", serviceName, namespace)
21
22    // Configure HTTP client with timeouts and retries
23    client := &http.Client{
24        Timeout: 30 * time.Second,
25        Transport: &retryablehttp.RoundTripper{
26            RetryMax:     3,
27            RetryWaitMin: 1 * time.Second,
28            RetryWaitMax: 5 * time.Second,
29        },
30    }
31
32    return &ServiceClient{
33        baseURL:    baseURL,
34        httpClient: client,
35    }
36}
37
38func CallService(ctx context.Context, path string) {
39    // Add tracing headers from context
40    req, err := http.NewRequestWithContext(ctx, "GET", c.baseURL+path, nil)
41    if err != nil {
42        return nil, fmt.Errorf("failed to create request: %w", err)
43    }
44
45    // Istio automatically adds:
46    // - Load balancing
47    // - mTLS encryption
48    // - Retry logic
49    // - Circuit breaking
50    // - Distributed tracing headers
51
52    return c.httpClient.Do(req)
53}
54
55func CallServiceWithTimeout(ctx context.Context, path string, timeout time.Duration) {
56    ctx, cancel := context.WithTimeout(ctx, timeout)
57    defer cancel()
58
59    return c.CallService(ctx, path)
60}

Pattern 2: Circuit Breaking Implementation

 1// resilience/circuit_breaker.go
 2package resilience
 3
 4import (
 5    "errors"
 6    "sync"
 7    "time"
 8)
 9
10type CircuitState string
11
12const (
13    StateClosed   CircuitState = "closed"   // Normal operation
14    StateOpen     CircuitState = "open"     // Failing, reject requests
15    StateHalfOpen CircuitState = "half-open" // Testing if service recovered
16)
17
18type CircuitBreaker struct {
19    maxFailures  int
20    resetTimeout time.Duration
21
22    mu           sync.RWMutex
23    failures     int
24    lastFailTime time.Time
25    state        CircuitState
26}
27
28func NewCircuitBreaker(maxFailures int, resetTimeout time.Duration) *CircuitBreaker {
29    return &CircuitBreaker{
30        maxFailures:  maxFailures,
31        resetTimeout: resetTimeout,
32        state:        StateClosed,
33    }
34}
35
36func Call(fn func() error) error {
37    if !cb.canCall() {
38        return errors.New("circuit breaker is open")
39    }
40
41    err := fn()
42    cb.recordResult(err)
43    return err
44}
45
46func canCall() bool {
47    cb.mu.Lock()
48    defer cb.mu.Unlock()
49
50    switch cb.state {
51    case StateClosed:
52        return true
53    case StateOpen:
54        if time.Since(cb.lastFailTime) > cb.resetTimeout {
55            cb.state = StateHalfOpen
56            cb.failures = 0
57            return true
58        }
59        return false
60    case StateHalfOpen:
61        return true
62    default:
63        return false
64    }
65}
66
67func recordResult(err error) {
68    cb.mu.Lock()
69    defer cb.mu.Unlock()
70
71    if err == nil {
72        cb.failures = 0
73        if cb.state == StateHalfOpen {
74            cb.state = StateClosed
75        }
76        return
77    }
78
79    cb.failures++
80    cb.lastFailTime = time.Now()
81
82    if cb.failures >= cb.maxFailures {
83        cb.state = StateOpen
84    }
85}
86
87func State() CircuitState {
88    cb.mu.RLock()
89    defer cb.mu.RUnlock()
90    return cb.state
91}

Common Pitfalls and Solutions

Pitfall 1: Missing Destination Rules

 1# ❌ Wrong: VirtualService without DestinationRule
 2apiVersion: networking.istio.io/v1beta1
 3kind: VirtualService
 4metadata:
 5  name: myapp
 6spec:
 7  hosts:
 8  - myapp
 9  http:
10  - route:
11    - destination:
12        host: myapp
13        subset: v1  # This will fail - no DestinationRule defined
14
15# ✅ Correct: DestinationRule defines subsets
16apiVersion: networking.istio.io/v1beta1
17kind: DestinationRule
18metadata:
19  name: myapp
20spec:
21  host: myapp
22  subsets:
23  - name: v1
24    labels:
25      version: v1

Pitfall 2: Incorrect Selector Labels

 1# ❌ Wrong: Selector doesn't match pod labels
 2apiVersion: security.istio.io/v1beta1
 3kind: AuthorizationPolicy
 4metadata:
 5  name: myapp-authz
 6spec:
 7  selector:
 8    matchLabels:
 9      app: myapp-service  # Pods have "app: myapp"
10  action: ALLOW
11  rules: []
12
13# ✅ Correct: Selector matches pod labels
14apiVersion: security.istio.io/v1beta1
15kind: AuthorizationPolicy
16metadata:
17  name: myapp-authz
18spec:
19  selector:
20    matchLabels:
21      app: myapp  # Matches deployment pod labels
22  action: ALLOW
23  rules: []

Integration and Mastery - Multi-Cluster Service Mesh

Multi-Cluster Configuration

 1# multi-cluster-config.yaml
 2apiVersion: install.istio.io/v1alpha1
 3kind: IstioOperator
 4metadata:
 5  name: istio-controlplane
 6  namespace: istio-system
 7spec:
 8  values:
 9    global:
10      meshID: production-mesh
11      # Multi-cluster configuration
12      multiCluster:
13        clusterName: cluster-east
14      # Network configuration
15      network: network-east
16    # Control plane configuration
17    pilot:
18      env:
19        PILOT_ENABLE_CROSS_CLUSTER_WORKLOAD_ENTRY: "true"
20        PILOT_ENABLE_WORKLOAD_ENTRY_FOR_CRDS: "true"

Cross-Cluster Service Discovery

 1# service-entry.yaml - Cross-cluster service discovery
 2apiVersion: networking.istio.io/v1beta1
 3kind: ServiceEntry
 4metadata:
 5  name: remote-service
 6  namespace: default
 7spec:
 8  hosts:
 9  - myapp.global  # Global service name
10  location: MESH_INTERNAL
11  resolution: DNS
12  ports:
13  - number: 8080
14    name: http
15    protocol: HTTP
16  # Remote service endpoints
17  endpoints:
18  - address: 172.20.0.1  # Remote cluster service IP
19    locality: us-west-2
20  - address: 172.20.0.2
21    locality: us-west-2
22
23---
24# virtual-service.yaml - Cross-cluster traffic routing
25apiVersion: networking.istio.io/v1beta1
26kind: VirtualService
27metadata:
28  name: myapp-global
29  namespace: default
30spec:
31  hosts:
32  - myapp.global
33  http:
34  # Prefer local cluster
35  - match:
36    - sourceLabels:
37        topology.istio.io/cluster: cluster-east
38    route:
39    - destination:
40        host: myapp.default.svc.cluster.local  # Local service
41      weight: 80
42    - destination:
43        host: myapp.global  # Remote service
44      weight: 20
45  # Default route for other clusters
46  - route:
47    - destination:
48        host: myapp.global
49      weight: 100

Multi-Region Load Balancing

 1# destination-rule.yaml - Locality-aware load balancing
 2apiVersion: networking.istio.io/v1beta1
 3kind: DestinationRule
 4metadata:
 5  name: myapp-global
 6  namespace: default
 7spec:
 8  host: myapp.global
 9  trafficPolicy:
10    loadBalancer:
11      localityLbSetting:
12        enabled: true
13        # Prefer same region, then same zone
14        distribute:
15        - from: us-east-1/us-east-1a/*
16          to:
17            "us-east-1/us-east-1a/*": 90
18            "us-east-1/us-east-1b/*": 10
19            "us-west-2/*": 0
20        - from: us-west-2/*
21          to:
22            "us-west-2/*": 90
23            "us-east-1/*": 10
24    # Failover configuration
25    outlierDetection:
26      consecutiveErrors: 2
27      interval: 10s
28      baseEjectionTime: 30s
29      maxEjectionPercent: 100

Observability - Comprehensive Monitoring and Tracing

Distributed Tracing with Jaeger

  1// observability/tracing.go - Distributed tracing integration
  2
  3package observability
  4
  5import (
  6	"context"
  7	"fmt"
  8	"net/http"
  9
 10	"go.opentelemetry.io/otel"
 11	"go.opentelemetry.io/otel/attribute"
 12	"go.opentelemetry.io/otel/exporters/jaeger"
 13	"go.opentelemetry.io/otel/propagation"
 14	"go.opentelemetry.io/otel/sdk/resource"
 15	sdktrace "go.opentelemetry.io/otel/sdk/trace"
 16	semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
 17	"go.opentelemetry.io/otel/trace"
 18)
 19
 20// InitTracer sets up OpenTelemetry with Jaeger backend
 21func InitTracer(serviceName, jaegerEndpoint string) (*sdktrace.TracerProvider, error) {
 22	// Create Jaeger exporter
 23	exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint(jaegerEndpoint)))
 24	if err != nil {
 25		return nil, fmt.Errorf("failed to create jaeger exporter: %w", err)
 26	}
 27
 28	// Create tracer provider with service information
 29	tp := sdktrace.NewTracerProvider(
 30		sdktrace.WithBatcher(exporter),
 31		sdktrace.WithResource(resource.NewWithAttributes(
 32			semconv.SchemaURL,
 33			semconv.ServiceNameKey.String(serviceName),
 34			semconv.ServiceVersionKey.String("v1.0.0"),
 35			attribute.String("environment", "production"),
 36		)),
 37		sdktrace.WithSampler(sdktrace.AlwaysSample()),
 38	)
 39
 40	// Register as global tracer provider
 41	otel.SetTracerProvider(tp)
 42
 43	// Set global propagator for trace context
 44	otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
 45		propagation.TraceContext{},
 46		propagation.Baggage{},
 47	))
 48
 49	return tp, nil
 50}
 51
 52// TracedHTTPMiddleware adds tracing to HTTP handlers
 53func TracedHTTPMiddleware(next http.Handler) http.Handler {
 54	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 55		tracer := otel.Tracer("http-handler")
 56
 57		// Extract trace context from Istio-injected headers
 58		ctx := otel.GetTextMapPropagator().Extract(r.Context(), propagation.HeaderCarrier(r.Header))
 59
 60		// Start new span
 61		ctx, span := tracer.Start(ctx, r.Method+" "+r.URL.Path,
 62			trace.WithSpanKind(trace.SpanKindServer),
 63			trace.WithAttributes(
 64				semconv.HTTPMethodKey.String(r.Method),
 65				semconv.HTTPTargetKey.String(r.URL.Path),
 66				semconv.HTTPURLKey.String(r.URL.String()),
 67				attribute.String("http.request_id", r.Header.Get("X-Request-Id")),
 68			),
 69		)
 70		defer span.End()
 71
 72		// Create response writer wrapper to capture status code
 73		wrapped := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}
 74
 75		// Call next handler
 76		next.ServeHTTP(wrapped, r.WithContext(ctx))
 77
 78		// Add response attributes
 79		span.SetAttributes(
 80			semconv.HTTPStatusCodeKey.Int(wrapped.statusCode),
 81		)
 82
 83		// Record error if status code indicates failure
 84		if wrapped.statusCode >= 400 {
 85			span.RecordError(fmt.Errorf("HTTP %d", wrapped.statusCode))
 86		}
 87	})
 88}
 89
 90type responseWriter struct {
 91	http.ResponseWriter
 92	statusCode int
 93}
 94
 95func (rw *responseWriter) WriteHeader(code int) {
 96	rw.statusCode = code
 97	rw.ResponseWriter.WriteHeader(code)
 98}
 99
100// AddSpan creates a custom span with attributes
101func AddSpan(ctx context.Context, operationName string, attributes ...attribute.KeyValue) (context.Context, trace.Span) {
102	tracer := otel.Tracer("service")
103	ctx, span := tracer.Start(ctx, operationName)
104	span.SetAttributes(attributes...)
105	return ctx, span
106}
107
108// Example traced database operation
109func (s *Service) GetUserTraced(ctx context.Context, userID int64) (*User, error) {
110	ctx, span := AddSpan(ctx, "GetUser.Database",
111		attribute.Int64("user.id", userID),
112	)
113	defer span.End()
114
115	// Perform database query
116	user, err := s.db.QueryUser(ctx, userID)
117	if err != nil {
118		span.RecordError(err)
119		return nil, fmt.Errorf("failed to query user: %w", err)
120	}
121
122	span.SetAttributes(
123		attribute.String("user.email", user.Email),
124		attribute.String("user.status", user.Status),
125	)
126
127	return user, nil
128}

Metrics Collection with Prometheus

  1// observability/metrics.go - Prometheus metrics integration
  2
  3package observability
  4
  5import (
  6	"net/http"
  7	"time"
  8
  9	"github.com/prometheus/client_golang/prometheus"
 10	"github.com/prometheus/client_golang/prometheus/promauto"
 11	"github.com/prometheus/client_golang/prometheus/promhttp"
 12)
 13
 14var (
 15	// HTTP request metrics
 16	httpRequestsTotal = promauto.NewCounterVec(
 17		prometheus.CounterOpts{
 18			Name: "http_requests_total",
 19			Help: "Total number of HTTP requests",
 20		},
 21		[]string{"method", "path", "status"},
 22	)
 23
 24	httpRequestDuration = promauto.NewHistogramVec(
 25		prometheus.HistogramOpts{
 26			Name:    "http_request_duration_seconds",
 27			Help:    "HTTP request latencies in seconds",
 28			Buckets: prometheus.DefBuckets,
 29		},
 30		[]string{"method", "path"},
 31	)
 32
 33	// Business metrics
 34	userOperationsTotal = promauto.NewCounterVec(
 35		prometheus.CounterOpts{
 36			Name: "user_operations_total",
 37			Help: "Total number of user operations",
 38		},
 39		[]string{"operation", "status"},
 40	)
 41
 42	activeUsers = promauto.NewGauge(
 43		prometheus.GaugeOpts{
 44			Name: "active_users",
 45			Help: "Current number of active users",
 46		},
 47	)
 48
 49	// Istio integration metrics
 50	istioRequestsTotal = promauto.NewCounterVec(
 51		prometheus.CounterOpts{
 52			Name: "istio_requests_total",
 53			Help: "Total requests processed by Istio sidecar",
 54		},
 55		[]string{"destination_service", "response_code"},
 56	)
 57)
 58
 59// MetricsMiddleware records HTTP request metrics
 60func MetricsMiddleware(next http.Handler) http.Handler {
 61	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 62		start := time.Now()
 63
 64		// Wrap response writer to capture status
 65		wrapped := &metricsResponseWriter{ResponseWriter: w, statusCode: http.StatusOK}
 66
 67		// Call next handler
 68		next.ServeHTTP(wrapped, r)
 69
 70		// Record metrics
 71		duration := time.Since(start).Seconds()
 72
 73		httpRequestsTotal.WithLabelValues(
 74			r.Method,
 75			r.URL.Path,
 76			http.StatusText(wrapped.statusCode),
 77		).Inc()
 78
 79		httpRequestDuration.WithLabelValues(
 80			r.Method,
 81			r.URL.Path,
 82		).Observe(duration)
 83	})
 84}
 85
 86type metricsResponseWriter struct {
 87	http.ResponseWriter
 88	statusCode int
 89}
 90
 91func (mrw *metricsResponseWriter) WriteHeader(code int) {
 92	mrw.statusCode = code
 93	mrw.ResponseWriter.WriteHeader(code)
 94}
 95
 96// RecordUserOperation records business metrics
 97func RecordUserOperation(operation string, success bool) {
 98	status := "success"
 99	if !success {
100		status = "failure"
101	}
102	userOperationsTotal.WithLabelValues(operation, status).Inc()
103}
104
105// UpdateActiveUsers updates active user gauge
106func UpdateActiveUsers(count int) {
107	activeUsers.Set(float64(count))
108}
109
110// SetupMetricsEndpoint creates HTTP handler for Prometheus metrics
111func SetupMetricsEndpoint() http.Handler {
112	return promhttp.Handler()
113}
114
115// Example usage in main
116func main() {
117	// Setup metrics endpoint
118	http.Handle("/metrics", SetupMetricsEndpoint())
119
120	// Setup traced application handlers
121	http.Handle("/api/", TracedHTTPMiddleware(MetricsMiddleware(apiHandler)))
122
123	http.ListenAndServe(":8080", nil)
124}

Structured Logging Integration

 1// observability/logging.go - Structured logging with trace correlation
 2
 3package observability
 4
 5import (
 6	"context"
 7	"os"
 8
 9	"go.opentelemetry.io/otel/trace"
10	"go.uber.org/zap"
11	"go.uber.org/zap/zapcore"
12)
13
14// InitLogger creates a structured logger with trace correlation
15func InitLogger(serviceName string, debug bool) (*zap.Logger, error) {
16	config := zap.NewProductionConfig()
17
18	if debug {
19		config.Level = zap.NewAtomicLevelAt(zapcore.DebugLevel)
20	}
21
22	config.EncoderConfig.TimeKey = "timestamp"
23	config.EncoderConfig.EncodeTime = zapcore.ISO8601TimeEncoder
24
25	// Add service name to all logs
26	config.InitialFields = map[string]interface{}{
27		"service": serviceName,
28	}
29
30	return config.Build()
31}
32
33// LoggerWithTrace adds trace context to logger
34func LoggerWithTrace(ctx context.Context, logger *zap.Logger) *zap.Logger {
35	span := trace.SpanFromContext(ctx)
36	if !span.IsRecording() {
37		return logger
38	}
39
40	spanContext := span.SpanContext()
41	return logger.With(
42		zap.String("trace_id", spanContext.TraceID().String()),
43		zap.String("span_id", spanContext.SpanID().String()),
44	)
45}
46
47// Example usage with trace correlation
48func (s *Service) ProcessRequest(ctx context.Context, request *Request) error {
49	logger := LoggerWithTrace(ctx, s.logger)
50
51	logger.Info("processing request",
52		zap.String("request_id", request.ID),
53		zap.String("user_id", request.UserID),
54	)
55
56	// Process request
57	if err := s.validateRequest(request); err != nil {
58		logger.Error("request validation failed",
59			zap.Error(err),
60			zap.String("request_id", request.ID),
61		)
62		return err
63	}
64
65	logger.Info("request processed successfully",
66		zap.String("request_id", request.ID),
67	)
68
69	return nil
70}

Practice Exercises

Exercise 1: Implement Progressive Canary Deployment

Objective: Create a canary deployment that automatically progresses based on health metrics.

Solution
  1package canary
  2
  3import (
  4    "context"
  5    "fmt"
  6    "time"
  7
  8    networkingv1beta1 "istio.io/client-go/pkg/apis/networking/v1beta1"
  9    "istio.io/client-go/pkg/clientset/versioned"
 10    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 11    "k8s.io/client-go/kubernetes"
 12    "k8s.io/client-go/rest"
 13    "k8s.io/client-go/tools/clientcmd"
 14)
 15
 16type CanaryDeployment struct {
 17    istioClient versioned.Interface
 18    k8sClient   kubernetes.Interface
 19    service     string
 20    namespace   string
 21}
 22
 23func NewCanaryDeployment(kubeconfigPath, service, namespace string) {
 24    var config *rest.Config
 25    var err error
 26
 27    if kubeconfigPath != "" {
 28        config, err = clientcmd.BuildConfigFromFlags("", kubeconfigPath)
 29    } else {
 30        config, err = rest.InClusterConfig()
 31    }
 32
 33    if err != nil {
 34        return nil, fmt.Errorf("failed to get kubernetes config: %w", err)
 35    }
 36
 37    istioClient, err := versioned.NewForConfig(config)
 38    if err != nil {
 39        return nil, fmt.Errorf("failed to create Istio client: %w", err)
 40    }
 41
 42    k8sClient, err := kubernetes.NewForConfig(config)
 43    if err != nil {
 44        return nil, fmt.Errorf("failed to create Kubernetes client: %w", err)
 45    }
 46
 47    return &CanaryDeployment{
 48        istioClient: istioClient,
 49        k8sClient:   k8sClient,
 50        service:     service,
 51        namespace:   namespace,
 52    }, nil
 53}
 54
 55func ExecuteCanary(ctx context.Context) error {
 56    // Progressive traffic stages
 57    stages := []struct {
 58        canaryWeight int32
 59        duration     time.Duration
 60        description  string
 61    }{
 62        {5, 5 * time.Minute, "Initial 5% traffic to canary"},
 63        {10, 10 * time.Minute, "Increase to 10%"},
 64        {25, 15 * time.Minute, "Scale to 25%"},
 65        {50, 20 * time.Minute, "50-50 split"},
 66        {75, 15 * time.Minute, "Scale to 75%"},
 67        {100, 10 * time.Minute, "Full traffic to canary"},
 68    }
 69
 70    for i, stage := range stages {
 71        fmt.Printf("Stage %d: %s\n", i+1, stage.description)
 72
 73        // Update VirtualService traffic weights
 74        if err := cd.updateTrafficWeights(ctx, stage.canaryWeight); err != nil {
 75            return fmt.Errorf("failed to update traffic weights: %w", err)
 76        }
 77
 78        // Monitor canary health
 79        if err := cd.monitorCanaryHealth(ctx, stage.duration); err != nil {
 80            fmt.Printf("Health check failed: %v\n", err)
 81
 82            // Automatic rollback
 83            if rollbackErr := cd.rollbackToStable(ctx); rollbackErr != nil {
 84                return fmt.Errorf("failed to rollback: %w", rollbackErr, err)
 85            }
 86            return fmt.Errorf("canary failed, rolled back: %w", err)
 87        }
 88
 89        fmt.Printf("Stage %d completed successfully\n", i+1)
 90    }
 91
 92    return nil
 93}
 94
 95func updateTrafficWeights(ctx context.Context, canaryWeight int32) error {
 96    vs, err := cd.istioClient.NetworkingV1beta1().VirtualServices(cd.namespace).Get(ctx, cd.service, metav1.GetOptions{})
 97    if err != nil {
 98        return fmt.Errorf("failed to get VirtualService: %w", err)
 99    }
100
101    stableWeight := 100 - canaryWeight
102
103    // Update traffic weights
104    vs.Spec.Http[0].Route = []*networkingv1beta1.HTTPRouteDestination{
105        {
106            Destination: &networkingv1beta1.Destination{
107                Host:   cd.service,
108                Subset: "stable",
109            },
110            Weight: stableWeight,
111        },
112        {
113            Destination: &networkingv1beta1.Destination{
114                Host:   cd.service,
115                Subset: "canary",
116            },
117            Weight: canaryWeight,
118        },
119    }
120
121    _, err = cd.istioClient.NetworkingV1beta1().VirtualServices(cd.namespace).Update(ctx, vs, metav1.UpdateOptions{})
122    if err != nil {
123        return fmt.Errorf("failed to update VirtualService: %w", err)
124    }
125
126    fmt.Printf("Updated traffic weights: stable=%d, canary=%d\n", stableWeight, canaryWeight)
127    return nil
128}
129
130func monitorCanaryHealth(ctx context.Context, duration time.Duration) error {
131    ticker := time.NewTicker(30 * time.Second)
132    defer ticker.Stop()
133
134    timeout := time.After(duration)
135    consecutiveErrors := 0
136    maxErrors := 3
137
138    for {
139        select {
140        case <-ctx.Done():
141            return ctx.Err()
142        case <-timeout:
143            return nil
144        case <-ticker.C:
145            healthy, err := cd.checkCanaryHealth(ctx)
146            if err != nil {
147                consecutiveErrors++
148                fmt.Printf("Health check error: %v\n", err, consecutiveErrors)
149                if consecutiveErrors >= maxErrors {
150                    return fmt.Errorf("too many consecutive failures")
151                }
152                continue
153            }
154
155            consecutiveErrors = 0
156            if !healthy {
157                return fmt.Errorf("canary is unhealthy")
158            }
159
160            fmt.Printf("Health check passed\n")
161        }
162    }
163}
164
165func checkCanaryHealth(ctx context.Context) {
166    pods, err := cd.k8sClient.CoreV1().Pods(cd.namespace).List(ctx, metav1.ListOptions{
167        LabelSelector: fmt.Sprintf("app=%s,version=canary", cd.service),
168    })
169    if err != nil {
170        return false, fmt.Errorf("failed to list canary pods: %w", err)
171    }
172
173    if len(pods.Items) == 0 {
174        return false, fmt.Errorf("no canary pods found")
175    }
176
177    readyPods := 0
178    for _, pod := range pods.Items {
179        for _, condition := range pod.Status.Conditions {
180            if condition.Type == "Ready" && condition.Status == "True" {
181                readyPods++
182                break
183            }
184        }
185    }
186
187    healthPercentage := float64(readyPods) / float64(len(pods.Items)) * 100
188    fmt.Printf("Canary health: %d/%d pods ready\n", readyPods, len(pods.Items), healthPercentage)
189
190    return healthPercentage >= 90.0, nil
191}
192
193func rollbackToStable(ctx context.Context) error {
194    fmt.Println("Rolling back to stable version...")
195    return cd.updateTrafficWeights(ctx, 0)
196}
197
198// Usage example
199func main() {
200    cd, err := NewCanaryDeployment("", "myapp", "default")
201    if err != nil {
202        panic(err)
203    }
204
205    ctx, cancel := context.WithTimeout(context.Background(), 2*time.Hour)
206    defer cancel()
207
208    if err := cd.ExecuteCanary(ctx); err != nil {
209        panic(err)
210    }
211
212    fmt.Println("Canary deployment completed successfully!")
213}

Exercise 2: Implement Zero-Trust Security

Objective: Configure comprehensive security policies including mTLS, JWT authentication, and fine-grained authorization.

Solution
  1# 1. Enable strict mTLS for the mesh
  2apiVersion: security.istio.io/v1beta1
  3kind: PeerAuthentication
  4metadata:
  5  name: default
  6  namespace: production
  7spec:
  8  mtls:
  9    mode: STRICT
 10
 11---
 12# 2. Configure JWT authentication for external access
 13apiVersion: security.istio.io/v1beta1
 14kind: RequestAuthentication
 15metadata:
 16  name: jwt-authentication
 17  namespace: production
 18spec:
 19  selector:
 20    matchLabels:
 21      app: api-service
 22  jwtRules:
 23  - issuer: "https://auth.example.com"
 24    jwksUri: "https://auth.example.com/.well-known/jwks.json"
 25    audiences:
 26    - "api.example.com"
 27    forwardOriginalToken: true
 28    fromHeaders:
 29    - name: Authorization
 30      prefix: "Bearer "
 31
 32---
 33# 3. Role-based authorization for API endpoints
 34apiVersion: security.istio.io/v1beta1
 35kind: AuthorizationPolicy
 36metadata:
 37  name: api-authorization
 38  namespace: production
 39spec:
 40  selector:
 41    matchLabels:
 42      app: api-service
 43  action: ALLOW
 44  rules:
 45  # Admin users have full access
 46  - from:
 47    - source:
 48        requestPrincipals: ["*"]
 49    when:
 50    - key: request.auth.claims[role]
 51      values: ["admin"]
 52    to:
 53    - operation:
 54        methods: ["*"]
 55  # Regular users can only read/write their own data
 56  - from:
 57    - source:
 58        requestPrincipals: ["*"]
 59    when:
 60    - key: request.auth.claims[role]
 61      values: ["user"]
 62    to:
 63    - operation:
 64        methods: ["GET", "POST", "PUT"]
 65        paths: ["/api/users/*", "/api/orders/*"]
 66  # Allow health checks without authentication
 67  - to:
 68    - operation:
 69        paths: ["/health", "/metrics"]
 70
 71---
 72# 4. Service-to-service authorization
 73apiVersion: security.istio.io/v1beta1
 74kind: AuthorizationPolicy
 75metadata:
 76  name: service-to-service-authz
 77  namespace: production
 78spec:
 79  action: ALLOW
 80  rules:
 81  # Database service only accessible from API and worker services
 82  - from:
 83    - source:
 84        principals:
 85        - "cluster.local/ns/production/sa/api-service"
 86        - "cluster.local/ns/production/sa/worker-service"
 87  - to:
 88    - operation:
 89        methods: ["GET", "POST", "PUT", "DELETE"]
 90  # Cache service accessible from all internal services
 91  - from:
 92    - source:
 93        principals: ["cluster.local/ns/production/sa/*"]
 94  - to:
 95    - operation:
 96        methods: ["GET", "POST", "DELETE"]
 97
 98---
 99# 5. Deny all other requests
100apiVersion: security.istio.io/v1beta1
101kind: AuthorizationPolicy
102metadata:
103  name: deny-all
104  namespace: production
105spec:
106  action: DENY
107  # This policy catches any request not allowed by other policies

Exercise 3: Multi-Region Service Mesh

Objective: Deploy a multi-region service mesh with cross-cluster service discovery, traffic routing, and failover capabilities.

Solution with Explanation

Implementation Steps:

  1. Configure both clusters with shared mesh ID
  2. Set up cross-cluster service discovery
  3. Implement region-aware traffic routing
  4. Configure automatic failover
  5. Add monitoring and observability
 1# cluster-east/istio-operator.yaml
 2apiVersion: install.istio.io/v1alpha1
 3kind: IstioOperator
 4metadata:
 5  name: istio-controlplane-east
 6  namespace: istio-system
 7spec:
 8  values:
 9    global:
10      meshID: production-mesh
11      multiCluster:
12        clusterName: cluster-east
13      network: network-east
14      # Enable cross-cluster communication
15    pilot:
16      env:
17        PILOT_ENABLE_CROSS_CLUSTER_WORKLOAD_ENTRY: "true"
18    # Ingress gateway configuration
19    components:
20      ingressGateways:
21      - name: istio-ingressgateway
22        enabled: true
23        k8s:
24          service:
25            type: LoadBalancer
26            ports:
27            - port: 15443  # Multi-cluster port
28              name: tls
29            - port: 80
30              name: http2
31            - port: 443
32              name: https
33
34---
35# cluster-west/istio-operator.yaml
36apiVersion: install.istio.io/v1alpha1
37kind: IstioOperator
38metadata:
39  name: istio-controlplane-west
40  namespace: istio-system
41spec:
42  values:
43    global:
44      meshID: production-mesh
45      multiCluster:
46        clusterName: cluster-west
47      network: network-west
48    pilot:
49      env:
50        PILOT_ENABLE_CROSS_CLUSTER_WORKLOAD_ENTRY: "true"
  1# service-entry.yaml - Cross-cluster service discovery
  2apiVersion: networking.istio.io/v1beta1
  3kind: ServiceEntry
  4metadata:
  5  name: remote-api-service
  6  namespace: production
  7spec:
  8  hosts:
  9  - api-service.global  # Global service name
 10  location: MESH_INTERNAL
 11  resolution: DNS
 12  ports:
 13  - number: 8080
 14    name: http
 15    protocol: HTTP
 16  endpoints:
 17  - address: 172.20.0.1  # West cluster ingress IP
 18    locality: us-west-2
 19    network: network-west
 20  - address: 172.20.0.2
 21    locality: us-west-2
 22    network: network-west
 23
 24---
 25# virtual-service.yaml - Region-aware routing
 26apiVersion: networking.istio.io/v1beta1
 27kind: VirtualService
 28metadata:
 29  name: api-service-routing
 30  namespace: production
 31spec:
 32  hosts:
 33  - api-service.production.svc.cluster.local
 34  - api-service.global
 35  http:
 36  # Route requests from east region to east cluster
 37  - match:
 38    - sourceLabels:
 39        topology.istio.io/cluster: cluster-east
 40    route:
 41    - destination:
 42        host: api-service.production.svc.cluster.local  # Local service
 43      weight: 80
 44    - destination:
 45        host: api-service.global  # Remote service
 46      weight: 20
 47  # Route requests from west region to west cluster
 48  - match:
 49    - sourceLabels:
 50        topology.istio.io/cluster: cluster-west
 51    route:
 52    - destination:
 53        host: api-service.production.svc.cluster.local
 54      weight: 80
 55    - destination:
 56        host: api-service.global
 57      weight: 20
 58  # Client preference routing
 59  - match:
 60    - headers:
 61        x-preferred-region:
 62          exact: "west"
 63    route:
 64    - destination:
 65        host: api-service.global
 66      weight: 100
 67  # Default load balancing
 68  - route:
 69    - destination:
 70        host: api-service.production.svc.cluster.local
 71      weight: 50
 72    - destination:
 73        host: api-service.global
 74      weight: 50
 75
 76---
 77# destination-rule.yaml - Locality-aware load balancing
 78apiVersion: networking.istio.io/v1beta1
 79kind: DestinationRule
 80metadata:
 81  name: api-service-lb
 82  namespace: production
 83spec:
 84  host: api-service.global
 85  trafficPolicy:
 86    loadBalancer:
 87      localityLbSetting:
 88        enabled: true
 89        distribute:
 90        - from: us-east-1/*
 91          to:
 92            "us-east-1/*": 80
 93            "us-west-2/*": 20
 94        - from: us-west-2/*
 95          to:
 96            "us-west-2/*": 80
 97            "us-east-1/*": 20
 98    outlierDetection:
 99      consecutiveErrors: 3
100      interval: 30s
101      baseEjectionTime: 30s
102      maxEjectionPercent: 50

Explanation: This multi-region setup provides:

  • Geographic distribution for reduced latency
  • Automatic failover between regions
  • Locality-aware routing to prefer local services
  • Load balancing across regions
  • Resilience to regional failures

The configuration ensures that 80% of traffic stays within the same region for performance, while 20% is distributed across regions for load balancing. When a region experiences issues, traffic automatically fails over to the healthy region.

Exercise 4: Implement Fault Injection for Chaos Testing

Objective: Create fault injection policies to test service resilience and implement automated chaos testing.

Requirements:

  1. Configure delay injection for latency testing
  2. Implement HTTP abort injection
  3. Create gradual fault rollout
  4. Monitor service behavior under fault conditions
  5. Automated recovery validation
Solution
 1# fault-injection.yaml - Comprehensive fault injection for chaos testing
 2apiVersion: networking.istio.io/v1beta1
 3kind: VirtualService
 4metadata:
 5  name: chaos-testing
 6  namespace: production
 7spec:
 8  hosts:
 9  - api-service
10  http:
11  # Inject delays for 10% of requests
12  - match:
13    - headers:
14        x-chaos-test:
15          exact: "latency"
16    fault:
17      delay:
18        percentage:
19          value: 10.0
20        fixedDelay: 5s
21    route:
22    - destination:
23        host: api-service
24        subset: v1
25
26  # Inject HTTP 503 errors for 5% of requests
27  - match:
28    - headers:
29        x-chaos-test:
30          exact: "error"
31    fault:
32      abort:
33        percentage:
34          value: 5.0
35        httpStatus: 503
36    route:
37    - destination:
38        host: api-service
39        subset: v1
40
41  # Combined fault injection - delays and errors
42  - match:
43    - headers:
44        x-chaos-test:
45          exact: "combined"
46    fault:
47      delay:
48        percentage:
49          value: 15.0
50        fixedDelay: 3s
51      abort:
52        percentage:
53          value: 10.0
54        httpStatus: 500
55    route:
56    - destination:
57        host: api-service
58        subset: v1
59
60  # Default route without faults
61  - route:
62    - destination:
63        host: api-service
64        subset: v1
  1// chaos/testing.go - Automated chaos testing framework
  2
  3package chaos
  4
  5import (
  6	"context"
  7	"fmt"
  8	"net/http"
  9	"time"
 10
 11	"k8s.io/client-go/kubernetes"
 12	"istio.io/client-go/pkg/clientset/versioned"
 13	networkingv1beta1 "istio.io/client-go/pkg/apis/networking/v1beta1"
 14	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 15)
 16
 17type ChaosTest struct {
 18	istioClient versioned.Interface
 19	k8sClient   kubernetes.Interface
 20	serviceName string
 21	namespace   string
 22}
 23
 24func NewChaosTest(istioClient versioned.Interface, k8sClient kubernetes.Interface,
 25	serviceName, namespace string) *ChaosTest {
 26	return &ChaosTest{
 27		istioClient: istioClient,
 28		k8sClient:   k8sClient,
 29		serviceName: serviceName,
 30		namespace:   namespace,
 31	}
 32}
 33
 34// RunLatencyTest injects delays and monitors service behavior
 35func (ct *ChaosTest) RunLatencyTest(ctx context.Context, delayMs int, percentage float64, duration time.Duration) error {
 36	fmt.Printf("Starting latency test: %dms delay for %.1f%% of requests\n", delayMs, percentage)
 37
 38	// Create fault injection VirtualService
 39	vs := &networkingv1beta1.VirtualService{
 40		ObjectMeta: metav1.ObjectMeta{
 41			Name:      ct.serviceName + "-latency-test",
 42			Namespace: ct.namespace,
 43		},
 44		Spec: networkingv1beta1.VirtualServiceSpec{
 45			Hosts: []string{ct.serviceName},
 46			Http: []*networkingv1beta1.HTTPRoute{
 47				{
 48					Fault: &networkingv1beta1.HTTPFaultInjection{
 49						Delay: &networkingv1beta1.HTTPFaultInjection_Delay{
 50							Percentage: &networkingv1beta1.Percent{
 51								Value: percentage,
 52							},
 53							HttpDelayType: &networkingv1beta1.HTTPFaultInjection_Delay_FixedDelay{
 54								FixedDelay: &duration.Duration{
 55									Seconds: int64(delayMs / 1000),
 56								},
 57							},
 58						},
 59					},
 60					Route: []*networkingv1beta1.HTTPRouteDestination{
 61						{
 62							Destination: &networkingv1beta1.Destination{
 63								Host: ct.serviceName,
 64							},
 65						},
 66					},
 67				},
 68			},
 69		},
 70	}
 71
 72	// Create VirtualService
 73	_, err := ct.istioClient.NetworkingV1beta1().VirtualServices(ct.namespace).Create(ctx, vs, metav1.CreateOptions{})
 74	if err != nil {
 75		return fmt.Errorf("failed to create fault injection: %w", err)
 76	}
 77
 78	// Monitor for specified duration
 79	fmt.Printf("Monitoring service for %v...\n", duration)
 80	time.Sleep(duration)
 81
 82	// Check metrics
 83	healthy, err := ct.checkServiceHealth(ctx)
 84	if err != nil {
 85		return fmt.Errorf("health check failed: %w", err)
 86	}
 87
 88	if !healthy {
 89		fmt.Println("⚠️  Service degraded under latency injection")
 90	} else {
 91		fmt.Println("✅ Service remained healthy under latency injection")
 92	}
 93
 94	// Cleanup
 95	err = ct.istioClient.NetworkingV1beta1().VirtualServices(ct.namespace).Delete(
 96		ctx, ct.serviceName+"-latency-test", metav1.DeleteOptions{})
 97	if err != nil {
 98		return fmt.Errorf("failed to cleanup: %w", err)
 99	}
100
101	return nil
102}
103
104// RunErrorInjectionTest injects HTTP errors
105func (ct *ChaosTest) RunErrorInjectionTest(ctx context.Context, httpStatus int32, percentage float64, duration time.Duration) error {
106	fmt.Printf("Starting error injection test: HTTP %d for %.1f%% of requests\n", httpStatus, percentage)
107
108	vs := &networkingv1beta1.VirtualService{
109		ObjectMeta: metav1.ObjectMeta{
110			Name:      ct.serviceName + "-error-test",
111			Namespace: ct.namespace,
112		},
113		Spec: networkingv1beta1.VirtualServiceSpec{
114			Hosts: []string{ct.serviceName},
115			Http: []*networkingv1beta1.HTTPRoute{
116				{
117					Fault: &networkingv1beta1.HTTPFaultInjection{
118						Abort: &networkingv1beta1.HTTPFaultInjection_Abort{
119							Percentage: &networkingv1beta1.Percent{
120								Value: percentage,
121							},
122							ErrorType: &networkingv1beta1.HTTPFaultInjection_Abort_HttpStatus{
123								HttpStatus: httpStatus,
124							},
125						},
126					},
127					Route: []*networkingv1beta1.HTTPRouteDestination{
128						{
129							Destination: &networkingv1beta1.Destination{
130								Host: ct.serviceName,
131							},
132						},
133					},
134				},
135			},
136		},
137	}
138
139	_, err := ct.istioClient.NetworkingV1beta1().VirtualServices(ct.namespace).Create(ctx, vs, metav1.CreateOptions{})
140	if err != nil {
141		return fmt.Errorf("failed to create error injection: %w", err)
142	}
143
144	// Monitor and validate retry behavior
145	fmt.Printf("Monitoring retry and circuit breaker behavior...\n")
146	time.Sleep(duration)
147
148	// Cleanup
149	err = ct.istioClient.NetworkingV1beta1().VirtualServices(ct.namespace).Delete(
150		ctx, ct.serviceName+"-error-test", metav1.DeleteOptions{})
151	if err != nil {
152		return fmt.Errorf("failed to cleanup: %w", err)
153	}
154
155	fmt.Println("✅ Error injection test completed")
156	return nil
157}
158
159func (ct *ChaosTest) checkServiceHealth(ctx context.Context) (bool, error) {
160	// Check pod health
161	pods, err := ct.k8sClient.CoreV1().Pods(ct.namespace).List(ctx, metav1.ListOptions{
162		LabelSelector: fmt.Sprintf("app=%s", ct.serviceName),
163	})
164	if err != nil {
165		return false, err
166	}
167
168	readyPods := 0
169	for _, pod := range pods.Items {
170		for _, condition := range pod.Status.Conditions {
171			if condition.Type == "Ready" && condition.Status == "True" {
172				readyPods++
173				break
174			}
175		}
176	}
177
178	healthPercentage := float64(readyPods) / float64(len(pods.Items)) * 100
179	return healthPercentage >= 80.0, nil
180}
181
182// Example usage
183func main() {
184	// Initialize chaos test
185	chaosTest := NewChaosTest(istioClient, k8sClient, "api-service", "production")
186
187	ctx := context.Background()
188
189	// Run latency test - 500ms delay for 20% of requests
190	if err := chaosTest.RunLatencyTest(ctx, 500, 20.0, 5*time.Minute); err != nil {
191		panic(err)
192	}
193
194	// Run error injection test - HTTP 503 for 10% of requests
195	if err := chaosTest.RunErrorInjectionTest(ctx, 503, 10.0, 5*time.Minute); err != nil {
196		panic(err)
197	}
198
199	fmt.Println("All chaos tests completed successfully!")
200}

Exercise 5: Implement Complete Observability Stack

Objective: Set up comprehensive observability with distributed tracing, metrics, and logging across the service mesh.

Requirements:

  1. Deploy Jaeger for distributed tracing
  2. Configure Prometheus for metrics collection
  3. Set up Grafana dashboards
  4. Implement trace sampling strategies
  5. Create alerts for SLO violations
Solution
  1# observability-stack.yaml - Complete observability deployment
  2
  3---
  4# Jaeger deployment for distributed tracing
  5apiVersion: apps/v1
  6kind: Deployment
  7metadata:
  8  name: jaeger
  9  namespace: istio-system
 10spec:
 11  replicas: 1
 12  selector:
 13    matchLabels:
 14      app: jaeger
 15  template:
 16    metadata:
 17      labels:
 18        app: jaeger
 19    spec:
 20      containers:
 21      - name: jaeger
 22        image: jaegertracing/all-in-one:latest
 23        ports:
 24        - containerPort: 16686  # UI
 25        - containerPort: 14268  # Collector
 26        - containerPort: 9411   # Zipkin compatible
 27        env:
 28        - name: COLLECTOR_ZIPKIN_HOST_PORT
 29          value: ":9411"
 30        resources:
 31          requests:
 32            cpu: "200m"
 33            memory: "512Mi"
 34          limits:
 35            cpu: "500m"
 36            memory: "1Gi"
 37
 38---
 39# Jaeger Service
 40apiVersion: v1
 41kind: Service
 42metadata:
 43  name: jaeger
 44  namespace: istio-system
 45spec:
 46  selector:
 47    app: jaeger
 48  ports:
 49  - name: ui
 50    port: 16686
 51    targetPort: 16686
 52  - name: collector
 53    port: 14268
 54    targetPort: 14268
 55  - name: zipkin
 56    port: 9411
 57    targetPort: 9411
 58
 59---
 60# Prometheus ServiceMonitor for application metrics
 61apiVersion: monitoring.coreos.com/v1
 62kind: ServiceMonitor
 63metadata:
 64  name: app-metrics
 65  namespace: production
 66spec:
 67  selector:
 68    matchLabels:
 69      monitoring: "enabled"
 70  endpoints:
 71  - port: metrics
 72    interval: 30s
 73    path: /metrics
 74
 75---
 76# Istio telemetry configuration
 77apiVersion: telemetry.istio.io/v1alpha1
 78kind: Telemetry
 79metadata:
 80  name: mesh-telemetry
 81  namespace: istio-system
 82spec:
 83  # Enable distributed tracing
 84  tracing:
 85  - providers:
 86    - name: jaeger
 87    randomSamplingPercentage: 100.0  # Sample 100% for testing, reduce in production
 88    customTags:
 89      environment:
 90        literal:
 91          value: "production"
 92      version:
 93        literal:
 94          value: "v1.0.0"
 95
 96  # Metrics configuration
 97  metrics:
 98  - providers:
 99    - name: prometheus
100    overrides:
101    - match:
102        metric: REQUEST_COUNT
103      tagOverrides:
104        request_protocol:
105          value: "api.protocol | unknown"
106        response_code:
107          value: "response.code | 000"
108
109---
110# Grafana Dashboard ConfigMap
111apiVersion: v1
112kind: ConfigMap
113metadata:
114  name: grafana-dashboard-istio
115  namespace: monitoring
116data:
117  istio-dashboard.json: |
118    {
119      "dashboard": {
120        "title": "Istio Service Mesh Dashboard",
121        "panels": [
122          {
123            "title": "Request Rate",
124            "targets": [
125              {
126                "expr": "sum(rate(istio_requests_total[5m])) by (destination_service)"
127              }
128            ]
129          },
130          {
131            "title": "Error Rate",
132            "targets": [
133              {
134                "expr": "sum(rate(istio_requests_total{response_code=~\"5.*\"}[5m])) / sum(rate(istio_requests_total[5m]))"
135              }
136            ]
137          },
138          {
139            "title": "Request Duration P99",
140            "targets": [
141              {
142                "expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m])) by (le, destination_service))"
143              }
144            ]
145          }
146        ]
147      }
148    }    
  1// observability/setup.go - Complete observability setup
  2
  3package observability
  4
  5import (
  6	"context"
  7	"fmt"
  8	"net/http"
  9
 10	"go.opentelemetry.io/otel"
 11	"go.opentelemetry.io/otel/exporters/jaeger"
 12	"go.opentelemetry.io/otel/sdk/resource"
 13	sdktrace "go.opentelemetry.io/otel/sdk/trace"
 14	semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
 15	"github.com/prometheus/client_golang/prometheus"
 16	"github.com/prometheus/client_golang/prometheus/promhttp"
 17	"go.uber.org/zap"
 18)
 19
 20type ObservabilityStack struct {
 21	logger         *zap.Logger
 22	tracerProvider *sdktrace.TracerProvider
 23	metricsServer  *http.Server
 24}
 25
 26func NewObservabilityStack(serviceName, jaegerEndpoint string, metricsPort string) (*ObservabilityStack, error) {
 27	// Initialize structured logging
 28	logger, err := initLogger(serviceName)
 29	if err != nil {
 30		return nil, fmt.Errorf("failed to initialize logger: %w", err)
 31	}
 32
 33	// Initialize distributed tracing
 34	tracerProvider, err := initTracing(serviceName, jaegerEndpoint)
 35	if err != nil {
 36		return nil, fmt.Errorf("failed to initialize tracing: %w", err)
 37	}
 38
 39	// Initialize metrics server
 40	metricsServer := initMetrics(metricsPort)
 41
 42	logger.Info("observability stack initialized",
 43		zap.String("service", serviceName),
 44		zap.String("jaeger_endpoint", jaegerEndpoint),
 45		zap.String("metrics_port", metricsPort),
 46	)
 47
 48	return &ObservabilityStack{
 49		logger:         logger,
 50		tracerProvider: tracerProvider,
 51		metricsServer:  metricsServer,
 52	}, nil
 53}
 54
 55func initLogger(serviceName string) (*zap.Logger, error) {
 56	config := zap.NewProductionConfig()
 57	config.InitialFields = map[string]interface{}{
 58		"service": serviceName,
 59	}
 60	return config.Build()
 61}
 62
 63func initTracing(serviceName, jaegerEndpoint string) (*sdktrace.TracerProvider, error) {
 64	exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint(jaegerEndpoint)))
 65	if err != nil {
 66		return nil, err
 67	}
 68
 69	tp := sdktrace.NewTracerProvider(
 70		sdktrace.WithBatcher(exporter),
 71		sdktrace.WithResource(resource.NewWithAttributes(
 72			semconv.SchemaURL,
 73			semconv.ServiceNameKey.String(serviceName),
 74		)),
 75		// Sample 10% of traces in production
 76		sdktrace.WithSampler(sdktrace.TraceIDRatioBased(0.1)),
 77	)
 78
 79	otel.SetTracerProvider(tp)
 80	return tp, nil
 81}
 82
 83func initMetrics(port string) *http.Server {
 84	mux := http.NewServeMux()
 85	mux.Handle("/metrics", promhttp.Handler())
 86
 87	return &http.Server{
 88		Addr:    ":" + port,
 89		Handler: mux,
 90	}
 91}
 92
 93func (o *ObservabilityStack) Start() error {
 94	// Start metrics server
 95	go func() {
 96		if err := o.metricsServer.ListenAndServe(); err != nil && err != http.ErrServerClosed {
 97			o.logger.Fatal("metrics server failed", zap.Error(err))
 98		}
 99	}()
100
101	o.logger.Info("observability stack started")
102	return nil
103}
104
105func (o *ObservabilityStack) Shutdown(ctx context.Context) error {
106	// Shutdown tracer provider
107	if err := o.tracerProvider.Shutdown(ctx); err != nil {
108		return fmt.Errorf("failed to shutdown tracer: %w", err)
109	}
110
111	// Shutdown metrics server
112	if err := o.metricsServer.Shutdown(ctx); err != nil {
113		return fmt.Errorf("failed to shutdown metrics server: %w", err)
114	}
115
116	o.logger.Info("observability stack shutdown complete")
117	return nil
118}
119
120// Example usage in main application
121func main() {
122	obs, err := NewObservabilityStack(
123		"api-service",
124		"http://jaeger-collector:14268/api/traces",
125		"9090",
126	)
127	if err != nil {
128		panic(err)
129	}
130
131	if err := obs.Start(); err != nil {
132		panic(err)
133	}
134
135	// Your application code here
136	// ...
137
138	// Graceful shutdown
139	ctx := context.Background()
140	if err := obs.Shutdown(ctx); err != nil {
141		panic(err)
142	}
143}

PrometheusRule for Alerting:

 1# alerting-rules.yaml - SLO-based alerts
 2apiVersion: monitoring.coreos.com/v1
 3kind: PrometheusRule
 4metadata:
 5  name: istio-alerts
 6  namespace: monitoring
 7spec:
 8  groups:
 9  - name: istio-slo-alerts
10    interval: 30s
11    rules:
12    # High error rate alert
13    - alert: HighErrorRate
14      expr: |
15        (sum(rate(istio_requests_total{response_code=~"5.*"}[5m])) by (destination_service)
16        / sum(rate(istio_requests_total[5m])) by (destination_service)) > 0.05        
17      for: 5m
18      labels:
19        severity: critical
20      annotations:
21        summary: "High error rate detected"
22        description: "Service {{ $labels.destination_service }} has error rate above 5%"
23
24    # High latency alert (P99 > 1s)
25    - alert: HighLatency
26      expr: |
27        histogram_quantile(0.99,
28          sum(rate(istio_request_duration_milliseconds_bucket[5m])) by (le, destination_service)
29        ) > 1000        
30      for: 10m
31      labels:
32        severity: warning
33      annotations:
34        summary: "High latency detected"
35        description: "Service {{ $labels.destination_service }} P99 latency > 1s"
36
37    # Circuit breaker open
38    - alert: CircuitBreakerOpen
39      expr: |
40                istio_tcp_connections_opened_total{} - istio_tcp_connections_closed_total{} > 100
41      for: 5m
42      labels:
43        severity: warning
44      annotations:
45        summary: "Circuit breaker may be triggered"
46        description: "Unusual connection pattern detected for {{ $labels.destination_service }}"

Further Reading


Summary

Key Takeaways

💡 Core Benefits:

  1. Zero-Code Networking: Sidecar pattern handles all network concerns transparently
  2. Advanced Traffic Management: Canary deployments, A/B testing, traffic splitting
  3. Zero-Trust Security: Automatic mTLS and fine-grained authorization policies
  4. Comprehensive Observability: Distributed tracing, metrics, and access logs
  5. Resilience: Circuit breaking, retries, and automatic failover
  6. Multi-Cluster Support: Global service mesh across regions and clouds

🚀 Production Patterns:

  • Progressive Delivery: Automated canary deployments with health-based progression
  • Security-First: Default deny policies with mTLS encryption
  • Observability: OpenTelemetry integration with distributed tracing
  • High Availability: Multi-region deployment with locality-aware routing

When to Use Service Mesh

✅ Ideal for:

  • Production microservices with >5 services
  • Multi-language environments requiring consistent networking
  • Compliance and security requirements
  • Complex traffic management
  • Multi-region or multi-cloud deployments

❌ Overkill for:

  • Single monolithic applications
  • Simple 2-3 service architectures
  • Development or testing environments
  • Applications with minimal networking requirements

Next Steps

  1. Start Simple: Begin with basic mTLS and telemetry
  2. Gradual Adoption: Enable features progressively
  3. Monitor Everything: Set up observability from day one
  4. Test Failures: Use fault injection to validate resilience
  5. Plan for Scale: Design for multi-cluster from the start