Why This Matters - The Communication Challenge
Think of managing a city's transportation network. Instead of individual drivers navigating traffic, you have a smart traffic system that coordinates all vehicles, optimizes routes, enforces safety rules, and monitors everything centrally. This is exactly what a service mesh does for microservices.
Real-World Scenario: A large e-commerce platform with 50+ microservices processes millions of requests daily. Each service needs to communicate securely, handle failures gracefully, and provide observability. Without a service mesh, each service team implements these concerns differently, leading to inconsistent behavior and massive operational overhead.
Business Impact: Companies adopting service meshes report:
- 60-80% reduction in networking code duplication across services
- 50% faster deployment cycles with traffic management features
- 90% improved security posture with automatic mTLS
- 40% reduction in mean time to recovery with observability
Learning Objectives
By the end of this article, you will:
- Understand service mesh architecture and benefits
- Implement Istio traffic management for Go applications
- Configure zero-trust security with mTLS and authorization policies
- Set up observability with distributed tracing and monitoring
- Deploy production-ready multi-cluster service mesh
- Master advanced patterns including canary deployments and fault injection
Core Concepts - Service Mesh Fundamentals
What is a Service Mesh?
A service mesh is a dedicated infrastructure layer that handles service-to-service communication in microservices architectures. It uses sidecar proxies deployed alongside each service to manage all network traffic.
Key Components:
- Data Plane: Sidecar proxies that intercept and manage network traffic
- Control Plane: Central management that configures and monitors the data plane
- Sidecar Pattern: Each service gets a dedicated proxy without code changes
1// Your Go application - unchanged!
2// Service mesh handles all networking concerns
3package main
4
5import (
6 "fmt"
7 "log"
8 "net/http"
9 "os"
10)
11
12func main() {
13 // Your business logic only
14 http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
15 // Service mesh automatically provides:
16 // - Load balancing
17 // - Retry logic
18 // - Circuit breaking
19 // - mTLS encryption
20 // - Distributed tracing headers
21
22 // Just focus on business logic
23 hostname := os.Getenv("HOSTNAME")
24 fmt.Fprintf(w, "Response from service: %s\n", hostname)
25
26 // Service mesh intercepts the response
27 // and adds security, observability, etc.
28 })
29
30 log.Println("Service starting on :8080")
31 log.Fatal(http.ListenAndServe(":8080", nil))
32}
💡 Key Insight: Your Go applications remain unchanged - the service mesh handles all networking concerns transparently through the sidecar proxy.
Service Mesh vs. Kubernetes Ingress
Traditional Kubernetes Ingress:
1# Basic ingress - limited functionality
2apiVersion: networking.k8s.io/v1
3kind: Ingress
4metadata:
5 name: simple-ingress
6spec:
7 rules:
8 - host: myapp.example.com
9 http:
10 paths:
11 - path: /
12 pathType: Prefix
13 backend:
14 service:
15 name: myapp
16 port:
17 number: 8080
Service Mesh Capabilities:
1# Advanced traffic management with Istio
2apiVersion: networking.istio.io/v1beta1
3kind: VirtualService
4metadata:
5 name: advanced-routing
6spec:
7 hosts:
8 - myapp.example.com
9 http:
10 # Canary deployment - route 10% to v2
11 - match:
12 - headers:
13 canary:
14 exact: "enabled"
15 route:
16 - destination:
17 host: myapp
18 subset: v2
19 weight: 10
20 # Default route - 90% to v1
21 - route:
22 - destination:
23 host: myapp
24 subset: v1
25 weight: 90
26 # Fault injection for testing
27 - fault:
28 delay:
29 percentage:
30 value: 5
31 fixedDelay: 2s
Practical Examples - Service Mesh Integration
Basic Go Service with Istio
1// service/main.go
2package main
3
4import (
5 "context"
6 "encoding/json"
7 "fmt"
8 "log"
9 "net/http"
10 "os"
11 "time"
12
13 "github.com/prometheus/client_golang/prometheus/promhttp"
14)
15
16type ServiceResponse struct {
17 Service string `json:"service"`
18 Version string `json:"version"`
19 Hostname string `json:"hostname"`
20 RequestID string `json:"request_id"`
21 TraceID string `json:"trace_id"`
22 Timestamp time.Time `json:"timestamp"`
23}
24
25func main() {
26 // Service configuration
27 serviceName := os.Getenv("SERVICE_NAME")
28 version := os.Getenv("SERVICE_VERSION")
29
30 // Initialize structured logger
31 logger := structuredLogger(serviceName, version)
32
33 // Health check endpoint
34 http.HandleFunc("/health", healthCheckHandler(logger))
35
36 // Main service endpoint
37 http.HandleFunc("/", serviceHandler(serviceName, version, logger))
38
39 // Metrics endpoint
40 http.Handle("/metrics", promhttp.Handler())
41
42 port := os.Getenv("PORT")
43 if port == "" {
44 port = "8080"
45 }
46
47 logger.Info("service starting",
48 "port", port,
49 "version", version,
50 )
51
52 // Your Go service runs normally
53 // Istio sidecar intercepts all traffic
54 if err := http.ListenAndServe(":"+port, nil); err != nil {
55 logger.Error("service failed", "error", err)
56 os.Exit(1)
57 }
58}
59
60func serviceHandler(serviceName, version string, logger structuredLogger) http.HandlerFunc {
61 return func(w http.ResponseWriter, r *http.Request) {
62 // Extract Istio-provided headers
63 requestID := r.Header.Get("X-Request-Id")
64 traceID := r.Header.Get("X-B3-Traceid")
65 spanID := r.Header.Get("X-B3-Spanid")
66
67 response := ServiceResponse{
68 Service: serviceName,
69 Version: version,
70 Hostname: os.Getenv("HOSTNAME"),
71 RequestID: requestID,
72 TraceID: traceID,
73 Timestamp: time.Now(),
74 }
75
76 logger.Info("request processed",
77 "method", r.Method,
78 "path", r.URL.Path,
79 "request_id", requestID,
80 "trace_id", traceID,
81 "span_id", spanID,
82 )
83
84 w.Header().Set("Content-Type", "application/json")
85 w.Header().Set("X-Service-Version", version)
86
87 if err := json.NewEncoder(w).Encode(response); err != nil {
88 logger.Error("encoding failed", "error", err)
89 http.Error(w, "Internal error", http.StatusInternalServerError)
90 }
91 }
92}
93
94func healthCheckHandler(logger structuredLogger) http.HandlerFunc {
95 return func(w http.ResponseWriter, r *http.Request) {
96 health := map[string]interface{}{
97 "status": "healthy",
98 "timestamp": time.Now(),
99 "version": os.Getenv("SERVICE_VERSION"),
100 }
101
102 w.Header().Set("Content-Type", "application/json")
103 json.NewEncoder(w).Encode(health)
104
105 logger.Debug("health check completed")
106 }
107}
108
109// Simple structured logger for demonstration
110type structuredLogger struct {
111 serviceName string
112 version string
113}
114
115func structuredLogger(serviceName, version string) structuredLogger {
116 return structuredLogger{serviceName: serviceName, version: version}
117}
118
119func Info(msg string, keysAndValues ...interface{}) {
120 fmt.Printf("[INFO] %s.%s: %s %v\n", l.serviceName, l.version, msg, keysAndValues)
121}
122
123func Error(msg string, keysAndValues ...interface{}) {
124 fmt.Printf("[ERROR] %s.%s: %s %v\n", l.serviceName, l.version, msg, keysAndValues)
125}
126
127func Debug(msg string, keysAndValues ...interface{}) {
128 fmt.Printf("[DEBUG] %s.%s: %s %v\n", l.serviceName, l.version, msg, keysAndValues)
129}
Kubernetes Deployment with Istio Sidecar
1# deployment.yaml
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5 name: myapp-v1
6 labels:
7 app: myapp
8 version: v1
9spec:
10 replicas: 3
11 selector:
12 matchLabels:
13 app: myapp
14 version: v1
15 template:
16 metadata:
17 labels:
18 app: myapp
19 version: v1
20 annotations:
21 # Istio automatically injects sidecar when annotation is present
22 sidecar.istio.io/inject: "true"
23 # Configure sidecar resources
24 sidecar.istio.io/proxyCPU: "100m"
25 sidecar.istio.io/proxyMemory: "128Mi"
26 # Traffic interception settings
27 traffic.sidecar.istio.io/includeOutboundPorts: "8080"
28 traffic.sidecar.istio.io/excludeInboundPorts: "8081"
29 spec:
30 containers:
31 - name: myapp
32 image: my-registry/myapp:v1
33 ports:
34 - containerPort: 8080
35 name: http
36 env:
37 - name: SERVICE_NAME
38 value: "myapp"
39 - name: SERVICE_VERSION
40 value: "v1"
41 - name: PORT
42 value: "8080"
43 resources:
44 requests:
45 cpu: "50m"
46 memory: "64Mi"
47 limits:
48 cpu: "200m"
49 memory: "128Mi"
50 # Application readiness and liveness
51 readinessProbe:
52 httpGet:
53 path: /health
54 port: 8080
55 initialDelaySeconds: 5
56 periodSeconds: 10
57 livenessProbe:
58 httpGet:
59 path: /health
60 port: 8080
61 initialDelaySeconds: 30
62 periodSeconds: 30
Deployment with Version Management:
1# deployment-v2.yaml - Canary version
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5 name: myapp-v2
6 labels:
7 app: myapp
8 version: v2
9spec:
10 replicas: 1 # Start with fewer replicas for canary
11 selector:
12 matchLabels:
13 app: myapp
14 version: v2
15 template:
16 metadata:
17 labels:
18 app: myapp
19 version: v2
20 annotations:
21 sidecar.istio.io/inject: "true"
22 spec:
23 containers:
24 - name: myapp
25 image: my-registry/myapp:v2
26 ports:
27 - containerPort: 8080
28 name: http
29 env:
30 - name: SERVICE_NAME
31 value: "myapp"
32 - name: SERVICE_VERSION
33 value: "v2"
34 resources:
35 requests:
36 cpu: "50m"
37 memory: "64Mi"
38 limits:
39 cpu: "200m"
40 memory: "128Mi"
Traffic Management - Advanced Patterns
Progressive Canary Deployments
1// canary/controller.go
2package main
3
4import (
5 "context"
6 "fmt"
7 "time"
8
9 networkingv1beta1 "istio.io/client-go/pkg/apis/networking/v1beta1"
10 "istio.io/client-go/pkg/clientset/versioned"
11 metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
12 "k8s.io/client-go/kubernetes"
13)
14
15type CanaryController struct {
16 istioClient versioned.Interface
17 k8sClient kubernetes.Interface
18 service string
19 namespace string
20}
21
22func NewCanaryController(k8sCfg *rest.Config, service, namespace string) {
23 istioClient, err := versioned.NewForConfig(k8sCfg)
24 if err != nil {
25 return nil, fmt.Errorf("failed to create Istio client: %w", err)
26 }
27
28 k8sClient, err := kubernetes.NewForConfig(k8sCfg)
29 if err != nil {
30 return nil, fmt.Errorf("failed to create Kubernetes client: %w", err)
31 }
32
33 return &CanaryController{
34 istioClient: istioClient,
35 k8sClient: k8sClient,
36 service: service,
37 namespace: namespace,
38 }, nil
39}
40
41// ProgressiveCanary performs gradual traffic shifting with monitoring
42func ProgressiveCanary(ctx context.Context) error {
43 // Progressive traffic stages
44 stages := []struct {
45 canaryWeight int32
46 stableWeight int32
47 duration time.Duration
48 description string
49 }{
50 {5, 95, 5 * time.Minute, "Initial 5% canary"},
51 {10, 90, 10 * time.Minute, "10% canary deployment"},
52 {25, 75, 15 * time.Minute, "25% canary expansion"},
53 {50, 50, 20 * time.Minute, "50% traffic split"},
54 {75, 25, 15 * time.Minute, "75% canary dominance"},
55 {100, 0, 10 * time.Minute, "Full canary deployment"},
56 }
57
58 for i, stage := range stages {
59 fmt.Printf("Stage %d: %s\n", i+1, stage.description)
60
61 // Update VirtualService with new traffic weights
62 if err := c.updateTrafficWeights(ctx, stage.canaryWeight, stage.stableWeight); err != nil {
63 return fmt.Errorf("failed to update traffic weights: %w", err)
64 }
65
66 // Monitor for the duration of this stage
67 fmt.Printf("Monitoring for %v...\n", stage.duration)
68 if err := c.monitorCanaryHealth(ctx, stage.duration); err != nil {
69 fmt.Printf("Health check failed: %v\n", err)
70
71 // Automatic rollback on failure
72 if rollbackErr := c.rollbackToStable(ctx); rollbackErr != nil {
73 return fmt.Errorf("failed to rollback: %w", rollbackErr, err)
74 }
75 return fmt.Errorf("canary failed, rolled back: %w", err)
76 }
77
78 fmt.Printf("Stage %d completed successfully\n", i+1)
79 }
80
81 // Clean up old stable version after successful canary
82 return c.cleanupOldVersion(ctx)
83}
84
85func updateTrafficWeights(ctx context.Context, canaryWeight, stableWeight int32) error {
86 vs, err := c.istioClient.NetworkingV1beta1().VirtualServices(c.namespace).Get(ctx, c.service, metav1.GetOptions{})
87 if err != nil {
88 return fmt.Errorf("failed to get VirtualService: %w", err)
89 }
90
91 // Update traffic weights
92 vs.Spec.Http[0].Route = []*networkingv1beta1.HTTPRouteDestination{
93 {
94 Destination: &networkingv1beta1.Destination{
95 Host: c.service,
96 Subset: "stable",
97 },
98 Weight: stableWeight,
99 },
100 {
101 Destination: &networkingv1beta1.Destination{
102 Host: c.service,
103 Subset: "canary",
104 },
105 Weight: canaryWeight,
106 },
107 }
108
109 _, err = c.istioClient.NetworkingV1beta1().VirtualServices(c.namespace).Update(ctx, vs, metav1.UpdateOptions{})
110 if err != nil {
111 return fmt.Errorf("failed to update VirtualService: %w", err)
112 }
113
114 fmt.Printf("Updated traffic weights: stable=%d, canary=%d\n", stableWeight, canaryWeight)
115 return nil
116}
117
118func monitorCanaryHealth(ctx context.Context, duration time.Duration) error {
119 ticker := time.NewTicker(30 * time.Second)
120 defer ticker.Stop()
121
122 timeout := time.After(duration)
123 consecutiveErrors := 0
124 maxErrors := 3
125
126 for {
127 select {
128 case <-ctx.Done():
129 return ctx.Err()
130 case <-timeout:
131 return nil // Monitoring period completed successfully
132 case <-ticker.C:
133 healthy, err := c.checkCanaryHealth(ctx)
134 if err != nil {
135 consecutiveErrors++
136 fmt.Printf("Health check error: %v\n", err, consecutiveErrors)
137
138 if consecutiveErrors >= maxErrors {
139 return fmt.Errorf("too many consecutive health check failures")
140 }
141 continue
142 }
143
144 consecutiveErrors = 0
145 if !healthy {
146 return fmt.Errorf("canary unhealthy")
147 }
148
149 fmt.Printf("Canary health check passed\n")
150 }
151 }
152}
153
154func checkCanaryHealth(ctx context.Context) {
155 // Check pod readiness
156 pods, err := c.k8sClient.CoreV1().Pods(c.namespace).List(ctx, metav1.ListOptions{
157 LabelSelector: "app=myapp,version=canary",
158 })
159 if err != nil {
160 return false, fmt.Errorf("failed to list canary pods: %w", err)
161 }
162
163 readyPods := 0
164 for _, pod := range pods.Items {
165 for _, condition := range pod.Status.Conditions {
166 if condition.Type == "Ready" && condition.Status == "True" {
167 readyPods++
168 break
169 }
170 }
171 }
172
173 if len(pods.Items) == 0 {
174 return false, fmt.Errorf("no canary pods found")
175 }
176
177 healthPercentage := float64(readyPods) / float64(len(pods.Items)) * 100
178 fmt.Printf("Canary health: %d/%d pods ready\n", readyPods, len(pods.Items), healthPercentage)
179
180 // Consider healthy if at least 80% of pods are ready
181 return healthPercentage >= 80.0, nil
182}
183
184func rollbackToStable(ctx context.Context) error {
185 fmt.Println("Performing automatic rollback to stable version...")
186
187 return c.updateTrafficWeights(ctx, 0, 100)
188}
189
190func cleanupOldVersion(ctx context.Context) error {
191 fmt.Println("Cleaning up old stable version...")
192
193 // Delete old stable deployment
194 err := c.k8sClient.AppsV1().Deployments(c.namespace).Delete(ctx, c.service+"-v1", metav1.DeleteOptions{})
195 if err != nil {
196 return fmt.Errorf("failed to delete old stable deployment: %w", err)
197 }
198
199 fmt.Println("Cleanup completed successfully")
200 return nil
201}
Istio Configuration for Traffic Management
1# destination-rule.yaml
2apiVersion: networking.istio.io/v1beta1
3kind: DestinationRule
4metadata:
5 name: myapp
6 namespace: default
7spec:
8 host: myapp
9 trafficPolicy:
10 # Connection pool settings
11 connectionPool:
12 tcp:
13 maxConnections: 100
14 http:
15 http1MaxPendingRequests: 50
16 maxRequestsPerConnection: 10
17 # Load balancing strategy
18 loadBalancer:
19 simple: LEAST_CONN # Least connections
20 # Outlier detection for circuit breaking
21 outlierDetection:
22 consecutiveErrors: 3
23 interval: 30s
24 baseEjectionTime: 30s
25 maxEjectionPercent: 50
26 # Subsets for version routing
27 subsets:
28 - name: stable
29 labels:
30 version: v1
31 - name: canary
32 labels:
33 version: v2
34
35---
36# virtual-service.yaml
37apiVersion: networking.istio.io/v1beta1
38kind: VirtualService
39metadata:
40 name: myapp
41 namespace: default
42spec:
43 hosts:
44 - myapp
45 http:
46 # Route based on headers
47 - match:
48 - headers:
49 x-user-type:
50 exact: "premium"
51 route:
52 - destination:
53 host: myapp
54 subset: canary
55 weight: 100
56 # Route based on query parameters
57 - match:
58 - queryParams:
59 version:
60 exact: "v2"
61 route:
62 - destination:
63 host: myapp
64 subset: canary
65 weight: 100
66 # Default traffic routing
67 - route:
68 - destination:
69 host: myapp
70 subset: stable
71 weight: 95
72 - destination:
73 host: myapp
74 subset: canary
75 weight: 5
76 # Retry configuration
77 retries:
78 attempts: 3
79 perTryTimeout: 2s
80 retryOn: 5xx,gateway-error,connect-failure,refused-stream
81 # Timeout configuration
82 timeout: 10s
Security - Zero-Trust Networking
Mutual TLS Configuration
1# peer-authentication.yaml - Enable mTLS
2apiVersion: security.istio.io/v1beta1
3kind: PeerAuthentication
4metadata:
5 name: default
6 namespace: default
7spec:
8 mtls:
9 # STRICT mode enforces mTLS for all service-to-service communication
10 mode: STRICT
11
12---
13# authorization-policy.yaml - Service-to-service authorization
14apiVersion: security.istio.io/v1beta1
15kind: AuthorizationPolicy
16metadata:
17 name: myapp-authz
18 namespace: default
19spec:
20 selector:
21 matchLabels:
22 app: myapp
23 action: ALLOW
24 rules:
25 # Allow requests from frontend service
26 - from:
27 - source:
28 principals: ["cluster.local/ns/default/sa/frontend"]
29 to:
30 - operation:
31 methods: ["GET", "POST"]
32 # Allow requests from admin service with full access
33 - from:
34 - source:
35 principals: ["cluster.local/ns/default/sa/admin"]
36 # Deny requests without proper authentication
37 - when:
38 - key: request.auth.principal
39 notValues: ["*"]
40 action: DENY
JWT Authentication for External Access
1// auth/middleware.go
2package auth
3
4import (
5 "context"
6 "net/http"
7 "strings"
8
9 "github.com/golang-jwt/jwt/v4"
10)
11
12type JWTClaims struct {
13 UserID string `json:"sub"`
14 Email string `json:"email"`
15 Roles []string `json:"roles"`
16 jwt.RegisteredClaims
17}
18
19type AuthMiddleware struct {
20 secretKey []byte
21}
22
23func NewAuthMiddleware(secretKey string) *AuthMiddleware {
24 return &AuthMiddleware{secretKey: []byte(secretKey)}
25}
26
27func Middleware(next http.Handler) http.Handler {
28 return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
29 // Istio validates JWT and adds claims as headers
30 userID := r.Header.Get("X-Jwt-Claim-Sub")
31 email := r.Header.Get("X-Jwt-Claim-Email")
32 roles := r.Header.Get("X-Jwt-Claim-Roles")
33
34 if userID == "" {
35 // Fallback: Check Authorization header
36 authHeader := r.Header.Get("Authorization")
37 if strings.HasPrefix(authHeader, "Bearer ") {
38 tokenString := authHeader[7:]
39 claims, err := a.validateJWT(tokenString)
40 if err != nil {
41 http.Error(w, "Invalid token", http.StatusUnauthorized)
42 return
43 }
44
45 // Add claims to context
46 ctx := context.WithValue(r.Context(), "user", claims)
47 r = r.WithContext(ctx)
48 } else {
49 http.Error(w, "Authorization required", http.StatusUnauthorized)
50 return
51 }
52 } else {
53 // Parse roles from header
54 userRoles := strings.Split(roles, ",")
55 claims := &JWTClaims{
56 UserID: userID,
57 Email: email,
58 Roles: userRoles,
59 }
60
61 ctx := context.WithValue(r.Context(), "user", claims)
62 r = r.WithContext(ctx)
63 }
64
65 next.ServeHTTP(w, r)
66 })
67}
68
69func validateJWT(tokenString string) {
70 token, err := jwt.ParseWithClaims(tokenString, &JWTClaims{}, func(token *jwt.Token) {
71 return a.secretKey, nil
72 })
73
74 if err != nil {
75 return nil, err
76 }
77
78 if claims, ok := token.Claims.(*JWTClaims); ok && token.Valid {
79 return claims, nil
80 }
81
82 return nil, jwt.ErrInvalidKey
83}
84
85func HasRole(role string) func(http.Handler) http.Handler {
86 return func(next http.Handler) http.Handler {
87 return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
88 user, ok := r.Context().Value("user").(*JWTClaims)
89 if !ok {
90 http.Error(w, "User not found", http.StatusUnauthorized)
91 return
92 }
93
94 for _, userRole := range user.Roles {
95 if userRole == role {
96 next.ServeHTTP(w, r)
97 return
98 }
99 }
100
101 http.Error(w, "Insufficient permissions", http.StatusForbidden)
102 })
103 }
104}
JWT Authentication Configuration
1# request-authentication.yaml
2apiVersion: security.istio.io/v1beta1
3kind: RequestAuthentication
4metadata:
5 name: jwt-auth
6 namespace: default
7spec:
8 selector:
9 matchLabels:
10 app: myapp
11 jwtRules:
12 - issuer: "https://auth.example.com"
13 jwksUri: "https://auth.example.com/.well-known/jwks.json"
14 audiences:
15 - "myapp-api"
16 # Forward original JWT to application
17 forwardOriginalToken: true
18 # Extract claims as headers
19 fromHeaders:
20 - name: Authorization
21 prefix: "Bearer "
22
23---
24# jwt-authorization.yaml
25apiVersion: security.istio.io/v1beta1
26kind: AuthorizationPolicy
27metadata:
28 name: jwt-require
29 namespace: default
30spec:
31 selector:
32 matchLabels:
33 app: myapp
34 action: ALLOW
35 rules:
36 # Require valid JWT for API endpoints
37 - to:
38 - operation:
39 notPaths: ["/health", "/metrics"]
40 when:
41 - key: request.auth.claims[aud]
42 values: ["myapp-api"]
Common Patterns and Pitfalls
Pattern 1: Service-to-Service Communication
1// client/service-client.go
2package client
3
4import (
5 "context"
6 "fmt"
7 "net/http"
8 "time"
9
10 "github.com/hashicorp/go-retryablehttp"
11)
12
13type ServiceClient struct {
14 baseURL string
15 httpClient *http.Client
16}
17
18func NewServiceClient(serviceName, namespace string) *ServiceClient {
19 // In Istio, services communicate via Kubernetes service names
20 baseURL := fmt.Sprintf("http://%s.%s.svc.cluster.local", serviceName, namespace)
21
22 // Configure HTTP client with timeouts and retries
23 client := &http.Client{
24 Timeout: 30 * time.Second,
25 Transport: &retryablehttp.RoundTripper{
26 RetryMax: 3,
27 RetryWaitMin: 1 * time.Second,
28 RetryWaitMax: 5 * time.Second,
29 },
30 }
31
32 return &ServiceClient{
33 baseURL: baseURL,
34 httpClient: client,
35 }
36}
37
38func CallService(ctx context.Context, path string) {
39 // Add tracing headers from context
40 req, err := http.NewRequestWithContext(ctx, "GET", c.baseURL+path, nil)
41 if err != nil {
42 return nil, fmt.Errorf("failed to create request: %w", err)
43 }
44
45 // Istio automatically adds:
46 // - Load balancing
47 // - mTLS encryption
48 // - Retry logic
49 // - Circuit breaking
50 // - Distributed tracing headers
51
52 return c.httpClient.Do(req)
53}
54
55func CallServiceWithTimeout(ctx context.Context, path string, timeout time.Duration) {
56 ctx, cancel := context.WithTimeout(ctx, timeout)
57 defer cancel()
58
59 return c.CallService(ctx, path)
60}
Pattern 2: Circuit Breaking Implementation
1// resilience/circuit_breaker.go
2package resilience
3
4import (
5 "errors"
6 "sync"
7 "time"
8)
9
10type CircuitState string
11
12const (
13 StateClosed CircuitState = "closed" // Normal operation
14 StateOpen CircuitState = "open" // Failing, reject requests
15 StateHalfOpen CircuitState = "half-open" // Testing if service recovered
16)
17
18type CircuitBreaker struct {
19 maxFailures int
20 resetTimeout time.Duration
21
22 mu sync.RWMutex
23 failures int
24 lastFailTime time.Time
25 state CircuitState
26}
27
28func NewCircuitBreaker(maxFailures int, resetTimeout time.Duration) *CircuitBreaker {
29 return &CircuitBreaker{
30 maxFailures: maxFailures,
31 resetTimeout: resetTimeout,
32 state: StateClosed,
33 }
34}
35
36func Call(fn func() error) error {
37 if !cb.canCall() {
38 return errors.New("circuit breaker is open")
39 }
40
41 err := fn()
42 cb.recordResult(err)
43 return err
44}
45
46func canCall() bool {
47 cb.mu.Lock()
48 defer cb.mu.Unlock()
49
50 switch cb.state {
51 case StateClosed:
52 return true
53 case StateOpen:
54 if time.Since(cb.lastFailTime) > cb.resetTimeout {
55 cb.state = StateHalfOpen
56 cb.failures = 0
57 return true
58 }
59 return false
60 case StateHalfOpen:
61 return true
62 default:
63 return false
64 }
65}
66
67func recordResult(err error) {
68 cb.mu.Lock()
69 defer cb.mu.Unlock()
70
71 if err == nil {
72 cb.failures = 0
73 if cb.state == StateHalfOpen {
74 cb.state = StateClosed
75 }
76 return
77 }
78
79 cb.failures++
80 cb.lastFailTime = time.Now()
81
82 if cb.failures >= cb.maxFailures {
83 cb.state = StateOpen
84 }
85}
86
87func State() CircuitState {
88 cb.mu.RLock()
89 defer cb.mu.RUnlock()
90 return cb.state
91}
Common Pitfalls and Solutions
Pitfall 1: Missing Destination Rules
1# ❌ Wrong: VirtualService without DestinationRule
2apiVersion: networking.istio.io/v1beta1
3kind: VirtualService
4metadata:
5 name: myapp
6spec:
7 hosts:
8 - myapp
9 http:
10 - route:
11 - destination:
12 host: myapp
13 subset: v1 # This will fail - no DestinationRule defined
14
15# ✅ Correct: DestinationRule defines subsets
16apiVersion: networking.istio.io/v1beta1
17kind: DestinationRule
18metadata:
19 name: myapp
20spec:
21 host: myapp
22 subsets:
23 - name: v1
24 labels:
25 version: v1
Pitfall 2: Incorrect Selector Labels
1# ❌ Wrong: Selector doesn't match pod labels
2apiVersion: security.istio.io/v1beta1
3kind: AuthorizationPolicy
4metadata:
5 name: myapp-authz
6spec:
7 selector:
8 matchLabels:
9 app: myapp-service # Pods have "app: myapp"
10 action: ALLOW
11 rules: []
12
13# ✅ Correct: Selector matches pod labels
14apiVersion: security.istio.io/v1beta1
15kind: AuthorizationPolicy
16metadata:
17 name: myapp-authz
18spec:
19 selector:
20 matchLabels:
21 app: myapp # Matches deployment pod labels
22 action: ALLOW
23 rules: []
Integration and Mastery - Multi-Cluster Service Mesh
Multi-Cluster Configuration
1# multi-cluster-config.yaml
2apiVersion: install.istio.io/v1alpha1
3kind: IstioOperator
4metadata:
5 name: istio-controlplane
6 namespace: istio-system
7spec:
8 values:
9 global:
10 meshID: production-mesh
11 # Multi-cluster configuration
12 multiCluster:
13 clusterName: cluster-east
14 # Network configuration
15 network: network-east
16 # Control plane configuration
17 pilot:
18 env:
19 PILOT_ENABLE_CROSS_CLUSTER_WORKLOAD_ENTRY: "true"
20 PILOT_ENABLE_WORKLOAD_ENTRY_FOR_CRDS: "true"
Cross-Cluster Service Discovery
1# service-entry.yaml - Cross-cluster service discovery
2apiVersion: networking.istio.io/v1beta1
3kind: ServiceEntry
4metadata:
5 name: remote-service
6 namespace: default
7spec:
8 hosts:
9 - myapp.global # Global service name
10 location: MESH_INTERNAL
11 resolution: DNS
12 ports:
13 - number: 8080
14 name: http
15 protocol: HTTP
16 # Remote service endpoints
17 endpoints:
18 - address: 172.20.0.1 # Remote cluster service IP
19 locality: us-west-2
20 - address: 172.20.0.2
21 locality: us-west-2
22
23---
24# virtual-service.yaml - Cross-cluster traffic routing
25apiVersion: networking.istio.io/v1beta1
26kind: VirtualService
27metadata:
28 name: myapp-global
29 namespace: default
30spec:
31 hosts:
32 - myapp.global
33 http:
34 # Prefer local cluster
35 - match:
36 - sourceLabels:
37 topology.istio.io/cluster: cluster-east
38 route:
39 - destination:
40 host: myapp.default.svc.cluster.local # Local service
41 weight: 80
42 - destination:
43 host: myapp.global # Remote service
44 weight: 20
45 # Default route for other clusters
46 - route:
47 - destination:
48 host: myapp.global
49 weight: 100
Multi-Region Load Balancing
1# destination-rule.yaml - Locality-aware load balancing
2apiVersion: networking.istio.io/v1beta1
3kind: DestinationRule
4metadata:
5 name: myapp-global
6 namespace: default
7spec:
8 host: myapp.global
9 trafficPolicy:
10 loadBalancer:
11 localityLbSetting:
12 enabled: true
13 # Prefer same region, then same zone
14 distribute:
15 - from: us-east-1/us-east-1a/*
16 to:
17 "us-east-1/us-east-1a/*": 90
18 "us-east-1/us-east-1b/*": 10
19 "us-west-2/*": 0
20 - from: us-west-2/*
21 to:
22 "us-west-2/*": 90
23 "us-east-1/*": 10
24 # Failover configuration
25 outlierDetection:
26 consecutiveErrors: 2
27 interval: 10s
28 baseEjectionTime: 30s
29 maxEjectionPercent: 100
Observability - Comprehensive Monitoring and Tracing
Distributed Tracing with Jaeger
1// observability/tracing.go - Distributed tracing integration
2
3package observability
4
5import (
6 "context"
7 "fmt"
8 "net/http"
9
10 "go.opentelemetry.io/otel"
11 "go.opentelemetry.io/otel/attribute"
12 "go.opentelemetry.io/otel/exporters/jaeger"
13 "go.opentelemetry.io/otel/propagation"
14 "go.opentelemetry.io/otel/sdk/resource"
15 sdktrace "go.opentelemetry.io/otel/sdk/trace"
16 semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
17 "go.opentelemetry.io/otel/trace"
18)
19
20// InitTracer sets up OpenTelemetry with Jaeger backend
21func InitTracer(serviceName, jaegerEndpoint string) (*sdktrace.TracerProvider, error) {
22 // Create Jaeger exporter
23 exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint(jaegerEndpoint)))
24 if err != nil {
25 return nil, fmt.Errorf("failed to create jaeger exporter: %w", err)
26 }
27
28 // Create tracer provider with service information
29 tp := sdktrace.NewTracerProvider(
30 sdktrace.WithBatcher(exporter),
31 sdktrace.WithResource(resource.NewWithAttributes(
32 semconv.SchemaURL,
33 semconv.ServiceNameKey.String(serviceName),
34 semconv.ServiceVersionKey.String("v1.0.0"),
35 attribute.String("environment", "production"),
36 )),
37 sdktrace.WithSampler(sdktrace.AlwaysSample()),
38 )
39
40 // Register as global tracer provider
41 otel.SetTracerProvider(tp)
42
43 // Set global propagator for trace context
44 otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
45 propagation.TraceContext{},
46 propagation.Baggage{},
47 ))
48
49 return tp, nil
50}
51
52// TracedHTTPMiddleware adds tracing to HTTP handlers
53func TracedHTTPMiddleware(next http.Handler) http.Handler {
54 return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
55 tracer := otel.Tracer("http-handler")
56
57 // Extract trace context from Istio-injected headers
58 ctx := otel.GetTextMapPropagator().Extract(r.Context(), propagation.HeaderCarrier(r.Header))
59
60 // Start new span
61 ctx, span := tracer.Start(ctx, r.Method+" "+r.URL.Path,
62 trace.WithSpanKind(trace.SpanKindServer),
63 trace.WithAttributes(
64 semconv.HTTPMethodKey.String(r.Method),
65 semconv.HTTPTargetKey.String(r.URL.Path),
66 semconv.HTTPURLKey.String(r.URL.String()),
67 attribute.String("http.request_id", r.Header.Get("X-Request-Id")),
68 ),
69 )
70 defer span.End()
71
72 // Create response writer wrapper to capture status code
73 wrapped := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}
74
75 // Call next handler
76 next.ServeHTTP(wrapped, r.WithContext(ctx))
77
78 // Add response attributes
79 span.SetAttributes(
80 semconv.HTTPStatusCodeKey.Int(wrapped.statusCode),
81 )
82
83 // Record error if status code indicates failure
84 if wrapped.statusCode >= 400 {
85 span.RecordError(fmt.Errorf("HTTP %d", wrapped.statusCode))
86 }
87 })
88}
89
90type responseWriter struct {
91 http.ResponseWriter
92 statusCode int
93}
94
95func (rw *responseWriter) WriteHeader(code int) {
96 rw.statusCode = code
97 rw.ResponseWriter.WriteHeader(code)
98}
99
100// AddSpan creates a custom span with attributes
101func AddSpan(ctx context.Context, operationName string, attributes ...attribute.KeyValue) (context.Context, trace.Span) {
102 tracer := otel.Tracer("service")
103 ctx, span := tracer.Start(ctx, operationName)
104 span.SetAttributes(attributes...)
105 return ctx, span
106}
107
108// Example traced database operation
109func (s *Service) GetUserTraced(ctx context.Context, userID int64) (*User, error) {
110 ctx, span := AddSpan(ctx, "GetUser.Database",
111 attribute.Int64("user.id", userID),
112 )
113 defer span.End()
114
115 // Perform database query
116 user, err := s.db.QueryUser(ctx, userID)
117 if err != nil {
118 span.RecordError(err)
119 return nil, fmt.Errorf("failed to query user: %w", err)
120 }
121
122 span.SetAttributes(
123 attribute.String("user.email", user.Email),
124 attribute.String("user.status", user.Status),
125 )
126
127 return user, nil
128}
Metrics Collection with Prometheus
1// observability/metrics.go - Prometheus metrics integration
2
3package observability
4
5import (
6 "net/http"
7 "time"
8
9 "github.com/prometheus/client_golang/prometheus"
10 "github.com/prometheus/client_golang/prometheus/promauto"
11 "github.com/prometheus/client_golang/prometheus/promhttp"
12)
13
14var (
15 // HTTP request metrics
16 httpRequestsTotal = promauto.NewCounterVec(
17 prometheus.CounterOpts{
18 Name: "http_requests_total",
19 Help: "Total number of HTTP requests",
20 },
21 []string{"method", "path", "status"},
22 )
23
24 httpRequestDuration = promauto.NewHistogramVec(
25 prometheus.HistogramOpts{
26 Name: "http_request_duration_seconds",
27 Help: "HTTP request latencies in seconds",
28 Buckets: prometheus.DefBuckets,
29 },
30 []string{"method", "path"},
31 )
32
33 // Business metrics
34 userOperationsTotal = promauto.NewCounterVec(
35 prometheus.CounterOpts{
36 Name: "user_operations_total",
37 Help: "Total number of user operations",
38 },
39 []string{"operation", "status"},
40 )
41
42 activeUsers = promauto.NewGauge(
43 prometheus.GaugeOpts{
44 Name: "active_users",
45 Help: "Current number of active users",
46 },
47 )
48
49 // Istio integration metrics
50 istioRequestsTotal = promauto.NewCounterVec(
51 prometheus.CounterOpts{
52 Name: "istio_requests_total",
53 Help: "Total requests processed by Istio sidecar",
54 },
55 []string{"destination_service", "response_code"},
56 )
57)
58
59// MetricsMiddleware records HTTP request metrics
60func MetricsMiddleware(next http.Handler) http.Handler {
61 return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
62 start := time.Now()
63
64 // Wrap response writer to capture status
65 wrapped := &metricsResponseWriter{ResponseWriter: w, statusCode: http.StatusOK}
66
67 // Call next handler
68 next.ServeHTTP(wrapped, r)
69
70 // Record metrics
71 duration := time.Since(start).Seconds()
72
73 httpRequestsTotal.WithLabelValues(
74 r.Method,
75 r.URL.Path,
76 http.StatusText(wrapped.statusCode),
77 ).Inc()
78
79 httpRequestDuration.WithLabelValues(
80 r.Method,
81 r.URL.Path,
82 ).Observe(duration)
83 })
84}
85
86type metricsResponseWriter struct {
87 http.ResponseWriter
88 statusCode int
89}
90
91func (mrw *metricsResponseWriter) WriteHeader(code int) {
92 mrw.statusCode = code
93 mrw.ResponseWriter.WriteHeader(code)
94}
95
96// RecordUserOperation records business metrics
97func RecordUserOperation(operation string, success bool) {
98 status := "success"
99 if !success {
100 status = "failure"
101 }
102 userOperationsTotal.WithLabelValues(operation, status).Inc()
103}
104
105// UpdateActiveUsers updates active user gauge
106func UpdateActiveUsers(count int) {
107 activeUsers.Set(float64(count))
108}
109
110// SetupMetricsEndpoint creates HTTP handler for Prometheus metrics
111func SetupMetricsEndpoint() http.Handler {
112 return promhttp.Handler()
113}
114
115// Example usage in main
116func main() {
117 // Setup metrics endpoint
118 http.Handle("/metrics", SetupMetricsEndpoint())
119
120 // Setup traced application handlers
121 http.Handle("/api/", TracedHTTPMiddleware(MetricsMiddleware(apiHandler)))
122
123 http.ListenAndServe(":8080", nil)
124}
Structured Logging Integration
1// observability/logging.go - Structured logging with trace correlation
2
3package observability
4
5import (
6 "context"
7 "os"
8
9 "go.opentelemetry.io/otel/trace"
10 "go.uber.org/zap"
11 "go.uber.org/zap/zapcore"
12)
13
14// InitLogger creates a structured logger with trace correlation
15func InitLogger(serviceName string, debug bool) (*zap.Logger, error) {
16 config := zap.NewProductionConfig()
17
18 if debug {
19 config.Level = zap.NewAtomicLevelAt(zapcore.DebugLevel)
20 }
21
22 config.EncoderConfig.TimeKey = "timestamp"
23 config.EncoderConfig.EncodeTime = zapcore.ISO8601TimeEncoder
24
25 // Add service name to all logs
26 config.InitialFields = map[string]interface{}{
27 "service": serviceName,
28 }
29
30 return config.Build()
31}
32
33// LoggerWithTrace adds trace context to logger
34func LoggerWithTrace(ctx context.Context, logger *zap.Logger) *zap.Logger {
35 span := trace.SpanFromContext(ctx)
36 if !span.IsRecording() {
37 return logger
38 }
39
40 spanContext := span.SpanContext()
41 return logger.With(
42 zap.String("trace_id", spanContext.TraceID().String()),
43 zap.String("span_id", spanContext.SpanID().String()),
44 )
45}
46
47// Example usage with trace correlation
48func (s *Service) ProcessRequest(ctx context.Context, request *Request) error {
49 logger := LoggerWithTrace(ctx, s.logger)
50
51 logger.Info("processing request",
52 zap.String("request_id", request.ID),
53 zap.String("user_id", request.UserID),
54 )
55
56 // Process request
57 if err := s.validateRequest(request); err != nil {
58 logger.Error("request validation failed",
59 zap.Error(err),
60 zap.String("request_id", request.ID),
61 )
62 return err
63 }
64
65 logger.Info("request processed successfully",
66 zap.String("request_id", request.ID),
67 )
68
69 return nil
70}
Practice Exercises
Exercise 1: Implement Progressive Canary Deployment
Objective: Create a canary deployment that automatically progresses based on health metrics.
Solution
1package canary
2
3import (
4 "context"
5 "fmt"
6 "time"
7
8 networkingv1beta1 "istio.io/client-go/pkg/apis/networking/v1beta1"
9 "istio.io/client-go/pkg/clientset/versioned"
10 metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
11 "k8s.io/client-go/kubernetes"
12 "k8s.io/client-go/rest"
13 "k8s.io/client-go/tools/clientcmd"
14)
15
16type CanaryDeployment struct {
17 istioClient versioned.Interface
18 k8sClient kubernetes.Interface
19 service string
20 namespace string
21}
22
23func NewCanaryDeployment(kubeconfigPath, service, namespace string) {
24 var config *rest.Config
25 var err error
26
27 if kubeconfigPath != "" {
28 config, err = clientcmd.BuildConfigFromFlags("", kubeconfigPath)
29 } else {
30 config, err = rest.InClusterConfig()
31 }
32
33 if err != nil {
34 return nil, fmt.Errorf("failed to get kubernetes config: %w", err)
35 }
36
37 istioClient, err := versioned.NewForConfig(config)
38 if err != nil {
39 return nil, fmt.Errorf("failed to create Istio client: %w", err)
40 }
41
42 k8sClient, err := kubernetes.NewForConfig(config)
43 if err != nil {
44 return nil, fmt.Errorf("failed to create Kubernetes client: %w", err)
45 }
46
47 return &CanaryDeployment{
48 istioClient: istioClient,
49 k8sClient: k8sClient,
50 service: service,
51 namespace: namespace,
52 }, nil
53}
54
55func ExecuteCanary(ctx context.Context) error {
56 // Progressive traffic stages
57 stages := []struct {
58 canaryWeight int32
59 duration time.Duration
60 description string
61 }{
62 {5, 5 * time.Minute, "Initial 5% traffic to canary"},
63 {10, 10 * time.Minute, "Increase to 10%"},
64 {25, 15 * time.Minute, "Scale to 25%"},
65 {50, 20 * time.Minute, "50-50 split"},
66 {75, 15 * time.Minute, "Scale to 75%"},
67 {100, 10 * time.Minute, "Full traffic to canary"},
68 }
69
70 for i, stage := range stages {
71 fmt.Printf("Stage %d: %s\n", i+1, stage.description)
72
73 // Update VirtualService traffic weights
74 if err := cd.updateTrafficWeights(ctx, stage.canaryWeight); err != nil {
75 return fmt.Errorf("failed to update traffic weights: %w", err)
76 }
77
78 // Monitor canary health
79 if err := cd.monitorCanaryHealth(ctx, stage.duration); err != nil {
80 fmt.Printf("Health check failed: %v\n", err)
81
82 // Automatic rollback
83 if rollbackErr := cd.rollbackToStable(ctx); rollbackErr != nil {
84 return fmt.Errorf("failed to rollback: %w", rollbackErr, err)
85 }
86 return fmt.Errorf("canary failed, rolled back: %w", err)
87 }
88
89 fmt.Printf("Stage %d completed successfully\n", i+1)
90 }
91
92 return nil
93}
94
95func updateTrafficWeights(ctx context.Context, canaryWeight int32) error {
96 vs, err := cd.istioClient.NetworkingV1beta1().VirtualServices(cd.namespace).Get(ctx, cd.service, metav1.GetOptions{})
97 if err != nil {
98 return fmt.Errorf("failed to get VirtualService: %w", err)
99 }
100
101 stableWeight := 100 - canaryWeight
102
103 // Update traffic weights
104 vs.Spec.Http[0].Route = []*networkingv1beta1.HTTPRouteDestination{
105 {
106 Destination: &networkingv1beta1.Destination{
107 Host: cd.service,
108 Subset: "stable",
109 },
110 Weight: stableWeight,
111 },
112 {
113 Destination: &networkingv1beta1.Destination{
114 Host: cd.service,
115 Subset: "canary",
116 },
117 Weight: canaryWeight,
118 },
119 }
120
121 _, err = cd.istioClient.NetworkingV1beta1().VirtualServices(cd.namespace).Update(ctx, vs, metav1.UpdateOptions{})
122 if err != nil {
123 return fmt.Errorf("failed to update VirtualService: %w", err)
124 }
125
126 fmt.Printf("Updated traffic weights: stable=%d, canary=%d\n", stableWeight, canaryWeight)
127 return nil
128}
129
130func monitorCanaryHealth(ctx context.Context, duration time.Duration) error {
131 ticker := time.NewTicker(30 * time.Second)
132 defer ticker.Stop()
133
134 timeout := time.After(duration)
135 consecutiveErrors := 0
136 maxErrors := 3
137
138 for {
139 select {
140 case <-ctx.Done():
141 return ctx.Err()
142 case <-timeout:
143 return nil
144 case <-ticker.C:
145 healthy, err := cd.checkCanaryHealth(ctx)
146 if err != nil {
147 consecutiveErrors++
148 fmt.Printf("Health check error: %v\n", err, consecutiveErrors)
149 if consecutiveErrors >= maxErrors {
150 return fmt.Errorf("too many consecutive failures")
151 }
152 continue
153 }
154
155 consecutiveErrors = 0
156 if !healthy {
157 return fmt.Errorf("canary is unhealthy")
158 }
159
160 fmt.Printf("Health check passed\n")
161 }
162 }
163}
164
165func checkCanaryHealth(ctx context.Context) {
166 pods, err := cd.k8sClient.CoreV1().Pods(cd.namespace).List(ctx, metav1.ListOptions{
167 LabelSelector: fmt.Sprintf("app=%s,version=canary", cd.service),
168 })
169 if err != nil {
170 return false, fmt.Errorf("failed to list canary pods: %w", err)
171 }
172
173 if len(pods.Items) == 0 {
174 return false, fmt.Errorf("no canary pods found")
175 }
176
177 readyPods := 0
178 for _, pod := range pods.Items {
179 for _, condition := range pod.Status.Conditions {
180 if condition.Type == "Ready" && condition.Status == "True" {
181 readyPods++
182 break
183 }
184 }
185 }
186
187 healthPercentage := float64(readyPods) / float64(len(pods.Items)) * 100
188 fmt.Printf("Canary health: %d/%d pods ready\n", readyPods, len(pods.Items), healthPercentage)
189
190 return healthPercentage >= 90.0, nil
191}
192
193func rollbackToStable(ctx context.Context) error {
194 fmt.Println("Rolling back to stable version...")
195 return cd.updateTrafficWeights(ctx, 0)
196}
197
198// Usage example
199func main() {
200 cd, err := NewCanaryDeployment("", "myapp", "default")
201 if err != nil {
202 panic(err)
203 }
204
205 ctx, cancel := context.WithTimeout(context.Background(), 2*time.Hour)
206 defer cancel()
207
208 if err := cd.ExecuteCanary(ctx); err != nil {
209 panic(err)
210 }
211
212 fmt.Println("Canary deployment completed successfully!")
213}
Exercise 2: Implement Zero-Trust Security
Objective: Configure comprehensive security policies including mTLS, JWT authentication, and fine-grained authorization.
Solution
1# 1. Enable strict mTLS for the mesh
2apiVersion: security.istio.io/v1beta1
3kind: PeerAuthentication
4metadata:
5 name: default
6 namespace: production
7spec:
8 mtls:
9 mode: STRICT
10
11---
12# 2. Configure JWT authentication for external access
13apiVersion: security.istio.io/v1beta1
14kind: RequestAuthentication
15metadata:
16 name: jwt-authentication
17 namespace: production
18spec:
19 selector:
20 matchLabels:
21 app: api-service
22 jwtRules:
23 - issuer: "https://auth.example.com"
24 jwksUri: "https://auth.example.com/.well-known/jwks.json"
25 audiences:
26 - "api.example.com"
27 forwardOriginalToken: true
28 fromHeaders:
29 - name: Authorization
30 prefix: "Bearer "
31
32---
33# 3. Role-based authorization for API endpoints
34apiVersion: security.istio.io/v1beta1
35kind: AuthorizationPolicy
36metadata:
37 name: api-authorization
38 namespace: production
39spec:
40 selector:
41 matchLabels:
42 app: api-service
43 action: ALLOW
44 rules:
45 # Admin users have full access
46 - from:
47 - source:
48 requestPrincipals: ["*"]
49 when:
50 - key: request.auth.claims[role]
51 values: ["admin"]
52 to:
53 - operation:
54 methods: ["*"]
55 # Regular users can only read/write their own data
56 - from:
57 - source:
58 requestPrincipals: ["*"]
59 when:
60 - key: request.auth.claims[role]
61 values: ["user"]
62 to:
63 - operation:
64 methods: ["GET", "POST", "PUT"]
65 paths: ["/api/users/*", "/api/orders/*"]
66 # Allow health checks without authentication
67 - to:
68 - operation:
69 paths: ["/health", "/metrics"]
70
71---
72# 4. Service-to-service authorization
73apiVersion: security.istio.io/v1beta1
74kind: AuthorizationPolicy
75metadata:
76 name: service-to-service-authz
77 namespace: production
78spec:
79 action: ALLOW
80 rules:
81 # Database service only accessible from API and worker services
82 - from:
83 - source:
84 principals:
85 - "cluster.local/ns/production/sa/api-service"
86 - "cluster.local/ns/production/sa/worker-service"
87 - to:
88 - operation:
89 methods: ["GET", "POST", "PUT", "DELETE"]
90 # Cache service accessible from all internal services
91 - from:
92 - source:
93 principals: ["cluster.local/ns/production/sa/*"]
94 - to:
95 - operation:
96 methods: ["GET", "POST", "DELETE"]
97
98---
99# 5. Deny all other requests
100apiVersion: security.istio.io/v1beta1
101kind: AuthorizationPolicy
102metadata:
103 name: deny-all
104 namespace: production
105spec:
106 action: DENY
107 # This policy catches any request not allowed by other policies
Exercise 3: Multi-Region Service Mesh
Objective: Deploy a multi-region service mesh with cross-cluster service discovery, traffic routing, and failover capabilities.
Solution with Explanation
Implementation Steps:
- Configure both clusters with shared mesh ID
- Set up cross-cluster service discovery
- Implement region-aware traffic routing
- Configure automatic failover
- Add monitoring and observability
1# cluster-east/istio-operator.yaml
2apiVersion: install.istio.io/v1alpha1
3kind: IstioOperator
4metadata:
5 name: istio-controlplane-east
6 namespace: istio-system
7spec:
8 values:
9 global:
10 meshID: production-mesh
11 multiCluster:
12 clusterName: cluster-east
13 network: network-east
14 # Enable cross-cluster communication
15 pilot:
16 env:
17 PILOT_ENABLE_CROSS_CLUSTER_WORKLOAD_ENTRY: "true"
18 # Ingress gateway configuration
19 components:
20 ingressGateways:
21 - name: istio-ingressgateway
22 enabled: true
23 k8s:
24 service:
25 type: LoadBalancer
26 ports:
27 - port: 15443 # Multi-cluster port
28 name: tls
29 - port: 80
30 name: http2
31 - port: 443
32 name: https
33
34---
35# cluster-west/istio-operator.yaml
36apiVersion: install.istio.io/v1alpha1
37kind: IstioOperator
38metadata:
39 name: istio-controlplane-west
40 namespace: istio-system
41spec:
42 values:
43 global:
44 meshID: production-mesh
45 multiCluster:
46 clusterName: cluster-west
47 network: network-west
48 pilot:
49 env:
50 PILOT_ENABLE_CROSS_CLUSTER_WORKLOAD_ENTRY: "true"
1# service-entry.yaml - Cross-cluster service discovery
2apiVersion: networking.istio.io/v1beta1
3kind: ServiceEntry
4metadata:
5 name: remote-api-service
6 namespace: production
7spec:
8 hosts:
9 - api-service.global # Global service name
10 location: MESH_INTERNAL
11 resolution: DNS
12 ports:
13 - number: 8080
14 name: http
15 protocol: HTTP
16 endpoints:
17 - address: 172.20.0.1 # West cluster ingress IP
18 locality: us-west-2
19 network: network-west
20 - address: 172.20.0.2
21 locality: us-west-2
22 network: network-west
23
24---
25# virtual-service.yaml - Region-aware routing
26apiVersion: networking.istio.io/v1beta1
27kind: VirtualService
28metadata:
29 name: api-service-routing
30 namespace: production
31spec:
32 hosts:
33 - api-service.production.svc.cluster.local
34 - api-service.global
35 http:
36 # Route requests from east region to east cluster
37 - match:
38 - sourceLabels:
39 topology.istio.io/cluster: cluster-east
40 route:
41 - destination:
42 host: api-service.production.svc.cluster.local # Local service
43 weight: 80
44 - destination:
45 host: api-service.global # Remote service
46 weight: 20
47 # Route requests from west region to west cluster
48 - match:
49 - sourceLabels:
50 topology.istio.io/cluster: cluster-west
51 route:
52 - destination:
53 host: api-service.production.svc.cluster.local
54 weight: 80
55 - destination:
56 host: api-service.global
57 weight: 20
58 # Client preference routing
59 - match:
60 - headers:
61 x-preferred-region:
62 exact: "west"
63 route:
64 - destination:
65 host: api-service.global
66 weight: 100
67 # Default load balancing
68 - route:
69 - destination:
70 host: api-service.production.svc.cluster.local
71 weight: 50
72 - destination:
73 host: api-service.global
74 weight: 50
75
76---
77# destination-rule.yaml - Locality-aware load balancing
78apiVersion: networking.istio.io/v1beta1
79kind: DestinationRule
80metadata:
81 name: api-service-lb
82 namespace: production
83spec:
84 host: api-service.global
85 trafficPolicy:
86 loadBalancer:
87 localityLbSetting:
88 enabled: true
89 distribute:
90 - from: us-east-1/*
91 to:
92 "us-east-1/*": 80
93 "us-west-2/*": 20
94 - from: us-west-2/*
95 to:
96 "us-west-2/*": 80
97 "us-east-1/*": 20
98 outlierDetection:
99 consecutiveErrors: 3
100 interval: 30s
101 baseEjectionTime: 30s
102 maxEjectionPercent: 50
Explanation: This multi-region setup provides:
- Geographic distribution for reduced latency
- Automatic failover between regions
- Locality-aware routing to prefer local services
- Load balancing across regions
- Resilience to regional failures
The configuration ensures that 80% of traffic stays within the same region for performance, while 20% is distributed across regions for load balancing. When a region experiences issues, traffic automatically fails over to the healthy region.
Exercise 4: Implement Fault Injection for Chaos Testing
Objective: Create fault injection policies to test service resilience and implement automated chaos testing.
Requirements:
- Configure delay injection for latency testing
- Implement HTTP abort injection
- Create gradual fault rollout
- Monitor service behavior under fault conditions
- Automated recovery validation
Solution
1# fault-injection.yaml - Comprehensive fault injection for chaos testing
2apiVersion: networking.istio.io/v1beta1
3kind: VirtualService
4metadata:
5 name: chaos-testing
6 namespace: production
7spec:
8 hosts:
9 - api-service
10 http:
11 # Inject delays for 10% of requests
12 - match:
13 - headers:
14 x-chaos-test:
15 exact: "latency"
16 fault:
17 delay:
18 percentage:
19 value: 10.0
20 fixedDelay: 5s
21 route:
22 - destination:
23 host: api-service
24 subset: v1
25
26 # Inject HTTP 503 errors for 5% of requests
27 - match:
28 - headers:
29 x-chaos-test:
30 exact: "error"
31 fault:
32 abort:
33 percentage:
34 value: 5.0
35 httpStatus: 503
36 route:
37 - destination:
38 host: api-service
39 subset: v1
40
41 # Combined fault injection - delays and errors
42 - match:
43 - headers:
44 x-chaos-test:
45 exact: "combined"
46 fault:
47 delay:
48 percentage:
49 value: 15.0
50 fixedDelay: 3s
51 abort:
52 percentage:
53 value: 10.0
54 httpStatus: 500
55 route:
56 - destination:
57 host: api-service
58 subset: v1
59
60 # Default route without faults
61 - route:
62 - destination:
63 host: api-service
64 subset: v1
1// chaos/testing.go - Automated chaos testing framework
2
3package chaos
4
5import (
6 "context"
7 "fmt"
8 "net/http"
9 "time"
10
11 "k8s.io/client-go/kubernetes"
12 "istio.io/client-go/pkg/clientset/versioned"
13 networkingv1beta1 "istio.io/client-go/pkg/apis/networking/v1beta1"
14 metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
15)
16
17type ChaosTest struct {
18 istioClient versioned.Interface
19 k8sClient kubernetes.Interface
20 serviceName string
21 namespace string
22}
23
24func NewChaosTest(istioClient versioned.Interface, k8sClient kubernetes.Interface,
25 serviceName, namespace string) *ChaosTest {
26 return &ChaosTest{
27 istioClient: istioClient,
28 k8sClient: k8sClient,
29 serviceName: serviceName,
30 namespace: namespace,
31 }
32}
33
34// RunLatencyTest injects delays and monitors service behavior
35func (ct *ChaosTest) RunLatencyTest(ctx context.Context, delayMs int, percentage float64, duration time.Duration) error {
36 fmt.Printf("Starting latency test: %dms delay for %.1f%% of requests\n", delayMs, percentage)
37
38 // Create fault injection VirtualService
39 vs := &networkingv1beta1.VirtualService{
40 ObjectMeta: metav1.ObjectMeta{
41 Name: ct.serviceName + "-latency-test",
42 Namespace: ct.namespace,
43 },
44 Spec: networkingv1beta1.VirtualServiceSpec{
45 Hosts: []string{ct.serviceName},
46 Http: []*networkingv1beta1.HTTPRoute{
47 {
48 Fault: &networkingv1beta1.HTTPFaultInjection{
49 Delay: &networkingv1beta1.HTTPFaultInjection_Delay{
50 Percentage: &networkingv1beta1.Percent{
51 Value: percentage,
52 },
53 HttpDelayType: &networkingv1beta1.HTTPFaultInjection_Delay_FixedDelay{
54 FixedDelay: &duration.Duration{
55 Seconds: int64(delayMs / 1000),
56 },
57 },
58 },
59 },
60 Route: []*networkingv1beta1.HTTPRouteDestination{
61 {
62 Destination: &networkingv1beta1.Destination{
63 Host: ct.serviceName,
64 },
65 },
66 },
67 },
68 },
69 },
70 }
71
72 // Create VirtualService
73 _, err := ct.istioClient.NetworkingV1beta1().VirtualServices(ct.namespace).Create(ctx, vs, metav1.CreateOptions{})
74 if err != nil {
75 return fmt.Errorf("failed to create fault injection: %w", err)
76 }
77
78 // Monitor for specified duration
79 fmt.Printf("Monitoring service for %v...\n", duration)
80 time.Sleep(duration)
81
82 // Check metrics
83 healthy, err := ct.checkServiceHealth(ctx)
84 if err != nil {
85 return fmt.Errorf("health check failed: %w", err)
86 }
87
88 if !healthy {
89 fmt.Println("⚠️ Service degraded under latency injection")
90 } else {
91 fmt.Println("✅ Service remained healthy under latency injection")
92 }
93
94 // Cleanup
95 err = ct.istioClient.NetworkingV1beta1().VirtualServices(ct.namespace).Delete(
96 ctx, ct.serviceName+"-latency-test", metav1.DeleteOptions{})
97 if err != nil {
98 return fmt.Errorf("failed to cleanup: %w", err)
99 }
100
101 return nil
102}
103
104// RunErrorInjectionTest injects HTTP errors
105func (ct *ChaosTest) RunErrorInjectionTest(ctx context.Context, httpStatus int32, percentage float64, duration time.Duration) error {
106 fmt.Printf("Starting error injection test: HTTP %d for %.1f%% of requests\n", httpStatus, percentage)
107
108 vs := &networkingv1beta1.VirtualService{
109 ObjectMeta: metav1.ObjectMeta{
110 Name: ct.serviceName + "-error-test",
111 Namespace: ct.namespace,
112 },
113 Spec: networkingv1beta1.VirtualServiceSpec{
114 Hosts: []string{ct.serviceName},
115 Http: []*networkingv1beta1.HTTPRoute{
116 {
117 Fault: &networkingv1beta1.HTTPFaultInjection{
118 Abort: &networkingv1beta1.HTTPFaultInjection_Abort{
119 Percentage: &networkingv1beta1.Percent{
120 Value: percentage,
121 },
122 ErrorType: &networkingv1beta1.HTTPFaultInjection_Abort_HttpStatus{
123 HttpStatus: httpStatus,
124 },
125 },
126 },
127 Route: []*networkingv1beta1.HTTPRouteDestination{
128 {
129 Destination: &networkingv1beta1.Destination{
130 Host: ct.serviceName,
131 },
132 },
133 },
134 },
135 },
136 },
137 }
138
139 _, err := ct.istioClient.NetworkingV1beta1().VirtualServices(ct.namespace).Create(ctx, vs, metav1.CreateOptions{})
140 if err != nil {
141 return fmt.Errorf("failed to create error injection: %w", err)
142 }
143
144 // Monitor and validate retry behavior
145 fmt.Printf("Monitoring retry and circuit breaker behavior...\n")
146 time.Sleep(duration)
147
148 // Cleanup
149 err = ct.istioClient.NetworkingV1beta1().VirtualServices(ct.namespace).Delete(
150 ctx, ct.serviceName+"-error-test", metav1.DeleteOptions{})
151 if err != nil {
152 return fmt.Errorf("failed to cleanup: %w", err)
153 }
154
155 fmt.Println("✅ Error injection test completed")
156 return nil
157}
158
159func (ct *ChaosTest) checkServiceHealth(ctx context.Context) (bool, error) {
160 // Check pod health
161 pods, err := ct.k8sClient.CoreV1().Pods(ct.namespace).List(ctx, metav1.ListOptions{
162 LabelSelector: fmt.Sprintf("app=%s", ct.serviceName),
163 })
164 if err != nil {
165 return false, err
166 }
167
168 readyPods := 0
169 for _, pod := range pods.Items {
170 for _, condition := range pod.Status.Conditions {
171 if condition.Type == "Ready" && condition.Status == "True" {
172 readyPods++
173 break
174 }
175 }
176 }
177
178 healthPercentage := float64(readyPods) / float64(len(pods.Items)) * 100
179 return healthPercentage >= 80.0, nil
180}
181
182// Example usage
183func main() {
184 // Initialize chaos test
185 chaosTest := NewChaosTest(istioClient, k8sClient, "api-service", "production")
186
187 ctx := context.Background()
188
189 // Run latency test - 500ms delay for 20% of requests
190 if err := chaosTest.RunLatencyTest(ctx, 500, 20.0, 5*time.Minute); err != nil {
191 panic(err)
192 }
193
194 // Run error injection test - HTTP 503 for 10% of requests
195 if err := chaosTest.RunErrorInjectionTest(ctx, 503, 10.0, 5*time.Minute); err != nil {
196 panic(err)
197 }
198
199 fmt.Println("All chaos tests completed successfully!")
200}
Exercise 5: Implement Complete Observability Stack
Objective: Set up comprehensive observability with distributed tracing, metrics, and logging across the service mesh.
Requirements:
- Deploy Jaeger for distributed tracing
- Configure Prometheus for metrics collection
- Set up Grafana dashboards
- Implement trace sampling strategies
- Create alerts for SLO violations
Solution
1# observability-stack.yaml - Complete observability deployment
2
3---
4# Jaeger deployment for distributed tracing
5apiVersion: apps/v1
6kind: Deployment
7metadata:
8 name: jaeger
9 namespace: istio-system
10spec:
11 replicas: 1
12 selector:
13 matchLabels:
14 app: jaeger
15 template:
16 metadata:
17 labels:
18 app: jaeger
19 spec:
20 containers:
21 - name: jaeger
22 image: jaegertracing/all-in-one:latest
23 ports:
24 - containerPort: 16686 # UI
25 - containerPort: 14268 # Collector
26 - containerPort: 9411 # Zipkin compatible
27 env:
28 - name: COLLECTOR_ZIPKIN_HOST_PORT
29 value: ":9411"
30 resources:
31 requests:
32 cpu: "200m"
33 memory: "512Mi"
34 limits:
35 cpu: "500m"
36 memory: "1Gi"
37
38---
39# Jaeger Service
40apiVersion: v1
41kind: Service
42metadata:
43 name: jaeger
44 namespace: istio-system
45spec:
46 selector:
47 app: jaeger
48 ports:
49 - name: ui
50 port: 16686
51 targetPort: 16686
52 - name: collector
53 port: 14268
54 targetPort: 14268
55 - name: zipkin
56 port: 9411
57 targetPort: 9411
58
59---
60# Prometheus ServiceMonitor for application metrics
61apiVersion: monitoring.coreos.com/v1
62kind: ServiceMonitor
63metadata:
64 name: app-metrics
65 namespace: production
66spec:
67 selector:
68 matchLabels:
69 monitoring: "enabled"
70 endpoints:
71 - port: metrics
72 interval: 30s
73 path: /metrics
74
75---
76# Istio telemetry configuration
77apiVersion: telemetry.istio.io/v1alpha1
78kind: Telemetry
79metadata:
80 name: mesh-telemetry
81 namespace: istio-system
82spec:
83 # Enable distributed tracing
84 tracing:
85 - providers:
86 - name: jaeger
87 randomSamplingPercentage: 100.0 # Sample 100% for testing, reduce in production
88 customTags:
89 environment:
90 literal:
91 value: "production"
92 version:
93 literal:
94 value: "v1.0.0"
95
96 # Metrics configuration
97 metrics:
98 - providers:
99 - name: prometheus
100 overrides:
101 - match:
102 metric: REQUEST_COUNT
103 tagOverrides:
104 request_protocol:
105 value: "api.protocol | unknown"
106 response_code:
107 value: "response.code | 000"
108
109---
110# Grafana Dashboard ConfigMap
111apiVersion: v1
112kind: ConfigMap
113metadata:
114 name: grafana-dashboard-istio
115 namespace: monitoring
116data:
117 istio-dashboard.json: |
118 {
119 "dashboard": {
120 "title": "Istio Service Mesh Dashboard",
121 "panels": [
122 {
123 "title": "Request Rate",
124 "targets": [
125 {
126 "expr": "sum(rate(istio_requests_total[5m])) by (destination_service)"
127 }
128 ]
129 },
130 {
131 "title": "Error Rate",
132 "targets": [
133 {
134 "expr": "sum(rate(istio_requests_total{response_code=~\"5.*\"}[5m])) / sum(rate(istio_requests_total[5m]))"
135 }
136 ]
137 },
138 {
139 "title": "Request Duration P99",
140 "targets": [
141 {
142 "expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m])) by (le, destination_service))"
143 }
144 ]
145 }
146 ]
147 }
148 }
1// observability/setup.go - Complete observability setup
2
3package observability
4
5import (
6 "context"
7 "fmt"
8 "net/http"
9
10 "go.opentelemetry.io/otel"
11 "go.opentelemetry.io/otel/exporters/jaeger"
12 "go.opentelemetry.io/otel/sdk/resource"
13 sdktrace "go.opentelemetry.io/otel/sdk/trace"
14 semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
15 "github.com/prometheus/client_golang/prometheus"
16 "github.com/prometheus/client_golang/prometheus/promhttp"
17 "go.uber.org/zap"
18)
19
20type ObservabilityStack struct {
21 logger *zap.Logger
22 tracerProvider *sdktrace.TracerProvider
23 metricsServer *http.Server
24}
25
26func NewObservabilityStack(serviceName, jaegerEndpoint string, metricsPort string) (*ObservabilityStack, error) {
27 // Initialize structured logging
28 logger, err := initLogger(serviceName)
29 if err != nil {
30 return nil, fmt.Errorf("failed to initialize logger: %w", err)
31 }
32
33 // Initialize distributed tracing
34 tracerProvider, err := initTracing(serviceName, jaegerEndpoint)
35 if err != nil {
36 return nil, fmt.Errorf("failed to initialize tracing: %w", err)
37 }
38
39 // Initialize metrics server
40 metricsServer := initMetrics(metricsPort)
41
42 logger.Info("observability stack initialized",
43 zap.String("service", serviceName),
44 zap.String("jaeger_endpoint", jaegerEndpoint),
45 zap.String("metrics_port", metricsPort),
46 )
47
48 return &ObservabilityStack{
49 logger: logger,
50 tracerProvider: tracerProvider,
51 metricsServer: metricsServer,
52 }, nil
53}
54
55func initLogger(serviceName string) (*zap.Logger, error) {
56 config := zap.NewProductionConfig()
57 config.InitialFields = map[string]interface{}{
58 "service": serviceName,
59 }
60 return config.Build()
61}
62
63func initTracing(serviceName, jaegerEndpoint string) (*sdktrace.TracerProvider, error) {
64 exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint(jaegerEndpoint)))
65 if err != nil {
66 return nil, err
67 }
68
69 tp := sdktrace.NewTracerProvider(
70 sdktrace.WithBatcher(exporter),
71 sdktrace.WithResource(resource.NewWithAttributes(
72 semconv.SchemaURL,
73 semconv.ServiceNameKey.String(serviceName),
74 )),
75 // Sample 10% of traces in production
76 sdktrace.WithSampler(sdktrace.TraceIDRatioBased(0.1)),
77 )
78
79 otel.SetTracerProvider(tp)
80 return tp, nil
81}
82
83func initMetrics(port string) *http.Server {
84 mux := http.NewServeMux()
85 mux.Handle("/metrics", promhttp.Handler())
86
87 return &http.Server{
88 Addr: ":" + port,
89 Handler: mux,
90 }
91}
92
93func (o *ObservabilityStack) Start() error {
94 // Start metrics server
95 go func() {
96 if err := o.metricsServer.ListenAndServe(); err != nil && err != http.ErrServerClosed {
97 o.logger.Fatal("metrics server failed", zap.Error(err))
98 }
99 }()
100
101 o.logger.Info("observability stack started")
102 return nil
103}
104
105func (o *ObservabilityStack) Shutdown(ctx context.Context) error {
106 // Shutdown tracer provider
107 if err := o.tracerProvider.Shutdown(ctx); err != nil {
108 return fmt.Errorf("failed to shutdown tracer: %w", err)
109 }
110
111 // Shutdown metrics server
112 if err := o.metricsServer.Shutdown(ctx); err != nil {
113 return fmt.Errorf("failed to shutdown metrics server: %w", err)
114 }
115
116 o.logger.Info("observability stack shutdown complete")
117 return nil
118}
119
120// Example usage in main application
121func main() {
122 obs, err := NewObservabilityStack(
123 "api-service",
124 "http://jaeger-collector:14268/api/traces",
125 "9090",
126 )
127 if err != nil {
128 panic(err)
129 }
130
131 if err := obs.Start(); err != nil {
132 panic(err)
133 }
134
135 // Your application code here
136 // ...
137
138 // Graceful shutdown
139 ctx := context.Background()
140 if err := obs.Shutdown(ctx); err != nil {
141 panic(err)
142 }
143}
PrometheusRule for Alerting:
1# alerting-rules.yaml - SLO-based alerts
2apiVersion: monitoring.coreos.com/v1
3kind: PrometheusRule
4metadata:
5 name: istio-alerts
6 namespace: monitoring
7spec:
8 groups:
9 - name: istio-slo-alerts
10 interval: 30s
11 rules:
12 # High error rate alert
13 - alert: HighErrorRate
14 expr: |
15 (sum(rate(istio_requests_total{response_code=~"5.*"}[5m])) by (destination_service)
16 / sum(rate(istio_requests_total[5m])) by (destination_service)) > 0.05
17 for: 5m
18 labels:
19 severity: critical
20 annotations:
21 summary: "High error rate detected"
22 description: "Service {{ $labels.destination_service }} has error rate above 5%"
23
24 # High latency alert (P99 > 1s)
25 - alert: HighLatency
26 expr: |
27 histogram_quantile(0.99,
28 sum(rate(istio_request_duration_milliseconds_bucket[5m])) by (le, destination_service)
29 ) > 1000
30 for: 10m
31 labels:
32 severity: warning
33 annotations:
34 summary: "High latency detected"
35 description: "Service {{ $labels.destination_service }} P99 latency > 1s"
36
37 # Circuit breaker open
38 - alert: CircuitBreakerOpen
39 expr: |
40 istio_tcp_connections_opened_total{} - istio_tcp_connections_closed_total{} > 100
41 for: 5m
42 labels:
43 severity: warning
44 annotations:
45 summary: "Circuit breaker may be triggered"
46 description: "Unusual connection pattern detected for {{ $labels.destination_service }}"
Further Reading
- Istio Documentation
- Istio by Examples
- Envoy Proxy Documentation
- Service Mesh Patterns
- Kubernetes Multi-Cluster Patterns
Summary
Key Takeaways
💡 Core Benefits:
- Zero-Code Networking: Sidecar pattern handles all network concerns transparently
- Advanced Traffic Management: Canary deployments, A/B testing, traffic splitting
- Zero-Trust Security: Automatic mTLS and fine-grained authorization policies
- Comprehensive Observability: Distributed tracing, metrics, and access logs
- Resilience: Circuit breaking, retries, and automatic failover
- Multi-Cluster Support: Global service mesh across regions and clouds
🚀 Production Patterns:
- Progressive Delivery: Automated canary deployments with health-based progression
- Security-First: Default deny policies with mTLS encryption
- Observability: OpenTelemetry integration with distributed tracing
- High Availability: Multi-region deployment with locality-aware routing
When to Use Service Mesh
✅ Ideal for:
- Production microservices with >5 services
- Multi-language environments requiring consistent networking
- Compliance and security requirements
- Complex traffic management
- Multi-region or multi-cloud deployments
❌ Overkill for:
- Single monolithic applications
- Simple 2-3 service architectures
- Development or testing environments
- Applications with minimal networking requirements
Next Steps
- Start Simple: Begin with basic mTLS and telemetry
- Gradual Adoption: Enable features progressively
- Monitor Everything: Set up observability from day one
- Test Failures: Use fault injection to validate resilience
- Plan for Scale: Design for multi-cluster from the start