Retry Mechanism with Exponential Backoff

Exercise: Retry Mechanism with Exponential Backoff

Difficulty - Intermediate

Learning Objectives

  • Implement exponential backoff retry logic
  • Add jitter to prevent thundering herd problems
  • Handle context cancellation during retries
  • Build circuit breaker patterns
  • Create retry strategies for different failure types

Problem Statement

Create a robust retry mechanism that handles transient failures gracefully using exponential backoff and jitter. Your implementation should support context cancellation, configurable retry policies, and intelligent failure classification to determine retry eligibility.

Requirements

1. Basic Retry Function

Implement a retry function that:

  • Accepts a context, configuration, and a function to execute
  • Retries the function up to MaxAttempts times on failure
  • Uses exponential backoff between retries
  • Caps the maximum delay to prevent excessive waiting
  • Respects context cancellation at any point
  • Returns the last error if all attempts fail

Example Usage:

 1cfg := RetryConfig{
 2    MaxAttempts:  5,
 3    InitialDelay: 100 * time.Millisecond,
 4    MaxDelay:     10 * time.Second,
 5    Multiplier:   2.0,
 6}
 7
 8err := Retry(ctx, cfg, func(ctx context.Context) error {
 9    return makeAPICall(ctx)
10})

2. Retry with Jitter

Add randomized jitter to prevent thundering herd:

  • Implements full jitter: delay = random(0, exponentialDelay)
  • Implements decorrelated jitter for better distribution
  • Prevents multiple clients from retrying simultaneously
  • Configurable jitter strategy

Example Usage:

 1cfg := RetryConfig{
 2    MaxAttempts:  3,
 3    InitialDelay: 1 * time.Second,
 4    MaxDelay:     30 * time.Second,
 5    Multiplier:   2.0,
 6    Jitter:       FullJitter,
 7}
 8
 9err := Retry(ctx, cfg, func(ctx context.Context) error {
10    return fetchFromDatabase(ctx)
11})

3. Retryable Error Classification

Create a system to classify errors as retryable or permanent:

  • Implements IsRetryable(error) bool function
  • Supports custom retryable error predicates
  • Handles wrapped errors correctly
  • Includes common retryable error types
  • Stops retrying immediately on permanent errors

Example Usage:

 1classifier := NewErrorClassifier()
 2classifier.AddRetryable(IsNetworkError)
 3classifier.AddRetryable(IsTemporaryDatabaseError)
 4
 5cfg := RetryConfig{
 6    MaxAttempts:     5,
 7    InitialDelay:    100 * time.Millisecond,
 8    MaxDelay:        10 * time.Second,
 9    Multiplier:      2.0,
10    ErrorClassifier: classifier,
11}
12
13err := Retry(ctx, cfg, func(ctx context.Context) error {
14    return doWork(ctx)
15})
16// Only retries if error is classified as retryable

4. Retry with Callback Hooks

Support lifecycle hooks for monitoring and debugging:

  • OnRetry(attempt int, err error, delay time.Duration) - called before each retry
  • OnSuccess(attempt int) - called when operation succeeds
  • OnFailure(err error) - called when all retries are exhausted
  • Useful for logging, metrics, and alerting

Example Usage:

 1cfg := RetryConfig{
 2    MaxAttempts:  3,
 3    InitialDelay: 1 * time.Second,
 4    MaxDelay:     5 * time.Second,
 5    Multiplier:   2.0,
 6    OnRetry: func(attempt int, err error, delay time.Duration) {
 7        log.Printf("Retry %d after %v: %v", attempt, delay, err)
 8    },
 9    OnSuccess: func(attempt int) {
10        log.Printf("Success on attempt %d", attempt)
11    },
12}

5. Circuit Breaker Integration

Implement a circuit breaker to prevent retry storms:

  • Opens circuit after N consecutive failures
  • Stays open for a cooldown period
  • Transitions to half-open state to test recovery
  • Closes circuit after successful operations in half-open state
  • Prevents retries when circuit is open

Example Usage:

 1breaker := NewCircuitBreaker(CircuitBreakerConfig{
 2    FailureThreshold: 5,
 3    SuccessThreshold: 2,
 4    Timeout:          30 * time.Second,
 5})
 6
 7cfg := RetryConfig{
 8    MaxAttempts:    3,
 9    InitialDelay:   1 * time.Second,
10    CircuitBreaker: breaker,
11}
12
13err := Retry(ctx, cfg, func(ctx context.Context) error {
14    if breaker.State() == CircuitOpen {
15        return ErrCircuitOpen
16    }
17    return callExternalService(ctx)
18})

Function Signatures

 1package retry
 2
 3import (
 4    "context"
 5    "errors"
 6    "time"
 7)
 8
 9// RetryConfig configures retry behavior
10type RetryConfig struct {
11    MaxAttempts     int
12    InitialDelay    time.Duration
13    MaxDelay        time.Duration
14    Multiplier      float64
15    Jitter          JitterStrategy
16    ErrorClassifier *ErrorClassifier
17    CircuitBreaker  *CircuitBreaker
18    OnRetry         func(attempt int, err error, delay time.Duration)
19    OnSuccess       func(attempt int)
20    OnFailure       func(err error)
21}
22
23// JitterStrategy defines how jitter is applied
24type JitterStrategy int
25
26const (
27    NoJitter JitterStrategy = iota
28    FullJitter
29    EqualJitter
30    DecorrelatedJitter
31)
32
33// Retry executes fn with exponential backoff retry logic
34func Retry(ctx context.Context, cfg RetryConfig, fn func(context.Context) error) error
35
36// ErrorClassifier determines if errors are retryable
37type ErrorClassifier struct {
38    predicates []func(error) bool
39}
40
41// NewErrorClassifier creates a new error classifier
42func NewErrorClassifier() *ErrorClassifier
43
44// AddRetryable adds a predicate to classify errors as retryable
45func AddRetryable(predicate func(error) bool)
46
47// IsRetryable checks if an error should trigger a retry
48func IsRetryable(err error) bool
49
50// CircuitBreakerState represents the state of a circuit breaker
51type CircuitBreakerState int
52
53const (
54    CircuitClosed CircuitBreakerState = iota
55    CircuitOpen
56    CircuitHalfOpen
57)
58
59// CircuitBreakerConfig configures circuit breaker behavior
60type CircuitBreakerConfig struct {
61    FailureThreshold int
62    SuccessThreshold int
63    Timeout          time.Duration
64}
65
66// CircuitBreaker implements the circuit breaker pattern
67type CircuitBreaker struct {
68    config           CircuitBreakerConfig
69    state            CircuitBreakerState
70    failures         int
71    successes        int
72    lastFailureTime  time.Time
73}
74
75// NewCircuitBreaker creates a new circuit breaker
76func NewCircuitBreaker(cfg CircuitBreakerConfig) *CircuitBreaker
77
78// State returns the current circuit breaker state
79func State() CircuitBreakerState
80
81// RecordSuccess records a successful operation
82func RecordSuccess()
83
84// RecordFailure records a failed operation
85func RecordFailure()

Test Cases

Your implementation should pass these test scenarios:

 1// Test successful retry after failures
 2func TestRetrySuccessAfterFailures() {
 3    attempt := 0
 4    cfg := RetryConfig{
 5        MaxAttempts:  3,
 6        InitialDelay: 10 * time.Millisecond,
 7        MaxDelay:     1 * time.Second,
 8        Multiplier:   2.0,
 9    }
10
11    err := Retry(context.Background(), cfg, func(ctx context.Context) error {
12        attempt++
13        if attempt < 3 {
14            return errors.New("temporary error")
15        }
16        return nil
17    })
18
19    assert.NoError(t, err)
20    assert.Equal(t, 3, attempt)
21}
22
23// Test context cancellation stops retries
24func TestRetryContextCancellation() {
25    ctx, cancel := context.WithCancel(context.Background())
26
27    cfg := RetryConfig{
28        MaxAttempts:  10,
29        InitialDelay: 100 * time.Millisecond,
30        MaxDelay:     1 * time.Second,
31        Multiplier:   2.0,
32    }
33
34    attempt := 0
35    go func() {
36        time.Sleep(150 * time.Millisecond)
37        cancel()
38    }()
39
40    err := Retry(ctx, cfg, func(ctx context.Context) error {
41        attempt++
42        return errors.New("always fails")
43    })
44
45    assert.Error(t, err)
46    assert.True(t, errors.Is(err, context.Canceled))
47    assert.Less(t, attempt, 10) // Should stop before max attempts
48}
49
50// Test exponential backoff timing
51func TestExponentialBackoff() {
52    cfg := RetryConfig{
53        MaxAttempts:  4,
54        InitialDelay: 100 * time.Millisecond,
55        MaxDelay:     10 * time.Second,
56        Multiplier:   2.0,
57    }
58
59    start := time.Now()
60    delays := []time.Duration{}
61
62    Retry(context.Background(), cfg, func(ctx context.Context) error {
63        if len(delays) > 0 {
64            delays = append(delays, time.Since(start))
65        }
66        start = time.Now()
67        return errors.New("fail")
68    })
69
70    // Expected delays: ~100ms, ~200ms, ~400ms
71    assert.InDelta(t, 100, delays[0].Milliseconds(), 50)
72    assert.InDelta(t, 200, delays[1].Milliseconds(), 50)
73    assert.InDelta(t, 400, delays[2].Milliseconds(), 50)
74}
75
76// Test non-retryable errors stop immediately
77func TestNonRetryableError() {
78    classifier := NewErrorClassifier()
79    classifier.AddRetryable(func(err error) bool {
80        return err.Error() != "permanent error"
81    })
82
83    cfg := RetryConfig{
84        MaxAttempts:     5,
85        InitialDelay:    10 * time.Millisecond,
86        ErrorClassifier: classifier,
87    }
88
89    attempt := 0
90    err := Retry(context.Background(), cfg, func(ctx context.Context) error {
91        attempt++
92        return errors.New("permanent error")
93    })
94
95    assert.Error(t, err)
96    assert.Equal(t, 1, attempt) // Should not retry
97}

Common Pitfalls

⚠️ Watch out for these common mistakes:

  1. Thundering herd: Without jitter, all clients retry at the same time, overwhelming the service
  2. Infinite retries: Always have a maximum attempt limit to prevent infinite loops
  3. Not respecting context: Check ctx.Done() during delays, not just in the function
  4. Retrying permanent errors: Don't retry 4xx errors or validation failures
  5. Delay calculation overflow: Cap delays to prevent time.Duration overflow with large multipliers
  6. Missing defer in circuit breaker: Always record the result in the circuit breaker

Hints

💡 Hint 1: Exponential Backoff Calculation

Calculate exponential backoff with capping:

1delay := time.Duration(float64(cfg.InitialDelay) * math.Pow(cfg.Multiplier, float64(attempt)))
2if delay > cfg.MaxDelay {
3    delay = cfg.MaxDelay
4}
💡 Hint 2: Full Jitter Implementation

Full jitter uses random value from 0 to calculated delay:

1func applyFullJitter(delay time.Duration) time.Duration {
2    if delay <= 0 {
3        return 0
4    }
5    return time.Duration(rand.Int63n(int64(delay)))
6}
💡 Hint 3: Context-Aware Sleep

Use select to respect context cancellation during delays:

1select {
2case <-time.After(delay):
3    // Continue to next retry
4case <-ctx.Done():
5    return ctx.Err()
6}
💡 Hint 4: Error Wrapping Detection

Use errors.Is and errors.As to check wrapped errors:

 1func IsRetryable(err error) bool {
 2    for _, predicate := range ec.predicates {
 3        if predicate(err) {
 4            return true
 5        }
 6        // Check wrapped errors
 7        if unwrapped := errors.Unwrap(err); unwrapped != nil {
 8            if predicate(unwrapped) {
 9                return true
10            }
11        }
12    }
13    return false
14}

Solution

Click to see the solution
  1package retry
  2
  3import (
  4    "context"
  5    "errors"
  6    "math"
  7    "math/rand"
  8    "sync"
  9    "time"
 10)
 11
 12// RetryConfig configures retry behavior
 13type RetryConfig struct {
 14    MaxAttempts     int
 15    InitialDelay    time.Duration
 16    MaxDelay        time.Duration
 17    Multiplier      float64
 18    Jitter          JitterStrategy
 19    ErrorClassifier *ErrorClassifier
 20    CircuitBreaker  *CircuitBreaker
 21    OnRetry         func(attempt int, err error, delay time.Duration)
 22    OnSuccess       func(attempt int)
 23    OnFailure       func(err error)
 24}
 25
 26// JitterStrategy defines how jitter is applied
 27type JitterStrategy int
 28
 29const (
 30    NoJitter JitterStrategy = iota
 31    FullJitter
 32    EqualJitter
 33    DecorrelatedJitter
 34)
 35
 36var ErrCircuitOpen = errors.New("circuit breaker is open")
 37
 38// Retry executes fn with exponential backoff retry logic
 39func Retry(ctx context.Context, cfg RetryConfig, fn func(context.Context) error) error {
 40    var lastErr error
 41
 42    for attempt := 0; attempt < cfg.MaxAttempts; attempt++ {
 43        // Check circuit breaker
 44        if cfg.CircuitBreaker != nil && cfg.CircuitBreaker.State() == CircuitOpen {
 45            return ErrCircuitOpen
 46        }
 47
 48        // Execute function
 49        err := fn(ctx)
 50
 51        // Success
 52        if err == nil {
 53            if cfg.CircuitBreaker != nil {
 54                cfg.CircuitBreaker.RecordSuccess()
 55            }
 56            if cfg.OnSuccess != nil {
 57                cfg.OnSuccess(attempt + 1)
 58            }
 59            return nil
 60        }
 61
 62        lastErr = err
 63
 64        // Record failure in circuit breaker
 65        if cfg.CircuitBreaker != nil {
 66            cfg.CircuitBreaker.RecordFailure()
 67        }
 68
 69        // Check if error is retryable
 70        if cfg.ErrorClassifier != nil && !cfg.ErrorClassifier.IsRetryable(err) {
 71            if cfg.OnFailure != nil {
 72                cfg.OnFailure(err)
 73            }
 74            return err
 75        }
 76
 77        // Don't sleep after last attempt
 78        if attempt == cfg.MaxAttempts-1 {
 79            break
 80        }
 81
 82        // Calculate delay with exponential backoff
 83        delay := calculateDelay(cfg, attempt)
 84
 85        // Call OnRetry hook
 86        if cfg.OnRetry != nil {
 87            cfg.OnRetry(attempt+1, err, delay)
 88        }
 89
 90        // Wait with context awareness
 91        select {
 92        case <-time.After(delay):
 93            // Continue to next attempt
 94        case <-ctx.Done():
 95            return ctx.Err()
 96        }
 97    }
 98
 99    if cfg.OnFailure != nil {
100        cfg.OnFailure(lastErr)
101    }
102
103    return lastErr
104}
105
106func calculateDelay(cfg RetryConfig, attempt int) time.Duration {
107    // Calculate exponential backoff
108    delay := time.Duration(float64(cfg.InitialDelay) * math.Pow(cfg.Multiplier, float64(attempt)))
109
110    // Cap at max delay
111    if delay > cfg.MaxDelay {
112        delay = cfg.MaxDelay
113    }
114
115    // Apply jitter
116    switch cfg.Jitter {
117    case FullJitter:
118        delay = applyFullJitter(delay)
119    case EqualJitter:
120        delay = applyEqualJitter(delay)
121    case DecorrelatedJitter:
122        delay = applyDecorrelatedJitter(delay, cfg.InitialDelay)
123    }
124
125    return delay
126}
127
128func applyFullJitter(delay time.Duration) time.Duration {
129    if delay <= 0 {
130        return 0
131    }
132    return time.Duration(rand.Int63n(int64(delay)))
133}
134
135func applyEqualJitter(delay time.Duration) time.Duration {
136    if delay <= 0 {
137        return 0
138    }
139    half := delay / 2
140    jitter := time.Duration(rand.Int63n(int64(half)))
141    return half + jitter
142}
143
144func applyDecorrelatedJitter(delay, initial time.Duration) time.Duration {
145    if delay <= 0 {
146        return initial
147    }
148    return time.Duration(rand.Int63n(int64(delay*3))) + initial
149}
150
151// ErrorClassifier determines if errors are retryable
152type ErrorClassifier struct {
153    predicates []func(error) bool
154}
155
156// NewErrorClassifier creates a new error classifier
157func NewErrorClassifier() *ErrorClassifier {
158    return &ErrorClassifier{
159        predicates: make([]func(error) bool, 0),
160    }
161}
162
163// AddRetryable adds a predicate to classify errors as retryable
164func AddRetryable(predicate func(error) bool) {
165    ec.predicates = append(ec.predicates, predicate)
166}
167
168// IsRetryable checks if an error should trigger a retry
169func IsRetryable(err error) bool {
170    if err == nil {
171        return false
172    }
173
174    for _, predicate := range ec.predicates {
175        if predicate(err) {
176            return true
177        }
178    }
179
180    return false
181}
182
183// CircuitBreakerState represents the state of a circuit breaker
184type CircuitBreakerState int
185
186const (
187    CircuitClosed CircuitBreakerState = iota
188    CircuitOpen
189    CircuitHalfOpen
190)
191
192// CircuitBreakerConfig configures circuit breaker behavior
193type CircuitBreakerConfig struct {
194    FailureThreshold int
195    SuccessThreshold int
196    Timeout          time.Duration
197}
198
199// CircuitBreaker implements the circuit breaker pattern
200type CircuitBreaker struct {
201    mu              sync.RWMutex
202    config          CircuitBreakerConfig
203    state           CircuitBreakerState
204    failures        int
205    successes       int
206    lastFailureTime time.Time
207}
208
209// NewCircuitBreaker creates a new circuit breaker
210func NewCircuitBreaker(cfg CircuitBreakerConfig) *CircuitBreaker {
211    return &CircuitBreaker{
212        config: cfg,
213        state:  CircuitClosed,
214    }
215}
216
217// State returns the current circuit breaker state
218func State() CircuitBreakerState {
219    cb.mu.RLock()
220    defer cb.mu.RUnlock()
221
222    // Check if we should transition from open to half-open
223    if cb.state == CircuitOpen {
224        if time.Since(cb.lastFailureTime) > cb.config.Timeout {
225            cb.state = CircuitHalfOpen
226            cb.successes = 0
227        }
228    }
229
230    return cb.state
231}
232
233// RecordSuccess records a successful operation
234func RecordSuccess() {
235    cb.mu.Lock()
236    defer cb.mu.Unlock()
237
238    cb.failures = 0
239
240    if cb.state == CircuitHalfOpen {
241        cb.successes++
242        if cb.successes >= cb.config.SuccessThreshold {
243            cb.state = CircuitClosed
244            cb.successes = 0
245        }
246    }
247}
248
249// RecordFailure records a failed operation
250func RecordFailure() {
251    cb.mu.Lock()
252    defer cb.mu.Unlock()
253
254    cb.failures++
255    cb.lastFailureTime = time.Now()
256
257    if cb.state == CircuitHalfOpen {
258        cb.state = CircuitOpen
259        cb.successes = 0
260    } else if cb.failures >= cb.config.FailureThreshold {
261        cb.state = CircuitOpen
262    }
263}
264
265// Common retryable error predicates
266
267// IsNetworkError checks if an error is a network error
268func IsNetworkError(err error) bool {
269    if err == nil {
270        return false
271    }
272    // In real implementation, check for specific network error types
273    return errors.Is(err, context.DeadlineExceeded) ||
274        errors.Is(err, context.Canceled)
275}
276
277// IsTemporaryError checks if an error is temporary
278func IsTemporaryError(err error) bool {
279    type temporary interface {
280        Temporary() bool
281    }
282
283    te, ok := err.(temporary)
284    return ok && te.Temporary()
285}

Key Takeaways

  • Exponential backoff prevents overwhelming failing services during recovery
  • Jitter prevents thundering herd problems when multiple clients retry simultaneously
  • Context cancellation is critical for graceful shutdown and timeout handling
  • Error classification ensures you don't retry permanent failures
  • Circuit breakers prevent cascade failures and give systems time to recover
  • Callback hooks enable observability and debugging of retry behavior
  • Maximum delay caps prevent excessive waiting times