Compiler Optimizations in Go | The Modern Go Tutorial

Why Compiler Optimizations Matter

The Go compiler is your performance multiplier—understanding what it can optimize is the difference between code that runs at 50% speed and code that runs at 100% speed with zero extra effort.

Real-World Impact:

Inlining - Free 2-5x speedup:

1// Small function: compiler inlines automatically
2func add(a, b int) int { return a + b }
3
4// Hot path: inlined = no function call overhead
5sum += add(x, y)  // Becomes: sum += x + y

Bounds Check Elimination - Free 20-30% speedup:

 1// ❌ Bounds check every iteration
 2for i := 0; i < len(data); i++ {
 3    process(data[i])  // Compiler checks: i < len(data)?
 4}
 5
 6// ✅ Compiler eliminates bounds check
 7data := data[:len(data)]  // Hint to compiler
 8for i := range data {
 9    process(data[i])  // No bounds check!
10}

Compiler Optimization Impact:

Optimization Speedups:
├─ Function inlining:     2-5x faster
├─ Bounds check elim:     1.2-1.3x faster
├─ Escape analysis:       2-10x faster
├─ Dead code elim:        Binary 30% smaller
└─ Constant folding:      Compile-time math = instant

The Go compiler performs numerous optimizations to generate efficient machine code. Understanding these optimizations helps you write code that the compiler can optimize effectively, leading to better performance without sacrificing readability—often achieving 2-10x speedups with simple patterns.

In this tutorial, we'll explore:

Compiler architecture: How the Go compiler works
Inlining: Automatic function call elimination
Compiler directives: Controlling optimization behavior
Bounds check elimination: Removing array bounds checks
Dead code elimination: Removing unused code
Escape analysis: Optimizing memory allocation
Build flags: Controlling optimization levels
Assembly analysis: Understanding generated code

Why Compiler Optimizations Matter:

Performance: Optimized code runs faster
Efficiency: Better use of CPU resources
Predictability: Understanding optimizations helps predict performance
Code quality: Write compiler-friendly code without sacrificing clarity

Go Compiler Architecture

Compilation Pipeline

 1// Go compilation pipeline:
 2// 1. Parsing: Source code → Abstract Syntax Tree
 3// 2. Type checking: Verify types and semantics
 4// 3. SSA generation: AST → Static Single Assignment form
 5// 4. Optimization: SSA transformations
 6// 5. Code generation: SSA → machine code
 7// 6. Linking: Combine compiled packages
 8
 9package main
10
11import "fmt"
12
13func main() {
14    // View compilation phases:
15    // go build -x main.go
16
17    // View SSA passes:
18    // GOSSAFUNC=main go build main.go
19    // Opens HTML showing each SSA optimization pass
20
21    fmt.Println("Understanding compiler phases")
22}

Compiler Flags Overview

 1# View all compiler flags
 2go tool compile -h
 3
 4# Common optimization flags:
 5go build -gcflags="-m"        # Show optimization decisions
 6go build -gcflags="-m -m"     # Verbose optimization details
 7go build -gcflags="-S"        # Show assembly output
 8go build -ldflags="-s -w"     # Strip debug info, reduce binary size
 9
10# Disable optimizations:
11go build -gcflags="-N -l"     # -N = no optimizations, -l = no inlining

Inlining

Inlining replaces function calls with the function body, eliminating call overhead.

What's Happening: When the compiler inlines a function, it copies the function's code directly into the caller instead of generating a CALL instruction. This eliminates the overhead of: setting up the call stack, jumping to function address, returning from function. For small, frequently-called functions, this can provide 2-5x speedup. The compiler uses a cost-based heuristic to decide what to inline.

Basic Inlining

 1package main
 2
 3import "fmt"
 4
 5// Small functions are automatically inlined
 6func add(a, b int) int {
 7    return a + b
 8}
 9
10// Inlined in optimized builds
11func square(x int) int {
12    return x * x
13}
14
15func main() {
16    result := add(3, square(4))
17    fmt.Println(result)
18}
19
20// View inlining decisions:
21// go build -gcflags="-m" main.go
22// Output:
23// ./main.go:6:6: can inline add
24// ./main.go:11:6: can inline square
25// ./main.go:16:6: can inline main

Inline Budget

The compiler uses a "cost budget" to decide what to inline:

 1package main
 2
 3// Cost: 5
 4func cheap(x int) int {
 5    return x * 2
 6}
 7
 8// Cost: ~40
 9func moderate(x, y int) int {
10    if x > y {
11        return x
12    }
13    return y
14}
15
16// Cost: >80
17func expensive(arr []int) int {
18    sum := 0
19    for _, v := range arr {
20        if v > 0 {
21            sum += v * v
22        } else {
23            sum += v
24        }
25    }
26    return sum
27}
28
29// View inlining costs:
30// go build -gcflags="-m -m" main.go

Forcing and Preventing Inlining

 1package main
 2
 3//go:noinline
 4func noInline(x int) int {
 5    return x * 2
 6}
 7
 8// Force inline attempt
 9//go:inline
10func forceInline(x int) int {
11    return x * 2
12}
13
14// Mid-stack inlining: inline calls within inlined functions
15func outer(x int) int {
16    return inner(x) + inner(x+1) // inner() may be inlined here
17}
18
19func inner(x int) int {
20    return x * x
21}
22
23func main() {
24    _ = noInline(5)
25    _ = forceInline(5)
26    _ = outer(5)
27}

Inlining Strategies

 1package main
 2
 3// Good: Small, simple functions inline well
 4func min(a, b int) int {
 5    if a < b {
 6        return a
 7    }
 8    return b
 9}
10
11// Good: Early returns help inlining
12func isPositive(x int) bool {
13    if x <= 0 {
14        return false
15    }
16    return true
17}
18
19// Bad: Loops prevent inlining
20func sum(arr []int) int {
21    total := 0
22    for _, v := range arr {
23        total += v
24    }
25    return total // Won't inline due to loop
26}
27
28// Bad: Panic prevents inlining
29func divide(a, b int) int {
30    if b == 0 {
31        panic("division by zero")
32    }
33    return a / b // Won't inline due to panic
34}
35
36// Optimization: Split function to enable partial inlining
37func divideOptimized(a, b int) int {
38    if b == 0 {
39        divideByZeroPanic() // Separate panic to cold path
40    }
41    return a / b // This part can inline
42}
43
44//go:noinline
45func divideByZeroPanic() {
46    panic("division by zero")
47}

Bounds Check Elimination

The compiler eliminates array bounds checks when it can prove safety.

Why This Matters: Every time you access arr[i], Go inserts a runtime check: "is i < len(arr)?". If not, it panics. While safe, this check costs ~1-3 CPU cycles per access. When the compiler can mathematically prove an index is always valid, it eliminates the check entirely—providing 20-30% speedup in tight loops.

Automatic BCE

 1package main
 2
 3import "fmt"
 4
 5func sumArray(arr []int) int {
 6    sum := 0
 7
 8    // Bounds check on each iteration
 9    for i := 0; i < len(arr); i++ {
10        sum += arr[i] // Bounds check here
11    }
12
13    return sum
14}
15
16func sumArrayOptimized(arr []int) int {
17    sum := 0
18
19    // No bounds check: range guarantees safety
20    for _, v := range arr {
21        sum += v // No bounds check needed
22    }
23
24    return sum
25}
26
27func main() {
28    arr := []int{1, 2, 3, 4, 5}
29    fmt.Println(sumArray(arr))
30    fmt.Println(sumArrayOptimized(arr))
31}
32
33// View bounds checks:
34// go build -gcflags="-d=ssa/check_bce/debug=1" main.go

Manual BCE Optimization

 1package main
 2
 3// Bounds check on every access
 4func processSlow(arr []int) {
 5    for i := 0; i < len(arr); i++ {
 6        _ = arr[i] // Bounds check
 7    }
 8}
 9
10// Compiler eliminates bounds check after first access
11func processFast(arr []int) {
12    if len(arr) == 0 {
13        return
14    }
15
16    // First access with bounds check
17    _ = arr[0]
18
19    // Subsequent accesses within proven bounds
20    for i := 1; i < len(arr); i++ {
21        _ = arr[i] // No bounds check after first
22    }
23}
24
25// Use range to avoid bounds checks entirely
26func processFastest(arr []int) {
27    for i := range arr {
28        _ = arr[i] // No bounds check
29    }
30}
31
32// BCE with slice tricks
33func copyOptimized(dst, src []int) {
34    // Explicit length check
35    if len(dst) < len(src) {
36        panic("dst too small")
37    }
38
39    // Compiler knows dst is large enough
40    for i := range src {
41        dst[i] = src[i] // No bounds check on dst
42    }
43}

Proving Bounds Safety

What's Happening: The compiler tracks value ranges through the program. When you explicitly check i < len(arr), the compiler records "in the true branch, i is definitely < len(arr)". All subsequent accesses to arr[i] within that scope skip the bounds check because safety is already proven. This is called "flow-sensitive analysis".

 1package main
 2
 3// Bad: Compiler can't prove safety
 4func accessBad(arr []int, i, j int) int {
 5    return arr[i] + arr[j] // Two bounds checks
 6}
 7
 8// Good: Prove bounds before access
 9func accessGood(arr []int, i, j int) int {
10    if i >= len(arr) || j >= len(arr) {
11        panic("index out of bounds")
12    }
13
14    // Compiler knows i and j are safe
15    return arr[i] + arr[j] // No bounds checks
16}
17
18// Better: Use max to prove bounds
19func accessBetter(arr []int, i, j int) int {
20    maxIdx := max(i, j)
21    if maxIdx >= len(arr) {
22        panic("index out of bounds")
23    }
24
25    // Single comparison proves both indices safe
26    return arr[i] + arr[j] // No bounds checks
27}
28
29func max(a, b int) int {
30    if a > b {
31        return a
32    }
33    return b
34}

Compiler Directives

Compiler directives control optimization behavior.

Common Directives

 1package main
 2
 3//go:noinline
 4func noInlineFunc() {
 5    // Prevent inlining for benchmarking or debugging
 6}
 7
 8//go:nosplit
 9func noSplitFunc() {
10    // Prevent stack splitting
11    // Use with caution - can cause stack overflow
12}
13
14//go:noescape
15func noescape(p *int) {
16    // Tell compiler pointer doesn't escape
17    // Used in unsafe code
18}
19
20//go:linkname localName importPath.name
21// Link to unexported function in another package
22// Used for runtime/syscall interfaces
23
24//go:build linux && amd64
25// Build constraint
26
27//go:generate command
28// Run command during go generate
29
30func main() {
31    noInlineFunc()
32    noSplitFunc()
33}

Practical Directive Usage

 1package main
 2
 3import (
 4    "fmt"
 5    "time"
 6)
 7
 8// Example: Preventing inlining for stable benchmarks
 9//go:noinline
10func computeExpensive(n int) int {
11    result := 0
12    for i := 0; i < n; i++ {
13        result += i * i
14    }
15    return result
16}
17
18// Example: Force escape for testing
19//go:noinline
20func allocateEscape() *int {
21    x := 42
22    return &x // Force heap allocation
23}
24
25// Example: Generate test data
26//go:generate go run generate_data.go
27
28// Example: Platform-specific optimization
29//go:build !race
30
31func fastPath() {
32    // Fast but not race-detector safe
33    fmt.Println("Fast path")
34}
35
36// Alternative implementation for race detector
37//go:build race
38
39func fastPath() {
40    // Slower but race-safe
41    fmt.Println("Race-safe path")
42}
43
44func main() {
45    start := time.Now()
46    result := computeExpensive(1000000)
47    fmt.Printf("Result: %d, Time: %v\n", result, time.Since(start))
48
49    ptr := allocateEscape()
50    fmt.Printf("Escaped value: %d\n", *ptr)
51
52    fastPath()
53}

Dead Code Elimination

The compiler removes unreachable or unused code.

What's Happening: During compilation, the compiler builds a control flow graph showing all possible execution paths. Code that's provably unreachable is completely removed from the binary. Similarly, unused variables and computations are eliminated. This happens after inlining, so constants folded from inlined functions can reveal more dead code.

Basic DCE

 1package main
 2
 3import "fmt"
 4
 5func example() {
 6    x := 10 // Used
 7    y := 20 // Unused - eliminated
 8
 9    if false {
10        // Dead code - eliminated
11        fmt.Println("Never executed")
12    }
13
14    if true {
15        fmt.Println("Always executed")
16    } else {
17        // Dead code - eliminated
18        fmt.Println("Never executed")
19    }
20
21    _ = x // Use x
22}
23
24// View what's eliminated:
25// go build -gcflags="-m" main.go

Constant Folding

Why This Works: When the compiler sees operations on constants or compile-time-known values, it performs the calculation during compilation rather than at runtime. const c = 10 + 20 doesn't generate addition code—the compiler just embeds 30 directly in the binary. This extends to complex expressions, saving runtime CPU cycles completely.

 1package main
 2
 3import "fmt"
 4
 5func constantFolding() {
 6    // Computed at compile time
 7    const a = 10
 8    const b = 20
 9    const c = a + b // c = 30 at compile time
10
11    x := 5 + 3      // Folded to 8
12    y := x * 2      // May be folded to 16
13    z := y << 1     // May be folded to 32
14
15    fmt.Println(c, x, y, z)
16}
17
18func conditionalElimination() {
19    const debug = false
20
21    if debug {
22        // Eliminated at compile time
23        fmt.Println("Debug info")
24    }
25
26    // Compiler knows this is always true
27    if !debug {
28        fmt.Println("Production")
29    }
30}

Optimizing with DCE

 1package main
 2
 3import "fmt"
 4
 5// Bad: Keeping unnecessary variables
 6func processDataBad(data []int) int {
 7    sum := 0
 8    count := 0
 9    avg := 0 // Unused - but computed
10
11    for _, v := range data {
12        sum += v
13        count++
14        avg = sum / count // Computed but never used
15    }
16
17    return sum
18}
19
20// Good: Remove unused computations
21func processDataGood(data []int) int {
22    sum := 0
23    for _, v := range data {
24        sum += v
25    }
26    return sum
27}
28
29// Feature flags for DCE
30const (
31    FeatureA = true
32    FeatureB = false
33)
34
35func handleRequest() {
36    if FeatureA {
37        fmt.Println("Feature A enabled")
38    }
39
40    if FeatureB {
41        // Dead code if FeatureB = false
42        fmt.Println("Feature B enabled")
43    }
44}

Loop Optimizations

The compiler applies various loop optimizations.

Loop Unrolling

 1package main
 2
 3// Compiler may unroll small, fixed-iteration loops
 4func unrolledLoop() {
 5    arr := [4]int{1, 2, 3, 4}
 6
 7    // May be unrolled to:
 8    // sum = arr[0] + arr[1] + arr[2] + arr[3]
 9    sum := 0
10    for i := 0; i < 4; i++ {
11        sum += arr[i]
12    }
13
14    _ = sum
15}
16
17// Explicit unrolling for better performance
18func manualUnroll(arr []int) int {
19    n := len(arr)
20    sum := 0
21
22    // Process 4 elements at a time
23    i := 0
24    for i+3 < n {
25        sum += arr[i] + arr[i+1] + arr[i+2] + arr[i+3]
26        i += 4
27    }
28
29    // Handle remaining elements
30    for ; i < n; i++ {
31        sum += arr[i]
32    }
33
34    return sum
35}

Loop Invariant Code Motion

What's Happening: Loop-invariant code motion identifies expressions that produce the same result on every iteration and hoists them outside the loop. Instead of computing len(arr)/2 1000 times in a 1000-iteration loop, it's computed once before the loop. Modern Go compilers do this automatically for simple cases, but explicit hoisting ensures optimization and improves code clarity.

 1package main
 2
 3// Bad: Recomputing invariant inside loop
 4func computeBad(arr []int, multiplier int) {
 5    for i := range arr {
 6        // len(arr) computed on each iteration
 7        if i < len(arr)/2 {
 8            arr[i] *= multiplier
 9        }
10    }
11}
12
13// Good: Hoist invariant out of loop
14func computeGood(arr []int, multiplier int) {
15    half := len(arr) / 2 // Computed once
16
17    for i := range arr {
18        if i < half {
19            arr[i] *= multiplier
20        }
21    }
22}
23
24// Compiler may do this automatically, but explicit is clearer

Loop Fusion

 1package main
 2
 3// Bad: Multiple passes over data
 4func processMultiPass(arr []int) {
 5    // First pass: multiply by 2
 6    for i := range arr {
 7        arr[i] *= 2
 8    }
 9
10    // Second pass: add 10
11    for i := range arr {
12        arr[i] += 10
13    }
14}
15
16// Good: Single pass
17func processSinglePass(arr []int) {
18    for i := range arr {
19        arr[i] = arr[i]*2 + 10
20    }
21}

Build Flags and Optimization Levels

Common Build Flags

 1# Production build
 2go build
 3
 4# Debug build
 5go build -gcflags="-N -l"
 6
 7# Optimized build
 8go build -ldflags="-s -w"
 9# -s: Strip symbol table
10# -w: Strip DWARF debugging info
11
12# Profile-guided optimization - Go 1.21+
13go build -pgo=default.pgo
14
15# Show optimization decisions
16go build -gcflags="-m -m"
17
18# Generate assembly
19go build -gcflags="-S" > assembly.txt
20
21# Cross-compilation
22GOOS=linux GOARCH=amd64 go build

Build Tags for Optimization

 1// fast.go
 2//go:build !debug
 3
 4package myapp
 5
 6func Process() {
 7    // Fast implementation
 8    fastAlgorithm()
 9}
10
11// slow.go
12//go:build debug
13
14package myapp
15
16func Process() {
17    // Slower but with extra checks
18    slowAlgorithmWithChecks()
19}
20
21// Build commands:
22// go build                    # Uses fast.go
23// go build -tags debug        # Uses slow.go

Reducing Binary Size

 1package main
 2
 3import "fmt"
 4
 5// Use -ldflags to reduce binary size:
 6// go build -ldflags="-s -w" -o app
 7
 8// Avoid importing large packages when not needed
 9// Bad: Imports entire regexp package
10import "regexp"
11
12func matchBad(pattern, text string) bool {
13    re := regexp.MustCompile(pattern)
14    return re.MatchString(text)
15}
16
17// Good: Use strings package for simple cases
18import "strings"
19
20func matchGood(pattern, text string) bool {
21    return strings.Contains(text, pattern)
22}
23
24func main() {
25    fmt.Println("Optimized binary")
26}

Analyzing Generated Assembly

Viewing Assembly Output

 1package main
 2
 3func add(a, b int) int {
 4    return a + b
 5}
 6
 7func main() {
 8    result := add(3, 4)
 9    _ = result
10}
11
12// Generate assembly:
13// go tool compile -S main.go > main.s
14// Or: go build -gcflags="-S" main.go
15
16// Assembly for add() on AMD64:
17// TEXT "".add(SB), NOSPLIT, $0-24
18//     MOVQ    a+0(FP), AX
19//     ADDQ    b+8(FP), AX
20//     MOVQ    AX, ret+16(FP)
21//     RET

Understanding Assembly Patterns

 1package main
 2
 3// Example 1: Simple arithmetic
 4func multiply(a, b int) int {
 5    return a * b
 6}
 7// Assembly: IMULQ instruction
 8
 9// Example 2: Bitwise operations
10func powerOfTwo(x int) int {
11    return x << 1 // Same as x * 2
12}
13// Assembly: SHLQ instruction
14
15// Example 3: Bounds check
16func accessArray(arr []int, i int) int {
17    return arr[i]
18}
19// Assembly: CMPQ, followed by MOVQ
20
21// Example 4: Inlined function
22func square(x int) int {
23    return x * x
24}
25
26func useSquare(x int) int {
27    return square(x) + 1
28}
29// Assembly for useSquare: No CALL to square

Benchmark with Assembly Analysis

 1package main
 2
 3import "testing"
 4
 5func addSlow(a, b int) int {
 6    result := 0
 7    for i := 0; i < a; i++ {
 8        result++
 9    }
10    for i := 0; i < b; i++ {
11        result++
12    }
13    return result
14}
15
16func addFast(a, b int) int {
17    return a + b
18}
19
20func BenchmarkAddSlow(b *testing.B) {
21    for i := 0; i < b.N; i++ {
22        _ = addSlow(10, 20)
23    }
24}
25
26func BenchmarkAddFast(b *testing.B) {
27    for i := 0; i < b.N; i++ {
28        _ = addFast(10, 20)
29    }
30}
31
32// Run: go test -bench=. -benchmem
33// Compare assembly: go test -gcflags="-S" -bench=Add

Best Practices for Optimizable Code

Write Compiler-Friendly Code

 1package main
 2
 3// Good: Small, simple functions inline well
 4func abs(x int) int {
 5    if x < 0 {
 6        return -x
 7    }
 8    return x
 9}
10
11// Good: Use range for slice iteration
12func sumSlice(arr []int) int {
13    sum := 0
14    for _, v := range arr {
15        sum += v
16    }
17    return sum
18}
19
20// Good: Prove bounds to eliminate checks
21func copySlice(dst, src []int) {
22    if len(dst) < len(src) {
23        panic("dst too small")
24    }
25    for i := range src {
26        dst[i] = src[i] // No bounds checks
27    }
28}
29
30// Good: Use constants for dead code elimination
31const DebugMode = false
32
33func log(msg string) {
34    if DebugMode {
35        println(msg) // Eliminated if DebugMode = false
36    }
37}
38
39// Good: Avoid unnecessary allocations
40func processInPlace(data []byte) {
41    for i := range data {
42        data[i] = data[i] * 2 // No allocation
43    }
44}

Profile-Guided Optimization

 1package main
 2
 3import (
 4    "fmt"
 5    "os"
 6    "runtime/pprof"
 7)
 8
 9func main() {
10    // Step 1: Generate profile
11    f, _ := os.Create("cpu.prof")
12    pprof.StartCPUProfile(f)
13    defer pprof.StopCPUProfile()
14
15    // Run your application...
16    workload()
17}
18
19func workload() {
20    // Representative workload
21    for i := 0; i < 1000000; i++ {
22        _ = expensiveFunction(i)
23    }
24}
25
26func expensiveFunction(n int) int {
27    sum := 0
28    for i := 0; i < n%100; i++ {
29        sum += i * i
30    }
31    return sum
32}
33
34// Step 2: Build with PGO
35// mv cpu.prof default.pgo
36// go build -pgo=default.pgo
37
38// The compiler uses profile data to optimize hot paths

Profile-Guided Optimization

Profile-Guided Optimization uses runtime profiles to optimize frequently executed code paths.

Why This Works: Traditional optimization is "blind"—the compiler guesses which code paths are hot. PGO uses actual runtime data to guide decisions. When you provide a CPU profile, the compiler knows which functions are called most frequently and optimizes those more aggressively. This can provide 5-20% speedup for real-world workloads.

Complete PGO Workflow

 1// main.go - Example application for PGO
 2package main
 3
 4import (
 5    "flag"
 6    "fmt"
 7    "log"
 8    "os"
 9    "runtime/pprof"
10    "time"
11)
12
13// Hot path function
14func processRequest(data []byte) int {
15    sum := 0
16    for _, b := range data {
17        sum += int(b)
18        if sum%2 == 0 {
19            sum = expensiveComputation(sum)
20        }
21    }
22    return sum
23}
24
25// Expensive function that benefits from PGO
26func expensiveComputation(n int) int {
27    result := n
28    for i := 0; i < 100; i++ {
29        result = % 1000000007
30        if result%3 == 0 {
31            result += fibonacci(10)
32        }
33    }
34    return result
35}
36
37func fibonacci(n int) int {
38    if n <= 1 {
39        return n
40    }
41    return fibonacci(n-1) + fibonacci(n-2)
42}
43
44func main() {
45    profile := flag.Bool("profile", false, "enable CPU profiling")
46    flag.Parse()
47
48    // Start CPU profiling if requested
49    if *profile {
50        f, err := os.Create("cpu.prof")
51        if err != nil {
52            log.Fatal(err)
53        }
54        defer f.Close()
55
56        if err := pprof.StartCPUProfile(f); err != nil {
57            log.Fatal(err)
58        }
59        defer pprof.StopCPUProfile()
60    }
61
62    // Simulate workload
63    start := time.Now()
64
65    for i := 0; i < 10000; i++ {
66        data := make([]byte, 1000)
67        for j := range data {
68            data[j] = byte(j % 256)
69        }
70        _ = processRequest(data)
71    }
72
73    elapsed := time.Since(start)
74    fmt.Printf("Completed in %v\n", elapsed)
75}
76
77/*
78PGO Workflow:
79
801. Generate profile:
81   go run main.go -profile
82   # Creates cpu.prof
83
842. Rename profile to default location:
85   mv cpu.prof default.pgo
86
873. Build with PGO:
88   go build -pgo=default.pgo -o app_pgo
89
904. Build without PGO for comparison:
91   go build -pgo=off -o app_no_pgo
92
935. Compare performance:
94   time ./app_pgo
95   time ./app_no_pgo
96
97Expected improvements: 5-20% performance gain in hot paths
98*/

Analyzing PGO Impact

 1#!/bin/bash
 2# pgo-analysis.sh - Script to analyze PGO impact
 3
 4echo "=== Profile-Guided Optimization Analysis ==="
 5echo
 6
 7# Step 1: Build baseline
 8echo "Building baseline..."
 9go build -pgo=off -o app_baseline
10baseline_size=$(stat -f%z app_baseline 2>/dev/null || stat -c%s app_baseline)
11echo "Baseline binary size: $baseline_size bytes"
12
13# Step 2: Generate profile
14echo
15echo "Generating profile..."
16./app_baseline -profile > /dev/null 2>&1
17mv cpu.prof default.pgo
18
19# Step 3: Build with PGO
20echo
21echo "Building with PGO..."
22go build -pgo=default.pgo -o app_pgo
23pgo_size=$(stat -f%z app_pgo 2>/dev/null || stat -c%s app_pgo)
24echo "PGO binary size: $pgo_size bytes"
25
26# Step 4: Benchmark
27echo
28echo "Benchmarking baseline..."
29/usr/bin/time -l ./app_baseline 2>&1 | grep "user\|real"
30
31echo
32echo "Benchmarking PGO..."
33/usr/bin/time -l ./app_pgo 2>&1 | grep "user\|real"
34
35echo
36echo "=== Analysis Complete ==="

Advanced PGO with Multiple Profiles

 1// merge-profiles.go - Merge multiple CPU profiles
 2package main
 3
 4import (
 5    "fmt"
 6    "log"
 7    "os"
 8
 9    "github.com/google/pprof/profile"
10)
11
12func mergeProfiles(files []string, output string) error {
13    var merged *profile.Profile
14
15    for i, file := range files {
16        f, err := os.Open(file)
17        if err != nil {
18            return fmt.Errorf("open %s: %w", file, err)
19        }
20
21        p, err := profile.Parse(f)
22        f.Close()
23        if err != nil {
24            return fmt.Errorf("parse %s: %w", file, err)
25        }
26
27        if i == 0 {
28            merged = p
29        } else {
30            if err := merged.Merge(p, 1); err != nil {
31                return fmt.Errorf("merge %s: %w", file, err)
32            }
33        }
34    }
35
36    // Write merged profile
37    out, err := os.Create(output)
38    if err != nil {
39        return err
40    }
41    defer out.Close()
42
43    return merged.Write(out)
44}
45
46func main() {
47    profiles := []string{
48        "profile1.prof",
49        "profile2.prof",
50        "profile3.prof",
51    }
52
53    if err := mergeProfiles(profiles, "merged.pgo"); err != nil {
54        log.Fatal(err)
55    }
56
57    fmt.Println("Merged profiles into merged.pgo")
58}

Assembly Analysis

Understanding generated assembly helps identify optimization opportunities.

Generating and Analyzing Assembly

 1// assembly-example.go
 2package main
 3
 4// Simple function for assembly analysis
 5func Add(a, b int) int {
 6    return a + b
 7}
 8
 9// Function with bounds check
10func SumArray(arr []int) int {
11    sum := 0
12    for i := 0; i < len(arr); i++ {
13        sum += arr[i] // Bounds check here
14    }
15    return sum
16}
17
18// Optimized version
19func SumArrayOptimized(arr []int) int {
20    sum := 0
21    for _, v := range arr { // Range eliminates bounds check
22        sum += v
23    }
24    return sum
25}
26
27//go:noinline
28func InlinedTest(x int) int {
29    return x * 2
30}

Assembly Generation Commands

 1#!/bin/bash
 2# generate-assembly.sh
 3
 4# Generate assembly for specific function
 5go tool compile -S assembly-example.go > assembly.s
 6
 7# Generate with optimization levels
 8go tool compile -S -N -l assembly-example.go > assembly-no-opt.s  # No optimization
 9go tool compile -S assembly-example.go > assembly-opt.s            # With optimization
10
11# Disassemble compiled binary
12go build -o app assembly-example.go
13go tool objdump -S app > disassembly.txt
14
15# Generate assembly with symbols
16go build -gcflags="-S" assembly-example.go 2> assembly-with-symbols.s

Reading Assembly Output

# Assembly output for Add function:
"".Add STEXT nosplit size=10 args=0x18 locals=0x0
        TEXT    "".Add(SB), NOSPLIT|ABIInternal, $0-24
        FUNCDATA        $0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        FUNCDATA        $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        MOVQ    "".b+16(SP), AX    # Load b into AX register
        ADDQ    "".a+8(SP), AX     # Add a to AX
        MOVQ    AX, "".~r2+24(SP)  # Store result
        RET

# Analysis:
# - NOSPLIT: Function doesn't need stack growth check
# - 3 instructions only
# - Direct register operations

Assembly Optimization Patterns

 1package main
 2
 3import "testing"
 4
 5// Example 1: Loop unrolling analysis
 6func ProcessLoop(data []int) int {
 7    sum := 0
 8    for i := 0; i < len(data); i++ {
 9        sum += data[i]
10    }
11    return sum
12}
13
14// Example 2: SIMD-friendly code
15func ProcessSIMD(data []float64) float64 {
16    sum := 0.0
17    // Compiler can vectorize this with SSE/AVX
18    for i := 0; i < len(data); i += 4 {
19        if i+3 < len(data) {
20            sum += data[i] + data[i+1] + data[i+2] + data[i+3]
21        }
22    }
23    return sum
24}
25
26// Benchmark to verify assembly optimizations
27func BenchmarkProcessLoop(b *testing.B) {
28    data := make([]int, 1000)
29    b.ResetTimer()
30
31    for i := 0; i < b.N; i++ {
32        ProcessLoop(data)
33    }
34}

Inline Analysis Tool

 1// inline-analyzer.go - Analyze inlining decisions
 2package main
 3
 4import (
 5    "fmt"
 6    "go/ast"
 7    "go/parser"
 8    "go/token"
 9    "log"
10)
11
12func analyzeFunctionSize(file string) {
13    fset := token.NewFileSet()
14    node, err := parser.ParseFile(fset, file, nil, 0)
15    if err != nil {
16        log.Fatal(err)
17    }
18
19    ast.Inspect(node, func(n ast.Node) bool {
20        if fn, ok := n.(*ast.FuncDecl); ok {
21            if fn.Body != nil {
22                lines := fset.Position(fn.Body.End()).Line - fset.Position(fn.Body.Pos()).Line
23                fmt.Printf("Function %s: %d lines", fn.Name.Name, lines)
24
25                if lines < 40 {
26                    fmt.Println(" [likely inlined]")
27                } else {
28                    fmt.Println(" [too large for inlining]")
29                }
30            }
31        }
32        return true
33    })
34}
35
36func main() {
37    analyzeFunctionSize("assembly-example.go")
38}

Compiler Intrinsics

Go compiler recognizes certain patterns and replaces them with optimized assembly:

What's Happening: Compiler intrinsics are special functions that the compiler recognizes and replaces with direct CPU instructions. Instead of compiling bits.OnesCount64() as function call + loop logic, the compiler emits a single POPCNT instruction. This transforms ~50 instructions into 1 instruction—a dramatic optimization for common bit manipulation operations.

 1package main
 2
 3import (
 4    "math/bits"
 5)
 6
 7// Compiler intrinsics - optimized to single CPU instruction
 8
 9func PopCount(x uint64) int {
10    return bits.OnesCount64(x) // Optimized to POPCNT instruction
11}
12
13func LeadingZeros(x uint64) int {
14    return bits.LeadingZeros64(x) // Optimized to CLZ/BSR
15}
16
17func TrailingZeros(x uint64) int {
18    return bits.TrailingZeros64(x) // Optimized to CTZ/BSF
19}
20
21func RotateLeft(x uint64, k int) uint64 {
22    return bits.RotateLeft64(x, k) // Optimized to ROL
23}
24
25// Check assembly to verify intrinsics:
26// go tool compile -S intrinsics.go | grep -A5 "PopCount"

Escape Analysis Visualization

 1// escape-analysis.go
 2package main
 3
 4// Escapes to heap
 5func EscapesToHeap() *int {
 6    x := 42
 7    return &x // x escapes to heap
 8}
 9
10// Stays on stack
11func StaysOnStack() int {
12    x := 42
13    return x // x stays on stack
14}
15
16// Analyze escape decisions:
17// go build -gcflags="-m -m" escape-analysis.go
18//
19// Output:
20// ./escape-analysis.go:6:2: x escapes to heap:
21// ./escape-analysis.go:6:2:   flow: ~r0 = &x:
22// ./escape-analysis.go:6:2:     from &x at ./escape-analysis.go:7:9
23// ./escape-analysis.go:6:2:     from return &x at ./escape-analysis.go:7:2
24// ./escape-analysis.go:6:2: moved to heap: x

Performance Comparison Tool

 1// compare-opts.go - Compare optimization levels
 2package main
 3
 4import (
 5    "fmt"
 6    "os"
 7    "os/exec"
 8    "time"
 9)
10
11func buildAndBench(opts string, name string) time.Duration {
12    // Build
13    cmd := exec.Command("go", "build", "-gcflags="+opts, "-o", name, "main.go")
14    if err := cmd.Run(); err != nil {
15        fmt.Printf("Build failed: %v\n", err)
16        return 0
17    }
18
19    // Benchmark
20    start := time.Now()
21    cmd = exec.Command("./" + name)
22    cmd.Run()
23    return time.Since(start)
24}
25
26func main() {
27    tests := map[string]string{
28        "no-opt":      "-N -l",           // No optimization, no inlining
29        "default":     "",                 // Default optimization
30        "aggressive":  "-l=4",            // Aggressive inlining
31    }
32
33    for name, opts := range tests {
34        duration := buildAndBench(opts, "app_"+name)
35        fmt.Printf("%-12s: %v\n", name, duration)
36    }
37
38    // Cleanup
39    os.Remove("app_no-opt")
40    os.Remove("app_default")
41    os.Remove("app_aggressive")
42}

Advanced Compiler Optimization Patterns

Leveraging Escape Analysis

Understanding escape analysis helps you write code that the compiler can optimize better. Variables that don't escape can be stack-allocated instead of heap-allocated:

 1package escape_analysis
 2
 3import "fmt"
 4
 5// Good: x doesn't escape, allocated on stack
 6func StackAllocation() int {
 7    x := make([]int, 100)
 8    return len(x)
 9}
10
11// Bad: slice escapes, allocated on heap
12func HeapAllocation() []int {
13    x := make([]int, 100)
14    return x // Escapes!
15}
16
17// Excellent: Reuse existing slice to avoid allocations
18func ReuseSlice(buf []int) {
19    for i := 0; i < len(buf); i++ {
20        buf[i] = i * 2
21    }
22}
23
24// Compiler Analysis:
25// StackAllocation: No escapes - stack allocated
26// HeapAllocation: Escapes return - heap allocated
27// ReuseSlice: No allocations - reuses caller's memory
28
29func main() {
30    // Stack-efficient version
31    x := make([]int, 1000)
32    ReuseSlice(x) // No new allocations
33    fmt.Println(x[0])
34}

Interface Optimization Techniques

Interfaces are powerful but can be optimized:

 1package interface_optimization
 2
 3import "fmt"
 4
 5// Heavy interface - lots of method calls
 6type HeavyReader interface {
 7    Read(p []byte) (n int, err error)
 8    Close() error
 9    Stat() (name string, size int64)
10    IsEOF() bool
11    Reset()
12}
13
14// Lightweight interface - only essentials
15type LightReader interface {
16    Read(p []byte) (n int, err error)
17}
18
19// Specialized interface - better inlining
20type ByteReader interface {
21    ReadByte() (byte, error)
22}
23
24// Concrete implementation - often faster than interface
25type FastBuffer struct {
26    data []byte
27    pos  int
28}
29
30func (b *FastBuffer) ReadByte() (byte, error) {
31    if b.pos >= len(b.data) {
32        return 0, fmt.Errorf("EOF")
33    }
34    c := b.data[b.pos]
35    b.pos++
36    return c, nil
37}
38
39// Compare performance:
40// - Using FastBuffer directly: All calls inlined
41// - Using ByteReader interface: Some inlining depending on usage
42// - Using HeavyReader interface: Less inlining opportunities
43
44// Optimization tips:
45// 1. Minimize methods in interfaces
46// 2. Use concrete types when performance critical
47// 3. Small interfaces inline better
48// 4. Consider type assertions for fast path
49
50func main() {
51    buf := &FastBuffer{data: []byte("hello")}
52    for buf.pos < len(buf.data) {
53        b, _ := buf.ReadByte()
54        fmt.Print(string(b))
55    }
56}

Build-Time Optimizations

The build process itself offers optimization opportunities:

 1package build_optimization
 2
 3// Use build flags for conditional compilation
 4// Build with: go build -ldflags "-X main.version=1.0.0"
 5
 6var (
 7    version = "dev"
 8    commit  = "unknown"
 9)
10
11// Or use build constraints for platform-specific code:
12
13// +build linux,amd64
14
15package optimization
16
17// Linux-specific optimizations here
18
19// And separately:
20
21// +build darwin,amd64
22
23package optimization
24
25// macOS-specific optimizations here
26
27// Strip symbols to reduce binary size:
28// go build -ldflags "-s -w"
29// This removes symbol table and debug info
30
31// Link statically (no C dependencies):
32// go build -tags 'cgo=false'
33
34// Use CGO_ENABLED=0 for static builds:
35// CGO_ENABLED=0 go build
36
37// Profile-guided optimizations:
38// 1. go test -cpuprofile=cpu.prof ./...
39// 2. go build -pgo=cpu.prof  // Use profile for optimization

Integration and Mastery - Optimization Workflow

The Optimization Mindset

Successful optimization follows a disciplined approach:

Measure First: Use profiling, benchmarking, and real metrics
Identify Bottlenecks: Focus on what actually matters
Understand Trade-offs: Every optimization has costs
Test Thoroughly: Verify improvements don't break functionality
Document Changes: Explain why optimizations exist
Monitor Results: Ensure improvements persist in production

Common Optimization Patterns

 1package optimization_patterns
 2
 3import (
 4	"encoding/json"
 5	"fmt"
 6	"sync"
 7)
 8
 9// Pattern 1: Object pooling for frequent allocations
10type BufferPool struct {
11	sync.Pool
12}
13
14func NewBufferPool() *BufferPool {
15	return &BufferPool{
16		Pool: sync.Pool{
17			New: func() interface{} {
18				return make([]byte, 0, 4096)
19			},
20		},
21	}
22}
23
24func (p *BufferPool) Get() []byte {
25	return p.Pool.Get().([]byte)[:0]
26}
27
28func (p *BufferPool) Put(buf []byte) {
29	if cap(buf) <= 65536 { // Reuse only reasonable sizes
30		p.Pool.Put(buf)
31	}
32}
33
34// Pattern 2: Avoiding JSON unmarshaling overhead
35type CachedConfig struct {
36	mu       sync.RWMutex
37	data     map[string]interface{}
38	rawJSON  []byte
39	lastMod  int64
40}
41
42func (c *CachedConfig) GetValue(key string) interface{} {
43	c.mu.RLock()
44	defer c.mu.RUnlock()
45	return c.data[key]
46}
47
48// Pattern 3: Preallocated slices for known sizes
49func ProcessBatch(items []string) []int {
50	// Allocate exact size needed
51	results := make([]int, len(items))
52	for i, item := range items {
53		results[i] = len(item)
54	}
55	return results
56}
57
58// Pattern 4: Context-aware optimizations
59type OptimizedHandler struct {
60	pool *BufferPool
61}
62
63func (h *OptimizedHandler) Process(data []byte) (string, error) {
64	buf := h.pool.Get()
65	defer h.pool.Put(buf)
66
67	// Reuse buffer
68	buf = append(buf, data...)
69	return string(buf), nil
70}

Monitoring and Profiling Best Practices

Always measure in production-like conditions:

Use realistic data sizes
Run on similar hardware
Account for concurrent usage
Monitor memory and CPU separately
Track metrics over time

Tools for different scenarios:

CPU profiling: pprof with -cpuprofile
Memory profiling: pprof with -memprofile
Allocations: pprof with -allocations
Goroutines: pprof with -goroutines
Benchmarks: go test -bench with -benchmem

Remember: "The best optimization is not doing unnecessary work." - Rob Pike

Practice Exercises

Exercise 1: Optimize Function Inlining

Learning Objective: Master function inlining techniques to eliminate call overhead and improve performance in hot code paths.

Context: Function inlining is critical for performance-sensitive applications like game engines and high-frequency trading systems. Google's Go team optimized the standard library by identifying key functions that benefit from inlining, resulting in 2-5x performance improvements in string operations and JSON parsing.

Difficulty: Intermediate | Time: 15-20 minutes

Refactor the following code to improve compiler inlining opportunities and eliminate unnecessary function call overhead:

 1package main
 2
 3import "fmt"
 4
 5type Point struct {
 6    X, Y int
 7}
 8
 9func Distance(other Point) float64 {
10    dx := float64(p.X - other.X)
11    dy := float64(p.Y - other.Y)
12    return math.Sqrt(dx*dx + dy*dy)
13}
14
15func processPoints(points []Point) float64 {
16    total := 0.0
17    for i := 0; i < len(points)-1; i++ {
18        total += points[i].Distance(points[i+1])
19    }
20    return total
21}

Task: Optimize the code to make Distance() more likely to inline and reduce call overhead in the hot path.

Solution

 1package main
 2
 3import "math"
 4
 5type Point struct {
 6    X, Y int
 7}
 8
 9// Split into smaller functions for better inlining
10func DistanceSquared(other Point) int {
11    dx := p.X - other.X
12    dy := p.Y - other.Y
13    return dx*dx + dy*dy // Inlines well
14}
15
16func Distance(other Point) float64 {
17    return math.Sqrt(float64(p.DistanceSquared(other)))
18}
19
20// Optimized: Avoid repeated function calls
21func processPoints(points []Point) float64 {
22    if len(points) < 2 {
23        return 0
24    }
25
26    total := 0.0
27    for i := range points[:len(points)-1] {
28        // Inline-friendly: simple computation
29        dx := float64(points[i].X - points[i+1].X)
30        dy := float64(points[i].Y - points[i+1].Y)
31        total += math.Sqrt(dx*dx + dy*dy)
32    }
33    return total
34}
35
36// Alternative: If you don't need exact distances
37func processPointsFast(points []Point) int {
38    if len(points) < 2 {
39        return 0
40    }
41
42    total := 0
43    for i := range points[:len(points)-1] {
44        // Use squared distance
45        total += points[i].DistanceSquared(points[i+1])
46    }
47    return total
48}

Exercise 2: Eliminate Bounds Checks

Learning Objective: Master bounds check elimination techniques to achieve maximum performance in array and slice operations.

Context: Bounds check elimination is crucial for numerical computing and data processing applications. Companies like Wolfram Research and financial modeling firms optimize matrix operations by eliminating bounds checks, achieving 20-30% performance improvements in computational kernels.

Difficulty: Intermediate | Time: 20-25 minutes

Optimize this matrix multiplication function by eliminating bounds checks in the inner loops where they cause the most performance impact:

 1package main
 2
 3func matrixMultiply(a, b [][]int) [][]int {
 4    n := len(a)
 5    result := make([][]int, n)
 6
 7    for i := 0; i < n; i++ {
 8        result[i] = make([]int, n)
 9        for j := 0; j < n; j++ {
10            for k := 0; k < n; k++ {
11                result[i][j] += a[i][k] * b[k][j]
12            }
13        }
14    }
15
16    return result
17}

Task: Apply bounds check elimination techniques to improve performance in this computational hot path.

Solution

 1package main
 2
 3func matrixMultiplyOptimized(a, b [][]int) [][]int {
 4    n := len(a)
 5    if n == 0 {
 6        return nil
 7    }
 8
 9    result := make([][]int, n)
10
11    for i := range a { // Use range to eliminate outer check
12        result[i] = make([]int, n)
13        rowA := a[i] // Hoist bounds check
14
15        for j := range result[i] { // Use range for middle loop
16            sum := 0
17
18            // Compiler can eliminate bounds checks here
19            for k := range rowA {
20                sum += rowA[k] * b[k][j]
21            }
22
23            result[i][j] = sum
24        }
25    }
26
27    return result
28}
29
30// Further optimization: Prove bounds explicitly
31func matrixMultiplyFastest(a, b [][]int) [][]int {
32    n := len(a)
33    if n == 0 {
34        return nil
35    }
36
37    result := make([][]int, n)
38
39    for i := 0; i < n; i++ {
40        if len(a[i]) != n {
41            panic("invalid matrix dimensions")
42        }
43
44        result[i] = make([]int, n)
45        rowA := a[i]
46
47        for j := 0; j < n; j++ {
48            if len(b) <= j || len(b[j]) != n {
49                panic("invalid matrix dimensions")
50            }
51
52            sum := 0
53            colB := b[j]
54
55            // All bounds proven, no checks needed
56            for k := 0; k < n; k++ {
57                sum += rowA[k] * colB[k]
58            }
59
60            result[i][j] = sum
61        }
62    }
63
64    return result
65}

Exercise 3: Dead Code Elimination

Learning Objective: Implement compile-time dead code elimination techniques to create optimized production builds without debug overhead.

Context: Dead code elimination is essential for creating high-performance production builds. Companies like Google and Netflix use build tags and constants to eliminate debug code entirely from production binaries, reducing binary size by 30% and improving runtime performance by removing unnecessary checks.

Difficulty: Intermediate | Time: 15-20 minutes

Use build tags and compile-time constants to eliminate debug code completely from production builds:

 1package main
 2
 3import (
 4    "fmt"
 5    "log"
 6    "time"
 7)
 8
 9func processRequest(data []byte) []byte {
10    start := time.Now()
11
12    log.Printf("Processing %d bytes", len(data))
13
14    result := transform(data)
15
16    log.Printf("Transformed in %v", time.Since(start))
17
18    return result
19}
20
21func transform(data []byte) []byte {
22    // Transformation logic
23    return data
24}

Task: Eliminate all logging overhead and debug code from production builds using compile-time techniques.

Solution

 1// config.go
 2package main
 3
 4// Build tags for different environments
 5//go:build !debug
 6const DebugMode = false
 7
 8// config_debug.go
 9//go:build debug
10const DebugMode = true
11
12// main.go
13package main
14
15import (
16    "fmt"
17    "log"
18    "time"
19)
20
21func processRequest(data []byte) []byte {
22    var start time.Time
23    if DebugMode {
24        start = time.Now()
25        log.Printf("Processing %d bytes", len(data))
26    }
27
28    result := transform(data)
29
30    if DebugMode {
31        log.Printf("Transformed in %v", time.Since(start))
32    }
33
34    return result
35}
36
37func transform(data []byte) []byte {
38    return data
39}
40
41// Better approach: Compile-time function selection
42//go:build !debug
43func debugLog(format string, args ...interface{}) {
44    // Empty function - completely eliminated
45}
46
47//go:build debug
48func debugLog(format string, args ...interface{}) {
49    log.Printf(format, args...)
50}
51
52func processRequestOptimized(data []byte) []byte {
53    start := time.Now()
54    debugLog("Processing %d bytes", len(data))
55
56    result := transform(data)
57
58    debugLog("Transformed in %v", time.Since(start))
59    return result
60}
61
62// Build:
63// Production: go build
64// Debug: go build -tags debug

Exercise 4: Loop Optimization

Learning Objective: Master advanced loop optimization techniques including fusion, unrolling, and invariant code motion.

Context: Loop optimization is critical for data processing and machine learning applications. Companies like TensorFlow and PyTorch spend significant engineering effort optimizing loop patterns, achieving 3-5x performance improvements in numerical computations through careful loop restructuring.

Difficulty: Advanced | Time: 25-30 minutes

Optimize this loop-heavy function by applying advanced loop optimization techniques to minimize memory access and maximize CPU cache efficiency:

 1package main
 2
 3func processMatrix(matrix [][]int) {
 4    rows := len(matrix)
 5    cols := len(matrix[0])
 6
 7    // Pass 1: Multiply by 2
 8    for i := 0; i < rows; i++ {
 9        for j := 0; j < cols; j++ {
10            matrix[i][j] *= 2
11        }
12    }
13
14    // Pass 2: Add row index
15    for i := 0; i < rows; i++ {
16        for j := 0; j < cols; j++ {
17            matrix[i][j] += i
18        }
19    }
20
21    // Pass 3: Add column index
22    for i := 0; i < rows; i++ {
23        for j := 0; j < cols; j++ {
24            matrix[i][j] += j
25        }
26    }
27}

Task: Apply loop fusion, unrolling, and other optimizations to reduce memory bandwidth and improve cache locality.

Solution

 1package main
 2
 3// Optimized: Loop fusion - single pass
 4func processMatrixOptimized(matrix [][]int) {
 5    rows := len(matrix)
 6    if rows == 0 {
 7        return
 8    }
 9    cols := len(matrix[0])
10
11    for i := range matrix {
12        row := matrix[i] // Hoist bounds check
13
14        for j := range row {
15            // Fuse all three operations
16            row[j] = row[j]*2 + i + j
17        }
18    }
19}
20
21// Further optimization: Explicit unrolling for small matrices
22func processMatrixUnrolled(matrix [][]int) {
23    rows := len(matrix)
24    if rows == 0 {
25        return
26    }
27    cols := len(matrix[0])
28
29    for i := range matrix {
30        row := matrix[i]
31        j := 0
32
33        // Process 4 elements at a time
34        for j+3 < cols {
35            row[j] = row[j]*2 + i + j
36            row[j+1] = row[j+1]*2 + i + j + 1
37            row[j+2] = row[j+2]*2 + i + j + 2
38            row[j+3] = row[j+3]*2 + i + j + 3
39            j += 4
40        }
41
42        // Handle remaining elements
43        for ; j < cols; j++ {
44            row[j] = row[j]*2 + i + j
45        }
46    }
47}

Exercise 5: PGO-Driven Optimization

Learning Objective: Master profile-guided optimization to let the compiler make intelligent optimization decisions based on runtime profiling data.

Context: PGO is a game-changer for performance-critical applications. Go's PGO implementation helped optimize the Go compiler itself, achieving 5-15% performance improvements in real workloads. Companies like Uber and Dropbox use PGO to optimize their critical path code based on production traffic patterns.

Difficulty: Advanced | Time: 30-35 minutes

Implement a function that benefits significantly from profile-guided optimization by creating realistic workload patterns:

 1package main
 2
 3func processData(data []int) int {
 4    result := 0
 5
 6    for _, v := range data {
 7        if v%2 == 0 {
 8            result += expensiveEven(v)
 9        } else {
10            result += expensiveOdd(v)
11        }
12    }
13
14    return result
15}
16
17func expensiveEven(n int) int {
18    sum := 0
19    for i := 0; i < n; i++ {
20        sum += i * i
21    }
22    return sum
23}
24
25func expensiveOdd(n int) int {
26    sum := 0
27    for i := 0; i < n; i++ {
28        sum += i * i * i
29    }
30    return sum
31}

Task: Set up PGO workflow, generate representative profiles, and demonstrate optimization impact on hot path performance.

Solution

 1// main.go
 2package main
 3
 4import (
 5    "fmt"
 6    "os"
 7    "runtime/pprof"
 8)
 9
10func processData(data []int) int {
11    result := 0
12
13    for _, v := range data {
14        if v%2 == 0 {
15            result += expensiveEven(v)
16        } else {
17            result += expensiveOdd(v)
18        }
19    }
20
21    return result
22}
23
24func expensiveEven(n int) int {
25    sum := 0
26    for i := 0; i < n; i++ {
27        sum += i * i
28    }
29    return sum
30}
31
32func expensiveOdd(n int) int {
33    sum := 0
34    for i := 0; i < n; i++ {
35        sum += i * i * i
36    }
37    return sum
38}
39
40func main() {
41    // Generate profile
42    if len(os.Args) > 1 && os.Args[1] == "profile" {
43        f, _ := os.Create("cpu.prof")
44        pprof.StartCPUProfile(f)
45        defer pprof.StopCPUProfile()
46    }
47
48    // Representative workload
49    data := make([]int, 10000)
50    for i := range data {
51        if i%10 < 8 {
52            data[i] = i * 2 // 80% even
53        } else {
54            data[i] = i*2 + 1 // 20% odd
55        }
56    }
57
58    // Run many times for profiling
59    for i := 0; i < 100; i++ {
60        _ = processData(data)
61    }
62
63    fmt.Println("Done")
64}
65
66// Step 1: Generate profile
67// go run main.go profile
68
69// Step 2: Prepare PGO file
70// mv cpu.prof default.pgo
71
72// Step 3: Build with PGO
73// go build -pgo=default.pgo -o app_pgo
74
75// Step 4: Build without PGO for comparison
76// go build -o app_no_pgo
77
78// Step 5: Benchmark both
79// time ./app_pgo
80// time ./app_no_pgo
81
82// PGO will optimize hot path more aggressively

Why Compiler Optimizations Matter

Go Compiler Architecture

Compilation Pipeline

Compiler Flags Overview

Inlining

Basic Inlining

Inline Budget

Forcing and Preventing Inlining

Inlining Strategies

Bounds Check Elimination

Automatic BCE

Manual BCE Optimization

Proving Bounds Safety

Compiler Directives

Common Directives

Practical Directive Usage

Dead Code Elimination

Basic DCE

Constant Folding

Optimizing with DCE

Loop Optimizations

Loop Unrolling

Loop Invariant Code Motion

Loop Fusion

Build Flags and Optimization Levels

Common Build Flags

Build Tags for Optimization

Reducing Binary Size

Analyzing Generated Assembly

Viewing Assembly Output

Understanding Assembly Patterns

Benchmark with Assembly Analysis

Best Practices for Optimizable Code

Write Compiler-Friendly Code

Profile-Guided Optimization

Profile-Guided Optimization

Complete PGO Workflow

Analyzing PGO Impact

Advanced PGO with Multiple Profiles

Assembly Analysis

Generating and Analyzing Assembly

Assembly Generation Commands

Reading Assembly Output

Assembly Optimization Patterns

Inline Analysis Tool

Compiler Intrinsics

Escape Analysis Visualization

Performance Comparison Tool

Advanced Compiler Optimization Patterns

Leveraging Escape Analysis

Interface Optimization Techniques

Build-Time Optimizations

Integration and Mastery - Optimization Workflow

The Optimization Mindset

Common Optimization Patterns

Monitoring and Profiling Best Practices

Practice Exercises

Exercise 1: Optimize Function Inlining

Exercise 2: Eliminate Bounds Checks

Exercise 3: Dead Code Elimination

Exercise 4: Loop Optimization

Exercise 5: PGO-Driven Optimization

Summary