Why Compiler Optimizations Matter
The Go compiler is your performance multiplier—understanding what it can optimize is the difference between code that runs at 50% speed and code that runs at 100% speed with zero extra effort.
Real-World Impact:
Inlining - Free 2-5x speedup:
1// Small function: compiler inlines automatically
2func add(a, b int) int { return a + b }
3
4// Hot path: inlined = no function call overhead
5sum += add(x, y) // Becomes: sum += x + y
Bounds Check Elimination - Free 20-30% speedup:
1// ❌ Bounds check every iteration
2for i := 0; i < len(data); i++ {
3 process(data[i]) // Compiler checks: i < len(data)?
4}
5
6// ✅ Compiler eliminates bounds check
7data := data[:len(data)] // Hint to compiler
8for i := range data {
9 process(data[i]) // No bounds check!
10}
Compiler Optimization Impact:
Optimization Speedups:
├─ Function inlining: 2-5x faster
├─ Bounds check elim: 1.2-1.3x faster
├─ Escape analysis: 2-10x faster
├─ Dead code elim: Binary 30% smaller
└─ Constant folding: Compile-time math = instant
The Go compiler performs numerous optimizations to generate efficient machine code. Understanding these optimizations helps you write code that the compiler can optimize effectively, leading to better performance without sacrificing readability—often achieving 2-10x speedups with simple patterns.
In this tutorial, we'll explore:
- Compiler architecture: How the Go compiler works
- Inlining: Automatic function call elimination
- Compiler directives: Controlling optimization behavior
- Bounds check elimination: Removing array bounds checks
- Dead code elimination: Removing unused code
- Escape analysis: Optimizing memory allocation
- Build flags: Controlling optimization levels
- Assembly analysis: Understanding generated code
Why Compiler Optimizations Matter:
- Performance: Optimized code runs faster
- Efficiency: Better use of CPU resources
- Predictability: Understanding optimizations helps predict performance
- Code quality: Write compiler-friendly code without sacrificing clarity
Go Compiler Architecture
Compilation Pipeline
1// Go compilation pipeline:
2// 1. Parsing: Source code → Abstract Syntax Tree
3// 2. Type checking: Verify types and semantics
4// 3. SSA generation: AST → Static Single Assignment form
5// 4. Optimization: SSA transformations
6// 5. Code generation: SSA → machine code
7// 6. Linking: Combine compiled packages
8
9package main
10
11import "fmt"
12
13func main() {
14 // View compilation phases:
15 // go build -x main.go
16
17 // View SSA passes:
18 // GOSSAFUNC=main go build main.go
19 // Opens HTML showing each SSA optimization pass
20
21 fmt.Println("Understanding compiler phases")
22}
Compiler Flags Overview
1# View all compiler flags
2go tool compile -h
3
4# Common optimization flags:
5go build -gcflags="-m" # Show optimization decisions
6go build -gcflags="-m -m" # Verbose optimization details
7go build -gcflags="-S" # Show assembly output
8go build -ldflags="-s -w" # Strip debug info, reduce binary size
9
10# Disable optimizations:
11go build -gcflags="-N -l" # -N = no optimizations, -l = no inlining
Inlining
Inlining replaces function calls with the function body, eliminating call overhead.
What's Happening: When the compiler inlines a function, it copies the function's code directly into the caller instead of generating a CALL instruction. This eliminates the overhead of: setting up the call stack, jumping to function address, returning from function. For small, frequently-called functions, this can provide 2-5x speedup. The compiler uses a cost-based heuristic to decide what to inline.
Basic Inlining
1package main
2
3import "fmt"
4
5// Small functions are automatically inlined
6func add(a, b int) int {
7 return a + b
8}
9
10// Inlined in optimized builds
11func square(x int) int {
12 return x * x
13}
14
15func main() {
16 result := add(3, square(4))
17 fmt.Println(result)
18}
19
20// View inlining decisions:
21// go build -gcflags="-m" main.go
22// Output:
23// ./main.go:6:6: can inline add
24// ./main.go:11:6: can inline square
25// ./main.go:16:6: can inline main
Inline Budget
The compiler uses a "cost budget" to decide what to inline:
1package main
2
3// Cost: 5
4func cheap(x int) int {
5 return x * 2
6}
7
8// Cost: ~40
9func moderate(x, y int) int {
10 if x > y {
11 return x
12 }
13 return y
14}
15
16// Cost: >80
17func expensive(arr []int) int {
18 sum := 0
19 for _, v := range arr {
20 if v > 0 {
21 sum += v * v
22 } else {
23 sum += v
24 }
25 }
26 return sum
27}
28
29// View inlining costs:
30// go build -gcflags="-m -m" main.go
Forcing and Preventing Inlining
1package main
2
3//go:noinline
4func noInline(x int) int {
5 return x * 2
6}
7
8// Force inline attempt
9//go:inline
10func forceInline(x int) int {
11 return x * 2
12}
13
14// Mid-stack inlining: inline calls within inlined functions
15func outer(x int) int {
16 return inner(x) + inner(x+1) // inner() may be inlined here
17}
18
19func inner(x int) int {
20 return x * x
21}
22
23func main() {
24 _ = noInline(5)
25 _ = forceInline(5)
26 _ = outer(5)
27}
Inlining Strategies
1package main
2
3// Good: Small, simple functions inline well
4func min(a, b int) int {
5 if a < b {
6 return a
7 }
8 return b
9}
10
11// Good: Early returns help inlining
12func isPositive(x int) bool {
13 if x <= 0 {
14 return false
15 }
16 return true
17}
18
19// Bad: Loops prevent inlining
20func sum(arr []int) int {
21 total := 0
22 for _, v := range arr {
23 total += v
24 }
25 return total // Won't inline due to loop
26}
27
28// Bad: Panic prevents inlining
29func divide(a, b int) int {
30 if b == 0 {
31 panic("division by zero")
32 }
33 return a / b // Won't inline due to panic
34}
35
36// Optimization: Split function to enable partial inlining
37func divideOptimized(a, b int) int {
38 if b == 0 {
39 divideByZeroPanic() // Separate panic to cold path
40 }
41 return a / b // This part can inline
42}
43
44//go:noinline
45func divideByZeroPanic() {
46 panic("division by zero")
47}
Bounds Check Elimination
The compiler eliminates array bounds checks when it can prove safety.
Why This Matters: Every time you access arr[i], Go inserts a runtime check: "is i < len(arr)?". If not, it panics. While safe, this check costs ~1-3 CPU cycles per access. When the compiler can mathematically prove an index is always valid, it eliminates the check entirely—providing 20-30% speedup in tight loops.
Automatic BCE
1package main
2
3import "fmt"
4
5func sumArray(arr []int) int {
6 sum := 0
7
8 // Bounds check on each iteration
9 for i := 0; i < len(arr); i++ {
10 sum += arr[i] // Bounds check here
11 }
12
13 return sum
14}
15
16func sumArrayOptimized(arr []int) int {
17 sum := 0
18
19 // No bounds check: range guarantees safety
20 for _, v := range arr {
21 sum += v // No bounds check needed
22 }
23
24 return sum
25}
26
27func main() {
28 arr := []int{1, 2, 3, 4, 5}
29 fmt.Println(sumArray(arr))
30 fmt.Println(sumArrayOptimized(arr))
31}
32
33// View bounds checks:
34// go build -gcflags="-d=ssa/check_bce/debug=1" main.go
Manual BCE Optimization
1package main
2
3// Bounds check on every access
4func processSlow(arr []int) {
5 for i := 0; i < len(arr); i++ {
6 _ = arr[i] // Bounds check
7 }
8}
9
10// Compiler eliminates bounds check after first access
11func processFast(arr []int) {
12 if len(arr) == 0 {
13 return
14 }
15
16 // First access with bounds check
17 _ = arr[0]
18
19 // Subsequent accesses within proven bounds
20 for i := 1; i < len(arr); i++ {
21 _ = arr[i] // No bounds check after first
22 }
23}
24
25// Use range to avoid bounds checks entirely
26func processFastest(arr []int) {
27 for i := range arr {
28 _ = arr[i] // No bounds check
29 }
30}
31
32// BCE with slice tricks
33func copyOptimized(dst, src []int) {
34 // Explicit length check
35 if len(dst) < len(src) {
36 panic("dst too small")
37 }
38
39 // Compiler knows dst is large enough
40 for i := range src {
41 dst[i] = src[i] // No bounds check on dst
42 }
43}
Proving Bounds Safety
What's Happening: The compiler tracks value ranges through the program. When you explicitly check i < len(arr), the compiler records "in the true branch, i is definitely < len(arr)". All subsequent accesses to arr[i] within that scope skip the bounds check because safety is already proven. This is called "flow-sensitive analysis".
1package main
2
3// Bad: Compiler can't prove safety
4func accessBad(arr []int, i, j int) int {
5 return arr[i] + arr[j] // Two bounds checks
6}
7
8// Good: Prove bounds before access
9func accessGood(arr []int, i, j int) int {
10 if i >= len(arr) || j >= len(arr) {
11 panic("index out of bounds")
12 }
13
14 // Compiler knows i and j are safe
15 return arr[i] + arr[j] // No bounds checks
16}
17
18// Better: Use max to prove bounds
19func accessBetter(arr []int, i, j int) int {
20 maxIdx := max(i, j)
21 if maxIdx >= len(arr) {
22 panic("index out of bounds")
23 }
24
25 // Single comparison proves both indices safe
26 return arr[i] + arr[j] // No bounds checks
27}
28
29func max(a, b int) int {
30 if a > b {
31 return a
32 }
33 return b
34}
Compiler Directives
Compiler directives control optimization behavior.
Common Directives
1package main
2
3//go:noinline
4func noInlineFunc() {
5 // Prevent inlining for benchmarking or debugging
6}
7
8//go:nosplit
9func noSplitFunc() {
10 // Prevent stack splitting
11 // Use with caution - can cause stack overflow
12}
13
14//go:noescape
15func noescape(p *int) {
16 // Tell compiler pointer doesn't escape
17 // Used in unsafe code
18}
19
20//go:linkname localName importPath.name
21// Link to unexported function in another package
22// Used for runtime/syscall interfaces
23
24//go:build linux && amd64
25// Build constraint
26
27//go:generate command
28// Run command during go generate
29
30func main() {
31 noInlineFunc()
32 noSplitFunc()
33}
Practical Directive Usage
1package main
2
3import (
4 "fmt"
5 "time"
6)
7
8// Example: Preventing inlining for stable benchmarks
9//go:noinline
10func computeExpensive(n int) int {
11 result := 0
12 for i := 0; i < n; i++ {
13 result += i * i
14 }
15 return result
16}
17
18// Example: Force escape for testing
19//go:noinline
20func allocateEscape() *int {
21 x := 42
22 return &x // Force heap allocation
23}
24
25// Example: Generate test data
26//go:generate go run generate_data.go
27
28// Example: Platform-specific optimization
29//go:build !race
30
31func fastPath() {
32 // Fast but not race-detector safe
33 fmt.Println("Fast path")
34}
35
36// Alternative implementation for race detector
37//go:build race
38
39func fastPath() {
40 // Slower but race-safe
41 fmt.Println("Race-safe path")
42}
43
44func main() {
45 start := time.Now()
46 result := computeExpensive(1000000)
47 fmt.Printf("Result: %d, Time: %v\n", result, time.Since(start))
48
49 ptr := allocateEscape()
50 fmt.Printf("Escaped value: %d\n", *ptr)
51
52 fastPath()
53}
Dead Code Elimination
The compiler removes unreachable or unused code.
What's Happening: During compilation, the compiler builds a control flow graph showing all possible execution paths. Code that's provably unreachable is completely removed from the binary. Similarly, unused variables and computations are eliminated. This happens after inlining, so constants folded from inlined functions can reveal more dead code.
Basic DCE
1package main
2
3import "fmt"
4
5func example() {
6 x := 10 // Used
7 y := 20 // Unused - eliminated
8
9 if false {
10 // Dead code - eliminated
11 fmt.Println("Never executed")
12 }
13
14 if true {
15 fmt.Println("Always executed")
16 } else {
17 // Dead code - eliminated
18 fmt.Println("Never executed")
19 }
20
21 _ = x // Use x
22}
23
24// View what's eliminated:
25// go build -gcflags="-m" main.go
Constant Folding
Why This Works: When the compiler sees operations on constants or compile-time-known values, it performs the calculation during compilation rather than at runtime. const c = 10 + 20 doesn't generate addition code—the compiler just embeds 30 directly in the binary. This extends to complex expressions, saving runtime CPU cycles completely.
1package main
2
3import "fmt"
4
5func constantFolding() {
6 // Computed at compile time
7 const a = 10
8 const b = 20
9 const c = a + b // c = 30 at compile time
10
11 x := 5 + 3 // Folded to 8
12 y := x * 2 // May be folded to 16
13 z := y << 1 // May be folded to 32
14
15 fmt.Println(c, x, y, z)
16}
17
18func conditionalElimination() {
19 const debug = false
20
21 if debug {
22 // Eliminated at compile time
23 fmt.Println("Debug info")
24 }
25
26 // Compiler knows this is always true
27 if !debug {
28 fmt.Println("Production")
29 }
30}
Optimizing with DCE
1package main
2
3import "fmt"
4
5// Bad: Keeping unnecessary variables
6func processDataBad(data []int) int {
7 sum := 0
8 count := 0
9 avg := 0 // Unused - but computed
10
11 for _, v := range data {
12 sum += v
13 count++
14 avg = sum / count // Computed but never used
15 }
16
17 return sum
18}
19
20// Good: Remove unused computations
21func processDataGood(data []int) int {
22 sum := 0
23 for _, v := range data {
24 sum += v
25 }
26 return sum
27}
28
29// Feature flags for DCE
30const (
31 FeatureA = true
32 FeatureB = false
33)
34
35func handleRequest() {
36 if FeatureA {
37 fmt.Println("Feature A enabled")
38 }
39
40 if FeatureB {
41 // Dead code if FeatureB = false
42 fmt.Println("Feature B enabled")
43 }
44}
Loop Optimizations
The compiler applies various loop optimizations.
Loop Unrolling
1package main
2
3// Compiler may unroll small, fixed-iteration loops
4func unrolledLoop() {
5 arr := [4]int{1, 2, 3, 4}
6
7 // May be unrolled to:
8 // sum = arr[0] + arr[1] + arr[2] + arr[3]
9 sum := 0
10 for i := 0; i < 4; i++ {
11 sum += arr[i]
12 }
13
14 _ = sum
15}
16
17// Explicit unrolling for better performance
18func manualUnroll(arr []int) int {
19 n := len(arr)
20 sum := 0
21
22 // Process 4 elements at a time
23 i := 0
24 for i+3 < n {
25 sum += arr[i] + arr[i+1] + arr[i+2] + arr[i+3]
26 i += 4
27 }
28
29 // Handle remaining elements
30 for ; i < n; i++ {
31 sum += arr[i]
32 }
33
34 return sum
35}
Loop Invariant Code Motion
What's Happening: Loop-invariant code motion identifies expressions that produce the same result on every iteration and hoists them outside the loop. Instead of computing len(arr)/2 1000 times in a 1000-iteration loop, it's computed once before the loop. Modern Go compilers do this automatically for simple cases, but explicit hoisting ensures optimization and improves code clarity.
1package main
2
3// Bad: Recomputing invariant inside loop
4func computeBad(arr []int, multiplier int) {
5 for i := range arr {
6 // len(arr) computed on each iteration
7 if i < len(arr)/2 {
8 arr[i] *= multiplier
9 }
10 }
11}
12
13// Good: Hoist invariant out of loop
14func computeGood(arr []int, multiplier int) {
15 half := len(arr) / 2 // Computed once
16
17 for i := range arr {
18 if i < half {
19 arr[i] *= multiplier
20 }
21 }
22}
23
24// Compiler may do this automatically, but explicit is clearer
Loop Fusion
1package main
2
3// Bad: Multiple passes over data
4func processMultiPass(arr []int) {
5 // First pass: multiply by 2
6 for i := range arr {
7 arr[i] *= 2
8 }
9
10 // Second pass: add 10
11 for i := range arr {
12 arr[i] += 10
13 }
14}
15
16// Good: Single pass
17func processSinglePass(arr []int) {
18 for i := range arr {
19 arr[i] = arr[i]*2 + 10
20 }
21}
Build Flags and Optimization Levels
Common Build Flags
1# Production build
2go build
3
4# Debug build
5go build -gcflags="-N -l"
6
7# Optimized build
8go build -ldflags="-s -w"
9# -s: Strip symbol table
10# -w: Strip DWARF debugging info
11
12# Profile-guided optimization - Go 1.21+
13go build -pgo=default.pgo
14
15# Show optimization decisions
16go build -gcflags="-m -m"
17
18# Generate assembly
19go build -gcflags="-S" > assembly.txt
20
21# Cross-compilation
22GOOS=linux GOARCH=amd64 go build
Build Tags for Optimization
1// fast.go
2//go:build !debug
3
4package myapp
5
6func Process() {
7 // Fast implementation
8 fastAlgorithm()
9}
10
11// slow.go
12//go:build debug
13
14package myapp
15
16func Process() {
17 // Slower but with extra checks
18 slowAlgorithmWithChecks()
19}
20
21// Build commands:
22// go build # Uses fast.go
23// go build -tags debug # Uses slow.go
Reducing Binary Size
1package main
2
3import "fmt"
4
5// Use -ldflags to reduce binary size:
6// go build -ldflags="-s -w" -o app
7
8// Avoid importing large packages when not needed
9// Bad: Imports entire regexp package
10import "regexp"
11
12func matchBad(pattern, text string) bool {
13 re := regexp.MustCompile(pattern)
14 return re.MatchString(text)
15}
16
17// Good: Use strings package for simple cases
18import "strings"
19
20func matchGood(pattern, text string) bool {
21 return strings.Contains(text, pattern)
22}
23
24func main() {
25 fmt.Println("Optimized binary")
26}
Analyzing Generated Assembly
Viewing Assembly Output
1package main
2
3func add(a, b int) int {
4 return a + b
5}
6
7func main() {
8 result := add(3, 4)
9 _ = result
10}
11
12// Generate assembly:
13// go tool compile -S main.go > main.s
14// Or: go build -gcflags="-S" main.go
15
16// Assembly for add() on AMD64:
17// TEXT "".add(SB), NOSPLIT, $0-24
18// MOVQ a+0(FP), AX
19// ADDQ b+8(FP), AX
20// MOVQ AX, ret+16(FP)
21// RET
Understanding Assembly Patterns
1package main
2
3// Example 1: Simple arithmetic
4func multiply(a, b int) int {
5 return a * b
6}
7// Assembly: IMULQ instruction
8
9// Example 2: Bitwise operations
10func powerOfTwo(x int) int {
11 return x << 1 // Same as x * 2
12}
13// Assembly: SHLQ instruction
14
15// Example 3: Bounds check
16func accessArray(arr []int, i int) int {
17 return arr[i]
18}
19// Assembly: CMPQ, followed by MOVQ
20
21// Example 4: Inlined function
22func square(x int) int {
23 return x * x
24}
25
26func useSquare(x int) int {
27 return square(x) + 1
28}
29// Assembly for useSquare: No CALL to square
Benchmark with Assembly Analysis
1package main
2
3import "testing"
4
5func addSlow(a, b int) int {
6 result := 0
7 for i := 0; i < a; i++ {
8 result++
9 }
10 for i := 0; i < b; i++ {
11 result++
12 }
13 return result
14}
15
16func addFast(a, b int) int {
17 return a + b
18}
19
20func BenchmarkAddSlow(b *testing.B) {
21 for i := 0; i < b.N; i++ {
22 _ = addSlow(10, 20)
23 }
24}
25
26func BenchmarkAddFast(b *testing.B) {
27 for i := 0; i < b.N; i++ {
28 _ = addFast(10, 20)
29 }
30}
31
32// Run: go test -bench=. -benchmem
33// Compare assembly: go test -gcflags="-S" -bench=Add
Best Practices for Optimizable Code
Write Compiler-Friendly Code
1package main
2
3// Good: Small, simple functions inline well
4func abs(x int) int {
5 if x < 0 {
6 return -x
7 }
8 return x
9}
10
11// Good: Use range for slice iteration
12func sumSlice(arr []int) int {
13 sum := 0
14 for _, v := range arr {
15 sum += v
16 }
17 return sum
18}
19
20// Good: Prove bounds to eliminate checks
21func copySlice(dst, src []int) {
22 if len(dst) < len(src) {
23 panic("dst too small")
24 }
25 for i := range src {
26 dst[i] = src[i] // No bounds checks
27 }
28}
29
30// Good: Use constants for dead code elimination
31const DebugMode = false
32
33func log(msg string) {
34 if DebugMode {
35 println(msg) // Eliminated if DebugMode = false
36 }
37}
38
39// Good: Avoid unnecessary allocations
40func processInPlace(data []byte) {
41 for i := range data {
42 data[i] = data[i] * 2 // No allocation
43 }
44}
Profile-Guided Optimization
1package main
2
3import (
4 "fmt"
5 "os"
6 "runtime/pprof"
7)
8
9func main() {
10 // Step 1: Generate profile
11 f, _ := os.Create("cpu.prof")
12 pprof.StartCPUProfile(f)
13 defer pprof.StopCPUProfile()
14
15 // Run your application...
16 workload()
17}
18
19func workload() {
20 // Representative workload
21 for i := 0; i < 1000000; i++ {
22 _ = expensiveFunction(i)
23 }
24}
25
26func expensiveFunction(n int) int {
27 sum := 0
28 for i := 0; i < n%100; i++ {
29 sum += i * i
30 }
31 return sum
32}
33
34// Step 2: Build with PGO
35// mv cpu.prof default.pgo
36// go build -pgo=default.pgo
37
38// The compiler uses profile data to optimize hot paths
Profile-Guided Optimization
Profile-Guided Optimization uses runtime profiles to optimize frequently executed code paths.
Why This Works: Traditional optimization is "blind"—the compiler guesses which code paths are hot. PGO uses actual runtime data to guide decisions. When you provide a CPU profile, the compiler knows which functions are called most frequently and optimizes those more aggressively. This can provide 5-20% speedup for real-world workloads.
Complete PGO Workflow
1// main.go - Example application for PGO
2package main
3
4import (
5 "flag"
6 "fmt"
7 "log"
8 "os"
9 "runtime/pprof"
10 "time"
11)
12
13// Hot path function
14func processRequest(data []byte) int {
15 sum := 0
16 for _, b := range data {
17 sum += int(b)
18 if sum%2 == 0 {
19 sum = expensiveComputation(sum)
20 }
21 }
22 return sum
23}
24
25// Expensive function that benefits from PGO
26func expensiveComputation(n int) int {
27 result := n
28 for i := 0; i < 100; i++ {
29 result = % 1000000007
30 if result%3 == 0 {
31 result += fibonacci(10)
32 }
33 }
34 return result
35}
36
37func fibonacci(n int) int {
38 if n <= 1 {
39 return n
40 }
41 return fibonacci(n-1) + fibonacci(n-2)
42}
43
44func main() {
45 profile := flag.Bool("profile", false, "enable CPU profiling")
46 flag.Parse()
47
48 // Start CPU profiling if requested
49 if *profile {
50 f, err := os.Create("cpu.prof")
51 if err != nil {
52 log.Fatal(err)
53 }
54 defer f.Close()
55
56 if err := pprof.StartCPUProfile(f); err != nil {
57 log.Fatal(err)
58 }
59 defer pprof.StopCPUProfile()
60 }
61
62 // Simulate workload
63 start := time.Now()
64
65 for i := 0; i < 10000; i++ {
66 data := make([]byte, 1000)
67 for j := range data {
68 data[j] = byte(j % 256)
69 }
70 _ = processRequest(data)
71 }
72
73 elapsed := time.Since(start)
74 fmt.Printf("Completed in %v\n", elapsed)
75}
76
77/*
78PGO Workflow:
79
801. Generate profile:
81 go run main.go -profile
82 # Creates cpu.prof
83
842. Rename profile to default location:
85 mv cpu.prof default.pgo
86
873. Build with PGO:
88 go build -pgo=default.pgo -o app_pgo
89
904. Build without PGO for comparison:
91 go build -pgo=off -o app_no_pgo
92
935. Compare performance:
94 time ./app_pgo
95 time ./app_no_pgo
96
97Expected improvements: 5-20% performance gain in hot paths
98*/
Analyzing PGO Impact
1#!/bin/bash
2# pgo-analysis.sh - Script to analyze PGO impact
3
4echo "=== Profile-Guided Optimization Analysis ==="
5echo
6
7# Step 1: Build baseline
8echo "Building baseline..."
9go build -pgo=off -o app_baseline
10baseline_size=$(stat -f%z app_baseline 2>/dev/null || stat -c%s app_baseline)
11echo "Baseline binary size: $baseline_size bytes"
12
13# Step 2: Generate profile
14echo
15echo "Generating profile..."
16./app_baseline -profile > /dev/null 2>&1
17mv cpu.prof default.pgo
18
19# Step 3: Build with PGO
20echo
21echo "Building with PGO..."
22go build -pgo=default.pgo -o app_pgo
23pgo_size=$(stat -f%z app_pgo 2>/dev/null || stat -c%s app_pgo)
24echo "PGO binary size: $pgo_size bytes"
25
26# Step 4: Benchmark
27echo
28echo "Benchmarking baseline..."
29/usr/bin/time -l ./app_baseline 2>&1 | grep "user\|real"
30
31echo
32echo "Benchmarking PGO..."
33/usr/bin/time -l ./app_pgo 2>&1 | grep "user\|real"
34
35echo
36echo "=== Analysis Complete ==="
Advanced PGO with Multiple Profiles
1// merge-profiles.go - Merge multiple CPU profiles
2package main
3
4import (
5 "fmt"
6 "log"
7 "os"
8
9 "github.com/google/pprof/profile"
10)
11
12func mergeProfiles(files []string, output string) error {
13 var merged *profile.Profile
14
15 for i, file := range files {
16 f, err := os.Open(file)
17 if err != nil {
18 return fmt.Errorf("open %s: %w", file, err)
19 }
20
21 p, err := profile.Parse(f)
22 f.Close()
23 if err != nil {
24 return fmt.Errorf("parse %s: %w", file, err)
25 }
26
27 if i == 0 {
28 merged = p
29 } else {
30 if err := merged.Merge(p, 1); err != nil {
31 return fmt.Errorf("merge %s: %w", file, err)
32 }
33 }
34 }
35
36 // Write merged profile
37 out, err := os.Create(output)
38 if err != nil {
39 return err
40 }
41 defer out.Close()
42
43 return merged.Write(out)
44}
45
46func main() {
47 profiles := []string{
48 "profile1.prof",
49 "profile2.prof",
50 "profile3.prof",
51 }
52
53 if err := mergeProfiles(profiles, "merged.pgo"); err != nil {
54 log.Fatal(err)
55 }
56
57 fmt.Println("Merged profiles into merged.pgo")
58}
Assembly Analysis
Understanding generated assembly helps identify optimization opportunities.
Generating and Analyzing Assembly
1// assembly-example.go
2package main
3
4// Simple function for assembly analysis
5func Add(a, b int) int {
6 return a + b
7}
8
9// Function with bounds check
10func SumArray(arr []int) int {
11 sum := 0
12 for i := 0; i < len(arr); i++ {
13 sum += arr[i] // Bounds check here
14 }
15 return sum
16}
17
18// Optimized version
19func SumArrayOptimized(arr []int) int {
20 sum := 0
21 for _, v := range arr { // Range eliminates bounds check
22 sum += v
23 }
24 return sum
25}
26
27//go:noinline
28func InlinedTest(x int) int {
29 return x * 2
30}
Assembly Generation Commands
1#!/bin/bash
2# generate-assembly.sh
3
4# Generate assembly for specific function
5go tool compile -S assembly-example.go > assembly.s
6
7# Generate with optimization levels
8go tool compile -S -N -l assembly-example.go > assembly-no-opt.s # No optimization
9go tool compile -S assembly-example.go > assembly-opt.s # With optimization
10
11# Disassemble compiled binary
12go build -o app assembly-example.go
13go tool objdump -S app > disassembly.txt
14
15# Generate assembly with symbols
16go build -gcflags="-S" assembly-example.go 2> assembly-with-symbols.s
Reading Assembly Output
# Assembly output for Add function:
"".Add STEXT nosplit size=10 args=0x18 locals=0x0
TEXT "".Add(SB), NOSPLIT|ABIInternal, $0-24
FUNCDATA $0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
MOVQ "".b+16(SP), AX # Load b into AX register
ADDQ "".a+8(SP), AX # Add a to AX
MOVQ AX, "".~r2+24(SP) # Store result
RET
# Analysis:
# - NOSPLIT: Function doesn't need stack growth check
# - 3 instructions only
# - Direct register operations
Assembly Optimization Patterns
1package main
2
3import "testing"
4
5// Example 1: Loop unrolling analysis
6func ProcessLoop(data []int) int {
7 sum := 0
8 for i := 0; i < len(data); i++ {
9 sum += data[i]
10 }
11 return sum
12}
13
14// Example 2: SIMD-friendly code
15func ProcessSIMD(data []float64) float64 {
16 sum := 0.0
17 // Compiler can vectorize this with SSE/AVX
18 for i := 0; i < len(data); i += 4 {
19 if i+3 < len(data) {
20 sum += data[i] + data[i+1] + data[i+2] + data[i+3]
21 }
22 }
23 return sum
24}
25
26// Benchmark to verify assembly optimizations
27func BenchmarkProcessLoop(b *testing.B) {
28 data := make([]int, 1000)
29 b.ResetTimer()
30
31 for i := 0; i < b.N; i++ {
32 ProcessLoop(data)
33 }
34}
Inline Analysis Tool
1// inline-analyzer.go - Analyze inlining decisions
2package main
3
4import (
5 "fmt"
6 "go/ast"
7 "go/parser"
8 "go/token"
9 "log"
10)
11
12func analyzeFunctionSize(file string) {
13 fset := token.NewFileSet()
14 node, err := parser.ParseFile(fset, file, nil, 0)
15 if err != nil {
16 log.Fatal(err)
17 }
18
19 ast.Inspect(node, func(n ast.Node) bool {
20 if fn, ok := n.(*ast.FuncDecl); ok {
21 if fn.Body != nil {
22 lines := fset.Position(fn.Body.End()).Line - fset.Position(fn.Body.Pos()).Line
23 fmt.Printf("Function %s: %d lines", fn.Name.Name, lines)
24
25 if lines < 40 {
26 fmt.Println(" [likely inlined]")
27 } else {
28 fmt.Println(" [too large for inlining]")
29 }
30 }
31 }
32 return true
33 })
34}
35
36func main() {
37 analyzeFunctionSize("assembly-example.go")
38}
Compiler Intrinsics
Go compiler recognizes certain patterns and replaces them with optimized assembly:
What's Happening: Compiler intrinsics are special functions that the compiler recognizes and replaces with direct CPU instructions. Instead of compiling bits.OnesCount64() as function call + loop logic, the compiler emits a single POPCNT instruction. This transforms ~50 instructions into 1 instruction—a dramatic optimization for common bit manipulation operations.
1package main
2
3import (
4 "math/bits"
5)
6
7// Compiler intrinsics - optimized to single CPU instruction
8
9func PopCount(x uint64) int {
10 return bits.OnesCount64(x) // Optimized to POPCNT instruction
11}
12
13func LeadingZeros(x uint64) int {
14 return bits.LeadingZeros64(x) // Optimized to CLZ/BSR
15}
16
17func TrailingZeros(x uint64) int {
18 return bits.TrailingZeros64(x) // Optimized to CTZ/BSF
19}
20
21func RotateLeft(x uint64, k int) uint64 {
22 return bits.RotateLeft64(x, k) // Optimized to ROL
23}
24
25// Check assembly to verify intrinsics:
26// go tool compile -S intrinsics.go | grep -A5 "PopCount"
Escape Analysis Visualization
1// escape-analysis.go
2package main
3
4// Escapes to heap
5func EscapesToHeap() *int {
6 x := 42
7 return &x // x escapes to heap
8}
9
10// Stays on stack
11func StaysOnStack() int {
12 x := 42
13 return x // x stays on stack
14}
15
16// Analyze escape decisions:
17// go build -gcflags="-m -m" escape-analysis.go
18//
19// Output:
20// ./escape-analysis.go:6:2: x escapes to heap:
21// ./escape-analysis.go:6:2: flow: ~r0 = &x:
22// ./escape-analysis.go:6:2: from &x at ./escape-analysis.go:7:9
23// ./escape-analysis.go:6:2: from return &x at ./escape-analysis.go:7:2
24// ./escape-analysis.go:6:2: moved to heap: x
Performance Comparison Tool
1// compare-opts.go - Compare optimization levels
2package main
3
4import (
5 "fmt"
6 "os"
7 "os/exec"
8 "time"
9)
10
11func buildAndBench(opts string, name string) time.Duration {
12 // Build
13 cmd := exec.Command("go", "build", "-gcflags="+opts, "-o", name, "main.go")
14 if err := cmd.Run(); err != nil {
15 fmt.Printf("Build failed: %v\n", err)
16 return 0
17 }
18
19 // Benchmark
20 start := time.Now()
21 cmd = exec.Command("./" + name)
22 cmd.Run()
23 return time.Since(start)
24}
25
26func main() {
27 tests := map[string]string{
28 "no-opt": "-N -l", // No optimization, no inlining
29 "default": "", // Default optimization
30 "aggressive": "-l=4", // Aggressive inlining
31 }
32
33 for name, opts := range tests {
34 duration := buildAndBench(opts, "app_"+name)
35 fmt.Printf("%-12s: %v\n", name, duration)
36 }
37
38 // Cleanup
39 os.Remove("app_no-opt")
40 os.Remove("app_default")
41 os.Remove("app_aggressive")
42}
Advanced Compiler Optimization Patterns
Leveraging Escape Analysis
Understanding escape analysis helps you write code that the compiler can optimize better. Variables that don't escape can be stack-allocated instead of heap-allocated:
1package escape_analysis
2
3import "fmt"
4
5// Good: x doesn't escape, allocated on stack
6func StackAllocation() int {
7 x := make([]int, 100)
8 return len(x)
9}
10
11// Bad: slice escapes, allocated on heap
12func HeapAllocation() []int {
13 x := make([]int, 100)
14 return x // Escapes!
15}
16
17// Excellent: Reuse existing slice to avoid allocations
18func ReuseSlice(buf []int) {
19 for i := 0; i < len(buf); i++ {
20 buf[i] = i * 2
21 }
22}
23
24// Compiler Analysis:
25// StackAllocation: No escapes - stack allocated
26// HeapAllocation: Escapes return - heap allocated
27// ReuseSlice: No allocations - reuses caller's memory
28
29func main() {
30 // Stack-efficient version
31 x := make([]int, 1000)
32 ReuseSlice(x) // No new allocations
33 fmt.Println(x[0])
34}
Interface Optimization Techniques
Interfaces are powerful but can be optimized:
1package interface_optimization
2
3import "fmt"
4
5// Heavy interface - lots of method calls
6type HeavyReader interface {
7 Read(p []byte) (n int, err error)
8 Close() error
9 Stat() (name string, size int64)
10 IsEOF() bool
11 Reset()
12}
13
14// Lightweight interface - only essentials
15type LightReader interface {
16 Read(p []byte) (n int, err error)
17}
18
19// Specialized interface - better inlining
20type ByteReader interface {
21 ReadByte() (byte, error)
22}
23
24// Concrete implementation - often faster than interface
25type FastBuffer struct {
26 data []byte
27 pos int
28}
29
30func (b *FastBuffer) ReadByte() (byte, error) {
31 if b.pos >= len(b.data) {
32 return 0, fmt.Errorf("EOF")
33 }
34 c := b.data[b.pos]
35 b.pos++
36 return c, nil
37}
38
39// Compare performance:
40// - Using FastBuffer directly: All calls inlined
41// - Using ByteReader interface: Some inlining depending on usage
42// - Using HeavyReader interface: Less inlining opportunities
43
44// Optimization tips:
45// 1. Minimize methods in interfaces
46// 2. Use concrete types when performance critical
47// 3. Small interfaces inline better
48// 4. Consider type assertions for fast path
49
50func main() {
51 buf := &FastBuffer{data: []byte("hello")}
52 for buf.pos < len(buf.data) {
53 b, _ := buf.ReadByte()
54 fmt.Print(string(b))
55 }
56}
Build-Time Optimizations
The build process itself offers optimization opportunities:
1package build_optimization
2
3// Use build flags for conditional compilation
4// Build with: go build -ldflags "-X main.version=1.0.0"
5
6var (
7 version = "dev"
8 commit = "unknown"
9)
10
11// Or use build constraints for platform-specific code:
12
13// +build linux,amd64
14
15package optimization
16
17// Linux-specific optimizations here
18
19// And separately:
20
21// +build darwin,amd64
22
23package optimization
24
25// macOS-specific optimizations here
26
27// Strip symbols to reduce binary size:
28// go build -ldflags "-s -w"
29// This removes symbol table and debug info
30
31// Link statically (no C dependencies):
32// go build -tags 'cgo=false'
33
34// Use CGO_ENABLED=0 for static builds:
35// CGO_ENABLED=0 go build
36
37// Profile-guided optimizations:
38// 1. go test -cpuprofile=cpu.prof ./...
39// 2. go build -pgo=cpu.prof // Use profile for optimization
Integration and Mastery - Optimization Workflow
The Optimization Mindset
Successful optimization follows a disciplined approach:
- Measure First: Use profiling, benchmarking, and real metrics
- Identify Bottlenecks: Focus on what actually matters
- Understand Trade-offs: Every optimization has costs
- Test Thoroughly: Verify improvements don't break functionality
- Document Changes: Explain why optimizations exist
- Monitor Results: Ensure improvements persist in production
Common Optimization Patterns
1package optimization_patterns
2
3import (
4 "encoding/json"
5 "fmt"
6 "sync"
7)
8
9// Pattern 1: Object pooling for frequent allocations
10type BufferPool struct {
11 sync.Pool
12}
13
14func NewBufferPool() *BufferPool {
15 return &BufferPool{
16 Pool: sync.Pool{
17 New: func() interface{} {
18 return make([]byte, 0, 4096)
19 },
20 },
21 }
22}
23
24func (p *BufferPool) Get() []byte {
25 return p.Pool.Get().([]byte)[:0]
26}
27
28func (p *BufferPool) Put(buf []byte) {
29 if cap(buf) <= 65536 { // Reuse only reasonable sizes
30 p.Pool.Put(buf)
31 }
32}
33
34// Pattern 2: Avoiding JSON unmarshaling overhead
35type CachedConfig struct {
36 mu sync.RWMutex
37 data map[string]interface{}
38 rawJSON []byte
39 lastMod int64
40}
41
42func (c *CachedConfig) GetValue(key string) interface{} {
43 c.mu.RLock()
44 defer c.mu.RUnlock()
45 return c.data[key]
46}
47
48// Pattern 3: Preallocated slices for known sizes
49func ProcessBatch(items []string) []int {
50 // Allocate exact size needed
51 results := make([]int, len(items))
52 for i, item := range items {
53 results[i] = len(item)
54 }
55 return results
56}
57
58// Pattern 4: Context-aware optimizations
59type OptimizedHandler struct {
60 pool *BufferPool
61}
62
63func (h *OptimizedHandler) Process(data []byte) (string, error) {
64 buf := h.pool.Get()
65 defer h.pool.Put(buf)
66
67 // Reuse buffer
68 buf = append(buf, data...)
69 return string(buf), nil
70}
Monitoring and Profiling Best Practices
Always measure in production-like conditions:
- Use realistic data sizes
- Run on similar hardware
- Account for concurrent usage
- Monitor memory and CPU separately
- Track metrics over time
Tools for different scenarios:
- CPU profiling:
pprofwith-cpuprofile - Memory profiling:
pprofwith-memprofile - Allocations:
pprofwith-allocations - Goroutines:
pprofwith-goroutines - Benchmarks:
go test -benchwith-benchmem
Remember: "The best optimization is not doing unnecessary work." - Rob Pike
Practice Exercises
Exercise 1: Optimize Function Inlining
Learning Objective: Master function inlining techniques to eliminate call overhead and improve performance in hot code paths.
Context: Function inlining is critical for performance-sensitive applications like game engines and high-frequency trading systems. Google's Go team optimized the standard library by identifying key functions that benefit from inlining, resulting in 2-5x performance improvements in string operations and JSON parsing.
Difficulty: Intermediate | Time: 15-20 minutes
Refactor the following code to improve compiler inlining opportunities and eliminate unnecessary function call overhead:
1package main
2
3import "fmt"
4
5type Point struct {
6 X, Y int
7}
8
9func Distance(other Point) float64 {
10 dx := float64(p.X - other.X)
11 dy := float64(p.Y - other.Y)
12 return math.Sqrt(dx*dx + dy*dy)
13}
14
15func processPoints(points []Point) float64 {
16 total := 0.0
17 for i := 0; i < len(points)-1; i++ {
18 total += points[i].Distance(points[i+1])
19 }
20 return total
21}
Task: Optimize the code to make Distance() more likely to inline and reduce call overhead in the hot path.
Solution
1package main
2
3import "math"
4
5type Point struct {
6 X, Y int
7}
8
9// Split into smaller functions for better inlining
10func DistanceSquared(other Point) int {
11 dx := p.X - other.X
12 dy := p.Y - other.Y
13 return dx*dx + dy*dy // Inlines well
14}
15
16func Distance(other Point) float64 {
17 return math.Sqrt(float64(p.DistanceSquared(other)))
18}
19
20// Optimized: Avoid repeated function calls
21func processPoints(points []Point) float64 {
22 if len(points) < 2 {
23 return 0
24 }
25
26 total := 0.0
27 for i := range points[:len(points)-1] {
28 // Inline-friendly: simple computation
29 dx := float64(points[i].X - points[i+1].X)
30 dy := float64(points[i].Y - points[i+1].Y)
31 total += math.Sqrt(dx*dx + dy*dy)
32 }
33 return total
34}
35
36// Alternative: If you don't need exact distances
37func processPointsFast(points []Point) int {
38 if len(points) < 2 {
39 return 0
40 }
41
42 total := 0
43 for i := range points[:len(points)-1] {
44 // Use squared distance
45 total += points[i].DistanceSquared(points[i+1])
46 }
47 return total
48}
Exercise 2: Eliminate Bounds Checks
Learning Objective: Master bounds check elimination techniques to achieve maximum performance in array and slice operations.
Context: Bounds check elimination is crucial for numerical computing and data processing applications. Companies like Wolfram Research and financial modeling firms optimize matrix operations by eliminating bounds checks, achieving 20-30% performance improvements in computational kernels.
Difficulty: Intermediate | Time: 20-25 minutes
Optimize this matrix multiplication function by eliminating bounds checks in the inner loops where they cause the most performance impact:
1package main
2
3func matrixMultiply(a, b [][]int) [][]int {
4 n := len(a)
5 result := make([][]int, n)
6
7 for i := 0; i < n; i++ {
8 result[i] = make([]int, n)
9 for j := 0; j < n; j++ {
10 for k := 0; k < n; k++ {
11 result[i][j] += a[i][k] * b[k][j]
12 }
13 }
14 }
15
16 return result
17}
Task: Apply bounds check elimination techniques to improve performance in this computational hot path.
Solution
1package main
2
3func matrixMultiplyOptimized(a, b [][]int) [][]int {
4 n := len(a)
5 if n == 0 {
6 return nil
7 }
8
9 result := make([][]int, n)
10
11 for i := range a { // Use range to eliminate outer check
12 result[i] = make([]int, n)
13 rowA := a[i] // Hoist bounds check
14
15 for j := range result[i] { // Use range for middle loop
16 sum := 0
17
18 // Compiler can eliminate bounds checks here
19 for k := range rowA {
20 sum += rowA[k] * b[k][j]
21 }
22
23 result[i][j] = sum
24 }
25 }
26
27 return result
28}
29
30// Further optimization: Prove bounds explicitly
31func matrixMultiplyFastest(a, b [][]int) [][]int {
32 n := len(a)
33 if n == 0 {
34 return nil
35 }
36
37 result := make([][]int, n)
38
39 for i := 0; i < n; i++ {
40 if len(a[i]) != n {
41 panic("invalid matrix dimensions")
42 }
43
44 result[i] = make([]int, n)
45 rowA := a[i]
46
47 for j := 0; j < n; j++ {
48 if len(b) <= j || len(b[j]) != n {
49 panic("invalid matrix dimensions")
50 }
51
52 sum := 0
53 colB := b[j]
54
55 // All bounds proven, no checks needed
56 for k := 0; k < n; k++ {
57 sum += rowA[k] * colB[k]
58 }
59
60 result[i][j] = sum
61 }
62 }
63
64 return result
65}
Exercise 3: Dead Code Elimination
Learning Objective: Implement compile-time dead code elimination techniques to create optimized production builds without debug overhead.
Context: Dead code elimination is essential for creating high-performance production builds. Companies like Google and Netflix use build tags and constants to eliminate debug code entirely from production binaries, reducing binary size by 30% and improving runtime performance by removing unnecessary checks.
Difficulty: Intermediate | Time: 15-20 minutes
Use build tags and compile-time constants to eliminate debug code completely from production builds:
1package main
2
3import (
4 "fmt"
5 "log"
6 "time"
7)
8
9func processRequest(data []byte) []byte {
10 start := time.Now()
11
12 log.Printf("Processing %d bytes", len(data))
13
14 result := transform(data)
15
16 log.Printf("Transformed in %v", time.Since(start))
17
18 return result
19}
20
21func transform(data []byte) []byte {
22 // Transformation logic
23 return data
24}
Task: Eliminate all logging overhead and debug code from production builds using compile-time techniques.
Solution
1// config.go
2package main
3
4// Build tags for different environments
5//go:build !debug
6const DebugMode = false
7
8// config_debug.go
9//go:build debug
10const DebugMode = true
11
12// main.go
13package main
14
15import (
16 "fmt"
17 "log"
18 "time"
19)
20
21func processRequest(data []byte) []byte {
22 var start time.Time
23 if DebugMode {
24 start = time.Now()
25 log.Printf("Processing %d bytes", len(data))
26 }
27
28 result := transform(data)
29
30 if DebugMode {
31 log.Printf("Transformed in %v", time.Since(start))
32 }
33
34 return result
35}
36
37func transform(data []byte) []byte {
38 return data
39}
40
41// Better approach: Compile-time function selection
42//go:build !debug
43func debugLog(format string, args ...interface{}) {
44 // Empty function - completely eliminated
45}
46
47//go:build debug
48func debugLog(format string, args ...interface{}) {
49 log.Printf(format, args...)
50}
51
52func processRequestOptimized(data []byte) []byte {
53 start := time.Now()
54 debugLog("Processing %d bytes", len(data))
55
56 result := transform(data)
57
58 debugLog("Transformed in %v", time.Since(start))
59 return result
60}
61
62// Build:
63// Production: go build
64// Debug: go build -tags debug
Exercise 4: Loop Optimization
Learning Objective: Master advanced loop optimization techniques including fusion, unrolling, and invariant code motion.
Context: Loop optimization is critical for data processing and machine learning applications. Companies like TensorFlow and PyTorch spend significant engineering effort optimizing loop patterns, achieving 3-5x performance improvements in numerical computations through careful loop restructuring.
Difficulty: Advanced | Time: 25-30 minutes
Optimize this loop-heavy function by applying advanced loop optimization techniques to minimize memory access and maximize CPU cache efficiency:
1package main
2
3func processMatrix(matrix [][]int) {
4 rows := len(matrix)
5 cols := len(matrix[0])
6
7 // Pass 1: Multiply by 2
8 for i := 0; i < rows; i++ {
9 for j := 0; j < cols; j++ {
10 matrix[i][j] *= 2
11 }
12 }
13
14 // Pass 2: Add row index
15 for i := 0; i < rows; i++ {
16 for j := 0; j < cols; j++ {
17 matrix[i][j] += i
18 }
19 }
20
21 // Pass 3: Add column index
22 for i := 0; i < rows; i++ {
23 for j := 0; j < cols; j++ {
24 matrix[i][j] += j
25 }
26 }
27}
Task: Apply loop fusion, unrolling, and other optimizations to reduce memory bandwidth and improve cache locality.
Solution
1package main
2
3// Optimized: Loop fusion - single pass
4func processMatrixOptimized(matrix [][]int) {
5 rows := len(matrix)
6 if rows == 0 {
7 return
8 }
9 cols := len(matrix[0])
10
11 for i := range matrix {
12 row := matrix[i] // Hoist bounds check
13
14 for j := range row {
15 // Fuse all three operations
16 row[j] = row[j]*2 + i + j
17 }
18 }
19}
20
21// Further optimization: Explicit unrolling for small matrices
22func processMatrixUnrolled(matrix [][]int) {
23 rows := len(matrix)
24 if rows == 0 {
25 return
26 }
27 cols := len(matrix[0])
28
29 for i := range matrix {
30 row := matrix[i]
31 j := 0
32
33 // Process 4 elements at a time
34 for j+3 < cols {
35 row[j] = row[j]*2 + i + j
36 row[j+1] = row[j+1]*2 + i + j + 1
37 row[j+2] = row[j+2]*2 + i + j + 2
38 row[j+3] = row[j+3]*2 + i + j + 3
39 j += 4
40 }
41
42 // Handle remaining elements
43 for ; j < cols; j++ {
44 row[j] = row[j]*2 + i + j
45 }
46 }
47}
Exercise 5: PGO-Driven Optimization
Learning Objective: Master profile-guided optimization to let the compiler make intelligent optimization decisions based on runtime profiling data.
Context: PGO is a game-changer for performance-critical applications. Go's PGO implementation helped optimize the Go compiler itself, achieving 5-15% performance improvements in real workloads. Companies like Uber and Dropbox use PGO to optimize their critical path code based on production traffic patterns.
Difficulty: Advanced | Time: 30-35 minutes
Implement a function that benefits significantly from profile-guided optimization by creating realistic workload patterns:
1package main
2
3func processData(data []int) int {
4 result := 0
5
6 for _, v := range data {
7 if v%2 == 0 {
8 result += expensiveEven(v)
9 } else {
10 result += expensiveOdd(v)
11 }
12 }
13
14 return result
15}
16
17func expensiveEven(n int) int {
18 sum := 0
19 for i := 0; i < n; i++ {
20 sum += i * i
21 }
22 return sum
23}
24
25func expensiveOdd(n int) int {
26 sum := 0
27 for i := 0; i < n; i++ {
28 sum += i * i * i
29 }
30 return sum
31}
Task: Set up PGO workflow, generate representative profiles, and demonstrate optimization impact on hot path performance.
Solution
1// main.go
2package main
3
4import (
5 "fmt"
6 "os"
7 "runtime/pprof"
8)
9
10func processData(data []int) int {
11 result := 0
12
13 for _, v := range data {
14 if v%2 == 0 {
15 result += expensiveEven(v)
16 } else {
17 result += expensiveOdd(v)
18 }
19 }
20
21 return result
22}
23
24func expensiveEven(n int) int {
25 sum := 0
26 for i := 0; i < n; i++ {
27 sum += i * i
28 }
29 return sum
30}
31
32func expensiveOdd(n int) int {
33 sum := 0
34 for i := 0; i < n; i++ {
35 sum += i * i * i
36 }
37 return sum
38}
39
40func main() {
41 // Generate profile
42 if len(os.Args) > 1 && os.Args[1] == "profile" {
43 f, _ := os.Create("cpu.prof")
44 pprof.StartCPUProfile(f)
45 defer pprof.StopCPUProfile()
46 }
47
48 // Representative workload
49 data := make([]int, 10000)
50 for i := range data {
51 if i%10 < 8 {
52 data[i] = i * 2 // 80% even
53 } else {
54 data[i] = i*2 + 1 // 20% odd
55 }
56 }
57
58 // Run many times for profiling
59 for i := 0; i < 100; i++ {
60 _ = processData(data)
61 }
62
63 fmt.Println("Done")
64}
65
66// Step 1: Generate profile
67// go run main.go profile
68
69// Step 2: Prepare PGO file
70// mv cpu.prof default.pgo
71
72// Step 3: Build with PGO
73// go build -pgo=default.pgo -o app_pgo
74
75// Step 4: Build without PGO for comparison
76// go build -o app_no_pgo
77
78// Step 5: Benchmark both
79// time ./app_pgo
80// time ./app_no_pgo
81
82// PGO will optimize hot path more aggressively