Text Processing and Unicode

Why This Matters

Consider building a global social media platform that needs to handle posts in every language - from English and Chinese to Arabic and emoji. Or a search engine that must understand user queries across different scripts and languages. Or a data pipeline that processes customer feedback from around the world.

Text processing is the foundation of modern software applications. It enables:

  • Global Applications: Handle text in hundreds of languages and scripts
  • Search & Discovery: Build intelligent search and recommendation systems
  • Data Processing: Parse and analyze unstructured text data
  • User Interfaces: Create responsive, multilingual user experiences
  • Communication: Build chatbots, translation services, and messaging systems
  • Content Management: Process documents, articles, and user-generated content

In today's connected world, proper text processing isn't optional - it's essential for reaching global audiences and handling the complexity of human language.

Learning Objectives

After completing this article, you will be able to:

  1. Master Unicode: Understand UTF-8 encoding and proper Unicode handling
  2. Process Text: Use Go's string manipulation functions effectively
  3. Normalize Content: Handle text normalization and canonical forms
  4. Parse & Transform: Build text parsers and transformers
  5. Handle Internationalization: Work with multiple languages and scripts
  6. Optimize Performance: Write efficient text processing algorithms
  7. Apply Regex: Master regular expressions for pattern matching
  8. Implement Security: Avoid common text processing vulnerabilities

Core Concepts

Unicode and UTF-8 in Go

Go was designed with Unicode support from the ground up. Understanding how Go handles text is crucial for building robust applications:

  • Unicode: Universal character encoding supporting all writing systems
  • UTF-8: Variable-width encoding that stores Unicode characters efficiently
  • Runes: Go's representation of Unicode code points (int32 values)
  • Strings: Immutable sequences of bytes interpreted as UTF-8 text
  • Code Points: Numerical values representing characters in Unicode
  • Normalization: Converting text to canonical forms for consistent comparison

Text Processing Challenges

Text processing presents unique challenges compared to other data types:

  1. Encoding Issues: Different character encodings and byte order marks
  2. Normalization: Same text can have multiple byte representations
  3. Language Rules: Different languages have different sorting, case, and punctuation rules
  4. Performance: String operations can be expensive for large texts
  5. Security: Text processing can introduce security vulnerabilities (injection attacks)
  6. Directionality: Right-to-left languages require special handling
  7. Grapheme Clusters: Some "characters" are composed of multiple code points

Unicode Fundamentals

Understanding UTF-8 Encoding

UTF-8 is a variable-width character encoding that uses 1-4 bytes per character. Understanding its structure is essential for proper text handling.

  1package main
  2
  3import (
  4    "fmt"
  5    "unicode/utf8"
  6)
  7
  8// run
  9func main() {
 10    fmt.Println("=== UTF-8 Encoding Deep Dive ===")
 11
 12    // 1. UTF-8 byte patterns
 13    fmt.Println("\n1. UTF-8 Byte Structure:")
 14
 15    chars := []struct {
 16        char  rune
 17        name  string
 18        bytes int
 19    }{
 20        {'A', "ASCII Letter", 1},
 21        {'£', "Pound Sign", 2},
 22        {'€', "Euro Sign", 3},
 23        {'世', "Chinese Character", 3},
 24        {'🌍', "Earth Emoji", 4},
 25        {'𝕳', "Mathematical Bold", 4},
 26    }
 27
 28    for _, c := range chars {
 29        fmt.Printf("\nCharacter: %c (U+%04X)\n", c.char, c.char)
 30        fmt.Printf("  Name: %s\n", c.name)
 31        fmt.Printf("  Expected bytes: %d\n", c.bytes)
 32        fmt.Printf("  Actual bytes: %d\n", utf8.RuneLen(c.char))
 33
 34        // Show byte representation
 35        str := string(c.char)
 36        fmt.Printf("  UTF-8 bytes: ")
 37        for i := 0; i < len(str); i++ {
 38            fmt.Printf("%08b ", str[i])
 39        }
 40        fmt.Printf("\n  Hex bytes: ")
 41        for i := 0; i < len(str); i++ {
 42            fmt.Printf("%02X ", str[i])
 43        }
 44        fmt.Println()
 45    }
 46
 47    // 2. String vs byte vs rune length
 48    fmt.Println("\n2. String Length Comparisons:")
 49
 50    texts := []string{
 51        "Hello",           // ASCII only
 52        "Café",            // Latin with accent
 53        "你好世界",           // Chinese
 54        "مرحبا",           // Arabic
 55        "Hello 🌍 World",  // Mixed with emoji
 56        "Ñoño",            // Spanish with tildes
 57    }
 58
 59    for _, text := range texts {
 60        fmt.Printf("\nText: %s\n", text)
 61        fmt.Printf("  len(string): %d bytes\n", len(text))
 62        fmt.Printf("  utf8.RuneCountInString: %d runes\n", utf8.RuneCountInString(text))
 63        fmt.Printf("  []rune length: %d runes\n", len([]rune(text)))
 64
 65        // Show each character's byte size
 66        fmt.Printf("  Character breakdown: ")
 67        for i, r := range text {
 68            size := utf8.RuneLen(r)
 69            fmt.Printf("%c(%d bytes at pos %d) ", r, size, i)
 70        }
 71        fmt.Println()
 72    }
 73
 74    // 3. Valid UTF-8 checking
 75    fmt.Println("\n3. UTF-8 Validation:")
 76
 77    testStrings := []struct {
 78        bytes []byte
 79        desc  string
 80    }{
 81        {[]byte("Hello"), "Valid ASCII"},
 82        {[]byte{0x48, 0x65, 0x6C, 0x6C, 0x6F}, "Valid ASCII (hex)"},
 83        {[]byte{0xE4, 0xB8, 0x96}, "Valid 3-byte Chinese"},
 84        {[]byte{0xFF, 0xFE}, "Invalid UTF-8"},
 85        {[]byte{0xC0, 0x80}, "Invalid overlong encoding"},
 86        {[]byte("Hell"), "Incomplete UTF-8 sequence"},
 87    }
 88
 89    for _, test := range testStrings {
 90        valid := utf8.Valid(test.bytes)
 91        fmt.Printf("%s: valid=%t, bytes=%v\n", test.desc, valid, test.bytes)
 92
 93        if valid {
 94            fmt.Printf("  String: %s\n", string(test.bytes))
 95        } else {
 96            fmt.Printf("  Invalid UTF-8 - cannot convert safely\n")
 97        }
 98    }
 99
100    // 4. Rune decoding and encoding
101    fmt.Println("\n4. Manual Rune Decoding:")
102
103    str := "Hello, 世界!"
104    for i := 0; i < len(str); {
105        r, size := utf8.DecodeRuneInString(str[i:])
106        fmt.Printf("Position %d: rune=%c (U+%04X), size=%d bytes\n", i, r, r, size)
107        i += size
108    }
109
110    // 5. Encoding runes to bytes
111    fmt.Println("\n5. Encoding Runes to UTF-8:")
112
113    runes := []rune{'H', 'e', 'l', 'l', 'o', ',', ' ', '世', '界'}
114    for _, r := range runes {
115        buf := make([]byte, 4) // Max UTF-8 size
116        size := utf8.EncodeRune(buf, r)
117        fmt.Printf("Rune %c: bytes=%v (size=%d)\n", r, buf[:size], size)
118    }
119}

Real-World Applications:

  • File Processing: Validate UTF-8 encoding in uploaded files
  • API Validation: Ensure API requests contain valid UTF-8 text
  • Database Storage: Store text efficiently with proper encoding
  • Protocol Handling: Parse text protocols with mixed encodings

⚠️ Critical Warning: Never use byte indices to split strings containing non-ASCII characters. Always use rune-based operations or utf8 package functions.

Unicode and Rune Manipulation

Understanding how Go handles Unicode characters is fundamental to proper text processing.

  1package main
  2
  3import (
  4    "fmt"
  5    "unicode"
  6    "unicode/utf8"
  7)
  8
  9// run
 10func main() {
 11    fmt.Println("=== Unicode and Rune Manipulation ===")
 12
 13    // 1. String vs Rune slice
 14    fmt.Println("\n1. String and Rune Slices:")
 15    text := "Hello, 世界! 🌍"
 16
 17    fmt.Printf("Original string: %s\n", text)
 18    fmt.Printf("String length (bytes): %d\n", len(text))
 19    fmt.Printf("Rune count: %d\n", utf8.RuneCountInString(text))
 20
 21    // Convert to rune slice
 22    runes := []rune(text)
 23    fmt.Printf("Rune slice length: %d\n", len(runes))
 24    fmt.Printf("Runes: %U\n", runes)
 25
 26    // 2. Rune iteration and properties
 27    fmt.Println("\n2. Rune Properties:")
 28    for i, r := range text {
 29        fmt.Printf("Byte index %d: %c (U+%04X)\n", i, r, r)
 30        fmt.Printf("  Size in bytes: %d\n", utf8.RuneLen(r))
 31        fmt.Printf("  Is letter: %t\n", unicode.IsLetter(r))
 32        fmt.Printf("  Is digit: %t\n", unicode.IsDigit(r))
 33        fmt.Printf("  Is space: %t\n", unicode.IsSpace(r))
 34        fmt.Printf("  Is punctuation: %t\n", unicode.IsPunct(r))
 35        fmt.Printf("  Is symbol: %t\n", unicode.IsSymbol(r))
 36    }
 37
 38    // 3. Unicode categories
 39    fmt.Println("\n3. Unicode Categories:")
 40    testRunes := []rune{'A', 'a', '中', 'ى', '5', '€', '🌍', '\n', '\t'}
 41
 42    for _, r := range testRunes {
 43        fmt.Printf("\nCharacter: %c (U+%04X)\n", r, r)
 44
 45        if unicode.IsLetter(r) {
 46            if unicode.IsUpper(r) {
 47                fmt.Printf("  → Uppercase letter\n")
 48            } else if unicode.IsLower(r) {
 49                fmt.Printf("  → Lowercase letter\n")
 50            } else {
 51                fmt.Printf("  → Other letter (no case)\n")
 52            }
 53        }
 54
 55        if unicode.IsNumber(r) {
 56            fmt.Printf("  → Number character\n")
 57        }
 58
 59        if unicode.IsPunct(r) {
 60            fmt.Printf("  → Punctuation\n")
 61        }
 62
 63        if unicode.IsSymbol(r) {
 64            fmt.Printf("  → Symbol\n")
 65        }
 66
 67        if unicode.IsSpace(r) {
 68            fmt.Printf("  → Whitespace\n")
 69        }
 70
 71        if unicode.IsControl(r) {
 72            fmt.Printf("  → Control character\n")
 73        }
 74
 75        fmt.Printf("  Category: %s\n", getUnicodeCategory(r))
 76    }
 77
 78    // 4. Case transformations
 79    fmt.Println("\n4. Unicode Case Transformations:")
 80    caseTests := []rune{'A', 'a', 'Σ', 'σ', 'ς', 'İ', 'i'}
 81
 82    for _, r := range caseTests {
 83        fmt.Printf("\nOriginal: %c (U+%04X)\n", r, r)
 84        fmt.Printf("  ToUpper: %c (U+%04X)\n", unicode.ToUpper(r), unicode.ToUpper(r))
 85        fmt.Printf("  ToLower: %c (U+%04X)\n", unicode.ToLower(r), unicode.ToLower(r))
 86        fmt.Printf("  ToTitle: %c (U+%04X)\n", unicode.ToTitle(r), unicode.ToTitle(r))
 87    }
 88
 89    // 5. String building with runes
 90    fmt.Println("\n5. Safe String Building:")
 91
 92    // Bad approach: byte manipulation can corrupt UTF-8
 93    badBytes := []byte{72, 101, 108, 108, 111} // "Hello"
 94    badString := string(badBytes)
 95    fmt.Printf("Byte manipulation: %s\n", badString)
 96
 97    // Good approach: rune manipulation
 98    goodRunes := []rune{'H', 'e', 'l', 'l', 'o', ',', ' ', '世', '界'}
 99    goodString := string(goodRunes)
100    fmt.Printf("Rune manipulation: %s\n", goodString)
101
102    // 6. Unicode script detection
103    fmt.Println("\n6. Unicode Script Detection:")
104    multiScript := "Hello мир 世界 مرحبا"
105
106    for _, r := range multiScript {
107        if unicode.IsLetter(r) {
108            script := detectScript(r)
109            fmt.Printf("Character %c: %s script\n", r, script)
110        }
111    }
112}
113
114func getUnicodeCategory(r rune) string {
115    switch {
116    case unicode.IsLetter(r):
117        return "Letter"
118    case unicode.IsNumber(r):
119        return "Number"
120    case unicode.IsPunct(r):
121        return "Punctuation"
122    case unicode.IsSymbol(r):
123        return "Symbol"
124    case unicode.IsSpace(r):
125        return "Space"
126    case unicode.IsControl(r):
127        return "Control"
128    default:
129        return "Other"
130    }
131}
132
133func detectScript(r rune) string {
134    switch {
135    case r >= 'A' && r <= 'Z' || r >= 'a' && r <= 'z':
136        return "Latin"
137    case r >= 0x0400 && r <= 0x04FF:
138        return "Cyrillic"
139    case r >= 0x4E00 && r <= 0x9FFF:
140        return "CJK (Chinese/Japanese/Korean)"
141    case r >= 0x0600 && r <= 0x06FF:
142        return "Arabic"
143    case r >= 0x0370 && r <= 0x03FF:
144        return "Greek"
145    case r >= 0x0590 && r <= 0x05FF:
146        return "Hebrew"
147    default:
148        return "Unknown"
149    }
150}

Real-World Applications:

  • Input Validation: Ensure user input contains only allowed characters
  • Password Policies: Check for character variety in passwords
  • Text Analysis: Categorize characters for language detection
  • Security: Detect suspicious character sequences (homograph attacks)

💡 Best Practice: Always use utf8.RuneCountInString() to count characters, not len() which returns byte count.

String Manipulation and Transformation

Core String Operations

Go provides powerful string manipulation functions that work properly with Unicode.

  1package main
  2
  3import (
  4    "fmt"
  5    "strings"
  6    "unicode"
  7    "unicode/utf8"
  8)
  9
 10// run
 11func main() {
 12    fmt.Println("=== String Manipulation and Transformation ===")
 13
 14    // 1. Basic string operations
 15    fmt.Println("\n1. Basic String Operations:")
 16    text := "Hello, 世界! Welcome to Go! 🚀"
 17
 18    fmt.Printf("Original: %s\n", text)
 19    fmt.Printf("Character count: %d\n", utf8.RuneCountInString(text))
 20    fmt.Printf("Byte count: %d\n", len(text))
 21
 22    // Contains, prefix, suffix
 23    fmt.Printf("Contains 'World': %t\n", strings.Contains(text, "World"))
 24    fmt.Printf("Contains '世界': %t\n", strings.Contains(text, "世界"))
 25    fmt.Printf("Starts with 'Hello': %t\n", strings.HasPrefix(text, "Hello"))
 26    fmt.Printf("Ends with '🚀': %t\n", strings.HasSuffix(text, "🚀"))
 27
 28    // Index finding
 29    fmt.Printf("Index of 'World': %d\n", strings.Index(text, "World"))
 30    fmt.Printf("Index of '世界': %d\n", strings.Index(text, "世界"))
 31    fmt.Printf("Last index of 'o': %d\n", strings.LastIndex(text, "o"))
 32
 33    // 2. String splitting and joining
 34    fmt.Println("\n2. Splitting and Joining:")
 35    sentence := "The quick brown fox jumps over the lazy dog"
 36
 37    // Split by whitespace
 38    words := strings.Fields(sentence)
 39    fmt.Printf("Words: %v\n", words)
 40    fmt.Printf("Word count: %d\n", len(words))
 41
 42    // Split by specific delimiter
 43    csv := "apple,banana,cherry,date,elderberry"
 44    fruits := strings.Split(csv, ",")
 45    fmt.Printf("Fruits: %v\n", fruits)
 46
 47    // Split with limit
 48    limited := strings.SplitN(csv, ",", 3)
 49    fmt.Printf("Limited split (3): %v\n", limited)
 50
 51    // Join strings
 52    joined := strings.Join(words, " | ")
 53    fmt.Printf("Joined with separator: %s\n", joined)
 54
 55    // 3. Case transformation with Unicode awareness
 56    fmt.Println("\n3. Unicode Case Transformation:")
 57    unicodeText := "hello WORLD! 你好世界 Straße İstanbul"
 58
 59    fmt.Printf("Original: %s\n", unicodeText)
 60    fmt.Printf("Upper: %s\n", strings.ToUpper(unicodeText))
 61    fmt.Printf("Lower: %s\n", strings.ToLower(unicodeText))
 62    fmt.Printf("Title: %s\n", strings.Title(unicodeText))
 63
 64    // Custom title case
 65    fmt.Printf("Custom title: %s\n", toProperTitle(unicodeText))
 66
 67    // 4. String trimming and cleaning
 68    fmt.Println("\n4. Trimming and Cleaning:")
 69    messyText := "   Hello, World!   \n\t  "
 70
 71    fmt.Printf("Original: %q\n", messyText)
 72    fmt.Printf("TrimSpace: %q\n", strings.TrimSpace(messyText))
 73    fmt.Printf("TrimPrefix: %q\n", strings.TrimPrefix(messyText, "   "))
 74    fmt.Printf("TrimSuffix: %q\n", strings.TrimSuffix(messyText, "  \n\t  "))
 75
 76    // Trim specific characters
 77    punctuationText := "!!!Hello, World!!!"
 78    fmt.Printf("Trim '!': %q\n", strings.Trim(punctuationText, "!"))
 79
 80    // Trim function
 81    trimmed := strings.TrimFunc(messyText, unicode.IsSpace)
 82    fmt.Printf("TrimFunc (spaces): %q\n", trimmed)
 83
 84    // 5. String replacement
 85    fmt.Println("\n5. String Replacement:")
 86    template := "Hello, {{name}}! Welcome to {{app}}."
 87
 88    // Simple replacement
 89    result := strings.Replace(template, "{{name}}", "Alice", -1)
 90    result = strings.Replace(result, "{{app}}", "Go Tutorial", -1)
 91    fmt.Printf("Simple replacement: %s\n", result)
 92
 93    // Multiple replacements with Replacer
 94    replacer := strings.NewReplacer(
 95        "{{name}}", "Bob",
 96        "{{app}}", "Awesome App",
 97    )
 98    result = replacer.Replace(template)
 99    fmt.Printf("Replacer: %s\n", result)
100
101    // 6. String comparison
102    fmt.Println("\n6. String Comparison:")
103    compareTests := []struct {
104        s1, s2 string
105    }{
106        {"hello", "Hello"},
107        {"café", "cafe"},
108        {"test", "test"},
109    }
110
111    for _, test := range compareTests {
112        fmt.Printf("\n%q vs %q:\n", test.s1, test.s2)
113        fmt.Printf("  Equal: %t\n", test.s1 == test.s2)
114        fmt.Printf("  EqualFold: %t\n", strings.EqualFold(test.s1, test.s2))
115        fmt.Printf("  Compare: %d\n", strings.Compare(test.s1, test.s2))
116    }
117
118    // 7. Unicode-safe substring extraction
119    fmt.Println("\n7. Unicode-Safe Substring:")
120    complexText := "Héllo, 世界! 👋"
121
122    // Dangerous: byte-based substring
123    fmt.Printf("Original: %s\n", complexText)
124    if len(complexText) >= 10 {
125        dangerous := complexText[0:10]
126        fmt.Printf("Dangerous (bytes 0-10): %s\n", dangerous)
127    }
128
129    // Safe: rune-based substring
130    safe := safeSubstring(complexText, 0, 7)
131    fmt.Printf("Safe (runes 0-7): %s\n", safe)
132
133    safe = safeSubstring(complexText, 7, 3)
134    fmt.Printf("Safe (runes 7-10): %s\n", safe)
135}
136
137func toProperTitle(s string) string {
138    words := strings.Fields(s)
139    for i, word := range words {
140        if len(word) > 0 {
141            runes := []rune(word)
142            if unicode.IsLetter(runes[0]) {
143                runes[0] = unicode.ToTitle(runes[0])
144                for j := 1; j < len(runes); j++ {
145                    runes[j] = unicode.ToLower(runes[j])
146                }
147                words[i] = string(runes)
148            }
149        }
150    }
151    return strings.Join(words, " ")
152}
153
154func safeSubstring(s string, start, length int) string {
155    runes := []rune(s)
156    if start < 0 {
157        start = 0
158    }
159    if start >= len(runes) {
160        return ""
161    }
162    end := start + length
163    if end > len(runes) {
164        end = len(runes)
165    }
166    return string(runes[start:end])
167}

Real-World Applications:

  • Template Engines: Render dynamic content with user data
  • Data Cleaning: Normalize and clean messy text data
  • URL Handling: Parse and manipulate URLs and paths
  • Configuration Files: Process configuration values and settings

Efficient String Building

String concatenation in loops is a common performance bottleneck. Learn the efficient approaches.

  1package main
  2
  3import (
  4    "fmt"
  5    "strings"
  6    "time"
  7)
  8
  9// run
 10func main() {
 11    fmt.Println("=== Efficient String Building ===")
 12
 13    // 1. Performance comparison
 14    fmt.Println("\n1. String Concatenation Performance:")
 15
 16    iterations := 10000
 17
 18    // Bad: repeated concatenation
 19    start := time.Now()
 20    var bad string
 21    for i := 0; i < iterations; i++ {
 22        bad += "x" // Creates new string each time
 23    }
 24    badTime := time.Since(start)
 25
 26    // Good: strings.Builder
 27    start = time.Now()
 28    var builder strings.Builder
 29    builder.Grow(iterations) // Pre-allocate capacity
 30    for i := 0; i < iterations; i++ {
 31        builder.WriteByte('x')
 32    }
 33    good := builder.String()
 34    goodTime := time.Since(start)
 35
 36    fmt.Printf("Concatenation: %v\n", badTime)
 37    fmt.Printf("Builder: %v\n", goodTime)
 38    fmt.Printf("Speedup: %.2fx\n", float64(badTime)/float64(goodTime))
 39    fmt.Printf("Results match: %t\n", len(bad) == len(good))
 40
 41    // 2. strings.Builder methods
 42    fmt.Println("\n2. strings.Builder Methods:")
 43
 44    var b strings.Builder
 45    b.Grow(100) // Pre-allocate
 46
 47    b.WriteString("Hello, ")
 48    b.WriteString("世界! ")
 49    b.WriteRune('🌍')
 50    b.WriteByte(' ')
 51
 52    result := b.String()
 53    fmt.Printf("Result: %s\n", result)
 54    fmt.Printf("Length: %d bytes, %d runes\n", b.Len(), utf8.RuneCountInString(result))
 55
 56    // 3. Building complex strings
 57    fmt.Println("\n3. Building Complex Strings:")
 58
 59    var complex strings.Builder
 60    complex.Grow(200)
 61
 62    // HTML-like template
 63    complex.WriteString("<html>\n")
 64    complex.WriteString("  <body>\n")
 65
 66    items := []string{"Apple", "Banana", "Cherry"}
 67    for _, item := range items {
 68        complex.WriteString("    <li>")
 69        complex.WriteString(item)
 70        complex.WriteString("</li>\n")
 71    }
 72
 73    complex.WriteString("  </body>\n")
 74    complex.WriteString("</html>\n")
 75
 76    fmt.Println(complex.String())
 77
 78    // 4. Memory efficiency
 79    fmt.Println("4. Memory Efficiency:")
 80
 81    var efficient strings.Builder
 82    efficient.Grow(1000) // Pre-allocate exactly what we need
 83
 84    fmt.Printf("Initial capacity: %d\n", efficient.Cap())
 85
 86    for i := 0; i < 100; i++ {
 87        efficient.WriteString("test ")
 88    }
 89
 90    fmt.Printf("After writes, capacity: %d\n", efficient.Cap())
 91    fmt.Printf("Length: %d\n", efficient.Len())
 92
 93    // 5. Building with formatting
 94    fmt.Println("\n5. Formatted String Building:")
 95
 96    var formatted strings.Builder
 97    formatted.Grow(100)
 98
 99    for i := 1; i <= 5; i++ {
100        formatted.WriteString(fmt.Sprintf("Line %d: Value = %d\n", i, i*10))
101    }
102
103    fmt.Print(formatted.String())
104}

Real-World Applications:

  • Log Aggregation: Build large log messages efficiently
  • JSON/XML Generation: Construct structured data formats
  • Report Generation: Build large text reports
  • Template Rendering: Efficient template engines

💡 Performance Tip: Always use strings.Builder for building strings iteratively. Call Grow() if you know the approximate size to avoid reallocations.

Text Normalization and Comparison

Unicode Normalization Forms

Text normalization is crucial for consistent comparison and processing of Unicode text.

  1package main
  2
  3import (
  4    "fmt"
  5    "strings"
  6    "unicode"
  7)
  8
  9// run
 10func main() {
 11    fmt.Println("=== Text Normalization and Comparison ===")
 12
 13    // 1. Unicode normalization problem
 14    fmt.Println("\n1. Unicode Normalization Problem:")
 15
 16    // Different ways to write "café"
 17    cafe := []string{
 18        "café",        // Combined form: é (U+00E9)
 19        "cafe\u0301",  // Decomposed form: e + combining acute (U+0301)
 20    }
 21
 22    fmt.Println("Visual comparison:")
 23    for i, s := range cafe {
 24        fmt.Printf("  Version %d: %s\n", i+1, s)
 25        fmt.Printf("    Bytes: %v\n", []byte(s))
 26        fmt.Printf("    Runes: %U\n", []rune(s))
 27        fmt.Printf("    Byte length: %d\n", len(s))
 28        fmt.Printf("    Rune count: %d\n", utf8.RuneCountInString(s))
 29    }
 30
 31    // Direct comparison fails
 32    fmt.Printf("\nDirect comparison: %q == %q: %t\n", cafe[0], cafe[1], cafe[0] == cafe[1])
 33
 34    // Normalized comparison
 35    norm0 := simpleNormalize(cafe[0])
 36    norm1 := simpleNormalize(cafe[1])
 37    fmt.Printf("Normalized comparison: %t\n", norm0 == norm1)
 38
 39    // 2. Case folding for case-insensitive comparison
 40    fmt.Println("\n2. Case Folding:")
 41
 42    caseTests := []struct {
 43        s1, s2 string
 44    }{
 45        {"Hello", "hello"},
 46        {"Straße", "STRASSE"},  // German ß vs SS
 47        {"İstanbul", "istanbul"}, // Turkish dotted I
 48    }
 49
 50    for _, test := range caseTests {
 51        fmt.Printf("\nComparing %q vs %q:\n", test.s1, test.s2)
 52        fmt.Printf("  Direct: %t\n", test.s1 == test.s2)
 53        fmt.Printf("  EqualFold: %t\n", strings.EqualFold(test.s1, test.s2))
 54        fmt.Printf("  Case folded: %t\n", caseFold(test.s1) == caseFold(test.s2))
 55    }
 56
 57    // 3. Whitespace normalization
 58    fmt.Println("\n3. Whitespace Normalization:")
 59
 60    whitespaceText := "  Hello\t\nWorld!   \n\t  "
 61    fmt.Printf("Original: %q\n", whitespaceText)
 62    fmt.Printf("Normalized: %q\n", normalizeWhitespace(whitespaceText))
 63
 64    // Multiple spaces and tabs
 65    messy := "Hello    \t\t   World  \n\n  !"
 66    fmt.Printf("Messy: %q\n", messy)
 67    fmt.Printf("Cleaned: %q\n", normalizeWhitespace(messy))
 68
 69    // 4. Accent removal for simple matching
 70    fmt.Println("\n4. Accent Removal:")
 71
 72    accentedTexts := []string{
 73        "café",
 74        "résumé",
 75        "naïve",
 76        "Zürich",
 77        "São Paulo",
 78        "Москва", // Moscow in Cyrillic
 79    }
 80
 81    for _, text := range accentedTexts {
 82        fmt.Printf("Original: %-15s → No accents: %s\n", text, removeAccents(text))
 83    }
 84
 85    // 5. Text similarity
 86    fmt.Println("\n5. Text Similarity (Levenshtein Distance):")
 87
 88    similarityTests := []struct {
 89        s1, s2 string
 90    }{
 91        {"color", "colour"},
 92        {"organize", "organise"},
 93        {"catalog", "catalogue"},
 94        {"mississippi", "misisipi"},
 95        {"kitten", "sitting"},
 96    }
 97
 98    for _, test := range similarityTests {
 99        distance := levenshteinDistance(test.s1, test.s2)
100        maxLen := max(len(test.s1), len(test.s2))
101        similarity := 1.0 - float64(distance)/float64(maxLen)
102
103        fmt.Printf("%s vs %s:\n", test.s1, test.s2)
104        fmt.Printf("  Distance: %d\n", distance)
105        fmt.Printf("  Similarity: %.2f%%\n", similarity*100)
106    }
107
108    // 6. Advanced text cleaning
109    fmt.Println("\n6. Advanced Text Cleaning:")
110
111    messyTexts := []string{
112        "  HÉLLÓ,   World! 123!!! \n\t🚀  ",
113        "Remove   extra    spaces",
114        "\t\tTabs\t\tand\t\tnewlines\n\n\n",
115    }
116
117    for _, messy := range messyTexts {
118        fmt.Printf("\nOriginal: %q\n", messy)
119        fmt.Printf("Cleaned: %q\n", cleanText(messy))
120        fmt.Printf("For search: %q\n", normalizeForSearch(messy))
121    }
122}
123
124func simpleNormalize(s string) string {
125    // Simple normalization: lowercase
126    return strings.ToLower(s)
127}
128
129func caseFold(s string) string {
130    // Simple case folding for demonstration
131    // In production, use golang.org/x/text/cases
132    runes := []rune(strings.ToLower(s))
133    for i, r := range runes {
134        // Handle special cases
135        if r == 'ß' || r == 'ẞ' {
136            // German sharp S → ss
137            return string(runes[:i]) + "ss" + caseFold(string(runes[i+1:]))
138        }
139    }
140    return string(runes)
141}
142
143func normalizeWhitespace(s string) string {
144    // Replace all whitespace sequences with single space and trim
145    runes := []rune(s)
146    var normalized []rune
147    var inWhitespace bool
148
149    for _, r := range runes {
150        if unicode.IsSpace(r) {
151            if !inWhitespace {
152                normalized = append(normalized, ' ')
153                inWhitespace = true
154            }
155        } else {
156            normalized = append(normalized, r)
157            inWhitespace = false
158        }
159    }
160
161    return strings.TrimSpace(string(normalized))
162}
163
164func removeAccents(s string) string {
165    // Simple accent removal using mapping
166    // In production, use golang.org/x/text/transform
167    accentMap := map[rune]rune{
168        'á': 'a', 'à': 'a', 'â': 'a', 'ä': 'a', 'ã': 'a', 'å': 'a',
169        'é': 'e', 'è': 'e', 'ê': 'e', 'ë': 'e',
170        'í': 'i', 'ì': 'i', 'î': 'i', 'ï': 'i',
171        'ó': 'o', 'ò': 'o', 'ô': 'o', 'ö': 'o', 'õ': 'o', 'ø': 'o',
172        'ú': 'u', 'ù': 'u', 'û': 'u', 'ü': 'u',
173        'ý': 'y', 'ÿ': 'y',
174        'ñ': 'n', 'ç': 'c',
175        'Á': 'A', 'À': 'A', 'Â': 'A', 'Ä': 'A', 'Ã': 'A', 'Å': 'A',
176        'É': 'E', 'È': 'E', 'Ê': 'E', 'Ë': 'E',
177        'Í': 'I', 'Ì': 'I', 'Î': 'I', 'Ï': 'I',
178        'Ó': 'O', 'Ò': 'O', 'Ô': 'O', 'Ö': 'O', 'Õ': 'O', 'Ø': 'O',
179        'Ú': 'U', 'Ù': 'U', 'Û': 'U', 'Ü': 'U',
180        'Ý': 'Y', 'Ÿ': 'Y',
181        'Ñ': 'N', 'Ç': 'C',
182    }
183
184    runes := []rune(s)
185    for i, r := range runes {
186        if replacement, exists := accentMap[r]; exists {
187            runes[i] = replacement
188        }
189    }
190
191    return string(runes)
192}
193
194func levenshteinDistance(s1, s2 string) int {
195    r1, r2 := []rune(s1), []rune(s2)
196    n, m := len(r1), len(r2)
197
198    if n == 0 {
199        return m
200    }
201    if m == 0 {
202        return n
203    }
204
205    // Initialize distance matrix
206    matrix := make([][]int, n+1)
207    for i := range matrix {
208        matrix[i] = make([]int, m+1)
209        matrix[i][0] = i
210    }
211    for j := 1; j <= m; j++ {
212        matrix[0][j] = j
213    }
214
215    // Calculate distances
216    for i := 1; i <= n; i++ {
217        for j := 1; j <= m; j++ {
218            cost := 0
219            if r1[i-1] != r2[j-1] {
220                cost = 1
221            }
222
223            deletion := matrix[i-1][j] + 1
224            insertion := matrix[i][j-1] + 1
225            substitution := matrix[i-1][j-1] + cost
226
227            matrix[i][j] = min(deletion, min(insertion, substitution))
228        }
229    }
230
231    return matrix[n][m]
232}
233
234func min(a, b int) int {
235    if a < b {
236        return a
237    }
238    return b
239}
240
241func max(a, b int) int {
242    if a > b {
243        return a
244    }
245    return b
246}
247
248func cleanText(s string) string {
249    // Remove non-printable characters, normalize whitespace
250    runes := []rune(s)
251    var cleaned []rune
252
253    for _, r := range runes {
254        if unicode.IsGraphic(r) && r != '\n' && r != '\t' {
255            cleaned = append(cleaned, r)
256        }
257    }
258
259    return normalizeWhitespace(string(cleaned))
260}
261
262func normalizeForSearch(s string) string {
263    // Normalize text for search: lowercase, remove accents, normalize whitespace
264    return normalizeWhitespace(removeAccents(strings.ToLower(s)))
265}

Real-World Applications:

  • Search Engines: Handle different ways users might spell queries
  • Usernames: Normalize usernames for comparison and validation
  • Data Deduplication: Identify similar or duplicate records
  • International SEO: Handle different character encodings

⚠️ Production Note: For production applications, use golang.org/x/text/unicode/norm for proper Unicode normalization (NFC, NFD, NFKC, NFKD forms).

Regular Expressions and Pattern Matching

Advanced Regex Patterns

Regular expressions are powerful tools for pattern matching and text extraction.

  1package main
  2
  3import (
  4    "fmt"
  5    "regexp"
  6    "strings"
  7)
  8
  9// run
 10func main() {
 11    fmt.Println("=== Regular Expressions and Pattern Matching ===")
 12
 13    // 1. Basic pattern matching
 14    fmt.Println("\n1. Basic Pattern Matching:")
 15
 16    emailRegex := regexp.MustCompile(`\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b`)
 17
 18    texts := []string{
 19        "Contact me at john@example.com",
 20        "Email: alice@company.org",
 21        "Invalid: not-an-email",
 22        "Multiple: bob@test.com and charlie@demo.net",
 23    }
 24
 25    for _, text := range texts {
 26        matches := emailRegex.FindAllString(text, -1)
 27        fmt.Printf("Text: %s\n", text)
 28        fmt.Printf("  Emails found: %v\n", matches)
 29    }
 30
 31    // 2. Capturing groups
 32    fmt.Println("\n2. Capturing Groups:")
 33
 34    phoneRegex := regexp.MustCompile(`\((\d{3})\)\s*(\d{3})-(\d{4})`)
 35    phoneText := "Call me at (555) 123-4567 or (800) 555-0199"
 36
 37    matches := phoneRegex.FindAllStringSubmatch(phoneText, -1)
 38    for _, match := range matches {
 39        fmt.Printf("Full match: %s\n", match[0])
 40        fmt.Printf("  Area code: %s\n", match[1])
 41        fmt.Printf("  Exchange: %s\n", match[2])
 42        fmt.Printf("  Number: %s\n", match[3])
 43    }
 44
 45    // 3. Named groups
 46    fmt.Println("\n3. Named Capturing Groups:")
 47
 48    urlRegex := regexp.MustCompile(`(?P<scheme>https?)://(?P<host>[^/]+)(?P<path>/.*)?`)
 49    urls := []string{
 50        "https://www.example.com/path/to/page",
 51        "http://api.service.com",
 52        "https://localhost:8080/admin",
 53    }
 54
 55    for _, url := range urls {
 56        match := urlRegex.FindStringSubmatch(url)
 57        if match != nil {
 58            fmt.Printf("\nURL: %s\n", url)
 59            for i, name := range urlRegex.SubexpNames() {
 60                if i != 0 && name != "" {
 61                    fmt.Printf("  %s: %s\n", name, match[i])
 62                }
 63            }
 64        }
 65    }
 66
 67    // 4. Pattern replacement
 68    fmt.Println("\n4. Pattern Replacement:")
 69
 70    // Redact email addresses
 71    text := "Contact john@example.com or alice@company.org for details."
 72    redacted := emailRegex.ReplaceAllString(text, "[EMAIL]")
 73    fmt.Printf("Original: %s\n", text)
 74    fmt.Printf("Redacted: %s\n", redacted)
 75
 76    // Replace with captured groups
 77    phoneText2 := "Call (555) 123-4567"
 78    formatted := phoneRegex.ReplaceAllString(phoneText2, "$1-$2-$3")
 79    fmt.Printf("Original: %s\n", phoneText2)
 80    fmt.Printf("Formatted: %s\n", formatted)
 81
 82    // 5. Complex patterns
 83    fmt.Println("\n5. Complex Patterns:")
 84
 85    // Match different date formats
 86    dateRegex := regexp.MustCompile(`(\d{4}-\d{2}-\d{2})|(\d{2}/\d{2}/\d{4})|(\d{2}\.\d{2}\.\d{4})`)
 87    dateText := "Events: 2024-03-15, 12/31/2023, and 01.01.2024"
 88
 89    dates := dateRegex.FindAllString(dateText, -1)
 90    fmt.Printf("Text: %s\n", dateText)
 91    fmt.Printf("Dates found: %v\n", dates)
 92
 93    // Match hashtags and mentions
 94    socialRegex := regexp.MustCompile(`(#\w+|@\w+)`)
 95    socialText := "Check out #golang and follow @gophers for updates!"
 96
 97    tags := socialRegex.FindAllString(socialText, -1)
 98    fmt.Printf("Text: %s\n", socialText)
 99    fmt.Printf("Tags/Mentions: %v\n", tags)
100
101    // 6. Validation patterns
102    fmt.Println("\n6. Validation Patterns:")
103
104    validators := map[string]*regexp.Regexp{
105        "Username":    regexp.MustCompile(`^[a-zA-Z0-9_]{3,20}$`),
106        "IPv4":        regexp.MustCompile(`^(\d{1,3}\.){3}\d{1,3}$`),
107        "HexColor":    regexp.MustCompile(`^#[0-9A-Fa-f]{6}$`),
108        "CreditCard":  regexp.MustCompile(`^\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}$`),
109    }
110
111    testValues := map[string][]string{
112        "Username":   {"john_doe", "a", "user@name", "valid_user123"},
113        "IPv4":       {"192.168.1.1", "256.1.1.1", "10.0.0.1"},
114        "HexColor":   {"#FF5733", "#GG0000", "#abc", "#FFFFFF"},
115        "CreditCard": {"1234-5678-9012-3456", "1234567890123456", "1234-5678"},
116    }
117
118    for validatorName, validator := range validators {
119        fmt.Printf("\n%s validation:\n", validatorName)
120        for _, value := range testValues[validatorName] {
121            isValid := validator.MatchString(value)
122            fmt.Printf("  %s: %t\n", value, isValid)
123        }
124    }
125
126    // 7. Performance: compiled vs non-compiled
127    fmt.Println("\n7. Regex Performance:")
128
129    pattern := `\b[A-Za-z]+\b`
130    testText := strings.Repeat("The quick brown fox jumps over the lazy dog ", 100)
131
132    // Compiled (good)
133    compiledRegex := regexp.MustCompile(pattern)
134    fmt.Printf("Using compiled regex: %d matches\n",
135        len(compiledRegex.FindAllString(testText, -1)))
136
137    // Note: regexp.MustCompile should always be used for repeated matching
138}

Real-World Applications:

  • Input Validation: Validate email, phone, credit card formats
  • Web Scraping: Extract structured data from HTML/text
  • Log Parsing: Extract information from log files
  • Data Extraction: Parse structured text formats

💡 Performance Tip: Always compile regex patterns once with regexp.MustCompile() at package level, not inside loops.

Text Pattern Extraction

Extract structured information from unstructured text.

  1package main
  2
  3import (
  4    "fmt"
  5    "regexp"
  6    "strings"
  7)
  8
  9// run
 10func main() {
 11    fmt.Println("=== Text Pattern Extraction ===")
 12
 13    // 1. Extract all patterns from text
 14    fmt.Println("\n1. Multi-Pattern Extraction:")
 15
 16    text := `
 17    Contact Information:
 18    Email: support@example.com
 19    Phone: (555) 123-4567
 20    Website: https://www.example.com
 21    IP: 192.168.1.1
 22    Date: 2024-03-15
 23    Price: $99.99
 24    `
 25
 26    patterns := extractAllPatterns(text)
 27
 28    for patternType, matches := range patterns {
 29        if len(matches) > 0 {
 30            fmt.Printf("%s: %v\n", patternType, matches)
 31        }
 32    }
 33
 34    // 2. Parse structured text
 35    fmt.Println("\n2. Parse Log Entries:")
 36
 37    logText := `
 38    [2024-03-15 10:30:45] INFO: Server started on port 8080
 39    [2024-03-15 10:31:12] ERROR: Database connection failed
 40    [2024-03-15 10:31:15] WARN: Retrying connection...
 41    [2024-03-15 10:31:20] INFO: Connected to database
 42    `
 43
 44    logs := parseLogEntries(logText)
 45    for _, log := range logs {
 46        fmt.Printf("%s [%s]: %s\n", log.Timestamp, log.Level, log.Message)
 47    }
 48
 49    // 3. Extract code blocks
 50    fmt.Println("\n3. Extract Code Blocks:")
 51
 52    markdown := `
 53    Here is some code:
 54    ` + "```go" + `
 55    func main() {
 56        fmt.Println("Hello")
 57    }
 58    ` + "```" + `
 59
 60    And another:
 61    ` + "```python" + `
 62    print("Hello")
 63    ` + "```" + `
 64    `
 65
 66    codeBlocks := extractCodeBlocks(markdown)
 67    for _, block := range codeBlocks {
 68        fmt.Printf("Language: %s\n", block.Language)
 69        fmt.Printf("Code:\n%s\n", block.Code)
 70    }
 71
 72    // 4. Parse key-value pairs
 73    fmt.Println("\n4. Parse Configuration:")
 74
 75    config := `
 76    host=localhost
 77    port=8080
 78    debug=true
 79    max_connections=100
 80    `
 81
 82    settings := parseKeyValue(config)
 83    for key, value := range settings {
 84        fmt.Printf("%s = %s\n", key, value)
 85    }
 86
 87    // 5. Extract mentions and hashtags
 88    fmt.Println("\n5. Social Media Parsing:")
 89
 90    tweet := "Great talk by @john_doe about #golang and #webdev! Check out #goconf2024 @gophercon"
 91
 92    mentions := extractMentions(tweet)
 93    hashtags := extractHashtags(tweet)
 94
 95    fmt.Printf("Tweet: %s\n", tweet)
 96    fmt.Printf("Mentions: %v\n", mentions)
 97    fmt.Printf("Hashtags: %v\n", hashtags)
 98}
 99
100func extractAllPatterns(text string) map[string][]string {
101    patterns := map[string]*regexp.Regexp{
102        "Email":   regexp.MustCompile(`\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b`),
103        "Phone":   regexp.MustCompile(`\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}`),
104        "URL":     regexp.MustCompile(`https?://[^\s]+`),
105        "IPv4":    regexp.MustCompile(`\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b`),
106        "Date":    regexp.MustCompile(`\d{4}-\d{2}-\d{2}`),
107        "Price":   regexp.MustCompile(`\$\d+\.?\d*`),
108    }
109
110    results := make(map[string][]string)
111    for name, regex := range patterns {
112        matches := regex.FindAllString(text, -1)
113        if len(matches) > 0 {
114            results[name] = matches
115        }
116    }
117
118    return results
119}
120
121type LogEntry struct {
122    Timestamp string
123    Level     string
124    Message   string
125}
126
127func parseLogEntries(text string) []LogEntry {
128    // Pattern: [timestamp] LEVEL: message
129    logRegex := regexp.MustCompile(`\[([^\]]+)\]\s+(\w+):\s+(.+)`)
130    matches := logRegex.FindAllStringSubmatch(text, -1)
131
132    var entries []LogEntry
133    for _, match := range matches {
134        if len(match) >= 4 {
135            entries = append(entries, LogEntry{
136                Timestamp: match[1],
137                Level:     match[2],
138                Message:   match[3],
139            })
140        }
141    }
142
143    return entries
144}
145
146type CodeBlock struct {
147    Language string
148    Code     string
149}
150
151func extractCodeBlocks(markdown string) []CodeBlock {
152    // Pattern: ```language\ncode\n```
153    codeRegex := regexp.MustCompile("```(\\w+)\\n([^`]+)```")
154    matches := codeRegex.FindAllStringSubmatch(markdown, -1)
155
156    var blocks []CodeBlock
157    for _, match := range matches {
158        if len(match) >= 3 {
159            blocks = append(blocks, CodeBlock{
160                Language: match[1],
161                Code:     strings.TrimSpace(match[2]),
162            })
163        }
164    }
165
166    return blocks
167}
168
169func parseKeyValue(text string) map[string]string {
170    kvRegex := regexp.MustCompile(`(\w+)\s*=\s*(.+)`)
171    matches := kvRegex.FindAllStringSubmatch(text, -1)
172
173    settings := make(map[string]string)
174    for _, match := range matches {
175        if len(match) >= 3 {
176            settings[match[1]] = strings.TrimSpace(match[2])
177        }
178    }
179
180    return settings
181}
182
183func extractMentions(text string) []string {
184    mentionRegex := regexp.MustCompile(`@(\w+)`)
185    matches := mentionRegex.FindAllStringSubmatch(text, -1)
186
187    var mentions []string
188    for _, match := range matches {
189        if len(match) >= 2 {
190            mentions = append(mentions, match[1])
191        }
192    }
193
194    return mentions
195}
196
197func extractHashtags(text string) []string {
198    hashtagRegex := regexp.MustCompile(`#(\w+)`)
199    matches := hashtagRegex.FindAllStringSubmatch(text, -1)
200
201    var hashtags []string
202    for _, match := range matches {
203        if len(match) >= 2 {
204            hashtags = append(hashtags, match[1])
205        }
206    }
207
208    return hashtags
209}

Real-World Applications:

  • Log Analysis: Parse and analyze application logs
  • Content Parsing: Extract data from social media posts
  • Configuration: Parse configuration files
  • Code Analysis: Extract code snippets from documentation

Internationalization and Localization

Multi-Language Text Processing

Building applications that work across languages and cultures.

  1package main
  2
  3import (
  4    "fmt"
  5    "sort"
  6    "strings"
  7    "unicode"
  8)
  9
 10// run
 11func main() {
 12    fmt.Println("=== Internationalization and Localization ===")
 13
 14    // 1. Language detection
 15    fmt.Println("\n1. Basic Language Detection:")
 16
 17    texts := []string{
 18        "Hello, how are you?",
 19        "Привет, как дела?",
 20        "你好,你好吗?",
 21        "مرحبا، كيف حالك؟",
 22        "こんにちは、元気ですか?",
 23        "Bonjour, comment allez-vous?",
 24    }
 25
 26    for _, text := range texts {
 27        lang := detectLanguage(text)
 28        fmt.Printf("Text: %s\n", text)
 29        fmt.Printf("  Detected: %s\n", lang)
 30    }
 31
 32    // 2. Sorting with locale awareness
 33    fmt.Println("\n2. Locale-Aware Sorting:")
 34
 35    // Simple ASCII sorting
 36    words := []string{"apple", "Banana", "cherry", "Date"}
 37    sorted := make([]string, len(words))
 38    copy(sorted, words)
 39    sort.Strings(sorted)
 40
 41    fmt.Printf("Original: %v\n", words)
 42    fmt.Printf("ASCII sorted: %v\n", sorted)
 43
 44    // Case-insensitive sorting
 45    sorted = make([]string, len(words))
 46    copy(sorted, words)
 47    sort.Slice(sorted, func(i, j int) bool {
 48        return strings.ToLower(sorted[i]) < strings.ToLower(sorted[j])
 49    })
 50    fmt.Printf("Case-insensitive: %v\n", sorted)
 51
 52    // 3. Text direction detection
 53    fmt.Println("\n3. Text Direction Detection:")
 54
 55    directionTests := []string{
 56        "Hello World",
 57        "مرحبا بالعالم",
 58        "שלום עולם",
 59        "Hello مرحبا World",
 60    }
 61
 62    for _, text := range directionTests {
 63        direction := detectDirection(text)
 64        fmt.Printf("Text: %s\n", text)
 65        fmt.Printf("  Direction: %s\n", direction)
 66    }
 67
 68    // 4. Number formatting
 69    fmt.Println("\n4. Number Formatting:")
 70
 71    num := 1234567.89
 72
 73    locales := []struct {
 74        name      string
 75        thousands string
 76        decimal   string
 77    }{
 78        {"US/UK", ",", "."},
 79        {"Europe", ".", ","},
 80        {"India", ",", "."},
 81    }
 82
 83    for _, locale := range locales {
 84        formatted := formatNumber(num, locale.thousands, locale.decimal)
 85        fmt.Printf("%s: %s\n", locale.name, formatted)
 86    }
 87
 88    // 5. Plural forms
 89    fmt.Println("\n5. Pluralization:")
 90
 91    counts := []int{0, 1, 2, 5, 21, 100}
 92
 93    for _, count := range counts {
 94        fmt.Printf("%d: %s\n", count, pluralize(count, "item", "items"))
 95        fmt.Printf("%d: %s\n", count, pluralizeRussian(count, "предмет", "предмета", "предметов"))
 96    }
 97
 98    // 6. Date and time formatting
 99    fmt.Println("\n6. Date/Time Format Patterns:")
100
101    dateFormats := map[string]string{
102        "US":         "01/02/2006",
103        "Europe":     "02.01.2006",
104        "ISO":        "2006-01-02",
105        "Long":       "January 2, 2006",
106    }
107
108    for locale, format := range dateFormats {
109        fmt.Printf("%s format: %s\n", locale, format)
110    }
111
112    // 7. Currency formatting
113    fmt.Println("\n7. Currency Formatting:")
114
115    amount := 1234.56
116
117    currencies := []struct {
118        code   string
119        symbol string
120        format string
121    }{
122        {"USD", "$", "%.2f"},
123        {"EUR", "€", "%.2f"},
124        {"GBP", "£", "%.2f"},
125        {"JPY", "¥", "%.0f"},
126    }
127
128    for _, curr := range currencies {
129        formatted := fmt.Sprintf(curr.format, amount)
130        fmt.Printf("%s: %s%s\n", curr.code, curr.symbol, formatted)
131    }
132}
133
134func detectLanguage(text string) string {
135    // Simple script-based language detection
136    latinCount := 0
137    cyrillicCount := 0
138    cjkCount := 0
139    arabicCount := 0
140
141    for _, r := range text {
142        switch {
143        case (r >= 'A' && r <= 'Z') || (r >= 'a' && r <= 'z'):
144            latinCount++
145        case r >= 0x0400 && r <= 0x04FF:
146            cyrillicCount++
147        case r >= 0x4E00 && r <= 0x9FFF:
148            cjkCount++
149        case r >= 0x3040 && r <= 0x309F || r >= 0x30A0 && r <= 0x30FF:
150            cjkCount++ // Japanese
151        case r >= 0x0600 && r <= 0x06FF:
152            arabicCount++
153        }
154    }
155
156    // Determine dominant script
157    max := latinCount
158    lang := "Latin (English/French/German/etc.)"
159
160    if cyrillicCount > max {
161        max = cyrillicCount
162        lang = "Cyrillic (Russian/Ukrainian/etc.)"
163    }
164    if cjkCount > max {
165        max = cjkCount
166        lang = "CJK (Chinese/Japanese/Korean)"
167    }
168    if arabicCount > max {
169        lang = "Arabic"
170    }
171
172    return lang
173}
174
175func detectDirection(text string) string {
176    // Detect text direction (LTR or RTL)
177    rtlCount := 0
178    ltrCount := 0
179
180    for _, r := range text {
181        switch {
182        case (r >= 'A' && r <= 'Z') || (r >= 'a' && r <= 'z'):
183            ltrCount++
184        case r >= 0x0590 && r <= 0x05FF: // Hebrew
185            rtlCount++
186        case r >= 0x0600 && r <= 0x06FF: // Arabic
187            rtlCount++
188        }
189    }
190
191    if rtlCount > ltrCount {
192        return "RTL (Right-to-Left)"
193    }
194    return "LTR (Left-to-Right)"
195}
196
197func formatNumber(num float64, thousands, decimal string) string {
198    // Simple number formatting
199    str := fmt.Sprintf("%.2f", num)
200    parts := strings.Split(str, ".")
201
202    // Add thousands separators
203    integer := parts[0]
204    var formatted []rune
205    for i, r := range integer {
206        if i > 0 && (len(integer)-i)%3 == 0 {
207            formatted = append(formatted, []rune(thousands)...)
208        }
209        formatted = append(formatted, r)
210    }
211
212    result := string(formatted)
213    if len(parts) > 1 {
214        result += decimal + parts[1]
215    }
216
217    return result
218}
219
220func pluralize(count int, singular, plural string) string {
221    if count == 1 {
222        return fmt.Sprintf("%d %s", count, singular)
223    }
224    return fmt.Sprintf("%d %s", count, plural)
225}
226
227func pluralizeRussian(count int, one, few, many string) string {
228    // Russian pluralization rules
229    n := count % 100
230    n1 := count % 10
231
232    if n >= 11 && n <= 19 {
233        return fmt.Sprintf("%d %s", count, many)
234    }
235
236    switch n1 {
237    case 1:
238        return fmt.Sprintf("%d %s", count, one)
239    case 2, 3, 4:
240        return fmt.Sprintf("%d %s", count, few)
241    default:
242        return fmt.Sprintf("%d %s", count, many)
243    }
244}

Real-World Applications:

  • Global Applications: Build apps that work in any language
  • E-commerce: Display prices and dates in local formats
  • Content Management: Handle multi-language content
  • Search: Implement language-aware search

⚠️ Production Note: For production i18n, use golang.org/x/text package which provides proper collation, normalization, and locale support.

Performance and Optimization

Efficient Text Processing

Optimize text processing for performance-critical applications.

  1package main
  2
  3import (
  4    "fmt"
  5    "strings"
  6    "time"
  7    "unicode"
  8    "unicode/utf8"
  9)
 10
 11// run
 12func main() {
 13    fmt.Println("=== Performance and Optimization ===")
 14
 15    // 1. String building performance
 16    fmt.Println("\n1. String Building Benchmark:")
 17
 18    iterations := 10000
 19
 20    // Bad: concatenation
 21    start := time.Now()
 22    var bad string
 23    for i := 0; i < iterations; i++ {
 24        bad += "x"
 25    }
 26    badTime := time.Since(start)
 27
 28    // Good: strings.Builder
 29    start = time.Now()
 30    var builder strings.Builder
 31    builder.Grow(iterations)
 32    for i := 0; i < iterations; i++ {
 33        builder.WriteByte('x')
 34    }
 35    good := builder.String()
 36    goodTime := time.Since(start)
 37
 38    fmt.Printf("Concatenation: %v\n", badTime)
 39    fmt.Printf("Builder: %v\n", goodTime)
 40    fmt.Printf("Speedup: %.0fx faster\n", float64(badTime)/float64(goodTime))
 41
 42    // 2. Substring operations
 43    fmt.Println("\n2. Substring Performance:")
 44
 45    text := strings.Repeat("Hello, 世界! ", 1000)
 46
 47    // Rune-based (safe)
 48    start = time.Now()
 49    for i := 0; i < 1000; i++ {
 50        runes := []rune(text)
 51        _ = string(runes[0:10])
 52    }
 53    runeTime := time.Since(start)
 54
 55    // Byte-based (unsafe for Unicode, but shown for comparison)
 56    start = time.Now()
 57    for i := 0; i < 1000; i++ {
 58        _ = text[0:20] // Arbitrary byte count
 59    }
 60    byteTime := time.Since(start)
 61
 62    fmt.Printf("Rune-based: %v\n", runeTime)
 63    fmt.Printf("Byte-based: %v\n", byteTime)
 64    fmt.Printf("Note: Byte-based is faster but UNSAFE for Unicode!\n")
 65
 66    // 3. Character counting optimization
 67    fmt.Println("\n3. Character Counting:")
 68
 69    longText := strings.Repeat("Hello, 世界! ", 10000)
 70
 71    // Method 1: utf8.RuneCountInString
 72    start = time.Now()
 73    count1 := utf8.RuneCountInString(longText)
 74    time1 := time.Since(start)
 75
 76    // Method 2: range loop
 77    start = time.Now()
 78    count2 := 0
 79    for range longText {
 80        count2++
 81    }
 82    time2 := time.Since(start)
 83
 84    // Method 3: convert to rune slice
 85    start = time.Now()
 86    count3 := len([]rune(longText))
 87    time3 := time.Since(start)
 88
 89    fmt.Printf("RuneCountInString: %d chars in %v\n", count1, time1)
 90    fmt.Printf("Range loop: %d chars in %v\n", count2, time2)
 91    fmt.Printf("Rune slice: %d chars in %v\n", count3, time3)
 92
 93    // 4. Search optimization
 94    fmt.Println("\n4. Search Optimization:")
 95
 96    haystack := strings.Repeat("The quick brown fox jumps over the lazy dog. ", 1000)
 97    needle := "lazy"
 98
 99    // Method 1: strings.Contains
100    start = time.Now()
101    for i := 0; i < 1000; i++ {
102        _ = strings.Contains(haystack, needle)
103    }
104    time1 = time.Since(start)
105
106    // Method 2: strings.Index
107    start = time.Now()
108    for i := 0; i < 1000; i++ {
109        _ = strings.Index(haystack, needle) != -1
110    }
111    time2 = time.Since(start)
112
113    fmt.Printf("Contains: %v\n", time1)
114    fmt.Printf("Index: %v\n", time2)
115
116    // 5. Case-insensitive comparison
117    fmt.Println("\n5. Case-Insensitive Comparison:")
118
119    s1 := strings.Repeat("Hello World ", 1000)
120    s2 := strings.Repeat("hello world ", 1000)
121
122    // Method 1: Convert to lower
123    start = time.Now()
124    for i := 0; i < 100; i++ {
125        _ = strings.ToLower(s1) == strings.ToLower(s2)
126    }
127    time1 = time.Since(start)
128
129    // Method 2: EqualFold
130    start = time.Now()
131    for i := 0; i < 100; i++ {
132        _ = strings.EqualFold(s1, s2)
133    }
134    time2 = time.Since(start)
135
136    fmt.Printf("ToLower: %v\n", time1)
137    fmt.Printf("EqualFold: %v (%.0fx faster)\n", time2, float64(time1)/float64(time2))
138
139    // 6. Memory-efficient character filtering
140    fmt.Println("\n6. Character Filtering:")
141
142    messyText := strings.Repeat("Hello123World456", 1000)
143
144    // Filter only letters
145    start = time.Now()
146    filtered := filterLetters(messyText)
147    filterTime := time.Since(start)
148
149    fmt.Printf("Original length: %d\n", len(messyText))
150    fmt.Printf("Filtered length: %d\n", len(filtered))
151    fmt.Printf("Time: %v\n", filterTime)
152}
153
154func filterLetters(s string) string {
155    var builder strings.Builder
156    builder.Grow(len(s)) // Pre-allocate
157
158    for _, r := range s {
159        if unicode.IsLetter(r) {
160            builder.WriteRune(r)
161        }
162    }
163
164    return builder.String()
165}

Performance Best Practices:

  1. Pre-allocate: Use builder.Grow() when size is known
  2. Avoid Conversions: Minimize string ↔ rune slice conversions
  3. Use Built-ins: Built-in functions are optimized
  4. Cache Results: Don't repeat expensive operations
  5. Profile: Use pprof to find bottlenecks

💡 Key Insight: strings.Builder is typically 100x+ faster than concatenation for building large strings.

Advanced Text Processing

Complete Text Processing Pipeline

Build a production-ready text processing system.

  1package main
  2
  3import (
  4    "fmt"
  5    "regexp"
  6    "sort"
  7    "strings"
  8    "unicode"
  9    "unicode/utf8"
 10)
 11
 12// run
 13func main() {
 14    fmt.Println("=== Complete Text Processing Pipeline ===")
 15
 16    // Sample documents
 17    documents := []string{
 18        "Hello World! This is a test document about Go programming.",
 19        "The quick brown fox jumps over the lazy dog.",
 20        "Artificial Intelligence is transforming modern technology.",
 21        "Cloud computing enables scalable applications worldwide.",
 22        "Machine learning algorithms process data efficiently.",
 23    }
 24
 25    // 1. Document preprocessing
 26    fmt.Println("\n1. Document Preprocessing:")
 27
 28    preprocessed := make([]string, len(documents))
 29    for i, doc := range documents {
 30        preprocessed[i] = preprocessText(doc)
 31        fmt.Printf("Doc %d:\n", i+1)
 32        fmt.Printf("  Original: %s\n", doc)
 33        fmt.Printf("  Processed: %s\n", preprocessed[i])
 34    }
 35
 36    // 2. Text statistics
 37    fmt.Println("\n2. Text Analysis:")
 38
 39    for i, doc := range preprocessed {
 40        stats := analyzeText(doc)
 41        fmt.Printf("\nDocument %d Statistics:\n", i+1)
 42        fmt.Printf("  Words: %d\n", stats.WordCount)
 43        fmt.Printf("  Characters: %d\n", stats.CharCount)
 44        fmt.Printf("  Unique words: %d\n", stats.UniqueWords)
 45        fmt.Printf("  Avg word length: %.2f\n", stats.AvgWordLength)
 46        fmt.Printf("  Complexity: %s\n", stats.Complexity)
 47    }
 48
 49    // 3. Term frequency analysis
 50    fmt.Println("\n3. Term Frequency Analysis:")
 51
 52    termFreq := calculateTermFrequency(preprocessed)
 53
 54    // Sort by frequency
 55    type termCount struct {
 56        term  string
 57        count int
 58    }
 59
 60    var terms []termCount
 61    for term, count := range termFreq {
 62        if count > 1 { // Only show terms appearing multiple times
 63            terms = append(terms, termCount{term, count})
 64        }
 65    }
 66
 67    sort.Slice(terms, func(i, j int) bool {
 68        return terms[i].count > terms[j].count
 69    })
 70
 71    fmt.Printf("Top terms:\n")
 72    for i, tc := range terms {
 73        if i >= 10 {
 74            break
 75        }
 76        fmt.Printf("  %d. %s: %d occurrences\n", i+1, tc.term, tc.count)
 77    }
 78
 79    // 4. Document similarity
 80    fmt.Println("\n4. Document Similarity:")
 81
 82    for i := 0; i < len(preprocessed); i++ {
 83        for j := i + 1; j < len(preprocessed); j++ {
 84            similarity := calculateJaccardSimilarity(preprocessed[i], preprocessed[j])
 85            fmt.Printf("Doc %d vs Doc %d: %.3f\n", i+1, j+1, similarity)
 86        }
 87    }
 88
 89    // 5. Pattern extraction
 90    fmt.Println("\n5. Pattern Extraction:")
 91
 92    sampleText := "Contact us at support@example.com or call +1-555-123-4567. Visit https://example.com"
 93    patterns := extractPatterns(sampleText)
 94
 95    fmt.Printf("Text: %s\n", sampleText)
 96    fmt.Printf("Extracted patterns:\n")
 97    for patternType, matches := range patterns {
 98        if len(matches) > 0 {
 99            fmt.Printf("  %s: %v\n", patternType, matches)
100        }
101    }
102}
103
104type TextStats struct {
105    WordCount     int
106    CharCount     int
107    UniqueWords   int
108    AvgWordLength float64
109    Complexity    string
110}
111
112func preprocessText(text string) string {
113    // 1. Convert to lowercase
114    text = strings.ToLower(text)
115
116    // 2. Remove punctuation (keep spaces)
117    text = regexp.MustCompile(`[^\w\s]`).ReplaceAllString(text, "")
118
119    // 3. Normalize whitespace
120    text = regexp.MustCompile(`\s+`).ReplaceAllString(text, " ")
121
122    // 4. Trim
123    return strings.TrimSpace(text)
124}
125
126func analyzeText(text string) TextStats {
127    words := strings.Fields(text)
128
129    stats := TextStats{
130        WordCount: len(words),
131        CharCount: utf8.RuneCountInString(text),
132    }
133
134    // Unique words
135    wordSet := make(map[string]bool)
136    totalLength := 0
137    for _, word := range words {
138        wordSet[word] = true
139        totalLength += utf8.RuneCountInString(word)
140    }
141    stats.UniqueWords = len(wordSet)
142
143    // Average word length
144    if len(words) > 0 {
145        stats.AvgWordLength = float64(totalLength) / float64(len(words))
146    }
147
148    // Complexity estimation
149    if stats.AvgWordLength < 4 {
150        stats.Complexity = "Simple"
151    } else if stats.AvgWordLength < 6 {
152        stats.Complexity = "Moderate"
153    } else {
154        stats.Complexity = "Complex"
155    }
156
157    return stats
158}
159
160func calculateTermFrequency(documents []string) map[string]int {
161    termFreq := make(map[string]int)
162
163    for _, doc := range documents {
164        words := strings.Fields(doc)
165        for _, word := range words {
166            if len(word) > 2 { // Ignore very short words
167                termFreq[word]++
168            }
169        }
170    }
171
172    return termFreq
173}
174
175func calculateJaccardSimilarity(doc1, doc2 string) float64 {
176    words1 := make(map[string]bool)
177    words2 := make(map[string]bool)
178
179    for _, word := range strings.Fields(doc1) {
180        words1[word] = true
181    }
182
183    for _, word := range strings.Fields(doc2) {
184        words2[word] = true
185    }
186
187    intersection := 0
188    for word := range words1 {
189        if words2[word] {
190            intersection++
191        }
192    }
193
194    union := len(words1) + len(words2) - intersection
195
196    if union == 0 {
197        return 0
198    }
199
200    return float64(intersection) / float64(union)
201}
202
203func extractPatterns(text string) map[string][]string {
204    patterns := make(map[string][]string)
205
206    // Email
207    emailRegex := regexp.MustCompile(`\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b`)
208    patterns["emails"] = emailRegex.FindAllString(text, -1)
209
210    // Phone
211    phoneRegex := regexp.MustCompile(`\+?\d{1,3}[-.]?\(?\d{3}\)?[-.]?\d{3}[-.]?\d{4}`)
212    patterns["phones"] = phoneRegex.FindAllString(text, -1)
213
214    // URL
215    urlRegex := regexp.MustCompile(`https?://[^\s]+`)
216    patterns["urls"] = urlRegex.FindAllString(text, -1)
217
218    return patterns
219}

Real-World Applications:

  • Search Engines: Index and rank documents
  • Content Analysis: Analyze text for insights
  • Recommendation Systems: Find similar content
  • Data Mining: Extract patterns from text

Practice Exercises

Exercise 1: Universal Text Search Engine

Learning Objectives: Master advanced text processing, pattern matching, and building search functionality that works across multiple languages.

Real-World Context: Search engines are the backbone of modern applications - from web search and document search to autocomplete and recommendation systems. Companies like Google, Microsoft, and Elasticsearch build sophisticated text processing systems that handle billions of queries across hundreds of languages.

Difficulty: Advanced | Time Estimate: 4-5 hours

Description: Build a multilingual text search engine that can index documents, handle complex queries, support different languages, and provide relevant results with proper ranking.

Requirements:

  • Document Indexing: Parse and index text documents with proper tokenization
  • Unicode Support: Handle text in multiple languages and scripts
  • Text Normalization: Implement case folding, accent removal, and whitespace normalization
  • Query Processing: Support boolean operators and phrase searching
  • Relevance Ranking: Implement TF-IDF scoring
  • Performance: Efficient indexing and searching with proper data structures
  • Fuzzy Matching: Support approximate string matching
  • Caching: Cache frequently used queries
Solution
  1package main
  2
  3import (
  4    "fmt"
  5    "math"
  6    "regexp"
  7    "sort"
  8    "strings"
  9    "unicode"
 10    "unicode/utf8"
 11)
 12
 13// Document represents a searchable document
 14type Document struct {
 15    ID       int
 16    Title    string
 17    Content  string
 18    Language string
 19}
 20
 21// SearchQuery represents a search query
 22type SearchQuery struct {
 23    Terms     []string
 24    Phrase    string
 25    Required  []string
 26    Excluded  []string
 27    Fuzzy     bool
 28}
 29
 30// SearchResult represents a search result
 31type SearchResult struct {
 32    Document *Document
 33    Score    float64
 34    Matches  []string
 35}
 36
 37// SearchEngine represents our search engine
 38type SearchEngine struct {
 39    documents []Document
 40    index     map[string][]Posting
 41    cache     map[string][]SearchResult
 42}
 43
 44// Posting contains document info
 45type Posting struct {
 46    DocID     int
 47    TF        int
 48    Positions []int
 49}
 50
 51// run
 52func main() {
 53    fmt.Println("=== Universal Text Search Engine ===")
 54
 55    engine := NewSearchEngine()
 56
 57    // Add documents
 58    documents := []Document{
 59        {1, "Go Programming", "Go is a programming language by Google", "en"},
 60        {2, "Data Structures", "Understanding data structures and algorithms", "en"},
 61        {3, "数据库设计", "数据库是现代应用的核心", "zh"},
 62        {4, "Machine Learning", "ML algorithms process data efficiently", "en"},
 63    }
 64
 65    for _, doc := range documents {
 66        engine.AddDocument(doc)
 67    }
 68
 69    engine.BuildIndex()
 70    fmt.Printf("Indexed %d documents\n", len(documents))
 71
 72    // Example searches
 73    queries := []string{
 74        "programming",
 75        "data structures",
 76        "数据",
 77    }
 78
 79    for _, query := range queries {
 80        fmt.Printf("\n=== Query: %s ===\n", query)
 81        results := engine.Search(parseQuery(query), 5)
 82
 83        for i, result := range results {
 84            fmt.Printf("%d. %s (score: %.2f)\n", i+1, result.Document.Title, result.Score)
 85        }
 86    }
 87}
 88
 89func NewSearchEngine() *SearchEngine {
 90    return &SearchEngine{
 91        index: make(map[string][]Posting),
 92        cache: make(map[string][]SearchResult),
 93    }
 94}
 95
 96func (se *SearchEngine) AddDocument(doc Document) {
 97    se.documents = append(se.documents, doc)
 98}
 99
100func (se *SearchEngine) BuildIndex() {
101    for _, doc := range se.documents {
102        tokens := tokenize(doc.Content + " " + doc.Title)
103
104        for i, token := range tokens {
105            normalized := normalizeToken(token)
106            if normalized == "" {
107                continue
108            }
109
110            postings := se.index[normalized]
111            found := false
112
113            for j := range postings {
114                if postings[j].DocID == doc.ID {
115                    postings[j].TF++
116                    postings[j].Positions = append(postings[j].Positions, i)
117                    found = true
118                    break
119                }
120            }
121
122            if !found {
123                postings = append(postings, Posting{
124                    DocID:     doc.ID,
125                    TF:        1,
126                    Positions: []int{i},
127                })
128            }
129
130            se.index[normalized] = postings
131        }
132    }
133}
134
135func (se *SearchEngine) Search(query SearchQuery, limit int) []SearchResult {
136    var results []SearchResult
137
138    if query.Phrase != "" {
139        results = se.phraseSearch(query.Phrase, limit)
140    } else {
141        results = se.termSearch(query, limit)
142    }
143
144    return results
145}
146
147func (se *SearchEngine) termSearch(query SearchQuery, limit int) []SearchResult {
148    docScores := make(map[int]float64)
149    docMatches := make(map[int][]string)
150
151    for _, term := range query.Terms {
152        normalized := normalizeToken(term)
153
154        if postings, exists := se.index[normalized]; exists {
155            for _, posting := range postings {
156                score := se.calculateTFIDF(posting.DocID, normalized)
157                docScores[posting.DocID] += score
158                docMatches[posting.DocID] = append(docMatches[posting.DocID], term)
159            }
160        }
161    }
162
163    var results []SearchResult
164    for docID, score := range docScores {
165        doc := se.findDocument(docID)
166        if doc != nil {
167            results = append(results, SearchResult{
168                Document: doc,
169                Score:    score,
170                Matches:  docMatches[docID],
171            })
172        }
173    }
174
175    sort.Slice(results, func(i, j int) bool {
176        return results[i].Score > results[j].Score
177    })
178
179    if len(results) > limit {
180        results = results[:limit]
181    }
182
183    return results
184}
185
186func (se *SearchEngine) phraseSearch(phrase string, limit int) []SearchResult {
187    var results []SearchResult
188    normalized := strings.ToLower(phrase)
189
190    for i := range se.documents {
191        content := strings.ToLower(se.documents[i].Content + " " + se.documents[i].Title)
192        if strings.Contains(content, normalized) {
193            results = append(results, SearchResult{
194                Document: &se.documents[i],
195                Score:    1.0,
196                Matches:  []string{phrase},
197            })
198        }
199    }
200
201    if len(results) > limit {
202        results = results[:limit]
203    }
204
205    return results
206}
207
208func (se *SearchEngine) calculateTFIDF(docID int, term string) float64 {
209    postings := se.index[term]
210
211    var tf int
212    for _, posting := range postings {
213        if posting.DocID == docID {
214            tf = posting.TF
215            break
216        }
217    }
218
219    if tf == 0 {
220        return 0
221    }
222
223    df := len(postings)
224    if df == 0 {
225        return 0
226    }
227
228    N := len(se.documents)
229    idf := math.Log(float64(N) / float64(df))
230
231    return float64(tf) * idf
232}
233
234func (se *SearchEngine) findDocument(docID int) *Document {
235    for i := range se.documents {
236        if se.documents[i].ID == docID {
237            return &se.documents[i]
238        }
239    }
240    return nil
241}
242
243func tokenize(text string) []string {
244    re := regexp.MustCompile(`\w+`)
245    return re.FindAllString(text, -1)
246}
247
248func normalizeToken(token string) string {
249    token = strings.ToLower(token)
250    if len(token) < 2 {
251        return ""
252    }
253    return token
254}
255
256func parseQuery(query string) SearchQuery {
257    q := SearchQuery{}
258
259    if strings.Contains(query, `"`) {
260        re := regexp.MustCompile(`"([^"]+)"`)
261        matches := re.FindStringSubmatch(query)
262        if len(matches) > 1 {
263            q.Phrase = matches[1]
264        }
265    } else {
266        q.Terms = strings.Fields(query)
267    }
268
269    return q
270}

Exercise 2: Text Sanitizer and Validator

Learning Objectives: Implement comprehensive text validation, sanitization, and security measures.

Real-World Context: Text sanitization is crucial for preventing XSS attacks, SQL injection, and other security vulnerabilities. Every user input must be validated and sanitized.

Difficulty: Intermediate | Time Estimate: 2-3 hours

Description: Build a text sanitizer that validates and cleans user input, removes malicious content, and ensures text meets specific requirements.

Requirements:

  • XSS Prevention: Remove or escape HTML/JavaScript
  • SQL Injection Prevention: Detect and prevent SQL injection attempts
  • Input Validation: Validate email, phone, URL formats
  • Content Filtering: Remove profanity and unwanted content
  • Length Limits: Enforce text length constraints
  • Unicode Safety: Handle Unicode properly
  • Whitespace Normalization: Clean up excessive whitespace
Solution
  1package main
  2
  3import (
  4    "fmt"
  5    "regexp"
  6    "strings"
  7    "unicode"
  8    "unicode/utf8"
  9)
 10
 11// Sanitizer handles text sanitization and validation
 12type Sanitizer struct {
 13    maxLength      int
 14    allowedChars   *regexp.Regexp
 15    profanityList  map[string]bool
 16    sqlPatterns    []*regexp.Regexp
 17    xssPatterns    []*regexp.Regexp
 18}
 19
 20// ValidationResult contains validation results
 21type ValidationResult struct {
 22    IsValid bool
 23    Errors  []string
 24    Cleaned string
 25}
 26
 27// run
 28func main() {
 29    fmt.Println("=== Text Sanitizer and Validator ===")
 30
 31    sanitizer := NewSanitizer()
 32
 33    // Test cases
 34    tests := []struct {
 35        name  string
 36        input string
 37    }{
 38        {"Normal text", "Hello, World! This is a test."},
 39        {"XSS attempt", "<script>alert('XSS')</script>"},
 40        {"SQL injection", "'; DROP TABLE users; --"},
 41        {"Excessive whitespace", "Hello    World   \n\n\n   Test"},
 42        {"Long text", strings.Repeat("x", 1100)},
 43        {"Mixed content", "Email me@example.com or call 555-1234"},
 44        {"Unicode", "Hello 世界 🌍"},
 45    }
 46
 47    for _, test := range tests {
 48        fmt.Printf("\n=== %s ===\n", test.name)
 49        fmt.Printf("Input: %s\n", truncate(test.input, 50))
 50
 51        result := sanitizer.Sanitize(test.input)
 52        fmt.Printf("Valid: %t\n", result.IsValid)
 53
 54        if len(result.Errors) > 0 {
 55            fmt.Printf("Errors: %v\n", result.Errors)
 56        }
 57
 58        fmt.Printf("Cleaned: %s\n", truncate(result.Cleaned, 50))
 59    }
 60
 61    // Validation examples
 62    fmt.Println("\n=== Format Validation ===")
 63
 64    emails := []string{"valid@example.com", "invalid@", "test@test"}
 65    phones := []string{"555-1234", "(555) 123-4567", "invalid"}
 66
 67    for _, email := range emails {
 68        fmt.Printf("Email %s: valid=%t\n", email, sanitizer.ValidateEmail(email))
 69    }
 70
 71    for _, phone := range phones {
 72        fmt.Printf("Phone %s: valid=%t\n", phone, sanitizer.ValidatePhone(phone))
 73    }
 74}
 75
 76func NewSanitizer() *Sanitizer {
 77    return &Sanitizer{
 78        maxLength:    1000,
 79        allowedChars: regexp.MustCompile(`^[\w\s\p{L}\p{N}.,!?'"()-]+$`),
 80        profanityList: map[string]bool{
 81            "badword": true,
 82            "profanity": true,
 83        },
 84        sqlPatterns: []*regexp.Regexp{
 85            regexp.MustCompile(`(?i)(DROP|DELETE|INSERT|UPDATE)\s+(TABLE|FROM|INTO)`),
 86            regexp.MustCompile(`(?i)(SELECT.*FROM|UNION.*SELECT)`),
 87            regexp.MustCompile(`(?i)(--|;|'|"|\*|=)`),
 88        },
 89        xssPatterns: []*regexp.Regexp{
 90            regexp.MustCompile(`(?i)<script[^>]*>.*?</script>`),
 91            regexp.MustCompile(`(?i)<iframe[^>]*>.*?</iframe>`),
 92            regexp.MustCompile(`(?i)javascript:`),
 93            regexp.MustCompile(`(?i)on\w+\s*=`),
 94        },
 95    }
 96}
 97
 98func (s *Sanitizer) Sanitize(text string) ValidationResult {
 99    result := ValidationResult{
100        IsValid: true,
101        Cleaned: text,
102    }
103
104    // Check length
105    if utf8.RuneCountInString(text) > s.maxLength {
106        result.Errors = append(result.Errors, fmt.Sprintf("Text exceeds %d characters", s.maxLength))
107        result.IsValid = false
108        runes := []rune(text)
109        result.Cleaned = string(runes[:s.maxLength])
110    }
111
112    // Check for XSS
113    for _, pattern := range s.xssPatterns {
114        if pattern.MatchString(result.Cleaned) {
115            result.Errors = append(result.Errors, "XSS content detected")
116            result.IsValid = false
117            result.Cleaned = pattern.ReplaceAllString(result.Cleaned, "")
118        }
119    }
120
121    // Check for SQL injection
122    for _, pattern := range s.sqlPatterns {
123        if pattern.MatchString(result.Cleaned) {
124            result.Errors = append(result.Errors, "SQL injection attempt detected")
125            result.IsValid = false
126        }
127    }
128
129    // Normalize whitespace
130    result.Cleaned = normalizeWhitespace(result.Cleaned)
131
132    // Remove profanity
133    result.Cleaned = s.removeProfanity(result.Cleaned)
134
135    // Trim
136    result.Cleaned = strings.TrimSpace(result.Cleaned)
137
138    return result
139}
140
141func (s *Sanitizer) ValidateEmail(email string) bool {
142    emailRegex := regexp.MustCompile(`^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}$`)
143    return emailRegex.MatchString(email)
144}
145
146func (s *Sanitizer) ValidatePhone(phone string) bool {
147    phoneRegex := regexp.MustCompile(`^\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}$`)
148    return phoneRegex.MatchString(phone)
149}
150
151func (s *Sanitizer) removeProfanity(text string) string {
152    words := strings.Fields(strings.ToLower(text))
153    for i, word := range words {
154        if s.profanityList[word] {
155            words[i] = strings.Repeat("*", len(word))
156        }
157    }
158    return strings.Join(words, " ")
159}
160
161func normalizeWhitespace(s string) string {
162    runes := []rune(s)
163    var normalized []rune
164    var inWhitespace bool
165
166    for _, r := range runes {
167        if unicode.IsSpace(r) {
168            if !inWhitespace {
169                normalized = append(normalized, ' ')
170                inWhitespace = true
171            }
172        } else {
173            normalized = append(normalized, r)
174            inWhitespace = false
175        }
176    }
177
178    return string(normalized)
179}
180
181func truncate(s string, maxLen int) string {
182    runes := []rune(s)
183    if len(runes) <= maxLen {
184        return s
185    }
186    return string(runes[:maxLen]) + "..."
187}

Exercise 3: Multilingual Tokenizer

Learning Objectives: Build a tokenizer that handles multiple languages and scripts correctly.

Real-World Context: Tokenization is the foundation of NLP. Different languages have different rules for word boundaries, making multilingual tokenization challenging.

Difficulty: Advanced | Time Estimate: 3-4 hours

Description: Create a tokenizer that correctly splits text into tokens (words) across multiple languages including CJK languages, Arabic, and European languages.

Requirements:

  • Language Detection: Detect the primary language
  • Script-Aware Tokenization: Different rules for different scripts
  • CJK Support: Handle Chinese, Japanese, Korean character-by-character
  • Arabic Support: Handle RTL text and diacritics
  • Compound Words: Handle German/Dutch compound words
  • Punctuation: Proper punctuation handling
  • Performance: Efficient tokenization
Solution
  1package main
  2
  3import (
  4    "fmt"
  5    "regexp"
  6    "strings"
  7    "unicode"
  8)
  9
 10// Tokenizer handles multilingual tokenization
 11type Tokenizer struct {
 12    preservePunctuation bool
 13    lowercaseTokens     bool
 14}
 15
 16// Token represents a text token
 17type Token struct {
 18    Text     string
 19    Type     TokenType
 20    Language string
 21    Position int
 22}
 23
 24// TokenType represents token types
 25type TokenType int
 26
 27const (
 28    Word TokenType = iota
 29    Number
 30    Punctuation
 31    Whitespace
 32    Symbol
 33)
 34
 35// run
 36func main() {
 37    fmt.Println("=== Multilingual Tokenizer ===")
 38
 39    tokenizer := NewTokenizer()
 40
 41    // Test texts in different languages
 42    texts := []struct {
 43        name string
 44        text string
 45    }{
 46        {"English", "Hello, world! How are you?"},
 47        {"Chinese", "你好世界,今天天气很好。"},
 48        {"Japanese", "こんにちは、世界!元気ですか?"},
 49        {"Arabic", "مرحبا بالعالم! كيف حالك؟"},
 50        {"German", "Guten Tag! Donaudampfschifffahrtsgesellschaft."},
 51        {"Mixed", "Hello 世界! مرحبا 123."},
 52    }
 53
 54    for _, test := range texts {
 55        fmt.Printf("\n=== %s ===\n", test.name)
 56        fmt.Printf("Text: %s\n", test.text)
 57
 58        tokens := tokenizer.Tokenize(test.text)
 59        fmt.Printf("Tokens (%d):\n", len(tokens))
 60
 61        for i, token := range tokens {
 62            fmt.Printf("  %d. %q [%s, %s]\n", i+1, token.Text,
 63                getTokenTypeName(token.Type), token.Language)
 64        }
 65    }
 66
 67    // Language-specific tokenization
 68    fmt.Println("\n=== Language-Specific Rules ===")
 69
 70    chineseText := "我爱编程。"
 71    tokens := tokenizer.TokenizeChinese(chineseText)
 72    fmt.Printf("Chinese: %s\n", chineseText)
 73    fmt.Printf("Tokens: %v\n", tokens)
 74
 75    arabicText := "مرحبا بالعالم"
 76    tokens = tokenizer.TokenizeArabic(arabicText)
 77    fmt.Printf("Arabic: %s\n", arabicText)
 78    fmt.Printf("Tokens: %v\n", tokens)
 79}
 80
 81func NewTokenizer() *Tokenizer {
 82    return &Tokenizer{
 83        preservePunctuation: false,
 84        lowercaseTokens:     true,
 85    }
 86}
 87
 88func (t *Tokenizer) Tokenize(text string) []Token {
 89    var tokens []Token
 90    position := 0
 91
 92    for i, r := range text {
 93        lang := detectRuneLanguage(r)
 94        tokenType := detectTokenType(r)
 95
 96        // Different tokenization strategies based on language
 97        switch lang {
 98        case "CJK":
 99            // CJK: character-by-character
100            tokens = append(tokens, Token{
101                Text:     string(r),
102                Type:     tokenType,
103                Language: lang,
104                Position: position,
105            })
106            position++
107
108        default:
109            // For alphabetic scripts, collect word characters
110            if unicode.IsLetter(r) || unicode.IsNumber(r) {
111                start := i
112                end := i + 1
113
114                // Find end of word
115                for j := i + 1; j < len(text); j++ {
116                    nextR := rune(text[j])
117                    if unicode.IsLetter(nextR) || unicode.IsNumber(nextR) || nextR == '\'' {
118                        end = j + 1
119                    } else {
120                        break
121                    }
122                }
123
124                word := text[start:end]
125                if t.lowercaseTokens {
126                    word = strings.ToLower(word)
127                }
128
129                tokens = append(tokens, Token{
130                    Text:     word,
131                    Type:     Word,
132                    Language: lang,
133                    Position: position,
134                })
135                position++
136
137                // Skip ahead
138                for j := start + 1; j < end; j++ {
139                    i = j
140                }
141            } else if !unicode.IsSpace(r) && !t.preservePunctuation {
142                // Skip punctuation unless preserving
143                continue
144            }
145        }
146    }
147
148    return tokens
149}
150
151func (t *Tokenizer) TokenizeChinese(text string) []string {
152    var tokens []string
153
154    for _, r := range text {
155        if unicode.Is(unicode.Han, r) {
156            tokens = append(tokens, string(r))
157        }
158    }
159
160    return tokens
161}
162
163func (t *Tokenizer) TokenizeArabic(text string) []string {
164    // Simple word-based tokenization for Arabic
165    re := regexp.MustCompile(`[\p{Arabic}]+`)
166    return re.FindAllString(text, -1)
167}
168
169func detectRuneLanguage(r rune) string {
170    switch {
171    case r >= 'A' && r <= 'Z' || r >= 'a' && r <= 'z':
172        return "Latin"
173    case r >= 0x4E00 && r <= 0x9FFF:
174        return "CJK"
175    case r >= 0x3040 && r <= 0x309F || r >= 0x30A0 && r <= 0x30FF:
176        return "Japanese"
177    case r >= 0x0600 && r <= 0x06FF:
178        return "Arabic"
179    case r >= 0x0400 && r <= 0x04FF:
180        return "Cyrillic"
181    default:
182        return "Other"
183    }
184}
185
186func detectTokenType(r rune) TokenType {
187    switch {
188    case unicode.IsLetter(r):
189        return Word
190    case unicode.IsNumber(r):
191        return Number
192    case unicode.IsPunct(r):
193        return Punctuation
194    case unicode.IsSpace(r):
195        return Whitespace
196    default:
197        return Symbol
198    }
199}
200
201func getTokenTypeName(t TokenType) string {
202    switch t {
203    case Word:
204        return "Word"
205    case Number:
206        return "Number"
207    case Punctuation:
208        return "Punctuation"
209    case Whitespace:
210        return "Whitespace"
211    default:
212        return "Symbol"
213    }
214}

Exercise 4: Text Diff and Patch System

Learning Objectives: Implement algorithms for comparing texts and generating patches.

Real-World Context: Version control systems like Git use diff algorithms to show changes between file versions. Understanding text diffs is essential for building collaboration tools.

Difficulty: Advanced | Time Estimate: 4-5 hours

Description: Build a system that can compare two texts, generate a diff showing changes, and apply patches to reconstruct modified versions.

Requirements:

  • Line-by-Line Diff: Compare texts line by line
  • Word-Level Diff: Show word-level changes
  • Character-Level Diff: Show character-level changes
  • Unified Format: Output in unified diff format
  • Patch Application: Apply patches to original text
  • Conflict Detection: Detect and report conflicts
  • Performance: Handle large texts efficiently
Solution
  1package main
  2
  3import (
  4    "fmt"
  5    "strings"
  6)
  7
  8// DiffType represents the type of difference
  9type DiffType int
 10
 11const (
 12    Equal DiffType = iota
 13    Insert
 14    Delete
 15    Replace
 16)
 17
 18// Diff represents a difference between two texts
 19type Diff struct {
 20    Type    DiffType
 21    OldText string
 22    NewText string
 23    Line    int
 24}
 25
 26// Differ handles text comparison
 27type Differ struct{}
 28
 29// run
 30func main() {
 31    fmt.Println("=== Text Diff and Patch System ===")
 32
 33    differ := NewDiffer()
 34
 35    // Test cases
 36    original := `Line 1: Hello World
 37Line 2: This is a test
 38Line 3: Some content here
 39Line 4: More text`
 40
 41    modified := `Line 1: Hello World
 42Line 2: This is a modified test
 43Line 3: New content added
 44Line 4: More text
 45Line 5: Additional line`
 46
 47    // Generate diff
 48    fmt.Println("=== Line-by-Line Diff ===")
 49    diffs := differ.DiffLines(original, modified)
 50
 51    for _, diff := range diffs {
 52        switch diff.Type {
 53        case Equal:
 54            fmt.Printf("  %s\n", diff.OldText)
 55        case Insert:
 56            fmt.Printf("+ %s\n", diff.NewText)
 57        case Delete:
 58            fmt.Printf("- %s\n", diff.OldText)
 59        case Replace:
 60            fmt.Printf("- %s\n", diff.OldText)
 61            fmt.Printf("+ %s\n", diff.NewText)
 62        }
 63    }
 64
 65    // Word-level diff
 66    fmt.Println("\n=== Word-Level Diff ===")
 67    line1 := "The quick brown fox jumps"
 68    line2 := "The fast brown fox leaps"
 69
 70    wordDiffs := differ.DiffWords(line1, line2)
 71    for _, diff := range wordDiffs {
 72        switch diff.Type {
 73        case Equal:
 74            fmt.Printf("%s ", diff.OldText)
 75        case Delete:
 76            fmt.Printf("[-%s] ", diff.OldText)
 77        case Insert:
 78            fmt.Printf("[+%s] ", diff.NewText)
 79        }
 80    }
 81    fmt.Println()
 82
 83    // Generate patch
 84    fmt.Println("\n=== Unified Diff Format ===")
 85    patch := differ.GeneratePatch(original, modified, "original.txt", "modified.txt")
 86    fmt.Println(patch)
 87
 88    // Calculate similarity
 89    fmt.Println("\n=== Similarity Analysis ===")
 90    similarity := differ.CalculateSimilarity(original, modified)
 91    fmt.Printf("Similarity: %.2f%%\n", similarity*100)
 92}
 93
 94func NewDiffer() *Differ {
 95    return &Differ{}
 96}
 97
 98func (d *Differ) DiffLines(text1, text2 string) []Diff {
 99    lines1 := strings.Split(text1, "\n")
100    lines2 := strings.Split(text2, "\n")
101
102    return d.diff(lines1, lines2)
103}
104
105func (d *Differ) DiffWords(text1, text2 string) []Diff {
106    words1 := strings.Fields(text1)
107    words2 := strings.Fields(text2)
108
109    return d.diff(words1, words2)
110}
111
112func (d *Differ) diff(seq1, seq2 []string) []Diff {
113    var diffs []Diff
114    n, m := len(seq1), len(seq2)
115
116    // Simple LCS-based diff algorithm
117    lcs := d.longestCommonSubsequence(seq1, seq2)
118
119    i, j, k := 0, 0, 0
120
121    for k < len(lcs) {
122        // Find deletions
123        for i < n && seq1[i] != lcs[k] {
124            diffs = append(diffs, Diff{
125                Type:    Delete,
126                OldText: seq1[i],
127                Line:    i,
128            })
129            i++
130        }
131
132        // Find insertions
133        for j < m && seq2[j] != lcs[k] {
134            diffs = append(diffs, Diff{
135                Type:    Insert,
136                NewText: seq2[j],
137                Line:    j,
138            })
139            j++
140        }
141
142        // Equal element
143        if i < n && j < m && seq1[i] == seq2[j] {
144            diffs = append(diffs, Diff{
145                Type:    Equal,
146                OldText: seq1[i],
147                NewText: seq2[j],
148                Line:    i,
149            })
150            i++
151            j++
152            k++
153        }
154    }
155
156    // Remaining deletions
157    for i < n {
158        diffs = append(diffs, Diff{
159            Type:    Delete,
160            OldText: seq1[i],
161            Line:    i,
162        })
163        i++
164    }
165
166    // Remaining insertions
167    for j < m {
168        diffs = append(diffs, Diff{
169            Type:    Insert,
170            NewText: seq2[j],
171            Line:    j,
172        })
173        j++
174    }
175
176    return diffs
177}
178
179func (d *Differ) longestCommonSubsequence(seq1, seq2 []string) []string {
180    n, m := len(seq1), len(seq2)
181
182    // DP table
183    dp := make([][]int, n+1)
184    for i := range dp {
185        dp[i] = make([]int, m+1)
186    }
187
188    // Fill DP table
189    for i := 1; i <= n; i++ {
190        for j := 1; j <= m; j++ {
191            if seq1[i-1] == seq2[j-1] {
192                dp[i][j] = dp[i-1][j-1] + 1
193            } else {
194                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
195            }
196        }
197    }
198
199    // Backtrack to find LCS
200    var lcs []string
201    i, j := n, m
202
203    for i > 0 && j > 0 {
204        if seq1[i-1] == seq2[j-1] {
205            lcs = append([]string{seq1[i-1]}, lcs...)
206            i--
207            j--
208        } else if dp[i-1][j] > dp[i][j-1] {
209            i--
210        } else {
211            j--
212        }
213    }
214
215    return lcs
216}
217
218func (d *Differ) GeneratePatch(text1, text2, file1, file2 string) string {
219    var patch strings.Builder
220
221    patch.WriteString(fmt.Sprintf("--- %s\n", file1))
222    patch.WriteString(fmt.Sprintf("+++ %s\n", file2))
223
224    diffs := d.DiffLines(text1, text2)
225
226    for _, diff := range diffs {
227        switch diff.Type {
228        case Delete:
229            patch.WriteString(fmt.Sprintf("- %s\n", diff.OldText))
230        case Insert:
231            patch.WriteString(fmt.Sprintf("+ %s\n", diff.NewText))
232        case Equal:
233            patch.WriteString(fmt.Sprintf("  %s\n", diff.OldText))
234        }
235    }
236
237    return patch.String()
238}
239
240func (d *Differ) CalculateSimilarity(text1, text2 string) float64 {
241    lines1 := strings.Split(text1, "\n")
242    lines2 := strings.Split(text2, "\n")
243
244    lcs := d.longestCommonSubsequence(lines1, lines2)
245
246    maxLen := max(len(lines1), len(lines2))
247    if maxLen == 0 {
248        return 1.0
249    }
250
251    return float64(len(lcs)) / float64(maxLen)
252}
253
254func max(a, b int) int {
255    if a > b {
256        return a
257    }
258    return b
259}

Exercise 5: Smart Text Autocomplete

Learning Objectives: Build an intelligent autocomplete system using trie data structures and frequency analysis.

Real-World Context: Autocomplete is ubiquitous in modern applications - search engines, IDEs, messaging apps. Understanding how to build efficient autocomplete systems is valuable.

Difficulty: Intermediate | Time Estimate: 3-4 hours

Description: Create an autocomplete system that suggests completions based on partial input, using frequency-based ranking and supporting multiple languages.

Requirements:

  • Trie Data Structure: Efficient prefix matching
  • Frequency Ranking: Rank suggestions by usage frequency
  • Multi-Language Support: Handle different scripts
  • Case-Insensitive: Ignore case for matching
  • Real-Time Updates: Update suggestions as user types
  • Fuzzy Matching: Suggest corrections for typos
  • Performance: Fast lookups (< 10ms)
Solution
  1package main
  2
  3import (
  4    "fmt"
  5    "sort"
  6    "strings"
  7    "unicode"
  8)
  9
 10// TrieNode represents a node in the trie
 11type TrieNode struct {
 12    children  map[rune]*TrieNode
 13    isEnd     bool
 14    frequency int
 15    word      string
 16}
 17
 18// Autocomplete handles text autocomplete
 19type Autocomplete struct {
 20    root *TrieNode
 21}
 22
 23// Suggestion represents an autocomplete suggestion
 24type Suggestion struct {
 25    Word      string
 26    Frequency int
 27    Score     float64
 28}
 29
 30// run
 31func main() {
 32    fmt.Println("=== Smart Text Autocomplete ===")
 33
 34    ac := NewAutocomplete()
 35
 36    // Build dictionary with frequencies
 37    words := []struct {
 38        word string
 39        freq int
 40    }{
 41        {"hello", 100},
 42        {"help", 50},
 43        {"helpful", 30},
 44        {"world", 80},
 45        {"word", 60},
 46        {"work", 70},
 47        {"programming", 90},
 48        {"program", 85},
 49        {"programmer", 75},
 50        {"python", 40},
 51        {"go", 95},
 52        {"golang", 88},
 53    }
 54
 55    for _, w := range words {
 56        ac.AddWord(w.word, w.freq)
 57    }
 58
 59    fmt.Printf("Dictionary built with %d words\n", len(words))
 60
 61    // Test autocomplete
 62    queries := []string{
 63        "hel",
 64        "wor",
 65        "pro",
 66        "go",
 67        "py",
 68    }
 69
 70    for _, query := range queries {
 71        fmt.Printf("\n=== Query: %s ===\n", query)
 72        suggestions := ac.Suggest(query, 5)
 73
 74        if len(suggestions) == 0 {
 75            fmt.Println("No suggestions found")
 76        } else {
 77            for i, sugg := range suggestions {
 78                fmt.Printf("%d. %s (freq: %d, score: %.2f)\n",
 79                    i+1, sugg.Word, sugg.Frequency, sugg.Score)
 80            }
 81        }
 82    }
 83
 84    // Fuzzy matching
 85    fmt.Println("\n=== Fuzzy Matching ===")
 86    fuzzyQueries := []string{
 87        "helo",    // typo: hello
 88        "wrld",    // typo: world
 89        "progam",  // typo: program
 90    }
 91
 92    for _, query := range fuzzyQueries {
 93        fmt.Printf("\nQuery: %s (typo)\n", query)
 94        suggestions := ac.FuzzySuggest(query, 3)
 95
 96        for i, sugg := range suggestions {
 97            fmt.Printf("%d. %s (suggested correction)\n", i+1, sugg.Word)
 98        }
 99    }
100}
101
102func NewAutocomplete() *Autocomplete {
103    return &Autocomplete{
104        root: &TrieNode{
105            children: make(map[rune]*TrieNode),
106        },
107    }
108}
109
110func (ac *Autocomplete) AddWord(word string, frequency int) {
111    node := ac.root
112    word = strings.ToLower(word)
113
114    for _, ch := range word {
115        if node.children[ch] == nil {
116            node.children[ch] = &TrieNode{
117                children: make(map[rune]*TrieNode),
118            }
119        }
120        node = node.children[ch]
121    }
122
123    node.isEnd = true
124    node.frequency = frequency
125    node.word = word
126}
127
128func (ac *Autocomplete) Suggest(prefix string, limit int) []Suggestion {
129    prefix = strings.ToLower(prefix)
130
131    // Find the prefix node
132    node := ac.root
133    for _, ch := range prefix {
134        if node.children[ch] == nil {
135            return nil
136        }
137        node = node.children[ch]
138    }
139
140    // Collect all completions
141    var suggestions []Suggestion
142    ac.collectSuggestions(node, &suggestions)
143
144    // Sort by frequency and score
145    sort.Slice(suggestions, func(i, j int) bool {
146        if suggestions[i].Frequency != suggestions[j].Frequency {
147            return suggestions[i].Frequency > suggestions[j].Frequency
148        }
149        return suggestions[i].Word < suggestions[j].Word
150    })
151
152    // Calculate scores
153    for i := range suggestions {
154        suggestions[i].Score = float64(suggestions[i].Frequency) / 100.0
155    }
156
157    if len(suggestions) > limit {
158        suggestions = suggestions[:limit]
159    }
160
161    return suggestions
162}
163
164func (ac *Autocomplete) collectSuggestions(node *TrieNode, suggestions *[]Suggestion) {
165    if node.isEnd {
166        *suggestions = append(*suggestions, Suggestion{
167            Word:      node.word,
168            Frequency: node.frequency,
169        })
170    }
171
172    for _, child := range node.children {
173        ac.collectSuggestions(child, suggestions)
174    }
175}
176
177func (ac *Autocomplete) FuzzySuggest(query string, limit int) []Suggestion {
178    query = strings.ToLower(query)
179
180    // Find all words and calculate edit distance
181    var candidates []Suggestion
182    ac.collectAllWords(ac.root, &candidates)
183
184    // Calculate Levenshtein distance
185    for i := range candidates {
186        distance := levenshteinDistance(query, candidates[i].Word)
187        candidates[i].Score = 1.0 / (1.0 + float64(distance))
188    }
189
190    // Sort by edit distance and frequency
191    sort.Slice(candidates, func(i, j int) bool {
192        if candidates[i].Score != candidates[j].Score {
193            return candidates[i].Score > candidates[j].Score
194        }
195        return candidates[i].Frequency > candidates[j].Frequency
196    })
197
198    if len(candidates) > limit {
199        candidates = candidates[:limit]
200    }
201
202    return candidates
203}
204
205func (ac *Autocomplete) collectAllWords(node *TrieNode, words *[]Suggestion) {
206    if node.isEnd {
207        *words = append(*words, Suggestion{
208            Word:      node.word,
209            Frequency: node.frequency,
210        })
211    }
212
213    for _, child := range node.children {
214        ac.collectAllWords(child, words)
215    }
216}
217
218func levenshteinDistance(s1, s2 string) int {
219    r1, r2 := []rune(s1), []rune(s2)
220    n, m := len(r1), len(r2)
221
222    if n == 0 {
223        return m
224    }
225    if m == 0 {
226        return n
227    }
228
229    matrix := make([][]int, n+1)
230    for i := range matrix {
231        matrix[i] = make([]int, m+1)
232        matrix[i][0] = i
233    }
234    for j := 1; j <= m; j++ {
235        matrix[0][j] = j
236    }
237
238    for i := 1; i <= n; i++ {
239        for j := 1; j <= m; j++ {
240            cost := 0
241            if r1[i-1] != r2[j-1] {
242                cost = 1
243            }
244
245            deletion := matrix[i-1][j] + 1
246            insertion := matrix[i][j-1] + 1
247            substitution := matrix[i-1][j-1] + cost
248
249            matrix[i][j] = min(deletion, min(insertion, substitution))
250        }
251    }
252
253    return matrix[n][m]
254}
255
256func min(a, b int) int {
257    if a < b {
258        return a
259    }
260    return b
261}

Common Pitfalls and Best Practices

Text Processing Mistakes to Avoid

  1. Using len() for character count: Always use utf8.RuneCountInString() for Unicode text
  2. Byte-based substring: Never slice strings by byte index with non-ASCII text
  3. Ignoring normalization: Always normalize text before comparison
  4. Inefficient concatenation: Use strings.Builder for building strings in loops
  5. Uncompiled regex: Always compile regex patterns once, not in loops
  6. Ignoring locale: Case conversion and sorting are locale-dependent
  7. Missing validation: Always validate and sanitize user input
  8. Improper encoding: Ensure consistent UTF-8 encoding throughout

Security Considerations

  1. Input Validation: Always validate and sanitize user input
  2. XSS Prevention: Escape HTML/JavaScript in user content
  3. SQL Injection: Use parameterized queries, not string concatenation
  4. Path Traversal: Validate file paths to prevent directory traversal
  5. Buffer Overflow: Enforce length limits on user input
  6. Unicode Attacks: Be aware of homograph and normalization attacks
  7. Regular Expression DoS: Limit regex complexity and execution time

Summary

Key Takeaways

  1. Unicode Handling: Always work with runes for character operations
  2. String Building: Use strings.Builder for efficient string concatenation
  3. Normalization: Normalize text for consistent comparison and processing
  4. Performance: Understand performance implications of operations
  5. Safety: Never split strings by byte indices with UTF-8 text
  6. Internationalization: Consider language-specific rules and cultural nuances
  7. Security: Validate and sanitize text input to prevent attacks
  8. Regex: Compile patterns once, use for pattern matching

Best Practices

  1. Unicode Awareness: Always assume text contains Unicode characters
  2. Normalization: Normalize text before comparison or storage
  3. Efficient Operations: Use appropriate data structures
  4. Memory Management: Reuse buffers and avoid unnecessary allocations
  5. Validation: Validate text input and handle edge cases
  6. Testing: Test with various languages and character sets
  7. Documentation: Document encoding requirements

When to Use What

  • Basic Operations: strings package for simple manipulation
  • Unicode Handling: unicode and utf8 packages for character processing
  • Regular Expressions: regexp for pattern matching
  • Internationalization: golang.org/x/text for advanced i18n support
  • Text Building: strings.Builder for concatenation
  • Character Iteration: Range over strings or use utf8.DecodeRuneInString

Production Libraries

  • golang.org/x/text: Unicode normalization, collation, encoding
  • golang.org/x/text/language: Language tag support
  • golang.org/x/text/cases: Locale-aware case mapping
  • github.com/rivo/uniseg: Unicode text segmentation

Further Reading

Text processing transforms your Go applications from single-language tools into global platforms that can handle the complexity of human communication across all languages and cultures.