Why This Matters
Consider building a global social media platform that needs to handle posts in every language - from English and Chinese to Arabic and emoji. Or a search engine that must understand user queries across different scripts and languages. Or a data pipeline that processes customer feedback from around the world.
Text processing is the foundation of modern software applications. It enables:
- Global Applications: Handle text in hundreds of languages and scripts
- Search & Discovery: Build intelligent search and recommendation systems
- Data Processing: Parse and analyze unstructured text data
- User Interfaces: Create responsive, multilingual user experiences
- Communication: Build chatbots, translation services, and messaging systems
- Content Management: Process documents, articles, and user-generated content
In today's connected world, proper text processing isn't optional - it's essential for reaching global audiences and handling the complexity of human language.
Learning Objectives
After completing this article, you will be able to:
- Master Unicode: Understand UTF-8 encoding and proper Unicode handling
- Process Text: Use Go's string manipulation functions effectively
- Normalize Content: Handle text normalization and canonical forms
- Parse & Transform: Build text parsers and transformers
- Handle Internationalization: Work with multiple languages and scripts
- Optimize Performance: Write efficient text processing algorithms
- Apply Regex: Master regular expressions for pattern matching
- Implement Security: Avoid common text processing vulnerabilities
Core Concepts
Unicode and UTF-8 in Go
Go was designed with Unicode support from the ground up. Understanding how Go handles text is crucial for building robust applications:
- Unicode: Universal character encoding supporting all writing systems
- UTF-8: Variable-width encoding that stores Unicode characters efficiently
- Runes: Go's representation of Unicode code points (int32 values)
- Strings: Immutable sequences of bytes interpreted as UTF-8 text
- Code Points: Numerical values representing characters in Unicode
- Normalization: Converting text to canonical forms for consistent comparison
Text Processing Challenges
Text processing presents unique challenges compared to other data types:
- Encoding Issues: Different character encodings and byte order marks
- Normalization: Same text can have multiple byte representations
- Language Rules: Different languages have different sorting, case, and punctuation rules
- Performance: String operations can be expensive for large texts
- Security: Text processing can introduce security vulnerabilities (injection attacks)
- Directionality: Right-to-left languages require special handling
- Grapheme Clusters: Some "characters" are composed of multiple code points
Unicode Fundamentals
Understanding UTF-8 Encoding
UTF-8 is a variable-width character encoding that uses 1-4 bytes per character. Understanding its structure is essential for proper text handling.
1package main
2
3import (
4 "fmt"
5 "unicode/utf8"
6)
7
8// run
9func main() {
10 fmt.Println("=== UTF-8 Encoding Deep Dive ===")
11
12 // 1. UTF-8 byte patterns
13 fmt.Println("\n1. UTF-8 Byte Structure:")
14
15 chars := []struct {
16 char rune
17 name string
18 bytes int
19 }{
20 {'A', "ASCII Letter", 1},
21 {'£', "Pound Sign", 2},
22 {'€', "Euro Sign", 3},
23 {'世', "Chinese Character", 3},
24 {'🌍', "Earth Emoji", 4},
25 {'𝕳', "Mathematical Bold", 4},
26 }
27
28 for _, c := range chars {
29 fmt.Printf("\nCharacter: %c (U+%04X)\n", c.char, c.char)
30 fmt.Printf(" Name: %s\n", c.name)
31 fmt.Printf(" Expected bytes: %d\n", c.bytes)
32 fmt.Printf(" Actual bytes: %d\n", utf8.RuneLen(c.char))
33
34 // Show byte representation
35 str := string(c.char)
36 fmt.Printf(" UTF-8 bytes: ")
37 for i := 0; i < len(str); i++ {
38 fmt.Printf("%08b ", str[i])
39 }
40 fmt.Printf("\n Hex bytes: ")
41 for i := 0; i < len(str); i++ {
42 fmt.Printf("%02X ", str[i])
43 }
44 fmt.Println()
45 }
46
47 // 2. String vs byte vs rune length
48 fmt.Println("\n2. String Length Comparisons:")
49
50 texts := []string{
51 "Hello", // ASCII only
52 "Café", // Latin with accent
53 "你好世界", // Chinese
54 "مرحبا", // Arabic
55 "Hello 🌍 World", // Mixed with emoji
56 "Ñoño", // Spanish with tildes
57 }
58
59 for _, text := range texts {
60 fmt.Printf("\nText: %s\n", text)
61 fmt.Printf(" len(string): %d bytes\n", len(text))
62 fmt.Printf(" utf8.RuneCountInString: %d runes\n", utf8.RuneCountInString(text))
63 fmt.Printf(" []rune length: %d runes\n", len([]rune(text)))
64
65 // Show each character's byte size
66 fmt.Printf(" Character breakdown: ")
67 for i, r := range text {
68 size := utf8.RuneLen(r)
69 fmt.Printf("%c(%d bytes at pos %d) ", r, size, i)
70 }
71 fmt.Println()
72 }
73
74 // 3. Valid UTF-8 checking
75 fmt.Println("\n3. UTF-8 Validation:")
76
77 testStrings := []struct {
78 bytes []byte
79 desc string
80 }{
81 {[]byte("Hello"), "Valid ASCII"},
82 {[]byte{0x48, 0x65, 0x6C, 0x6C, 0x6F}, "Valid ASCII (hex)"},
83 {[]byte{0xE4, 0xB8, 0x96}, "Valid 3-byte Chinese"},
84 {[]byte{0xFF, 0xFE}, "Invalid UTF-8"},
85 {[]byte{0xC0, 0x80}, "Invalid overlong encoding"},
86 {[]byte("Hell"), "Incomplete UTF-8 sequence"},
87 }
88
89 for _, test := range testStrings {
90 valid := utf8.Valid(test.bytes)
91 fmt.Printf("%s: valid=%t, bytes=%v\n", test.desc, valid, test.bytes)
92
93 if valid {
94 fmt.Printf(" String: %s\n", string(test.bytes))
95 } else {
96 fmt.Printf(" Invalid UTF-8 - cannot convert safely\n")
97 }
98 }
99
100 // 4. Rune decoding and encoding
101 fmt.Println("\n4. Manual Rune Decoding:")
102
103 str := "Hello, 世界!"
104 for i := 0; i < len(str); {
105 r, size := utf8.DecodeRuneInString(str[i:])
106 fmt.Printf("Position %d: rune=%c (U+%04X), size=%d bytes\n", i, r, r, size)
107 i += size
108 }
109
110 // 5. Encoding runes to bytes
111 fmt.Println("\n5. Encoding Runes to UTF-8:")
112
113 runes := []rune{'H', 'e', 'l', 'l', 'o', ',', ' ', '世', '界'}
114 for _, r := range runes {
115 buf := make([]byte, 4) // Max UTF-8 size
116 size := utf8.EncodeRune(buf, r)
117 fmt.Printf("Rune %c: bytes=%v (size=%d)\n", r, buf[:size], size)
118 }
119}
Real-World Applications:
- File Processing: Validate UTF-8 encoding in uploaded files
- API Validation: Ensure API requests contain valid UTF-8 text
- Database Storage: Store text efficiently with proper encoding
- Protocol Handling: Parse text protocols with mixed encodings
⚠️ Critical Warning: Never use byte indices to split strings containing non-ASCII characters. Always use rune-based operations or utf8 package functions.
Unicode and Rune Manipulation
Understanding how Go handles Unicode characters is fundamental to proper text processing.
1package main
2
3import (
4 "fmt"
5 "unicode"
6 "unicode/utf8"
7)
8
9// run
10func main() {
11 fmt.Println("=== Unicode and Rune Manipulation ===")
12
13 // 1. String vs Rune slice
14 fmt.Println("\n1. String and Rune Slices:")
15 text := "Hello, 世界! 🌍"
16
17 fmt.Printf("Original string: %s\n", text)
18 fmt.Printf("String length (bytes): %d\n", len(text))
19 fmt.Printf("Rune count: %d\n", utf8.RuneCountInString(text))
20
21 // Convert to rune slice
22 runes := []rune(text)
23 fmt.Printf("Rune slice length: %d\n", len(runes))
24 fmt.Printf("Runes: %U\n", runes)
25
26 // 2. Rune iteration and properties
27 fmt.Println("\n2. Rune Properties:")
28 for i, r := range text {
29 fmt.Printf("Byte index %d: %c (U+%04X)\n", i, r, r)
30 fmt.Printf(" Size in bytes: %d\n", utf8.RuneLen(r))
31 fmt.Printf(" Is letter: %t\n", unicode.IsLetter(r))
32 fmt.Printf(" Is digit: %t\n", unicode.IsDigit(r))
33 fmt.Printf(" Is space: %t\n", unicode.IsSpace(r))
34 fmt.Printf(" Is punctuation: %t\n", unicode.IsPunct(r))
35 fmt.Printf(" Is symbol: %t\n", unicode.IsSymbol(r))
36 }
37
38 // 3. Unicode categories
39 fmt.Println("\n3. Unicode Categories:")
40 testRunes := []rune{'A', 'a', '中', 'ى', '5', '€', '🌍', '\n', '\t'}
41
42 for _, r := range testRunes {
43 fmt.Printf("\nCharacter: %c (U+%04X)\n", r, r)
44
45 if unicode.IsLetter(r) {
46 if unicode.IsUpper(r) {
47 fmt.Printf(" → Uppercase letter\n")
48 } else if unicode.IsLower(r) {
49 fmt.Printf(" → Lowercase letter\n")
50 } else {
51 fmt.Printf(" → Other letter (no case)\n")
52 }
53 }
54
55 if unicode.IsNumber(r) {
56 fmt.Printf(" → Number character\n")
57 }
58
59 if unicode.IsPunct(r) {
60 fmt.Printf(" → Punctuation\n")
61 }
62
63 if unicode.IsSymbol(r) {
64 fmt.Printf(" → Symbol\n")
65 }
66
67 if unicode.IsSpace(r) {
68 fmt.Printf(" → Whitespace\n")
69 }
70
71 if unicode.IsControl(r) {
72 fmt.Printf(" → Control character\n")
73 }
74
75 fmt.Printf(" Category: %s\n", getUnicodeCategory(r))
76 }
77
78 // 4. Case transformations
79 fmt.Println("\n4. Unicode Case Transformations:")
80 caseTests := []rune{'A', 'a', 'Σ', 'σ', 'ς', 'İ', 'i'}
81
82 for _, r := range caseTests {
83 fmt.Printf("\nOriginal: %c (U+%04X)\n", r, r)
84 fmt.Printf(" ToUpper: %c (U+%04X)\n", unicode.ToUpper(r), unicode.ToUpper(r))
85 fmt.Printf(" ToLower: %c (U+%04X)\n", unicode.ToLower(r), unicode.ToLower(r))
86 fmt.Printf(" ToTitle: %c (U+%04X)\n", unicode.ToTitle(r), unicode.ToTitle(r))
87 }
88
89 // 5. String building with runes
90 fmt.Println("\n5. Safe String Building:")
91
92 // Bad approach: byte manipulation can corrupt UTF-8
93 badBytes := []byte{72, 101, 108, 108, 111} // "Hello"
94 badString := string(badBytes)
95 fmt.Printf("Byte manipulation: %s\n", badString)
96
97 // Good approach: rune manipulation
98 goodRunes := []rune{'H', 'e', 'l', 'l', 'o', ',', ' ', '世', '界'}
99 goodString := string(goodRunes)
100 fmt.Printf("Rune manipulation: %s\n", goodString)
101
102 // 6. Unicode script detection
103 fmt.Println("\n6. Unicode Script Detection:")
104 multiScript := "Hello мир 世界 مرحبا"
105
106 for _, r := range multiScript {
107 if unicode.IsLetter(r) {
108 script := detectScript(r)
109 fmt.Printf("Character %c: %s script\n", r, script)
110 }
111 }
112}
113
114func getUnicodeCategory(r rune) string {
115 switch {
116 case unicode.IsLetter(r):
117 return "Letter"
118 case unicode.IsNumber(r):
119 return "Number"
120 case unicode.IsPunct(r):
121 return "Punctuation"
122 case unicode.IsSymbol(r):
123 return "Symbol"
124 case unicode.IsSpace(r):
125 return "Space"
126 case unicode.IsControl(r):
127 return "Control"
128 default:
129 return "Other"
130 }
131}
132
133func detectScript(r rune) string {
134 switch {
135 case r >= 'A' && r <= 'Z' || r >= 'a' && r <= 'z':
136 return "Latin"
137 case r >= 0x0400 && r <= 0x04FF:
138 return "Cyrillic"
139 case r >= 0x4E00 && r <= 0x9FFF:
140 return "CJK (Chinese/Japanese/Korean)"
141 case r >= 0x0600 && r <= 0x06FF:
142 return "Arabic"
143 case r >= 0x0370 && r <= 0x03FF:
144 return "Greek"
145 case r >= 0x0590 && r <= 0x05FF:
146 return "Hebrew"
147 default:
148 return "Unknown"
149 }
150}
Real-World Applications:
- Input Validation: Ensure user input contains only allowed characters
- Password Policies: Check for character variety in passwords
- Text Analysis: Categorize characters for language detection
- Security: Detect suspicious character sequences (homograph attacks)
💡 Best Practice: Always use utf8.RuneCountInString() to count characters, not len() which returns byte count.
String Manipulation and Transformation
Core String Operations
Go provides powerful string manipulation functions that work properly with Unicode.
1package main
2
3import (
4 "fmt"
5 "strings"
6 "unicode"
7 "unicode/utf8"
8)
9
10// run
11func main() {
12 fmt.Println("=== String Manipulation and Transformation ===")
13
14 // 1. Basic string operations
15 fmt.Println("\n1. Basic String Operations:")
16 text := "Hello, 世界! Welcome to Go! 🚀"
17
18 fmt.Printf("Original: %s\n", text)
19 fmt.Printf("Character count: %d\n", utf8.RuneCountInString(text))
20 fmt.Printf("Byte count: %d\n", len(text))
21
22 // Contains, prefix, suffix
23 fmt.Printf("Contains 'World': %t\n", strings.Contains(text, "World"))
24 fmt.Printf("Contains '世界': %t\n", strings.Contains(text, "世界"))
25 fmt.Printf("Starts with 'Hello': %t\n", strings.HasPrefix(text, "Hello"))
26 fmt.Printf("Ends with '🚀': %t\n", strings.HasSuffix(text, "🚀"))
27
28 // Index finding
29 fmt.Printf("Index of 'World': %d\n", strings.Index(text, "World"))
30 fmt.Printf("Index of '世界': %d\n", strings.Index(text, "世界"))
31 fmt.Printf("Last index of 'o': %d\n", strings.LastIndex(text, "o"))
32
33 // 2. String splitting and joining
34 fmt.Println("\n2. Splitting and Joining:")
35 sentence := "The quick brown fox jumps over the lazy dog"
36
37 // Split by whitespace
38 words := strings.Fields(sentence)
39 fmt.Printf("Words: %v\n", words)
40 fmt.Printf("Word count: %d\n", len(words))
41
42 // Split by specific delimiter
43 csv := "apple,banana,cherry,date,elderberry"
44 fruits := strings.Split(csv, ",")
45 fmt.Printf("Fruits: %v\n", fruits)
46
47 // Split with limit
48 limited := strings.SplitN(csv, ",", 3)
49 fmt.Printf("Limited split (3): %v\n", limited)
50
51 // Join strings
52 joined := strings.Join(words, " | ")
53 fmt.Printf("Joined with separator: %s\n", joined)
54
55 // 3. Case transformation with Unicode awareness
56 fmt.Println("\n3. Unicode Case Transformation:")
57 unicodeText := "hello WORLD! 你好世界 Straße İstanbul"
58
59 fmt.Printf("Original: %s\n", unicodeText)
60 fmt.Printf("Upper: %s\n", strings.ToUpper(unicodeText))
61 fmt.Printf("Lower: %s\n", strings.ToLower(unicodeText))
62 fmt.Printf("Title: %s\n", strings.Title(unicodeText))
63
64 // Custom title case
65 fmt.Printf("Custom title: %s\n", toProperTitle(unicodeText))
66
67 // 4. String trimming and cleaning
68 fmt.Println("\n4. Trimming and Cleaning:")
69 messyText := " Hello, World! \n\t "
70
71 fmt.Printf("Original: %q\n", messyText)
72 fmt.Printf("TrimSpace: %q\n", strings.TrimSpace(messyText))
73 fmt.Printf("TrimPrefix: %q\n", strings.TrimPrefix(messyText, " "))
74 fmt.Printf("TrimSuffix: %q\n", strings.TrimSuffix(messyText, " \n\t "))
75
76 // Trim specific characters
77 punctuationText := "!!!Hello, World!!!"
78 fmt.Printf("Trim '!': %q\n", strings.Trim(punctuationText, "!"))
79
80 // Trim function
81 trimmed := strings.TrimFunc(messyText, unicode.IsSpace)
82 fmt.Printf("TrimFunc (spaces): %q\n", trimmed)
83
84 // 5. String replacement
85 fmt.Println("\n5. String Replacement:")
86 template := "Hello, {{name}}! Welcome to {{app}}."
87
88 // Simple replacement
89 result := strings.Replace(template, "{{name}}", "Alice", -1)
90 result = strings.Replace(result, "{{app}}", "Go Tutorial", -1)
91 fmt.Printf("Simple replacement: %s\n", result)
92
93 // Multiple replacements with Replacer
94 replacer := strings.NewReplacer(
95 "{{name}}", "Bob",
96 "{{app}}", "Awesome App",
97 )
98 result = replacer.Replace(template)
99 fmt.Printf("Replacer: %s\n", result)
100
101 // 6. String comparison
102 fmt.Println("\n6. String Comparison:")
103 compareTests := []struct {
104 s1, s2 string
105 }{
106 {"hello", "Hello"},
107 {"café", "cafe"},
108 {"test", "test"},
109 }
110
111 for _, test := range compareTests {
112 fmt.Printf("\n%q vs %q:\n", test.s1, test.s2)
113 fmt.Printf(" Equal: %t\n", test.s1 == test.s2)
114 fmt.Printf(" EqualFold: %t\n", strings.EqualFold(test.s1, test.s2))
115 fmt.Printf(" Compare: %d\n", strings.Compare(test.s1, test.s2))
116 }
117
118 // 7. Unicode-safe substring extraction
119 fmt.Println("\n7. Unicode-Safe Substring:")
120 complexText := "Héllo, 世界! 👋"
121
122 // Dangerous: byte-based substring
123 fmt.Printf("Original: %s\n", complexText)
124 if len(complexText) >= 10 {
125 dangerous := complexText[0:10]
126 fmt.Printf("Dangerous (bytes 0-10): %s\n", dangerous)
127 }
128
129 // Safe: rune-based substring
130 safe := safeSubstring(complexText, 0, 7)
131 fmt.Printf("Safe (runes 0-7): %s\n", safe)
132
133 safe = safeSubstring(complexText, 7, 3)
134 fmt.Printf("Safe (runes 7-10): %s\n", safe)
135}
136
137func toProperTitle(s string) string {
138 words := strings.Fields(s)
139 for i, word := range words {
140 if len(word) > 0 {
141 runes := []rune(word)
142 if unicode.IsLetter(runes[0]) {
143 runes[0] = unicode.ToTitle(runes[0])
144 for j := 1; j < len(runes); j++ {
145 runes[j] = unicode.ToLower(runes[j])
146 }
147 words[i] = string(runes)
148 }
149 }
150 }
151 return strings.Join(words, " ")
152}
153
154func safeSubstring(s string, start, length int) string {
155 runes := []rune(s)
156 if start < 0 {
157 start = 0
158 }
159 if start >= len(runes) {
160 return ""
161 }
162 end := start + length
163 if end > len(runes) {
164 end = len(runes)
165 }
166 return string(runes[start:end])
167}
Real-World Applications:
- Template Engines: Render dynamic content with user data
- Data Cleaning: Normalize and clean messy text data
- URL Handling: Parse and manipulate URLs and paths
- Configuration Files: Process configuration values and settings
Efficient String Building
String concatenation in loops is a common performance bottleneck. Learn the efficient approaches.
1package main
2
3import (
4 "fmt"
5 "strings"
6 "time"
7)
8
9// run
10func main() {
11 fmt.Println("=== Efficient String Building ===")
12
13 // 1. Performance comparison
14 fmt.Println("\n1. String Concatenation Performance:")
15
16 iterations := 10000
17
18 // Bad: repeated concatenation
19 start := time.Now()
20 var bad string
21 for i := 0; i < iterations; i++ {
22 bad += "x" // Creates new string each time
23 }
24 badTime := time.Since(start)
25
26 // Good: strings.Builder
27 start = time.Now()
28 var builder strings.Builder
29 builder.Grow(iterations) // Pre-allocate capacity
30 for i := 0; i < iterations; i++ {
31 builder.WriteByte('x')
32 }
33 good := builder.String()
34 goodTime := time.Since(start)
35
36 fmt.Printf("Concatenation: %v\n", badTime)
37 fmt.Printf("Builder: %v\n", goodTime)
38 fmt.Printf("Speedup: %.2fx\n", float64(badTime)/float64(goodTime))
39 fmt.Printf("Results match: %t\n", len(bad) == len(good))
40
41 // 2. strings.Builder methods
42 fmt.Println("\n2. strings.Builder Methods:")
43
44 var b strings.Builder
45 b.Grow(100) // Pre-allocate
46
47 b.WriteString("Hello, ")
48 b.WriteString("世界! ")
49 b.WriteRune('🌍')
50 b.WriteByte(' ')
51
52 result := b.String()
53 fmt.Printf("Result: %s\n", result)
54 fmt.Printf("Length: %d bytes, %d runes\n", b.Len(), utf8.RuneCountInString(result))
55
56 // 3. Building complex strings
57 fmt.Println("\n3. Building Complex Strings:")
58
59 var complex strings.Builder
60 complex.Grow(200)
61
62 // HTML-like template
63 complex.WriteString("<html>\n")
64 complex.WriteString(" <body>\n")
65
66 items := []string{"Apple", "Banana", "Cherry"}
67 for _, item := range items {
68 complex.WriteString(" <li>")
69 complex.WriteString(item)
70 complex.WriteString("</li>\n")
71 }
72
73 complex.WriteString(" </body>\n")
74 complex.WriteString("</html>\n")
75
76 fmt.Println(complex.String())
77
78 // 4. Memory efficiency
79 fmt.Println("4. Memory Efficiency:")
80
81 var efficient strings.Builder
82 efficient.Grow(1000) // Pre-allocate exactly what we need
83
84 fmt.Printf("Initial capacity: %d\n", efficient.Cap())
85
86 for i := 0; i < 100; i++ {
87 efficient.WriteString("test ")
88 }
89
90 fmt.Printf("After writes, capacity: %d\n", efficient.Cap())
91 fmt.Printf("Length: %d\n", efficient.Len())
92
93 // 5. Building with formatting
94 fmt.Println("\n5. Formatted String Building:")
95
96 var formatted strings.Builder
97 formatted.Grow(100)
98
99 for i := 1; i <= 5; i++ {
100 formatted.WriteString(fmt.Sprintf("Line %d: Value = %d\n", i, i*10))
101 }
102
103 fmt.Print(formatted.String())
104}
Real-World Applications:
- Log Aggregation: Build large log messages efficiently
- JSON/XML Generation: Construct structured data formats
- Report Generation: Build large text reports
- Template Rendering: Efficient template engines
💡 Performance Tip: Always use strings.Builder for building strings iteratively. Call Grow() if you know the approximate size to avoid reallocations.
Text Normalization and Comparison
Unicode Normalization Forms
Text normalization is crucial for consistent comparison and processing of Unicode text.
1package main
2
3import (
4 "fmt"
5 "strings"
6 "unicode"
7)
8
9// run
10func main() {
11 fmt.Println("=== Text Normalization and Comparison ===")
12
13 // 1. Unicode normalization problem
14 fmt.Println("\n1. Unicode Normalization Problem:")
15
16 // Different ways to write "café"
17 cafe := []string{
18 "café", // Combined form: é (U+00E9)
19 "cafe\u0301", // Decomposed form: e + combining acute (U+0301)
20 }
21
22 fmt.Println("Visual comparison:")
23 for i, s := range cafe {
24 fmt.Printf(" Version %d: %s\n", i+1, s)
25 fmt.Printf(" Bytes: %v\n", []byte(s))
26 fmt.Printf(" Runes: %U\n", []rune(s))
27 fmt.Printf(" Byte length: %d\n", len(s))
28 fmt.Printf(" Rune count: %d\n", utf8.RuneCountInString(s))
29 }
30
31 // Direct comparison fails
32 fmt.Printf("\nDirect comparison: %q == %q: %t\n", cafe[0], cafe[1], cafe[0] == cafe[1])
33
34 // Normalized comparison
35 norm0 := simpleNormalize(cafe[0])
36 norm1 := simpleNormalize(cafe[1])
37 fmt.Printf("Normalized comparison: %t\n", norm0 == norm1)
38
39 // 2. Case folding for case-insensitive comparison
40 fmt.Println("\n2. Case Folding:")
41
42 caseTests := []struct {
43 s1, s2 string
44 }{
45 {"Hello", "hello"},
46 {"Straße", "STRASSE"}, // German ß vs SS
47 {"İstanbul", "istanbul"}, // Turkish dotted I
48 }
49
50 for _, test := range caseTests {
51 fmt.Printf("\nComparing %q vs %q:\n", test.s1, test.s2)
52 fmt.Printf(" Direct: %t\n", test.s1 == test.s2)
53 fmt.Printf(" EqualFold: %t\n", strings.EqualFold(test.s1, test.s2))
54 fmt.Printf(" Case folded: %t\n", caseFold(test.s1) == caseFold(test.s2))
55 }
56
57 // 3. Whitespace normalization
58 fmt.Println("\n3. Whitespace Normalization:")
59
60 whitespaceText := " Hello\t\nWorld! \n\t "
61 fmt.Printf("Original: %q\n", whitespaceText)
62 fmt.Printf("Normalized: %q\n", normalizeWhitespace(whitespaceText))
63
64 // Multiple spaces and tabs
65 messy := "Hello \t\t World \n\n !"
66 fmt.Printf("Messy: %q\n", messy)
67 fmt.Printf("Cleaned: %q\n", normalizeWhitespace(messy))
68
69 // 4. Accent removal for simple matching
70 fmt.Println("\n4. Accent Removal:")
71
72 accentedTexts := []string{
73 "café",
74 "résumé",
75 "naïve",
76 "Zürich",
77 "São Paulo",
78 "Москва", // Moscow in Cyrillic
79 }
80
81 for _, text := range accentedTexts {
82 fmt.Printf("Original: %-15s → No accents: %s\n", text, removeAccents(text))
83 }
84
85 // 5. Text similarity
86 fmt.Println("\n5. Text Similarity (Levenshtein Distance):")
87
88 similarityTests := []struct {
89 s1, s2 string
90 }{
91 {"color", "colour"},
92 {"organize", "organise"},
93 {"catalog", "catalogue"},
94 {"mississippi", "misisipi"},
95 {"kitten", "sitting"},
96 }
97
98 for _, test := range similarityTests {
99 distance := levenshteinDistance(test.s1, test.s2)
100 maxLen := max(len(test.s1), len(test.s2))
101 similarity := 1.0 - float64(distance)/float64(maxLen)
102
103 fmt.Printf("%s vs %s:\n", test.s1, test.s2)
104 fmt.Printf(" Distance: %d\n", distance)
105 fmt.Printf(" Similarity: %.2f%%\n", similarity*100)
106 }
107
108 // 6. Advanced text cleaning
109 fmt.Println("\n6. Advanced Text Cleaning:")
110
111 messyTexts := []string{
112 " HÉLLÓ, World! 123!!! \n\t🚀 ",
113 "Remove extra spaces",
114 "\t\tTabs\t\tand\t\tnewlines\n\n\n",
115 }
116
117 for _, messy := range messyTexts {
118 fmt.Printf("\nOriginal: %q\n", messy)
119 fmt.Printf("Cleaned: %q\n", cleanText(messy))
120 fmt.Printf("For search: %q\n", normalizeForSearch(messy))
121 }
122}
123
124func simpleNormalize(s string) string {
125 // Simple normalization: lowercase
126 return strings.ToLower(s)
127}
128
129func caseFold(s string) string {
130 // Simple case folding for demonstration
131 // In production, use golang.org/x/text/cases
132 runes := []rune(strings.ToLower(s))
133 for i, r := range runes {
134 // Handle special cases
135 if r == 'ß' || r == 'ẞ' {
136 // German sharp S → ss
137 return string(runes[:i]) + "ss" + caseFold(string(runes[i+1:]))
138 }
139 }
140 return string(runes)
141}
142
143func normalizeWhitespace(s string) string {
144 // Replace all whitespace sequences with single space and trim
145 runes := []rune(s)
146 var normalized []rune
147 var inWhitespace bool
148
149 for _, r := range runes {
150 if unicode.IsSpace(r) {
151 if !inWhitespace {
152 normalized = append(normalized, ' ')
153 inWhitespace = true
154 }
155 } else {
156 normalized = append(normalized, r)
157 inWhitespace = false
158 }
159 }
160
161 return strings.TrimSpace(string(normalized))
162}
163
164func removeAccents(s string) string {
165 // Simple accent removal using mapping
166 // In production, use golang.org/x/text/transform
167 accentMap := map[rune]rune{
168 'á': 'a', 'à': 'a', 'â': 'a', 'ä': 'a', 'ã': 'a', 'å': 'a',
169 'é': 'e', 'è': 'e', 'ê': 'e', 'ë': 'e',
170 'í': 'i', 'ì': 'i', 'î': 'i', 'ï': 'i',
171 'ó': 'o', 'ò': 'o', 'ô': 'o', 'ö': 'o', 'õ': 'o', 'ø': 'o',
172 'ú': 'u', 'ù': 'u', 'û': 'u', 'ü': 'u',
173 'ý': 'y', 'ÿ': 'y',
174 'ñ': 'n', 'ç': 'c',
175 'Á': 'A', 'À': 'A', 'Â': 'A', 'Ä': 'A', 'Ã': 'A', 'Å': 'A',
176 'É': 'E', 'È': 'E', 'Ê': 'E', 'Ë': 'E',
177 'Í': 'I', 'Ì': 'I', 'Î': 'I', 'Ï': 'I',
178 'Ó': 'O', 'Ò': 'O', 'Ô': 'O', 'Ö': 'O', 'Õ': 'O', 'Ø': 'O',
179 'Ú': 'U', 'Ù': 'U', 'Û': 'U', 'Ü': 'U',
180 'Ý': 'Y', 'Ÿ': 'Y',
181 'Ñ': 'N', 'Ç': 'C',
182 }
183
184 runes := []rune(s)
185 for i, r := range runes {
186 if replacement, exists := accentMap[r]; exists {
187 runes[i] = replacement
188 }
189 }
190
191 return string(runes)
192}
193
194func levenshteinDistance(s1, s2 string) int {
195 r1, r2 := []rune(s1), []rune(s2)
196 n, m := len(r1), len(r2)
197
198 if n == 0 {
199 return m
200 }
201 if m == 0 {
202 return n
203 }
204
205 // Initialize distance matrix
206 matrix := make([][]int, n+1)
207 for i := range matrix {
208 matrix[i] = make([]int, m+1)
209 matrix[i][0] = i
210 }
211 for j := 1; j <= m; j++ {
212 matrix[0][j] = j
213 }
214
215 // Calculate distances
216 for i := 1; i <= n; i++ {
217 for j := 1; j <= m; j++ {
218 cost := 0
219 if r1[i-1] != r2[j-1] {
220 cost = 1
221 }
222
223 deletion := matrix[i-1][j] + 1
224 insertion := matrix[i][j-1] + 1
225 substitution := matrix[i-1][j-1] + cost
226
227 matrix[i][j] = min(deletion, min(insertion, substitution))
228 }
229 }
230
231 return matrix[n][m]
232}
233
234func min(a, b int) int {
235 if a < b {
236 return a
237 }
238 return b
239}
240
241func max(a, b int) int {
242 if a > b {
243 return a
244 }
245 return b
246}
247
248func cleanText(s string) string {
249 // Remove non-printable characters, normalize whitespace
250 runes := []rune(s)
251 var cleaned []rune
252
253 for _, r := range runes {
254 if unicode.IsGraphic(r) && r != '\n' && r != '\t' {
255 cleaned = append(cleaned, r)
256 }
257 }
258
259 return normalizeWhitespace(string(cleaned))
260}
261
262func normalizeForSearch(s string) string {
263 // Normalize text for search: lowercase, remove accents, normalize whitespace
264 return normalizeWhitespace(removeAccents(strings.ToLower(s)))
265}
Real-World Applications:
- Search Engines: Handle different ways users might spell queries
- Usernames: Normalize usernames for comparison and validation
- Data Deduplication: Identify similar or duplicate records
- International SEO: Handle different character encodings
⚠️ Production Note: For production applications, use golang.org/x/text/unicode/norm for proper Unicode normalization (NFC, NFD, NFKC, NFKD forms).
Regular Expressions and Pattern Matching
Advanced Regex Patterns
Regular expressions are powerful tools for pattern matching and text extraction.
1package main
2
3import (
4 "fmt"
5 "regexp"
6 "strings"
7)
8
9// run
10func main() {
11 fmt.Println("=== Regular Expressions and Pattern Matching ===")
12
13 // 1. Basic pattern matching
14 fmt.Println("\n1. Basic Pattern Matching:")
15
16 emailRegex := regexp.MustCompile(`\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b`)
17
18 texts := []string{
19 "Contact me at john@example.com",
20 "Email: alice@company.org",
21 "Invalid: not-an-email",
22 "Multiple: bob@test.com and charlie@demo.net",
23 }
24
25 for _, text := range texts {
26 matches := emailRegex.FindAllString(text, -1)
27 fmt.Printf("Text: %s\n", text)
28 fmt.Printf(" Emails found: %v\n", matches)
29 }
30
31 // 2. Capturing groups
32 fmt.Println("\n2. Capturing Groups:")
33
34 phoneRegex := regexp.MustCompile(`\((\d{3})\)\s*(\d{3})-(\d{4})`)
35 phoneText := "Call me at (555) 123-4567 or (800) 555-0199"
36
37 matches := phoneRegex.FindAllStringSubmatch(phoneText, -1)
38 for _, match := range matches {
39 fmt.Printf("Full match: %s\n", match[0])
40 fmt.Printf(" Area code: %s\n", match[1])
41 fmt.Printf(" Exchange: %s\n", match[2])
42 fmt.Printf(" Number: %s\n", match[3])
43 }
44
45 // 3. Named groups
46 fmt.Println("\n3. Named Capturing Groups:")
47
48 urlRegex := regexp.MustCompile(`(?P<scheme>https?)://(?P<host>[^/]+)(?P<path>/.*)?`)
49 urls := []string{
50 "https://www.example.com/path/to/page",
51 "http://api.service.com",
52 "https://localhost:8080/admin",
53 }
54
55 for _, url := range urls {
56 match := urlRegex.FindStringSubmatch(url)
57 if match != nil {
58 fmt.Printf("\nURL: %s\n", url)
59 for i, name := range urlRegex.SubexpNames() {
60 if i != 0 && name != "" {
61 fmt.Printf(" %s: %s\n", name, match[i])
62 }
63 }
64 }
65 }
66
67 // 4. Pattern replacement
68 fmt.Println("\n4. Pattern Replacement:")
69
70 // Redact email addresses
71 text := "Contact john@example.com or alice@company.org for details."
72 redacted := emailRegex.ReplaceAllString(text, "[EMAIL]")
73 fmt.Printf("Original: %s\n", text)
74 fmt.Printf("Redacted: %s\n", redacted)
75
76 // Replace with captured groups
77 phoneText2 := "Call (555) 123-4567"
78 formatted := phoneRegex.ReplaceAllString(phoneText2, "$1-$2-$3")
79 fmt.Printf("Original: %s\n", phoneText2)
80 fmt.Printf("Formatted: %s\n", formatted)
81
82 // 5. Complex patterns
83 fmt.Println("\n5. Complex Patterns:")
84
85 // Match different date formats
86 dateRegex := regexp.MustCompile(`(\d{4}-\d{2}-\d{2})|(\d{2}/\d{2}/\d{4})|(\d{2}\.\d{2}\.\d{4})`)
87 dateText := "Events: 2024-03-15, 12/31/2023, and 01.01.2024"
88
89 dates := dateRegex.FindAllString(dateText, -1)
90 fmt.Printf("Text: %s\n", dateText)
91 fmt.Printf("Dates found: %v\n", dates)
92
93 // Match hashtags and mentions
94 socialRegex := regexp.MustCompile(`(#\w+|@\w+)`)
95 socialText := "Check out #golang and follow @gophers for updates!"
96
97 tags := socialRegex.FindAllString(socialText, -1)
98 fmt.Printf("Text: %s\n", socialText)
99 fmt.Printf("Tags/Mentions: %v\n", tags)
100
101 // 6. Validation patterns
102 fmt.Println("\n6. Validation Patterns:")
103
104 validators := map[string]*regexp.Regexp{
105 "Username": regexp.MustCompile(`^[a-zA-Z0-9_]{3,20}$`),
106 "IPv4": regexp.MustCompile(`^(\d{1,3}\.){3}\d{1,3}$`),
107 "HexColor": regexp.MustCompile(`^#[0-9A-Fa-f]{6}$`),
108 "CreditCard": regexp.MustCompile(`^\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}$`),
109 }
110
111 testValues := map[string][]string{
112 "Username": {"john_doe", "a", "user@name", "valid_user123"},
113 "IPv4": {"192.168.1.1", "256.1.1.1", "10.0.0.1"},
114 "HexColor": {"#FF5733", "#GG0000", "#abc", "#FFFFFF"},
115 "CreditCard": {"1234-5678-9012-3456", "1234567890123456", "1234-5678"},
116 }
117
118 for validatorName, validator := range validators {
119 fmt.Printf("\n%s validation:\n", validatorName)
120 for _, value := range testValues[validatorName] {
121 isValid := validator.MatchString(value)
122 fmt.Printf(" %s: %t\n", value, isValid)
123 }
124 }
125
126 // 7. Performance: compiled vs non-compiled
127 fmt.Println("\n7. Regex Performance:")
128
129 pattern := `\b[A-Za-z]+\b`
130 testText := strings.Repeat("The quick brown fox jumps over the lazy dog ", 100)
131
132 // Compiled (good)
133 compiledRegex := regexp.MustCompile(pattern)
134 fmt.Printf("Using compiled regex: %d matches\n",
135 len(compiledRegex.FindAllString(testText, -1)))
136
137 // Note: regexp.MustCompile should always be used for repeated matching
138}
Real-World Applications:
- Input Validation: Validate email, phone, credit card formats
- Web Scraping: Extract structured data from HTML/text
- Log Parsing: Extract information from log files
- Data Extraction: Parse structured text formats
💡 Performance Tip: Always compile regex patterns once with regexp.MustCompile() at package level, not inside loops.
Text Pattern Extraction
Extract structured information from unstructured text.
1package main
2
3import (
4 "fmt"
5 "regexp"
6 "strings"
7)
8
9// run
10func main() {
11 fmt.Println("=== Text Pattern Extraction ===")
12
13 // 1. Extract all patterns from text
14 fmt.Println("\n1. Multi-Pattern Extraction:")
15
16 text := `
17 Contact Information:
18 Email: support@example.com
19 Phone: (555) 123-4567
20 Website: https://www.example.com
21 IP: 192.168.1.1
22 Date: 2024-03-15
23 Price: $99.99
24 `
25
26 patterns := extractAllPatterns(text)
27
28 for patternType, matches := range patterns {
29 if len(matches) > 0 {
30 fmt.Printf("%s: %v\n", patternType, matches)
31 }
32 }
33
34 // 2. Parse structured text
35 fmt.Println("\n2. Parse Log Entries:")
36
37 logText := `
38 [2024-03-15 10:30:45] INFO: Server started on port 8080
39 [2024-03-15 10:31:12] ERROR: Database connection failed
40 [2024-03-15 10:31:15] WARN: Retrying connection...
41 [2024-03-15 10:31:20] INFO: Connected to database
42 `
43
44 logs := parseLogEntries(logText)
45 for _, log := range logs {
46 fmt.Printf("%s [%s]: %s\n", log.Timestamp, log.Level, log.Message)
47 }
48
49 // 3. Extract code blocks
50 fmt.Println("\n3. Extract Code Blocks:")
51
52 markdown := `
53 Here is some code:
54 ` + "```go" + `
55 func main() {
56 fmt.Println("Hello")
57 }
58 ` + "```" + `
59
60 And another:
61 ` + "```python" + `
62 print("Hello")
63 ` + "```" + `
64 `
65
66 codeBlocks := extractCodeBlocks(markdown)
67 for _, block := range codeBlocks {
68 fmt.Printf("Language: %s\n", block.Language)
69 fmt.Printf("Code:\n%s\n", block.Code)
70 }
71
72 // 4. Parse key-value pairs
73 fmt.Println("\n4. Parse Configuration:")
74
75 config := `
76 host=localhost
77 port=8080
78 debug=true
79 max_connections=100
80 `
81
82 settings := parseKeyValue(config)
83 for key, value := range settings {
84 fmt.Printf("%s = %s\n", key, value)
85 }
86
87 // 5. Extract mentions and hashtags
88 fmt.Println("\n5. Social Media Parsing:")
89
90 tweet := "Great talk by @john_doe about #golang and #webdev! Check out #goconf2024 @gophercon"
91
92 mentions := extractMentions(tweet)
93 hashtags := extractHashtags(tweet)
94
95 fmt.Printf("Tweet: %s\n", tweet)
96 fmt.Printf("Mentions: %v\n", mentions)
97 fmt.Printf("Hashtags: %v\n", hashtags)
98}
99
100func extractAllPatterns(text string) map[string][]string {
101 patterns := map[string]*regexp.Regexp{
102 "Email": regexp.MustCompile(`\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b`),
103 "Phone": regexp.MustCompile(`\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}`),
104 "URL": regexp.MustCompile(`https?://[^\s]+`),
105 "IPv4": regexp.MustCompile(`\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b`),
106 "Date": regexp.MustCompile(`\d{4}-\d{2}-\d{2}`),
107 "Price": regexp.MustCompile(`\$\d+\.?\d*`),
108 }
109
110 results := make(map[string][]string)
111 for name, regex := range patterns {
112 matches := regex.FindAllString(text, -1)
113 if len(matches) > 0 {
114 results[name] = matches
115 }
116 }
117
118 return results
119}
120
121type LogEntry struct {
122 Timestamp string
123 Level string
124 Message string
125}
126
127func parseLogEntries(text string) []LogEntry {
128 // Pattern: [timestamp] LEVEL: message
129 logRegex := regexp.MustCompile(`\[([^\]]+)\]\s+(\w+):\s+(.+)`)
130 matches := logRegex.FindAllStringSubmatch(text, -1)
131
132 var entries []LogEntry
133 for _, match := range matches {
134 if len(match) >= 4 {
135 entries = append(entries, LogEntry{
136 Timestamp: match[1],
137 Level: match[2],
138 Message: match[3],
139 })
140 }
141 }
142
143 return entries
144}
145
146type CodeBlock struct {
147 Language string
148 Code string
149}
150
151func extractCodeBlocks(markdown string) []CodeBlock {
152 // Pattern: ```language\ncode\n```
153 codeRegex := regexp.MustCompile("```(\\w+)\\n([^`]+)```")
154 matches := codeRegex.FindAllStringSubmatch(markdown, -1)
155
156 var blocks []CodeBlock
157 for _, match := range matches {
158 if len(match) >= 3 {
159 blocks = append(blocks, CodeBlock{
160 Language: match[1],
161 Code: strings.TrimSpace(match[2]),
162 })
163 }
164 }
165
166 return blocks
167}
168
169func parseKeyValue(text string) map[string]string {
170 kvRegex := regexp.MustCompile(`(\w+)\s*=\s*(.+)`)
171 matches := kvRegex.FindAllStringSubmatch(text, -1)
172
173 settings := make(map[string]string)
174 for _, match := range matches {
175 if len(match) >= 3 {
176 settings[match[1]] = strings.TrimSpace(match[2])
177 }
178 }
179
180 return settings
181}
182
183func extractMentions(text string) []string {
184 mentionRegex := regexp.MustCompile(`@(\w+)`)
185 matches := mentionRegex.FindAllStringSubmatch(text, -1)
186
187 var mentions []string
188 for _, match := range matches {
189 if len(match) >= 2 {
190 mentions = append(mentions, match[1])
191 }
192 }
193
194 return mentions
195}
196
197func extractHashtags(text string) []string {
198 hashtagRegex := regexp.MustCompile(`#(\w+)`)
199 matches := hashtagRegex.FindAllStringSubmatch(text, -1)
200
201 var hashtags []string
202 for _, match := range matches {
203 if len(match) >= 2 {
204 hashtags = append(hashtags, match[1])
205 }
206 }
207
208 return hashtags
209}
Real-World Applications:
- Log Analysis: Parse and analyze application logs
- Content Parsing: Extract data from social media posts
- Configuration: Parse configuration files
- Code Analysis: Extract code snippets from documentation
Internationalization and Localization
Multi-Language Text Processing
Building applications that work across languages and cultures.
1package main
2
3import (
4 "fmt"
5 "sort"
6 "strings"
7 "unicode"
8)
9
10// run
11func main() {
12 fmt.Println("=== Internationalization and Localization ===")
13
14 // 1. Language detection
15 fmt.Println("\n1. Basic Language Detection:")
16
17 texts := []string{
18 "Hello, how are you?",
19 "Привет, как дела?",
20 "你好,你好吗?",
21 "مرحبا، كيف حالك؟",
22 "こんにちは、元気ですか?",
23 "Bonjour, comment allez-vous?",
24 }
25
26 for _, text := range texts {
27 lang := detectLanguage(text)
28 fmt.Printf("Text: %s\n", text)
29 fmt.Printf(" Detected: %s\n", lang)
30 }
31
32 // 2. Sorting with locale awareness
33 fmt.Println("\n2. Locale-Aware Sorting:")
34
35 // Simple ASCII sorting
36 words := []string{"apple", "Banana", "cherry", "Date"}
37 sorted := make([]string, len(words))
38 copy(sorted, words)
39 sort.Strings(sorted)
40
41 fmt.Printf("Original: %v\n", words)
42 fmt.Printf("ASCII sorted: %v\n", sorted)
43
44 // Case-insensitive sorting
45 sorted = make([]string, len(words))
46 copy(sorted, words)
47 sort.Slice(sorted, func(i, j int) bool {
48 return strings.ToLower(sorted[i]) < strings.ToLower(sorted[j])
49 })
50 fmt.Printf("Case-insensitive: %v\n", sorted)
51
52 // 3. Text direction detection
53 fmt.Println("\n3. Text Direction Detection:")
54
55 directionTests := []string{
56 "Hello World",
57 "مرحبا بالعالم",
58 "שלום עולם",
59 "Hello مرحبا World",
60 }
61
62 for _, text := range directionTests {
63 direction := detectDirection(text)
64 fmt.Printf("Text: %s\n", text)
65 fmt.Printf(" Direction: %s\n", direction)
66 }
67
68 // 4. Number formatting
69 fmt.Println("\n4. Number Formatting:")
70
71 num := 1234567.89
72
73 locales := []struct {
74 name string
75 thousands string
76 decimal string
77 }{
78 {"US/UK", ",", "."},
79 {"Europe", ".", ","},
80 {"India", ",", "."},
81 }
82
83 for _, locale := range locales {
84 formatted := formatNumber(num, locale.thousands, locale.decimal)
85 fmt.Printf("%s: %s\n", locale.name, formatted)
86 }
87
88 // 5. Plural forms
89 fmt.Println("\n5. Pluralization:")
90
91 counts := []int{0, 1, 2, 5, 21, 100}
92
93 for _, count := range counts {
94 fmt.Printf("%d: %s\n", count, pluralize(count, "item", "items"))
95 fmt.Printf("%d: %s\n", count, pluralizeRussian(count, "предмет", "предмета", "предметов"))
96 }
97
98 // 6. Date and time formatting
99 fmt.Println("\n6. Date/Time Format Patterns:")
100
101 dateFormats := map[string]string{
102 "US": "01/02/2006",
103 "Europe": "02.01.2006",
104 "ISO": "2006-01-02",
105 "Long": "January 2, 2006",
106 }
107
108 for locale, format := range dateFormats {
109 fmt.Printf("%s format: %s\n", locale, format)
110 }
111
112 // 7. Currency formatting
113 fmt.Println("\n7. Currency Formatting:")
114
115 amount := 1234.56
116
117 currencies := []struct {
118 code string
119 symbol string
120 format string
121 }{
122 {"USD", "$", "%.2f"},
123 {"EUR", "€", "%.2f"},
124 {"GBP", "£", "%.2f"},
125 {"JPY", "¥", "%.0f"},
126 }
127
128 for _, curr := range currencies {
129 formatted := fmt.Sprintf(curr.format, amount)
130 fmt.Printf("%s: %s%s\n", curr.code, curr.symbol, formatted)
131 }
132}
133
134func detectLanguage(text string) string {
135 // Simple script-based language detection
136 latinCount := 0
137 cyrillicCount := 0
138 cjkCount := 0
139 arabicCount := 0
140
141 for _, r := range text {
142 switch {
143 case (r >= 'A' && r <= 'Z') || (r >= 'a' && r <= 'z'):
144 latinCount++
145 case r >= 0x0400 && r <= 0x04FF:
146 cyrillicCount++
147 case r >= 0x4E00 && r <= 0x9FFF:
148 cjkCount++
149 case r >= 0x3040 && r <= 0x309F || r >= 0x30A0 && r <= 0x30FF:
150 cjkCount++ // Japanese
151 case r >= 0x0600 && r <= 0x06FF:
152 arabicCount++
153 }
154 }
155
156 // Determine dominant script
157 max := latinCount
158 lang := "Latin (English/French/German/etc.)"
159
160 if cyrillicCount > max {
161 max = cyrillicCount
162 lang = "Cyrillic (Russian/Ukrainian/etc.)"
163 }
164 if cjkCount > max {
165 max = cjkCount
166 lang = "CJK (Chinese/Japanese/Korean)"
167 }
168 if arabicCount > max {
169 lang = "Arabic"
170 }
171
172 return lang
173}
174
175func detectDirection(text string) string {
176 // Detect text direction (LTR or RTL)
177 rtlCount := 0
178 ltrCount := 0
179
180 for _, r := range text {
181 switch {
182 case (r >= 'A' && r <= 'Z') || (r >= 'a' && r <= 'z'):
183 ltrCount++
184 case r >= 0x0590 && r <= 0x05FF: // Hebrew
185 rtlCount++
186 case r >= 0x0600 && r <= 0x06FF: // Arabic
187 rtlCount++
188 }
189 }
190
191 if rtlCount > ltrCount {
192 return "RTL (Right-to-Left)"
193 }
194 return "LTR (Left-to-Right)"
195}
196
197func formatNumber(num float64, thousands, decimal string) string {
198 // Simple number formatting
199 str := fmt.Sprintf("%.2f", num)
200 parts := strings.Split(str, ".")
201
202 // Add thousands separators
203 integer := parts[0]
204 var formatted []rune
205 for i, r := range integer {
206 if i > 0 && (len(integer)-i)%3 == 0 {
207 formatted = append(formatted, []rune(thousands)...)
208 }
209 formatted = append(formatted, r)
210 }
211
212 result := string(formatted)
213 if len(parts) > 1 {
214 result += decimal + parts[1]
215 }
216
217 return result
218}
219
220func pluralize(count int, singular, plural string) string {
221 if count == 1 {
222 return fmt.Sprintf("%d %s", count, singular)
223 }
224 return fmt.Sprintf("%d %s", count, plural)
225}
226
227func pluralizeRussian(count int, one, few, many string) string {
228 // Russian pluralization rules
229 n := count % 100
230 n1 := count % 10
231
232 if n >= 11 && n <= 19 {
233 return fmt.Sprintf("%d %s", count, many)
234 }
235
236 switch n1 {
237 case 1:
238 return fmt.Sprintf("%d %s", count, one)
239 case 2, 3, 4:
240 return fmt.Sprintf("%d %s", count, few)
241 default:
242 return fmt.Sprintf("%d %s", count, many)
243 }
244}
Real-World Applications:
- Global Applications: Build apps that work in any language
- E-commerce: Display prices and dates in local formats
- Content Management: Handle multi-language content
- Search: Implement language-aware search
⚠️ Production Note: For production i18n, use golang.org/x/text package which provides proper collation, normalization, and locale support.
Performance and Optimization
Efficient Text Processing
Optimize text processing for performance-critical applications.
1package main
2
3import (
4 "fmt"
5 "strings"
6 "time"
7 "unicode"
8 "unicode/utf8"
9)
10
11// run
12func main() {
13 fmt.Println("=== Performance and Optimization ===")
14
15 // 1. String building performance
16 fmt.Println("\n1. String Building Benchmark:")
17
18 iterations := 10000
19
20 // Bad: concatenation
21 start := time.Now()
22 var bad string
23 for i := 0; i < iterations; i++ {
24 bad += "x"
25 }
26 badTime := time.Since(start)
27
28 // Good: strings.Builder
29 start = time.Now()
30 var builder strings.Builder
31 builder.Grow(iterations)
32 for i := 0; i < iterations; i++ {
33 builder.WriteByte('x')
34 }
35 good := builder.String()
36 goodTime := time.Since(start)
37
38 fmt.Printf("Concatenation: %v\n", badTime)
39 fmt.Printf("Builder: %v\n", goodTime)
40 fmt.Printf("Speedup: %.0fx faster\n", float64(badTime)/float64(goodTime))
41
42 // 2. Substring operations
43 fmt.Println("\n2. Substring Performance:")
44
45 text := strings.Repeat("Hello, 世界! ", 1000)
46
47 // Rune-based (safe)
48 start = time.Now()
49 for i := 0; i < 1000; i++ {
50 runes := []rune(text)
51 _ = string(runes[0:10])
52 }
53 runeTime := time.Since(start)
54
55 // Byte-based (unsafe for Unicode, but shown for comparison)
56 start = time.Now()
57 for i := 0; i < 1000; i++ {
58 _ = text[0:20] // Arbitrary byte count
59 }
60 byteTime := time.Since(start)
61
62 fmt.Printf("Rune-based: %v\n", runeTime)
63 fmt.Printf("Byte-based: %v\n", byteTime)
64 fmt.Printf("Note: Byte-based is faster but UNSAFE for Unicode!\n")
65
66 // 3. Character counting optimization
67 fmt.Println("\n3. Character Counting:")
68
69 longText := strings.Repeat("Hello, 世界! ", 10000)
70
71 // Method 1: utf8.RuneCountInString
72 start = time.Now()
73 count1 := utf8.RuneCountInString(longText)
74 time1 := time.Since(start)
75
76 // Method 2: range loop
77 start = time.Now()
78 count2 := 0
79 for range longText {
80 count2++
81 }
82 time2 := time.Since(start)
83
84 // Method 3: convert to rune slice
85 start = time.Now()
86 count3 := len([]rune(longText))
87 time3 := time.Since(start)
88
89 fmt.Printf("RuneCountInString: %d chars in %v\n", count1, time1)
90 fmt.Printf("Range loop: %d chars in %v\n", count2, time2)
91 fmt.Printf("Rune slice: %d chars in %v\n", count3, time3)
92
93 // 4. Search optimization
94 fmt.Println("\n4. Search Optimization:")
95
96 haystack := strings.Repeat("The quick brown fox jumps over the lazy dog. ", 1000)
97 needle := "lazy"
98
99 // Method 1: strings.Contains
100 start = time.Now()
101 for i := 0; i < 1000; i++ {
102 _ = strings.Contains(haystack, needle)
103 }
104 time1 = time.Since(start)
105
106 // Method 2: strings.Index
107 start = time.Now()
108 for i := 0; i < 1000; i++ {
109 _ = strings.Index(haystack, needle) != -1
110 }
111 time2 = time.Since(start)
112
113 fmt.Printf("Contains: %v\n", time1)
114 fmt.Printf("Index: %v\n", time2)
115
116 // 5. Case-insensitive comparison
117 fmt.Println("\n5. Case-Insensitive Comparison:")
118
119 s1 := strings.Repeat("Hello World ", 1000)
120 s2 := strings.Repeat("hello world ", 1000)
121
122 // Method 1: Convert to lower
123 start = time.Now()
124 for i := 0; i < 100; i++ {
125 _ = strings.ToLower(s1) == strings.ToLower(s2)
126 }
127 time1 = time.Since(start)
128
129 // Method 2: EqualFold
130 start = time.Now()
131 for i := 0; i < 100; i++ {
132 _ = strings.EqualFold(s1, s2)
133 }
134 time2 = time.Since(start)
135
136 fmt.Printf("ToLower: %v\n", time1)
137 fmt.Printf("EqualFold: %v (%.0fx faster)\n", time2, float64(time1)/float64(time2))
138
139 // 6. Memory-efficient character filtering
140 fmt.Println("\n6. Character Filtering:")
141
142 messyText := strings.Repeat("Hello123World456", 1000)
143
144 // Filter only letters
145 start = time.Now()
146 filtered := filterLetters(messyText)
147 filterTime := time.Since(start)
148
149 fmt.Printf("Original length: %d\n", len(messyText))
150 fmt.Printf("Filtered length: %d\n", len(filtered))
151 fmt.Printf("Time: %v\n", filterTime)
152}
153
154func filterLetters(s string) string {
155 var builder strings.Builder
156 builder.Grow(len(s)) // Pre-allocate
157
158 for _, r := range s {
159 if unicode.IsLetter(r) {
160 builder.WriteRune(r)
161 }
162 }
163
164 return builder.String()
165}
Performance Best Practices:
- Pre-allocate: Use
builder.Grow()when size is known - Avoid Conversions: Minimize string ↔ rune slice conversions
- Use Built-ins: Built-in functions are optimized
- Cache Results: Don't repeat expensive operations
- Profile: Use
pprofto find bottlenecks
💡 Key Insight: strings.Builder is typically 100x+ faster than concatenation for building large strings.
Advanced Text Processing
Complete Text Processing Pipeline
Build a production-ready text processing system.
1package main
2
3import (
4 "fmt"
5 "regexp"
6 "sort"
7 "strings"
8 "unicode"
9 "unicode/utf8"
10)
11
12// run
13func main() {
14 fmt.Println("=== Complete Text Processing Pipeline ===")
15
16 // Sample documents
17 documents := []string{
18 "Hello World! This is a test document about Go programming.",
19 "The quick brown fox jumps over the lazy dog.",
20 "Artificial Intelligence is transforming modern technology.",
21 "Cloud computing enables scalable applications worldwide.",
22 "Machine learning algorithms process data efficiently.",
23 }
24
25 // 1. Document preprocessing
26 fmt.Println("\n1. Document Preprocessing:")
27
28 preprocessed := make([]string, len(documents))
29 for i, doc := range documents {
30 preprocessed[i] = preprocessText(doc)
31 fmt.Printf("Doc %d:\n", i+1)
32 fmt.Printf(" Original: %s\n", doc)
33 fmt.Printf(" Processed: %s\n", preprocessed[i])
34 }
35
36 // 2. Text statistics
37 fmt.Println("\n2. Text Analysis:")
38
39 for i, doc := range preprocessed {
40 stats := analyzeText(doc)
41 fmt.Printf("\nDocument %d Statistics:\n", i+1)
42 fmt.Printf(" Words: %d\n", stats.WordCount)
43 fmt.Printf(" Characters: %d\n", stats.CharCount)
44 fmt.Printf(" Unique words: %d\n", stats.UniqueWords)
45 fmt.Printf(" Avg word length: %.2f\n", stats.AvgWordLength)
46 fmt.Printf(" Complexity: %s\n", stats.Complexity)
47 }
48
49 // 3. Term frequency analysis
50 fmt.Println("\n3. Term Frequency Analysis:")
51
52 termFreq := calculateTermFrequency(preprocessed)
53
54 // Sort by frequency
55 type termCount struct {
56 term string
57 count int
58 }
59
60 var terms []termCount
61 for term, count := range termFreq {
62 if count > 1 { // Only show terms appearing multiple times
63 terms = append(terms, termCount{term, count})
64 }
65 }
66
67 sort.Slice(terms, func(i, j int) bool {
68 return terms[i].count > terms[j].count
69 })
70
71 fmt.Printf("Top terms:\n")
72 for i, tc := range terms {
73 if i >= 10 {
74 break
75 }
76 fmt.Printf(" %d. %s: %d occurrences\n", i+1, tc.term, tc.count)
77 }
78
79 // 4. Document similarity
80 fmt.Println("\n4. Document Similarity:")
81
82 for i := 0; i < len(preprocessed); i++ {
83 for j := i + 1; j < len(preprocessed); j++ {
84 similarity := calculateJaccardSimilarity(preprocessed[i], preprocessed[j])
85 fmt.Printf("Doc %d vs Doc %d: %.3f\n", i+1, j+1, similarity)
86 }
87 }
88
89 // 5. Pattern extraction
90 fmt.Println("\n5. Pattern Extraction:")
91
92 sampleText := "Contact us at support@example.com or call +1-555-123-4567. Visit https://example.com"
93 patterns := extractPatterns(sampleText)
94
95 fmt.Printf("Text: %s\n", sampleText)
96 fmt.Printf("Extracted patterns:\n")
97 for patternType, matches := range patterns {
98 if len(matches) > 0 {
99 fmt.Printf(" %s: %v\n", patternType, matches)
100 }
101 }
102}
103
104type TextStats struct {
105 WordCount int
106 CharCount int
107 UniqueWords int
108 AvgWordLength float64
109 Complexity string
110}
111
112func preprocessText(text string) string {
113 // 1. Convert to lowercase
114 text = strings.ToLower(text)
115
116 // 2. Remove punctuation (keep spaces)
117 text = regexp.MustCompile(`[^\w\s]`).ReplaceAllString(text, "")
118
119 // 3. Normalize whitespace
120 text = regexp.MustCompile(`\s+`).ReplaceAllString(text, " ")
121
122 // 4. Trim
123 return strings.TrimSpace(text)
124}
125
126func analyzeText(text string) TextStats {
127 words := strings.Fields(text)
128
129 stats := TextStats{
130 WordCount: len(words),
131 CharCount: utf8.RuneCountInString(text),
132 }
133
134 // Unique words
135 wordSet := make(map[string]bool)
136 totalLength := 0
137 for _, word := range words {
138 wordSet[word] = true
139 totalLength += utf8.RuneCountInString(word)
140 }
141 stats.UniqueWords = len(wordSet)
142
143 // Average word length
144 if len(words) > 0 {
145 stats.AvgWordLength = float64(totalLength) / float64(len(words))
146 }
147
148 // Complexity estimation
149 if stats.AvgWordLength < 4 {
150 stats.Complexity = "Simple"
151 } else if stats.AvgWordLength < 6 {
152 stats.Complexity = "Moderate"
153 } else {
154 stats.Complexity = "Complex"
155 }
156
157 return stats
158}
159
160func calculateTermFrequency(documents []string) map[string]int {
161 termFreq := make(map[string]int)
162
163 for _, doc := range documents {
164 words := strings.Fields(doc)
165 for _, word := range words {
166 if len(word) > 2 { // Ignore very short words
167 termFreq[word]++
168 }
169 }
170 }
171
172 return termFreq
173}
174
175func calculateJaccardSimilarity(doc1, doc2 string) float64 {
176 words1 := make(map[string]bool)
177 words2 := make(map[string]bool)
178
179 for _, word := range strings.Fields(doc1) {
180 words1[word] = true
181 }
182
183 for _, word := range strings.Fields(doc2) {
184 words2[word] = true
185 }
186
187 intersection := 0
188 for word := range words1 {
189 if words2[word] {
190 intersection++
191 }
192 }
193
194 union := len(words1) + len(words2) - intersection
195
196 if union == 0 {
197 return 0
198 }
199
200 return float64(intersection) / float64(union)
201}
202
203func extractPatterns(text string) map[string][]string {
204 patterns := make(map[string][]string)
205
206 // Email
207 emailRegex := regexp.MustCompile(`\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b`)
208 patterns["emails"] = emailRegex.FindAllString(text, -1)
209
210 // Phone
211 phoneRegex := regexp.MustCompile(`\+?\d{1,3}[-.]?\(?\d{3}\)?[-.]?\d{3}[-.]?\d{4}`)
212 patterns["phones"] = phoneRegex.FindAllString(text, -1)
213
214 // URL
215 urlRegex := regexp.MustCompile(`https?://[^\s]+`)
216 patterns["urls"] = urlRegex.FindAllString(text, -1)
217
218 return patterns
219}
Real-World Applications:
- Search Engines: Index and rank documents
- Content Analysis: Analyze text for insights
- Recommendation Systems: Find similar content
- Data Mining: Extract patterns from text
Practice Exercises
Exercise 1: Universal Text Search Engine
Learning Objectives: Master advanced text processing, pattern matching, and building search functionality that works across multiple languages.
Real-World Context: Search engines are the backbone of modern applications - from web search and document search to autocomplete and recommendation systems. Companies like Google, Microsoft, and Elasticsearch build sophisticated text processing systems that handle billions of queries across hundreds of languages.
Difficulty: Advanced | Time Estimate: 4-5 hours
Description: Build a multilingual text search engine that can index documents, handle complex queries, support different languages, and provide relevant results with proper ranking.
Requirements:
- Document Indexing: Parse and index text documents with proper tokenization
- Unicode Support: Handle text in multiple languages and scripts
- Text Normalization: Implement case folding, accent removal, and whitespace normalization
- Query Processing: Support boolean operators and phrase searching
- Relevance Ranking: Implement TF-IDF scoring
- Performance: Efficient indexing and searching with proper data structures
- Fuzzy Matching: Support approximate string matching
- Caching: Cache frequently used queries
Solution
1package main
2
3import (
4 "fmt"
5 "math"
6 "regexp"
7 "sort"
8 "strings"
9 "unicode"
10 "unicode/utf8"
11)
12
13// Document represents a searchable document
14type Document struct {
15 ID int
16 Title string
17 Content string
18 Language string
19}
20
21// SearchQuery represents a search query
22type SearchQuery struct {
23 Terms []string
24 Phrase string
25 Required []string
26 Excluded []string
27 Fuzzy bool
28}
29
30// SearchResult represents a search result
31type SearchResult struct {
32 Document *Document
33 Score float64
34 Matches []string
35}
36
37// SearchEngine represents our search engine
38type SearchEngine struct {
39 documents []Document
40 index map[string][]Posting
41 cache map[string][]SearchResult
42}
43
44// Posting contains document info
45type Posting struct {
46 DocID int
47 TF int
48 Positions []int
49}
50
51// run
52func main() {
53 fmt.Println("=== Universal Text Search Engine ===")
54
55 engine := NewSearchEngine()
56
57 // Add documents
58 documents := []Document{
59 {1, "Go Programming", "Go is a programming language by Google", "en"},
60 {2, "Data Structures", "Understanding data structures and algorithms", "en"},
61 {3, "数据库设计", "数据库是现代应用的核心", "zh"},
62 {4, "Machine Learning", "ML algorithms process data efficiently", "en"},
63 }
64
65 for _, doc := range documents {
66 engine.AddDocument(doc)
67 }
68
69 engine.BuildIndex()
70 fmt.Printf("Indexed %d documents\n", len(documents))
71
72 // Example searches
73 queries := []string{
74 "programming",
75 "data structures",
76 "数据",
77 }
78
79 for _, query := range queries {
80 fmt.Printf("\n=== Query: %s ===\n", query)
81 results := engine.Search(parseQuery(query), 5)
82
83 for i, result := range results {
84 fmt.Printf("%d. %s (score: %.2f)\n", i+1, result.Document.Title, result.Score)
85 }
86 }
87}
88
89func NewSearchEngine() *SearchEngine {
90 return &SearchEngine{
91 index: make(map[string][]Posting),
92 cache: make(map[string][]SearchResult),
93 }
94}
95
96func (se *SearchEngine) AddDocument(doc Document) {
97 se.documents = append(se.documents, doc)
98}
99
100func (se *SearchEngine) BuildIndex() {
101 for _, doc := range se.documents {
102 tokens := tokenize(doc.Content + " " + doc.Title)
103
104 for i, token := range tokens {
105 normalized := normalizeToken(token)
106 if normalized == "" {
107 continue
108 }
109
110 postings := se.index[normalized]
111 found := false
112
113 for j := range postings {
114 if postings[j].DocID == doc.ID {
115 postings[j].TF++
116 postings[j].Positions = append(postings[j].Positions, i)
117 found = true
118 break
119 }
120 }
121
122 if !found {
123 postings = append(postings, Posting{
124 DocID: doc.ID,
125 TF: 1,
126 Positions: []int{i},
127 })
128 }
129
130 se.index[normalized] = postings
131 }
132 }
133}
134
135func (se *SearchEngine) Search(query SearchQuery, limit int) []SearchResult {
136 var results []SearchResult
137
138 if query.Phrase != "" {
139 results = se.phraseSearch(query.Phrase, limit)
140 } else {
141 results = se.termSearch(query, limit)
142 }
143
144 return results
145}
146
147func (se *SearchEngine) termSearch(query SearchQuery, limit int) []SearchResult {
148 docScores := make(map[int]float64)
149 docMatches := make(map[int][]string)
150
151 for _, term := range query.Terms {
152 normalized := normalizeToken(term)
153
154 if postings, exists := se.index[normalized]; exists {
155 for _, posting := range postings {
156 score := se.calculateTFIDF(posting.DocID, normalized)
157 docScores[posting.DocID] += score
158 docMatches[posting.DocID] = append(docMatches[posting.DocID], term)
159 }
160 }
161 }
162
163 var results []SearchResult
164 for docID, score := range docScores {
165 doc := se.findDocument(docID)
166 if doc != nil {
167 results = append(results, SearchResult{
168 Document: doc,
169 Score: score,
170 Matches: docMatches[docID],
171 })
172 }
173 }
174
175 sort.Slice(results, func(i, j int) bool {
176 return results[i].Score > results[j].Score
177 })
178
179 if len(results) > limit {
180 results = results[:limit]
181 }
182
183 return results
184}
185
186func (se *SearchEngine) phraseSearch(phrase string, limit int) []SearchResult {
187 var results []SearchResult
188 normalized := strings.ToLower(phrase)
189
190 for i := range se.documents {
191 content := strings.ToLower(se.documents[i].Content + " " + se.documents[i].Title)
192 if strings.Contains(content, normalized) {
193 results = append(results, SearchResult{
194 Document: &se.documents[i],
195 Score: 1.0,
196 Matches: []string{phrase},
197 })
198 }
199 }
200
201 if len(results) > limit {
202 results = results[:limit]
203 }
204
205 return results
206}
207
208func (se *SearchEngine) calculateTFIDF(docID int, term string) float64 {
209 postings := se.index[term]
210
211 var tf int
212 for _, posting := range postings {
213 if posting.DocID == docID {
214 tf = posting.TF
215 break
216 }
217 }
218
219 if tf == 0 {
220 return 0
221 }
222
223 df := len(postings)
224 if df == 0 {
225 return 0
226 }
227
228 N := len(se.documents)
229 idf := math.Log(float64(N) / float64(df))
230
231 return float64(tf) * idf
232}
233
234func (se *SearchEngine) findDocument(docID int) *Document {
235 for i := range se.documents {
236 if se.documents[i].ID == docID {
237 return &se.documents[i]
238 }
239 }
240 return nil
241}
242
243func tokenize(text string) []string {
244 re := regexp.MustCompile(`\w+`)
245 return re.FindAllString(text, -1)
246}
247
248func normalizeToken(token string) string {
249 token = strings.ToLower(token)
250 if len(token) < 2 {
251 return ""
252 }
253 return token
254}
255
256func parseQuery(query string) SearchQuery {
257 q := SearchQuery{}
258
259 if strings.Contains(query, `"`) {
260 re := regexp.MustCompile(`"([^"]+)"`)
261 matches := re.FindStringSubmatch(query)
262 if len(matches) > 1 {
263 q.Phrase = matches[1]
264 }
265 } else {
266 q.Terms = strings.Fields(query)
267 }
268
269 return q
270}
Exercise 2: Text Sanitizer and Validator
Learning Objectives: Implement comprehensive text validation, sanitization, and security measures.
Real-World Context: Text sanitization is crucial for preventing XSS attacks, SQL injection, and other security vulnerabilities. Every user input must be validated and sanitized.
Difficulty: Intermediate | Time Estimate: 2-3 hours
Description: Build a text sanitizer that validates and cleans user input, removes malicious content, and ensures text meets specific requirements.
Requirements:
- XSS Prevention: Remove or escape HTML/JavaScript
- SQL Injection Prevention: Detect and prevent SQL injection attempts
- Input Validation: Validate email, phone, URL formats
- Content Filtering: Remove profanity and unwanted content
- Length Limits: Enforce text length constraints
- Unicode Safety: Handle Unicode properly
- Whitespace Normalization: Clean up excessive whitespace
Solution
1package main
2
3import (
4 "fmt"
5 "regexp"
6 "strings"
7 "unicode"
8 "unicode/utf8"
9)
10
11// Sanitizer handles text sanitization and validation
12type Sanitizer struct {
13 maxLength int
14 allowedChars *regexp.Regexp
15 profanityList map[string]bool
16 sqlPatterns []*regexp.Regexp
17 xssPatterns []*regexp.Regexp
18}
19
20// ValidationResult contains validation results
21type ValidationResult struct {
22 IsValid bool
23 Errors []string
24 Cleaned string
25}
26
27// run
28func main() {
29 fmt.Println("=== Text Sanitizer and Validator ===")
30
31 sanitizer := NewSanitizer()
32
33 // Test cases
34 tests := []struct {
35 name string
36 input string
37 }{
38 {"Normal text", "Hello, World! This is a test."},
39 {"XSS attempt", "<script>alert('XSS')</script>"},
40 {"SQL injection", "'; DROP TABLE users; --"},
41 {"Excessive whitespace", "Hello World \n\n\n Test"},
42 {"Long text", strings.Repeat("x", 1100)},
43 {"Mixed content", "Email me@example.com or call 555-1234"},
44 {"Unicode", "Hello 世界 🌍"},
45 }
46
47 for _, test := range tests {
48 fmt.Printf("\n=== %s ===\n", test.name)
49 fmt.Printf("Input: %s\n", truncate(test.input, 50))
50
51 result := sanitizer.Sanitize(test.input)
52 fmt.Printf("Valid: %t\n", result.IsValid)
53
54 if len(result.Errors) > 0 {
55 fmt.Printf("Errors: %v\n", result.Errors)
56 }
57
58 fmt.Printf("Cleaned: %s\n", truncate(result.Cleaned, 50))
59 }
60
61 // Validation examples
62 fmt.Println("\n=== Format Validation ===")
63
64 emails := []string{"valid@example.com", "invalid@", "test@test"}
65 phones := []string{"555-1234", "(555) 123-4567", "invalid"}
66
67 for _, email := range emails {
68 fmt.Printf("Email %s: valid=%t\n", email, sanitizer.ValidateEmail(email))
69 }
70
71 for _, phone := range phones {
72 fmt.Printf("Phone %s: valid=%t\n", phone, sanitizer.ValidatePhone(phone))
73 }
74}
75
76func NewSanitizer() *Sanitizer {
77 return &Sanitizer{
78 maxLength: 1000,
79 allowedChars: regexp.MustCompile(`^[\w\s\p{L}\p{N}.,!?'"()-]+$`),
80 profanityList: map[string]bool{
81 "badword": true,
82 "profanity": true,
83 },
84 sqlPatterns: []*regexp.Regexp{
85 regexp.MustCompile(`(?i)(DROP|DELETE|INSERT|UPDATE)\s+(TABLE|FROM|INTO)`),
86 regexp.MustCompile(`(?i)(SELECT.*FROM|UNION.*SELECT)`),
87 regexp.MustCompile(`(?i)(--|;|'|"|\*|=)`),
88 },
89 xssPatterns: []*regexp.Regexp{
90 regexp.MustCompile(`(?i)<script[^>]*>.*?</script>`),
91 regexp.MustCompile(`(?i)<iframe[^>]*>.*?</iframe>`),
92 regexp.MustCompile(`(?i)javascript:`),
93 regexp.MustCompile(`(?i)on\w+\s*=`),
94 },
95 }
96}
97
98func (s *Sanitizer) Sanitize(text string) ValidationResult {
99 result := ValidationResult{
100 IsValid: true,
101 Cleaned: text,
102 }
103
104 // Check length
105 if utf8.RuneCountInString(text) > s.maxLength {
106 result.Errors = append(result.Errors, fmt.Sprintf("Text exceeds %d characters", s.maxLength))
107 result.IsValid = false
108 runes := []rune(text)
109 result.Cleaned = string(runes[:s.maxLength])
110 }
111
112 // Check for XSS
113 for _, pattern := range s.xssPatterns {
114 if pattern.MatchString(result.Cleaned) {
115 result.Errors = append(result.Errors, "XSS content detected")
116 result.IsValid = false
117 result.Cleaned = pattern.ReplaceAllString(result.Cleaned, "")
118 }
119 }
120
121 // Check for SQL injection
122 for _, pattern := range s.sqlPatterns {
123 if pattern.MatchString(result.Cleaned) {
124 result.Errors = append(result.Errors, "SQL injection attempt detected")
125 result.IsValid = false
126 }
127 }
128
129 // Normalize whitespace
130 result.Cleaned = normalizeWhitespace(result.Cleaned)
131
132 // Remove profanity
133 result.Cleaned = s.removeProfanity(result.Cleaned)
134
135 // Trim
136 result.Cleaned = strings.TrimSpace(result.Cleaned)
137
138 return result
139}
140
141func (s *Sanitizer) ValidateEmail(email string) bool {
142 emailRegex := regexp.MustCompile(`^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}$`)
143 return emailRegex.MatchString(email)
144}
145
146func (s *Sanitizer) ValidatePhone(phone string) bool {
147 phoneRegex := regexp.MustCompile(`^\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}$`)
148 return phoneRegex.MatchString(phone)
149}
150
151func (s *Sanitizer) removeProfanity(text string) string {
152 words := strings.Fields(strings.ToLower(text))
153 for i, word := range words {
154 if s.profanityList[word] {
155 words[i] = strings.Repeat("*", len(word))
156 }
157 }
158 return strings.Join(words, " ")
159}
160
161func normalizeWhitespace(s string) string {
162 runes := []rune(s)
163 var normalized []rune
164 var inWhitespace bool
165
166 for _, r := range runes {
167 if unicode.IsSpace(r) {
168 if !inWhitespace {
169 normalized = append(normalized, ' ')
170 inWhitespace = true
171 }
172 } else {
173 normalized = append(normalized, r)
174 inWhitespace = false
175 }
176 }
177
178 return string(normalized)
179}
180
181func truncate(s string, maxLen int) string {
182 runes := []rune(s)
183 if len(runes) <= maxLen {
184 return s
185 }
186 return string(runes[:maxLen]) + "..."
187}
Exercise 3: Multilingual Tokenizer
Learning Objectives: Build a tokenizer that handles multiple languages and scripts correctly.
Real-World Context: Tokenization is the foundation of NLP. Different languages have different rules for word boundaries, making multilingual tokenization challenging.
Difficulty: Advanced | Time Estimate: 3-4 hours
Description: Create a tokenizer that correctly splits text into tokens (words) across multiple languages including CJK languages, Arabic, and European languages.
Requirements:
- Language Detection: Detect the primary language
- Script-Aware Tokenization: Different rules for different scripts
- CJK Support: Handle Chinese, Japanese, Korean character-by-character
- Arabic Support: Handle RTL text and diacritics
- Compound Words: Handle German/Dutch compound words
- Punctuation: Proper punctuation handling
- Performance: Efficient tokenization
Solution
1package main
2
3import (
4 "fmt"
5 "regexp"
6 "strings"
7 "unicode"
8)
9
10// Tokenizer handles multilingual tokenization
11type Tokenizer struct {
12 preservePunctuation bool
13 lowercaseTokens bool
14}
15
16// Token represents a text token
17type Token struct {
18 Text string
19 Type TokenType
20 Language string
21 Position int
22}
23
24// TokenType represents token types
25type TokenType int
26
27const (
28 Word TokenType = iota
29 Number
30 Punctuation
31 Whitespace
32 Symbol
33)
34
35// run
36func main() {
37 fmt.Println("=== Multilingual Tokenizer ===")
38
39 tokenizer := NewTokenizer()
40
41 // Test texts in different languages
42 texts := []struct {
43 name string
44 text string
45 }{
46 {"English", "Hello, world! How are you?"},
47 {"Chinese", "你好世界,今天天气很好。"},
48 {"Japanese", "こんにちは、世界!元気ですか?"},
49 {"Arabic", "مرحبا بالعالم! كيف حالك؟"},
50 {"German", "Guten Tag! Donaudampfschifffahrtsgesellschaft."},
51 {"Mixed", "Hello 世界! مرحبا 123."},
52 }
53
54 for _, test := range texts {
55 fmt.Printf("\n=== %s ===\n", test.name)
56 fmt.Printf("Text: %s\n", test.text)
57
58 tokens := tokenizer.Tokenize(test.text)
59 fmt.Printf("Tokens (%d):\n", len(tokens))
60
61 for i, token := range tokens {
62 fmt.Printf(" %d. %q [%s, %s]\n", i+1, token.Text,
63 getTokenTypeName(token.Type), token.Language)
64 }
65 }
66
67 // Language-specific tokenization
68 fmt.Println("\n=== Language-Specific Rules ===")
69
70 chineseText := "我爱编程。"
71 tokens := tokenizer.TokenizeChinese(chineseText)
72 fmt.Printf("Chinese: %s\n", chineseText)
73 fmt.Printf("Tokens: %v\n", tokens)
74
75 arabicText := "مرحبا بالعالم"
76 tokens = tokenizer.TokenizeArabic(arabicText)
77 fmt.Printf("Arabic: %s\n", arabicText)
78 fmt.Printf("Tokens: %v\n", tokens)
79}
80
81func NewTokenizer() *Tokenizer {
82 return &Tokenizer{
83 preservePunctuation: false,
84 lowercaseTokens: true,
85 }
86}
87
88func (t *Tokenizer) Tokenize(text string) []Token {
89 var tokens []Token
90 position := 0
91
92 for i, r := range text {
93 lang := detectRuneLanguage(r)
94 tokenType := detectTokenType(r)
95
96 // Different tokenization strategies based on language
97 switch lang {
98 case "CJK":
99 // CJK: character-by-character
100 tokens = append(tokens, Token{
101 Text: string(r),
102 Type: tokenType,
103 Language: lang,
104 Position: position,
105 })
106 position++
107
108 default:
109 // For alphabetic scripts, collect word characters
110 if unicode.IsLetter(r) || unicode.IsNumber(r) {
111 start := i
112 end := i + 1
113
114 // Find end of word
115 for j := i + 1; j < len(text); j++ {
116 nextR := rune(text[j])
117 if unicode.IsLetter(nextR) || unicode.IsNumber(nextR) || nextR == '\'' {
118 end = j + 1
119 } else {
120 break
121 }
122 }
123
124 word := text[start:end]
125 if t.lowercaseTokens {
126 word = strings.ToLower(word)
127 }
128
129 tokens = append(tokens, Token{
130 Text: word,
131 Type: Word,
132 Language: lang,
133 Position: position,
134 })
135 position++
136
137 // Skip ahead
138 for j := start + 1; j < end; j++ {
139 i = j
140 }
141 } else if !unicode.IsSpace(r) && !t.preservePunctuation {
142 // Skip punctuation unless preserving
143 continue
144 }
145 }
146 }
147
148 return tokens
149}
150
151func (t *Tokenizer) TokenizeChinese(text string) []string {
152 var tokens []string
153
154 for _, r := range text {
155 if unicode.Is(unicode.Han, r) {
156 tokens = append(tokens, string(r))
157 }
158 }
159
160 return tokens
161}
162
163func (t *Tokenizer) TokenizeArabic(text string) []string {
164 // Simple word-based tokenization for Arabic
165 re := regexp.MustCompile(`[\p{Arabic}]+`)
166 return re.FindAllString(text, -1)
167}
168
169func detectRuneLanguage(r rune) string {
170 switch {
171 case r >= 'A' && r <= 'Z' || r >= 'a' && r <= 'z':
172 return "Latin"
173 case r >= 0x4E00 && r <= 0x9FFF:
174 return "CJK"
175 case r >= 0x3040 && r <= 0x309F || r >= 0x30A0 && r <= 0x30FF:
176 return "Japanese"
177 case r >= 0x0600 && r <= 0x06FF:
178 return "Arabic"
179 case r >= 0x0400 && r <= 0x04FF:
180 return "Cyrillic"
181 default:
182 return "Other"
183 }
184}
185
186func detectTokenType(r rune) TokenType {
187 switch {
188 case unicode.IsLetter(r):
189 return Word
190 case unicode.IsNumber(r):
191 return Number
192 case unicode.IsPunct(r):
193 return Punctuation
194 case unicode.IsSpace(r):
195 return Whitespace
196 default:
197 return Symbol
198 }
199}
200
201func getTokenTypeName(t TokenType) string {
202 switch t {
203 case Word:
204 return "Word"
205 case Number:
206 return "Number"
207 case Punctuation:
208 return "Punctuation"
209 case Whitespace:
210 return "Whitespace"
211 default:
212 return "Symbol"
213 }
214}
Exercise 4: Text Diff and Patch System
Learning Objectives: Implement algorithms for comparing texts and generating patches.
Real-World Context: Version control systems like Git use diff algorithms to show changes between file versions. Understanding text diffs is essential for building collaboration tools.
Difficulty: Advanced | Time Estimate: 4-5 hours
Description: Build a system that can compare two texts, generate a diff showing changes, and apply patches to reconstruct modified versions.
Requirements:
- Line-by-Line Diff: Compare texts line by line
- Word-Level Diff: Show word-level changes
- Character-Level Diff: Show character-level changes
- Unified Format: Output in unified diff format
- Patch Application: Apply patches to original text
- Conflict Detection: Detect and report conflicts
- Performance: Handle large texts efficiently
Solution
1package main
2
3import (
4 "fmt"
5 "strings"
6)
7
8// DiffType represents the type of difference
9type DiffType int
10
11const (
12 Equal DiffType = iota
13 Insert
14 Delete
15 Replace
16)
17
18// Diff represents a difference between two texts
19type Diff struct {
20 Type DiffType
21 OldText string
22 NewText string
23 Line int
24}
25
26// Differ handles text comparison
27type Differ struct{}
28
29// run
30func main() {
31 fmt.Println("=== Text Diff and Patch System ===")
32
33 differ := NewDiffer()
34
35 // Test cases
36 original := `Line 1: Hello World
37Line 2: This is a test
38Line 3: Some content here
39Line 4: More text`
40
41 modified := `Line 1: Hello World
42Line 2: This is a modified test
43Line 3: New content added
44Line 4: More text
45Line 5: Additional line`
46
47 // Generate diff
48 fmt.Println("=== Line-by-Line Diff ===")
49 diffs := differ.DiffLines(original, modified)
50
51 for _, diff := range diffs {
52 switch diff.Type {
53 case Equal:
54 fmt.Printf(" %s\n", diff.OldText)
55 case Insert:
56 fmt.Printf("+ %s\n", diff.NewText)
57 case Delete:
58 fmt.Printf("- %s\n", diff.OldText)
59 case Replace:
60 fmt.Printf("- %s\n", diff.OldText)
61 fmt.Printf("+ %s\n", diff.NewText)
62 }
63 }
64
65 // Word-level diff
66 fmt.Println("\n=== Word-Level Diff ===")
67 line1 := "The quick brown fox jumps"
68 line2 := "The fast brown fox leaps"
69
70 wordDiffs := differ.DiffWords(line1, line2)
71 for _, diff := range wordDiffs {
72 switch diff.Type {
73 case Equal:
74 fmt.Printf("%s ", diff.OldText)
75 case Delete:
76 fmt.Printf("[-%s] ", diff.OldText)
77 case Insert:
78 fmt.Printf("[+%s] ", diff.NewText)
79 }
80 }
81 fmt.Println()
82
83 // Generate patch
84 fmt.Println("\n=== Unified Diff Format ===")
85 patch := differ.GeneratePatch(original, modified, "original.txt", "modified.txt")
86 fmt.Println(patch)
87
88 // Calculate similarity
89 fmt.Println("\n=== Similarity Analysis ===")
90 similarity := differ.CalculateSimilarity(original, modified)
91 fmt.Printf("Similarity: %.2f%%\n", similarity*100)
92}
93
94func NewDiffer() *Differ {
95 return &Differ{}
96}
97
98func (d *Differ) DiffLines(text1, text2 string) []Diff {
99 lines1 := strings.Split(text1, "\n")
100 lines2 := strings.Split(text2, "\n")
101
102 return d.diff(lines1, lines2)
103}
104
105func (d *Differ) DiffWords(text1, text2 string) []Diff {
106 words1 := strings.Fields(text1)
107 words2 := strings.Fields(text2)
108
109 return d.diff(words1, words2)
110}
111
112func (d *Differ) diff(seq1, seq2 []string) []Diff {
113 var diffs []Diff
114 n, m := len(seq1), len(seq2)
115
116 // Simple LCS-based diff algorithm
117 lcs := d.longestCommonSubsequence(seq1, seq2)
118
119 i, j, k := 0, 0, 0
120
121 for k < len(lcs) {
122 // Find deletions
123 for i < n && seq1[i] != lcs[k] {
124 diffs = append(diffs, Diff{
125 Type: Delete,
126 OldText: seq1[i],
127 Line: i,
128 })
129 i++
130 }
131
132 // Find insertions
133 for j < m && seq2[j] != lcs[k] {
134 diffs = append(diffs, Diff{
135 Type: Insert,
136 NewText: seq2[j],
137 Line: j,
138 })
139 j++
140 }
141
142 // Equal element
143 if i < n && j < m && seq1[i] == seq2[j] {
144 diffs = append(diffs, Diff{
145 Type: Equal,
146 OldText: seq1[i],
147 NewText: seq2[j],
148 Line: i,
149 })
150 i++
151 j++
152 k++
153 }
154 }
155
156 // Remaining deletions
157 for i < n {
158 diffs = append(diffs, Diff{
159 Type: Delete,
160 OldText: seq1[i],
161 Line: i,
162 })
163 i++
164 }
165
166 // Remaining insertions
167 for j < m {
168 diffs = append(diffs, Diff{
169 Type: Insert,
170 NewText: seq2[j],
171 Line: j,
172 })
173 j++
174 }
175
176 return diffs
177}
178
179func (d *Differ) longestCommonSubsequence(seq1, seq2 []string) []string {
180 n, m := len(seq1), len(seq2)
181
182 // DP table
183 dp := make([][]int, n+1)
184 for i := range dp {
185 dp[i] = make([]int, m+1)
186 }
187
188 // Fill DP table
189 for i := 1; i <= n; i++ {
190 for j := 1; j <= m; j++ {
191 if seq1[i-1] == seq2[j-1] {
192 dp[i][j] = dp[i-1][j-1] + 1
193 } else {
194 dp[i][j] = max(dp[i-1][j], dp[i][j-1])
195 }
196 }
197 }
198
199 // Backtrack to find LCS
200 var lcs []string
201 i, j := n, m
202
203 for i > 0 && j > 0 {
204 if seq1[i-1] == seq2[j-1] {
205 lcs = append([]string{seq1[i-1]}, lcs...)
206 i--
207 j--
208 } else if dp[i-1][j] > dp[i][j-1] {
209 i--
210 } else {
211 j--
212 }
213 }
214
215 return lcs
216}
217
218func (d *Differ) GeneratePatch(text1, text2, file1, file2 string) string {
219 var patch strings.Builder
220
221 patch.WriteString(fmt.Sprintf("--- %s\n", file1))
222 patch.WriteString(fmt.Sprintf("+++ %s\n", file2))
223
224 diffs := d.DiffLines(text1, text2)
225
226 for _, diff := range diffs {
227 switch diff.Type {
228 case Delete:
229 patch.WriteString(fmt.Sprintf("- %s\n", diff.OldText))
230 case Insert:
231 patch.WriteString(fmt.Sprintf("+ %s\n", diff.NewText))
232 case Equal:
233 patch.WriteString(fmt.Sprintf(" %s\n", diff.OldText))
234 }
235 }
236
237 return patch.String()
238}
239
240func (d *Differ) CalculateSimilarity(text1, text2 string) float64 {
241 lines1 := strings.Split(text1, "\n")
242 lines2 := strings.Split(text2, "\n")
243
244 lcs := d.longestCommonSubsequence(lines1, lines2)
245
246 maxLen := max(len(lines1), len(lines2))
247 if maxLen == 0 {
248 return 1.0
249 }
250
251 return float64(len(lcs)) / float64(maxLen)
252}
253
254func max(a, b int) int {
255 if a > b {
256 return a
257 }
258 return b
259}
Exercise 5: Smart Text Autocomplete
Learning Objectives: Build an intelligent autocomplete system using trie data structures and frequency analysis.
Real-World Context: Autocomplete is ubiquitous in modern applications - search engines, IDEs, messaging apps. Understanding how to build efficient autocomplete systems is valuable.
Difficulty: Intermediate | Time Estimate: 3-4 hours
Description: Create an autocomplete system that suggests completions based on partial input, using frequency-based ranking and supporting multiple languages.
Requirements:
- Trie Data Structure: Efficient prefix matching
- Frequency Ranking: Rank suggestions by usage frequency
- Multi-Language Support: Handle different scripts
- Case-Insensitive: Ignore case for matching
- Real-Time Updates: Update suggestions as user types
- Fuzzy Matching: Suggest corrections for typos
- Performance: Fast lookups (< 10ms)
Solution
1package main
2
3import (
4 "fmt"
5 "sort"
6 "strings"
7 "unicode"
8)
9
10// TrieNode represents a node in the trie
11type TrieNode struct {
12 children map[rune]*TrieNode
13 isEnd bool
14 frequency int
15 word string
16}
17
18// Autocomplete handles text autocomplete
19type Autocomplete struct {
20 root *TrieNode
21}
22
23// Suggestion represents an autocomplete suggestion
24type Suggestion struct {
25 Word string
26 Frequency int
27 Score float64
28}
29
30// run
31func main() {
32 fmt.Println("=== Smart Text Autocomplete ===")
33
34 ac := NewAutocomplete()
35
36 // Build dictionary with frequencies
37 words := []struct {
38 word string
39 freq int
40 }{
41 {"hello", 100},
42 {"help", 50},
43 {"helpful", 30},
44 {"world", 80},
45 {"word", 60},
46 {"work", 70},
47 {"programming", 90},
48 {"program", 85},
49 {"programmer", 75},
50 {"python", 40},
51 {"go", 95},
52 {"golang", 88},
53 }
54
55 for _, w := range words {
56 ac.AddWord(w.word, w.freq)
57 }
58
59 fmt.Printf("Dictionary built with %d words\n", len(words))
60
61 // Test autocomplete
62 queries := []string{
63 "hel",
64 "wor",
65 "pro",
66 "go",
67 "py",
68 }
69
70 for _, query := range queries {
71 fmt.Printf("\n=== Query: %s ===\n", query)
72 suggestions := ac.Suggest(query, 5)
73
74 if len(suggestions) == 0 {
75 fmt.Println("No suggestions found")
76 } else {
77 for i, sugg := range suggestions {
78 fmt.Printf("%d. %s (freq: %d, score: %.2f)\n",
79 i+1, sugg.Word, sugg.Frequency, sugg.Score)
80 }
81 }
82 }
83
84 // Fuzzy matching
85 fmt.Println("\n=== Fuzzy Matching ===")
86 fuzzyQueries := []string{
87 "helo", // typo: hello
88 "wrld", // typo: world
89 "progam", // typo: program
90 }
91
92 for _, query := range fuzzyQueries {
93 fmt.Printf("\nQuery: %s (typo)\n", query)
94 suggestions := ac.FuzzySuggest(query, 3)
95
96 for i, sugg := range suggestions {
97 fmt.Printf("%d. %s (suggested correction)\n", i+1, sugg.Word)
98 }
99 }
100}
101
102func NewAutocomplete() *Autocomplete {
103 return &Autocomplete{
104 root: &TrieNode{
105 children: make(map[rune]*TrieNode),
106 },
107 }
108}
109
110func (ac *Autocomplete) AddWord(word string, frequency int) {
111 node := ac.root
112 word = strings.ToLower(word)
113
114 for _, ch := range word {
115 if node.children[ch] == nil {
116 node.children[ch] = &TrieNode{
117 children: make(map[rune]*TrieNode),
118 }
119 }
120 node = node.children[ch]
121 }
122
123 node.isEnd = true
124 node.frequency = frequency
125 node.word = word
126}
127
128func (ac *Autocomplete) Suggest(prefix string, limit int) []Suggestion {
129 prefix = strings.ToLower(prefix)
130
131 // Find the prefix node
132 node := ac.root
133 for _, ch := range prefix {
134 if node.children[ch] == nil {
135 return nil
136 }
137 node = node.children[ch]
138 }
139
140 // Collect all completions
141 var suggestions []Suggestion
142 ac.collectSuggestions(node, &suggestions)
143
144 // Sort by frequency and score
145 sort.Slice(suggestions, func(i, j int) bool {
146 if suggestions[i].Frequency != suggestions[j].Frequency {
147 return suggestions[i].Frequency > suggestions[j].Frequency
148 }
149 return suggestions[i].Word < suggestions[j].Word
150 })
151
152 // Calculate scores
153 for i := range suggestions {
154 suggestions[i].Score = float64(suggestions[i].Frequency) / 100.0
155 }
156
157 if len(suggestions) > limit {
158 suggestions = suggestions[:limit]
159 }
160
161 return suggestions
162}
163
164func (ac *Autocomplete) collectSuggestions(node *TrieNode, suggestions *[]Suggestion) {
165 if node.isEnd {
166 *suggestions = append(*suggestions, Suggestion{
167 Word: node.word,
168 Frequency: node.frequency,
169 })
170 }
171
172 for _, child := range node.children {
173 ac.collectSuggestions(child, suggestions)
174 }
175}
176
177func (ac *Autocomplete) FuzzySuggest(query string, limit int) []Suggestion {
178 query = strings.ToLower(query)
179
180 // Find all words and calculate edit distance
181 var candidates []Suggestion
182 ac.collectAllWords(ac.root, &candidates)
183
184 // Calculate Levenshtein distance
185 for i := range candidates {
186 distance := levenshteinDistance(query, candidates[i].Word)
187 candidates[i].Score = 1.0 / (1.0 + float64(distance))
188 }
189
190 // Sort by edit distance and frequency
191 sort.Slice(candidates, func(i, j int) bool {
192 if candidates[i].Score != candidates[j].Score {
193 return candidates[i].Score > candidates[j].Score
194 }
195 return candidates[i].Frequency > candidates[j].Frequency
196 })
197
198 if len(candidates) > limit {
199 candidates = candidates[:limit]
200 }
201
202 return candidates
203}
204
205func (ac *Autocomplete) collectAllWords(node *TrieNode, words *[]Suggestion) {
206 if node.isEnd {
207 *words = append(*words, Suggestion{
208 Word: node.word,
209 Frequency: node.frequency,
210 })
211 }
212
213 for _, child := range node.children {
214 ac.collectAllWords(child, words)
215 }
216}
217
218func levenshteinDistance(s1, s2 string) int {
219 r1, r2 := []rune(s1), []rune(s2)
220 n, m := len(r1), len(r2)
221
222 if n == 0 {
223 return m
224 }
225 if m == 0 {
226 return n
227 }
228
229 matrix := make([][]int, n+1)
230 for i := range matrix {
231 matrix[i] = make([]int, m+1)
232 matrix[i][0] = i
233 }
234 for j := 1; j <= m; j++ {
235 matrix[0][j] = j
236 }
237
238 for i := 1; i <= n; i++ {
239 for j := 1; j <= m; j++ {
240 cost := 0
241 if r1[i-1] != r2[j-1] {
242 cost = 1
243 }
244
245 deletion := matrix[i-1][j] + 1
246 insertion := matrix[i][j-1] + 1
247 substitution := matrix[i-1][j-1] + cost
248
249 matrix[i][j] = min(deletion, min(insertion, substitution))
250 }
251 }
252
253 return matrix[n][m]
254}
255
256func min(a, b int) int {
257 if a < b {
258 return a
259 }
260 return b
261}
Common Pitfalls and Best Practices
Text Processing Mistakes to Avoid
- Using
len()for character count: Always useutf8.RuneCountInString()for Unicode text - Byte-based substring: Never slice strings by byte index with non-ASCII text
- Ignoring normalization: Always normalize text before comparison
- Inefficient concatenation: Use
strings.Builderfor building strings in loops - Uncompiled regex: Always compile regex patterns once, not in loops
- Ignoring locale: Case conversion and sorting are locale-dependent
- Missing validation: Always validate and sanitize user input
- Improper encoding: Ensure consistent UTF-8 encoding throughout
Security Considerations
- Input Validation: Always validate and sanitize user input
- XSS Prevention: Escape HTML/JavaScript in user content
- SQL Injection: Use parameterized queries, not string concatenation
- Path Traversal: Validate file paths to prevent directory traversal
- Buffer Overflow: Enforce length limits on user input
- Unicode Attacks: Be aware of homograph and normalization attacks
- Regular Expression DoS: Limit regex complexity and execution time
Summary
Key Takeaways
- Unicode Handling: Always work with runes for character operations
- String Building: Use
strings.Builderfor efficient string concatenation - Normalization: Normalize text for consistent comparison and processing
- Performance: Understand performance implications of operations
- Safety: Never split strings by byte indices with UTF-8 text
- Internationalization: Consider language-specific rules and cultural nuances
- Security: Validate and sanitize text input to prevent attacks
- Regex: Compile patterns once, use for pattern matching
Best Practices
- Unicode Awareness: Always assume text contains Unicode characters
- Normalization: Normalize text before comparison or storage
- Efficient Operations: Use appropriate data structures
- Memory Management: Reuse buffers and avoid unnecessary allocations
- Validation: Validate text input and handle edge cases
- Testing: Test with various languages and character sets
- Documentation: Document encoding requirements
When to Use What
- Basic Operations:
stringspackage for simple manipulation - Unicode Handling:
unicodeandutf8packages for character processing - Regular Expressions:
regexpfor pattern matching - Internationalization:
golang.org/x/textfor advanced i18n support - Text Building:
strings.Builderfor concatenation - Character Iteration: Range over strings or use
utf8.DecodeRuneInString
Production Libraries
- golang.org/x/text: Unicode normalization, collation, encoding
- golang.org/x/text/language: Language tag support
- golang.org/x/text/cases: Locale-aware case mapping
- github.com/rivo/uniseg: Unicode text segmentation
Further Reading
- Go strings package
- Go unicode package
- Go utf8 package
- Go regexp package
- golang.org/x/text
- Unicode Standard
- UTF-8 Specification
- Regular Expression Syntax
Text processing transforms your Go applications from single-language tools into global platforms that can handle the complexity of human communication across all languages and cultures.