Why This Matters - Pattern Matching as a Production Superpower
💡 Real-world Context: Regular expressions are the unsung heroes of production systems. From validating user input in registration forms to parsing log files for security monitoring, from routing URLs in web frameworks to extracting data from text streams - pattern matching silently powers countless critical operations that keep applications running smoothly.
⚠️ Production Reality: Poor regex usage causes real system failures:
- Security Vulnerabilities: Catastrophic backtracking allowing DoS attacks (ReDoS)
- Performance Disasters: Regular expressions that take exponential time on certain inputs
- Data Loss: Malformed patterns silently dropping valid data during extraction
- Maintenance Hell: Complex patterns that no one can understand or modify
- False Positives/Negatives: Validation patterns that block legitimate users or allow malicious input
- Integration Failures: Incompatible regex flavors breaking when migrating between systems
Go's RE2-based regexp package provides linear-time guarantees, preventing catastrophic backtracking and making regex safe for production use.
Learning Objectives
By the end of this article, you will:
- Master regex syntax with production-ready patterns for common use cases
- Implement efficient compilation and caching strategies for high-performance systems
- Build validation systems with clear error messages and edge case handling
- Create extractors for structured data parsing
- Understand Go's regex limitations and when to use alternatives
- Develop security-aware pattern matching that prevents ReDoS attacks
- Build maintainable regex with documentation and testing strategies
- Integrate regex with real-world applications efficiently
Core Concepts - Understanding the Pattern Language
The Foundation: Literals, Wildcards, and Quantifiers
Regular expressions are a declarative language for describing text patterns. Think of them as blueprints for text matching rather than imperative code that searches character by character.
1package main
2
3import (
4 "fmt"
5 "regexp"
6)
7
8func main() {
9 // Literals match exact characters
10 literalRe := regexp.MustCompile(`hello`)
11 fmt.Printf("Literal 'hello' in 'hello world': %v\n", literalRe.MatchString("hello world"))
12 fmt.Printf("Literal 'hello' in 'Hello world': %v\n", literalRe.MatchString("Hello world"))
13
14 // Wildcard . matches any single character except newline
15 wildcardRe := regexp.MustCompile(`h.llo`)
16 fmt.Printf("\nWildcard 'h.llo' in 'hallo': %v\n", wildcardRe.MatchString("hallo"))
17 fmt.Printf("Wildcard 'h.llo' in 'h?llo': %v\n", wildcardRe.MatchString("h?llo"))
18 fmt.Printf("Wildcard 'h.llo' in 'hello': %v\n", wildcardRe.MatchString("hello"))
19
20 // Quantifiers specify repetition
21 // * = zero or more, + = one or more, ? = zero or one, {n,m} = between n and m
22 quantifierRe := regexp.MustCompile(`colou?r`)
23 fmt.Printf("\nQuantifier 'colou?r' matches 'color': %v\n", quantifierRe.MatchString("color"))
24 fmt.Printf("Quantifier 'colou?r' matches 'colour': %v\n", quantifierRe.MatchString("colour"))
25 fmt.Printf("Quantifier 'colou?r' matches 'colouur': %v\n", quantifierRe.MatchString("colouur"))
26
27 // Plus quantifier (one or more)
28 plusRe := regexp.MustCompile(`go+gle`)
29 fmt.Printf("\nPlus 'go+gle' matches 'gogle': %v\n", plusRe.MatchString("gogle"))
30 fmt.Printf("Plus 'go+gle' matches 'google': %v\n", plusRe.MatchString("google"))
31 fmt.Printf("Plus 'go+gle' matches 'gooogle': %v\n", plusRe.MatchString("gooogle"))
32
33 // Star quantifier (zero or more)
34 starRe := regexp.MustCompile(`go*gle`)
35 fmt.Printf("\nStar 'go*gle' matches 'ggle': %v\n", starRe.MatchString("ggle"))
36 fmt.Printf("Star 'go*gle' matches 'google': %v\n", starRe.MatchString("google"))
37
38 // Specific repetition counts
39 countRe := regexp.MustCompile(`\d{3}-\d{4}`)
40 fmt.Printf("\nCount pattern '\\d{3}-\\d{4}' matches '555-1234': %v\n", countRe.MatchString("555-1234"))
41 fmt.Printf("Count pattern '\\d{3}-\\d{4}' matches '55-1234': %v\n", countRe.MatchString("55-1234"))
42
43 // Character classes match sets of characters
44 classRe := regexp.MustCompile(`[aeiou]`)
45 fmt.Printf("\nVowel class in 'hello': %v\n", classRe.MatchString("hello"))
46 fmt.Printf("Vowel class in 'rhythm': %v\n", classRe.MatchString("rhythm"))
47
48 // Negated character classes
49 negatedRe := regexp.MustCompile(`[^aeiou]`)
50 fmt.Printf("\nNon-vowel class in 'hello': %v\n", negatedRe.MatchString("hello"))
51}
52// run
💡 Key Insight: Regular expressions are greedy by default - they match as much as possible while still allowing the overall pattern to match. This can cause unexpected behavior if not carefully controlled.
Character Classes and Predefined Sets
Character classes are one of the most powerful features in regex, allowing you to match specific sets of characters efficiently.
1package main
2
3import (
4 "fmt"
5 "regexp"
6)
7
8func demonstrateCharacterClasses() {
9 fmt.Println("=== Predefined Character Classes ===")
10
11 // \d = digits [0-9]
12 digitRe := regexp.MustCompile(`\d+`)
13 fmt.Printf("Digits in 'abc123def456': %v\n", digitRe.FindAllString("abc123def456", -1))
14
15 // \D = non-digits [^0-9]
16 nonDigitRe := regexp.MustCompile(`\D+`)
17 fmt.Printf("Non-digits in 'abc123def456': %v\n", nonDigitRe.FindAllString("abc123def456", -1))
18
19 // \w = word characters [A-Za-z0-9_]
20 wordRe := regexp.MustCompile(`\w+`)
21 fmt.Printf("Words in 'hello_world 123!': %v\n", wordRe.FindAllString("hello_world 123!", -1))
22
23 // \W = non-word characters [^A-Za-z0-9_]
24 nonWordRe := regexp.MustCompile(`\W+`)
25 fmt.Printf("Non-words in 'hello_world 123!': %v\n", nonWordRe.FindAllString("hello_world 123!", -1))
26
27 // \s = whitespace [ \t\n\r\f]
28 spaceRe := regexp.MustCompile(`\s+`)
29 fmt.Printf("Whitespace in 'hello world\\n': %v\n", spaceRe.FindAllString("hello world\n", -1))
30
31 // \S = non-whitespace [^ \t\n\r\f]
32 nonSpaceRe := regexp.MustCompile(`\S+`)
33 fmt.Printf("Non-whitespace in 'hello world': %v\n", nonSpaceRe.FindAllString("hello world", -1))
34
35 fmt.Println("\n=== Custom Character Classes ===")
36
37 // Range-based classes
38 lowerRe := regexp.MustCompile(`[a-z]+`)
39 fmt.Printf("Lowercase in 'Hello World': %v\n", lowerRe.FindAllString("Hello World", -1))
40
41 upperRe := regexp.MustCompile(`[A-Z]+`)
42 fmt.Printf("Uppercase in 'Hello World': %v\n", upperRe.FindAllString("Hello World", -1))
43
44 // Multiple ranges
45 alphanumericRe := regexp.MustCompile(`[A-Za-z0-9]+`)
46 fmt.Printf("Alphanumeric in 'Test123!': %v\n", alphanumericRe.FindAllString("Test123!", -1))
47
48 // Negated classes
49 notVowelRe := regexp.MustCompile(`[^aeiouAEIOU]`)
50 fmt.Printf("First non-vowel in 'apple': %s\n", notVowelRe.FindString("apple"))
51
52 // Special characters in classes (need escaping)
53 specialRe := regexp.MustCompile(`[\[\]{}()]`)
54 fmt.Printf("Brackets in 'test[123]{456}': %v\n", specialRe.FindAllString("test[123]{456}", -1))
55}
56
57func main() {
58 demonstrateCharacterClasses()
59}
60// run
💡 Production Pattern: Use character classes instead of alternation when possible. [abc] is more efficient than (a|b|c).
Anchors and Word Boundaries
Anchors don't match characters - they match positions in the string. This is crucial for precise pattern matching.
🎯 Production Pattern: Use anchors to prevent partial matches that could cause security issues.
1package main
2
3import (
4 "fmt"
5 "regexp"
6)
7
8func demonstrateAnchors() {
9 fmt.Println("=== Start and End Anchors ===")
10
11 // Without anchors - matches anywhere
12 partialRe := regexp.MustCompile(`admin`)
13 text := "useradmin123"
14 fmt.Printf("Partial match '%s' in '%s': %v\n", "admin", text, partialRe.MatchString(text))
15
16 // Start anchor ^ - matches at beginning
17 startRe := regexp.MustCompile(`^admin`)
18 fmt.Printf("Start anchor '^admin' in '%s': %v\n", text, startRe.MatchString(text))
19 fmt.Printf("Start anchor '^admin' in 'admin123': %v\n", startRe.MatchString("admin123"))
20
21 // End anchor $ - matches at end
22 endRe := regexp.MustCompile(`admin$`)
23 fmt.Printf("End anchor 'admin$' in '%s': %v\n", text, endRe.MatchString(text))
24 fmt.Printf("End anchor 'admin$' in 'useradmin': %v\n", endRe.MatchString("useradmin"))
25
26 // Both anchors - exact match only
27 exactRe := regexp.MustCompile(`^admin$`)
28 fmt.Printf("Exact match '^admin$' in '%s': %v\n", text, exactRe.MatchString(text))
29 fmt.Printf("Exact match '^admin$' in 'admin': %v\n", exactRe.MatchString("admin"))
30
31 fmt.Println("\n=== Word Boundaries ===")
32
33 // Word boundaries - matches whole words only
34 wordRe := regexp.MustCompile(`\badmin\b`)
35 text2 := "admin user and superadmin"
36 matches := wordRe.FindAllString(text2, -1)
37 fmt.Printf("Word boundary '\\badmin\\b' in '%s': %v\n", text2, matches)
38
39 // Extract whole words
40 wholeWordRe := regexp.MustCompile(`\b\w+\b`)
41 text3 := "hello, world! how are you?"
42 words := wholeWordRe.FindAllString(text3, -1)
43 fmt.Printf("All words in '%s': %v\n", text3, words)
44
45 fmt.Println("\n=== Email Validation with Anchors ===")
46
47 // Email validation with word boundaries
48 emailRe := regexp.MustCompile(`\b[\w.%+-]+@[\w.-]+\.[a-z]{2,}\b`)
49 text4 := "Contact user@example.com or admin@test.org for help"
50 emails := emailRe.FindAllString(text4, -1)
51 fmt.Printf("Emails found in '%s': %v\n", text4, emails)
52
53 // Without word boundaries (can match partial emails)
54 badEmailRe := regexp.MustCompile(`[\w.%+-]+@[\w.-]+\.[a-z]{2,}`)
55 text5 := "email:user@example.com,admin@test.org"
56 badEmails := badEmailRe.FindAllString(text5, -1)
57 fmt.Printf("Partial email matches: %v\n", badEmails)
58}
59
60func main() {
61 demonstrateAnchors()
62}
63// run
⚠️ Security Warning: Always use anchors or word boundaries for validation to prevent partial matches that could bypass security checks. For example, ^admin$ ensures exact match, while admin would match "administrator", "admins", "useradmin", etc.
Capture Groups and Extraction
Capture groups are parentheses that not only group patterns but also capture the matched text for extraction.
1package main
2
3import (
4 "fmt"
5 "regexp"
6)
7
8func demonstrateCaptureGroups() {
9 fmt.Println("=== Basic Capture Groups ===")
10
11 // Simple capture
12 dateRe := regexp.MustCompile(`(\d{4})-(\d{2})-(\d{2})`)
13 date := "Today is 2024-03-15"
14 matches := dateRe.FindStringSubmatch(date)
15 fmt.Printf("Full match: %s\n", matches[0])
16 fmt.Printf("Year: %s, Month: %s, Day: %s\n", matches[1], matches[2], matches[3])
17
18 fmt.Println("\n=== Named Capture Groups ===")
19
20 // Named groups for clarity
21 namedDateRe := regexp.MustCompile(`(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})`)
22 namedMatches := namedDateRe.FindStringSubmatch(date)
23 names := namedDateRe.SubexpNames()
24
25 result := make(map[string]string)
26 for i, name := range names {
27 if i > 0 && name != "" {
28 result[name] = namedMatches[i]
29 }
30 }
31
32 fmt.Printf("Named captures: %v\n", result)
33
34 fmt.Println("\n=== Non-Capturing Groups ===")
35
36 // Non-capturing groups (?:...) for grouping without capture
37 urlRe := regexp.MustCompile(`^(?:https?://)?([^/]+)(/.*)?$`)
38 url := "https://example.com/path/to/page"
39 urlMatches := urlRe.FindStringSubmatch(url)
40 fmt.Printf("Domain: %s\n", urlMatches[1])
41 if len(urlMatches) > 2 {
42 fmt.Printf("Path: %s\n", urlMatches[2])
43 }
44
45 fmt.Println("\n=== Multiple Matches with Groups ===")
46
47 // Find all emails with user and domain parts
48 emailRe := regexp.MustCompile(`(\w+)@([\w.]+)`)
49 text := "Contact: john@example.com or jane@test.org"
50 allMatches := emailRe.FindAllStringSubmatch(text, -1)
51
52 for i, match := range allMatches {
53 fmt.Printf("Email %d: Full=%s, User=%s, Domain=%s\n",
54 i+1, match[0], match[1], match[2])
55 }
56
57 fmt.Println("\n=== Extracting Structured Data ===")
58
59 // Parse log entries
60 logRe := regexp.MustCompile(`(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) \[(?P<level>\w+)\] (?P<message>.*)`)
61 logLine := "2024-03-15 10:30:45 [ERROR] Database connection failed"
62 logMatches := logRe.FindStringSubmatch(logLine)
63 logNames := logRe.SubexpNames()
64
65 logEntry := make(map[string]string)
66 for i, name := range logNames {
67 if i > 0 && name != "" && i < len(logMatches) {
68 logEntry[name] = logMatches[i]
69 }
70 }
71
72 fmt.Printf("Parsed log entry:\n")
73 fmt.Printf(" Timestamp: %s\n", logEntry["timestamp"])
74 fmt.Printf(" Level: %s\n", logEntry["level"])
75 fmt.Printf(" Message: %s\n", logEntry["message"])
76}
77
78func main() {
79 demonstrateCaptureGroups()
80}
81// run
💡 Production Pattern: Use named capture groups (?P<name>...) for complex patterns to make code maintainable and self-documenting.
Practical Examples - From Patterns to Production Code
Email Validation with Clear Error Messages
🎯 Production Pattern: Create comprehensive validation that provides specific feedback for common mistakes.
1package main
2
3import (
4 "fmt"
5 "regexp"
6 "strings"
7)
8
9type EmailValidator struct {
10 pattern *regexp.Regexp
11}
12
13func NewEmailValidator() *EmailValidator {
14 // Pattern explanation:
15 // ^ - Start of string
16 // [\w.%+-]+ - Local part: word chars, dots, percents, pluses, hyphens
17 // @ - Literal @
18 // [\w.-]+ - Domain: word chars, dots, hyphens
19 // \. - Literal dot before TLD
20 // [a-z]{2,} - TLD: 2+ letters
21 // $ - End of string
22 pattern := regexp.MustCompile(`^[\w.%+-]+@[\w.-]+\.[a-z]{2,}$`)
23
24 return &EmailValidator{pattern: pattern}
25}
26
27func (ev *EmailValidator) ValidateWithFeedback(email string) (bool, []string) {
28 var errors []string
29
30 // Normalize
31 email = strings.TrimSpace(strings.ToLower(email))
32
33 // Length checks first
34 if len(email) < 5 {
35 errors = append(errors, "Email too short (minimum 5 characters)")
36 }
37 if len(email) > 254 {
38 errors = append(errors, "Email too long (maximum 254 characters)")
39 }
40
41 // Character checks
42 atCount := strings.Count(email, "@")
43 if atCount == 0 {
44 errors = append(errors, "Email must contain @ symbol")
45 } else if atCount > 1 {
46 errors = append(errors, "Email must contain exactly one @ symbol")
47 }
48
49 // Format validation
50 if !ev.pattern.MatchString(email) {
51 errors = append(errors, "Invalid email format")
52
53 // Specific format issues
54 if regexp.MustCompile(`\.\.`).MatchString(email) {
55 errors = append(errors, "Cannot contain consecutive dots")
56 }
57 if regexp.MustCompile(`^\.`).MatchString(email) {
58 errors = append(errors, "Cannot start with a dot")
59 }
60 if regexp.MustCompile(`\.$`).MatchString(email) {
61 errors = append(errors, "Cannot end with a dot")
62 }
63 if regexp.MustCompile(`@\.`).MatchString(email) || regexp.MustCompile(`\.@`).MatchString(email) {
64 errors = append(errors, "Dot cannot be adjacent to @ symbol")
65 }
66
67 // Check for spaces
68 if strings.Contains(email, " ") {
69 errors = append(errors, "Email cannot contain spaces")
70 }
71
72 // Check for invalid characters
73 if regexp.MustCompile(`[^A-Za-z0-9._%+\-@]`).MatchString(email) {
74 errors = append(errors, "Email contains invalid characters")
75 }
76 }
77
78 // Domain-specific checks
79 if strings.Contains(email, "@") {
80 parts := strings.Split(email, "@")
81 domain := parts[1]
82
83 // Check for valid TLD
84 if !regexp.MustCompile(`\.[a-z]{2,}$`).MatchString(domain) {
85 errors = append(errors, "Domain must have a valid top-level domain (e.g., .com, .org)")
86 }
87
88 // Check local part length
89 localPart := parts[0]
90 if len(localPart) > 64 {
91 errors = append(errors, "Local part (before @) exceeds 64 characters")
92 }
93 if len(localPart) == 0 {
94 errors = append(errors, "Local part (before @) cannot be empty")
95 }
96 }
97
98 return len(errors) == 0, errors
99}
100
101func main() {
102 validator := NewEmailValidator()
103
104 testEmails := []string{
105 "valid@example.com",
106 "user.name+tag@example.co.uk",
107 "invalid", // Missing @ and domain
108 "user@", // Missing domain
109 "@example.com", // Missing local part
110 "user..name@example.com", // Consecutive dots
111 "user@.example.com", // Starts with dot
112 ".user@example.com", // Local part starts with dot
113 "user@example", // Missing TLD
114 "user name@example.com", // Contains space
115 "user@@example.com", // Multiple @ symbols
116 "a@b.c", // Too short domain
117 strings.Repeat("a", 65) + "@example.com", // Local part too long
118 }
119
120 for _, email := range testEmails {
121 fmt.Printf("\nEmail: %s\n", email)
122 valid, errors := validator.ValidateWithFeedback(email)
123 if valid {
124 fmt.Println(" ✓ Valid")
125 } else {
126 fmt.Println(" ✗ Invalid:")
127 for _, err := range errors {
128 fmt.Printf(" - %s\n", err)
129 }
130 }
131 }
132}
133// run
Log File Parser with Multiple Formats
🎯 Production Pattern: Build flexible parsers that handle multiple log formats without breaking on unknown patterns.
1package main
2
3import (
4 "fmt"
5 "regexp"
6 "strings"
7 "time"
8)
9
10type LogEntry struct {
11 Timestamp time.Time
12 Level string
13 Message string
14 Component string
15 UserID string
16 RequestID string
17}
18
19type LogParser struct {
20 patterns []*regexp.Regexp
21}
22
23func NewLogParser() *LogParser {
24 // Multiple log formats commonly found in production
25 patterns := []string{
26 // Apache/Nginx style: IP - - [timestamp] "method path" status size
27 `^(?P<ip>[\d\.]+) - - \[(?P<timestamp>[^\]]+)\] "(?P<method>\w+) (?P<path>[^"]+)" (?P<status>\d+) (?P<size>\d+)$`,
28
29 // JSON style: {"timestamp":"...","level":"...","message":"..."}
30 `^{.*?"timestamp":"(?P<timestamp>[^"]+)".*?"level":"(?P<level>[^"]+)".*?"message":"(?P<message>[^"]+)".*}$`,
31
32 // Structured style: timestamp [level] component: message
33 `^(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) \[(?P<level>\w+)\] (?P<component>\w+): (?P<message>.*)$`,
34
35 // Application style: timestamp|level|user|request_id|message
36 `^(?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}[^|]*)\|(?P<level>\w+)\|(?P<user>[^|]*)\|(?P<request_id>[^|]*)\|(?P<message>.*)$`,
37
38 // Syslog style: timestamp hostname service[pid]: message
39 `^(?P<timestamp>\w{3}\s+\d{1,2} \d{2}:\d{2}:\d{2}) (?P<hostname>\S+) (?P<service>\w+)\[(?P<pid>\d+)\]: (?P<message>.*)$`,
40 }
41
42 compiled := make([]*regexp.Regexp, len(patterns))
43 for i, pattern := range patterns {
44 compiled[i] = regexp.MustCompile(pattern)
45 }
46
47 return &LogParser{patterns: compiled}
48}
49
50func (lp *LogParser) Parse(line string) (*LogEntry, error) {
51 for _, re := range lp.patterns {
52 matches := re.FindStringSubmatch(line)
53 if matches == nil {
54 continue
55 }
56
57 entry := &LogEntry{}
58 names := re.SubexpNames()
59
60 for i, name := range names {
61 if i == 0 || name == "" {
62 continue
63 }
64
65 value := matches[i]
66 switch name {
67 case "timestamp":
68 // Try different timestamp formats
69 formats := []string{
70 "2006-01-02 15:04:05",
71 "2006-01-02T15:04:05Z07:00",
72 time.RFC3339,
73 "02/Jan/2006:15:04:05 -0700",
74 "Jan 2 15:04:05",
75 }
76 for _, format := range formats {
77 if t, err := time.Parse(format, value); err == nil {
78 entry.Timestamp = t
79 break
80 }
81 }
82 if entry.Timestamp.IsZero() {
83 entry.Timestamp = time.Now() // Fallback
84 }
85 case "level":
86 entry.Level = strings.ToUpper(value)
87 case "message":
88 entry.Message = value
89 case "component", "service":
90 entry.Component = value
91 case "user":
92 entry.UserID = value
93 case "request_id":
94 entry.RequestID = value
95 }
96 }
97
98 return entry, nil
99 }
100
101 // No pattern matched - create basic entry
102 return &LogEntry{
103 Timestamp: time.Now(),
104 Level: "UNKNOWN",
105 Message: line,
106 }, fmt.Errorf("no pattern matched")
107}
108
109func main() {
110 parser := NewLogParser()
111
112 logLines := []string{
113 `192.168.1.1 - - [15/Oct/2024:10:30:45 +0000] "GET /api/users" 200 1234`,
114 `{"timestamp":"2024-10-15T10:30:45Z","level":"INFO","message":"User logged in"}`,
115 `2024-10-15 10:30:45 [INFO] auth: User authentication successful`,
116 `2024-10-15T10:30:45.123Z|INFO|user123|req_456|Payment processed successfully`,
117 `Oct 15 10:30:45 server app[12345]: Database connection established`,
118 `unstructured log line that doesn't match any pattern`,
119 }
120
121 for i, line := range logLines {
122 fmt.Printf("Line %d: %s\n", i+1, line)
123 entry, err := parser.Parse(line)
124 if err != nil {
125 fmt.Printf(" Parse warning: %v\n", err)
126 }
127
128 fmt.Printf(" Timestamp: %s\n", entry.Timestamp.Format("2006-01-02 15:04:05"))
129 fmt.Printf(" Level: %s\n", entry.Level)
130 if entry.Component != "" {
131 fmt.Printf(" Component: %s\n", entry.Component)
132 }
133 if entry.UserID != "" {
134 fmt.Printf(" User ID: %s\n", entry.UserID)
135 }
136 if entry.RequestID != "" {
137 fmt.Printf(" Request ID: %s\n", entry.RequestID)
138 }
139 fmt.Printf(" Message: %s\n", entry.Message)
140 fmt.Println()
141 }
142}
143// run
Security-Focused URL Parser
🎯 Production Pattern: Extract and validate URLs with security considerations to prevent SSRF and injection attacks.
1package main
2
3import (
4 "fmt"
5 "regexp"
6 "strconv"
7 "strings"
8)
9
10type URLInfo struct {
11 Original string
12 Protocol string
13 Host string
14 Port string
15 Path string
16 Query string
17 Fragment string
18 IsValid bool
19 Warnings []string
20}
21
22type SecureURLParser struct {
23 allowedProtocols []string
24 blockedHosts []string
25 maxURLLength int
26}
27
28func NewSecureURLParser() *SecureURLParser {
29 return &SecureURLParser{
30 allowedProtocols: []string{"http", "https"},
31 blockedHosts: []string{"localhost", "127.0.0.1", "0.0.0.0", "169.254.169.254"}, // Including AWS metadata service
32 maxURLLength: 2048,
33 }
34}
35
36func (sup *SecureURLParser) Parse(inputURL string) URLInfo {
37 info := URLInfo{
38 Original: inputURL,
39 IsValid: true,
40 Warnings: []string{},
41 }
42
43 // Length check
44 if len(inputURL) > sup.maxURLLength {
45 info.IsValid = false
46 info.Warnings = append(info.Warnings, fmt.Sprintf("URL exceeds maximum length of %d characters", sup.maxURLLength))
47 return info
48 }
49
50 if len(inputURL) == 0 {
51 info.IsValid = false
52 info.Warnings = append(info.Warnings, "URL is empty")
53 return info
54 }
55
56 // Basic URL pattern with named groups
57 urlPattern := regexp.MustCompile(`^(?P<protocol>https?)://(?P<host>[^:/\s]+)(?::(?P<port>\d+))?(?P<path>/[^\s?#]*)?(?:\?(?P<query>[^\s#]*))?(?:#(?P<fragment>[\w-]+))?$`)
58
59 matches := urlPattern.FindStringSubmatch(inputURL)
60 if matches == nil {
61 info.IsValid = false
62 info.Warnings = append(info.Warnings, "Invalid URL format - must be http:// or https:// with valid structure")
63 return info
64 }
65
66 names := urlPattern.SubexpNames()
67 for i, name := range names {
68 if i == 0 || name == "" {
69 continue
70 }
71
72 value := matches[i]
73 switch name {
74 case "protocol":
75 info.Protocol = value
76 // Check allowed protocols
77 allowed := false
78 for _, protocol := range sup.allowedProtocols {
79 if strings.EqualFold(value, protocol) {
80 allowed = true
81 break
82 }
83 }
84 if !allowed {
85 info.IsValid = false
86 info.Warnings = append(info.Warnings, fmt.Sprintf("Protocol '%s' not allowed (only http, https)", value))
87 }
88
89 case "host":
90 info.Host = strings.ToLower(value)
91
92 // Check blocked hosts (SSRF protection)
93 for _, blocked := range sup.blockedHosts {
94 if strings.Contains(info.Host, blocked) {
95 info.IsValid = false
96 info.Warnings = append(info.Warnings, fmt.Sprintf("Host '%s' is blocked (potential SSRF vulnerability)", blocked))
97 }
98 }
99
100 // Check for IP address ranges (private networks)
101 if sup.isPrivateIP(info.Host) {
102 info.IsValid = false
103 info.Warnings = append(info.Warnings, "Private IP addresses are blocked")
104 }
105
106 // Check for suspicious patterns
107 if strings.Contains(value, "..") {
108 info.IsValid = false
109 info.Warnings = append(info.Warnings, "Host contains path traversal pattern (..)")
110 }
111
112 // Validate hostname format
113 if !regexp.MustCompile(`^[a-zA-Z0-9]([a-zA-Z0-9-]*[a-zA-Z0-9])?(\.[a-zA-Z0-9]([a-zA-Z0-9-]*[a-zA-Z0-9])?)*$`).MatchString(info.Host) {
114 info.IsValid = false
115 info.Warnings = append(info.Warnings, "Invalid hostname format")
116 }
117
118 case "port":
119 info.Port = value
120 if value != "" {
121 port, err := strconv.Atoi(value)
122 if err != nil || port < 1 || port > 65535 {
123 info.IsValid = false
124 info.Warnings = append(info.Warnings, fmt.Sprintf("Invalid port: %s (must be 1-65535)", value))
125 }
126 }
127
128 case "path":
129 info.Path = value
130 // Security checks for path
131 if strings.Contains(value, "..") {
132 info.IsValid = false
133 info.Warnings = append(info.Warnings, "Path contains directory traversal (..)")
134 }
135 if regexp.MustCompile(`<script|javascript:|on\w+=`).MatchString(strings.ToLower(value)) {
136 info.IsValid = false
137 info.Warnings = append(info.Warnings, "Path contains potential XSS payload")
138 }
139 // Check for null bytes
140 if strings.Contains(value, "\x00") {
141 info.IsValid = false
142 info.Warnings = append(info.Warnings, "Path contains null byte")
143 }
144
145 case "query":
146 info.Query = value
147 // Check query string for common injection patterns
148 if regexp.MustCompile(`<script|javascript:|on\w+=`).MatchString(strings.ToLower(value)) {
149 info.IsValid = false
150 info.Warnings = append(info.Warnings, "Query string contains potential XSS payload")
151 }
152
153 case "fragment":
154 info.Fragment = value
155 }
156 }
157
158 return info
159}
160
161func (sup *SecureURLParser) isPrivateIP(host string) bool {
162 // Check for common private IP ranges
163 privateRanges := []string{
164 "10.",
165 "172.16.", "172.17.", "172.18.", "172.19.", "172.20.", "172.21.", "172.22.", "172.23.",
166 "172.24.", "172.25.", "172.26.", "172.27.", "172.28.", "172.29.", "172.30.", "172.31.",
167 "192.168.",
168 }
169
170 for _, prefix := range privateRanges {
171 if strings.HasPrefix(host, prefix) {
172 return true
173 }
174 }
175
176 return false
177}
178
179func main() {
180 parser := NewSecureURLParser()
181
182 testURLs := []string{
183 "https://example.com/path/to/resource",
184 "http://localhost:8080/admin",
185 "https://192.168.1.1/internal",
186 "https://example.com/../../etc/passwd",
187 "https://example.com/path?param=<script>alert('xss')</script>",
188 "https://169.254.169.254/latest/meta-data/", // AWS metadata
189 "https://example.com:99999/invalid-port",
190 "ftp://example.com/file",
191 strings.Repeat("a", 3000) + ".com",
192 "https://valid-domain.com/safe/path?id=123",
193 }
194
195 for _, testURL := range testURLs {
196 fmt.Printf("Parsing: %s\n", testURL)
197 info := parser.Parse(testURL)
198
199 if info.Protocol != "" {
200 fmt.Printf(" Protocol: %s\n", info.Protocol)
201 }
202 if info.Host != "" {
203 fmt.Printf(" Host: %s\n", info.Host)
204 }
205 if info.Port != "" {
206 fmt.Printf(" Port: %s\n", info.Port)
207 }
208 if info.Path != "" {
209 fmt.Printf(" Path: %s\n", info.Path)
210 }
211 fmt.Printf(" Valid: %v\n", info.IsValid)
212
213 if len(info.Warnings) > 0 {
214 fmt.Println(" Warnings:")
215 for _, warning := range info.Warnings {
216 fmt.Printf(" ⚠ %s\n", warning)
217 }
218 }
219 fmt.Println()
220 }
221}
222// run
Common Patterns and Pitfalls
Performance Optimization: Compile Once, Use Many Times
⚠️ Critical Performance Issue: Compiling regex patterns is expensive. Never compile in hot paths.
1package main
2
3import (
4 "fmt"
5 "regexp"
6 "sync"
7 "time"
8)
9
10// BAD: Compiling regex in loop
11func badValidation(emails []string) []bool {
12 results := make([]bool, len(emails))
13
14 start := time.Now()
15 for i, email := range emails {
16 // Compiles regex EVERY iteration - very slow!
17 re := regexp.MustCompile(`^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$`)
18 results[i] = re.MatchString(email)
19 }
20 duration := time.Since(start)
21
22 fmt.Printf("BAD approach: %v for %d emails\n", duration, len(emails))
23 return results
24}
25
26// GOOD: Pre-compile regex
27func goodValidation(emails []string) []bool {
28 // Compile once
29 emailRegex := regexp.MustCompile(`^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$`)
30 results := make([]bool, len(emails))
31
32 start := time.Now()
33 for i, email := range emails {
34 results[i] = emailRegex.MatchString(email) // Reuse compiled regex
35 }
36 duration := time.Since(start)
37
38 fmt.Printf("GOOD approach: %v for %d emails\n", duration, len(emails))
39 return results
40}
41
42// Production-ready validator with caching
43type RegexCache struct {
44 cache map[string]*regexp.Regexp
45 mu sync.RWMutex
46}
47
48func NewRegexCache() *RegexCache {
49 return &RegexCache{
50 cache: make(map[string]*regexp.Regexp),
51 }
52}
53
54func (rc *RegexCache) Get(pattern string) (*regexp.Regexp, error) {
55 // Check cache with read lock
56 rc.mu.RLock()
57 if re, exists := rc.cache[pattern]; exists {
58 rc.mu.RUnlock()
59 return re, nil
60 }
61 rc.mu.RUnlock()
62
63 // Compile and cache with write lock
64 rc.mu.Lock()
65 defer rc.mu.Unlock()
66
67 // Double-check after acquiring write lock
68 if re, exists := rc.cache[pattern]; exists {
69 return re, nil
70 }
71
72 re, err := regexp.Compile(pattern)
73 if err != nil {
74 return nil, fmt.Errorf("invalid regex pattern '%s': %w", pattern, err)
75 }
76
77 rc.cache[pattern] = re
78 return re, nil
79}
80
81// Best practice: Use package-level variables for common patterns
82var (
83 emailRegex = regexp.MustCompile(`^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$`)
84 phoneRegex = regexp.MustCompile(`^\+?1?\d{10}$`)
85 urlRegex = regexp.MustCompile(`^https?://[^\s]+$`)
86)
87
88func bestPracticeValidation(emails []string) []bool {
89 results := make([]bool, len(emails))
90
91 start := time.Now()
92 for i, email := range emails {
93 results[i] = emailRegex.MatchString(email)
94 }
95 duration := time.Since(start)
96
97 fmt.Printf("BEST approach (package-level): %v for %d emails\n", duration, len(emails))
98 return results
99}
100
101func main() {
102 // Generate test emails
103 emails := make([]string, 1000)
104 for i := range emails {
105 emails[i] = fmt.Sprintf("user%d@example.com", i)
106 }
107
108 // Benchmark approaches
109 fmt.Println("=== Performance Comparison ===\n")
110
111 badValidation(emails[:10]) // Only 10 to avoid slowness
112 goodValidation(emails)
113 bestPracticeValidation(emails)
114
115 // Demonstrate cache usage
116 fmt.Println("\n=== Regex Cache Example ===")
117 cache := NewRegexCache()
118
119 patterns := []string{
120 `^\+?1?\d{10}$`,
121 `^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$`,
122 `^\+?1?\d{10}$`, // Duplicate - will use cached version
123 }
124
125 for i, pattern := range patterns {
126 re, err := cache.Get(pattern)
127 if err != nil {
128 fmt.Printf("Pattern %d: Error - %v\n", i+1, err)
129 } else {
130 fmt.Printf("Pattern %d: Compiled successfully (from cache: %v)\n",
131 i+1, i == 2) // Third pattern is duplicate
132 _ = re
133 }
134 }
135}
136// run
💡 Production Pattern:
- Compile regex at package initialization time
- Use
sync.Oncefor lazy initialization if needed - Implement a cache for dynamic patterns
- Never compile regex in request handlers or loops
Avoiding Catastrophic Backtracking
⚠️ Security Critical: Certain regex patterns can cause exponential time complexity, leading to DoS attacks (ReDoS).
1package main
2
3import (
4 "fmt"
5 "regexp"
6 "strings"
7 "time"
8)
9
10// Demonstrate catastrophic backtracking (or lack thereof in Go's RE2)
11func demonstrateBacktracking() {
12 fmt.Println("=== Go's RE2 Linear-Time Guarantee ===\n")
13
14 // Patterns that WOULD be vulnerable in other regex engines
15 vulnerablePatterns := []struct {
16 name string
17 pattern string
18 }{
19 {"Nested quantifiers", `^(a+)+$`},
20 {"Alternation overlap", `^(.*|a.*|ab.*)$`},
21 {"Repeated groups", `^(a|a)*$`},
22 {"Complex nesting", `^(a+)+(b+)+(c+)+$`},
23 }
24
25 testInputs := []string{
26 strings.Repeat("a", 20),
27 strings.Repeat("a", 30),
28 strings.Repeat("a", 40) + "b", // Won't match
29 }
30
31 for _, vp := range vulnerablePatterns {
32 fmt.Printf("Testing '%s' pattern: %s\n", vp.name, vp.pattern)
33 re := regexp.MustCompile(vp.pattern)
34
35 for _, input := range testInputs {
36 start := time.Now()
37 result := re.MatchString(input)
38 duration := time.Since(start)
39
40 fmt.Printf(" Input length %d: match=%v, time=%v\n",
41 len(input), result, duration)
42 }
43 fmt.Println()
44 }
45}
46
47// Safe pattern alternatives
48func demonstrateSafeAlternatives() {
49 fmt.Println("=== Safe Pattern Alternatives ===\n")
50
51 examples := []struct {
52 name string
53 unsafe string
54 safe string
55 test string
56 }{
57 {
58 name: "Repeated groups",
59 unsafe: `(a+)*`,
60 safe: `a*`,
61 test: "aaaaaaaaaa",
62 },
63 {
64 name: "Alternation with overlap",
65 unsafe: `(.*|a.*|ab.*)`,
66 safe: `(ab.*|a.*|.*)`,
67 test: "abcdefghij",
68 },
69 {
70 name: "Wildcard repetition",
71 unsafe: `.+.+`,
72 safe: `.{2,}`,
73 test: "hello world",
74 },
75 {
76 name: "Optional repetition",
77 unsafe: `(x+)?(x+)?x`,
78 safe: `x+`,
79 test: "xxxxxxxxxx",
80 },
81 }
82
83 for _, ex := range examples {
84 fmt.Printf("%s:\n", ex.name)
85 fmt.Printf(" Unsafe pattern: %s\n", ex.unsafe)
86 fmt.Printf(" Safe pattern: %s\n", ex.safe)
87
88 unsafeRe := regexp.MustCompile(ex.unsafe)
89 safeRe := regexp.MustCompile(ex.safe)
90
91 // Test with progressively longer strings
92 for length := 10; length <= 50; length += 10 {
93 testStr := strings.Repeat(ex.test[:1], length)
94
95 start := time.Now()
96 unsafeRe.MatchString(testStr)
97 unsafeDuration := time.Since(start)
98
99 start = time.Now()
100 safeRe.MatchString(testStr)
101 safeDuration := time.Since(start)
102
103 fmt.Printf(" Length %d: unsafe=%v, safe=%v\n",
104 length, unsafeDuration, safeDuration)
105 }
106 fmt.Println()
107 }
108}
109
110// Best practices for safe patterns
111func demonstrateBestPractices() {
112 fmt.Println("=== Best Practices for Safe Patterns ===\n")
113
114 tips := []struct {
115 tip string
116 bad string
117 good string
118 example string
119 }{
120 {
121 tip: "Use specific quantifiers instead of +/*",
122 bad: `\w+`,
123 good: `\w{1,50}`,
124 example: "username",
125 },
126 {
127 tip: "Avoid nested quantifiers",
128 bad: `(a+)+`,
129 good: `a+`,
130 example: "aaaa",
131 },
132 {
133 tip: "Use character classes instead of alternation",
134 bad: `(a|b|c|d)`,
135 good: `[a-d]`,
136 example: "abcd",
137 },
138 {
139 tip: "Be specific with wildcards",
140 bad: `.*`,
141 good: `[^\n]*`,
142 example: "text\nmore",
143 },
144 {
145 tip: "Anchor patterns when possible",
146 bad: `\d+`,
147 good: `^\d+$`,
148 example: "12345",
149 },
150 }
151
152 for i, tip := range tips {
153 fmt.Printf("%d. %s\n", i+1, tip.tip)
154 fmt.Printf(" Bad: %s\n", tip.bad)
155 fmt.Printf(" Good: %s\n", tip.good)
156
157 goodRe := regexp.MustCompile(tip.good)
158 fmt.Printf(" Example '%s' matches: %v\n\n", tip.example, goodRe.MatchString(tip.example))
159 }
160}
161
162func main() {
163 demonstrateBacktracking()
164 demonstrateSafeAlternatives()
165 demonstrateBestPractices()
166}
167// run
💡 Go's Safety: Go's regexp package uses the RE2 engine, which guarantees linear time complexity. However, poorly written patterns can still be slow - just not exponentially slow like in other engines.
Greedy vs Non-Greedy Matching
Understanding quantifier behavior is crucial for correct pattern matching.
1package main
2
3import (
4 "fmt"
5 "regexp"
6)
7
8func demonstrateGreedy() {
9 fmt.Println("=== Greedy Matching (default) ===\n")
10
11 // Greedy quantifiers match as much as possible
12 greedyRe := regexp.MustCompile(`<.*>`)
13 html := "<div>content</div><span>more</span>"
14
15 match := greedyRe.FindString(html)
16 fmt.Printf("Pattern: %s\n", `<.*>`)
17 fmt.Printf("Text: %s\n", html)
18 fmt.Printf("Greedy match: %s\n", match)
19 fmt.Printf(" → Matched from first < to last >\n\n")
20
21 fmt.Println("=== Non-Greedy Matching ===\n")
22
23 // Non-greedy quantifiers (? suffix) match as little as possible
24 nonGreedyRe := regexp.MustCompile(`<.*?>`)
25 matches := nonGreedyRe.FindAllString(html, -1)
26
27 fmt.Printf("Pattern: %s\n", `<.*?>`)
28 fmt.Printf("Text: %s\n", html)
29 fmt.Printf("Non-greedy matches: %v\n", matches)
30 fmt.Printf(" → Matched smallest possible strings\n\n")
31
32 fmt.Println("=== Practical Examples ===\n")
33
34 // Example 1: Extracting quoted strings
35 text1 := `He said "hello" and she said "goodbye"`
36
37 greedyQuote := regexp.MustCompile(`".*"`)
38 nonGreedyQuote := regexp.MustCompile(`".*?"`)
39
40 fmt.Printf("Text: %s\n", text1)
41 fmt.Printf("Greedy \".*\": %s\n", greedyQuote.FindString(text1))
42 fmt.Printf("Non-greedy \".*?\": %v\n\n", nonGreedyQuote.FindAllString(text1, -1))
43
44 // Example 2: HTML tag extraction
45 html2 := "<b>bold</b> and <i>italic</i>"
46
47 greedyTag := regexp.MustCompile(`<.+>`)
48 nonGreedyTag := regexp.MustCompile(`<.+?>`)
49
50 fmt.Printf("HTML: %s\n", html2)
51 fmt.Printf("Greedy <.+>: %s\n", greedyTag.FindString(html2))
52 fmt.Printf("Non-greedy <.+?>: %v\n\n", nonGreedyTag.FindAllString(html2, -1))
53
54 // Example 3: Path extraction
55 path := "/users/123/posts/456/comments/789"
56
57 greedyPath := regexp.MustCompile(`/\w+/\d+`)
58 nonGreedyPath := regexp.MustCompile(`/\w+?/\d+`)
59
60 fmt.Printf("Path: %s\n", path)
61 fmt.Printf("Greedy pattern: %v\n", greedyPath.FindAllString(path, -1))
62 fmt.Printf("Non-greedy pattern: %v\n\n", nonGreedyPath.FindAllString(path, -1))
63
64 fmt.Println("=== Quantifier Reference ===")
65 fmt.Println(" * = greedy (0 or more)")
66 fmt.Println(" *? = non-greedy (0 or more)")
67 fmt.Println(" + = greedy (1 or more)")
68 fmt.Println(" +? = non-greedy (1 or more)")
69 fmt.Println(" ? = greedy (0 or 1)")
70 fmt.Println(" ?? = non-greedy (0 or 1)")
71 fmt.Println(" {n,m} = greedy (n to m)")
72 fmt.Println(" {n,m}? = non-greedy (n to m)")
73}
74
75func main() {
76 demonstrateGreedy()
77}
78// run
💡 Production Pattern: Use non-greedy quantifiers when extracting content between delimiters to avoid matching too much.
Integration and Mastery
Production-Ready Input Validation Framework
🎯 Production Pattern: Create a comprehensive validation system that handles multiple input types with consistent error handling.
1package main
2
3import (
4 "fmt"
5 "regexp"
6 "strings"
7)
8
9type ValidationError struct {
10 Field string
11 Value string
12 Message string
13 Code string
14}
15
16type ValidationRule struct {
17 Pattern *regexp.Regexp
18 Required bool
19 MinLength int
20 MaxLength int
21 Message string
22 Code string
23}
24
25type Validator struct {
26 rules map[string]*ValidationRule
27}
28
29func NewValidator() *Validator {
30 return &Validator{
31 rules: make(map[string]*ValidationRule),
32 }
33}
34
35func (v *Validator) AddRule(field, pattern string, options map[string]interface{}) error {
36 re, err := regexp.Compile(pattern)
37 if err != nil {
38 return fmt.Errorf("invalid pattern for field '%s': %w", field, err)
39 }
40
41 rule := &ValidationRule{
42 Pattern: re,
43 Required: true,
44 Code: "INVALID_FORMAT",
45 }
46
47 // Apply options
48 for key, value := range options {
49 switch key {
50 case "required":
51 rule.Required = value.(bool)
52 case "min_length":
53 rule.MinLength = value.(int)
54 case "max_length":
55 rule.MaxLength = value.(int)
56 case "message":
57 rule.Message = value.(string)
58 case "code":
59 rule.Code = value.(string)
60 }
61 }
62
63 v.rules[field] = rule
64 return nil
65}
66
67func (v *Validator) Validate(data map[string]string) []ValidationError {
68 var errors []ValidationError
69
70 // Validate present fields
71 for field, value := range data {
72 rule, exists := v.rules[field]
73 if !exists {
74 continue
75 }
76
77 errors = append(errors, v.validateField(field, value, rule)...)
78 }
79
80 // Check for missing required fields
81 for field, rule := range v.rules {
82 if rule.Required {
83 if _, exists := data[field]; !exists {
84 errors = append(errors, ValidationError{
85 Field: field,
86 Value: "",
87 Message: fmt.Sprintf("%s is required", field),
88 Code: "REQUIRED",
89 })
90 }
91 }
92 }
93
94 return errors
95}
96
97func (v *Validator) validateField(field, value string, rule *ValidationRule) []ValidationError {
98 var errors []ValidationError
99
100 // Trim whitespace
101 value = strings.TrimSpace(value)
102
103 // Required check
104 if rule.Required && value == "" {
105 errors = append(errors, ValidationError{
106 Field: field,
107 Value: value,
108 Message: fmt.Sprintf("%s is required", field),
109 Code: "REQUIRED",
110 })
111 return errors // No point checking other rules
112 }
113
114 // Skip other validations if empty and not required
115 if value == "" && !rule.Required {
116 return errors
117 }
118
119 // Length checks
120 if rule.MinLength > 0 && len(value) < rule.MinLength {
121 errors = append(errors, ValidationError{
122 Field: field,
123 Value: value,
124 Message: fmt.Sprintf("%s must be at least %d characters", field, rule.MinLength),
125 Code: "TOO_SHORT",
126 })
127 }
128
129 if rule.MaxLength > 0 && len(value) > rule.MaxLength {
130 errors = append(errors, ValidationError{
131 Field: field,
132 Value: value,
133 Message: fmt.Sprintf("%s must be at most %d characters", field, rule.MaxLength),
134 Code: "TOO_LONG",
135 })
136 }
137
138 // Pattern validation
139 if !rule.Pattern.MatchString(value) {
140 message := rule.Message
141 if message == "" {
142 message = fmt.Sprintf("%s has invalid format", field)
143 }
144
145 errors = append(errors, ValidationError{
146 Field: field,
147 Value: value,
148 Message: message,
149 Code: rule.Code,
150 })
151 }
152
153 return errors
154}
155
156func main() {
157 // Create validator with common patterns
158 validator := NewValidator()
159
160 // Add validation rules
161 validator.AddRule("email", `^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$`, map[string]interface{}{
162 "required": true,
163 "max_length": 254,
164 "message": "Please enter a valid email address",
165 "code": "INVALID_EMAIL",
166 })
167
168 validator.AddRule("username", `^[a-zA-Z0-9_-]{3,16}$`, map[string]interface{}{
169 "required": true,
170 "min_length": 3,
171 "max_length": 16,
172 "message": "Username must be 3-16 alphanumeric characters, hyphens, or underscores",
173 "code": "INVALID_USERNAME",
174 })
175
176 validator.AddRule("phone", `^\+?[1-9]\d{1,14}$`, map[string]interface{}{
177 "required": false,
178 "message": "Please enter a valid international phone number",
179 "code": "INVALID_PHONE",
180 })
181
182 validator.AddRule("password", `^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$`, map[string]interface{}{
183 "required": true,
184 "min_length": 8,
185 "message": "Password must be at least 8 characters with uppercase, lowercase, number, and special character",
186 "code": "WEAK_PASSWORD",
187 })
188
189 validator.AddRule("zipcode", `^\d{5}(-\d{4})?$`, map[string]interface{}{
190 "required": false,
191 "message": "ZIP code must be 5 digits or 5+4 format",
192 "code": "INVALID_ZIPCODE",
193 })
194
195 // Test validation
196 testCases := []map[string]string{
197 {
198 "email": "user@example.com",
199 "username": "validuser123",
200 "phone": "+1234567890",
201 "password": "SecurePass123!",
202 "zipcode": "12345-6789",
203 },
204 {
205 "email": "invalid-email",
206 "username": "ab",
207 "phone": "123",
208 "password": "weak",
209 },
210 {
211 "username": "toolongusername12345",
212 "password": "NoNumber!",
213 },
214 {}, // Missing required fields
215 }
216
217 for i, testCase := range testCases {
218 fmt.Printf("=== Test Case %d ===\n", i+1)
219 fmt.Printf("Input: %v\n", testCase)
220
221 errors := validator.Validate(testCase)
222
223 if len(errors) == 0 {
224 fmt.Println("✓ All validations passed")
225 } else {
226 fmt.Printf("✗ %d validation error(s):\n", len(errors))
227 for _, err := range errors {
228 fmt.Printf(" Field: %s\n", err.Field)
229 if err.Value != "" {
230 fmt.Printf(" Value: %s\n", err.Value)
231 }
232 fmt.Printf(" Error: %s\n", err.Message)
233 fmt.Printf(" Code: %s\n", err.Code)
234 fmt.Println()
235 }
236 }
237 fmt.Println()
238 }
239}
240// run
Text Processing Pipeline with Regex
Build a complete text processing system demonstrating real-world regex integration.
1package main
2
3import (
4 "fmt"
5 "regexp"
6 "strings"
7)
8
9type TextProcessor struct {
10 linkRe *regexp.Regexp
11 emailRe *regexp.Regexp
12 hashtagRe *regexp.Regexp
13 mentionRe *regexp.Regexp
14 codeBlockRe *regexp.Regexp
15}
16
17func NewTextProcessor() *TextProcessor {
18 return &TextProcessor{
19 linkRe: regexp.MustCompile(`https?://[^\s]+`),
20 emailRe: regexp.MustCompile(`\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b`),
21 hashtagRe: regexp.MustCompile(`#\w+`),
22 mentionRe: regexp.MustCompile(`@\w+`),
23 codeBlockRe: regexp.MustCompile("`([^`]+)`"),
24 }
25}
26
27// Extract all entities from text
28func (tp *TextProcessor) ExtractEntities(text string) map[string][]string {
29 return map[string][]string{
30 "links": tp.linkRe.FindAllString(text, -1),
31 "emails": tp.emailRe.FindAllString(text, -1),
32 "hashtags": tp.hashtagRe.FindAllString(text, -1),
33 "mentions": tp.mentionRe.FindAllString(text, -1),
34 "code": tp.extractCodeBlocks(text),
35 }
36}
37
38func (tp *TextProcessor) extractCodeBlocks(text string) []string {
39 matches := tp.codeBlockRe.FindAllStringSubmatch(text, -1)
40 result := make([]string, len(matches))
41 for i, match := range matches {
42 result[i] = match[1]
43 }
44 return result
45}
46
47// Sanitize text by removing potentially harmful content
48func (tp *TextProcessor) Sanitize(text string) string {
49 // Remove script tags
50 scriptRe := regexp.MustCompile(`(?i)<script[^>]*>.*?</script>`)
51 text = scriptRe.ReplaceAllString(text, "")
52
53 // Remove event handlers
54 eventRe := regexp.MustCompile(`(?i)\son\w+\s*=\s*["'][^"']*["']`)
55 text = eventRe.ReplaceAllString(text, "")
56
57 // Remove javascript: URLs
58 jsUrlRe := regexp.MustCompile(`(?i)javascript:`)
59 text = jsUrlRe.ReplaceAllString(text, "")
60
61 return strings.TrimSpace(text)
62}
63
64// Format text by converting entities to HTML
65func (tp *TextProcessor) FormatHTML(text string) string {
66 // Convert links to anchor tags
67 text = tp.linkRe.ReplaceAllStringFunc(text, func(link string) string {
68 return fmt.Sprintf(`<a href="%s" target="_blank">%s</a>`, link, link)
69 })
70
71 // Convert code blocks to code tags
72 text = tp.codeBlockRe.ReplaceAllStringFunc(text, func(code string) string {
73 inner := tp.codeBlockRe.FindStringSubmatch(code)[1]
74 return fmt.Sprintf("<code>%s</code>", inner)
75 })
76
77 // Convert hashtags to links
78 text = tp.hashtagRe.ReplaceAllStringFunc(text, func(hashtag string) string {
79 tag := strings.TrimPrefix(hashtag, "#")
80 return fmt.Sprintf(`<a href="/tags/%s">%s</a>`, tag, hashtag)
81 })
82
83 // Convert mentions to links
84 text = tp.mentionRe.ReplaceAllStringFunc(text, func(mention string) string {
85 username := strings.TrimPrefix(mention, "@")
86 return fmt.Sprintf(`<a href="/users/%s">%s</a>`, username, mention)
87 })
88
89 return text
90}
91
92// Highlight search terms in text
93func (tp *TextProcessor) Highlight(text, searchTerm string) string {
94 if searchTerm == "" {
95 return text
96 }
97
98 // Escape special regex characters in search term
99 escapedTerm := regexp.QuoteMeta(searchTerm)
100 highlightRe := regexp.MustCompile(`(?i)(` + escapedTerm + `)`)
101
102 return highlightRe.ReplaceAllString(text, "<mark>$1</mark>")
103}
104
105// Extract summary (first N words)
106func (tp *TextProcessor) ExtractSummary(text string, maxWords int) string {
107 // Remove extra whitespace
108 text = regexp.MustCompile(`\s+`).ReplaceAllString(text, " ")
109 text = strings.TrimSpace(text)
110
111 // Split into words
112 words := strings.Fields(text)
113
114 if len(words) <= maxWords {
115 return text
116 }
117
118 return strings.Join(words[:maxWords], " ") + "..."
119}
120
121func main() {
122 processor := NewTextProcessor()
123
124 // Example text
125 text := `Check out https://example.com for more info!
126Contact support@example.com or visit our site.
127Use code: \`npm install package\` to get started.
128Follow @johndoe and use #golang for questions.
129<script>alert('xss')</script>
130Visit javascript:void(0) for nothing.`
131
132 fmt.Println("=== Original Text ===")
133 fmt.Println(text)
134
135 fmt.Println("\n=== Extracted Entities ===")
136 entities := processor.ExtractEntities(text)
137 for entityType, values := range entities {
138 if len(values) > 0 {
139 fmt.Printf("%s: %v\n", entityType, values)
140 }
141 }
142
143 fmt.Println("\n=== Sanitized Text ===")
144 sanitized := processor.Sanitize(text)
145 fmt.Println(sanitized)
146
147 fmt.Println("\n=== Formatted HTML ===")
148 formatted := processor.FormatHTML(sanitized)
149 fmt.Println(formatted)
150
151 fmt.Println("\n=== With Highlighting (search: 'golang') ===")
152 highlighted := processor.Highlight(formatted, "golang")
153 fmt.Println(highlighted)
154
155 fmt.Println("\n=== Summary (15 words) ===")
156 summary := processor.ExtractSummary(sanitized, 15)
157 fmt.Println(summary)
158}
159// run
Practice Exercises
The exercises from the original article are preserved below with enhanced learning objectives and real-world context, plus two new exercises.
Exercise 1: Advanced Log Parser with Performance Optimization
Learning Objectives: Master complex pattern matching, implement multi-format parsing strategies, and handle structured data extraction with regular expressions while optimizing for high-volume log processing.
Real-World Context: Log parsing is fundamental to observability and monitoring systems. Tools like Splunk, ELK Stack, and Fluentd process millions of log entries daily to extract meaningful insights. This exercise teaches you patterns used in production log analysis for troubleshooting, security monitoring, and business intelligence.
Difficulty: Advanced | Time Estimate: 60 minutes
Write a log parser that extracts structured information from log lines with multiple formats while handling edge cases, malformed entries, and performance optimization for high-volume processing.
Solution
1package main
2
3import (
4 "fmt"
5 "regexp"
6 "time"
7)
8
9type LogEntry struct {
10 Timestamp time.Time
11 Level string
12 Service string
13 Message string
14 RequestID string
15 UserID string
16}
17
18type LogParser struct {
19 patterns []*regexp.Regexp
20}
21
22func NewLogParser() *LogParser {
23 patterns := []string{
24 // Format 1: 2024-03-15T10:30:45Z [INFO] service=api msg="Request received" request_id=abc123 user_id=456
25 `(?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z) \[(?P<level>\w+)\] service=(?P<service>\w+) msg="(?P<message>[^"]*)"(?: request_id=(?P<request_id>\w+))?(?: user_id=(?P<user_id>\w+))?`,
26
27 // Format 2: 2024-03-15 10:30:45 INFO [api] Request received (rid:abc123, uid:456)
28 `(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) (?P<level>\w+) \[(?P<service>\w+)\] (?P<message>.*?)(?: \(rid:(?P<request_id>\w+)(?:, uid:(?P<user_id>\w+))?\))?$`,
29 }
30
31 compiled := make([]*regexp.Regexp, len(patterns))
32 for i, pattern := range patterns {
33 compiled[i] = regexp.MustCompile(pattern)
34 }
35
36 return &LogParser{patterns: compiled}
37}
38
39func (lp *LogParser) Parse(line string) (*LogEntry, error) {
40 for _, re := range lp.patterns {
41 matches := re.FindStringSubmatch(line)
42 if matches == nil {
43 continue
44 }
45
46 names := re.SubexpNames()
47 entry := &LogEntry{}
48
49 for i, name := range names {
50 if i == 0 || name == "" {
51 continue
52 }
53
54 value := matches[i]
55
56 switch name {
57 case "timestamp":
58 // Try different time formats
59 formats := []string{
60 "2006-01-02T15:04:05Z",
61 "2006-01-02 15:04:05",
62 }
63 for _, format := range formats {
64 if t, err := time.Parse(format, value); err == nil {
65 entry.Timestamp = t
66 break
67 }
68 }
69 case "level":
70 entry.Level = value
71 case "service":
72 entry.Service = value
73 case "message":
74 entry.Message = value
75 case "request_id":
76 entry.RequestID = value
77 case "user_id":
78 entry.UserID = value
79 }
80 }
81
82 return entry, nil
83 }
84
85 return nil, fmt.Errorf("no pattern matched")
86}
87
88func main() {
89 parser := NewLogParser()
90
91 logs := []string{
92 `2024-03-15T10:30:45Z [INFO] service=api msg="Request received" request_id=abc123 user_id=456`,
93 `2024-03-15T10:30:46Z [ERROR] service=db msg="Connection failed"`,
94 `2024-03-15 10:30:47 INFO [api] Request received (rid:xyz789, uid:123)`,
95 `2024-03-15 10:30:48 WARN [cache] Cache miss`,
96 }
97
98 for _, log := range logs {
99 entry, err := parser.Parse(log)
100 if err != nil {
101 fmt.Printf("Failed to parse: %s\n", log)
102 continue
103 }
104
105 fmt.Printf("Parsed: %+v\n", entry)
106 }
107}
Exercise 2: URL Router with Pattern Matching
Learning Objectives: Build routing systems, implement parameter extraction, and understand pattern matching in web frameworks.
Real-World Context: URL routing is the foundation of all web frameworks, from Express.js to Ruby on Rails to Go's Gin and Chi. Understanding how routing works under the hood helps you build more efficient APIs and debug routing issues. This exercise reveals patterns used in production web servers handling millions of requests.
Difficulty: Intermediate | Time Estimate: 45 minutes
Create a simple URL router using regex for path pattern matching while supporting parameter extraction, middleware integration, and performance optimization for high-traffic applications.
Solution
1package main
2
3import (
4 "fmt"
5 "regexp"
6)
7
8type Route struct {
9 Pattern *regexp.Regexp
10 Names []string
11 Handler func(map[string]string)
12}
13
14type Router struct {
15 routes []Route
16}
17
18func NewRouter() *Router {
19 return &Router{routes: make([]Route, 0)}
20}
21
22func (r *Router) AddRoute(pattern string, handler func(map[string]string)) error {
23 // Convert pattern to regex
24 // /users/:id -> /users/(?P<id>[^/]+)
25 // /posts/:id/comments/:cid -> /posts/(?P<id>[^/]+)/comments/(?P<cid>[^/]+)
26
27 regexPattern := regexp.MustCompile(`:(\w+)`).ReplaceAllString(pattern, `(?P<$1>[^/]+)`)
28 regexPattern = "^" + regexPattern + "$"
29
30 re, err := regexp.Compile(regexPattern)
31 if err != nil {
32 return err
33 }
34
35 route := Route{
36 Pattern: re,
37 Names: re.SubexpNames(),
38 Handler: handler,
39 }
40
41 r.routes = append(r.routes, route)
42 return nil
43}
44
45func (r *Router) Match(path string) bool {
46 for _, route := range r.routes {
47 matches := route.Pattern.FindStringSubmatch(path)
48 if matches == nil {
49 continue
50 }
51
52 // Extract parameters
53 params := make(map[string]string)
54 for i, name := range route.Names {
55 if i > 0 && name != "" {
56 params[name] = matches[i]
57 }
58 }
59
60 // Call handler
61 route.Handler(params)
62 return true
63 }
64
65 return false
66}
67
68func main() {
69 router := NewRouter()
70
71 // Define routes
72 router.AddRoute("/users/:id", func(params map[string]string) {
73 fmt.Printf("User handler: ID=%s\n", params["id"])
74 })
75
76 router.AddRoute("/posts/:id/comments/:cid", func(params map[string]string) {
77 fmt.Printf("Comment handler: Post ID=%s, Comment ID=%s\n",
78 params["id"], params["cid"])
79 })
80
81 router.AddRoute("/api/v:version/:resource", func(params map[string]string) {
82 fmt.Printf("API handler: Version=%s, Resource=%s\n",
83 params["version"], params["resource"])
84 })
85
86 // Test routes
87 paths := []string{
88 "/users/123",
89 "/posts/456/comments/789",
90 "/api/v2/products",
91 "/not/found",
92 }
93
94 for _, path := range paths {
95 fmt.Printf("\nMatching: %s\n", path)
96 if !router.Match(path) {
97 fmt.Println(" No route matched")
98 }
99 }
100}
Exercise 3: Email Validator with Custom Rules
Learning Objectives: Implement advanced validation patterns, handle business rules with regex, and build robust input validation systems.
Real-World Context: Email validation is critical for user registration systems, marketing campaigns, and communication platforms. From preventing spam to ensuring deliverability, proper email validation saves businesses millions in lost revenue and prevents security issues. This exercise teaches production-level validation patterns used by services like Mailchimp and SendGrid.
Difficulty: Advanced | Time Estimate: 55 minutes
Build an email validator that enforces custom rules beyond basic format validation while handling internationalization, disposable email detection, and domain validation.
Requirements:
- Valid email format
- Blacklist certain domains
- Require corporate email addresses
- Check for disposable email providers
- Validate TLD length and format
Solution with Explanation
1package main
2
3import (
4 "fmt"
5 "regexp"
6 "strings"
7)
8
9// EmailValidator validates email addresses with custom rules
10type EmailValidator struct {
11 emailPattern *regexp.Regexp
12 blacklistedDomains map[string]bool
13 allowedDomains map[string]bool
14 disposableDomains map[string]bool
15 requireCorporate bool
16}
17
18type ValidationResult struct {
19 Valid bool
20 Errors []string
21}
22
23func NewEmailValidator(requireCorporate bool) *EmailValidator {
24 return &EmailValidator{
25 // RFC 5322 compliant email pattern
26 emailPattern: regexp.MustCompile(`^[a-zA-Z0-9.!#$%&'*+/=?^_` + "`" + `{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$`),
27 blacklistedDomains: map[string]bool{
28 "spam.com": true,
29 "mailinator.com": true,
30 "tempmail.com": true,
31 },
32 allowedDomains: map[string]bool{
33 "company.com": true,
34 "corp.com": true,
35 },
36 disposableDomains: map[string]bool{
37 "guerrillamail.com": true,
38 "10minutemail.com": true,
39 "throwaway.email": true,
40 },
41 requireCorporate: requireCorporate,
42 }
43}
44
45func (v *EmailValidator) Validate(email string) ValidationResult {
46 result := ValidationResult{Valid: true, Errors: []string{}}
47
48 email = strings.TrimSpace(strings.ToLower(email))
49
50 // Rule 1: Basic format validation
51 if !v.emailPattern.MatchString(email) {
52 result.Valid = false
53 result.Errors = append(result.Errors, "Invalid email format")
54 return result // No point checking further if format is wrong
55 }
56
57 // Extract domain
58 parts := strings.Split(email, "@")
59 if len(parts) != 2 {
60 result.Valid = false
61 result.Errors = append(result.Errors, "Invalid email structure")
62 return result
63 }
64
65 domain := parts[1]
66
67 // Rule 2: Check blacklisted domains
68 if v.blacklistedDomains[domain] {
69 result.Valid = false
70 result.Errors = append(result.Errors, fmt.Sprintf("Domain '%s' is blacklisted", domain))
71 }
72
73 // Rule 3: Check disposable email providers
74 if v.disposableDomains[domain] {
75 result.Valid = false
76 result.Errors = append(result.Errors, "Disposable email addresses are not allowed")
77 }
78
79 // Rule 4: Corporate email requirement
80 if v.requireCorporate && !v.allowedDomains[domain] {
81 result.Valid = false
82 result.Errors = append(result.Errors, "Only corporate email addresses are allowed")
83 }
84
85 // Rule 5: Validate TLD
86 tldPattern := regexp.MustCompile(`\.([a-z]{2,})$`)
87 matches := tldPattern.FindStringSubmatch(domain)
88 if len(matches) < 2 {
89 result.Valid = false
90 result.Errors = append(result.Errors, "Invalid top-level domain")
91 } else {
92 tld := matches[1]
93 if len(tld) < 2 || len(tld) > 6 {
94 result.Valid = false
95 result.Errors = append(result.Errors, "TLD must be between 2 and 6 characters")
96 }
97 }
98
99 // Rule 6: Check for common typos in popular domains
100 commonDomains := map[string]string{
101 "gmial.com": "gmail.com",
102 "gmai.com": "gmail.com",
103 "yahooo.com": "yahoo.com",
104 "hotmial.com": "hotmail.com",
105 }
106
107 if correctDomain, found := commonDomains[domain]; found {
108 result.Valid = false
109 result.Errors = append(result.Errors,
110 fmt.Sprintf("Did you mean '%s' instead of '%s'?", correctDomain, domain))
111 }
112
113 // Rule 7: Check local part length
114 localPart := parts[0]
115 if len(localPart) > 64 {
116 result.Valid = false
117 result.Errors = append(result.Errors, "Local part exceeds 64 characters")
118 }
119
120 // Rule 8: Check for consecutive dots
121 if strings.Contains(email, "..") {
122 result.Valid = false
123 result.Errors = append(result.Errors, "Consecutive dots are not allowed")
124 }
125
126 return result
127}
128
129func main() {
130 // Test without corporate requirement
131 validator := NewEmailValidator(false)
132
133 testEmails := []string{
134 "user@company.com",
135 "test@gmail.com",
136 "invalid@spam.com",
137 "user@10minutemail.com",
138 "bad..email@domain.com",
139 "toolonglocalpartaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaart@domain.com",
140 "user@gmial.com",
141 "valid.user+tag@example.co.uk",
142 "not-an-email",
143 "user@domain.c",
144 }
145
146 fmt.Println("=== Email Validation (No Corporate Requirement) ===")
147 for _, email := range testEmails {
148 result := validator.Validate(email)
149 fmt.Printf("\nEmail: %s\n", email)
150 if result.Valid {
151 fmt.Println(" Status: ✓ Valid")
152 } else {
153 fmt.Println(" Status: ✗ Invalid")
154 for _, err := range result.Errors {
155 fmt.Printf(" - %s\n", err)
156 }
157 }
158 }
159
160 // Test with corporate requirement
161 fmt.Println("\n=== Email Validation (Corporate Requirement) ===")
162 corporateValidator := NewEmailValidator(true)
163
164 corporateEmails := []string{
165 "employee@company.com",
166 "user@gmail.com",
167 "contractor@corp.com",
168 }
169
170 for _, email := range corporateEmails {
171 result := corporateValidator.Validate(email)
172 fmt.Printf("\nEmail: %s\n", email)
173 if result.Valid {
174 fmt.Println(" Status: ✓ Valid")
175 } else {
176 fmt.Println(" Status: ✗ Invalid")
177 for _, err := range result.Errors {
178 fmt.Printf(" - %s\n", err)
179 }
180 }
181 }
182}
Explanation:
This email validator demonstrates several regex techniques and validation patterns:
-
RFC-Compliant Pattern: Uses a comprehensive regex that handles most valid email formats including special characters, hyphens in domains, and multi-part TLDs.
-
Domain Extraction: Splits email to separately validate the local part and domain.
-
TLD Validation: Uses regex to extract and validate top-level domain length.
-
Multi-Rule Validation: Applies multiple validation rules in sequence, collecting all errors rather than failing on the first one.
-
Practical Checks:
- Blacklist/whitelist domain checking
- Disposable email detection
- Common typo detection
- Consecutive dot prevention
- Length constraints
-
User-Friendly Error Messages: Provides specific error messages for each validation failure, including suggestions for common typos.
This pattern is useful for production email validation where you need more than basic format checking, such as preventing spam signups, enforcing corporate email policies, or improving user experience by catching common mistakes.
Exercise 4: Phone Number Validator (NEW)
Learning Objectives: Handle international phone number formats, implement flexible validation for multiple countries, and manage formatting variations.
Real-World Context: Phone number validation is essential for user verification, SMS notifications, and communication systems. From e-commerce checkouts to two-factor authentication, proper phone validation prevents failed deliveries and improves user experience. This exercise teaches patterns used by Twilio, AWS SNS, and other messaging platforms.
Difficulty: Intermediate | Time Estimate: 40 minutes
Build a phone number validator that supports multiple international formats (US, UK, Germany, Japan) with proper country code handling and format normalization.
Requirements:
- Support multiple country formats
- Validate country codes
- Normalize phone numbers to E.164 format
- Detect and extract country/area codes
Solution
1package main
2
3import (
4 "fmt"
5 "regexp"
6 "strings"
7)
8
9type PhoneFormat struct {
10 Country string
11 Pattern *regexp.Regexp
12 Example string
13}
14
15type PhoneValidator struct {
16 formats []PhoneFormat
17}
18
19func NewPhoneValidator() *PhoneValidator {
20 formats := []PhoneFormat{
21 {
22 Country: "US",
23 Pattern: regexp.MustCompile(`^\+?1?[-.\s]?\(?([2-9]\d{2})\)?[-.\s]?(\d{3})[-.\s]?(\d{4})$`),
24 Example: "+1 (555) 123-4567",
25 },
26 {
27 Country: "UK",
28 Pattern: regexp.MustCompile(`^\+?44[-.\s]?(\d{4})[-.\s]?(\d{6})$`),
29 Example: "+44 7700 900123",
30 },
31 {
32 Country: "Germany",
33 Pattern: regexp.MustCompile(`^\+?49[-.\s]?(\d{3})[-.\s]?(\d{7,8})$`),
34 Example: "+49 30 12345678",
35 },
36 {
37 Country: "Japan",
38 Pattern: regexp.MustCompile(`^\+?81[-.\s]?(\d{1,4})[-.\s]?(\d{1,4})[-.\s]?(\d{4})$`),
39 Example: "+81 3-1234-5678",
40 },
41 }
42
43 return &PhoneValidator{formats: formats}
44}
45
46func (pv *PhoneValidator) Validate(phone string) (bool, string, string) {
47 // Remove common separators
48 cleaned := strings.Map(func(r rune) rune {
49 if r == ' ' || r == '-' || r == '.' || r == '(' || r == ')' {
50 return -1
51 }
52 return r
53 }, phone)
54
55 for _, format := range pv.formats {
56 if format.Pattern.MatchString(phone) {
57 // Extract E.164 format
58 e164 := pv.toE164(phone, format)
59 return true, format.Country, e164
60 }
61 }
62
63 return false, "", ""
64}
65
66func (pv *PhoneValidator) toE164(phone string, format PhoneFormat) string {
67 // Extract digits only
68 digitsOnly := regexp.MustCompile(`\d+`).FindAllString(phone, -1)
69 digits := strings.Join(digitsOnly, "")
70
71 // Add country code if missing
72 if !strings.HasPrefix(digits, "+") {
73 switch format.Country {
74 case "US":
75 if len(digits) == 10 {
76 digits = "1" + digits
77 }
78 case "UK":
79 if !strings.HasPrefix(digits, "44") {
80 digits = "44" + digits
81 }
82 case "Germany":
83 if !strings.HasPrefix(digits, "49") {
84 digits = "49" + digits
85 }
86 case "Japan":
87 if !strings.HasPrefix(digits, "81") {
88 digits = "81" + digits
89 }
90 }
91 }
92
93 return "+" + digits
94}
95
96func main() {
97 validator := NewPhoneValidator()
98
99 testPhones := []string{
100 "+1 (555) 123-4567",
101 "555-123-4567",
102 "+44 7700 900123",
103 "+49 30 12345678",
104 "+81 3-1234-5678",
105 "invalid-phone",
106 "123",
107 }
108
109 for _, phone := range testPhones {
110 valid, country, e164 := validator.Validate(phone)
111 fmt.Printf("Phone: %-20s Valid: %-5v", phone, valid)
112 if valid {
113 fmt.Printf("Country: %-10s E.164: %s", country, e164)
114 }
115 fmt.Println()
116 }
117}
Exercise 5: Data Extraction from Mixed Format Documents (NEW)
Learning Objectives: Build complex extraction patterns, handle multiple data formats in single documents, implement robust error handling for malformed data.
Real-World Context: Real-world documents often contain mixed formats - invoices with dates, amounts, and reference numbers; contracts with parties, dates, and clauses; or resumes with contact info, dates, and skills. This exercise teaches patterns used by document processing systems like DocuSign, legal tech platforms, and HR systems.
Difficulty: Advanced | Time Estimate: 65 minutes
Create a document parser that extracts structured data from text containing multiple types of information (dates, amounts, emails, phone numbers, reference codes) while handling formatting inconsistencies and validation.
Requirements:
- Extract multiple entity types from single text
- Handle various date formats
- Parse currency amounts with symbols
- Extract and validate reference codes
- Build structured output from unstructured text
Solution
1package main
2
3import (
4 "fmt"
5 "regexp"
6 "strings"
7 "time"
8)
9
10type DocumentData struct {
11 Dates []string
12 Amounts []string
13 Emails []string
14 Phones []string
15 References []string
16 Names []string
17}
18
19type DocumentParser struct {
20 dateRe *regexp.Regexp
21 amountRe *regexp.Regexp
22 emailRe *regexp.Regexp
23 phoneRe *regexp.Regexp
24 refRe *regexp.Regexp
25 nameRe *regexp.Regexp
26}
27
28func NewDocumentParser() *DocumentParser {
29 return &DocumentParser{
30 // Date patterns: MM/DD/YYYY, YYYY-MM-DD, Month DD, YYYY
31 dateRe: regexp.MustCompile(`\b(\d{1,2}/\d{1,2}/\d{4}|\d{4}-\d{2}-\d{2}|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{1,2},? \d{4})\b`),
32
33 // Amount patterns: $1,234.56, USD 1234.56, €1.234,56
34 amountRe: regexp.MustCompile(`(?:USD|EUR|GBP|\$|€|£)\s*[\d,]+\.?\d*`),
35
36 // Email pattern
37 emailRe: regexp.MustCompile(`\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b`),
38
39 // Phone patterns
40 phoneRe: regexp.MustCompile(`\+?\d{1,3}[-.\s]?\(?\d{1,4}\)?[-.\s]?\d{1,4}[-.\s]?\d{1,9}`),
41
42 // Reference codes: INV-12345, REF#98765, ORD-2024-001
43 refRe: regexp.MustCompile(`\b(?:INV|REF|ORD|PO|ID)[-#]?\d{3,}[-]?\d*\b`),
44
45 // Name patterns: Mr./Ms./Dr. FirstName LastName
46 nameRe: regexp.MustCompile(`\b(?:Mr\.|Ms\.|Mrs\.|Dr\.) [A-Z][a-z]+ [A-Z][a-z]+\b`),
47 }
48}
49
50func (dp *DocumentParser) Parse(text string) DocumentData {
51 return DocumentData{
52 Dates: dp.extractDates(text),
53 Amounts: dp.amountRe.FindAllString(text, -1),
54 Emails: dp.emailRe.FindAllString(text, -1),
55 Phones: dp.extractPhones(text),
56 References: dp.refRe.FindAllString(text, -1),
57 Names: dp.nameRe.FindAllString(text, -1),
58 }
59}
60
61func (dp *DocumentParser) extractDates(text string) []string {
62 matches := dp.dateRe.FindAllString(text, -1)
63 validated := make([]string, 0, len(matches))
64
65 for _, match := range matches {
66 // Try to parse to validate it's a real date
67 formats := []string{
68 "01/02/2006",
69 "2006-01-02",
70 "Jan 2, 2006",
71 "January 2, 2006",
72 }
73
74 for _, format := range formats {
75 if _, err := time.Parse(format, match); err == nil {
76 validated = append(validated, match)
77 break
78 }
79 }
80 }
81
82 return validated
83}
84
85func (dp *DocumentParser) extractPhones(text string) []string {
86 matches := dp.phoneRe.FindAllString(text, -1)
87 filtered := make([]string, 0, len(matches))
88
89 for _, match := range matches {
90 // Filter out things that look like phone numbers but aren't
91 // (e.g., dates, amounts)
92 digits := regexp.MustCompile(`\d`).FindAllString(match, -1)
93 if len(digits) >= 7 && len(digits) <= 15 {
94 filtered = append(filtered, match)
95 }
96 }
97
98 return filtered
99}
100
101func (dp *DocumentParser) FormatOutput(data DocumentData) string {
102 var sb strings.Builder
103
104 sb.WriteString("=== Extracted Document Data ===\n\n")
105
106 if len(data.Dates) > 0 {
107 sb.WriteString("Dates:\n")
108 for _, date := range data.Dates {
109 sb.WriteString(fmt.Sprintf(" - %s\n", date))
110 }
111 sb.WriteString("\n")
112 }
113
114 if len(data.Amounts) > 0 {
115 sb.WriteString("Amounts:\n")
116 for _, amount := range data.Amounts {
117 sb.WriteString(fmt.Sprintf(" - %s\n", amount))
118 }
119 sb.WriteString("\n")
120 }
121
122 if len(data.Emails) > 0 {
123 sb.WriteString("Emails:\n")
124 for _, email := range data.Emails {
125 sb.WriteString(fmt.Sprintf(" - %s\n", email))
126 }
127 sb.WriteString("\n")
128 }
129
130 if len(data.Phones) > 0 {
131 sb.WriteString("Phones:\n")
132 for _, phone := range data.Phones {
133 sb.WriteString(fmt.Sprintf(" - %s\n", phone))
134 }
135 sb.WriteString("\n")
136 }
137
138 if len(data.References) > 0 {
139 sb.WriteString("Reference Codes:\n")
140 for _, ref := range data.References {
141 sb.WriteString(fmt.Sprintf(" - %s\n", ref))
142 }
143 sb.WriteString("\n")
144 }
145
146 if len(data.Names) > 0 {
147 sb.WriteString("Names:\n")
148 for _, name := range data.Names {
149 sb.WriteString(fmt.Sprintf(" - %s\n", name))
150 }
151 sb.WriteString("\n")
152 }
153
154 return sb.String()
155}
156
157func main() {
158 parser := NewDocumentParser()
159
160 // Sample document text
161 document := `
162INVOICE
163
164Invoice Number: INV-2024-12345
165Date: March 15, 2024
166Due Date: 04/15/2024
167
168Bill To:
169Dr. Jane Smith
170Email: jane.smith@example.com
171Phone: +1 (555) 123-4567
172
173Ship To:
174Mr. John Doe
175Email: john.doe@company.com
176Phone: +44 20 7123 4567
177
178Items:
179- Product A: $1,234.56
180- Product B: USD 987.65
181- Service C: €456.78
182
183Total Amount: $2,679.99
184
185Please reference order number ORD-2024-001 in your payment.
186For questions, contact support@example.com or call 1-800-555-0199.
187
188Payment due by 2024-04-15.
189Thank you for your business!
190`
191
192 data := parser.Parse(document)
193 output := parser.FormatOutput(data)
194
195 fmt.Println(output)
196
197 // Additional analysis
198 fmt.Println("=== Analysis ===")
199 fmt.Printf("Total entities extracted: %d\n",
200 len(data.Dates)+len(data.Amounts)+len(data.Emails)+
201 len(data.Phones)+len(data.References)+len(data.Names))
202}
Summary
💡 Key Takeaways:
- Compile Once, Use Many Times - Pattern compilation is expensive, do it at startup or cache it
- Anchors Are Essential - Use ^ and $ to match complete strings, not substrings, for security
- Simple Beats Complex - A 95% solution that runs fast beats a 100% solution that crawls
- RE2 Limitations Matter - No lookahead/lookbehind, but guaranteed linear time prevents ReDoS
- Security First - Always validate inputs and prevent ReDoS attacks with safe patterns
- Named Groups Aid Maintenance - Use
(?P<name>...)for complex patterns to improve readability - Test Edge Cases - Regex bugs often surface with unusual input like empty strings, Unicode, or very long text
⚠️ Production Considerations:
- Cache Compiled Patterns - Never compile regex in hot paths or request handlers
- Validate User Patterns - Never compile regex patterns from untrusted input without validation
- Profile Performance - Regex can be a bottleneck in high-throughput systems, measure before optimizing
- Document Complex Patterns - Future you will thank present you for explaining what
(?:(?:[^"\\]|\\.)*)means - Test Edge Cases - Regex bugs often surface with unusual input - test empty strings, Unicode, very long strings
- Security Audits - Review regex patterns during security audits for potential DoS vectors
- Monitor Performance - Track regex execution time in production to catch performance regressions
Real-world Wisdom: Regular expressions are like power tools - incredibly useful when used appropriately, but dangerous when misapplied. The famous quote "Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems" exists for a reason. Use regex for pattern matching and validation, but reach for proper parsers when you're dealing with structured data like JSON, XML, or programming languages.
When to use regex:
- Validating user input (emails, phone numbers, usernames)
- Text searching and simple extraction
- Log file analysis and filtering
- Simple data transformation and replacement
- URL routing and pattern matching
- Quick prototypes and one-off scripts
When NOT to use regex:
- Parsing HTML/XML (use proper parsers like
golang.org/x/net/html) - Complex programming language parsing (use
go/parseror similar) - When performance is critical and simpler alternatives exist (
stringspackage) - For highly nested or recursive structures (regex isn't designed for this)
- When the pattern becomes unreadable and unmaintainable
- Configuration file parsing (use structured formats like JSON, YAML, TOML)
Next Steps:
- Advanced Patterns: Study advanced techniques like conditional patterns and subroutines (if your regex engine supports them)
- Performance Optimization: Learn to measure and optimize regex performance with benchmarks
- Security Patterns: Master secure validation and ReDoS prevention techniques
- Alternative Libraries: Explore specialized libraries for complex parsing tasks
- Testing Strategies: Build comprehensive test suites for regex patterns with edge cases
- Documentation Tools: Learn to use tools like regex101.com for pattern documentation and sharing
Mastering Go's regexp package transforms you from a developer who struggles with text processing to one who builds robust, performant pattern matching systems that handle real-world complexity safely and efficiently. Remember: the best regex is often the one you didn't write - if a simpler solution exists using the strings package or structured parsing, prefer that.