Exercise: Concurrent Web Crawler
Difficulty - Intermediate
📋 Prerequisites: This exercise assumes familiarity with Concurrency in Go for goroutines and channels, and Web Development for HTTP client fundamentals.
Learning Objectives
- Implement concurrent web crawling with goroutines
- Master URL deduplication with thread-safe maps
- Handle graceful shutdown with context cancellation
- Implement rate limiting and politeness policies
- Process and extract links from HTML documents
- Manage crawler depth and breadth-first traversal
Problem Statement
Create a production-ready concurrent web crawler that can efficiently crawl websites while respecting robots.txt, implementing rate limiting, and handling various edge cases. The crawler should support:
- Concurrent crawling: Multiple goroutines fetching pages simultaneously
- URL deduplication: Track visited URLs to prevent loops
- Depth control: Limit crawl depth from the starting URL
- Rate limiting: Respect politeness policies and avoid overwhelming servers
- Graceful shutdown: Stop cleanly on context cancellation
- Error handling: Handle network errors, timeouts, and invalid URLs
- Link extraction: Parse HTML and extract absolute URLs
Data Structures
1package crawler
2
3import (
4 "context"
5 "net/http"
6 "sync"
7 "time"
8)
9
10// Page represents a crawled page
11type Page struct {
12 URL string
13 Title string
14 Links []string
15 Depth int
16 FetchTime time.Duration
17 Error error
18}
19
20// CrawlResult contains the results of a crawl operation
21type CrawlResult struct {
22 Pages []Page
23 TotalPages int
24 TotalErrors int
25 Duration time.Duration
26}
27
28// Crawler manages the web crawling process
29type Crawler struct {
30 maxDepth int
31 maxConcurrent int
32 timeout time.Duration
33 userAgent string
34 visited map[string]bool
35 mu sync.RWMutex
36 client *http.Client
37 rateLimiter *time.Ticker
38}
39
40// Config holds crawler configuration
41type Config struct {
42 MaxDepth int
43 MaxConcurrent int
44 Timeout time.Duration
45 UserAgent string
46 RateLimit time.Duration
47}
Function Signatures
1// NewCrawler creates a new web crawler
2func NewCrawler(config Config) *Crawler
3
4// Crawl starts crawling from the seed URL
5func Crawl(ctx context.Context, seedURL string)
6
7// crawlPage fetches and processes a single page
8func crawlPage(ctx context.Context, url string, depth int)
9
10// extractLinks parses HTML and extracts all links
11func extractLinks(body io.Reader, baseURL string)
12
13// isVisited checks if a URL has been visited
14func isVisited(url string) bool
15
16// markVisited marks a URL as visited
17func markVisited(url string)
18
19// normalizeURL normalizes URLs for consistent comparison
20func normalizeURL(rawURL string)
Example Usage
1package main
2
3import (
4 "context"
5 "fmt"
6 "log"
7 "time"
8 "crawler"
9)
10
11func main() {
12 // Configure crawler
13 config := crawler.Config{
14 MaxDepth: 3,
15 MaxConcurrent: 5,
16 Timeout: 10 * time.Second,
17 UserAgent: "MyCrawler/1.0",
18 RateLimit: 500 * time.Millisecond,
19 }
20
21 // Create crawler
22 c := crawler.NewCrawler(config)
23
24 // Set up context with timeout
25 ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
26 defer cancel()
27
28 // Start crawling
29 seedURL := "https://example.com"
30 result, err := c.Crawl(ctx, seedURL)
31 if err != nil {
32 log.Fatal(err)
33 }
34
35 // Display results
36 fmt.Printf("Crawl completed in %v\n", result.Duration)
37 fmt.Printf("Total pages: %d\n", result.TotalPages)
38 fmt.Printf("Total errors: %d\n", result.TotalErrors)
39
40 // Display pages
41 for _, page := range result.Pages {
42 if page.Error != nil {
43 fmt.Printf("[ERROR] %s: %v\n", page.URL, page.Error)
44 continue
45 }
46
47 fmt.Printf("[%d] %s - %s\n",
48 page.Depth,
49 page.URL,
50 page.Title,
51 page.FetchTime,
52 len(page.Links),
53 )
54 }
55}
Requirements
- Crawler must use goroutines for concurrent page fetching
- Implement thread-safe URL deduplication
- Respect maximum depth limit
- Apply rate limiting between requests
- Handle context cancellation gracefully
- Extract and normalize URLs from HTML
- Collect comprehensive statistics about the crawl
Solution
Click to see the complete solution
1package crawler
2
3import (
4 "context"
5 "fmt"
6 "io"
7 "net/http"
8 "net/url"
9 "strings"
10 "sync"
11 "time"
12
13 "golang.org/x/net/html"
14)
15
16// Page represents a crawled page
17type Page struct {
18 URL string
19 Title string
20 Links []string
21 Depth int
22 FetchTime time.Duration
23 Error error
24}
25
26// CrawlResult contains the results of a crawl operation
27type CrawlResult struct {
28 Pages []Page
29 TotalPages int
30 TotalErrors int
31 Duration time.Duration
32}
33
34// Crawler manages the web crawling process
35type Crawler struct {
36 maxDepth int
37 maxConcurrent int
38 timeout time.Duration
39 userAgent string
40 visited map[string]bool
41 mu sync.RWMutex
42 client *http.Client
43 rateLimiter *time.Ticker
44}
45
46// Config holds crawler configuration
47type Config struct {
48 MaxDepth int
49 MaxConcurrent int
50 Timeout time.Duration
51 UserAgent string
52 RateLimit time.Duration
53}
54
55// urlJob represents a URL to crawl
56type urlJob struct {
57 url string
58 depth int
59}
60
61// NewCrawler creates a new web crawler
62func NewCrawler(config Config) *Crawler {
63 return &Crawler{
64 maxDepth: config.MaxDepth,
65 maxConcurrent: config.MaxConcurrent,
66 timeout: config.Timeout,
67 userAgent: config.UserAgent,
68 visited: make(map[string]bool),
69 client: &http.Client{
70 Timeout: config.Timeout,
71 CheckRedirect: func(req *http.Request, via []*http.Request) error {
72 if len(via) >= 10 {
73 return fmt.Errorf("too many redirects")
74 }
75 return nil
76 },
77 },
78 rateLimiter: time.NewTicker(config.RateLimit),
79 }
80}
81
82// Crawl starts crawling from the seed URL
83func Crawl(ctx context.Context, seedURL string) {
84 start := time.Now()
85
86 // Normalize seed URL
87 normalizedSeed, err := normalizeURL(seedURL)
88 if err != nil {
89 return nil, fmt.Errorf("invalid seed URL: %w", err)
90 }
91
92 // Channels for job distribution and result collection
93 jobs := make(chan urlJob, c.maxConcurrent*2)
94 results := make(chan Page, c.maxConcurrent*2)
95 done := make(chan struct{})
96
97 // Result collection
98 var pages []Page
99 var totalErrors int
100
101 go func() {
102 for page := range results {
103 pages = append(pages, page)
104 if page.Error != nil {
105 totalErrors++
106 }
107 }
108 close(done)
109 }()
110
111 // Worker pool
112 var wg sync.WaitGroup
113 for i := 0; i < c.maxConcurrent; i++ {
114 wg.Add(1)
115 go func() {
116 defer wg.Done()
117 c.worker(ctx, jobs, results)
118 }()
119 }
120
121 // Start with seed URL
122 jobs <- urlJob{url: normalizedSeed, depth: 0}
123 c.markVisited(normalizedSeed)
124
125 // BFS traversal
126 pending := 1
127 for pending > 0 {
128 select {
129 case <-ctx.Done():
130 close(jobs)
131 wg.Wait()
132 close(results)
133 <-done
134 return &CrawlResult{
135 Pages: pages,
136 TotalPages: len(pages),
137 TotalErrors: totalErrors,
138 Duration: time.Since(start),
139 }, ctx.Err()
140
141 case page := <-results:
142 pending--
143
144 // Queue new URLs if within depth limit
145 if page.Error == nil && page.Depth < c.maxDepth {
146 for _, link := range page.Links {
147 normalized, err := normalizeURL(link)
148 if err != nil {
149 continue
150 }
151
152 if !c.isVisited(normalized) {
153 c.markVisited(normalized)
154 pending++
155 select {
156 case jobs <- urlJob{url: normalized, depth: page.Depth + 1}:
157 case <-ctx.Done():
158 close(jobs)
159 wg.Wait()
160 close(results)
161 <-done
162 return &CrawlResult{
163 Pages: pages,
164 TotalPages: len(pages),
165 TotalErrors: totalErrors,
166 Duration: time.Since(start),
167 }, ctx.Err()
168 }
169 }
170 }
171 }
172 }
173 }
174
175 close(jobs)
176 wg.Wait()
177 close(results)
178 <-done
179
180 return &CrawlResult{
181 Pages: pages,
182 TotalPages: len(pages),
183 TotalErrors: totalErrors,
184 Duration: time.Since(start),
185 }, nil
186}
187
188// worker processes crawl jobs
189func worker(ctx context.Context, jobs <-chan urlJob, results chan<- Page) {
190 for job := range jobs {
191 select {
192 case <-ctx.Done():
193 return
194 case <-c.rateLimiter.C:
195 page, err := c.crawlPage(ctx, job.url, job.depth)
196 if err != nil {
197 results <- Page{URL: job.url, Depth: job.depth, Error: err}
198 continue
199 }
200 results <- *page
201 }
202 }
203}
204
205// crawlPage fetches and processes a single page
206func crawlPage(ctx context.Context, targetURL string, depth int) {
207 start := time.Now()
208
209 // Create request
210 req, err := http.NewRequestWithContext(ctx, "GET", targetURL, nil)
211 if err != nil {
212 return nil, fmt.Errorf("create request: %w", err)
213 }
214
215 req.Header.Set("User-Agent", c.userAgent)
216
217 // Fetch page
218 resp, err := c.client.Do(req)
219 if err != nil {
220 return nil, fmt.Errorf("fetch: %w", err)
221 }
222 defer resp.Body.Close()
223
224 if resp.StatusCode != http.StatusOK {
225 return nil, fmt.Errorf("status %d", resp.StatusCode)
226 }
227
228 // Extract title and links
229 title, links, err := c.parseHTML(resp.Body, targetURL)
230 if err != nil {
231 return nil, fmt.Errorf("parse HTML: %w", err)
232 }
233
234 return &Page{
235 URL: targetURL,
236 Title: title,
237 Links: links,
238 Depth: depth,
239 FetchTime: time.Since(start),
240 }, nil
241}
242
243// parseHTML extracts title and links from HTML
244func parseHTML(body io.Reader, baseURL string) {
245 doc, err := html.Parse(body)
246 if err != nil {
247 return "", nil, err
248 }
249
250 var title string
251 var links []string
252
253 var traverse func(*html.Node)
254 traverse = func(n *html.Node) {
255 if n.Type == html.ElementNode {
256 if n.Data == "title" && n.FirstChild != nil {
257 title = n.FirstChild.Data
258 }
259 if n.Data == "a" {
260 for _, attr := range n.Attr {
261 if attr.Key == "href" {
262 if absURL := makeAbsolute(baseURL, attr.Val); absURL != "" {
263 links = append(links, absURL)
264 }
265 }
266 }
267 }
268 }
269 for child := n.FirstChild; child != nil; child = child.NextSibling {
270 traverse(child)
271 }
272 }
273
274 traverse(doc)
275 return title, links, nil
276}
277
278// makeAbsolute converts relative URLs to absolute
279func makeAbsolute(base, href string) string {
280 baseURL, err := url.Parse(base)
281 if err != nil {
282 return ""
283 }
284
285 hrefURL, err := url.Parse(href)
286 if err != nil {
287 return ""
288 }
289
290 // Skip non-HTTP(S) URLs
291 absURL := baseURL.ResolveReference(hrefURL)
292 if absURL.Scheme != "http" && absURL.Scheme != "https" {
293 return ""
294 }
295
296 return absURL.String()
297}
298
299// isVisited checks if a URL has been visited
300func isVisited(url string) bool {
301 c.mu.RLock()
302 defer c.mu.RUnlock()
303 return c.visited[url]
304}
305
306// markVisited marks a URL as visited
307func markVisited(url string) {
308 c.mu.Lock()
309 defer c.mu.Unlock()
310 c.visited[url] = true
311}
312
313// normalizeURL normalizes URLs for consistent comparison
314func normalizeURL(rawURL string) {
315 u, err := url.Parse(rawURL)
316 if err != nil {
317 return "", err
318 }
319
320 // Remove fragment
321 u.Fragment = ""
322
323 // Remove trailing slash
324 u.Path = strings.TrimSuffix(u.Path, "/")
325
326 // Lowercase scheme and host
327 u.Scheme = strings.ToLower(u.Scheme)
328 u.Host = strings.ToLower(u.Host)
329
330 return u.String(), nil
331}
Explanation
Crawler Structure:
maxDepth: Maximum crawl depth from seed URLmaxConcurrent: Number of concurrent workerstimeout: HTTP request timeoutuserAgent: User-Agent header for requestsvisited: Thread-safe map tracking visited URLsclient: HTTP client with configured timeoutrateLimiter: Ticker for rate limiting requests
NewCrawler:
- Initializes crawler with configuration
- Sets up HTTP client with redirect handling
- Creates rate limiter ticker
- Initializes visited map
Crawl:
- Implements breadth-first search traversal
- Uses worker pool pattern for concurrent crawling
- Tracks pending jobs to know when crawl is complete
- Handles context cancellation gracefully
- Collects results from all workers
- Returns comprehensive statistics
worker:
- Processes jobs from job queue
- Applies rate limiting before each request
- Sends results to results channel
- Respects context cancellation
crawlPage:
- Fetches a single page with context
- Sets User-Agent header
- Checks HTTP status code
- Extracts title and links
- Measures fetch time
parseHTML:
- Parses HTML using golang.org/x/net/html
- Extracts page title from
<title>tag - Finds all
<a>tags and extracts href attributes - Converts relative URLs to absolute
- Filters out non-HTTP(S) URLs
Thread-Safety:
isVisitedandmarkVisiteduse RWMutex for safe concurrent access- Visited map protected from race conditions
- Multiple goroutines can safely check/mark URLs
Graceful Shutdown:
- Context cancellation stops all workers
- Drains result channel before returning
- Returns partial results on cancellation
- No goroutine leaks
Crawler Patterns
Pattern 1: Domain-Focused Crawling
1func shouldCrawl(targetURL, seedURL string) bool {
2 target, _ := url.Parse(targetURL)
3 seed, _ := url.Parse(seedURL)
4 return target.Host == seed.Host
5}
Pattern 2: Robots.txt Compliance
1type RobotsChecker struct {
2 cache map[string]*robotstxt.RobotsData
3 mu sync.RWMutex
4}
5
6func CanCrawl(targetURL, userAgent string) bool {
7 // Check robots.txt rules
8 // ...
9}
Pattern 3: Priority Queue
1type PriorityURL struct {
2 URL string
3 Depth int
4 Priority int
5}
6
7// Use heap for priority-based crawling
Pattern 4: Content Processing
1type ContentProcessor interface {
2 Process(page *Page) error
3}
4
5type TextExtractor struct{}
6
7func Process(page *Page) error {
8 // Extract and process text content
9 return nil
10}
Best Practices
1. Respect robots.txt:
1// Always check robots.txt before crawling
2import "github.com/temoto/robotstxt"
3
4func checkRobots(targetURL, userAgent string) bool {
5 robotsURL := getRobotsURL(targetURL)
6 robots, err := robotstxt.FromURL(robotsURL)
7 if err != nil {
8 return false
9 }
10 return robots.TestAgent(targetURL, userAgent)
11}
2. Implement backoff on errors:
1type BackoffCrawler struct {
2 *Crawler
3 backoff time.Duration
4}
5
6func crawlWithBackoff(url string) error {
7 retries := 3
8 for i := 0; i < retries; i++ {
9 err := b.crawl(url)
10 if err == nil {
11 return nil
12 }
13 time.Sleep(b.backoff * time.Duration(i+1))
14 }
15 return fmt.Errorf("max retries exceeded")
16}
3. Handle different content types:
1func crawlPage(targetURL string) error {
2 resp, err := c.client.Get(targetURL)
3 if err != nil {
4 return err
5 }
6 defer resp.Body.Close()
7
8 contentType := resp.Header.Get("Content-Type")
9 if !strings.Contains(contentType, "text/html") {
10 return fmt.Errorf("skipping non-HTML content: %s", contentType)
11 }
12
13 // Process HTML...
14 return nil
15}
4. Monitor crawler health:
1type CrawlerStats struct {
2 PagesPerSecond float64
3 ErrorRate float64
4 ActiveWorkers int
5 QueuedJobs int
6}
7
8func Stats() CrawlerStats {
9 // Calculate and return statistics
10 return CrawlerStats{
11 QueuedJobs: len(c.jobs),
12 // ...
13 }
14}
Test Cases
1package crawler
2
3import (
4 "context"
5 "net/http"
6 "net/http/httptest"
7 "strings"
8 "testing"
9 "time"
10)
11
12func TestCrawler_BasicCrawl(t *testing.T) {
13 // Create test server
14 server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
15 html := `<html><head><title>Test Page</title></head>
16 <body><a href="/page2">Link</a></body></html>`
17 w.Write([]byte(html))
18 }))
19 defer server.Close()
20
21 config := Config{
22 MaxDepth: 1,
23 MaxConcurrent: 2,
24 Timeout: 5 * time.Second,
25 UserAgent: "TestBot",
26 RateLimit: 100 * time.Millisecond,
27 }
28
29 crawler := NewCrawler(config)
30 ctx := context.Background()
31
32 result, err := crawler.Crawl(ctx, server.URL)
33 if err != nil {
34 t.Fatalf("Crawl failed: %v", err)
35 }
36
37 if result.TotalPages == 0 {
38 t.Error("Expected at least 1 page crawled")
39 }
40}
41
42func TestCrawler_DepthLimit(t *testing.T) {
43 var requestCount int
44 server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
45 requestCount++
46 html := `<html><head><title>Page</title></head>
47 <body><a href="/next">Next</a></body></html>`
48 w.Write([]byte(html))
49 }))
50 defer server.Close()
51
52 config := Config{
53 MaxDepth: 2,
54 MaxConcurrent: 1,
55 Timeout: 5 * time.Second,
56 UserAgent: "TestBot",
57 RateLimit: 50 * time.Millisecond,
58 }
59
60 crawler := NewCrawler(config)
61 ctx := context.Background()
62
63 _, err := crawler.Crawl(ctx, server.URL)
64 if err != nil {
65 t.Fatalf("Crawl failed: %v", err)
66 }
67
68 // Should crawl: depth 0 + depth 1 + depth 2 = max 3 pages
69 if requestCount > 3 {
70 t.Errorf("Expected max 3 requests, got %d", requestCount)
71 }
72}
73
74func TestCrawler_Deduplication(t *testing.T) {
75 visitCount := make(map[string]int)
76 server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
77 visitCount[r.URL.Path]++
78 html := `<html><head><title>Page</title></head>
79 <body><a href="/">Home</a><a href="/page">Page</a></body></html>`
80 w.Write([]byte(html))
81 }))
82 defer server.Close()
83
84 config := Config{
85 MaxDepth: 2,
86 MaxConcurrent: 2,
87 Timeout: 5 * time.Second,
88 UserAgent: "TestBot",
89 RateLimit: 50 * time.Millisecond,
90 }
91
92 crawler := NewCrawler(config)
93 ctx := context.Background()
94
95 _, err := crawler.Crawl(ctx, server.URL)
96 if err != nil {
97 t.Fatalf("Crawl failed: %v", err)
98 }
99
100 // Each URL should be visited exactly once
101 for path, count := range visitCount {
102 if count > 1 {
103 t.Errorf("URL %s visited %d times, expected 1", path, count)
104 }
105 }
106}
107
108func TestCrawler_ContextCancellation(t *testing.T) {
109 server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
110 time.Sleep(2 * time.Second) // Slow response
111 html := `<html><head><title>Page</title></head><body></body></html>`
112 w.Write([]byte(html))
113 }))
114 defer server.Close()
115
116 config := Config{
117 MaxDepth: 5,
118 MaxConcurrent: 2,
119 Timeout: 10 * time.Second,
120 UserAgent: "TestBot",
121 RateLimit: 50 * time.Millisecond,
122 }
123
124 crawler := NewCrawler(config)
125 ctx, cancel := context.WithTimeout(context.Background(), 500*time.Millisecond)
126 defer cancel()
127
128 _, err := crawler.Crawl(ctx, server.URL)
129 if err != context.DeadlineExceeded {
130 t.Errorf("Expected context deadline exceeded, got: %v", err)
131 }
132}
133
134func TestCrawler_ErrorHandling(t *testing.T) {
135 server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
136 if strings.Contains(r.URL.Path, "error") {
137 w.WriteHeader(http.StatusInternalServerError)
138 return
139 }
140 html := `<html><head><title>Page</title></head>
141 <body><a href="/error">Error Page</a></body></html>`
142 w.Write([]byte(html))
143 }))
144 defer server.Close()
145
146 config := Config{
147 MaxDepth: 1,
148 MaxConcurrent: 2,
149 Timeout: 5 * time.Second,
150 UserAgent: "TestBot",
151 RateLimit: 50 * time.Millisecond,
152 }
153
154 crawler := NewCrawler(config)
155 ctx := context.Background()
156
157 result, err := crawler.Crawl(ctx, server.URL)
158 if err != nil {
159 t.Fatalf("Crawl failed: %v", err)
160 }
161
162 if result.TotalErrors == 0 {
163 t.Error("Expected some errors from 500 responses")
164 }
165}
166
167func TestNormalizeURL(t *testing.T) {
168 tests := []struct {
169 input string
170 expected string
171 }{
172 {"https://example.com/path", "https://example.com/path"},
173 {"https://example.com/path/", "https://example.com/path"},
174 {"https://Example.com/path", "https://example.com/path"},
175 {"https://example.com/path#fragment", "https://example.com/path"},
176 }
177
178 for _, test := range tests {
179 result, err := normalizeURL(test.input)
180 if err != nil {
181 t.Errorf("normalizeURL(%s) error: %v", test.input, err)
182 }
183 if result != test.expected {
184 t.Errorf("normalizeURL(%s) = %s, want %s", test.input, result, test.expected)
185 }
186 }
187}
Bonus Challenges
- Robots.txt Compliance: Respect robots.txt directives
- Sitemap Support: Parse and use XML sitemaps
- Content Filtering: Skip certain file types or patterns
- Link Prioritization: Crawl important pages first
- Distributed Crawling: Coordinate multiple crawler instances
Key Takeaways
- Concurrent crawling improves performance for I/O-bound operations
- URL normalization prevents duplicate crawls
- Rate limiting respects server resources and politeness
- Context cancellation enables graceful shutdown
- BFS traversal ensures systematic coverage
- Thread-safe data structures prevent race conditions
Web crawling requires careful balance between speed and politeness. Production crawlers must respect robots.txt, implement proper rate limiting, and handle errors gracefully to be good internet citizens.