Concurrent Web Crawler

Exercise: Concurrent Web Crawler

Difficulty - Intermediate

📋 Prerequisites: This exercise assumes familiarity with Concurrency in Go for goroutines and channels, and Web Development for HTTP client fundamentals.

Learning Objectives

  • Implement concurrent web crawling with goroutines
  • Master URL deduplication with thread-safe maps
  • Handle graceful shutdown with context cancellation
  • Implement rate limiting and politeness policies
  • Process and extract links from HTML documents
  • Manage crawler depth and breadth-first traversal

Problem Statement

Create a production-ready concurrent web crawler that can efficiently crawl websites while respecting robots.txt, implementing rate limiting, and handling various edge cases. The crawler should support:

  1. Concurrent crawling: Multiple goroutines fetching pages simultaneously
  2. URL deduplication: Track visited URLs to prevent loops
  3. Depth control: Limit crawl depth from the starting URL
  4. Rate limiting: Respect politeness policies and avoid overwhelming servers
  5. Graceful shutdown: Stop cleanly on context cancellation
  6. Error handling: Handle network errors, timeouts, and invalid URLs
  7. Link extraction: Parse HTML and extract absolute URLs

Data Structures

 1package crawler
 2
 3import (
 4    "context"
 5    "net/http"
 6    "sync"
 7    "time"
 8)
 9
10// Page represents a crawled page
11type Page struct {
12    URL       string
13    Title     string
14    Links     []string
15    Depth     int
16    FetchTime time.Duration
17    Error     error
18}
19
20// CrawlResult contains the results of a crawl operation
21type CrawlResult struct {
22    Pages       []Page
23    TotalPages  int
24    TotalErrors int
25    Duration    time.Duration
26}
27
28// Crawler manages the web crawling process
29type Crawler struct {
30    maxDepth      int
31    maxConcurrent int
32    timeout       time.Duration
33    userAgent     string
34    visited       map[string]bool
35    mu            sync.RWMutex
36    client        *http.Client
37    rateLimiter   *time.Ticker
38}
39
40// Config holds crawler configuration
41type Config struct {
42    MaxDepth      int
43    MaxConcurrent int
44    Timeout       time.Duration
45    UserAgent     string
46    RateLimit     time.Duration
47}

Function Signatures

 1// NewCrawler creates a new web crawler
 2func NewCrawler(config Config) *Crawler
 3
 4// Crawl starts crawling from the seed URL
 5func Crawl(ctx context.Context, seedURL string)
 6
 7// crawlPage fetches and processes a single page
 8func crawlPage(ctx context.Context, url string, depth int)
 9
10// extractLinks parses HTML and extracts all links
11func extractLinks(body io.Reader, baseURL string)
12
13// isVisited checks if a URL has been visited
14func isVisited(url string) bool
15
16// markVisited marks a URL as visited
17func markVisited(url string)
18
19// normalizeURL normalizes URLs for consistent comparison
20func normalizeURL(rawURL string)

Example Usage

 1package main
 2
 3import (
 4    "context"
 5    "fmt"
 6    "log"
 7    "time"
 8    "crawler"
 9)
10
11func main() {
12    // Configure crawler
13    config := crawler.Config{
14        MaxDepth:      3,
15        MaxConcurrent: 5,
16        Timeout:       10 * time.Second,
17        UserAgent:     "MyCrawler/1.0",
18        RateLimit:     500 * time.Millisecond,
19    }
20
21    // Create crawler
22    c := crawler.NewCrawler(config)
23
24    // Set up context with timeout
25    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
26    defer cancel()
27
28    // Start crawling
29    seedURL := "https://example.com"
30    result, err := c.Crawl(ctx, seedURL)
31    if err != nil {
32        log.Fatal(err)
33    }
34
35    // Display results
36    fmt.Printf("Crawl completed in %v\n", result.Duration)
37    fmt.Printf("Total pages: %d\n", result.TotalPages)
38    fmt.Printf("Total errors: %d\n", result.TotalErrors)
39
40    // Display pages
41    for _, page := range result.Pages {
42        if page.Error != nil {
43            fmt.Printf("[ERROR] %s: %v\n", page.URL, page.Error)
44            continue
45        }
46
47        fmt.Printf("[%d] %s - %s\n",
48            page.Depth,
49            page.URL,
50            page.Title,
51            page.FetchTime,
52            len(page.Links),
53        )
54    }
55}

Requirements

  1. Crawler must use goroutines for concurrent page fetching
  2. Implement thread-safe URL deduplication
  3. Respect maximum depth limit
  4. Apply rate limiting between requests
  5. Handle context cancellation gracefully
  6. Extract and normalize URLs from HTML
  7. Collect comprehensive statistics about the crawl

Solution

Click to see the complete solution
  1package crawler
  2
  3import (
  4    "context"
  5    "fmt"
  6    "io"
  7    "net/http"
  8    "net/url"
  9    "strings"
 10    "sync"
 11    "time"
 12
 13    "golang.org/x/net/html"
 14)
 15
 16// Page represents a crawled page
 17type Page struct {
 18    URL       string
 19    Title     string
 20    Links     []string
 21    Depth     int
 22    FetchTime time.Duration
 23    Error     error
 24}
 25
 26// CrawlResult contains the results of a crawl operation
 27type CrawlResult struct {
 28    Pages       []Page
 29    TotalPages  int
 30    TotalErrors int
 31    Duration    time.Duration
 32}
 33
 34// Crawler manages the web crawling process
 35type Crawler struct {
 36    maxDepth      int
 37    maxConcurrent int
 38    timeout       time.Duration
 39    userAgent     string
 40    visited       map[string]bool
 41    mu            sync.RWMutex
 42    client        *http.Client
 43    rateLimiter   *time.Ticker
 44}
 45
 46// Config holds crawler configuration
 47type Config struct {
 48    MaxDepth      int
 49    MaxConcurrent int
 50    Timeout       time.Duration
 51    UserAgent     string
 52    RateLimit     time.Duration
 53}
 54
 55// urlJob represents a URL to crawl
 56type urlJob struct {
 57    url   string
 58    depth int
 59}
 60
 61// NewCrawler creates a new web crawler
 62func NewCrawler(config Config) *Crawler {
 63    return &Crawler{
 64        maxDepth:      config.MaxDepth,
 65        maxConcurrent: config.MaxConcurrent,
 66        timeout:       config.Timeout,
 67        userAgent:     config.UserAgent,
 68        visited:       make(map[string]bool),
 69        client: &http.Client{
 70            Timeout: config.Timeout,
 71            CheckRedirect: func(req *http.Request, via []*http.Request) error {
 72                if len(via) >= 10 {
 73                    return fmt.Errorf("too many redirects")
 74                }
 75                return nil
 76            },
 77        },
 78        rateLimiter: time.NewTicker(config.RateLimit),
 79    }
 80}
 81
 82// Crawl starts crawling from the seed URL
 83func Crawl(ctx context.Context, seedURL string) {
 84    start := time.Now()
 85
 86    // Normalize seed URL
 87    normalizedSeed, err := normalizeURL(seedURL)
 88    if err != nil {
 89        return nil, fmt.Errorf("invalid seed URL: %w", err)
 90    }
 91
 92    // Channels for job distribution and result collection
 93    jobs := make(chan urlJob, c.maxConcurrent*2)
 94    results := make(chan Page, c.maxConcurrent*2)
 95    done := make(chan struct{})
 96
 97    // Result collection
 98    var pages []Page
 99    var totalErrors int
100
101    go func() {
102        for page := range results {
103            pages = append(pages, page)
104            if page.Error != nil {
105                totalErrors++
106            }
107        }
108        close(done)
109    }()
110
111    // Worker pool
112    var wg sync.WaitGroup
113    for i := 0; i < c.maxConcurrent; i++ {
114        wg.Add(1)
115        go func() {
116            defer wg.Done()
117            c.worker(ctx, jobs, results)
118        }()
119    }
120
121    // Start with seed URL
122    jobs <- urlJob{url: normalizedSeed, depth: 0}
123    c.markVisited(normalizedSeed)
124
125    // BFS traversal
126    pending := 1
127    for pending > 0 {
128        select {
129        case <-ctx.Done():
130            close(jobs)
131            wg.Wait()
132            close(results)
133            <-done
134            return &CrawlResult{
135                Pages:       pages,
136                TotalPages:  len(pages),
137                TotalErrors: totalErrors,
138                Duration:    time.Since(start),
139            }, ctx.Err()
140
141        case page := <-results:
142            pending--
143
144            // Queue new URLs if within depth limit
145            if page.Error == nil && page.Depth < c.maxDepth {
146                for _, link := range page.Links {
147                    normalized, err := normalizeURL(link)
148                    if err != nil {
149                        continue
150                    }
151
152                    if !c.isVisited(normalized) {
153                        c.markVisited(normalized)
154                        pending++
155                        select {
156                        case jobs <- urlJob{url: normalized, depth: page.Depth + 1}:
157                        case <-ctx.Done():
158                            close(jobs)
159                            wg.Wait()
160                            close(results)
161                            <-done
162                            return &CrawlResult{
163                                Pages:       pages,
164                                TotalPages:  len(pages),
165                                TotalErrors: totalErrors,
166                                Duration:    time.Since(start),
167                            }, ctx.Err()
168                        }
169                    }
170                }
171            }
172        }
173    }
174
175    close(jobs)
176    wg.Wait()
177    close(results)
178    <-done
179
180    return &CrawlResult{
181        Pages:       pages,
182        TotalPages:  len(pages),
183        TotalErrors: totalErrors,
184        Duration:    time.Since(start),
185    }, nil
186}
187
188// worker processes crawl jobs
189func worker(ctx context.Context, jobs <-chan urlJob, results chan<- Page) {
190    for job := range jobs {
191        select {
192        case <-ctx.Done():
193            return
194        case <-c.rateLimiter.C:
195            page, err := c.crawlPage(ctx, job.url, job.depth)
196            if err != nil {
197                results <- Page{URL: job.url, Depth: job.depth, Error: err}
198                continue
199            }
200            results <- *page
201        }
202    }
203}
204
205// crawlPage fetches and processes a single page
206func crawlPage(ctx context.Context, targetURL string, depth int) {
207    start := time.Now()
208
209    // Create request
210    req, err := http.NewRequestWithContext(ctx, "GET", targetURL, nil)
211    if err != nil {
212        return nil, fmt.Errorf("create request: %w", err)
213    }
214
215    req.Header.Set("User-Agent", c.userAgent)
216
217    // Fetch page
218    resp, err := c.client.Do(req)
219    if err != nil {
220        return nil, fmt.Errorf("fetch: %w", err)
221    }
222    defer resp.Body.Close()
223
224    if resp.StatusCode != http.StatusOK {
225        return nil, fmt.Errorf("status %d", resp.StatusCode)
226    }
227
228    // Extract title and links
229    title, links, err := c.parseHTML(resp.Body, targetURL)
230    if err != nil {
231        return nil, fmt.Errorf("parse HTML: %w", err)
232    }
233
234    return &Page{
235        URL:       targetURL,
236        Title:     title,
237        Links:     links,
238        Depth:     depth,
239        FetchTime: time.Since(start),
240    }, nil
241}
242
243// parseHTML extracts title and links from HTML
244func parseHTML(body io.Reader, baseURL string) {
245    doc, err := html.Parse(body)
246    if err != nil {
247        return "", nil, err
248    }
249
250    var title string
251    var links []string
252
253    var traverse func(*html.Node)
254    traverse = func(n *html.Node) {
255        if n.Type == html.ElementNode {
256            if n.Data == "title" && n.FirstChild != nil {
257                title = n.FirstChild.Data
258            }
259            if n.Data == "a" {
260                for _, attr := range n.Attr {
261                    if attr.Key == "href" {
262                        if absURL := makeAbsolute(baseURL, attr.Val); absURL != "" {
263                            links = append(links, absURL)
264                        }
265                    }
266                }
267            }
268        }
269        for child := n.FirstChild; child != nil; child = child.NextSibling {
270            traverse(child)
271        }
272    }
273
274    traverse(doc)
275    return title, links, nil
276}
277
278// makeAbsolute converts relative URLs to absolute
279func makeAbsolute(base, href string) string {
280    baseURL, err := url.Parse(base)
281    if err != nil {
282        return ""
283    }
284
285    hrefURL, err := url.Parse(href)
286    if err != nil {
287        return ""
288    }
289
290    // Skip non-HTTP(S) URLs
291    absURL := baseURL.ResolveReference(hrefURL)
292    if absURL.Scheme != "http" && absURL.Scheme != "https" {
293        return ""
294    }
295
296    return absURL.String()
297}
298
299// isVisited checks if a URL has been visited
300func isVisited(url string) bool {
301    c.mu.RLock()
302    defer c.mu.RUnlock()
303    return c.visited[url]
304}
305
306// markVisited marks a URL as visited
307func markVisited(url string) {
308    c.mu.Lock()
309    defer c.mu.Unlock()
310    c.visited[url] = true
311}
312
313// normalizeURL normalizes URLs for consistent comparison
314func normalizeURL(rawURL string) {
315    u, err := url.Parse(rawURL)
316    if err != nil {
317        return "", err
318    }
319
320    // Remove fragment
321    u.Fragment = ""
322
323    // Remove trailing slash
324    u.Path = strings.TrimSuffix(u.Path, "/")
325
326    // Lowercase scheme and host
327    u.Scheme = strings.ToLower(u.Scheme)
328    u.Host = strings.ToLower(u.Host)
329
330    return u.String(), nil
331}

Explanation

Crawler Structure:

  • maxDepth: Maximum crawl depth from seed URL
  • maxConcurrent: Number of concurrent workers
  • timeout: HTTP request timeout
  • userAgent: User-Agent header for requests
  • visited: Thread-safe map tracking visited URLs
  • client: HTTP client with configured timeout
  • rateLimiter: Ticker for rate limiting requests

NewCrawler:

  • Initializes crawler with configuration
  • Sets up HTTP client with redirect handling
  • Creates rate limiter ticker
  • Initializes visited map

Crawl:

  • Implements breadth-first search traversal
  • Uses worker pool pattern for concurrent crawling
  • Tracks pending jobs to know when crawl is complete
  • Handles context cancellation gracefully
  • Collects results from all workers
  • Returns comprehensive statistics

worker:

  • Processes jobs from job queue
  • Applies rate limiting before each request
  • Sends results to results channel
  • Respects context cancellation

crawlPage:

  • Fetches a single page with context
  • Sets User-Agent header
  • Checks HTTP status code
  • Extracts title and links
  • Measures fetch time

parseHTML:

  • Parses HTML using golang.org/x/net/html
  • Extracts page title from <title> tag
  • Finds all <a> tags and extracts href attributes
  • Converts relative URLs to absolute
  • Filters out non-HTTP(S) URLs

Thread-Safety:

  • isVisited and markVisited use RWMutex for safe concurrent access
  • Visited map protected from race conditions
  • Multiple goroutines can safely check/mark URLs

Graceful Shutdown:

  • Context cancellation stops all workers
  • Drains result channel before returning
  • Returns partial results on cancellation
  • No goroutine leaks

Crawler Patterns

Pattern 1: Domain-Focused Crawling

1func shouldCrawl(targetURL, seedURL string) bool {
2    target, _ := url.Parse(targetURL)
3    seed, _ := url.Parse(seedURL)
4    return target.Host == seed.Host
5}

Pattern 2: Robots.txt Compliance

1type RobotsChecker struct {
2    cache map[string]*robotstxt.RobotsData
3    mu    sync.RWMutex
4}
5
6func CanCrawl(targetURL, userAgent string) bool {
7    // Check robots.txt rules
8    // ...
9}

Pattern 3: Priority Queue

1type PriorityURL struct {
2    URL      string
3    Depth    int
4    Priority int
5}
6
7// Use heap for priority-based crawling

Pattern 4: Content Processing

 1type ContentProcessor interface {
 2    Process(page *Page) error
 3}
 4
 5type TextExtractor struct{}
 6
 7func Process(page *Page) error {
 8    // Extract and process text content
 9    return nil
10}

Best Practices

1. Respect robots.txt:

 1// Always check robots.txt before crawling
 2import "github.com/temoto/robotstxt"
 3
 4func checkRobots(targetURL, userAgent string) bool {
 5    robotsURL := getRobotsURL(targetURL)
 6    robots, err := robotstxt.FromURL(robotsURL)
 7    if err != nil {
 8        return false
 9    }
10    return robots.TestAgent(targetURL, userAgent)
11}

2. Implement backoff on errors:

 1type BackoffCrawler struct {
 2    *Crawler
 3    backoff time.Duration
 4}
 5
 6func crawlWithBackoff(url string) error {
 7    retries := 3
 8    for i := 0; i < retries; i++ {
 9        err := b.crawl(url)
10        if err == nil {
11            return nil
12        }
13        time.Sleep(b.backoff * time.Duration(i+1))
14    }
15    return fmt.Errorf("max retries exceeded")
16}

3. Handle different content types:

 1func crawlPage(targetURL string) error {
 2    resp, err := c.client.Get(targetURL)
 3    if err != nil {
 4        return err
 5    }
 6    defer resp.Body.Close()
 7
 8    contentType := resp.Header.Get("Content-Type")
 9    if !strings.Contains(contentType, "text/html") {
10        return fmt.Errorf("skipping non-HTML content: %s", contentType)
11    }
12
13    // Process HTML...
14    return nil
15}

4. Monitor crawler health:

 1type CrawlerStats struct {
 2    PagesPerSecond  float64
 3    ErrorRate       float64
 4    ActiveWorkers   int
 5    QueuedJobs      int
 6}
 7
 8func Stats() CrawlerStats {
 9    // Calculate and return statistics
10    return CrawlerStats{
11        QueuedJobs: len(c.jobs),
12        // ...
13    }
14}

Test Cases

  1package crawler
  2
  3import (
  4    "context"
  5    "net/http"
  6    "net/http/httptest"
  7    "strings"
  8    "testing"
  9    "time"
 10)
 11
 12func TestCrawler_BasicCrawl(t *testing.T) {
 13    // Create test server
 14    server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 15        html := `<html><head><title>Test Page</title></head>
 16                 <body><a href="/page2">Link</a></body></html>`
 17        w.Write([]byte(html))
 18    }))
 19    defer server.Close()
 20
 21    config := Config{
 22        MaxDepth:      1,
 23        MaxConcurrent: 2,
 24        Timeout:       5 * time.Second,
 25        UserAgent:     "TestBot",
 26        RateLimit:     100 * time.Millisecond,
 27    }
 28
 29    crawler := NewCrawler(config)
 30    ctx := context.Background()
 31
 32    result, err := crawler.Crawl(ctx, server.URL)
 33    if err != nil {
 34        t.Fatalf("Crawl failed: %v", err)
 35    }
 36
 37    if result.TotalPages == 0 {
 38        t.Error("Expected at least 1 page crawled")
 39    }
 40}
 41
 42func TestCrawler_DepthLimit(t *testing.T) {
 43    var requestCount int
 44    server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 45        requestCount++
 46        html := `<html><head><title>Page</title></head>
 47                 <body><a href="/next">Next</a></body></html>`
 48        w.Write([]byte(html))
 49    }))
 50    defer server.Close()
 51
 52    config := Config{
 53        MaxDepth:      2,
 54        MaxConcurrent: 1,
 55        Timeout:       5 * time.Second,
 56        UserAgent:     "TestBot",
 57        RateLimit:     50 * time.Millisecond,
 58    }
 59
 60    crawler := NewCrawler(config)
 61    ctx := context.Background()
 62
 63    _, err := crawler.Crawl(ctx, server.URL)
 64    if err != nil {
 65        t.Fatalf("Crawl failed: %v", err)
 66    }
 67
 68    // Should crawl: depth 0 + depth 1 + depth 2 = max 3 pages
 69    if requestCount > 3 {
 70        t.Errorf("Expected max 3 requests, got %d", requestCount)
 71    }
 72}
 73
 74func TestCrawler_Deduplication(t *testing.T) {
 75    visitCount := make(map[string]int)
 76    server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 77        visitCount[r.URL.Path]++
 78        html := `<html><head><title>Page</title></head>
 79                 <body><a href="/">Home</a><a href="/page">Page</a></body></html>`
 80        w.Write([]byte(html))
 81    }))
 82    defer server.Close()
 83
 84    config := Config{
 85        MaxDepth:      2,
 86        MaxConcurrent: 2,
 87        Timeout:       5 * time.Second,
 88        UserAgent:     "TestBot",
 89        RateLimit:     50 * time.Millisecond,
 90    }
 91
 92    crawler := NewCrawler(config)
 93    ctx := context.Background()
 94
 95    _, err := crawler.Crawl(ctx, server.URL)
 96    if err != nil {
 97        t.Fatalf("Crawl failed: %v", err)
 98    }
 99
100    // Each URL should be visited exactly once
101    for path, count := range visitCount {
102        if count > 1 {
103            t.Errorf("URL %s visited %d times, expected 1", path, count)
104        }
105    }
106}
107
108func TestCrawler_ContextCancellation(t *testing.T) {
109    server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
110        time.Sleep(2 * time.Second) // Slow response
111        html := `<html><head><title>Page</title></head><body></body></html>`
112        w.Write([]byte(html))
113    }))
114    defer server.Close()
115
116    config := Config{
117        MaxDepth:      5,
118        MaxConcurrent: 2,
119        Timeout:       10 * time.Second,
120        UserAgent:     "TestBot",
121        RateLimit:     50 * time.Millisecond,
122    }
123
124    crawler := NewCrawler(config)
125    ctx, cancel := context.WithTimeout(context.Background(), 500*time.Millisecond)
126    defer cancel()
127
128    _, err := crawler.Crawl(ctx, server.URL)
129    if err != context.DeadlineExceeded {
130        t.Errorf("Expected context deadline exceeded, got: %v", err)
131    }
132}
133
134func TestCrawler_ErrorHandling(t *testing.T) {
135    server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
136        if strings.Contains(r.URL.Path, "error") {
137            w.WriteHeader(http.StatusInternalServerError)
138            return
139        }
140        html := `<html><head><title>Page</title></head>
141                 <body><a href="/error">Error Page</a></body></html>`
142        w.Write([]byte(html))
143    }))
144    defer server.Close()
145
146    config := Config{
147        MaxDepth:      1,
148        MaxConcurrent: 2,
149        Timeout:       5 * time.Second,
150        UserAgent:     "TestBot",
151        RateLimit:     50 * time.Millisecond,
152    }
153
154    crawler := NewCrawler(config)
155    ctx := context.Background()
156
157    result, err := crawler.Crawl(ctx, server.URL)
158    if err != nil {
159        t.Fatalf("Crawl failed: %v", err)
160    }
161
162    if result.TotalErrors == 0 {
163        t.Error("Expected some errors from 500 responses")
164    }
165}
166
167func TestNormalizeURL(t *testing.T) {
168    tests := []struct {
169        input    string
170        expected string
171    }{
172        {"https://example.com/path", "https://example.com/path"},
173        {"https://example.com/path/", "https://example.com/path"},
174        {"https://Example.com/path", "https://example.com/path"},
175        {"https://example.com/path#fragment", "https://example.com/path"},
176    }
177
178    for _, test := range tests {
179        result, err := normalizeURL(test.input)
180        if err != nil {
181            t.Errorf("normalizeURL(%s) error: %v", test.input, err)
182        }
183        if result != test.expected {
184            t.Errorf("normalizeURL(%s) = %s, want %s", test.input, result, test.expected)
185        }
186    }
187}

Bonus Challenges

  1. Robots.txt Compliance: Respect robots.txt directives
  2. Sitemap Support: Parse and use XML sitemaps
  3. Content Filtering: Skip certain file types or patterns
  4. Link Prioritization: Crawl important pages first
  5. Distributed Crawling: Coordinate multiple crawler instances

Key Takeaways

  • Concurrent crawling improves performance for I/O-bound operations
  • URL normalization prevents duplicate crawls
  • Rate limiting respects server resources and politeness
  • Context cancellation enables graceful shutdown
  • BFS traversal ensures systematic coverage
  • Thread-safe data structures prevent race conditions

Web crawling requires careful balance between speed and politeness. Production crawlers must respect robots.txt, implement proper rate limiting, and handle errors gracefully to be good internet citizens.