Compression and Archives

Compression is like packing a suitcase - you want to fit as much as possible while keeping everything organized and accessible. Just as you'd fold clothes carefully and use vacuum bags to save space, compression algorithms find patterns in data and represent them more efficiently.

Think about it: when you're sending photos to a friend, you wouldn't mail each photo in a separate envelope. You'd pack them together efficiently. Compression does the same thing with data - it removes redundancy and represents repeated patterns more compactly.

Go's standard library provides excellent support for multiple compression algorithms and archive formats, making it easy to implement these "data packing" strategies in your applications.

Introduction to Compression

The Mathematics of Space: Compression is fundamentally about information theory. Claude Shannon proved that data has an intrinsic entropy—a theoretical minimum size based on its information content. Compression algorithms try to approach this limit by finding and exploiting patterns.

Two Types of Compression: Lossless compression (like gzip) preserves every bit of original data—crucial for text, code, and data files. Lossy compression (like JPEG) discards less important information to achieve higher ratios—acceptable for images, audio, and video where perfect reproduction isn't necessary.

Real-World Impact: Google reports that gzip compression reduces bandwidth by 60-80% for typical web pages. For mobile users on metered connections, compression can mean the difference between usable and unusable applications. For data centers, compression reduces storage costs and increases effective network capacity.

Consider a book where the phrase "the quick brown fox" appears on every page. Instead of writing it out every time, you could just write "1" and have a legend that says "1 = the quick brown fox". That's essentially how compression works - it finds repeated patterns and creates shortcuts to represent them.

💡 Key Takeaway: Compression works by finding and eliminating redundancy in data. The more repetitive your data, the better it compresses.

Go's compression packages provide:

  • compress/gzip: GNU zip compression
  • compress/zlib: zlib format
  • compress/flate: DEFLATE algorithm
  • archive/tar: Tar archive format
  • archive/zip: ZIP archive format with compression

⚠️ Important: Not all data compresses equally well. Random data may actually get slightly larger when compressed because there are no patterns to exploit.

Compression Basics

Let's start with a simple example that demonstrates the magic of compression. We'll create some repetitive text and see how much space we can save.

 1package main
 2
 3import (
 4    "bytes"
 5    "compress/gzip"
 6    "fmt"
 7)
 8
 9// run
10func main() {
11    // Original data - notice the repetition
12    original := []byte("Hello, World! This is a test string for compression. " +
13        "Compression works best with repetitive data. " +
14        "Hello, World! Hello, World!")
15
16    // Compress
17    var compressed bytes.Buffer
18    writer := gzip.NewWriter(&compressed)
19    writer.Write(original)
20    writer.Close()
21
22    // Compare sizes
23    fmt.Printf("Original size: %d bytes\n", len(original))
24    fmt.Printf("Compressed size: %d bytes\n", compressed.Len())
25    fmt.Printf("Compression ratio: %.2f%%\n",
26        float64(compressed.Len())/float64(len(original))*100)
27
28    // Decompress
29    reader, err := gzip.NewReader(&compressed)
30    if err != nil {
31        fmt.Println("Decompress error:", err)
32        return
33    }
34    defer reader.Close()
35
36    var decompressed bytes.Buffer
37    decompressed.ReadFrom(reader)
38
39    fmt.Printf("\nDecompressed: %s\n", decompressed.String()[:50]+"...")
40}

Real-world Example: When you download a large file from the internet, it's often compressed. A 100MB text file might compress to just 10MB, meaning it downloads 10 times faster! This is why websites use gzip compression - it saves bandwidth and improves page load times.

Compression Levels

Compression levels are like choosing how carefully to pack your suitcase. You can throw everything in quickly, or you can meticulously fold and arrange everything to maximize space. The trade-off is time vs. space.

💡 Key Takeaway: Compression levels represent a trade-off between CPU usage and compression ratio. Higher levels use more CPU time but achieve better compression.

 1package main
 2
 3import (
 4    "bytes"
 5    "compress/gzip"
 6    "fmt"
 7    "strings"
 8    "time"
 9)
10
11// run
12func main() {
13    // Test data - repetitive for better compression
14    data := []byte(strings.Repeat("The quick brown fox jumps over the lazy dog. ", 100))
15
16    levels := []int{
17        gzip.NoCompression,
18        gzip.BestSpeed,
19        gzip.DefaultCompression,
20        gzip.BestCompression,
21    }
22
23    levelNames := []string{
24        "No Compression",
25        "Best Speed",
26        "Default",
27        "Best Compression",
28    }
29
30    fmt.Printf("Original size: %d bytes\n\n", len(data))
31
32    for i, level := range levels {
33        start := time.Now()
34
35        var compressed bytes.Buffer
36        writer, _ := gzip.NewWriterLevel(&compressed, level)
37        writer.Write(data)
38        writer.Close()
39
40        elapsed := time.Since(start)
41
42        fmt.Printf("%s:\n", levelNames[i], level)
43        fmt.Printf("  Size: %d bytes\n",
44            compressed.Len(),
45            float64(compressed.Len())/float64(len(data))*100)
46        fmt.Printf("  Time: %s\n\n", elapsed)
47    }
48}

When to use different compression levels:

  • Best Speed: Real-time applications, chat systems, live streams
  • Default: General web content, API responses
  • Best Compression: Backup files, archives, content distribution networks

Common Pitfalls: Don't use maximum compression for real-time applications - the CPU overhead might cause delays that outweigh the bandwidth savings.

Compression Theory and Algorithms

LZ77 Algorithm Details

Sliding Window: LZ77 maintains a sliding window of recently-seen data (typically 32KB for gzip). When encoding, it searches this window for matches with the current data. When it finds a match, it outputs a reference (distance back, length) instead of the literal data.

Lazy Matching: Better compression uses lazy matching—before outputting a match, look ahead one byte to see if a longer match exists. This adds complexity but typically improves compression by 1-2%.

Hash Chains: Efficient LZ77 implementations use hash tables and chains to find matches quickly. Naive search is O(n²); hash chains reduce this to near-linear time.

Huffman Coding Details

Frequency Analysis: Huffman coding builds a tree based on symbol frequencies. More frequent symbols get shorter codes. For English text, 'e' might be 3 bits while 'z' is 11 bits.

Static vs Dynamic: Static Huffman uses predetermined codes. Dynamic Huffman analyzes the actual data to build optimal codes. Dynamic achieves better compression but requires transmitting the code table.

Canonical Huffman: DEFLATE uses canonical Huffman codes—a clever encoding that requires transmitting only code lengths, not the full tree. This saves space in the compressed stream.

Alternative Algorithms

LZ78 and LZW: These build an explicit dictionary instead of using a sliding window. LZW (used in GIF and old Unix compress) was popular but is less common now due to patent history.

LZMA: Used by 7-Zip and xz, LZMA achieves excellent compression using a much larger dictionary (up to 4GB), range encoding instead of Huffman, and sophisticated match finding. Compression is slow but decompression is reasonable.

Brotli: Developed by Google for HTTP compression, Brotli combines LZ77 with a pre-defined dictionary of common web content. Achieves 15-25% better compression than gzip for web content.

Zstandard: Modern algorithm by Facebook, designed to match gzip compression with LZ4 speed. Features include: real-time compression, training mode for custom dictionaries, and multiple compression levels.

Understanding how compression works helps you choose the right algorithm and settings for your needs. Modern compression combines multiple techniques to achieve impressive ratios.

Dictionary Coding (LZ77): This technique finds repeated sequences and replaces them with references to earlier occurrences. For example, in the text "the quick brown fox jumps over the lazy dog, the fox was quick", the second "the" and "fox" can be replaced with pointers to their first occurrences.

Entropy Coding (Huffman): After dictionary coding, Huffman encoding assigns shorter bit codes to more frequent symbols. In English text, 'e' appears more often than 'z', so 'e' gets a shorter code. This statistical encoding provides additional compression.

DEFLATE = LZ77 + Huffman: The DEFLATE algorithm (used by gzip, zlib, and zip) combines both techniques. It first finds repeated strings, then encodes the result with Huffman coding. This two-stage approach is why DEFLATE achieves such good compression ratios on text data.

Why Some Data Doesn't Compress: Random data has no patterns to exploit—each byte is equally likely, so entropy coding helps minimally and dictionary coding finds few matches. Already-compressed data (JPEG, MP4, PNG) is designed to be near-maximum entropy, so further compression is impossible.

Compression Ratio Calculations

Understanding compression metrics helps you evaluate performance and make informed decisions.

 1package main
 2
 3import (
 4    "bytes"
 5    "compress/gzip"
 6    "fmt"
 7    "strings"
 8)
 9
10// run
11func main() {
12    // Test different data types
13    testData := map[string][]byte{
14        "Random":     generateRandomData(1000),
15        "Repetitive": []byte(strings.Repeat("AAAA", 250)),
16        "Text":       []byte(strings.Repeat("The quick brown fox ", 50)),
17        "Numeric":    []byte(strings.Repeat("1234567890", 100)),
18    }
19    
20    fmt.Println("Compression Analysis:")
21    fmt.Println(strings.Repeat("=", 60))
22    
23    for name, data := range testData {
24        compressed := compress(data)
25        
26        // Calculate metrics
27        originalSize := len(data)
28        compressedSize := len(compressed)
29        ratio := float64(compressedSize) / float64(originalSize)
30        savings := float64(originalSize-compressedSize) / float64(originalSize) * 100
31        
32        fmt.Printf("\n%s:\n", name)
33        fmt.Printf("  Original:    %6d bytes\n", originalSize)
34        fmt.Printf("  Compressed:  %6d bytes\n", compressedSize)
35        fmt.Printf("  Ratio:       %.2f:1\n", 1/ratio)
36        fmt.Printf("  Savings:     %.1f%%\n", savings)
37    }
38}
39
40func generateRandomData(size int) []byte {
41    data := make([]byte, size)
42    for i := range data {
43        data[i] = byte(i % 256)
44    }
45    return data
46}
47
48func compress(data []byte) []byte {
49    var buf bytes.Buffer
50    w := gzip.NewWriter(&buf)
51    w.Write(data)
52    w.Close()
53    return buf.Bytes()
54}

Compression Metrics Explained:

  • Compression Ratio: Original size / compressed size (higher is better)
  • Compression Factor: Compressed size / original size (lower is better)
  • Space Savings: (1 - compressed/original) * 100%
  • Throughput: MB/s of data processed

Gzip Compression

The Ubiquitous Format: Gzip is everywhere—web servers, file archives, database backups, log files, and more. It's the de facto standard for HTTP compression, supported by every browser and web server. Understanding gzip is essential for any network-facing application.

How Gzip Works: Gzip combines two algorithms: LZ77 (finding repeated strings and replacing them with references) and Huffman coding (using shorter codes for common symbols). This combination is remarkably effective for text and structured data, often achieving 70-90% compression on HTML, JSON, and logs.

Compression Levels Explained: Level 1 (best speed) uses fast heuristics that miss some compression opportunities. Level 9 (best compression) exhaustively searches for patterns. Most applications use the default (level 6), which provides 95% of the compression benefit at a fraction of the CPU cost.

Gzip is like the universal language of compression on the internet. It's the most commonly used compression format, especially for HTTP and file compression. When your browser requests a web page, it often tells the server "I speak gzip" and the server responds with compressed content that your browser automatically decompresses.

Think of gzip as the postal service's standard method for reducing shipping costs - it's reliable, widely supported, and gets the job done efficiently.

Basic Gzip Operations

Gzip Internal Structure: A gzip file starts with a 10-byte header containing magic bytes (1f 8b), compression method, flags, timestamp, and OS type. After the header comes compressed data, and finally an 8-byte trailer with CRC32 checksum and uncompressed size (modulo 2^32).

Compression Window: Gzip uses a 32KB sliding window for finding repeated strings. This means patterns more than 32KB apart won't be detected. For very large files with distant repetition, specialized algorithms or preprocessing might achieve better compression.

Header Metadata: The gzip header can store filename, comment, and modification time. This metadata is optional but useful for archives. Go's gzip.Writer lets you set these fields—important when creating archives that other tools will process.

 1package main
 2
 3import (
 4    "bytes"
 5    "compress/gzip"
 6    "fmt"
 7    "io"
 8)
 9
10// run
11func main() {
12    original := "This is some data to compress with gzip!"
13
14    // Compress
15    compressed, err := gzipCompress([]byte(original))
16    if err != nil {
17        fmt.Println("Compression error:", err)
18        return
19    }
20
21    fmt.Printf("Original: %d bytes\n", len(original))
22    fmt.Printf("Compressed: %d bytes\n", len(compressed))
23
24    // Decompress
25    decompressed, err := gzipDecompress(compressed)
26    if err != nil {
27        fmt.Println("Decompression error:", err)
28        return
29    }
30
31    fmt.Printf("Decompressed: %s\n", string(decompressed))
32}
33
34func gzipCompress(data []byte) {
35    var buf bytes.Buffer
36
37    writer := gzip.NewWriter(&buf)
38    _, err := writer.Write(data)
39    if err != nil {
40        return nil, err
41    }
42
43    if err := writer.Close(); err != nil {
44        return nil, err
45    }
46
47    return buf.Bytes(), nil
48}
49
50func gzipDecompress(data []byte) {
51    reader, err := gzip.NewReader(bytes.NewReader(data))
52    if err != nil {
53        return nil, err
54    }
55    defer reader.Close()
56
57    var buf bytes.Buffer
58    _, err = io.Copy(&buf, reader)
59    if err != nil {
60        return nil, err
61    }
62
63    return buf.Bytes(), nil
64}

Gzip with Metadata

 1package main
 2
 3import (
 4    "bytes"
 5    "compress/gzip"
 6    "fmt"
 7    "io"
 8    "time"
 9)
10
11// run
12func main() {
13    data := []byte("Data with metadata in gzip header")
14
15    // Compress with metadata
16    compressed, err := compressWithMetadata(data, "example.txt", "Example file")
17    if err != nil {
18        fmt.Println("Error:", err)
19        return
20    }
21
22    fmt.Printf("Compressed %d bytes\n", len(compressed))
23
24    // Decompress and read metadata
25    decompressed, metadata, err := decompressWithMetadata(compressed)
26    if err != nil {
27        fmt.Println("Error:", err)
28        return
29    }
30
31    fmt.Printf("\nMetadata:\n")
32    fmt.Printf("  Name: %s\n", metadata.Name)
33    fmt.Printf("  Comment: %s\n", metadata.Comment)
34    fmt.Printf("  Modified: %s\n", metadata.ModTime)
35    fmt.Printf("\nDecompressed: %s\n", string(decompressed))
36}
37
38type GzipMetadata struct {
39    Name    string
40    Comment string
41    ModTime time.Time
42}
43
44func compressWithMetadata(data []byte, name, comment string) {
45    var buf bytes.Buffer
46
47    writer := gzip.NewWriter(&buf)
48    writer.Name = name
49    writer.Comment = comment
50    writer.ModTime = time.Now()
51
52    _, err := writer.Write(data)
53    if err != nil {
54        return nil, err
55    }
56
57    if err := writer.Close(); err != nil {
58        return nil, err
59    }
60
61    return buf.Bytes(), nil
62}
63
64func decompressWithMetadata(data []byte) {
65    reader, err := gzip.NewReader(bytes.NewReader(data))
66    if err != nil {
67        return nil, GzipMetadata{}, err
68    }
69    defer reader.Close()
70
71    metadata := GzipMetadata{
72        Name:    reader.Name,
73        Comment: reader.Comment,
74        ModTime: reader.ModTime,
75    }
76
77    var buf bytes.Buffer
78    _, err = io.Copy(&buf, reader)
79    if err != nil {
80        return nil, GzipMetadata{}, err
81    }
82
83    return buf.Bytes(), metadata, nil
84}

Streaming Gzip Compression

 1package main
 2
 3import (
 4    "compress/gzip"
 5    "fmt"
 6    "io"
 7    "strings"
 8)
 9
10// run
11func main() {
12    // Source of data
13    source := strings.NewReader("This is line 1\n" +
14        "This is line 2\n" +
15        "This is line 3\n" +
16        "This is line 4\n" +
17        "This is line 5\n")
18
19    // Destination
20    var compressed strings.Builder
21
22    // Stream compression
23    if err := streamCompress(source, &compressed); err != nil {
24        fmt.Println("Error:", err)
25        return
26    }
27
28    fmt.Printf("Compressed %d bytes\n", compressed.Len())
29
30    // Stream decompression
31    decompressed := &strings.Builder{}
32    if err := streamDecompress(strings.NewReader(compressed.String()), decompressed); err != nil {
33        fmt.Println("Error:", err)
34        return
35    }
36
37    fmt.Printf("Decompressed:\n%s", decompressed.String())
38}
39
40func streamCompress(source io.Reader, dest io.Writer) error {
41    writer := gzip.NewWriter(dest)
42    defer writer.Close()
43
44    _, err := io.Copy(writer, source)
45    return err
46}
47
48func streamDecompress(source io.Reader, dest io.Writer) error {
49    reader, err := gzip.NewReader(source)
50    if err != nil {
51        return err
52    }
53    defer reader.Close()
54
55    _, err = io.Copy(dest, reader)
56    return err
57}

Zlib and Flate

The DEFLATE Algorithm: At the heart of both gzip and zlib is DEFLATE, a compression algorithm so successful it's used in PNG images, ZIP files, HTTP compression, and countless other formats. DEFLATE combines LZ77 dictionary coding with Huffman entropy encoding.

Zlib vs Gzip vs Flate: These formats are siblings—they all use DEFLATE compression but differ in their wrapper format. Gzip adds a header with file metadata and a CRC32 checksum. Zlib adds a simpler header and Adler-32 checksum. Raw DEFLATE has no wrapper at all, saving a few bytes at the cost of no integrity checking.

Choosing the Right Format: Use gzip for HTTP and file compression (it's the standard). Use zlib for network protocols and embedded data (smaller overhead than gzip). Use raw DEFLATE when you're implementing your own wrapper or need maximum efficiency.

Zlib is another popular compression format, often used in network protocols and PNG images.

Zlib Compression

Zlib Header Format: Zlib uses a 2-byte header: first byte encodes compression method and window size, second byte contains flags and a header checksum. This compact header makes zlib suitable for protocols where overhead matters.

Adler-32 vs CRC-32: Zlib uses Adler-32 checksum (faster but weaker) while gzip uses CRC-32 (slower but stronger). For most applications, both provide adequate integrity checking. Adler-32's speed advantage is noticeable for small payloads.

Dictionary Compression: Zlib supports preset dictionaries—predefined data that seeds the compression window. For compressing many similar small messages (like JSON with consistent schema), a shared dictionary can improve compression by 20-40%.

 1package main
 2
 3import (
 4    "bytes"
 5    "compress/zlib"
 6    "fmt"
 7    "io"
 8)
 9
10// run
11func main() {
12    original := []byte("Zlib compression example with some repetitive data. " +
13        "Repetitive data compresses better. Repetitive data!")
14
15    // Compress with zlib
16    compressed, err := zlibCompress(original)
17    if err != nil {
18        fmt.Println("Compression error:", err)
19        return
20    }
21
22    fmt.Printf("Original: %d bytes\n", len(original))
23    fmt.Printf("Compressed: %d bytes\n", len(compressed))
24    fmt.Printf("Ratio: %.2f%%\n",
25        float64(len(compressed))/float64(len(original))*100)
26
27    // Decompress
28    decompressed, err := zlibDecompress(compressed)
29    if err != nil {
30        fmt.Println("Decompression error:", err)
31        return
32    }
33
34    fmt.Printf("\nDecompressed: %s\n", string(decompressed)[:50]+"...")
35}
36
37func zlibCompress(data []byte) {
38    var buf bytes.Buffer
39
40    writer := zlib.NewWriter(&buf)
41    _, err := writer.Write(data)
42    if err != nil {
43        return nil, err
44    }
45
46    if err := writer.Close(); err != nil {
47        return nil, err
48    }
49
50    return buf.Bytes(), nil
51}
52
53func zlibDecompress(data []byte) {
54    reader, err := zlib.NewReader(bytes.NewReader(data))
55    if err != nil {
56        return nil, err
57    }
58    defer reader.Close()
59
60    var buf bytes.Buffer
61    _, err = io.Copy(&buf, reader)
62    if err != nil {
63        return nil, err
64    }
65
66    return buf.Bytes(), nil
67}

Flate

 1package main
 2
 3import (
 4    "bytes"
 5    "compress/flate"
 6    "fmt"
 7    "io"
 8)
 9
10// run
11func main() {
12    original := []byte("Raw DEFLATE compression without headers")
13
14    // Compress with flate
15    compressed, err := flateCompress(original, flate.DefaultCompression)
16    if err != nil {
17        fmt.Println("Compression error:", err)
18        return
19    }
20
21    fmt.Printf("Original: %d bytes\n", len(original))
22    fmt.Printf("Compressed: %d bytes\n", len(compressed))
23
24    // Decompress
25    decompressed, err := flateDecompress(compressed)
26    if err != nil {
27        fmt.Println("Decompression error:", err)
28        return
29    }
30
31    fmt.Printf("Decompressed: %s\n", string(decompressed))
32}
33
34func flateCompress(data []byte, level int) {
35    var buf bytes.Buffer
36
37    writer, err := flate.NewWriter(&buf, level)
38    if err != nil {
39        return nil, err
40    }
41
42    _, err = writer.Write(data)
43    if err != nil {
44        return nil, err
45    }
46
47    if err := writer.Close(); err != nil {
48        return nil, err
49    }
50
51    return buf.Bytes(), nil
52}
53
54func flateDecompress(data []byte) {
55    reader := flate.NewReader(bytes.NewReader(data))
56    defer reader.Close()
57
58    var buf bytes.Buffer
59    _, err := io.Copy(&buf, reader)
60    if err != nil {
61        return nil, err
62    }
63
64    return buf.Bytes(), nil
65}

Custom Dictionary Compression

 1package main
 2
 3import (
 4    "bytes"
 5    "compress/flate"
 6    "fmt"
 7)
 8
 9// run
10func main() {
11    // Dictionary for better compression of similar data
12    dictionary := []byte("the quick brown fox jumps over")
13
14    data := []byte("the quick brown fox jumps over the lazy dog")
15
16    // Compress with dictionary
17    compressed, err := compressWithDict(data, dictionary)
18    if err != nil {
19        fmt.Println("Error:", err)
20        return
21    }
22
23    fmt.Printf("Original: %d bytes\n", len(data))
24    fmt.Printf("Compressed: %d bytes\n", len(compressed))
25
26    // Decompress with dictionary
27    decompressed, err := decompressWithDict(compressed, dictionary)
28    if err != nil {
29        fmt.Println("Error:", err)
30        return
31    }
32
33    fmt.Printf("Decompressed: %s\n", string(decompressed))
34}
35
36func compressWithDict(data, dict []byte) {
37    var buf bytes.Buffer
38
39    writer, err := flate.NewWriter(&buf, flate.BestCompression)
40    if err != nil {
41        return nil, err
42    }
43
44    // Note: Standard flate.Writer doesn't directly support dictionaries
45    // This is a simplified example
46    _, err = writer.Write(data)
47    if err != nil {
48        return nil, err
49    }
50
51    if err := writer.Close(); err != nil {
52        return nil, err
53    }
54
55    return buf.Bytes(), nil
56}
57
58func decompressWithDict(data, dict []byte) {
59    reader := flate.NewReader(bytes.NewReader(data))
60    defer reader.Close()
61
62    var buf bytes.Buffer
63    buf.ReadFrom(reader)
64
65    return buf.Bytes(), nil
66}

Tar Archives

Unix Tape Archives: Tar (Tape ARchive) originated in the 1970s for backing up data to magnetic tape. Despite its age, it remains the standard for distributing software and creating backups on Unix-like systems. Its simplicity and efficiency have ensured its longevity.

Tar's Design: Tar concatenates files sequentially, preserving metadata (permissions, ownership, timestamps) in headers. This sequential design made sense for tape drives (which couldn't seek) and still works well for streaming backups and network transfers.

Tar + Gzip = TGZ: The classic combination of tar (bundling) with gzip (compression) is so common it has its own file extension (.tgz or .tar.gz). This two-step process allows tar to focus on metadata preservation while gzip handles compression.

Tar is widely used for bundling multiple files together.

Creating Tar Archives

Tar Format Variants: Original tar had limitations (99-character filenames, 8GB file size). GNU tar extensions lifted many restrictions. POSIX.1-2001 (PAX) format supports arbitrary filename lengths, large files, and extended attributes. Modern Go uses PAX format by default.

Tar Block Structure: Tar uses 512-byte blocks. Each file header occupies one block, followed by file data rounded up to block boundaries. This padding ensures tape-friendly alignment but means small files have overhead—a 1-byte file uses 1024 bytes (header + data block).

Tar Streaming Benefits: Tar's sequential format is perfect for streaming: you can add files to an archive without seeking, and you can extract files while still downloading. This makes tar excellent for network transfer and backups to tape.

  1package main
  2
  3import (
  4    "archive/tar"
  5    "bytes"
  6    "fmt"
  7    "io"
  8    "time"
  9)
 10
 11// run
 12func main() {
 13    // Create a tar archive
 14    archive, err := createTarArchive()
 15    if err != nil {
 16        fmt.Println("Error creating archive:", err)
 17        return
 18    }
 19
 20    fmt.Printf("Created tar archive: %d bytes\n\n", len(archive))
 21
 22    // List contents
 23    if err := listTarArchive(archive); err != nil {
 24        fmt.Println("Error listing archive:", err)
 25        return
 26    }
 27
 28    // Extract a file
 29    content, err := extractFromTar(archive, "file2.txt")
 30    if err != nil {
 31        fmt.Println("Error extracting:", err)
 32        return
 33    }
 34
 35    fmt.Printf("\nExtracted file2.txt:\n%s\n", string(content))
 36}
 37
 38func createTarArchive() {
 39    var buf bytes.Buffer
 40    writer := tar.NewWriter(&buf)
 41    defer writer.Close()
 42
 43    // Add files
 44    files := []struct {
 45        Name    string
 46        Content []byte
 47    }{
 48        {"file1.txt", []byte("Content of file 1")},
 49        {"file2.txt", []byte("Content of file 2")},
 50        {"dir/file3.txt", []byte("Content of file 3 in a directory")},
 51    }
 52
 53    for _, file := range files {
 54        header := &tar.Header{
 55            Name:    file.Name,
 56            Mode:    0644,
 57            Size:    int64(len(file.Content)),
 58            ModTime: time.Now(),
 59        }
 60
 61        if err := writer.WriteHeader(header); err != nil {
 62            return nil, err
 63        }
 64
 65        if _, err := writer.Write(file.Content); err != nil {
 66            return nil, err
 67        }
 68    }
 69
 70    return buf.Bytes(), nil
 71}
 72
 73func listTarArchive(data []byte) error {
 74    reader := tar.NewReader(bytes.NewReader(data))
 75
 76    fmt.Println("Archive contents:")
 77    for {
 78        header, err := reader.Next()
 79        if err == io.EOF {
 80            break
 81        }
 82        if err != nil {
 83            return err
 84        }
 85
 86        fmt.Printf("  %s\n",
 87            header.Name, header.Size, header.Mode)
 88    }
 89
 90    return nil
 91}
 92
 93func extractFromTar(data []byte, filename string) {
 94    reader := tar.NewReader(bytes.NewReader(data))
 95
 96    for {
 97        header, err := reader.Next()
 98        if err == io.EOF {
 99            break
100        }
101        if err != nil {
102            return nil, err
103        }
104
105        if header.Name == filename {
106            var buf bytes.Buffer
107            io.Copy(&buf, reader)
108            return buf.Bytes(), nil
109        }
110    }
111
112    return nil, fmt.Errorf("file not found: %s", filename)
113}

Archive Format Comparison

Compression in Different Domains

Image Compression: JPEG is lossy—acceptable for photos but not for screenshots with text. PNG is lossless—perfect for screenshots, diagrams, and images with text. WebP is modern—supports both lossy and lossless, smaller than JPEG/PNG.

Video Compression: H.264/H.265 are sophisticated lossy codecs achieving amazing compression (1000:1 common). Work by exploiting temporal redundancy (frames are similar) and spatial redundancy (adjacent pixels are similar).

Audio Compression: MP3 is lossy—discards frequencies humans can't hear well. FLAC is lossless—preserves perfect quality while reducing size 40-60%. Opus is modern—optimized for speech and music, low latency.

Text Compression: Text compresses extremely well—80-90% reduction common. JSON with repeated keys compresses even better. Log files are highly compressible due to repeated patterns.

Binary Data: Executable files, compiled code, and random data compress poorly. Already-compressed files (ZIP, MP4, JPEG) won't compress further—attempting wastes CPU and might increase size.

Compression Benchmarking Methodology

Representative Data: Benchmark with your actual data. Compression ratios vary dramatically—JSON compresses differently than images. Use real production data samples.

Warm-Up: First compressions are slower due to CPU caches, memory allocation, and JIT compilation. Warm up with several iterations before measuring.

Statistical Significance: Run multiple iterations and compute statistics (mean, stddev, percentiles). Single runs can be misleading due to system noise.

Memory Profiling: Measure peak memory usage, not just time. High-compression algorithms can use 100MB+ for buffers. This matters for memory-constrained environments.

Throughput vs Latency: Measure both compression speed (MB/s) and latency for single operations. Batch processing cares about throughput; interactive applications care about latency.

Advanced Compression Techniques

Dictionary Training: For compressing many similar small messages (like JSON API responses), train a shared dictionary. Can improve compression 20-40% for small messages where per-message dictionaries don't help.

Preprocessing: Sometimes preprocessing improves compression. BWT (Burrows-Wheeler Transform) reorders data to group similar bytes, improving compression. Delta encoding for time-series data. Run-length encoding for simple repetition.

Hybrid Approaches: Combine techniques for best results. Databases often use: column-oriented storage (groups similar values), dictionary encoding (for low-cardinality columns), run-length encoding (for repeated values), then general compression (gzip/zstd).

Content-Aware Compression: Different compression for different content types. Images use image codecs, text uses text compression, already-compressed data stored uncompressed. ZIP's per-file compression enables this.

Compression Error Handling

Detecting Corruption: Compressed data is fragile—single-bit error can corrupt everything after. Use checksums: gzip includes CRC32, or add external checksums (SHA-256) for critical data.

Recovery Strategies: For critical data, implement recovery: error-correcting codes, redundancy (multiple copies), or splitting into independent chunks so one corrupted block doesn't destroy everything.

Graceful Degradation: If decompression fails, don't crash. Log error, skip corrupted item, continue processing. For optional compression (like caching), fall back to uncompressed.

Validation: After compression, decompress and verify matches original. This catches implementation bugs. For backups, always test restoration.

Compression and Encryption Interaction

Encrypt After Compression: Encryption destroys patterns needed for compression. Always compress first, then encrypt. Trying to compress encrypted data wastes CPU.

Security Considerations: BREACH/CRIME attacks exploit compression to leak secrets. For pages with secrets (tokens, cookies), either disable compression or add random padding to vary sizes.

Authenticated Encryption: Use AES-GCM or ChaCha20-Poly1305—they provide both confidentiality and authenticity. Verify authentication tags before decompression to prevent processing malicious data.

Tar vs Zip Philosophy: Tar separates archiving (bundling files) from compression (reducing size). Zip combines both. This explains why tar.gz has better compression—compression sees all files as one stream, finding patterns across file boundaries. Zip compresses files independently, missing cross-file patterns.

Random Access: Zip's central directory enables random access—extract one file without reading the entire archive. Tar requires sequential scan to find files. For archives with thousands of files, this makes extraction much faster when you only need specific files.

Streaming and Seeking: Tar streams naturally—create/extract without seeking. Zip requires seeking to read the central directory (at the end) and jump to file data. This makes tar better for pipes and network streams, zip better for local files.

Metadata Preservation: Tar excels at preserving Unix metadata: permissions (rwxrwxrwx), ownership (UID/GID), timestamps (modify, access, change), extended attributes (ACLs, SELinux contexts), device nodes (for system backups). Zip stores basic metadata but isn't designed for system backups.

Cross-Platform: Zip is more cross-platform—Windows, Mac, and Linux all have native support. Tar is Unix-native, though Windows now has tar support. For distributing software to non-technical users, zip is often better.

Encryption: Zip has built-in encryption (though weak in some implementations). Tar has no native encryption—encrypt the tar.gz file with gpg or another tool. For sensitive archives, consider GPG encryption regardless of format.

Tar with Compression

  1package main
  2
  3import (
  4    "archive/tar"
  5    "bytes"
  6    "compress/gzip"
  7    "fmt"
  8    "io"
  9    "time"
 10)
 11
 12// run
 13func main() {
 14    // Create compressed tar archive
 15    archive, err := createTarGz()
 16    if err != nil {
 17        fmt.Println("Error:", err)
 18        return
 19    }
 20
 21    fmt.Printf("Created .tar.gz archive: %d bytes\n\n", len(archive))
 22
 23    // Extract from compressed archive
 24    files, err := extractTarGz(archive)
 25    if err != nil {
 26        fmt.Println("Error:", err)
 27        return
 28    }
 29
 30    fmt.Println("Extracted files:")
 31    for name, content := range files {
 32        fmt.Printf("  %s: %s\n", name, string(content))
 33    }
 34}
 35
 36func createTarGz() {
 37    var buf bytes.Buffer
 38
 39    // Create gzip writer
 40    gzWriter := gzip.NewWriter(&buf)
 41    defer gzWriter.Close()
 42
 43    // Create tar writer on top of gzip
 44    tarWriter := tar.NewWriter(gzWriter)
 45    defer tarWriter.Close()
 46
 47    // Add files
 48    files := map[string][]byte{
 49        "data1.txt": []byte("First file content"),
 50        "data2.txt": []byte("Second file content"),
 51        "data3.txt": []byte("Third file content"),
 52    }
 53
 54    for name, content := range files {
 55        header := &tar.Header{
 56            Name:    name,
 57            Mode:    0644,
 58            Size:    int64(len(content)),
 59            ModTime: time.Now(),
 60        }
 61
 62        if err := tarWriter.WriteHeader(header); err != nil {
 63            return nil, err
 64        }
 65
 66        if _, err := tarWriter.Write(content); err != nil {
 67            return nil, err
 68        }
 69    }
 70
 71    // Important: close writers to flush
 72    tarWriter.Close()
 73    gzWriter.Close()
 74
 75    return buf.Bytes(), nil
 76}
 77
 78func extractTarGz(data []byte) {
 79    // Create gzip reader
 80    gzReader, err := gzip.NewReader(bytes.NewReader(data))
 81    if err != nil {
 82        return nil, err
 83    }
 84    defer gzReader.Close()
 85
 86    // Create tar reader
 87    tarReader := tar.NewReader(gzReader)
 88
 89    files := make(map[string][]byte)
 90
 91    for {
 92        header, err := tarReader.Next()
 93        if err == io.EOF {
 94            break
 95        }
 96        if err != nil {
 97            return nil, err
 98        }
 99
100        var buf bytes.Buffer
101        io.Copy(&buf, tarReader)
102        files[header.Name] = buf.Bytes()
103    }
104
105    return files, nil
106}

Streaming Tar Operations

 1package main
 2
 3import (
 4    "archive/tar"
 5    "bytes"
 6    "fmt"
 7    "io"
 8    "time"
 9)
10
11// run
12func main() {
13    // Simulate streaming tar creation
14    var archive bytes.Buffer
15
16    // Create tar archive with streaming
17    if err := streamTarCreate(&archive); err != nil {
18        fmt.Println("Error:", err)
19        return
20    }
21
22    fmt.Printf("Created streaming tar: %d bytes\n\n", archive.Len())
23
24    // Stream processing of tar
25    if err := streamTarProcess(&archive); err != nil {
26        fmt.Println("Error:", err)
27        return
28    }
29}
30
31func streamTarCreate(dest io.Writer) error {
32    writer := tar.NewWriter(dest)
33    defer writer.Close()
34
35    // Simulate streaming multiple chunks
36    for i := 1; i <= 3; i++ {
37        name := fmt.Sprintf("stream_%d.txt", i)
38        content := []byte(fmt.Sprintf("Streamed content #%d", i))
39
40        header := &tar.Header{
41            Name:    name,
42            Mode:    0644,
43            Size:    int64(len(content)),
44            ModTime: time.Now(),
45        }
46
47        if err := writer.WriteHeader(header); err != nil {
48            return err
49        }
50
51        if _, err := writer.Write(content); err != nil {
52            return err
53        }
54
55        fmt.Printf("Streamed: %s\n", name)
56    }
57
58    return nil
59}
60
61func streamTarProcess(source io.Reader) error {
62    reader := tar.NewReader(source)
63
64    fmt.Println("\nProcessing streamed tar:")
65    for {
66        header, err := reader.Next()
67        if err == io.EOF {
68            break
69        }
70        if err != nil {
71            return err
72        }
73
74        // Process each file as it's read
75        fmt.Printf("  Processing: %s\n", header.Name, header.Size)
76
77        // Read and discard content
78        io.Copy(io.Discard, reader)
79    }
80
81    return nil
82}

Tar Advanced Features

Tar supports many advanced features beyond basic file archiving. Understanding these features helps you build robust backup and distribution systems.

Sparse Files: Files with large holes (regions of zeros) can be stored efficiently. Tar's sparse format stores only non-zero regions, dramatically reducing archive size for sparse files like disk images or database files.

Hard and Soft Links: Tar preserves hard links (multiple names for same file) and symbolic links (pointers to other files). This ensures archives accurately represent complex file systems.

Extended Attributes: Modern tar formats (PAX) support extended attributes, ACLs, and SELinux contexts. This is crucial for accurate system backups that need to preserve security settings.

Incremental Backups: Tar can create incremental archives containing only files changed since a previous backup. Combined with snapshot files, this enables efficient multi-level backup strategies.

Archive Splitting: Large archives can be split across multiple files or volumes. This was essential for tape backups and remains useful for distribution across size-limited media.

Zip Files

The Universal Archive: ZIP archives work on every platform—Windows, Mac, Linux, mobile devices. Unlike tar, ZIP files include a central directory at the end, allowing efficient random access to individual files without scanning the entire archive.

ZIP's Advantages: Individual file access (extract single files without decompressing everything), built-in compression per file (can store some files uncompressed), optional encryption, and universal tool support make ZIP ideal for software distribution and data exchange.

Compression Methods: ZIP supports multiple compression methods: Store (no compression, for already-compressed files), DEFLATE (standard compression, same as gzip), BZIP2 (better compression, slower), and LZMA (best compression, much slower). Most tools use DEFLATE as the default.

ZIP is a popular archive format with built-in compression.

Creating ZIP Archives

ZIP Central Directory: Unlike tar's sequential format, zip files have a central directory at the end listing all files. This directory enables random access—you can extract a single file without scanning the entire archive—but requires seeking, making zip less suitable for streaming.

ZIP64 Extensions: Original zip format limited file sizes to 4GB and archive sizes to 4GB. ZIP64 extensions use 64-bit fields, removing these limits. Modern tools use ZIP64 automatically when needed, ensuring compatibility while supporting large files.

Compression Per File: ZIP compresses each file independently, enabling parallel decompression and different compression levels per file. Already-compressed files (JPEG, MP4) can be stored uncompressed, saving CPU while maintaining archive integrity.

  1package main
  2
  3import (
  4    "archive/zip"
  5    "bytes"
  6    "fmt"
  7    "io"
  8)
  9
 10// run
 11func main() {
 12    // Create ZIP archive
 13    archive, err := createZipArchive()
 14    if err != nil {
 15        fmt.Println("Error:", err)
 16        return
 17    }
 18
 19    fmt.Printf("Created ZIP archive: %d bytes\n\n", len(archive))
 20
 21    // List contents
 22    if err := listZipArchive(archive); err != nil {
 23        fmt.Println("Error:", err)
 24        return
 25    }
 26
 27    // Extract file
 28    content, err := extractFromZip(archive, "document.txt")
 29    if err != nil {
 30        fmt.Println("Error:", err)
 31        return
 32    }
 33
 34    fmt.Printf("\nExtracted document.txt:\n%s\n", string(content))
 35}
 36
 37func createZipArchive() {
 38    var buf bytes.Buffer
 39    writer := zip.NewWriter(&buf)
 40    defer writer.Close()
 41
 42    // Add files with different compression methods
 43    files := []struct {
 44        Name   string
 45        Data   []byte
 46        Method uint16
 47    }{
 48        {"readme.txt", []byte("This is a readme file"), zip.Deflate},
 49        {"document.txt", []byte("Document content here"), zip.Deflate},
 50        {"data.bin", []byte{0x00, 0x01, 0x02, 0x03}, zip.Store}, // No compression
 51    }
 52
 53    for _, file := range files {
 54        header := &zip.FileHeader{
 55            Name:   file.Name,
 56            Method: file.Method,
 57        }
 58
 59        w, err := writer.CreateHeader(header)
 60        if err != nil {
 61            return nil, err
 62        }
 63
 64        if _, err := w.Write(file.Data); err != nil {
 65            return nil, err
 66        }
 67    }
 68
 69    writer.Close()
 70    return buf.Bytes(), nil
 71}
 72
 73func listZipArchive(data []byte) error {
 74    reader, err := zip.NewReader(bytes.NewReader(data), int64(len(data)))
 75    if err != nil {
 76        return err
 77    }
 78
 79    fmt.Println("ZIP contents:")
 80    for _, file := range reader.File {
 81        method := "Deflate"
 82        if file.Method == zip.Store {
 83            method = "Store"
 84        }
 85
 86        fmt.Printf("  %s\n",
 87            file.Name, method, file.CompressedSize64, file.UncompressedSize64)
 88    }
 89
 90    return nil
 91}
 92
 93func extractFromZip(data []byte, filename string) {
 94    reader, err := zip.NewReader(bytes.NewReader(data), int64(len(data)))
 95    if err != nil {
 96        return nil, err
 97    }
 98
 99    for _, file := range reader.File {
100        if file.Name == filename {
101            rc, err := file.Open()
102            if err != nil {
103                return nil, err
104            }
105            defer rc.Close()
106
107            var buf bytes.Buffer
108            io.Copy(&buf, rc)
109            return buf.Bytes(), nil
110        }
111    }
112
113    return nil, fmt.Errorf("file not found: %s", filename)
114}

ZIP with Compression Levels

 1package main
 2
 3import (
 4    "archive/zip"
 5    "bytes"
 6    "fmt"
 7)
 8
 9// run
10func main() {
11    data := []byte("Test data for compression level comparison. " +
12        "Repetitive data compresses better. Repetitive data!")
13
14    levels := []struct {
15        Name  string
16        Level int
17    }{
18        {"No Compression", zip.Store},
19        {"Best Speed", 1},
20        {"Default", -1},
21        {"Best Compression", 9},
22    }
23
24    fmt.Println("ZIP Compression Level Comparison:\n")
25
26    for _, level := range levels {
27        size, err := createZipWithLevel(data, level.Level)
28        if err != nil {
29            fmt.Printf("%s: Error: %v\n", level.Name, err)
30            continue
31        }
32
33        fmt.Printf("%s:\n", level.Name)
34        fmt.Printf("  Size: %d bytes\n",
35            size, float64(size)/float64(len(data))*100)
36    }
37}
38
39func createZipWithLevel(data []byte, level int) {
40    var buf bytes.Buffer
41    writer := zip.NewWriter(&buf)
42
43    // Create file with specific compression level
44    header := &zip.FileHeader{
45        Name:   "data.txt",
46        Method: zip.Deflate,
47    }
48
49    if level == zip.Store {
50        header.Method = zip.Store
51    }
52
53    w, err := writer.CreateHeader(header)
54    if err != nil {
55        return 0, err
56    }
57
58    // Set compression level
59    if compressor, ok := w.(*zip.Writer); ok && level != zip.Store {
60        _ = compressor // Level is set via RegisterCompressor
61    }
62
63    if _, err := w.Write(data); err != nil {
64        return 0, err
65    }
66
67    writer.Close()
68    return buf.Len(), nil
69}

Password-Protected ZIP

 1package main
 2
 3import (
 4    "archive/zip"
 5    "bytes"
 6    "fmt"
 7)
 8
 9// run
10func main() {
11    // Note: Standard library doesn't support password-protected ZIP
12    // This example shows the structure; you'd need golang.org/x/crypto
13    // or a third-party library for actual encryption
14
15    data := []byte("Sensitive data")
16
17    archive, err := createProtectedZip(data)
18    if err != nil {
19        fmt.Println("Error:", err)
20        return
21    }
22
23    fmt.Printf("Created archive: %d bytes\n", len(archive))
24    fmt.Println("\nNote: Standard library doesn't fully support ZIP encryption")
25    fmt.Println("Use github.com/alexmullins/zip for encrypted ZIP files")
26}
27
28func createProtectedZip(data []byte) {
29    var buf bytes.Buffer
30    writer := zip.NewWriter(&buf)
31    defer writer.Close()
32
33    // Standard ZIP creation
34    header := &zip.FileHeader{
35        Name:   "protected.txt",
36        Method: zip.Deflate,
37    }
38
39    w, err := writer.CreateHeader(header)
40    if err != nil {
41        return nil, err
42    }
43
44    // In real implementation, you'd encrypt data here
45    if _, err := w.Write(data); err != nil {
46        return nil, err
47    }
48
49    writer.Close()
50    return buf.Bytes(), nil
51}

Streaming Compression

Memory Efficiency: Streaming compression processes data incrementally rather than loading entire files into memory. This is essential for large files (gigabytes or terabytes) and real-time applications (compressing data as it's generated).

The io.Reader/io.Writer Pattern: Go's interface-based design makes streaming natural. A gzip.Writer is an io.Writer, so anything that accepts io.Writer can transparently add compression. This composability is powerful—you can chain readers and writers to create complex pipelines.

Chunked Processing: For streaming to work effectively, data must be processed in chunks. Too small and you lose compression efficiency (patterns span chunks). Too large and you lose the memory benefits of streaming. Typical chunk sizes range from 4KB to 64KB, balancing memory use and compression ratio.

Efficient memory usage with streaming compression.

Chunked Streaming

 1package main
 2
 3import (
 4    "compress/gzip"
 5    "fmt"
 6    "io"
 7    "strings"
 8)
 9
10// run
11func main() {
12    // Simulate large data source
13    source := generateLargeData()
14
15    // Stream compress with chunks
16    var compressed strings.Builder
17    stats, err := streamCompressChunked(source, &compressed, 1024)
18    if err != nil {
19        fmt.Println("Error:", err)
20        return
21    }
22
23    fmt.Printf("Compression Statistics:\n")
24    fmt.Printf("  Chunks processed: %d\n", stats.Chunks)
25    fmt.Printf("  Bytes read: %d\n", stats.BytesRead)
26    fmt.Printf("  Bytes written: %d\n", stats.BytesWritten)
27    fmt.Printf("  Compression ratio: %.2f%%\n",
28        float64(stats.BytesWritten)/float64(stats.BytesRead)*100)
29}
30
31type CompressionStats struct {
32    Chunks       int
33    BytesRead    int64
34    BytesWritten int64
35}
36
37func generateLargeData() io.Reader {
38    // Generate repetitive data for good compression
39    data := strings.Repeat("Sample data for streaming compression. ", 100)
40    return strings.NewReader(data)
41}
42
43func streamCompressChunked(source io.Reader, dest io.Writer, chunkSize int) {
44    stats := CompressionStats{}
45
46    writer := gzip.NewWriter(dest)
47    defer writer.Close()
48
49    buffer := make([]byte, chunkSize)
50
51    for {
52        n, err := source.Read(buffer)
53        if err != nil && err != io.EOF {
54            return stats, err
55        }
56
57        if n > 0 {
58            stats.Chunks++
59            stats.BytesRead += int64(n)
60
61            written, err := writer.Write(buffer[:n])
62            if err != nil {
63                return stats, err
64            }
65
66            stats.BytesWritten += int64(written)
67        }
68
69        if err == io.EOF {
70            break
71        }
72    }
73
74    // Must close to flush
75    writer.Close()
76
77    return stats, nil
78}

Advanced Streaming Compression Example

Here's a production-ready streaming compression system with progress tracking, error handling, and optimization:

  1package main
  2
  3import (
  4    "compress/gzip"
  5    "crypto/sha256"
  6    "fmt"
  7    "hash"
  8    "io"
  9    "os"
 10    "sync/atomic"
 11    "time"
 12)
 13
 14// run
 15func main() {
 16    // Create streaming compressor
 17    compressor := NewStreamingCompressor(CompressionConfig{
 18        Level:      gzip.DefaultCompression,
 19        BufferSize: 64 * 1024, // 64KB buffers
 20        ChunkSize:  1024 * 1024, // 1MB chunks
 21    })
 22
 23    // Simulate compressing a large file
 24    fmt.Println("Starting compression...")
 25
 26    input := generateLargeData(5 * 1024 * 1024) // 5MB test data
 27    output := &countingWriter{}
 28
 29    stats, err := compressor.Compress(input, output, &progressCallback{})
 30    if err != nil {
 31        fmt.Printf("Compression failed: %v\n", err)
 32        return
 33    }
 34
 35    fmt.Printf("\nCompression complete:\n")
 36    fmt.Printf("  Input size: %d bytes\n", stats.InputSize)
 37    fmt.Printf("  Output size: %d bytes\n", stats.OutputSize)
 38    fmt.Printf("  Ratio: %.2f%%\n", float64(stats.OutputSize)/float64(stats.InputSize)*100)
 39    fmt.Printf("  Duration: %s\n", stats.Duration)
 40    fmt.Printf("  Throughput: %.2f MB/s\n", float64(stats.InputSize)/stats.Duration.Seconds()/1024/1024)
 41    fmt.Printf("  Checksum: %x\n", stats.Checksum)
 42}
 43
 44// CompressionConfig holds compression parameters
 45type CompressionConfig struct{
 46    Level      int // Compression level (1-9)
 47    BufferSize int // I/O buffer size
 48    ChunkSize  int // Processing chunk size
 49}
 50
 51// CompressionStats holds compression statistics
 52type CompressionStats struct {
 53    InputSize  int64          // Bytes read
 54    OutputSize int64          // Bytes written
 55    Duration   time.Duration  // Time taken
 56    Checksum   []byte         // Input checksum
 57}
 58
 59// ProgressCallback receives progress updates
 60type ProgressCallback interface {
 61    OnProgress(bytesProcessed, totalBytes int64)
 62}
 63
 64// StreamingCompressor handles streaming compression
 65type StreamingCompressor struct {
 66    config CompressionConfig
 67}
 68
 69func NewStreamingCompressor(config CompressionConfig) *StreamingCompressor {
 70    return &StreamingCompressor{config: config}
 71}
 72
 73// Compress compresses input to output with progress tracking
 74func Compress(input io.Reader, output io.Writer, progress ProgressCallback) {
 75    start := time.Now()
 76
 77    // Create checksum hasher
 78    hasher := sha256.New()
 79
 80    // Wrap input to compute checksum
 81    inputReader := io.TeeReader(input, hasher)
 82
 83    // Create gzip writer with specified level
 84    gzWriter, err := gzip.NewWriterLevel(output, c.config.Level)
 85    if err != nil {
 86        return CompressionStats{}, err
 87    }
 88    defer gzWriter.Close()
 89
 90    // Create counting writer to track output
 91    countingOutput := &countingWriter{Writer: gzWriter}
 92
 93    // Compress in chunks
 94    buffer := make([]byte, c.config.ChunkSize)
 95    var totalRead int64
 96
 97    for {
 98        n, err := inputReader.Read(buffer)
 99        if n > 0 {
100            // Write compressed data
101            _, writeErr := countingOutput.Write(buffer[:n])
102            if writeErr != nil {
103                return CompressionStats{}, writeErr
104            }
105
106            totalRead += int64(n)
107
108            // Report progress
109            if progress != nil {
110                progress.OnProgress(totalRead, -1) // -1 = unknown total
111            }
112        }
113
114        if err == io.EOF {
115            break
116        }
117        if err != nil {
118            return CompressionStats{}, err
119        }
120    }
121
122    // Flush gzip writer
123    if err := gzWriter.Close(); err != nil {
124        return CompressionStats{}, err
125    }
126
127    return CompressionStats{
128        InputSize:  totalRead,
129        OutputSize: countingOutput.Count(),
130        Duration:   time.Since(start),
131        Checksum:   hasher.Sum(nil),
132    }, nil
133}
134
135// countingWriter wraps a writer and counts bytes written
136type countingWriter struct {
137    Writer io.Writer
138    count  atomic.Int64
139}
140
141func Write(p []byte) {
142    n, err := cw.Writer.Write(p)
143    cw.count.Add(int64(n))
144    return n, err
145}
146
147func Count() int64 {
148    return cw.count.Load()
149}
150
151// progressCallback implements ProgressCallback
152type progressCallback struct {
153    lastReport time.Time
154}
155
156func OnProgress(bytesProcessed, totalBytes int64) {
157    // Rate limit progress updates
158    if time.Since(p.lastReport) < 100*time.Millisecond {
159        return
160    }
161    p.lastReport = time.Now()
162
163    mb := float64(bytesProcessed) / 1024 / 1024
164    fmt.Printf("\rProcessed: %.2f MB", mb)
165}
166
167// generateLargeData creates test data
168func generateLargeData(size int) io.Reader {
169    // Create pipe for streaming data
170    r, w := io.Pipe()
171
172    go func() {
173        defer w.Close()
174
175        // Generate repetitive data (compresses well)
176        chunk := []byte("This is test data for compression. It repeats to demonstrate good compression ratios.\n")
177
178        for written := 0; written < size; {
179            n, _ := w.Write(chunk)
180            written += n
181        }
182    }()
183
184    return r
185}

This example demonstrates:

  • Streaming Processing: Handle arbitrarily large files
  • Progress Tracking: Real-time feedback
  • Checksum Verification: Data integrity
  • Error Handling: Robust error management
  • Performance Monitoring: Track throughput and ratios
  • Configurable Parameters: Flexible compression settings
 1package main
 2
 3import (
 4    "compress/gzip"
 5    "fmt"
 6    "io"
 7    "strings"
 8)
 9
10// run
11func main() {
12    // Simulate large data source
13    source := generateLargeData()
14
15    // Stream compress with chunks
16    var compressed strings.Builder
17    stats, err := streamCompressChunked(source, &compressed, 1024)
18    if err != nil {
19        fmt.Println("Error:", err)
20        return
21    }
22
23    fmt.Printf("Compression Statistics:\n")
24    fmt.Printf("  Chunks processed: %d\n", stats.Chunks)
25    fmt.Printf("  Bytes read: %d\n", stats.BytesRead)
26    fmt.Printf("  Bytes written: %d\n", stats.BytesWritten)
27    fmt.Printf("  Compression ratio: %.2f%%\n",
28        float64(stats.BytesWritten)/float64(stats.BytesRead)*100)
29}
30
31type CompressionStats struct {
32    Chunks       int
33    BytesRead    int64
34    BytesWritten int64
35}
36
37func generateLargeData() io.Reader {
38    // Generate repetitive data for good compression
39    data := strings.Repeat("Sample data for streaming compression. ", 100)
40    return strings.NewReader(data)
41}
42
43func streamCompressChunked(source io.Reader, dest io.Writer, chunkSize int) {
44    stats := CompressionStats{}
45
46    writer := gzip.NewWriter(dest)
47    defer writer.Close()
48
49    buffer := make([]byte, chunkSize)
50
51    for {
52        n, err := source.Read(buffer)
53        if err != nil && err != io.EOF {
54            return stats, err
55        }
56
57        if n > 0 {
58            stats.Chunks++
59            stats.BytesRead += int64(n)
60
61            written, err := writer.Write(buffer[:n])
62            if err != nil {
63                return stats, err
64            }
65
66            stats.BytesWritten += int64(written)
67        }
68
69        if err == io.EOF {
70            break
71        }
72    }
73
74    // Must close to flush
75    writer.Close()
76
77    return stats, nil
78}

Concurrent Compression

 1package main
 2
 3import (
 4    "bytes"
 5    "compress/gzip"
 6    "fmt"
 7    "sync"
 8)
 9
10// run
11func main() {
12    // Split data into chunks for parallel compression
13    chunks := [][]byte{
14        []byte("Chunk 1: Some data to compress"),
15        []byte("Chunk 2: More data to compress"),
16        []byte("Chunk 3: Even more data"),
17        []byte("Chunk 4: Final chunk of data"),
18    }
19
20    // Compress chunks concurrently
21    results, err := compressConcurrent(chunks)
22    if err != nil {
23        fmt.Println("Error:", err)
24        return
25    }
26
27    // Show results
28    for i, result := range results {
29        fmt.Printf("Chunk %d: %d bytes -> %d bytes\n",
30            i+1,
31            len(chunks[i]),
32            len(result),
33            float64(len(result))/float64(len(chunks[i]))*100)
34    }
35}
36
37func compressConcurrent(chunks [][]byte) {
38    results := make([][]byte, len(chunks))
39    errChan := make(chan error, len(chunks))
40
41    var wg sync.WaitGroup
42
43    for i, chunk := range chunks {
44        wg.Add(1)
45        go func(index int, data []byte) {
46            defer wg.Done()
47
48            var buf bytes.Buffer
49            writer := gzip.NewWriter(&buf)
50
51            _, err := writer.Write(data)
52            if err != nil {
53                errChan <- err
54                return
55            }
56
57            writer.Close()
58            results[index] = buf.Bytes()
59        }(i, chunk)
60    }
61
62    wg.Wait()
63    close(errChan)
64
65    // Check for errors
66    if err := <-errChan; err != nil {
67        return nil, err
68    }
69
70    return results, nil
71}

Buffered Compression

 1package main
 2
 3import (
 4    "bufio"
 5    "bytes"
 6    "compress/gzip"
 7    "fmt"
 8    "io"
 9)
10
11// run
12func main() {
13    data := []byte("Data for buffered compression. " +
14        "Buffering can improve performance for small writes.")
15
16    // Compare buffered vs unbuffered
17    fmt.Println("Unbuffered compression:")
18    unbuffered := compressUnbuffered(data)
19    fmt.Printf("  Size: %d bytes\n", len(unbuffered))
20
21    fmt.Println("\nBuffered compression:")
22    buffered := compressBuffered(data, 4096)
23    fmt.Printf("  Size: %d bytes\n", len(buffered))
24}
25
26func compressUnbuffered(data []byte) []byte {
27    var buf bytes.Buffer
28    writer := gzip.NewWriter(&buf)
29
30    // Small writes without buffering
31    for i := 0; i < len(data); i += 10 {
32        end := i + 10
33        if end > len(data) {
34            end = len(data)
35        }
36        writer.Write(data[i:end])
37    }
38
39    writer.Close()
40    return buf.Bytes()
41}
42
43func compressBuffered(data []byte, bufferSize int) []byte {
44    var buf bytes.Buffer
45    gzWriter := gzip.NewWriter(&buf)
46
47    // Add buffering layer
48    bufWriter := bufio.NewWriterSize(gzWriter, bufferSize)
49
50    // Small writes with buffering
51    for i := 0; i < len(data); i += 10 {
52        end := i + 10
53        if end > len(data) {
54            end = len(data)
55        }
56        bufWriter.Write(data[i:end])
57    }
58
59    bufWriter.Flush()
60    gzWriter.Close()
61    return buf.Bytes()
62}

Performance Comparisons

Speed vs Size Trade-offs: Compression is always a trade-off between CPU time and compression ratio. Fast algorithms (like Snappy or LZ4) compress at hundreds of MB/s but achieve modest ratios. Slow algorithms (like LZMA or Brotli) achieve excellent ratios but may compress at just a few MB/s.

Choosing Based on Bottlenecks: If your bottleneck is network bandwidth, use maximum compression—the extra CPU time is worth the bandwidth savings. If your bottleneck is CPU, use fast compression or skip it entirely. If your bottleneck is disk I/O, compression can actually improve performance by reducing I/O.

Data Characteristics Matter: Text compresses well (70-90% reduction typical). JSON and XML compress very well (80-95% reduction) due to repeated tag names. Binary data and already-compressed formats (JPEG, MP4, PNG) achieve minimal compression—sometimes even expanding slightly due to overhead.

Algorithm Comparison

 1package main
 2
 3import (
 4    "bytes"
 5    "compress/flate"
 6    "compress/gzip"
 7    "compress/zlib"
 8    "fmt"
 9    "strings"
10    "time"
11)
12
13// run
14func main() {
15    // Test data - repetitive for good compression
16    data := []byte(strings.Repeat("The quick brown fox jumps over the lazy dog. ", 200))
17
18    fmt.Printf("Original size: %d bytes\n\n", len(data))
19
20    algorithms := []struct {
21        name    string
22        compress func([]byte)
23    }{
24        {"Gzip", compressGzip},
25        {"Zlib", compressZlib},
26        {"Flate", compressFlate},
27    }
28
29    for _, algo := range algorithms {
30        start := time.Now()
31        compressed, err := algo.compress(data)
32        elapsed := time.Since(start)
33
34        if err != nil {
35            fmt.Printf("%s: Error: %v\n", algo.name, err)
36            continue
37        }
38
39        fmt.Printf("%s:\n", algo.name)
40        fmt.Printf("  Compressed: %d bytes\n",
41            len(compressed),
42            float64(len(compressed))/float64(len(data))*100)
43        fmt.Printf("  Time: %s\n\n", elapsed)
44    }
45}
46
47func compressGzip(data []byte) {
48    var buf bytes.Buffer
49    writer := gzip.NewWriter(&buf)
50    _, err := writer.Write(data)
51    if err != nil {
52        return nil, err
53    }
54    writer.Close()
55    return buf.Bytes(), nil
56}
57
58func compressZlib(data []byte) {
59    var buf bytes.Buffer
60    writer := zlib.NewWriter(&buf)
61    _, err := writer.Write(data)
62    if err != nil {
63        return nil, err
64    }
65    writer.Close()
66    return buf.Bytes(), nil
67}
68
69func compressFlate(data []byte) {
70    var buf bytes.Buffer
71    writer, err := flate.NewWriter(&buf, flate.DefaultCompression)
72    if err != nil {
73        return nil, err
74    }
75    _, err = writer.Write(data)
76    if err != nil {
77        return nil, err
78    }
79    writer.Close()
80    return buf.Bytes(), nil
81}

Memory Usage Patterns

 1package main
 2
 3import (
 4    "bytes"
 5    "compress/gzip"
 6    "fmt"
 7    "runtime"
 8)
 9
10// run
11func main() {
12    // Large data set
13    data := make([]byte, 1024*1024) // 1MB
14    for i := range data {
15        data[i] = byte(i % 256)
16    }
17
18    // Measure memory before
19    var m1 runtime.MemStats
20    runtime.ReadMemStats(&m1)
21
22    // Compress in memory
23    var buf bytes.Buffer
24    writer := gzip.NewWriter(&buf)
25    writer.Write(data)
26    writer.Close()
27
28    // Measure memory after
29    var m2 runtime.MemStats
30    runtime.ReadMemStats(&m2)
31
32    fmt.Printf("Memory Usage:\n")
33    fmt.Printf("  Before: %d KB\n", m1.Alloc/1024)
34    fmt.Printf("  After: %d KB\n", m2.Alloc/1024)
35    fmt.Printf("  Delta: %d KB\n",/1024)
36    fmt.Printf("\nCompressed size: %d bytes\n",
37        buf.Len(),
38        float64(buf.Len())/float64(len(data))*100)
39}

Production Patterns

Production Considerations: Real-world compression involves more than just calling compress() and decompress(). You need error handling (corrupted data), progress tracking (user feedback), resource management (memory limits), and performance optimization (pooling, buffering).

HTTP Compression: Web servers automatically compress responses based on Accept-Encoding headers. Proper HTTP compression can reduce page load times by 60-80%, dramatically improving user experience especially on mobile networks.

Log File Management: Growing log files consume disk space quickly. Compressed rotation (compress old logs, delete oldest) is standard practice. A year's worth of logs might compress from 100GB to 5GB—a 95% space savings that makes retention practical.

HTTP Compression Middleware

 1package main
 2
 3import (
 4    "compress/gzip"
 5    "fmt"
 6    "io"
 7    "net/http"
 8    "net/http/httptest"
 9    "strings"
10)
11
12// run
13func main() {
14    // Create handler with compression
15    handler := GzipMiddleware(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
16        data := strings.Repeat("Hello, World! ", 100)
17        w.Write([]byte(data))
18    }))
19
20    // Test without compression
21    req1 := httptest.NewRequest("GET", "/", nil)
22    rec1 := httptest.NewRecorder()
23    handler.ServeHTTP(rec1, req1)
24
25    fmt.Printf("Without compression: %d bytes\n", rec1.Body.Len())
26
27    // Test with compression
28    req2 := httptest.NewRequest("GET", "/", nil)
29    req2.Header.Set("Accept-Encoding", "gzip")
30    rec2 := httptest.NewRecorder()
31    handler.ServeHTTP(rec2, req2)
32
33    fmt.Printf("With compression: %d bytes\n", rec2.Body.Len())
34    fmt.Printf("Compression ratio: %.2f%%\n",
35        float64(rec2.Body.Len())/float64(rec1.Body.Len())*100)
36}
37
38type gzipResponseWriter struct {
39    io.Writer
40    http.ResponseWriter
41}
42
43func Write(b []byte) {
44    return w.Writer.Write(b)
45}
46
47func GzipMiddleware(next http.Handler) http.Handler {
48    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
49        // Check if client accepts gzip
50        if !strings.Contains(r.Header.Get("Accept-Encoding"), "gzip") {
51            next.ServeHTTP(w, r)
52            return
53        }
54
55        // Create gzip writer
56        gz := gzip.NewWriter(w)
57        defer gz.Close()
58
59        // Set headers
60        w.Header().Set("Content-Encoding", "gzip")
61
62        // Wrap response writer
63        gzw := gzipResponseWriter{Writer: gz, ResponseWriter: w}
64        next.ServeHTTP(gzw, r)
65    })
66}

File Compression with Progress

 1package main
 2
 3import (
 4    "compress/gzip"
 5    "fmt"
 6    "io"
 7    "strings"
 8)
 9
10// run
11func main() {
12    // Simulate large file
13    source := strings.NewReader(strings.Repeat("Sample data ", 1000))
14
15    // Compress with progress tracking
16    var compressed strings.Builder
17    progress := &ProgressWriter{Total: int64(source.Len())}
18
19    if err := compressWithProgress(source, &compressed, progress); err != nil {
20        fmt.Println("Error:", err)
21        return
22    }
23
24    fmt.Printf("\nCompression complete!\n")
25    fmt.Printf("Original: %d bytes\n", source.Len())
26    fmt.Printf("Compressed: %d bytes\n", compressed.Len())
27}
28
29type ProgressWriter struct {
30    Total   int64
31    Written int64
32}
33
34func Write(data []byte) {
35    n := len(data)
36    p.Written += int64(n)
37
38    percent := float64(p.Written) / float64(p.Total) * 100
39    fmt.Printf("\rProgress: %.1f%%", percent, p.Written, p.Total)
40
41    return n, nil
42}
43
44func compressWithProgress(source io.Reader, dest io.Writer, progress *ProgressWriter) error {
45    // Chain: source -> progress tracker -> gzip -> destination
46    gzWriter := gzip.NewWriter(dest)
47    defer gzWriter.Close()
48
49    // Create multi-writer to track progress
50    multiWriter := io.MultiWriter(gzWriter, progress)
51
52    _, err := io.Copy(multiWriter, source)
53    return err
54}

Compressed Log Rotation

  1package main
  2
  3import (
  4    "compress/gzip"
  5    "fmt"
  6    "io"
  7    "os"
  8    "path/filepath"
  9    "sync"
 10    "time"
 11)
 12
 13// run
 14func main() {
 15    logger := NewCompressedLogger("/tmp", "app.log", 1024) // 1KB max
 16
 17    // Write logs
 18    for i := 0; i < 10; i++ {
 19        logger.Write([]byte(fmt.Sprintf("Log entry %d: %s\n",
 20            i, time.Now().Format(time.RFC3339))))
 21    }
 22
 23    logger.Close()
 24
 25    fmt.Println("Log rotation with compression demo completed")
 26    fmt.Println("Check /tmp for app.log and compressed rotated logs")
 27}
 28
 29type CompressedLogger struct {
 30    dir         string
 31    filename    string
 32    maxSize     int64
 33    currentSize int64
 34    file        *os.File
 35    mu          sync.Mutex
 36}
 37
 38func NewCompressedLogger(dir, filename string, maxSize int64) *CompressedLogger {
 39    return &CompressedLogger{
 40        dir:      dir,
 41        filename: filename,
 42        maxSize:  maxSize,
 43    }
 44}
 45
 46func Write(data []byte) {
 47    l.mu.Lock()
 48    defer l.mu.Unlock()
 49
 50    // Check if we need to rotate
 51    if l.currentSize+int64(len(data)) > l.maxSize {
 52        if err := l.rotate(); err != nil {
 53            return 0, err
 54        }
 55    }
 56
 57    // Open file if needed
 58    if l.file == nil {
 59        path := filepath.Join(l.dir, l.filename)
 60        f, err := os.OpenFile(path, os.O_CREATE|os.O_WRONLY|os.O_APPEND, 0644)
 61        if err != nil {
 62            return 0, err
 63        }
 64        l.file = f
 65    }
 66
 67    n, err := l.file.Write(data)
 68    l.currentSize += int64(n)
 69    return n, err
 70}
 71
 72func rotate() error {
 73    // Close current file
 74    if l.file != nil {
 75        l.file.Close()
 76        l.file = nil
 77    }
 78
 79    // Compress old file
 80    oldPath := filepath.Join(l.dir, l.filename)
 81    timestamp := time.Now().Format("20060102-150405")
 82    gzPath := filepath.Join(l.dir, fmt.Sprintf("%s.%s.gz", l.filename, timestamp))
 83
 84    if err := compressFile(oldPath, gzPath); err != nil {
 85        return err
 86    }
 87
 88    // Remove old file
 89    os.Remove(oldPath)
 90
 91    // Reset size
 92    l.currentSize = 0
 93
 94    fmt.Printf("Rotated log to: %s\n", gzPath)
 95    return nil
 96}
 97
 98func Close() error {
 99    l.mu.Lock()
100    defer l.mu.Unlock()
101
102    if l.file != nil {
103        return l.file.Close()
104    }
105    return nil
106}
107
108func compressFile(source, dest string) error {
109    srcFile, err := os.Open(source)
110    if err != nil {
111        return err
112    }
113    defer srcFile.Close()
114
115    destFile, err := os.Create(dest)
116    if err != nil {
117        return err
118    }
119    defer destFile.Close()
120
121    gzWriter := gzip.NewWriter(destFile)
122    defer gzWriter.Close()
123
124    _, err = io.Copy(gzWriter, srcFile)
125    return err
126}

Compression Pool

 1package main
 2
 3import (
 4    "compress/gzip"
 5    "fmt"
 6    "io"
 7    "strings"
 8    "sync"
 9)
10
11// run
12func main() {
13    pool := NewGzipPool(gzip.BestSpeed)
14
15    // Compress multiple items using pool
16    items := []string{
17        "Item 1 data",
18        "Item 2 data",
19        "Item 3 data",
20        "Item 4 data",
21        "Item 5 data",
22    }
23
24    var wg sync.WaitGroup
25    for i, item := range items {
26        wg.Add(1)
27        go func(id int, data string) {
28            defer wg.Done()
29
30            var compressed strings.Builder
31            if err := pool.Compress(strings.NewReader(data), &compressed); err != nil {
32                fmt.Printf("Item %d error: %v\n", id, err)
33                return
34            }
35
36            fmt.Printf("Item %d: %d -> %d bytes\n",
37                id, len(data), compressed.Len())
38        }(i+1, item)
39    }
40
41    wg.Wait()
42}
43
44type GzipPool struct {
45    writers sync.Pool
46    level   int
47}
48
49func NewGzipPool(level int) *GzipPool {
50    return &GzipPool{
51        level: level,
52        writers: sync.Pool{
53            New: func() interface{} {
54                w, _ := gzip.NewWriterLevel(io.Discard, level)
55                return w
56            },
57        },
58    }
59}
60
61func Compress(src io.Reader, dest io.Writer) error {
62    // Get writer from pool
63    writer := p.writers.Get().(*gzip.Writer)
64    defer p.writers.Put(writer)
65
66    // Reset writer to new destination
67    writer.Reset(dest)
68    defer writer.Close()
69
70    _, err := io.Copy(writer, src)
71    return err
72}

Further Reading

Practice Exercises

Exercise 1: Backup Utility

Difficulty: Advanced | Time: 45-60 minutes

Learning Objectives:

  • Master combined tar+gzip archive creation and management
  • Implement file filtering, exclusion patterns, and incremental backup strategies
  • Learn progress tracking, integrity verification, and backup validation techniques

Real-World Context: Backup utilities are critical infrastructure components that protect data from loss. Professional backup systems need to handle millions of files, apply complex filtering rules, track incremental changes, and provide verification mechanisms to ensure data integrity. Understanding these patterns is essential for building reliable data protection systems.

Create a backup utility that creates compressed tar archives with advanced features. Your utility should create .tar.gz archives, support flexible file exclusion patterns, provide real-time progress feedback during backup operations, implement archive integrity verification with checksums, and support incremental backup strategies to only process changed files. This exercise demonstrates the core patterns used in production backup systems where data integrity, performance, and reliability are critical for protecting valuable data assets.

Solution
  1package main
  2
  3import (
  4    "archive/tar"
  5    "compress/gzip"
  6    "crypto/sha256"
  7    "fmt"
  8    "io"
  9    "os"
 10    "path/filepath"
 11    "strings"
 12    "time"
 13)
 14
 15type BackupOptions struct {
 16    SourceDir      string
 17    DestFile       string
 18    ExcludePattern []string
 19    Progress       bool
 20}
 21
 22type BackupStats struct {
 23    FilesProcessed int
 24    BytesProcessed int64
 25    Duration       time.Duration
 26    Checksum       string
 27}
 28
 29func CreateBackup(opts BackupOptions) {
 30    start := time.Now()
 31    stats := BackupStats{}
 32
 33    // Create destination file
 34    destFile, err := os.Create(opts.DestFile)
 35    if err != nil {
 36        return stats, err
 37    }
 38    defer destFile.Close()
 39
 40    // Create gzip writer
 41    gzWriter := gzip.NewWriter(destFile)
 42    defer gzWriter.Close()
 43
 44    // Create tar writer
 45    tarWriter := tar.NewWriter(gzWriter)
 46    defer tarWriter.Close()
 47
 48    // Hash writer for integrity check
 49    hashWriter := sha256.New()
 50    multiWriter := io.MultiWriter(gzWriter, hashWriter)
 51    gzWriter.Reset(multiWriter)
 52
 53    // Walk directory
 54    err = filepath.Walk(opts.SourceDir, func(path string, info os.FileInfo, err error) error {
 55        if err != nil {
 56            return err
 57        }
 58
 59        // Check exclusions
 60        for _, pattern := range opts.ExcludePattern {
 61            if matched, _ := filepath.Match(pattern, filepath.Base(path)); matched {
 62                if info.IsDir() {
 63                    return filepath.SkipDir
 64                }
 65                return nil
 66            }
 67        }
 68
 69        // Skip directories
 70        if info.IsDir() {
 71            return nil
 72        }
 73
 74        // Create tar header
 75        header, err := tar.FileInfoHeader(info, "")
 76        if err != nil {
 77            return err
 78        }
 79
 80        // Use relative path
 81        relPath, _ := filepath.Rel(opts.SourceDir, path)
 82        header.Name = relPath
 83
 84        // Write header
 85        if err := tarWriter.WriteHeader(header); err != nil {
 86            return err
 87        }
 88
 89        // Write file content
 90        file, err := os.Open(path)
 91        if err != nil {
 92            return err
 93        }
 94        defer file.Close()
 95
 96        n, err := io.Copy(tarWriter, file)
 97        if err != nil {
 98            return err
 99        }
100
101        stats.FilesProcessed++
102        stats.BytesProcessed += n
103
104        if opts.Progress {
105            fmt.Printf("\rBacked up: %s", relPath, n)
106        }
107
108        return nil
109    })
110
111    if err != nil {
112        return stats, err
113    }
114
115    // Finalize
116    tarWriter.Close()
117    gzWriter.Close()
118
119    stats.Duration = time.Since(start)
120    stats.Checksum = fmt.Sprintf("%x", hashWriter.Sum(nil))
121
122    return stats, nil
123}
124
125func VerifyBackup(filename string) error {
126    file, err := os.Open(filename)
127    if err != nil {
128        return err
129    }
130    defer file.Close()
131
132    gzReader, err := gzip.NewReader(file)
133    if err != nil {
134        return err
135    }
136    defer gzReader.Close()
137
138    tarReader := tar.NewReader(gzReader)
139
140    fileCount := 0
141    for {
142        header, err := tarReader.Next()
143        if err == io.EOF {
144            break
145        }
146        if err != nil {
147            return err
148        }
149
150        // Read file content to verify
151        _, err = io.Copy(io.Discard, tarReader)
152        if err != nil {
153            return fmt.Errorf("error reading %s: %w", header.Name, err)
154        }
155
156        fileCount++
157    }
158
159    fmt.Printf("Verification successful: %d files\n", fileCount)
160    return nil
161}
162
163func main() {
164    // Create test directory structure
165    testDir := "/tmp/backup-test"
166    os.MkdirAll(testDir, 0755)
167    os.WriteFile(filepath.Join(testDir, "file1.txt"), []byte("Content 1"), 0644)
168    os.WriteFile(filepath.Join(testDir, "file2.txt"), []byte("Content 2"), 0644)
169    os.WriteFile(filepath.Join(testDir, "temp.tmp"), []byte("Temp file"), 0644)
170
171    // Create backup
172    opts := BackupOptions{
173        SourceDir:      testDir,
174        DestFile:       "/tmp/backup.tar.gz",
175        ExcludePattern: []string{"*.tmp", "*.log"},
176        Progress:       true,
177    }
178
179    fmt.Println("Creating backup...")
180    stats, err := CreateBackup(opts)
181    if err != nil {
182        fmt.Println("Backup failed:", err)
183        return
184    }
185
186    fmt.Printf("\n\nBackup Statistics:\n")
187    fmt.Printf("  Files: %d\n", stats.FilesProcessed)
188    fmt.Printf("  Bytes: %d\n", stats.BytesProcessed)
189    fmt.Printf("  Duration: %s\n", stats.Duration)
190    fmt.Printf("  Checksum: %s\n", stats.Checksum)
191
192    // Verify backup
193    fmt.Println("\nVerifying backup...")
194    if err := VerifyBackup(opts.DestFile); err != nil {
195        fmt.Println("Verification failed:", err)
196        return
197    }
198
199    // Cleanup
200    os.RemoveAll(testDir)
201    os.Remove(opts.DestFile)
202}

Exercise 2: Compression Benchmark Tool

Difficulty: Intermediate | Time: 30-40 minutes

Learning Objectives:

  • Master performance measurement and comparison of compression algorithms
  • Understand compression trade-offs between speed, size, and CPU usage
  • Learn to generate comprehensive benchmark reports with statistical analysis

Real-World Context: Choosing the right compression algorithm and settings is crucial for application performance. Different data types compress differently, and the optimal choice depends on whether you prioritize speed, compression ratio, or CPU usage. Professional developers need to make informed decisions based on actual performance data.

Build a comprehensive compression benchmark tool that measures and compares different compression algorithms and settings. Your tool should test gzip, zlib, and flate algorithms across various compression levels, measure both compression and decompression performance, calculate compression ratios for different data types, and generate detailed comparison reports. This exercise demonstrates the performance analysis patterns used when optimizing applications where compression choices significantly impact throughput, storage costs, and user experience.

Solution
  1package main
  2
  3import (
  4    "bytes"
  5    "compress/flate"
  6    "compress/gzip"
  7    "compress/zlib"
  8    "fmt"
  9    "io"
 10    "math/rand"
 11    "time"
 12)
 13
 14type CompressionResult struct {
 15    Algorithm       string
 16    Level           int
 17    OriginalSize    int
 18    CompressedSize  int
 19    CompressTime    time.Duration
 20    DecompressTime  time.Duration
 21    CompressionRatio float64
 22}
 23
 24type BenchmarkSuite struct {
 25    testData []byte
 26    results  []CompressionResult
 27}
 28
 29func NewBenchmarkSuite(dataType string, size int) *BenchmarkSuite {
 30    return &BenchmarkSuite{
 31        testData: generateTestData(dataType, size),
 32    }
 33}
 34
 35func generateTestData(dataType string, size int) []byte {
 36    data := make([]byte, size)
 37
 38    switch dataType {
 39    case "text":
 40        // Repetitive text data
 41        text := []byte("The quick brown fox jumps over the lazy dog. ")
 42        for i := 0; i < size; i++ {
 43            data[i] = text[i%len(text)]
 44        }
 45
 46    case "random":
 47        // Random data
 48        rand.Read(data)
 49
 50    case "binary":
 51        // Semi-structured binary data
 52        for i := 0; i < size; i++ {
 53            data[i] = byte(i % 256)
 54        }
 55    }
 56
 57    return data
 58}
 59
 60func BenchmarkGzip(level int) CompressionResult {
 61    result := CompressionResult{
 62        Algorithm:    "Gzip",
 63        Level:        level,
 64        OriginalSize: len(bs.testData),
 65    }
 66
 67    // Compression
 68    var compressed bytes.Buffer
 69    start := time.Now()
 70
 71    writer, _ := gzip.NewWriterLevel(&compressed, level)
 72    writer.Write(bs.testData)
 73    writer.Close()
 74
 75    result.CompressTime = time.Since(start)
 76    result.CompressedSize = compressed.Len()
 77
 78    // Decompression
 79    start = time.Now()
 80
 81    reader, _ := gzip.NewReader(&compressed)
 82    io.Copy(io.Discard, reader)
 83    reader.Close()
 84
 85    result.DecompressTime = time.Since(start)
 86    result.CompressionRatio = float64(result.CompressedSize) / float64(result.OriginalSize) * 100
 87
 88    return result
 89}
 90
 91func BenchmarkZlib(level int) CompressionResult {
 92    result := CompressionResult{
 93        Algorithm:    "Zlib",
 94        Level:        level,
 95        OriginalSize: len(bs.testData),
 96    }
 97
 98    // Compression
 99    var compressed bytes.Buffer
100    start := time.Now()
101
102    writer, _ := zlib.NewWriterLevel(&compressed, level)
103    writer.Write(bs.testData)
104    writer.Close()
105
106    result.CompressTime = time.Since(start)
107    result.CompressedSize = compressed.Len()
108
109    // Decompression
110    start = time.Now()
111
112    reader, _ := zlib.NewReader(&compressed)
113    io.Copy(io.Discard, reader)
114    reader.Close()
115
116    result.DecompressTime = time.Since(start)
117    result.CompressionRatio = float64(result.CompressedSize) / float64(result.OriginalSize) * 100
118
119    return result
120}
121
122func BenchmarkFlate(level int) CompressionResult {
123    result := CompressionResult{
124        Algorithm:    "Flate",
125        Level:        level,
126        OriginalSize: len(bs.testData),
127    }
128
129    // Compression
130    var compressed bytes.Buffer
131    start := time.Now()
132
133    writer, _ := flate.NewWriter(&compressed, level)
134    writer.Write(bs.testData)
135    writer.Close()
136
137    result.CompressTime = time.Since(start)
138    result.CompressedSize = compressed.Len()
139
140    // Decompression
141    start = time.Now()
142
143    reader := flate.NewReader(&compressed)
144    io.Copy(io.Discard, reader)
145    reader.Close()
146
147    result.DecompressTime = time.Since(start)
148    result.CompressionRatio = float64(result.CompressedSize) / float64(result.OriginalSize) * 100
149
150    return result
151}
152
153func RunAll() {
154    levels := []int{1, 6, 9}
155
156    for _, level := range levels {
157        bs.results = append(bs.results, bs.BenchmarkGzip(level))
158        bs.results = append(bs.results, bs.BenchmarkZlib(level))
159        bs.results = append(bs.results, bs.BenchmarkFlate(level))
160    }
161}
162
163func PrintReport() {
164    fmt.Println("\nCompression Benchmark Results")
165    fmt.Println(strings.Repeat("=", 100))
166    fmt.Printf("%-10s %-6s %-12s %-12s %-15s %-15s %-10s\n",
167        "Algorithm", "Level", "Original", "Compressed", "Compress Time", "Decompress Time", "Ratio")
168    fmt.Println(strings.Repeat("-", 100))
169
170    for _, result := range bs.results {
171        fmt.Printf("%-10s %-6d %-12d %-12d %-15s %-15s %.2f%%\n",
172            result.Algorithm,
173            result.Level,
174            result.OriginalSize,
175            result.CompressedSize,
176            result.CompressTime,
177            result.DecompressTime,
178            result.CompressionRatio)
179    }
180
181    fmt.Println(strings.Repeat("=", 100))
182}
183
184func main() {
185    dataTypes := []string{"text", "binary", "random"}
186
187    for _, dataType := range dataTypes {
188        fmt.Printf("\n\n=== Testing with %s data ===\n", dataType)
189
190        suite := NewBenchmarkSuite(dataType, 10*1024)
191        suite.RunAll()
192        suite.PrintReport()
193    }
194}

Exercise 3: Streaming File Archiver

Difficulty: Advanced | Time: 40-50 minutes

Learning Objectives:

  • Master streaming compression patterns for memory-efficient large file processing
  • Implement progress tracking, checksum generation, and error recovery mechanisms
  • Learn interface-based design for extensible file source systems

Real-World Context: Processing large files without loading them entirely into memory is essential for building scalable applications. Cloud storage systems, data pipelines, and backup utilities all rely on streaming architectures to handle gigabyte or terabyte-sized files efficiently. Understanding streaming patterns is crucial for building production-grade data processing systems.

Build a streaming file archiver that processes large files efficiently without loading entire files into memory. Your archiver should stream files from multiple sources, compress data on-the-fly using configurable compression formats, provide real-time progress tracking with detailed status updates, handle compression errors gracefully with proper cleanup, support resume functionality for interrupted operations, and generate checksums during the archiving process for data integrity verification. This exercise demonstrates the streaming patterns used in production systems where memory efficiency and error resilience are critical for handling large-scale data processing tasks.

Solution with Explanation
  1package main
  2
  3import (
  4	"compress/gzip"
  5	"crypto/sha256"
  6	"fmt"
  7	"hash"
  8	"io"
  9	"os"
 10	"path/filepath"
 11	"strings"
 12	"sync"
 13	"time"
 14)
 15
 16// run
 17type ArchiveConfig struct {
 18	CompressionLevel int
 19	BufferSize       int
 20	VerifyChecksum   bool
 21	MaxConcurrent    int
 22}
 23
 24type FileSource interface {
 25	Open()
 26	Name() string
 27	Size() int64
 28}
 29
 30type LocalFileSource struct {
 31	Path string
 32}
 33
 34func Open() {
 35	return os.Open(l.Path)
 36}
 37
 38func Name() string {
 39	return filepath.Base(l.Path)
 40}
 41
 42func Size() int64 {
 43	info, err := os.Stat(l.Path)
 44	if err != nil {
 45		return 0
 46	}
 47	return info.Size()
 48}
 49
 50type StreamingArchiver struct {
 51	config    ArchiveConfig
 52	sources   []FileSource
 53	dest      io.Writer
 54	progress  chan ProgressUpdate
 55	mu        sync.Mutex
 56	checksums map[string]string
 57}
 58
 59type ProgressUpdate struct {
 60	FileName      string
 61	BytesWritten  int64
 62	TotalBytes    int64
 63	Complete      bool
 64	Error         error
 65}
 66
 67func NewStreamingArchiver(dest io.Writer, config ArchiveConfig) *StreamingArchiver {
 68	return &StreamingArchiver{
 69		config:    config,
 70		dest:      dest,
 71		progress:  make(chan ProgressUpdate, 10),
 72		checksums: make(map[string]string),
 73	}
 74}
 75
 76func AddSource(source FileSource) {
 77	sa.sources = append(sa.sources, source)
 78}
 79
 80func Archive() error {
 81	gzWriter, err := gzip.NewWriterLevel(sa.dest, sa.config.CompressionLevel)
 82	if err != nil {
 83		return err
 84	}
 85	defer gzWriter.Close()
 86
 87	// Process files
 88	for _, source := range sa.sources {
 89		if err := sa.archiveFile(source, gzWriter); err != nil {
 90			return fmt.Errorf("failed to archive %s: %w", source.Name(), err)
 91		}
 92	}
 93
 94	return nil
 95}
 96
 97func archiveFile(source FileSource, writer io.Writer) error {
 98	reader, err := source.Open()
 99	if err != nil {
100		return err
101	}
102	defer reader.Close()
103
104	fileName := source.Name()
105	totalSize := source.Size()
106
107	// Create progress tracker
108	progressReader := &ProgressReader{
109		Reader:    reader,
110		Total:     totalSize,
111		FileName:  fileName,
112		OnProgress: func(bytesRead int64) {
113			sa.progress <- ProgressUpdate{
114				FileName:     fileName,
115				BytesWritten: bytesRead,
116				TotalBytes:   totalSize,
117				Complete:     false,
118			}
119		},
120	}
121
122	// Add checksum calculation
123	var hasher hash.Hash
124	if sa.config.VerifyChecksum {
125		hasher = sha256.New()
126		progressReader.Reader = io.TeeReader(progressReader.Reader, hasher)
127	}
128
129	// Stream to archive
130	bytesWritten, err := io.Copy(writer, progressReader)
131	if err != nil {
132		sa.progress <- ProgressUpdate{
133			FileName: fileName,
134			Error:    err,
135		}
136		return err
137	}
138
139	// Store checksum
140	if hasher != nil {
141		checksum := fmt.Sprintf("%x", hasher.Sum(nil))
142		sa.mu.Lock()
143		sa.checksums[fileName] = checksum
144		sa.mu.Unlock()
145	}
146
147	sa.progress <- ProgressUpdate{
148		FileName:     fileName,
149		BytesWritten: bytesWritten,
150		TotalBytes:   totalSize,
151		Complete:     true,
152	}
153
154	return nil
155}
156
157func GetChecksums() map[string]string {
158	sa.mu.Lock()
159	defer sa.mu.Unlock()
160
161	result := make(map[string]string)
162	for k, v := range sa.checksums {
163		result[k] = v
164	}
165	return result
166}
167
168type ProgressReader struct {
169	Reader     io.Reader
170	Total      int64
171	BytesRead  int64
172	FileName   string
173	OnProgress func(int64)
174	lastUpdate time.Time
175}
176
177func Read(p []byte) {
178	n, err := pr.Reader.Read(p)
179	pr.BytesRead += int64(n)
180
181	// Update progress every 100ms
182	if time.Since(pr.lastUpdate) > 100*time.Millisecond {
183		if pr.OnProgress != nil {
184			pr.OnProgress(pr.BytesRead)
185		}
186		pr.lastUpdate = time.Now()
187	}
188
189	return n, err
190}
191
192func main() {
193	// Create test files
194	testDir := "/tmp/archive-test"
195	os.MkdirAll(testDir, 0755)
196	defer os.RemoveAll(testDir)
197
198	// Create sample files
199	files := []string{"file1.txt", "file2.txt", "file3.txt"}
200	for i, name := range files {
201		content := strings.Repeat(fmt.Sprintf("Content of file %d\n", i+1), 100)
202		os.WriteFile(filepath.Join(testDir, name), []byte(content), 0644)
203	}
204
205	// Create archiver
206	outputFile := filepath.Join(testDir, "archive.gz")
207	output, err := os.Create(outputFile)
208	if err != nil {
209		fmt.Println("Error:", err)
210		return
211	}
212	defer output.Close()
213
214	config := ArchiveConfig{
215		CompressionLevel: gzip.BestCompression,
216		BufferSize:       32 * 1024,
217		VerifyChecksum:   true,
218		MaxConcurrent:    2,
219	}
220
221	archiver := NewStreamingArchiver(output, config)
222
223	// Add sources
224	for _, name := range files {
225		archiver.AddSource(&LocalFileSource{
226			Path: filepath.Join(testDir, name),
227		})
228	}
229
230	// Monitor progress in background
231	go func() {
232		for update := range archiver.progress {
233			if update.Error != nil {
234				fmt.Printf("[ERROR] %s: %v\n", update.FileName, update.Error)
235			} else if update.Complete {
236				fmt.Printf("[COMPLETE] %s: %d bytes\n", update.FileName, update.BytesWritten)
237			} else {
238				percent := float64(update.BytesWritten) / float64(update.TotalBytes) * 100
239				fmt.Printf("[PROGRESS] %s: %.1f%%\n", update.FileName, percent)
240			}
241		}
242	}()
243
244	// Archive files
245	fmt.Println("Streaming File Archiver")
246	fmt.Println("======================")
247	if err := archiver.Archive(); err != nil {
248		fmt.Println("Archive failed:", err)
249		return
250	}
251
252	close(archiver.progress)
253
254	// Show checksums
255	fmt.Println("\nChecksums:")
256	for file, checksum := range archiver.GetChecksums() {
257		fmt.Printf("  %s: %s\n", file, checksum[:16]+"...")
258	}
259
260	// Show archive stats
261	info, _ := os.Stat(outputFile)
262	fmt.Printf("\nArchive created: %s\n", outputFile, float64(info.Size())/1024)
263}

Explanation:

This streaming file archiver demonstrates:

  1. Stream Processing: Files are read and compressed in chunks, never loading entire files into memory
  2. Progress Tracking: Real-time progress updates via channels
  3. Checksum Calculation: SHA-256 checksums computed during streaming using io.TeeReader
  4. Interface Design: FileSource interface allows different source types
  5. Error Handling: Graceful error handling with detailed error messages
  6. Concurrency Safety: Mutex-protected checksum map for thread-safe access

The archiver efficiently handles large files by streaming data through multiple readers:

  • Original file reader → Progress tracker → Checksum calculator → Compressor → Output

This pattern is essential for production applications that need to process large files without excessive memory usage.

Exercise 4: Compressed File Format Converter

Difficulty: Intermediate | Time: 35-45 minutes

Learning Objectives:

  • Master multiple compression format conversion and interoperability
  • Implement format detection, metadata preservation, and batch processing
  • Learn to build flexible compression utilities with streaming capabilities

Real-World Context: Different systems use different compression formats, and being able to convert between them is essential for data interoperability. Whether you're migrating data between systems, preparing files for different platforms, or optimizing storage based on content type, understanding format conversion patterns is crucial for data engineering workflows.

Build a compressed file format converter that can convert between different compression formats while preserving data integrity. Your converter should automatically detect source compression formats, support conversion between gzip, zlib, and flate formats, preserve file metadata and timestamps during conversion, handle batch processing of multiple files efficiently, provide detailed conversion statistics and progress tracking, and validate converted files to ensure data integrity. This exercise demonstrates the format conversion patterns used in data migration tools and ETL pipelines where maintaining data integrity while optimizing storage formats is essential for cross-platform compatibility.

Solution with Explanation
  1package main
  2
  3import (
  4	"bytes"
  5	"compress/flate"
  6	"compress/gzip"
  7	"compress/zlib"
  8	"fmt"
  9	"io"
 10	"os"
 11	"path/filepath"
 12	"strings"
 13	"time"
 14)
 15
 16// run
 17type CompressionFormat string
 18
 19const (
 20	FormatGzip CompressionFormat = "gzip"
 21	FormatZlib CompressionFormat = "zlib"
 22	FormatFlate CompressionFormat = "flate"
 23)
 24
 25type ConversionJob struct {
 26	InputFile    string
 27	OutputFile   string
 28	InputFormat  CompressionFormat
 29	OutputFormat CompressionFormat
 30}
 31
 32type ConversionStats struct {
 33	FilesProcessed    int
 34	TotalOriginalSize int64
 35	TotalConvertedSize int64
 36	ConversionTime    time.Duration
 37	Errors           []error
 38}
 39
 40type FormatConverter struct {
 41	BufferSize int
 42}
 43
 44func NewFormatConverter(bufferSize int) *FormatConverter {
 45	return &FormatConverter{
 46		BufferSize: bufferSize,
 47	}
 48}
 49
 50func DetectFormat(data []byte) CompressionFormat {
 51	// Gzip magic number: 0x1f 0x8b
 52	if len(data) >= 2 && data[0] == 0x1f && data[1] == 0x8b {
 53		return FormatGzip
 54	}
 55
 56	// Zlib magic number: 0x78
 57	if len(data) >= 1 && {
 58		return FormatZlib
 59	}
 60
 61	// Flate doesn't have a clear magic number, but if it's not gzip or zlib,
 62	// and it compresses/decompresses successfully with flate, assume flate
 63	// This is a simplified detection - in production, you'd want more sophisticated detection
 64	return FormatFlate
 65}
 66
 67func Convert(inputFile, outputFile string, outputFormat CompressionFormat) {
 68	stats := &ConversionStats{}
 69	start := time.Now()
 70
 71	// Read input file
 72	inputData, err := os.ReadFile(inputFile)
 73	if err != nil {
 74		return stats, fmt.Errorf("failed to read input file: %w", err)
 75	}
 76
 77	stats.TotalOriginalSize = int64(len(inputData))
 78
 79	// Detect input format
 80	inputFormat := fc.DetectFormat(inputData)
 81	fmt.Printf("Detected format: %s\n", inputFormat)
 82
 83	// Decompress input data
 84	decompressed, err := fc.decompressData(inputData, inputFormat)
 85	if err != nil {
 86		return stats, fmt.Errorf("failed to decompress: %w", err)
 87	}
 88
 89	// Compress to output format
 90	converted, err := fc.compressData(decompressed, outputFormat)
 91	if err != nil {
 92		return stats, fmt.Errorf("failed to compress: %w", err)
 93	}
 94
 95	// Write output file
 96	if err := os.WriteFile(outputFile, converted, 0644); err != nil {
 97		return stats, fmt.Errorf("failed to write output file: %w", err)
 98	}
 99
100	stats.TotalConvertedSize = int64(len(converted))
101	stats.ConversionTime = time.Since(start)
102	stats.FilesProcessed = 1
103
104	return stats, nil
105}
106
107func decompressData(data []byte, format CompressionFormat) {
108	var reader io.Reader
109	var err error
110
111	switch format {
112	case FormatGzip:
113		reader, err = gzip.NewReader(bytes.NewReader(data))
114	case FormatZlib:
115		reader, err = zlib.NewReader(bytes.NewReader(data))
116	case FormatFlate:
117		reader = flate.NewReader(bytes.NewReader(data))
118	default:
119		return nil, fmt.Errorf("unsupported format: %s", format)
120	}
121
122	if err != nil {
123		return nil, err
124	}
125	defer func() {
126		if closer, ok := reader.(io.Closer); ok {
127			closer.Close()
128		}
129	}()
130
131	var buf bytes.Buffer
132	if _, err := io.Copy(&buf, reader); err != nil {
133		return nil, err
134	}
135
136	return buf.Bytes(), nil
137}
138
139func compressData(data []byte, format CompressionFormat) {
140	var buf bytes.Buffer
141	var writer io.WriteCloser
142	var err error
143
144	switch format {
145	case FormatGzip:
146		writer = gzip.NewWriter(&buf)
147	case FormatZlib:
148		writer = zlib.NewWriter(&buf)
149	case FormatFlate:
150		writer, err = flate.NewWriter(&buf, flate.DefaultCompression)
151	default:
152		return nil, fmt.Errorf("unsupported format: %s", format)
153	}
154
155	if err != nil {
156		return nil, err
157	}
158
159	// Write data in chunks for memory efficiency
160	buffer := make([]byte, fc.BufferSize)
161	dataReader := bytes.NewReader(data)
162
163	for {
164		n, err := dataReader.Read(buffer)
165		if err == io.EOF {
166			break
167		}
168		if err != nil {
169			writer.Close()
170			return nil, err
171		}
172
173		if _, err := writer.Write(buffer[:n]); err != nil {
174			writer.Close()
175			return nil, err
176		}
177	}
178
179	if err := writer.Close(); err != nil {
180		return nil, err
181	}
182
183	return buf.Bytes(), nil
184}
185
186func BatchConvert(jobs []ConversionJob) {
187	totalStats := &ConversionStats{}
188	start := time.Now()
189
190	for i, job := range jobs {
191		fmt.Printf("Processing job %d/%d: %s -> %s\n", i+1, len(jobs), job.InputFile, job.OutputFile)
192
193		// Create output directory if needed
194		if err := os.MkdirAll(filepath.Dir(job.OutputFile), 0755); err != nil {
195			totalStats.Errors = append(totalStats.Errors, err)
196			continue
197		}
198
199		stats, err := fc.Convert(job.InputFile, job.OutputFile, job.OutputFormat)
200		if err != nil {
201			totalStats.Errors = append(totalStats.Errors, err)
202			fmt.Printf("  Error: %v\n", err)
203			continue
204		}
205
206		totalStats.FilesProcessed += stats.FilesProcessed
207		totalStats.TotalOriginalSize += stats.TotalOriginalSize
208		totalStats.TotalConvertedSize += stats.TotalConvertedSize
209
210		fmt.Printf("  Success: %d bytes -> %d bytes\n",
211			stats.TotalOriginalSize,
212			stats.TotalConvertedSize,
213			float64(stats.TotalConvertedSize)/float64(stats.TotalOriginalSize)*100)
214	}
215
216	totalStats.ConversionTime = time.Since(start)
217	return totalStats, nil
218}
219
220func ValidateConversion(originalFile, convertedFile string) error {
221	// Read and decompress original
222	originalData, err := os.ReadFile(originalFile)
223	if err != nil {
224		return err
225	}
226
227	originalFormat := fc.DetectFormat(originalData)
228	decompressedOriginal, err := fc.decompressData(originalData, originalFormat)
229	if err != nil {
230		return err
231	}
232
233	// Read and decompress converted
234	convertedData, err := os.ReadFile(convertedFile)
235	if err != nil {
236		return err
237	}
238
239	convertedFormat := fc.DetectFormat(convertedData)
240	decompressedConverted, err := fc.decompressData(convertedData, convertedFormat)
241	if err != nil {
242		return err
243	}
244
245	// Compare decompressed data
246	if len(decompressedOriginal) != len(decompressedConverted) {
247		return fmt.Errorf("size mismatch: %d vs %d bytes", len(decompressedOriginal), len(decompressedConverted))
248	}
249
250	for i := 0; i < len(decompressedOriginal); i++ {
251		if decompressedOriginal[i] != decompressedConverted[i] {
252			return fmt.Errorf("data mismatch at byte %d", i)
253		}
254	}
255
256	return nil
257}
258
259func main() {
260	// Create test files
261	testDir := "/tmp/conversion-test"
262	os.MkdirAll(testDir, 0755)
263	defer os.RemoveAll(testDir)
264
265	// Create test data
266	testData := strings.Repeat("This is test data for compression format conversion. ", 100)
267
268	// Create files in different formats
269	files := map[string]CompressionFormat{
270		"test.gz":   FormatGzip,
271		"test.z":    FormatZlib,
272		"test.flate": FormatFlate,
273	}
274
275	for filename, format := range files {
276		converter := NewFormatConverter(4096)
277		compressed, err := converter.compressData([]byte(testData), format)
278		if err != nil {
279			fmt.Printf("Error creating %s: %v\n", filename, err)
280			continue
281		}
282		os.WriteFile(filepath.Join(testDir, filename), compressed, 0644)
283	}
284
285	// Create converter
286	converter := NewFormatConverter(8192)
287
288	// Create conversion jobs
289	var jobs []ConversionJob
290	inputFormats := []string{"test.gz", "test.z", "test.flate"}
291	outputFormats := []CompressionFormat{FormatGzip, FormatZlib, FormatFlate}
292
293	for _, inputFile := range inputFormats {
294		for _, outputFormat := range outputFormats {
295			inputPath := filepath.Join(testDir, inputFile)
296			outputExt := map[CompressionFormat]string{
297				FormatGzip: ".gz",
298				FormatZlib: ".z",
299				FormatFlate: ".flate",
300			}[outputFormat]
301
302			outputFile := filepath.Join(testDir, "converted",
303				strings.TrimSuffix(inputFile, filepath.Ext(inputFile))+outputExt)
304
305			jobs = append(jobs, ConversionJob{
306				InputFile:    inputPath,
307				OutputFile:   outputFile,
308				OutputFormat: outputFormat,
309			})
310		}
311	}
312
313	// Run batch conversion
314	fmt.Println("Format Converter")
315	fmt.Println("================")
316
317	start := time.Now()
318	stats, err := converter.BatchConvert(jobs)
319	if err != nil {
320		fmt.Printf("Batch conversion error: %v\n", err)
321	}
322
323	// Show results
324	fmt.Printf("\nConversion Summary:\n")
325	fmt.Printf("  Files processed: %d\n", stats.FilesProcessed)
326	fmt.Printf("  Original size: %d bytes\n", stats.TotalOriginalSize)
327	fmt.Printf("  Converted size: %d bytes\n", stats.TotalConvertedSize)
328	fmt.Printf("  Total time: %s\n", stats.ConversionTime)
329
330	if stats.Errors != nil {
331		fmt.Printf("  Errors: %d\n", len(stats.Errors))
332	}
333
334	// Validate a few conversions
335	fmt.Printf("\nValidating conversions...\n")
336	validations := []struct {
337		original string
338		converted string
339	}{
340		{filepath.Join(testDir, "test.gz"), filepath.Join(testDir, "converted", "test.flate")},
341		{filepath.Join(testDir, "test.z"), filepath.Join(testDir, "converted", "test.gz")},
342	}
343
344	for _, v := range validations {
345		if err := converter.ValidateConversion(v.original, v.converted); err != nil {
346			fmt.Printf("  Validation failed for %s: %v\n", v.converted, err)
347		} else {
348			fmt.Printf("  ✓ %s validated successfully\n", filepath.Base(v.converted))
349		}
350	}
351
352	fmt.Printf("\nConversion completed in %s\n", time.Since(start))
353}
354// run

Explanation:

This format converter demonstrates:

  • Format Detection: Automatic detection of compression formats using magic numbers
  • Format Conversion: Conversion between gzip, zlib, and flate formats
  • Batch Processing: Efficient processing of multiple files with progress tracking
  • Memory Efficiency: Chunked processing for large files
  • Data Validation: Integrity checking to ensure conversions are lossless
  • Error Handling: Comprehensive error handling with detailed reporting

Production use cases:

  • Data migration between systems with different compression preferences
  • Content optimization for different delivery platforms
  • Archive format standardization
  • ETL pipeline data preparation

Exercise 5: Compressed Data Storage Engine

Difficulty: Advanced | Time: 50-60 minutes

Learning Objectives:

  • Build a compressed key-value storage system with multiple compression strategies
  • Implement efficient data retrieval, indexing, and memory management
  • Learn compression strategy selection and performance optimization techniques

Real-World Context: Compressed storage engines are used in databases, caching systems, and big data applications to reduce storage costs while maintaining query performance. Companies like Redis, LevelDB, and various NoSQL databases use sophisticated compression strategies to balance memory usage, disk space, and access speed.

Build a compressed key-value storage engine that automatically selects optimal compression strategies based on data characteristics. Your storage engine should support multiple compression algorithms with automatic strategy selection, implement efficient key-based data retrieval with decompression, handle memory management with configurable cache sizes, provide compression statistics and performance metrics, and support data persistence with compressed file storage. This exercise demonstrates the storage engine patterns used in production databases where balancing compression efficiency with access performance is critical for scalable data management.

Solution with Explanation
  1package main
  2
  3import (
  4	"bytes"
  5	"compress/flate"
  6	"compress/gzip"
  7	"compress/zlib"
  8	"encoding/binary"
  9	"fmt"
 10	"hash/crc32"
 11	"io"
 12	"os"
 13	"path/filepath"
 14	"sync"
 15	"time"
 16)
 17
 18// run
 19type CompressionStrategy string
 20
 21const (
 22	StrategyNone   CompressionStrategy = "none"
 23	StrategyGzip   CompressionStrategy = "gzip"
 24	StrategyZlib   CompressionStrategy = "zlib"
 25	StrategyFlate  CompressionStrategy = "flate"
 26	StrategyAuto   CompressionStrategy = "auto"
 27)
 28
 29type StorageEntry struct {
 30	Key         string
 31	Value       []byte
 32	Compression CompressionStrategy
 33	OriginalSize int
 34	CompressedSize int
 35	Timestamp   time.Time
 36	Checksum    uint32
 37}
 38
 39type CompressionStats struct {
 40	TotalEntries     int
 41	TotalOriginalSize int64
 42	TotalCompressedSize int64
 43	CompressionRatio  float64
 44	HitRate          float64
 45	MissRate         float64
 46}
 47
 48type CompressedStorage struct {
 49	data       map[string]*StorageEntry
 50	mu         sync.RWMutex
 51	strategy   CompressionStrategy
 52	cacheSize  int
 53	maxSize    int
 54	stats      CompressionStats
 55	persistPath string
 56}
 57
 58func NewCompressedStorage(strategy CompressionStrategy, maxSize int, persistPath string) *CompressedStorage {
 59	return &CompressedStorage{
 60		data:        make(map[string]*StorageEntry),
 61		strategy:    strategy,
 62		cacheSize:   0,
 63		maxSize:     maxSize,
 64		persistPath: persistPath,
 65	}
 66}
 67
 68func ChooseStrategy(data []byte) CompressionStrategy {
 69	if cs.strategy != StrategyAuto {
 70		return cs.strategy
 71	}
 72
 73	// Small data: no compression
 74	if len(data) < 100 {
 75		return StrategyNone
 76	}
 77
 78	// Test compression ratios
 79	gzipRatio := cs.testCompression(data, StrategyGzip)
 80	zlibRatio := cs.testCompression(data, StrategyZlib)
 81	flateRatio := cs.testCompression(data, StrategyFlate)
 82
 83	// Choose best ratio
 84	ratios := map[CompressionStrategy]float64{
 85		StrategyGzip:  gzipRatio,
 86		StrategyZlib:  zlibRatio,
 87		StrategyFlate: flateRatio,
 88	}
 89
 90	bestStrategy := StrategyNone
 91	bestRatio := 1.0
 92
 93	for strategy, ratio := range ratios {
 94		if ratio < bestRatio && ratio < 0.8 {
 95			bestRatio = ratio
 96			bestStrategy = strategy
 97		}
 98	}
 99
100	return bestStrategy
101}
102
103func testCompression(data []byte, strategy CompressionStrategy) float64 {
104	var buf bytes.Buffer
105	var err error
106
107	switch strategy {
108	case StrategyGzip:
109		writer := gzip.NewWriter(&buf)
110		writer.Write(data)
111		writer.Close()
112	case StrategyZlib:
113		writer := zlib.NewWriter(&buf)
114		writer.Write(data)
115		writer.Close()
116	case StrategyFlate:
117		writer, err := flate.NewWriter(&buf, flate.DefaultCompression)
118		if err != nil {
119			return 1.0
120		}
121		writer.Write(data)
122		writer.Close()
123	default:
124		return 1.0
125	}
126
127	if err != nil {
128		return 1.0
129	}
130
131	return float64(buf.Len()) / float64(len(data))
132}
133
134func compress(data []byte, strategy CompressionStrategy) {
135	if strategy == StrategyNone {
136		return data, nil
137	}
138
139	var buf bytes.Buffer
140	var writer io.WriteCloser
141	var err error
142
143	switch strategy {
144	case StrategyGzip:
145		writer = gzip.NewWriter(&buf)
146	case StrategyZlib:
147		writer = zlib.NewWriter(&buf)
148	case StrategyFlate:
149		writer, err = flate.NewWriter(&buf, flate.DefaultCompression)
150	default:
151		return nil, fmt.Errorf("unsupported compression strategy: %s", strategy)
152	}
153
154	if err != nil {
155		return nil, err
156	}
157
158	if _, err := writer.Write(data); err != nil {
159		writer.Close()
160		return nil, err
161	}
162
163	if err := writer.Close(); err != nil {
164		return nil, err
165	}
166
167	return buf.Bytes(), nil
168}
169
170func decompress(data []byte, strategy CompressionStrategy) {
171	if strategy == StrategyNone {
172		return data, nil
173	}
174
175	var reader io.Reader
176	var err error
177
178	switch strategy {
179	case StrategyGzip:
180		reader, err = gzip.NewReader(bytes.NewReader(data))
181	case StrategyZlib:
182		reader, err = zlib.NewReader(bytes.NewReader(data))
183	case StrategyFlate:
184		reader = flate.NewReader(bytes.NewReader(data))
185	default:
186		return nil, fmt.Errorf("unsupported compression strategy: %s", strategy)
187	}
188
189	if err != nil {
190		return nil, err
191	}
192	defer func() {
193		if closer, ok := reader.(io.Closer); ok {
194			closer.Close()
195		}
196	}()
197
198	var buf bytes.Buffer
199	if _, err := io.Copy(&buf, reader); err != nil {
200		return nil, err
201	}
202
203	return buf.Bytes(), nil
204}
205
206func Set(key string, value []byte) error {
207	cs.mu.Lock()
208	defer cs.mu.Unlock()
209
210	// Check if we need to evict entries
211	if len(cs.data) >= cs.maxSize {
212		cs.evictLRU()
213	}
214
215	// Choose compression strategy
216	strategy := cs.ChooseStrategy(value)
217
218	// Compress data
219	compressed, err := cs.compress(value, strategy)
220	if err != nil {
221		return fmt.Errorf("compression failed: %w", err)
222	}
223
224	// Calculate checksum
225	checksum := crc32.ChecksumIEEE(compressed)
226
227	// Create entry
228	entry := &StorageEntry{
229		Key:             key,
230		Value:           compressed,
231		Compression:     strategy,
232		OriginalSize:    len(value),
233		CompressedSize:  len(compressed),
234		Timestamp:       time.Now(),
235		Checksum:        checksum,
236	}
237
238	// Update stats
239	if oldEntry, exists := cs.data[key]; exists {
240		cs.cacheSize -= oldEntry.CompressedSize
241		cs.stats.TotalOriginalSize -= int64(oldEntry.OriginalSize)
242		cs.stats.TotalCompressedSize -= int64(oldEntry.CompressedSize)
243	} else {
244		cs.stats.TotalEntries++
245	}
246
247	cs.data[key] = entry
248	cs.cacheSize += entry.CompressedSize
249	cs.stats.TotalOriginalSize += int64(entry.OriginalSize)
250	cs.stats.TotalCompressedSize += int64(entry.CompressedSize)
251
252	// Update compression ratio
253	if cs.stats.TotalOriginalSize > 0 {
254		cs.stats.CompressionRatio = float64(cs.stats.TotalCompressedSize) / float64(cs.stats.TotalOriginalSize)
255	}
256
257	return nil
258}
259
260func Get(key string) {
261	cs.mu.RLock()
262	entry, exists := cs.data[key]
263	cs.mu.RUnlock()
264
265	if !exists {
266		cs.stats.MissRate++
267		return nil, fmt.Errorf("key not found: %s", key)
268	}
269
270	cs.stats.HitRate++
271
272	// Verify checksum
273	if crc32.ChecksumIEEE(entry.Value) != entry.Checksum {
274		return nil, fmt.Errorf("checksum mismatch for key: %s", key)
275	}
276
277	// Decompress data
278	decompressed, err := cs.decompress(entry.Value, entry.Compression)
279	if err != nil {
280		return nil, fmt.Errorf("decompression failed: %w", err)
281	}
282
283	return decompressed, nil
284}
285
286func Delete(key string) error {
287	cs.mu.Lock()
288	defer cs.mu.Unlock()
289
290	entry, exists := cs.data[key]
291	if !exists {
292		return fmt.Errorf("key not found: %s", key)
293	}
294
295	delete(cs.data, key)
296	cs.cacheSize -= entry.CompressedSize
297	cs.stats.TotalEntries--
298	cs.stats.TotalOriginalSize -= int64(entry.OriginalSize)
299	cs.stats.TotalCompressedSize -= int64(entry.CompressedSize)
300
301	return nil
302}
303
304func evictLRU() {
305	if len(cs.data) == 0 {
306		return
307	}
308
309	// Find oldest entry
310	var oldestKey string
311	var oldestTime time.Time
312	first := true
313
314	for key, entry := range cs.data {
315		if first || entry.Timestamp.Before(oldestTime) {
316			oldestKey = key
317			oldestTime = entry.Timestamp
318			first = false
319		}
320	}
321
322	if oldestKey != "" {
323		cs.Delete(oldestKey)
324	}
325}
326
327func GetStats() CompressionStats {
328	cs.mu.RLock()
329	defer cs.mu.RUnlock()
330
331	stats := cs.stats
332
333	// Calculate hit/miss rates
334	total := stats.HitRate + stats.MissRate
335	if total > 0 {
336		stats.HitRate = stats.HitRate / total * 100
337		stats.MissRate = stats.MissRate / total * 100
338	}
339
340	return stats
341}
342
343func Persist() error {
344	if cs.persistPath == "" {
345		return nil
346	}
347
348	cs.mu.RLock()
349	defer cs.mu.RUnlock()
350
351	file, err := os.Create(cs.persistPath)
352	if err != nil {
353		return err
354	}
355	defer file.Close()
356
357	// Write header
358	if err := binary.Write(file, binary.LittleEndian, uint32(len(cs.data))); err != nil {
359		return err
360	}
361
362	// Write entries
363	for key, entry := range cs.data {
364		// Write key length and key
365		if err := binary.Write(file, binary.LittleEndian, uint32(len(key))); err != nil {
366			return err
367		}
368		if _, err := file.WriteString(key); err != nil {
369			return err
370		}
371
372		// Write entry data
373		if err := binary.Write(file, binary.LittleEndian, uint32(entry.OriginalSize)); err != nil {
374			return err
375		}
376		if err := binary.Write(file, binary.LittleEndian, uint32(entry.CompressedSize)); err != nil {
377			return err
378		}
379		if err := binary.Write(file, binary.LittleEndian, uint32(entry.Compression)); err != nil {
380			return err
381		}
382		if err := binary.Write(file, binary.LittleEndian, uint64(entry.Timestamp.UnixNano())); err != nil {
383			return err
384		}
385		if err := binary.Write(file, binary.LittleEndian, entry.Checksum); err != nil {
386			return err
387		}
388		if _, err := file.Write(entry.Value); err != nil {
389			return err
390		}
391	}
392
393	return nil
394}
395
396func Load() error {
397	if cs.persistPath == "" {
398		return nil
399	}
400
401	file, err := os.Open(cs.persistPath)
402	if err != nil {
403		if os.IsNotExist(err) {
404			return nil
405		}
406		return err
407	}
408	defer file.Close()
409
410	var entryCount uint32
411	if err := binary.Read(file, binary.LittleEndian, &entryCount); err != nil {
412		return err
413	}
414
415	for i := 0; i < int(entryCount); i++ {
416		var keyLen uint32
417		if err := binary.Read(file, binary.LittleEndian, &keyLen); err != nil {
418			return err
419		}
420
421		key := make([]byte, keyLen)
422		if _, err := file.Read(key); err != nil {
423			return err
424		}
425
426		entry := &StorageEntry{Key: string(key)}
427
428		if err := binary.Read(file, binary.LittleEndian,(&entry.OriginalSize)); err != nil {
429			return err
430		}
431		if err := binary.Read(file, binary.LittleEndian,(&entry.CompressedSize)); err != nil {
432			return err
433		}
434		if err := binary.Read(file, binary.LittleEndian,(&entry.Compression)); err != nil {
435			return err
436		}
437		var timestampNano uint64
438		if err := binary.Read(file, binary.LittleEndian, &timestampNano); err != nil {
439			return err
440		}
441		entry.Timestamp = time.Unix(0, int64(timestampNano))
442		if err := binary.Read(file, binary.LittleEndian, &entry.Checksum); err != nil {
443			return err
444		}
445
446		entry.Value = make([]byte, entry.CompressedSize)
447		if _, err := file.Read(entry.Value); err != nil {
448			return err
449		}
450
451		cs.data[string(key)] = entry
452		cs.cacheSize += entry.CompressedSize
453		cs.stats.TotalEntries++
454		cs.stats.TotalOriginalSize += int64(entry.OriginalSize)
455		cs.stats.TotalCompressedSize += int64(entry.CompressedSize)
456	}
457
458	if cs.stats.TotalOriginalSize > 0 {
459		cs.stats.CompressionRatio = float64(cs.stats.TotalCompressedSize) / float64(cs.stats.TotalOriginalSize)
460	}
461
462	return nil
463}
464
465func main() {
466	// Create storage engine
467	persistPath := filepath.Join("/tmp", "compressed_storage.db")
468	storage := NewCompressedStorage(StrategyAuto, 1000, persistPath)
469
470	// Load existing data if any
471	if err := storage.Load(); err != nil {
472		fmt.Printf("Failed to load persisted data: %v\n", err)
473	}
474
475	fmt.Println("Compressed Storage Engine")
476	fmt.Println("========================")
477
478	// Test different data types
479	testData := map[string][]byte{
480		"short":      []byte("short data"),
481		"repetitive": []byte(strings.Repeat("repetitive data ", 50)),
482		"json":       []byte(`{"name":"test","value":123,"items":[1,2,3,4,5],"description":"this is a test json object with some repetitive fields"}`),
483		"binary":     []byte{0x01, 0x02, 0x03, 0x04, 0x05, 0x01, 0x02, 0x03, 0x04, 0x05},
484		"text":       []byte(strings.Repeat("This is a longer text document with many repeated words and phrases. ", 20)),
485	}
486
487	// Store test data
488	fmt.Println("\nStoring test data:")
489	for key, value := range testData {
490		if err := storage.Set(key, value); err != nil {
491			fmt.Printf("Error storing %s: %v\n", key, err)
492			continue
493		}
494		fmt.Printf("  %s: %d bytes stored\n", key, len(value))
495	}
496
497	// Retrieve and verify data
498	fmt.Println("\nRetrieving and verifying data:")
499	for key := range testData {
500		retrieved, err := storage.Get(key)
501		if err != nil {
502			fmt.Printf("Error retrieving %s: %v\n", key, err)
503			continue
504		}
505
506		if string(retrieved) == string(testData[key]) {
507			fmt.Printf("  %s: ✓ Verified\n", key, len(retrieved))
508		} else {
509			fmt.Printf("  %s: ✗ Data mismatch\n", key)
510		}
511	}
512
513	// Show statistics
514	stats := storage.GetStats()
515	fmt.Printf("\nStorage Statistics:\n")
516	fmt.Printf("  Total entries: %d\n", stats.TotalEntries)
517	fmt.Printf("  Cache size: %d bytes\n", storage.cacheSize)
518	fmt.Printf("  Original size: %d bytes\n", stats.TotalOriginalSize)
519	fmt.Printf("  Compressed size: %d bytes\n", stats.TotalCompressedSize)
520	fmt.Printf("  Compression ratio: %.2f%%\n", stats.CompressionRatio*100)
521	fmt.Printf("  Hit rate: %.2f%%\n", stats.HitRate)
522	fmt.Printf("  Miss rate: %.2f%%\n", stats.MissRate)
523
524	// Test persistence
525	fmt.Println("\nTesting persistence...")
526	if err := storage.Persist(); err != nil {
527		fmt.Printf("Persistence error: %v\n", err)
528	} else {
529		fmt.Printf("✓ Data persisted to %s\n", persistPath)
530	}
531
532	// Test eviction
533	fmt.Println("\nTesting LRU eviction...")
534	for i := 0; i < 10; i++ {
535		key := fmt.Sprintf("evict_test_%d", i)
536		value := []byte(strings.Repeat(fmt.Sprintf("data_%d_", i), 100))
537		storage.Set(key, value)
538	}
539
540	finalStats := storage.GetStats()
541	fmt.Printf("Final storage size: %d entries\n", finalStats.TotalEntries)
542
543	// Cleanup
544	os.Remove(persistPath)
545}
546// run

Explanation:

This compressed storage engine demonstrates:

  • Automatic Strategy Selection: Chooses optimal compression based on data characteristics
  • Memory Management: LRU eviction with configurable cache sizes
  • Data Integrity: CRC32 checksums for compressed data verification
  • Performance Monitoring: Hit/miss rates and compression statistics
  • Persistence: Binary serialization for data recovery
  • Thread Safety: Concurrent access protection with mutex locks

Production features:

  • Multiple compression algorithms with automatic selection
  • Efficient memory usage with LRU eviction
  • Data integrity verification
  • Performance metrics and monitoring
  • Persistent storage for data recovery

Summary

Throughout this tutorial, we've explored compression and archives in Go - from the basics of making data smaller to production-ready patterns for handling large-scale compression tasks.

Think of compression as a superpower that lets you store more information in less space and transmit it faster over networks. Whether you're building a web server, backup utility, or file processing system, understanding compression is essential for creating efficient, scalable applications.

Key Takeaways

  1. Gzip Compression

    • Most common compression format - the universal language of web compression
    • Configurable compression levels balance speed vs. size
    • Support for metadata
    • Streaming compression for large files - process data without loading it all into memory
    • HTTP compression middleware - essential for web performance
  2. Zlib and Flate

    • Alternative compression formats with different trade-offs
    • Lower overhead than gzip - good for internal protocols
    • Dictionary-based compression for predictable data patterns
    • Raw DEFLATE algorithm access - the foundation beneath gzip and zlib
  3. Tar Archives

    • Unix tape archive format - like putting files in a box
    • Combines multiple files while preserving structure
    • Preserves file metadata
    • Often combined with gzip - the classic combo
    • Streaming archive processing - handle huge archives efficiently
  4. ZIP Files

    • Popular cross-platform format - works everywhere
    • Built-in compression with multiple algorithms
    • Directory structure support - maintains file organization
    • Individual file access - extract specific files without unpacking everything
  5. Streaming Operations

    • Memory-efficient processing - crucial for large files
    • Chunked compression - process data piece by piece
    • Concurrent compression - use multiple CPU cores
    • Progress tracking - provide user feedback
    • Buffered I/O - optimize small writes
  6. Performance

    • Compression level trade-offs - speed vs. size decisions
    • Algorithm comparison - choose the right tool for the job
    • Memory usage patterns - avoid memory blowup
    • Throughput optimization - squeeze out maximum performance
  7. Production Patterns

    • HTTP compression middleware - automatic web compression
    • Log rotation with compression - manage log file sizes
    • Backup utilities - create efficient backups
    • Compression pools - reuse objects for better performance
    • Progress monitoring - keep users informed during long operations

Best Practices

  1. Choose the Right Algorithm: Gzip for HTTP, zlib for protocols, zip for archives
  2. Balance Speed vs Size: Higher compression = slower but smaller
  3. Stream Large Files: Don't load everything into memory
  4. Always Close Writers: Flush buffers to complete compression
  5. Pool Compressors: Reuse writer objects for better performance
  6. Verify Integrity: Check compressed data after creation
  7. Handle Errors: Compression can fail in many ways
  8. Use Buffering: Improve performance for small writes
  9. Monitor Progress: Provide feedback for long operations
  10. Consider Context: Some data doesn't compress well

Common Pitfalls

  • Forgetting to close compression writers
  • Not checking compression ratio
  • Loading large files entirely into memory
  • Using wrong compression level for use case
  • Not handling corrupted compressed data
  • Ignoring decompression errors
  • Over-compressing already compressed data
  • Not pooling compression objects