Compression is like packing a suitcase - you want to fit as much as possible while keeping everything organized and accessible. Just as you'd fold clothes carefully and use vacuum bags to save space, compression algorithms find patterns in data and represent them more efficiently.
Think about it: when you're sending photos to a friend, you wouldn't mail each photo in a separate envelope. You'd pack them together efficiently. Compression does the same thing with data - it removes redundancy and represents repeated patterns more compactly.
Go's standard library provides excellent support for multiple compression algorithms and archive formats, making it easy to implement these "data packing" strategies in your applications.
Introduction to Compression
The Mathematics of Space: Compression is fundamentally about information theory. Claude Shannon proved that data has an intrinsic entropy—a theoretical minimum size based on its information content. Compression algorithms try to approach this limit by finding and exploiting patterns.
Two Types of Compression: Lossless compression (like gzip) preserves every bit of original data—crucial for text, code, and data files. Lossy compression (like JPEG) discards less important information to achieve higher ratios—acceptable for images, audio, and video where perfect reproduction isn't necessary.
Real-World Impact: Google reports that gzip compression reduces bandwidth by 60-80% for typical web pages. For mobile users on metered connections, compression can mean the difference between usable and unusable applications. For data centers, compression reduces storage costs and increases effective network capacity.
Consider a book where the phrase "the quick brown fox" appears on every page. Instead of writing it out every time, you could just write "1" and have a legend that says "1 = the quick brown fox". That's essentially how compression works - it finds repeated patterns and creates shortcuts to represent them.
💡 Key Takeaway: Compression works by finding and eliminating redundancy in data. The more repetitive your data, the better it compresses.
Go's compression packages provide:
- compress/gzip: GNU zip compression
- compress/zlib: zlib format
- compress/flate: DEFLATE algorithm
- archive/tar: Tar archive format
- archive/zip: ZIP archive format with compression
⚠️ Important: Not all data compresses equally well. Random data may actually get slightly larger when compressed because there are no patterns to exploit.
Compression Basics
Let's start with a simple example that demonstrates the magic of compression. We'll create some repetitive text and see how much space we can save.
1package main
2
3import (
4 "bytes"
5 "compress/gzip"
6 "fmt"
7)
8
9// run
10func main() {
11 // Original data - notice the repetition
12 original := []byte("Hello, World! This is a test string for compression. " +
13 "Compression works best with repetitive data. " +
14 "Hello, World! Hello, World!")
15
16 // Compress
17 var compressed bytes.Buffer
18 writer := gzip.NewWriter(&compressed)
19 writer.Write(original)
20 writer.Close()
21
22 // Compare sizes
23 fmt.Printf("Original size: %d bytes\n", len(original))
24 fmt.Printf("Compressed size: %d bytes\n", compressed.Len())
25 fmt.Printf("Compression ratio: %.2f%%\n",
26 float64(compressed.Len())/float64(len(original))*100)
27
28 // Decompress
29 reader, err := gzip.NewReader(&compressed)
30 if err != nil {
31 fmt.Println("Decompress error:", err)
32 return
33 }
34 defer reader.Close()
35
36 var decompressed bytes.Buffer
37 decompressed.ReadFrom(reader)
38
39 fmt.Printf("\nDecompressed: %s\n", decompressed.String()[:50]+"...")
40}
Real-world Example: When you download a large file from the internet, it's often compressed. A 100MB text file might compress to just 10MB, meaning it downloads 10 times faster! This is why websites use gzip compression - it saves bandwidth and improves page load times.
Compression Levels
Compression levels are like choosing how carefully to pack your suitcase. You can throw everything in quickly, or you can meticulously fold and arrange everything to maximize space. The trade-off is time vs. space.
💡 Key Takeaway: Compression levels represent a trade-off between CPU usage and compression ratio. Higher levels use more CPU time but achieve better compression.
1package main
2
3import (
4 "bytes"
5 "compress/gzip"
6 "fmt"
7 "strings"
8 "time"
9)
10
11// run
12func main() {
13 // Test data - repetitive for better compression
14 data := []byte(strings.Repeat("The quick brown fox jumps over the lazy dog. ", 100))
15
16 levels := []int{
17 gzip.NoCompression,
18 gzip.BestSpeed,
19 gzip.DefaultCompression,
20 gzip.BestCompression,
21 }
22
23 levelNames := []string{
24 "No Compression",
25 "Best Speed",
26 "Default",
27 "Best Compression",
28 }
29
30 fmt.Printf("Original size: %d bytes\n\n", len(data))
31
32 for i, level := range levels {
33 start := time.Now()
34
35 var compressed bytes.Buffer
36 writer, _ := gzip.NewWriterLevel(&compressed, level)
37 writer.Write(data)
38 writer.Close()
39
40 elapsed := time.Since(start)
41
42 fmt.Printf("%s:\n", levelNames[i], level)
43 fmt.Printf(" Size: %d bytes\n",
44 compressed.Len(),
45 float64(compressed.Len())/float64(len(data))*100)
46 fmt.Printf(" Time: %s\n\n", elapsed)
47 }
48}
When to use different compression levels:
- Best Speed: Real-time applications, chat systems, live streams
- Default: General web content, API responses
- Best Compression: Backup files, archives, content distribution networks
Common Pitfalls: Don't use maximum compression for real-time applications - the CPU overhead might cause delays that outweigh the bandwidth savings.
Compression Theory and Algorithms
LZ77 Algorithm Details
Sliding Window: LZ77 maintains a sliding window of recently-seen data (typically 32KB for gzip). When encoding, it searches this window for matches with the current data. When it finds a match, it outputs a reference (distance back, length) instead of the literal data.
Lazy Matching: Better compression uses lazy matching—before outputting a match, look ahead one byte to see if a longer match exists. This adds complexity but typically improves compression by 1-2%.
Hash Chains: Efficient LZ77 implementations use hash tables and chains to find matches quickly. Naive search is O(n²); hash chains reduce this to near-linear time.
Huffman Coding Details
Frequency Analysis: Huffman coding builds a tree based on symbol frequencies. More frequent symbols get shorter codes. For English text, 'e' might be 3 bits while 'z' is 11 bits.
Static vs Dynamic: Static Huffman uses predetermined codes. Dynamic Huffman analyzes the actual data to build optimal codes. Dynamic achieves better compression but requires transmitting the code table.
Canonical Huffman: DEFLATE uses canonical Huffman codes—a clever encoding that requires transmitting only code lengths, not the full tree. This saves space in the compressed stream.
Alternative Algorithms
LZ78 and LZW: These build an explicit dictionary instead of using a sliding window. LZW (used in GIF and old Unix compress) was popular but is less common now due to patent history.
LZMA: Used by 7-Zip and xz, LZMA achieves excellent compression using a much larger dictionary (up to 4GB), range encoding instead of Huffman, and sophisticated match finding. Compression is slow but decompression is reasonable.
Brotli: Developed by Google for HTTP compression, Brotli combines LZ77 with a pre-defined dictionary of common web content. Achieves 15-25% better compression than gzip for web content.
Zstandard: Modern algorithm by Facebook, designed to match gzip compression with LZ4 speed. Features include: real-time compression, training mode for custom dictionaries, and multiple compression levels.
Understanding how compression works helps you choose the right algorithm and settings for your needs. Modern compression combines multiple techniques to achieve impressive ratios.
Dictionary Coding (LZ77): This technique finds repeated sequences and replaces them with references to earlier occurrences. For example, in the text "the quick brown fox jumps over the lazy dog, the fox was quick", the second "the" and "fox" can be replaced with pointers to their first occurrences.
Entropy Coding (Huffman): After dictionary coding, Huffman encoding assigns shorter bit codes to more frequent symbols. In English text, 'e' appears more often than 'z', so 'e' gets a shorter code. This statistical encoding provides additional compression.
DEFLATE = LZ77 + Huffman: The DEFLATE algorithm (used by gzip, zlib, and zip) combines both techniques. It first finds repeated strings, then encodes the result with Huffman coding. This two-stage approach is why DEFLATE achieves such good compression ratios on text data.
Why Some Data Doesn't Compress: Random data has no patterns to exploit—each byte is equally likely, so entropy coding helps minimally and dictionary coding finds few matches. Already-compressed data (JPEG, MP4, PNG) is designed to be near-maximum entropy, so further compression is impossible.
Compression Ratio Calculations
Understanding compression metrics helps you evaluate performance and make informed decisions.
1package main
2
3import (
4 "bytes"
5 "compress/gzip"
6 "fmt"
7 "strings"
8)
9
10// run
11func main() {
12 // Test different data types
13 testData := map[string][]byte{
14 "Random": generateRandomData(1000),
15 "Repetitive": []byte(strings.Repeat("AAAA", 250)),
16 "Text": []byte(strings.Repeat("The quick brown fox ", 50)),
17 "Numeric": []byte(strings.Repeat("1234567890", 100)),
18 }
19
20 fmt.Println("Compression Analysis:")
21 fmt.Println(strings.Repeat("=", 60))
22
23 for name, data := range testData {
24 compressed := compress(data)
25
26 // Calculate metrics
27 originalSize := len(data)
28 compressedSize := len(compressed)
29 ratio := float64(compressedSize) / float64(originalSize)
30 savings := float64(originalSize-compressedSize) / float64(originalSize) * 100
31
32 fmt.Printf("\n%s:\n", name)
33 fmt.Printf(" Original: %6d bytes\n", originalSize)
34 fmt.Printf(" Compressed: %6d bytes\n", compressedSize)
35 fmt.Printf(" Ratio: %.2f:1\n", 1/ratio)
36 fmt.Printf(" Savings: %.1f%%\n", savings)
37 }
38}
39
40func generateRandomData(size int) []byte {
41 data := make([]byte, size)
42 for i := range data {
43 data[i] = byte(i % 256)
44 }
45 return data
46}
47
48func compress(data []byte) []byte {
49 var buf bytes.Buffer
50 w := gzip.NewWriter(&buf)
51 w.Write(data)
52 w.Close()
53 return buf.Bytes()
54}
Compression Metrics Explained:
- Compression Ratio: Original size / compressed size (higher is better)
- Compression Factor: Compressed size / original size (lower is better)
- Space Savings: (1 - compressed/original) * 100%
- Throughput: MB/s of data processed
Gzip Compression
The Ubiquitous Format: Gzip is everywhere—web servers, file archives, database backups, log files, and more. It's the de facto standard for HTTP compression, supported by every browser and web server. Understanding gzip is essential for any network-facing application.
How Gzip Works: Gzip combines two algorithms: LZ77 (finding repeated strings and replacing them with references) and Huffman coding (using shorter codes for common symbols). This combination is remarkably effective for text and structured data, often achieving 70-90% compression on HTML, JSON, and logs.
Compression Levels Explained: Level 1 (best speed) uses fast heuristics that miss some compression opportunities. Level 9 (best compression) exhaustively searches for patterns. Most applications use the default (level 6), which provides 95% of the compression benefit at a fraction of the CPU cost.
Gzip is like the universal language of compression on the internet. It's the most commonly used compression format, especially for HTTP and file compression. When your browser requests a web page, it often tells the server "I speak gzip" and the server responds with compressed content that your browser automatically decompresses.
Think of gzip as the postal service's standard method for reducing shipping costs - it's reliable, widely supported, and gets the job done efficiently.
Basic Gzip Operations
Gzip Internal Structure: A gzip file starts with a 10-byte header containing magic bytes (1f 8b), compression method, flags, timestamp, and OS type. After the header comes compressed data, and finally an 8-byte trailer with CRC32 checksum and uncompressed size (modulo 2^32).
Compression Window: Gzip uses a 32KB sliding window for finding repeated strings. This means patterns more than 32KB apart won't be detected. For very large files with distant repetition, specialized algorithms or preprocessing might achieve better compression.
Header Metadata: The gzip header can store filename, comment, and modification time. This metadata is optional but useful for archives. Go's gzip.Writer lets you set these fields—important when creating archives that other tools will process.
1package main
2
3import (
4 "bytes"
5 "compress/gzip"
6 "fmt"
7 "io"
8)
9
10// run
11func main() {
12 original := "This is some data to compress with gzip!"
13
14 // Compress
15 compressed, err := gzipCompress([]byte(original))
16 if err != nil {
17 fmt.Println("Compression error:", err)
18 return
19 }
20
21 fmt.Printf("Original: %d bytes\n", len(original))
22 fmt.Printf("Compressed: %d bytes\n", len(compressed))
23
24 // Decompress
25 decompressed, err := gzipDecompress(compressed)
26 if err != nil {
27 fmt.Println("Decompression error:", err)
28 return
29 }
30
31 fmt.Printf("Decompressed: %s\n", string(decompressed))
32}
33
34func gzipCompress(data []byte) {
35 var buf bytes.Buffer
36
37 writer := gzip.NewWriter(&buf)
38 _, err := writer.Write(data)
39 if err != nil {
40 return nil, err
41 }
42
43 if err := writer.Close(); err != nil {
44 return nil, err
45 }
46
47 return buf.Bytes(), nil
48}
49
50func gzipDecompress(data []byte) {
51 reader, err := gzip.NewReader(bytes.NewReader(data))
52 if err != nil {
53 return nil, err
54 }
55 defer reader.Close()
56
57 var buf bytes.Buffer
58 _, err = io.Copy(&buf, reader)
59 if err != nil {
60 return nil, err
61 }
62
63 return buf.Bytes(), nil
64}
Gzip with Metadata
1package main
2
3import (
4 "bytes"
5 "compress/gzip"
6 "fmt"
7 "io"
8 "time"
9)
10
11// run
12func main() {
13 data := []byte("Data with metadata in gzip header")
14
15 // Compress with metadata
16 compressed, err := compressWithMetadata(data, "example.txt", "Example file")
17 if err != nil {
18 fmt.Println("Error:", err)
19 return
20 }
21
22 fmt.Printf("Compressed %d bytes\n", len(compressed))
23
24 // Decompress and read metadata
25 decompressed, metadata, err := decompressWithMetadata(compressed)
26 if err != nil {
27 fmt.Println("Error:", err)
28 return
29 }
30
31 fmt.Printf("\nMetadata:\n")
32 fmt.Printf(" Name: %s\n", metadata.Name)
33 fmt.Printf(" Comment: %s\n", metadata.Comment)
34 fmt.Printf(" Modified: %s\n", metadata.ModTime)
35 fmt.Printf("\nDecompressed: %s\n", string(decompressed))
36}
37
38type GzipMetadata struct {
39 Name string
40 Comment string
41 ModTime time.Time
42}
43
44func compressWithMetadata(data []byte, name, comment string) {
45 var buf bytes.Buffer
46
47 writer := gzip.NewWriter(&buf)
48 writer.Name = name
49 writer.Comment = comment
50 writer.ModTime = time.Now()
51
52 _, err := writer.Write(data)
53 if err != nil {
54 return nil, err
55 }
56
57 if err := writer.Close(); err != nil {
58 return nil, err
59 }
60
61 return buf.Bytes(), nil
62}
63
64func decompressWithMetadata(data []byte) {
65 reader, err := gzip.NewReader(bytes.NewReader(data))
66 if err != nil {
67 return nil, GzipMetadata{}, err
68 }
69 defer reader.Close()
70
71 metadata := GzipMetadata{
72 Name: reader.Name,
73 Comment: reader.Comment,
74 ModTime: reader.ModTime,
75 }
76
77 var buf bytes.Buffer
78 _, err = io.Copy(&buf, reader)
79 if err != nil {
80 return nil, GzipMetadata{}, err
81 }
82
83 return buf.Bytes(), metadata, nil
84}
Streaming Gzip Compression
1package main
2
3import (
4 "compress/gzip"
5 "fmt"
6 "io"
7 "strings"
8)
9
10// run
11func main() {
12 // Source of data
13 source := strings.NewReader("This is line 1\n" +
14 "This is line 2\n" +
15 "This is line 3\n" +
16 "This is line 4\n" +
17 "This is line 5\n")
18
19 // Destination
20 var compressed strings.Builder
21
22 // Stream compression
23 if err := streamCompress(source, &compressed); err != nil {
24 fmt.Println("Error:", err)
25 return
26 }
27
28 fmt.Printf("Compressed %d bytes\n", compressed.Len())
29
30 // Stream decompression
31 decompressed := &strings.Builder{}
32 if err := streamDecompress(strings.NewReader(compressed.String()), decompressed); err != nil {
33 fmt.Println("Error:", err)
34 return
35 }
36
37 fmt.Printf("Decompressed:\n%s", decompressed.String())
38}
39
40func streamCompress(source io.Reader, dest io.Writer) error {
41 writer := gzip.NewWriter(dest)
42 defer writer.Close()
43
44 _, err := io.Copy(writer, source)
45 return err
46}
47
48func streamDecompress(source io.Reader, dest io.Writer) error {
49 reader, err := gzip.NewReader(source)
50 if err != nil {
51 return err
52 }
53 defer reader.Close()
54
55 _, err = io.Copy(dest, reader)
56 return err
57}
Zlib and Flate
The DEFLATE Algorithm: At the heart of both gzip and zlib is DEFLATE, a compression algorithm so successful it's used in PNG images, ZIP files, HTTP compression, and countless other formats. DEFLATE combines LZ77 dictionary coding with Huffman entropy encoding.
Zlib vs Gzip vs Flate: These formats are siblings—they all use DEFLATE compression but differ in their wrapper format. Gzip adds a header with file metadata and a CRC32 checksum. Zlib adds a simpler header and Adler-32 checksum. Raw DEFLATE has no wrapper at all, saving a few bytes at the cost of no integrity checking.
Choosing the Right Format: Use gzip for HTTP and file compression (it's the standard). Use zlib for network protocols and embedded data (smaller overhead than gzip). Use raw DEFLATE when you're implementing your own wrapper or need maximum efficiency.
Zlib is another popular compression format, often used in network protocols and PNG images.
Zlib Compression
Zlib Header Format: Zlib uses a 2-byte header: first byte encodes compression method and window size, second byte contains flags and a header checksum. This compact header makes zlib suitable for protocols where overhead matters.
Adler-32 vs CRC-32: Zlib uses Adler-32 checksum (faster but weaker) while gzip uses CRC-32 (slower but stronger). For most applications, both provide adequate integrity checking. Adler-32's speed advantage is noticeable for small payloads.
Dictionary Compression: Zlib supports preset dictionaries—predefined data that seeds the compression window. For compressing many similar small messages (like JSON with consistent schema), a shared dictionary can improve compression by 20-40%.
1package main
2
3import (
4 "bytes"
5 "compress/zlib"
6 "fmt"
7 "io"
8)
9
10// run
11func main() {
12 original := []byte("Zlib compression example with some repetitive data. " +
13 "Repetitive data compresses better. Repetitive data!")
14
15 // Compress with zlib
16 compressed, err := zlibCompress(original)
17 if err != nil {
18 fmt.Println("Compression error:", err)
19 return
20 }
21
22 fmt.Printf("Original: %d bytes\n", len(original))
23 fmt.Printf("Compressed: %d bytes\n", len(compressed))
24 fmt.Printf("Ratio: %.2f%%\n",
25 float64(len(compressed))/float64(len(original))*100)
26
27 // Decompress
28 decompressed, err := zlibDecompress(compressed)
29 if err != nil {
30 fmt.Println("Decompression error:", err)
31 return
32 }
33
34 fmt.Printf("\nDecompressed: %s\n", string(decompressed)[:50]+"...")
35}
36
37func zlibCompress(data []byte) {
38 var buf bytes.Buffer
39
40 writer := zlib.NewWriter(&buf)
41 _, err := writer.Write(data)
42 if err != nil {
43 return nil, err
44 }
45
46 if err := writer.Close(); err != nil {
47 return nil, err
48 }
49
50 return buf.Bytes(), nil
51}
52
53func zlibDecompress(data []byte) {
54 reader, err := zlib.NewReader(bytes.NewReader(data))
55 if err != nil {
56 return nil, err
57 }
58 defer reader.Close()
59
60 var buf bytes.Buffer
61 _, err = io.Copy(&buf, reader)
62 if err != nil {
63 return nil, err
64 }
65
66 return buf.Bytes(), nil
67}
Flate
1package main
2
3import (
4 "bytes"
5 "compress/flate"
6 "fmt"
7 "io"
8)
9
10// run
11func main() {
12 original := []byte("Raw DEFLATE compression without headers")
13
14 // Compress with flate
15 compressed, err := flateCompress(original, flate.DefaultCompression)
16 if err != nil {
17 fmt.Println("Compression error:", err)
18 return
19 }
20
21 fmt.Printf("Original: %d bytes\n", len(original))
22 fmt.Printf("Compressed: %d bytes\n", len(compressed))
23
24 // Decompress
25 decompressed, err := flateDecompress(compressed)
26 if err != nil {
27 fmt.Println("Decompression error:", err)
28 return
29 }
30
31 fmt.Printf("Decompressed: %s\n", string(decompressed))
32}
33
34func flateCompress(data []byte, level int) {
35 var buf bytes.Buffer
36
37 writer, err := flate.NewWriter(&buf, level)
38 if err != nil {
39 return nil, err
40 }
41
42 _, err = writer.Write(data)
43 if err != nil {
44 return nil, err
45 }
46
47 if err := writer.Close(); err != nil {
48 return nil, err
49 }
50
51 return buf.Bytes(), nil
52}
53
54func flateDecompress(data []byte) {
55 reader := flate.NewReader(bytes.NewReader(data))
56 defer reader.Close()
57
58 var buf bytes.Buffer
59 _, err := io.Copy(&buf, reader)
60 if err != nil {
61 return nil, err
62 }
63
64 return buf.Bytes(), nil
65}
Custom Dictionary Compression
1package main
2
3import (
4 "bytes"
5 "compress/flate"
6 "fmt"
7)
8
9// run
10func main() {
11 // Dictionary for better compression of similar data
12 dictionary := []byte("the quick brown fox jumps over")
13
14 data := []byte("the quick brown fox jumps over the lazy dog")
15
16 // Compress with dictionary
17 compressed, err := compressWithDict(data, dictionary)
18 if err != nil {
19 fmt.Println("Error:", err)
20 return
21 }
22
23 fmt.Printf("Original: %d bytes\n", len(data))
24 fmt.Printf("Compressed: %d bytes\n", len(compressed))
25
26 // Decompress with dictionary
27 decompressed, err := decompressWithDict(compressed, dictionary)
28 if err != nil {
29 fmt.Println("Error:", err)
30 return
31 }
32
33 fmt.Printf("Decompressed: %s\n", string(decompressed))
34}
35
36func compressWithDict(data, dict []byte) {
37 var buf bytes.Buffer
38
39 writer, err := flate.NewWriter(&buf, flate.BestCompression)
40 if err != nil {
41 return nil, err
42 }
43
44 // Note: Standard flate.Writer doesn't directly support dictionaries
45 // This is a simplified example
46 _, err = writer.Write(data)
47 if err != nil {
48 return nil, err
49 }
50
51 if err := writer.Close(); err != nil {
52 return nil, err
53 }
54
55 return buf.Bytes(), nil
56}
57
58func decompressWithDict(data, dict []byte) {
59 reader := flate.NewReader(bytes.NewReader(data))
60 defer reader.Close()
61
62 var buf bytes.Buffer
63 buf.ReadFrom(reader)
64
65 return buf.Bytes(), nil
66}
Tar Archives
Unix Tape Archives: Tar (Tape ARchive) originated in the 1970s for backing up data to magnetic tape. Despite its age, it remains the standard for distributing software and creating backups on Unix-like systems. Its simplicity and efficiency have ensured its longevity.
Tar's Design: Tar concatenates files sequentially, preserving metadata (permissions, ownership, timestamps) in headers. This sequential design made sense for tape drives (which couldn't seek) and still works well for streaming backups and network transfers.
Tar + Gzip = TGZ: The classic combination of tar (bundling) with gzip (compression) is so common it has its own file extension (.tgz or .tar.gz). This two-step process allows tar to focus on metadata preservation while gzip handles compression.
Tar is widely used for bundling multiple files together.
Creating Tar Archives
Tar Format Variants: Original tar had limitations (99-character filenames, 8GB file size). GNU tar extensions lifted many restrictions. POSIX.1-2001 (PAX) format supports arbitrary filename lengths, large files, and extended attributes. Modern Go uses PAX format by default.
Tar Block Structure: Tar uses 512-byte blocks. Each file header occupies one block, followed by file data rounded up to block boundaries. This padding ensures tape-friendly alignment but means small files have overhead—a 1-byte file uses 1024 bytes (header + data block).
Tar Streaming Benefits: Tar's sequential format is perfect for streaming: you can add files to an archive without seeking, and you can extract files while still downloading. This makes tar excellent for network transfer and backups to tape.
1package main
2
3import (
4 "archive/tar"
5 "bytes"
6 "fmt"
7 "io"
8 "time"
9)
10
11// run
12func main() {
13 // Create a tar archive
14 archive, err := createTarArchive()
15 if err != nil {
16 fmt.Println("Error creating archive:", err)
17 return
18 }
19
20 fmt.Printf("Created tar archive: %d bytes\n\n", len(archive))
21
22 // List contents
23 if err := listTarArchive(archive); err != nil {
24 fmt.Println("Error listing archive:", err)
25 return
26 }
27
28 // Extract a file
29 content, err := extractFromTar(archive, "file2.txt")
30 if err != nil {
31 fmt.Println("Error extracting:", err)
32 return
33 }
34
35 fmt.Printf("\nExtracted file2.txt:\n%s\n", string(content))
36}
37
38func createTarArchive() {
39 var buf bytes.Buffer
40 writer := tar.NewWriter(&buf)
41 defer writer.Close()
42
43 // Add files
44 files := []struct {
45 Name string
46 Content []byte
47 }{
48 {"file1.txt", []byte("Content of file 1")},
49 {"file2.txt", []byte("Content of file 2")},
50 {"dir/file3.txt", []byte("Content of file 3 in a directory")},
51 }
52
53 for _, file := range files {
54 header := &tar.Header{
55 Name: file.Name,
56 Mode: 0644,
57 Size: int64(len(file.Content)),
58 ModTime: time.Now(),
59 }
60
61 if err := writer.WriteHeader(header); err != nil {
62 return nil, err
63 }
64
65 if _, err := writer.Write(file.Content); err != nil {
66 return nil, err
67 }
68 }
69
70 return buf.Bytes(), nil
71}
72
73func listTarArchive(data []byte) error {
74 reader := tar.NewReader(bytes.NewReader(data))
75
76 fmt.Println("Archive contents:")
77 for {
78 header, err := reader.Next()
79 if err == io.EOF {
80 break
81 }
82 if err != nil {
83 return err
84 }
85
86 fmt.Printf(" %s\n",
87 header.Name, header.Size, header.Mode)
88 }
89
90 return nil
91}
92
93func extractFromTar(data []byte, filename string) {
94 reader := tar.NewReader(bytes.NewReader(data))
95
96 for {
97 header, err := reader.Next()
98 if err == io.EOF {
99 break
100 }
101 if err != nil {
102 return nil, err
103 }
104
105 if header.Name == filename {
106 var buf bytes.Buffer
107 io.Copy(&buf, reader)
108 return buf.Bytes(), nil
109 }
110 }
111
112 return nil, fmt.Errorf("file not found: %s", filename)
113}
Archive Format Comparison
Compression in Different Domains
Image Compression: JPEG is lossy—acceptable for photos but not for screenshots with text. PNG is lossless—perfect for screenshots, diagrams, and images with text. WebP is modern—supports both lossy and lossless, smaller than JPEG/PNG.
Video Compression: H.264/H.265 are sophisticated lossy codecs achieving amazing compression (1000:1 common). Work by exploiting temporal redundancy (frames are similar) and spatial redundancy (adjacent pixels are similar).
Audio Compression: MP3 is lossy—discards frequencies humans can't hear well. FLAC is lossless—preserves perfect quality while reducing size 40-60%. Opus is modern—optimized for speech and music, low latency.
Text Compression: Text compresses extremely well—80-90% reduction common. JSON with repeated keys compresses even better. Log files are highly compressible due to repeated patterns.
Binary Data: Executable files, compiled code, and random data compress poorly. Already-compressed files (ZIP, MP4, JPEG) won't compress further—attempting wastes CPU and might increase size.
Compression Benchmarking Methodology
Representative Data: Benchmark with your actual data. Compression ratios vary dramatically—JSON compresses differently than images. Use real production data samples.
Warm-Up: First compressions are slower due to CPU caches, memory allocation, and JIT compilation. Warm up with several iterations before measuring.
Statistical Significance: Run multiple iterations and compute statistics (mean, stddev, percentiles). Single runs can be misleading due to system noise.
Memory Profiling: Measure peak memory usage, not just time. High-compression algorithms can use 100MB+ for buffers. This matters for memory-constrained environments.
Throughput vs Latency: Measure both compression speed (MB/s) and latency for single operations. Batch processing cares about throughput; interactive applications care about latency.
Advanced Compression Techniques
Dictionary Training: For compressing many similar small messages (like JSON API responses), train a shared dictionary. Can improve compression 20-40% for small messages where per-message dictionaries don't help.
Preprocessing: Sometimes preprocessing improves compression. BWT (Burrows-Wheeler Transform) reorders data to group similar bytes, improving compression. Delta encoding for time-series data. Run-length encoding for simple repetition.
Hybrid Approaches: Combine techniques for best results. Databases often use: column-oriented storage (groups similar values), dictionary encoding (for low-cardinality columns), run-length encoding (for repeated values), then general compression (gzip/zstd).
Content-Aware Compression: Different compression for different content types. Images use image codecs, text uses text compression, already-compressed data stored uncompressed. ZIP's per-file compression enables this.
Compression Error Handling
Detecting Corruption: Compressed data is fragile—single-bit error can corrupt everything after. Use checksums: gzip includes CRC32, or add external checksums (SHA-256) for critical data.
Recovery Strategies: For critical data, implement recovery: error-correcting codes, redundancy (multiple copies), or splitting into independent chunks so one corrupted block doesn't destroy everything.
Graceful Degradation: If decompression fails, don't crash. Log error, skip corrupted item, continue processing. For optional compression (like caching), fall back to uncompressed.
Validation: After compression, decompress and verify matches original. This catches implementation bugs. For backups, always test restoration.
Compression and Encryption Interaction
Encrypt After Compression: Encryption destroys patterns needed for compression. Always compress first, then encrypt. Trying to compress encrypted data wastes CPU.
Security Considerations: BREACH/CRIME attacks exploit compression to leak secrets. For pages with secrets (tokens, cookies), either disable compression or add random padding to vary sizes.
Authenticated Encryption: Use AES-GCM or ChaCha20-Poly1305—they provide both confidentiality and authenticity. Verify authentication tags before decompression to prevent processing malicious data.
Tar vs Zip Philosophy: Tar separates archiving (bundling files) from compression (reducing size). Zip combines both. This explains why tar.gz has better compression—compression sees all files as one stream, finding patterns across file boundaries. Zip compresses files independently, missing cross-file patterns.
Random Access: Zip's central directory enables random access—extract one file without reading the entire archive. Tar requires sequential scan to find files. For archives with thousands of files, this makes extraction much faster when you only need specific files.
Streaming and Seeking: Tar streams naturally—create/extract without seeking. Zip requires seeking to read the central directory (at the end) and jump to file data. This makes tar better for pipes and network streams, zip better for local files.
Metadata Preservation: Tar excels at preserving Unix metadata: permissions (rwxrwxrwx), ownership (UID/GID), timestamps (modify, access, change), extended attributes (ACLs, SELinux contexts), device nodes (for system backups). Zip stores basic metadata but isn't designed for system backups.
Cross-Platform: Zip is more cross-platform—Windows, Mac, and Linux all have native support. Tar is Unix-native, though Windows now has tar support. For distributing software to non-technical users, zip is often better.
Encryption: Zip has built-in encryption (though weak in some implementations). Tar has no native encryption—encrypt the tar.gz file with gpg or another tool. For sensitive archives, consider GPG encryption regardless of format.
Tar with Compression
1package main
2
3import (
4 "archive/tar"
5 "bytes"
6 "compress/gzip"
7 "fmt"
8 "io"
9 "time"
10)
11
12// run
13func main() {
14 // Create compressed tar archive
15 archive, err := createTarGz()
16 if err != nil {
17 fmt.Println("Error:", err)
18 return
19 }
20
21 fmt.Printf("Created .tar.gz archive: %d bytes\n\n", len(archive))
22
23 // Extract from compressed archive
24 files, err := extractTarGz(archive)
25 if err != nil {
26 fmt.Println("Error:", err)
27 return
28 }
29
30 fmt.Println("Extracted files:")
31 for name, content := range files {
32 fmt.Printf(" %s: %s\n", name, string(content))
33 }
34}
35
36func createTarGz() {
37 var buf bytes.Buffer
38
39 // Create gzip writer
40 gzWriter := gzip.NewWriter(&buf)
41 defer gzWriter.Close()
42
43 // Create tar writer on top of gzip
44 tarWriter := tar.NewWriter(gzWriter)
45 defer tarWriter.Close()
46
47 // Add files
48 files := map[string][]byte{
49 "data1.txt": []byte("First file content"),
50 "data2.txt": []byte("Second file content"),
51 "data3.txt": []byte("Third file content"),
52 }
53
54 for name, content := range files {
55 header := &tar.Header{
56 Name: name,
57 Mode: 0644,
58 Size: int64(len(content)),
59 ModTime: time.Now(),
60 }
61
62 if err := tarWriter.WriteHeader(header); err != nil {
63 return nil, err
64 }
65
66 if _, err := tarWriter.Write(content); err != nil {
67 return nil, err
68 }
69 }
70
71 // Important: close writers to flush
72 tarWriter.Close()
73 gzWriter.Close()
74
75 return buf.Bytes(), nil
76}
77
78func extractTarGz(data []byte) {
79 // Create gzip reader
80 gzReader, err := gzip.NewReader(bytes.NewReader(data))
81 if err != nil {
82 return nil, err
83 }
84 defer gzReader.Close()
85
86 // Create tar reader
87 tarReader := tar.NewReader(gzReader)
88
89 files := make(map[string][]byte)
90
91 for {
92 header, err := tarReader.Next()
93 if err == io.EOF {
94 break
95 }
96 if err != nil {
97 return nil, err
98 }
99
100 var buf bytes.Buffer
101 io.Copy(&buf, tarReader)
102 files[header.Name] = buf.Bytes()
103 }
104
105 return files, nil
106}
Streaming Tar Operations
1package main
2
3import (
4 "archive/tar"
5 "bytes"
6 "fmt"
7 "io"
8 "time"
9)
10
11// run
12func main() {
13 // Simulate streaming tar creation
14 var archive bytes.Buffer
15
16 // Create tar archive with streaming
17 if err := streamTarCreate(&archive); err != nil {
18 fmt.Println("Error:", err)
19 return
20 }
21
22 fmt.Printf("Created streaming tar: %d bytes\n\n", archive.Len())
23
24 // Stream processing of tar
25 if err := streamTarProcess(&archive); err != nil {
26 fmt.Println("Error:", err)
27 return
28 }
29}
30
31func streamTarCreate(dest io.Writer) error {
32 writer := tar.NewWriter(dest)
33 defer writer.Close()
34
35 // Simulate streaming multiple chunks
36 for i := 1; i <= 3; i++ {
37 name := fmt.Sprintf("stream_%d.txt", i)
38 content := []byte(fmt.Sprintf("Streamed content #%d", i))
39
40 header := &tar.Header{
41 Name: name,
42 Mode: 0644,
43 Size: int64(len(content)),
44 ModTime: time.Now(),
45 }
46
47 if err := writer.WriteHeader(header); err != nil {
48 return err
49 }
50
51 if _, err := writer.Write(content); err != nil {
52 return err
53 }
54
55 fmt.Printf("Streamed: %s\n", name)
56 }
57
58 return nil
59}
60
61func streamTarProcess(source io.Reader) error {
62 reader := tar.NewReader(source)
63
64 fmt.Println("\nProcessing streamed tar:")
65 for {
66 header, err := reader.Next()
67 if err == io.EOF {
68 break
69 }
70 if err != nil {
71 return err
72 }
73
74 // Process each file as it's read
75 fmt.Printf(" Processing: %s\n", header.Name, header.Size)
76
77 // Read and discard content
78 io.Copy(io.Discard, reader)
79 }
80
81 return nil
82}
Tar Advanced Features
Tar supports many advanced features beyond basic file archiving. Understanding these features helps you build robust backup and distribution systems.
Sparse Files: Files with large holes (regions of zeros) can be stored efficiently. Tar's sparse format stores only non-zero regions, dramatically reducing archive size for sparse files like disk images or database files.
Hard and Soft Links: Tar preserves hard links (multiple names for same file) and symbolic links (pointers to other files). This ensures archives accurately represent complex file systems.
Extended Attributes: Modern tar formats (PAX) support extended attributes, ACLs, and SELinux contexts. This is crucial for accurate system backups that need to preserve security settings.
Incremental Backups: Tar can create incremental archives containing only files changed since a previous backup. Combined with snapshot files, this enables efficient multi-level backup strategies.
Archive Splitting: Large archives can be split across multiple files or volumes. This was essential for tape backups and remains useful for distribution across size-limited media.
Zip Files
The Universal Archive: ZIP archives work on every platform—Windows, Mac, Linux, mobile devices. Unlike tar, ZIP files include a central directory at the end, allowing efficient random access to individual files without scanning the entire archive.
ZIP's Advantages: Individual file access (extract single files without decompressing everything), built-in compression per file (can store some files uncompressed), optional encryption, and universal tool support make ZIP ideal for software distribution and data exchange.
Compression Methods: ZIP supports multiple compression methods: Store (no compression, for already-compressed files), DEFLATE (standard compression, same as gzip), BZIP2 (better compression, slower), and LZMA (best compression, much slower). Most tools use DEFLATE as the default.
ZIP is a popular archive format with built-in compression.
Creating ZIP Archives
ZIP Central Directory: Unlike tar's sequential format, zip files have a central directory at the end listing all files. This directory enables random access—you can extract a single file without scanning the entire archive—but requires seeking, making zip less suitable for streaming.
ZIP64 Extensions: Original zip format limited file sizes to 4GB and archive sizes to 4GB. ZIP64 extensions use 64-bit fields, removing these limits. Modern tools use ZIP64 automatically when needed, ensuring compatibility while supporting large files.
Compression Per File: ZIP compresses each file independently, enabling parallel decompression and different compression levels per file. Already-compressed files (JPEG, MP4) can be stored uncompressed, saving CPU while maintaining archive integrity.
1package main
2
3import (
4 "archive/zip"
5 "bytes"
6 "fmt"
7 "io"
8)
9
10// run
11func main() {
12 // Create ZIP archive
13 archive, err := createZipArchive()
14 if err != nil {
15 fmt.Println("Error:", err)
16 return
17 }
18
19 fmt.Printf("Created ZIP archive: %d bytes\n\n", len(archive))
20
21 // List contents
22 if err := listZipArchive(archive); err != nil {
23 fmt.Println("Error:", err)
24 return
25 }
26
27 // Extract file
28 content, err := extractFromZip(archive, "document.txt")
29 if err != nil {
30 fmt.Println("Error:", err)
31 return
32 }
33
34 fmt.Printf("\nExtracted document.txt:\n%s\n", string(content))
35}
36
37func createZipArchive() {
38 var buf bytes.Buffer
39 writer := zip.NewWriter(&buf)
40 defer writer.Close()
41
42 // Add files with different compression methods
43 files := []struct {
44 Name string
45 Data []byte
46 Method uint16
47 }{
48 {"readme.txt", []byte("This is a readme file"), zip.Deflate},
49 {"document.txt", []byte("Document content here"), zip.Deflate},
50 {"data.bin", []byte{0x00, 0x01, 0x02, 0x03}, zip.Store}, // No compression
51 }
52
53 for _, file := range files {
54 header := &zip.FileHeader{
55 Name: file.Name,
56 Method: file.Method,
57 }
58
59 w, err := writer.CreateHeader(header)
60 if err != nil {
61 return nil, err
62 }
63
64 if _, err := w.Write(file.Data); err != nil {
65 return nil, err
66 }
67 }
68
69 writer.Close()
70 return buf.Bytes(), nil
71}
72
73func listZipArchive(data []byte) error {
74 reader, err := zip.NewReader(bytes.NewReader(data), int64(len(data)))
75 if err != nil {
76 return err
77 }
78
79 fmt.Println("ZIP contents:")
80 for _, file := range reader.File {
81 method := "Deflate"
82 if file.Method == zip.Store {
83 method = "Store"
84 }
85
86 fmt.Printf(" %s\n",
87 file.Name, method, file.CompressedSize64, file.UncompressedSize64)
88 }
89
90 return nil
91}
92
93func extractFromZip(data []byte, filename string) {
94 reader, err := zip.NewReader(bytes.NewReader(data), int64(len(data)))
95 if err != nil {
96 return nil, err
97 }
98
99 for _, file := range reader.File {
100 if file.Name == filename {
101 rc, err := file.Open()
102 if err != nil {
103 return nil, err
104 }
105 defer rc.Close()
106
107 var buf bytes.Buffer
108 io.Copy(&buf, rc)
109 return buf.Bytes(), nil
110 }
111 }
112
113 return nil, fmt.Errorf("file not found: %s", filename)
114}
ZIP with Compression Levels
1package main
2
3import (
4 "archive/zip"
5 "bytes"
6 "fmt"
7)
8
9// run
10func main() {
11 data := []byte("Test data for compression level comparison. " +
12 "Repetitive data compresses better. Repetitive data!")
13
14 levels := []struct {
15 Name string
16 Level int
17 }{
18 {"No Compression", zip.Store},
19 {"Best Speed", 1},
20 {"Default", -1},
21 {"Best Compression", 9},
22 }
23
24 fmt.Println("ZIP Compression Level Comparison:\n")
25
26 for _, level := range levels {
27 size, err := createZipWithLevel(data, level.Level)
28 if err != nil {
29 fmt.Printf("%s: Error: %v\n", level.Name, err)
30 continue
31 }
32
33 fmt.Printf("%s:\n", level.Name)
34 fmt.Printf(" Size: %d bytes\n",
35 size, float64(size)/float64(len(data))*100)
36 }
37}
38
39func createZipWithLevel(data []byte, level int) {
40 var buf bytes.Buffer
41 writer := zip.NewWriter(&buf)
42
43 // Create file with specific compression level
44 header := &zip.FileHeader{
45 Name: "data.txt",
46 Method: zip.Deflate,
47 }
48
49 if level == zip.Store {
50 header.Method = zip.Store
51 }
52
53 w, err := writer.CreateHeader(header)
54 if err != nil {
55 return 0, err
56 }
57
58 // Set compression level
59 if compressor, ok := w.(*zip.Writer); ok && level != zip.Store {
60 _ = compressor // Level is set via RegisterCompressor
61 }
62
63 if _, err := w.Write(data); err != nil {
64 return 0, err
65 }
66
67 writer.Close()
68 return buf.Len(), nil
69}
Password-Protected ZIP
1package main
2
3import (
4 "archive/zip"
5 "bytes"
6 "fmt"
7)
8
9// run
10func main() {
11 // Note: Standard library doesn't support password-protected ZIP
12 // This example shows the structure; you'd need golang.org/x/crypto
13 // or a third-party library for actual encryption
14
15 data := []byte("Sensitive data")
16
17 archive, err := createProtectedZip(data)
18 if err != nil {
19 fmt.Println("Error:", err)
20 return
21 }
22
23 fmt.Printf("Created archive: %d bytes\n", len(archive))
24 fmt.Println("\nNote: Standard library doesn't fully support ZIP encryption")
25 fmt.Println("Use github.com/alexmullins/zip for encrypted ZIP files")
26}
27
28func createProtectedZip(data []byte) {
29 var buf bytes.Buffer
30 writer := zip.NewWriter(&buf)
31 defer writer.Close()
32
33 // Standard ZIP creation
34 header := &zip.FileHeader{
35 Name: "protected.txt",
36 Method: zip.Deflate,
37 }
38
39 w, err := writer.CreateHeader(header)
40 if err != nil {
41 return nil, err
42 }
43
44 // In real implementation, you'd encrypt data here
45 if _, err := w.Write(data); err != nil {
46 return nil, err
47 }
48
49 writer.Close()
50 return buf.Bytes(), nil
51}
Streaming Compression
Memory Efficiency: Streaming compression processes data incrementally rather than loading entire files into memory. This is essential for large files (gigabytes or terabytes) and real-time applications (compressing data as it's generated).
The io.Reader/io.Writer Pattern: Go's interface-based design makes streaming natural. A gzip.Writer is an io.Writer, so anything that accepts io.Writer can transparently add compression. This composability is powerful—you can chain readers and writers to create complex pipelines.
Chunked Processing: For streaming to work effectively, data must be processed in chunks. Too small and you lose compression efficiency (patterns span chunks). Too large and you lose the memory benefits of streaming. Typical chunk sizes range from 4KB to 64KB, balancing memory use and compression ratio.
Efficient memory usage with streaming compression.
Chunked Streaming
1package main
2
3import (
4 "compress/gzip"
5 "fmt"
6 "io"
7 "strings"
8)
9
10// run
11func main() {
12 // Simulate large data source
13 source := generateLargeData()
14
15 // Stream compress with chunks
16 var compressed strings.Builder
17 stats, err := streamCompressChunked(source, &compressed, 1024)
18 if err != nil {
19 fmt.Println("Error:", err)
20 return
21 }
22
23 fmt.Printf("Compression Statistics:\n")
24 fmt.Printf(" Chunks processed: %d\n", stats.Chunks)
25 fmt.Printf(" Bytes read: %d\n", stats.BytesRead)
26 fmt.Printf(" Bytes written: %d\n", stats.BytesWritten)
27 fmt.Printf(" Compression ratio: %.2f%%\n",
28 float64(stats.BytesWritten)/float64(stats.BytesRead)*100)
29}
30
31type CompressionStats struct {
32 Chunks int
33 BytesRead int64
34 BytesWritten int64
35}
36
37func generateLargeData() io.Reader {
38 // Generate repetitive data for good compression
39 data := strings.Repeat("Sample data for streaming compression. ", 100)
40 return strings.NewReader(data)
41}
42
43func streamCompressChunked(source io.Reader, dest io.Writer, chunkSize int) {
44 stats := CompressionStats{}
45
46 writer := gzip.NewWriter(dest)
47 defer writer.Close()
48
49 buffer := make([]byte, chunkSize)
50
51 for {
52 n, err := source.Read(buffer)
53 if err != nil && err != io.EOF {
54 return stats, err
55 }
56
57 if n > 0 {
58 stats.Chunks++
59 stats.BytesRead += int64(n)
60
61 written, err := writer.Write(buffer[:n])
62 if err != nil {
63 return stats, err
64 }
65
66 stats.BytesWritten += int64(written)
67 }
68
69 if err == io.EOF {
70 break
71 }
72 }
73
74 // Must close to flush
75 writer.Close()
76
77 return stats, nil
78}
Advanced Streaming Compression Example
Here's a production-ready streaming compression system with progress tracking, error handling, and optimization:
1package main
2
3import (
4 "compress/gzip"
5 "crypto/sha256"
6 "fmt"
7 "hash"
8 "io"
9 "os"
10 "sync/atomic"
11 "time"
12)
13
14// run
15func main() {
16 // Create streaming compressor
17 compressor := NewStreamingCompressor(CompressionConfig{
18 Level: gzip.DefaultCompression,
19 BufferSize: 64 * 1024, // 64KB buffers
20 ChunkSize: 1024 * 1024, // 1MB chunks
21 })
22
23 // Simulate compressing a large file
24 fmt.Println("Starting compression...")
25
26 input := generateLargeData(5 * 1024 * 1024) // 5MB test data
27 output := &countingWriter{}
28
29 stats, err := compressor.Compress(input, output, &progressCallback{})
30 if err != nil {
31 fmt.Printf("Compression failed: %v\n", err)
32 return
33 }
34
35 fmt.Printf("\nCompression complete:\n")
36 fmt.Printf(" Input size: %d bytes\n", stats.InputSize)
37 fmt.Printf(" Output size: %d bytes\n", stats.OutputSize)
38 fmt.Printf(" Ratio: %.2f%%\n", float64(stats.OutputSize)/float64(stats.InputSize)*100)
39 fmt.Printf(" Duration: %s\n", stats.Duration)
40 fmt.Printf(" Throughput: %.2f MB/s\n", float64(stats.InputSize)/stats.Duration.Seconds()/1024/1024)
41 fmt.Printf(" Checksum: %x\n", stats.Checksum)
42}
43
44// CompressionConfig holds compression parameters
45type CompressionConfig struct{
46 Level int // Compression level (1-9)
47 BufferSize int // I/O buffer size
48 ChunkSize int // Processing chunk size
49}
50
51// CompressionStats holds compression statistics
52type CompressionStats struct {
53 InputSize int64 // Bytes read
54 OutputSize int64 // Bytes written
55 Duration time.Duration // Time taken
56 Checksum []byte // Input checksum
57}
58
59// ProgressCallback receives progress updates
60type ProgressCallback interface {
61 OnProgress(bytesProcessed, totalBytes int64)
62}
63
64// StreamingCompressor handles streaming compression
65type StreamingCompressor struct {
66 config CompressionConfig
67}
68
69func NewStreamingCompressor(config CompressionConfig) *StreamingCompressor {
70 return &StreamingCompressor{config: config}
71}
72
73// Compress compresses input to output with progress tracking
74func Compress(input io.Reader, output io.Writer, progress ProgressCallback) {
75 start := time.Now()
76
77 // Create checksum hasher
78 hasher := sha256.New()
79
80 // Wrap input to compute checksum
81 inputReader := io.TeeReader(input, hasher)
82
83 // Create gzip writer with specified level
84 gzWriter, err := gzip.NewWriterLevel(output, c.config.Level)
85 if err != nil {
86 return CompressionStats{}, err
87 }
88 defer gzWriter.Close()
89
90 // Create counting writer to track output
91 countingOutput := &countingWriter{Writer: gzWriter}
92
93 // Compress in chunks
94 buffer := make([]byte, c.config.ChunkSize)
95 var totalRead int64
96
97 for {
98 n, err := inputReader.Read(buffer)
99 if n > 0 {
100 // Write compressed data
101 _, writeErr := countingOutput.Write(buffer[:n])
102 if writeErr != nil {
103 return CompressionStats{}, writeErr
104 }
105
106 totalRead += int64(n)
107
108 // Report progress
109 if progress != nil {
110 progress.OnProgress(totalRead, -1) // -1 = unknown total
111 }
112 }
113
114 if err == io.EOF {
115 break
116 }
117 if err != nil {
118 return CompressionStats{}, err
119 }
120 }
121
122 // Flush gzip writer
123 if err := gzWriter.Close(); err != nil {
124 return CompressionStats{}, err
125 }
126
127 return CompressionStats{
128 InputSize: totalRead,
129 OutputSize: countingOutput.Count(),
130 Duration: time.Since(start),
131 Checksum: hasher.Sum(nil),
132 }, nil
133}
134
135// countingWriter wraps a writer and counts bytes written
136type countingWriter struct {
137 Writer io.Writer
138 count atomic.Int64
139}
140
141func Write(p []byte) {
142 n, err := cw.Writer.Write(p)
143 cw.count.Add(int64(n))
144 return n, err
145}
146
147func Count() int64 {
148 return cw.count.Load()
149}
150
151// progressCallback implements ProgressCallback
152type progressCallback struct {
153 lastReport time.Time
154}
155
156func OnProgress(bytesProcessed, totalBytes int64) {
157 // Rate limit progress updates
158 if time.Since(p.lastReport) < 100*time.Millisecond {
159 return
160 }
161 p.lastReport = time.Now()
162
163 mb := float64(bytesProcessed) / 1024 / 1024
164 fmt.Printf("\rProcessed: %.2f MB", mb)
165}
166
167// generateLargeData creates test data
168func generateLargeData(size int) io.Reader {
169 // Create pipe for streaming data
170 r, w := io.Pipe()
171
172 go func() {
173 defer w.Close()
174
175 // Generate repetitive data (compresses well)
176 chunk := []byte("This is test data for compression. It repeats to demonstrate good compression ratios.\n")
177
178 for written := 0; written < size; {
179 n, _ := w.Write(chunk)
180 written += n
181 }
182 }()
183
184 return r
185}
This example demonstrates:
- Streaming Processing: Handle arbitrarily large files
- Progress Tracking: Real-time feedback
- Checksum Verification: Data integrity
- Error Handling: Robust error management
- Performance Monitoring: Track throughput and ratios
- Configurable Parameters: Flexible compression settings
1package main
2
3import (
4 "compress/gzip"
5 "fmt"
6 "io"
7 "strings"
8)
9
10// run
11func main() {
12 // Simulate large data source
13 source := generateLargeData()
14
15 // Stream compress with chunks
16 var compressed strings.Builder
17 stats, err := streamCompressChunked(source, &compressed, 1024)
18 if err != nil {
19 fmt.Println("Error:", err)
20 return
21 }
22
23 fmt.Printf("Compression Statistics:\n")
24 fmt.Printf(" Chunks processed: %d\n", stats.Chunks)
25 fmt.Printf(" Bytes read: %d\n", stats.BytesRead)
26 fmt.Printf(" Bytes written: %d\n", stats.BytesWritten)
27 fmt.Printf(" Compression ratio: %.2f%%\n",
28 float64(stats.BytesWritten)/float64(stats.BytesRead)*100)
29}
30
31type CompressionStats struct {
32 Chunks int
33 BytesRead int64
34 BytesWritten int64
35}
36
37func generateLargeData() io.Reader {
38 // Generate repetitive data for good compression
39 data := strings.Repeat("Sample data for streaming compression. ", 100)
40 return strings.NewReader(data)
41}
42
43func streamCompressChunked(source io.Reader, dest io.Writer, chunkSize int) {
44 stats := CompressionStats{}
45
46 writer := gzip.NewWriter(dest)
47 defer writer.Close()
48
49 buffer := make([]byte, chunkSize)
50
51 for {
52 n, err := source.Read(buffer)
53 if err != nil && err != io.EOF {
54 return stats, err
55 }
56
57 if n > 0 {
58 stats.Chunks++
59 stats.BytesRead += int64(n)
60
61 written, err := writer.Write(buffer[:n])
62 if err != nil {
63 return stats, err
64 }
65
66 stats.BytesWritten += int64(written)
67 }
68
69 if err == io.EOF {
70 break
71 }
72 }
73
74 // Must close to flush
75 writer.Close()
76
77 return stats, nil
78}
Concurrent Compression
1package main
2
3import (
4 "bytes"
5 "compress/gzip"
6 "fmt"
7 "sync"
8)
9
10// run
11func main() {
12 // Split data into chunks for parallel compression
13 chunks := [][]byte{
14 []byte("Chunk 1: Some data to compress"),
15 []byte("Chunk 2: More data to compress"),
16 []byte("Chunk 3: Even more data"),
17 []byte("Chunk 4: Final chunk of data"),
18 }
19
20 // Compress chunks concurrently
21 results, err := compressConcurrent(chunks)
22 if err != nil {
23 fmt.Println("Error:", err)
24 return
25 }
26
27 // Show results
28 for i, result := range results {
29 fmt.Printf("Chunk %d: %d bytes -> %d bytes\n",
30 i+1,
31 len(chunks[i]),
32 len(result),
33 float64(len(result))/float64(len(chunks[i]))*100)
34 }
35}
36
37func compressConcurrent(chunks [][]byte) {
38 results := make([][]byte, len(chunks))
39 errChan := make(chan error, len(chunks))
40
41 var wg sync.WaitGroup
42
43 for i, chunk := range chunks {
44 wg.Add(1)
45 go func(index int, data []byte) {
46 defer wg.Done()
47
48 var buf bytes.Buffer
49 writer := gzip.NewWriter(&buf)
50
51 _, err := writer.Write(data)
52 if err != nil {
53 errChan <- err
54 return
55 }
56
57 writer.Close()
58 results[index] = buf.Bytes()
59 }(i, chunk)
60 }
61
62 wg.Wait()
63 close(errChan)
64
65 // Check for errors
66 if err := <-errChan; err != nil {
67 return nil, err
68 }
69
70 return results, nil
71}
Buffered Compression
1package main
2
3import (
4 "bufio"
5 "bytes"
6 "compress/gzip"
7 "fmt"
8 "io"
9)
10
11// run
12func main() {
13 data := []byte("Data for buffered compression. " +
14 "Buffering can improve performance for small writes.")
15
16 // Compare buffered vs unbuffered
17 fmt.Println("Unbuffered compression:")
18 unbuffered := compressUnbuffered(data)
19 fmt.Printf(" Size: %d bytes\n", len(unbuffered))
20
21 fmt.Println("\nBuffered compression:")
22 buffered := compressBuffered(data, 4096)
23 fmt.Printf(" Size: %d bytes\n", len(buffered))
24}
25
26func compressUnbuffered(data []byte) []byte {
27 var buf bytes.Buffer
28 writer := gzip.NewWriter(&buf)
29
30 // Small writes without buffering
31 for i := 0; i < len(data); i += 10 {
32 end := i + 10
33 if end > len(data) {
34 end = len(data)
35 }
36 writer.Write(data[i:end])
37 }
38
39 writer.Close()
40 return buf.Bytes()
41}
42
43func compressBuffered(data []byte, bufferSize int) []byte {
44 var buf bytes.Buffer
45 gzWriter := gzip.NewWriter(&buf)
46
47 // Add buffering layer
48 bufWriter := bufio.NewWriterSize(gzWriter, bufferSize)
49
50 // Small writes with buffering
51 for i := 0; i < len(data); i += 10 {
52 end := i + 10
53 if end > len(data) {
54 end = len(data)
55 }
56 bufWriter.Write(data[i:end])
57 }
58
59 bufWriter.Flush()
60 gzWriter.Close()
61 return buf.Bytes()
62}
Performance Comparisons
Speed vs Size Trade-offs: Compression is always a trade-off between CPU time and compression ratio. Fast algorithms (like Snappy or LZ4) compress at hundreds of MB/s but achieve modest ratios. Slow algorithms (like LZMA or Brotli) achieve excellent ratios but may compress at just a few MB/s.
Choosing Based on Bottlenecks: If your bottleneck is network bandwidth, use maximum compression—the extra CPU time is worth the bandwidth savings. If your bottleneck is CPU, use fast compression or skip it entirely. If your bottleneck is disk I/O, compression can actually improve performance by reducing I/O.
Data Characteristics Matter: Text compresses well (70-90% reduction typical). JSON and XML compress very well (80-95% reduction) due to repeated tag names. Binary data and already-compressed formats (JPEG, MP4, PNG) achieve minimal compression—sometimes even expanding slightly due to overhead.
Algorithm Comparison
1package main
2
3import (
4 "bytes"
5 "compress/flate"
6 "compress/gzip"
7 "compress/zlib"
8 "fmt"
9 "strings"
10 "time"
11)
12
13// run
14func main() {
15 // Test data - repetitive for good compression
16 data := []byte(strings.Repeat("The quick brown fox jumps over the lazy dog. ", 200))
17
18 fmt.Printf("Original size: %d bytes\n\n", len(data))
19
20 algorithms := []struct {
21 name string
22 compress func([]byte)
23 }{
24 {"Gzip", compressGzip},
25 {"Zlib", compressZlib},
26 {"Flate", compressFlate},
27 }
28
29 for _, algo := range algorithms {
30 start := time.Now()
31 compressed, err := algo.compress(data)
32 elapsed := time.Since(start)
33
34 if err != nil {
35 fmt.Printf("%s: Error: %v\n", algo.name, err)
36 continue
37 }
38
39 fmt.Printf("%s:\n", algo.name)
40 fmt.Printf(" Compressed: %d bytes\n",
41 len(compressed),
42 float64(len(compressed))/float64(len(data))*100)
43 fmt.Printf(" Time: %s\n\n", elapsed)
44 }
45}
46
47func compressGzip(data []byte) {
48 var buf bytes.Buffer
49 writer := gzip.NewWriter(&buf)
50 _, err := writer.Write(data)
51 if err != nil {
52 return nil, err
53 }
54 writer.Close()
55 return buf.Bytes(), nil
56}
57
58func compressZlib(data []byte) {
59 var buf bytes.Buffer
60 writer := zlib.NewWriter(&buf)
61 _, err := writer.Write(data)
62 if err != nil {
63 return nil, err
64 }
65 writer.Close()
66 return buf.Bytes(), nil
67}
68
69func compressFlate(data []byte) {
70 var buf bytes.Buffer
71 writer, err := flate.NewWriter(&buf, flate.DefaultCompression)
72 if err != nil {
73 return nil, err
74 }
75 _, err = writer.Write(data)
76 if err != nil {
77 return nil, err
78 }
79 writer.Close()
80 return buf.Bytes(), nil
81}
Memory Usage Patterns
1package main
2
3import (
4 "bytes"
5 "compress/gzip"
6 "fmt"
7 "runtime"
8)
9
10// run
11func main() {
12 // Large data set
13 data := make([]byte, 1024*1024) // 1MB
14 for i := range data {
15 data[i] = byte(i % 256)
16 }
17
18 // Measure memory before
19 var m1 runtime.MemStats
20 runtime.ReadMemStats(&m1)
21
22 // Compress in memory
23 var buf bytes.Buffer
24 writer := gzip.NewWriter(&buf)
25 writer.Write(data)
26 writer.Close()
27
28 // Measure memory after
29 var m2 runtime.MemStats
30 runtime.ReadMemStats(&m2)
31
32 fmt.Printf("Memory Usage:\n")
33 fmt.Printf(" Before: %d KB\n", m1.Alloc/1024)
34 fmt.Printf(" After: %d KB\n", m2.Alloc/1024)
35 fmt.Printf(" Delta: %d KB\n",/1024)
36 fmt.Printf("\nCompressed size: %d bytes\n",
37 buf.Len(),
38 float64(buf.Len())/float64(len(data))*100)
39}
Production Patterns
Production Considerations: Real-world compression involves more than just calling compress() and decompress(). You need error handling (corrupted data), progress tracking (user feedback), resource management (memory limits), and performance optimization (pooling, buffering).
HTTP Compression: Web servers automatically compress responses based on Accept-Encoding headers. Proper HTTP compression can reduce page load times by 60-80%, dramatically improving user experience especially on mobile networks.
Log File Management: Growing log files consume disk space quickly. Compressed rotation (compress old logs, delete oldest) is standard practice. A year's worth of logs might compress from 100GB to 5GB—a 95% space savings that makes retention practical.
HTTP Compression Middleware
1package main
2
3import (
4 "compress/gzip"
5 "fmt"
6 "io"
7 "net/http"
8 "net/http/httptest"
9 "strings"
10)
11
12// run
13func main() {
14 // Create handler with compression
15 handler := GzipMiddleware(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
16 data := strings.Repeat("Hello, World! ", 100)
17 w.Write([]byte(data))
18 }))
19
20 // Test without compression
21 req1 := httptest.NewRequest("GET", "/", nil)
22 rec1 := httptest.NewRecorder()
23 handler.ServeHTTP(rec1, req1)
24
25 fmt.Printf("Without compression: %d bytes\n", rec1.Body.Len())
26
27 // Test with compression
28 req2 := httptest.NewRequest("GET", "/", nil)
29 req2.Header.Set("Accept-Encoding", "gzip")
30 rec2 := httptest.NewRecorder()
31 handler.ServeHTTP(rec2, req2)
32
33 fmt.Printf("With compression: %d bytes\n", rec2.Body.Len())
34 fmt.Printf("Compression ratio: %.2f%%\n",
35 float64(rec2.Body.Len())/float64(rec1.Body.Len())*100)
36}
37
38type gzipResponseWriter struct {
39 io.Writer
40 http.ResponseWriter
41}
42
43func Write(b []byte) {
44 return w.Writer.Write(b)
45}
46
47func GzipMiddleware(next http.Handler) http.Handler {
48 return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
49 // Check if client accepts gzip
50 if !strings.Contains(r.Header.Get("Accept-Encoding"), "gzip") {
51 next.ServeHTTP(w, r)
52 return
53 }
54
55 // Create gzip writer
56 gz := gzip.NewWriter(w)
57 defer gz.Close()
58
59 // Set headers
60 w.Header().Set("Content-Encoding", "gzip")
61
62 // Wrap response writer
63 gzw := gzipResponseWriter{Writer: gz, ResponseWriter: w}
64 next.ServeHTTP(gzw, r)
65 })
66}
File Compression with Progress
1package main
2
3import (
4 "compress/gzip"
5 "fmt"
6 "io"
7 "strings"
8)
9
10// run
11func main() {
12 // Simulate large file
13 source := strings.NewReader(strings.Repeat("Sample data ", 1000))
14
15 // Compress with progress tracking
16 var compressed strings.Builder
17 progress := &ProgressWriter{Total: int64(source.Len())}
18
19 if err := compressWithProgress(source, &compressed, progress); err != nil {
20 fmt.Println("Error:", err)
21 return
22 }
23
24 fmt.Printf("\nCompression complete!\n")
25 fmt.Printf("Original: %d bytes\n", source.Len())
26 fmt.Printf("Compressed: %d bytes\n", compressed.Len())
27}
28
29type ProgressWriter struct {
30 Total int64
31 Written int64
32}
33
34func Write(data []byte) {
35 n := len(data)
36 p.Written += int64(n)
37
38 percent := float64(p.Written) / float64(p.Total) * 100
39 fmt.Printf("\rProgress: %.1f%%", percent, p.Written, p.Total)
40
41 return n, nil
42}
43
44func compressWithProgress(source io.Reader, dest io.Writer, progress *ProgressWriter) error {
45 // Chain: source -> progress tracker -> gzip -> destination
46 gzWriter := gzip.NewWriter(dest)
47 defer gzWriter.Close()
48
49 // Create multi-writer to track progress
50 multiWriter := io.MultiWriter(gzWriter, progress)
51
52 _, err := io.Copy(multiWriter, source)
53 return err
54}
Compressed Log Rotation
1package main
2
3import (
4 "compress/gzip"
5 "fmt"
6 "io"
7 "os"
8 "path/filepath"
9 "sync"
10 "time"
11)
12
13// run
14func main() {
15 logger := NewCompressedLogger("/tmp", "app.log", 1024) // 1KB max
16
17 // Write logs
18 for i := 0; i < 10; i++ {
19 logger.Write([]byte(fmt.Sprintf("Log entry %d: %s\n",
20 i, time.Now().Format(time.RFC3339))))
21 }
22
23 logger.Close()
24
25 fmt.Println("Log rotation with compression demo completed")
26 fmt.Println("Check /tmp for app.log and compressed rotated logs")
27}
28
29type CompressedLogger struct {
30 dir string
31 filename string
32 maxSize int64
33 currentSize int64
34 file *os.File
35 mu sync.Mutex
36}
37
38func NewCompressedLogger(dir, filename string, maxSize int64) *CompressedLogger {
39 return &CompressedLogger{
40 dir: dir,
41 filename: filename,
42 maxSize: maxSize,
43 }
44}
45
46func Write(data []byte) {
47 l.mu.Lock()
48 defer l.mu.Unlock()
49
50 // Check if we need to rotate
51 if l.currentSize+int64(len(data)) > l.maxSize {
52 if err := l.rotate(); err != nil {
53 return 0, err
54 }
55 }
56
57 // Open file if needed
58 if l.file == nil {
59 path := filepath.Join(l.dir, l.filename)
60 f, err := os.OpenFile(path, os.O_CREATE|os.O_WRONLY|os.O_APPEND, 0644)
61 if err != nil {
62 return 0, err
63 }
64 l.file = f
65 }
66
67 n, err := l.file.Write(data)
68 l.currentSize += int64(n)
69 return n, err
70}
71
72func rotate() error {
73 // Close current file
74 if l.file != nil {
75 l.file.Close()
76 l.file = nil
77 }
78
79 // Compress old file
80 oldPath := filepath.Join(l.dir, l.filename)
81 timestamp := time.Now().Format("20060102-150405")
82 gzPath := filepath.Join(l.dir, fmt.Sprintf("%s.%s.gz", l.filename, timestamp))
83
84 if err := compressFile(oldPath, gzPath); err != nil {
85 return err
86 }
87
88 // Remove old file
89 os.Remove(oldPath)
90
91 // Reset size
92 l.currentSize = 0
93
94 fmt.Printf("Rotated log to: %s\n", gzPath)
95 return nil
96}
97
98func Close() error {
99 l.mu.Lock()
100 defer l.mu.Unlock()
101
102 if l.file != nil {
103 return l.file.Close()
104 }
105 return nil
106}
107
108func compressFile(source, dest string) error {
109 srcFile, err := os.Open(source)
110 if err != nil {
111 return err
112 }
113 defer srcFile.Close()
114
115 destFile, err := os.Create(dest)
116 if err != nil {
117 return err
118 }
119 defer destFile.Close()
120
121 gzWriter := gzip.NewWriter(destFile)
122 defer gzWriter.Close()
123
124 _, err = io.Copy(gzWriter, srcFile)
125 return err
126}
Compression Pool
1package main
2
3import (
4 "compress/gzip"
5 "fmt"
6 "io"
7 "strings"
8 "sync"
9)
10
11// run
12func main() {
13 pool := NewGzipPool(gzip.BestSpeed)
14
15 // Compress multiple items using pool
16 items := []string{
17 "Item 1 data",
18 "Item 2 data",
19 "Item 3 data",
20 "Item 4 data",
21 "Item 5 data",
22 }
23
24 var wg sync.WaitGroup
25 for i, item := range items {
26 wg.Add(1)
27 go func(id int, data string) {
28 defer wg.Done()
29
30 var compressed strings.Builder
31 if err := pool.Compress(strings.NewReader(data), &compressed); err != nil {
32 fmt.Printf("Item %d error: %v\n", id, err)
33 return
34 }
35
36 fmt.Printf("Item %d: %d -> %d bytes\n",
37 id, len(data), compressed.Len())
38 }(i+1, item)
39 }
40
41 wg.Wait()
42}
43
44type GzipPool struct {
45 writers sync.Pool
46 level int
47}
48
49func NewGzipPool(level int) *GzipPool {
50 return &GzipPool{
51 level: level,
52 writers: sync.Pool{
53 New: func() interface{} {
54 w, _ := gzip.NewWriterLevel(io.Discard, level)
55 return w
56 },
57 },
58 }
59}
60
61func Compress(src io.Reader, dest io.Writer) error {
62 // Get writer from pool
63 writer := p.writers.Get().(*gzip.Writer)
64 defer p.writers.Put(writer)
65
66 // Reset writer to new destination
67 writer.Reset(dest)
68 defer writer.Close()
69
70 _, err := io.Copy(writer, src)
71 return err
72}
Further Reading
- Go compress/gzip documentation
- Go compress/zlib documentation
- Go archive/tar documentation
- Go archive/zip documentation
- DEFLATE specification
- Gzip format
- ZIP file format specification
Practice Exercises
Exercise 1: Backup Utility
Difficulty: Advanced | Time: 45-60 minutes
Learning Objectives:
- Master combined tar+gzip archive creation and management
- Implement file filtering, exclusion patterns, and incremental backup strategies
- Learn progress tracking, integrity verification, and backup validation techniques
Real-World Context: Backup utilities are critical infrastructure components that protect data from loss. Professional backup systems need to handle millions of files, apply complex filtering rules, track incremental changes, and provide verification mechanisms to ensure data integrity. Understanding these patterns is essential for building reliable data protection systems.
Create a backup utility that creates compressed tar archives with advanced features. Your utility should create .tar.gz archives, support flexible file exclusion patterns, provide real-time progress feedback during backup operations, implement archive integrity verification with checksums, and support incremental backup strategies to only process changed files. This exercise demonstrates the core patterns used in production backup systems where data integrity, performance, and reliability are critical for protecting valuable data assets.
Solution
1package main
2
3import (
4 "archive/tar"
5 "compress/gzip"
6 "crypto/sha256"
7 "fmt"
8 "io"
9 "os"
10 "path/filepath"
11 "strings"
12 "time"
13)
14
15type BackupOptions struct {
16 SourceDir string
17 DestFile string
18 ExcludePattern []string
19 Progress bool
20}
21
22type BackupStats struct {
23 FilesProcessed int
24 BytesProcessed int64
25 Duration time.Duration
26 Checksum string
27}
28
29func CreateBackup(opts BackupOptions) {
30 start := time.Now()
31 stats := BackupStats{}
32
33 // Create destination file
34 destFile, err := os.Create(opts.DestFile)
35 if err != nil {
36 return stats, err
37 }
38 defer destFile.Close()
39
40 // Create gzip writer
41 gzWriter := gzip.NewWriter(destFile)
42 defer gzWriter.Close()
43
44 // Create tar writer
45 tarWriter := tar.NewWriter(gzWriter)
46 defer tarWriter.Close()
47
48 // Hash writer for integrity check
49 hashWriter := sha256.New()
50 multiWriter := io.MultiWriter(gzWriter, hashWriter)
51 gzWriter.Reset(multiWriter)
52
53 // Walk directory
54 err = filepath.Walk(opts.SourceDir, func(path string, info os.FileInfo, err error) error {
55 if err != nil {
56 return err
57 }
58
59 // Check exclusions
60 for _, pattern := range opts.ExcludePattern {
61 if matched, _ := filepath.Match(pattern, filepath.Base(path)); matched {
62 if info.IsDir() {
63 return filepath.SkipDir
64 }
65 return nil
66 }
67 }
68
69 // Skip directories
70 if info.IsDir() {
71 return nil
72 }
73
74 // Create tar header
75 header, err := tar.FileInfoHeader(info, "")
76 if err != nil {
77 return err
78 }
79
80 // Use relative path
81 relPath, _ := filepath.Rel(opts.SourceDir, path)
82 header.Name = relPath
83
84 // Write header
85 if err := tarWriter.WriteHeader(header); err != nil {
86 return err
87 }
88
89 // Write file content
90 file, err := os.Open(path)
91 if err != nil {
92 return err
93 }
94 defer file.Close()
95
96 n, err := io.Copy(tarWriter, file)
97 if err != nil {
98 return err
99 }
100
101 stats.FilesProcessed++
102 stats.BytesProcessed += n
103
104 if opts.Progress {
105 fmt.Printf("\rBacked up: %s", relPath, n)
106 }
107
108 return nil
109 })
110
111 if err != nil {
112 return stats, err
113 }
114
115 // Finalize
116 tarWriter.Close()
117 gzWriter.Close()
118
119 stats.Duration = time.Since(start)
120 stats.Checksum = fmt.Sprintf("%x", hashWriter.Sum(nil))
121
122 return stats, nil
123}
124
125func VerifyBackup(filename string) error {
126 file, err := os.Open(filename)
127 if err != nil {
128 return err
129 }
130 defer file.Close()
131
132 gzReader, err := gzip.NewReader(file)
133 if err != nil {
134 return err
135 }
136 defer gzReader.Close()
137
138 tarReader := tar.NewReader(gzReader)
139
140 fileCount := 0
141 for {
142 header, err := tarReader.Next()
143 if err == io.EOF {
144 break
145 }
146 if err != nil {
147 return err
148 }
149
150 // Read file content to verify
151 _, err = io.Copy(io.Discard, tarReader)
152 if err != nil {
153 return fmt.Errorf("error reading %s: %w", header.Name, err)
154 }
155
156 fileCount++
157 }
158
159 fmt.Printf("Verification successful: %d files\n", fileCount)
160 return nil
161}
162
163func main() {
164 // Create test directory structure
165 testDir := "/tmp/backup-test"
166 os.MkdirAll(testDir, 0755)
167 os.WriteFile(filepath.Join(testDir, "file1.txt"), []byte("Content 1"), 0644)
168 os.WriteFile(filepath.Join(testDir, "file2.txt"), []byte("Content 2"), 0644)
169 os.WriteFile(filepath.Join(testDir, "temp.tmp"), []byte("Temp file"), 0644)
170
171 // Create backup
172 opts := BackupOptions{
173 SourceDir: testDir,
174 DestFile: "/tmp/backup.tar.gz",
175 ExcludePattern: []string{"*.tmp", "*.log"},
176 Progress: true,
177 }
178
179 fmt.Println("Creating backup...")
180 stats, err := CreateBackup(opts)
181 if err != nil {
182 fmt.Println("Backup failed:", err)
183 return
184 }
185
186 fmt.Printf("\n\nBackup Statistics:\n")
187 fmt.Printf(" Files: %d\n", stats.FilesProcessed)
188 fmt.Printf(" Bytes: %d\n", stats.BytesProcessed)
189 fmt.Printf(" Duration: %s\n", stats.Duration)
190 fmt.Printf(" Checksum: %s\n", stats.Checksum)
191
192 // Verify backup
193 fmt.Println("\nVerifying backup...")
194 if err := VerifyBackup(opts.DestFile); err != nil {
195 fmt.Println("Verification failed:", err)
196 return
197 }
198
199 // Cleanup
200 os.RemoveAll(testDir)
201 os.Remove(opts.DestFile)
202}
Exercise 2: Compression Benchmark Tool
Difficulty: Intermediate | Time: 30-40 minutes
Learning Objectives:
- Master performance measurement and comparison of compression algorithms
- Understand compression trade-offs between speed, size, and CPU usage
- Learn to generate comprehensive benchmark reports with statistical analysis
Real-World Context: Choosing the right compression algorithm and settings is crucial for application performance. Different data types compress differently, and the optimal choice depends on whether you prioritize speed, compression ratio, or CPU usage. Professional developers need to make informed decisions based on actual performance data.
Build a comprehensive compression benchmark tool that measures and compares different compression algorithms and settings. Your tool should test gzip, zlib, and flate algorithms across various compression levels, measure both compression and decompression performance, calculate compression ratios for different data types, and generate detailed comparison reports. This exercise demonstrates the performance analysis patterns used when optimizing applications where compression choices significantly impact throughput, storage costs, and user experience.
Solution
1package main
2
3import (
4 "bytes"
5 "compress/flate"
6 "compress/gzip"
7 "compress/zlib"
8 "fmt"
9 "io"
10 "math/rand"
11 "time"
12)
13
14type CompressionResult struct {
15 Algorithm string
16 Level int
17 OriginalSize int
18 CompressedSize int
19 CompressTime time.Duration
20 DecompressTime time.Duration
21 CompressionRatio float64
22}
23
24type BenchmarkSuite struct {
25 testData []byte
26 results []CompressionResult
27}
28
29func NewBenchmarkSuite(dataType string, size int) *BenchmarkSuite {
30 return &BenchmarkSuite{
31 testData: generateTestData(dataType, size),
32 }
33}
34
35func generateTestData(dataType string, size int) []byte {
36 data := make([]byte, size)
37
38 switch dataType {
39 case "text":
40 // Repetitive text data
41 text := []byte("The quick brown fox jumps over the lazy dog. ")
42 for i := 0; i < size; i++ {
43 data[i] = text[i%len(text)]
44 }
45
46 case "random":
47 // Random data
48 rand.Read(data)
49
50 case "binary":
51 // Semi-structured binary data
52 for i := 0; i < size; i++ {
53 data[i] = byte(i % 256)
54 }
55 }
56
57 return data
58}
59
60func BenchmarkGzip(level int) CompressionResult {
61 result := CompressionResult{
62 Algorithm: "Gzip",
63 Level: level,
64 OriginalSize: len(bs.testData),
65 }
66
67 // Compression
68 var compressed bytes.Buffer
69 start := time.Now()
70
71 writer, _ := gzip.NewWriterLevel(&compressed, level)
72 writer.Write(bs.testData)
73 writer.Close()
74
75 result.CompressTime = time.Since(start)
76 result.CompressedSize = compressed.Len()
77
78 // Decompression
79 start = time.Now()
80
81 reader, _ := gzip.NewReader(&compressed)
82 io.Copy(io.Discard, reader)
83 reader.Close()
84
85 result.DecompressTime = time.Since(start)
86 result.CompressionRatio = float64(result.CompressedSize) / float64(result.OriginalSize) * 100
87
88 return result
89}
90
91func BenchmarkZlib(level int) CompressionResult {
92 result := CompressionResult{
93 Algorithm: "Zlib",
94 Level: level,
95 OriginalSize: len(bs.testData),
96 }
97
98 // Compression
99 var compressed bytes.Buffer
100 start := time.Now()
101
102 writer, _ := zlib.NewWriterLevel(&compressed, level)
103 writer.Write(bs.testData)
104 writer.Close()
105
106 result.CompressTime = time.Since(start)
107 result.CompressedSize = compressed.Len()
108
109 // Decompression
110 start = time.Now()
111
112 reader, _ := zlib.NewReader(&compressed)
113 io.Copy(io.Discard, reader)
114 reader.Close()
115
116 result.DecompressTime = time.Since(start)
117 result.CompressionRatio = float64(result.CompressedSize) / float64(result.OriginalSize) * 100
118
119 return result
120}
121
122func BenchmarkFlate(level int) CompressionResult {
123 result := CompressionResult{
124 Algorithm: "Flate",
125 Level: level,
126 OriginalSize: len(bs.testData),
127 }
128
129 // Compression
130 var compressed bytes.Buffer
131 start := time.Now()
132
133 writer, _ := flate.NewWriter(&compressed, level)
134 writer.Write(bs.testData)
135 writer.Close()
136
137 result.CompressTime = time.Since(start)
138 result.CompressedSize = compressed.Len()
139
140 // Decompression
141 start = time.Now()
142
143 reader := flate.NewReader(&compressed)
144 io.Copy(io.Discard, reader)
145 reader.Close()
146
147 result.DecompressTime = time.Since(start)
148 result.CompressionRatio = float64(result.CompressedSize) / float64(result.OriginalSize) * 100
149
150 return result
151}
152
153func RunAll() {
154 levels := []int{1, 6, 9}
155
156 for _, level := range levels {
157 bs.results = append(bs.results, bs.BenchmarkGzip(level))
158 bs.results = append(bs.results, bs.BenchmarkZlib(level))
159 bs.results = append(bs.results, bs.BenchmarkFlate(level))
160 }
161}
162
163func PrintReport() {
164 fmt.Println("\nCompression Benchmark Results")
165 fmt.Println(strings.Repeat("=", 100))
166 fmt.Printf("%-10s %-6s %-12s %-12s %-15s %-15s %-10s\n",
167 "Algorithm", "Level", "Original", "Compressed", "Compress Time", "Decompress Time", "Ratio")
168 fmt.Println(strings.Repeat("-", 100))
169
170 for _, result := range bs.results {
171 fmt.Printf("%-10s %-6d %-12d %-12d %-15s %-15s %.2f%%\n",
172 result.Algorithm,
173 result.Level,
174 result.OriginalSize,
175 result.CompressedSize,
176 result.CompressTime,
177 result.DecompressTime,
178 result.CompressionRatio)
179 }
180
181 fmt.Println(strings.Repeat("=", 100))
182}
183
184func main() {
185 dataTypes := []string{"text", "binary", "random"}
186
187 for _, dataType := range dataTypes {
188 fmt.Printf("\n\n=== Testing with %s data ===\n", dataType)
189
190 suite := NewBenchmarkSuite(dataType, 10*1024)
191 suite.RunAll()
192 suite.PrintReport()
193 }
194}
Exercise 3: Streaming File Archiver
Difficulty: Advanced | Time: 40-50 minutes
Learning Objectives:
- Master streaming compression patterns for memory-efficient large file processing
- Implement progress tracking, checksum generation, and error recovery mechanisms
- Learn interface-based design for extensible file source systems
Real-World Context: Processing large files without loading them entirely into memory is essential for building scalable applications. Cloud storage systems, data pipelines, and backup utilities all rely on streaming architectures to handle gigabyte or terabyte-sized files efficiently. Understanding streaming patterns is crucial for building production-grade data processing systems.
Build a streaming file archiver that processes large files efficiently without loading entire files into memory. Your archiver should stream files from multiple sources, compress data on-the-fly using configurable compression formats, provide real-time progress tracking with detailed status updates, handle compression errors gracefully with proper cleanup, support resume functionality for interrupted operations, and generate checksums during the archiving process for data integrity verification. This exercise demonstrates the streaming patterns used in production systems where memory efficiency and error resilience are critical for handling large-scale data processing tasks.
Solution with Explanation
1package main
2
3import (
4 "compress/gzip"
5 "crypto/sha256"
6 "fmt"
7 "hash"
8 "io"
9 "os"
10 "path/filepath"
11 "strings"
12 "sync"
13 "time"
14)
15
16// run
17type ArchiveConfig struct {
18 CompressionLevel int
19 BufferSize int
20 VerifyChecksum bool
21 MaxConcurrent int
22}
23
24type FileSource interface {
25 Open()
26 Name() string
27 Size() int64
28}
29
30type LocalFileSource struct {
31 Path string
32}
33
34func Open() {
35 return os.Open(l.Path)
36}
37
38func Name() string {
39 return filepath.Base(l.Path)
40}
41
42func Size() int64 {
43 info, err := os.Stat(l.Path)
44 if err != nil {
45 return 0
46 }
47 return info.Size()
48}
49
50type StreamingArchiver struct {
51 config ArchiveConfig
52 sources []FileSource
53 dest io.Writer
54 progress chan ProgressUpdate
55 mu sync.Mutex
56 checksums map[string]string
57}
58
59type ProgressUpdate struct {
60 FileName string
61 BytesWritten int64
62 TotalBytes int64
63 Complete bool
64 Error error
65}
66
67func NewStreamingArchiver(dest io.Writer, config ArchiveConfig) *StreamingArchiver {
68 return &StreamingArchiver{
69 config: config,
70 dest: dest,
71 progress: make(chan ProgressUpdate, 10),
72 checksums: make(map[string]string),
73 }
74}
75
76func AddSource(source FileSource) {
77 sa.sources = append(sa.sources, source)
78}
79
80func Archive() error {
81 gzWriter, err := gzip.NewWriterLevel(sa.dest, sa.config.CompressionLevel)
82 if err != nil {
83 return err
84 }
85 defer gzWriter.Close()
86
87 // Process files
88 for _, source := range sa.sources {
89 if err := sa.archiveFile(source, gzWriter); err != nil {
90 return fmt.Errorf("failed to archive %s: %w", source.Name(), err)
91 }
92 }
93
94 return nil
95}
96
97func archiveFile(source FileSource, writer io.Writer) error {
98 reader, err := source.Open()
99 if err != nil {
100 return err
101 }
102 defer reader.Close()
103
104 fileName := source.Name()
105 totalSize := source.Size()
106
107 // Create progress tracker
108 progressReader := &ProgressReader{
109 Reader: reader,
110 Total: totalSize,
111 FileName: fileName,
112 OnProgress: func(bytesRead int64) {
113 sa.progress <- ProgressUpdate{
114 FileName: fileName,
115 BytesWritten: bytesRead,
116 TotalBytes: totalSize,
117 Complete: false,
118 }
119 },
120 }
121
122 // Add checksum calculation
123 var hasher hash.Hash
124 if sa.config.VerifyChecksum {
125 hasher = sha256.New()
126 progressReader.Reader = io.TeeReader(progressReader.Reader, hasher)
127 }
128
129 // Stream to archive
130 bytesWritten, err := io.Copy(writer, progressReader)
131 if err != nil {
132 sa.progress <- ProgressUpdate{
133 FileName: fileName,
134 Error: err,
135 }
136 return err
137 }
138
139 // Store checksum
140 if hasher != nil {
141 checksum := fmt.Sprintf("%x", hasher.Sum(nil))
142 sa.mu.Lock()
143 sa.checksums[fileName] = checksum
144 sa.mu.Unlock()
145 }
146
147 sa.progress <- ProgressUpdate{
148 FileName: fileName,
149 BytesWritten: bytesWritten,
150 TotalBytes: totalSize,
151 Complete: true,
152 }
153
154 return nil
155}
156
157func GetChecksums() map[string]string {
158 sa.mu.Lock()
159 defer sa.mu.Unlock()
160
161 result := make(map[string]string)
162 for k, v := range sa.checksums {
163 result[k] = v
164 }
165 return result
166}
167
168type ProgressReader struct {
169 Reader io.Reader
170 Total int64
171 BytesRead int64
172 FileName string
173 OnProgress func(int64)
174 lastUpdate time.Time
175}
176
177func Read(p []byte) {
178 n, err := pr.Reader.Read(p)
179 pr.BytesRead += int64(n)
180
181 // Update progress every 100ms
182 if time.Since(pr.lastUpdate) > 100*time.Millisecond {
183 if pr.OnProgress != nil {
184 pr.OnProgress(pr.BytesRead)
185 }
186 pr.lastUpdate = time.Now()
187 }
188
189 return n, err
190}
191
192func main() {
193 // Create test files
194 testDir := "/tmp/archive-test"
195 os.MkdirAll(testDir, 0755)
196 defer os.RemoveAll(testDir)
197
198 // Create sample files
199 files := []string{"file1.txt", "file2.txt", "file3.txt"}
200 for i, name := range files {
201 content := strings.Repeat(fmt.Sprintf("Content of file %d\n", i+1), 100)
202 os.WriteFile(filepath.Join(testDir, name), []byte(content), 0644)
203 }
204
205 // Create archiver
206 outputFile := filepath.Join(testDir, "archive.gz")
207 output, err := os.Create(outputFile)
208 if err != nil {
209 fmt.Println("Error:", err)
210 return
211 }
212 defer output.Close()
213
214 config := ArchiveConfig{
215 CompressionLevel: gzip.BestCompression,
216 BufferSize: 32 * 1024,
217 VerifyChecksum: true,
218 MaxConcurrent: 2,
219 }
220
221 archiver := NewStreamingArchiver(output, config)
222
223 // Add sources
224 for _, name := range files {
225 archiver.AddSource(&LocalFileSource{
226 Path: filepath.Join(testDir, name),
227 })
228 }
229
230 // Monitor progress in background
231 go func() {
232 for update := range archiver.progress {
233 if update.Error != nil {
234 fmt.Printf("[ERROR] %s: %v\n", update.FileName, update.Error)
235 } else if update.Complete {
236 fmt.Printf("[COMPLETE] %s: %d bytes\n", update.FileName, update.BytesWritten)
237 } else {
238 percent := float64(update.BytesWritten) / float64(update.TotalBytes) * 100
239 fmt.Printf("[PROGRESS] %s: %.1f%%\n", update.FileName, percent)
240 }
241 }
242 }()
243
244 // Archive files
245 fmt.Println("Streaming File Archiver")
246 fmt.Println("======================")
247 if err := archiver.Archive(); err != nil {
248 fmt.Println("Archive failed:", err)
249 return
250 }
251
252 close(archiver.progress)
253
254 // Show checksums
255 fmt.Println("\nChecksums:")
256 for file, checksum := range archiver.GetChecksums() {
257 fmt.Printf(" %s: %s\n", file, checksum[:16]+"...")
258 }
259
260 // Show archive stats
261 info, _ := os.Stat(outputFile)
262 fmt.Printf("\nArchive created: %s\n", outputFile, float64(info.Size())/1024)
263}
Explanation:
This streaming file archiver demonstrates:
- Stream Processing: Files are read and compressed in chunks, never loading entire files into memory
- Progress Tracking: Real-time progress updates via channels
- Checksum Calculation: SHA-256 checksums computed during streaming using
io.TeeReader - Interface Design:
FileSourceinterface allows different source types - Error Handling: Graceful error handling with detailed error messages
- Concurrency Safety: Mutex-protected checksum map for thread-safe access
The archiver efficiently handles large files by streaming data through multiple readers:
- Original file reader → Progress tracker → Checksum calculator → Compressor → Output
This pattern is essential for production applications that need to process large files without excessive memory usage.
Exercise 4: Compressed File Format Converter
Difficulty: Intermediate | Time: 35-45 minutes
Learning Objectives:
- Master multiple compression format conversion and interoperability
- Implement format detection, metadata preservation, and batch processing
- Learn to build flexible compression utilities with streaming capabilities
Real-World Context: Different systems use different compression formats, and being able to convert between them is essential for data interoperability. Whether you're migrating data between systems, preparing files for different platforms, or optimizing storage based on content type, understanding format conversion patterns is crucial for data engineering workflows.
Build a compressed file format converter that can convert between different compression formats while preserving data integrity. Your converter should automatically detect source compression formats, support conversion between gzip, zlib, and flate formats, preserve file metadata and timestamps during conversion, handle batch processing of multiple files efficiently, provide detailed conversion statistics and progress tracking, and validate converted files to ensure data integrity. This exercise demonstrates the format conversion patterns used in data migration tools and ETL pipelines where maintaining data integrity while optimizing storage formats is essential for cross-platform compatibility.
Solution with Explanation
1package main
2
3import (
4 "bytes"
5 "compress/flate"
6 "compress/gzip"
7 "compress/zlib"
8 "fmt"
9 "io"
10 "os"
11 "path/filepath"
12 "strings"
13 "time"
14)
15
16// run
17type CompressionFormat string
18
19const (
20 FormatGzip CompressionFormat = "gzip"
21 FormatZlib CompressionFormat = "zlib"
22 FormatFlate CompressionFormat = "flate"
23)
24
25type ConversionJob struct {
26 InputFile string
27 OutputFile string
28 InputFormat CompressionFormat
29 OutputFormat CompressionFormat
30}
31
32type ConversionStats struct {
33 FilesProcessed int
34 TotalOriginalSize int64
35 TotalConvertedSize int64
36 ConversionTime time.Duration
37 Errors []error
38}
39
40type FormatConverter struct {
41 BufferSize int
42}
43
44func NewFormatConverter(bufferSize int) *FormatConverter {
45 return &FormatConverter{
46 BufferSize: bufferSize,
47 }
48}
49
50func DetectFormat(data []byte) CompressionFormat {
51 // Gzip magic number: 0x1f 0x8b
52 if len(data) >= 2 && data[0] == 0x1f && data[1] == 0x8b {
53 return FormatGzip
54 }
55
56 // Zlib magic number: 0x78
57 if len(data) >= 1 && {
58 return FormatZlib
59 }
60
61 // Flate doesn't have a clear magic number, but if it's not gzip or zlib,
62 // and it compresses/decompresses successfully with flate, assume flate
63 // This is a simplified detection - in production, you'd want more sophisticated detection
64 return FormatFlate
65}
66
67func Convert(inputFile, outputFile string, outputFormat CompressionFormat) {
68 stats := &ConversionStats{}
69 start := time.Now()
70
71 // Read input file
72 inputData, err := os.ReadFile(inputFile)
73 if err != nil {
74 return stats, fmt.Errorf("failed to read input file: %w", err)
75 }
76
77 stats.TotalOriginalSize = int64(len(inputData))
78
79 // Detect input format
80 inputFormat := fc.DetectFormat(inputData)
81 fmt.Printf("Detected format: %s\n", inputFormat)
82
83 // Decompress input data
84 decompressed, err := fc.decompressData(inputData, inputFormat)
85 if err != nil {
86 return stats, fmt.Errorf("failed to decompress: %w", err)
87 }
88
89 // Compress to output format
90 converted, err := fc.compressData(decompressed, outputFormat)
91 if err != nil {
92 return stats, fmt.Errorf("failed to compress: %w", err)
93 }
94
95 // Write output file
96 if err := os.WriteFile(outputFile, converted, 0644); err != nil {
97 return stats, fmt.Errorf("failed to write output file: %w", err)
98 }
99
100 stats.TotalConvertedSize = int64(len(converted))
101 stats.ConversionTime = time.Since(start)
102 stats.FilesProcessed = 1
103
104 return stats, nil
105}
106
107func decompressData(data []byte, format CompressionFormat) {
108 var reader io.Reader
109 var err error
110
111 switch format {
112 case FormatGzip:
113 reader, err = gzip.NewReader(bytes.NewReader(data))
114 case FormatZlib:
115 reader, err = zlib.NewReader(bytes.NewReader(data))
116 case FormatFlate:
117 reader = flate.NewReader(bytes.NewReader(data))
118 default:
119 return nil, fmt.Errorf("unsupported format: %s", format)
120 }
121
122 if err != nil {
123 return nil, err
124 }
125 defer func() {
126 if closer, ok := reader.(io.Closer); ok {
127 closer.Close()
128 }
129 }()
130
131 var buf bytes.Buffer
132 if _, err := io.Copy(&buf, reader); err != nil {
133 return nil, err
134 }
135
136 return buf.Bytes(), nil
137}
138
139func compressData(data []byte, format CompressionFormat) {
140 var buf bytes.Buffer
141 var writer io.WriteCloser
142 var err error
143
144 switch format {
145 case FormatGzip:
146 writer = gzip.NewWriter(&buf)
147 case FormatZlib:
148 writer = zlib.NewWriter(&buf)
149 case FormatFlate:
150 writer, err = flate.NewWriter(&buf, flate.DefaultCompression)
151 default:
152 return nil, fmt.Errorf("unsupported format: %s", format)
153 }
154
155 if err != nil {
156 return nil, err
157 }
158
159 // Write data in chunks for memory efficiency
160 buffer := make([]byte, fc.BufferSize)
161 dataReader := bytes.NewReader(data)
162
163 for {
164 n, err := dataReader.Read(buffer)
165 if err == io.EOF {
166 break
167 }
168 if err != nil {
169 writer.Close()
170 return nil, err
171 }
172
173 if _, err := writer.Write(buffer[:n]); err != nil {
174 writer.Close()
175 return nil, err
176 }
177 }
178
179 if err := writer.Close(); err != nil {
180 return nil, err
181 }
182
183 return buf.Bytes(), nil
184}
185
186func BatchConvert(jobs []ConversionJob) {
187 totalStats := &ConversionStats{}
188 start := time.Now()
189
190 for i, job := range jobs {
191 fmt.Printf("Processing job %d/%d: %s -> %s\n", i+1, len(jobs), job.InputFile, job.OutputFile)
192
193 // Create output directory if needed
194 if err := os.MkdirAll(filepath.Dir(job.OutputFile), 0755); err != nil {
195 totalStats.Errors = append(totalStats.Errors, err)
196 continue
197 }
198
199 stats, err := fc.Convert(job.InputFile, job.OutputFile, job.OutputFormat)
200 if err != nil {
201 totalStats.Errors = append(totalStats.Errors, err)
202 fmt.Printf(" Error: %v\n", err)
203 continue
204 }
205
206 totalStats.FilesProcessed += stats.FilesProcessed
207 totalStats.TotalOriginalSize += stats.TotalOriginalSize
208 totalStats.TotalConvertedSize += stats.TotalConvertedSize
209
210 fmt.Printf(" Success: %d bytes -> %d bytes\n",
211 stats.TotalOriginalSize,
212 stats.TotalConvertedSize,
213 float64(stats.TotalConvertedSize)/float64(stats.TotalOriginalSize)*100)
214 }
215
216 totalStats.ConversionTime = time.Since(start)
217 return totalStats, nil
218}
219
220func ValidateConversion(originalFile, convertedFile string) error {
221 // Read and decompress original
222 originalData, err := os.ReadFile(originalFile)
223 if err != nil {
224 return err
225 }
226
227 originalFormat := fc.DetectFormat(originalData)
228 decompressedOriginal, err := fc.decompressData(originalData, originalFormat)
229 if err != nil {
230 return err
231 }
232
233 // Read and decompress converted
234 convertedData, err := os.ReadFile(convertedFile)
235 if err != nil {
236 return err
237 }
238
239 convertedFormat := fc.DetectFormat(convertedData)
240 decompressedConverted, err := fc.decompressData(convertedData, convertedFormat)
241 if err != nil {
242 return err
243 }
244
245 // Compare decompressed data
246 if len(decompressedOriginal) != len(decompressedConverted) {
247 return fmt.Errorf("size mismatch: %d vs %d bytes", len(decompressedOriginal), len(decompressedConverted))
248 }
249
250 for i := 0; i < len(decompressedOriginal); i++ {
251 if decompressedOriginal[i] != decompressedConverted[i] {
252 return fmt.Errorf("data mismatch at byte %d", i)
253 }
254 }
255
256 return nil
257}
258
259func main() {
260 // Create test files
261 testDir := "/tmp/conversion-test"
262 os.MkdirAll(testDir, 0755)
263 defer os.RemoveAll(testDir)
264
265 // Create test data
266 testData := strings.Repeat("This is test data for compression format conversion. ", 100)
267
268 // Create files in different formats
269 files := map[string]CompressionFormat{
270 "test.gz": FormatGzip,
271 "test.z": FormatZlib,
272 "test.flate": FormatFlate,
273 }
274
275 for filename, format := range files {
276 converter := NewFormatConverter(4096)
277 compressed, err := converter.compressData([]byte(testData), format)
278 if err != nil {
279 fmt.Printf("Error creating %s: %v\n", filename, err)
280 continue
281 }
282 os.WriteFile(filepath.Join(testDir, filename), compressed, 0644)
283 }
284
285 // Create converter
286 converter := NewFormatConverter(8192)
287
288 // Create conversion jobs
289 var jobs []ConversionJob
290 inputFormats := []string{"test.gz", "test.z", "test.flate"}
291 outputFormats := []CompressionFormat{FormatGzip, FormatZlib, FormatFlate}
292
293 for _, inputFile := range inputFormats {
294 for _, outputFormat := range outputFormats {
295 inputPath := filepath.Join(testDir, inputFile)
296 outputExt := map[CompressionFormat]string{
297 FormatGzip: ".gz",
298 FormatZlib: ".z",
299 FormatFlate: ".flate",
300 }[outputFormat]
301
302 outputFile := filepath.Join(testDir, "converted",
303 strings.TrimSuffix(inputFile, filepath.Ext(inputFile))+outputExt)
304
305 jobs = append(jobs, ConversionJob{
306 InputFile: inputPath,
307 OutputFile: outputFile,
308 OutputFormat: outputFormat,
309 })
310 }
311 }
312
313 // Run batch conversion
314 fmt.Println("Format Converter")
315 fmt.Println("================")
316
317 start := time.Now()
318 stats, err := converter.BatchConvert(jobs)
319 if err != nil {
320 fmt.Printf("Batch conversion error: %v\n", err)
321 }
322
323 // Show results
324 fmt.Printf("\nConversion Summary:\n")
325 fmt.Printf(" Files processed: %d\n", stats.FilesProcessed)
326 fmt.Printf(" Original size: %d bytes\n", stats.TotalOriginalSize)
327 fmt.Printf(" Converted size: %d bytes\n", stats.TotalConvertedSize)
328 fmt.Printf(" Total time: %s\n", stats.ConversionTime)
329
330 if stats.Errors != nil {
331 fmt.Printf(" Errors: %d\n", len(stats.Errors))
332 }
333
334 // Validate a few conversions
335 fmt.Printf("\nValidating conversions...\n")
336 validations := []struct {
337 original string
338 converted string
339 }{
340 {filepath.Join(testDir, "test.gz"), filepath.Join(testDir, "converted", "test.flate")},
341 {filepath.Join(testDir, "test.z"), filepath.Join(testDir, "converted", "test.gz")},
342 }
343
344 for _, v := range validations {
345 if err := converter.ValidateConversion(v.original, v.converted); err != nil {
346 fmt.Printf(" Validation failed for %s: %v\n", v.converted, err)
347 } else {
348 fmt.Printf(" ✓ %s validated successfully\n", filepath.Base(v.converted))
349 }
350 }
351
352 fmt.Printf("\nConversion completed in %s\n", time.Since(start))
353}
354// run
Explanation:
This format converter demonstrates:
- Format Detection: Automatic detection of compression formats using magic numbers
- Format Conversion: Conversion between gzip, zlib, and flate formats
- Batch Processing: Efficient processing of multiple files with progress tracking
- Memory Efficiency: Chunked processing for large files
- Data Validation: Integrity checking to ensure conversions are lossless
- Error Handling: Comprehensive error handling with detailed reporting
Production use cases:
- Data migration between systems with different compression preferences
- Content optimization for different delivery platforms
- Archive format standardization
- ETL pipeline data preparation
Exercise 5: Compressed Data Storage Engine
Difficulty: Advanced | Time: 50-60 minutes
Learning Objectives:
- Build a compressed key-value storage system with multiple compression strategies
- Implement efficient data retrieval, indexing, and memory management
- Learn compression strategy selection and performance optimization techniques
Real-World Context: Compressed storage engines are used in databases, caching systems, and big data applications to reduce storage costs while maintaining query performance. Companies like Redis, LevelDB, and various NoSQL databases use sophisticated compression strategies to balance memory usage, disk space, and access speed.
Build a compressed key-value storage engine that automatically selects optimal compression strategies based on data characteristics. Your storage engine should support multiple compression algorithms with automatic strategy selection, implement efficient key-based data retrieval with decompression, handle memory management with configurable cache sizes, provide compression statistics and performance metrics, and support data persistence with compressed file storage. This exercise demonstrates the storage engine patterns used in production databases where balancing compression efficiency with access performance is critical for scalable data management.
Solution with Explanation
1package main
2
3import (
4 "bytes"
5 "compress/flate"
6 "compress/gzip"
7 "compress/zlib"
8 "encoding/binary"
9 "fmt"
10 "hash/crc32"
11 "io"
12 "os"
13 "path/filepath"
14 "sync"
15 "time"
16)
17
18// run
19type CompressionStrategy string
20
21const (
22 StrategyNone CompressionStrategy = "none"
23 StrategyGzip CompressionStrategy = "gzip"
24 StrategyZlib CompressionStrategy = "zlib"
25 StrategyFlate CompressionStrategy = "flate"
26 StrategyAuto CompressionStrategy = "auto"
27)
28
29type StorageEntry struct {
30 Key string
31 Value []byte
32 Compression CompressionStrategy
33 OriginalSize int
34 CompressedSize int
35 Timestamp time.Time
36 Checksum uint32
37}
38
39type CompressionStats struct {
40 TotalEntries int
41 TotalOriginalSize int64
42 TotalCompressedSize int64
43 CompressionRatio float64
44 HitRate float64
45 MissRate float64
46}
47
48type CompressedStorage struct {
49 data map[string]*StorageEntry
50 mu sync.RWMutex
51 strategy CompressionStrategy
52 cacheSize int
53 maxSize int
54 stats CompressionStats
55 persistPath string
56}
57
58func NewCompressedStorage(strategy CompressionStrategy, maxSize int, persistPath string) *CompressedStorage {
59 return &CompressedStorage{
60 data: make(map[string]*StorageEntry),
61 strategy: strategy,
62 cacheSize: 0,
63 maxSize: maxSize,
64 persistPath: persistPath,
65 }
66}
67
68func ChooseStrategy(data []byte) CompressionStrategy {
69 if cs.strategy != StrategyAuto {
70 return cs.strategy
71 }
72
73 // Small data: no compression
74 if len(data) < 100 {
75 return StrategyNone
76 }
77
78 // Test compression ratios
79 gzipRatio := cs.testCompression(data, StrategyGzip)
80 zlibRatio := cs.testCompression(data, StrategyZlib)
81 flateRatio := cs.testCompression(data, StrategyFlate)
82
83 // Choose best ratio
84 ratios := map[CompressionStrategy]float64{
85 StrategyGzip: gzipRatio,
86 StrategyZlib: zlibRatio,
87 StrategyFlate: flateRatio,
88 }
89
90 bestStrategy := StrategyNone
91 bestRatio := 1.0
92
93 for strategy, ratio := range ratios {
94 if ratio < bestRatio && ratio < 0.8 {
95 bestRatio = ratio
96 bestStrategy = strategy
97 }
98 }
99
100 return bestStrategy
101}
102
103func testCompression(data []byte, strategy CompressionStrategy) float64 {
104 var buf bytes.Buffer
105 var err error
106
107 switch strategy {
108 case StrategyGzip:
109 writer := gzip.NewWriter(&buf)
110 writer.Write(data)
111 writer.Close()
112 case StrategyZlib:
113 writer := zlib.NewWriter(&buf)
114 writer.Write(data)
115 writer.Close()
116 case StrategyFlate:
117 writer, err := flate.NewWriter(&buf, flate.DefaultCompression)
118 if err != nil {
119 return 1.0
120 }
121 writer.Write(data)
122 writer.Close()
123 default:
124 return 1.0
125 }
126
127 if err != nil {
128 return 1.0
129 }
130
131 return float64(buf.Len()) / float64(len(data))
132}
133
134func compress(data []byte, strategy CompressionStrategy) {
135 if strategy == StrategyNone {
136 return data, nil
137 }
138
139 var buf bytes.Buffer
140 var writer io.WriteCloser
141 var err error
142
143 switch strategy {
144 case StrategyGzip:
145 writer = gzip.NewWriter(&buf)
146 case StrategyZlib:
147 writer = zlib.NewWriter(&buf)
148 case StrategyFlate:
149 writer, err = flate.NewWriter(&buf, flate.DefaultCompression)
150 default:
151 return nil, fmt.Errorf("unsupported compression strategy: %s", strategy)
152 }
153
154 if err != nil {
155 return nil, err
156 }
157
158 if _, err := writer.Write(data); err != nil {
159 writer.Close()
160 return nil, err
161 }
162
163 if err := writer.Close(); err != nil {
164 return nil, err
165 }
166
167 return buf.Bytes(), nil
168}
169
170func decompress(data []byte, strategy CompressionStrategy) {
171 if strategy == StrategyNone {
172 return data, nil
173 }
174
175 var reader io.Reader
176 var err error
177
178 switch strategy {
179 case StrategyGzip:
180 reader, err = gzip.NewReader(bytes.NewReader(data))
181 case StrategyZlib:
182 reader, err = zlib.NewReader(bytes.NewReader(data))
183 case StrategyFlate:
184 reader = flate.NewReader(bytes.NewReader(data))
185 default:
186 return nil, fmt.Errorf("unsupported compression strategy: %s", strategy)
187 }
188
189 if err != nil {
190 return nil, err
191 }
192 defer func() {
193 if closer, ok := reader.(io.Closer); ok {
194 closer.Close()
195 }
196 }()
197
198 var buf bytes.Buffer
199 if _, err := io.Copy(&buf, reader); err != nil {
200 return nil, err
201 }
202
203 return buf.Bytes(), nil
204}
205
206func Set(key string, value []byte) error {
207 cs.mu.Lock()
208 defer cs.mu.Unlock()
209
210 // Check if we need to evict entries
211 if len(cs.data) >= cs.maxSize {
212 cs.evictLRU()
213 }
214
215 // Choose compression strategy
216 strategy := cs.ChooseStrategy(value)
217
218 // Compress data
219 compressed, err := cs.compress(value, strategy)
220 if err != nil {
221 return fmt.Errorf("compression failed: %w", err)
222 }
223
224 // Calculate checksum
225 checksum := crc32.ChecksumIEEE(compressed)
226
227 // Create entry
228 entry := &StorageEntry{
229 Key: key,
230 Value: compressed,
231 Compression: strategy,
232 OriginalSize: len(value),
233 CompressedSize: len(compressed),
234 Timestamp: time.Now(),
235 Checksum: checksum,
236 }
237
238 // Update stats
239 if oldEntry, exists := cs.data[key]; exists {
240 cs.cacheSize -= oldEntry.CompressedSize
241 cs.stats.TotalOriginalSize -= int64(oldEntry.OriginalSize)
242 cs.stats.TotalCompressedSize -= int64(oldEntry.CompressedSize)
243 } else {
244 cs.stats.TotalEntries++
245 }
246
247 cs.data[key] = entry
248 cs.cacheSize += entry.CompressedSize
249 cs.stats.TotalOriginalSize += int64(entry.OriginalSize)
250 cs.stats.TotalCompressedSize += int64(entry.CompressedSize)
251
252 // Update compression ratio
253 if cs.stats.TotalOriginalSize > 0 {
254 cs.stats.CompressionRatio = float64(cs.stats.TotalCompressedSize) / float64(cs.stats.TotalOriginalSize)
255 }
256
257 return nil
258}
259
260func Get(key string) {
261 cs.mu.RLock()
262 entry, exists := cs.data[key]
263 cs.mu.RUnlock()
264
265 if !exists {
266 cs.stats.MissRate++
267 return nil, fmt.Errorf("key not found: %s", key)
268 }
269
270 cs.stats.HitRate++
271
272 // Verify checksum
273 if crc32.ChecksumIEEE(entry.Value) != entry.Checksum {
274 return nil, fmt.Errorf("checksum mismatch for key: %s", key)
275 }
276
277 // Decompress data
278 decompressed, err := cs.decompress(entry.Value, entry.Compression)
279 if err != nil {
280 return nil, fmt.Errorf("decompression failed: %w", err)
281 }
282
283 return decompressed, nil
284}
285
286func Delete(key string) error {
287 cs.mu.Lock()
288 defer cs.mu.Unlock()
289
290 entry, exists := cs.data[key]
291 if !exists {
292 return fmt.Errorf("key not found: %s", key)
293 }
294
295 delete(cs.data, key)
296 cs.cacheSize -= entry.CompressedSize
297 cs.stats.TotalEntries--
298 cs.stats.TotalOriginalSize -= int64(entry.OriginalSize)
299 cs.stats.TotalCompressedSize -= int64(entry.CompressedSize)
300
301 return nil
302}
303
304func evictLRU() {
305 if len(cs.data) == 0 {
306 return
307 }
308
309 // Find oldest entry
310 var oldestKey string
311 var oldestTime time.Time
312 first := true
313
314 for key, entry := range cs.data {
315 if first || entry.Timestamp.Before(oldestTime) {
316 oldestKey = key
317 oldestTime = entry.Timestamp
318 first = false
319 }
320 }
321
322 if oldestKey != "" {
323 cs.Delete(oldestKey)
324 }
325}
326
327func GetStats() CompressionStats {
328 cs.mu.RLock()
329 defer cs.mu.RUnlock()
330
331 stats := cs.stats
332
333 // Calculate hit/miss rates
334 total := stats.HitRate + stats.MissRate
335 if total > 0 {
336 stats.HitRate = stats.HitRate / total * 100
337 stats.MissRate = stats.MissRate / total * 100
338 }
339
340 return stats
341}
342
343func Persist() error {
344 if cs.persistPath == "" {
345 return nil
346 }
347
348 cs.mu.RLock()
349 defer cs.mu.RUnlock()
350
351 file, err := os.Create(cs.persistPath)
352 if err != nil {
353 return err
354 }
355 defer file.Close()
356
357 // Write header
358 if err := binary.Write(file, binary.LittleEndian, uint32(len(cs.data))); err != nil {
359 return err
360 }
361
362 // Write entries
363 for key, entry := range cs.data {
364 // Write key length and key
365 if err := binary.Write(file, binary.LittleEndian, uint32(len(key))); err != nil {
366 return err
367 }
368 if _, err := file.WriteString(key); err != nil {
369 return err
370 }
371
372 // Write entry data
373 if err := binary.Write(file, binary.LittleEndian, uint32(entry.OriginalSize)); err != nil {
374 return err
375 }
376 if err := binary.Write(file, binary.LittleEndian, uint32(entry.CompressedSize)); err != nil {
377 return err
378 }
379 if err := binary.Write(file, binary.LittleEndian, uint32(entry.Compression)); err != nil {
380 return err
381 }
382 if err := binary.Write(file, binary.LittleEndian, uint64(entry.Timestamp.UnixNano())); err != nil {
383 return err
384 }
385 if err := binary.Write(file, binary.LittleEndian, entry.Checksum); err != nil {
386 return err
387 }
388 if _, err := file.Write(entry.Value); err != nil {
389 return err
390 }
391 }
392
393 return nil
394}
395
396func Load() error {
397 if cs.persistPath == "" {
398 return nil
399 }
400
401 file, err := os.Open(cs.persistPath)
402 if err != nil {
403 if os.IsNotExist(err) {
404 return nil
405 }
406 return err
407 }
408 defer file.Close()
409
410 var entryCount uint32
411 if err := binary.Read(file, binary.LittleEndian, &entryCount); err != nil {
412 return err
413 }
414
415 for i := 0; i < int(entryCount); i++ {
416 var keyLen uint32
417 if err := binary.Read(file, binary.LittleEndian, &keyLen); err != nil {
418 return err
419 }
420
421 key := make([]byte, keyLen)
422 if _, err := file.Read(key); err != nil {
423 return err
424 }
425
426 entry := &StorageEntry{Key: string(key)}
427
428 if err := binary.Read(file, binary.LittleEndian,(&entry.OriginalSize)); err != nil {
429 return err
430 }
431 if err := binary.Read(file, binary.LittleEndian,(&entry.CompressedSize)); err != nil {
432 return err
433 }
434 if err := binary.Read(file, binary.LittleEndian,(&entry.Compression)); err != nil {
435 return err
436 }
437 var timestampNano uint64
438 if err := binary.Read(file, binary.LittleEndian, ×tampNano); err != nil {
439 return err
440 }
441 entry.Timestamp = time.Unix(0, int64(timestampNano))
442 if err := binary.Read(file, binary.LittleEndian, &entry.Checksum); err != nil {
443 return err
444 }
445
446 entry.Value = make([]byte, entry.CompressedSize)
447 if _, err := file.Read(entry.Value); err != nil {
448 return err
449 }
450
451 cs.data[string(key)] = entry
452 cs.cacheSize += entry.CompressedSize
453 cs.stats.TotalEntries++
454 cs.stats.TotalOriginalSize += int64(entry.OriginalSize)
455 cs.stats.TotalCompressedSize += int64(entry.CompressedSize)
456 }
457
458 if cs.stats.TotalOriginalSize > 0 {
459 cs.stats.CompressionRatio = float64(cs.stats.TotalCompressedSize) / float64(cs.stats.TotalOriginalSize)
460 }
461
462 return nil
463}
464
465func main() {
466 // Create storage engine
467 persistPath := filepath.Join("/tmp", "compressed_storage.db")
468 storage := NewCompressedStorage(StrategyAuto, 1000, persistPath)
469
470 // Load existing data if any
471 if err := storage.Load(); err != nil {
472 fmt.Printf("Failed to load persisted data: %v\n", err)
473 }
474
475 fmt.Println("Compressed Storage Engine")
476 fmt.Println("========================")
477
478 // Test different data types
479 testData := map[string][]byte{
480 "short": []byte("short data"),
481 "repetitive": []byte(strings.Repeat("repetitive data ", 50)),
482 "json": []byte(`{"name":"test","value":123,"items":[1,2,3,4,5],"description":"this is a test json object with some repetitive fields"}`),
483 "binary": []byte{0x01, 0x02, 0x03, 0x04, 0x05, 0x01, 0x02, 0x03, 0x04, 0x05},
484 "text": []byte(strings.Repeat("This is a longer text document with many repeated words and phrases. ", 20)),
485 }
486
487 // Store test data
488 fmt.Println("\nStoring test data:")
489 for key, value := range testData {
490 if err := storage.Set(key, value); err != nil {
491 fmt.Printf("Error storing %s: %v\n", key, err)
492 continue
493 }
494 fmt.Printf(" %s: %d bytes stored\n", key, len(value))
495 }
496
497 // Retrieve and verify data
498 fmt.Println("\nRetrieving and verifying data:")
499 for key := range testData {
500 retrieved, err := storage.Get(key)
501 if err != nil {
502 fmt.Printf("Error retrieving %s: %v\n", key, err)
503 continue
504 }
505
506 if string(retrieved) == string(testData[key]) {
507 fmt.Printf(" %s: ✓ Verified\n", key, len(retrieved))
508 } else {
509 fmt.Printf(" %s: ✗ Data mismatch\n", key)
510 }
511 }
512
513 // Show statistics
514 stats := storage.GetStats()
515 fmt.Printf("\nStorage Statistics:\n")
516 fmt.Printf(" Total entries: %d\n", stats.TotalEntries)
517 fmt.Printf(" Cache size: %d bytes\n", storage.cacheSize)
518 fmt.Printf(" Original size: %d bytes\n", stats.TotalOriginalSize)
519 fmt.Printf(" Compressed size: %d bytes\n", stats.TotalCompressedSize)
520 fmt.Printf(" Compression ratio: %.2f%%\n", stats.CompressionRatio*100)
521 fmt.Printf(" Hit rate: %.2f%%\n", stats.HitRate)
522 fmt.Printf(" Miss rate: %.2f%%\n", stats.MissRate)
523
524 // Test persistence
525 fmt.Println("\nTesting persistence...")
526 if err := storage.Persist(); err != nil {
527 fmt.Printf("Persistence error: %v\n", err)
528 } else {
529 fmt.Printf("✓ Data persisted to %s\n", persistPath)
530 }
531
532 // Test eviction
533 fmt.Println("\nTesting LRU eviction...")
534 for i := 0; i < 10; i++ {
535 key := fmt.Sprintf("evict_test_%d", i)
536 value := []byte(strings.Repeat(fmt.Sprintf("data_%d_", i), 100))
537 storage.Set(key, value)
538 }
539
540 finalStats := storage.GetStats()
541 fmt.Printf("Final storage size: %d entries\n", finalStats.TotalEntries)
542
543 // Cleanup
544 os.Remove(persistPath)
545}
546// run
Explanation:
This compressed storage engine demonstrates:
- Automatic Strategy Selection: Chooses optimal compression based on data characteristics
- Memory Management: LRU eviction with configurable cache sizes
- Data Integrity: CRC32 checksums for compressed data verification
- Performance Monitoring: Hit/miss rates and compression statistics
- Persistence: Binary serialization for data recovery
- Thread Safety: Concurrent access protection with mutex locks
Production features:
- Multiple compression algorithms with automatic selection
- Efficient memory usage with LRU eviction
- Data integrity verification
- Performance metrics and monitoring
- Persistent storage for data recovery
Summary
Throughout this tutorial, we've explored compression and archives in Go - from the basics of making data smaller to production-ready patterns for handling large-scale compression tasks.
Think of compression as a superpower that lets you store more information in less space and transmit it faster over networks. Whether you're building a web server, backup utility, or file processing system, understanding compression is essential for creating efficient, scalable applications.
Key Takeaways
-
Gzip Compression
- Most common compression format - the universal language of web compression
- Configurable compression levels balance speed vs. size
- Support for metadata
- Streaming compression for large files - process data without loading it all into memory
- HTTP compression middleware - essential for web performance
-
Zlib and Flate
- Alternative compression formats with different trade-offs
- Lower overhead than gzip - good for internal protocols
- Dictionary-based compression for predictable data patterns
- Raw DEFLATE algorithm access - the foundation beneath gzip and zlib
-
Tar Archives
- Unix tape archive format - like putting files in a box
- Combines multiple files while preserving structure
- Preserves file metadata
- Often combined with gzip - the classic combo
- Streaming archive processing - handle huge archives efficiently
-
ZIP Files
- Popular cross-platform format - works everywhere
- Built-in compression with multiple algorithms
- Directory structure support - maintains file organization
- Individual file access - extract specific files without unpacking everything
-
Streaming Operations
- Memory-efficient processing - crucial for large files
- Chunked compression - process data piece by piece
- Concurrent compression - use multiple CPU cores
- Progress tracking - provide user feedback
- Buffered I/O - optimize small writes
-
Performance
- Compression level trade-offs - speed vs. size decisions
- Algorithm comparison - choose the right tool for the job
- Memory usage patterns - avoid memory blowup
- Throughput optimization - squeeze out maximum performance
-
Production Patterns
- HTTP compression middleware - automatic web compression
- Log rotation with compression - manage log file sizes
- Backup utilities - create efficient backups
- Compression pools - reuse objects for better performance
- Progress monitoring - keep users informed during long operations
Best Practices
- Choose the Right Algorithm: Gzip for HTTP, zlib for protocols, zip for archives
- Balance Speed vs Size: Higher compression = slower but smaller
- Stream Large Files: Don't load everything into memory
- Always Close Writers: Flush buffers to complete compression
- Pool Compressors: Reuse writer objects for better performance
- Verify Integrity: Check compressed data after creation
- Handle Errors: Compression can fail in many ways
- Use Buffering: Improve performance for small writes
- Monitor Progress: Provide feedback for long operations
- Consider Context: Some data doesn't compress well
Common Pitfalls
- Forgetting to close compression writers
- Not checking compression ratio
- Loading large files entirely into memory
- Using wrong compression level for use case
- Not handling corrupted compressed data
- Ignoring decompression errors
- Over-compressing already compressed data
- Not pooling compression objects