How software compression works

Table of contents

Software compression is everywhere, from network transmissions to data storage. But how does it reduce the size of data without losing information? This article discusses the basic principles used in compression algorithms.

An abstract example of compression

To understand compression, let's look at a very abstract example. Consider you were given the following text, with the task to reduce it's size:

the big red cat chased the small gray cat around the garden

One way to accomplish this goal is to replace recurring character sequences with an index from a dictionary. For example, the word "the" appears 3 times in the text, and the word " cat" (note the whitespace in front!) appears twice. So the text could be reformatted like this:

  1. "the"
  2. " cat"
1 big red2 chased 1 small gray2 around 1 garden.

This reduces the original text from 59 characters, to 55 (48 in the sentence, and 7 in the dictionary needed to decode it).

The difference between compression levels

Many compression software allow choosing a compression level, that will affect the compression ratio of the output file. The tradeoff here is that compression will take significantly more time and processing power by considering more options for replacing sequences in the data, in order to achieve the smallest possible output size. This tradeoff entirely depends on context: while long-term storage or archiving solutions greatly benefit from better compression, lower-latency use cases like voice calls or network requests often cannot entertain long delays between messages while one side compresses the data, and will opt to use lower comperssion levels to minimize delays between data transfers.

With higher compression levels, the algorithm will be more eager to consider smaller patterns for replacement, with less overall impact on filesize. As an example, if a text contained the word "apartment" twice and the word "as" twice, then "apartment" would yield up to 9 saved characters, while "as" would only save 2 characters, while causing the same work for the algorithm to go through the text and replace the words. A lower compression ratio may choose to skip patterns that are too small or don't appear often enough, whereas a higher compression level will be more eager to save as many characters as it can.

Why some files compress better than others

The type of data to compress can have a huge impact on the efficiency of the compression itself. Consider these examples:

A gray wolf ran through the forest.
aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa

The first sentence contains only "th" as a recurring character pattern, and only twice. The second one consists of a single pattern, which can be compressed much higher. Assuming the algorithm were tho pick "aaa aaa aaa " (note the whitespace at the end) as it's pattern, the substitution would look like this:

  1. "aaa aaa aaa "
11111111111111111111111111111aaa aaa aaa

With our simplified compression algorithm, the first sentence would stay at 35 characters (33 int ext and two from dictionary), but the second (which is considerably longer) compresses from 359 characters to only only 52 (40 in the text and 12 in the dictionary). This lesson can also be applied to most compression algorithms when handling text: structured or recurring patterns will compress better than texts with less recurring patterns. Things like human words will often compress well, as many words will repeat throughout longer texts, like "the", "as", "or", "to", "and" etc.

It is important to remember that many different compression algorithms exist, each built for different use cases. Image compression algorithms are often built into image formats like PNG or WEBP, more text-centric versions like flate are commonly used when handling log files or plaintext network traffic like HTTP, and some are more general-purpose such as gzip, zip, rar or 7z. Choosing the right algorithm can lead to significant savings in time or storage space, but only testing with actual data will tell which algorithm is right for your needs.




Of course, modern compression algorithms are much more optimized than the abstract sample in this article. They usually operate on bytes directly instead of words, give smaller dictionary keys to more frequent sequences, and can process a file much faster by splitting it into multiple chunks or through asynchronous processing. More specialized algorithms for images or videos will work tightly with the underlying frames and pixel encodings to encode the visual information more efficiently without losing any visible quality.

More articles

Checking disk health in Linux

Identify bad drives before they fail

Handling file downloads with PHP

Meeting user expectations with every download

Streamlined error handling in PHP

Handling all kinds of errors with a single function

Why boring software is a smart choice

Not everything is about excitement

Common pitfalls running docker in production

Avoiding the mistakes many make when embracing containerized deployments

Modern linux networking basics

Getting started with systemd-networkd, NetworkManager and the iproute2 suite