Data compression. That thing that magically makes data take up less space, somehow.
It might feel like it, and some very smart people came up with some very smart ways to do it, but it isn't magic.
At its heart, lossless data compression relies on reducing data redundancy. In other words, removing stuff that is repeated or not needed. Purely random data is incompressible because there is no redundancy to exploit.
Lossy compression relies on removing or simplifying data to reduce the storage requirements. Generally speaking, you can reduce the size needed by much more than lossless compression, at the expense of accuracy. A JPEG for instance, which is lossy, could be made smaller than a PNG file, which is lossless.
Needless to say, this is a huge simplification of a complicated topic.
Also see Texture Compression.
No compression
Let's imagine two people are playing a game of chess and we want to keep a record of the match. An easy way of fully describing the game state snapshot would to list all 64 squares on the board, along with any piece contained on it (6 piece types, 2 colors, including an "empty" state this gives 13 possibilities), and a turn indicator (black or white), which we could do in a single bit, if we want a complete record of the game, we would store a turn number instead. 2 bytes for a turn counter would be enough to handle up to ~65 thousand moves (the longest human match was 269 moves - we could probably do this in 9 bits, up to 512 turns, if we want to maximise every bit of space).
Every single turn, we can store all this data.
You might see some benefits and some obvious drawbacks of this.
Let's start with the main benefit: there is redundancy, which means we can repair the data. Let's say the move for turn 25 gets lost in transmission or corrupted in storage. We know the game state on turns 24 and 26, so we can use this to work out what must have happened on turn 25. This ability is not endless; if we lost 3 turns in a row, we could end up in a state where one player might have moved piece A and then piece B, or piece B and then piece A, we can't tell the difference. Given that even the best storage and transmission will occasionally have errors, this gives us some protection against loss. And even if we do end up losing a few turns of information, we can still pretty much tell how the game went. Awesome!
Another benefit is that we don't have to do anything to the data to convert it into a game state, so this would be quite quick to process, we would just have to read it into memory.
However, you will have probably figured out the downside: this is a lot of information to store (relatively speaking), and takes a relatively long time to transmit (obviously by today's standards this is very little information).
So let's look at ways to improve this and see where this takes us.
Optimisation
Chess only really cares about pieces not squares. Instead of listing all 64 squares and their state each turn, we could describe exactly where the 32 pieces are (or if they're captured). We've now halved the amount of data stored each turn. This isn't compression as such, this is optimisation (we are not removing any redundancy, we're just not wasting time transmitting and storing data that doesn't give us useful information), but it still reduces the amount of data needed, and for some bonus points it will probably take less time to process as well.
What else can we do? We don't need to list every single piece every turn, especially since there will quickly be less than 32 pieces as the same goes on. We could instead do the equivalent of a radio call "here is a list of all pieces, over". If we have an average of 8 pieces on a turn, this saves a lot of space over the course of the game, with some overhead for the 'radio call' (since we can't just assume a message lists all 32 pieces and move on, we need to be told when the transmission is finished because it's not a fixed size, which takes information/space).
Data Differencing
In the chess example, we can do much better than listing the pieces or the state of the board. Players can only move their piece, and only 1 move per turn. We could store what move is made each turn. We can infer what the side moving is from the turn number; we could do this with the piece type (6 values), the start position (64 possibilities), and the destination square (64 possibilities - you could do even more compression here and just list and select from the valid moves each piece can make but this will require a lot more processing). We're very quickly moving towards the system of Chess notationAs you can see, we have greatly compressed the amount of data we need to transmit and store, but at the expense of two other considerations: the processing required (we have to apply the change our side, and if we want to calculate the game after say 20 turns we have no choice but to crunch the numbers and work it out), and what happens when data gets lost (the game data is completely corrupted after that point because we can't fill in any gaps). This technique illustrates a point: most optimizations trade processing time vs. memory space. In this case, data differencing trades how much data the information takes up, but in turn, we have to spend more time processing to see any given state, which gets worse the further you go out.
This technique of storing the differences in a set of data is broadly known as data differencing, but often it's called ''differential compression." Differential compression by itself is not lossy or lossless as that depends on the underlying compression algorithm.
A major downside to differential compression is the final state can become useless if even one step becomes corrupted. One option may be to store the state of all pieces every 5 turns as a compromise (add some redundancy back in). It will take up more space, but we won't lose the entire game if a turn gets corrupted. Video compression uses this technique (amongst others), storing differences and every so many frames (a key frame) it provides the full frame before starting the process again.
Lossless compression
This is where we start looking into compressing data.Lossless compression as it sounds does not throw any information away. That is, you can perfectly reconstruct the original data from the compressed one. This is generally used for information that already comes in discrete data points, such as text, but it can be used on any digitized information because digital information by its nature is made up of discrete data points. If we use our chess notation example, we have performed a lossless compression on the information since we can perfectly reconstruct the game state. The way we've done this is by eliminating a lot of redundant information in how we describe the state.
There are several main ways to do lossless encoding:
Instruction encoding
At the heart of lossless encoding is a way to tell the program how to interpret the stored data to produce the desired result.
For example, we have an image where half of it is red and the other half is blue. We could store the data using something like "split the image in half vertically and make one side red, one side blue". This perfectly describes the image and takes up a lot less room, but we then have to decode (decompress) the data to actually get the image when we want to view it; in this case the computer would have to process these instructions to draw the image we want.
Or with numbers, if we have the number 10^1000, we can store this in shorthand rather than storing all 1000 digits separately (we can round numbers and do the same thing, but this would be lossy compression instead). In this case, you're not so much storing data as an instruction on how to calculate the data.
Run-Length Encoding
Run-length encoding, or RLE, finds repeats of a data value and instead of storing each instance of the data value, store it as the value and how many times it repeats. For example if we take this sentence:
"Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo" note
If we ignore capitalization, we can store this as [buffalo, 8], which could be interpreted by a computer as "repeat 'buffalo' eight times." The efficiency doesn't work as well for shorter sequences or sizes of the data, such as storing single characters or repeats less than three.
Dictionary encoding
Dictionary encoding stores two things as the final set of data: a listing of indices or keys where each entry is smaller in size than the actual value and the titular dictionary that maps those indices or keys to the real value. For example, say we want to store a really large image consisting of only four colors using a four-byte color format. Rather than store four bytes per pixel, we store the values 0, 1, 2, or 3 to represent our colors and have the dictionary say what each number means. Now we only need two bits per pixel, which be stored as one byte, a 75% compression ratio. Or for even more optimization, we can pack four 2-bit values in a byte, resulting in a staggering 93.75% compression ratio. This was how graphics on many 8-bit and 16-bit computer systems worked; the information in the images themselves represented a value in a color palette, and the developer had to define which palette to use.
Probably one of the most influential dictionary-based lossless algorithms is Lempel-Ziv, or LZ77 and LZ78. Unlike a naive approach to dictionary encoding that uses, for example, whole words, it looks for sequences of characters seen previously. When it finds the same sequence somewhere else in the data set, it encodes that sequence as "I saw this sequence X characters ago and it was Y characters long." This way, it only needs to store the first occurrence of the sequence. The most common application for LZ compression is file compression. The popular file compression format Zip uses an LZ based algorithm. LZ compression may also be used on top of other compression methods to help reduce their size further.
Entropy encoding
The main idea of entropy encoding is to find something in the data stream that's used often, then use fewer bits to represent that. It's one of the main methods of lossless encoding due to how space efficient it typically is.
For example, the letter "z" is rarely used in English. If we have some text, we could replace a common word such as "the" with "z", and this will save us space since we can represent 3 characters using 1 character. There's no such thing as a free lunch though, and the consequence in this example is that if we actually want a "z" we will need to use more space to represent this (such as using escape characters). The compression comes from the fact that "the" is more common than "z", so overall we would win out. For maximum compression we would want to map as much repeated data as possible into as little space as possible.
Two main entropy encoding methods are Huffman and Arithmetic coding.
Huffman coding
starts by gathering the frequency of what characters are used in a string of data, then create a balanced tree where the further down you go, the less frequently said character is used. The character is encoded using a binary sequence of which branch in the tree you take. For example, the letter "e" is used one of the most frequently used letters in English, so we can encode this using 0. In the previous example, "z" is the least used letter at less than a percent, so we'd have to encode this using something like 0101101. Something in the middle like "R" may be encoded with 101.
Arithmetic coding
stores a number on a number line, normally between 0 and 1.0, but for the sake of this page, we'll say between 0% and 100%, that gives you what looks like the probability of the sequence occurring. The frequency of the characters is stored in some order along this number line. For example, if we wanted to encode the word "Hello", we note that it has five letters, with H, e, and o appearing 20% of time while l appears 40% of the time. Sorting the letters by the order in which they appear, we'll start at 0% to 20% for H, then we take this number line and put it below that. To find e, we narrow down the value between 4% and 8%. For the two l's, the value narrows down between 5.6% and 6.4%, then 5.92% to 6.08%. Finally, the 'o' puts the value between 6.048% and 6.08%. The last value, as long as we use a number between the final range, will encode the sequence of characters provided we also know how it was sorted.
Lossy compression
Lossy compression, as the name implies, throws away data to achieve better compression. That is, you cannot perfectly reconstruct the original data from the compressed one. Lossy compression is used to handle storing analog signals, such as what humans see and hear, into a digital format. While not exactly lossy compression per se, a characteristic of converting analog to digital signals is that data is captured in discrete steps, known as quantization. Ideally, we capture this data using the highest possible or practical number of bits to minimize the amount of error from quantization. If we use enough bits to capture the information, when we convert the signal back to analog, it'll look more or less indistinguishable to the original.
A simple example is a compass rose. There are an infinite number of directions. If we go with a 16 point compass rose, we are quantizing an infinite number of directions into something we can store in 4 bits (2^4). The downside is we lose precision due to quantization error; 2 degrees would be represented as "North". But if we store this data in 32-bits, now we can represent ~4 billion distinct directions.
The main problem is that this data set can be really large. For example, 24-bits (or 3 bytes) are typically used to represent color these days. If we were to show a 24 FPS, 3840 x 2160 pixel resolution video (a typical 4K movie) without compression, you would need about 569.5 megabytes of data per second. And this is before adding in the audio. Lossy compression thus finds data that can be removed but is not necessarily noticed if it was missing by taking advantage of how humans perceive things.
The main method to lossy compression, whether it be audio or images, is the discrete cosine transform
. This transforms the data into a sum of cosine functions at various frequencies and a scaling factor of how much that frequency contributes. This is important for a few of reasons: The scaling factor tends to have a smaller range of values than the raw values themselves, we only need to store the scaling values since the frequencies chosen can be known ahead of time, and humans have poor perception of high frequency information. The last bit means that we can generally toss or give very low weight to the high frequency information, saving space. For example, in the JPEG picture format, about 32% of the scaling factors are saved, the rest get dropped.
For images, high frequency information can be in the form of a pixel-sized checkerboard pattern. If you look at it from far enough away or the pixels themselves are small enough, a black/white pattern starts looking more like gray. This was often exploited in digital art in a technique known as dithering. For sound, while humans are capable of hearing between 20Hz and 20KHz, our brains have adapted to focus between 500Hz and 6KHz (you need less volume to hear these frequencies).
Using DCT is often the first step for digital media, but there are other methods to augment the removal of information that may be deemed practically unnecessary.
- For audio, psychoaucoustic analysis
is performed. This is taking advantage that certain tones have an absolute minimum threshold of perception, and those tones may also be imperceptible if other tones of higher volume are being played. This allows discarding portions you won't likely hear anyway.
- For images and video, many lossy compression methods take advantage that human color perception is worse than brightness perception, as you have fewer color sensing cells than brightness sensing ones in your eyes. In addition, humans normally perceive green better than the other colors. This means that you can reduce the resolution of just the color portion (known as chroma subsampling
) and how many bits are needed for red and blue if space is a concern.
- For video specifically, instead of storing whole frames, video codecs try to find only the changes between key frames and reconstruct the frames between them. This method uses motion or optical flow vectors to predict where a pixel goes
And finally, the last piece of the puzzle in lossy compression for audio and video playback is how many bits per second of data the codec can use. Audio and video data is designed to be streamed from where they're stored so they can be played immediately rather than waiting for the entire thing to be transferred to a faster media or RAM for playback. A higher bitrate means better quality, regardless of the method used. Though some codecs were developed to store information more efficiently in lower bitrates. To help with this, most codecs also only work on chunks of the data. For images, most algorithms split the image up into 8x8 or 16x16 pixel blocks. For audio, its split up into chunks of data at even intervals.
