Data Compression
I'll be honest, data compression used to be an incredibly hot topic, but kids these days usually have incredibly powerful devices that do not seem to be memory-limited at all. Because of this, it seems odd to talk about data compression as a field of intense debate and development. It would naively seem that better hardware means that there are less restrictions on programmers and less of a need to search for new and unique ways to compress their data; however, this is far from the case.
That said, there will always be new devices on the market that require minimizing data storage. In fact, some of the most revolutionary algorithms and methods in existence today fall in the category of data compression. From lossless data compression with Huffman encoding to genetic compression algorithms and machine learning, there is a lot to learn about this field, and we'll go through it piece-by-piece.
All that said, no discussion about data compression is complete without first discussing the information, itself -- specifically how information is represented in computer systems. Now, we've discussed this in some depth before with bitlogic, but there is much more to the story than what we let on before. Let's start with a working definition of information:
Information is a representation of certainty.
This might seem like a silly, hand-wavey definition, but hear me out. If I am uncertain about something, I will ask a question. The answer to this question could be any number of things, but it will contain information from an individual with some level of certainty. For example, let's pretend you have been furiously coding for months in your mother's basement (it happens). At some point, you realize you haven't gone outside and are completely unaware of the day, month, or even year! As your mother slips food under your door one day, you reach out to her and ask, "What's it like outside today?" Overjoyed at the prospect of her child finally leaving their room, your mother might say, "It's bright, sunny and warm. A perfect day to go outside and relax!" This provides you a lot of information, and to some degree of certainty you can conclude it is summer.
Of course, after this interaction you do not acknowledge your mother's existence after her answer and go back to coding. You have the information you need. No reason to overcomplicate things with further human contact.
See, information is defined in a number of different places for a number of different reasons, so for our purposes, we will define the unit of information to be the bit, a simple binary 1 or 0. Taking the example mentioned before (assuming you can take your mother at face-value), you were provided 3 true statements:
- It is bright
- It is sunny
- It is warm
With this information, you assumed that it was probably summer. Unfortunately, your assumption about it being summer is not information. It might be a logical conclusion, but it was not provided as a "fact." This is an important distinction between what we might colloquially describe as information and what information theory requires. Information theory works with measurements -- binary absolutes.
Now, we can clearly say that with some probability it is summer, but this is a different story altogether, which we will undoubtedly discuss in the future. For now, let's talk about a simple representation of information on computer systems. Imagine you have a simple alphabet with only 2 characters in it, a and b. In this case, there are plenty of ways you can represent these characters in bits, but the most obvious way might look like this:
Character | Bit Representation |
---|---|
a | 0 |
b | 1 |
So long as you don't add any new characters to the mix, this is a perfectly valid set of codewords. If you get the bitstring 0111101, you can easily decode it as abbbbab.
But what if you wanted to add a third character, c? Well, it's clear that c cannot be either 0 or 1, but because of the way we have defined the set of codewords above, it actually cannot be any combination of 0 or 1 either. For example, if we defined c to be 01 and we were provided the bitstring 0111101, we could interpret this string as either abbbbab or cbbbc! Now, we could use context or other information provided to distinguish these two possible cases, but it is clear that we need to think more deeply about our set of codewords in this case.
First, let's think a bit about decoding. For our purposes, we do not want to think when decoding. No matter what the bitstring is that we need to decode, we want to be able to read bit-by-bit until we find a match in our set of codewords and move on. Basically, we do not want any ambiguity in our set of codewords. The code for c should not contain the code for a or b! In this way, our set of codewords should be prefix-free. No word should appear as a prefix to another word.
If we wanted a good, prefix-free set of codewords for 4 characters (a, b, c, and d), it might look like this:
Character | Bit Representation |
---|---|
a | 00 |
b | 01 |
c | 10 |
d | 11 |
We can decode any even bitstring with this set. 0100101010110100 is bacccdba. This is great, but now we have another question: given a string of characters, can we construct a set of codewords that minimizes the number of bits in its corresponding bitstring?
This is the heart of data compression!
First things first, let's define a simple measure for how compressed the data is. Let's take the following set of characters: abbcccdddd. If we were to put all the letters in a bag and pull one out at random, we would have the following probabilities of pulling out any of the letters:
Character | Probability |
---|---|
a | .1 |
b | .2 |
c | .3 |
d | .4 |
This basically means that we are far more likely to pull out a d than an a, and if we are trying to minimize the length of our encoded bitstring, the length of the bit representation for d should probably be shorter than the bit representation for a. Ultimately, to compress our encoded bitstring, we want to minimize the following quantity:
Where is the characters of our alphabet and is codeword for a single character, is related to the probability of pulling that character out of a bag and is an individual probability of getting a specific character, and is a vague, unitless quantity to determine the overall length and is usually used to determine compression ratios. To show how this works, let's consider encoding the word abbcccdddd with two different alphabets:
Character | Probability | Bit Representation 1 | Bit Representation 2 |
---|---|---|---|
a | .1 | 00 | 000 |
b | .2 | 01 | 001 |
c | .3 | 10 | 01 |
d | .4 | 11 | 1 |
In this case:
Here, it's clear that , and thus the second set of codewords compresses our data more than the first. This measure can be used as a direct test of certain simple data compression techniques, notably those created by Shannon, Fano, and Huffman, which will be covered soon!
License
Code Examples
The code examples are licensed under the MIT license (found in LICENSE.md).
Text
The text of this chapter was written by James Schloss and is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.
Pull Requests
After initial licensing (#560), the following pull requests have modified the text or graphics of this chapter:
- none