Loading

Module 1: Uncertainty, Compression and Entropy

Notes
Study Reminders
Support
Text Version

Compression Limits

Set your study reminders

We will email you at these times to remind you to study.
  • Monday

    -

    7am

    +

    Tuesday

    -

    7am

    +

    Wednesday

    -

    7am

    +

    Thursday

    -

    7am

    +

    Friday

    -

    7am

    +

    Saturday

    -

    7am

    +

    Sunday

    -

    7am

    +

Now that we have seen enough examples for compression, it is time to come up with some mathematically sound definitions of minimum limits of compression. So, we will present two different definitions; the first one here, the first one here tries to capture the minimum number of bits required, the minimum number of bits required to represent almost all symbols from the set, calligraphic X, remember we have the set calligraphic X which represents all possible symbols, all possible words that we are trying to store. We are looking for all subsets of this set which have a large probability and among those, we are looking for the subset with the smallest cardinality possible. That is what this min here says, so this definition here, it is a mathematical definition, it says that let L epsilon P, this quantity will come up again and again in this course, so please remember this, this L epsilon P is the minimum number of bits required to represent subset A of significant probability, probability greater than 1 minus epsilon. So, this epsilon is the same as some epsilon here, you can think of it as a number 0.01, so we are allowing you to ignore a symbol with probability less than 0.01. It is just an example but the definition can be for any arbitrary epsilon and among all, so this probability irrespective to this P, the model that you have formed, we saw one example of such P earlier and you want to find the subset which has large probability with respect to P and has smallest cardinality, what is the minimum number of bits required to the present, which is subset. That is a natural definition of the minimum number of bits required to represent a random variable generated from this distribution P. So this is actually, if you remember a table 1 example, this is actually the minimum number of bits required to represent a table 1. So we just formed a table and we, so remember this example we will form two tables. We kept all the more frequent symbols in table 1 and then we kept less frequent symbols in table 2. Table 2 was a set of small symbols which can be node, this is the minimum number of bits required to represent table 1. That is what this definition is. Now this is one possible definition, the second definition here is a, is related but slightly different, it is the minimum variable, minimum length of a variable length code. What do I mean by that? Instead of, in this previous definition, I represent all the symbols in the subset A with a same length vector, same length binary vector of log 2 cardinality, this is I use the same length binary sequences for all symbols and then I ask for minimum length of, minimum length of vectors required to store all elements of A for a subset A with significant probability. Now alternatively I can allow for mapping e, e for encoder because I encode my symbols from X into binary sequences, this is a 1 to 1 map, so that I can look at a binary sequence and come back to the symbol here and I can look at this symbol and come back to this right here, by the way this notation here, if you have not seen it, this is the set of all binary sequences, set of all binary sequences of finite lengths. So, 00, 01, 10, all finite length binary sequences, so this takes a symbol, this this encoded takes a symbol and represent, represents that as a binary sequence. And it is a 1 to 1 map, so that I can look at the sequence and come back to the symbol. So this is a notion of a compression scale. This mod of e x represents a length of a binary sequence. So, this guy here is length of a binary sequence e x. So, what is this quantity? It is the average length of the binary sequence, so what this L bar, bar for average, L bar P denotes, the minimum expected length of, of a 1 to 1 representation of the symbols in x. So, we do not ignore any symbols in this new notation, in this new notion of, notion of minimum length of a of a code that can be used to compress the source, so we do not ignore any symbols, but we only consider average length as a question of previous definition, which considered worst case length but allowed you to ignore symbols with probability less than epsilon. So, these two definitions are very similar, but they are slightly different too and we will see in what sense they are similar and in what sense they are differ. So, this L bar P is a second definition I want you to remember, it is a minimum average length of a code which represents X as bits. So, these quantities have been defined mathematically and they are being motivated by our search for measure of uncertainty, we are looking for a measure of uncertainty. That is why we have defined these two quantity. And we do not put any practical constraints, we do not say that this encoder e has to be (computed), computation in light, nothing like that. Because right now our goal is to not come up with a compression scheme, our goal is to come up with a measure of uncertainty. Later when we visit compression and, compression in detail we will look at schemes which are also practical and somehow match, have average length matching these fundamental quantities. So, we have defined these two fundamental quantities, L bar P and L epsilon P, L epsilon P is the minimum, worst case (proba), worst case length of a binary sequence that can represent almost all, a fraction 1 minus epsilon of symbols in X. And L bar P is the minimum average length of a mapping which represents all the symbols in X. So how do we, what do you think are the schemes which achieve these, these quantities? What schemes are performance that matches these fundamental quantities? I will urge you to take a pause and think about it, of course I will describe it right away. So now that you have thought about it, so we already saw such a scheme. Basically what we do is we arrange these probabilities P1 to Pk, let us say there are k symbols in a decreasing order. So, this bracket here represents the first, the highest probability symbol, P2 represents the second highest, Pk, note that it may not be P1, it can be P100 or P20 or P30, so this is the same sequence P1 to Pn. So, you have these probabilities P1 to Pk, the k symbols, so I am assuming here that x has k symbols. So, you arrange these probabilities in a decreasing order and now what you do? If, so this i denotes symbol with the ith highest probability that is my notation here. So, once you arrange it, what you do is, for the minimum, for a scheme which attains this, you take the symbol with highest probability, assign to it the shortest sequence, namely the sequence with length 0, sequence with length 1 which is let us say 0. Then again you have one more sequence with length 1, you assign it to this guy and you keep on going, each time assigning the shortest available sequence. We will see shortly that this scheme is actually optimal and attains L bar P. In fact, this scheme is also optimal for attaining L epsilon P, the scheme attains both of them, except that when you are striving to attain L epsilon P, you do not keep on going, you stop once you have collected enough symbols to have probability greater than 1 minus epsilon. But the principle is the same. Arrange the symbols in decreasing order of probabilities and start assigning shorter sequences to symbols with larger probabilities. That is the idea, so we just described this informally and now we will see formal statements claiming the optimality of the scheme. In fact, we will, a formal statement will be about these quantities, L epsilon P and L bar P. So here is the first theorem I am showing you in this course, I will also use this opportunity to write a formal proof. So get ready to get your mathematical hat and start writing a proof. So, what is the statement this statement characterizes epsilon P. One second, this characterize L epsilon P, here epsilon is some quantity, which is between 0 and 1 but is fixed. So, you can take examples like epsilon equal to some small quantity, like 0.1 0.01, it is customary in math to denote the small quantities by Greek alphabet, epsilon naught delta, if you have not seen that, so, just do not be surprised that I am using this Greek alphabet epsilon delta. It is customary to represent small quantities by this epsilon delta's. So, L epsilon P, this theorem claims is exactly equal to the seal, this is the seal. So, it is the next integer, next integer after this quantity log of N epsilon P. Well, what is N epsilon P well, N epsilon P collects these symbols in decreasing order of probability. Remember this notation, collect L of them, the first L of them till it crosses the probability 1 minus epsilon. So, it puts it is making a bucket, just I, just to pictorially represent what the statement is saying it is a bucket. You add symbol to this bucket one by one. So, first you add the symbol P1, namely the symbol with highest probability, then you see, you are trying to fill this bucket till here, this is the probability you wanted to have 1 minus epsilon probability, you add P1 to it. You see if has it filled up this bucket, if not, then you add the next symbol. So, you keep on doing this, you keep on adding one by one, and you add Pl, you stop at the smallest Pl, at which you exceed this volume 1 by epsilon. And that is when you stop. That is what is N epsilon P is, minimum number of symbols that you can fit in 1 minus epsilon probability that you require to cross that 1 minus epsilon probability. So, you will take all these symbols, you will look at the set of the symbols and just look at the number of bits required to represents just these symbols. That is what is N epsilon P, that log of N epsilon P seal is. And you will ignore all the other symbols and the claim is that this is optimal, this L epsilon P is equal to this. So, we will now try to prove this claim. Let me make some more space, so this is actually not a difficult proof. Nonetheless, I would like to give you some details for it. So, when you see an equality like this, and you are asked to prove that equality, keep in mind that it is hiding two statements. So, it is hiding a statement that this quantity on the left, the quantity that you defined operationally, so what do you mean by defining operationally? This is a quantity we gave some physical meaning to, it is the minimum number of bits required to the present a source with probability P, a probability distribution P, that is an operational quantity, it can be described in words and it has meaning in practice. So, this operational quantity, we are saying equals to this has a formula like this. So, we will have to establish two parts, first part is that this quantity. So proof, I only give a proof sketch, and you can go back and read this notes where all the details will be given. So, first part of the proof will claim that actually, L epsilon P, is less than this thing on the right. It is the first part of the claim that it is less than, of course, to show that it is exactly equal to, we will also have to show that it is we come to this later, we also have to show that L epsilon P is greater than, cannot be less than, is greater than this. Both are needed to claim this. So this is something which you have to keep in mind, whenever I make a claim like this for equality, you have to understand that there are two sides of this. There are two statements hidden there; one is that it is smaller and the other one it is greater. So, it cannot be, it is at most this much, and it cannot be less than this, either two statements, and we this is how we will prove. So, do not be surprised that why do we have two parts of the proof on equality statement, this is something which will happen again and again in this course. So, one more thing. Now what does this statement mean? L epsilon P is less than equal to this. Remember, L epsilon P was defined as minimum over all sets. So, this statement means that there is a set, so let me write a definition, just to refresh your memory L epsilon P was defined as a minimum of the number of bits required to represent a set, this is the cardinality of a set, this notation here is the cardinality of a set. We look at all possible subsets of our alphabet, such that P of A is greater than 1 minus epsilon. Among all subsets what is the minimum value of this, that is what our definition was. Now, what this says is there is a subset A, which has cardinality this much and therefore, the minimum can only be smaller. So, you have to just find a subset. So, this upper bound here says that we have to give a scheme, you have to give a subset which attains this. So, what is that subset? Well, that subset we already wrote down, this subset exactly is that subset, the minimum L which fills this bucket. So is it, so let us quickly check if it is, is there is such a subset? Yes, indeed, we are crossing the probability 1 minus epsilon. So, this subset here, so let me write, for A equal to, so remember this notation, it is 1 is the symbol with first the highest probability, 2 is the symbol with second highest probability. And you keep on going till till what number, till you go till this number. Again, take the symbol with this, let us say this is 100, hundredth highest probability. And the claim is this symbol will cross 1 minus epsilon probability. That is the claim. That is the definition actually of this N epsilon P and so the cardinality of this subset A is clearly N epsilon P, also the probability of this subset A by definition here is greater than or equal to 1 minus epsilon. So, implies that this, this gives you that, so if you look at all the observations there this implies that L epsilon P can only be smaller than, which is exactly equal to this guy here. So, this is the first part of the proof. So, we have shown this part, it is actually quite obvious once you define the subset. Now, for the other part of the proof, to claim that L epsilon P is greater than this, how do we show that? Now, this is an (interesting), this is an interesting part of the proof. This is asking you to show that no matter which subset you choose, it must have cardinality greater than equal to this, that is what it says. So, suppose, so A is any subset which has probability greater than 1 minus epsilon. So, what did we do? We take this i in A so this i goes from 1 to let us say, how do I write it? So, let us let us give some notation for symbols in A. So, let us say A has the symbols i1 to i L. So, A has L elements denoted by i1 to i L. So, what is P of A, then P of A is equal to summation i equal to 1 to L P of let me call it something else j equal to 1 to L P of ij, that is what this probability is and it crosses, it crosses 1 minus epsilon. Now, we would like to claim that so, let us see, so what do I put it? So, whatever these symbols are, they, the sum of their first, first probabilities cannot be less than this sum. This is some simple observation, but interesting. Remember, this is the jth highest probability, why? This is easy to see that if you sum any l symbols, their probability must be a, cannot exceed the sum of the probabilities of the same l symbols with highest probabilities but this inequality must hold. So, what it means is that, if you sum the first l highest probability symbol that, that that probability exceeds 1 minus epsilon, but then by a definition of N epsilon P, N epsilon P is a smallest l at which this probability exceeds 1 minus epsilon. So, this l can only be larger than N epsilon P. And therefore, log of, so from now on I will not put this subscript here, all the logs in this course are the, are to the base 2, because you want to think of representing thing in (big), things in bits. So, what it says is log of l epsilon P is less than what is this l, let me write it, log of l, but this l is exactly equal to, so this l is exactly equal to cardinality of A. So, any set A which has this probability must have this exceeding this and therefore, the minimum cardinality of such a set also exceeds this. That is the proof. So, since we showed that this any subset A with probability greater than 1 minus epsilon must have log of cardinality of A seal exceeding this guy. From that, it follows that, the minimums over all subsets of this cardinality must exceed form that is how we have shown this. So, this is the second part of the proof, so our proof is complete. So what have we shown, we have shown the, this theorem that L epsilon P is equal to log of N epsilon P where N epsilon P is the smallest l such that if you add l highest probability elements, that probability exceeds 1 minus epsilon. In fact, this tells you what set A to use to, to store if you are just wanting to store elements with probability 1 minus epsilon, and want to ensure that the number of bits required to store is the least possible. So, we have sort of a formula for N epsilon P, L epsilon P. Now, let us go to the second quantity we saw. So, the second part of this theorem claims to characterize this quantity L bar epsilon. Remember this L bar epsilon, L bar P, it is the minimum average length of a code that represents the symbols in P using the variable lengths that represents this random variable X generated from P using variable length sequences. And the claim is we again have this precise formula for it. Note that this P bracket i is a notation from earlier where screen up this here, so this is P i, So remember, this is the probability of the symbol with ith largest probability. So, the claim is that this formula can also be written precisely k is the number of symbols in our alphabet. So, we are throughout assuming that this is equal to k, multiplied with this guy here. So, what is this guy can, can someone think about this quantity here? So, so it is a good time to pause and think about it. Alright, let me, now, now that you thought about it, let me describe what this quantity is. So, this log of i by 2 plus 1, this quantity actually is the, it is the minimum integer j, such that, if you add, if you look at all binary sequences of length less than or equal to j, how many sub sequences are there well l equal to 1 to j, 2 to the power l. So, to this formula is very simple, there are 2 to the power l sequences of length l. If you look at all binary sequences of length 1 to j. And you are looking for these sequences and you want to collect how many sub sequences, you want to collect i sub sequences. So, so what does it mean? So, so this just once again, what is this formula here, this I am claiming is equal to this, it is the minimum number of binary sequences, minimum length of a binary sequences that I need to represent i symbols. So when i equals to 1, just as an example, then you can represent it by just one binary sequence 0, binary sequence of length 1 0, and so this j becomes the minimum j becomes 1, and i equal to 2. Once again, you can just represent both, both the symbols 1 and 2 using a single binary, binary sequence of length 1, so you can have two symbols 0 and 1. So in that case, again, for i equal to 2 again, this is 1, so we can check for i equal to 1, this is half plus 1, so it is again, this is just, so what is log of half plus 1 seal, well, that is just 1. What is log of, when i equal 2, so this becomes 1, 1 plus 1, 2. So one seal is again 1, so 3, well it goes to 2, 4, 4 by 2 is 2, 2 plus 1, 3 seal again 2. 5, 5 by 2 plus 1 seal, again 2. Now, 6, how many (bina), what is the maximum length of binary sequence that above the minimum length of binary sequences you need to represent 6 different symbols. So 6 by 2 is 3, 3 plus 1 4, the log of 4 with the base 2 is 2. So, this is 2. So, that is what this formula is. It is the minimum length of binary sequences required to represent i distinct symbols. It is a bit of a mouthful, but I hope you understand what can we, just try to think of this quantity and write some examples and you will come to this. So, the claim is this L bar P is exactly equal to this. So once again, we will try to prove it, that L bar P is exactly equal to this. So, let us make some more space for ourselves. All right, so here is proof. In fact, this proof I will not give you all the details, it is a little bit involved. But in my notes, I am providing all the details. So, one side of this proof is very easy. As I said the first part of the proof is to claim that there is a scheme which attains this in the sense that L bar P is less than equal to summation i equal to 1 to k Pi log i by 2 plus 1. So, how do we attain this, which scheme attains this? It is a very simple scheme. First, you arrange the symbols in their decreasing order probability, it is the scheme we discussed earlier and assign, so maybe I will say this way. So, you move in this direction and assign the shortest available binary sequence. So for symbol 1, you assign the sequence of length or the shortest available binary sequence, which is 0. So, you go from the symbol with largest probability and assign sequences, the shortest available sequence. So, this first symbol gets sequence 0 assigned to it, the next symbol gets sequence 1 assigned to it and you keep on going and then the last symbol will get some sequence assigned to it. Now, what is the length of the sequence that gets assigned to symbol with probability i, this is exactly that formula. The length of the sequence that in this order gets assigned to symbol with i largest probability is exactly this. That is why this formula came out. There is no magic to this formula. And this is the exact expression, you can convince yourself that is what it is. So, this shows that there is a scheme, there is an assignment, which gets you this thing on the right, so the best assignment can only be smaller. So this is simple, now we have to claim that no other assignment can do better. On this part of the proof is a little bit involved. So I will only give some high level idea. So what we want to claim is that the best you can do can never beat the scheme that we discussed, that is what we have to show, just writing it here. Alright, so how do we show this? Well, we use some induction to prove it, you can prove it in various ways, I will prove it using induction. So what I will claim is that consider any scheme, your favorite scheme. So what is the scheme means, scheme is a mapping e, which looks at this alphabet and assign to it binary sequences. And this e is 1 to 1, in the sense that it assigns a unique sequence to each symbol, so that you can look at the symbol and come back to the sequence. Of course, you can look at this symbol and go to the sequence, it should also be the other way around, you should be able to invert it. That is what, that is what we assumed about this e. Now, let li denote the length of the sequence that is assigned to the symbol with probability Pi the ith highest probability symbol, the length of sequence assigned to a symbol i. So, remember that this symbol i is the symbol with probability Pi. So, that is what this li is. So, any scheme we will also assign some sequence to this symbol. Of course, the scheme we saw earlier had assigned the shortest available sequence, short to the symbol. Now this any other scheme will also have some other sequence assigned to the symbol, let li be the length of that sequence. That is what that notation is. So claim is, so claim, claim is something we will show that is what we mean by a claim that if you look at the first t symbols with highest probability, these are the first t symbols with higher probability. And look at the average length for this guy, this is greater than i equal to 1 to t Pi log of, sorry, so log of i by 2 plus 1, for every t. So, why am I writing this in a strange way, of course, I only need to show this what t equal to k, because I want to prove this claim by induction. So, if you have not seen proof by induction, well, yeah, you probably need to read a little bit of, you need to brush up a little bit of proof writing. I assume you are familiar with these kind of things. So, how do we show it? So, this clearly holds for, this clearly holds for t equal to 1. Because P1, L1, whatever you are, so what happens with this guy for i equal to, for t equal to 1, this guy is just 1, So, P1, L1 must be greater than equal to P1, 1 because L1 must, is greater than equal to 1. So, this holds for t equal to 1. Now you suppose this holds for some t. And from there, you show that it, if it holds for t, then it must also hold for t plus 1. That is how he will do that proof. So, I will not give details in the class. But you can take a look at the notes later, I will provide complete notes on this. So, if it holds t, it must hold for t plus 1. That is how you show this. So what this proof, therefore says is that no matter what you do, no matter what scheme you use, you can never do better than this. And therefore, this quantity must equal to this. So, this also says in effect that this scheme that we saw, we just arranges the symbols in decreasing orders of their probability and assigns sequences one by one from starting from the left, and always assigned the shortest available sequence. This is optimal. Also, remember that even for L epsilon P, a similar scheme was optimal, you again, arrange these odd symbols in their decreasing order of probability, and took the symbol and put it in a bucket to the next symbol, put it in a bucket kept on doing this until your bucket had a proper symbols worst probability 1 minus epsilon. And then you just tore those symbols, look at the minimum length of (seq), binary sequences required to represent all those symbols. So, that is also very similar scheme. And that retains we claim L epsilon P. So what have we seen, we have seen just a quick summary where we are till now. So, so we saw these two quantities, L epsilon P and L bar P and schemes. So, I will say schemes that attain this, which means schemes that are optimal or the schemes that attain these. And generally, the template of both the schemes are are similar. And this is the principle they follow assign shortest or shorter sequences to symbols with largest or larger probability that is the template of our scheme, larger probability. These are the template of the schemes. This is, both the schemes follow this principle. It is a very natural principle. And now we have given a mathematical theorem, which shows that this indeed is the optimal scheme. All right, so just one more thing before we close this video. So, we saw these two different quantities, L epsilon P and L bar P, how are they related? So here is an exercise. You can think about this exercise. So, it claims that these quantities are closely related, it claims that L epsilon P is actually almost L bar P divided by epsilon. So, if you do not worry about dependence on epsilon, these quantities, quantities are similar. It is bounded above by this plus 1, we know this plus 1 is constant in (())(34:13). So, suppose this thing is 10,000. And epsilon is 0.1. So, this is well 10 times more than this, but no more. So this, there is a relation between these two. Also, this is not L epsilon P, cannot be too much less than L bar P either, it is L bar P by 1 minus epsilon and some correction. So, if you ignore this dependence on epsilon, these two quantities are very similar, which is something nice because otherwise you would always worry which definition should I keep in mind. Remember, these definitions were made up to mathematically capture the minimum number of bits required to represent symbols with probability distribution P, a source with probability distribution P and this exercise is very pleasing because it says that the two quantities we define will at the end of it have very similar behavior. Now, maybe you can take a pause and try to solve this exercise. And I will solve it here. But I asked you to, I urge you to again, take a pause and solve it. I will give a hint of how to solve this exercise right away. Now, this is a tough exercise. So, now that you have tried solving it, let me just give you a simple hint. I will also give a part of it as a homework. So, what is this inequality saying this inequality is telling you that one of them can be bounded by the L epsilon P is bounded by L bar P by epsilon. So it is, an upper bound for L epsilon P is a scheme. So, what it is saying is you can come up with a scheme of for L epsilon P, which has length less than equal to this. So, which symbols will you keep? That is the question which symbols will you keep? So suppose, so consider a scheme which attains L bar P. So suppose, I am showing this part of the inequality. L bar P, equals to summation i equal to 1 to k Pi li. So suppose, it is attain by this particular assignment li. Then you can think of this li as a random variable and this L bar P is the expected value of this random variable. So, I am writing it in that form. So, what is this random variable L? It takes value li with probability Pi. So, that is how we are viewing this L bar P. Now, if it is, this random variable, recall, Markov’s inequality. So, Markov inequality tells us that the probability that this random variable L, this is the length exceeds any fixed length l, small l. maybe too many L's around, so we will change the symbol to something else, let us say any fixed lambda, this is a non-negative random variable, lengths are always non-negative. This probability is less than equal to the expected value of this random variable divided by lambda. Equivalently, equivalently probability that L exceeds this expected value of this random variable divided by epsilon, that probability is less than equal to epsilon. So, what is this set here, this is the set we were looking for actually. So, if you look at all the i, for which, this random variable is li, remember for which li is less than equal to E L by epsilon, that set, that set has probability greater than 1 minus epsilon,