Loading

Module 1: Uncertainty, Compression and Entropy

Notes
Study Reminders
Support
Text Version

Shannon Entropy and Random Hashing

Set your study reminders

We will email you at these times to remind you to study.
  • Monday

    -

    7am

    +

    Tuesday

    -

    7am

    +

    Wednesday

    -

    7am

    +

    Thursday

    -

    7am

    +

    Friday

    -

    7am

    +

    Saturday

    -

    7am

    +

    Sunday

    -

    7am

    +

So, till the previous video we saw these two quantities L epsilon p and L bar p where the first was the minimum number of bits required to represent a random variable with distribution p, so x we assumed, there was a random variable x with distribution p and we wanted to compress this symbols or generate this this x the source x L epsilon p is the minimum number of bits required so that you are correct with probability at least L minus epsilon.
And L bar P was the average number of bits required when no error is allowed so you want a 1 to 1 mapping but you count average number of bits and what we saw towards the end of the previous video was that these two quantities are roughly the same they differ only in terms of this parameter epsilon, this parameter epsilon and that is the only way in which these two quantities differ otherwise they are roughly the same and this parameter epsilon is nothing very fundamental to a formulation.
It is an accuracy parameter it is something like 0.99 which we can agree on and so we do not worry about the dependence on this epsilon at this point because we were looking for a conceptual definition of on a measure of uncertainty and both these are good candidates and now we saw that these two are similar quantities so we can focus on any one of them, we will focus on this guy. L epsilon p the minimum number of bits required to recover a random variable the minimum number of bits required to express a random variable so that it the so that you can correctly recover it with accuracy at least (prob) with accuracy probability at least 1 minus epsilon.
And we did characterize L epsilon p exactly we showed that this L epsilon p is exactly equal to this log to the base 2 of N epsilon p where N epsilon was the the least L such that when you take the L largest probability symbols and their probability adds to one minus epsilon. So, you will just the scheme which attains this L epsilon p is exactly that it just retains the L largest probability symbols such that l the probability of those symbols together exceeds 1 minus epsilon.
So, we had an exact expression for this L epsilon L but it may not look very useful in the sense that it is a little bit of a complicated formula and it still not gives us enough feeling about how do we measure uncertainty. One thing we realized was that somehow this compression was related to uncertainty that was a form that was actually a our thesis we believe that up front and what we realize is that a good compression scheme does the following, it assigns shorter length sequences to symbols with large probability.
However this expression is still not very useful because for for beginning it has as epsilon in built into it there is some dependence on epsilon and epsilon is some accuracy parameter it is not a fundamental quantity towards expressing the towards measuring uncertainty in a random variable. So, it is just something that we came up with to have a mathematical formulation.
So now, what we will do today, we find an approp we will find an approximate upper bound for L epsilon P. So, an approximate upper bound for L epsilon P that is not as accurate as the previous one as this formula we just saw but it brings out the more fundamental notion a more fundamental notion of uncertainty. It brings out it will bring out a new feature of P that measures its uncertainty.
So, how do we come up with that so here is an alternative scheme so remember our previous scheme was taking L largest probability symbols and keeping them and this L was chosen so that the collective probability of the symbol exceeds 1 minus epsilon. Note that if you take all the symbols the probability will become 1 and if you take no symbol the probability is 0, so as you add symbol 1 at a time at some point you will exceed 1 minus epsilon and that is when you stop. And those are the symbols you retain for because we are allowed to ignore symbols with probability less than epsilon.
Alternatively, instead of looking for this L largest probability symbols what we can do is we can contain symbols which are more likely than a threshold which are more likely to occur than a threshold. So, in particular we consider this at A lambda where this threshold of acceptable symbols is said to be 2 to the power lambda it is a set of all those symbol x whose probability P x exceeds 2 to the power minus lambda.
So, they have a reasonable probability we just retain those symbols. So, instead of L largest we retain these symbols. You can take a log and re express this as minus log P x less than equal to lambda. So. there is another view of this set you can think of this minus log P x, a quantity which will come up again and again as a measure of rarity of a symbol, a symbol is very rare if minus log P x is large and we retain those symbols which are not too rare, that is what the set A lambda is. So, we can define the set A lambda and retain symbols in this A lambda.
Here is a very simple exercise, this one here. I allowed you to take a pause and try to solve it, but I will solve it right away. We have to show that the cardinality of A lambda is less than equal to lambda. Can you pause and try to show this? Now that you have tried to work on this exercise I will just prove it in one line here it is a very simple exercise but very very basic one, very interesting one.
So, if you look at all the symbols x in A lambda and add their probability that probability cannot exceed 1 because the sum of probability is of the entire set is 1 and that probability cannot exceed one but the probability of each symbol here was at least 2 to the power minus lambda, remember that was our definition for a lambda is A (prob) symbols with probability greater than 2 to the power minus lambda. So, you substitute the lower bound on each of these quantity.
So, this sum here each term is greater than 2 to the power minus lambda and how many terms are there? Well, the cardinality of A lambda. Therefore, the cardinality of such set cannot be more than 2 to the power lambda, this exercise is a log missing here about that, so log of A lambda is less than equal to lambda and so so so there are not more than 2 to the power lambda sequences in a lambda that is what this exercise claims. So, let me remove this proof.
So, with this observation here is a simple theorem simple to state but very interesting it gives the nice handle over this L epsilon P. For any lambda if you consider the set A lambda and store the sequences in that set A lambda suppose that a lambda has probability greater than 1 minus epsilon, we already saw that there are no more than 2 to the power lambda sequences in A lambda, so we can store al the sequences in A lambda using less than equal to lambda bits. So, if its probability is greater than one minus epsilon then a lambda constitutes storing all the elements in A lambda constitutes a valid scheme for L epsilon P.
And therefore, L epsilon P which is the minimum over all such schemes minimum length overall such schemes must be less than equal to lambda. Therefore, L epsilon P is less than equal to lambda. So, for any lambda, how do you find such a lambda which satisfies this? That is a different problem but suppose you found it then we have an upper bound for P epsilon, L epsilon P.
There is a way to say this bounded words, what it says is a large probability a 1 minus epsilon probability upper bound on the random variable. So, you can think of this minus log P x as a random variable, this random variable is very fundamental for this course. Just keep just remember we call this a measure of rarity. So, this minus log P x is a random variable because x is random, this is a measure of rarity of an element.
Any large probability upper bound on this is actually an upper bound on L epsilon P, any upper bound on this which holds with probability greater than 1 minus epsilon like this one here is an upper bound for L epsilon P.
So, now can we give a more concrete upper bound. For that for that we need we need a large probability upper bound on this random variable here. Now what is a good estimate for large probability upper bounds on random variable? That is where we just saw Markov inequality.
So, Markov inequality gives a nice upper bound like that it tells us that expected value of a random variable is such an upper bound, the expected value of random variable by epsilon for instance, is a large probability upper bound on the random variable.
So, let us try that upper bound so our theorem says that by the theorem the random variable we are looking at is minus log P x so it is expected value of minus log P x serves as a large probability of minus log P x by epsilon. So, this quantity is will serve as an upper bound for this length. And in fact this quantity is the fundamental measure of uncertainty we were looking for so we can expand this you can see that this expected value is sum over x P x log 1 by P x and it has a name it is the Shannon entropy.
So, this is the first important quantity of we see in this course is denoted by H of P. Shannon entropy of a random variable p of a random variable with p m and o is denoted by H of P. So, H of P is summation over x P x log 1 by P x. Remember this was this log 1 by P x is the rarity of a symbol if sometimes also called the entropy random variable or entropy density and its expected value is the entropy. So, that and what we have seen is that L epsilon P is less than equal to entropy by epsilon that is what we have seen.
Alright, so great so we have discovered an entropy as a measure of uncertainty, we have given a formal theorem which says first we said that we will measure uncertainty by the minimum number of bits required to represent that random variable, then this quantity L epsilon P is the answer to that problem, but we could get an exact expression for this L epsilon P but that was not very nice because it is difficult to interpret it.
So, we look for approximations to this, we have not commented how good these approximations are, but it will turn out these are actually pretty good approximation. We will show that this quantity is actually approximately equal to the Shannon entropy itself if we know the dependence on epsilon. And therefore, Shannon entropy emerges as a measure of uncertainty and also as a measure of as an answer to this com the fundamental limit of compression L epsilon P and I would I would request you to after this lecture to look at this quantity play around with it ad try to remember it.
So, here are some very simple exercises to help us understand what the Shannon entropy is about of course we will spend a lot of time in reviewing properties of Shannon entropy but here is a small warm up exercise. So, you can do it yourself you can look at it and take up pause the video here and try it yourself I will quickly solve it now. So, this first exercise compute Shannon entropy of p for this.
So, this distribution here is a uniform distribution on 1024 elements. So, one thing which we will see later is that H of P for P which is uniform on any set, let us say a set of cardinality k, so this is the set 1 to k any P which is uniform here its Shannon entropy is just log of k, you can show that so this is law of 1024 which is 10. So, this this entropy is 10.
Next one, what is the Shannon entropy of this guy? So, what is this sequence? It has half one one element with very large probability half, one element with smaller probability one fourth and remaining the same. So which one do you think is higher entropy this one or this one? So, since this is completely random, so for us for us the answer here is 10, this one is completely random this should have higher uncertainty for the Shannon entropy for the system. So, if this is a reasonable measure of uncertainty the entropy for this one must be smaller at least smaller than this, so let us compute.
This is this is b so so with probability half it is log to the base 2 of half up one by one by half which is 2 plus 1 probability four one fourth it is 2 plus probability what is this probability, so all these are 1024, 1024. How many are remaining? So, you have 3 by 4 symbols, so 3 by 4, what is the probability of these symbols? It is 1 fourth each each of the symbols probability 1 fourth into 1 1, the total probability is 1 fourth of these symbols and each symbol has probability log of 1 by 1024 which is 10, so 10.
So, the total comes out to be 2 here, so, 2 plus 2, 4, 4 plus 10, 14, 14 by 4. So, which is less than 10. So, this is this has smaller entropy than 10 it has 7 by 2 just 3.5. How about this guy? Now, it was easy to compare these two but what a measure like this will do is also allow you to compare this with this, which of these have higher entropy so you can compute this entropy and check and let me know once again what I mean by this 1024 throughout here is that all the other symbols have probability 1024, how many symbols well is one fourth probability is remaining and 1 by 4 by 1024, that is the number of symbols remain.
Second exercise is also very revealing it is very interesting it asks you to plot entropy of Bernoulli p, so Bernoulli p just has two probabilities plus 1 minus P log 1 by 1 minus P, so you can plot this function. So, you will see that at p equal to 0 it is 0 and at P equal to 1, it is again 0,at P equal to half it peaks to its value and that is 1 and it is what is called a concave function which attains at maximum at P equal to half and it is symmetric around half although my drawing is very bad and does not look symmetric but it is symmetric.
So good so this was a very important lecture and it sort of concludes our discussion in this this week till now, we have discovered this measure of uncertainty uncertainty as Shannon entropy. In many books this notion is introduced upfront without any operational meaning instead of that what I did in this lecture series is I tried to motivate a more philosophical problem of measuring uncertainty and slowly discovered Shannon entropy as an answer to that problem.
And his will be our approach throughout this course we will not introduce many of the quantities upfront, we will slowly motivate them through operational problems and try to discover them and later on we revisit them and study their properties.
We introduced a very interesting concept of Random Hash. It is a concept which is typically taught separately not along with this not here, we are introducing a measure of uncertainty but I think it is rightly placed here because it is a very basic concept and should be introduced upfront.
So, let us revisit our our question-answer paradigm. So, in a the way we were formulating that problem, the compression problem the question and answer problem there is a person who is trying to who has raised, who has asked the question, it was let us say you when you trying to mean someone raises a question to you.
And the this person already has a list of guesses, this list L and here is a person who actually knows the current answer, this x, in this x, this guy knows the answer and it has to describe this answer to you and what we said was that instead of just making a list like this you can think of a model on all the answers so that is this distribution P, where Px is the probability with which this answer is x and thus the model p here.
And now this list actually corresponds to the set a lambda we saw last time, it is the set of likely answers which you can form. And when we previously described this formulation here, we assume that this person describing the answer here also knows p and also knows this list and therefore, it can arrange the answers according to this list and give you that answer.
And in that paradigm what we show that you can always form this of size approximately 2 to the power h of p such that probability that your answer belongs to this list is greater than 1 minus epsilon, the significant probability that you will have a list of this size and the answer belongs to this list that is what we saw.
We connect to the important point is that size of list can be this one here, 2 to the power h p, right. And therefore to describe this answer you need only H p bits because there are these many answers, this is roughly what we have seen till now. But it is very important to notice that this requires both, so this is the answer are here. This is the person who is giving you the answer to the question so we call it the answerer.
So, this person also knows p, the same p and it also knows this list that you are forming, requires the answerer to know p, the model. However, what can we do if this answerer does not know p? And this exactly what random hash will be able to, this is exactly what a random hash will be able to do for us.
What a random hash will be able to do for us is that the answer still send these many bits about the answer without knowing p. This is very important without knowing p. And the person whose trying to guess was from this guess list gets the answer with large probability. So, that is that is what we will be able to do with a random hash.
So, let us introduce a formal notion of random hash so here is a definition and L bit, L is a number here a natural number and L bit random hash is defined as follows. We consider a random function f from your alphabet to all functions with L bit output and it is a random function, what is a random function? Random function is the one which is shows a uniformly over a set of all such functions. How many such functions are there by the way? So, you have to express 2 to the power L outputs each element in x can be mapped to any such output.
So, you have 2 to the power h choice is for each element of h you multiply them and so that gives you this. So, there are so many such functions, large number of functions and f x is chosen randomly over that. Such an f x is called L bit random hash of the input x.
So, we will observe very interesting property of this random function which is called a collision resistance property. It is very easy to very easy to state, what it sees is that for a random function f which is just randomly chosen over all function, if you consider two distinct elements x and x prime, then the probability that this random function maps both of them to the small to same value is very small, it is 2 to power minus l remember this is a random function f from x to l bits.
So, it is a l bit random function and this l is what shows up here. So, you can show, it is not a difficult exercise you can show that for any fix x, x prime this is the probability of hash collision. So, what this observation tells us and when this l is let us say 100 this probability is very small 2 to power minus 100. So, 2 to power minus 100 is an extremely small number, you can convince you yourself about it. Just with 100 bits you can have this property which is called the collision resistant property.
Therefore in English what we have just shown is a random hash is almost one to one map in the sense that the probability that is maps two distinct elements, two different elements to the same value is extremely small 2 to the power minus l. So, we will use this property of random hash to answer the problem which we raised earlier.
So, suppose now the guesser the person who is trying to guess the answer because the list list A, of guesses of answers such that x belongs to A. Instead of saying that it holds a large one probability one let us say the answer, the unknown answer x belongs to a with probability greater than 1 minus epsilon. Suppose that holds, the answerer who was earlier describing the location of this x in this list which it can do because he knew the list now it does not know the list but can still sent this random hash fx of this of this input x, so it just computes it suppose it can be computed efficiently so it just computes it and sends it.
So, once again is to draw some picture there are just two entities here answerer and the guesser. So, this guesser has a list A and this guy knows the answer and it sends to the guesser f of x. Claim is that this f of x will allow the guesser to recover the answer with large probability.
So, what the guesser can do is it can declare the answer x hat as the x in this list A for which the hash of that element x matches the hash received from the answerer. So, the answerer has send some hash value to it and this guy goes over all its elements x prime and look for that element whose forgets the f of x prime matches the f x received from the answerer. And if they are multiples such guesses you take any one of them. So, how good is this guess?
So, will show that actually this guess is accurate with large probability, will try to bound the probability that this guess differs from the actual answer. So, if you look at this probability, you can divide this into two parts, two events this is the total probability formula, the event that the actual unknown answer lies in this gets list and the event does not lie in this guess list. So, this is an intersection of these two events h hat is not equal to x and x lies does not lie in the guess list.
So, it can be bounded by the event that x does not lie in the guess list but this we assume this guy is less than equal to epsilon. So, the second term here is less than equal to epsilon. Now how about the first term? Well I can write it as probability x hat not equal to x given x in A times probability x in A. But probability x in A is bounded by 1 so this term the first term gets bounded by this guy.
Probability x hat is not equal to x given that x in the guess list. So, let us try to analyze this probability now, probability that x hat is not equal to x given that x is (())(09:06) how do we analyze this probability? So, let us try to write it completely, so this is the probability that so 1 by probability x in A I am using base rule here and I sum over all x in A and this probability then becomes probability that x hat is not equal to x and x equals to x, this is exactly equality.
But this can only happen this can only happen if f of so when can this thing happen? The answer was in A and this guy has send this answer so this can only happen if there is some other x prime so there you can find some x prime there exists, this symbol is there exists x prime not equal to x such that x prime is in A and the hash is same value only then this kind of thing can happen.
And therefore, there are at most these many carnality of A such x prime and for any such x prime the probability of hash hash collision 2 to power minus therefore you can show that this probability here is bounded by this quantity here. So, I will give you the proof details in the notes and I will put them up and I want you to try this bound but I gave you the main idea behind the bound you can show that this first term is bounded by the size of the list times 2 to power minus l, hash.
Therefore if you combine all the bounds what we have shown is that the output differs from x with this much probability, epsilon plus A times this. If we chose now this L is something you have to choose you have to choose the length of the random hash. By the way we can choose this A to be that L epsilon p thing and then if you have a random hash this bound becomes this and if you choose this l the length of the random hash to be greater than L epsilon p plus 1 by eta for some parameter eta then this bound becomes 1 eta plus epsilon.
So, for instance say this is 10 to power minus 6 then you have another 10 to the power 6 then you have a another this is by the way log 1 by eta so then you have log 10 to power 6 which is just 6.
So, if we choose L to be greater than this then this probability our disagreement becomes small. So, what is the conclusion? The conclusion is that the answer did not actually need to know the guess list all it need to know was that what is the accuracy with which you want to match the answer.
And with that it can actually use a random hash of appropriate length and send a hash it will give you reliable recovery with high probability. So, you do not even need to know the answer p because list to give the answer. You can just send a ransom hash that is the idea of random hash there is a very powerful tool it is called very powerful tool for practice as well as for theory and will revisit them, revisit this later in the course.
So, as a conclusion of this lecture what of this video what you can remember is that it is not that the answerer also needs to know the model with the guess list form all it needs to know is the accuracy that the guesser is looking for and it can send a hash of the true answer and the guesser who has a guess list that A lambda which we used in a scheme it can just go over that guess list and select the answer for which the hash matches the received hash, that is the conclusion.
Alright that is all for this week, so this week what we did was related to the measure of uncertainty to we said that uncertainty can be measured as minimum number of bits required to represent a random variable then we studied the problem of minimum number of bits required to access a random variable this is exactly the compression problem.
What we discovered was that the fundamental limit of this problem can be measured in terms of this quantity called entropy of this random variable which has this formula which has a very concrete formula this notation Hp and it is concrete formula summation over x, Px log 1 by Px which you can compute and what we claim was so this entropy is a measure of a uncertainty.
And it can be computed explicitly, some examples we saw uniform distribution is has high uncertainty of course entropy is also larger like ones and a distribution which is more peaky which has heavy mass at one point has lower entropy, has lower uncertainty as one can expect and the entropy then also is likewise smaller for such a distribution.
And finally we saw this notion of random hash which satisfies this basic (prop) property which we call the collision resistance property which says that it is random hash is in almost invertible map, it is very rare that a random hash will map to two different elements the same value and using this property of a random has we came up with a very interesting scheme which does not require the answerer to even know the guess list to answer the question.