In the final lecture for this week, I like to discuss this notion of typical sets and their connection with entropy. This is a very basically elementary notion and but very powerful. And this will be used throughout this course. So, I urge you to see this lecture several times, and maybe try some exercises on your own.
So, if you notice that in both our proofs in the previous two lectures, we uses sets which looked like this T, we were calling them typical sets, so we had two different sets, so I am calling them T lambda 1 and T lambda 2. They were associated with this parameter lambda some real number lambda and actually a positive real number lambda and it is the set of those sequences for which minus log p x is less than lambda or minus log p x is greater than lambda, okay.
So, if you look at this first one, this is the same as saying that p x is greater than 2 to the power minus lambda, so probabilities are large and this set is where probabilities are small. Okay, these sets we saw and they gave us different results. So, we were looking for these sets, we are looking for lambda such that these sets have large probability under p, we were looking for that choice of lambda and one such choice of lambda we said is entropy.
And that was bring, that is what brings in the role of entropy in these different problems. So we saw that a good choice of lambda such that these sets are large probabilities lambda equal to HP, the Shannon entropy. Okay, so, that is how we saw entropy entering many of our problems. Okay, that is what we saw. So, here are two properties of these two different sets. We showed this last time, at least (())(02:09) to these properties.
At this point, you can pause the video and try to show these two properties me highlight them. So these two properties, first one is that T lambda, the first set has cardinality less than 2 to the power lambda. Second one is that if T lambda 2 has large probability, then its cardinality must also be large, roughly 2 to the power lambda again, so remember this T lambda 1 is the upper bound last probability upper bound, T, lambda 2 is the last probability lower bound.
So this first one is easy to see, each element has probability 2 to the power minus lambda. Therefore, you cannot add more than these many elements because otherwise probability will exceed one. For second one, each element has small probability, but you have added them you have added enough elements to cross the probability of 1 minus epsilon. Therefore, the number of elements must be at least 2 the power lambda times 1 minus epsilon.
So these are easy to show I just summarize the proof. So such sets will be used throughout this course they are called typical sets since they contain elements which occur with large probability, large probability, because we ensure that probability of the set is greater than 1 minus epsilon. So these are called typical sets. What is the role of typical sets?
Well, the probabilities of elements in typical set either have a uniform lower bound or a uniform upper bound, and in that sense, they look very much like a uniform distribution, the uniform distribution that easy to understand, we can answer many of our questions easily for uniform distribution, how much randomness is there a uniform distribution? Well log of cardinality of the uniform distribution is a good answer.
How many bits we need to compress a uniform random variable? Well, again, log of cardinality of its support. So for uniform distribution, many of the questions that we have been raising can be easily answered. And conditioned on this typical set or distribution all also looks like a uniform distribution in some sense. So this is just a heuristic statement here, probabilities of elements in these sets under our distribution are close, close in some sense.
In the sense that either it is larger than some constant quantity or smaller than some constant quantity to a uniform distribution on this set. Okay, so if you, that is the role of a typical set, and that is how we have been using it. Okay, and these are two properties of such a typical set that you can remember.
Okay, now, let us consider the case of an iid source or discrete memory less source let us say and so, the source has symbol x which is which comprises the sequence x1 to xn generated iid independent and identically distributed from P.
Then if you set lambda equal to n times HP plus eta these are some small, this is some small parameter. Consider this random variable Zi, which is minus log P of x i this is a name actually this name may be it is a useful name, it may not be so standard, but I think this is standardizing first. So this name has it is called entropy density because its expectation is entropy. So, this is the entropy density.
Just like probability density is a random variable can be thought of as a random variable whose expected whose integral expected value over a set is probability. Similarly, this is the entropy density that is one way to think of it. So this entropy density, so the probability of the first typical set which was a set of sequences which are this less than equal to lambda minus log px less than equal to lambda. Note that minus log p x for iid distribution is just some of Zi and we want it to be less than equal to lambda.
So that is less than equal to this, this one you can just divide by n and apply Chebyshev inequality and what you get is this bound on it is an n here and this is eta here and eta is a constant n is something we can take larger and larger. Similarly, for n equal to HP minus eta this probability greater and this is also greater than equal to this, this is again Chebyshev’s inequality.
Therefore, as n goes to infinity, this probability can be made as small as you want, it goes to zero. And this therefore, for iid source at least n times HP plus eta is a good handle over both these T lambda 1 and T lambda 2 okay, in fact, in the cover Thomas book, the text one of the textbooks for this course, both these sets are combined into a single one, that is also okay.
So, they work with the intersection of T lambda 1 and T lambda 2 and I have kept them separate to bring out which set plays a role in which proof. Okay, so the lambda of choice for us in many cases will be n times entropy minus some or plus some correction factor, okay. And essentially, when we are doing this, we are taking recourse to the law of large numbers, alright.
So that is what I want to say about typical sets that so the cardinalities 2 to the power lambda, roughly to the power lambda, and a good choice of lambda is n times entropy for iid sources. Therefore, the number of bits required to represent typical set is roughly n times entropy for iid sources as n goes to infinity, okay.
Now this, I will close this week's lectures with these very important properties, very important property about entropy. We will spend a lot of time on other properties of entropy, but this one is very important, we should see it front.
Entropy is additive. If you take two if you take independent random variables x 1 and x 2, then their entropy, the entropy of x1 x2, the joint entropy. So this is called joint entropy, if you want it just the entropy of the pair of this pair random variable x1 x2 is entropy of x1 plus entropy of x2. And I believe this is actually the most important property of Shannon entropy okay it is additivity.
So this is easy to show, you can try to show it actually, if you want, you can pause the video and try to show it, I will just write the proof here. So this entropy of x1 x2 is this, but we know that P x1 x2 is P x1 into P x2, because it is independent. So you just get this. And therefore, when you take expectation, this is H x1, x2, this is H x1 and this is H x2 that is the proof, alright.
So in so this week, we saw another notion of uncertainty, we were trying to lead uncertain to randomness in a random variable. And we saw that a good measure of randomness for a random variable defined in two different ways is this entropy, Shannon’s entropy. And we showed in giving these proofs, we introduce this notion of typical sets, which are sets of large probability that have up that have uniform upper and lower bounds on probabilities of elements.
And, yes and so this basically reinforces entropy as a central answer to the measure of randomness or measure of uncertainty in a random variable. And we finally concluded by noticing that this entropy is additive.
Log in to save your progress and obtain a certificate in Alison’s free Diploma in the Foundations of Information Theory online course
Sign up to save your progress and obtain a certificate in Alison’s free Diploma in the Foundations of Information Theory online course
Please enter you email address and we will mail you a link to reset your password.