Loading

Module 1: Randomness and Entropy

Notes
Study Reminders
Support
Text Version

Measuring Randomness in a Random Variable

Set your study reminders

We will email you at these times to remind you to study.
  • Monday

    -

    7am

    +

    Tuesday

    -

    7am

    +

    Wednesday

    -

    7am

    +

    Thursday

    -

    7am

    +

    Friday

    -

    7am

    +

    Saturday

    -

    7am

    +

    Sunday

    -

    7am

    +

So, now that we have seen a definition of a distance between distributions namely we define this total variation distance which is a good measure for distance between two distributions. We return to a question of measuring the amount of randomness in a random variable. So, we will start with the first question which studies, how many uniform bits can be extracted, perfectly uniform random bits can be extracted out of a given random variable. In fact will relax a requirement a little bit, instead of asking for exact exactly uniform random bits or exactly uniform random variable what we will ask for is approximately uniform random variable so almost random bits. So, here is the formal version. Given a sample X, so given a single observation X from a distribution P we want to generate a uniform random variable U out of X by applying some function to X. So, here is a formal version of this question. So, you choose your accuracy epsilon. This is something between 0 and 1 and it can you can just imagine something large like 0.99 or something like that. The question is what is the largest value of L such that we can find a function f which takes input X and produces L bits and satisfies this bound. Now what is this bound? What is this bound here? this bound says that the output of the function f x is approximately uniform on L bits. So, it is just like a random L bit string and when I say approximately this approximation is in distribution, the total variation distance between f x and U is at most epsilon. So, it is approximately uniform. That is a requirement what is the largest such L that we can generate you may, you may be a bit worried about this error and how it will play how it will affect your gem quality of your the randomness that you have generated. So, remember that this total variation distance can be expressed as sup over all a the different between probability of this first random variable and the second, so in the first PMF and the second PMF. Now suppose you have designed some algorithm which uses randomness generated by Pu. What that formula says is that if you replace Pu with this pfx if you replace U with this f x then all your errors will increase by at most epsilon. So, it is not a big loss if you replace U with f x and that is why it makes sense to seek random number generation in this approximate sense. So, you do not want perfectly uniform you want almost uniform. Suppose this is a question you want to solve what will you do? Here is a very simple scheme, I am putting scheme in quotes because it is really a scheme, it is only the idea for a scheme. You will divide x into parts with roughly equal probability. And this roughly equal probability is important here. You may not be able to divide X into parts with exactly equal probability. So, roughly equal probability and this epsilon basically will account for this roughly part. So, probability of each part may not be the same as other and this mismatch between probabilities is what contributes to epsilon. Also the more parts you have the larger is this L. So, you want to divide x into as large number of parts as possible such that each part has roughly the same probability at the P and if you can do that then well, then you can generate this F and then you can find this F very larger. Keep in mind this is some small point but perhaps it is good to revise it here. You can view a function of X as simply a partition where each part corresponds to those x for which you declare a particular output. Let us just what you can be a function of Fs alright. So, how do we do this? Let us do this a little bit more formally. So, formally first observation is that we cannot partition all of x into this kind of parts. We at least need some handle over how small the probability of how large the probability of each part is. For instance if x has just two parts with probable if X has just two elements with probability 3 by 4 and 1 by 4 then there is nothing you can do about it. The only partition you can create is keep one element with 3 by 4 in one part and the second element with 1 by 4 in second part and there is no way you can make this random variable uniform with very high accuracy epsilon. So, it is important to first take into account. What is the largest probability of an element in it. In particular we can work with this set the sort of typical set T lambda. Lambda is the parameter that I will bring in soon.
It is a set of all those elements X such that the probability of X is less than equal to 2 to the power minus lambda. This is those elements which do not have very large probability as a convention be instead of writing it this way we will write it in some different form we write it as log to the base 2 1 by P x is greater than equal to lambda. So, these are elements which have sufficiently sufficient entropy in them. This is just an informal use of the word entropy. Sufficient randomness, sufficiently rare element. So the sufficiently rare elements are what contribute to randomness. So, suppose you have the set T we will only work with this particular set we will ignore the elements outside and what we do is we make a partition of T lambda. So, T lambda we divided in 2 parts; x1 x2 blah-blah xm we divide it into m parts and also retain all the other remaining guys in the last part. I will tell you what these parts must satisfy M plus 1. What is this notation? This notation denotes Union of disjoint part. So, I am calling it a partition because we assume that x1 to x 2, x1 x2 xn plus 1 are all disjoint and the union is exactly equal to T lambda. So, you can, this is our assumption It is empty the intersection is empty. So, such a thing is called a partition of the set. So, we had we have the set T lambda and we divided this T lambda into different parts. So, all these parts are your different xi’s. So these parts are your different xi’s that is what we have learned. What do we want out of these parts? What we want out of these parts are two properties. So, here are these two properties that we want out of each of these part first we want that the probability of all the parts are roughly equal in particular wanted to be between 2 to the power minus lambda N and 2 to the power minus lambda into N plus 1 for some n, this n is something will bring out later. Why can we do this? Maybe you can think about this question pause the video and think about discussion. Why can we divide T lambda into such part? Well, the reason is that each element so this can be done. Since each element x in T lambda has probability has P of x less than equal 2 to the power minus 2 to the power minus lambda. So, there is each element is 2 to the power minus lambda. So, what you can just do is keep on adding one part at a time until you cross this threshold and look for the first time you cross this threshold. Since each element has probability at most of the power minus lambda. You will be able to find one set where this will just cross a threshold and will not cross it by more than 2 to the power minus lambda extra and that is when you stop and take these elements as the first part and now collect more element for the second part and you keep on doing this until you cannot cross this threshold and that is what the remaining elements are the last part. So, first you all take the M parts and then you have the last part. So, how many such parts M can you generate? This is an important question, so each of, so this is another assumption each part has at least this much mass how many such parts can be generate. So, it is important to note that the total probability of all the things is at most 1 therefore all these parts so think of this big box representing a total probability. It has this mass of one and we are adding these parts in it; x1 x2. Each of them has probability at most this much and it will add at most to 1 xm. How many such parts can be added? Well total sum here must be at least M times 2 to the power minus lambda N minus lambda into N? So, maybe I should use a different color for this. So, once again this is a box of mass 1 and we are filling it with these parts x1 x2 plus xm and all these parts. The total mass of these parts is at least 2 to the power minus lambda times n times m and this must be less than equal to 1 because total probability that you can add up is 1 and they are all disjoint. So, this just adds, so this means M must be less than equal to 2 the power of lambda by n that is the best you can do. But on the other hand since each of them is each of them is at least at most this much then how many parts can we collect? What is the largest number of part we can collect? Well we collect the parts, we collect as many parts and so will collect as many parts as needed to fill up the probability of lambda, T lambda because these parts are all inside T lambda and each has probability less than equal to (sorry) each has probability greater than equal to this. So how many such parts can we collect? So, sum of these probabilities must exceed the sum must exceed the probability of T lambda. Sum of all these probabilities must exceed probability of T lambda. But each part has probability less than this. So, total sum is M times this and this must exceed this. So, what isthe total number of parts you get? The total number of parts you get is this by this, so you can show that you can find it you can show that you can find these many months parts, this is some, by the way this presentation in this lecture may just appear as a like a switching gear. We suddenly are into something very technical. But once again, our goal is to provide a motivation for entropy as a measure of randomness as a measure of uncertainty and we will show since randomness in a random variable is related to uncertainty of a random variable. We have already did that problem. That is the goal but it looks a bit tedious, but it is all very elementary. I suggest that you go back and try to recreate these arguments think from scratch these are simple arguments but requires some thought. So, this is the claim that you can form at least these many parts for a given nm lambda. So, what is your scheme now? Here is my scheme. What I do is that let me call this number M Prime. This is the number of the size of my output domain. So, what I do is if x is in this part i for some i then I declare that i and if so note that m is less than equal 2 to the power lambda by n therefore M plus 1, M plus 1 is less than equal to seal of this. Less than equal to this plus 1 okay that something will you. So, when you fall in this ith part you declare i, if you fall outside this set T. So, these parts together for the set T then you declare an error or you call it something else and falling in Bart for an error symbol. That is my functional. Now the claim is the output distribution of this function F is very close to the uniform distribution. Well yes indeed all the parts are roughly the same probability these two upper and lower bounds are roughly the same so they are almost equal probability parts. So, let us analyze the total variation distance between the distribution of FX and the uniform random variable on M prime alright, so here is that distribution. So, we divide that total variation distance formula into different parts. Look at symbols between 1 and M. What is the probability of each of this symbol? Well under f x the probability is p of X of M will evaluate that minus 1 by M prime all probability are 1 by M prime and a uniform then look at this part from n plus 1 to M prime. This is the probability of those parts uniform guys go here. The symbol pot does not appear under uniform distribution. So, this part is the last contribution just the probability of pot. So, suppose this is an assumption here for this part. Suppose probability of T lambda is greater than equal to 1 minus epsilon suppose you can find such a T lambda but then it is probability greater than equal to 1 minus epsilon. Then the probability of pot is the probability that you are outside T lambda, which is at most epsilon problem. So, this last term we have handled now let us look at this green term here. So, for the green term what we notice is that the probability under F of M plus 1 is at most 2 to the power minus lambda by n and probability under F of anything more than M plus 1 is 0 because that is the maximum f declares. So, this is the bound on this now, let us look at the uniform distribution. Probability of M plus 1 under uniform distribution or any other element under uniform distribution is 1 by m prime which is 2 to the power minus lambda by n. So, this probability is always smaller than this problem. So, what is the bound? So, this is equal to 1 by M prime minus this if you take the bound and so you get M prime minus M by m prime minus this probability. And M prime by M this is 1 minus M minus N minus some probability. So, we ignore this part. So, this is less than equal to this. And actually, there is a small correction here. This probability is not T Lambda you only add up to T lambda probability of T lambda minus the probability of the last element. This last element can have 2 to the power minus n probability. So, this probability is at most, probability T lambda minus to the power minus lambda times that is the probability of total contribution. So, you can show that you must have I mean at least this much. So, what has happened here? So this last part is the bound for the probability of f x of m plus 1. Sorry this part, last part is the lower bound for M that we have derived in this condition here. So, m is at least this much. We just substituted that part. So, you get this bound, you should repeat this calculation. The way this course is being done with where every calculation is elementary but you should repeat them. So, let us now focus on the last remaining part. So, we have already handled this part the purple part then the green part also we have handled here and let us focus on the last remaining part. It can be handled similarly. So this is the total contribution for the first m part into total variation distance this first part here this you can you can verify that the probability that f x takes the value M is just the probability of the set x M and you have to also add over all these parts and formal condition remember that M prime is set to 2 to the power lambda by N and therefore from our condition on this balance part. This is where we are using the fact about balance part. This difference is at most 2 to the power minus lambda that is how we constructed is balanced partitions. So this is 2 to the power minus lambda times N and M was set to be again 2 to the power lambda by n, so this is at most 1 by n, so that is this last part. So, if you combine these all these bounds this last part is 1 by n this part is epsilon. And this part is 1 minus this quantity here and you can check that when you combine these bound this is what you obtain as an upper bound for the total variation distance. So, you are the random variable that you have generated is this distance between that and uniform distribution is this much. So, when n is large, that is why we wanted this parameter n this is small and when Lambda is large again, this is small. That is why we have these two parameters. You set them large to make this distance small. So, we set n to be 1 by epsilon minus 1. And, when you said that you get this bound, so that is the final resulting bound that we get. Alright so what have we shown? We have shown the following result what we have shown now is so we have shown this theorem we did the proof first, but that is theorem we have let us focus on this theorem. This looks very much like our source coding theorem. So, take any lambda. This is a parameter take any epsilon. This is your another parameter, suppose so this is an assumption suppose that you can find you can find a lambda such that with large probability, probability exceeding 1 minus epsilon is minus log p x is greater than lambda. We have seen similar conditions in the source coding the compression problem earlier. So, suppose this condition holds this condition is something about the underlying distribution P that you are able to find such a lambda with for which this lower bound holds the last probability. Then for the uniform distribution over roughly lambda minus log 1 by epsilon bits, that is the uniform distribution. We have the total variation distance between this f of x that we designed and Pu to be less than equal to some 5 by 2 epsilon plus 2 into 2 to the power minus lambda this minus (sorry) 2 the power minus gamma, this gamma is set to lambda minus log 1 by epsilon that is what we get. In particular if you set Lambda to be greater than 2 to the power 2 log 1 by epsilon we can set it here. You can set that lambda then this becomes log 1 by epsilon so 2 to the power minus gamma is another epsilon, 2 epsilon. So, we get this to be smaller than 7 by 2 epsilon, so you can make it smaller than 7 by 2 Epsilon and you can generate these money bits lambda minus log 1 by epsilon bits. So you can extract Lambda minus log 1 by epsilon almost random bits out of a random variable x which are they are almost random because they are within a distance epsilon or 7 by 2 epsilon. So, in this course in general also in practice, when we are analyzing algorithms it is okay to ignore some such constraints and probability of error that is not so big. The order is important. So, you should epsilon to be 10 to the power minus 6 and then there is a small constant which we can ignore. So, we can generate lambda minus log 1 by epsilon bits and distance epsilon. And what is lambda? Lambda is the one which satisfy this condition so this is a result. This is what we have shown this looks very much like the compression result which we saw in the previous week and if we used to bring in entropy. So, can you now stare at this condition and see the role of entropy? So, the question is what is the largest land that we can have such that this condition is satisfied? What is the largest lambda we can have? And here is the answer if you did not guess it already. It is roughly the entropy you can use Chebyshev inequality to show that with large probability minus log P x is more than entropy minus correction term. So, if you Chebyshev you get 1 by epsilon correction, there are sharper inequalities it will give you log 1 by epsilon correction also, but we can ignore this correction term for now. Here is the main term that we should focus on. And this is a Shannon entropy. How did Shannon entropy come in? That is the expected value of this guy, so with the right choice of lambda here what is the estimate for minus log P. This is the expected value of this gang. So, what we have shown is that you can extract roughly as many random bits as the entropy of the random variable. So, this is another different interpretation of entropy as a measure of randomness. Last time we showed last week we showed interpretation of entropy as a measure of how much you can compress a random variable what we have seen in this lecture is interpretation of entropy as a measure of randomness. How did we define randomness in a random variable? Well we defined it as the largest number of almost uniform bits that can be extracted from a random variable. So, in the next lecture what we will see is slightly different notion of measure of randomness. We will define randomness as the largest number of the minimum number of uniform bits required to generate the random variable that a different notion of randomness and will again get entropy as answer. This is the magic of this fundamental notion of entropy. It comes as answer to many related questions.
In this video, we will now visit our second notion of measure of Randomness namely we want to find out what is the least number of bits L least number of bits L such that for if I give you a uniform random variable on L bits. So, if this if I give you L independent coin tosses which are unbiased then you can find a function f such that you can generate a given distribution P using this U.
So, P of f of U is close to the given distribution P. It is closed in this total variation distance epsilon. So what is the minimum that such bits required? This is the number of bits required to generate this random variable. And we will think of it as a measure of randomness in this kind of variable. This is sort of the complementary definition to the one that we saw in the previous video. So, this problem is very much related to the one we saw last time and rather than, I will just directly jump into the scheme.
The scheme is very much like the scheme we saw last time where last time what we did was we were given a random variable of a given distribution and we want to generate uniform random variable out of it. What we did was we divided the domain into small parts such that each part had roughly same probability some mass was left and that we dumped into this final part, which we are calling XM plus 1.
But the first M parts have a balanced they had roughly same probability. Therefore if you declare different values on these different parts that random variable is almost uniform and we showed this formally and what the amazing part about the scheme was how what was the number of bits that you were able to extract? Well, it was the entropy of the random variable.
So, now we will look at a complementary scheme where we are given uniform random variable and we want to extract, we want to generate a TMF P out of it. This is something which may be very important in practice. You may have seen similar questions but in a probability course where you use the inverse of the probability cumulative distribution function to do this, but what we will do now is very different we are accounting for the number of random bits used.
So, here is a scheme we define the set. This is a new set T lambda that we are defining it is different from the previous one, define this set unlike the previous one where we had a large probability lower bound on minus log Px, this is your P the one you want to generate. Suppose Lambda is a large probability upper bound of minus log Px. So, this is my set T lambda this one and we assume that the lambda is large probability.
So, you can check very easily check that for each x in T lambda Px is greater than equal to 2 to the power minus lambda. So, what we do is we set M to be greater than 2 to the power Lambda the same Lambda here and now we consider a partition of 1 to M. So, we since each probability is greater than 2 to the power minus Lambda we can say something about the cardinality of the set.
So, if you sum all the elements of the set the probabilities of all the elements that sums to 1 and that must be less than equal to the probability must be less than equal to 1 but each element has probability more than 2 to the power minus lambda. Therefore, there are at most 2 to the power lambda elements in this set. So, consider a partition Y1 to Y M prime plus 1 of this 1 to M set such that M prime is equal to 2 to the power to the cardinality of T Lambda.
So, remember cardinality of T lambda is less than equal to 2 to the power lambda. Why? Because each element has probability at least to the power minus Lambda and probabilities sum to 1 so the total number of elements in T Lambda cannot accede to the power Lambda. So, this M prime that we have is less than M by our assumption. We assume M is greater than 2 to the power Lambda.
Therefore you can form this partition of 1 to M of size M prime plus 1. Such that, what do you want out of this partition? What we want out of this partition is if you take any element X remember, there are M prime elements D lambda. What we want is that each part has probability that is between Pxi and Pxi plus 1 by M.
So, the question is, why can we do this or how can we do this rather? So, take each element, we know that each element is probability greater than 2 to the power minus lambda and 1 by M is smaller than that. So, you just basically take one element and these parts here is probability is under a uniform distribution. So, this is just the size of the number of elements you will take here that number divided by M prime by M which is more than 2 to the power Lambda.
So, you can convince yourself that this kind of partition Y1 to YM prime plus 1 can be designed it is just a very simple account keeping. This U is the uniform distribution. So, now this is what we have done. We have constructed these parts as many just one more than the number of elements in T Lambda just one more and each part has probability for the first M prime parts the probability is between the probability of elements here plus 1 by M.
So, we roughly have these element we roughly have these parts which have probabilities of this element. So, now you can imagine what we are trying to do we whenever we see something in this part this Yi we will declare Xi, that because you want to generate something from PMF P and it is off by some 1 by M but that is the error you have to account for.
And this last element x m prime plus 1 is something outside the T lambda some arbitrary element you can choose. Now what is a function the function that we want to use to generate TMF? Well, it is very simple when you see while in why are you declare Xi that is a function.
So, let us analyze the total variation distance between the generated regenerated output and then target distribution. So, this is the generated and this guy here it is the target distribution. And there are close that is what you want to say. So, for each element x in T lambda probability that you output that element minus T probability of that element under this distribution P, this differences we will compare and claim that this is small or summed over all the elements and then well this function never declares anything outside T lambda.
The only thing in declares outside T lambda is x m prime plus 1. So, but the you that we account for the error here and then this y m prime plus 1 final part, but that also you can account for the error you validate you can check this bound. So, will bound each of this term.
So, for the first term, each of these elements what we notice is, so let us analyze each of these terms. So, for the first term that is how we constructed the set this Yi minus Pxi remember that Yi minus Pxi. This is the condition the green one here is they differ by just 1 by m. That is how we constructed yM. So, this is at most 1 by M for each part and there are M prime elements in T lambda and that is M prime by M.
The first one is easy third term is just this guy. So, remember what is the third term, third term is this guy. So, for the third term so this is the last term, it requires some thought. So, what we have seen is that for each Xi and this is less than this that is a construction of this partition. So, we already have this found for the first term and now we sum over all of these guys.
So, you get P of T Lambda there and what you get here you get the first M prime parts, but remember that we have partitioned the whole 1 to M into just this M prime plus 1 part. So, this is just equal to 1 minus P of last part under uniform distribution, correct? So, this guy here we can check from this one is less than equal to 1 minus T Lambda complement probability of that. So, by our assumption this is less than equal to epsilon.
So, what we have shown is that this total variation is so this is T lambda complement one part and then another one of such thing. So, divide two time that when you divide by 2 what you get is this is less than equal to this by this and then you substitute that this I put a question here for you to think about maybe you can pause and think about because I will answer this question right away.
So, remember that our construction tells us that each element in T lambda is this and from that we verified this earlier, let me read later proof. So, if for this is true for all x in T lambda, so if you add the probability of all the elements in T lambda this probability must be greater than the cardinality of T lambda times 2 to the power minus lambda. But this sum here must be less than equal to 1.
Because probability some to at most 1 and therefore T lambda must be less than 2 lambda, 2 to the power lambda. So, you get this bound and you can choose any M of your choice. We choose M to be this guy here. So, the choice of M, and then we substitute this what we get is that this is less than equal to 3 by 2 lambda. So, the M is the number of bits the uniform, the size of the informed random variable that we used.
So, what have we shown here is the theorem that we have shown if shown that suppose that you can find a lambda for which the probability for which minus log P x is less than equal to lambda B probability greater than 1 minus Epsilon. So it's a large probability upper bound on this minus log the px. This is the quantity we have been playing around this week and also last week. So, suppose that is true.
So, this is related to your PMFP this lambda which gives this bound is associated with you PMFP. The one you want to generate then if you use L equal to lambda log 1 by epsilon bits that is just log of this kind log M. Lambda log 1 by epsilon bits then if you generate a uniform random variable with this length or if you generate these many random bits and you can find a function f the one we specified above such that the total variation distance between Fu and P is at most 3 by 2 epsilon.
So, this is a number of bits needed roughly lambda. Now once again, what is a good estimate for lambda earlier we had large probability lower bound this now, we need large probability upper bound on this. Well, that is a good estimate is the expected value of this guy, which is the same guy, which is the