Loading

Module 1: Randomness and Entropy

Notes
Study Reminders
Support
Text Version

Randomness and Total Variation Distance

Set your study reminders

We will email you at these times to remind you to study.
  • Monday

    -

    7am

    +

    Tuesday

    -

    7am

    +

    Wednesday

    -

    7am

    +

    Thursday

    -

    7am

    +

    Friday

    -

    7am

    +

    Saturday

    -

    7am

    +

    Sunday

    -

    7am

    +

So, let us begin by introducing another concept related to uncertainty, namely that of Randomness. In the previous two lectures, we have identified entropy as a measure of uncertainty and therefore the information revealed when you observe a random variable x generated from a distribution p is a. So, this is x which is generated from a distribution p is entropy of p, H of p which I define as summation over x px log 1 by px because information, as you remember was reduction in uncertainty, okay.
So why is the H of p, because well the uncertainty before the answer was revealed to you was H of p and after the answer is revealed to you is 0. And so therefore the information is H of p minus , which is just H of p. But we are calling this a measure of uncertainty and we define a problem related to compression, to measure uncertainty or to at least give an operational meaning to uncertainty but there is another national notion of uncertainty, namely that of randomness in a random variable.
So, is a random variable really random and how does uncertainty relate to randomness? This is the question we set to answer in this in this week's lectures, right, so there are two possible ways to measure how much randomness is there in a random variable x. What are these two possible ways? Here are these two possible definitions. First, you can you may ask how many random bits are needed to generate the random variable.
So, this random variable has a specified distribution and you want to generate it, but you only have a perfect, unbiased random independent bits available to you. How many of bits you need to generate this random variable x? So, if you write a computer code where you generate a sample from this random variable x with this pmfp that computer how many times will that computer code access this function, which generates one random bit that is one way to think of it.
This is a reasonable measure of randomness in a random variable x or associated with a pmfp. Alternatively, we can think of this definition green. How many random bits can be generated from okay to generate, you need to generate, and how many randomness can be generated from this random variable x? So, how many independent random bits can you extract out of this random variable x? So, this is sort of the input. So, you give random bits as input to a, to a function, and it gives output as an output. It gives the sample as the output it gives a sample from this distribution.
Here what we are asking is you take as an input x and output random bits. How many bits can be extracted out of this x? These two questions can be thought of as natural measures of answers to these two questions can be thought of as natural measures of uncertainty, natural measures of randomness in a random variable. And we will examine both these definitions this week, so just to make things concrete, let us look at some examples.
Let us look at this first example about how many independent, unbiased coin flips are needed to generate samples from these two distributions. So, before I proceed, I will request you to pause and think about this answer for both these examples. And when you come back, I will be telling you the answer, alright so for the first question, what we can do? So, you need to generate 3 symbols; a, b, c.
And a should have probability half, b should have probability one fourth, c should have probability one fourth. Here is one way of doing it by using 2 random coins, 2 random unbiased coins. First we flip a coin if it is. So, if we flip a coin 2 times, so if the first one is heads then we declare a, okay. If it is tail then we flip a coin another time. So, if this so maybe I should do better once again.
So you flip a coin. If it is heads you declare a, if it is tail you declare, you flip the second coin. If the second coin is heads, you declare b. So, if the second coin is head, so declare b, if the second coin is tail should declare c. So, what is the probability of a? It is half. What is the probability of b? It is half into half, which is one fourth. What is the probability of c? It is half into half which is one fourth. So, we two coin tosses, I am able to generate one sample from this random variable with distribution one half, one fourth, one fourth.
Now, how about this one? How many coin tosses you need to do to generate a sample from this. If you think of a procedure like the one we have here, what you will soon realize is that to get a probability like 1 over 3 out of coin, unbiased coin which is probability half, you need infinitely many, you may need infinitely many samples, okay. So, we have to pose this question better and we can ask questions like how many samples do we need?
How many unbiased coin flip we need on? So, how many right, how many. On average on expectation that is a better question to generate a sample from this distribution, and that will be the measure of randomness in these two random variables. Let us look at this other question, which is sort of famous. It was studied by Von Neumann may be about 70 years back.
So suppose I give you access to samples from this distribution Bernoulli p and you want to generate one sample from this unbiased coin this is the second question, the one we raised here. So, this is the first one we saw an example, this is second one. Then how do you do that? And what is the minimum number of coin flip that you need to do that, minimum number of coin clips of this coin to do that? So, you want to generate an unbiased coin from a biased one.
So here is a protocol which actually Von Neumann gave which is not optimal, but it will give you some example. So, what you do is you toss the coin 2 times. If both outputs are the same, then you toss it again but you stop when both outputs are different. So, you stop when the output are HT or TH when is HT you declare 0 and it is TH you declare 1 or vice-versa whatever you prefer.
Then you can show that condition on the fact that you stop this probability of 0 1 is 0 and 1 are equal and it just half. So, this is one way to generate an unbiased coin from biased one. In fact, you can compute the expected number of coin tosses from this coin that you will take to stop that will turn out to be. Well you stop when you have this or this, the probability of this or this is 2 into p into 1 minus p.
And therefore the expected number of coin tosses that you need to stop is 1 by 2 into p into 1 minus p, okay. You can check that this is a geometric distribution, the distribution of the random variable which indicates a number of tosses after which you stop is a geometric one and so you will stop after these many coin tosses. So, the question is, is this optimal? So, the answer to this question is related to how much randomness there, is there in this random variable.
Similarly, the answer to this question is related to how much randomness is related to, is there in this emfp, okay. And what is very interesting, what we will see is that the answer to both these questions is roughly the same. It is the entropy of the random variable. So, entropy indeed also measures the randomness in a random variable, okay. So, we will see that in the next part.

Before we move forward to characterizing the measure of randomness for a random variable which we basically related to the problem of either generating unbiased coins out of a random variable or generating the random variable from unbiased coin, both these formulations make sense. Before we address that question, we want to introduce an approximate version of that question.
So, the goal will no longer be to generate exactly unbiased coin or to generate a random variable with a given exact distribution P, but to generate a random variable with a distribution close to P. So, that begs the question, how do you measure distances between distributions? What do I mean by saying that a distribution is close to P. So, this is a bit of a degradation what I would like to do now is to tell you about how we measure distances between distributions.
So, we will revisit this topic later, but it is an appropriate time to first to introduce this for the first time. So, as you can notice, probably in this course, rather than stating quantities upfront what I have been trying to do is I have been trying to first motivate a question first use a motivating question to set up a problem and then introduce quantities as a natural answer to that problem in that context very good (())(01:39) for measuring distances between distributions is this quantity which is called the total variation distance, total variation distance this has many names.
It is also called statistical distance. In some, some people call it statistical distance, I would say that statistical distance is the more popular name of this quantity in computer science and machine learning, total variation distance is a more classic name, and it is very easy to define this quantity for discrete distributions, namely pmf. So this is something which will encounter again and again in this course, just like entropy. This is one of the other main characters of this course.
The first one was entropy, second one we introduced today is this total variation distance d P, Q; d for distance P, Q. So, it is a distance between P and Q. This notation is, as is defined, is so it is defined as half of summation over x, x in that alphabet calligraphic x of P x, minus Q x, mod of that, okay. So, you go over all the realisations x and take P x minus Q x, that is for d P is, it is well-defined. You can check that this is a reasonable measure of distance, although it is not very important for our treatment in the sense that it satisfies 3 properties.
If d P, Q is 0, then P x must be equal Q x for all x, therefore P is equal to Q. So same points have distance 0, it is non-negative that is also good so it is 0 if and only if P equals to Q and it satisfied triangular inequality and something you can check. We will come back to these properties later. But some quick sanitary test so d P, Q is less than equal to d P let us say Q Prime plus d Q prime Q you can check this, this one also. So, this is a reasonable notion of a distance between distributions. And it satisfies certain properties.
I will just review few properties which is essential, which are essentially equivalent expression for this, d P, Q what you can show is actually this d P, Q is supremum over all. Supremum is like maximum. So, it is maximum Supremum is similar to maximum, but requires some slightly more formal definition.
It is maximum over all events here over all subsets A of x of P A minus Q A. So, it is the maximum probability difference between two events, but it is the maximum difference between probabilities of an event A under P and Q that is what d P, Q is. From symmetry this is also equal to maximum over B of Q, B, minus B A, P B, and therefore you can also show that this is equal to sup over A, P A minus Q.
You can put a mod here because the maximum will be attained by a non-negative quantity. This is also true okay you can show this. So, I am not showing this property, but it can be shown, in fact, we can exactly write down which set attains this maximum, this supremum. So its supremum is a max when it can be attained. And this is indeed a max because it can be attained by this set, which I am calling A star. It is the set of those x for which P x exceeds Q x. That is the one which for which d P, Q is equal to the supreme.
And similarly, this particular sup here attained by the set B x such that Q x is greater than P x. So, this is one property that would be useful for us, I would urge you to go back and study these properties carefully. This is something that you should recall quickly throughout this course.
Another property of this d P, Q is that it is always between 0 and 1, it is 0 if and only if P equals to Q and it is 1 it can also be 1, you see it is P A minus Q A. So, it is always less than equal to 1 because PA is less than equal to 1, but it can also equal 1. When will it equal 1? That is interesting. So, equality on the right side holds if and only if is support of P and support of Q are discharged. So, what is support of a pmf or support of a distribution?
Is a set of points on which a distribution P puts mass and it is the support of Q similarly a set point on which distribution Q places mass. So, d P, Q is 1 if and only if these two supports are discharged. So, if you have you think of this alphabet x and suppose this is where P puts mass and this is where Q puts mass, so these are disjoined, there is no common point between them, and in this case d P, Q will be 1. And this is if and only if okay. So, if this holds, then this is 1. If this is 1, then this must be 2. You can also show this will come back to that proof later.
Finally, I would like to point out that this definition of d P, Q can be easily extended to continuous distributions, a continuous random variables, more general distributions, distributions with density. So, here is the examples of suppose your distribution has density is, suppose two distributions P and Q have densities. So, this is densities f and g so density of p and density of Q.
So, you have suppose you have these densities f and g with respect to this standard living measure, but I do not need to use that language actually. So, this is so when you think of a continuous random variable, you can think of its densities and these random variables P and Qhave densities f and g with respect to the length measure and then you can define d P, Q simply as half of integral fx minus gx integral over x okay, just like we had summation earlier now you have integral of density.
So, pmf gets a place with density and you have integration of summation. This two is equivalent to supremum over A PA minus Q A. And now equality holds if and only if it is a similar condition so the sets which attains equality it is given by A star equals to the set of those x for which fx exceeds gx that is the set which attains this equality. So, this d P, Q is this total variation distance is something that we would like to use as a measure of distance between distributions and it satisfies various properties.
For instance, is always between 0 and 1 and 1. So, the important property that we saw in this part is this formula that it is supremum over all APA minus QA. This is very important. Such formulae will be seen throughout this course, which take a quantity like this and related to max of some other function, okay.
So, in particular they useful because any particular A for any particular A, P A minus Q A is less than d P, Q. And the best A this one, holds equality. That is the one that is one way to interpret such formulae sometimes such formulae are called variation, are called variational formula. And that is we have this name this distance is sometimes also called variational distance because you have this form for this distance.
The name total variation distance is in fact associated with this form. It is. Something related to total variation and that you can look it up, the reason for this name but variational distance, total variation distance, statistical distance these are all the same things, namely this distance, and we will use it throughout this course. See you in the next lecture.