We just saw random variables and their expected values, and I am assuming that by the time you have reached this next video, you have at least brushed up these basics and are now comfortable with it.
Now, so what is it that we are going to do next? We will try to understand how a probabilistic model provides you an estimate for the unknown. So when you model an unknown you would also like to ensure that the model provides a good estimate for that unknown.
So, for instance, in our particular application, this unknown was the answer to a question and we said that we will model that unknown answer as random variable with let us say, a given distribution, and now we would like to understand once we put such a distribution what kind of estimates do we have for that unknown. So, these are various inequalities in probability that give you such estimates.
So, one basic estimate that we all know maybe, formally or informally, is that expected value of the random variable is an estimate for the random variable that is why it is, of course, called the expected value of a random variable. Now, we can provide various accuracy guarantees for this estimate and all these various accuracy guarantees constitute what are sometimes called concentration bounds.
So these inequalities, these concentration bounds vary in the degree of complexity. So we start with the most basic such inequality, which is called Markov’s inequality or Markov inequality. It only applies to non-negative random variables.
So here is the statement of Markov inequality, it says that if you have a non-negative random variable X, so it is a non-negative random variable this is very important here, then probability that X exceeds any number a is less than expected value of X divided by a, and this is true for any non-negative a, in fact, any positive a.
So this is just a statement, so before we prove it let us see what is the statement means. So, what this statement means is that we can view, so we can, we can recast this statement by choosing a equals to the expected value of X by delta. So we substitute this as a value for a, so what that statement says is that the probability that a random variable is more than let us say, 0.1 times its expected, more than 10 times its expected value that probability is at most 1 by 10.
Let us say what the probability that a non-negative random variable exceeds a million times the expected value is very small, it is 10 to the power minus 6, so this is how one should read Markov’s inequality. It gives a multiplicative accuracy of 1 by probability for expected value. So we can now believe that no random variable can exceed its expected value significantly with a large probability that is how you can read Markov’s inequality.
So with this understanding of the statement let us try to prove Markov’s inequality. So this is the first proof I am doing in this course and it is a rather simple one. Well, I will take that back it is rather simple to follow, but to come up with this proof require some good idea. So, here is how we prove Markov’s inequality.
Remember we have to use the fact that X is non-negative sample, so we see the expected value of this random variable X and we want to bound this probability, so we will try to get this probability out of the expected value. Remember the expected value of any random variable is x times the density p x dx, so this remember, is the density.
And as you recall that this proof the way we are writing it, it can apply to both continuous and discrete case if we write it equivalently as integral, integral x d P x, just a notation; this denotes all these things here and it capture both discrete and continuous.
So it is, we just multiply x and multiply it with the probability with which that x appears and now this is the probability of that infinitesimal small element that is how you can read this integral. So this one, this integral I divide into two range. First is the set where x is greater than equal to a, so this is one range; and second is the set where x is less than a that is the second range.
So if you look at these two sets, throughout the set this random variable x is at least a, so I can bound this quantity here by a. So this comes out, this is always greater than a, so this is a integral, maybe it is a better notation to use, this is that set, just to describe this set to you, we are dividing that random variable in two parts.
So in this, in this part is always greater than equal to a, so you are left with this and this is the part where it is greater than a. What about this second part? So this second part is interesting,we would like to claim that the second part is non-negative. Well, X is always non-negative. So this is greater than 0 since X is always greater than equal to 0. So throughout, wherever we are integrating this, wherever this probability is, this X can only be non-negative. So, this entire part must be non-negative.
So, if we do that then this is non-negative. So we can say this is plus 0 that is the bound. So we are left with this. Now, this a comes out and what are you left with? This part is equal to a times the integral of this part. This is just a times integral of the indicator function that X greater than a, so this is just probability of X greater than equal to a. Because it was just was just a constant, integrating a probability over any such constant is just the probability of that event. And now, this is exactly what Markov’s inequality was claiming if you take a on the other side. That is the proof. So it is very simple, but it requires some thought. I would urge you to write this proof again maybe using a summation notation. We have just split the range of summation into two parts, first part where X is greater than a; second part where X is less than a. And when you write that proof, try to follow the proof in this notation as well, so that you become comfortable with this more general notation.
Once again, the reason I use this more general notation is so that you can think of the general proof for the discrete case as well as the continuous case. You can try to understand this proof but what you should now remember completely is this Markov’s inequality, I will use this again and again in this course. It just says that a non-negative random variable exceeds any quantity a with probability at most expected value of X by a.
Alternatively, any non-negative random variable exceeds 1 by delta times its expected value with probability at most delta. So it exceeds one million times the expected value with probability at most 10 to the power minus 6.
So this arouse us to at least be convinced about the heuristic that expected value is an estimate for the values of the random variable in the following sense. It says that with large probability and non-negative random variable will not exceed its expected value by much.
But, so let us just say it cannot exceed, but how far can it exceeds? Can it really fall below expected value? So it cannot exceed by much but can it fall below or is it really close with expected value? Turns out that the same inequality can be used to get even that kind of estimate.
So now, instead of applying the equality to the random variable X, we apply it to the error X minus expected value of X, in fact, we applied to the square error. When we do that we get expected value of X minus; we get that probability that expected value of X minus expected value of X Square greater than tau, t square. So we get this, how do we get this, we get this by applying, apply Markov inequality to X minus expected value of X square.
So same inequality but applied to a slightly different random variable, the error random variable says that X, the probability with which X minus E X exceeds t is less than to expected value of this by t square. So you may now recognize this quantity, this quantity is exactly the variance of the random variable.
So this gives us the following bound which is very useful and sometimes is called Chebyshev’s inequality, at least in this course, we will call this Chebyshev inequality throughout.
It says that a random variable X lies between its expected value plus minus some error, this should be plus, plus minus some error. What is that error? That is just square root of variance of the same random variable. We need to clean it up, so variance of X divided by delta and it lies in this interval with probability greater than 1 minus delta.
And, so for instance, if you substitute delta equal to 10 to the power minus 6, it says with probability greater than 1 minus 10 to the power minus 6, the random variable X belongs to expected value of X plus minus 10 to the power 3 variance of X. So it is just the square root of 1 by delta.
So this bound is not tight in the dependence of delta, sometimes you can, well, very often you can find bounds which are better dependence on delta. But it gives you a very good idea about why an expect, the expected value of a random variable is a good estimate for its value, such bound are sometimes called confidence bounds.
So what we can claim here is that with the confidence of delta that is the probability, the random variable of X lies between its expected value plus minus square root of variance by delta, that is a confidence bound.
We just saw a confidence bound, which showed that with significant probability a random variable lies between its expected value and plus minus square root of its variance divided by delta, this delta was the confidence probability of the confidence. And now, we will build on that result to obtain what are called Limit Theorems.
So the previous result was valid for any single instance of a random variable, any particular random variable, but these estimates can be very sharp when your random variable has a particular structure; a particular structure that is of interest to us in this course and in many other places is when the random variable is sum of many independent random variables.
For example, the random variable can be the total number of times you see heads when you toss a coin 1000 times. Now, this particular random variable can be thought of as sum of 1000 independent Bernoulli random variables. For such random variables, we can obtain very precise estimates. How do we do that?
So, the beginning of all this is this observation which I assume you have seen, if not you have to revise, but this is the very fundamental observation. This one here says that, just like expected value which is linear, variance is also linear for the sum of independent random variables.
So if you would have independent random variables, so independent is important here, actually, a weaker assumption of what is called uncorrelatedness will do, but independence implies uncorrelatedness. So if you have independent random variables X1 to Xn, then the variance of their sum is equal to sum of their variance. So variance just draws linearly.
So, how can this simple observation possibly help us? By the way, can you prove this? Well, yeah. This is not difficult to show, maybe we take, maybe we prove this now, here is how you can show it. I will just make some space for this. So, why is variance of sum equals to sum of variance?
So, it suffice us to show this for two random variables. So suppose, so X and Y are independent, then if you look at variance of X plus Y, what is this, this is equal to expected value of X minus expected value of X plus Y minus expected value of Y whole square. So, you can expand this square by taking this as one term and this is the other.
So, when you do that, so you get a square plus b square plus 2 a b. So the first term is a square, expected value of X minus expected value of X square; second term is expected value of Y minus expected value of Y square; and what about the third term? Now, third term is expected value of X minus expected value of X, Y minus expected value of Y.
So a question here. So this thing inside the expectation can be expressed as sum of these terms, but we are writing it as sum; so expected value of sum, but we are writing it as sum of expectation of each of these terms, how does that happen?
Well, that is where we use linearity of expectation, expected value sum is equal to sum of expected values that is something by definition because we can think of expectation as an integral, an integral is linear and that is how this works out.
But, there is something very interesting here. Since X and Y are independent or uncorrelated, this expected value here can be split as expected value of the first term. This is where we use, this is precisely where we use independence and uncorrelatedness would have sufficed. Expected value of Y minus expected value of Y and this is 0. And this is 0, so this thing becomes 0.
And what are these two terms that are left here? Well, this is the variance of X and this is the variance of Y. So variance of X plus Y equals to variance of X plus variance of Y plus this term which is 0, and therefore it is just the sum of the two variances and you can now apply it repeatedly and you can get that same result holds for multiple random variables as long as they are independent. Variance of sum is equal to sum of variance; looks like a simple observation but, it is very powerful.
Now, what we can do is we can apply a Chebyshev’s inequality which we saw in the previous video of this sum, and this mu, this is the Greek letter mu, which is my notation for the expected value of xi divided by n. For sum of random variables it is minus n times mu, it is per, it is the per symbol expected value, if they are all identically distributed then this is also the expected value of each individual coordinate.
So this difference by Chebyshev inequality exceeds a square root of 1 by delta times the variance by the sum, which we just saw in the sum of the variance with probability delta, this is just the probability at most delta. When they are iid, so this is something I will maybe, take a pause and describe, this is some abbreviation we will come back to again in this course. So iid is an abbreviation for independent and identically distributed, so just copies of the same random variable.
So the coin toss example that I gave, there each coin toss is independent and is identically distributed to the same distribution. In this case, this mu, n mu is just, well, this mu is just the expected value of each guy and we divide by n. This variance here is the same for all of them I am calling it sigma; sigma square. So you get n times sigma square, you divide by n, so you get this.
So this is very good it says that this average of, the sample average of the random variable goes to the actual expected value of the distribution and the accuracy is just 1 by root n. So the average is mu plus minus 1 by root n, so when n becomes large this goes to 0 and it goes to 0 as 1 by root n.
So if I ask you that if you toss a point 100, 1000 times, how many coin tosses will you, or how many heads would you see? If it is an unbiased coin, its expected number of heads is half of 1000, which is 500 plus minus a correction, the correction is like 1 by root n that is what this says. So it is, root n here is 1000, so 1 by root n is 1 by, let us say, so what is square root of 1000, square root of 1000 is about three times square root of 100. So the power 30.
So that half plus minus 1 by 30 that is the correction term, that is the fraction of heads that you expect to see. So not only you can say you can give a nominal estimate, you can also give a confidence interval around it. And this confidence interval sharpens as the number of independent interval, trials go, become large.
So this, one way to read this is that random variable, the average of Xis converges to this mean the expected value in what is called a convergence in probability, this particular sense is called convergence probability. For every delta, no matter how small it is, for large enough n this guy will go this, this guy will become very close to the mean this is called convergence in probability. I am not, defining it formally but, you can look it up.
So this, the resulting theorem is sometimes called the Weak Law of Large Numbers. It says that for every epsilon whatever accuracy you want in the limit as n goes to infinity, probability that the sample average exceeds the mean, the actual expected value by more than epsilon goes to 0.
In fact, we have a very precise estimate, we know that it exceeds epsilon with probability less than delta whenever n is greater than sigma square by delta epsilon square, this is just from Chebyshev inequality. This holds for any random variable which, what did we assume about is Xis, we assumed they were independent, we assume they are identically distributed with variance sigma square. So we showed this with this assumption.
So this give us some good estimate of how far does the simple mean go from the expected value of the underlying distribution, but this is not precise. In particular, our dependence on delta is not very precise. It gives a 1 by delta dependence and this can be improved to log 1 by d, let us see how.
In fact, you may have heard of this theorem but you probably do not know how to use it. Recently, I was talking to a friend of mine who uses data quite a bit in business modeling and he was commenting that although he had heard about central limit theorem in an undergraduate probability course, but he never realized that it is a central idea in statistics.
So what central limit theorem gives is a precise estimate of how far the sample mean, this 1 by n summation Xi can go from the expected value of the distribution, of the underlying distribution, which is mu here. And we just saw that it can go as much as the variance by epsilon square by delta and, but this is not a very tight bound. Actually, we can make this 1 by delta dependence to be log 1 by delta and that is what central limit theorem will give us.
So here is the statement, if you have n independent random variables X1 to Xn and here I am saying their iid, so they all have same distribution. With mean Xi given by mu and variance sigma square for all of them, then the probability that this Xi, the average, the sum of these random variables exceeds n times mu, sorry, sum minus n times mu, this difference.
So this n times mu is the nominal value of this sum that you expect; probability that it exceed this one by more than square root of n sigma square, the square root n is the error that we saw earlier also, so square root n times variance, probability that exceeds that by a factor of t is precisely Q t. This Q t, this is exactly equal to Q t, this central limit theorem.
What is this Q t? This Q t corresponds to the Gaussian-tail. So it is as if this error here is behaving like a Gaussian random variable with 0 mean and variance sigma square that is what this limit, what this theorem is saying that the error behaves as, in the limit as n goes to infinity behaves as if it is Gaussian.
So this probability is exactly this Gaussian tail, the Gaussian tail and it is given by probability that a Gaussian random variable, a standard Gaussian random variable exceeds t that is if this is, if you do not recognize this density you have to really brush up your random variable. This is the density for a Gaussian random variable, and this is very important and it is given by e to the power minus x square by 2 by 1 by root 2.
So this Q t has a precise formula and it is approximately e to the power minus t square by 2. So if you want this probability here to be delta then you set this e to the power minus t square by 2 to delta, so that you get t to be roughly a square root. So you get this t here to be, what do you get? Yeah, t comes out to square root log 1 by delta.
So this is another way to state the same result. It says that the probability, it gives a, it can be interpreted as a confidence bound. It says that the sample average, which is this guy 1 by n summation i equal to 1 to n Xi is roughly the expected value plus minus a correction term, this is the confidence interval plus minus square root of 2 sigma square by n log 2 by delta with probability greater than equal to 1 minus delta.
So if you want the probability of accuracy to be 10 to the power minus 6 then this log 2 by delta is just about 6. So factor which comes out here is just 6, unlike Chebyshev inequality which gives a similar estimate we saw that in Mikolov large number, but instead of log 2 by delta, you would have got 1 by delta here and that 1 by delta for 10 to the power minus 6 can be pretty large, it is 10 to the power 3.
Now, you just have this factor which is also inside a square root to be just 6. So you see how much sharper this bound is. But morally, it is saying the same thing as the previous bound we saw, it just improved the dependence on probability, this is called the, this is one implication of the central limit theorem.
So that is what I wanted to say in this first unit. A quick summary. So what did we see in the first unit? Let me just do a quick summary.
So, unit one. What we saw first was we try to define information and we agreed that we will say information equals to reduction in uncertainty, that is what we, that is how we defined information.
Then we asked, then what is uncertainty how do we model uncertainty? We said that, well, what we will do is we will put a probability distribution on our unknown and somehow this probability distribution will lead to a measure of uncertainty. So we have not seen that yet that will come in the next unit.
We saw a very simple way of measuring uncertainty, we said, maybe the cardinality of our overall possible set of guesses can be thought of as a measure of uncertainty, but now we are looking for a more refined measure which takes this probability distribution into account, not just the guess list but also the likelihood of different guesses, we will come to that.
And then we had integration, we started reviewing probability. So review of probability. And in particular, I proved Markov’s inequality and Chebyshev inequality, and for the most part of this course, these two inequalities are good enough for us to obtain the estimates on the unknown random variable that we need. That is the first unit for you, see you next time.
Log in to save your progress and obtain a certificate in Alison’s free Diploma in the Foundations of Information Theory online course
Sign up to save your progress and obtain a certificate in Alison’s free Diploma in the Foundations of Information Theory online course
Please enter you email address and we will mail you a link to reset your password.