We will introduce to very basic problems in statistics, namely that of Hypothesis Testing and Estimation. So, essentially the entire statistics is about, roughly about these two problems. So, there is no way I can describe them completely in this single lecture which is just 20 minute, perhaps will be just about 20 minutes long. But I want to give you some basic idea about this problems because it is through this problem that we will be able to characterize how much information does some random variable Y contains about another random variable X. So, let us begin. So, let X and Y be two random variables, two jointly distributed random variables with joint distribution PXY. So, this is the notation which we will use throughout. So, P subscript X, Y. So, this is the joint distribution of X and Y. Suppose that X is an unknown random variable and Y is the observed random variable. You can think of some example, where X corresponds to, X corresponds to some binary random variable like a class and Y correspond to something that you see about it. So, for a very concrete example which is quite popular, which is quite important in practice, you can think of, you can think of the problem where you want to predict if a particular user will click an advertisement or not. In that case, there are two kinds of users; the users who click and users who do not click. And this is your unknown random variable X; you do not know which user is which. So, when X is 0 then this is corresponds to user who does not click. And when X is 1, it corresponds to a user who will click.
And Y can be features of this user. For instance, which country is the user logging in from? Where did it come to your website from? Which other website did it come to your website? What are the other things you know about the user from its, from the user’s cookies, and so on and so forth. What time zone the user in? Sometimes you have much more information. There are, several competitions on a website like Kaggle which, which ask you to solve this particular hypothesis testing problem. So, X is the unknown and Y is the observed random variable. So, before we proceed, we have to put down some basic notation. So, we have to bring in this notion of conditional distribution, P of Y given X, it is a distribution of Y once you conditioned it on X. So, remember X is your unknown and Y is your observed random variable. This distribution has a special name in information theory. It is often called a channel. So we just, this, this is called a channel it may sound like a strange name to you. But perhaps I will try to explain why this is called a channel. So, X is the thing that you do not know and it is the message that you want to figure out. And Y is the observation that you have. And this X is passed through some channel, this particular channel to give you Y. So, this channel is thought of like some communication channel and you observe X through it to see your Y. So, from now on we call this conditional distribution a channel which adds some noise to the input to give you the output. A simple way to think of a channel is that it is a randomized mapping from this input alphabet X to this output alphabet Y. Corresponding to each X, there is a distribution Y which we can abbreviate by W subscript x.
For discrete random variables to describe this distribution W subscript x, we only need to give its PMF, Probability Mass Function. And most of our this course will be about discrete random variables. So, for discrete X and Y, we will use this notation W of y given x. This W of y given x here. It will denote the probability of Y equal to y given that X is equal to x, is that conditional probability. So, just to be explicit here, this is by Bayes rule. This is equal to probability of Y equal to y, X equals to x, maybe I should make some more space for myself. This here just by Bayes rule, it is the probability Y equal to y, X equals to x divided by probability X equals to x. This is the conditional probability; this is valid as long as the probability of X equal to x is non zero. So this, this conditional distribution is what we will call a channel Now, so when we observe a y and we want to infer something about x, then we have a channel associated with y. And we can think of this family of, this this family of distributions, W subscript x for different x in x as representing the experiment that we are running to determine the unknown x. And in this experiment Y is the observed random variable and x is the unknown random.
Once again, the goal is to determine X by observing Y which is distributed, this notation, this symbol is use for distributed. So, which is distributed as Wx. This is a classic statistical inference problem, that is what people in statistics want to do. They want to, they want to figure out what is X by looking at Y where the distribution of Y depends on X. And this channel essentially represents that dependence. So, we will look at a specific kind of statistical problem which is sometimes called Bayesian, which is sometimes called a Bayesian formulation. So, what happens in a Bayesian formulation is that, you have a distribution which generates this X. So, we assume that X is generated from a fixed distribution, Px. That is what we will assume throughout. And such a formulation is called a Bayesian formulation. So, in this, in this part of the course, we will go with a Bayesian formulation. So, let us take a very concrete example. Let me introduce what is called binary hypothesis testing. So binary hypothesis testing is a problem, is a statistical problem when this X (())06:55) cardinality 2. Just like a click prediction example that I described. So, there are 2 hypotheses. This is my calligraphic H. So, there are 2 hypotheses; H0 and H1. Under H0, your Y has distribution W0., under H1 your Y has distribution W1.
And your goal is that upon observing Y, which has distribution WX and this X itself is random, we want to figure out what X was. So, what we can do is we can form an estimate X hat of the unknown X by looking at Y, by looking at Y. And we, how do we know that this is a good estimate? Well, what you want to do is we want to minimize the average probability of error. So how is this average probability of error defined?
We want to minimize the probability that X is not equal to this X hat that you have formed. So, what is this probability of error? Well, we can write the total probability formula. So how does it go, probability that X is equal to 0, multiplied with the probability that the output X hat is different from X given that X equal to 0. This will happen when g of Y is 1. So when g of Y is 1, the output will be 1 while input was only 0. So, this is the first part of the error probability, PX0 multiplied with probability that g Y equals to 1 given, given X equal to 0. So that is the distribution here, maybe I should write it more it more elaborately.
So, this will basically probability that g Y equals to 1 given X equals to 0, plus probability that X equal to 1 multiplied by the probability that g Y equal to 0 given X equal to 1. So, the probability that you declared 0 given that the X was 1. And probability that you declared, declared 1 given that X equal to 0. This is the probability of error.
So, if this mapping g is deterministic. If it is a fixed mapping which does not use any extra local coins. So, what is this notion of deterministic mapping versus randomized mapping. Well, in statistics it is common, statistics and algorithms, it is common to think of randomized mapping where the mapping g is not only a function of your Y but also an independent coin which you can generate in your computer and use to make a decision.
But for simplicity we just look at simple functions, simple mappings, which are deterministic. So, they just look at Y and declare 1. So, for such a mapping we can think of a set at A0 as a set of those y’s for which g of y is 0. We can always associate; we can always define the set A0. It is a set of those y for which this mapping g declares 0.
So, what is the probability of error in terms of the set A0, well it can be written as P of X0. Now under 0, under X equal to 0, the distribution of Y is W0. What is the probability that you do not see a y in A0, well that is this probability that y is in A0 compliment plus probability that X was 1 and you saw a y in A0. But this time, the measure that you use is W1; the probability that you use is W1. Here, you use the probability W0. It is a more detailed expression for probability of error.
So, the distribution PX has a name, it is called a prior. And, so whenever you may have already heard this phrase prior. It is your prior belief about the unknown X. Or about the unknown hypothesis. So, it is called prior. It is your prior belief before you saw Y. The distribution P of X given Y is the posterior distribution, it is a distribution of X after you have seen Y.
And this posterior distribution can be computed using Bayes rule. It can be written as this, W of X given Y is P of X given Y is just PX W of Y given X this is the conditional distribution by PY of y. So, when you have an input distribution, the prior PX and this channel W, you can also write the output distribution.
So, I am calling it the output distribution for the output of the channel. The input distribution is prior, output distribution is PY, the distribution of the observed output. It is given by the average distribution, where the averaging is done over this PX your prior. So we are given each X the distribution of Y is WY given X and you weigh the distribution by PX.
So, for a uniform prior, in this particular case, a uniform prior corresponds to having equal belief about both 0 and 1 to begin with. So you have no preference of two hypotheses. You have no other information to prefer one over the other. So, PX0 equals to PX1 one equals to half, that is a uniform prior. For this prior, the probability of error can be expressed as half times probability under W0 of A0 compliment plus half times probability under W1 of A0. So half, this half corresponds to X being 0 and under X being 0, the distribution of Y is under W0.
And this corresponds to you declaring 1. This set A0 remember corresponds to your rule g1. And again, with probability half, X is 1. And then when X is 1, then the probability distribution Y is W1 and this corresponds to you declaring 0. So, this is the probability of error in terms of the set A0.
Since W0 of A0 complement, is 1 minus W0 of A0. If you substitute this, if you substitute this here, you get half minus half into this. Remember now A0 is the same set used in both these measures. So, can you recognize this expression? So, now what can we do with this? Let us see what this expression is about. I think you have seen this expression before in this course.
So note that we can choose any A0 because we were looking for the best decision rule. So, we can choose any A0 each mapping g which is our decision rule corresponds to an A0. The probability, the least probability of error, which we will denote by PE star uniform, this uniform is for the uniform prior that we choose is given by this quantity here, P star is half minus half into, since you want to minimize probability of error, you will maximize this thing, which is negative max over A W0 A minus W1 A.
So, can you now recognize this quantity, is it something that you have seen before in this course? Why do not you pause and think about it. Indeed, this is a total variation distance. So, the probability of error is the total distance between W0 and W1. And therefore, the probability of error is half into 1 minus total variation distance between W0 and W1.
And the best rule which attains this minimum probability of error is the same as the set A which attains this total variation distance. And if you remember, this set A is given by a set of those y for which W0 y is greater than W1 y. So, this is something very interesting. We have given an explicit formula for minimum probability of error under uniform prior.
And this formula entails the total variation distance, some quantity we saw earlier. It allows us to specify the optimal test, the one which minimizes this probability of error under uniform prior. And that is a very natural test. Whenever we see a Y which is better explained by W0 and W1. In other words, which is higher probability under W0 than W1, then we declared 0. And whenever we see a Y which has higher probability under W1 than W0, then we declare 1. This is the rule.
Furthermore, this particular probability of error formula here says something very nice and heuristic. It says that this probability of error, this probability for a P star, it is large when the distance between W0 and W1 is small. So, hypotheses which are close to each other are harder to distinguish. This is very much like Euclidean geometry or our day to day life, when we know that if you want to, if you want to understand the difference between two phenomenon which are very close to each other, then it will take a greater effort.
For instance, if you want to measure a very small distance, the width of your hair for instance, it will take a very fine scale; it will take a greater effort. Similarly, to distinguish if your underlying hypothesis were W0 and W1, it will take greater effort if W0 and W1 were close in total variation distance.
Well, this heuristic is some very basic heuristic about, heuristic statistic. And I would require you to, I will request you to go back and think about it and try to remember it throughout this course. Hypothesis which are close to each other in total variation distance are harder to distinguish because its probability of error under uniform prior is larger.
So some terminology. This function g is called a test because it tests if the hypothesis was 0 or 1 by looking at y. And the function g star y, which is just this function corresponding to this A star, this one declares 0 if W0 is less than W1 and otherwise it declares 1, is a Bayes optimal test or sometimes it is just called Bayes. So, it is a Bayes Optimal test. And this particular test is optimal for uniform prior because that's the one for which we got this answer. So, it is called a Bayes optimal test, optimal point uniform prior.
So, this is one kind of hypothesis testing problem where you had two hypotheses except carnality 2. Next, we can look at what is called the M-ary hypothesis testing problem. This M here is a number, just like binary, I am just calling it M-ary. Hypothesis testing problem with the cardinality of the set X, the unknown X is M. And you form a test g which looks at y and outputs x. This is again test.
So, it looks at y and tests to decide which of the x was true. So, it gives out an X hat. And probability of error can again, is again defined as probability that X hat is not equal to x. And the optimal test is the one which minimizes this probability of error. We will come to analysing M-ary hypothesis testing later. But I just wanted to formulate the M-ary hypothesis testing problem.
Just like we have binary hypothesis testing problem, we have an M-ary hypothesis testing problem. So, to think of an example of this problem, you can think of basic digital communication. Think of modulation schemes, amplitude modulation scheme or QPSK, scheme where you send a symbol and the symbol gets corrupted by noise. And that is what you observe.
And by looking at it, this noisy version of the symbol, you would like to recover the symbol back, what was the symbol that was sent. Now, in this case, the unknown hypothesis, the unknown hypothesis class is this set of all different symbols that you could have sent. And this M corresponds to a number of different symbols in your constellation.
For instance, 16 qualm used in used popularly in LTE will have 16 symbols. And you observe Y which is a noisy version of the symbol set and you would like to determine which symbol was sent. That is an (())(18:43) M-ary that hypothesis testing problems. So, we saw a binary hypothesis testing problem and an M-ary hypothesis testing problem. And we looked at Bayesian versions of this problem where a distribution was put an X, the unknown X.
So, hypothesis testing is one kind of statistical problem where you have a hypothesis and you observe some random variable whose distribution depends on the unknown hypothesis. And you want to determine the unknown hypothesis. And you impose a probability of error criterion. And the goodness of your test, the so-called test will be determined by how small the probability of error is.
Another kind, another kind of of problem looked at in statistics is what is called the estimation problem. For estimation problem X need not be a discrete random variable. The set X need not be discrete, it can be continuous. But for simplicity, we will just mostly look at a discrete X. So, you note that this criterion here that we were looking at for our hypothesis testing problem probability that X hat is not equal to X.
You can view this as probability under PXY of indicator function, expectation under PXY of indicator function X hat, not equal to X. Remember this X hat is a function of Y. For a general X, we can consider an arbitrary loss function. This can be thought of as a loss function. This is the loss that you incur when you declare X instead of X hat instead of X. And this is the expected loss.
But why are we considering only this indicator function as loss function. This indicator function is very well suited for hypothesis testing. But in general, we can consider some other loss functions. So, what is a loss function, like here is a generic interpretation of a loss function. When your true hypothesis is x or your true unknown x the unknown x is x. But you declare some explain x prime, then you incur a loss which is a l of x, x prime. So, it is a positive, all we are telling about this is it is a non-negative number.
And we are interested in expected loss. This capital L under rule g and average using PX and it is given by this. If the expected value of loss between g y; that is what you declare and x the input that you have. This small notation, I use. Typically, loss functions that we will encounter in this course are symmetric, l of g y, x equal to l of x, g y But here just for convention I am putting the output g y first and then the unknown x.
So, what is the optimal loss, what is the minimum loss, well it is the one which is minimized by all your allowed tests or your allowed estimates g. Now, I should think of g as some estimate of X that I formed by looking at Y. So, in this case, a good example of this problem is what you can think what is called the regression problem.
So, you observe one quantity like temperature in Bangalore and from that you want to predict another quantity or estimate another quantity like temperature in Mysore. Or many other such things. And they will fall in the category of an estimation problem. And we are looking for the best rule g which minimizes this loss function. So, this minimum loss function is what we call L star of PX. And this PX is the distribution of the unknown X. So, we assume for now that this PX is fixed.
A popular loss function for d-dimensional vectors. So, when X equals to Rd, is this Euclidean loss function which is just the Euclidean distance squared, it is called the mean squared loss. It is summation i summation i equal to 1 to d, the distance the difference of xi minus x prime i whole square. Just a distance squared between these two points. So, this L g, PX is called the mean squared loss.
And the quantity L star PX in this case is called the; so what is L star PX, it is the minimum error that can be incurred when you measure loss according to this. So, in this case, this error has a very popular name, it is called the Minimum Mean Squared Error, MMSE. And we often look for estimates which are MMSE estimates. The estimates which minimize this mean squared error.
So, to conclude this video, we have we have seen three different statistical problem. The first one was that of binary hypothesis testing where there were two unknown values X 0 or 1, at hypothesis 0 and hypothesis 1. And you observe a Y which under 0 had a distribution W0 and under 1 had a distribution W1. And we wanted to figure out if the unknown X was 0 or 1. This is binary hypothesis testing problem because X can take two values. We also quickly define this generalization to M-ary hypothesis testing problem where the unknown X can take M values. This M can be 16, some number basically. And then finally, we looked at a statistical, we looked at a statistical problem estimation, where the unknown X can take not perhaps not even discrete many values but continuously many values. And the goal is to estimate X by looking at Y. And we, it and the goodness of our estimate will be determined by a loss function. A popular loss function, for example is the mean squared error loss function. In all these formulations, we are always putting a distribution on X which we are denoting as PX. This is the prior that we put on the unknown.
Now that we have seen all the basic formulations, I will give some examples to help you, to give you some concrete formulations for this, the general problem that being outlined in the previous lecture. So, the first example we consider is perhaps the most popular one in binary hypothesis testing. In this case, a unknown x can take two values; either s or minus s and what we observe is y. So, I am calling this thing x because this is the unknown and I observe y, and from y I would like to infer something about x.
So, this goes back to the basic question that we started with, namely how much information that does y reveal about x. So, there are two hypotheses we have, there are two possible hypotheses. First is that y is distributed as a Gaussian random variable, so this is a Gaussian random variable or a standard or a normal random variable. This is a Gaussian random variable with mean minus s and variance sigma square, that is my notation. This is the mean and this is the variance of my Gaussian random variable.
So, under H0 this mean is s, and under H1 this mean is, under H0 it is minus s and under H1 it is s. And the variance is same under both the hypotheses, sigma square. So, as we saw earlier, if we look at uniform prior and look at a Bayesian formulation, we have first s equal to, x equal to minus s or s is selected with probability half each and then you observe Y from this distribution in that formulation. The minimum probability of error that you can have is half into 1 minus total variation distance, this quantity here.
And this total variation distance is the total variation distance between the distribution Y and the minus s and distribution y under s, these are these two distributions. So, what is the total variation distance? It is the probability that if you recall, the total variation distance d minus W minus s W s. So, I will write it here, d P, Q has various expressions, the one’s I am using here is it is P of A star minus Q of A star, where A star is the set of those values for which P x exceeds Q x, that is the definition we had seen for the discrete case.
For the continuous case, this is the set of those values for which the density of Pth f, f is the density of P, is greater than the density of Q, which is g, let us say. So, for this particular example, denoting by W, so this seems slight W’s notation, W minus s here denotes the density of Y under hypothesis minus s and density of Y under s, so this is P of A star minus Q of A star. Pictorially, you can pictorially see this set, this green curve represents W s, so this is the Gaussian distribution with mean s. This blue curve represent W minus s, Gaussian distribution mean minus s.
And this set where W minus s exceeds W s is the left side of 0, you can see that in this part W minus s exceeds W s. And this set is the right side of 0, so we can exactly identify these sets, so this set here is equivalent to y that y is less than or equal to 0 and this set here is equivalent to y such that y is greater than 0. Where you put 0 does not matter, it has zero probability because the continuous random variable.
So, then what is the probability of A star under W minus s, this probability is given by, you can check this, so we can write this Y, this the probability that random variable Y is less than equal to 0 or less than 0. And this random variable Y is Z plus Z minus s, where Z is a standard Gaussian random, Z is a Gaussian random variable with mean 0 and variance sigma square. So Y was Gaussian random variable with mean under W minus S, Y is a Gaussian random variable with mean minus s in variance sigma square, so I took out that minus s, so I get this Z.
So, this first part of the probability is just 1 minus probability that Z, a Gaussian random variable with mean 0 and variance sigma square exceeds s. This, so this probability here is typically denoted by this Q function, this Q of t is simply the right tail of a Gaussian distribution. So, Q of t if you, if this is the Gaussian density is this area in the right part, that is what Q of t is.
And therefore, you can express this probability as Q of s by sigma, so you divide by sigma because this, this is the in this notation Q function corresponds to standard normal random variable but we are looking at a Gaussian random variable, normal random variable with variance sigma square, so we divide by sigma and then it is the same thing. So s by sigma, Q of s by sigma, this is how the Q function is defined. Similarly, W s of A star is equal to Q of s by sigma as well. So this is 1 minus Q of s by sigma, this is Q by s by sigma, you can verify this.
Therefore, this total variation distance between Gaussian with mean minus s and mean s is 1 minus 2 times Q of s by sigma. Therefore, the probability of error is 1 minus total direction distance by 2, which is Q of s by sigma, that is the probability of error. This expression is sort of very popular, it comes up in many places. For instance, communication engineers may have seen it in analysis of probability of error for BPSK scheme in presence of Gaussian noise.
So, this P e star uniform is Q of s by sigma and there is a very nice approximation for the Q function, which is also easy to remember, it just looks like the Gaussian density itself. So, Q of s by sigma is approximately equal to, you can find precise statement here but I am just giving a rough statement, it is approximately equal to e to the power minus s square by sigma square.
So, this approximation is useful to remember, Q of t the right tail of a Gaussian random variable, this is called the right tail, is approximately equal to e to the power minus t square by 2. So, it goes to 0 very fast as fast as density of Gaussian. So, this one is just this much. So this the probability of error for, this is the best probability of error for uniform try that you can attain for the first example.
So, let us look at another example now, the second example. Now it is very similar to the first example, but now x and y are vectors, the dimensional vectors and we assume that under both, so x and y the dimensional vector and y is R d as before and x is not a single value but x can be any vector here. So, this is an estimation problem.
And given x, the distribution of y is this normal distribution with mean x, remember this is the vector and variance sigma square I. So, again, y is a gaussian with mean x and variance sigma square I. Our goal is to estimate x by looking at y. So, you can consider this very simple estimator, this is like a sort of a naïve estimator. It just declares y, when it sees y it just declares y is estimate for x. So you can check that for this estimate the loss, the expected loss of this estimator g, given P x for any P x is d sigma square.
Remember, so by the way this loss we are going with is the mean square error loss, we define this in the previous lecture, mean square error loss expected, so maybe y z here. So what is the loss that we are computing it is the expected value of, so I observe y and I use my y to estimate x by using this function g. So, my estimate for x is just g of y and we do this error. And this expectation is over the distribution of x and then y given x is fixed to this channel.
So, then you can check that this guy, no matter what P x is, is exactly equal to for our choice of, so for our choice which is g y equals to y this is exactly equal to d sigma square. So, we saw one estimation problem, that mean that will be mean estimation by looking at a noisy version of it and proposed a very simple estimator and saw that what error performance, so what the performance we get.
So, this is another toy problem just to give you an example of an estimation problem. Let me now come to a very interesting problem, one of my favourite example for introducing hypothesis testing. This problem is that of testing the bias of coin. So here is a very simple version of the problem. How many coin tosses are needed to test if a coin is head heavy or tail heavy? So, what is a head coin? Head heavy coin is the one which shows heads with more probability than tails. Tail heavy coin is the one, which shows tails with more probability than heads. And an unbiased coin is the one which shows both heads and tails with equal probability.
So let me, let me try to articulate this problem. So here, it is a very simple problem to state, what I am saying here is that you have a coin and you want to figure out if it is head heavy or tail heavy, how many times you need to toss the coin to answer this question with good reliability? So, let us formulate this problem in our framework, the hypothesis testing framework that we have introduced the general fo