The Log-Likelihood Ratio Test So, till now we have been looking at hypothesis testing an estimation as examples of generic, as actually generic examples of estimation statistical inference problems. For hypothesis testing, what we introduced is a Bayesian formulation, where we have a prior a distribution before the experiment starts on the unknown. So, for the binary case, we know that H0 occurs with some probability P, let us say and H1 occurs with probability 1 minus P. And what we were able to do was that, using a simple calculation, we were able to show that for uniform prior, when you do not have any reason to prefer H0 over H1 to begin with, that is roughly what you can think of as a uniform prior, for uniform prior this optimal probability of error that you can attain is 1 minus, half into 1 minus total variation distance between the distribution under H0 and a distribution under H1. So, this test, the corresponding test was called Bayes Optimal test. And what you can see is that, is this simple form here, so the Bayes Optimal test declare 0 when the probability under 0, H0, is more than probability under 1 and it declares 1, when probability under 0 is less than probability under 1. This is an alternative form of the test that attained, attain this performance Pe star uniform, (())(01:48) was half into 1 minus total variation distance between W0, W1, attained by this test. Now, you can take a log of both of these, these ratios. And so equivalently you can express this test as, it will declare 0 the log likelihood ratio exceeds 0 and it will declare 1 if the log likelihood ratio is less than equal to 0. So in general, this form suggest the following form of tests which are very popular in practice, these are called log likelihood ratio test, thresholded log likelihood ratio test. So, here you will declare 0 if the log likelihood ratio of 0 versus 1 exceeds some threshold tau. This is some threshold tau, which is a feature of the test in the sense that you have to define it to use this test. And it will declare 1, if the log likelihood ratio of 0 by 1 is less than equal to tau. So, if you prefer W0 over W1, in this sense to declare 0, and if you prefer, if you do not prefer W0 over W1, then you declare 1. So, this is a, this is an important class of test which is used in practice. Now, how did this test come about or if or is there any theoretical guarantee for this test, that is what will show now. In fact, this class of test for different threshold are optimal for what is called the Neyman Pearson formulation. Let me describe that formulation. Before we do that, let us see how well these tests perform. So, we will denote this test by g tau, where tau is the threshold. So, under X 0, so when the, when the two distribution, so this is slightly using, when the true unknown hypothesis is 0, this is a binary hypothesis testing, then under that the distribution of observation is W0, so the probability of error under 0 is that firstly 0 occurs and then you declare 1, so you declare 1. And for any such y you declare 1, this will be an error. So, this is one part of the probability of error. The second part of the probability of error was that in fact, the true hypothesis was 1 but you ended up declaring 0. So, all such y’s, with the probability measured under 1 is the second part of probability of error. We just abbreviate this PX 1 as p, so the probability of error is 1 minus p into W0 of this test declaring 1, this, this so you measure this probability under W0. Time plus p, p is the probability 1, under W1 the probability that test declares 0. So, this is your probability of error. In fact, these two probability of errors here are of different style. So, this one where you declare 1 under 0 is typically called error given, this is the error given 0 this is typically called the type 1 error. And this is called the type 2 error, these are two different errors associated with the test. So, one way to think of hypothesis testing problem is that it is a decision making problem but the cost function is this pair, W0 verse comma W1. We have waited this with this p to come up with a single cost function, but the actual cost function is a pair. So, as tau increases, you can check, as a threshold increases, a type 1 error increases, but the type 2 error decreases. So as threshold increases, you decrease the number of point including 0, but you increase the number of point including 1. So, you can verify that, what happens is type 1 error increases, but type 2 error decreases. So, what Neyman Pearson, Neyman and Pearson did, they considered a slightly different formulation they (())(05:27) considered a slightly different criterion for optimality for a test, till now we were looking at average probability of error, the Bayesian formulation we looked at, had average probability of error. But here is another way to use this pair to define a cost function. So, we seek test, we seek test for which the error of type 1 is less than epsilon, error of type 1 is less than epsilon. So, this is the reasonable expectation from the vector set. No matter what the error of type 1 must be less than epsilon. Under this constraint, we seek the best test. So, what is the best test once you impose this constraint, you want to find a test that minimizes the error of type 2, so it minimizes the error of type 2. Subject to the constraint that error of type 1 is less than equal to epsilon. That is the optimal test. So, we call this quantity beta, epsilon W0, W1, so beta epsilon W0, W1. Now why beta, typically, in hypothesis testing, 1 minus epsilon is called alpha, it is sometimes called the power of the test, is it call it is called the size of the test, this is the power of the test. So, we are just calling in beta, it is just a convention. So, beta of epsilon W0, W1 is the minimum probability of error of type 2, this is the probability of error of type 2. Given that the probability of error of type 1 is less than equal to epsilon, that is what we are looking for, that is this beta epsilon W0, W1. So, why is this formulation, why does this formulation make sense, for many reasons. But here is a very simple heuristic reason. So, think of an application where H0 is a normal operation and H1 is an alarming situation. For example, H0 can be absence of a disease and H1 can be presence of a disease. So here, the terror of type 1 says that it was actually the disease was not present, but you declare that the disease is present. So, this is a false alarm; it is a less severe kind of error, so you do not want to declare false alarms. However, the error of type 2 is a missed detection, the disease was present, but you did not find it, this is a very severe error. And therefore, it makes sense to ask for a nominal guarantee of a small false alarm, while minimizing the missed detection problem, that is what Neyman Pearson formulation does. So, let us evaluate a threshold test, the threshold test gt that we defined earlier, the threshold test on log likelihood ratio for this setting. So, what is the probability of false alarm, it is given by this. So, if you substitute for gt y as the log likelihood ratio test, so it will declare 1 if this log likelihood ratio is small. So, probability of error of type 1, the false alarm probability is probability under W0 of those y's for which log of W0 by W1 is less than equal to tau. Similarly, probability of missed detection is W1 of y, of those y’s for which W0 by W1, log of that is greater than tau. By the way we can, so let us try to derive a bound for this probability of missed detection. So, here is what we can do, we can express this W1 as W1 by W0 into W0. So, this is sometimes called a measure change argument, the probability measure under which you wanted to measure things was W1, but you shifted it to W0. And there is a correction factor that you need W1 by W0 here. So, how does it help you, well this condition here tells you that W0 by W1 is more than tau, for all we are only summing over those y's. So, over all these y's, actually, you can see that this ratio, which is just 2 to the power minus log of W0 by W1, this ratio here is less than equal, is less than tau, you can see that. So, this is greater than tau, so therefore minus of this is less than tau. Therefore, this probability is less than summation over y, 2 to the power minus tau times W0 of y. So, we made a measure change argument, we were first measuring the probability of the set under W1, we shifted it to a probability W0 at this cost of this additional factor 2 to the power minus t. In fact, this submission is over the same set, it is over the same set. However, if we remove this restriction, it will only increase the probability, so I can remove this restriction. And so, this is not equality, it is only an upper bound. But, what can you say about submission over y, W0 of y, well that is 1 that is at most 1. That is exactly equal to 1, since W0 is a probability measure it is a probability distribution, it is a PMF. And therefore, the upper bound that you get for probability of missed detection for this threshold test with threshold tau is just 2 to the power minus tau, very simple. So, for this threshold test, the probability of missed detection is no more than 2 to the power minus tau. So, what have we shown, we have obtained a bound for beta epsilon. What we have shown is that, suppose you can find a lambda that satisfies this property. So this is much like the source coding theorem that we were seeing earlier, this lambda what it satisfices that under W0, the probability of log likelihood ratio W0 by W1 exceeding tau is at least 1 minus epsilon, this will ensure that, so this will ensure that the probability of error of type 1 is, so this is a probability of error of type 1, we want this to be less than epsilon. So, 1 minus this, the complement of this event is where this is greater than tau. And we want that even to have probability greater than 1 minus epsilon. So that is the condition that we impose here, suppose this condition holds, then the probability of missed detection is less than 2 to the power minus tau. And therefore, beta epsilon, which is the smallest probability of type 2, given that the test satisfy this constraint must be less than this, this is the smallest probability and giving you one test, which satisfies this probability of error guarantee, therefore the smallest can only be smaller. So, we have shown this, this is you can think about how the argument I just went through, how we have shown this. So, what is our test again, the threshold test. But we have additionally assume something about the distributions, so we have assume that this tau we have chosen in a very specific way, we assume that this tau is a large probability, this is my large probability lower bound on the log likelihood ratio, log likelihood ratio. So, this thing is called the log likelihood ratio, its likelihood is this guy, this is the likelihood ratio and this is the log of it, so Log Likelihood Ratio, LLR sometimes. And, so we have seen this very simple formulation. So, we have a handle over beta epsilon W0, W1 and it is at most 2 to the power minus tau. So, now let us look at the case of iid, this is the general result, let us look at the case of iid observations. So, when iid observations are there, so we can find such an estimate, all we have to find is an estimate for this tau to get this upper bound. So, if W0 and W1 are iid repetitions, as was a in a coin toss example, W0 will be this product distribution and W1 will be some other product distribution under q, maybe I should not use small q here, I have been using capital. I have been using capital P and Q. So, P this, P this and Q this. So, the ratio W0 by W1 is product these two, they show these two products. And there is a log, so the log makes product sums, so again make it capital P and Q. So, what is a good estimate for this guy, well we can use Chebyshev’s inequality, the good, a good estimate for this guy is expected value of this quantity plus some square root n times variance of this quantity. So, n times expected value plus variance. This is just Chebyshev’s inequality, since these guys are independent, this variance is just sum of variances. That is important property we saw earlier. This is what we have been doing right through. We have been finding estimates of log of 1 by, remember this part, we have been finding estimates of log of 1 by P Xn, iid random variables. It is a similar calculation. That is how we found entropy. So, what this calculation suggests is, there is something special about this expectation, just like we discovered entropy as expectation of log of 1 by P. Now, this log of, the expectation of log of likelihood ratio, by the way, this expectation is under P and this is very important because we were measuring this probability under W0 and W0 was under, in the numerator here. So, this expectation is under P. So, this quantity basically gives you an upper bound on beta epsilon W0, W1 because we have an estimate for this large probability of upper bound on log likelihood ratio. So, this for iid case, the upper bound that we have is 2 to the power minus n. The expected value under W0 of log of W0 y. Let me make some space for myself. So, expected value under W0 of log of W0 y, this y has distribution W0 y, W1 y and n times this. So, this probability of error goes to 0 very fast in, it is exponentially then and the exponent is this guy. And this guy has a name, this guy is called the Kullback-Leibler divergence. And the fact that this probability of error can go to 0 very fast, has a lot of interest.
Log in to save your progress and obtain a certificate in Alison’s free Diploma in the Foundations of Information Theory online course
Sign up to save your progress and obtain a certificate in Alison’s free Diploma in the Foundations of Information Theory online course
Please enter you email address and we will mail you a link to reset your password.