Alison's New App is now available on iOS and Android! Download Now
Kullback-Leibler Divergence and Stein’s Lemma
So, last time in the previous video, previous lecture, what we saw was that the, we saw a new formulation which we were calling the Neyman Pearson formulation, where was, where our goal was to find a test, which minimize the probability of error of type 2, while the probability of error of type 1, the missed detection, the missed and the false alarm probability is kept below some nominal epsilon.
And what we saw was, so we call this minimum probability of error of type 2, the minimum probability of error missed detection, we denoted this by this beta epsilon W0, W1. And we saw a very simple calculation, very simple, but actually quite useful and very powerful, which shows that if you have a threshold tau, which is a large probability upper bound on the log likelihood ratio W0 by W1 and large probability under W0, then a threshold test with this threshold has probability of error of type 2 less than 2 to the power minus tau.
In particular, for the case where these distributions were iid, what we saw was that tau, the probability of error of type 2 can be made to go to 0 very fast, as fast as roughly 2 to the power minus n times a quantity which was some expected value of log likelihood ratio. So, it can be made to go to 0 exponentially fast, exponentially in n and the exponent had a very simple formula.
So this exponent, which we discovered last time is called the Kullback-Leibler Divergence. It is a, the it is the, it is an alternative notion of distance between distributions, which will complement a definition of total variation distance. And this phenomenon that the probability of error can go to 0 exponentially rapidly is related to a very popular result called Stein’slemma. So, today in this lecture, what I will do, is I will discuss this Kullback-Leibler divergence and I will present Stein's lemma. So, Stein's lemma is a very important result for hypothesis testing as well as information theory. So, this is the quantity we saw last time, we saw that expected value of P of log, so here, by the way, some small confusion about this notation, I often switch to small p and small q. Basically, these are my proxy for densities which, which is just PMF in this case, so no difference between p, q and it is not much difference, and t is the distribution. So, I switch to small when I am thinking of densities, but I will not be consistent with this notation. So, please remain active and fix this notation in your head all the time.
So, the quantity Ep, this guy here, the expected value of log likelihood ratio, under p is called the Kullback-Leibler divergence, D P Q I will denote this by D P Q. So, just like small D, P Q it was total variation distance. This is the second distance that we will look at, we will just call it divergence for short, but it was used in statistics by Kullback and Leibler and it is called Kullback-Leibler divergence, KL divergence for short.
And I am not using a comma here, I am using this part this, this function here, a pipe here, mainly because a double pipe here, because it is not symmetric in P and Q, you can check that this definition is not symmetric in P and Q. If you expand this expected value, this will total, the Kullback-Leibler divergence will look like this.
It is summation over y, p y log of p y by q y. And it is the counter part of total variation distance, as I say, in the Neyman Pearson formulation, just like total variation, distance governs the probability of error in Bayesian formulation, at least for uniform prior, this Kullback-Leibler divergence controls the probability of error in the Neyman Pearson formulation.
What we show last time was that this probability of error can be smaller than 2 to the power minus n D P Q, that is what we showed last time. Well you can divide by n and take a limit, this limit is roughly defined. If you know some discussion on formal notion of limits, we have to be careful in defining this limit. But I assure you that this limit exists, and this limit exists. And in fact, what we have shown is that the lim inf, if you know what that means is bounded by this guy, no problem. But we can just think of this is limit, it is, I do not want to be so pedantic in this part.
So, as n goes to infinity, this can be made to, so this goes to 0 very fast, exponentially fast and the exponent can be made larger than D P Q, there is a minus sign here. Therefore, this is an upper bound on beta epsilon P and Q. So, this is the following statement, this is a very cute handle over the exponent. But is this any good, is this bound any good, we are happy that this is exponentially falling it can be made to fall exponentially rapidly, but can we improve over this D P Q.
And this is the content of Stein's lemma, which says no, you cannot improve. In fact, what it says is for iid distribution, this is exact; the best you can do is D P Q, so I have not shown this, so when, I have not shown the other side of this inequality. A quick remark on some proof whenever you show, whenever you see equality like this, you should imagine two inequalities hidden there, this one and this one, both hold and therefore equality holds, we have already shown one side of it, and this is the other side of it.
In computer science and related areas, these things are called lower bound and upper bound. An impossibility result will be called a lower bound in a scheme which achieves something is called an upper bound typically. But as you can see, in this example, this notion of lower bound and upper bound can flip based on application. So, here it is a scheme which gives a lower bound.
So, in information theory, we will call these things as achievability and converse results. So, what you, so achievability is not a real word but it is used quite commonly in information theory. So, so we have shown that a scheme can get you here and the second part which we have not shown will show later in the course, is say that no other scheme can do better than this. So indeed, this log likelihood ratio test with threshold being roughly n times D P Q there was a slightly, there was a correction it was this plus, I think, plus or minus (())(06:41) remember that part, something like that, some variance.
And there was some independence also there, which we had from Chebyshev’s 1 by epsilon, so therefore this is correct. So, this is roughly what we chose the threshold to be. And then it gives you D P Q, 2 to the power minus n D P Q is the rate at which probability of error can be made to go to 0 for this choice of threshold. And, what we are showing here is that nothing can beat this performance. We are claiming, we are not showing, it will show it to you. So, the largest exponential decay rate of beta epsilon Pn, Qn is D P Q and it is attained by a threshold test.
So, this D P Q is a very interesting quantity, this total variation distance, it has similar, this Kullback-Leibler divergence, it has similar interpretation as total variation distance. So, what do I mean by that, if D P Q is small, the hypothesis P and Q are difficult to distinguish. So, we can just treat, this is a heuristic statement, I just made. So, if D P Q is small, the hypothesis P and Q are difficult to distinguish. Just like we thought when total variation distance is small, hypothesis P and Q are difficult to distinguish.
Similarly, when Kullback-Leibler divergence is small, this hypothesis are difficult to distinguish. Stein's lemma gives an asymptotic justification of this fact. Stein's lemma shows us that this is indeed true because asymptotically the probability of error is roughly 2 to the power minus n D P Q n times 2 to the power minus n times Kullback-Leibler divergence. So, a small Kullback-Leibler divergence, let us say a Kullback-Leibler divergence is of the order 1 by n means that you cannot have small probability of error.
But later, we will see another justification, which will also instantiate this principle for a fixed n. So, but I want you to remember this principle, the difficulty of a binary hypothesis testing problem is related to the distance between P and Q, this distance can be measured most formally by total variation distance, but we will see other versions later, where you can also introduce other notions of distances, which are not really distances in true sense, like Kullback-Leibler divergence, which we saw now.
So, if P and Q are close in any of these distances, the hypothesis testing problem is hard to solve. It will take more samples to solve it, if you are thinking of independent samples or it will be harder to solve. So, let us take an example, let us go back to our example of coin toss and try to compute D P Q.
So, the total variation, look at this particular choice of P and Q, 1 where the P is Bernoulli half and Q is Bernoulli half plus epsilon. So, for this particular case, half plus epsilon by 2, for this particular case, total variation distance is epsilon by 2. But Kullback-Leibler divergence is half of log 1 by 1 plus epsilon you can check this plus half of log 1 minus epsilon, so it is roughly half of log 1 minus epsilon square. In fact, you can replace this, replace this log with natural logarithm by an additional factor of ln 2, so it is, this guy here.
So claim, now, we make is that this ln of 1 minus epsilon square is actually greater than epsilon square. Can you show this, the easy way to show it is to plot the graph and show it. So, what is this, this is equivalent to showing that log of 1 minus epsilon square minus sign is here, is less than equal to minus epsilon square. So, log 1 minus x, is less than equal to minus x, at least for non-negative x, x between 0 and 1.
So, what is log 1 minus x, how do we show this. You can try to plot the functions of, you can try to plot these functions, and then try to show this. But this is a very important inequality, I am not showing it now I am leaving, I will give you as a homework exercise to show several such an inequalities actually, you should be comfortable showing it So, here is something which you should remember, ln 1 plus x is less than equal to x for all x that are greater than minus 1.
So, this is the inequality which we have used here. But epsilon square was roughly total variation distance square. So, for this example, this D P Q, the total variation distance, the Kullback-Leibler divergence is square of the total variation distance. And in fact, this kind of relation holds in general, we will see that later.
Before I proceed, I would like to, so I kept most of my discussion about discrete distributions, but it is easy to extend all these notions to continuous distribution, in particular those with density. So, suppose P and Q have densities f and g when you can define D P Q, the Kullback-Leibler divergence as the expected value of log of ratio of densities, this, the theory for discrete and continuous distribution can be very different.
But there is a common ground for both theories, you can instead of densities and PMF, you can work with ratio of PMF, and ratio of densities, and ratio of densities can always replace ratio of PMF to come up with such a theory, that is a very high level description of formal probability. So, that is what we have done here. So, this is log of ratio of density, and it is an expectation but it can be written as this integral also because P has density f with respect to this (()(12:40) measure. So, this distribution serves the same purpose as for the discrete case, this definition of total, Kullback-Leibler divergence serves the same purpose.
So, you can actually try to see if you can recover some of those results for continuous case using this definition. In fact, this definition of Kullback-Leibler divergence, and the discrete distribution version can both be recovered from a more general definition, which perhaps I will allude to later in the course.
So, the log likelihood ratio test g tau can now be replaced with this kind of log likelihood ratio test. So, this is for the iid case, where it becomes the log likelihood ratio of Pn and Qn, so if P has density f, its iid copies have densities, f to the power n. And this one will have g to the power n, so the log likelihood ratio is just summation log of f by g, of xi. So instead of ratio of PMF, it is ratio of density is evaluated at xi. And we compute this and so these are sometimes called score functions. So, log likelihood ratios, these are called score function.
So, we compute these sum of score functions and check if the score exceeds some threshold tau or not. And a good choice of tau is n times D P Q which gives times lemma. So, in this part, I just quickly showed you how this definition of Kullback-Leibler divergence can be extended to distributions with densities, like costs in distributions, and how even for this extended definition, similar results will hold.
I just alluded to it, I gave you the threshold, the corresponding threshold test to recover Stein's lemma for this distribution with intensities. But for the most part of this, for most of this course, I will focus on discrete distributions, but I will give this kind of side remarks here and there, so that you can go back and convert all the results to distributions with densities, or maybe even more general distributions based on how much probability you know. But for discrete distribution, you should be able to follow everything in the class.
Properties of KL divergence
In the last video for this week, I would like to present some basic properties of this Kullback– Leibler divergence, a quantity we discovered in the previous video, previous two videos actually. And it is a very important quantity, one of the most important ones in Information Theory. Later we will see many of these, many of other information quantities will be related to Kullback– Leibler divergence, including our favorite entropy.
So, in this short video I will only present the properties without proof and later we will go, later when we study properties of information quantities, we will give you, I will try to, I will give you all these proofs. So, first property of Kullback–Leibler divergence is the data processing, is called the data processing inequality. In fact, any quantity which promises to measure distance between distributions must satisfy data processing inequality, this is a reasonable expectation.
So, what is the data processing inequality, it is very simple it says, distances between distribution decreases when you further process their samples. So, when you further process samples by applying any randomized mapping to the samples, by passing them to the same channel, the distributions, the underlying distributions will come closer.
So, any down line processing will actually make the hypothesis testing problem harder because the (())(01:47). So, what does it mean, it sounds a little bit counterintuitive because signal processing engineers would always process the signal before they apply some test to it. So, is that processing only making that problem harder, in fact not, the purpose of signal processing is something else. The purpose of signal processing is to look for this transformations which minimally reduce the distances, while making the problem of hypothesis testing computationally easier, this is a heuristic reason I am giving for signal processing, it is a broader field.
But this heuristic is somewhat surprising to begin with, so you should take some time to digest it. Any processing on signals which is independent of the underlying distribution will make will make the distributions closer. So, this agrees with our heuristic that these distances determine how difficult it is to test between two distributions, since we applied test to process samples. So, let me formalize this data processing inequality.
Let P and Q be two distributions on Y, so this P and Q are two distributions on the same alphabet Y. And consider this channel W this represents our down link, which represents the processing so what W does is looks at Y and produces something from this alphabet Z it is a fixed channel representing a data processing operation.
So, this data processing for instance can be that I add some random noise to the sample. So, if it is a continuous observation, I add just some Gaussian noise to it. So, then we can show that both total variation distance and Kullback–Leibler divergence satisfy this, these inequalities, these are called the Data Processing Inequalities, sometimes DPI for short.
The distance between W of P the output distribution when P is passed through W and the distance between W of Q which is the output distribution when Q is passed to the channel. So this W of P remember is a distribution on Z, this is given by summation over Y W of Z given Y P Y that is our definition of W0 W of P. So, W of P and W of Q are closer than D P Q. You can show this by just triangular equality.
And similarly, under Kullback–Leibler divergence as well W of P and W of Q are closer than D P Q, so any channel which will add, so channel can be thought of as a noise. This channel does not depend on P and Q, so it does not help you distinguish these P and Q right away and it will make these things closer. Operationally also you can verify these inequalities, we have already shown operational meaning of D P Q in terms of the expression for P star. So, any Bayesian test, so it is the, the minimum probability of error over all Bayesian test is related to D P Q.
Now, consider those tests where you first pass P through W and Q through W and then use the optimal test for W of P versus W of Q. The probability of error for this class of test can only be smaller, can only be larger than the probability of error if you work with P and Q directly because this allows more tests, it does not force you to pass through this W. And therefore D, this D can only be smaller than this.
But these are very easy to remember, these inequalities will give justification for them again, this just says that distances shrink when you process the data, these are called data processing inequalities and processing the data here refers to passing through the same channel W. This is the first property.
The second property which is very useful is, is the one which relates Kullback–Leibler divergence to total variation distance. Earlier in an example for binary case we saw that D P Q is at least 2 by ln 2 total variation distance square. In fact this bound holds for every distribution, so this bound this is called Pinsker’s inequality, it relates total variation distance to Kullback– Leibler divergence and if you ignore the constants of 2 l, 2 by ln2 which I make mistakes with all the time, if you ignore this constant what this says is that D P Q is more than total variation distance square.
So, D P Q you can think of it roughly as total variation distance square. So, D P Q behaves roughly the same as square of distance, so it is roughly like a square root distance. So, what is special about square of distances is in Euclidean space. So, here is the Euclidean distance summation over i 1 to n xi minus yi square this is a Euclidean distance between two n dimensional points, maybe I should call it d, so that you think of d dimensional points. This is a distance square; Euclidean distance is the square root of this.
What is special about the square of distance? Do we know what is special about this, so you can think about it. There one thing, there are many things such as special about it, but this is additive across dimensions. So, if you have multiple dimension this is additive. And in fact, Kullback– Leibler divergence has a similar behavior, it is additive for product distributions.
So, earlier we saw that total variation distance is sub-additive but Kullback–Leibler divergence is additive, so this is a very, very important property, it is the additivity of Kullback–Leibler divergence for product distribution. So, P1 these are product distribution P1 time, P2 time, Pn all coordinates are independent and ith coordinator distribution Pi, all coordinates are independent it is the second part and ith coordinate has distribution Qi. What is the Kullback–Leibler diverges between these two product distribution, well it is the sum of coordinate wise Kullback–Leibler divergence.
So, this is the activity of Kullback–Leibler divergence. So, three properties we have seen, Kullback–Leibler divergence as well as total variation distance satisfied data processing inequality. This Pinsker’s inequality which relates Kullback–Leibler divergence to total variation distance square. And then Kullback–Leibler divergence is additive unlike total variation distance which was sub-additive. So, if you combine these two properties there is something interesting which comes out. Earlier we said that Pn, Qn D the total variation distance between Pn and Qn grows like n, that is what we were able to show.
But we (())(08:03) that bound is weak. In fact if you combine Pinsker’s inequality with the additivity of Kullback–Leibler divergence you can show that total variation distance of Pn and Qn grows roughly like square root n, you can try to show it, I will show it later in the course. And basically, that bound that we had earlier for coin toss problem, how many coin tosses are needed to distinguish a coin with bias half plus epsilon from half minus epsilon, that bound can be improved to roughly 1 by epsilon square by combining this Pinsker’s inequality and additivity of total variation distance, additivity of Kullback–Leibler divergence.
This is the last thing I wanted to present this week. And in the next week we will continue with our discussion on connection between statistics and information theory, we will, in this week we basically defined different versions of hypothesis testing problem, we mostly focus on hypothesis testing, although we defined the estimation problem as well.
And for the first version which was Bayesian we discovered that the probability of error relates to this quantity called total variation distance. For the second one the Neyman–Pearson formulation, we saw that the probability of error is related to this quantity called Kullback– Leibler divergence D, capital D P Q.
And one principle that we projected was that hardness of a hypothesis testing problem, a binary hypothesis testing problem is related to the distance between the two distributions under two hypothesis and this distance can be Kullback–Leibler divergence for one formulation or total variation distance for another formulation. In fact, the two are related by Pinsker’s inequality.
Log in to save your progress and obtain a certificate in Alison’s free Diploma in the Foundations of Information Theory online course
Sign up to save your progress and obtain a certificate in Alison’s free Diploma in the Foundations of Information Theory online course
Please enter you email address and we will mail you a link to reset your password.