Loading

Module 1: Uncertainty, Compression and Entropy

Notes
Study Reminders
Support
Text Version

Mathematical Measure of Uncertainty

Set your study reminders

We will email you at these times to remind you to study.
  • Monday

    -

    7am

    +

    Tuesday

    -

    7am

    +

    Wednesday

    -

    7am

    +

    Thursday

    -

    7am

    +

    Friday

    -

    7am

    +

    Saturday

    -

    7am

    +

    Sunday

    -

    7am

    +

We will start by discussing what is called a source model. We decided last time that we will use probability as a means for modelling uncertainty, we can now return to that goal. And we will try to answer “how much information is revealed when a question is answered.”
Recall recall this example that we saw last time. So, I am trying to compose an email on an email client. And when I try to do that, I write something like: dear all, there will be no information theory class today. A makeup class will be held on Tuesday. And then I say sorry, for and I just say sorry, and what my email client does, it tries to complete the sentence for me. And it says something like for the confusion.
Of course, as we discuss, this is not the only possibility which the email client may have in mind. It can have several other possibilities in mind, sorry for the confusion, sorry for the inconvenience, and maybe something else. And, and what we know is that somehow using some rule, this email client chooses one of these options and revealed it to us.
So, how does this email client model this one? So in this case, I use Gmail. So, let us ask how does Gmail server model this one? So, it can either one simple approach it can follow is it can model this unknown face using a random variable, it can assume that the user that is me here is generating these outputs from a probability distribution. And we will call such a user who is generating this information a source, so you will hear this phrase a source several times in this course, and this is the source of information.
So, this user can be thought of as a source, and this source generates outputs according to some probability distribution, which we captured by a probability mass function P on a finite alphabet, which we denote by this calligraphic X. So, we have options in modeling we can either choose, we can either treat each word as a symbol, each word can be thought of as a symbol from this alphabet X. Or we can think of this entire phrase as a symbol from this X.
What these will result in very different engineering implementations? However, for a for a theoretical treatment in this course, we do not worry about these semantic mappings of abstract symbols from an alphabet to the actual data. And we will just denote this this data as coming from this alphabet calligraphic X. So, a source which is a generator of information generates this random variable X, which is distributed according to a pmf P on a finite alphabet, X.
So, it is not, as we saw in the previous example, this source does not generate a single symbol, it can generate a sequence of symbols, for instance words in a sentence, a sequence of symbols, if you think of words as those symbols. So, we will think of sources generating the sequence X1, X2, Xn, so on and so forth. We will abbreviate the sequence by Xt, where this numbers t can be from 1 to n, because this is a source which generates an output.
Based on the assumptions on the joint distribution of this, this sequence X1 to Xn, so this X1 to Xn is a sequence of random variable, you can think of that sequence because the finite line sequence, you can think of that finite line sequence itself as a single random variable.
In this course, we will look at various classes of source for the most part, we will look at what is called a memoryless source. So, memoryless source is a very simple probabilistic model, where we assume that the sequence Xt, t can be 1 to n, we assume that this sequence is distributed, it comprises of independent and identically distributed random variables. So, this is independent and identically distributed random variables.
So, this is a phrase which will come up again and again. So, we will abbreviate it with i, i for independent, i for identically and d for distributed, iid random variable. So we, for the most part we look at this memoryless source, of this memoryless sources, which can be modelled using iid random variables.
So, if you do not remember what was independence and what are independent random
variables, here is a very quick review, although I urge you to go back and revise these things.
So, two random variables X and Y are independent independent, if the joint distribution of X Y has this product form. So, for the discrete case, if we look at the probability that the random variable X takes the value x, and the random variable Y takes the value y, that is equal to probability that X takes the value small x times the probability that Y takes the value small y.
Such valuable, such random variables will be called independent random variables. It follows that if you look at the expected value of X Y for independent random variables, it must equal to expected value of X times expected value of Y. There is a sort of common misconception, I assume that this class, everyone has studied probability carefully enough to know that this is not so, this misconception is that when you what people think that if expected value of X Y equals to expected value of x times expected value of y, then the random variables are independent.
In fact, this is these are just uncorrelated random variables. It does not mean that for instance, the expected value of X square Y square is equal to expected value of X square times expected value of Y square. So, just just revise these concept, revise the concept of independent random variables, and revise the concept of uncorrelated random variables. So, for most part, the models that we look at in this course, we will assume that the source is memoryless. And these random variables Xt are independent and identically distributed.
So, what do we mean by identically distributed? So, these two random variables X, let us X1 X2, two of them, what is the district, what is the probability of X1 being small x1 and this capital X, the second random variable X2 being small x2, since they are independent, it is the product of two probabilities. So, X1 x1, probability of X1 being x1 and probability X2 being x2. And since they are identically distributed, this is also equal to P of X1 of x1 times, P of X1 times x2, because all of them have the same common distribution, PX, so these are what independent and identically distributed random variables are.
So, this is a very simple model, somewhat unrealistic. So, you may just, so what does this model say? It says that every time you see this word, in the sentence, it is just drawn independently from a bag give you the fixed distribution. It is more of like the more of like a monkey typing on a typewriter rather than a human speaking, because humans may probably keep some memory, and they will not have this kind of independence across the word, across words.
So for instance, if you are speaking in English, the English language has structure, the sentence has a structure, it starts with a subject is a verb somewhere. And when you hear a word you probably want can already guess what the next word would be. So, this is not an independent, this is not probably an independent random variable. What is very surprising is that in spite of being such a bad model, even with this memoryless source model, the compression algorithms which work for them, is a rich class of compression algorithms. And just by using these models, a lot of engineering innovations have been made. It is a very reasonable class of models, and it is very well suited for theoretical analysis.
So, what do we mean by identically distributed? So, these two random variables X, let us X1 X2, two of them, what is the district, what is the probability of X1 being small x1 and this capital X, the second random variable X2 being small x2, since they are independent, it is the product of two probabilities. So, X1 x1, probability of X1 being x1 and probability X2 being x2. And since they are identically distributed, this is also equal to P of X1 of x1 times, P of X1 times x2, because all of them have the same common distribution, PX, so these are what independent and identically distributed random variables are.
So, this is a very simple model, somewhat unrealistic. So, you may just, so what does this model say? It says that every time you see this word, in the sentence, it is just drawn independently from a bag give you the fixed distribution. It is more of like the more of like a monkey typing on a typewriter rather than a human speaking, because humans may probably keep some memory, and they will not have this kind of independence across the word, across words.
So for instance, if you are speaking in English, the English language has structure, the sentence has a structure, it starts with a subject is a verb somewhere. And when you hear a word you probably want can already guess what the next word would be. So, this is not an independent, this is not probably an independent random variable. What is very surprising is that in spite of being such a bad model, even with this memoryless source model, the compression algorithms which work for them, is a rich class of compression algorithms. And just by using these models, a lot of engineering innovations have been made. It is a very reasonable class of models, and it is very well suited for theoretical analysis.
But in the Markov source of order m, each of these expressions here can be approximated by just a previous m symbols, it is independent of all the past given the previous m symbol. So this is equal to probability of Xt given X1 to Xt minus m plus 1, so 1 2 this is not the right way to write here. So, it depends on only the previous m symbol, so it depends on Xt minus 1, Xt minus 2, so on and so forth, till Xt minus m, that is what a Markov source of order m would be, it only depends on the previous m sentence.
So, this kind of model makes sense, and it is much richer than the Chebyshev memoryless model. And we will visit this model once in a while in this course, for the most part, we will restrict our attention to memoryless models.
So, now that we understand what are the various option for modeling a source, the producer of information, the source just to revise this quickly, source is just a random, a sequence of random variables, so it produces a sequence of random variables like this. And it has a prescribed joint distribution, the sequence of random variables, when these random variables are assumed to be independent and identically distributed, we call the source a memoryless source, because each symbol seems to be generated without the memory of the previous symbols. And when it has some memory we, when we want to model the memory in this sequences, we use this Markov sources of order m.
So, now that we know how to model uncertainty, we come to the more important question.
So this uncertainty we just saw is modeled through probability is obviously there is some uncertainty in these symbols because it is a random variable, it can take any of the any of thevalues that this random variable can take. And we have represented the lack of our knowledge of the exact value of this source by this joint by this distribution, the pmf probability mass function P that represents what is a belief of likelihood of different values of the source.
So, we know how to model uncertainty there, but we still do not know how to measure uncertainty by a single number. So, if I ask you to measure volume, you have a single number, which you can, you will give answer a single number, you want a similar answer as a measure of uncertainty.
So, here is a here is a thesis from information theory, it says that, let us say that there is a reasonable way of measuring uncertainty, it says that “uncertainty of a source is equal to the number of bits used to represent its output” but this is a very data centric view, on a computer centric view, of measuring uncertainty, the idea is that we would like to have a quantitative answer to measure uncertainty, because we will use it to perhaps express this unknown in fixed number of bits.
So going back to a Google example, there must be some, in the Google server, there must be a table, which stores all possible answers that it can put there. And that table must be of finite length. So, Google has to decide upfront, how many bits to reserve for these answers. To be able to do that, it must be able to capture the (uncer) uncertainty in the source in some quantitative way. And this is that quantitative way we measure the uncertainty of a source by the number of bits used to represent it. So, that is a measure of uncertainty. It is a number of bits with which you can represent an unknown. That is the measure of uncertainty in that.
You may ask that, well the sources are random variable and whatever representation you choose based on the realization it will have different length, so this number of bits itself will be a random variable. So, what do you mean by to uncertainty equals the number of bits used to represent the output? Because this is a random variable. So, that is a very valid point. And one can either think of the expected value of this random variable, the average number of bits required to represent the output of the source.
Either we can think of that as a measure of uncertainty. Or maybe we can think of a likely number which will hold the large probability. So, suppose you know that we can represent this source using 100 bits with probability greater than 99.99, so so then the uncertainty the source is 100 points. It may also seem that this number of bits required to represent the source can depend on the method that you use to represent the source. And that, therefore, this definition itself becomes quite subjective.
So, we will just fix this by putting a small phrase here, the minimum number of the most economical representation of a source that is what we call uncertainty of the source, the minimum number of bits used to represent the source. And it can either be the average number of bits or the likely number of bits. Now, what we saw in the previous the previous week, was that for a random variable, its average value expected value is not too different from its likely value.
In fact, the likely value of a random variable is average value plus minus some correction based on the variance of the random variable. So, these two answers would be very close to each other. So, that is how we want to measure uncertainty. So, we now have a sort of a philosophical definition of uncertainty, and we would like to make it a mathematically concrete definition.
But now, we have something to work with, we have put down some very tangible definition, which is just the minimum number of bits required to represent the source that is the measure of uncertainty.
So, we just agreed that we will measure uncertainty by the minimum number of bits required to represent a source. So, let us return to our Google server example for the running example, the email client. So, how does Google server actually complete our emails? How does the Google email, Gmail client read our emails? So what probably does it go, it goes over all the emails that I have composed, and finds, and looks at the phrase that I have written and finds the phrase that follows it most of the time. So, suppose that it has a list like that, and suppose that it knows that there is a phrase, which follows this particular word, half the half the fraction of time, so among all the time that it has seen this phrase, the following phrase comes up half the fraction of time, the next one comes 1 fouth times, and the next one comes 1 eighth time, 1 by 16, times, and so on and so forth. So, suppose it has such a list with such likelihoods half, 1 fourth assign to it, and it has 256 such possibilities. So that is suppose it has already built this model, this is what we can think of the probability model that Google has built for my email. So it will represent all these phrases. And the first phrase has a likelihood half of occurring, which is a very high likelihood, half the time I complete my emails, after sorry, I say for inconvenience, so let us say with half the times, the next phrase comes 1 fourth of the times, the next one comes 1 eighth of the times, next one 1 sixteenth, next one 1 by 32, and the remaining phases all come equal number of times. So let us see how, what does this probability? So, we have half plus 1 fourth plus 1 eighth plus 1 16, plus 1 by 32. What is the sum? So, you can quickly check that this remaining probability is, so how many symbols are we left actually, so we have 5 symbols here, so these are 5 symbols, and we have 256 possible symbols, so we are left with another 251 symbols. And all of them come with equal probability, so 1 by 251 is the probability of the remaining guys. How much probably, so that 251 symbols, each with remaining probability, how much probability remains? Half plus 1 fourth plus 1 eighth plus 1 16 plus 1 32, but that is how you can quickly check that this remaining probability is 1 by 32 because probabilities sum to 1, so you can check that this remaining probability 1 by 32. And it is spread equally among the remaining 251 symbols. So, these are all way less likely come in comparison to these first 5 symbols. So, given that it already knows that this is the likelihood of these phases. How should Google store these phases and how should anyone store these phrases, so that the number of bits used are as small as possible? So, let us think of some very simple schemes. Let me start with the simplest scheme that comes to that perhaps comes to your mind, make a giant lookup table, so this lookup table has two columns, record number and phrase, so it puts it puts a record number 1 2 3, and then it puts a phrase corresponding to it. How many records how many records are there? Well, he said, there are 256 possible phrases, you can represent them using 8 bits, so that is 1 byte. So, you associate each phrase with a, with 1 byte representing that phrase, the phrase number. So, this particular method requires 1 byte or 8 bits to store all the phrases. But then, can we do better? So, here is another very simple idea, why did we just make one table? Why do not we make two tables, what do I mean by this? So, we first use 1 bit, so we use 1 bit to represent which table you should go to. So there are two tables that you have stored in your computer. And first you, for each symbol first you figure out to use 1 bit to indicate which table to go to, to find this symbol. And then, in both these tables, you have different symbols, so for instance this symbol with probability half and probability 1 fourth can go to table 1, and the remaining symbols can go to table 2. So, you need 8 bits to represent this remaining symbols, you can check this, you cannot do it less than 8 bits, actually, there are 254 symbols, you at least need 8 bits represent them. And these first table only has two symbols, so you only need 1 bit to represent them. So, what is the worst case number of bits that we use here? Well, in the worst case, you will have 1 bit here, plus 8 bit here, 9 bits. So, worst case you can check is 9 bits. So, each symbol gets represented by which table the symbol is in, and what is the entry number in that table. So, that will take 9 bits for all the symbols in this table 2. But, it only takes 2 bits to represent symbols in this table 1, it has just 2 symbols here. So, what is the average number of bits that that is used? So, average number of bits, first 1 bit must be used for all the symbol, so this 1 is with probability 1, maybe I will do it in a slightly different way this calculation. So, what is the average number of bits? These first 2 symbols, this first symbol with probability half, it uses two bits, one for the table, and then one for the location inside the table. Then the next symbol which has probability 1 fourth again uses 2 bits in all the remaining symbols use 8 bits. What is the remaining probability? Another 1 fourth, so use 8 bits. So, what is the total number of bits? It is, so this is 2, and then 1 and a half, 2 plus 1 and a half, so that is 3.5 bits. So, very interesting, it is very simple scheme although in the worst case is worse than the single table scheme, it uses 9 bits in the worst case. But on average, it uses much fewer bits, it only uses 3.5 bits. So, that is something very interesting. And this is the kind of thing we want to do, we want to use as small number of bits as possible as possible on average. In constructing these tables we, I just gave you this answer. But if you try to think of how would you please the symbols in the table, it will be a very interesting exercise. So, I will give you the heuristic that I went by, when I gave you this answer. What I thought was that I will take the symbols which have large probability, in this case, the first 2 symbols and put them in a very small table, so that I can put I can represent them with fewer bits. And I will take the symbols which have very small probability and club them together in a separate table. That was the heuristic I went with. In fact, for all these symbols in table two, I end up using much more bits than the simple representation by 1 byte, better representation. However, collectively, the probability is very small just 1 eight, 1 fourth, so I pay extra here, I pay extra here, so I think this calculation is not correct. But its conclusion is correct, so this should be 9 because I use 9 bits here. And, so this turns out to be 3.75 bits, so I pay extra here, instead of 8, I have 9 bits for the symbols. But, I see when the other bit significantly, and overall average comes down from 8 in the single table case to 3.75 bits, so that is the idea. Now, is this a good assignment? Well, we can try some more things. Why did we assign just 2 symbols to the first table? What if we take 4 symbols for the first table? So now, you have these 4 symbols in the first table and the remaining in the next table, so remaining symbols, all are all will require 9 bits and the collective probability is 1 by sixteenth. So for, so you can do this calculation, now if you do that, the average length comes down even further, it comes down to 3.44 bits. So, what is the end of this? You can keep on doing this. Can we have 3 bits for the first table? Basically, can we put 8 symbols in the first table? Now, if you do that, you will see that the average length is 4.15 bits. So, what is going on here? So, we see that there are various options based on how we arrange these symbols in our tables, how we store them, and different schemes will have different average lengths. In fact, if you think a bit deeply, you will be able to convince yourself that this second scheme, which uses 2 bits for the first table and 9 bits for the symbols in the second table this, so it will use 3 bits for symbols in the first table, 1 for table location and 2 for location inside the table, and 9 for the sceond table this second, this particular scheme here, this this one here is actually optimal, you cannot do better than 2.44 bits it turns out, and this is not so difficult to see, you can think about it. But there are all the other schemes, which also look reasonable. And one thing is there for sure that if if we do not design the scheme, by keeping the probabilities in mind, then our performance will be the same as the worst case performance, which is just 8 bits in this case,and the average is also 8 bits. So, and when we design schemes, keeping the probability, the frequency of all these symbols in mind, we can do much better, we can bring down the average performance to 3.75 bits if we just try some nice thing. And if you are a little bit more careful, like in the second scheme here, we can bring it down to 2.44 bits. So, this is this is just one example we saw. Let us continue with this example. So, next thing you should ask yourself is why just two tables? Why did we just use two tables? Why not 5 tables? Why not multiple tables? So, we can do that, so we can make 4 different tables. So, now we use 2 bits to represent the table number. So, there are 4 different tables that can represent by 2 bits, 00 is table 1, 01 is table 2, 10 is table 3, 11 is table 4, and then you can try to think which symbols you will put in these tables. At this point, I urge you to pause the video and try to think of the scheme this scheme by yourself try to come up with this assignment. What, how will you play the symbols in the across the tables so that the average number of bits used to represent the symbol is as small as possible? I will be giving this answer now. But I urge you to pause the video and think about it. So, now that you have thought about it, so table 1, hopefully you agree with this. So. now there are 4 tables, and what I will do is I will keep this heavy symbols half and 1 fourth, in table 1, this will require 1 bit to indicate their location inside the table because they just 2 symbols, another 2 symbols and keep in table 2, 1 bit to indicate their position inside the table. This table 3 I put 1 single symbol. So, how many bits do I need to represent it? Well, 0 extra bits. If you go to this table, you know that this is the only symbol there. So, no extra bit needed to represent this location there. And then the fourth table have all the remaining symbols, so we have already consumed 5 out of 256, 251 symbols remain. And we use 8 bits to represent all of them. Once again, the rationale is same as before we use we we take the heavy symbols and put them in 1 table. And then the next step in second table, and so on and so forth. And all these light symbols 251 of them are clubbed together in a single table. So, 2 bits per symbol are now required for sure plus extra based on which table are you in. So, if you do this calculation, you can check that the average number of bits now comes out to be 3.25. It is still worse than our 2.44. But 3.25, which is not bad, but this can be improved easily. So, you can try another thing, instead of putting half and 1 fourth in table 1, you can put a half, you can put this half alone in table 1. And if you make the small change, you will see that it improves to 2.72, so I am just working out some example ideas here to see, how how different schemes can come about. So, let us take another view of this multiple table scheme. So we can think of this tables themselves, the tables containing multiple symbols themselves, they are a super symbol, which represents a group of symbols. And the probability of this super symbol can be thought of as equal to the sum of probabilities of all (ta) all the symbols inside it. In fact, this view leads to what is called the Huffman code. We will come to that later in this course. So, Huffman code is a very popular code used for compression. It is a well it came out, it was it was the first it was the first optimal code for minimizing the average length. It is, in fact, still the only optimal code. And there is a story behind it. So, Huffman was a master student, when Alias, I think raised this question about finding the optimal code in his class and Huffman found this code as an answer to that question. So, for all the master students taking this course, hopefully, you will invent some new things and by the end of this course. So, how do we view Huffman code? As I said, we will come back to it in greater detail later. But, here is a simple way of seeing that. Here is an alternative representation of those tables. So, so we are thinking of tables as symbols, the first table has a first symbol as half, and the second symbol, which is now itself, a super symbol is a table, is a table himself. This is table 2. So, in the first table, you think of this as having just 2 symbols half the symbol correspond to half, and the super symbol, which we call table 2. Then inside table 2, there is another symbol, which has probability half, and the super symbol, which has, which is now table 3, and so on and so forth. You can also view this entire arrangement by a, what is called a graph. So, you start with the root node, on the left side, you put the symbol with probability half. On the right side, you put a lot of other symbol, this is the super symbol table 2. So you go here now, and this guy has 2 symbols, 2 with probability fourth, on the right side, we have this super symbol with all the remaining symbols. Again, left side, 1 eighth, and right side, super symbol with all the remaining probabilities is table 4, so 1 sixteenth, and then right side with all the super symbol and so on and so forth, 1 32. And now here, you have this single table 6, which has all the remaining symbols. And, so you can represent by this tree, and then what you can do is just go over this tree, thinking of left side is 0 and right side is 1, and you just start from this root and reach to these so called leaves, and this path represents the represents the the bit assignment for each of these symbols. So, you will get multiple, you will get binary representation of all symbols. So, half the symbol with probability half has representation 0, symbol with probability 1 fourth has representation 1 0, symbol with probability 1 eighth has 1 1 0. So, starting from this very simple lookup table scheme, we can quickly come to this much more sophisticated scheme. And when you calculate the average number of bit, it turns out to be 2.03 now, so this is any, this is better than anything we have seen till now. In fact, what we can show is that this is the minimum average length that you can have for this particular probability distribution. And it is attained by Huffman code. So we will return to Huffman code later. But this example was to point out to you that if you think about the frequency with which different symbols appear, then you can have very clever compression schemes. Let us take another view of the same scheme. So, so by the way, this is almost Huffman code, you can you can do something slightly d