Alison's New App is now available on iOS and Android!

Study Reminders
Support
Text Version

We will email you at these times to remind you to study.
• Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Sunday

We will start by discussing what is called a source model. We decided last time that we will use probability as a means for modelling uncertainty, we can now return to that goal. And we will try to answer “how much information is revealed when a question is answered.”
Recall recall this example that we saw last time. So, I am trying to compose an email on an email client. And when I try to do that, I write something like: dear all, there will be no information theory class today. A makeup class will be held on Tuesday. And then I say sorry, for and I just say sorry, and what my email client does, it tries to complete the sentence for me. And it says something like for the confusion.
Of course, as we discuss, this is not the only possibility which the email client may have in mind. It can have several other possibilities in mind, sorry for the confusion, sorry for the inconvenience, and maybe something else. And, and what we know is that somehow using some rule, this email client chooses one of these options and revealed it to us.
So, how does this email client model this one? So in this case, I use Gmail. So, let us ask how does Gmail server model this one? So, it can either one simple approach it can follow is it can model this unknown face using a random variable, it can assume that the user that is me here is generating these outputs from a probability distribution. And we will call such a user who is generating this information a source, so you will hear this phrase a source several times in this course, and this is the source of information.
So, this user can be thought of as a source, and this source generates outputs according to some probability distribution, which we captured by a probability mass function P on a finite alphabet, which we denote by this calligraphic X. So, we have options in modeling we can either choose, we can either treat each word as a symbol, each word can be thought of as a symbol from this alphabet X. Or we can think of this entire phrase as a symbol from this X.
What these will result in very different engineering implementations? However, for a for a theoretical treatment in this course, we do not worry about these semantic mappings of abstract symbols from an alphabet to the actual data. And we will just denote this this data as coming from this alphabet calligraphic X. So, a source which is a generator of information generates this random variable X, which is distributed according to a pmf P on a finite alphabet, X.
So, it is not, as we saw in the previous example, this source does not generate a single symbol, it can generate a sequence of symbols, for instance words in a sentence, a sequence of symbols, if you think of words as those symbols. So, we will think of sources generating the sequence X1, X2, Xn, so on and so forth. We will abbreviate the sequence by Xt, where this numbers t can be from 1 to n, because this is a source which generates an output.
Based on the assumptions on the joint distribution of this, this sequence X1 to Xn, so this X1 to Xn is a sequence of random variable, you can think of that sequence because the finite line sequence, you can think of that finite line sequence itself as a single random variable.
In this course, we will look at various classes of source for the most part, we will look at what is called a memoryless source. So, memoryless source is a very simple probabilistic model, where we assume that the sequence Xt, t can be 1 to n, we assume that this sequence is distributed, it comprises of independent and identically distributed random variables. So, this is independent and identically distributed random variables.
So, this is a phrase which will come up again and again. So, we will abbreviate it with i, i for independent, i for identically and d for distributed, iid random variable. So we, for the most part we look at this memoryless source, of this memoryless sources, which can be modelled using iid random variables.
So, if you do not remember what was independence and what are independent random
variables, here is a very quick review, although I urge you to go back and revise these things.
So, two random variables X and Y are independent independent, if the joint distribution of X Y has this product form. So, for the discrete case, if we look at the probability that the random variable X takes the value x, and the random variable Y takes the value y, that is equal to probability that X takes the value small x times the probability that Y takes the value small y.
Such valuable, such random variables will be called independent random variables. It follows that if you look at the expected value of X Y for independent random variables, it must equal to expected value of X times expected value of Y. There is a sort of common misconception, I assume that this class, everyone has studied probability carefully enough to know that this is not so, this misconception is that when you what people think that if expected value of X Y equals to expected value of x times expected value of y, then the random variables are independent.
In fact, this is these are just uncorrelated random variables. It does not mean that for instance, the expected value of X square Y square is equal to expected value of X square times expected value of Y square. So, just just revise these concept, revise the concept of independent random variables, and revise the concept of uncorrelated random variables. So, for most part, the models that we look at in this course, we will assume that the source is memoryless. And these random variables Xt are independent and identically distributed.
So, what do we mean by identically distributed? So, these two random variables X, let us X1 X2, two of them, what is the district, what is the probability of X1 being small x1 and this capital X, the second random variable X2 being small x2, since they are independent, it is the product of two probabilities. So, X1 x1, probability of X1 being x1 and probability X2 being x2. And since they are identically distributed, this is also equal to P of X1 of x1 times, P of X1 times x2, because all of them have the same common distribution, PX, so these are what independent and identically distributed random variables are.
So, this is a very simple model, somewhat unrealistic. So, you may just, so what does this model say? It says that every time you see this word, in the sentence, it is just drawn independently from a bag give you the fixed distribution. It is more of like the more of like a monkey typing on a typewriter rather than a human speaking, because humans may probably keep some memory, and they will not have this kind of independence across the word, across words.
So for instance, if you are speaking in English, the English language has structure, the sentence has a structure, it starts with a subject is a verb somewhere. And when you hear a word you probably want can already guess what the next word would be. So, this is not an independent, this is not probably an independent random variable. What is very surprising is that in spite of being such a bad model, even with this memoryless source model, the compression algorithms which work for them, is a rich class of compression algorithms. And just by using these models, a lot of engineering innovations have been made. It is a very reasonable class of models, and it is very well suited for theoretical analysis.
So, what do we mean by identically distributed? So, these two random variables X, let us X1 X2, two of them, what is the district, what is the probability of X1 being small x1 and this capital X, the second random variable X2 being small x2, since they are independent, it is the product of two probabilities. So, X1 x1, probability of X1 being x1 and probability X2 being x2. And since they are identically distributed, this is also equal to P of X1 of x1 times, P of X1 times x2, because all of them have the same common distribution, PX, so these are what independent and identically distributed random variables are.
So, this is a very simple model, somewhat unrealistic. So, you may just, so what does this model say? It says that every time you see this word, in the sentence, it is just drawn independently from a bag give you the fixed distribution. It is more of like the more of like a monkey typing on a typewriter rather than a human speaking, because humans may probably keep some memory, and they will not have this kind of independence across the word, across words.
So for instance, if you are speaking in English, the English language has structure, the sentence has a structure, it starts with a subject is a verb somewhere. And when you hear a word you probably want can already guess what the next word would be. So, this is not an independent, this is not probably an independent random variable. What is very surprising is that in spite of being such a bad model, even with this memoryless source model, the compression algorithms which work for them, is a rich class of compression algorithms. And just by using these models, a lot of engineering innovations have been made. It is a very reasonable class of models, and it is very well suited for theoretical analysis.
But in the Markov source of order m, each of these expressions here can be approximated by just a previous m symbols, it is independent of all the past given the previous m symbol. So this is equal to probability of Xt given X1 to Xt minus m plus 1, so 1 2 this is not the right way to write here. So, it depends on only the previous m symbol, so it depends on Xt minus 1, Xt minus 2, so on and so forth, till Xt minus m, that is what a Markov source of order m would be, it only depends on the previous m sentence.
So, this kind of model makes sense, and it is much richer than the Chebyshev memoryless model. And we will visit this model once in a while in this course, for the most part, we will restrict our attention to memoryless models.
So, now that we understand what are the various option for modeling a source, the producer of information, the source just to revise this quickly, source is just a random, a sequence of random variables, so it produces a sequence of random variables like this. And it has a prescribed joint distribution, the sequence of random variables, when these random variables are assumed to be independent and identically distributed, we call the source a memoryless source, because each symbol seems to be generated without the memory of the previous symbols. And when it has some memory we, when we want to model the memory in this sequences, we use this Markov sources of order m.
So, now that we know how to model uncertainty, we come to the more important question.
So this uncertainty we just saw is modeled through probability is obviously there is some uncertainty in these symbols because it is a random variable, it can take any of the any of thevalues that this random variable can take. And we have represented the lack of our knowledge of the exact value of this source by this joint by this distribution, the pmf probability mass function P that represents what is a belief of likelihood of different values of the source.
So, we know how to model uncertainty there, but we still do not know how to measure uncertainty by a single number. So, if I ask you to measure volume, you have a single number, which you can, you will give answer a single number, you want a similar answer as a measure of uncertainty.
So, here is a here is a thesis from information theory, it says that, let us say that there is a reasonable way of measuring uncertainty, it says that “uncertainty of a source is equal to the number of bits used to represent its output” but this is a very data centric view, on a computer centric view, of measuring uncertainty, the idea is that we would like to have a quantitative answer to measure uncertainty, because we will use it to perhaps express this unknown in fixed number of bits.
So going back to a Google example, there must be some, in the Google server, there must be a table, which stores all possible answers that it can put there. And that table must be of finite length. So, Google has to decide upfront, how many bits to reserve for these answers. To be able to do that, it must be able to capture the (uncer) uncertainty in the source in some quantitative way. And this is that quantitative way we measure the uncertainty of a source by the number of bits used to represent it. So, that is a measure of uncertainty. It is a number of bits with which you can represent an unknown. That is the measure of uncertainty in that.
You may ask that, well the sources are random variable and whatever representation you choose based on the realization it will have different length, so this number of bits itself will be a random variable. So, what do you mean by to uncertainty equals the number of bits used to represent the output? Because this is a random variable. So, that is a very valid point. And one can either think of the expected value of this random variable, the average number of bits required to represent the output of the source.
Either we can think of that as a measure of uncertainty. Or maybe we can think of a likely number which will hold the large probability. So, suppose you know that we can represent this source using 100 bits with probability greater than 99.99, so so then the uncertainty the source is 100 points. It may also seem that this number of bits required to represent the source can depend on the method that you use to represent the source. And that, therefore, this definition itself becomes quite subjective.
So, we will just fix this by putting a small phrase here, the minimum number of the most economical representation of a source that is what we call uncertainty of the source, the minimum number of bits used to represent the source. And it can either be the average number of bits or the likely number of bits. Now, what we saw in the previous the previous week, was that for a random variable, its average value expected value is not too different from its likely value.
In fact, the likely value of a random variable is average value plus minus some correction based on the variance of the random variable. So, these two answers would be very close to each other. So, that is how we want to measure uncertainty. So, we now have a sort of a philosophical definition of uncertainty, and we would like to make it a mathematically concrete definition.
But now, we have something to work with, we have put down some very tangible definition, which is just the minimum number of bits required to represent the source that is the measure of uncertainty.