So now that we have agreed that we will use probability to model our guess list and the likelihood of different gusses in our guess list, we need to review some basic probability. So I assume that you have all taken that some course in probability, at least an undergraduate-level course in probability, and have seen concept such as probability mass functions, probability density, probability distribution, random variables, expected value.
There will be some other inequalities which we will be covering in this review, Markov inequality, Chebyshev inequality, Law of Large Numbers, Central Limit Theorem and while I do not assume that you know these things thoroughly but I, I expect that you go ahead and read about these from some books. So I will cover these things very briefly and very quickly, mainly to agree on the notation and then I expect you to go and read them.
So in this first part of my review, I will basically cover this is the first part, and in this first part, I will cover notations for let us say probability and expectation. And let me start right away with the warning, this is not the course in probability, so here is a warning. So I will do very little things, I will do very few concepts and little knowledge may be dangerous. And so, I urge you to go ahead and read the basics on your own. So with that caveat, let me start my quick review of probability.
So as you may know, there are two kind of probabilities that we encounter. One is discrete probability and one is continuous probability. We see discrete distributions and we see continuous distributions, and we talk about probability mass functions and we talk about densities. So let me start by reviewing a common notation which will unify both of these into a single umbrella.
So for us, the probability, when we talk of probability we take a set; we talk of a set which is the universe where all things are happening. So for us, that universe is the set X. Let me clean up this in a little bit, let us see.
So for us, that set is this X and this is where all our events takes place, this is the set of guesses for us remember, and in A, which is a subset of this set is called an event and we assign probabilities to these events. A probability mass function assigns the probability mass P to each element in this event whereby the whole event gets probability which is sum of all the P x; that something which you may be familiar with.
So this P x is the probability mass function, it puts a probability mass on each x. For example, if this universe X is just the coin toss so it is just heads and tails, so then this x can either take the value heads which has some probability 0.1 let us say, and then tails will have probability 0.9 and so this x, this p x can have those probabilities, 0.1 or 0.9. In general, we use this notation small p x for probability mass function evaluated at x.
And the probability of a set A, which is a subset of x, so an event A is just the sum of probability mass function over all elements in A, that is the definition of probability. That is how this probability mass function can be used to calculate probabilities of events. So here is something which I assume you would know.
Let me write it in a different way so instead of writing this as summation over this x in A here, I write it as summation overall x but I multiply this thing with an indicator function, so this thing here has a name, it is the indicator function, indicator function of A.
So this function takes x as an input and output 0 or 1. When x belongs to A the output is 1, when x does not belongs to A the output is 0. So it is the indicator function of A. So what we notice here is something very simple that this summation can also be written as summation over x p x times indicator function of A, it can be written in this form. So this is one observation we make.
Now, let us move on to continuous probability distributions, in which case we talk about probability densities. So now X is a continuous set. For instance, X can be the d-dimensional Euclidean space Rd or it can be even more complicated set, such as set of functions, which have frequency domain response between some band limited thing.
So such function, for instance, appear when you are thinking of communication over real channels. So this can be a very complicated set X. But let us, for concreteness, think of a very simple set just a d-dimensional Euclidean space. So that is a continuous distribution.
For d equal to 1 this is just the real line and we again want to talk about probability of events. Events are subsets of this Rd, probability of an event now is the integral, instead of summation,of this density. So this p x will be called probability density. It is the integral of probability density over the set A.
So once again just like we did it here instead of writing it as integral over A, we can write it in this form, integral over the entire space but multiplied within indicator function of A, it is the same thing, it is the indicator of A. So in this sense, you can see that densities and probability mass function look very similar, except that here we are computing an integral with respect to the Euclidean space and sometimes it is called integral with respect to the Lebesgue measure.
And here we are (consider), we are computing a summation which is also a kind of integral by the way but over a very simple measure, the counting measure. So these two have a very similar form and that is the connection between density and probability mass function, they can both be used to compute probabilities of event by integrating p x times the indicator function of an event over a, over the, using the underlying measure, here it is the Lebesgue measure, here it is the counting measure.
For some notation, so as we said, both definitions of P A, probabilities of events, have a common expression. They look like this, P A equals to integral of indicator function with respect to some measure and we will use a small abbreviation for this, for this integral instead of every time writing integral P x d x, we will simply write d of P x, this is a standard notation actually. It means it is an infinitesimal probability measure. Just like we take very small length, this is the probability of very small event, that single X. And we can write this as integral of over A of d P x.
So this is some notation, P of A you can, we are writing as integral A over d P x, when x is, when the random variable is discrete then you should think of this integral as summation and you get this; when the random variable is continuous then you should think of this integral as an integral over a real line or Rd and you get this.
So this is a very quick review of a sort of a measure-theoretic view of probability but this notation is very important for us and it also helps us handle both discrete and continuous random variable in the common framework. So if you are little bit uncomfortable with it, I will suggest that you read on probability densities, and probability mass functions and they play a role in computing probabilities of events, which is what we are doing here.
So let us now, so we have agreed on this notation. Let us now move to the next concept I would like to review quickly, which is the concept of the random variable. So we already saw a random variable, by the way, we saw this indicator function of A. This function takes as input element x and output 0 or 1.
And we also define this, this particular integral, this particular integral which is the integral of indicator function A over d P, this is just a notation we introduced, we saw that this is equal to P of A that is how this integral was actually defined.
So you can think of this indicator function of A also as a random variable, this is the very simple random variable just these two values 0 or 1, and it tells you if that event occurred or not. So for instance, if your underlying probability space is that of a six-sided die and you are rolling a die, a random variable that we can think of is this indicator function of A, where A is just the even numbers 2 4 6.
And in that case, this random variable will be again binary, we will take value 0 or 1; when an even number appears the value will be 1, when odd number appears the value will be 0. And we can define this integral, which is that probability of an even number appearing, which is just half.
So if you agree with this definition and if you agree that we can define all such indicator functions for all such event A, then what we can do, we can also define this linear combination of such indicator function, so this is some weight Ai times indicator function Ai. So we can define all that.
And then you may ask, well, let us find, now what is this integral which was earlier just P of A? So let us assume that we can take this summation outside the integral, in fact, that how this integral is defined that you should be able to take some summation outside. So an integral is define in such a manner and it is a linear operation that is all we are seeing here.
So this summation can be taken outside, and therefore this integral here becomes summation i equal to 1 to l ai P of Ai, that is the integral of this guy. So my claim is this is also another random variable, so we started with the simple example of random variable, which was this indicator function which just takes two value and we can combine some of them, take their linear combination to get these new random variables, many of them now, each linear combination gives a new random variable to us. So you have done this.
Now we are allowed, we are not allowed to use any subsets A that we want in general, there is a subtlety to the risk and you have to take a formal course, real analysis, or a probability theory to figure out which subsets A are allowed. But a large class of subset A are allowed and all these are your events and you can take indicator function for all these events and take the linear combinations and get new random variables.
In fact, when is discrete then all subset are allowed. So if X is discrete then all subsets are allowed, the subtlety that I am talking about appears when X is not discrete. For instance, if you take X as the, as this interval 0 to 1, you can start by allowing all intervals.
And then this, for an interval the function looks like this. Indicator function, at 0 will be here and it is 1 here and you can take these linear combination, such functions will look like this. So all such functions, we will take up some space here. So such functions, so this look like this, summation over i ai indicator function Ai, such functions we are taking into account.
So what is a domain for this function, that from X and this is the same as the probability space X where our basic events are taking place and they take values in R. Since you are taking linear combination of R, you can make it complex numbers, for this course R suffices. Such functions are called random variables.
And so, till now we have, we went, we started the very elemental function, indicator functions of events then we take their linear combination we get a richer class of functions and now we can also do is we also can take limits. We can take limits so that this integral f can be defined as limit of such integrals here.
So we can even take limits and there is a technical way of doing all this but roughly speaking, all such functions which arise from such limits are called random variables. So what was this procedure? What did I walk you, what did I just walk you through? So let us draw some analogy with how you construct real numbers. We first learn natural numbers. So natural numbers are, well, formally they are some kind of an inductive set which contain 1, but one easy way to think of natural numbers is just, they are just counting number 1 2 3 4 and then from natural numbers we move on to fractions.
So we learn fractions, rational numbers and then we learn that these actual these rational numbers are dense inside the real and if we take the limits then we will fill all the gaps in the real line and we get the real number. It is a similar procedure, we first define a simplest of the numbers as such this indicator function of events, these are sort of like our natural numbers then we took these linear combinations, these are sort of like a rational numbers and then we took limits and those limits are real numbers.
So we have made these large place of functions which we call random variables, this way. Also, using the same procedure we were able to define this integral f of d P for any random variable f and it is just defined as limit, as a limit procedure.
And this function you have seen before, so I will come to this function this integral of a function f. So just as a summary, a random variable is such a function f, just like the one we got above, it takes values in this real line R. So but the probabilities, remember, were on the domain of f, probabilities were not on the range of f.
And we denote these random variables by capital X and this integral, so instead of f we will denote by capital X, this is same as your f, it is the f of whatever the input was. And, so now, remember, we were able to define this integral because that is how it constructed a random variable, this instruct, this integral is something which you may be familiar with, it is called the expected value of x.
So, whenever we have a random variable we can talk about its expected value. It is denoted by expected value of X, this notation here. So this notation is something I will use throughout this course. So I may have taken a concept which you already knew and have confused you about it, what I will request you to do is to read, brush up your probability, read about random variables, read about expected value of a random variable, and then come back to this lecture and try to see if you make sense of what I just said.
How, if you can connect the concepts that you already knew to the way I described them because this is perhaps an alternative view for some of them, this is the major theoretic view of probability, and it really helps us in writing formal proofs and generally it is a good way to unite both discrete and continuous probability.
So we were talking about these random variables, what about this elemental random variable, the random variable, the indicator function random variable? Turns out that this random variable has a very useful expected value, expected value of indicator function of A is simply the probability of A. Remember, we can only talk about probability of this event A but from this, we would like to talk about probabilities of this random variable taking some value.
So suppose, what is the probability that random variable X, which is now a function takes values
in B? That probability is the expected value that the domain of X is so that the output belongs to
B. So this is what we are writing as indicator function f inverse B. So it is just the probability of f
inverse of B, that is what that probability is.
So this is some quick review of random variables. So we have, I expect you to see some random variables to, I expect that you have seen some random variables, Bernoulli, Binomial, Poisson, Geometric, Gaussian, Exponential. The last two here are continuous, the first four are discrete. Maybe I will walk you over some examples so that I can make this, the discussion I had just concrete, so let me just do that, let me go to the side screen and walk you over some examples.
So let us just work out some examples. So examples of random variables. So I hope that all of you have seen most basic random variable for which the samples space X is just 0 1, and remember I said that this f of x is the random variable we have to talk about, it is just x; that is my random variable. This one is called the Bernoulli random variable, it is just a different view of Bernoulli random variable.
And when I talk about this, I also have to talk about probability of different events, so I only need to talk about probability of 0, let us say, it is 1 minus p; then the probability of 1, which is the complement of 0 is just 1, is just p because total probability must sum to 1. And from that, I can compute the probability of even the empty set, it must be always 0.
So this is, this random variable x is called the Bernoulli random variable and we, when we talk of random variable, we typically, we describe them by their distribution, so I will use the notation that X is distributed as Bernoulli p, so its takes two values 0 and 1, and the probability that it takes value 1 is just p, small p.
So what is the expected value of this X? So it is a discrete random variable, so this particular expectation which is an integral in general that is what we just established can be written as summation over different values that this random variable takes, so y and probability of that value. So this random variable can just take, we are writing a new form, so summation over y is a discrete random variable probability that x takes the value y times y.
So in this case, this is just takes two value. So probability that x takes the value 0, this is my notation for that times 0, so this guy does not matter, probability that x takes the value 1 times 1, so this is just p.
Another way to see the same formula is that this random variable X here, so this random variable here is simply the indicator function of an event, that event is what? What is subset A? Subset A is just 1. So this is 1 when X is in 1 and 0 otherwise, and therefore its expected value is simply probability of 1 and which is p; that is something we saw. So this is a very basic random variable, Bernoulli. Similarly, we can do another example.
You can think of Binomial random variable. Now, this (particular), so from now on we will not worry about underlying space x as such, we will directly think of the random variable and the place where it takes value.
So this particular random variable can be thought of as sum of n independent Bernoulli random variable. So n independent coin flips, so I will call it Y, it is just X1 plus X2 plus Xn and this X1 to Xn, all of them are independent and each Xi is distributed as Bernoulli p. Then we say that Y has a distribution, which is Binomial with two parameters n because I am adding n independent Bernoulli random variables and p because all of them are an expected value p.
So what is the expected value of Y? Well, expected value of y is expected value of X1 plus Xn. So this linearity of expectation is a property which I want you to use very frequently and this guy can be written as summation i equal to 1 to n expected value of Xi, each of this is p so it becomes np.
So I just wanted to do two examples quickly and another example which you can work out is that of a Gaussian random variable, which is the continuous random variable. So for now I will just continue, I just want to cover this two examples.
So we just learned of random variables. So now what we will do, we will model our unknown answers by random variables. Rather than thinking of X as inset and thinking of it having probabilities, we will think of the unknown answers as a random variable, that is what we will do.
So this is, so we learned about random variables, we learned about expectation, and then we decided that from here on, we will model our unknown answers, that guess list by a random variable. So we will behave as if that unknown answers that we all want to know is a random variable with certain distribution.
Now, who gave you that distribution, that distribution is a belief that you are putting on that random variable. So this is our premises from here. So I will stop with this part and then continue with the next part of the lecture.
Log in to save your progress and obtain a certificate in Alison’s free Diploma in the Foundations of Information Theory online course
Sign up to save your progress and obtain a certificate in Alison’s free Diploma in the Foundations of Information Theory online course
Please enter you email address and we will mail you a link to reset your password.