Loading
Notes
Study Reminders
Support
Text Version

Association between Numerical Variables

Set your study reminders

We will email you at these times to remind you to study.
  • Monday

    -

    7am

    +

    Tuesday

    -

    7am

    +

    Wednesday

    -

    7am

    +

    Thursday

    -

    7am

    +

    Friday

    -

    7am

    +

    Saturday

    -

    7am

    +

    Sunday

    -

    7am

    +

In this lecture, we study Association between Quantitative Variables or numerical variables.In the previous couple of lectures, we looked at association between Categorical variables.Now, we extend; we look at measures like Chi square and Cramer’s V for categorical variablesand we will try to find out what are the equivalent measures, if we look at quantitative or numericalvariables.Now, let us look at some data and let us say that we are trying to look at the associationbetween salary and saving. So, we have 2 variables; one is the Salary of the person and the otheris the saving of the person. So, one of the variables acts as the x variable or the xaxis variable and the other variable acts as the y variable or the y axis variable.So, generally, the explanatory variable is the x axis variable and the response variableis the y axis variable. So in this example, we can assume that thesalary is the reason for saving because people get income and salary from which they save.Therefore, the saving becomes the y axis variable and the salary becomes the x axis variable.So, let us assume that we have some 100 pieces of data for salary and saving for a pair ofx, y; where, x is the salary and y is the saving. And let us assume we have plottedthis and this is what we get if we plot these 100 pieces of data each data having an x anda y value.Now, the same picture is shown on the left hand side and let us look at this. Now, letus do something else and then, we started with hundred sets of data, each data havingan x and each data having a y. Now suppose, we quickly randomize the x and y in the sense,we just quickly make a make a random sort of x and then random sort of y which meansnow with the new data, the x and y are not exactly as they were paired in this data andwe still get 100 pairs with different x and different y. Let us say kind of randomly chosenand then we plot that using this and get this kind of a plot.So, this is the data looks and this is the random. Now the first impression is that thisthe data looks very different from the random and therefore, the first impression is thatthere is association between these 2 variables, if they look alike or look reasonably similarto the eye. Then, one could say that there is no association. In this case they lookdifferent and therefore, we can say that there is association. So, how different and allthose depends on how we understand the pictures. But I am sure, most of us would agree thatthese 2 pictures are not similar; they look different and therefore, there is some associationbetween the salary and the saving.Now, how do we describe association in terms of many things. The Trend. So, is there anupward or a downward association, which means as x increases thus y increase or as x increasesthus y decrease. We could look at Curvature which means is it linear, it is a straightline or does it show a curve and then, we look at variations are points tightly clusteredalong the line. Are they are further away and then, are there outliers; are there pointsthat should not be belonging completely away, are there surprises and so on? So, we willtry to look at some of these in this lecture.Now, if we look at another set of data on salary versus savings and let us say thisis how the data looks with x as the salary and y as the response variable which is thesaving. Now if we try to do the randomized picture of this, where we still have 100 points,but the x and y are now sorted completely randomly which means they do not have theold x y pair and when we do a similar exercise, this is how the original data looked.This is how the randomized data looks. And do they look different? May be they do not;I mean if we look at it very very carefully one might say they look different; but tothe eye, one might get a feeling that both of them look a little cluttered and here,we might conclude that there is actually no association between this or very little associationbetween this.So, we have if in the in the previous example, we saw that there is a vast difference. So,we said there is association here there is not so much of a difference. So, we said maybe there is not so much association. But then, how do we compute is there a measure or isthere a metric that tells us there is association or there is no association among these variables;we will see those matrix as we move along.Now, to do this let us take this kind of a data and then, we try to plot the x bar andy bar and that is shown in this picture. This point is our x bar y bar that we can calculate.Now, we have already seen this measure called Covariance. So, we visit the Covariance again.Covariance quantifies the strength of linear association between two numerical variables.It measures the degree to which data concentrates across the diagonal.So, given x 1, y 1 and x 2, y 2 and an x bar, y bar the covariance is given by x minus xbar into y minus y bar by n. I have also raised a question is it n or is it n minus 1? Wecan assume n and later when we find out correlations and other measures, we consistently use thesame denominator so that we there is no bias in the calculation. So, we use n in this case.x minus x bar into y minus y bar by n is the covariance between these two.So again, in the same picture, where the averages are shown so, when we do this the first picturewhen we have this picture which we go back to this, this is the original data and thisis the random. So, we get back to this picture and then, we try to find out x bar and y bar.So, in this picture x bar is 125523 and y bar is 31978. In the random picture x baris 113403 and y bar is 14382. This is not the random picture. This is the second setof data. So, let us go back the, this is one set of data; this is data set 1 and let ussay this is data set 2 and you can see these 2 pictures here.So, this is data set 1 which is reflected here and this is our data set 2 which is herethat is expanded to this. Therefore, they show different x bars and y bars, if therehad been the same data and the random the x bar and y bar would not change because thesame 100 values would be used. Therefore, these two represent two different sets ofdata. This is called data set 1; this is called data set 2.Now, x bar is 125523; x bar is 113403 because the data set is different. y bar is 31978;y bar is 14382. So, if we find the covariance of both sigma x minus x bar into y minus ybar n; summations for all the 100 values, the covariance in the first case become 6282326586.The covariance in the second case is 527175843. So, which has a higher covariance?We quickly calculate the number of digits and then, realize that there are 10 digitsin this number and there are 9 digits in this number. Therefore, this shows higher covariancecompare to this data. So, what we make out from these numbers? Generally, what we canmake out is if the covariance is higher, there could be some association with the data andbetween comparable data sets the one that has higher covariance seems to have association.So, for the first set of data x bar is 125523; y bar is 31978. Covariance is 6282326586 andthen, we calculate standard deviation of x which is S x is 148035; S y is 44851.54. So,standard deviation of x and standard deviation of y are also shown here. Now, we computethe correlation. So, correlation is equal to covariance divided by the product of standarddeviations. So, covariance by S x S y; so, 6282326586 divided by 148035 into 44851.54which is 0.9462. So, correlation for this is 0.9462. We have already studied the correlationcoefficient and its computation. So, we are now doing it one more time to showthat correlation is 0.9462. Also remember that covariance can be negative, we saw thatin an earlier lecture; whereas, standard deviations are strictly non-negative, they are either0 or positive. And because, correlation coefficient is covariance divided by a positive quantity,we would have correlation as a negative quantity as well, as well as the positive quantity.The range is between minus 1 and plus 1 and in this case, we have a correlation of 0.9462which is very close to 1 and then, we could say that the data is indeed associated.Now, if we look at this data the second set of data. Before that let us look at thesecorrelation measures the strength of linear association between numerical variables. Whatis important is linear association between numerical variables. r is always between minus1 and plus 1 and r does not have units. Standard deviation has units. So, S x has a unit inthis case its money or rupees.S y also has unit which is money or rupees and then, we go back we we do the productof S x and S y. So, it is rupees square or money square. Covariance is x minus x barinto y minus y bar by n summation. So, it also has the unit money square and therefore,correlation does not have a unit. It is a unit less quantity which is between minus1 and plus 1. So, when we look at the second set of data x bar is 113403; y bar is 14382;S of x is 134858; S of y is 44243.6. Covariance is 527175843 and when we compute the correlation,we get point 0.088. So, correlation is closed to 0 and therefore, there is no associationor very very little association between salary and savings in this case.Now, let us look at another example, where we have age of the person versus salary andlet us say we have picked up these 10 points and then, we first plot these 10 points. So,these points these 10 points are plotted. We get a scatter plot of these 10 and then,let us say we also fit a line through a software that we can do and you want to check thisand we then want to ask a question is there a positive association or is there a negativeassociation from this data? The line seems to say that there is lookslike at least for the data that we have looked at, there is a negative association betweenage and salary. Because which is also given by a computed correlation coefficient of minus0.22705. Now, there is an outlier that we can think of which is well away and outside,it answers all the 4 questions that we looked at. So, there is an outlier in this example.It is also possible to find the correlation when we have more than 2 numerical variables.So, we have age, we have height and we have weight. Let us say we have this data for boysin the age group of 11 to 20 and let us say we picked up one person with age 11; one personwith age 12 and so on and that is the height and weight respectively. So, correlation betweenage and age, age and itself is 1. So, in this example, we can compute the correlationbetween age and height which works out to be 0.9015 and you can do the rest of it correlationbetween height and height is also 1 and correlation between weight and weight is also 1. Correlationbetween age and height is the same as correlation between height and age and therefore, thismatrix will be a symmetric matrix with 1 across the diagonals. So, effectively it is enoughto find only 3 numbers; age versus height, high versus weight and age versus weight.Similar exercise, you can do that let us say the marks obtained in 3 subjects; Mathematics,English and Science by high school students are given. So, we could do that and we couldcomplete this table.Now, let us look at another set of data. So, here we look at the age that is given hereand the salary is also given here. Another set of data. So, we did correlation and weget correlation is equal to 0.8806. So, if we fit a line, we show the line that is fitwhich is shown here and this line as I mentioned also in an earlier lecture has something calledr square which is a goodness of fit which is 0.7756 and then, we also realize that thisr square is actually is square of the correlation coefficient of 0.8806.So, 0.8806square becomes 0.7756. But all this is true only if we decide to fit a straightline this association holds. If we fit a curve then, we have to have different types of association.We already saw that covariance or correlation is a measure of linear association betweennumerical variables.So, correlation measures the strength of linear association between variables. Larger r absolutevalue of r becomes more closely the data clusters along the line, we can use r to find the equationof this line and we can predict y for a given x. We can do all this when we have the correlationcoefficient. And if we consider the z score, we have not yet come in detail about z score,but z score is the deviations from the mean divided by the standard deviation correlationconverts the z score of one variable into the z square of another variable.So, there are these mathematics that I have given the equations that I have given. So,z x is x minus x bar by S; z y is y minus y bar and then, we can have the equation ofthe line is z bar y is equal to r into z x and from this we can get a and b that areassociated with the line. And we are just showing these computations. So, again thisdata correlation is 0.8806. So, you can use this calculation from x bary bar S x S y and r and from this, we can quickly get b which is 9473.28 and then, wecan get a which is minus 222495 and if we actually fit that using the line on a usinga software, we would get 9473.8. We got 9473.28 in our calculation, a small approximationminus 222514 was calculated as 222495. r square was 0.7756 correlation is 0.8806.A shown similar pictures and here, I have actually shown how curves are there and howthese curves are also fit for this kind of data. But we will restrict ourselves to linearassociation. But we have to understand that the analysis for r is only for linear associationamong these.Now, we also have to understand one thing with through which we look at it through anotherset of data. Now, let us show 2 sets of data. Let us assume that this are CAT scores orexam scores of 6 students and let us say these are data which correspond to scores made bya cricketer in 6 innings. So, we could treat one of them as a x variableand we could treat the other as a y variable and then, if we only apply the Math and tryto find the correlation we get r is equal to 0.7834. So, if I had not told you thatthis set of variables represent let us say CAT scores and this set of variable representsruns made by a cricketer and if I simply had withheld this and had simply asked find thecorrelation coefficient? You will get 0.7834 and if I had asked the question is there anassociation? Then, you might say yes, there is a reasonably high correlation. Maybe thereis an association between this x and this y.But the moment we say that this represents say CAT score and this represents the runsmade by a cricketer. So, we realize that there need not be and will not be an association.So, better knowledge of variables can help us understand causation. So, scatter plotsand correlations reveal association. They donot tell us causation. For example, theydonot tell that if this is the x variable; then, I calculate the y variable, I can applythe math and calculate a number. But how well I interpret the number will actuallydepend on the variables that we are studying and it is important to know those variablesfirst, before we even attempt to find is there a cause and effect between these 2 variables?But just by themselves without defining what these variables are, if there is an association;yes, there is an association with the high value of r.So, with this we complete this lecture and in the next lecture, we would look at somenumerical examples of association among numerical variables after which we will start studyingprobability in further detail. In this lecture we continue to discuss association between numerical variables.So, we look at a few exercise questions, try to understand the concepts further and thenwe will summarize what we have learnt in the 11 lectures and try to wind up the discussionon statistics in this course and then go on to do probability and start models for probabilityin the next lecture.So, we have looked at association between numerical variables.So, we first looked at covariance as a measure of association, we also said that covariancecan be negative and then from the covariance we moved on to describing the correlationcoefficient, which looks like a more compact measure because it takes on values betweenminus 1 and plus 1.And because covariance is negative and individual standard deviations are positive, correlationcoefficient can also take negative values it takes values between minus 1 and plus 1.So, with this let us move on to some questions some true or false questions, the x axis ofthe scatter plot has the explanatory variable the answer is also given so, the answer istrue.So, the x axis is the independent variable or the variable that tries to explain somethinghappening and the y axis now has the variable on which the effect of the explanatory variableis felt therefore, x axis has the explanatory variable is true.Question number 2, the presence of a pattern indicates that the response variable increasesas the explanatory variable increases.So, the answer is not necessarily true because we may have a pattern where as the x variableincreases the y variable can decrease so, that happens when there is a negative correlation.So, it is not entirely true though one might be tempted to think it is true because ourmind normally makes us believe that there is a positive correlation.So, if there is a positive correlation then as x increases y also would increase, if thereis a negative correlation as x increases y would decrease and therefore, this statementthat the presence of a pattern indicates other that the response variable would increaseas the explanatory variable increases is not entirely true.Third question it serve a situation where the net profit is about 10 percent of thesales.So, the scatter plot should be thought of as a line.So, the question is now does this look like a line or would it be non-linear and so on.Now the net profit is about 10 percent of the sales gives us an indication that we havea line of the form y is equal to a plus 0.1 x and so on.Roughly the slope can be thought of as 0.1 and therefore, one can believe that when westart plotting this data, such a data would approximate to a line and therefore, the answercould be true for this statement.Statement number 4, if the correlation of a stock with the economy is 1, it is goodto buy the stock when there is recession.Now, the answer is given here is false because as the stock is entirely dependent on economyand entirely correlated with it with the correlation of one.So, when the economy is down the stock will also be down and therefore, it depends onwhat we want to do with the stock if you want to trade it very regularly buy and sell thenext day and so on, then it is not a very good thing.But therefore, the answer is false, but if we have a person who simply buys the stockkeeps it for a very long time waits for the economy to to recover.So, that the stock prices also go up and then the person wants to sell it then the answercould be true, but in general the answer is false because if economy is down the stockprice will also be down.Question number 5, the covariance between employees and the production quantity is computedwith daily data it is expected to increase if the data was aggregated to monthly yesthe as we aggregate data we realize that the covariance increases.So, these questions have helped us understand what is correlation, what is covariance, andhow we model a linear relationship, it also helps us understand what is an explanatoryvariable and what is the dependent variable and so on.So, let us move to the next, a simple question find the explanatory variable and the responsevariable.So, the explanatory variable is the x variable response variable is the y variable.So, we have to look at these situations and try to find out which one has an effect onthe other or which one can be explained by some other variable.So, marks obtained in an exam with hours of study.So, as the student puts in more effort in terms of more hours of study the mark is expectedto increase.So, hours of study is the x variable or the explanatory variable, while the marks obtainedis the y variable or the response variable.Number of workers and quantity produced or units produced.So, here as we put in more workers the we end up producing more quantity.Therefore number of workers is the x variable or the explanatory variable and units producedor quantity produced is the y variable or the response variable.Third question time taken to run a particular distance and the weight of a person so, thereis a general assumption that the as the person is heavy and has more weight the person wouldtake more time to run.And therefore, in this case weight of the person can be the x variable or the explanatoryvariable and the time taken to run is the response variable or the y variable.Total revenue and items sold so, again the assumption is as we sell more items or therevenue increases or the revenue is comes because of sale of items.So, items sold is the x variable or the explanatory variable, while total revenue is the y variableor the response variable.The exercise done the amount of time spent on doing exercises and the body weight.So, again there is a general assumption here in this statement that as we spend more timeon exercising the body weight reduces and the body weight has an effect on the amountof time spent on exercise.Therefore, the time spent on exercise would be the x variable or the explanatory variableand the weight of the person would be the y variable or the response variable.Move to the next question, correlation between number of customers and sales in rupees is0.8 does the correlation change if the sale is measured in 1000s of rupees.The answer is the correlation does not change when it is measured in 1000s of rupees orwhen it is measured in equivalent denominations could be even for example, you could havea set where the sale is given in rupees and then we multiply by a constant to make itinto dollars or some other form of currency and as long as we multiply by the same constantthe correlation does not change.So, if the sale is measured in 1000s of rupees is equivalent of dividing it by 1000.So, it does not change.Question number 3, would correlation change if we add a constant to a variable or if wemultiplied it by a constant we will answer the first part first and then the second,again the correlation would not change if we add a constant to a variable.Let us assume we are adding a constant to the y variable.So, as we add the same constant to each of the y values we assuming that the constantis positive.So, y bar would increase by the same constant and therefore, y minus y bar would remainthe same in all these cases.So, when y minus y bar remains the same in all these cases, the variance of y remainsthe same and the standard deviation of y remains the same, covariance would also remain thesame because y minus y bar does not change and the covariance remains the same, the standarddeviation remains the same and therefore, the correlation coefficient would also remainthe same.What happens if we multiply by a constant, this was the question given in the earlierquestion 2 when we said if it is measured in 1000s of rupees.So, when we multiply 1 by a constant let us say we multiply the x variable by a constant.So, the x bar gets multiplied by at the same constant, since x bar gets multiplied by thesame constant, individual x minus x bars get multiplied by the same constant and therefore,the standard deviation gets multiplied by the same constant.And then the covariance same since x minus x bar gets multiplied by the same constant,the covariance also gets multiplied by the same constant.Now, with respect to the standard deviation since x minus x bar gets multiplied by thesame constant, when we compute the variance we square it therefore, it becomes squareof the constant and then to get the standard deviation we take the square root and therefore,the standard deviation gets multiplied by the same constant, covariance gets multipliedby the constant and therefore, the correlation coefficient would remain the same becauseboth the numerator and the denominator are multiplied by the same constant.In the case of addition the numerator and denominator remain the same therefore, theratio is the same in case of multiplication both the numerator and the denominator getmultiplied by the same constant and therefore, the ratio remains the same.The question number 4, Cramer’s V measures association among or between categorical variablescorrelation is used as a measure for numerical variables now correlation can be between minus1 and plus 1, now can Cramer’s V be negative why or why not.So, whatever we saw in the earlier lectures Cramer’s V is the value of chi square dividedby minimum of the number of rows minus 1 number of columns minus 1.So, in the Cramer’s V the denominator is a positive quantity while the numerator whichis the value of chi square is also a positive quantity because or 0 because it squares numberstherefore, the way we computed Cramer’s V, Cramer’s V cannot take a negative valuewhereas, correlation coefficient also has a numerator and a denominator.The denominator part which is the standard deviations is either 0 or positive whereas,the numerator part which is the covariance can be negative and then we said correlationis between minus 1 and plus 1, plus 1 indicates some kind of a positive association and minus1 kind of indicates a association in the opposite direction.Now, since we look at categorical variables in Cramer’s V we only check whether thereis an association and we do not further qualify the association to be positive positivelyassociated or not positively associated.Also because in categorical variables there is no question of difference between the valuesthere is only a category and therefore, we do not further qualify the association aspositive or not positive therefore, it is only fair that Cramer’s V shows whetherthere is an association or not, but does not try to say whether there is a positive association.So, Cramer’s V will take a positive value whereas, correlation can also show some kindof a negative association where as x increases y can decrease.Ten students took a test and after studying for a week took another test with the sameportions let us say the marks are given.So, would you expect this course to be associated most probably yes because we assume that whenthey took the first test they were still good enough and then the extra study would helpthem to get a slightly higher mark than what they would have got in the first test so,we would expect an association.Now, what is the relationship between the marks?We can calculate the correlation coefficient in this case and we can also expect the marksto increase and if we actually compute the correlation coefficient which you can do asan exercise it would be very close to 1, I think in this case we get some 0.98 or somethingas the correlation coefficient.The student with the highest score in the first test has not got the highest in thesecond is it an indication that he has not performed very well, in some ways the answerlies in the correlation.If we look at the second column the highest mark is 78 which is got by a person who got72 in the first test whereas, the person who got 77 in the first also got 77.If the correlation had been a plus 1 then it is quite likely that there will be an increasein each one of them since, it is not plus 1 very close to 1.So, these things can happen, but certainly that is not an indication that the personwho got highest in the first has not performed well in the second.So, with this we come to the end of our discussion on association among numerical variables wewill just spend a minute to summarize what we have seen in these 11 lectures and withthis 11th lecture we complete the course content on statistics or introduction to statisticsand then from the next lecture we move on to probability.So, we began with defining statistics and trying to understand why we study this subjectand then at some point we started understanding data and we also understood the data neednot be numbers, data can also be text and information and then we learned to categorizedata into 4 types of data and 2 broad types of data.And then we looked at each of these classifications categorical data and numerical data and thentry to identify measures of central tendency and said for the categorical data mode andif the data is ordinal then median and if the data is numerical interval and ratio thenwe could have mean median and mode and then we also defined standard deviation and variance.So, they could have measures of dispersion as well with standard deviation and variance.We also looked at for the categorical data we then looked at association and before thatwe also looked at the inter quartile range if the data can be sorted and ordered andthen we did the inter quartile range and we also did that for the numerical data did interquartile range and then we moved on to define measure of association between categoricaldata and defined chi square and Cramer’s V.And then we moved on to define measures of association for numerical data where we lookedat covariance we also looked at coefficient of variation in summarizing the data and asregards measures of association we looked at covariance and then we looked at correlationcoefficient.So, with this we kind of come to the end of the course content for the statistics portionof this course and then in the next lecture we will start probability.