Amazing April Sale! 🥳 25% off all digital certs & diplomas!Ends in  : : :

Claim Your Discount!

Module 1: Relationships Between Data Sets: Scatterplots, Correlation and Regression

    Study Reminders
    Support

    CorrelationCorrelation refers to relationships that exist between data sets. There are many kinds of relationships that can exist between two sets of data. For example, there is probably a relationship and a significant correlation between the ages of children and their heights. Causation is not implied by correlation, however, even though one set of data may cause the other. For example, ages of children may cause their heights.
    The price of a slice of pizza over time in NYC is correlated to the price of a subway ride in NYC over this same time period. However, the increase in the price of the subway ride does not cause the increase in the price of the slice of pizza and vice versa. Rather, there is a third factor that causes both prices to go up together; that of the cost of living.
    Correlation is examined in three steps.A scatterplot is created and examined.The Pearson Coefficient of Correlation, r, is calculated and examined.The Pearson Coefficient of Correlation, r, is compared to the critical values for r for a specified level of significance.ScatterplotsA scatterplot is a graph of the ordered pairs (x, y) consisting of data from two data sets, such as age (x) and height (y) of children. The scatterplot provides a quick visual indication of a relationship.
    After the scatter plot is drawn, we can analyze the graph to see if there is a pattern. If there is a noticeable pattern, such as the points falling in an approximately straight line, then a possible relationship between the two variables may exist.
    The scatterplot of age vs. height might look like one of the four below.
     
    Plot A indicates a positive linear relationship exists between age (x) and height (y) of children. As age increases, so does height.
    Plot B indicates a negative linear relationship exists between age (x) and height (y) of children. As age increases, height decreases.
    Plot C indicates a non-linear relationship exists between age (x) and height (y) of children.
    Plot D indicate no relationship between age (x) and height (y) of children.
    If our plot looks like A or B above, then we would move on to the 2nd step, which is to calculate the coefficient of correlation, r.The Pearson Coefficient of Correlation, rThe correlation coefficient, r, is a number that describes how close to a linear relationship two data sets are.
    Correlation coefficients range from -1 (perfect negative linear relationship) to +1 (perfect positive linear relationship). The closer this number is to one (either positive or negative), the more likely it is that the data sets are related. A correlation coefficient close to zero indicates that the data are most likely not related at all.
    The formula for calculating r is:
    r = frac{n(Sigma xy) - (Sigma x) (Sigma y)}{sqrt{[n(Sigma x^2) - (Sigma x)^2] [n(Sigma y^2) - (Sigma y)^2]}}
    Where n = the number of data pairs
    Sigma x = the sum of the x values
    Sigma y = the sum of the y values
    Sigma xy = the sum of the products of the x and y values for each pair
    Sigma x^2 = the sum of the squares of the x values
    Sigma y^2 = the sum of the squares of the y values
    Create a table to facilitate the calculation.
    Fill the table with the x and y values. Then find the values for the product xy, the values for x^2 , the values for y^2 and enter them into the table. Then sum the values and substitute them into the formula for r.
    x
    y
    xy x^2 y^2
     
     
     
     
     
    Sigma x : Sigma y : Sigma xy : Sigma x^2 : Sigma y^2 :
    When r is close to pm 1, statistical correlation is easy to claim. The problem occurs when the coefficient of correlation is somewhere around + or - 0.5. Is there statistical correlation? To determine if the coefficient of correlation is significant, look up the computed r in the r table of critical levels.
    Critical Levels of rThe table of critical levels of r enables a determination of statistical correlation with a specified level of confidence.
    To be 95% confident of statistical correlation, the computed r value is looked up in the column for a critical value of .05 with n pairs of data. If the computed value is greater than the value in the table, then it can stated with 95% confidence that there is probably statistical correlation.
    To be 99% confident, the computed r value is looked up on the column for a critical value of .01 with n pairs of data. If the computed value is greater than the value in the table, then it can be stated with 99% confidence that there is probably statistical correlation.