Alison's New App is now available on iOS and Android! Download Now

Module 1: Relationships Between Data Sets: Scatterplots, Correlation and Regression

    Study Reminders
    Support

    RegressionOnce it has been determined that there is a significant relationship between the data sets, the next step is to find the equation of the regression line that is drawn through the data pairs on the scatterplot.
    Although an infinite number of lines can be drawn through the points on the scatterplot, there is only one line that can be characterized as the line that best fits the data. This is the line that passes through the data pairs in such a way that the overall distance each point is from the line is at a minimum. This is also known as the regression line.
    Determining the regression line then enables predictions to be made.
    Dependent and independent variables
    The equation of a straight line is y = mx + b, where b is the y-intercept and m is the slope. In statistics, the terms are often renamed and rearranged to be y = b_0 + b_1 x, where b_0 is the y-intercept and b_1 is the slope. y is the dependent variable and x is the independent variable.
    In the example of age vs. height, the dependent variable, y, is height and the independent variable, x, is age. Think of this as height (y) depends on age (x).
    The dependent and independent variables are known and consist of the values in the two data sets, y and x. What is unknown are the slope, b_1, and the y-intercept, b_0. Regression will find the slope and y-intercept, based on the best fit regression line. Then, the dependent variable, y, can be estimated, or predicted, by substituting a value for the independent variable, x, and then solving for y (the x value used must be within the range of values in the data set, x).Best Fit LineSlope of the best fit line
    The equation for finding the slope of the regression line is:
    b_1 = frac{n(Sigma xy) - (Sigma x)(Sigma y)}{n(Sigma x^2) - (Sigma x)^2}
    where n is the number of (x, y) data pairs.
    y-intercept of the best fit line
    The equation for finding the y-intercept of the regression line is:
    b_0 = frac{Sigma y - b_1 (Sigma x)}{n}
    Example: Laying the best fit line on the scatterplot
    Assume the following data set of starting salaries of 10 students and their associated GPAs.
    GPA
    STARTING SALARY
    3.7
    52,000
    3.9
    55,269
    3.8
    53,300
    3.4
    44,119
    4
    53,161
    3.3
    43,500
    3.7
    49,080
    3.1
    43,500
    3.7
    52,000
    3.5
    50,700
    A scatterplot of the data, where Starting Salary is the dependent (y) variable, and GPA is the associated independent variable (x) is shown below. Starting Salary is plotted in the range from 35,000 to 60,000 and GPA from 2.5 to 4.5.
     
    The Pearson coefficient of correlation, r, is 0.909078, indicating a strong statistical correlation as it is very close to +1.
    Looking up the computed r of 0.909078 in an r table of critical values for n=10, the computed r is greater than the table r at confidence levels .05 and .01. The table r at confidence level .05 is 0.632 and at .01 is 0.765. Statistical correlation is confirmed.
    The dependent variable (y) is Starting Salary and the independent variable (x) is GPA. The form of the regression equation is:
    y = b_0 + b_1 x
    Using the formula for the slope of the best fit line, b_1 for n=10
    b_1 = 14320.18
    Using the formula for the y-intercept of the best fit line, b_0, for n=10
    b_0 = -2032.96
    The regression equation, the equation of the best fit line, is:
    y = -2032.96 + 14320.18 x
    Laying the best fit line on the scatterplot gives us the following:
     
    Making predictionsThe regression equation can now be used for predictions. We can estimate, for example, that a student with a GPA of 3.2 can be expected to have a starting salary of approximately $43,791.
    Salary = 14320(3.2) - 2033 = $43,791