Defining Regression Line and Making Predictions/Free Course/Alison

RegressionOnce it has been determined that there is a significant relationship between the data sets, the next step is to find the equation of the regression line that is drawn through the data pairs on the scatterplot.
Although an infinite number of lines can be drawn through the points on the scatterplot, there is only one line that can be characterized as the line that best fits the data. This is the line that passes through the data pairs in such a way that the overall distance each point is from the line is at a minimum. This is also known as the regression line.
Determining the regression line then enables predictions to be made.
Dependent and independent variables
The equation of a straight line is y = mx + b, where b is the y-intercept and m is the slope. In statistics, the terms are often renamed and rearranged to be y = b_0 + b_1 x, where b_0 is the y-intercept and b_1 is the slope. y is the dependent variable and x is the independent variable.
In the example of age vs. height, the dependent variable, y, is height and the independent variable, x, is age. Think of this as height (y) depends on age (x).
The dependent and independent variables are known and consist of the values in the two data sets, y and x. What is unknown are the slope, b_1, and the y-intercept, b_0. Regression will find the slope and y-intercept, based on the best fit regression line. Then, the dependent variable, y, can be estimated, or predicted, by substituting a value for the independent variable, x, and then solving for y (the x value used must be within the range of values in the data set, x).Best Fit LineSlope of the best fit line
The equation for finding the slope of the regression line is:
b_1 = frac{n(Sigma xy) - (Sigma x)(Sigma y)}{n(Sigma x^2) - (Sigma x)^2}
where n is the number of (x, y) data pairs.
y-intercept of the best fit line
The equation for finding the y-intercept of the regression line is:
b_0 = frac{Sigma y - b_1 (Sigma x)}{n}
Example: Laying the best fit line on the scatterplot
Assume the following data set of starting salaries of 10 students and their associated GPAs.
GPA
STARTING SALARY
3.7
52,000
3.9
55,269
3.8
53,300
3.4
44,119
4
53,161
3.3
43,500
3.7
49,080
3.1
43,500
3.7
52,000
3.5
50,700
A scatterplot of the data, where Starting Salary is the dependent (y) variable, and GPA is the associated independent variable (x) is shown below. Starting Salary is plotted in the range from 35,000 to 60,000 and GPA from 2.5 to 4.5.

The Pearson coefficient of correlation, r, is 0.909078, indicating a strong statistical correlation as it is very close to +1.
Looking up the computed r of 0.909078 in an r table of critical values for n=10, the computed r is greater than the table r at confidence levels .05 and .01. The table r at confidence level .05 is 0.632 and at .01 is 0.765. Statistical correlation is confirmed.
The dependent variable (y) is Starting Salary and the independent variable (x) is GPA. The form of the regression equation is:
y = b_0 + b_1 x
Using the formula for the slope of the best fit line, b_1 for n=10
b_1 = 14320.18
Using the formula for the y-intercept of the best fit line, b_0, for n=10
b_0 = -2032.96
The regression equation, the equation of the best fit line, is:
y = -2032.96 + 14320.18 x
Laying the best fit line on the scatterplot gives us the following:

Making predictionsThe regression equation can now be used for predictions. We can estimate, for example, that a student with a GPA of 3.2 can be expected to have a starting salary of approximately $43,791.
Salary = 14320(3.2) - 2033 = $43,791

Module 1: Relationships Between Data Sets: Scatterplots, Correlation and Regression

Module 1: Sampling, Collecting, and Classifying Data

Sampling, Collecting, and Classifying Data - Learning Outcomes

Introduction to Statistics

Sampling and Collecting Data

Classifying Data

Module 2: Measures of Central Tendency: Mean, Median, Mode

Measures of Central Tendency: Mean, Median, Mode - Learning Outcomes

Measures of Central Tendency

The Arithmetic Mean and Its Computation

The Median and its Computation

The Mode and its Computation

Module 3: Sigma Notation and Large Summations

Sigma Notation and Large Summations- Learning Outcomes

Sigma Notation and its Laws

Module 4: Measures of Dispersion and Normal Distribution

Measures of Dispersion and Standard Deviation - Learning Outcomes

Absolute and Relative Measures of Dispersion

Range of a Data Set and its Properties

Mean Absolute Deviation and its Properties

Variance and its Properties

Standard Deviation and its Properties

Normal Distribution and the Bell Curve

Module 5: Relationships Between Data Sets: Scatterplots, Correlation and Regression

Relationships Between Data Sets: Scatterplots, Correlation and Regression - Learning Outcomes

Using Correlation and Regression to Determine Relationships between Data Sets

Analyzing Correlation Using Scatterplots, the Pearson Coefficient of Correlation, and Confidence Intervals

Finding the Regression Line that Best Fits the Scatterplot and Using it to Make Predictions

Course assessment

The Fundamentals of Statistics - Course assessment

We offer unlimited learning for free. Be a part of our mission.

Support us in our mission to keep education free for all.

Pick Your Contribution Amount.

Select Payment Method

Thank you for being part of our mission!

“Education should be...”

Education should be... free and accessible.

Select Payment Method

Thank you for your contribution!

You’ve started now, make sure you finish!

Learners with study reminders are 34% more likely to finish their course!

Set A Weekly Study Reminder

Set Study Reminders

Set Study Reminder

Empower Yourself For Free

Education should be...
free and accessible.