2. Exploring Two-Variable Data

Least Squares Regression

Least Squares Regression 📈

Introduction: Why do we care about a line that fits data?

students, in AP Statistics, one of the most useful tools for studying two quantitative variables is a least squares regression line. Imagine a teacher wants to predict a student’s test score from the number of hours studied, or a coach wants to estimate how sprint time changes with practice time. A scatterplot can show the relationship, but sometimes we want a single line that summarizes the pattern and helps us make predictions. That is where least squares regression comes in.

The big idea is simple: when points in a scatterplot show a roughly linear pattern, we can draw a line through the data that gets as close as possible to all the points. This line is called the least squares regression line. It gives predicted values for $y$ based on values of $x$ and helps us understand how the two variables are related.

Objectives for this lesson

By the end of this lesson, students, you should be able to:

  • explain the main ideas and vocabulary behind least squares regression,
  • use regression to make predictions in AP Statistics problems,
  • interpret slope, intercept, and residuals in context,
  • connect regression to scatterplots, correlation, and departures from linearity,
  • summarize how regression fits into the study of two-variable data.

What least squares regression does

Suppose we have paired data $(x, y)$, where $x$ is the explanatory variable and $y$ is the response variable. A regression line predicts values of $y$ from values of $x$. The line is written as

$$\hat{y} = a + bx$$

where $\hat{y}$ means the predicted value of $y$, $a$ is the $y$-intercept, and $b$ is the slope.

The least squares regression line is special because it minimizes the sum of squared residuals. A residual is the difference between an observed value and a predicted value:

$$\text{residual} = y - \hat{y}$$

Residuals can be positive or negative. If a point is above the line, its residual is positive. If a point is below the line, its residual is negative. Squaring the residuals makes them all positive and gives larger errors more weight. The line with the smallest total of squared residuals is the least squares regression line.

Why square the residuals? If we simply added the residuals, positive and negative values would cancel each other out. Squaring avoids that problem and gives a fair measure of overall prediction error.

Real-world example 🌍

Imagine a school district studying the relationship between hours of tutoring $x$ and math test score $y$. If the regression line is

$$\hat{y} = 62 + 4x$$

then each extra hour of tutoring is associated with an increase of about $4$ points in the predicted test score. If a student studies for $5$ hours, the predicted score is

$$\hat{y} = 62 + 4(5) = 82$$

That prediction is useful, but it is still only an estimate.

Understanding slope and intercept in context

The slope $b$ tells how much the predicted response changes for each 1-unit increase in the explanatory variable. In AP Statistics, the slope should always be interpreted in context. If the slope is $4$, that means for every additional hour of tutoring, the predicted math score increases by about $4$ points.

The intercept $a$ is the predicted value of $y$ when $x = 0$. Sometimes this is meaningful, but often it is not. For example, if $x$ represents hours of tutoring, the intercept is the predicted score when no tutoring occurs. That may be useful. But if $x$ represents time spent studying over a long period, $x = 0$ might be outside the range of the data or unrealistic. In that case, the intercept should not be overinterpreted.

students, a common AP Stats idea is that the regression line is only reliable within the range of the data. Predicting far beyond the observed $x$-values is called extrapolation, and it can be risky because the relationship may change outside the data range.

Example of interpretation

Suppose a regression line for car value is

$$\hat{y} = 24{,}000 - 1{,}500x$$

where $x$ is the number of years since purchase and $y$ is the car’s value in dollars. The slope means the car loses about $1{,}500$ dollars in predicted value per year. The intercept means the predicted value at purchase is $24{,}000$. If the data include new cars, that intercept makes sense. If the data start at $x = 3$, then using the intercept to describe a brand-new car would be extrapolation.

Residuals: checking how well the line fits

Residuals are very important in regression because they show how far each point is from the line. A point with a small residual is close to the predicted line. A point with a large residual is far from it.

Residuals help us judge whether a linear model is a good choice. If the residuals are scattered randomly above and below zero, a linear model may fit well. If residuals show a pattern, such as a curve, that suggests the relationship is not truly linear.

You may also see a residual plot, which graphs residuals on the vertical axis and $x$ on the horizontal axis. A good residual plot shows no obvious pattern and roughly equal spread around $0$.

Example of a residual

If the regression equation predicts a student score of $82$, but the actual score is $88$, then the residual is

$$88 - 82 = 6$$

This means the student scored $6$ points above the predicted value.

Residuals are also useful for finding unusual points. A point with a very large residual may be an outlier. Outliers can strongly affect a regression line, especially when they are far from the rest of the data in the $x$ direction.

Correlation and regression: related but not the same

Correlation and regression are connected, but they are not identical.

The correlation coefficient $r$ measures the strength and direction of a linear relationship between two quantitative variables. It always lies between $-1$ and $1$:

$$-1 \le r \le 1$$

A value of $r$ near $1$ means a strong positive linear relationship, and a value near $-1$ means a strong negative linear relationship. A value near $0$ means little or no linear relationship.

Regression goes one step further by giving an equation for prediction. Correlation describes the relationship; regression predicts values.

The sign of the slope in the regression line matches the sign of $r$. If $r > 0$, the slope is positive. If $r < 0$, the slope is negative.

Another key fact is that regression does not mean causation. Even if two variables move together, that does not prove that one causes the other. For example, ice cream sales and drowning deaths may both increase in summer, but ice cream does not cause drowning. A third variable, like hot weather, may explain both.

A quick AP Stats connection

The value of $r^2$ is called the coefficient of determination. It tells the proportion of the variation in $y$ explained by the linear relationship with $x$. For example, if $r^2 = 0.81$, then about $81\%$ of the variation in $y$ is explained by the linear model. That still leaves $19\%$ unexplained.

When the line is useful and when it is not

A regression line is useful when the scatterplot shows a clear linear trend and the points are not too scattered. But if the pattern is curved, the least squares line may miss important features.

This is why AP Statistics emphasizes form when describing scatterplots. The relationship may be linear, curved, or something else. It may also have unusual features like clusters or outliers. A line is best when the form is roughly linear.

Important things to check before using regression:

  • Is the relationship roughly linear?
  • Are there any outliers or influential points?
  • Is the spread about the same across the graph?
  • Are predictions being made within the data range?

An influential point is a point that changes the regression line a lot when removed. Points with extreme $x$-values can have strong influence because they affect the slope.

Example of a non-linear situation

Suppose the relationship between studying and performance improves quickly at first, but later levels off. That curve is not well described by a straight line. In that case, a regression line may give poor predictions, and residuals may show a curved pattern.

How AP Statistics uses least squares regression

In AP Statistics, least squares regression fits into the broader study of exploring two-variable data. First, you describe the scatterplot. Then you check the direction, form, and strength of the association. If the pattern looks linear, regression can summarize it with an equation.

A typical AP Stats workflow looks like this:

  1. make a scatterplot,
  2. describe the association,
  3. calculate or use the regression line,
  4. interpret slope and intercept in context,
  5. examine residuals for fit,
  6. use the line for prediction only when appropriate.

This process connects regression to earlier ideas like scatterplots and correlation and later ideas like residual analysis and model checking.

Mini example

Suppose a line predicts weekly sleep hours $y$ from homework hours $x$:

$$\hat{y} = 9 - 0.4x$$

If a student has $x = 10$, then

$$\hat{y} = 9 - 0.4(10) = 5$$

The model predicts about $5$ hours of sleep. If the student actually sleeps $6$ hours, the residual is

$$6 - 5 = 1$$

That student slept 1 hour more than predicted.

Conclusion

students, least squares regression is one of the most important tools for analyzing bivariate quantitative data in AP Statistics. It gives a line that best fits the data by minimizing the sum of squared residuals. The slope tells how the predicted response changes for each unit increase in the explanatory variable, and the intercept gives the predicted value when $x = 0$, if that makes sense in context. Residuals help us see how well the line fits and whether a linear model is appropriate.

Least squares regression belongs to the larger topic of exploring two-variable data because it builds on scatterplots, association, and correlation. It helps us describe relationships, make predictions, and evaluate whether a linear model is useful. Understanding when to use it and how to interpret it is a major AP Statistics skill ✅

Study Notes

  • A least squares regression line is the line that minimizes the sum of squared residuals.
  • A regression equation has the form $\hat{y} = a + bx$.
  • The slope $b$ tells the predicted change in $y$ for each 1-unit increase in $x$.
  • The intercept $a$ is the predicted value of $y$ when $x = 0$, if that value is meaningful.
  • A residual is $y - \hat{y}$.
  • Positive residuals mean the actual value is above the line; negative residuals mean it is below the line.
  • Residual plots help check whether a linear model fits well.
  • Correlation $r$ measures strength and direction of linear association, while regression gives a prediction equation.
  • The sign of the slope matches the sign of $r$.
  • The coefficient of determination $r^2$ tells how much of the variation in $y$ is explained by the linear model.
  • Do not use regression for extrapolation unless there is a strong reason and the context supports it.
  • Regression is part of exploring two-variable data because it builds on scatterplots and helps describe quantitative relationships.

Practice Quiz

5 questions to test your understanding