Correlation

students, have you ever noticed that two things seem to move together? For example, when the temperature goes up, ice cream sales often go up too 🍦☀️. In AP Statistics, one way to measure how strongly two quantitative variables move together is correlation. This lesson will help you understand what correlation tells us, what it does not tell us, and how it fits into the bigger picture of exploring two-variable data.

By the end of this lesson, you should be able to:

explain the main ideas and vocabulary of correlation,
describe the direction, strength, and form of a relationship,
recognize when correlation is appropriate,
connect correlation to scatterplots and regression,
use correlation in AP Statistics reasoning with real data examples.

What Correlation Measures

Correlation describes the strength and direction of a linear relationship between two quantitative variables. In AP Statistics, the most common correlation coefficient is $r$, called the sample correlation coefficient.

A correlation value is always between $-1$ and $1$, so $-1\le r\le 1.$ A value near $1$ means a strong positive linear relationship. A value near $-1$ means a strong negative linear relationship. A value near $0$ means little or no linear relationship.

Here is the basic idea:

If $x$ increases and $y$ tends to increase, the association is positive.
If $x$ increases and $y$ tends to decrease, the association is negative.
If points cluster closely around a straight line, the relationship is strong.
If points are spread out, the relationship is weak.

For example, suppose students is studying hours studied $x$ and test score $y$. If students who study more usually score higher, that is a positive association. If the points fall close to a straight upward line, the correlation may be fairly close to $1$.

Correlation is one part of exploring two-variable data because it helps summarize scatterplots with a single number. However, that number does not replace the scatterplot. The scatterplot shows shape, outliers, clusters, and nonlinearity that correlation alone can hide.

Reading Correlation from a Scatterplot

Before calculating anything, AP Statistics expects you to look at the scatterplot. Correlation is only useful for quantitative variables, and it works best when the pattern is roughly linear.

A strong linear pattern means the points are close to a line. A weak linear pattern means the points are more spread out. If the pattern is curved, $r$ may be misleading because it measures only linear association.

Think about these examples:

Height and weight: often positive, moderate to strong, and roughly linear.
Temperature and heater use: often negative.
Age and car value: usually negative, but the pattern may curve downward rather than stay linear.

A few important terms help describe scatterplots:

Direction: positive or negative.
Form: linear or nonlinear.
Strength: weak, moderate, or strong.
Outlier: a point far from the overall pattern.
Influential point: a point that has a big effect on the line of best fit or correlation.

Outliers matter because correlation is not resistant. That means one unusual point can change $r$ a lot. For example, if a class has a scatterplot of study time and grades, one student who studied very little but still earned a perfect score could weaken the correlation.

A key AP Statistics habit is to describe the scatterplot first, then interpret $r$. Never use correlation by itself without checking the graph 📊.

What Makes Correlation Big or Small

The closer $r$ is to $1$ or $-1$, the stronger the linear relationship. The closer $r$ is to $0$, the weaker the linear relationship.

Here is a helpful way to think about it:

$r\approx 1$: strong positive linear relationship
$r\approx -1$: strong negative linear relationship
$r\approx 0$: little or no linear relationship

But remember, “close to $0$” does not always mean “no relationship.” It only means no linear relationship. There could still be a curved pattern.

For example, suppose students graphs the hours of daylight $x$ and the number of umbrellas sold $y$ throughout a year. The data may rise and fall in a seasonal curve. In that case, the correlation could be near $0$, even though the variables are clearly related in a non-linear way.

Correlation also depends on the scale of measurement, but it does not change if we add a constant or multiply one variable by a positive constant. That means if we convert temperatures from Celsius to Fahrenheit, the correlation stays the same. This is useful because it shows that correlation measures pattern, not units.

A common AP Statistics idea is that correlation is unitless. It does not have a measurement unit like inches or dollars. It simply summarizes the linear relationship.

Correlation Is Not Causation

One of the most important ideas in statistics is that correlation does not prove causation. Even when two variables are strongly related, we cannot automatically say that one causes the other.

There are three major reasons:

Lurking variables may affect both variables.
Reverse causation may be possible.
The relationship may be caused by coincidence in the data.

For example, ice cream sales and drowning incidents may have a positive correlation, but ice cream does not cause drowning. A lurking variable, such as hot weather, affects both. During summer, more people swim and more people buy ice cream.

Another example: there may be a positive correlation between shoe size and reading level in young children. That does not mean bigger shoes make kids better readers. Age is the lurking variable.

In AP Statistics, you should always be careful with language. Instead of saying “$x$ causes $y$,” say “$x$ is associated with $y$” unless a well-designed experiment supports causation.

Correlation and Regression Work Together

Correlation is closely connected to regression. A regression line is a line used to predict one quantitative variable from another. Correlation helps describe how well that line fits the data.

If the relationship is strongly linear, the regression line will usually fit well. If the relationship is weak or curved, the line may not be useful.

The regression line often has the form $\hat{y}=a+bx,$ where $\hat{y}$ is the predicted value of $y$, $a$ is the intercept, and $b$ is the slope.

A large absolute value of $r$ usually means points are close to the regression line. A smaller absolute value of $r$ usually means more scatter around the line. In many cases, the square of the correlation, $r^2$, tells us the proportion of variation in $y$ explained by the linear relationship with $x$.

For example, if $r=0.8$, then $r^2=0.64.$ This means about $64\%$ of the variation in $y$ is explained by the linear model using $x$. The remaining $36\%$ is not explained by that line.

This does not mean the model is perfect. Even with a strong correlation, predictions can still be off. That is why residuals matter. A residual is $y-\hat{y},$ the difference between the observed value and the predicted value.

Correlation helps you understand whether regression is likely to be useful, but residuals help you see how far actual values are from the line.

Common Mistakes and AP Exam Thinking

students, AP Statistics questions often test whether you can interpret correlation correctly. Here are common mistakes to avoid:

Using correlation on categorical data: correlation is for two quantitative variables, not for categories like favorite color or type of phone.
Ignoring outliers: a single unusual point can strongly affect $r$.
Confusing association with causation: correlation alone does not prove a cause.
Assuming all patterns are linear: a curved scatterplot may have a low $r$ even if the variables are related.
Thinking a strong $r$ means perfect prediction: even strong linear relationships have residuals.

A typical AP Statistics response should mention direction, strength, and form. For example:

“The scatterplot shows a strong positive linear association between hours studied and exam score. There is one possible outlier at $x=0$ hours and $y=98$. The correlation is likely positive and fairly large in absolute value, but the outlier may affect it.”

That kind of answer shows evidence-based reasoning, which is exactly what the course expects.

Conclusion

Correlation is a powerful tool for summarizing the linear relationship between two quantitative variables. It helps us describe direction, strength, and form using the sample correlation coefficient $r$. But correlation must be used carefully: it does not prove causation, it is sensitive to outliers, and it only measures linear patterns.

In the wider AP Statistics topic of Exploring Two-Variable Data, correlation connects scatterplots, regression, and residuals. First, the scatterplot reveals the pattern. Then, correlation summarizes the linear strength and direction. After that, regression uses the relationship to make predictions, and residuals show how well the model works.

If you remember to look at the graph, describe the relationship clearly, and avoid cause-and-effect mistakes, you will be ready to answer many AP Statistics questions about correlation ✅.

Study Notes

Correlation measures the strength and direction of a linear relationship between two quantitative variables.
The sample correlation coefficient is $r$, and $$-1\le r\le 1.$$
A positive $r$ means the variables tend to increase together; a negative $r$ means one tends to increase as the other decreases.
Values of $r$ near $1$ or $-1$ show a strong linear relationship; values near $0$ show little or no linear relationship.
Correlation is unitless and does not change when variables are rescaled by positive linear transformations.
Correlation is not resistant; outliers can affect it a lot.
Correlation does not prove causation.
Correlation works best with quantitative variables and a roughly linear scatterplot.
Correlation connects directly to regression: a stronger linear relationship usually means a better-fitting line.
The coefficient of determination is $r^2$, which describes the proportion of variation in $y$ explained by the linear model with $x$.
Residuals are $y-\hat{y},$ and they help check how well the regression line fits the data.
On the AP exam, always describe direction, strength, and form before making conclusions about correlation.