Introducing Statistics: Are Variables Related?

Welcome, students! In AP Statistics, one of the first big questions you learn to ask is: Are two variables related? 📊 This question shows up everywhere—grades and study time, height and shoe size, exercise and heart rate, screen time and sleep, or advertising and sales. Statistics helps us move beyond guesses and use data to look for patterns.

Today’s goals:

Understand what it means for two variables to be related.
Learn the basic language of bivariate data.
Recognize how statisticians study relationships using graphs, correlation, and regression.
Connect these ideas to the larger AP Statistics topic of Exploring Two-Variable Data.

By the end of this lesson, students, you should be able to describe how two variables may move together, explain what that relationship looks like, and identify when a pattern is strong, weak, positive, negative, linear, or not linear.

What Does It Mean for Variables to Be Related?

A variable is any characteristic that can take different values. For example, a student’s number of hours studied, height, favorite sport, or zip code are all variables. In AP Statistics, we often study two variables at once to see whether one seems connected to the other.

When two variables are related, the value of one variable tends to change in some pattern as the other changes. For example, if students who study more tend to score higher on a test, then study time and test score may be related. That does not automatically mean one causes the other. This is a key statistics idea: association does not imply causation.

There are two main broad situations:

Two categorical variables, such as grade level and preferred lunch choice.
Two quantitative variables, such as time spent studying and test score.

For this lesson, the main focus is on understanding the big idea of relationship and the vocabulary used to describe it. Later, you will use specific tools like scatterplots, correlation, and regression to study quantitative pairs more deeply.

Imagine a coach looking at practice time and sprint times. If more practice tends to go with faster sprint times, the coach may want to investigate further. But the pattern could also be influenced by other factors like natural athletic ability, sleep, or training background. Statistics helps us ask better questions and support claims with evidence.

Bivariate Data: Two Values for Each Individual

When we study two-variable data, each individual or case has two measurements. This is called bivariate data. For example, each student in a class might have a pair of values like:

hours of sleep and quiz score,
number of absences and final grade,
temperature and ice cream sales.

You can think of each data point as a pair. If the variables are quantitative, we often write the pair as $(x, y)$, where $x$ is one variable and $y$ is the other. In a scatterplot, each point shows one individual’s values for both variables.

The idea of pairing is important because we are not just studying how one variable behaves alone. We are asking whether the two variables move together in some way. If they do, the relationship may be useful for prediction or understanding patterns in the real world.

For example, suppose a school collects data on the number of minutes spent reading each day and the score on a reading assessment. If students who read more usually score higher, the relationship might help teachers support reading habits. Still, it would be a mistake to say reading time alone causes the score difference without more evidence.

Describing Relationships: Direction, Strength, and Form

When two quantitative variables are related, statisticians describe the relationship using three main features:

Direction
Strength
Form

Direction

A relationship can be positive or negative.

A positive association means that as one variable increases, the other tends to increase too.
A negative association means that as one variable increases, the other tends to decrease.

Example: If hours of exercise and calories burned are studied, the relationship is often positive. If speed and time needed to finish a fixed-distance race are studied, the relationship is often negative because faster speed means less time.

Strength

The strength of a relationship tells us how closely the points follow a pattern. A strong relationship has points close together around a pattern, while a weak relationship has more scatter or spread.

For example, height and shoe size for teenagers may show a moderate positive relationship, but it will not be perfect because people differ in body proportions. A very strong relationship is still not necessarily perfect.

Form

The form describes the shape of the relationship.

A relationship may be linear, which means it follows an approximate straight-line pattern.
It may be nonlinear, which means the pattern curves.

A linear pattern is important because many AP Statistics methods, especially correlation and regression, are designed for data that follow an approximately straight-line relationship.

For example, as hours studied increase, test scores may rise at first but then level off if students are already near the top of the score range. That is not perfectly linear. The form matters because it tells you what kind of model may be appropriate.

Tools for Seeing Relationships 📈

The first and most common graph for two quantitative variables is the scatterplot. In a scatterplot, each point represents one pair of values $(x, y)$. Scatterplots help you quickly see direction, strength, and form.

Here is what you look for:

Does the cloud of points slope upward or downward?
Are the points tightly grouped or widely spread?
Does the pattern look straight or curved?
Are there any unusual points far from the rest?

For example, if you plot temperature and ice cream sales, you might see that higher temperatures are linked to higher sales. That would suggest a positive association. If you plot outside temperature and heating bill, you might see a negative association.

A scatterplot can also reveal outliers, which are points that do not fit the overall pattern. An outlier might represent a special situation, measurement error, or a real but unusual case. Outliers matter because they can affect the direction, strength, and even the shape of a relationship.

Another useful idea is residuals, which measure how far a point is from the line or pattern used to predict it. A residual is the difference between an observed value and a predicted value. If the observed value is higher than predicted, the residual is positive; if lower, the residual is negative. Residuals help you judge how well a model fits the data.

Correlation and Regression: A First Look

Once a scatterplot shows an approximately linear relationship, AP Statistics uses correlation and regression to describe it more precisely.

The correlation coefficient is written as $r$. It measures the strength and direction of a linear relationship between two quantitative variables. Its value always lies between $-1$ and $1$:

$$-1 \le r \le 1$$

If $r$ is close to $1$, the relationship is strongly positive.
If $r$ is close to $-1$, the relationship is strongly negative.
If $r$ is close to $0$, there is little or no linear relationship.

Correlation is useful, but it has limits. It only measures linear relationships. A curved pattern might have a low $r$ even when the variables are clearly related.

Regression goes one step further by giving an equation for predicting one quantitative variable from another. The most common model is the least-squares regression line, written as

$$\hat{y}=a+bx$$

where $\hat{y}$ is the predicted value of the response variable, $a$ is the intercept, and $b$ is the slope.

The slope $b$ tells us the predicted change in $\hat{y}$ for each 1-unit increase in $x$. For example, if $b=2.5$, then each additional hour of study is associated with an increase of about $2.5$ points in predicted test score, assuming the model is appropriate.

Regression is powerful, but it should be used carefully. A good regression model only makes sense when the data show an approximately linear pattern and the residuals do not show strong problems like curvature or changing spread.

Comparing Two Categorical Variables

Not every two-variable question uses scatterplots. If both variables are categorical, such as grade level and transportation method, the data are usually displayed in a two-way table. A two-way table helps compare how categories are distributed across groups.

For example, suppose a school wants to know whether club membership is related to participation in sports. The table could show how many students are in each combination of categories. Then you can compare conditional distributions to see whether the variables seem related.

If the percentages are similar across categories, there may be little association. If the percentages differ a lot, there may be a relationship. This is still the same core question: Are the variables related? The graph or table changes depending on the type of variables, but the statistical thinking is the same.

Conclusion

students, the big idea of this lesson is that statistics is about finding evidence for relationships between variables. When you ask whether two variables are related, you are beginning a major AP Statistics skill: describing patterns in data and deciding what those patterns mean.

For quantitative variables, you will often use scatterplots to look for direction, strength, and form. You will learn how to describe positive and negative associations, recognize linear and nonlinear patterns, and use correlation and regression when appropriate. For categorical variables, you will use tables and compare distributions.

Most importantly, remember this: a relationship in data does not automatically prove cause and effect. Statistics helps you see patterns, ask better questions, and make careful conclusions based on evidence. That is the foundation of Exploring Two-Variable Data. 🌟

Study Notes

A variable is a characteristic that can change.
Bivariate data means two variables are recorded for each individual.
The main AP Stats question here is: Are the variables related?
For two quantitative variables, use a scatterplot.
Describe scatterplots by direction, strength, and form.
A positive association means both variables tend to increase together.
A negative association means one variable tends to increase while the other decreases.
A linear relationship looks roughly like a straight line.
Outliers are unusual points that do not fit the pattern well.
The correlation coefficient $r$ measures the strength and direction of a linear relationship, and $-1 \le r \le 1$.
Regression uses an equation like $\hat{y}=a+bx$ to predict one variable from another.
A residual is the difference between an observed value and a predicted value.
For two categorical variables, use a two-way table and compare percentages.
Association does not imply causation.