Correlation

Introduction: spotting links in data 📈

students, in statistics we often want to know whether two variables move together. For example, do taller students tend to weigh more? Does spending more time revising tend to improve test scores? When two variables show a pattern of moving together, we call that correlation. Correlation is a major idea in IB Mathematics: Analysis and Approaches HL because it helps us describe relationships in real data and decide whether those relationships are strong, weak, positive, or negative.

In this lesson, you will learn how to explain the main ideas and terminology behind correlation, interpret scatter diagrams, and connect correlation to regression and the wider statistics syllabus. You will also see why correlation matters in real life, from sports performance to business sales and climate data 🌍.

By the end of this lesson, you should be able to:

define correlation using clear statistical language,
describe the direction and strength of a relationship,
interpret scatter plots and the correlation coefficient,
understand why correlation does not prove causation,
connect correlation to regression and statistical modelling.

What correlation means

Correlation describes how two variables are related. If one variable tends to increase when the other increases, the correlation is positive. If one variable tends to decrease when the other increases, the correlation is negative. If there is no clear pattern, the correlation is zero or very weak.

Here are some simple examples:

Positive correlation: as the number of hours studied increases, exam marks may increase.
Negative correlation: as the number of hours spent on a phone before bed increases, sleep quality may decrease.
No clear correlation: shoe size and favourite music genre are usually unrelated.

Correlation is usually explored using a scatter diagram. This is a graph where each point represents a pair of values for two variables, such as $x$ and $y$. The shape of the cloud of points gives information about the relationship.

Important terminology includes:

Variables: the quantities being compared.
Bivariate data: data with two variables measured for each individual.
Positive association: larger values of one variable tend to go with larger values of the other.
Negative association: larger values of one variable tend to go with smaller values of the other.
Strength: how tightly the points cluster around a pattern.
Outlier: a data point far from the overall pattern.

A strong correlation means the points are close to a line or curve. A weak correlation means the points are more spread out. A correlation can be linear or non-linear, but in IB, the most common focus is on linear correlation.

Scatter diagrams and interpreting patterns

A scatter diagram is the first tool for studying correlation. It helps you see whether the relationship is positive, negative, or absent. It can also show whether the relationship is approximately linear.

Imagine a school collected data on the number of hours of revision and the mark on a test. If the scatter plot shows points rising from left to right, that suggests a positive correlation. If the points fall from left to right, that suggests a negative correlation. If the points look scattered with no clear direction, then there is little or no correlation.

When describing a scatter diagram, use precise language:

direction: positive, negative, or none,
form: linear or curved,
strength: strong, moderate, or weak,
unusual points: outliers or influential points.

For example, you might say: “The data show a strong positive linear correlation with one possible outlier.” This is better than saying simply “the graph goes up,” because statistical language must be accurate.

Outliers matter because they can affect the overall pattern. A single unusual point can make a correlation appear stronger, weaker, or even change the direction of the relationship. In examination questions, students, you should always look for outliers before drawing conclusions.

The correlation coefficient $r$

For linear relationships, the correlation coefficient is often written as $r$. It measures the strength and direction of the linear relationship between $x$ and $y$.

The value of $r$ lies between $-1$ and $1$.

$r=1$ means perfect positive linear correlation.
$r=-1$ means perfect negative linear correlation.
$r=0$ means no linear correlation.

Values close to $1$ or $-1$ show a strong linear relationship. Values close to $0$ show a weak linear relationship.

A useful way to interpret $r$ is:

$0.7 \leq |r| \leq 1.0$ suggests a strong correlation,
$0.3 \leq |r| < 0.7$ suggests a moderate correlation,
$0 \leq |r| < 0.3$ suggests a weak correlation.

These boundaries are not exact laws, but they are useful for interpretation.

The coefficient $r$ is especially important because it gives a numerical summary of the scatter plot. However, you should never rely on $r$ alone. Always look at the actual graph too. Two data sets can have the same value of $r$ but very different shapes.

A famous idea in statistics is that correlation does not imply causation. Just because two variables are related does not mean one causes the other. For example, ice cream sales and sunburn cases may rise together in summer, but ice cream does not cause sunburn. The hidden factor is hot weather ☀️.

Correlation and regression

Correlation is closely connected to regression. Correlation tells us how strong the relationship is, while regression gives a line or model that can be used to make predictions.

If the data are roughly linear, we may fit a line of best fit. A common form is

$$y=mx+c$$

where $m$ is the gradient and $c$ is the $y$-intercept.

The line of best fit tries to summarize the overall trend in the data. If the correlation is strong, the regression model is often more reliable for prediction. If the correlation is weak, predictions from the line are less trustworthy.

For example, suppose a biology class records the relationship between plant height and amount of fertilizer used. If the scatter plot shows a strong positive linear correlation, the line of best fit can help predict plant height for a given fertilizer amount. But if the data are very scattered, the prediction will have a larger error.

In IB Mathematics: Analysis and Approaches HL, you should understand that regression is based on the data and does not prove a cause-and-effect relationship. A model is only useful within the range of the observed data unless there is strong reason to extend it. Predicting far beyond the data is called extrapolation, and it can be risky.

If a question gives you a scatter plot, you may be asked to state whether a line of best fit is appropriate. The answer depends on whether the points show a roughly linear trend and whether the data contain strong outliers or curvature.

Correlation in real-world data

Correlation appears in many everyday situations.

In education, hours studied and test performance may be positively correlated.
In health, smoking amount and lung function may be negatively correlated.
In economics, price and demand may have a negative relationship.
In sports, training time and performance may show a positive pattern.

Real data are rarely perfect. Human behavior, measurement error, and hidden factors all create scatter. That is why correlation is described using words like strong or weak rather than just yes or no.

Suppose a coach records the relationship between sprint time and weekly training distance. If the data show a strong negative correlation, it might suggest that athletes who train more tend to have faster times. But the coach should be careful: other factors such as diet, rest, and natural ability may also matter. Statistics helps us describe patterns, but it does not automatically explain them.

A good statistical description is balanced. It should mention the direction, strength, form, and any unusual features. That is exactly the kind of reasoning IB expects.

How to answer exam-style questions

When a question asks about correlation, students, use a clear structure:

Describe the scatter plot.
State the direction of the correlation.
Comment on the strength.
Mention whether the relationship looks linear.
Note any outliers.
If relevant, interpret the value of $r$.

For example, you might write:

“The data show a moderate positive linear correlation. As $x$ increases, $y$ generally increases. There is one possible outlier that may affect the strength of the relationship.”

If a question asks whether one variable causes another, remember to be careful. Correlation alone cannot establish causation. To show causation, we would need a well-designed experiment or stronger evidence that rules out other factors.

If the question gives a correlation coefficient, interpret it in context. For example, if $r=-0.86$, this indicates a strong negative linear correlation. In real language, that means as one variable increases, the other tends to decrease quite consistently.

Conclusion

Correlation is a key statistical idea for understanding relationships between two variables. It helps us read scatter diagrams, describe patterns, and judge whether a linear model is appropriate. The correlation coefficient $r$ gives a numerical measure of the strength and direction of linear association, but it must be interpreted carefully and alongside the graph.

For IB Mathematics: Analysis and Approaches HL, students, correlation is not just about memorizing a formula. It is about making sensible statistical judgments from data. It connects directly to regression, modelling, and the broader study of uncertainty in statistics and probability. When used carefully, correlation gives powerful insight into real-world situations, from school results to health data and beyond 📊.

Study Notes

Correlation describes how two variables are related.
A scatter diagram is used to visualize correlation.
Positive correlation means both variables tend to increase together.
Negative correlation means one variable tends to increase as the other decreases.
Weak or no correlation means there is no clear linear pattern.
The correlation coefficient is written as $r$.
The value of $r$ is always between $-1$ and $1$.
$r=1$ means perfect positive linear correlation.
$r=-1$ means perfect negative linear correlation.
$r=0$ means no linear correlation.
Correlation strength should be described using context, not just a number.
Outliers can strongly affect correlation.
Correlation does not imply causation.
Regression uses the relationship in data to make predictions.
A line of best fit is useful only when the data are roughly linear.
Extrapolation beyond the data range can be unreliable.
Good statistical answers mention direction, strength, form, and unusual points.