Scatter Diagrams and Correlation
Have you ever noticed that when one thing changes, another thing often changes too? 📈 For example, students who spend more time revising may often score higher on tests, or taller students may tend to weigh more. In statistics, we use scatter diagrams to show these relationships visually, and correlation to describe how strongly and in what direction two variables are connected.
In this lesson, students, you will learn how to read scatter diagrams, describe patterns, and understand what correlation really means. By the end, you should be able to explain the ideas clearly, use the correct terminology, and connect this topic to the wider study of statistics and probability.
What Is a Scatter Diagram?
A scatter diagram is a graph that shows paired data points for two numerical variables. Each point represents one observation with coordinates $(x,y)$. The variable on the horizontal axis is usually called the independent variable, and the variable on the vertical axis is usually called the dependent variable, although in statistics this choice is often based on context rather than a strict rule.
For example, if we study the relationship between hours of revision and test score, we might place revision time on the $x$-axis and test score on the $y$-axis. Each student gives one pair of values, such as $(2,65)$ or $(5,88)$.
Scatter diagrams are useful because they help us see patterns quickly. Instead of looking at a long table of numbers, you can often spot whether the data:
- rises from left to right,
- falls from left to right,
- appears random,
- clusters into groups,
- or contains unusual values called outliers.
A scatter diagram is one of the first tools used in analysing data because it gives a visual summary before more advanced methods are used.
Describing the Pattern in a Scatter Diagram
When you look at a scatter diagram, there are several features you should describe carefully.
Direction
The direction tells you whether the overall trend goes up or down.
- Positive association: as $x$ increases, $y$ tends to increase.
- Negative association: as $x$ increases, $y$ tends to decrease.
- No association: there is no clear pattern between the variables.
A real-world example of positive association is height and arm span. In many cases, taller people also have longer arm spans. A real-world example of negative association is speed and travel time for a fixed distance: as speed increases, time decreases.
Form
The form describes the shape of the pattern.
- Linear: the points follow a pattern close to a straight line.
- Non-linear: the points follow a curve or other shape.
In IB Mathematics Analysis and Approaches SL, you often start by checking whether the relationship looks roughly linear, because that affects the methods you may use later.
Strength
The strength tells you how closely the points cluster around the overall pattern.
- A strong relationship means the points are close to a line or curve.
- A weak relationship means the points are more spread out.
Strength matters because two data sets can both show positive association, but one may be much clearer than the other.
Outliers
An outlier is a data point that lies unusually far from the rest of the data. Outliers can affect the appearance of a scatter diagram and may also influence measures of correlation and regression.
For example, if most students score between $40$ and $90$, but one student scores $5$, that point may stand out and should be checked carefully. It may be a genuine unusual value or it may be due to an error in recording.
Correlation: What It Means
Correlation describes the strength and direction of a linear relationship between two variables. The word is often used when the scatter of points follows a roughly straight-line pattern.
A common numerical measure is the correlation coefficient, often written as $r$.
Its value lies between $-1$ and $1$:
$$-1 \leq r \leq 1$$
Interpretation:
- If $r$ is close to $1$, there is a strong positive linear correlation.
- If $r$ is close to $-1$, there is a strong negative linear correlation.
- If $r$ is close to $0$, there is little or no linear correlation.
It is important to understand that $r$ measures linear association, not every possible kind of association. Two variables might have a clear curved pattern and still have a correlation coefficient near $0$.
Also, correlation does not automatically mean one variable causes the other. This idea is called correlation does not imply causation. For example, ice cream sales and the number of sunburn cases may both rise in summer, but buying ice cream does not cause sunburn. A third variable, such as hot weather, helps explain both.
Reading and Interpreting Correlation in Context
In exams, you are often expected to interpret results in context, not just state a number. If a scatter diagram shows a strong positive relationship between study time and exam score, you should say something like:
“Students who spend more time studying tend to achieve higher exam scores, and the relationship appears strong and positive.”
This is better than simply saying “positive correlation,” because it explains what the variables are and what the pattern means.
When describing a correlation, use the correct terms:
- positive or negative,
- strong or weak,
- linear or non-linear,
- and mention any outliers or clusters.
A useful tip, students: always refer to the actual variables. Saying “there is a strong positive correlation” is less complete than saying “there is a strong positive correlation between hours of revision and test score.”
Example: Revision Time and Test Score
Imagine a teacher collects data from ten students on revision time in hours and test scores out of $100$. The scatter diagram shows points that rise from left to right, and most points lie near a straight line.
From this graph, we could conclude:
- the association is positive,
- the form is approximately linear,
- the strength is fairly strong,
- and there may be no obvious outliers.
This suggests that students who revise more tend to score higher, although this does not prove that revision time is the only reason for better scores. Other factors could matter too, such as prior knowledge, sleep, or quality of study methods.
Now imagine a different data set comparing temperature and ice cream sales. The points may show a strong positive trend. This is another good example of a positive association in context. However, if we wanted to predict future sales from temperature, we would need to be careful because the relationship might change across seasons or locations.
Using Scatter Diagrams as a First Step in Data Analysis
Scatter diagrams are often used before building a line of best fit or a regression model. They help you decide whether a linear model is reasonable.
If the data points are scattered randomly with no clear pattern, then a line of best fit may not be useful. If the data follow a curved pattern, then a linear model may be a poor choice.
This links scatter diagrams to the broader topic of Statistics and Probability because statistics is about collecting, presenting, and interpreting data. A scatter diagram helps you summarise bivariate data, which means data involving two variables. This is part of learning how to describe real-world situations using mathematical tools.
The main ideas here connect to later work in regression, where you may use an equation of the form $y = mx + c$ to model a linear relationship. Before you fit a model, you should always inspect the scatter diagram to see whether the model is sensible.
Common Mistakes to Avoid
Here are some mistakes students often make:
- Saying correlation proves cause and effect.
- Describing a relationship as “strong” without saying whether it is positive or negative.
- Ignoring outliers that may affect the pattern.
- Using a line of best fit when the data are clearly non-linear.
- Forgetting to interpret the graph in context.
Another important point is that the axes and units matter. A scatter diagram without labels is incomplete. To interpret data correctly, you need to know what the variables measure.
Conclusion
Scatter diagrams and correlation are core tools for understanding relationships between two numerical variables. A scatter diagram shows the data visually, while correlation describes the direction and strength of a linear relationship. In IB Mathematics Analysis and Approaches SL, these ideas help you analyse real data, judge whether relationships are linear, and prepare for regression and more advanced statistical modelling.
When students looks at a scatter diagram, focus on direction, form, strength, and outliers. Then explain the relationship clearly in context. This skill is essential in statistics because it turns raw data into meaningful information. 😊
Study Notes
- A scatter diagram plots paired data as points $(x,y)$.
- It is used to explore the relationship between two numerical variables.
- Positive correlation means $y$ tends to increase as $x$ increases.
- Negative correlation means $y$ tends to decrease as $x$ increases.
- No correlation means there is no clear linear pattern.
- Correlation coefficient $r$ satisfies $-1 \leq r \leq 1$.
- $r$ close to $1$ means strong positive linear correlation.
- $r$ close to $-1$ means strong negative linear correlation.
- $r$ close to $0$ means little or no linear correlation.
- Correlation describes association, not causation.
- A scatter diagram should be described by direction, form, strength, and outliers.
- Scatter diagrams are a starting point for regression and other statistical analysis.
- Always interpret results in context using the names of the variables.
