4. Statistics and Probability

Outliers

Outliers in Statistics and Probability 📊

students, imagine two classes take the same math test. In one class, most scores are around 70, but one student scores 99 and another scores 12. In the other class, the scores are all close to 70. Which class is “more typical”? Which values might be called outliers? In statistics, outliers are unusual data values that stand out from the rest of a dataset. Understanding them matters because they can change averages, spread, regression lines, and conclusions. In IB Mathematics: Analysis and Approaches HL, you need to know how to identify outliers, describe their effect, and decide whether they should be investigated rather than ignored. 🎯

By the end of this lesson, you should be able to:

  • explain what an outlier is and why it matters,
  • use statistical tools to identify possible outliers,
  • understand how outliers affect summary statistics and graphs,
  • connect outliers to regression, correlation, and probability ideas,
  • and summarize how outliers fit into the wider study of statistics and probability.

What is an outlier? 🔍

An outlier is a data value that is noticeably far from the rest of the data. It may be caused by a genuine unusual event, measurement error, or a data entry mistake. For example, if the heights of students in a class are mostly between $150\,\text{cm}$ and $190\,\text{cm}$, then a recorded height of $15\,\text{cm}$ is almost certainly an error, while a height of $210\,\text{cm}$ might be unusual but still possible.

Outliers are not defined by a single universal number. Whether a value is an outlier depends on context, the shape of the data, and the method used to detect it. In IB Statistics and Probability, the main idea is to use evidence, not just guesswork.

A good first check is the graph. On a dot plot, scatter plot, box-and-whisker plot, or histogram, an outlier may appear isolated from the main cluster of data. For example, if most exam marks lie between $50$ and $80$, but one mark is $2$, that point clearly deserves attention. However, a value far from the center is not automatically “wrong.” It may represent a real extreme case, like a very tall person in a height dataset or a very high-income household in an economics dataset.

Ways to identify outliers 📉

One common IB method uses the interquartile range, written as $\mathrm{IQR}$. The quartiles divide ordered data into four parts:

  • $Q_1$ is the lower quartile,
  • $Q_2$ is the median,
  • $Q_3$ is the upper quartile.

The interquartile range is

$$\mathrm{IQR} = Q_3 - Q_1$$

A common rule for potential outliers is the $1.5\,\mathrm{IQR}$ rule. Values below

$$Q_1 - 1.5\,\mathrm{IQR}$$

or above

$$Q_3 + 1.5\,\mathrm{IQR}$$

may be treated as outliers.

Example: suppose a dataset has $Q_1 = 18$ and $Q_3 = 30$. Then

$$\mathrm{IQR} = 30 - 18 = 12$$

The lower fence is

$$18 - 1.5(12) = 18 - 18 = 0$$

and the upper fence is

$$30 + 1.5(12) = 30 + 18 = 48$$

So values below $0$ or above $48$ would be possible outliers. If the data are test scores, any score above $48$ would not be unusual in this example, but if the dataset were ages, a negative value would be impossible and clearly incorrect.

Another useful idea is the z-score. A z-score tells you how many standard deviations a value is from the mean. It is given by

$$z = \frac{x - \mu}{\sigma}$$

for a population, or

$$z = \frac{x - \bar{x}}{s}$$

for a sample.

A large absolute value of $z$ may indicate an outlier. For example, if a test score is much larger than the mean, its z-score will be positive and possibly large. In practice, the $1.5\,\mathrm{IQR}$ rule is often more robust because it is less affected by extreme values than the mean and standard deviation.

How outliers affect summaries and graphs 🧠

Outliers can strongly change the mean. Since the mean uses every value in the sum,

$$\bar{x} = \frac{\sum x}{n}$$

a very large or very small value can pull the mean toward itself. The median is usually less affected because it depends on order rather than the exact size of every value.

Example: consider the data set $4, 5, 6, 6, 7, 8, 30.

  • The median is $6$.
  • The mean is

$$\bar{x} = \frac{4+5+6+6+7+8+30}{7} = \frac{66}{7} \approx 9.43$$

The value $30$ pulls the mean upward a lot, even though most values are around $6$ or $7$. This is why, when a dataset has outliers, the median and interquartile range are often better measures of center and spread than the mean and standard deviation.

Outliers also affect graphs. On a box plot, a possible outlier may appear as a separate point beyond the whiskers. In a histogram, one extreme value may stretch the horizontal axis and make the main cluster harder to read. In a scatter plot, an outlier may sit far from the general trend, which can make it look like the data are less consistent than they really are.

Outliers in regression and correlation 📈

Outliers matter a great deal in regression and correlation. Correlation describes the strength and direction of a linear relationship. A single extreme point can change the correlation coefficient, written as $r$, and it can also change the regression line.

If most points follow an upward trend, then a far-away point might make the trend look weaker or even misleading. For example, suppose you are studying hours studied $x$ and test score $y$. If most students fit a pattern where more study time means higher scores, but one student studied a lot and scored very low because of illness, that point may be an outlier. It should not automatically be deleted. It could represent an important real-world exception.

In regression, outliers can be especially influential if they are also high-leverage points, meaning they are far from the mean of the $x$-values. These points can pull the line toward themselves and change predictions. This is important in IB because you must not blindly trust a line of best fit without checking the data.

A good habit is to ask:

  • Is the point unusual in the $y$-direction only?
  • Is it unusual in the $x$-direction too?
  • Is there a reason for it, such as a recording error or a special circumstance? 🤔

Outliers and probability ideas 🎲

Outliers are mainly a statistics idea, but they connect to probability because unusual values can be thought of as rare events. In probability, we often ask how likely a value is under a model. If a result has very small probability, it may be considered surprising or unusual.

For example, if exam marks are approximately normal with mean $70$ and standard deviation $5$, then a mark of $95$ is very far from the mean. Under the model, such a value would have a very small probability. This does not prove the value is invalid, but it suggests further checking is sensible.

Outliers also connect to discrete and continuous distributions. In a discrete distribution, some values may have extremely small probabilities, while in a continuous distribution, extreme tail values are possible but rare. In both cases, a value may be unusual without being impossible. That is why context is essential.

This idea appears in data collection too. If a measurement device is faulty, the dataset may contain an outlier caused by error rather than chance. If the data come from a random sample, an outlier may simply be a rare but legitimate outcome. Good statistical practice is to investigate before removing any value.

Working with outliers in IB-style reasoning ✍️

In IB Mathematics: Analysis and Approaches HL, you are expected to use clear reasoning. If asked whether a data point is an outlier, you should show your method. For the $1.5\,\mathrm{IQR}$ rule, list $Q_1$, $Q_3$, calculate $\mathrm{IQR}$, find the fences, and compare the value. If asked about a scatter plot, describe whether the point is far from the overall pattern and whether it may influence correlation or regression.

Example reasoning: a company tracks delivery times in minutes. Most values are between $20$ and $35$, but one time is $90$. Using the box-plot rule, the value may be an outlier. In context, it might be caused by traffic, a vehicle breakdown, or a data entry mistake. A statistician would not just delete it automatically. They would investigate the cause and decide whether it belongs in the analysis.

When writing about outliers, use careful language like “possible outlier,” “unusual value,” and “requires investigation.” This is better than saying every extreme value is an error.

Conclusion ✅

Outliers are unusual data values that can strongly affect interpretation. students, in statistics they can change the mean, standard deviation, correlation, and regression line, while the median and interquartile range are often more resistant to their influence. The $1.5\,\mathrm{IQR}$ rule and z-scores are useful tools for identifying possible outliers, but context always matters. In IB Mathematics: Analysis and Approaches HL, outliers connect data description, graphical interpretation, regression, and probability thinking into one important skill: using evidence to decide what a data value means.

Study Notes

  • An outlier is a value that lies unusually far from the rest of the data.
  • Outliers may come from random variation, real extreme cases, measurement error, or data entry mistakes.
  • The interquartile range is $\mathrm{IQR} = Q_3 - Q_1$.
  • Possible outliers may lie below $Q_1 - 1.5\,\mathrm{IQR}$ or above $Q_3 + 1.5\,\mathrm{IQR}$.
  • A z-score is $z = \frac{x - \mu}{\sigma}$ or $z = \frac{x - \bar{x}}{s}$.
  • Outliers can strongly affect the mean, standard deviation, correlation coefficient $r$, and regression lines.
  • The median and interquartile range are usually less affected by outliers than the mean and standard deviation.
  • In a scatter plot, an outlier may weaken or distort the apparent linear relationship.
  • In probability, an outlier can be viewed as a rare event under a model.
  • Always interpret outliers in context before deciding what to do with them.

Practice Quiz

5 questions to test your understanding