Lesson 5.3: Box Plots and Comparing Distributions
Introduction
In statistics, understanding how data behaves is crucial for interpreting it effectively. While averages provide a snapshot of the data, they often mask important characteristics such as how spread out the data points are. This lesson focuses on box plots, which provide a visual representation of the distribution of data through the five-number summary. By the end of this lesson, you, students, will be able to understand how to construct box plots, identify outliers, and compare distributions effectively using parallel box plots.
Learning Objectives
- Understand the five-number summary: minimum, lower quartile, median, upper quartile, maximum.
- Learn to construct a box-and-whisker plot from the five-number summary.
- Identify and mark clear outliers on a box plot.
- Compare two or more groups using parallel box plots.
- State the five-number summary of a dataset.
The Five-Number Summary
The five-number summary provides key insights into the distribution of a dataset. It consists of the following five values:
- Minimum: The smallest data point in the dataset.
- Lower Quartile (Q1): This value divides the lowest 25% of the data from the rest. It is a measure of the lower end of the dataset.
- Median (Q2): The middle value when the data is arranged in ascending order. It represents the midpoint of the dataset.
- Upper Quartile (Q3): This value divides the lowest 75% from the highest 25% of the data. It gives an idea of the upper end of the dataset.
- Maximum: The largest data point in the dataset.
Example of Finding the Five-Number Summary
Let's say we have the following dataset representing the ages of participants in a survey: [22, 25, 29, 22, 35, 33, 40].
- Minimum: The smallest number is 22.
- Lower Quartile (Q1): To find Q1, we first order the data: [22, 22, 25, 29, 33, 35, 40]. Q1 is the median of the first half (22, 22, 25), which is 22.
- Median: The median is the middle number of the ordered data. Here, it is 29.
- Upper Quartile (Q3): We find Q3 by looking at the second half of the dataset (29, 33, 35, 40). The middle value is 35.
- Maximum: The largest number is 40.
So the five-number summary for this dataset is: Minimum = 22, Q1 = 22, Median = 29, Q3 = 35, Maximum = 40.
Constructing a Box-and-Whisker Plot
A box-and-whisker plot is a standardized way of displaying the five-number summary. Here's how to construct it step-by-step:
- Draw a number line: This will form the basis of your box plot.
- Plot the Minimum and Maximum: Mark points for the minimum and maximum values on this line.
- Draw a Box: Draw a box from Q1 to Q3. This box represents the interquartile range (IQR).
- Mark the Median: Inside the box, draw a vertical line at the median (Q2).
- Whiskers: From each end of the box (Q1 and Q3), draw lines (whiskers) extending to the minimum and maximum values respectively.
- Outliers: Identify and mark outliers which are typically defined as any data point below $Q1 - 1.5 \times IQR$ or above $Q3 + 1.5 \times IQR$.
Example of Constructing a Box Plot
Using the five-number summary we obtained earlier, we can now construct the box-and-whisker plot:
- Draw a number line that can accommodate the minimum (22) and maximum (40).
- Mark the minimum point (22) and maximum point (40).
- Draw a box from the lower quartile (22) to the upper quartile (35).
- Indicate the median (29) by drawing a vertical line inside the box.
- Extend the whiskers from Q1 to the minimum (22) and from Q3 to the maximum (40).
Our constructed box plot visually represents the distribution of ages in the survey dataset. It shows that the ages are clustered between 22 and 35, with a spread covering the entire range to 40.
Identifying Outliers
Outliers are values in a dataset that are significantly higher or lower than the rest. In our box plot example, let's determine if there are any outliers:
- Calculate the IQR: $ IQR = Q3 - Q1 = 35 - 22 = 13.$
- Determine the outlier boundaries: $ \text{Lower Bound} = Q1 - 1.5 \times IQR = 22 - 1.5 (13) = 22 - 19.5 = 2.5, $ $\text{Upper Bound}$ = Q3 + $1.5 \times$ IQR = 35 + 1.5 (13) = 35 + 19.5 = 54.5.
- Any data points below 2.5 or above 54.5 are considered outliers. In our dataset, there are no values below 2.5 or above 54.5.
Hence, we conclude that there are no outliers in this dataset.
Comparing Two or More Groups with Parallel Box Plots
One of the most powerful uses of box plots is to compare distributions between multiple groups. For example, we can compare two different datasets of ages from two different surveys:
- Dataset A: [22, 25, 29, 22, 35, 33, 40]
- Dataset B: [18, 20, 21, 24, 36, 37, 50]
Steps to Compare Using Box Plots
- Calculate five-number summaries for both datasets and construct individual box plots for each.
- Plot both box plots side by side on the same number line for visual comparison. This helps to see differences in medians and spreads at a glance.
- Analyze and compare the box plots: Look at medians to see which group has a higher average. Look at the interquartile range to examine variability, and check for outliers.
Example of Comparing Box Plots
Using our two datasets:
- For Dataset A, we previously found the five-number summary: Minimum = 22, Q1 = 22, Median = 29, Q3 = 35, Maximum = 40.
- For Dataset B: Order the data to find its five-number summary: [18, 20, 21, 24, 36, 37, 50] gives:
$ - Minimum = 18$
$ - Q1 = 21$
$ - Median = 24$
$ - Q3 = 36$
$ - Maximum = 50$
After constructing both box plots side by side, we can analyze them. Dataset A has a median of 29, while Dataset B's median is 24. Dataset A also shows a greater spread between the minimum and the maximum. These differences indicate that participants in Dataset A generally are older, and their ages vary more significantly compared to Dataset B.
Conclusion
Box plots provide a clear and effective way to summarize and visualize data distributions. They allow us to represent the five-number summary succinctly and compare distributions effectively. In practice, understanding box plots is vital as it helps to identify trends, understand variability, and make informed decisions based on data. As we move forward, remember that the distribution and spread of data are just as essential as the averages in accurately interpreting data.
Study Notes
- The five-number summary consists of the minimum, lower quartile (Q1), median, upper quartile (Q3), and maximum.
- A box plot visually depicts the five-number summary, showcasing data spread and potential outliers.
- Outliers are identified using the criteria $Q1 - 1.5 \times IQR$ and $Q3 + 1.5 \times IQR$.
- Parallel box plots enable effective comparison of distributions across multiple datasets.
- The median gives the central location of the data, while the interquartile range indicates its spread.
