Lesson 5.5: Comparing Datasets Fairly
Introduction
In this lesson, we will explore how to compare two or more datasets fairly by considering both their averages (centres) and their spreads (variability). By the end of this lesson, you will understand the importance of reporting data in a way that accurately reflects its characteristics, allowing for clear and reliable comparisons.
Learning Objectives:
- Compare two or more datasets on both their average and their spread.
- Understand why a fair comparison reports centre and variability together.
- Informally compare the spread of groups of different sizes or units.
- Write a clear, two-part comparison in words.
- Compare two datasets on both centre and spread.
Understanding Measures of Centre and Spread
To make meaningful comparisons between datasets, we first need to understand what is meant by the measures of centre and spread.
Measures of Centre
The measure of centre typically refers to the average or the mean. The mean is calculated by summing all data points and dividing by the number of points:
$$
$\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}$
$$
where $\bar{x}$ is the mean, $x_i$ represents each individual data point, and $n$ is the number of data points.
Measures of Spread
The spread of a dataset indicates how much variability there is in the data. Common measures of spread include:
- Range: The difference between the maximum and minimum values in a dataset.
$$\text{Range} = \text{Max}(x) - \text{Min}(x)$$
- Variance: The average of the squared differences from the mean. For a dataset, the variance $s^2$ is given by:
$$s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}$$
- Standard Deviation: The square root of the variance, providing a measure of spread in the same units as the original data:
$$s = \sqrt{s^2}$$
Example: Comparing Two Datasets
Let's consider two datasets:
- Dataset A: 2, 4, 6, 8, 10
- Dataset B: 1, 1, 1, 1, 1, 20
To compare the two datasets, we will calculate their means and spreads.
Step 1: Calculate the Mean
For Dataset A:
$$
$\bar{x}$_A = $\frac{2 + 4 + 6 + 8 + 10}{5}$ = $\frac{30}{5}$ = 6
$$
For Dataset B:
$$
$\bar{x}$_B = $\frac{1 + 1 + 1 + 1 + 1 + 20}{6}$ = $\frac{25}{6}$ $\approx 4$.17
$$
Step 2: Calculate the Range
For Dataset A:
$$\text{Range}_A = 10 - 2 = 8$$
For Dataset B:
$$\text{Range}_B = 20 - 1 = 19$$
Step 3: Calculate the Variance and Standard Deviation
For Dataset A:
- Compute the squared differences from the mean:
- $(2 - 6)^2 = 16$
- $(4 - 6)^2 = 4$
- $(6 - 6)^2 = 0$
- $(8 - 6)^2 = 4$
- $(10 - 6)^2 = 16$
- Sum of squared differences: $16 + 4 + 0 + 4 + 16 = 40$
- Variance:
$$s^2_A = \frac{40}{5-1} = \frac{40}{4} = 10$$
- Standard Deviation:
$$s_A = \sqrt{10} \approx 3.16$$
For Dataset B:
- Compute the squared differences:
- $(1 - 4.17)^2 \approx 10.71$
- $(20 - 4.17)^2 \approx 247.51$
- Sum of squared differences: $10.71 + 10.71 + 10.71 + 10.71 + 10.71 + 247.51 \approx 300.99$
- Variance:
$$s^2_B = \frac{300.99}{6-1} = \frac{300.99}{5} \approx 60.20$$
- Standard Deviation:
$$s_B = \sqrt{60.20} \approx 7.75$$
Summary of the Comparison
- Mean:
- Dataset A: $\bar{x}_A = 6$
- Dataset B: $\bar{x}_B \approx 4.17$
- Range:
- Dataset A: $\text{Range}_A = 8$
- Dataset B: $\text{Range}_B = 19$
- Standard Deviation:
- Dataset A: $s_A \approx 3.16$
- Dataset B: $s_B \approx 7.75$
From these calculations, we see that although the mean of Dataset A is higher than that of Dataset B, Dataset B has a much higher spread, indicating that it is much more variable, primarily due to the single outlier at 20.
Writing a Clear, Two-Part Comparison
When comparing two datasets, it is essential to address both their average and their spread. This combined view allows for a more comprehensive understanding of the datasets.
Example
In our comparison of Dataset A and Dataset B, we can say:
- Dataset A has a mean of 6, with a standard deviation of approximately 3.16, indicating a relatively low spread and consistency in the data.
- Dataset B, on the other hand, has an approximate mean of 4.17 and a standard deviation of approximately 7.75, reflecting significant variability due to the presence of an outlier.
Conclusion
In this lesson, we have learned how to compare datasets not only based on their averages but also by accounting for their spread. This dual approach is crucial in providing an accurate portrayal of the datasets, enabling us to draw meaningful conclusions.
Study Notes
- The centre (mean) and spread (variability) are crucial for meaningful comparisons.
- The measures of spread include range, variance, and standard deviation.
- Always consider both centre and spread when comparing datasets.
- Datasets can have the same mean but different spreads, leading to different interpretations of the data.
- Write clear comparisons that address both average and variability.
