Topic 1: Collecting And Describing Data

Lesson 1.5: Skewness, Outliers And Interpreting Diagrams

Official syllabus section covering Lesson 1.5: Skewness, outliers and interpreting diagrams within Topic 1: Collecting and Describing Data: Describing the shape of a distribution as symmetrical, positively skewed or negatively skewed.; Identifying outliers by inspection and by appropriate calculation, and judging the nature of an outlier with reference to the population and the original data collection process..

Lesson 1.5: Skewness, Outliers, and Interpreting Diagrams

Introduction

In this lesson, we will explore the concepts of skewness and outliers in data distributions. Understanding the shape of a distribution is essential for interpreting statistical data accurately. We will also learn to identify outliers and interpret various statistical diagrams, including bar charts, stem and leaf diagrams, box and whisker plots, cumulative frequency diagrams, histograms, time series, and scatter diagrams. By the end of this lesson, you will be able to describe the skewness of distributions, identify outliers, and critically assess different statistical representations.

Learning Objectives:

  • Describe the shape of a distribution as symmetrical, positively skewed, or negatively skewed.
  • Identify outliers through inspection and appropriate calculations, considering their nature in relation to the population and original data collection process.
  • Interpret and critically assess a range of statistical diagrams, noting instances of misrepresentation.
  • Relate skewness to the positions of the mean, median, and mode within a distribution.
  • Use a stated rule to identify outliers and evaluate whether they should be retained or removed in context.

1. Understanding Skewness

Skewness refers to the asymmetry of a distribution. We can classify the skewness of a distribution into three categories:

  • Symmetrical: When the left and right sides of the distribution are mirror images of each other. The mean, median, and mode are all approximately equal.
  • Positively Skewed (Right Skewed): When the right tail (the higher values) of the distribution is longer or fatter than the left tail. In this case, the mean is usually greater than the median, which is greater than the mode ($\text{mean} > \text{median} > \text{mode}$).
  • Negatively Skewed (Left Skewed): When the left tail (the lower values) is longer or fatter than the right tail. In this distribution, the mean is usually less than the median, which is less than the mode ($\text{mean} < \text{median} < \text{mode}$).

Example 1: Determining Skewness

Let's consider the following set of data representing the ages of a group of people:

Ages: 22, 23, 23, 24, 25, 26, 30, 32, 35, 50

  1. Calculate the mean:

$$\text{Mean} = \frac{22 + 23 + 23 + 24 + 25 + 26 + 30 + 32 + 35 + 50}{10} = \frac{ 30 }{10} = 29.5$$

  1. Calculate the median:

The median is the average of the 5th and 6th values in this ordered set, which are 25 and 26:

$$\text{Median} = \frac{25 + 26}{2} = \frac{51}{2} = 25.5$$

  1. Notice the mode:

The mode is 23, as it occurs most frequently.

Now we can identify skewness:

$- Mean = 29.5$

$- Median = 25.5$

$- Mode = 23$

Given that $\text{mean} > \text{median} > \text{mode}$, we conclude that the distribution is positively skewed.

2. Identifying Outliers

Outliers are data points that deviate significantly from the overall pattern of a distribution. They can arise due to variability in the data, measurement errors, or other factors. Identifying outliers is crucial since they can affect statistics such as the mean.

Common Methods for Identifying Outliers

  1. Visual Inspection: By plotting data using simple diagrams, we can spot any data points that appear distant from the rest.
  2. Using IQR (Interquartile Range):
  • Calculate the first quartile (Q1) and the third quartile (Q3).
  • Determine the IQR:

$$\text{IQR} = Q3 - Q1$$

  • Calculate the lower and upper bounds for outliers:

$$\text{Lower Bound} = Q1 - 1.5 \times \text{IQR}$$

$$\text{Upper Bound} = Q3 + 1.5 \times \text{IQR}$$

Any data points outside of these bounds are considered outliers.

Example 2: Identifying Outliers Using IQR

Consider the following data set: 10, 12, 12, 13, 14, 14, 15, 20, 21, 100.

  1. Order the Data: 10, 12, 12, 13, 14, 14, 15, 20, 21, 100
  2. Determine Q1 and Q3:
  • The lower half is: 10, 12, 12, 13, 14.
  • $Q1 = 12$ (the median of this half).
  • The upper half is: 14, 14, 15, 20, 21.
  • $Q3 = 20$ (the median of this half).
  1. Calculate IQR:

$$\text{IQR} = Q3 - Q1 = 20 - 12 = 8$$

  1. Calculate the Bounds:

$$\text{Lower Bound} = Q1 - 1.5 \times \text{IQR} = 12 - 1.5 \times 8 = 12 - 12 = 0$$

$$\text{Upper Bound} = Q3 + 1.5 \times \text{IQR} = 20 + 1.5 \times 8 = 20 + 12 = 32$$

  1. Identify Outliers: The value 100 is greater than 32, so it is considered an outlier.

3. Interpreting Statistical Diagrams

Interpreting diagrams is a skill that enhances data comprehension. We will consider various types of statistical diagrams and how to interpret them effectively.

3.1 Bar Charts

Bar charts are used to represent categorical data. Each category is represented by a bar whose height is proportional to the frequency or value it represents. Misleading practices include varying the scales or omitting data points, which can distort interpretations.

Example 3: Analyzing a Bar Chart

A bar chart shows sales figures for different fruits:

  • Apples: 50
  • Bananas: 30
  • Oranges: 20

If we omit oranges but keep other values, the comparison becomes misleading, as viewers would think the profits are much higher overall. It is crucial to display all relevant information.

3.2 Box and Whisker Plots

Box plots provide a visual summary through five-number summaries: minimum, Q1, median, Q3, and maximum. They help highlight the spread and identify outliers.

Example 4: Analyzing a Box Plot

If the box plot shows a wide box and thin whiskers, indicating a large IQR compared to the overall data spread, this suggests a concentration of values within a specific range, along with potential outliers.

4. Relating Skewness to Mean, Median, and Mode

Understanding the relationships between skewness and measures of central tendency gives insights into the data’s distribution. As discussed earlier, in positively skewed distributions, the order is typically $\text{mean} > \text{median} > \text{mode}$.

Conversely, in negatively skewed distributions, we observe $\text{mean} < \text{median} < \text{mode}$. This knowledge helps when making inferences about a dataset from its measures.

Recap Example

In a classroom with test scores such as 50, 60, 70, 80, 90, 95, and 100:

$1. Mean = 79.29 $

$2. Median = 80 $

$3. Mode = None $

This indicates symmetry, as no tail extends further.

Conclusion

In this lesson, we have delved into the intricacies of skewness and outliers. We learned how to identify skewness, interpret statistical diagrams, and use IQR to detect outliers. These concepts are fundamental in understanding the shape and spread of data distributions and are critical for any data analysis task. Remember that recognizing outliers and understanding skewness allows for more informed decisions and interpretations in statistical contexts.

Study Notes

  • Skewness types: symmetrical, positively skewed, negatively skewed.
  • Mean, median, mode relationships help determine skewness.
  • Outliers identified through visual inspection or IQR method.
  • Box plots summarize data and showcase outliers effectively.
  • Critical assessment of diagrams important to avoid misrepresentation.

Practice Quiz

5 questions to test your understanding