Lesson 3.4: Describing the shape of a distribution

Introduction

In this lesson, students will explore the concept of distribution shapes in data. Understanding the shape of a distribution is crucial for analyzing data and making informed decisions based on statistical analysis. By the end of this lesson, you will:

Understand what symmetry in data means and how to identify left and right skew.
Recognize the modes or peaks in a distribution and where the bulk of the data lies.
Identify outliers and gaps in data visualization.
Grasp how the shape of a distribution influences which average measure (mean, median, or mode) you might use later in your analyses.
Accurately describe the shape of a distribution based on given charts.

Understanding Distribution Shapes

What is a Distribution?

A distribution in statistics refers to the way in which the values of a dataset are spread out. It is often visualized using charts and graphs, with frequency plotted against the value of the variable. This graphical display helps in identifying patterns and trends in the data.

Symmetry in Distributions

A distribution is symmetric if its two halves are mirror images of each other. For instance, if you can fold a chart down the middle and the two sides match, then that distribution is symmetrical. The mean, median, and mode in a symmetric distribution typically coincide, suggesting a balance in the data.

Example of a Symmetric Distribution: The Normal Distribution

The Bell Curve, also known as the normal distribution, is a classic example of a symmetric distribution. In a normal distribution:

The mean ($\mu$) is equal to the median, which is equal to the mode.
The distribution is shaped like a bell and is characterized by the following equation for probability density function:

$$f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}$$

where $\sigma$ is the standard deviation.

Skewness in Distributions

Skewness refers to the degree of asymmetry of a distribution around its mean. When a distribution is not symmetric, it can be skewed to the left or the right:

Left Skew: Data has a longer tail on the left side. The mean is typically less than the median, which is less than the mode ($\text{mean} < \text{median} < \text{mode}$).
Right Skew: Data has a longer tail on the right side. The mean is usually greater than the median, which is greater than the mode ($\text{mean} > \text{median} > \text{mode}$).

Worked Example: Consider a dataset of exam scores:

Scores	Frequency
40-49	2
50-59	5
60-69	12
70-79	15
80-89	8
90-100	3

In this case, the distribution of scores has a longer tail towards the left since fewer students scored in the lower range than in the higher range. This indicates a right-skew. The mean will be higher than the median because there are high scores from 80 to 100 that pull the average up.

Identifying the Mode

The mode is the value that appears most frequently in a dataset. In graphical terms, it is the location of the peak(s) in a distribution. A distribution can be unimodal (one peak), bimodal (two peaks), or multimodal (several peaks).

Worked Example: Consider the following dataset of ages:

Ages	Frequency
18	2
19	3
20	8
21	15
22	10
23	4
24	1

In this case, the age 21 has the highest frequency and is the mode. The data is unimodal since it has only one peak.

Spotting Outliers and Gaps

Outliers are data points that differ significantly from other observations. They can distort statistical analyses and interpretations if not handled properly. Gaps in data can also indicate anomalies or missing information.

Example: In a dataset representing the ages of people attending a class, if most ages range from 18-25 but a few data points are at 70, then these 70s could be considered outliers. Their presence might suggest a more complex picture; perhaps parents attending with their children or a misrecording error.

When drawing graphs, one can often detect outliers as points that fall far away from the majority of data points.

Choosing the Right Average

The shape of the distribution affects which average — mean, median, or mode — is most representative of the data.

In symmetric distributions, the mean is a good measure of central tendency.
In left-skewed distributions, the median is preferred over the mean as it is less affected by outliers.
In right-skewed distributions, using the median provides a better central tendency measure than the mean, which can be skewed by high values.

Conclusion

Understanding the shape of a distribution provides insight into the characteristics and tendencies of data. Recognizing whether a dataset is symmetric, left-skewed, or right-skewed allows students to choose the appropriate measures of central tendency and provides context for the data when conducting further analyses.

Study Notes

Symmetric Distribution: Data where left and right halves are mirror images. Mean = Median = Mode.
Left Skewed: Longer tail on the left, mean < median < mode.
Right Skewed: Longer tail on the right, mean > median > mode.
Mode: The most frequently occurring value in the dataset.
Outliers: Data points significantly different from others; they can skew the average.
Choosing Averages: Use the mean for symmetric distributions; the median for skewed ones.