Lesson 6.1: Chi-squared Test for Independence (Contingency Tables)
Introduction
In this lesson, we will explore the chi-squared test for independence, particularly within the context of contingency tables. This test allows us to determine whether there is a significant association between two categorical variables. We will accomplish this by constructing contingency tables from real data, applying the chi-squared test, and interpreting the results in context. By the end of this lesson, you will be able to construct contingency tables, compute expected frequencies and degrees of freedom, and carry out chi-squared tests successfully.
Learning Objectives
- Constructing contingency tables from real data and combining categories where appropriate.
- Using a chi-squared test with the correct number of degrees of freedom to test for independence and interpreting the result.
- Understanding the requirement that expected frequencies are at least 5, and knowing how to pool classes when they are not (no Yates' correction required).
- Forming a contingency table and calculating the expected frequencies and degrees of freedom.
- Carrying out a chi-squared test for independence, stating hypotheses, significance level, and making a contextual conclusion.
Contingency Tables
What is a Contingency Table?
A contingency table is a type of table in a matrix format that displays the frequency distribution of two categorical variables. Each cell in the table represents the frequency of observations that fall into the corresponding categories for the two variables. For example, if we want to study the relationship between gender (male or female) and preference for a type of drink (coffee, tea), our contingency table might look like this:
| Coffee | Tea | Total | |
|---|---|---|---|
| Male | 30 | 10 | 40 |
| Female | 20 | 40 | 60 |
| Total | 50 | 50 | 100 |
Constructing a Contingency Table
To construct a contingency table from real data, follow these steps:
- Collect Data: Gather data on the two categorical variables of interest.
- Tabulate Frequencies: Count the frequency of each combination of categories and fill in the table.
- Calculate Totals: Compute row and column totals, and ensure they are included in your table.
Example of Constructing a Contingency Table
Problem: Consider a survey of 100 students which asked them about their favorite subject: Mathematics, Science, or Literature, and whether they like to study in groups. Here are the responses:
- 15 students like Mathematics and study in groups.
- 25 students like Mathematics and do not study in groups.
- 30 students like Science and study in groups.
- 10 students like Science and do not study in groups.
- 5 students like Literature and study in groups.
- 15 students like Literature and do not study in groups.
Solution:
We can summarize this information in a contingency table:
| Study in Groups | Do Not Study in Groups | Total | |
|---|---|---|---|
| Mathematics | 15 | 25 | 40 |
| Science | 30 | 10 | 40 |
| Literature | 5 | 15 | 20 |
| Total | 50 | 50 | 100 |
Chi-squared Test for Independence
What is the Chi-squared Test?
The chi-squared test is a statistical test used to determine if there is a significant association between two categorical variables arranged in a contingency table. The null hypothesis states that the two variables are independent, while the alternative hypothesis states that they are not independent.
Steps to Perform the Chi-squared Test
- State the Hypotheses:
- Null Hypothesis ($H_0$): The two categorical variables are independent.
- Alternative Hypothesis ($H_a$): The two categorical variables are not independent.
- Determine the Expected Frequencies:
The expected frequency for each cell in the contingency table is calculated using the formula:
$$E_{ij} = \frac{(Row \ Total)(Column \ Total)}{Sample \ Size}$$
where $E_{ij}$ is the expected frequency for cell $ij$.
- Calculate the Chi-squared Statistic:
The chi-squared statistic is calculated using the formula:
$$\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$
where $O_{ij}$ is the observed frequency for cell $ij$.
- Determine the Degrees of Freedom:
The degrees of freedom for a contingency table is calculated as:
$$df = (r - 1)(c - 1)$$
where $r$ is the number of rows and $c$ is the number of columns.
- Find the Critical Value: Use the chi-squared distribution table to find the critical value based on the significance level (typically $\alpha = 0.05$) and the calculated degrees of freedom.
- Make the Decision:
If the calculated chi-squared statistic is greater than the critical value, reject the null hypothesis.
Example of a Chi-squared Test
Problem: Using the contingency table from the previous example, test if favorite subject and group study preference are independent at the $\alpha = 0.05$ significance level.
Solution:
- State the Hypotheses:
- $H_0$: Favorite subject and group study preference are independent.
- $H_a$: Favorite subject and group study preference are not independent.
- Calculate Expected Frequencies:
- For Mathematics, Study in Groups:
$$E_{11} = \frac{(40)(50)}{100} = 20$$
- For Mathematics, Do Not Study in Groups:
$$E_{12} = \frac{(40)(50)}{100} = 20$$
- For Science, Study in Groups:
$$E_{21} = \frac{(40)(50)}{100} = 20$$
- For Science, Do Not Study in Groups:
$$E_{22} = \frac{(40)(50)}{100} = 20$$
- For Literature, Study in Groups:
$$E_{31} = \frac{(20)(50)}{100} = 10$$
- For Literature, Do Not Study in Groups:
$$E_{32} = \frac{(20)(50)}{100} = 10$$
- Calculate Chi-squared Statistic:
- Using the formula:
$$\chi^2 = \frac{(15 - 20)^2}{20} + \frac{(25 - 20)^2}{20} + \frac{(30 - 20)^2}{20} + \frac{(10 - 20)^2}{20} + \frac{(5 - 10)^2}{10} + \frac{(15 - 10)^2}{10}$$
- Compute each term:
$$= \frac{25}{20} + \frac{25}{20} + \frac{100}{20} + \frac{100}{20} + \frac{25}{10} + \frac{25}{10}$$
$$= 1.25 + 1.25 + 5 + 5 + 2.5 + 2.5 = 17.5$$
- Calculate Degrees of Freedom:
- $df = (3 - 1)(2 - 1) = 2$.
- Find the Critical Value:
- From the chi-squared distribution table, the critical value for $df = 2$ at $\alpha = 0.05$ is approximately 5.991.
- Make the Decision:
Since $17.5 > 5.991$, we reject the null hypothesis. This indicates that there is a significant association between favorite subject and group study preference.
Conclusion
The chi-squared test for independence is a powerful tool for analyzing the relationship between two categorical variables. By constructing contingency tables and calculating expected frequencies, we can apply this test to derive insights from real-world data. Remember that one key assumption of the chi-squared test is that the expected frequencies must be at least 5. If they are not, consider pooling categories. Always interpret your results within the context of your data.
Study Notes
- Contingency Table: Displays the frequency distributions of two categorical variables.
- Chi-squared Test: Tests independence between two categorical variables.
- Null Hypothesis ($H_0$): Assumes no association between variables.
- Expected Frequency: Calculated using (Row \ Total)(Column \ Total)/Sample \ Size.
- Degrees of Freedom: Determined by $(r-1)(c-1)$, where $r$ is rows and $c$ is columns.
- Chi-squared Statistic: Calculated using $\sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$.
- Critical Value: Found using the chi-squared distribution table at a specified significance level.
- Conclusion: If $\chi^2 > $ critical value, reject $H_0$; this indicates a significant association.
