Lesson 4.5: Residuals and Significance of a Correlation Coefficient
Introduction
In this lesson, we will explore the concepts of residuals and the significance of correlation coefficients within the framework of linear regression. Our objectives include calculating residuals, using them to evaluate regression models, and commenting on scatter graphs with fitted lines. Furthermore, we will touch on how to test the significance of a correlation coefficient, all of which are crucial for interpreting statistical data correctly.
Learning Objectives
By the end of this lesson, students should be able to:
- Calculate a residual using the formula: residual = observed value - fitted value and evaluate the model using these residuals to identify outliers.
- Comment on residuals visually from a scatter graph where a line of best fit is drawn.
- Use tables to test the significance of a correlation coefficient, stating hypotheses and conclusions in context.
- Calculate a specified residual and use it to assess the fit of the regression model.
- Identify a likely outlier based on residuals or a scatter graph with a fitted line.
Understanding Residuals
What is a Residual?
A residual is the difference between the observed value of a dependent variable (let's call it $y$) and the value predicted by a regression model (fitted value). The formula for calculating the residual for each observation is:
$$\text{Residual} = y - \hat{y}$$
where:
- $y$ is the actual observed value.
- $\hat{y}$ is the predicted value obtained from the regression equation.
Importance of Residuals
Residuals are important because they help us understand how well our regression model is performing. By analyzing the residuals, we can:
- Determine whether our model is adequate for explaining the relationship between the variables.
- Identify outliers that might influence our conclusions.
Example of Calculating Residuals
Consider the following dataset containing the variables $X$ (independent variable) and $Y$ (dependent variable):
| $X$ | $Y$ |
|---|---|
| 1 | 2 |
| 2 | 3 |
| 3 | 5 |
| 4 | 4 |
| 5 | 6 |
We fit a linear regression model and the equation of the line of best fit we obtain is:
$$\hat{y} = 1 + 1.0x$$
Now, let’s calculate the residuals:
- For $X = 1$: $y = 2$, $\hat{y} = 2$, Residual = $2 - 2 = 0$.
- For $X = 2$: $y = 3$, $\hat{y} = 3$, Residual = $3 - 3 = 0$.
- For $X = 3$: $y = 5$, $\hat{y} = 4$, Residual = $5 - 4 = 1$.
- For $X = 4$: $y = 4$, $\hat{y} = 5$, Residual = $4 - 5 = -1$.
- For $X = 5$: $y = 6$, $\hat{y} = 6$, Residual = $6 - 6 = 0$.
The residuals calculated are $[0, 0, 1, -1, 0]$. Notice that the third data point has a positive residual and the fourth has a negative residual, indicating that they are greater or lesser than the fitted values respectively.
Visual Representation of Residuals
Scatter Graph with a Line of Best Fit
When we plot the points of the dataset on a scatter graph with the regression line, the residuals can be visually assessed. Each point represents an observed value, and the distance between the point and the line of best fit signifies the residual.
To illustrate, consider a scatter graph where:
- The point for $X = 3$ which has a residual of $1$ is above the line, indicating that the model underestimates the actual $y$ value for this observation.
- Conversely, the point for $X = 4$, with a residual of $-1$, is below the line, showing that the model overestimates the actual $y$ value.
Commenting on Residuals
When interpreting the residuals visually, we can make several statements:
- If residuals are randomly dispersed around the horizontal axis (y=0), this indicates a good fit for the model.
- If there is a discernible pattern (for instance, a funnel shape), it suggests that the linear model may not be appropriate for the data. In such cases, transformation of data or a different model may be necessary.
Significance of Correlation Coefficient
Understanding Correlation Coefficients
The correlation coefficient measures the strength and direction of the linear relationship between two variables. In A-Level statistics, we typically use Pearson's or Spearman's correlation coefficients.
- Pearson's Correlation ($r$) measures the linear relationship and assumes that both variables are continuous and normally distributed. The value of $r$ ranges from -1 to 1, where:
- $r = 1$ indicates a perfect positive linear relationship.
- $r = -1$ indicates a perfect negative linear relationship.
- $r = 0$ indicates no linear relationship.
- **Spearman's Rank Correlation (
ho)** is a non-parametric measure used for ordinal data or when the assumptions of Pearson’s correlation do not hold. It assesses how well the relationship between two variables can be described by a monotonic function.
Testing the Significance
To test whether the correlation coefficient is statistically significant, we can follow these steps:
- State Hypotheses:
- Null Hypothesis ($H_0$): There is no correlation between the variables (
ho = 0).
- Alternative Hypothesis ($H_a$): There is a correlation between the variables (
ho $\neq 0$).
- Calculate the test statistic using the relevant formula according to the correlation coefficient being used.
For Pearson’s correlation, the test statistic $t$ is given by:
$$t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}}$$
where $r$ is the sample correlation coefficient, and $n$ is the number of pairs.
- Determine the critical value from the correlation significance table based on the degrees of freedom ($df = n-2$) and significance level (commonly $\alpha = 0.05$).
- Make a decision: If the absolute value of the calculated statistic exceeds the critical value, we reject the null hypothesis.
Example of Testing Significance
Consider we have a calculated Pearson correlation coefficient $r = 0.8$ from a sample of $n = 30$. To test its significance:
- Hypotheses:
- H_0:
ho = 0
- H_a:
ho $\neq 0$
- Calculate the test statistic:
$$t = \frac{0.8\sqrt{30-2}}{\sqrt{1 - 0.8^2}} = \frac{0.8\sqrt{28}}{\sqrt{0.36}} = \frac{0.8 \cdot 5.291}{0.6} = 6.909$$
- The degrees of freedom = $n-2 = 28$. From tables, the critical value for $df = 28$ at $\alpha = 0.05$ (two-tailed) is approximately $2.048$. Since $6.909 > 2.048$, we reject $H_0$.
- Thus, we conclude that there is a statistically significant correlation.
Conclusion
In this lesson, we have explored the critical concepts of residuals and the significance of correlation coefficients in the context of linear regression analysis. We learned how to calculate residuals, assess the fit of a model visually, and test the significance of a correlation coefficient through hypothesis testing. Mastery of these concepts is essential for effective evaluation of statistical models and for making informed decisions based on data analysis.
Study Notes
- A residual is calculated as: residual = observed value - fitted value.
- Positive residuals indicate underestimation and negative residuals indicate overestimation.
- A good model will exhibit randomly dispersed residuals with no discernible patterns.
- To test the significance of a correlation coefficient, formulate hypotheses and use the relevant test statistic.
- Reject the null hypothesis (no correlation) if your calculated statistic is greater than the critical value from tables.
