Topic 4: Correlation And Regression

Lesson 4.5: Residuals And Significance Of A Correlation Coefficient

Official syllabus section covering Lesson 4.5: Residuals and significance of a correlation coefficient within Topic 4: Correlation and Regression: Calculating a residual using residual = y minus the fitted value and using residuals to evaluate the model and identify outliers.; Commenting on residuals visually from a scatter graph with a line of best fit drawn..

Lesson 4.5: Residuals and Significance of a Correlation Coefficient

Introduction

In this lesson, we will explore the concepts of residuals and the significance of correlation coefficients within the framework of linear regression. Our objectives include calculating residuals, using them to evaluate regression models, and commenting on scatter graphs with fitted lines. Furthermore, we will touch on how to test the significance of a correlation coefficient, all of which are crucial for interpreting statistical data correctly.

Learning Objectives

By the end of this lesson, students should be able to:

  • Calculate a residual using the formula: residual = observed value - fitted value and evaluate the model using these residuals to identify outliers.
  • Comment on residuals visually from a scatter graph where a line of best fit is drawn.
  • Use tables to test the significance of a correlation coefficient, stating hypotheses and conclusions in context.
  • Calculate a specified residual and use it to assess the fit of the regression model.
  • Identify a likely outlier based on residuals or a scatter graph with a fitted line.

Understanding Residuals

What is a Residual?

A residual is the difference between the observed value of a dependent variable (let's call it $y$) and the value predicted by a regression model (fitted value). The formula for calculating the residual for each observation is:

$$\text{Residual} = y - \hat{y}$$

where:

  • $y$ is the actual observed value.
  • $\hat{y}$ is the predicted value obtained from the regression equation.

Importance of Residuals

Residuals are important because they help us understand how well our regression model is performing. By analyzing the residuals, we can:

  • Determine whether our model is adequate for explaining the relationship between the variables.
  • Identify outliers that might influence our conclusions.

Example of Calculating Residuals

Consider the following dataset containing the variables $X$ (independent variable) and $Y$ (dependent variable):

$X$$Y$
12
23
35
44
56

We fit a linear regression model and the equation of the line of best fit we obtain is:

$$\hat{y} = 1 + 1.0x$$

Now, let’s calculate the residuals:

  1. For $X = 1$: $y = 2$, $\hat{y} = 2$, Residual = $2 - 2 = 0$.
  2. For $X = 2$: $y = 3$, $\hat{y} = 3$, Residual = $3 - 3 = 0$.
  3. For $X = 3$: $y = 5$, $\hat{y} = 4$, Residual = $5 - 4 = 1$.
  4. For $X = 4$: $y = 4$, $\hat{y} = 5$, Residual = $4 - 5 = -1$.
  5. For $X = 5$: $y = 6$, $\hat{y} = 6$, Residual = $6 - 6 = 0$.

The residuals calculated are $[0, 0, 1, -1, 0]$. Notice that the third data point has a positive residual and the fourth has a negative residual, indicating that they are greater or lesser than the fitted values respectively.

Visual Representation of Residuals

Scatter Graph with a Line of Best Fit

When we plot the points of the dataset on a scatter graph with the regression line, the residuals can be visually assessed. Each point represents an observed value, and the distance between the point and the line of best fit signifies the residual.

To illustrate, consider a scatter graph where:

  • The point for $X = 3$ which has a residual of $1$ is above the line, indicating that the model underestimates the actual $y$ value for this observation.
  • Conversely, the point for $X = 4$, with a residual of $-1$, is below the line, showing that the model overestimates the actual $y$ value.

Commenting on Residuals

When interpreting the residuals visually, we can make several statements:

  • If residuals are randomly dispersed around the horizontal axis (y=0), this indicates a good fit for the model.
  • If there is a discernible pattern (for instance, a funnel shape), it suggests that the linear model may not be appropriate for the data. In such cases, transformation of data or a different model may be necessary.

Significance of Correlation Coefficient

Understanding Correlation Coefficients

The correlation coefficient measures the strength and direction of the linear relationship between two variables. In A-Level statistics, we typically use Pearson's or Spearman's correlation coefficients.

  1. Pearson's Correlation ($r$) measures the linear relationship and assumes that both variables are continuous and normally distributed. The value of $r$ ranges from -1 to 1, where:
  • $r = 1$ indicates a perfect positive linear relationship.
  • $r = -1$ indicates a perfect negative linear relationship.
  • $r = 0$ indicates no linear relationship.
  1. **Spearman's Rank Correlation (

ho)** is a non-parametric measure used for ordinal data or when the assumptions of Pearson’s correlation do not hold. It assesses how well the relationship between two variables can be described by a monotonic function.

Testing the Significance

To test whether the correlation coefficient is statistically significant, we can follow these steps:

  1. State Hypotheses:
  • Null Hypothesis ($H_0$): There is no correlation between the variables (

ho = 0).

  • Alternative Hypothesis ($H_a$): There is a correlation between the variables (

ho $\neq 0$).

  1. Calculate the test statistic using the relevant formula according to the correlation coefficient being used.

For Pearson’s correlation, the test statistic $t$ is given by:

$$t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}}$$

where $r$ is the sample correlation coefficient, and $n$ is the number of pairs.

  1. Determine the critical value from the correlation significance table based on the degrees of freedom ($df = n-2$) and significance level (commonly $\alpha = 0.05$).
  2. Make a decision: If the absolute value of the calculated statistic exceeds the critical value, we reject the null hypothesis.

Example of Testing Significance

Consider we have a calculated Pearson correlation coefficient $r = 0.8$ from a sample of $n = 30$. To test its significance:

  1. Hypotheses:
  • H_0:

ho = 0

  • H_a:

ho $\neq 0$

  1. Calculate the test statistic:

$$t = \frac{0.8\sqrt{30-2}}{\sqrt{1 - 0.8^2}} = \frac{0.8\sqrt{28}}{\sqrt{0.36}} = \frac{0.8 \cdot 5.291}{0.6} = 6.909$$

  1. The degrees of freedom = $n-2 = 28$. From tables, the critical value for $df = 28$ at $\alpha = 0.05$ (two-tailed) is approximately $2.048$. Since $6.909 > 2.048$, we reject $H_0$.
  2. Thus, we conclude that there is a statistically significant correlation.

Conclusion

In this lesson, we have explored the critical concepts of residuals and the significance of correlation coefficients in the context of linear regression analysis. We learned how to calculate residuals, assess the fit of a model visually, and test the significance of a correlation coefficient through hypothesis testing. Mastery of these concepts is essential for effective evaluation of statistical models and for making informed decisions based on data analysis.

Study Notes

  • A residual is calculated as: residual = observed value - fitted value.
  • Positive residuals indicate underestimation and negative residuals indicate overestimation.
  • A good model will exhibit randomly dispersed residuals with no discernible patterns.
  • To test the significance of a correlation coefficient, formulate hypotheses and use the relevant test statistic.
  • Reject the null hypothesis (no correlation) if your calculated statistic is greater than the critical value from tables.

Practice Quiz

5 questions to test your understanding

Lesson 4.5: Residuals And Significance Of A Correlation Coefficient — A-Level Statistics | A-Warded