Lesson 4.5: Residuals and Testing a Correlation Coefficient

Introduction

In this lesson, students will explore the concepts of residuals and the testing of correlation coefficients. We will define residuals, understand their significance in evaluating regression models, and learn how to test the significance of a correlation coefficient using critical-value tables. By the end of this lesson, you will be able to calculate residuals, identify outliers in data, state hypotheses for statistical tests, and interpret the results.

Learning Objectives

Understand residuals as the difference between observed and predicted values in a regression model.
Learn how to test the significance of a correlation coefficient with critical-value tables.
Formulate hypotheses for one-tailed or two-tailed tests concerning a population correlation coefficient and interpret testing results.
Calculate specified residuals and analyze residuals presented on scatter graphs to evaluate regression models and identify potential outliers.
Utilize critical value tables to determine the significance of Pearson's product moment correlation coefficient.

Section 1: Understanding Residuals

What Are Residuals?

In the context of regression analysis, residuals are the differences between observed values and the values predicted by our regression line. Mathematically, we can express this as:

$$\text{Residual} = y - \hat{y} = y - (a + bx)$$

Here:

$y$ represents the observed value.
$\hat{y}$ is the predicted value based on our regression model.
$a$ is the y-intercept of the regression line.
$b$ is the slope of the regression line.
$x$ is the independent variable.

Importance of Residuals

Residuals are crucial in evaluating how well our regression model fits the data. If a regression model predicts perfectly, all residuals would be zero. However, in practice, they help us understand the model’s accuracy and detect patterns that indicate potential issues such as non-linearity or outliers in the data.

Example 1: Calculating Residuals

Let’s consider a dataset showing the relationship between the number of hours studied and the score on a test. Assume we have the following data points:

Hours Studied ($x$)	Test Score ($y$)
1	60
2	65
3	68
4	70
5	75

Suppose we fit a regression line to this data and find the equation:

$$\hat{y} = 58 + 3.4x$$

To calculate the residuals for each data point, we will substitute the values of $x$ into the regression equation to find $\hat{y}$ and then calculate the residuals.

Hours Studied ($x$)	Test Score ($y$)	Predicted Score ($\hat{y}$)	Residual ($y - \hat{y}$)
1	60	61.4	-1.4
2	65	64.8	0.2
3	68	68.2	-0.2
4	70	71.6	-1.6
5	75	75	0

From this example, we see how some of the residuals are positive (indicating underestimation) and some are negative (indicating overestimation).

Identifying Outliers with Residuals

Outliers can be detected by examining the residuals. If a residual is significantly larger or smaller than others, it may suggest that the corresponding observation is an outlier. A common threshold to consider is a residual that is greater than 2 standard deviations from the mean residual.

Section 2: Testing Correlation Coefficients

What is a Correlation Coefficient?

The correlation coefficient quantifies the strength and direction of a linear relationship between two variables. The most commonly used correlation coefficient is Pearson's product moment correlation coefficient, denoted as $r$. Its value ranges from -1 to +1:

$r = 1$: Perfect positive correlation
$r = -1$: Perfect negative correlation
$r = 0$: No correlation

Significance Testing of Correlation Coefficients

To determine whether a calculated correlation coefficient is statistically significant, we often perform a hypothesis test. The null hypothesis ($H_0$) typically posits that there is no correlation in the population, while the alternative hypothesis ($H_a$) suggests that there is a significant correlation.

Formulating Hypotheses

For a two-tailed test:
H_0:

ho = 0 (there is no correlation in the population)

H_a:

ho $\neq 0$ (there is a correlation in the population)

For a one-tailed test:
H_0:

ho $\leq 0$ (no positive correlation)

H_a:

ho > 0 (there is a positive correlation)

Critical-Value Tables

Once we have our correlation coefficient $r$ calculated from a sample, we can use critical-value tables to determine the significance.

Identify the degrees of freedom: $df = n - 2$, where $n$ is the number of pairs of observations.
Find the critical value for a specified significance level (e.g., $\alpha = 0.05$) for the corresponding $df$.
Compare the absolute value of $r$ with the critical value. If $|r|$ is greater than the critical value, we reject the null hypothesis, suggesting that the correlation is significant.

Example 2: Testing a Correlation Coefficient

Consider a scenario where we calculated the correlation coefficient between hours studied and test scores and obtained $r = 0.85$. We had a sample of 30 students, thus:

$n = 30 \Rightarrow df = 30 - 2 = 28$
For $\alpha = 0.05$ (two-tailed), the critical value from the table for $df = 28$ is approximately $0.361$.
Here, since $|0.85| > 0.361$, we reject the null hypothesis and conclude that there is a significant positive correlation between hours studied and test scores.

Conclusion

In this lesson, we learned about residuals—the differences between observed and predicted values—and their significance in assessing regression models. We also explored how to test the significance of a correlation coefficient using hypotheses and critical value tables. Mastering these concepts is critical for analyzing bivariate data effectively.

Study Notes

Residuals: $y - \hat{y}$, where $\hat{y} = a + bx$.
Outliers indicated by large residual values (e.g., greater than 2 standard deviations).
Pearson's correlation coefficient $r$ ranges from -1 to +1.
Hypotheses for correlation tests: H_0:

ho = 0$, $H_a:

ho $\neq 0$.

Critical value tables used to test the significance of $r$ with degrees of freedom.
Reject $H_0$ if $|r|$ exceeds the critical value at the desired significance level.