Lesson 3.2: Model Specification and Violations
Introduction
In this lesson, we will explore the critical concepts of model specification and the common violations that can compromise the integrity of regression models. As you progress through your study of quantitative methods, it's vital to understand not just how to run regression analyses, but also the assumptions behind these models and what happens when these assumptions are violated.
Learning Objectives
By the end of this lesson, students, you will be able to:
- Understand heteroskedasticity, serial correlation, and multicollinearity: how to detect them and their consequences.
- Identify model specification errors and know the corrective measures to take.
- Diagnose violations of regression assumptions using diagnostic statistics.
- Recommend corrections and understand their implications for inference.
- Explain the main ideas and terminology related to model specification and its violations.
Section 1: Understanding Heteroskedasticity
What is Heteroskedasticity?
Heteroskedasticity refers to a condition in regression analysis where the variability of the errors differs across observations. In a good regression model, we generally expect the variance of the error terms to be constant for all values of the independent variable(s). When this assumption is violated, we face issues in estimating the standard errors of our coefficients, which can lead to inaccurate statistical inferences.
Detection of Heteroskedasticity
One of the common ways to detect heteroskedasticity is through visual inspection of residual plots. After fitting a regression model, you can plot the residuals against the predicted values. If the residuals fan out or display a pattern, it suggests the presence of heteroskedasticity.
Alternatively, statistical tests such as the Breusch-Pagan test can be employed, which tests the null hypothesis that the variance of the errors is constant (homoskedastic).
Example
Consider the regression model predicting the house prices based on their size:
$$\text{Price} = \beta_0 + \beta_1 \times \text{Size} + \epsilon$$
After fitting this model, you might observe that the residuals plot shows increasing spread as the size of the house increases, indicating a potential heteroskedasticity problem.
Consequences of Heteroskedasticity
The primary consequence of ignoring heteroskedasticity is the underestimation of standard errors, which can lead to:
- Overly optimistic results where we incorrectly reject a null hypothesis.
- Incorrect confidence intervals that are narrower than they should be.
Section 2: Serial Correlation
What is Serial Correlation?
Serial correlation (or autocorrelation) occurs when the residuals from a regression model are correlated with each other. This usually arises in time series data, where observations are collected sequentially over time.
Detection of Serial Correlation
The presence of serial correlation can be checked using the Durbin-Watson statistic, which tests for first-order autocorrelation. The value of the Durbin-Watson statistic ranges from 0 to 4, where a value near 2 indicates no autocorrelation.
Example
If we are analyzing monthly sales data and find the Durbin-Watson statistic to be 1.2, this suggests a positive serial correlation among the residuals, indicating that today's residual is correlated with yesterday's.
Consequences of Serial Correlation
Ignoring serial correlation can lead to:
- Inefficient estimates of the regression coefficients because the model does not use all available information.
- Underestimated standard errors, resulting in invalid statistical tests (i.e., incorrect p-values).
Section 3: Multicollinearity
What is Multicollinearity?
Multicollinearity occurs when two or more independent variables in a regression model are highly correlated with each other. This makes it difficult to isolate the individual effect of each predictor on the dependent variable.
Detection of Multicollinearity
Multicollinearity can be detected through:
- Variance Inflation Factor (VIF): A VIF value greater than 10 is often taken as a sign of serious multicollinearity.
- Correlation matrices: Checking the correlation coefficients between pairs of independent variables. High correlation (close to +1 or -1) suggests multicollinearity.
Example
In a model attempting to predict employee performance based on hours worked and sales figures, if both variables have a correlation coefficient of 0.9, it implies strong multicollinearity.
Consequences of Multicollinearity
The effects of multicollinearity include:
- Increased standard errors for coefficients, which can make them statistically insignificant.
- Difficulty in determining the effect of each independent variable on the dependent variable, leading to misinterpretation of results.
Section 4: Model Specification Errors
What are Model Specification Errors?
Model specification errors arise when the model is incorrectly specified due to the exclusion of relevant variables, inclusion of irrelevant variables, or incorrect functional forms.
Identification of Specification Errors
These errors can be diagnosed through:
- Specification tests such as the Ramsey RESET test, which checks whether we omitted higher-order terms or interaction terms.
- Sigiacial plots that compare the fitted model against an alternative.
Example
If a linear model is used to predict outcomes that are inherently quadratic, such as demand functions, it may lead to systematic errors in predictions.
Corrective Measures
When facing model specification errors, corrective actions can include:
- Adding missing variables that may have a significant relationship with the dependent variable.
- Removing non-significant variables to simplify the model.
- Testing different functional forms to identify a more suitable model.
Conclusion
Understanding model specification and violations is crucial for building reliable regression models. By diagnosing issues like heteroskedasticity, serial correlation, and multicollinearity, you can improve the accuracy of your models. Correcting model specification errors is equally important to ensure that your analysis reflects true relationships within the data.
Study Notes
- Heteroskedasticity leads to varying error variances across observations.
- Serial correlation implies correlation between residuals across time.
- Multicollinearity complicates the estimation of individual variable effects.
- Specifically check for model specification errors through testing.
- Correcting these issues improves inference and the validity of statistical tests.
