Regression

Hey students! 👋 Welcome to one of the most powerful tools in statistics - regression analysis! In this lesson, you'll discover how we can use mathematical relationships to predict one variable based on another, understand the strength of these relationships, and even work with multiple variables at once. By the end of this lesson, you'll be able to perform linear regression, interpret correlation coefficients, apply least squares estimation, diagnose model quality, and understand the basics of multiple regression. Get ready to unlock the secrets hidden in data patterns! 📊

Understanding Linear Regression and Correlation

Linear regression is like being a detective who looks for patterns in data to make predictions about the future! 🔍 Imagine you're trying to predict how much ice cream a shop will sell based on the temperature outside. You might notice that on hotter days, more ice cream is sold, and on cooler days, less ice cream is sold. Linear regression helps us create a mathematical equation that describes this relationship.

The foundation of regression analysis lies in correlation, which measures how strongly two variables are related to each other. The correlation coefficient, denoted as $r$, ranges from -1 to +1. When $r = +1$, we have a perfect positive correlation (as one variable increases, the other increases perfectly). When $r = -1$, we have a perfect negative correlation (as one variable increases, the other decreases perfectly). When $r = 0$, there's no linear relationship between the variables.

For example, studies have shown that there's typically a strong positive correlation (around $r = 0.85$) between hours studied and exam scores among students. This means that generally, students who study more hours tend to achieve higher exam scores, though the relationship isn't perfect since other factors also influence performance.

The simple linear regression model takes the form: $$y = \beta_0 + \beta_1x + \epsilon$$

Where $y$ is the dependent variable (what we're trying to predict), $x$ is the independent variable (what we're using to make predictions), $\beta_0$ is the y-intercept, $\beta_1$ is the slope, and $\epsilon$ represents the random error term that accounts for variability not explained by our model.

Least Squares Estimation Method

The least squares method is the mathematical technique we use to find the "best fit" line through our data points! 📏 Think of it like trying to draw a straight line through a cloud of points on a graph in such a way that the line is as close as possible to all the points.

Mathematically, we want to minimize the sum of squared differences between our observed values and the values predicted by our line. The least squares estimates for our parameters are:

$$\hat{\beta_1} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}$$

$$\hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}$$

Where $\bar{x}$ and $\bar{y}$ are the sample means of $x$ and $y$ respectively.

Let's consider a real-world example: A study of house prices might find that for every additional square foot of living space, the house price increases by £150 on average. If we have data showing that houses with 1,000 square feet sell for an average of £200,000, our regression equation might be: House Price = £50,000 + £150 × (Square Feet). This means a 1,200 square foot house would be predicted to cost £50,000 + £150 × 1,200 = £230,000.

The beauty of least squares estimation is that it gives us the Best Linear Unbiased Estimators (BLUE) under certain conditions, meaning our estimates are as accurate as possible given the linear model assumption.

Model Diagnostics and Assessment

Just like a doctor needs to check if a treatment is working, we need to check if our regression model is doing a good job! 🩺 Model diagnostics help us determine whether our model assumptions are met and whether our predictions are reliable.

The coefficient of determination, denoted as $R^2$, tells us what proportion of the variability in our dependent variable is explained by our independent variable. It's calculated as:

$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$$

Where $SS_{res}$ is the sum of squares of residuals and $SS_{tot}$ is the total sum of squares. An $R^2$ value of 0.75 means that 75% of the variability in our dependent variable is explained by our model - that's pretty good! However, be careful not to assume that a high $R^2$ automatically means a good model.

Residual analysis is crucial for checking our model assumptions. Residuals are the differences between observed and predicted values: $e_i = y_i - \hat{y_i}$. We should look for:

Linearity: The relationship should be approximately linear
Independence: Observations should be independent of each other
Homoscedasticity: The variance of residuals should be constant
Normality: Residuals should be approximately normally distributed

Real-world example: In analyzing the relationship between advertising spending and sales revenue, a company found $R^2 = 0.82$, suggesting that 82% of sales variation could be explained by advertising spending. However, residual plots revealed that the relationship was actually curved rather than linear, indicating that a simple linear model wasn't appropriate.

Introduction to Multiple Regression

While simple linear regression uses one independent variable, multiple regression allows us to use several independent variables to make predictions! 🎯 This is much more realistic since most real-world phenomena are influenced by multiple factors.

The multiple regression model is: $$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p + \epsilon$$

Where $p$ is the number of independent variables. For example, house prices might depend on square footage, number of bedrooms, age of the house, and neighborhood quality.

In multiple regression, we interpret each coefficient as the expected change in the dependent variable for a one-unit increase in that independent variable, holding all other variables constant. This is crucial! If $\hat{\beta_1} = 150$ in our house price example, it means that for every additional square foot, we expect the price to increase by £150, assuming the number of bedrooms, age, and neighborhood quality remain the same.

The adjusted R-squared becomes more important in multiple regression because regular $R^2$ always increases when we add more variables, even if they're not actually useful. Adjusted R-squared penalizes the addition of irrelevant variables:

$$R^2_{adj} = 1 - \frac{(1-R^2)(n-1)}{n-p-1}$$

Multiple regression also introduces the concept of multicollinearity - when independent variables are highly correlated with each other. This can make it difficult to determine the individual effect of each variable. For instance, if we're predicting house prices using both square footage and number of rooms, these variables are likely highly correlated since bigger houses tend to have more rooms.

Conclusion

Regression analysis is a powerful statistical tool that helps us understand and predict relationships between variables. We've explored how linear regression uses correlation to establish mathematical relationships, how least squares estimation finds the best-fit line through data points, how model diagnostics ensure our models are reliable and valid, and how multiple regression extends these concepts to handle multiple independent variables simultaneously. These techniques form the foundation for much of modern statistical analysis and are essential tools for making data-driven decisions in fields ranging from economics and psychology to engineering and medicine.

Study Notes

• Correlation coefficient (r): Measures linear relationship strength; ranges from -1 to +1

• Simple linear regression model: $y = \beta_0 + \beta_1x + \epsilon$

• Least squares estimates: $\hat{\beta_1} = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}$ and $\hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}$

• Coefficient of determination: $R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$ (proportion of variance explained)

• Model assumptions: Linearity, Independence, Homoscedasticity, Normality (LINE)

• Residuals: $e_i = y_i - \hat{y_i}$ (difference between observed and predicted values)

• Multiple regression model: $y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p + \epsilon$

• Adjusted R-squared: $R^2_{adj} = 1 - \frac{(1-R^2)(n-1)}{n-p-1}$ (accounts for number of variables)

• Multicollinearity: High correlation between independent variables in multiple regression

• Coefficient interpretation: Expected change in dependent variable for one-unit increase in independent variable, holding others constant