Regression Methods

Welcome to this essential lesson on regression methods in public health, students! 📊 This lesson will equip you with the fundamental statistical tools that public health professionals use every day to understand disease patterns, identify risk factors, and make evidence-based decisions that protect communities. By the end of this lesson, you'll understand how linear, logistic, and Poisson regression work, how to build effective models, interpret coefficients meaningfully, and adjust for confounding variables that could mislead your conclusions.

Understanding Regression in Public Health Context

Regression analysis is like being a detective 🕵️‍♀️ - you're trying to figure out which factors are truly responsible for health outcomes in populations. In public health, we constantly ask questions like: "Does smoking really increase lung cancer risk?" or "How much does air pollution contribute to asthma rates?" Regression methods help us answer these questions by quantifying relationships between variables while accounting for other factors that might confuse our conclusions.

Think of regression as a mathematical way to draw the "best fit" line through data points, but it's much more sophisticated than the simple lines you might remember from algebra class. In public health, we deal with different types of outcomes - some are continuous (like blood pressure readings), some are yes/no (like whether someone develops diabetes), and some are counts (like number of disease cases in a community). Each type requires a different regression approach.

The power of regression lies in its ability to isolate the effect of one variable while holding others constant. For example, if we want to study how exercise affects heart disease risk, we need to account for age, diet, genetics, and other factors. Regression lets us say, "Among people of the same age, diet, and genetic background, those who exercise have X% lower risk of heart disease."

Linear Regression: The Foundation

Linear regression is your starting point for understanding all regression methods, students. It's used when your outcome variable is continuous - things you can measure on a scale like weight, blood pressure, cholesterol levels, or air quality index. The basic equation looks like this: $$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \epsilon$$

Let's break this down with a real example. Suppose researchers want to understand how socioeconomic factors affect childhood obesity rates across different neighborhoods. They might collect data on average BMI (Body Mass Index) for children in 100 neighborhoods, along with information about median household income, education levels, and access to parks.

In this scenario, $Y$ represents the average childhood BMI in each neighborhood. $\beta_0$ is the intercept - the expected BMI when all other variables are zero. $\beta_1$ might represent the change in BMI for every $1,000 increase in median household income. If $\beta_1 = -0.2$, this means that for every $1,000 increase in household income, childhood BMI decreases by 0.2 points on average, holding other factors constant.

The beauty of linear regression is in its interpretability. Each coefficient tells you exactly how much the outcome changes for a one-unit change in that predictor. However, linear regression assumes your outcome can theoretically take any value, which isn't always realistic in public health. You can't have negative disease rates or BMI values above certain biological limits.

Model building in linear regression involves selecting which variables to include. Too few variables might miss important relationships (underfitting), while too many might create a model that's too complex and doesn't generalize well (overfitting). Public health researchers often use subject matter knowledge combined with statistical criteria to build parsimonious yet comprehensive models.

Logistic Regression: When Outcomes Are Binary

Logistic regression becomes your go-to method when dealing with yes/no outcomes, students - things like whether someone develops a disease, survives a treatment, or engages in a health behavior. Unlike linear regression, logistic regression uses the logistic function to ensure predicted probabilities stay between 0 and 1.

The logistic regression equation is: $$\text{logit}(p) = \ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ...$$

Here, $p$ represents the probability of the outcome occurring, and the left side is called the "log-odds" or "logit." This mathematical transformation allows us to use linear methods for binary outcomes.

Consider a study examining factors that influence whether teenagers start smoking. Researchers might collect data on 1,000 teenagers, recording whether they started smoking (yes/no) and various risk factors like peer smoking, parental smoking, academic performance, and socioeconomic status.

Interpreting logistic regression coefficients requires understanding odds ratios. If the coefficient for "having smoking friends" is 1.2, then $e^{1.2} = 3.32$ represents the odds ratio. This means teenagers with smoking friends have 3.32 times the odds of starting smoking compared to those without smoking friends, controlling for other factors.

Real-world example: A 2019 study published in the American Journal of Preventive Medicine used logistic regression to identify factors associated with COVID-19 vaccination hesitancy. They found that individuals with higher education levels had lower odds of vaccine hesitancy (OR = 0.68, meaning 32% lower odds for each additional education level), while those who primarily got health information from social media had higher odds of hesitancy (OR = 1.45).

Poisson Regression: Modeling Count Data

Poisson regression is perfect for count data - situations where you're measuring how many times something happens, students. In public health, this could be the number of disease cases in a community, hospital admissions per month, or accidents per year. Unlike linear regression, Poisson regression assumes your outcome follows a Poisson distribution, which is appropriate for rare events.

The Poisson regression model is: $$\ln(\lambda) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ...$$

where $\lambda$ (lambda) represents the expected count, and we model its natural logarithm.

Imagine studying traffic accidents in different city districts. Your outcome might be the number of accidents per district per year, and your predictors could include population density, number of traffic lights, average income, and road quality. If the coefficient for population density is 0.003, this means that for every additional person per square kilometer, the expected number of accidents increases by a factor of $e^{0.003} = 1.003$, or about 0.3%.

A fascinating real-world application comes from disease surveillance. During the 2014-2016 Ebola outbreak in West Africa, epidemiologists used Poisson regression to model the number of new cases per day in different regions. They found that factors like population density, healthcare infrastructure, and cultural practices significantly influenced transmission rates. This analysis helped guide resource allocation and intervention strategies.

One important consideration with Poisson regression is overdispersion - when your data shows more variability than the Poisson distribution predicts. In such cases, researchers might use negative binomial regression, which is more flexible and can handle extra variability.

Model Building and Variable Selection

Building effective regression models is both an art and a science, students! 🎨 The process involves balancing statistical rigor with practical considerations and subject matter expertise. You can't just throw all available variables into a model and hope for the best - this leads to overfitting and poor generalizability.

The "change-in-estimate" approach is widely used in epidemiology. You start with your main exposure variable, then add potential confounders one by one. If adding a variable changes your main coefficient by more than 10%, it's considered an important confounder and should stay in the model. This method prioritizes biological plausibility over purely statistical criteria.

Stepwise regression, while controversial, is still used in exploratory analyses. Forward selection starts with no variables and adds them based on statistical significance. Backward elimination starts with all variables and removes non-significant ones. However, these automated approaches can miss important confounders that aren't statistically significant but are biologically important.

Sample size considerations are crucial. A general rule for logistic regression is having at least 10 events per predictor variable. If you're studying a rare disease with only 50 cases, you shouldn't include more than 5 predictors, or your model will be unstable.

Confounder Adjustment: Separating True Effects from Noise

Confounding is one of the biggest challenges in observational public health research, students. A confounder is a variable that's associated with both your exposure and outcome, potentially creating a spurious relationship. Imagine studying whether coffee consumption prevents Parkinson's disease. Age is a confounder because older people drink less coffee AND have higher Parkinson's risk. Without adjusting for age, you might incorrectly conclude that coffee is more protective than it actually is.

The classic example is the apparent protective effect of hormone replacement therapy (HRT) on heart disease observed in early studies. Women who used HRT tended to be healthier, wealthier, and more health-conscious than non-users. These factors, not HRT itself, explained much of the apparent benefit. Proper confounder adjustment revealed that HRT actually increased heart disease risk, leading to major changes in clinical practice.

Directed Acyclic Graphs (DAGs) help visualize confounding relationships. These diagrams show causal pathways between variables, helping researchers identify which variables to include as confounders and which might be mediators (variables in the causal pathway) that shouldn't be adjusted for.

Conclusion

Regression methods are the backbone of quantitative public health research, providing powerful tools to understand complex relationships between risk factors and health outcomes. Linear regression handles continuous outcomes, logistic regression tackles binary outcomes, and Poisson regression addresses count data. Success in applying these methods depends on thoughtful model building, appropriate variable selection, and careful attention to confounding. These statistical techniques, when properly applied, transform raw data into actionable insights that guide public health policy and improve population health outcomes.

Study Notes

• Linear regression equation: $Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \epsilon$ (for continuous outcomes)

• Logistic regression equation: $\text{logit}(p) = \ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ...$ (for binary outcomes)

• Poisson regression equation: $\ln(\lambda) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ...$ (for count data)

• Odds ratio interpretation: $OR = e^{\beta}$ in logistic regression

• Rate ratio interpretation: $RR = e^{\beta}$ in Poisson regression

• Change-in-estimate criterion: Include confounders that change main coefficient by >10%

• Sample size rule: At least 10 events per predictor variable in logistic regression

• Confounders: Variables associated with both exposure and outcome

• Overfitting: Using too many predictors relative to sample size

• Underfitting: Missing important predictors, leading to biased estimates

• Model assumptions: Linear regression assumes normality and constant variance; logistic and Poisson regression are more robust

• Goodness of fit: Assess using R-squared (linear), Hosmer-Lemeshow test (logistic), or deviance (Poisson)