Linear Regression 📈

Introduction: Why straight lines matter in data

students, imagine you track the number of hours you study each week and your test score on the next exam. At first, the data may look messy, but there may still be a pattern hiding inside it. Linear regression is a statistical method used to describe the relationship between two variables and to find a line that best fits the data. This line can help us understand trends, make predictions, and judge how strong the relationship is.

In IB Mathematics Analysis and Approaches SL, linear regression belongs to the wider study of statistics and probability because it helps us analyze data collection, correlation, and the meaning of relationships between variables. In this lesson, you will learn the key ideas and terminology, how to interpret regression results, and how linear regression is used in real life. By the end, you should be able to explain what a regression line does, read its equation, and decide whether the model is useful. ✅

Learning objectives

Explain the main ideas and terminology behind linear regression.
Apply IB Mathematics Analysis and Approaches SL procedures related to linear regression.
Connect linear regression to statistics and probability.
Summarize how linear regression fits into the broader topic.
Use evidence and examples related to linear regression.

What linear regression is

Linear regression is used when we want to study the relationship between two variables, usually called $x$ and $y$. The variable $x$ is the explanatory variable, also called the independent variable, and $y$ is the response variable, also called the dependent variable. The goal is to model how $y$ changes when $x$ changes.

When the relationship is approximately linear, the data points on a scatter plot tend to cluster around a straight line. We then use a regression line, often written as $y=mx+c$, where $m$ is the gradient and $c$ is the $y$-intercept. In some textbooks and software, the line may also be written as $y=ax+b$. Both forms mean the same thing: a straight-line model that estimates $y$ from $x$.

The phrase “best fit” does not mean the line passes through every point. Instead, it means the line is chosen so that the overall prediction errors are as small as possible. The most common method is the least squares method, which chooses the line that minimizes the sum of the squared vertical residuals.

A residual is the difference between an observed value and the predicted value from the line. If a point is $(x,y)$ and the predicted value is $\hat{y}$, then the residual is $y-\hat{y}$. Residuals can be positive, negative, or zero. A positive residual means the actual value is above the line, and a negative residual means it is below the line.

Scatter plots, correlation, and the regression line

Before calculating a regression line, we usually begin with a scatter plot. A scatter plot shows paired data and lets us see the direction, form, and strength of the relationship. For linear regression, we want the points to show an approximately straight-line pattern.

Correlation measures the strength and direction of the linear relationship between two variables. The correlation coefficient is usually written as $r$, and it always lies between $-1$ and $1$. A value of $r$ close to $1$ means a strong positive linear relationship, a value close to $-1$ means a strong negative linear relationship, and a value near $0$ means little or no linear relationship.

A strong correlation does not prove that one variable causes the other. For example, ice cream sales and sunburn cases may both increase in summer, but one does not directly cause the other. A third variable, such as hot weather, may influence both. This is an important idea in statistics: correlation is not causation.

The regression line is related to correlation because a stronger linear pattern usually makes the line a better summary of the data. However, if the scatter plot is curved, contains extreme outliers, or shows no pattern, then a straight line may not be a good model.

Example: study time and marks

Suppose a teacher records data for five students:

$$(1,52), (2,56), (3,60), (4,65), (5,68)$$

Here, $x$ is the number of hours studied and $y$ is the test score. The points suggest a positive linear relationship: as study time increases, marks also increase. A regression line might look like $\hat{y}=4x+48$. This means that for each extra hour studied, the predicted score increases by about $4$ marks.

If a student studies $6$ hours, the model predicts

$$\hat{y}=4(6)+48=72$$

So the estimated score is $72$. This is useful, but only as a prediction based on the pattern in the data.

Working with the regression equation

A regression equation gives a mathematical rule for prediction. The slope $m$ tells us the average change in $y$ for each increase of $1$ unit in $x$. The intercept $c$ gives the predicted value of $y$ when $x=0$, if that value makes sense in context.

For example, if a regression line modeling height as a function of age is $\hat{y}=6x+90$, then the slope $6$ means the predicted height increases by about $6$ cm for every extra year of age. The intercept $90$ means the model predicts a height of $90$ cm when $x=0$, but this may not be realistic if age $0$ is outside the data range or the context does not support it.

This is why interpretation matters. A regression equation should not be used blindly. If the data were collected for ages $12$ to $16$, then predicting for age $30$ is not reliable. This is called extrapolation, and it can be misleading because the relationship may change outside the observed range.

The safer use is interpolation, which means predicting within the range of the data. Interpolation is usually more trustworthy than extrapolation because it stays close to the evidence.

Example: predicting using a line

If the regression equation for temperature and electricity use is $\hat{y}=2.5x+10$, where $x$ is temperature in degrees and $y$ is electricity use in kilowatt-hours, then at $x=20$ the prediction is

$$\hat{y}=2.5(20)+10=60$$

This means the model predicts $60$ kilowatt-hours. In real life, this kind of model can help energy companies estimate demand during hot weather. 🌞

How to judge if a regression model is useful

Not every data set should be modeled with a line. A good regression model should match the shape of the scatter plot reasonably well, and the residuals should not show a clear pattern. If the residuals are randomly scattered around $0$, the line is likely a reasonable fit.

The value of $r^2$, called the coefficient of determination, is also important. It tells us the proportion of variation in $y$ that can be explained by the linear relationship with $x$. For example, if $r^2=0.81$, then about $81\%$ of the variation in $y$ is explained by the linear model. The remaining $19\%$ is due to other factors, random variation, or limitations of the model.

A higher $r^2$ often means a better fit, but context still matters. A high value does not guarantee the model is appropriate if the data contain outliers or if the relationship is not actually linear.

Example: checking residuals

Suppose the predicted values from a line are close to the actual values except for one very large point far above the others. That point may be an outlier. Outliers can affect the regression line a lot because the least squares method gives more influence to large residuals, since they are squared. This is one reason why data should be inspected carefully before drawing conclusions.

Linear regression in the wider statistics topic

Linear regression connects to several other areas of statistics and probability in IB Mathematics Analysis and Approaches SL.

Data collection: The quality of the regression depends on how the data were collected. Random sampling and careful measurement improve reliability.
Statistical description: Scatter plots, means, and measures of spread help us summarize the data before fitting a line.
Correlation and regression: Correlation describes the strength and direction of the relationship, while regression gives a predictive model.
Interpretation: Statistics is not only about calculating values, but also about making sensible conclusions from data.

In probability, regression can be connected to uncertainty. Even when a line fits well, predictions are not exact. Real-world data always include variation. That is why statistical models give estimates rather than guaranteed outcomes.

For example, a sports coach might use regression to estimate how training hours affect sprint times. A line can provide a useful prediction, but it cannot capture every factor, such as sleep, nutrition, or injury. This reminds us that statistics models patterns, not certainty. 🏃

Conclusion

Linear regression is a powerful tool for finding and describing straight-line relationships in data. It helps us predict values, understand trends, and measure how strongly two variables are connected. In IB Mathematics Analysis and Approaches SL, students, you should be able to read a scatter plot, interpret a regression equation, explain residuals, and judge whether a linear model is appropriate. Linear regression fits naturally into statistics and probability because it uses data to make informed estimates while also recognizing uncertainty.

Study Notes

Linear regression models the relationship between two variables using a straight line.
The explanatory variable is usually $x$, and the response variable is usually $y$.
A regression line is often written as $\hat{y}=mx+c$.
The slope $m$ shows the predicted change in $y$ for each increase of $1$ in $x$.
The intercept $c$ is the predicted value when $x=0$, if that makes sense in context.
A residual is $y-\hat{y}$, the difference between actual and predicted values.
Least squares chooses the line that minimizes the sum of the squared residuals.
Correlation coefficient $r$ measures direction and strength of linear relationship, with $-1\le r\le 1$.
Correlation does not prove causation.
Coefficient of determination $r^2$ shows the proportion of variation in $y$ explained by the linear model.
Use interpolation more carefully than extrapolation.
Check for outliers and non-linear patterns before trusting a regression line.
Linear regression is an important part of statistics because it turns data into a model for prediction and interpretation.