Linear Regression 📈

students, in this lesson you will learn how linear regression helps us describe and predict relationships between two variables in real life. Imagine you are studying the relationship between hours spent revising and test scores, or advertising budget and sales. A straight line may not fit perfectly, but it can still give a useful summary of the pattern. That is the main idea behind linear regression: finding the line that best represents the trend in a set of data.

Objectives

Explain the main ideas and terminology behind linear regression.
Apply IB Mathematics: Applications and Interpretation HL reasoning and procedures related to linear regression.
Connect linear regression to the broader topic of functions.
Summarize how linear regression fits within functions.
Use evidence from data to interpret a linear model correctly.

Linear regression is a key part of functional modelling because it turns data into a function model. Instead of just drawing points on a scatter diagram, we use technology and mathematical reasoning to create a line that describes the relationship between variables. This is especially useful when the relationship is roughly linear and we want to make predictions. 🧠

1. What Linear Regression Means

Linear regression is a method for finding the line of best fit for a set of paired data. If one variable is called the explanatory variable and the other is called the response variable, then the regression line helps us predict the response from the explanatory variable.

A regression line usually has the form

$$y = mx + c$$

where $m$ is the gradient and $c$ is the $y$-intercept.

In data analysis, the regression line is not chosen just by eye. Instead, it is calculated so that the overall distance between the data points and the line is as small as possible. The most common method is least squares regression, which chooses the line that minimizes the sum of the squared residuals.

A residual is the vertical difference between an observed value and the value predicted by the line:

$$\text{residual} = y - \hat{y}$$

Here, $y$ is the actual data value and $\hat{y}$ is the predicted value from the regression line.

Squaring residuals makes all differences positive and gives larger errors more influence. This is useful because a model should not ignore big mismatches. 📊

2. Scatter Diagrams and the Need for a Model

Before fitting a line, you should look at a scatter diagram. The pattern of the points tells you whether linear regression is appropriate. If the points cluster around a straight-line trend, then a linear model may work well. If the pattern curves strongly, then a line may not be suitable.

For example, suppose a school records the number of hours students revise before an exam and the scores they achieve. A scatter plot may show that as revision time increases, scores tend to increase too. This positive association suggests that a linear model could be helpful.

However, not every set of data should be forced into a line. If the relationship bends, levels off, or changes direction, then a different model may be needed. In IB Mathematics: Applications and Interpretation HL, you are expected to interpret the context and decide whether the model is reasonable.

Important vocabulary includes:

Independent variable: the variable used to predict.
Dependent variable: the variable being predicted.
Correlation: the degree to which two variables are related.
Positive correlation: both variables tend to increase together.
Negative correlation: one variable tends to increase while the other decreases.
Residual: the difference between observed and predicted values.

A strong correlation does not always mean the data should be used for prediction outside the range given. That would be extrapolation, and it can be unreliable. 🚦

3. The Regression Line and Its Interpretation

Once a regression line is found, it can be used to make predictions. Suppose the model is

$$\hat{y} = 4.2x + 15$$

If $x$ represents hours of revision, then the model says that each extra hour of revision is associated with an increase of about $4.2$ marks in the predicted score. The value $15$ is the predicted score when $x = 0$.

But interpretation matters. In a real context, $x = 0$ may or may not make sense. If nobody in the data studied for zero hours, then $15$ is still part of the mathematical model, but it may not be meaningful in the real world.

This is a very important IB skill: always interpret the line in context, not just as an algebraic equation.

For a data table like the one below, a regression line is chosen to represent the general trend:

| $x$ | $y$ |

|---|---|

| $1$ | $18$ |

| $2$ | $20$ |

| $3$ | $22$ |

| $4$ | $27$ |

| $5$ | $30$ |

The points are not exactly on a line, but they rise overall. A regression line would give a reasonable approximate model.

The closeness of the data points to the line is often measured using the correlation coefficient, $r$. Its value lies between $-1$ and $1$.

If $r$ is close to $1$, there is a strong positive linear relationship.
If $r$ is close to $-1$, there is a strong negative linear relationship.
If $r$ is close to $0$, there is little or no linear relationship.

The coefficient of determination, $r^2$, shows how much of the variation in the response variable is explained by the linear model. For example, if $r^2 = 0.81$, then about $81\%$ of the variation is explained by the model. This does not mean the model is perfect; it means it explains a large portion of the pattern.

4. Technology and Least Squares Regression

In IB Mathematics: Applications and Interpretation HL, technology plays an important role in regression analysis. Calculators and software can compute the line of best fit quickly, along with values of $r$ and $r^2$.

The least squares method works by minimizing

$$\sum (y - \hat{y})^2$$

where the sum is taken over all data points.

This means the line is chosen so that the total squared error is as small as possible. The actual calculations can be lengthy, so technology is usually used to find the regression equation. However, you still need to understand what the output means.

A common mistake is to trust the equation without checking the graph. For example, if there is an outlier, it may strongly affect the regression line. An outlier is a point far from the rest of the data. One unusual point can pull the line in its direction and reduce how well the model reflects the rest of the data.

Another important idea is interpolation versus extrapolation:

Interpolation means predicting within the range of the observed data.
Extrapolation means predicting outside the range of the observed data.

Interpolation is usually more reliable because it stays close to the data used to build the model. Extrapolation can be risky because the pattern may change beyond the observed range. 📱

5. Linear Regression as a Function Model

Linear regression fits directly into the topic of functions because it creates a function that maps values of $x$ to predicted values of $y$.

If the model is

$$\hat{y} = mx + c$$

then the function takes an input $x$ and gives an output $\hat{y}$. This is a clear example of functional thinking: one quantity depends on another.

In the Functions topic, you study how functions behave, how they transform, and how they represent real relationships. Regression extends this idea by using data to build the function rather than starting from a known rule.

For example, a company might record the number of products sold at different advertising levels. The regression function can help estimate future sales for a given advertising budget. In this way, a function becomes a tool for decision-making.

This connection to functions is important because the regression line is not just a picture. It is a mathematical model that can be evaluated, interpreted, and tested against data.

You may also be asked to compare different models. If a linear model is not suitable, a quadratic or exponential model may fit better. Choosing the right function is part of mathematical modelling. In IB, good modelling means using the context, not just chasing the largest value of $r^2$.

6. How to Interpret Regression in Context

When answering IB-style questions, always explain what the model says in the real situation.

Suppose a regression line for the relationship between temperature $x$ and ice cream sales $\hat{y}$ is

$$\hat{y} = 12x + 80$$

This means that for each increase of $1\degree\text{C}$ in temperature, predicted sales increase by about $12$ units. The value $80$ is the predicted sales when the temperature is $0\degree\text{C}$.

Now consider the question: Is this reasonable? If the data were collected only for temperatures between $15\degree\text{C}$ and $30\degree\text{C}$, then predicting at $0\degree\text{C}$ is extrapolation and may not be valid.

Also, a regression model gives predictions, not exact answers. Real-world data always has variation because of many factors not included in the model. That is why residuals matter. They show how far the model is from the actual data.

A smaller residual means the prediction is closer to the observed value. A pattern in the residuals may also suggest that a linear model is not ideal.

Conclusion

Linear regression is a powerful way to turn data into a function model. It helps you find a line that best fits a scatter diagram, measure how strong the linear relationship is, and make predictions carefully. In IB Mathematics: Applications and Interpretation HL, you must not only calculate a regression line with technology, but also interpret it in context, check whether the model is suitable, and understand its limits. When used correctly, linear regression connects data, functions, and real-world decision-making in a clear and practical way. ✅

Study Notes

Linear regression finds a line of best fit for paired data.
The standard regression model is $\hat{y} = mx + c$.
Residuals are calculated using $y - \hat{y}$.
Least squares regression minimizes $\sum (y - \hat{y})^2$.
A scatter diagram helps decide whether a linear model is appropriate.
The correlation coefficient $r$ measures the strength and direction of a linear relationship.
The coefficient of determination $r^2$ shows how much variation is explained by the model.
Interpolation is usually more reliable than extrapolation.
Outliers can strongly affect the regression line.
Regression connects to functions because it creates a rule that predicts $\hat{y}$ from $x$.
Always interpret the slope and intercept in the context of the problem.
Technology is used to compute regression, but understanding the meaning is essential.