Linear Regression 📈

Introduction

students, in many real situations, two quantities seem to move together. For example, as study time increases, test scores may improve; as outside temperature rises, ice cream sales may increase; as a car gets older, its value may decrease. In mathematics, we often want to describe this relationship with a function so we can understand the pattern, make predictions, and test how well the model fits the data.

In this lesson, you will learn the main ideas behind linear regression, a tool used to find the best-fitting straight line for a set of data points. By the end, you should be able to explain key terms, interpret regression results, connect linear regression to the topic of functions, and use technology-supported analysis in an IB-style context. 🚀

What Is Linear Regression?

Linear regression is a method used to model the relationship between two variables with a straight line. One variable is called the independent variable and is usually written as $x$. The other is called the dependent variable and is usually written as $y$.

A linear regression model has the form

$$y = mx + c$$

where $m$ is the gradient and $c$ is the $y$-intercept.

The goal is to choose values of $m$ and $c$ so that the line fits the data as well as possible. In statistics, this is usually done using the least squares method, which finds the line that minimizes the sum of the squared vertical distances between the data points and the line.

Those vertical distances are called residuals. For a data point $(x, y)$, the residual is

$$\text{residual} = y - \hat{y}$$

where $\hat{y}$ is the predicted value from the regression line.

If a point lies above the line, the residual is positive. If it lies below the line, the residual is negative. If the point lies exactly on the line, the residual is $0$.

Why Linear Regression Matters in Functions

In the topic of Functions, you study rules that connect inputs to outputs. A regression line is a function because it assigns each input $x$ one output $y$. This makes linear regression part of functional modeling.

Linear regression is useful when data show an approximate linear trend. It helps answer questions such as:

How does one variable change when the other changes?
Can we predict a value using the model?
How strong is the relationship?
Is a straight-line model suitable, or should another model be used?

For IB Mathematics: Applications and Interpretation SL, this is important because real-world data are rarely perfectly neat. A regression line is not claiming that every point lies exactly on a line. Instead, it gives a sensible mathematical summary of the relationship. 📊

For example, suppose a school records the number of hours students revise and their final scores. A regression line might show that more revision is associated with higher scores. The model can then be used to estimate a likely score for a given number of hours, while remembering that real results can still vary.

Key Terms and Ideas

To work confidently with linear regression, you need to know the language used around it.

Scatter plot: A graph showing paired data points $(x, y)$. It helps reveal patterns.

Correlation: A measure of how strongly and in what direction two variables are related.

Positive correlation: As $x$ increases, $y$ tends to increase.
Negative correlation: As $x$ increases, $y$ tends to decrease.
No correlation: No clear pattern is visible.

Regression line: The line that best fits the data according to the chosen method, usually least squares.

Residual: The difference between the actual value and the predicted value.

Outlier: A point far from the general pattern of the data. Outliers can affect the regression line a lot.

Interpolation: Predicting a value within the range of the data.

Extrapolation: Predicting beyond the range of the data. This is riskier because the pattern may change outside the observed values.

For IB analysis, students, these terms help you explain not only what the line is, but also what it means in context.

How the Best-Fit Line Is Chosen

Imagine a set of data points and several possible straight lines. Which one is best? The least squares method answers this by comparing how far each point is from a line.

For each point, calculate the residual $y - \hat{y}$. Then square each residual to make all values positive and to give larger errors more influence. The best-fitting line is the one that minimizes

$$\sum (y - \hat{y})^2$$

This sum is called the sum of squared residuals.

Why square the residuals? Because positive and negative errors should not cancel each other out, and larger mistakes should matter more. This creates a line that is mathematically balanced for the whole data set.

A common feature of the regression line is that it passes through the point $(\bar{x}, \bar{y})$, where $\bar{x}$ and $\bar{y}$ are the mean values of $x$ and $y$.

This is useful in IB because it shows that regression is not random guessing. It is a structured method based on the data’s overall pattern.

Interpreting the Regression Equation

Suppose a regression equation is

$$y = 2.5x + 10$$

How should students interpret it?

The gradient $2.5$ means that for each increase of $1$ unit in $x$, the predicted value of $y$ increases by $2.5$ units.
The intercept $10$ means that when $x = 0$, the model predicts $y = 10$.

But in context, the intercept may or may not be meaningful. For example, if $x$ represents age and $y$ represents height, then predicting height at age $0$ might not be useful in a particular study if the data were collected only from teenagers.

Always interpret the regression equation in context. The numbers are not just symbols; they represent a real situation.

For example, if a model predicts the amount of water a plant needs based on temperature, the gradient tells how much the predicted water amount changes for each degree increase in temperature. That is a meaningful functional relationship 🌱

Using Technology in Regression Analysis

In IB Mathematics: Applications and Interpretation SL, technology is an important part of working with regression. A graphing calculator or statistical software can quickly produce a regression line, the correlation coefficient, and sometimes the coefficient of determination.

The correlation coefficient, often written as $r$, measures the strength and direction of a linear relationship. Its value lies between $-1$ and $1$.

$r$ close to $1$ means strong positive linear correlation.
$r$ close to $-1$ means strong negative linear correlation.
$r$ close to $0$ means weak linear correlation.

The coefficient of determination, written as $r^2$, tells us the proportion of variation in $y$ explained by the linear model. For example, if $r^2 = 0.81$, then $81\%$ of the variation in $y$ is explained by the model.

However, a high $r$ or $r^2$ does not automatically mean the model is perfect. students should still inspect the scatter plot, look for outliers, and check whether a straight line is reasonable.

Technology helps with calculation, but human interpretation is still needed. 📱

Example in Context

Suppose a sports coach records the number of training sessions $x$ and the sprint time $y$ in seconds. A regression model gives

$$y = -0.4x + 12$$

This means that each extra training session is associated with a predicted decrease of $0.4$ seconds in sprint time. Since lower sprint times are better, the negative gradient makes sense.

If a student has completed $8$ sessions, the model predicts

$$y = -0.4(8) + 12 = 8.8$$

So the predicted sprint time is $8.8$ seconds.

This is interpolation if $8$ is within the range of the data collected. If the data only covered $1$ to $6$ sessions, then using $8$ sessions would be extrapolation, which is less reliable.

Notice how this regression line acts as a function. Input $x$ gives output $y$. The model is a simplified mathematical version of the real-world relationship.

Strengths and Limits of Linear Regression

Linear regression is powerful because it is simple, clear, and useful for prediction. It can reveal trends that are hard to see at first glance. It is especially helpful when the data points form a pattern close to a straight line.

But it also has limits:

It only works well when a linear model is appropriate.
It can be distorted by outliers.
It does not prove cause and effect.
Extrapolation can be unreliable.

For example, if a graph of population growth curves upward, a straight line may not be a good model. In that case, another function type, such as exponential regression, may fit better.

This is an important IB idea: choosing the right model matters as much as calculating it.

Conclusion

Linear regression is a key part of Functions because it turns data into a mathematical model that can be used for interpretation and prediction. It describes relationships with a line of the form $y = mx + c$, where the parameters are chosen to minimize the squared residuals. By studying the scatter plot, gradient, intercept, correlation, and fit, students can decide whether the model is appropriate and explain what it means in context.

In real life, linear regression helps analyze trends in science, economics, sports, health, and many other areas. In IB Mathematics: Applications and Interpretation SL, it is a practical example of using functions to model the world. ✅

Study Notes

Linear regression finds the straight line that best fits data.
The model usually has the form $y = mx + c$.
The independent variable is $x$ and the dependent variable is $y$.
Residuals are given by $y - \hat{y}$.
The least squares method minimizes $\sum (y - \hat{y})^2$.
The regression line is a function because each $x$ gives a predicted $y$.
A scatter plot helps show whether a linear model is reasonable.
Positive correlation means both variables tend to increase together.
Negative correlation means one variable tends to decrease as the other increases.
The correlation coefficient $r$ measures strength and direction of linear association.
The coefficient of determination $r^2$ shows how much variation is explained by the model.
Interpolation is safer than extrapolation.
Outliers can change the regression line significantly.
Always interpret the equation in context, not just algebraically.
Technology is useful for calculating regression, but understanding the graph and context is still essential.