Regression 📈
Introduction: Why do we need regression?
students, imagine you are tracking how many hours students study and their test scores. The points on a graph will probably not lie on a perfect line, but there may still be a clear pattern. Regression is the statistical method that helps us describe that pattern and use it to make predictions. In IB Mathematics: Analysis and Approaches HL, regression is a key part of Statistics and Probability because it turns data into a model that can be analyzed, interpreted, and used responsibly.
Learning objectives
By the end of this lesson, students, you should be able to:
- explain the main ideas and terminology behind regression,
- apply IB Mathematics: Analysis and Approaches HL reasoning and procedures related to regression,
- connect regression to the wider topic of Statistics and Probability,
- summarize how regression fits into data analysis,
- use examples and evidence related to regression in real-life contexts.
Regression is used everywhere: predicting house prices from size, estimating temperature from altitude, or examining whether screen time is related to sleep. But regression is not just about drawing a line. It is about understanding the relationship between variables, measuring how well a model fits, and knowing when prediction is reasonable.
What regression means
In statistics, regression studies the relationship between a dependent variable and one or more independent variables. In the simplest IB setting, we often use two variables: one explanatory variable and one response variable. The explanatory variable is usually placed on the $x$-axis, and the response variable is on the $y$-axis.
A regression model tries to find an equation that describes the average pattern in the data. For a linear model, the equation is often written as $y=mx+c$, where $m$ is the gradient and $c$ is the intercept. In regression, this line is not chosen just by looking at the graph. It is usually calculated using a method that makes the model fit the data as well as possible.
The most common model in school mathematics is the line of best fit. This line does not pass through every point. Instead, it balances the data points so that the overall prediction error is as small as possible. The standard method used is least squares regression, which chooses the line that minimizes the sum of the squared residuals.
A residual is the difference between an observed value and the predicted value from the model. If a point has coordinates $(x,y)$ and the model predicts $\hat{y}$, then the residual is $y-\hat{y}$. Residuals show how far off the model is for each data point.
Linear regression and the line of best fit
Linear regression is used when the relationship between two variables looks roughly straight-line shaped. For example, if the number of hours studied increases and test scores tend to rise at a fairly steady rate, a linear model may be suitable.
The regression line can be used for interpolation, which means predicting values within the range of the data. This is usually safer than extrapolation, which means predicting values outside the range of the data. Extrapolation can be risky because the pattern may change beyond the data collected.
For a data set, the calculator or software may produce a regression equation such as $y=2.4x+58$. This means that, on average, when $x$ increases by $1$, the predicted value of $y$ increases by $2.4$. The intercept $58$ is the predicted value when $x=0$, although that value may or may not make sense in context.
Example
Suppose a teacher records the relationship between the number of revision sessions $x$ and a student’s score $y$. If the regression equation is $y=5x+40$, then a student who attends $6$ sessions is predicted to score $y=5(6)+40=70$.
This prediction is useful, but it is not exact. Another student with the same number of sessions might score higher or lower because of motivation, memory, exam stress, or many other factors. Regression gives a model, not certainty.
Correlation, strength, and direction
Regression is closely connected to correlation, which measures the strength and direction of a linear relationship. The correlation coefficient is often written as $r$, and it satisfies $-1\le r\le 1$.
- If $r$ is close to $1$, there is a strong positive linear relationship.
- If $r$ is close to $-1$, there is a strong negative linear relationship.
- If $r$ is close to $0$, there is little or no linear relationship.
A positive correlation means that as $x$ increases, $y$ tends to increase too. A negative correlation means that as $x$ increases, $y$ tends to decrease.
However, correlation and causation are not the same thing. A strong correlation does not prove that one variable causes the other. For example, ice cream sales and sunburn cases may both rise in summer, but buying ice cream does not cause sunburn. A third variable, such as hot weather, explains both.
In IB questions, you may need to interpret $r$ carefully. A value such as $r=0.92$ suggests a strong positive linear relationship, but it does not guarantee that the model is perfect or that predictions will always be accurate.
Measuring the quality of a regression model
A regression model is useful only if it fits the data reasonably well. One way to judge this is by looking at residuals. If the residuals are scattered randomly around $0$, that suggests the model may be appropriate. If the residuals show a pattern, such as a curve, then a straight-line model may not be suitable.
The coefficient of determination, written as $r^2$ in linear regression, is also important. It tells us the proportion of the variation in $y$ explained by the linear model. For example, if $r^2=0.81$, then about $81\%$ of the variation in $y$ can be explained by the model, and about $19\%$ is due to other factors or random variation.
In context, this helps students judge whether a regression model is strong enough to be useful. A high $r^2$ suggests a better fit, but even then, the model should still be checked for appropriateness and realism.
Example with interpretation
If a model for house price and size gives $r^2=0.64$, then $64\%$ of the variation in house prices is explained by house size alone. That still leaves a lot unexplained, such as location, number of bathrooms, age of the property, and local demand. So the model may be helpful, but it is incomplete.
Beyond straight lines: other regression models
Not all relationships are linear. In IB Mathematics: Analysis and Approaches HL, you may also meet non-linear regression models, depending on the data and the calculator tools used. Common examples include quadratic, exponential, and power models.
A quadratic model might have the form $y=ax^2+bx+c$. This can be useful when the data curves upward or downward. An exponential model may have the form $y=ab^x$, which is useful for growth or decay patterns, such as population increase or radioactive decay. A power model has the form $y=ax^b$, which may suit relationships like area versus side length in some contexts.
Choosing a model depends on the shape of the scatterplot, the context of the data, and the residuals. A model should make sense mathematically and in the real world. For instance, a model predicting height from age may work for children but not forever, because growth slows and eventually stops.
In IB work, it is important not just to calculate a regression equation but also to explain why the model is suitable. That means students should look at the trend, the residuals, and the context before trusting a prediction.
Regression in the wider topic of Statistics and Probability
Regression belongs to Statistics and Probability because it uses data to describe and predict real-world outcomes. It connects closely to other ideas in the topic:
- Data collection: regression depends on reliable data, so sampling method matters.
- Statistical description: scatterplots, means, and spread help summarize the relationship.
- Correlation: this shows whether the relationship is positive, negative, strong, or weak.
- Probability: while regression is not the same as probability, both help make informed decisions under uncertainty.
- Modeling: regression is one of the main tools for building mathematical models from data.
A good statistical investigation often starts with collecting data, then describing it, then checking patterns, and finally building a regression model if appropriate. This process is important in science, economics, medicine, sport, and social studies.
Real-world example
A sports scientist might use regression to compare training hours $x$ with sprint time $y$. If the line of best fit is $y=-0.3x+12$, then more training hours are associated with lower sprint times, which means faster performance. But the scientist must still be careful: the data may be influenced by diet, age, injury, or talent.
Conclusion
Regression helps students turn data into a mathematical model that can be used for interpretation and prediction. In IB Mathematics: Analysis and Approaches HL, the main goals are to identify the type of relationship, choose an appropriate model, interpret the equation, and judge how reliable the model is. Regression is powerful because it connects graphs, equations, and real-life decision-making. At the same time, it must be used carefully, with attention to context, residuals, and the limits of prediction.
Study Notes
- Regression is a method for modeling the relationship between variables.
- In simple linear regression, the model often has the form $y=mx+c$.
- The residual is $y-\hat{y}$, the difference between observed and predicted values.
- The line of best fit is usually found using least squares regression.
- Correlation coefficient $r$ measures direction and strength of a linear relationship, with $-1\le r\le 1$.
- Correlation does not prove causation.
- The coefficient of determination $r^2$ shows the proportion of variation explained by the model.
- Interpolation is usually safer than extrapolation.
- Residual plots help check whether a model is appropriate.
- Regression can be linear or non-linear, depending on the data.
- Regression is an important part of Statistics and Probability because it supports data analysis, modeling, and prediction.
