Hypothesis Testing 📊

students, in statistics we often have to make decisions using data that is incomplete or noisy. Imagine a school wants to know whether a new revision app really improves exam scores, or a company wants to know whether a new bottle design reduces leaks. We cannot test every possible person or every possible product, so we use samples and statistical reasoning. Hypothesis testing is the formal way to use sample evidence to judge whether a claim about a population is believable.

Learning objectives:

Explain the main ideas and terminology behind hypothesis testing.
Apply IB Mathematics: Applications and Interpretation HL reasoning related to hypothesis testing.
Connect hypothesis testing to data analysis, distributions, and inferential reasoning.
Summarize how hypothesis testing fits into Statistics and Probability.
Use examples to interpret results in real-world situations.

What hypothesis testing is and why it matters

Hypothesis testing is a process for checking a claim about a population using sample data. A population is the full group being studied, such as all students in a school or all items made by a factory. A sample is a smaller group taken from the population. Because samples vary, results are never perfectly certain. Hypothesis testing helps us decide whether the sample gives strong enough evidence to support a claim.

The starting point is usually a pair of statements:

the null hypothesis, written as $H_0$,
the alternative hypothesis, written as $H_1$ or $H_a$.

The null hypothesis is the default idea, often saying that there is no effect, no difference, or no change. The alternative hypothesis is what we are trying to find evidence for. For example, if a school thinks a new tutoring program raises mean test scores, one possible setup is:

$$H_0: \mu = 70$$

$$H_1: \mu > 70$$

Here, $\mu$ is the population mean score. The null says the mean is still $70$, while the alternative says the mean is greater than $70$.

A key idea is that hypothesis testing does not prove something with absolute certainty. Instead, it measures whether the sample result would be unusual if $H_0$ were true. If the result is very unusual, we may reject $H_0$.

The logic of the test step by step

A hypothesis test usually follows a clear sequence.

First, state the hypotheses. These must be written carefully and should refer to a population parameter such as $\mu$, $p$, or $\sigma$.

Second, choose a significance level, written as $\alpha$. This is the threshold for deciding whether evidence is strong enough. Common values are $\alpha = 0.05$ or $\alpha = 0.01$. The significance level is the probability of rejecting $H_0$ when $H_0$ is actually true.

Third, calculate a test statistic from the sample. The test statistic is a number that measures how far the sample result is from what $H_0$ predicts. In a one-sample $z$-test for a mean, a common test statistic is

$$z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}$$

where $\bar{x}$ is the sample mean, $\mu_0$ is the hypothesized mean, $\sigma$ is the population standard deviation, and $n$ is the sample size.

Fourth, find the $p$-value or compare the statistic to a critical value. The $p$-value is the probability of getting a result at least as extreme as the sample result, assuming $H_0$ is true. A small $p$-value means the sample result would be rare if the null hypothesis were correct.

Fifth, make a decision:

if $p \le \alpha$, reject $H_0$;
if $p > \alpha$, fail to reject $H_0$.

It is important to say fail to reject instead of accept. That is because the sample may not provide enough evidence to reject the null, but that does not prove the null is true.

One-tailed and two-tailed tests

Hypothesis tests can be one-tailed or two-tailed.

A one-tailed test checks for a change in only one direction. For example, if a manufacturer wants to know whether a new machine reduces defect rates, then the alternative may be $H_1: p < 0.08$, where $p$ is the true defect proportion. The direction matters because the manufacturer only cares about improvement in one direction.

A two-tailed test checks for a difference in either direction. For example, if a scientist wants to know whether a medicine changes blood pressure, the alternative may be $H_1: \mu \ne 120$. This means the mean could be higher or lower than $120$.

The shape of the decision region depends on the test type. In a two-tailed test, the rejection area is split between both tails of the distribution. In a one-tailed test, all of $\alpha$ is placed in one tail.

A useful real-world example is sports performance. Suppose a coach believes a new training plan changes sprint times. If the coach only cares whether times improve, a one-tailed test may be suitable. If the coach cares about any change, positive or negative, a two-tailed test is better.

Understanding evidence, errors, and decisions

Hypothesis testing is about making decisions with uncertainty, so errors can happen.

A Type I error happens when $H_0$ is rejected even though $H_0$ is true. The probability of a Type I error is $\alpha$.

A Type II error happens when $H_0$ is not rejected even though $H_1$ is true. The probability of a Type II error is often written as $\beta$.

The power of a test is $1-\beta$. It is the probability of correctly rejecting $H_0$ when the alternative is true. Larger sample sizes usually increase power, making it easier to detect real effects.

For example, imagine a school testing whether a new revision method improves average scores. If the school sets $\alpha = 0.05$, then there is a 5% chance of a false alarm. If the sample size is small, a real improvement might still be missed, creating a Type II error.

This balance is important in real life. In medicine, a Type I error might mean approving an ineffective treatment. A Type II error might mean missing a treatment that really works. In business, it might mean launching a weak product or failing to notice a strong one. Decisions should consider the consequences of both types of error.

Hypothesis testing and probability models

Hypothesis testing connects closely to probability distributions. The test statistic is often compared with a theoretical distribution, such as the normal distribution or the binomial distribution.

For example, if a company claims that only $10\%$ of its products are defective, and a sample of $n=50$ items finds $9$ defects, a binomial model may help judge whether that number is unusual. The random variable $X$ could represent the number of defective items, with

$$X \sim \text{Bin}(50, 0.10)$$

If the observed result is very unlikely under this model, that gives evidence against the claim.

In many IB-style questions, students must decide whether a binomial model is appropriate. This usually requires:

a fixed number of trials,
two outcomes for each trial,
independence,
constant probability of success.

When these conditions hold, the binomial distribution can be used to calculate probabilities and support hypothesis testing. In other situations, a normal approximation may be used when sample sizes are large enough and the conditions are suitable.

A quick example: suppose a company states that the defect rate is $p = 0.10$. If the sample result is much larger than expected, such as 12 defects in 20 items, the probability of that happening by chance may be very small. That small probability gives evidence that the true defect rate might be higher than claimed.

How to write and interpret conclusions

In IB Mathematics, interpretation matters as much as calculation. A good conclusion should mention the context, the decision, and the meaning of the result.

A strong conclusion might look like this:

“Since the $p$-value is less than $\alpha = 0.05$, there is sufficient evidence to reject $H_0$. The data suggest that the mean exam score after the new tutoring program is greater than $70$.”

Notice that this conclusion does not say the claim is proven. It says the data provide sufficient evidence. This language is statistically accurate and important in real-world reasoning.

If the result is not significant, the conclusion should say that there is not enough evidence to support the alternative hypothesis. For example:

“Since the $p$-value is greater than $\alpha = 0.05$, there is insufficient evidence to reject $H_0$. The sample does not show a clear increase in the mean score.”

This careful wording helps avoid overstating what the data can prove.

Conclusion

Hypothesis testing is one of the central tools in statistics because it turns sample data into informed decisions. students, it connects the ideas of sampling, probability, distributions, and uncertainty. By setting up $H_0$ and $H_1$, choosing $\alpha$, calculating a test statistic, and interpreting the $p$-value, you can judge whether the evidence supports a claim. In IB Mathematics: Applications and Interpretation HL, hypothesis testing is not just a calculation method; it is a way to reason carefully about the real world, from education and medicine to manufacturing and sports. 📈

Study Notes

Hypothesis testing uses sample data to evaluate a claim about a population.
The null hypothesis is written as $H_0$ and represents the default statement.
The alternative hypothesis is written as $H_1$ or $H_a$ and represents the claim being tested.
A significance level $\alpha$ is the cutoff for deciding whether evidence is strong enough.
A $p$-value is the probability of getting a result at least as extreme as the sample result, assuming $H_0$ is true.
If $p \le \alpha$, reject $H_0$; if $p > \alpha$, fail to reject $H_0$.
A Type I error means rejecting a true $H_0$.
A Type II error means not rejecting $H_0$ when $H_1$ is true.
Power is $1-\beta$ and measures how well a test detects a real effect.
One-tailed tests look for an effect in one direction; two-tailed tests look for a difference in either direction.
Hypothesis testing often uses the normal or binomial distribution.
Good conclusions must be stated in context and should not claim certainty.
Hypothesis testing is a major part of inferential reasoning in Statistics and Probability.