Floating-Point Numbers

Introduction

students, when a computer stores a number, it does not usually keep the number exactly the way humans write it on paper. Instead, it stores an approximation in a special format called a floating-point number 💻. This idea is one of the most important parts of numerical analysis because it explains why computers can sometimes give answers like $0.30000000000000004$ instead of exactly $0.3$.

In this lesson, you will learn how floating-point numbers work, why they are used, and how they connect to numerical error and computation. By the end, you should be able to explain the basic terminology, recognize why rounding happens, and understand how floating-point representation affects calculations in the real world.

Learning goals

Understand what floating-point numbers are and why computers use them.
Learn the main terms such as sign, mantissa, significand, exponent, base, precision, and rounding.
See how floating-point numbers create absolute and relative error.
Understand how errors can grow during computation.
Connect floating-point arithmetic to real situations like science, finance, and engineering 📊.

What a floating-point number is

A floating-point number is a way to represent real numbers using a fixed number of digits. The idea is similar to scientific notation. For example, the number $5320$ can be written as $5.32 \times 10^3$. A computer stores numbers in a similar form, but with binary digits instead of decimal digits.

In a general system, a floating-point number has the form

$$x = \pm m \times b^e$$

where $m$ is the significand or mantissa, $b$ is the base, and $e$ is the exponent.

For computers, the base is usually $b = 2$, not $10$. That means the number is written using powers of $2$. A typical floating-point number is normalized, which means the leading digit of the significand is chosen in a standard way so the representation is unique or nearly unique.

For example, in base $10$, $0.00452$ can be written as $4.52 \times 10^{-3}$. In base $2$, a similar idea might look like $1.011_2 \times 2^3$. The computer stores the significand and exponent separately, then reconstructs the value when needed.

This matters because computers have limited storage. They cannot keep infinitely many digits, so many numbers must be rounded. That is the heart of floating-point error.

Why computers use floating-point numbers

Computers need a number system that can handle both very large and very small values. A fixed decimal system with only a few places after the point would fail badly for many tasks. Floating-point numbers solve this by letting the decimal point “float” according to the exponent.

Here is a real-world example 🌍: imagine measuring the distance to a star and also the thickness of a sheet of paper. The first is enormous, the second is tiny. Floating-point representation allows a computer to store both kinds of numbers using the same general rule.

The trade-off is that the computer cannot store every real number exactly. This is similar to using a ruler with marks only every $1$ millimeter. You can measure a length to the nearest mark, but not perfectly. Floating-point arithmetic is like that ruler, except the spacing depends on the size of the number.

Because the number of stored digits is limited, some values are rounded to the nearest representable floating-point number. This is why numerical analysis studies not just answers, but also how those answers are approximated.

Main parts of a floating-point system

A floating-point system is usually described by a few key features:

Base $b$: the numeral system used, often $b = 2$ in computers.
Precision: the number of digits available for the significand.
Exponent range: the smallest and largest powers of $b$ allowed.
Rounding rule: how a number is chosen when it cannot be represented exactly.

A simplified decimal floating-point system might store numbers in the form

$$\pm d_0.d_1d_2\times 10^e$$

where only a certain number of digits are kept. A binary system uses the same idea but with digits $0$ and $1$.

The exponent lets the same system represent values across a huge range. If the exponent is too small, the number may underflow, meaning it becomes too close to $0$ to be represented normally. If the exponent is too large, the number may overflow, meaning it is too large for the system to store.

Another key idea is spacing. Floating-point numbers are not evenly spaced across all real numbers. Near $1$, the gap between neighboring numbers is tiny. For very large numbers, the gap is much bigger. This means the precision is relative, not absolute.

Rounding and approximation

Since many real numbers cannot be stored exactly, computers round them. Suppose a number must be represented with only three significant digits. Then $12.345$ might be stored as $12.3$ or $12.4$, depending on the rounding rule.

Common rounding modes include:

Round to nearest: choose the closest representable value.
Round toward zero: chop off extra digits.
Round upward: choose the next larger value.
Round downward: choose the next smaller value.

The most common rule in many systems is round to nearest, often with a tie-breaking method. This helps keep errors small on average.

Example: if a computer stores $0.1$ in binary floating-point, it cannot represent it exactly because $0.1$ in base $10$ is a repeating binary fraction. So the stored value is a nearby approximation. That is why simple-looking calculations can produce slightly surprising answers.

For example, if $a = 0.1$ and $b = 0.2$, then mathematically $a + b = 0.3$. But in a floating-point system, the stored versions of $a$ and $b$ may cause the computed sum to be slightly off. This is not a mistake in the computer; it is a consequence of limited precision.

Floating-point numbers and numerical error

Floating-point representation creates rounding error, which is the difference between the exact number and the stored approximation. If the exact value is $x$ and the computed value is $\hat{x}$, then the absolute error is

$$|x - \hat{x}|$$

and the relative error is

$$\frac{|x - \hat{x}|}{|x|}$$

when $x \neq 0$.

These ideas are central to numerical analysis because they tell us how accurate a result is. A small absolute error may still be important if the true number is tiny. A small relative error often gives a better sense of practical accuracy.

For example, if the true value is $1000000$ and the computed value is $1000001$, then the absolute error is $1$, which might seem large at first. But the relative error is

$$\frac{1}{1000000} = 10^{-6}$$

which is very small.

On the other hand, if the true value is $0.001$ and the computed value is $0.002$, the absolute error is also $0.001$, but the relative error is

$$\frac{0.001}{0.001} = 1$$

which is much worse. This shows why relative error is often more informative.

Error propagation in computation

Errors from floating-point representation do not always stay small. When computations involve many steps, errors can propagate and sometimes grow. This is called error propagation.

Suppose a calculation uses the stored value $\hat{x}$ instead of the exact value $x$. If that result is then used in later steps, new rounding errors may be added. Over time, the final answer may differ noticeably from the exact answer.

For example, consider repeated addition of a tiny number. If a computer adds $0.000001$ many times, each step may involve a rounded value. The total effect can be significant after many operations.

A famous numerical issue is catastrophic cancellation, which happens when two nearly equal numbers are subtracted. If $a$ and $b$ are close, then $a-b$ may lose many correct digits. For instance, if

$$a = 1.234567$$

$$b = 1.234560$$

then the difference is

$$a-b = 0.000007$$

The original numbers share many digits, so subtracting them can reveal only a few meaningful digits and magnify the effect of rounding. This is a major concern in numerical analysis.

Computations are most reliable when algorithms are designed to be numerically stable, meaning they do not amplify small errors too much. Stable algorithms are especially important in science, engineering, and data analysis.

Real-world importance

Floating-point numbers appear everywhere 🚗🏥📈.

In engineering, they are used to simulate motion, forces, and stress.
In medicine, they help process signals from scanners and monitors.
In finance, they are used in models, although exact decimal handling is often preferred for money because rounding errors can matter.
In graphics and games, they are used to position objects and calculate movement.
In machine learning, they are used to store large matrices and perform fast calculations.

Because floating-point numbers are approximations, professionals must think carefully about whether a result is accurate enough for the task. For example, a tiny rounding error may be harmless in a video game, but unacceptable in a satellite navigation system.

This is why numerical analysis is not only about computing answers. It is also about understanding how those answers are produced and how reliable they are.

Conclusion

Floating-point numbers are the standard way computers represent real numbers, but they are not exact representations of all values. They use a sign, significand, and exponent to store numbers across a wide range, usually in base $2$. Because storage is limited, computers round many values, which causes rounding error, absolute error, and relative error.

students, understanding floating-point numbers is essential for numerical analysis because it explains why computation can introduce small inaccuracies and how those inaccuracies may spread through a calculation. This knowledge helps you judge the quality of computed results and choose better methods when accuracy matters.

Study Notes

Floating-point numbers store numbers in a form like $\pm m \times b^e$.
Computers usually use base $2$, not base $10$.
Floating-point systems have limited precision, so not every real number can be stored exactly.
Numbers are rounded to the nearest representable value, or by another rounding rule.
The difference between an exact value $x$ and a computed value $\hat{x}$ is measured by absolute error $|x-\hat{x}|$.
Relative error is $\frac{|x-\hat{x}|}{|x|}$ for $x \neq 0$.
Floating-point arithmetic can cause error propagation when results from one step are used in later steps.
Catastrophic cancellation can happen when subtracting nearly equal numbers.
Numerical analysis studies how to measure, control, and reduce these errors.
Floating-point numbers are essential in real-world computing, but their limitations must always be considered.