Survival Analysis

Hey students! 👋 Welcome to one of the most fascinating and practical areas of actuarial science - survival analysis! This lesson will equip you with the essential techniques used to analyze time-to-event data, particularly when dealing with incomplete information through censoring and truncation. By the end of this lesson, you'll understand how actuaries use survival models to predict lifetimes, calculate insurance premiums, and assess risk in real-world scenarios. Get ready to discover the mathematical tools that help insurance companies and researchers make sense of life's uncertainties! 📊

Understanding Survival Analysis Fundamentals

Survival analysis is the statistical study of the time until an event occurs - whether that's death, disease onset, equipment failure, or policy cancellation. In actuarial science, we're particularly interested in modeling human lifetimes and other insurance-related events. What makes survival analysis unique is that we often don't observe the complete story for every individual in our study.

Think about it this way, students: imagine you're studying how long smartphones last before breaking. You start following 1,000 phones in January, but by December, only 300 have actually broken. The remaining 700 are still working! This incomplete information is called censoring, and it's everywhere in actuarial work. When someone buys a life insurance policy, we don't know exactly when they'll die - we only know they were alive when they purchased the policy.

The survival function $S(t)$ represents the probability that an individual survives beyond time $t$. Mathematically, $S(t) = P(T > t)$, where $T$ is the random variable representing the time to event. This function starts at 1 (everyone is alive at time 0) and decreases monotonically toward 0 as time increases. In life insurance, if $S(65) = 0.85$, this means there's an 85% chance a newborn will survive to age 65.

The hazard function $h(t)$ - also called the force of mortality in actuarial contexts - measures the instantaneous risk of the event occurring at time $t$, given survival up to that point. It's defined as $h(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T < t + \Delta t | T \geq t)}{\Delta t}$. Real-world data shows that human mortality follows a "bathtub curve" - high in infancy, low during young adult years, then increasing exponentially with age.

Dealing with Censored and Truncated Data

Censoring occurs when we have incomplete information about the exact time of the event. There are three main types you'll encounter, students:

Right censoring is the most common type in actuarial work. This happens when we know the event occurred after a certain time, but not exactly when. For example, if someone is still alive at the end of a mortality study, we know they survived at least until the study's end date. According to recent actuarial research, approximately 60-80% of observations in typical life insurance studies are right-censored.

Left censoring occurs when we know the event happened before a certain time, but not exactly when. This might happen if someone already had a pre-existing condition when they entered a health insurance study.

Interval censoring means we only know the event occurred within a specific time interval. For instance, if medical exams happen annually, we might only know someone developed diabetes sometime between their 2022 and 2023 checkups.

Truncation is different from censoring - it's when certain individuals aren't even included in our sample based on their survival times. Left truncation happens when individuals must survive to a certain age to be included in the study. Think about studying the mortality of people over 65 - anyone who died before 65 simply isn't in your dataset. Right truncation occurs when only individuals who experience the event by a certain time are included in the study.

These data complications require special statistical techniques because traditional methods assume complete information. Ignoring censoring and truncation can lead to severely biased estimates - potentially causing insurance companies to misprice policies by 20-30% or more!

The Kaplan-Meier Estimator

The Kaplan-Meier estimator is the gold standard non-parametric method for estimating survival functions from censored data. Developed in 1958, it's sometimes called the "product-limit estimator" because it multiplies conditional survival probabilities.

Here's how it works, students: At each observed event time $t_i$, we calculate the conditional probability of surviving past that time, given survival up to that point. The formula is:

$$\hat{S}(t) = \prod_{t_i \leq t} \left(1 - \frac{d_i}{n_i}\right)$$

Where $d_i$ is the number of events at time $t_i$, and $n_i$ is the number of individuals at risk just before time $t_i$.

Let's walk through a simple example: Suppose we're tracking 5 insurance policyholders, and we observe deaths at times 2, 4, and 7 years, with 2 people censored (still alive) at times 5 and 8 years.

At $t = 2$: 1 death out of 5 at risk, so $\hat{S}(2) = 1 \times (1 - 1/5) = 0.8$
At $t = 4$: 1 death out of 4 at risk, so $\hat{S}(4) = 0.8 \times (1 - 1/4) = 0.6$
At $t = 7$: 1 death out of 2 at risk, so $\hat{S}(7) = 0.6 \times (1 - 1/2) = 0.3$

The Kaplan-Meier estimator creates a step function that drops at each observed event time. It's particularly valuable because it makes no assumptions about the underlying distribution of survival times and handles censoring naturally. Modern actuarial software can compute Kaplan-Meier estimates for datasets with millions of observations in seconds.

Parametric Survival Models

While the Kaplan-Meier estimator is flexible, parametric models offer several advantages: they provide smooth survival curves, allow for extrapolation beyond observed data, and enable easier calculation of summary statistics. Here are the key parametric models used in actuarial practice:

The exponential distribution assumes constant hazard over time: $h(t) = \lambda$. The survival function is $S(t) = e^{-\lambda t}$. While simple, this model rarely fits real mortality data well because human mortality rates change dramatically with age.

The Weibull distribution is much more flexible, with hazard function $h(t) = \lambda \gamma t^{\gamma-1}$, where $\lambda > 0$ is the scale parameter and $\gamma > 0$ is the shape parameter. When $\gamma = 1$, it reduces to the exponential distribution. When $\gamma > 1$, the hazard increases with time (like human aging), and when $\gamma < 1$, it decreases (like infant mortality). The Weibull model is widely used in actuarial science because it can model both increasing and decreasing failure rates.

The Gompertz distribution has exponentially increasing hazard: $h(t) = \alpha e^{\beta t}$. This model fits human adult mortality exceptionally well - mortality rates approximately double every 8 years after age 30, a phenomenon known as the "Gompertz law of mortality." Insurance companies often use Gompertz models for life insurance pricing.

The log-normal distribution assumes that the logarithm of survival time follows a normal distribution. It's particularly useful for modeling events that result from the accumulation of many small factors over time.

Model selection typically involves comparing goodness-of-fit statistics like the Akaike Information Criterion (AIC) or likelihood ratio tests. In practice, actuaries often find that Weibull or Gompertz models provide the best balance of flexibility and interpretability for mortality data.

Applications in Life Contingencies

Life contingencies - the mathematical foundation of life insurance and pensions - relies heavily on survival analysis techniques. Modern life tables, which form the basis for insurance pricing, are essentially sophisticated survival models incorporating vast amounts of mortality data.

Life insurance pricing uses survival models to calculate expected present values of future benefits. For a whole life insurance policy paying $B$ upon death, the expected present value is:

$$E[PV] = B \int_0^{\infty} e^{-\delta t} f(t) dt$$

Where $\delta$ is the force of interest and $f(t)$ is the probability density function of the time of death.

Pension calculations require joint survival models for couples, incorporating both individual mortality and dependency between spouses' lifetimes. Studies show that widowhood significantly affects mortality rates - the "broken heart syndrome" where surviving spouses have elevated mortality risk in the months following their partner's death.

Disability insurance uses multi-state survival models that account for transitions between healthy, disabled, and dead states. These models help insurers price coverage and estimate claim reserves.

Recent industry data shows that incorporating modern survival analysis techniques can improve pricing accuracy by 15-25% compared to traditional life table approaches, particularly for subpopulations with unique risk profiles.

Conclusion

Survival analysis provides the mathematical framework for understanding time-to-event data in the face of incomplete information. From the fundamental concepts of survival and hazard functions to sophisticated parametric models, these techniques enable actuaries to quantify risk and uncertainty in life contingencies. The Kaplan-Meier estimator offers a robust non-parametric approach for initial data exploration, while parametric models like Weibull and Gompertz provide the mathematical structure needed for insurance pricing and reserving. Mastering these concepts will prepare you for the real-world challenges of actuarial practice, where incomplete data and complex risk patterns are the norm rather than the exception! 🎯

Study Notes

• Survival function $S(t) = P(T > t)$: probability of surviving beyond time $t$

• Hazard function $h(t)$: instantaneous risk of event at time $t$ given survival to $t$

• Right censoring: know event occurred after observed time (most common in actuarial work)

• Left censoring: know event occurred before observed time

• Interval censoring: know event occurred within specific time interval

• Left truncation: individuals must survive to certain age to enter study

• Right truncation: only individuals experiencing event by certain time included

• Kaplan-Meier estimator: $\hat{S}(t) = \prod_{t_i \leq t} \left(1 - \frac{d_i}{n_i}\right)$

• Exponential model: constant hazard $h(t) = \lambda$, survival $S(t) = e^{-\lambda t}$

• Weibull model: hazard $h(t) = \lambda \gamma t^{\gamma-1}$, flexible shape parameter $\gamma$

• Gompertz model: exponentially increasing hazard $h(t) = \alpha e^{\beta t}$, fits adult mortality well

• Gompertz law: human mortality doubles approximately every 8 years after age 30

• Survival analysis improves insurance pricing accuracy by 15-25% over traditional methods

• 60-80% of observations in typical life insurance studies are right-censored

• Parametric models enable extrapolation and smooth survival curves for practical applications