Unconstrained Optimization

Hey students! 👋 Welcome to one of the most powerful tools in applied mathematics - unconstrained optimization! This lesson will teach you how to find the minimum (or maximum) of functions without any constraints holding you back. By the end of this lesson, you'll understand gradient descent, Newton's method, and quasi-Newton methods like BFGS, plus you'll see how these algorithms power everything from machine learning to engineering design. Get ready to discover the mathematical engines that help Netflix recommend movies and help engineers design better airplanes! ✈️

Understanding the Optimization Problem

Imagine you're hiking in a foggy mountain range and you want to find the lowest valley. You can't see very far, but you can feel the slope of the ground beneath your feet. This is exactly what unconstrained optimization is like! We're trying to find the minimum value of a function $f(x)$ where $x$ can be a single variable or a vector of many variables.

In mathematical terms, we want to solve:

$$\min_{x \in \mathbb{R}^n} f(x)$$

The function $f(x)$ could represent anything - the cost of manufacturing a product, the error in a machine learning model, or the energy in a physical system. For example, when training a neural network to recognize cats in photos, we're minimizing an error function that measures how wrong our predictions are.

The key insight is that at the minimum point, the gradient (the slope) must be zero. Think about it - if there's still a slope, you could roll a little further downhill! This gives us our fundamental condition:

$$\nabla f(x^*) = 0$$

where $x^*$ is our optimal point and $\nabla f(x)$ is the gradient vector containing all partial derivatives.

Gradient Descent: Following the Steepest Path

Gradient descent is like being that hiker in the fog - you feel the steepest downhill direction and take a step that way. Mathematically, we update our position using:

$$x_{k+1} = x_k - \alpha_k \nabla f(x_k)$$

Here, $\alpha_k$ is the step size (how big a step you take), and $\nabla f(x_k)$ points in the steepest uphill direction, so we subtract it to go downhill.

The beauty of gradient descent lies in its simplicity, but it has some quirks. Picture a long, narrow valley - gradient descent will zigzag back and forth across the valley walls instead of heading straight down the middle. This happens because the algorithm only uses local slope information and doesn't understand the overall shape of the function.

In practice, choosing the step size $\alpha_k$ is crucial. Too small, and you'll crawl toward the minimum like a snail 🐌. Too large, and you might overshoot and bounce around like a ping-pong ball! Modern implementations often use adaptive step sizes or line search methods to automatically find good step sizes.

Google uses variations of gradient descent to train their massive neural networks. With billions of parameters to optimize, gradient descent's simplicity and low memory requirements make it practical even for problems that would overwhelm more sophisticated methods.

Newton's Method: Using Curvature Information

Newton's method is like having a mathematical crystal ball that can see the local shape of your function. Instead of just using the slope (first derivative), it also uses the curvature (second derivative) to make smarter steps.

The Newton update formula is:

$$x_{k+1} = x_k - H_k^{-1} \nabla f(x_k)$$

where $H_k$ is the Hessian matrix containing all second partial derivatives of $f$ at point $x_k$.

Think of it this way: if gradient descent is like walking downhill with your eyes closed, Newton's method is like having a detailed topographic map. The Hessian matrix tells us not just which way is downhill, but how the slope is changing - is the valley getting steeper or flatter?

This extra information allows Newton's method to take much more intelligent steps. In our narrow valley example, Newton's method would recognize the valley's shape and head straight down the middle instead of zigzagging.

The downside? Computing and inverting the Hessian matrix is expensive! For a function with $n$ variables, the Hessian is an $n \times n$ matrix, and inverting it takes $O(n^3)$ operations. When $n$ is large (think thousands or millions of variables), this becomes computationally prohibitive.

However, when Newton's method works, it's incredibly fast - it has quadratic convergence, meaning the number of correct digits roughly doubles with each iteration! 🚀

Quasi-Newton Methods: The Best of Both Worlds

Quasi-Newton methods are the clever compromise between gradient descent's simplicity and Newton method's power. They build up an approximation of the Hessian matrix using only gradient information, avoiding the expensive second derivative calculations.

The most famous quasi-Newton method is BFGS (Broyden-Fletcher-Goldfarb-Shanno), named after its four inventors. The BFGS algorithm maintains an approximation $B_k$ of the Hessian and updates it each iteration using the formula:

$$B_{k+1} = B_k + \frac{y_k y_k^T}{y_k^T s_k} - \frac{B_k s_k s_k^T B_k}{s_k^T B_k s_k}$$

where $s_k = x_{k+1} - x_k$ and $y_k = \nabla f(x_{k+1}) - \nabla f(x_k)$.

This might look intimidating, but the idea is beautiful: we're using the change in position ($s_k$) and the change in gradient ($y_k$) to learn about the function's curvature. It's like a detective gathering clues about the crime scene!

BFGS has become the workhorse of optimization because it combines several advantages:

It only needs gradient information (no expensive Hessian calculations)
It achieves superlinear convergence (faster than gradient descent, though not quite as fast as Newton's method)
It's robust and works well on a wide variety of problems

Major software packages like MATLAB's fminunc and Python's scipy.optimize.minimize use BFGS as their default algorithm. Companies like Tesla use these methods to optimize battery management systems, while pharmaceutical companies use them to design new drug molecules.

Practical Implementation Considerations

Real-world optimization isn't just about the core algorithm - there are many practical details that can make or break your success! 🔧

Line Search Methods: Instead of using a fixed step size, most practical implementations use line search to automatically find good step sizes. The Armijo-Wolfe conditions provide mathematical criteria for accepting or rejecting proposed steps.

Convergence Criteria: How do you know when to stop? Common criteria include:

$\|\nabla f(x_k)\| < \epsilon$ (gradient is small enough)
$|f(x_{k+1}) - f(x_k)| < \epsilon$ (function value isn't changing much)
$\|x_{k+1} - x_k\| < \epsilon$ (position isn't changing much)

Scaling and Conditioning: Some optimization problems are naturally harder than others. A function might change rapidly in one direction but slowly in another, creating numerical difficulties. Proper scaling of variables and preconditioning can dramatically improve performance.

Memory Management: For very large problems, even storing the BFGS approximation matrix becomes expensive. Limited-memory BFGS (L-BFGS) keeps only the most recent information, making it practical for problems with millions of variables.

Conclusion

Unconstrained optimization gives us powerful tools for finding the best solutions to mathematical problems. Gradient descent provides a simple, reliable foundation that works even for massive problems. Newton's method offers blazing speed when you can afford the computational cost. Quasi-Newton methods like BFGS give you the best balance of speed, reliability, and practicality. Whether you're training AI models, designing aircraft, or optimizing financial portfolios, these methods are the mathematical engines that make it all possible! 🎯

Study Notes

• Optimization Goal: Find $x^$ such that $\nabla f(x^) = 0$ (gradient equals zero at minimum)

• Gradient Descent Update: $x_{k+1} = x_k - \alpha_k \nabla f(x_k)$

Uses only first derivatives (gradient)
Simple but can be slow on poorly conditioned problems
Linear convergence rate

• Newton's Method Update: $x_{k+1} = x_k - H_k^{-1} \nabla f(x_k)$

Uses both gradient and Hessian matrix $H_k$
Quadratic convergence (very fast when close to solution)
Expensive: requires $O(n^3)$ operations per iteration

• BFGS Quasi-Newton Method:

Approximates Hessian using only gradient information
Update formula: $B_{k+1} = B_k + \frac{y_k y_k^T}{y_k^T s_k} - \frac{B_k s_k s_k^T B_k}{s_k^T B_k s_k}$
Superlinear convergence rate
Most practical method for medium-sized problems

• Key Implementation Details:

Use line search methods (Armijo-Wolfe conditions) for step size selection
Common convergence criteria: $\|\nabla f(x_k)\| < \epsilon$
L-BFGS for large-scale problems (limited memory version)

• Method Selection Guide:

Small problems (n < 100): Newton's method
Medium problems (100 < n < 10,000): BFGS
Large problems (n > 10,000): L-BFGS or gradient descent