[CMU16-745] Lecture 4 Optimization Part 2_cmu 16-745开源课件-CSDN博客

本文链接：https://blog.csdn.net/weixin_61338898/article/details/144982602

Pendulum System Diagram
Source: CMU 16-745 Study Notes, taught by Prof. Zac Manchester

Lecture 3 Optimization Part 1

Review

Root Finding

Newton’s Method
Minimization
Regularization/Damped Newton’s Method

Lecture 4: Optimization Pt. 2

Overview

Line Search (which can solve the “overshoot” problem)
- Trust region method can also solve this problem.
Constrained minimization

1. Line Search

Motivation

$\Delta \mathbf{x}$ step from Newton’s method may overshoot the minimum. To fix this, check $f(\mathbf{x} + \Delta \mathbf{x})$ and “backtrack” until we get a “good” reduction in $f$ .

(1) Armijo Rule

There are many strategies for this, but we will focus on the Armijo rule, which is simple and effective:

Set $\alpha = 1$ (step length).
While:
$\quad$ $f(\mathbf{x} + \alpha \Delta \mathbf{x}) > f(\mathbf{x}) + b \alpha \nabla f(\mathbf{x})^T \Delta \mathbf{x}$
$\quad$ Update: $\alpha \leftarrow c \alpha$ where $\in (0,1)$
End

$b$ is tolerance, $\in (0,1)$ .
$\alpha \nabla f(\mathbf{x})^T \Delta \mathbf{x}$ is the expected reduction from the gradient.

(2) Intuition

Ensure the step agrees with linearization within some tolerance $b$ .
Typical values: $b = 10^{-4} - 10^{-1}, c = 1/2$ .

(3) Example: Backtracking Regularized Newton Step

function backtracking_regularized_newton_step(x0)
    b = 0.1
    c = 0.5
    β = 1.0
    H = ∇²f(x0)
    while !isposdef(H)
        H = H + β * I 
    end
    Δx = -H \ ∇f(x0)
    
    α = 1.0
    while f(x0 + α * Δx) > f(x0) + b * α * ∇f(x0)' * Δx
        α = c * α
    end
    
    xn = x0 + α * Δx
end

(4) Takeaway Message

Newton’s method, with simple and cheap modifications (globalization strategy), is extremely effective at finding local minima.

2. Equality Constraints

Given $f(\mathbf{x}): \mathbb{R}^n \rightarrow \mathbb{R}$ , $\mathbf{c}(\mathbf{x}): \mathbb{R}^n \rightarrow \mathbb{R}^m$ .
Problem:
$\min_{\mathbf{x}} f(\mathbf{x}) \\ \text{s.t.} \quad \mathbf{c}(\mathbf{x}) = \mathbf{0}$

(1) First-Order Necessary Conditions:

$\nabla f(\mathbf{x}) = 0$ in free directions.
$\mathbf{c}(\mathbf{x}) = 0$ .

Explanation:

If any component of $\nabla f(\mathbf{x})$ is not normal to the constraint surface, we can reduce $f(\mathbf{x})$ by moving along the surface.

(2) Lagrange Multiplier

$\nabla f(\mathbf{x}) + \lambda \nabla \mathbf{c}(\mathbf{x}) = \mathbf{0}, \quad \lambda \in \mathbb{R}^m$

$\nabla f(\mathbf{x})$ and $\nabla \mathbf{c}(\mathbf{x})$ are parallel.

(3) Lagrangian

$L(\mathbf{x}, \lambda) = f(\mathbf{x}) + \lambda^\top \mathbf{c}(\mathbf{x}).$

Turns constrained minimization into an unconstrained problem:
$\min_{\mathbf{x}}L(\mathbf{x}, \lambda).$
Gradients (KKT conditions):
$\nabla_xL(\mathbf{x}, \lambda) = \nabla f(\mathbf{x}) + \left(\frac{\partial \mathbf{c}}{\partial \mathbf{x}}\right)^\top \lambda = \mathbf{0},$ $\nabla_\lambda L(\mathbf{x}, \lambda) = \mathbf{c}(\mathbf{x}) = \mathbf{0}.$

To solve equations where the gradient is zero, we employ Newton’s method. With two variables and two outputs (not necessarily two-dimensional), we perform a first-order Taylor expansion of the equation and set it to zero (since we are solving for roots). This yields an iterative formula for the roots of the equation, specifically for the local extrema of the Lagrangian function ( L(x, \lambda) ), analogous to Newton’s method for single-variable equations.
$\nabla_x L(x + \Delta x, \lambda + \Delta \lambda) \approx \nabla_x L(x, \lambda) + \frac{\partial^2 L}{\partial x^2} \Delta x + \frac{\partial^2 L}{\partial x \partial \lambda} \Delta \lambda = 0$ Where:
$\frac{\partial^2 L}{\partial x \partial \lambda} = \frac{\partial^2 L}{\partial \lambda \partial x} = \frac{\partial}{\partial x} \left( \frac{\partial L}{\partial \lambda} \right) = \left( \frac{\partial C}{\partial x} \right)^T$

For $\nabla_\lambda L$ :

$\nabla_{\lambda} L(x + \Delta x, \lambda + \Delta \lambda) = C(x) + \frac{\partial C}{\partial x} \Delta x = 0 \Rightarrow \frac{\partial C}{\partial x} \Delta x = -C(x)$

Explanation of Symbols

$\lambda)$ : The Lagrangian function, typically the combination of the objective function and constraints.
$\lambda) = f(x) + \lambda^T C(x)$ Here, $\lambda$ represents the Lagrange multipliers.
$\nabla_x L(x, \lambda)$ : The gradient of $\lambda)$ with respect to $x$ , representing the rate of change of the objective function along the directions of $x$ .
$\nabla_x L(x + \alpha \Delta x, \lambda + \Delta \lambda)$ : The gradient of $L$ at the updated point $\alpha \Delta x, \lambda + \Delta \lambda)$ , where $\alpha$ is the step size, and $\Delta x, \Delta \lambda$ are the increments of $x$ and $\lambda$ , respectively.
$\frac{\partial^2 L}{\partial x^2}$ : The second derivative of $L$ with respect to $x$ , indicating the curvature of the objective function along $x$ .
$\frac{\partial^2 L}{\partial x \partial \lambda}$ : The mixed second derivative of $L$ with respect to $x$ and $\lambda$ , representing interactions between the variables.
$\Delta x, \Delta \lambda$ : The updates for the optimization variables $x$ and the Lagrange multipliers $\lambda$ .

Derivation 1: $\nabla_x L(x + \Delta x, \lambda + \Delta \lambda)$
1. Taylor Expansion
Using the first-order Taylor expansion for $\lambda)$ , where $x$ and $\lambda$ undergo small changes $\Delta x$ and $\Delta \lambda$ , we have:
$\Delta x, \lambda + \Delta \lambda) \approx L(x, \lambda) + \nabla_x L(x, \lambda) \cdot \Delta x + \nabla_\lambda L(x, \lambda) \cdot \Delta \lambda$ For the gradient with respect to $x$ , we expand $\nabla_x L(x + \Delta x, \lambda + \Delta \lambda)$ as follows:
$\nabla_x L(x + \Delta x, \lambda + \Delta \lambda) \approx \nabla_x L(x, \lambda) + \frac{\partial^2 L}{\partial x^2} \Delta x + \frac{\partial^2 L}{\partial x \partial \lambda} \Delta \lambda$ Here:
$\frac{\partial^2 L}{\partial x^2} \Delta x$ : The contribution from the second-order derivative with respect to $x$ , describing how the gradient changes due to $x$ .
$\frac{\partial^2 L}{\partial x \partial \lambda} \Delta \lambda$ : The contribution from the mixed second-order derivative with respect to $x$ and $\lambda$ , describing the interaction between the two variables.
Setting this equation to zero (since we are solving for the critical points of the gradient):
$\nabla_x L(x + \Delta x, \lambda + \Delta \lambda) = 0 \Rightarrow \nabla_x L(x, \lambda) + \frac{\partial^2 L}{\partial x^2} \Delta x + \frac{\partial^2 L}{\partial x \partial \lambda} \Delta \lambda = 0$
Derivation 2: $\frac{\partial^2 L}{\partial x \partial \lambda}$
Using the symmetry property of mixed second-order derivatives, we know:
$\frac{\partial^2 L}{\partial x \partial \lambda} = \frac{\partial^2 L}{\partial \lambda \partial x}$ From the definition of the Lagrangian function:
$\lambda) = f(x) + \lambda^T C(x)$ where:
$f (x)$ : Objective function.
$C (x)$ : Constraint function.
The derivative of $L$ with respect to $\lambda$ is:
$\frac{\partial L}{\partial \lambda} = C(x)$ Taking the derivative of $\frac{\partial L}{\partial \lambda}$ with respect to $x$ :
$\frac{\partial}{\partial x} \left( \frac{\partial L}{\partial \lambda} \right) = \frac{\partial}{\partial x} \left( C(x) \right)$ Thus: $\frac{\partial^2 L}{\partial x \partial \lambda} = \left( \frac{\partial C}{\partial x} \right)^T$
Derivation 3: $\nabla_\lambda L(x + \Delta x, \lambda + \Delta \lambda)$
From the definition of the Lagrangian:
$\lambda) = f(x) + \lambda^T C(x)$ The gradient with respect to $\lambda$ is:
$\nabla_\lambda L(x, \lambda) = C(x)$ Expanding $\nabla_\lambda L$ using the first-order Taylor expansion:
$\nabla_\lambda L(x + \Delta x, \lambda + \Delta \lambda) \approx \nabla_\lambda L(x, \lambda) + \frac{\partial}{\partial \lambda} \left( \nabla_\lambda L(x, \lambda) \right) \Delta \lambda + \frac{\partial}{\partial x} \left( \nabla_\lambda L(x, \lambda) \right) \Delta x$ Since $C (x)$ depends only on $x$ and not on $\lambda$ , we have $\frac{\partial C(x)}{\partial \lambda} = 0$ . Therefore:
$\nabla_\lambda L(x + \Delta x, \lambda + \Delta \lambda) \approx C(x) + \frac{\partial C}{\partial x} \Delta x$ Setting this equation to zero:
$\frac{\partial C}{\partial x} \Delta x = 0 \Rightarrow \frac{\partial C}{\partial x} \Delta x = -C(x)$

Iterative Formula and the KKT System

Combining the above, we derive the iterative formula for ( x ) and ( \lambda ):
$\begin{bmatrix} \frac{\partial^2 L}{\partial x^2} & \left(\frac{\partial C}{\partial x}\right)^T \\ \frac{\partial C}{\partial x} & 0 \end{bmatrix} \begin{bmatrix} \Delta x \\ \Delta \lambda \end{bmatrix} =\begin{bmatrix} -\nabla_x L(x, \lambda) \\ -C(x) \end{bmatrix}$

This matrix system is referred to as the KKT system (Karush-Kuhn-Tucker system). The top-left block represents the Hessian matrix of the Lagrangian function.

Special Solvers for the KKT System
Due to the specific structure and generality of this matrix, there are many targeted solvers designed to efficiently iterate through and solve the KKT system, as noted in lectures.

在这里插入图片描述

3. Gauss-Newton Method

(1) Basic Idea

In the KKT system, there is a term involving the second derivative of $L$ with respect to $x$ , which includes the second derivative of the objective function and the curvature term of the constraint. In practical problems, $f$ is the desired objective function, which can be designed to be relatively well-behaved, while $c$ represents the constraints, which come from real physical systems and are often more complex and difficult to differentiate. Therefore, when iterating over the KKT system, it is possible to consider discarding the second term of the second derivative of $L$ , i.e., $\frac{\partial^2 L}{\partial x^2}$ . This method is known as the Gauss-Newton method. The Gauss-Newton method is equivalent to first linearizing the system and then solving for the extremum points of the linearized system.

$\frac{\partial^2L}{\partial \mathbf{x}^2} = \nabla^2 f(\mathbf{x}) + \frac{\partial}{\partial \mathbf{x}}\left[\left(\frac{\partial \mathbf{c}}{\partial \mathbf{x}}\right)^\top \lambda\right] \approx \nabla^2 f(\mathbf{x}).$

The second term is expensive to compute, so it is dropped.
Gauss-Newton method has slower convergence than Newton’s method but is computationally cheaper per iteration.

(2) Example: Comparison of Newton vs. Gauss-Newton¹

Using Newton’s method and Gauss-Newton’s method, the problem depicted in the figure below is minimized. The concentric circles represent the contour lines of the objective function, while the yellow parabola indicates the constraint. It can be observed that Newton’s method, starting from an unreasonable initial point (-3, 2), may fail to converge to the local minimum. In contrast, the Gauss-Newton method does not encounter this issue. By removing the second term, the Gauss-Newton method ensures that the Hessian matrix is positive definite, guaranteeing that each iteration moves in a descending direction. Although Newton’s method theoretically converges faster near the vicinity of the extremum, in practice, the Gauss-Newton method demonstrates greater stability.

Newton’s Method: Typically fewer iterations but higher cost per iteration.
Gauss-Newton Method: Cheaper per iteration but slower convergence.

(3) Takeaway Message

May still need to regularize the Hessian:

$\frac{\partial^2 L}{\partial \mathbf{x}^2}$

even if:

$\nabla^2 f\left(\mathbf{x}\right) \succ 0$

The Gauss-Newton method is often used in practice.

(4) Inequality Constraints

Minimize $f (x)$ subject to $\leq 0$ :

$\min_{\mathbf{x}} f\left(\mathbf{x}\right) \quad \text{s.t.} \quad \mathbf{c}\left(\mathbf{x}\right) \leq \mathbf{0}$ We’ll look at just inequality constraints for now. These are handled by combining with previous methods for both types of constraints.

First-Order Necessary Conditions

i. Need:

$\nabla f\left(\mathbf{x}\right) = \mathbf{0}$ in free directions.

ii. Need:
$\mathbf{c}\left(\mathbf{x}\right) \leq \mathbf{0}$ (Same as equality constraints)

(i) KKT Conditions

The Karush-Kuhn-Tucker (KKT) conditions are:

$\nabla f(\mathbf{x}) + \left( \frac{\partial c}{\partial x} \right)^\top \lambda = 0 \quad (\text{stationarity})$

$c(\mathbf{x}) \leq 0 \quad (\text{primal feasibility})$

$\lambda \geq 0 \quad (\text{dual feasibility})$

$\lambda \odot c(\mathbf{x}) = \lambda^\top c(\mathbf{x}) = 0 \quad (\text{complementary slackness})$ for some $\lambda \in \mathbb{R}^m$ .

(ii) Intuition of KKT Conditions

If the constraint is active ( $c(\mathbf{x}) = 0$ ), then $\lambda > 0$ .
If the constraint is inactive ( $c(\mathbf{x}) < 0$ ), then $\lambda = 0$ .

(iii) Takeaway Message
The complementary slackness condition is like a switch:

If the constraint is active, the switch is on ( $\lambda > 0$ ).
If the constraint is inactive, the switch is off ( $\lambda = 0$ ).

Additionally, there is an edge case when the minimum of the objective happens to be on the constraint manifold, i.e., when $c(\mathbf{x}) = \lambda = 0$ .

From KKT Condition, it can be seen that the first-order necessary conditions for inequality constraints are mathematically complex. Moreover, the complementary slackness condition introduces powerful nonlinearity, even an if-else logic, which makes finding the minimum point from KKT Condition relatively complicated. Therefore, alternative methods must be sought. Additionally, the KKT conditions only tell us what the situation looks like at the extremum points but do not provide a method for iteration or optimization to reach the extremum points. Hence, how to solve this problem and how to handle the different cases depends on the specific solver being used.