高级优化理论与方法(四)
上节回顾
Fixed Stepsize
x k + 1 = x k − α ∇ f ( x k ) x^{k+1}=x^k-\alpha\nabla f(x^k) xk+1=xk−α∇f(xk)
Steepest Decent
x k + 1 = x k − α k ∇ f ( x k ) x^{k+1}=x^k-\alpha^k \nabla f(x^k) xk+1=xk−αk∇f(xk),where α k = a r g m i n f ( x k − α k ∇ f ( x k ) ) \alpha^k=argmin f(x^k-\alpha^k \nabla f(x^k)) αk=argminf(xk−αk∇f(xk))
Gradient Method
Analysis
Theorem 3
Thm: Fixed Stepsize: x k → x ∗ x^k\rightarrow x^* xk→x∗ for any x 0 ⇔ 0 < α < 2 λ m a x ( Q ) x^0 \Leftrightarrow 0<\alpha <\frac{2}{\lambda_{max}(Q)} x0⇔0<α<λmax(Q)2
Pf: “ ⇐ \Leftarrow ⇐”: Rayleigh’s Inequality: λ m i n ( Q ) g k T g k ≤ g k T Q g k ≤ λ m a x ( Q ) g k T g k \lambda_{min}(Q){g^k}^Tg^k\leq {g^k}^TQg^k\leq \lambda_{max}(Q){g^k}^Tg^k λmin(Q)gkTgk≤gkTQgk≤λmax(Q)gkTgk
g k T Q − 1 g k ≤ 1 λ m i n ( Q ) g k T g k {g^k}^TQ^{-1}g^k\leq \frac{1}{\lambda_{min}(Q)}{g^k}^Tg^k gkTQ−1gk≤λmin(Q)1gkTgk
⇒ r k ≥ α λ m i n ( Q ) g k T g k λ m a x ( Q − 1 ) g k T g k ( 2 g k T g k λ m a x ( Q ) g k T g k − α ) = α λ m i n 2 ( Q ) ( 2 λ m a x ( Q ) − α ) ≥ C > 0 \Rightarrow r^k\geq \alpha\frac{\lambda_{min}(Q){g^k}^Tg^k}{\lambda_{max}(Q^{-1}){g^k}^Tg^k}(2\frac{{g^k}^Tg^k}{\lambda_{max}(Q){g^k}^Tg^k}-\alpha)=\alpha \lambda_{min}^2(Q)(\frac{2}{\lambda_{max}(Q)}-\alpha)\geq C>0 ⇒rk≥αλmax(Q−1)gkTgkλmin(Q)gkTgk(2λmax(Q)gkTgkgkTgk−α)=αλmin2(Q)(λmax(Q)2−α)≥C>0
∑ k = 0 ∞ r k = ∞ \sum_{k=0}^{\infty} r^k=\infty ∑k=0∞rk=∞
x k → x ∗ x^k \rightarrow x^* xk→x∗
Pf: “ ⇐ \Leftarrow ⇐”: Assume α < 0 \alpha<0 α<0 or α > 2 λ m a x ( Q ) \alpha>\frac{2}{\lambda_{max}(Q)} α>λmax(Q)2
Define x 0 x^0 x0 with x 0 − x ∗ x^0-x^* x0−x∗ is an eigenvector of Q corresponding to λ m a x ( Q ) \lambda_{max}(Q) λmax(Q)
x k + 1 = x k − α ( Q x k − b ) = x k − α ( Q x k − Q x ∗ ) x^{k+1}=x^k-\alpha(Qx^k-b)=x^k-\alpha(Qx^k-Qx^*) xk+1=xk−α(Qxk−b)=xk−α(Qxk−Qx∗)
→ x k + 1 − x ∗ = x k − x ∗ − α ( Q x k − Q x ∗ ) = ( I n − α Q ) ( x k − x ∗ ) = ( I n − α Q ) k + 1 ( x 0 − x ∗ ) = ( I n − α Q ) k ( I n − α Q ) ( x 0 − x ∗ ) = ( I n − α Q ) k ( x 0 − x ∗ − α Q ( x 0 − x ∗ ) ) = ( I n − α Q ) ( x 0 − x ∗ − α λ m a x ( Q ) ( x 0 − x ∗ ) ) = ( 1 − α λ m a x ( Q ) ) ( I n − α Q ) k ( x 0 − x ∗ ) = ( 1 − α λ m a x ( Q ) ) k + 1 ( x 0 − x ∗ ) \rightarrow x^{k+1}-x^*=x^k-x^*-\alpha(Qx^k-Qx^*)=(I_n-\alpha Q)(x^k-x^*)=(I_n-\alpha Q)^{k+1}(x^0-x^*)=(I_n-\alpha Q)^k(I_n-\alpha Q)(x^0-x^*)=(I_n-\alpha Q)^k(x^0-x^*-\alpha Q(x^0-x^*))=(I_n-\alpha Q)(x^0-x^*-\alpha \lambda_{max}(Q)(x^0-x^*))=(1-\alpha \lambda_{max}(Q))(I_n-\alpha Q)^k(x^0-x^*)=(1-\alpha \lambda_{max}(Q))^{k+1}(x^0-x^*) →xk+1−x∗=xk−x∗−α(Qxk−Qx∗)=(In−αQ)(xk−x∗)=(In−αQ)k+1(x0−x∗)=(In−αQ)k(In−αQ)(x0−x∗)=(In−αQ)k(x0−x∗−αQ(x0−x∗))=(In−αQ)(x0−x∗−αλmax(Q)(x0−x∗))=(1−αλmax(Q))(In−αQ)k(x0−x∗)=(1−αλmax(Q))k+1(x0−x∗)
∣ ∣ x k + 1 − x ∗ ∣ ∣ = ∣ 1 − α λ m a x ( Q ) ∣ k + 1 ∣ ∣ x 0 − x ∗ ∣ ∣ ||x^{k+1}-x^*||=|1-\alpha \lambda_{max}(Q)|^{k+1}||x^0-x^*|| ∣∣xk+1−x∗∣∣=∣1−αλmax(Q)∣k+1∣∣x0−x∗∣∣
∵ α > 0 \because \alpha>0 ∵α>0 or α > 2 λ m a x ( Q ) \alpha>\frac{2}{\lambda_{max}(Q)} α>λmax(Q)2
∴ ∣ 1 − α λ m a x ( Q ) ∣ > 1 \therefore |1-\alpha \lambda_{max}(Q)|>1 ∴∣1−αλmax(Q)∣>1
∴ x k \therefore x^k ∴xk not converge to x ∗ x^* x∗
Order of convergence
Def: Given x k → x ∗ x^k\rightarrow x^* xk→x∗, lim k → ∞ ∣ ∣ x k − x ∗ ∣ ∣ = 0 \lim_{k\to\infty} ||x^k-x^*||=0 limk→∞∣∣xk−x∗∣∣=0. Order of convergence is P ∈ R P \in \mathbb{R} P∈R,if 0 < lim k → ∞ ∣ ∣ x k + 1 − x ∗ ∣ ∣ ∣ ∣ x k − x ∗ ∣ ∣ p = c < ∞ 0<\lim_{k\to\infty} \frac{||x^{k+1}-x^*||}{||x^k-x^*||^p}=c<\infty 0<limk→∞∣∣xk−x∗∣∣p∣∣xk+1−x∗∣∣=c<∞
order is ∞ \infty ∞, if for all p > 0 p>0 p>0: lim k → ∞ ∣ ∣ x k + 1 − x ∗ ∣ ∣ ∣ ∣ x k − x ∗ ∣ ∣ p = 0 \lim_{k\to\infty} \frac{||x^{k+1}-x^*||}{||x^k-x^*||^p}=0 limk→∞∣∣xk−x∗∣∣p∣∣xk+1−x∗∣∣=0
sublinear: p = 1 p=1 p=1, lim k → ∞ ∣ ∣ x k + 1 − x ∗ ∣ ∣ ∣ ∣ x k − x ∗ ∣ ∣ p = 1 \lim_{k\to\infty} \frac{||x^{k+1}-x^*||}{||x^k-x^*||^p}=1 limk→∞∣∣xk−x∗∣∣p∣∣xk+1−x∗∣∣=1
linear: p = 1 p=1 p=1, lim k → ∞ ∣ ∣ x k + 1 − x ∗ ∣ ∣ ∣ ∣ x k − x ∗ ∣ ∣ p < 1 \lim_{k\to\infty} \frac{||x^{k+1}-x^*||}{||x^k-x^*||^p}<1 limk→∞∣∣xk−x∗∣∣p∣∣xk+1−x∗∣∣<1
superlinear: p = 1 p=1 p=1, lim k → ∞ ∣ ∣ x k + 1 − x ∗ ∣ ∣ ∣ ∣ x k − x ∗ ∣ ∣ p > 1 \lim_{k\to\infty} \frac{||x^{k+1}-x^*||}{||x^k-x^*||^p}>1 limk→∞∣∣xk−x∗∣∣p∣∣xk+1−x∗∣∣>1
注:对于二次函数, p = 2 p=2 p=2。
Example 1
x k = 1 k x^k=\frac{1}{k} xk=k1
x k → 0 = x ∗ x^k\to 0=x^* xk→0=x∗
1 k + 1 ( 1 k ) p = k p k + 1 \frac{\frac{1}{k+1}}{(\frac{1}{k})^p}=\frac{k^p}{k+1} (k1)pk+11=k+1kp
p < 1 p<1 p<1: lim k → ∞ k p k + 1 = 0 \lim_{k\to\infty} \frac{k^p}{k+1}=0 limk→∞k+1kp=0
p = 1 p=1 p=1: lim k → ∞ k k + 1 = 1 \lim_{k\to\infty}\frac{k}{k+1}=1 limk→∞k+1k=1
Example 2
x k = r k x^k=r^k xk=rk( 0 < r < 1 0<r<1 0<r<1)
x ∗ = 0 x^*=0 x∗=0
r k + 1 ( r k ) p = r k ( 1 − p ) + 1 \frac{r^{k+1}}{(r^k)^p}=r^{k(1-p)+1} (rk)prk+1=rk(1−p)+1
p < 1 p<1 p<1: lim k → ∞ r k ( 1 − p ) + 1 = 0 \lim_{k\to\infty} r^{k(1-p)+1}=0 limk→∞rk(1−p)+1=0
p = 1 p=1 p=1: lim k → ∞ r k ( 1 − p ) + 1 = r < 1 \lim_{k\to\infty} r^{k(1-p)+1}=r<1 limk→∞rk(1−p)+1=r<1
Example 3
x k = r q k x^k=r^{q^k} xk=rqk, q > 1 q>1 q>1, 0 < r < 1 0<r<1 0<r<1
x ∗ = 0 x^*=0 x∗=0
r q k + 1 ( r q k ) p = r q k + 1 − p q k = r ( q − p ) q k \frac{r^{q^{k+1}}}{(r^{q^k})^p}=r^{q^{k+1}-pq^k}=r^{(q-p)q^k} (rqk)prqk+1=rqk+1−pqk=r(q−p)qk
p < q p<q p<q: lim k → ∞ r ( q − p ) q k = 0 \lim_{k\to\infty}r^{(q-p)q^k}=0 limk→∞r(q−p)qk=0
p = q p=q p=q: lim k → ∞ r ( q − p ) q k = 1 \lim_{k\to\infty}r^{(q-p)q^k}=1 limk→∞r(q−p)qk=1
Example 4
x k = 1 x^k=1 xk=1
x k → x 0 = 1 x^k\to x^0=1 xk→x0=1
x k + 1 − 1 ( x k − 1 ) p = 0 \frac{x^{k+1}-1}{(x^k-1)^p}=0 (xk−1)pxk+1−1=0, p = ∞ p=\infty p=∞
Theorem
∣ ∣ x k + 1 − x ∗ ∣ ∣ = O ( ∣ ∣ x k − x ∗ ∣ ∣ p ) ||x^{k+1}-x^*||=O(||x^k-x^*||^p) ∣∣xk+1−x∗∣∣=O(∣∣xk−x∗∣∣p)
For large k k k: ∃ c ∈ R \exist c \in \mathbb{R} ∃c∈R: ∣ ∣ x k + 1 − x ∗ ∣ ∣ ≤ c ∣ ∣ x k − x ∗ ∣ ∣ p ||x^{k+1}-x^*||\leq c||x^{k}-x^*||^p ∣∣xk+1−x∗∣∣≤c∣∣xk−x∗∣∣p
Thm: x k − x ∗ x^k-x^* xk−x∗ If ∣ ∣ x k + 1 − x ∗ ∣ ∣ = O ( ∣ ∣ x k − x ∗ ∣ ∣ p ) ||x^{k+1}-x^*||=O(||x^k-x^*||^p) ∣∣xk+1−x∗∣∣=O(∣∣xk−x∗∣∣p), then the order of convergence is at least p p p.
Pf: For large k k k, ∃ c \exist c ∃c: ∣ ∣ x k + 1 − x ∗ ∣ ∣ ∣ ∣ x k − x ∗ ∣ ∣ p ≤ c \frac{||x^{k+1}-x^*||}{||x^{k}-x^*||^p}\leq c ∣∣xk−x∗∣∣p∣∣xk+1−x∗∣∣≤c
∣ ∣ x k + 1 − x ∗ ∣ ∣ ∣ ∣ x k − x ∗ ∣ ∣ s = ∣ ∣ x k + 1 − x ∗ ∣ ∣ ∣ ∣ x k − x ∗ ∣ ∣ p ∣ ∣ x k − x ∗ ∣ ∣ p − s ≤ c ∣ ∣ x k − x ∗ ∣ ∣ p − s \frac{||x^{k+1}-x^*||}{||x^{k}-x^*||^s}=\frac{||x^{k+1}-x^*||}{||x^{k}-x^*||^p}||x^{k}-x^*||^{p-s}\leq c||x^{k}-x^*||^{p-s} ∣∣xk−x∗∣∣s∣∣xk+1−x∗∣∣=∣∣xk−x∗∣∣p∣∣xk+1−x∗∣∣∣∣xk−x∗∣∣p−s≤c∣∣xk−x∗∣∣p−s
If s s s is the order of convergence, lim k → ∞ ∣ ∣ x k + 1 − x ∗ ∣ ∣ ∣ ∣ x k − x ∗ ∣ ∣ s > 0 \lim_{k\to\infty} \frac{||x^{k+1}-x^*||}{||x^k-x^*||^s}>0 limk→∞∣∣xk−x∗∣∣s∣∣xk+1−x∗∣∣>0
⇒ c lim k → ∞ ∣ ∣ x k − x ∗ ∣ ∣ p − s > 0 \Rightarrow c\lim_{k\to\infty} ||x^k-x^*||^{p-s}>0 ⇒climk→∞∣∣xk−x∗∣∣p−s>0
∵ lim k → ∞ ∣ ∣ x k − x ∗ ∣ ∣ = 0 \because \lim_{k\to\infty} ||x^k-x^*||=0 ∵limk→∞∣∣xk−x∗∣∣=0
∴ \therefore ∴ if p > s p>s p>s, c lim k → ∞ ∣ ∣ x k − x ∗ ∣ ∣ p − s = 0 c\lim_{k\to\infty} ||x^k-x^*||^{p-s}=0 climk→∞∣∣xk−x∗∣∣p−s=0
⇒ s ≥ p \Rightarrow s \geq p ⇒s≥p
Theorem
Thm: Stepest Decent: the order of convergence ≥ 1 \geq 1 ≥1
Pf: Q Q Q: λ m a x ( Q ) > λ m i n ( Q ) > 0 \lambda_{max}(Q)>\lambda_{min}(Q)>0 λmax(Q)>λmin(Q)>0
Suffices to prove: ∃ c , ∣ ∣ x k + 1 − x ∗ ∣ ∣ ≥ c ∣ ∣ x k − x ∗ ∣ ∣ \exist c, ||x^{k+1}-x^*||\geq c||x^k-x^*|| ∃c,∣∣xk+1−x∗∣∣≥c∣∣xk−x∗∣∣
∵ V ( x k + 1 ) = 1 2 ( x k + 1 − x ∗ ) T Q ( x k + 1 − x ∗ ) ≤ λ m a x ( Q ) 2 ∣ ∣ x k + 1 − x ∗ ∣ ∣ 2 \because V(x^{k+1})=\frac{1}{2}(x^{k+1}-x^*)^TQ(x^{k+1}-x^*)\leq\frac{\lambda_{max}(Q)}{2}||x^{k+1}-x^*||^2 ∵V(xk+1)=21(xk+1−x∗)TQ(xk+1−x∗)≤2λmax(Q)∣∣xk+1−x∗∣∣2
V ( x ∗ ) ≥ λ m i n ( Q ) 2 ∣ ∣ x k − x ∗ ∣ ∣ 2 V(x^*)\geq\frac{\lambda_{min}(Q)}{2}||x^k-x^*||^2 V(x∗)≥2λmin(Q)∣∣xk−x∗∣∣2
∣ ∣ x k + 1 − x ∗ ∣ ∣ ≥ ( 1 − r k ) λ m i n ( Q ) λ m a x ( Q ) ∣ ∣ x k − x ∗ ∣ ∣ ||x^{k+1}-x^*||\geq \sqrt{(1-r^k)\frac{\lambda_{min}(Q)}{\lambda_{max}(Q)}}||x^k-x^*|| ∣∣xk+1−x∗∣∣≥(1−rk)λmax(Q)λmin(Q)∣∣xk−x∗∣∣
To prove : r k < 1 ⇒ g k r^k<1\Rightarrow g^k rk<1⇒gk is not eigenvectir of Q ⇔ r k < 1 Q\Leftrightarrow r^k<1 Q⇔rk<1
Newton Method
f ∈ C 2 f\in C^2 f∈C2
x ∗ x^* x∗ FONC ⇒ ∇ f ( x ∗ ) = 0 \Rightarrow \nabla f(x^*)=0 ⇒∇f(x∗)=0
x k + 1 = x k − f ′ ( x k ) f ′ ′ ( x k ) ⇒ x k + 1 = x k − [ F ( x k ) ] ′ ∇ f ( x ∗ ) x^{k+1}=x^k-\frac{f'(x^k)}{f''(x^k)}\Rightarrow x^{k+1}=x^k-[F(x^k)]'\nabla f(x^*) xk+1=xk−f′′(xk)f′(xk)⇒xk+1=xk−[F(xk)]′∇f(x∗)
优缺点
优点
Pro: simple, convergen order
缺点
Con: F ( x k ) < 0 F(x^k)<0 F(xk)<0,
even if F ( x k ) > 0 F(x^k)>0 F(xk)>0, not decent.
Compute F − 1 ( x k ) F^{-1}(x^k) F−1(xk)
Convergence Order
Example
f ( x ) = 1 2 x T Q x − b T x f(x)=\frac{1}{2}x^TQx-b^Tx f(x)=21xTQx−bTx
∇ f ( x ) = Q x − b \nabla f(x)=Qx-b ∇f(x)=Qx−b
F ( x ) = Q F(x)=Q F(x)=Q
x 1 = x 0 − Q − 1 ( Q x 0 − b ) = x 0 − x 0 + Q − 1 b = Q − 1 b = x ∗ x^1=x^0-Q^{-1}(Qx^0-b)=x^0-x^0+Q^{-1}b=Q^{-1}b=x^* x1=x0−Q−1(Qx0−b)=x0−x0+Q−1b=Q−1b=x∗
Theorem
Thm: f ∈ C 3 f\in C^3 f∈C3, x ∗ x^* x∗: ∇ f ( x ∗ ) = 0 \nabla f(x^*)=0 ∇f(x∗)=0 and F ( x ∗ ) F(x^*) F(x∗) inventible. Then, for all x 0 x^0 x0 sufficiently close to x ∗ x^* x∗, x ∗ x^* x∗ converges to x ∗ x^* x∗ with an order at least 2.
Pf: To prove: ∣ ∣ x k + 1 − x ∗ ∣ ∣ = O ( ∣ ∣ x k − x ∗ ∣ ∣ 2 ) ||x^{k+1}-x^*||=O(||x^k-x^*||^2) ∣∣xk+1−x∗∣∣=O(∣∣xk−x∗∣∣2)
∣ ∣ x 1 − x ∗ ∣ ∣ = ∣ ∣ x 0 − x ∗ − F − 1 ( x 0 ) ∇ f ( x ∗ ) ∣ ∣ = ∣ ∣ F − 1 ( x 0 ) ( F ( x 0 ) ( x 0 − x ∗ ) − ∇ f ( x 0 ) ) ∣ ∣ ≤ ∣ ∣ F − 1 ( x 0 ) ∣ ∣ ⋅ ∣ ∣ ( F ( x 0 ) ( x 0 − x ∗ ) − ∇ f ( x 0 ) ) ∣ ∣ ||x^{1}-x^*||=||x^0-x^*-F^{-1}(x^0)\nabla f(x^*)||=||F^{-1}(x^0)(F(x^0)(x^0-x^*)-\nabla f(x^0))||\leq ||F^{-1}(x^0)||\cdot||(F(x^0)(x^0-x^*)-\nabla f(x^0))|| ∣∣x1−x∗∣∣=∣∣x0−x∗−F−1(x0)∇f(x∗)∣∣=∣∣F−1(x0)(F(x0)(x0−x∗)−∇f(x0))∣∣≤∣∣F−1(x0)∣∣⋅∣∣(F(x0)(x0−x∗)−∇f(x0))∣∣
∵ F ( x ∗ ) \because F(x^*) ∵F(x∗) inventible, f ∈ C 3 f\in C^3 f∈C3, x 0 x^0 x0 sufficiently close to x ∗ x^* x∗, ∣ ∣ F − 1 ( x ∗ ) ∣ ∣ ||F^{-1}(x^*)|| ∣∣F−1(x∗)∣∣ constant ⇒ ∣ ∣ F − 1 ( x 0 ) ∣ ∣ < c 2 \Rightarrow ||F^{-1}(x^0)||<c_2 ⇒∣∣F−1(x0)∣∣<c2 for some c 2 ∈ R c_2\in \mathbb{R} c2∈R
Taylor expansion of ∇ f ( x ) \nabla f(x) ∇f(x): ∇ f ( x ) − ∇ f ( x ∗ ) = F ( x 0 ) ( x − x 0 ) + O ( ∣ ∣ x − x 0 ∣ ∣ 2 ) \nabla f(x)-\nabla f(x^*)=F(x^0)(x-x^0)+O(||x-x^0||^2) ∇f(x)−∇f(x∗)=F(x0)(x−x0)+O(∣∣x−x0∣∣2)
∀ x , ∣ ∣ x − x ∗ ∣ ∣ < ϵ : ∣ ∣ ∇ f ( x ) − ∇ f ( x 0 ) − F ( x 0 ) ( x − x 0 ) ∣ ∣ ≤ c 1 ∣ ∣ x − x 0 ∣ ∣ 2 \forall x, ||x-x^*||<\epsilon: ||\nabla f(x)-\nabla f(x^0)-F(x^0)(x-x^0)||\leq c_1 ||x-x^0||^2 ∀x,∣∣x−x∗∣∣<ϵ:∣∣∇f(x)−∇f(x0)−F(x0)(x−x0)∣∣≤c1∣∣x−x0∣∣2 for some c 1 ∈ R c_1\in \mathbb{R} c1∈R
If x ∗ ∈ { x : ∣ ∣ x − x ∗ ∣ ∣ < ϵ } x^*\in \{x:||x-x^*||<\epsilon\} x∗∈{x:∣∣x−x∗∣∣<ϵ}: ∣ ∣ ∇ f ( x ∗ ) − ∇ f ( x 0 ) − F ( x 0 ) ( x ∗ − x 0 ) ∣ ∣ ≤ c 1 ∣ ∣ x ∗ − x 0 ∣ ∣ 2 ||\nabla f(x^*)-\nabla f(x^0)-F(x^0)(x^*-x^0)||\leq c_1 ||x^*-x^0||^2 ∣∣∇f(x∗)−∇f(x0)−F(x0)(x∗−x0)∣∣≤c1∣∣x∗−x0∣∣2
∣ ∣ F ( x 0 ) ( x 0 − x ∗ ) − ∇ f ( x 0 ) ∣ ∣ ≤ c 1 ∣ ∣ x ∗ − x 0 ∣ ∣ 2 ||F(x^0)(x^0-x^*)-\nabla f(x^0)||\leq c_1||x^*-x^0||^2 ∣∣F(x0)(x0−x∗)−∇f(x0)∣∣≤c1∣∣x∗−x0∣∣2
∣ ∣ x 1 − x ∗ ∣ ∣ ≤ c 1 c 2 ∣ ∣ x 0 − x ∗ ∣ ∣ 2 ||x^1-x^*||\leq c_1c_2||x^0-x^*||^2 ∣∣x1−x∗∣∣≤c1c2∣∣x0−x∗∣∣2
∣ ∣ x k + 1 − x ∗ ∣ ∣ ≤ c 1 c 2 ∣ ∣ x k − x ∗ ∣ ∣ 2 ||x^{k+1}-x^*||\leq c_1c_2||x^k-x^*||^2 ∣∣xk+1−x∗∣∣≤c1c2∣∣xk−x∗∣∣2
Let 0 < x < 1 0<x<1 0<x<1, choose x 0 x^0 x0 satisfy ∣ ∣ x 0 − x ∗ ∣ ∣ ≤ α c 1 c 2 ||x^0-x^*||\leq\frac{\alpha}{c_1c_2} ∣∣x0−x∗∣∣≤c1c2α
lim k → ∞ ∣ ∣ x k − x ∗ ∣ ∣ ≤ α k ∣ ∣ x 0 − x ∗ ∣ ∣ = 0 \lim_{k\to\infty} ||x^k-x^*||\leq \alpha^k ||x^0-x^*||=0 limk→∞∣∣xk−x∗∣∣≤αk∣∣x0−x∗∣∣=0
Theorem
Thm: x k x^k xk: Sequence generated by Newton’s Method. If F ( x k ) > 0 F(x^k)>0 F(xk)>0 and ∇ f ( x k ) ≠ 0 \nabla f(x^k)\neq0 ∇f(xk)=0, then for d k = − F − 1 ( x k ) ∇ f ( x k ) d^k=-F^{-1}(x^k)\nabla f(x^k) dk=−F−1(xk)∇f(xk), there exsits an α ‾ > 0 \overline{\alpha}>0 α>0 s.t. ∀ α ∈ ( 0 , α ‾ ) : f ( x k + α d k ) < f ( x k ) \forall \alpha\in (0,\overline{\alpha}): f(x^k+\alpha d^k)<f(x^k) ∀α∈(0,α):f(xk+αdk)<f(xk).
Pf: Let ϕ ( α ) = f ( x k + α d k ) \phi(\alpha)=f(x^k+\alpha d^k) ϕ(α)=f(xk+αdk)
ϕ ′ ( α ) = ∇ f ( x k + α d k ) T d k \phi'(\alpha)=\nabla f(x^k+\alpha d^k)^Td^k ϕ′(α)=∇f(xk+αdk)Tdk
ϕ ′ ( 0 ) = ∇ f ( x k ) T d k = − ∇ f ( x k ) T F ( x k ) ∇ f ( x k ) \phi'(0)=\nabla f(x^k)^Td^k=-\nabla f(x^k)^TF(x^k)\nabla f(x^k) ϕ′(0)=∇f(xk)Tdk=−∇f(xk)TF(xk)∇f(xk)
∵ F ( x k ) > 0 , ∇ f ( x k ) ≠ 0 \because F(x^k)>0, \nabla f(x^k)\neq 0 ∵F(xk)>0,∇f(xk)=0
∴ ϕ ′ ( 0 ) < 0 \therefore \phi'(0)<0 ∴ϕ′(0)<0
∴ ∃ α ‾ > 0 \therefore \exist \overline{\alpha}>0 ∴∃α>0 s.t. ∀ α ∈ ( 0 , α ‾ ) : ϕ ( α ) < ϕ ( 0 ) \forall \alpha\in (0,\overline{\alpha}): \phi(\alpha)<\phi(0) ∀α∈(0,α):ϕ(α)<ϕ(0)
∀ α ∈ ( 0 , α ‾ ) : f ( x k + α d k ) < f ( x k ) \forall \alpha\in (0,\overline{\alpha}): f(x^k+\alpha d^k)<f(x^k) ∀α∈(0,α):f(xk+αdk)<f(xk)
Modification
x k + 1 = x k − α k F − 1 ( x k ) ∇ f ( x k ) x^{k+1}=x^k-\alpha^k F^{-1}(x^k)\nabla f(x^k) xk+1=xk−αkF−1(xk)∇f(xk), where α k = a r g m i n f ( x k − α F − 1 ( x k ) ∇ f ( x k ) \alpha^k=argmin f(x^k-\alpha F^{-1}(x^k)\nabla f(x^k) αk=argminf(xk−αF−1(xk)∇f(xk)
F ( x k ) F(x^k) F(xk) not positive definite.
Let λ 1 , λ 2 , ⋯ , λ n \lambda_1,\lambda_2,\cdots,\lambda_n λ1,λ2,⋯,λn be the eigenvalues of F ( x k ) F(x^k) F(xk) corresponding to eigenvalues v 1 , v 2 , ⋯ , v n v_1,v_2,\cdots,v_n v1,v2,⋯,vn
Consider G = F ( x k ) + μ I n , μ > 0 G=F(x^k)+\mu I_n,\mu>0 G=F(xk)+μIn,μ>0
G v i = ( F ( x k ) + μ I n ) v i = F ( x k ) v i + μ v i = λ i v i + μ v i = ( λ i + μ ) v i ⇒ v i Gv_i=(F(x^k)+\mu I_n)v_i=F(x^k)v_i+\mu v_i=\lambda_i v_i+\mu v_i=(\lambda_i+\mu)v_i\Rightarrow v_i Gvi=(F(xk)+μIn)vi=F(xk)vi+μvi=λivi+μvi=(λi+μ)vi⇒vi eigrnvector of G G G corresponding to λ i + μ \lambda_i+\mu λi+μ
Choose μ \mu μ large enough, s.t. λ i + μ > 0 \lambda_i+\mu>0 λi+μ>0 ∀ i ⇒ G \forall i\Rightarrow G ∀i⇒G positive definite
Modification: x k + 1 = x k − α k ( F ( x k ) + μ I n ) − 1 ∇ f ( x k ) x^{k+1}=x^k-\alpha^k(F(x^k)+\mu I_n)^{-1}\nabla f(x^k) xk+1=xk−αk(F(xk)+μIn)−1∇f(xk)
α k = a r g m i n f ( x k − α ( F ( x k ) + μ I n ) − 1 ∇ f ( x k ) ) \alpha^k=argmin f(x^k-\alpha(F(x^k)+\mu I_n)^{-1}\nabla f(x^k)) αk=argminf(xk−α(F(xk)+μIn)−1∇f(xk))
Conjugate Method
Def: Q Q Q: symmetric matrix from R n × n \mathbb{R}^{n\times n} Rn×n. d 0 , ⋯ , d m d_0,\cdots,d_m d0,⋯,dm are Q-conjugate, if ∀ i ≠ j \forall i\neq j ∀i=j: d i T Q d j = 0 d_i^TQd_j=0 diTQdj=0
Orthogonal: x T y = 0 = Δ x T I n y x^Ty=0\stackrel{\Delta}{=}x^T I_n y xTy=0=ΔxTIny
Lemma
Lem: Q Q Q symmetric, positive definite. d 0 , ⋯ , d k d_0,\cdots,d_k d0,⋯,dk: non-zero, Q-conjugate. Then d 0 , ⋯ , d k d_0,\cdots,d_k d0,⋯,dk linearly independent.
d j T Q ( a 0 d 0 + ⋯ + a k d k ) = 0 d_j^TQ(a_0d_0+\cdots+a_kd_k)=0 djTQ(a0d0+⋯+akdk)=0
a j d j T Q d j = 0 a_jd_j^TQd_j=0 ajdjTQdj=0
∵ a > 0 , d j ≠ 0 \because a>0, d_j\neq0 ∵a>0,dj=0
∴ a j = 0 \therefore a_j=0 ∴aj=0
Conjugate Direction Algorithm
Input: f ( x ) = 1 2 x T Q x − b T x , x 0 , d 0 , ⋯ , d n − 1 f(x)=\frac{1}{2}x^TQx-b^Tx, x_0, d_0,\cdots,d_{n-1} f(x)=21xTQx−bTx,x0,d0,⋯,dn−1: Q-conjugate
g k = ∇ f ( x k ) = Q x k − b g^k=\nabla f(x^k)=Qx^k-b gk=∇f(xk)=Qxk−b
α k = − g k T d k d k T Q d k \alpha^k=-\frac{{g^k}^Td_k}{d_k^TQd_k} αk=−dkTQdkgkTdk
x k + 1 = x k + α k d k x^{k+1}=x^k+\alpha^kd_k xk+1=xk+αkdk
Thm: For any x 0 x_0 x0, CDA converges to x ∗ x^* x∗ in n n n steps.
Pf: ∵ d 0 , ⋯ , d n − 1 \because d_0,\cdots,d_{n-1} ∵d0,⋯,dn−1 Q-conjugate
∴ d 0 , ⋯ , d n − 1 \therefore d_0,\cdots,d_{n-1} ∴d0,⋯,dn−1 linearly independent
⇒ ∃ β 0 , ⋯ , β n − 1 : x ∗ − x 0 = β 0 d 0 + ⋯ + β n − 1 d n − 1 \Rightarrow \exist \beta_0,\cdots,\beta_{n-1}: x^*-x^0=\beta_0d_0+\cdots+\beta_{n-1}d_{n-1} ⇒∃β0,⋯,βn−1:x∗−x0=β0d0+⋯+βn−1dn−1
⇒ d k T Q ( x ∗ − x 0 ) = β k d k T Q d k \Rightarrow d_k^TQ(x^*-x^0)=\beta_kd_k^TQd_k ⇒dkTQ(x∗−x0)=βkdkTQdk
⇒ β k = d k T Q ( x ∗ − x 0 ) d k T Q d k = − d k T g k d k T Q d k = α k \Rightarrow \beta_k=\frac{d_k^TQ(x^*-x^0)}{d_k^TQd_k}=-\frac{d_k^Tg^k}{d_k^TQd_k}=\alpha^k ⇒βk=dkTQdkdkTQ(x∗−x0)=−dkTQdkdkTgk=αk
x k = x 0 + α 0 d 0 + ⋯ + α k − 1 d k − 1 x^k=x^0+\alpha^0d_0+\cdots+\alpha^{k-1}d_{k-1} xk=x0+α0d0+⋯+αk−1dk−1
d k T Q ( x ∗ − x 0 ) = d k T Q ( x ∗ − x k + x k − x 0 ) = d k T Q ( x ∗ − x k ) = d k T ( Q x ∗ − Q x k ) = d k T ( b − Q x k ) = − d k T g k d_k^TQ(x^*-x^0)=d_k^TQ(x^*-x^k+x^k-x^0)=d_k^TQ(x^*-x^k)=d_k^T(Qx^*-Qx^k)=d_k^T(b-Qx^k)=-d_k^Tg^k dkTQ(x∗−x0)=dkTQ(x∗−xk+xk−x0)=dkTQ(x∗−xk)=dkT(Qx∗−Qxk)=dkT(b−Qxk)=−dkTgk
Example
Q = [ 3 0 1 0 4 2 1 2 3 ] Q=\begin{bmatrix} 3&0&1 \\ 0&4&2\\ 1&2&3 \end{bmatrix} Q= 301042123
Compute
d
0
,
d
1
,
d
2
d_0,d_1,d_2
d0,d1,d2
d
0
=
[
1
0
0
]
d_0=\begin{bmatrix} 1 \\ 0\\ 0 \end{bmatrix}
d0=
100
d
0
T
Q
d
1
=
[
1
,
0
,
0
]
[
3
0
1
0
4
2
1
2
3
]
[
d
1
1
d
1
2
d
1
3
]
=
3
d
1
1
+
d
1
3
=
0
d_0^TQd_1=[1,0,0]\begin{bmatrix} 3&0&1 \\ 0&4&2\\ 1&2&3 \end{bmatrix}\begin{bmatrix} d_1^1 \\ d_1^2\\ d_1^3 \end{bmatrix}=3d_1^1+d_1^3=0
d0TQd1=[1,0,0]
301042123
d11d12d13
=3d11+d13=0
d
1
=
[
1
0
−
3
]
d_1=\begin{bmatrix} 1 \\ 0\\ -3 \end{bmatrix}
d1=
10−3
注:此处展示了如何获取Q-conjugate的向量
d
0
d_0
d0和
d
1
d_1
d1。先选取一个比较简单的
d
0
d_0
d0,然后代入
d
0
T
Q
d
1
=
0
d_0^TQd_1=0
d0TQd1=0,算出关于
d
1
d_1
d1的关系式
3
d
1
1
+
d
1
3
=
0
3d_1^1+d_1^3=0
3d11+d13=0,然后选定
d
1
=
[
1
,
0
,
−
3
]
T
d_1=[1,0,-3]^T
d1=[1,0,−3]T。
f
(
x
)
=
1
2
x
T
[
4
2
2
2
]
x
−
[
−
1
,
1
]
x
f(x)=\frac{1}{2}x^T\begin{bmatrix} 4&2 \\ 2&2 \end{bmatrix}x-[-1,1]x
f(x)=21xT[4222]x−[−1,1]x
g
(
x
)
=
[
4
2
2
2
]
x
−
[
−
1
1
]
g(x)=\begin{bmatrix} 4&2 \\ 2&2 \end{bmatrix}x-\begin{bmatrix} -1 \\ 1 \end{bmatrix}
g(x)=[4222]x−[−11]
x
0
=
[
0
0
]
x^0=\begin{bmatrix} 0 \\ 0 \end{bmatrix}
x0=[00]
d
0
=
[
1
0
]
d_0=\begin{bmatrix} 1 \\ 0 \end{bmatrix}
d0=[10]
d
1
=
[
−
3
8
3
4
]
d_1=\begin{bmatrix} -\frac{3}{8} \\ \frac{3}{4} \end{bmatrix}
d1=[−8343]
g
0
=
[
1
−
1
]
g^0=\begin{bmatrix} 1 \\ -1 \end{bmatrix}
g0=[1−1]
α
0
=
−
g
0
T
d
0
d
0
T
Q
d
0
=
−
[
1
,
−
1
]
[
1
0
]
[
1
,
0
]
[
4
2
2
2
]
[
1
0
]
=
−
1
4
\alpha^0=\frac{-{g^0}^Td_0}{d_0^TQd_0}=\frac{-[1,-1]\begin{bmatrix} 1 \\ 0 \end{bmatrix}}{[1,0]\begin{bmatrix} 4&2 \\ 2&2 \end{bmatrix}\begin{bmatrix} 1 \\ 0 \end{bmatrix}}=-\frac{1}{4}
α0=d0TQd0−g0Td0=[1,0][4222][10]−[1,−1][10]=−41
x
1
=
x
0
+
α
0
d
0
=
[
−
1
4
0
]
x^1=x^0+\alpha^0d_0=\begin{bmatrix} -\frac{1}{4} \\ 0 \end{bmatrix}
x1=x0+α0d0=[−410]
g
1
=
[
0
−
2
3
]
g^1=\begin{bmatrix} 0 \\ -\frac{2}{3} \end{bmatrix}
g1=[0−32]
α
1
=
2
\alpha^1=2
α1=2
x
2
=
[
−
1
3
2
]
x^2=\begin{bmatrix} -1 \\ \frac{3}{2} \end{bmatrix}
x2=[−123]
f
(
x
2
)
=
0
f(x^2)=0
f(x2)=0
注:二次函数在
n
n
n次迭代后必取到最值。
总结
本节课首先延续上节课的梯度方法,做了一些理论上的分析。然后提出了收敛速度的概念,从而可以进一步比较各个方法的收敛速度。最后介绍了牛顿法和共轭方向法。