高级优化理论与方法(七)

Solving Linear Equations

Case 2

A x = b , A ∈ R m × n , m ≤ n , r a n k A = m , x ∈ R n , b ∈ R m Ax=b, A\in \mathbb{R}^{m\times n}, m\leq n, rank A=m, x\in \mathbb{R}^n, b\in \mathbb{R}^m Ax=b,ARm×n,mn,rankA=m,xRn,bRm

⇒ \Rightarrow infinite many solutions

⇒ m i n ∣ ∣ x ∣ ∣ \Rightarrow min ||x|| min∣∣x∣∣
s.t. A x = b Ax=b Ax=b

注:在此情形下,由于有无穷多解,可以将 A x = b Ax=b Ax=b看成是优化问题的约束条件,而这个问题可以看成有限子条件的优化问题。

Theorem

Thm: The unique solution x ∗ x^* x of A x = b Ax=b Ax=b that minimizes ∣ ∣ x ∣ ∣ ||x|| ∣∣x∣∣ is given by x ∗ = A T ( A A T ) − 1 b x^*=A^T (AA^T)^{-1}b x=AT(AAT)1b .

Kaczmarz’s Algorithm

为了避免计算逆矩阵,我们介绍Kaczmarz算法。

  1. Set i = 0 , x 0 i=0, x^0 i=0,x0
  2. For j = 1 , ⋯   , m j=1,\cdots, m j=1,,m do
    x i m + j = x i m + j − 1 + μ ( b j − a j T x i m + j − 1 ) a j a j T a j x^{im+j}=x^{im+j-1}+\mu(b_j-{a_j}^Tx^{im+j-1})\frac{a_j}{{a_j}^Ta_j} xim+j=xim+j1+μ(bjajTxim+j1)ajTajaj
  3. i i i++, goto 2

注: 0 < μ < 2 0<\mu<2 0<μ<2。由于带限制的优化问题较为复杂,故在本门课程中对带限制的优化问题算法的收敛速度不进行研究。

Theorem

In Kaczmarz’s Algorithm, if x 0 = 0 x^0=0 x0=0, then x k → x ∗ = A T ( A A T ) − 1 b x^k\to x^*=A^T (AA^T)^{-1}b xkx=AT(AAT)1b as k → ∞ k\to \infty k.

Example

A = [ 1 − 1 0 1 ] A=\begin{bmatrix} 1&-1 \\ 0&1 \end{bmatrix} A=[1011]

b = [ 2 3 ] b=\begin{bmatrix} 2 \\ 3 \end{bmatrix} b=[23]

μ = 1 , x 0 = [ 0 0 ] \mu=1, x^0=\begin{bmatrix} 0 \\ 0 \end{bmatrix} μ=1,x0=[00]

a 1 = [ 1 − 1 ] a_1=\begin{bmatrix} 1 \\ -1 \end{bmatrix} a1=[11]

a 2 = [ 0 1 ] a_2=\begin{bmatrix} 0 \\ 1 \end{bmatrix} a2=[01]

b 1 = 2 , b 2 = 3 b_1=2, b_2=3 b1=2,b2=3

i = 0 , j = 1 : x 1 = [ 0 0 ] + ( 2 − [ 1 , − 1 ] ⋅ [ 0 0 ] ) [ 1 − 1 ] [ 1 , − 1 ] ⋅ [ 1 − 1 ] = [ 1 − 1 ] i=0, j=1: x^1=\begin{bmatrix} 0 \\ 0 \end{bmatrix}+\left(2-[1,-1]\cdot \begin{bmatrix} 0 \\ 0 \end{bmatrix}\right)\frac{\begin{bmatrix} 1 \\ -1 \end{bmatrix}}{[1,-1]\cdot \begin{bmatrix} 1 \\ -1 \end{bmatrix}}=\begin{bmatrix} 1 \\ -1 \end{bmatrix} i=0,j=1:x1=[00]+(2[1,1][00])[1,1][11][11]=[11]

i = 0 , j = 2 : x 2 = [ 1 − 1 ] + ( 3 − [ 0 , − 1 ] ⋅ [ 1 − 1 ] ) [ 0 1 ] [ 0 , 1 ] ⋅ [ 0 1 ] = [ 1 3 ] i=0, j=2: x^2=\begin{bmatrix} 1 \\ -1 \end{bmatrix}+\left(3-[0,-1]\cdot \begin{bmatrix} 1 \\ -1 \end{bmatrix}\right)\frac{\begin{bmatrix} 0 \\ 1 \end{bmatrix}}{[0,1]\cdot \begin{bmatrix} 0 \\ 1 \end{bmatrix}}=\begin{bmatrix} 1 \\ 3 \end{bmatrix} i=0,j=2:x2=[11]+(3[0,1][11])[0,1][01][01]=[13]

i = 1 , j = 1 : x 3 = [ 1 3 ] + ( 2 − [ 1 , − 1 ] ⋅ [ 1 3 ] ) [ 1 − 1 ] [ 1 , − 1 ] ⋅ [ 1 − 1 ] = [ 3 1 ] i=1, j=1: x^3=\begin{bmatrix} 1 \\ 3 \end{bmatrix}+\left(2-[1,-1]\cdot \begin{bmatrix} 1 \\ 3 \end{bmatrix}\right)\frac{\begin{bmatrix} 1 \\ -1 \end{bmatrix}}{[1,-1]\cdot \begin{bmatrix} 1 \\ -1 \end{bmatrix}}=\begin{bmatrix} 3 \\ 1 \end{bmatrix} i=1,j=1:x3=[13]+(2[1,1][13])[1,1][11][11]=[31]

⋯ \cdots

x ∗ = [ 5 3 ] x^*=\begin{bmatrix} 5 \\ 3 \end{bmatrix} x=[53]

Pseudoinverse

Definition

A + ∈ R m × n A^+\in\mathbb{R}^{m\times n} A+Rm×n is a pseudoinverse of A A A, if A A T A = A AA^TA=A AATA=A and ∃ U ∈ R n × n , V ∈ R m × m \exist U\in \mathbb{R}^{n\times n}, V\in \mathbb{R}^{m\times m} URn×n,VRm×m s.t. A + = U A T , A + = A T V A^+=UA^T, A^+=A^TV A+=UAT,A+=ATV
注:pseudoinverse表示伪逆,是矩阵逆的一种广义形式。

Special Case 1

m ≥ n , r a n k A = n m\geq n, rank A=n mn,rankA=n

A + = ( A T A ) − 1 A T → A A + A = A A^+=(A^TA)^{-1}A^T\rightarrow AA^+A=A A+=(ATA)1ATAA+A=A

U = ( A T A ) − 1 , V = A ( A T A ) − 1 ( A T A ) − 1 A T U=(A^TA)^{-1}, V=A(A^TA)^{-1}(A^TA)^{-1}A^T U=(ATA)1,V=A(ATA)1(ATA)1AT

A + = U A T , A + = A T V A^+=UA^T, A^+=A^TV A+=UAT,A+=ATV

Special Case 2

m ≤ n , r a n k A = m m\leq n, rank A=m mn,rankA=m

A + = A T ( A A T ) − 1 → A A + A = A A^+=A^T(AA^T)^{-1}\rightarrow AA^+A=A A+=AT(AAT)1AA+A=A

U = A T ( A A T ) − 1 ( A A T ) − 1 A , V = ( A A T ) − 1 U=A^T(AA^T)^{-1}(AA^T)^{-1}A, V=(AA^T)^{-1} U=AT(AAT)1(AAT)1A,V=(AAT)1

A + = U A T , A + = A T V A^+=UA^T, A^+=A^TV A+=UAT,A+=ATV

Properties of Pseudoinverse

Lemma 1: Unique pseudoinverse

pf: Assume: A 1 + , A 2 + A_1^+, A_2^+ A1+,A2+ pseudoinverse of A A A.
A A 1 + A = A A 2 + A = A AA_1^+A=AA_2^+A=A AA1+A=AA2+A=A
And U 1 , U 2 ∈ R n × n , V 1 , V 2 ∈ R m × m U_1,U_2\in \mathbb{R}^{n\times n}, V_1,V_2\in \mathbb{R}^{m\times m} U1,U2Rn×n,V1,V2Rm×m

A 1 + = U 1 A T = A T V 1 , A 2 + = U 2 A T = A T V 2 A_1^+=U_1A^T=A^TV_1, A_2^+=U_2A^T=A^TV_2 A1+=U1AT=ATV1,A2+=U2AT=ATV2

Let D = A 2 + − A 1 + , U = U 2 − U 1 , V = V 2 − V 1 D=A_2^+-A_1^+, U=U_2-U_1, V=V_2-V_1 D=A2+A1+,U=U2U1,V=V2V1

Then O = A D A , D = U A T = A T V O=ADA, D=UA^T=A^TV O=ADA,D=UAT=ATV

⇒ ( D A ) T D A = A T D T D A = A T V T A D A = O \Rightarrow (DA)^TDA=A^TD^TDA=A^TV^TADA=O (DA)TDA=ATDTDA=ATVTADA=O

⇒ D A = O \Rightarrow DA=O DA=O

⇒ D D T = D A U T = O ⇒ D = O \Rightarrow DD^T=DAU^T=O\Rightarrow D=O DDT=DAUT=OD=O

Lemma 2: Full Rank Factorization

Let A ∈ R m × n , r a n k A = r ≤ m i n ( m , n ) A\in \mathbb{R}^{m\times n}, rank A=r\leq min(m,n) ARm×n,rankA=rmin(m,n). Then exist B ∈ R m × r B\in\mathbb{R}^{m\times r} BRm×r and C ∈ R r × n C\in\mathbb{R}^{r\times n} CRr×n, with r a n k B = r a n k C = r rankB=rankC=r rankB=rankC=r and A = B C A=BC A=BC.

Lemma 3

Let A ∈ R m × n A\in \mathbb{R}^{m\times n} ARm×n have full rank factorization A = B C A=BC A=BC, Then A + = C + B + A^+=C^+B^+ A+=C+B+.

Example

A = [ 2 1 − 2 5 1 0 − 3 2 3 − 1 − 13 5 ] A=\begin{bmatrix} 2&1&-2&5 \\ 1&0&-3&2\\ 3&-1&-13&5 \end{bmatrix} A= 2131012313525

r a n k A = 2 rankA=2 rankA=2

B = [ 2 1 1 0 3 − 1 ] B=\begin{bmatrix} 2&1 \\ 1&0 \\ 3&-1 \end{bmatrix} B= 213101

C = [ 0 1 − 3 2 1 0 4 1 ] C=\begin{bmatrix} 0&1&-3&2 \\ 1&0&4&1 \end{bmatrix} C=[01103421]

A = B C , A + = C + B + A=BC, A^+=C^+B^+ A=BC,A+=C+B+

Case 3

A x = b , A ⊆ R m × n , r a n k A ≤ m i n ( m , n ) Ax=b, A\subseteq \mathbb{R}^{m\times n}, rankA\leq min(m,n) Ax=b,ARm×n,rankAmin(m,n)

x x x minimize ∣ ∣ A x − b ∣ ∣ 2 ||Ax-b||^2 ∣∣Axb2

x x x minimize ∣ ∣ x ∣ ∣ ||x|| ∣∣x∣∣

注:Case 1和Case 2的主要区别在于 m m m n n n的大小关系,此时 A A A满秩。Case 3则表示了 A A A不满秩的更为一般的情形。在 m = n m=n m=n时,Case 1和Case 2两种情况均适用。在 A A A满秩的时候,Case 3也适用。故在分类条件中,并没有严格分类讨论,而是把该结论适用的最大范围写了上去。

Theorem

Given A x = b Ax=b Ax=b with r a n k A = r rankA=r rankA=r, the unique vector x ∗ = A + b x^*=A^+b x=A+b minimizes ∣ ∣ A x − b ∣ ∣ 2 ||Ax-b||^2 ∣∣Axb2. Furthermore, among all vectors that minimize ∣ ∣ A x − b ∣ ∣ 2 ||Ax-b||^2 ∣∣Axb2, x ∗ x^* x is the unique one with minimal norm.

Neural Networks

神经网络结构图

Single neuron

单层神经网络

Task: data fitting

Training data: < x d , y d > , x d ∈ R s × n , y d ∈ R s <x^d,y^d>, x^d\in \mathbb{R}^{s\times n}, y^d\in\mathbb{R}^s <xd,yd>,xdRs×n,ydRs

f ( x ) = ∑ i = 1 m ω i x i = ω T x f(x)=\sum_{i=1}^m \omega_ix_i=\omega^Tx f(x)=i=1mωixi=ωTx

Find ω ∈ R n \omega\in\mathbb{R}^n ωRn minimizing 1 2 ∑ i = 1 n ( y i d − x i d T ω i ) 2 = 1 2 ∣ ∣ y d − x d T ω ∣ ∣ 2 \frac{1}{2}\sum_{i=1}^n (y_i^d-{x_i^d}^T\omega_i)^2=\frac{1}{2}||y^d-{x^d}^T\omega||^2 21i=1n(yidxidTωi)2=21∣∣ydxdTω2

Case 1

r a n k X = s ≤ n rank X=s\leq n rankX=sn

⇒ ∃ \Rightarrow \exist infinite ω : y d = x d T ω \omega: y^d={x^d}^T\omega ω:yd=xdTω

⇒ m i n ∣ ∣ ω ∣ ∣ \Rightarrow min ||\omega|| min∣∣ω∣∣
s.t. y d = x d T ω y^d={x^d}^T\omega yd=xdTω

⇒ ω ∗ = x d ( x d T x d ) − 1 y d \Rightarrow \omega^*=x^d({x^d}^Tx^d)^{-1}y^d ω=xd(xdTxd)1yd

⇒ \Rightarrow Kaczmarz’s Algorithm

Case 2

r a n k X = n ≤ s rank X=n\leq s rankX=ns

⇒ ω ∗ = ( x d T x d ) − 1 x d T y d \Rightarrow \omega^*=({x^d}^Tx^d)^{-1}{x^d}^Ty^d ω=(xdTxd)1xdTyd

⇒ \Rightarrow Gradient algorithm

ω k + 1 = ω k − α k ∇ f ( ω k ) \omega^{k+1}=\omega^k-\alpha^k \nabla f(\omega^k) ωk+1=ωkαkf(ωk)

⇒ ω k + 1 = ω k + α k x d e k \Rightarrow \omega^{k+1}=\omega^k+\alpha^kx^de^k ωk+1=ωk+αkxdek, where e k = y d − x d T ω k e^k=y^d-{x^d}^T\omega^k ek=ydxdTωk

y = f ( ∑ i = 1 m ω i x i ) ⇒ m i n 1 2 ∣ ∣ y d − f ( ∑ i = 1 m ω i x i ) ∣ ∣ 2 y=f(\sum_{i=1}^m \omega_ix_i) \Rightarrow min\frac{1}{2}||y^d-f(\sum_{i=1}^m \omega_ix_i)||^2 y=f(i=1mωixi)min21∣∣ydf(i=1mωixi)2

ω k + 1 = ω k + α k x d e k ∣ ∣ x d ∣ ∣ 2 \omega^{k+1}=\omega^k+\alpha^k\frac{x^de^k}{||x^d||^2} ωk+1=ωk+αk∣∣xd2xdek

[only one < x d , y d > <x^d,y^d> <xd,yd>]

e k = y d − f ( x d T ω k ) e^k=y^d-f({x^d}^T\omega^k) ek=ydf(xdTωk)

Multiple Layers

backpropagation algorithm(反向传播算法)
由于情况比较复杂,此处不作展开。

总结

上节课介绍了解线性方程组的第一种情况,这节课介绍了第二种和第三种情况。为了使结论更具一般性,还引入了矩阵的伪逆概念。接下来开始介绍神经网络。对神经网络做了一些数学上的简化,为了便于理论研究。主要介绍了最简单的单层神经网络。还提及了多层神经网络的反向传播算法,但由于过于复杂,于是没有具体展开计算。

  • 14
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值