Deep Learning 2023/07/11
Logistic Regression
Step 1: Function Set
- 需要确定一个概率,因为逻辑回归是一种用于解决二分类问题的机器学习算法。由此可得如下定义:
i f P w , b ( C 1 ∣ x ) ≥ 0.5 , o u t p u t C 1 O t h e r w i s e , o u t p u t C 2 if \ \ P_{w,b}(C_1|x)\geq 0.5 ,output \ C_1 \\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ Otherwise,output \ C_2 if Pw,b(C1∣x)≥0.5,output C1 Otherwise,output C2
- 使用Gaussian函数得到如下内容
P w , b ( C 1 ∣ x ) = σ ( z ) z = w ⋅ x + b = ∑ i w i x i + b P_{w,b}(C_1|x)=\sigma(z)\\ z=w·x+b=\sum_iw_ix_i+b Pw,b(C1∣x)=σ(z)z=w⋅x+b=i∑wixi+b
- 综上得出Function set如下:
f w , b = P w , b ( C 1 ∣ x ) f_{w,b}=P_{w,b}(C_1|x) fw,b=Pw,b(C1∣x)
⇒ f w , b ( x ) = σ ( ∑ i w i x i + b ) \Rightarrow f_{w,b}(x)=\sigma(\sum_iw_ix_i+b) ⇒fw,b(x)=σ(i∑wixi+b)
Step2: Goodness of a Function
- 假设一组训练集由上述Function Set产生即满足:
f w , b ( x ) = P w , b ( C 1 ∣ x ) f_{w,b}(x) = P_{w,b}(C_1|x) fw,b(x)=Pw,b(C1∣x)
- 通过等式可以获悉通过一组(w,b),就可以确认对于一组数据的P,故可得:
L ( w , b ) = f w , b ( x 1 ) f w , b ( x 2 ) ( 1 − f w , b ( x 3 ) ) ⋅ ⋅ ⋅ f w , b ( x N ) L(w,b)=f_{w,b}(x^1)f_{w,b}(x^2)(1-f_{w,b}(x^3)) ···f_{w,b}(x^N) L(w,b)=fw,b(x1)fw,b(x2)(1−fw,b(x3))⋅⋅⋅fw,b(xN)
- 将计算出来最符合L(w,b)的一组(w,b)叫做(w*,b*),即
w ∗ , b ∗ = a r g max w , b L ( w , b ) w^*,b^*=arg\max_{w,b} L(w,b) w∗,b∗=argw,bmaxL(w,b)
- 将上述求解*L(w,b)*转换如下
w ∗ , b ∗ = a r g min w , b − l n L ( w , b ) w^*,b^*=arg\min_{w,b}-lnL(w,b) w∗,b∗=argw,bmin−lnL(w,b)
- 公式*L(w,b)*推导如下:
L ( w , b ) = f w , b ( x 1 ) f w , b ( x 2 ) ( 1 − f w , b ( x 3 ) ) ⋅ ⋅ ⋅ f w , b ( x N ) ⇒ − l n L ( w , b ) = l n f w , b ( x 1 ) l n f w , b ( x 2 ) l n ( 1 − f w , b ( x 3 ) ) ⋅ ⋅ ⋅ l n f w , b ( x N ) y ^ n : 1 f o r c l a s s 1 , 0 f o r c l a s s 2 ⇒ = ∑ n − [ y ^ n l n f w , b ( x n ) + ( 1 − y ^ n ) l n ( 1 − f w , b ( x n ) ) ] L(w,b)=f_{w,b}(x^1)f_{w,b}(x^2)(1-f_{w,b}(x^3)) ···f_{w,b}(x^N) \\ \Rightarrow -lnL(w,b)=lnf_{w,b}(x^1)lnf_{w,b}(x^2)ln(1-f_{w,b}(x^3)) ···lnf_{w,b}(x^N) \\ \hat{y}^n:1 \ for \ class \ 1, \ 0 \ for class \ 2 \\ \Rightarrow = \sum_n-[\hat{y}^nlnf_{w,b}(x^n)+(1-\hat{y}^n)ln(1-f_{w,b}(x^n))] L(w,b)=fw,b(x1)fw,b(x2)(1−fw,b(x3))⋅⋅⋅fw,b(xN)⇒−lnL(w,b)=lnfw,b(x1)lnfw,b(x2)ln(1−fw,b(x3))⋅⋅⋅lnfw,b(xN)y^n:1 for class 1, 0 forclass 2⇒=n∑−[y^nlnfw,b(xn)+(1−y^n)ln(1−fw,b(xn))]
D i s t r i b u t i o n p : p ( x = 1 ) = y ^ n p ( x = 0 ) = 1 − y ^ n Distribution \ p:p(x = 1)=\hat{y}^n \\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ p(x = 0)=1-\hat{y}^n Distribution p:p(x=1)=y^n p(x=0)=1−y^n
D i s t r i b u t i o n q : q ( x = 1 ) = f ( x n ) q ( x = 0 ) = 1 − f ( x n ) Distribution \ q: q(x=1)=f(x^n) \\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ q(x=0)=1-f(x^n) Distribution q:q(x=1)=f(xn) q(x=0)=1−f(xn)
- 通过上述两个定义式计算cross entropy, 即将两式带入下式中:
H ( p , q ) = − ∑ x p ( x ) l n ( q ( x ) ) H(p,q)=-\sum_xp(x)ln(q(x)) H(p,q)=−x∑p(x)ln(q(x))
- Cross entropy:
C ( f ( x n ) , y ^ n ) = − [ y ^ n l n f ( x n ) + ( 1 − y ^ n ) l n ( 1 − f ( x n ) ) ] C(f(x^n),\hat{y}^n)=-[\hat{y}^nlnf(x^n)+(1-\hat{y}^n)ln(1-f(x^n))] C(f(xn),y^n)=−[y^nlnf(xn)+(1−y^n)ln(1−f(xn))]
Step 3: Find the best function
− l n L ( w , b ) = ∑ n − [ y ^ n l n f w , b ( x n ) + ( 1 − y ^ n ) l n ( 1 − f w , b ( x n ) ) ] ⇒ ∂ l n L ( w , b ) ∂ w i = ∂ l n f w , b ( x ) ∂ z ∂ z ∂ w i -lnL(w,b)=\sum_n-[\hat{y}^nlnf_{w,b}(x^n)+(1-\hat{y}^n)ln(1-f_{w,b}(x^n))] \\ \Rightarrow \frac{\partial lnL(w,b)}{\partial w_i} = \frac{\partial lnf_{w,b}(x)}{\partial z}\frac{\partial z}{\partial w_i} −lnL(w,b)=n∑−[y^nlnfw,b(xn)+(1−y^n)ln(1−fw,b(xn))]⇒∂wi∂lnL(w,b)=∂z∂lnfw,b(x)∂wi∂z
⇒ ∂ z ∂ w i = x i \Rightarrow \frac{\partial z}{\partial w_i}=x_i ⇒∂wi∂z=xi
⇒ ∂ l n σ ( z ) ∂ z = 1 σ ( z ) ∂ σ ( z ) ∂ z = 1 σ ( z ) σ ( z ) ( 1 − σ ( z ) ) = 1 − σ ( z ) \Rightarrow \frac{\partial ln\sigma(z)}{\partial z}=\frac{1}{\sigma(z)}\frac{\partial \sigma(z)}{\partial z}=\frac{1}{\sigma(z)}\sigma(z)(1-\sigma(z))=1-\sigma(z) ⇒∂z∂lnσ(z)=σ(z)1∂z∂σ(z)=σ(z)1σ(z)(1−σ(z))=1−σ(z)
⇒ ∂ l n ( 1 − f w , b ( x ) ) ∂ w i = ∂ l n ( 1 − f w , b ( x ) ) ∂ z ∂ z ∂ w i \Rightarrow \frac{\partial ln(1-f_{w,b}(x))}{\partial w_i}=\frac{\partial ln(1-f_{w,b}(x))}{\partial z}\frac{\partial z}{\partial w_i} ⇒∂wi∂ln(1−fw,b(x))=∂z∂ln(1−fw,b(x))∂wi∂z
⇒ ∂ l n ( 1 − σ ( z ) ) ∂ z = − 1 1 − σ ( z ) ∂ σ ( z ) ∂ z = − 1 1 − σ ( z ) σ ( z ) ( 1 − σ ( z ) ) = − σ ( z ) \Rightarrow \frac{\partial ln(1-\sigma(z))}{\partial z}=-\frac{1}{1-\sigma(z)}\frac{\partial \sigma(z)}{\partial z}=-\frac{1}{1-\sigma(z)}\sigma(z)(1-\sigma(z))=-\sigma(z) ⇒∂z∂ln(1−σ(z))=−1−σ(z)1∂z∂σ(z)=−1−σ(z)1σ(z)(1−σ(z))=−σ(z)
− l n L ( w , b ) ∂ w i = ∑ n − [ y ^ n ( 1 − f w , b ( x n ) ) x i n − ( 1 − y ^ n ) f w , b ( x n ) x i n ] = ∑ n − [ y ^ n − y ^ n f w , b x n − f w , b ( x n ) + y ^ n f w , b ( x n ) ] x i n = ∑ n − ( y ^ n − f w , b ( x n ) ) x i n \frac{-lnL(w,b)}{\partial w_i}=\sum_n-[\hat{y}^n(1-f_{w,b}(x^n))x_i^n-(1-\hat{y}^n)f_{w,b}(x^n)x_i^n] \\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ =\sum_n-[\hat{y}^n-\hat{y}^nf_{w,b}{x^n}-f_{w,b}(x^n)+\hat{y}^nf_{w,b}(x^n)]x_i^n \\ =\sum_n-(\hat{y}^n-f_{w,b}(x^n))x_i^n \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ ∂wi−lnL(w,b)=n∑−[y^n(1−fw,b(xn))xin−(1−y^n)fw,b(xn)xin] =n∑−[y^n−y^nfw,bxn−fw,b(xn)+y^nfw,b(xn)]xin=n∑−(y^n−fw,b(xn))xin
- 由此可推导出参数更新公式,如下:
w i ← w i − η ∑ n − ( y ^ n − f w , b ( x n ) ) x i n w_i \leftarrow w_i-\eta\sum_n-(\hat{y}^n-f_{w,b}(x^n))x_i^n wi←wi−ηn∑−(y^n−fw,b(xn))xin
-
由上式可知参数更新取决于三个因素
-
即 learning rate η
-
取决于数据集的 x
-
以及预测值与真实值之间的差值
y ^ n − f w , b ( x n ) \hat{y}^n-f_{w,b}(x^n) y^n−fw,b(xn)
-
-
根据上述分析可以得知Logistic regression和Linear regression的梯度下降方程一致,不同的是两者方程中的预测值与真实值的取值范围不同。Logistic regression中两者的取值为0或1,而Linear regression中两者的取值则为全体实数。