我们将在分类模型基础上继续,并开始学习一种常用的分类算法——Logistic回归,逻辑回归logistic regression,虽然名字是回归,但是实际上它是处理分类问题的算法。简单的说回归问题和分类问题如下:
- 回归问题:预测一个连续的输出。
- 分类问题:离散输出,比如二分类问题输出0或1。
- 逻辑回归常用于垃圾邮件分类,天气预测、疾病判断和广告投放。
一、Step 1: Function Set
-
同样考虑一个而分类问题,此时Function Set 为: f x = P w , b ( C 1 ∣ x ) = σ ( z ) = 1 1 + exp { − ( w x + b ) } f_{x}=P_{w, b}\left(C_{1} | x\right)=\sigma(z)=\frac{1}{1+\exp \{-(w x+b)\}} fx=Pw,b(C1∣x)=σ(z)=1+exp{−(wx+b)}1
-
如果 P w , b ( C 1 ∣ x ) > 0.5 P_{w, b}\left(C_{1} | x\right)>0.5 Pw,b(C1∣x)>0.5,class为 C 1 C_{1} C1,否则为 C 2 C_{2} C2
-
Sigmoid function
-
Function Set
二、Step 2: Goodness of a Function
-
Assume the data is generated based on f w , b ( x ) = P w , b ( C 1 ∣ x ) f_{w, b}(x)=P_{w, b}\left(C_{1} | x\right) fw,b(x)=Pw,b(C1∣x)
-
Given a set of w and b, what is its probability of generating the data?
L ( w , b ) = f w , b ( x 1 ) f w , b ( x 2 ) ( 1 − f w , b ( x 3 ) ) ⋯ f w , b ( x N ) L(w, b)=f_{w, b}\left(x^{1}\right) f_{w, b}\left(x^{2}\right)\left(1-f_{w, b}\left(x^{3}\right)\right) \cdots f_{w, b}\left(x^{N}\right) L(w,b)=fw,b(x1)fw,b(x2)(1−fw,b(x3))⋯fw,b(xN) -
The most likely w ∗ w^{*} w∗is the one with the largest L ( w , b ) L(w, b) L(w,b).
w ∗ , b ∗ = arg max w , b L ( w , b ) w^{*}, b^{*}=\arg \max _{w, b} L(w, b) w∗,b∗=argw,bmaxL(w,b)
-
class C 1 C_{1} C1的标记 y ^ \hat{y} y^为1,class C 2 C_{2} C2的标记 y ^ \hat{y} y^为0
L ( w , b ) = ∏ i = 1 n P ( C 1 ∣ x i ) , ln L = ∑ i = 1 n [ y ^ i f w , b ( x i ) + ( 1 − y ^ i ) ( 1 − f w , b ( x i ) ) ] L(w, b)=\prod_{i=1}^{n} P\left(C_{1} | x_{i}\right), \ln L=\sum_{i=1}^{n}\left[\hat{y}^{i} f_{w, b}\left(x^{i}\right)+\left(1-\hat{y}^{i}\right)\left(1-f_{w, b}\left(x^{i}\right)\right)\right] L(w,b)=i=1∏nP(C1∣xi),lnL=i=1∑n[y^ifw,b(xi)+(1−y^i)(1−fw,b(xi))]
根据极大似然估计,为了极大化L,等价于极小化
−
ln
L
-\ln L
−lnL,求解得到
w
∗
,
b
∗
=
argmin
w
,
b
∑
i
=
1
n
−
[
y
^
i
f
w
,
b
(
x
i
)
+
(
1
−
y
^
i
)
(
1
−
f
w
,
b
(
x
i
)
)
]
w^{*}, b^{*}=\operatorname{argmin}_{w, b} \sum_{i=1}^{n}-\left[\hat{y}^{i} f_{w, b}\left(x^{i}\right)+\left(1-\hat{y}^{i}\right)\left(1-f_{w, b}\left(x^{i}\right)\right)\right]
w∗,b∗=argminw,bi=1∑n−[y^ifw,b(xi)+(1−y^i)(1−fw,b(xi))]
-
交叉熵 - cross entropy
C ( f ( x n ) , ( y ^ ) n ) = − [ y ^ n f w , b ( x n ) + ( 1 − y ^ n ) ( 1 − f w , b ( x n ) ) ] \left.C\left(f\left(x^{n}\right), (\hat{y}\right)^{n}\right)=-\left[\hat{y}^{n} f_{w, b}\left(x^{n}\right)+\left(1-\hat{y}^{n}\right)\left(1-f_{w, b}\left(x^{n}\right)\right)\right] C(f(xn),(y^)n)=−[y^nfw,b(xn)+(1−y^n)(1−fw,b(xn))]
表示Cross entropy between two Bernoulli distribution -
Then the cross entropy is:
H ( p , q ) = − ∑ x p ( x ) ln ( q ( x ) ) H(p, q)=-\sum_{x} p(x) \ln (q(x)) H(p,q)=−x∑p(x)ln(q(x))
三、Step 3: Find the best function
-
z = w ⋅ x + b = ∑ i w i x i + b z=w \cdot x+b=\sum_{i} w_{i} x_{i}+b z=w⋅x+b=i∑wixi+b
-
f w , b ( x ) = σ ( z ) = 1 / 1 + exp ( − z ) \begin{array}{l}{f_{w, b}(x)=\sigma(z)} {=1 / 1+\exp (-z)}\end{array} fw,b(x)=σ(z)=1/1+exp(−z)