Logistic regression
目的: 分类还是回归?经典的二分类算法!
机器学习算法选择:先用逻辑回归,再用复杂的,能简单尽量简单
逻辑回归的的决策边界:可以是非线性的
Sigmoid 函数
公式:
h
θ
(
x
)
=
g
(
θ
T
x
)
=
1
1
+
e
−
θ
T
x
h_{\theta }(x)=g(\theta ^{T}x)=\tfrac{1}{1+e^{-\theta ^{T}x}}
hθ(x)=g(θTx)=1+e−θTx1
自变量取值为任意实数,值域[0,1]
解释:将任意的输入映射到了[0,1]区间,我们在线性回归中可以得到一个预测值,再将该值映射到Sigmoid 函数中。这样就完成了由值到概率的转换,也就是分类任务
预测函数:
h
θ
(
x
)
=
g
(
θ
T
x
)
=
1
1
+
e
−
θ
T
x
其
中
:
θ
0
+
θ
1
x
1
+
,
.
.
.
,
+
θ
n
x
n
=
∑
i
=
1
n
θ
i
x
i
=
θ
T
x
h_{\theta }(x)=g(\theta ^{T}x)=\tfrac{1}{1+e^{-\theta ^{T}x}}\newline 其中:\theta _{0}+\theta _{1}x_{1}+,...,+\theta _{n}x_{n}=\sum _{i=1}^{n}\theta _{i}x_{i}=\theta ^{T}x
hθ(x)=g(θTx)=1+e−θTx1其中:θ0+θ1x1+,...,+θnxn=i=1∑nθixi=θTx
分类任务:
{
P
(
y
=
1
∣
x
;
θ
)
=
h
θ
(
x
)
P
(
y
=
0
∣
x
;
θ
)
=
1
−
h
θ
(
x
)
\begin{cases} P(y=1|x;\theta )=h_{\theta }(x)\\ P(y=0|x;\theta )=1-h_{\theta }(x) \end{cases}
{P(y=1∣x;θ)=hθ(x)P(y=0∣x;θ)=1−hθ(x)
整合:
P
(
y
∣
x
,
θ
)
=
(
h
θ
(
x
)
)
y
(
1
−
h
θ
(
x
)
)
1
−
y
P(y|x,\theta )=(h_{\theta }(x))^{y}(1-h_{\theta }(x))^{1-y}
P(y∣x,θ)=(hθ(x))y(1−hθ(x))1−y
解释:对于二分类任务(0,1),整合后y取0只保留
(
1
−
h
θ
(
x
)
)
1
−
y
(1-h_{\theta }(x))^{1-y}
(1−hθ(x))1−y,y取1只保留
(
h
θ
(
x
)
)
y
(h_{\theta }(x))^{y}
(hθ(x))y
似然函数:
L
(
θ
)
=
∏
i
=
1
m
P
(
y
i
∣
x
i
,
θ
)
=
∏
i
=
1
m
(
h
θ
(
x
i
)
)
y
i
(
1
−
h
θ
(
x
i
)
)
1
−
y
i
L(\theta )=\prod _{i=1}^{m}P(y_{i}|x_{i},\theta )=\prod _{i=1}^{m}(h_{\theta }(x_{i}))^{y_{i}}(1-h_{\theta }(x_{i}))^{1-y_{i}}
L(θ)=i=1∏mP(yi∣xi,θ)=i=1∏m(hθ(xi))yi(1−hθ(xi))1−yi
对数似然:
l
(
θ
)
=
L
(
θ
)
=
∑
i
=
1
m
(
y
i
log
h
θ
(
x
i
)
+
(
1
−
y
i
)
log
(
1
−
h
θ
(
x
i
)
)
)
l(\theta )=L(\theta )=\sum _{i=1}^{m}(y_{i}\log h_{\theta }(x_{i})+(1-y_{i})\log (1-h_{\theta }(x_{i})))
l(θ)=L(θ)=i=1∑m(yiloghθ(xi)+(1−yi)log(1−hθ(xi)))
此时应用梯度上升求最大值,引入
J
(
θ
)
=
−
1
m
l
(
θ
)
J(\theta )=-\tfrac{1}{m}l(\theta )
J(θ)=−m1l(θ) 转换为小批量梯度下降求最小值任务
求导过程:
δ
δ
θ
j
j
(
θ
)
=
−
1
m
∑
i
=
1
m
(
y
i
1
h
θ
(
x
i
)
δ
δ
θ
j
−
(
1
−
y
i
)
1
1
−
h
θ
(
x
i
)
δ
δ
θ
j
h
θ
(
x
i
)
)
=
−
1
m
∑
i
=
1
m
(
y
i
1
g
(
θ
T
x
i
)
−
(
1
−
y
i
)
1
1
−
g
(
θ
T
x
i
)
)
δ
δ
θ
j
g
(
θ
T
x
i
)
=
−
1
m
∑
i
=
1
m
(
y
i
1
g
(
θ
T
x
i
)
−
(
1
−
y
i
)
1
1
−
g
(
θ
T
x
i
)
)
g
(
θ
T
x
i
)
(
1
−
g
(
θ
T
x
i
)
)
δ
δ
θ
j
θ
T
x
i
=
−
1
m
∑
i
=
1
m
(
y
i
(
1
−
g
(
θ
T
x
i
)
)
−
(
1
−
y
i
)
g
(
θ
T
x
i
)
)
x
i
j
=
−
1
m
∑
i
=
1
m
(
y
i
−
g
(
θ
T
x
i
)
)
x
i
j
=
1
m
∑
i
=
1
m
(
h
θ
(
x
i
)
−
y
i
)
x
i
j
\begin{aligned} \tfrac{\delta }{\delta _{\theta _{j}}}j(\theta ) &=-\tfrac{1}{m}\sum _{i=1}^{m}\left ( y_{i}\tfrac{1}{h_{\theta }(x_{i})}\tfrac{\delta }{\delta _{\theta _{j}}}-(1-y_{i})\tfrac{1}{1-h_{\theta }(x_{i})}\tfrac{\delta }{\delta _{\theta _{j}}}h_{\theta }(x_{i}) \right ) \\ &=-\tfrac{1}{m}\sum _{i=1}^{m}\left (y_{i}\tfrac{1}{g(\theta ^{T}x_{i})}-(1-y_{i})\tfrac{1}{1-g(\theta ^{T}x_{i})}\right )\tfrac{\delta }{\delta _{\theta _{j}}}g(\theta ^{T}x_{i}) \\ &=-\tfrac{1}{m}\sum _{i=1}^{m}\left (y_{i}\tfrac{1}{g(\theta ^{T}x_{i})}-(1-y_{i})\tfrac{1}{1-g(\theta ^{T}x_{i})}\right )g(\theta ^{T}x_{i})(1-g(\theta ^{T}x_{i}))\tfrac{\delta }{\delta _{\theta _{j}}}\theta ^{T}x_{i} \\ &=-\tfrac{1}{m}\sum _{i=1}^{m}\left ( y_{i}(1-g(\theta ^{T}x_{i}))-(1-y_{i})g(\theta ^{T}x_{i}) \right )x_{i}^{j} \\ &=-\tfrac{1}{m}\sum _{i=1}^{m}\left ( y_{i}-g(\theta ^{T}x_{i}) \right )x_{i}^{j} \\ &=\tfrac{1}{m}\sum _{i=1}^{m}\left ( h_{\theta }(x_{i})-y_{i} \right )x_{i}^{j} \end{aligned}
δθjδj(θ)=−m1i=1∑m(yihθ(xi)1δθjδ−(1−yi)1−hθ(xi)1δθjδhθ(xi))=−m1i=1∑m(yig(θTxi)1−(1−yi)1−g(θTxi)1)δθjδg(θTxi)=−m1i=1∑m(yig(θTxi)1−(1−yi)1−g(θTxi)1)g(θTxi)(1−g(θTxi))δθjδθTxi=−m1i=1∑m(yi(1−g(θTxi))−(1−yi)g(θTxi))xij=−m1i=1∑m(yi−g(θTxi))xij=m1i=1∑m(hθ(xi)−yi)xij
参数更新:
θ
j
:
=
θ
j
−
α
1
m
∑
i
=
1
m
(
h
θ
(
x
i
)
−
y
i
)
x
i
j
\theta _{j}:=\theta _{j}-\alpha \tfrac{1}{m}\sum _{i=1}^{m}(h_{\theta }(x_{i})-y_{i})x_{i}^{j}
θj:=θj−αm1i=1∑m(hθ(xi)−yi)xij
多分类的softmax:
h
θ
(
x
(
i
)
)
=
[
p
(
y
(
i
)
=
1
∣
x
(
i
)
;
θ
)
p
(
y
(
i
)
=
2
∣
x
(
i
)
;
θ
)
.
.
.
p
(
y
(
i
)
=
k
∣
x
(
i
)
;
θ
)
]
=
1
∑
j
=
1
k
e
θ
j
T
x
(
i
)
[
e
θ
1
T
x
(
i
)
e
θ
2
T
x
(
i
)
.
.
.
e
θ
k
T
x
(
i
)
]
h_{\theta }(x^{(i)})=\begin{bmatrix} p(y^{(i)}=1|x^{(i)};\theta )\\ p(y^{(i)}=2|x^{(i)};\theta )\\ .\\ .\\ .\\ p(y^{(i)}=k|x^{(i)};\theta )\\ \end{bmatrix} =\tfrac{1}{\sum _{j=1}^{k}e^{\theta _{j}^{T}x^{(i)}}} \begin{bmatrix} e^{\theta _{1}^{T}x^{(i)}}\\ e^{\theta _{2}^{T}x^{(i)}}\\ .\\ .\\ .\\ e^{\theta _{k}^{T}x^{(i)}}\\ \end{bmatrix}
hθ(x(i))=⎣⎢⎢⎢⎢⎢⎢⎡p(y(i)=1∣x(i);θ)p(y(i)=2∣x(i);θ)...p(y(i)=k∣x(i);θ)⎦⎥⎥⎥⎥⎥⎥⎤=∑j=1keθjTx(i)1⎣⎢⎢⎢⎢⎢⎢⎢⎡eθ1Tx(i)eθ2Tx(i)...eθkTx(i)⎦⎥⎥⎥⎥⎥⎥⎥⎤
总结:逻辑回归真的真的很好很好用