数学公式
(1)log函数计算
l
o
g
(
M
∗
N
)
=
l
o
g
M
+
l
o
g
N
log(M*N)=logM+logN
log(M∗N)=logM+logN
(
l
o
g
M
N
)
=
N
l
o
g
M
(logM^N)=NlogM
(logMN)=NlogM
逻辑回归
Logistic Regression是广义线性模型的一种,可以用线性函数表示分类的超平面:
W
x
+
b
=
y
Wx+b=y
Wx+b=y
其中W为权重,b为偏置项。在多维情况下,W和b为向量。
通过对训练样本的学习,得到超平面,再使用阈值函数,将样本映射到不同的类别(0或1)。
常用的阈值函数有Sigmoid函数,形式为:
f
(
x
)
=
1
1
+
e
−
x
f(x)=\frac{1}{1+e^{-x}}
f(x)=1+e−x1
可以看出,函数的值域为(0,1),在0附近的变化比较明显。
Sigmoid的导数为:
σ
′
(
x
)
=
(
1
1
+
e
−
x
)
′
=
−
(
1
+
e
−
x
)
′
(
1
+
e
−
x
)
2
=
−
1
′
−
(
e
−
x
)
′
(
1
+
e
−
x
)
2
\sigma'(x) = \left(\frac{1}{1+e^{-x}}\right)' = \frac{-(1+e^{-x})'}{(1+e^{-x})^2} = \frac{-1'-(e^{-x})'}{(1+e^{-x})^2}
σ′(x)=(1+e−x1)′=(1+e−x)2−(1+e−x)′=(1+e−x)2−1′−(e−x)′
=
0
−
(
−
x
)
′
(
e
−
x
)
(
1
+
e
−
x
)
2
=
e
−
x
(
1
+
e
−
x
)
2
= \frac{0-(-x)'(e^{-x})}{(1+e^{-x})^2} = \frac{e^{-x}}{(1+e^{-x})^2}
=(1+e−x)20−(−x)′(e−x)=(1+e−x)2e−x
=
(
1
1
+
e
−
x
)
(
e
−
x
1
+
e
−
x
)
= \left(\frac{1}{1+e^{-x}}\right)\left(\frac{e^{-x}}{1+e^{-x}}\right)
=(1+e−x1)(1+e−xe−x)
=
σ
(
x
)
(
1
+
e
−
x
1
+
e
−
x
−
1
1
+
e
−
x
)
= \sigma(x)\left(\frac{1 + e^{-x}}{1+e^{-x}} - \frac{1}{1+e^{-x}}\right)
=σ(x)(1+e−x1+e−x−1+e−x1)
=
σ
(
x
)
(
1
−
σ
(
x
)
)
= \sigma(x)(1 - \sigma(x))
=σ(x)(1−σ(x))
损失函数
对于输入向量X,属于正例的概率为:
P
(
y
=
1
)
=
σ
(
w
x
+
b
)
=
1
1
+
e
−
(
w
x
+
b
)
P(y=1)=\sigma(wx+b)=\frac{1}{1+e^{-(wx+b)}}
P(y=1)=σ(wx+b)=1+e−(wx+b)1
属于负例的概率为:
P
(
y
=
0
)
=
1
−
σ
(
w
x
+
b
)
P(y=0)=1-\sigma(wx+b)
P(y=0)=1−σ(wx+b)
根据伯努利概率函数,属于类别y的概率为:
P
(
y
)
=
σ
(
w
x
+
b
)
y
(
1
−
σ
(
w
x
+
b
)
)
1
−
y
,
y
=
0
,
1
P(y)=\sigma(wx+b)^y (1-\sigma(wx+b))^{1-y}, y=0,1
P(y)=σ(wx+b)y(1−σ(wx+b))1−y,y=0,1
已经每个训练样本的所属类别的概率,将训练样本的类别概率连乘,用极大似然法估计。似然函数为:
L
θ
=
∏
i
=
1
m
P
i
(
y
)
L_\theta=\prod_{i=1}^mP_i(y)
Lθ=i=1∏mPi(y)
=
∏
i
=
1
m
[
h
θ
(
x
i
)
y
i
(
1
−
h
θ
(
x
i
)
)
1
−
y
i
]
=\prod_{i=1}^m[h_{\theta}(x^i)^{y^i}(1-h_{\theta}(x^i))^{1-y^i}]
=i=1∏m[hθ(xi)yi(1−hθ(xi))1−yi]
其中
h
θ
(
x
i
)
=
σ
(
w
x
i
+
b
)
h_{\theta}(x^i)=\sigma(wx^i+b)
hθ(xi)=σ(wxi+b)。
为求似然函数的最大值,可使用log似然函数,将连乘转换为连加操作。将负的log似然函数(negative log likehood)NLL作为损失函数,此时需要计算NLL的极小值,损失函数为:
−
l
o
g
(
L
θ
)
=
−
l
o
g
(
∏
i
=
1
m
[
h
θ
(
x
i
)
y
i
(
1
−
h
θ
(
x
i
)
)
1
−
y
i
]
)
-log(L_\theta)=-log(\prod_{i=1}^m[h_{\theta}(x^i)^{y^i}(1-h_{\theta}(x^i))^{1-y^i}])
−log(Lθ)=−log(i=1∏m[hθ(xi)yi(1−hθ(xi))1−yi])
=
−
∑
i
=
1
m
l
o
g
(
h
θ
(
x
i
)
y
i
(
1
−
h
θ
(
x
i
)
)
1
−
y
i
)
=-\sum_{i=1}^mlog(h_{\theta}(x^i)^{y^i}(1-h_{\theta}(x^i))^{1-y^i})
=−i=1∑mlog(hθ(xi)yi(1−hθ(xi))1−yi)
=
−
∑
i
=
1
m
[
l
o
g
(
h
θ
(
x
i
)
y
i
)
+
l
o
g
(
1
−
h
θ
(
x
i
)
)
1
−
y
i
]
=-\sum_{i=1}^m[log(h_{\theta}(x^i)^{y^i})+log(1-h_{\theta}(x^i))^{1-y^i}]
=−i=1∑m[log(hθ(xi)yi)+log(1−hθ(xi))1−yi]
=
−
∑
i
=
1
m
[
y
i
l
o
g
(
h
θ
(
x
i
)
)
+
(
1
−
y
i
)
l
o
g
(
1
−
h
θ
(
x
i
)
)
]
=-\sum_{i=1}^m[y^ilog(h_{\theta}(x^i))+(1-y^i)log(1-h_{\theta}(x^i))]
=−i=1∑m[yilog(hθ(xi))+(1−yi)log(1−hθ(xi))]
为求得损失函数的最小值,使用梯度下降法求解。
梯度下降法
损失函数为:
J
(
θ
)
=
−
1
m
∑
i
=
1
m
[
y
i
l
o
g
(
h
θ
(
x
i
)
)
+
(
1
−
y
i
)
l
o
g
(
1
−
h
θ
(
x
i
)
)
]
J(\theta) = -\frac{1}{m} \sum_{i=1}^m \left[ y^i log(h_\theta(x^i)) + (1 - y^i) log(1 - h_\theta(x^i)) \right]
J(θ)=−m1i=1∑m[yilog(hθ(xi))+(1−yi)log(1−hθ(xi))]
梯度下降公式
θ
j
:
=
θ
j
−
α
∂
∂
θ
j
J
(
θ
)
\theta_j := \theta_j - \alpha \dfrac{\partial}{\partial \theta_j}J(\theta)
θj:=θj−α∂θj∂J(θ)
代入损失函数推导:
其中
θ
T
x
(
i
)
\theta^Tx^{(i)}
θTx(i)对
θ
j
\theta_j
θj求偏导
θ
T
x
(
i
)
=
[
θ
0
,
θ
1
,
.
.
.
,
θ
j
,
.
.
.
]
∗
x
(
i
)
\theta^Tx^{(i)} = [\theta_0,\theta_1,...,\theta_j,...]*x^{(i)}
θTx(i)=[θ0,θ1,...,θj,...]∗x(i)
=
(
θ
0
x
0
(
i
)
+
θ
1
x
1
(
i
)
+
.
.
.
+
θ
j
x
j
(
i
)
+
.
.
.
)
= (\theta_0x_0^{(i)}+\theta_1x_1^{(i)}+...+\theta_jx_j^{(i)}+...)
=(θ0x0(i)+θ1x1(i)+...+θjxj(i)+...)
结果为
x
j
(
i
)
x_j^{(i)}
xj(i)
推导关键点
- 求导可以穿透常量系数,如 ( 3 x ) ′ = 3 ( x ) ′ (3x)' = 3(x)' (3x)′=3(x)′
- 以e为底的对数为自然对数,用ln表示, ( l n x ) ′ = 1 / x (lnx)' = 1/x (lnx)′=1/x
- Sigmoid函数的导数为 σ ′ ( x ) = σ ( x ) ( 1 − σ ( x ) ) ( x ) ′ \sigma'(x) = \sigma(x)(1 - \sigma(x))(x)' σ′(x)=σ(x)(1−σ(x))(x)′
python实现
# 代码来自《Python机器学习算法》一书
def sig(x):
return 1.0 / (1 + np.exp(-x))
def lr_train_bgd(feature, label, maxCycle, alpha):
'''利用梯度下降法训练LR模型
input: feature(mat)特征
label(mat)标签
maxCycle(int)最大迭代次数
alpha(float)学习率
output: w(mat):权重
'''
n = np.shape(feature)[1] # 特征个数
w = np.mat(np.ones((n, 1))) # 初始化权重
i = 0
while i <= maxCycle: # 在最大迭代次数的范围内
i += 1 # 当前的迭代次数
h = sig(feature * w) # 计算Sigmoid值
err = label - h
if i % 100 == 0:
print "\t---------iter=" + str(i) + \
" , train error rate= " + str(error_rate(h, label))
w = w + alpha * feature.T * err # 权重修正
return w
代码分析:
(1)feature为训练数据,偏置项的特征值设为1,数据如下:
(Pdb) feature[:10]
matrix([[1. , 4.459, 8.225],
[1. , 0.043, 6.307],
[1. , 6.997, 9.313],
[1. , 4.755, 9.26 ],
[1. , 8.662, 9.768],
[1. , 7.174, 8.695],
[1. , 0.134, 1.969],
[1. , 2.959, 5.805],
[1. , 0.162, 2.596],
[1. , 3.996, 8.833]])
label为标签数据,值为0或1,数据如下:
(Pdb) label[:10]
matrix([[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.]])
maxCycle为迭代次数,设为1000;alpha为学习率,设为0.01
特征个数为3,权重值初始化为1
(Pdb) p w
matrix([[1.],
[1.],
[1.]])
(2)h = sig(feature * w)
为预测值,对应表达式
h
θ
(
x
i
)
=
σ
(
w
x
i
+
b
)
=
1
1
+
e
−
(
w
x
i
+
b
)
h_{\theta}(x^i)=\sigma(wx^i+b)=\frac{1}{1+e^{-(wx^i+b)}}
hθ(xi)=σ(wxi+b)=1+e−(wxi+b)1
(Pdb) p h[:10]
matrix([[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.]])
(3)err = label - h
对应表达式
y
i
−
h
θ
(
x
i
)
y^i-h_{\theta}(x^i)
yi−hθ(xi)
feature.T * err
对应表达式
x
i
⋅
(
y
i
−
h
θ
(
x
i
)
)
x^i \cdot (y^i-h_{\theta}(x^i))
xi⋅(yi−hθ(xi))
(3)更新w权重值后,可以计算损失值,损失函数为:
J
(
θ
)
=
−
1
m
∑
i
=
1
m
[
y
i
l
o
g
(
h
θ
(
x
i
)
)
+
(
1
−
y
i
)
l
o
g
(
1
−
h
θ
(
x
i
)
)
]
J(\theta) = -\frac{1}{m} \sum_{i=1}^m \left[ y^i log(h_\theta(x^i)) + (1 - y^i) log(1 - h_\theta(x^i)) \right]
J(θ)=−m1i=1∑m[yilog(hθ(xi))+(1−yi)log(1−hθ(xi))]
代码实现为:
sum_err = 0.0
for i in xrange(m):
y_i = label[i,0]
sum_err -= (y_i * np.log(h[i,0]) + (1-y_i) * np.log(1-h[i,0]))
sum_err /= m
(4)训练结束后,得到最终权重值
(Pdb) p w
matrix([[ 1.394],
[ 4.527],
[-4.794]])
预测
将测试数据代入预测函数 h = sig(feature * w)
,得到预测值,若值<0.5预测为负例,>=0.5为正例。
(Pdb) p h[:10]
matrix([[0. ],
[0. ],
[0.002],
[0. ],
[0.001],
[0. ],
[0.001],
[0.001],
[0. ],
[0. ]])