1. 逻辑回归公式
假设函数 (hypotheses function):
y
^
=
h
θ
(
x
)
=
g
(
θ
T
x
)
=
1
1
+
e
−
θ
T
x
\hat y=h_{\theta}(x)=g(\theta^Tx)=\frac{1}{1+e^{-\theta^Tx}}
y^=hθ(x)=g(θTx)=1+e−θTx1,其中:
g
(
z
)
=
1
1
+
e
−
z
g(z)=\frac{1}{1+e^{-z}}
g(z)=1+e−z1 为sigmoid
函数
交叉熵损失函数 (loss function)(单个样本): L ( y ^ , y ) = − y log y ^ − ( 1 − y ) log ( 1 − y ^ ) L(\hat{y}, y)=-y \log \hat{y}-(1-y) \log (1-\hat{y}) L(y^,y)=−ylogy^−(1−y)log(1−y^)
- y y y: 表示样本的 label,正类为1,负类为0
- y ^ \hat{y} y^:表示样本预测为正的概率, 1 − y ^ 1-\hat{y} 1−y^:表示样本预测为负的概率,损失函数-参考链接
代价函数 (cost function):
J
(
θ
)
=
1
m
∑
i
=
1
m
L
(
y
^
(
i
)
,
y
(
i
)
)
=
1
m
∑
i
=
1
m
[
−
y
(
i
)
log
y
^
(
i
)
−
(
1
−
y
(
i
)
)
log
(
1
−
y
^
(
i
)
)
]
=
1
m
∑
i
=
1
m
[
−
y
(
i
)
log
1
1
+
e
−
θ
T
x
(
i
)
−
(
1
−
y
(
i
)
)
log
(
1
−
1
1
+
e
−
θ
T
x
(
i
)
)
]
\begin{aligned} J(\theta) &=\frac{1}{m} \sum_{i=1}^{m} L\left(\hat{y}^{(i)}, y^{(i)}\right) \\ &=\frac{1}{m} \sum_{i=1}^{m}\left[-y^{(i)} \log \hat{y}^{(i)}-\left(1-y^{(i)}\right) \log \left(1-\hat{y}^{(i)}\right)\right] \\ &=\frac{1}{m} \sum_{i=1}^{m}\left[-y^{(i)} \log \frac{1}{1+e^{-\theta^Tx^{(i)}}}-\left(1-y^{(i)}\right) \log \left(1-\frac{1}{1+e^{-\theta^Tx^{(i)}}}\right)\right] \end{aligned}
J(θ)=m1i=1∑mL(y^(i),y(i))=m1i=1∑m[−y(i)logy^(i)−(1−y(i))log(1−y^(i))]=m1i=1∑m[−y(i)log1+e−θTx(i)1−(1−y(i))log(1−1+e−θTx(i)1)]
- 使用梯度下降法 线性回归中类似-链接,使代价函数损失值最小
∂ ∂ θ j J ( θ ) = ∂ ∂ θ j 1 m ∑ i = 1 m [ − y ( i ) log 1 1 + e − θ T x ( i ) − ( 1 − y ( i ) ) log ( 1 − 1 1 + e − θ T x ( i ) ) ] = ∂ ∂ θ j 1 m ∑ i = 1 m [ y ( i ) log ( 1 + e − θ T x ( i ) ) + ( 1 − y ( i ) ) log ( 1 + e θ T x ( i ) ) ] = 1 m ∑ i = 1 m [ y ( i ) − x j ( i ) e − θ T x ( i ) 1 + e − θ T x ( i ) + ( 1 − y ( i ) ) x j ( i ) e θ T x ( i ) 1 + e θ T x ( i ) ] = 1 m ∑ i = 1 m [ y ( i ) − x j ( i ) 1 + e θ T x ( i ) + ( 1 − y ( i ) ) x j ( i ) e θ T x ( i ) 1 + e θ T x ( i ) ] = 1 m ∑ i = 1 m [ − x j ( i ) y ( i ) + x j ( i ) e θ T x ( i ) − y ( i ) x j ( i ) e θ T x ( i ) 1 + e θ T x ( i ) ] = 1 m ∑ i = 1 m [ − y ( i ) ( 1 + e θ T x ( i ) ) + e θ T x ( i ) 1 + e θ T x ( i ) x j ( i ) ] = 1 m ∑ i = 1 m [ ( − y ( i ) + e θ T x ( i ) 1 + e θ T x ( i ) ) x j ( i ) ] = 1 m ∑ i = 1 m [ ( − y ( i ) + 1 1 + e − θ T x ( i ) ) x j ( i ) ] = 1 m ∑ i = 1 m [ ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) ] \begin{aligned} \frac{\partial}{\partial \theta_{j}} J(\theta) &=\frac{\partial}{\partial \theta_{j}} \frac{1}{m} \sum_{i=1}^{m}\left[-y^{(i)} \log \frac{1}{1+e^{-\theta^{T} x^{(i)}}}-\left(1-y^{(i)}\right) \log \left(1-\frac{1}{\left.1+e^{-\theta^{T} x^{(i)}}\right)}\right]\right. \\ &=\frac{\partial}{\partial \theta_{j}} \frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)} \log \left(1+e^{-\theta^{T} x^{(i)}}\right)+\left(1-y^{(i)}\right) \log \left(1+e^{\theta^{T} x^{(i)}}\right)\right] \\ &=\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)} \frac{-x_{j}^{(i)} e^{-\theta^{T} x^{(i)}}}{1+e^{-\theta^{T} x^{(i)}}}+\left(1-y^{(i)}\right) \frac{x_{j}^{(i)} e^{\theta^{T} x^{(i)}}}{1+e^{\theta^{T} x^{(i)}}}\right] \\ &=\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)} \frac{-x_{j}^{(i)}}{1+e^{\theta^{T} x^{(i)}}}+\left(1-y^{(i)}\right) \frac{x_{j}^{(i)} e^{\theta^{T} x^{(i)}}}{1+e^{\theta^{T} x^{(i)}}}\right] \\ &=\frac{1}{m} \sum_{i=1}^{m}\left[\frac{-x_{j}^{(i)} y^{(i)}+x_{j}^{(i)} e^{\theta^{T} x^{(i)}}-y^{(i)} x_{j}^{(i)} e^{\theta^{T} x^{(i)}}}{1+e^{\theta^{T} x^{(i)}}}\right] \\ &=\frac{1}{m} \sum_{i=1}^{m}\left[\frac{-y^{(i)}\left(1+e^{\theta^{T} x^{(i)}}\right)+e^{\theta^{T} x^{(i)}}}{1+e^{\theta^{T} x^{(i)}}} x_{j}^{(i)}\right]=\frac{1}{m} \sum_{i=1}^{m}\left[\left(-y^{(i)}+\frac{e^{\theta^{T} x^{(i)}}}{1+e^{\theta^{T} x^{(i)}}}\right) x_{j}^{(i)}\right] \\ &=\frac{1}{m} \sum_{i=1}^{m}\left[\left(-y^{(i)}+\frac{1}{1+e^{-\theta^{T} x^{(i)}}}\right) x_{j}^{(i)}\right] = \frac{1}{m} \sum_{i=1}^{m}\left[\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)}\right] \end{aligned} ∂θj∂J(θ)=∂θj∂m1i=1∑m[−y(i)log1+e−θTx(i)1−(1−y(i))log(1−1+e−θTx(i))1]=∂θj∂m1i=1∑m[y(i)log(1+e−θTx(i))+(1−y(i))log(1+eθTx(i))]=m1i=1∑m[y(i)1+e−θTx(i)−xj(i)e−θTx(i)+(1−y(i))1+eθTx(i)xj(i)eθTx(i)]=m1i=1∑m[y(i)1+eθTx(i)−xj(i)+(1−y(i))1+eθTx(i)xj(i)eθTx(i)]=m1i=1∑m[1+eθTx(i)−xj(i)y(i)+xj(i)eθTx(i)−y(i)xj(i)eθTx(i)]=m1i=1∑m⎣⎡1+eθTx(i)−y(i)(1+eθTx(i))+eθTx(i)xj(i)⎦⎤=m1i=1∑m[(−y(i)+1+eθTx(i)eθTx(i))xj(i)]=m1i=1∑m[(−y(i)+1+e−θTx(i)1)xj(i)]=m1i=1∑m[(hθ(x(i))−y(i))xj(i)]
- 补充, h θ ( x ( i ) ) = 1 1 + e − θ T x ( i ) h_{\theta}\left(x^{(i)}\right)=\frac{1}{1+e^{-\theta^{T} x^{(i)}}} hθ(x(i))=1+e−θTx(i)1,与线性回归不一样
- 迭代公式: θ j : = θ j − α 1 m ∑ i = 1 m [ ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) ] \theta_{j}:=\theta_{j}-\alpha \frac{1}{m} \sum_{i=1}^{m}[(h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)}] θj:=θj−αm1∑i=1m[(hθ(x(i))−y(i))xj(i)] ,与线性回归一致,乘了一个 − 1 -1 −1
2. 梯度下降求解事例
根据上面逻辑回归梯度更新公式 θ j : = θ j − α 1 m ∑ i = 1 m [ ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) ] \theta_{j}:=\theta_{j}-\alpha \frac{1}{m} \sum_{i=1}^{m}[(h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)}] θj:=θj−αm1∑i=1m[(hθ(x(i))−y(i))xj(i)] ,引入实际范例中计算。表格数据中有两个特征量 x 1 , x 2 x_1, x_2 x1,x2和一个输出值 y y y,根据假设函数公式,引入特征量 x 0 x_0 x0,其值均为1。则特征数量 n = 3 n=3 n=3。表格中有2行数据,则数据量 m = 2 m=2 m=2。
引入特征量 x 0 x_0 x0 | 房子面积 x 1 x_1 x1 | 房子朝向 x 2 x_2 x2 | 分类 y y y |
---|---|---|---|
1 | 200 | 1 | 1 |
1 | 120 | 2 | 0 |
假设函数 h θ ( x ( i ) ) = 1 1 + e − θ T x ( i ) = 1 1 + e − ( θ 0 x 0 ( i ) + θ 1 x 1 ( i ) + θ 2 x 2 ( i ) ) h_{\theta}\left(x^{(i)}\right)=\frac{1}{1+e^{-\theta^{T} x^{(i)}}}=\frac{1}{1+e^{-(\theta_0x_0^{(i)}+\theta_{1} x_{1}^{(i)}+\theta_{2} x_{2}^{(i)})}} hθ(x(i))=1+e−θTx(i)1=1+e−(θ0x0(i)+θ1x1(i)+θ2x2(i))1,随机给定 θ 0 = 0.01 , θ 1 = 0.03 , θ 2 = 0.06 \theta_{0}=0.01, \quad\theta_{1}=0.03,\quad \theta_{2}=0.06 θ0=0.01,θ1=0.03,θ2=0.06,指定学习率 α = 0.01 \alpha=0.01 α=0.01,进行迭代,更新 θ \theta θ。
θ 0 = θ 0 − α × [ ( 1 1 + e − ( θ 0 x 0 ( 1 ) + θ 1 x 1 ( 1 ) + θ 2 x 2 ( 1 ) ) − y ( 1 ) ) × x 0 ( 1 ) + ( 1 1 + e − ( θ 0 x 0 ( 2 ) + θ 1 x 1 ( 2 ) + θ 2 x 2 ( 2 ) ) − y ( 2 ) ) × x 0 ( 2 ) ] \theta_0 = \theta_0-\alpha\times[(\frac{1}{1+e^{-(\theta_0x_0^{(1)}+\theta_{1} x_1^{(1)}+\theta_{2} x_2^{(1)})}}-y^{(1)})\times x_0^{(1)} + (\frac{1}{1+e^{-(\theta_0x_0^{(2)}+\theta_{1} x_1^{(2)}+\theta_{2} x_2^{(2)})}}-y^{(2)})\times x_0^{(2)}] θ0=θ0−α×[(1+e−(θ0x0(1)+θ1x1(1)+θ2x2(1))1−y(1))×x0(1)+(1+e−(θ0x0(2)+θ1x1(2)+θ2x2(2))1−y(2))×x0(2)]
θ 1 , θ 2 \theta_1,\theta_2 θ1,θ2 更新过程与 θ 0 \theta_0 θ0 一致,具体过程可参考线性回归中参数更新-链接
3. 逻辑回归的正则化
L2范数正则化 解决过拟合
J
(
θ
)
=
1
m
∑
i
=
1
m
[
−
y
(
i
)
log
1
1
+
e
−
θ
T
x
(
i
)
−
(
1
−
y
(
i
)
)
log
(
1
−
1
1
+
e
−
θ
T
x
(
i
)
)
)
]
+
λ
2
m
∥
θ
∥
2
2
J(\theta)=\frac{1}{m} \sum_{i=1}^{m}\left[-y^{(i)} \log \frac{1}{1+e^{-\theta^{T} x^{(i)}}}-\left(1-y^{(i)}\right) \log \left(1-\frac{1}{ \left.1+e^{-\theta^{T} x^{(i)}}\right)}\right)\right]+\frac{\lambda}{2 \mathrm{~m}}\|\theta\|_{2}^{2}
J(θ)=m1i=1∑m[−y(i)log1+e−θTx(i)1−(1−y(i))log(1−1+e−θTx(i))1)]+2 mλ∥θ∥22
迭代公式: θ j : = θ j − α m ( ∑ i = 1 m [ ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) ] + λ θ j ) \theta_{j}:=\theta_{j}-\frac{\alpha }{m} (\sum_{i=1}^{m}[(h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)}]+\lambda\theta_{j}) θj:=θj−mα(i=1∑m[(hθ(x(i))−y(i))xj(i)]+λθj)
4. 逻辑回归实现多分类
One-Vs-All (Rest)
上图中一个三分类问题,得到 3 个二元分类器。在 预测阶段,每个分类器可以根据测试样本,得到当前正类的概率。即 P(y=i|x;θ),i=1,2,3
。选择计算结果 最高 的分类器,其正类就可以作为预测结果。
One-Vs-One
Many-Vs-Many
- 编码:对
N
个类别做M
次划分,每次划分将一部分类别划为正类,其余负类,形成一个二分类的训练集。这样共有M个训练集,则可训练出M个分类器。 - 解码:
M
个分类器预测标记组成一个编码,将此预测编码与各自类别的编码进行比较,返回其中距离最小的类。
4. 代码案例
名称 | 代码链接 |
---|---|
逻辑回归实现 | code |
逻辑回归对鸢尾花进行分类 | code |
逻辑回归进行手写数字识别 | code |