分类问题
1.收集数据并选择合适特征
本次学习以鸢尾花数据为例, 相关特征为:
- sepal length (cm):花萼长度(厘米)
- sepal width (cm):花萼宽度(厘米)
- petal length (cm):花瓣长度(厘米)
- petal width (cm):花瓣宽度(厘米)
2.选择度量模型性能的指标
分类问题因为因变量是离散的,所以评价指标和回归问题不太一样:
- 真阳性TP:预测值和真实值都为正例;
- 真阴性TN:预测值与真实值都为正例;
- 假阳性FP:预测值为正,实际值为负;
- 假阴性FN:预测值为负,实际值为正;
指标定义: - 准确率:分类正确的样本占
总样本的比例
,即 A C C = T P + T N F P + F N + T P + T N ACC=\frac{TP+TN}{FP+FN+TP+TN} ACC=FP+FN+TP+TNTP+TN - 精度:
预测正
∩分类正确
/预测为正
,即 P R E = T P T P + F P PRE=\frac{TP}{TP+FP} PRE=TP+FPTP - 召回率:
预测正
∩分类正确
/样本为正
,即 R E C = T P T P + F N REC=\frac{TP}{TP+FN} REC=TP+FNTP - F1值:综合衡量精度和召回率: F 1 = 2 P R E × R E C P R E + R E C F1=2\frac{PRE×REC}{PRE+REC} F1=2PRE+RECPRE×REC
- ROC曲线:以假阳率为横轴,真阳率为纵轴所画的曲线
本次案例采用ROC为指标进行
3.选择模型进行训练
-
logistic regression
- logistic函数
p ( x ) = e β 0 + β 1 X 1 + e β 0 + β 1 X p(x)=\frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}} p(x)=1+eβ0+β1Xeβ0+β1X
直接推导公式: ω ^ = a r g m a x ω l o g P ( Y ∣ X ) = a r g m a x ω l o g ∏ i = 1 N P ( y i ∣ x i ) = a r g m a x ω ∑ i = 1 N l o g P ( y i ∣ x i ) = a r g m a x ω ∑ i = 1 N l o g ( p 1 y ( 1 − p 1 ) ( 1 − y ) = a r g m a x ω ∑ i = 1 N ( y i l o g p 1 + ( 1 − y i ) l o g ( 1 − p 1 ) ) 记 : L ( ω ) = ∑ i = 1 N ( y i l o g p 1 + ( 1 − y i ) l o g ( 1 − p 1 ) ) ∂ L ∂ ω k = ∑ i = 1 N y i 1 p 1 ∂ p 1 ∂ z ∂ z ω k + ( 1 − y i ) 1 1 − p 1 ( − ∂ p 1 ∂ z ∂ z ∂ ω k ) = ∑ i = 1 N y i 1 σ ( z ) ( σ ( z i ) − σ ( z i ) 2 ) x i + ( 1 − y i ) 1 1 − σ ( z i ) [ − ( σ ( z i ) − σ ( z i ) 2 x i ] = ∑ i = 1 N [ ( y i − y i σ ( z i ) ) x i + ( 1 − y i ) ( − σ ( z i ) ) x i ] = ∑ i = 1 N y i x i − σ ( z i ) x i = ∑ i = 1 N ( y i − σ ( z i ) ) x i \hat{\omega}=argmax_\omega logP(Y|X)=argmax_\omega log\prod_{i=1}^{N}P(y_i|x_i)=argmax_\omega \sum_{i=1}^{N}logP(y_i|x_i)\\ =argmax_\omega \sum_{i=1}^Nlog(p_1^y(1-p_1)^{(1-y)}=argmax_\omega\sum_{i=1}^N(y_ilogp_1+(1-y_i)log(1-p_1))\\ 记:L(\omega)=\sum_{i=1}^N(y_ilogp_1+(1-y_i)log(1-p_1))\\ \frac{\partial L}{\partial \omega_k}=\sum_{i=1}^Ny_i\frac{1}{p_1}\frac{\partial p_1}{\partial z}\frac{\partial z}{\omega_k}+(1-y_i)\frac{1}{1-p_1}(-\frac{\partial p_1}{\partial z}\frac{\partial z}{\partial \omega_k})\\ =\sum_{i=1}^Ny_i\frac{1}{\sigma(z)}(\sigma(z_i)-\sigma(z_i)^2)x_i+(1-y_i)\frac{1}{1-\sigma(z_i)}[-(\sigma(z_i)-\sigma(z_i)^2x_i]\\ =\sum_{i=1}^N[(y_i-y_i\sigma(z_i))x_i+(1-y_i)(-\sigma(z_i))x_i]\\ =\sum_{i=1}^Ny_ix_i-\sigma(z_i)x_i=\sum_{i=1}^N(y_i-\sigma(z_i))x_i ω^=argmaxωlogP(Y∣X)=argmaxωlogi=1∏NP(yi∣xi)=argmaxωi=1
- logistic函数