基于逻辑斯蒂回归的肿瘤预测

1 知识预警

1.1 交叉熵与对数似然

1.1.1 熵

X X X是一个取有限个值的离散随机变量,其概率分布(分布律、分布列)为: P ( X = x i ) = p i , i = 1 , 2 , ⋯   , n \begin{align}P(X=x_i)=p_i,\quad i=1,2,\cdots,n \end{align} P(X=xi)=pi,i=1,2,,n

一个随机事件发生的概率越高,其信息量越低,故信息量定义为: I ( X ) = − ∑ i = 1 n log ⁡ p i \begin{align}I(X)=-\sum_{i=1}^{n}{\log p_i}\end{align} I(X)=i=1nlogpi

在信息论与概率统计中,(entropy)是表示随机变量不确定性的度量:则随机变量 X X X熵的定义为: H ( X ) = − ∑ i = 1 n p i log ⁡ p i \begin{align}H(X) &=- \sum_{i=1}^n p_i \log p_i \end{align} H(X)=i=1npilogpi

上式中,若 p i = 0 p_i=0 pi=0,则定义 0 log ⁡ 0 = 0 0 \log 0=0 0log0=0。由定义可知,熵只依赖于 X X X的分布,而与 X X X的取值无关,所以也可以将 X X X的熵记作 H ( p ) H(p) H(p) H ( p ) = − ∑ i = 1 n p i log ⁡ p i \begin{align}H(p)=- \sum_{i=1}^n p_i \log p_i \end{align} H(p)=i=1npilogpi

1.1.2 交叉熵

现在有关于样本集的两个概率分布 p(x) 和 q(x),其中 p(x) 为真实分布, q(x)非真实分布。如果用真实分布 p(x) 来衡量识别一个样本所需要编码长度的期望(平均编码长度)为式(3),如果使用非真实分布 q(x) 来表示来自真实分布 p(x) 的平均编码长度,则是: H ( p , q ) = − ∑ p i log ⁡ q i \begin{align} H(p,q)=-\sum_{} p_i \log q_i\end{align} H(p,q)=pilogqi

如考虑一个随机变量 X X X,真实分布 p ( x ) = ( 1 2 , 1 4 , 1 8 , 1 8 ) p(x)=( \frac{1}{2}, \frac{1}{4}, \frac{1}{8}, \frac{1}{8}) p(x)=(21,41,81,81),非真实分布 q ( x ) = ( 1 4 , 1 4 , 1 4 , 1 4 ) q(x)=( \frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4}) q(x)=(41,41,41,41),则 H ( p , q ) = − ( 1 2 log ⁡ 2 1 4 + 1 4 log ⁡ 2 1 4 + 1 8 log ⁡ 2 1 4 + 1 8 log ⁡ 2 1 4 + ) = 2 bits H(p,q)=-(\frac{1}{2} \log_2 \frac{1}{4}+\frac{1}{4} \log_2 \frac{1}{4}+\frac{1}{8} \log_2 \frac{1}{4}+\frac{1}{8} \log_2 \frac{1}{4}+)=2 \text{bits} H(p,q)=(21log241+41log241+81log241+81log241+)=2bits

1.1.3 KL散度

用来衡量两个分布之间的差异。使用概率分布q来近似p时所造成的信息损失量。 KL ( p , q ) = H ( p , q ) − H ( p ) = ∑ p i log ⁡ p i q i \begin{align} \text{KL}(p,q)&=H(p,q)-H(p)\\ &=\sum_{}p_i \log \frac{p_i}{q_i}\end{align} KL(p,q)=H(p,q)H(p)=pilogqipi
应用到机器学习中,若真实分布为 p r ( y ∣ x ) p_r(y|\boldsymbol{x}) pr(yx),预测分布为 p θ ( y ∣ x ) p_{\theta}(y|\boldsymbol{x}) pθ(yx),通过KL散度衡量两个分布之间的额差异,即损失函数为最小化预测分布与真实分布之间的差异: min ⁡ KL ( p r ( y ∣ x ) , p θ ( y ∣ x ) ) = ∑ p r ( y ∣ x ) log ⁡ p r ( y ∣ x ) p θ ( y ∣ x ) ∝ − ∑ p r ( y ∣ x ) log ⁡ p θ ( y ∣ x ) \begin{align} \min \text{KL}(p_r(y|\boldsymbol{x}),p_{\theta}(y|\boldsymbol{x})) &=\sum_{}p_r(y|\boldsymbol{x}) \log \frac{p_r(y|\boldsymbol{x})}{p_{\theta}(y|\boldsymbol{x})}\\& \propto -\sum{}p_r(y|\boldsymbol{x}) \log p_{\theta}(y|\boldsymbol{x})\end{align} minKL(pr(yx),pθ(yx))=pr(yx)logpθ(yx)pr(yx)pr(yx)logpθ(yx)

1.3 Sigmoid函数

Sigmoid函数是一个在生物学中常见的S型函数,也称为S型生长曲线。在深度学习中,由于其单增以及反函数单增等性质,Sigmoid函数常被用作神经网络的激活函数,将变量映射到 [ 0 , 1 ] [0,1] [0,1]之间。

σ ( x ) = 1 1 + e − x \sigma(x)=\frac{1}{1+e^{-x}} σ(x)=1+ex1

Sigmoid函数的导数可以用其自身表示:

σ ′ ( x ) = σ ( x ) ( 1 − σ ( x ) ) \sigma '(x)=\sigma(x)(1-\sigma(x)) σ(x)=σ(x)(1σ(x))

两种坐标尺度下的Sigmoid函数图如下,上图的横坐标为-5到5,这时的曲线变化较为平滑;下图横坐标的尺度足够大,可以看到,在x = 0点处Sigmoid函数看起来很像阶跃函数
在这里插入图片描述

为了实现Logistic回归分类器,我们可以在每个特征上都乘以一个回归系数,然后把所有的结果值相加,将这个总和代入Sigmoid函数中,进而得到一个范围在0~1之间的数值。任何大于0.5的数据被分入1类,小于0.5即被归入0类。所以,Logistic回归也可以被看成是一种概率估计。

梯度上升算法
梯度上升法基于的思想是:要找到某函数的
最大值,最好的方法是沿着该函数的梯度方向探寻。如果梯度记为∇,则函数f(x,y)的梯度由
下式表示:

2 逻辑斯蒂回归模型

首先推导逻辑斯蒂回归模型,然后最小化损失函数,通过梯度下降算法学习最优参数。文章将回答一下问题:

  1. 为什么要使用sigmoid函数?
  2. 逻辑斯蒂分布、二项逻辑斯蒂回归模型、对数几率
  3. 交叉熵损失函数就是对数似然损失函数
  4. 为什么选择交叉熵损失函数,而不是平方损失函数?

逻辑斯谛回归是经典的分类方法,它属于对数线性模型,原理是根据现有的数据对分类边界线建立回归公式,以此进行分类。具体来讲,对于二分类问题,只需要构造一个线性判别函数 f ( x ) = w T x + b f(x)=\pmb{w}^\mathrm{T} \pmb{x}+b f(x)=wTx+b。特征空间 R D \mathbb{R}^D RD中所有满足 f ( x ) = 0 f(\pmb x)=0 f(x)=0的点组成一个分割超平面(Hyperplane),称为决策边界(Decision Boundary),决策边界将特征空间一分为二,划分为两个区域,每个区域对应一个类别。一个二分类问题的线性决策边界如下图所示,其中样本特征向量 x = [ x 1 , x 2 ] \pmb x =[x1, x2] x=[x1,x2],权重向量 w = [ w 1 , w 2 ] \pmb w = [w1, w2] w=[w1,w2]

但这存在一个问题,我们知道一个分割超平面将特征空间分成两个部分,样本在不同的子空间中则被分为相对应的类。但是感知机的一个问题在于,我们仅能知道这个样本被分类的结果,但不知道它属于一个类的概率是多少。换句话说,样本在特征空间中的位置可能与分离超平面距离非常近,也有可能非常远,如果距离较远,那么它更有可能被分成它所在一侧对应的类,但是如果与超平面的距离非常近,说明它被分成另一类的可能性也很大,比如被分成A的可能性为51%,而分成B类的可能性为49%,此时线性回归会将其分为A类,而忽略了49%分成B类的可能性。于是,为了得到这一概率,我们引入了 Sigmoid 函数。

sigmoid ( z ) = g ( z ) = 1 1 + e − z \begin{align}\text{sigmoid}(z)=g(z)=\frac{1}{1+e^{-z}}\end{align} sigmoid(z)=g(z)=1+ez1

在线性回归模型的基础上,使用Sigmoid函数,将线性模型的预测值 ( − ∞ , + ∞ ) (-\infty,+\infty) (,+)压缩到 ( 0 , 1 ) (0,1) (0,1)之间,使其拥有概率意义,实现值到概率转换1
。这样,就可以显示一个样本被分为一个类别的概率是多少。这时,式(10)已经是一个概率了,给定一个样本输入 x \pmb x x,输出为正类的条件概率分布表示为式(11),当然也可以是表示为负类,只是表示方式不一样而已。 P ( Y = 0 ∣ x ) = 1 − P ( Y = 1 ∣ x ) P(Y=0|\pmb x)=1-P(Y=1|\pmb x) P(Y=0∣x)=1P(Y=1∣x)得到式(12)。公式(11)(12)是二项逻辑斯蒂回归模型的条件概率分布。
P ( Y = 1 ∣ x ) = sigmoid ( f ( x ) ) = 1 1 + e − w T x + b = e w T x + b 1 + e w T x + b P ( Y = 0 ∣ x ) = 1 1 + e w T x + b \begin{align}P(Y=1|\pmb x) &=\text{sigmoid}(f(\pmb x))&=\frac{1}{1+e^{-\pmb{w}^\mathrm{T} \pmb{x}+b}}=\frac{ e^{\pmb{w}^\mathrm{T} \pmb x+b}}{ 1+e^{\pmb{w}^\mathrm{T} \pmb x+b} } \\P(Y=0|\pmb x) &= \frac{1}{1+e^{\pmb{w}^\mathrm{T} \pmb x+b} }\end{align} P(Y=1∣x)P(Y=0∣x)=sigmoid(f(x))=1+ewTx+b1=1+ewTx+b1=1+ewTx+bewTx+b
这里, x ∈ R d \pmb{x} \in \mathbb{R}^d xRd是输入, Y = { 0 , 1 } Y=\{0,1\} Y={0,1}是输出, w ∈ R d \pmb{w} \in \mathbb{R}^d wRd b ∈ R b \in \mathbb{R} bR是参数, w w w称为权重, b b b称为偏置, w x w \pmb x wx为内积。对于给定的输入实例 x x x,按照式(11)(12)可以计算 P ( Y = 1 ∣ x ) P(Y=1|\pmb x) P(Y=1∣x) P ( Y = 0 ∣ x ) P(Y=0|\pmb x) P(Y=0∣x),逻辑斯蒂回归比较两个条件概率的大小,将实例 x \pmb x x分类到概率值较大的那一类。

比如我们认为A类为正类,B类为负类,那么当某个样本分为A类的概率>50%,我们可认为其为A类,如果<50%,我们可认为其为B类,如下式,式中 y ^ \hat{y} y^为样本的预测值, f ( x ) f(x) f(x)为分割超平面,当某个样本在分割超平面的这一侧(法向量方向, f ( x ) > 0 f(x)>0 f(x)>0),可以认为该样本分为这一类的概率较大( sigmoid ( f ( x ) ) > 0.5 \text{sigmoid}(f(x))>0.5 sigmoid(f(x))>0.5):
y ^ = { 1 if  f ( x ) > 0 ⇔ sigmoid ( f ( x ) ) > 0.5 − 1 if  f ( x ) < 0 ⇔ sigmoid ( f ( x ) ) < 0.5 \begin{align} \hat{y} = \begin{cases} 1 & \text{if} \ f(\pmb x) > 0 \Leftrightarrow\text{sigmoid}(f(\pmb x))>0.5\\ -1 & \text{if} \ f(\pmb x) < 0 \Leftrightarrow\text{sigmoid}(f(\pmb x))<0.5\\ \end{cases}\end{align} y^={11if f(x)>0sigmoid(f(x))>0.5if f(x)<0sigmoid(f(x))<0.5

给定 N N N个样本的训练集 D = { x ( i ) , y ( i ) } i = 1 N \mathcal{D}=\{\pmb x^{(i)},y^{(i)}\}_{i=1}^N D={x(i),y(i)}i=1N,其中 y ( i ) ∈ { + 1 , − 1 } y^{(i)} \in \{+1,-1\} y(i){+1,1},线性模型试图学习参数 w ∗ \pmb w^* w,使得对于每个样本 ( x ( i ) , y ( i ) ) (\pmb x^{(i)},y^{(i)}) (x(i),y(i))尽量满足 f w ∗ ( x ( i ) ) > 0 if y ( i ) = 1 f w ∗ ( x ( i ) ) < 0 if y ( i ) = − 1 \begin{align} f_{w^*}(x^{(i)}) >0 \quad \text{if} \quad y^{(i)}=1 \\ f_{w^*}(x^{(i)}) <0 \quad \text{if}\quad y^{(i)}=-1 \\ \end{align} fw(x(i))>0ify(i)=1fw(x(i))<0ify(i)=1
上面两个公式也可以合并,即参数𝒘∗ 尽量满足
y ( i ) f w ∗ > 0 , ∀ i ∈ [ 1 , N ] \begin{align} y^{(i)}f_{w^*}>0,\quad \forall i \in [1,N] \end{align} y(i)fw>0,i[1,N]

3 模型训练

模型预测条件概率为:
P w ( Y = 1 ∣ x ( i ) ) = y ( i ) P w ( Y = − 1 ∣ x ( i ) ) = 1 − y ( i ) \begin{align} P_w(Y=1|\pmb{x}^{(i)})&=y^{(i)} \\ P_w(Y=-1|\pmb{x}^{(i)})&=1-y^{(i)}\end{align} Pw(Y=1∣x(i))Pw(Y=1∣x(i))=y(i)=1y(i)

对于一个样本 ( x ( i ) , y ( i ) ) (\pmb{x}^{(i)},y^{(i)}) (x(i),y(i)),其真实条件概率为:
P r ( Y = 1 ∣ x ( i ) ) = y ( i ) P r ( Y = − 1 ∣ x ( i ) ) = 1 − y ( i ) \begin{align} P_r(Y=1|\pmb{x}^{(i)})&=y^{(i)}\\P_r(Y=-1|\pmb{x}^{(i)})&=1-y^{(i)} \end{align} Pr(Y=1∣x(i))Pr(Y=1∣x(i))=y(i)=1y(i)

3.1 损失函数

为了充分利用凸优化中一些高效、成熟的优化方法, 如共轭梯度、拟牛顿法等,很多机器学习方法都倾向于选择合适的模型和损失函数,以构造一个凸函数作为优化目标.但也有很多模型(比如神经网络)的优化目标是非凸的,只能退而求其次找到局部最优解。

在线性回归中采用平方损失函数,但在逻辑斯蒂回归中采用交叉熵损失函数2:,很多文献和资料提及交叉损失的同时还涉及对数损失,其实两者是一样的3

3.1.1 平方损失函数

平方损失函数:
L ( w ) = 1 2 N ∑ i = 1 N ( y ( i ) − y ^ ( i ) ) ) 2 \begin{align} \mathcal{L}(\boldsymbol w) = \frac{1}{2N} \sum_{i=1}^N ( y^{(i)}-\hat y^{(i)}) )^2 \end{align} L(w)=2N1i=1N(y(i)y^(i)))2
梯度为: ∂ L ( w ) ∂ w = 1 N ∑ i = 1 N ( y ( i ) − y ^ ( i ) ) x \begin{align} \frac{\partial \mathcal{L(\boldsymbol w)}}{\partial \boldsymbol w} =\frac{1}{N} \sum_{i=1}^N(y^{(i)}-\hat y^{(i)})\boldsymbol x \end{align} wL(w)=N1i=1N(y(i)y^(i))x其中, y ^ ( i ) \hat{y}^{(i)} y^(i)是关于 w \boldsymbol w w的sigmoid函数,是一个非凸函数,存在许多局部极小值点,采用梯度下降算法求解时,不适合做逻辑斯蒂回归的损失函数。

3.1.1 交叉熵损失函数

由交叉熵公式和KL散度推导出损失函数:
H ( p r , p w ) = − ( y ( i ) log ⁡ y ^ ( i ) + ( 1 − y ( i ) ) log ⁡ ( 1 − y ^ ( i ) ) ) \begin{align} H(p_r,p_w)=-\left( y^{(i)} \log \hat{y}^{(i)}+\left(1-y^{(i)}\right) \log(1-\hat{y}^{(i)}) \right) \end{align} H(pr,pw)=(y(i)logy^(i)+(1y(i))log(1y^(i)))

交叉熵损失函数:
L ( w ) = − 1 N ∑ i = 1 N ( y ( i ) log ⁡ y ^ ( i ) + ( 1 − y ( i ) ) log ⁡ ( 1 − y ^ ( i ) ) ) \begin{align} \mathcal{L}(\pmb{w})=-\frac{1}{N} \sum_{i=1}^N\left( y^{(i)} \log \hat{y}^{(i)}+\left(1-y^{(i)}\right) \log(1-\hat{y}^{(i)}) \right) \end{align} L(w)=N1i=1N(y(i)logy^(i)+(1y(i))log(1y^(i)))

3.1.2 对数似然函数

交叉熵损失函数就是对数似然函数
首先写出最大似然函数:
L ( w ) = ∏ i = 1 N [ P ( X = x ( i ) ) ] y ( i ) [ 1 − P ( X − x ( i ) ) ] ( 1 − y ( i ) ) = ∏ i = 1 N y ^ y ( i ) [ 1 − y ^ ] ( 1 − y ( i ) ) \begin{align} L(\boldsymbol w) &= \prod_{i=1}^{N} [P(X=\boldsymbol x^{(i)})] ^{y{(i)}} [1-P(X-\boldsymbol x ^{(i)})]^{(1-y^{(i)})} \\ &=\prod_{i=1}^{N} \hat y ^{y{(i)}} [1-\hat y]^{(1-y^{(i)})}\end{align} L(w)=i=1N[P(X=x(i))]y(i)[1P(Xx(i))](1y(i))=i=1Ny^y(i)[1y^](1y(i))
为了计算方便,我们对似然函数取对数,得到对数似然函数:
log ⁡ L ( x ) = ∑ i = 1 N y ( i ) log ⁡ y ^ ( i ) + ( 1 − y ( i ) ) log ⁡ ( 1 − y ^ ( i ) ) \begin{align} \log L(\boldsymbol x)=\sum_{i=1}^N y^{(i)} \log \hat y^{(i)} +(1-y^{(i)})\log (1- \hat y^{(i)}) \end{align} logL(x)=i=1Ny(i)logy^(i)+(1y(i))log(1y^(i))

3.2 梯度

梯度为:

∂ L ( w ) ∂ w = − 1 N ∑ i = 1 N ( y ( i ) y ^ ( i ) ( 1 − y ^ ( i ) ) y ^ ( i ) x ( i ) − ( 1 − y ( i ) ) y ^ ( i ) ( 1 − y ^ ( i ) ) 1 − y ^ ( i ) x ( i ) ) = − 1 N ∑ i = 1 N ( y ( i ) ( 1 − y ^ ( i ) ) x ( i ) − ( 1 − y ( i ) ) y ^ ( i ) x ( i ) ) = − 1 N ∑ i = 1 N x ( i ) ( y ( i ) − y ^ ( i ) ) . \begin{aligned} \frac{\partial \mathcal{L}(\boldsymbol{w})}{\partial \boldsymbol{w}} & =-\frac{1}{N} \sum_{i=1}^{N}\left(y^{(i)} \frac{\hat{y}^{(i)}\left(1-\hat{y}^{(i)}\right)}{\hat{y}^{(i)}} \boldsymbol{x}^{(i)}-\left(1-y^{(i)}\right) \frac{\hat{y}^{(i)}\left(1-\hat{y}^{(i)}\right)}{1-\hat{y}^{(i)}} \boldsymbol{x}^{(i)}\right) \\ & =-\frac{1}{N} \sum_{i=1}^{N}\left(y^{(i)}\left(1-\hat{y}^{(i)}\right) \boldsymbol{x}^{(i)}-\left(1-y^{(i)}\right) \hat{y}^{(i)} \boldsymbol{x}^{(i)}\right) \\ & =-\frac{1}{N} \sum_{i=1}^{N} \boldsymbol{x}^{(i)}\left(y^{(i)}-\hat{y}^{(i)}\right) . \end{aligned} wL(w)=N1i=1N(y(i)y^(i)y^(i)(1y^(i))x(i)(1y(i))1y^(i)y^(i)(1y^(i))x(i))=N1i=1N(y(i)(1y^(i))x(i)(1y(i))y^(i)x(i))=N1i=1Nx(i)(y(i)y^(i)).

采用梯度下降法,Logistic回归的训练过程为:初始化 w 0 ← 0 \boldsymbol{w_0} ← 0 w00,然后通过下式来迭代更新参数:
w t + 1 ← w t + α 1 N ∑ i = 1 N x ( i ) ( y ( i ) − y ^ ( i ) ) \begin{aligned} \boldsymbol{w}_{t+1} \leftarrow \boldsymbol{w}_{t}+\alpha \frac{1}{N} \sum_{i=1}^{N} \boldsymbol{x}^{(i)} \left(y^{(i)}-\hat{y}^{(i)} \right) \end{aligned} wt+1wt+αN1i=1Nx(i)(y(i)y^(i))

2.3 优化方法

有了梯度后,就可以使用梯度下降算法学习参数 w \boldsymbol w w了。

二、利用逻辑斯蒂回归预测学生是否被学校录取

题目(吴恩达机器学习作业ex2data1)
吴恩达机器学习-逻辑斯蒂回归
使用Logistic回归模型来预测一个学生是否被大学录取。假设你是大学某个院系的管理员,你想通过申请人在两门考试中的表现来决定每个人的录取率,你有来自以前申请人的历史数据,你可以用这些数据作为训练集建立Logistic回归,对每一个训练样本,你有申请人在两门考试中的分数和录取决定。
你的任务是建立一个分类模型,基于这两门课的分数来估计申请人的录取概率。

吴恩达数据集ex2data1.txt。

34.62365962451697,78.0246928153624,0
30.28671076822607,43.89499752400101,0
35.84740876993872,72.90219802708364,0
60.18259938620976,86.30855209546826,1
79.0327360507101,75.3443764369103,1
45.08327747668339,56.3163717815305,0
61.10666453684766,96.51142588489624,1
75.02474556738889,46.55401354116538,1
76.09878670226257,87.42056971926803,1
84.43281996120035,43.53339331072109,1
95.86155507093572,38.22527805795094,0
75.01365838958247,30.60326323428011,0
82.30705337399482,76.48196330235604,1
69.36458875970939,97.71869196188608,1
39.53833914367223,76.03681085115882,0
53.9710521485623,89.20735013750205,1
69.07014406283025,52.74046973016765,1
67.94685547711617,46.67857410673128,0
70.66150955499435,92.92713789364831,1
76.97878372747498,47.57596364975532,1
67.37202754570876,42.83843832029179,0
89.67677575072079,65.79936592745237,1
50.534788289883,48.85581152764205,0
34.21206097786789,44.20952859866288,0
77.9240914545704,68.9723599933059,1
62.27101367004632,69.95445795447587,1
80.1901807509566,44.82162893218353,1
93.114388797442,38.80067033713209,0
61.83020602312595,50.25610789244621,0
38.78580379679423,64.99568095539578,0
61.379289447425,72.80788731317097,1
85.40451939411645,57.05198397627122,1
52.10797973193984,63.12762376881715,0
52.04540476831827,69.43286012045222,1
40.23689373545111,71.16774802184875,0
54.63510555424817,52.21388588061123,0
33.91550010906887,98.86943574220611,0
64.17698887494485,80.90806058670817,1
74.78925295941542,41.57341522824434,0
34.1836400264419,75.2377203360134,0
83.90239366249155,56.30804621605327,1
51.54772026906181,46.85629026349976,0
94.44336776917852,65.56892160559052,1
82.36875375713919,40.61825515970618,0
51.04775177128865,45.82270145776001,0
62.22267576120188,52.06099194836679,0
77.19303492601364,70.45820000180959,1
97.77159928000232,86.7278223300282,1
62.07306379667647,96.76882412413983,1
91.56497449807442,88.69629254546599,1
79.94481794066932,74.16311935043758,1
99.2725269292572,60.99903099844988,1
90.54671411399852,43.39060180650027,1
34.52451385320009,60.39634245837173,0
50.2864961189907,49.80453881323059,0
49.58667721632031,59.80895099453265,0
97.64563396007767,68.86157272420604,1
32.57720016809309,95.59854761387875,0
74.24869136721598,69.82457122657193,1
71.79646205863379,78.45356224515052,1
75.3956114656803,85.75993667331619,1
35.28611281526193,47.02051394723416,0
56.25381749711624,39.26147251058019,0
30.05882244669796,49.59297386723685,0
44.66826172480893,66.45008614558913,0
66.56089447242954,41.09209807936973,0
40.45755098375164,97.53518548909936,1
49.07256321908844,51.88321182073966,0
80.27957401466998,92.11606081344084,1
66.74671856944039,60.99139402740988,1
32.72283304060323,43.30717306430063,0
64.0393204150601,78.03168802018232,1
72.34649422579923,96.22759296761404,1
60.45788573918959,73.09499809758037,1
58.84095621726802,75.85844831279042,1
99.82785779692128,72.36925193383885,1
47.26426910848174,88.47586499559782,1
50.45815980285988,75.80985952982456,1
60.45555629271532,42.50840943572217,0
82.22666157785568,42.71987853716458,0
88.9138964166533,69.80378889835472,1
94.83450672430196,45.69430680250754,1
67.31925746917527,66.58935317747915,1
57.23870631569862,59.51428198012956,1
80.36675600171273,90.96014789746954,1
68.46852178591112,85.59430710452014,1
42.0754545384731,78.84478600148043,0
75.47770200533905,90.42453899753964,1
78.63542434898018,96.64742716885644,1
52.34800398794107,60.76950525602592,0
94.09433112516793,77.15910509073893,1
90.44855097096364,87.50879176484702,1
55.48216114069585,35.57070347228866,0
74.49269241843041,84.84513684930135,1
89.84580670720979,45.35828361091658,1
83.48916274498238,48.38028579728175,1
42.2617008099817,87.10385094025457,1
99.31500880510394,68.77540947206617,1
55.34001756003703,64.9319380069486,1
74.77589300092767,89.52981289513276,1

假设函数
h θ ( x ) = 1 1 + e θ T x h_\theta(x)=\frac{1}{1+e^{\theta ^T x}} hθ(x)=1+eθTx1
交叉熵损失函数
J ( θ ) = 1 m ∑ i = 1 m ( C o s t ( h θ ( x ( i ) ) , y ( i ) ) ) = − 1 m ∑ i = 1 m [ y ( i ) log ⁡ h θ ( x ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − h θ ( x ( i ) ) ) ] \begin{aligned} J(\theta) & = \frac{1}{m} \sum_{i=1}^{m}(Cost(h_\theta(x^{(i)}),y^{(i)})) \\ & = -\frac{1}{m} \sum_{i=1}^{m}{[y^{(i)}\log{h_\theta(x^{(i)})}+(1-y^{(i)})\log(1-h_\theta(x^{(i)}))]} \end{aligned} J(θ)=m1i=1m(Cost(hθ(x(i)),y(i)))=m1i=1m[y(i)loghθ(x(i))+(1y(i))log(1hθ(x(i)))]

梯度
∂ ∂ θ j J ( θ ) = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \frac{\partial}{\partial \theta_{j}} J(\theta)=\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)} θjJ(θ)=m1i=1m(hθ(x(i))y(i))xj(i)
梯度下降
θ : = θ − α ∂ ∂ θ j J ( θ ) \theta:=\theta-\alpha \frac{\partial}{\partial\theta_j}J(\theta) θ:=θαθjJ(θ)

1.1读取数据

data = pd.read_csv("./ex2data1.txt", header=None, sep=",", names=['score1', 'score2', 'admit'])

1.2数据处理

增广矩阵形式,特征归一化

data = np.asarray(data)
X = np.insert(
        data[:, 0:2],
        obj=0,
        values=1.,
        axis=1
    )
y = data[:, -1]

构造特征 X 100 ∗ 3 = [ x 0 ⃗ , x 1 ⃗ , x 2 ⃗ ] X_{100*3}=[\vec{x_0},\vec{x_1},\vec{x_2}] X1003=[x0 ,x1 ,x2 ]

1.3 定义假设函数

h θ ( x ) = 1 1 + e θ T x h_\theta(x)=\frac{1}{1+e^{\theta ^T x}} hθ(x)=1+eθTx1

def sigmoid(z):
    return 1 / (1 + np.exp(-z))


def h(theta, X):
    return sigmoid(X.dot(theta))

1.4 损失函数

交叉熵损失函数
J ( θ ) = 1 m ∑ i = 1 m ( C o s t ( h θ ( x ( i ) ) , y ( i ) ) ) = − 1 m ∑ i = 1 m [ y ( i ) log ⁡ h θ ( x ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − h θ ( x ( i ) ) ) ] \begin{aligned} J(\theta) & = \frac{1}{m} \sum_{i=1}^{m}(Cost(h_\theta(x^{(i)}),y^{(i)})) \\ & = -\frac{1}{m} \sum_{i=1}^{m}{[y^{(i)}\log{h_\theta(x^{(i)})}+(1-y^{(i)})\log(1-h_\theta(x^{(i)}))]} \end{aligned} J(θ)=m1i=1m(Cost(hθ(x(i)),y(i)))=m1i=1m[y(i)loghθ(x(i))+(1y(i))log(1hθ(x(i)))]

## 1.5梯度下降
def gradient(theta, X, y):
    m = X.shape[0]
    return 1 / m * np.dot((h(theta, X) - y), X)


def gradientDescent(theta, X, y):
    for i in range(maxIter):
        theta -= alpha * gradient(theta, X, y)
    return theta

1.6绘制决策边界

在这里插入图片描述

1.7计算分类正确率

print(classification_report(y, predict(theta, X)))
classification_report(
         y_true,
         y_pred,
         labels=None,
         target_names=None,
         sample_weight=None,
         digits=2,
         output_dict=False,
         zero_division=“warn”
)

在这里插入图片描述

1.8程序

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.metrics import classification_report


def lossFunction(theta, X, y):
    m = X.shape[0]
    return -1 / m * np.sum(y * np.log(h(theta, X)) + (1 - y) * np.log(1 - h(theta, X)))


def gradient(theta, X, y):
    m = X.shape[0]
    return 1 / m * np.dot((h(theta, X) - y), X)


def gradientDescent(theta, X, y):
    for i in range(maxIter):
        theta -= alpha * gradient(theta, X, y)
    return theta


def sigmoid(z):
    return 1 / (1 + np.exp(-z))


def h(theta, X):
    return sigmoid(X.dot(theta))


def plotData(data):
    positive = data[data.admit == 1]  # 筛选出y=1的部分
    negative = data[data.admit == 0]
    plt.scatter(positive.score1, positive.score2, c='r', marker='+', label='Admitted')
    plt.scatter(negative.score1, negative.score2, c='b', marker='o', label='Not Admitted')
    plt.legend(loc=1)
    plt.xlabel('Exam1 Score')
    plt.ylabel('Exam2 Score')
    # plt.show()


def plotBD(theta):
    param1 = -theta[0] / theta[2]
    param2 = -theta[1] / theta[2]
    x1 = np.linspace(20, 100, 2)  # 在20100之前绘制2个点确定直线,注意纵轴为Exam2,刻度在20-100
    x2 = param1 + param2 * x1
    # 因为进行了特征缩放,所以计算y时需要还原特征缩放
    x2 = mean[1] - std[1] * (theta[0] + theta[1] * (x1 - mean[0]) / std[0]) / theta[2]
    plt.plot(x1, x2, color="black", label="decision boundary")
    plt.legend(loc=0)
    plt.show()


def predict(theta, X):
    return [1 if i > 0.6 else 0 for i in sigmoid(X.dot(theta))]


if __name__ == "__main__":
    data = pd.read_csv("./ex2data1.txt", header=None, sep=",", names=['score1', 'score2', 'admit'])
    print(data)
    plotData(data)
    data = np.asarray(data)
    X = np.insert(
        data[:, 0:2],
        obj=0,
        values=1.,
        axis=1
    )
    y = data[:, -1]
    print(X)
    print(y)
    mean = np.mean(X[:, 1:], axis=0)
    std = np.std(X[:, 1:], axis=0)
    X[:, 1:] = (X[:, 1:] - mean) / std

    alpha = 0.2
    maxIter = 10000

    theta = np.zeros(shape=(3))

    theta = gradientDescent(theta, X, y)
    print("梯度下降,得出参数:", theta)
    plotBD(theta)
    print(classification_report(y, predict(theta, X)))

三、基于逻辑斯蒂回归的肿瘤预测

解决机器学习问题的基本流程(写代码时也是这个流程):

  1. 加载数据集
  2. 数据预处理(数据清洗、数据降维等)
  3. EDA数据探索性分析
  4. 特征工程(目的:将数据转换成满足算法需求的数据,非数值数据做数值化处理,数值数据做非量纲处理)
  5. 划分数据集(分为训练集与测试集)
  6. 训练模型:用训练集来训练模型
  7. 测试集:用得到的训练模型对测试集进行测试
  8. 模型评估:准确率
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

加载数据集

feature_names = ['Sample code number', # 样本编号
                 'Clump Thickness', # 肿块厚度
                 'Uniformity of Cell Size', # 细胞尺寸的均匀性
                 'Uniformity of Cell Shape', # 细胞形状的均匀性,
                 'Marginal Adhesion', # 边缘粘连乳腺癌
                 'Single Epithelial Cell Size', # 单个上皮细胞大小
                 'Bare Nuclei', # 裸核细胞
                 'Bland Chromatin', # 染色质
                 'Normal Nucleoli', # 核仁
                 'Mitoses', # 细胞有丝分裂
                 'Class' # 肿瘤类别,2表示良性、4表示恶性
                ]
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data", 
                   header=None, 
                   names=feature_names
                  )
data.head()
Sample code numberClump ThicknessUniformity of Cell SizeUniformity of Cell ShapeMarginal AdhesionSingle Epithelial Cell SizeBare NucleiBland ChromatinNormal NucleoliMitosesClass
010000255111213112
1100294554457103212
210154253111223112
310162776881343712
410170234113213112
# 数据形状
data.shape
(699, 11)
# 数据整体信息
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Sample code number           699 non-null    int64 
 1   Clump Thickness              699 non-null    int64 
 2   Uniformity of Cell Size      699 non-null    int64 
 3   Uniformity of Cell Shape     699 non-null    int64 
 4   Marginal Adhesion            699 non-null    int64 
 5   Single Epithelial Cell Size  699 non-null    int64 
 6   Bare Nuclei                  699 non-null    object
 7   Bland Chromatin              699 non-null    int64 
 8   Normal Nucleoli              699 non-null    int64 
 9   Mitoses                      699 non-null    int64 
 10  Class                        699 non-null    int64 
dtypes: int64(10), object(1)
memory usage: 60.2+ KB

数据清洗

# 没有空值
data.isnull().sum()
Sample code number             0
Clump Thickness                0
Uniformity of Cell Size        0
Uniformity of Cell Shape       0
Marginal Adhesion              0
Single Epithelial Cell Size    0
Bare Nuclei                    0
Bland Chromatin                0
Normal Nucleoli                0
Mitoses                        0
Class                          0
dtype: int64
# 查看数据类型,发现 Bare Nuclei列是object类型
data.dtypes
Sample code number              int64
Clump Thickness                 int64
Uniformity of Cell Size         int64
Uniformity of Cell Shape        int64
Marginal Adhesion               int64
Single Epithelial Cell Size     int64
Bare Nuclei                    object
Bland Chromatin                 int64
Normal Nucleoli                 int64
Mitoses                         int64
Class                           int64
dtype: object
# 进一步发现,该列数据存在‘?’,需要进行缺失值处理,删除含?的行
data = data.drop(index=data[data["Bare Nuclei"] == '?'].index)
data.shape
(683, 11)
# 转换为数值类型
data[ ["Bare Nuclei"] ] = data[ ["Bare Nuclei"] ].apply(pd.to_numeric)
data.dtypes
Sample code number             int64
Clump Thickness                int64
Uniformity of Cell Size        int64
Uniformity of Cell Shape       int64
Marginal Adhesion              int64
Single Epithelial Cell Size    int64
Bare Nuclei                    int64
Bland Chromatin                int64
Normal Nucleoli                int64
Mitoses                        int64
Class                          int64
dtype: object

探索性数据分析

# 描述性统计
data.describe()
Sample code numberClump ThicknessUniformity of Cell SizeUniformity of Cell ShapeMarginal AdhesionSingle Epithelial Cell SizeBare NucleiBland ChromatinNormal NucleoliMitosesClass
count6.830000e+02683.000000683.000000683.000000683.000000683.000000683.000000683.000000683.000000683.000000683.000000
mean1.076720e+064.4421673.1508053.2152272.8301613.2342613.5446563.4450952.8696931.6032212.699854
std6.206440e+052.8207613.0651452.9885812.8645622.2230853.6438572.4496973.0526661.7326740.954592
min6.337500e+041.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000002.000000
25%8.776170e+052.0000001.0000001.0000001.0000002.0000001.0000002.0000001.0000001.0000002.000000
50%1.171795e+064.0000001.0000001.0000001.0000002.0000001.0000003.0000001.0000001.0000002.000000
75%1.238705e+066.0000005.0000005.0000004.0000004.0000006.0000005.0000004.0000001.0000004.000000
max1.345435e+0710.00000010.00000010.00000010.00000010.00000010.00000010.00000010.00000010.0000004.000000

查看样本是否均衡

所谓的不均衡指的是不同类别(标签)的样本量差异非常大,明显超过4:1。样本类别分布不均衡主要出现在分类相关的建模问题上。样本不均衡将导致样本量小的分类所包含的特征过少,并很难从中提取规律;即使得到分类模型,也容易产生过度依赖于有限的数据样本而导致过拟合的问题,当模型应用到新的数据上时,模型的准确性和健壮性将很差。

样本不均衡从数据规模的角度分为:

  • 大数据分布不均衡:例如1000万条数据集中,50万条的小类别。
  • 小数据分布不均衡:例如1000条数据集中,10条的小类别。此情况属于严重的数据样本分布不均衡。
data["Class"].value_counts()
2    444
4    239
Name: Class, dtype: int64
plt.figure(figsize=(2,2))
sns.countplot(x=data["Class"])
<Axes: xlabel='Class', ylabel='count'>

在这里插入图片描述

相关性分析

热力图,又名相关系数图。根据热力图中不同方块颜色对应的相关系数的大小,可以判断出变量之间相关性的大小,数值越接近于1,代表相关性越强。

sns.heatmap(data[feature_names[1:-1]].corr(), annot=True)
<Axes: >

在这里插入图片描述

划分数据集

x_train, x_test, y_train, y_test = train_test_split(data[feature_names[1:-1]], data["Class"], test_size=0.2, random_state=47)
x_train.shape, x_test.shape, y_train.shape, y_test.shape
((546, 9), (137, 9), (546,), (137,))

特征工程(标准化)

transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.fit_transform(x_test)
x_train[:5]
array([[-0.53068778, -0.69906483, -0.74519367, -0.64248271, -0.56547165,
        -0.70539241, -0.15961065, -0.60997878, -0.33834478],
       [-0.17428492, -0.69906483, -0.06763207, -0.64248271, -0.56547165,
        -0.70539241, -0.58061266, -0.60997878, -0.33834478],
       [ 0.18211795, -0.03755808,  0.94871035, -0.64248271, -0.56547165,
        -0.70539241, -1.00161466, -0.60997878, -0.33834478],
       [-0.17428492, -0.36831145, -0.06763207,  0.77059104, -0.11474534,
         1.20938727,  1.52439737,  1.02421364, -0.33834478],
       [-1.24349351, -0.69906483, -0.74519367, -0.64248271, -0.56547165,
        -0.70539241, -1.00161466, -0.60997878, -0.33834478]])

模型训练

model = LogisticRegression()
model.fit(x_train, y_train)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()

模型评估

y_predict = model.predict(x_test)
y_predict==y_test
563    True
557    True
640    True
638    True
179    True
       ... 
279    True
647    True
648    True
214    True
327    True
Name: Class, Length: 137, dtype: bool
# 计算准确率
score = model.score(x_test, y_test)
score
0.9781021897810219

参考:

  • https://zhuanlan.zhihu.com/p/74874291

  1. 【机器学习】关于逻辑斯蒂回归,看这一篇就够了!解答绝大部分关于逻辑斯蒂回归的常见问题,以及代码实现 ↩︎

  2. 逻辑斯谛回归(Logistic回归)最详解 ↩︎

  3. 对数损失和交叉熵损失 ↩︎

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值