引言
逻辑回归虽然名字中带有回归,但实质上并不是起到回归作用,而是一种典型的二分类算法。当然,多分类问题也可以用逻辑回归来解决,比如针对每种分类都分别建立一个二分类来实现,也可以通过改进逻辑回归算法为softmax来实现。
一、场景模拟
假设有一个考试需要考AB两门课,达到一定要求的考生就通过考试,但我们并不清楚具体的通过标准。
现在我们需要使用手上拥有的过往考生的考试成绩和通过情况,从中学习出判断通过与否的标准。已有数据如下:
Score_A | Score_B | 是否通过(1通过,0未通过) |
---|---|---|
34.62365962451697 | 78.0246928153624 | 0 |
30.28671076822607 | 43.894997524001006 | 0 |
35.84740876993872 | 72.90219802708364 | 0 |
60.18259938620975 | 86.30855209546827 | 1 |
79.0327360507101 | 75.3443764369103 | 1 |
45.083277476683385 | 56.316371781530506 | 0 |
61.10666453684766 | 96.51142588489624 | 1 |
75.02474556738889 | 46.55401354116538 | 1 |
76.09878670226256 | 87.42056971926803 | 1 |
84.43281996120034 | 43.533393310721095 | 1 |
95.86155507093572 | 38.22527805795094 | 0 |
75.01365838958247 | 30.60326323428011 | 0 |
82.3070533739948 | 76.48196330235605 | 1 |
69.36458875970939 | 97.71869196188608 | 1 |
39.538339143672225 | 76.0368108511588 | 0 |
53.9710521485623 | 89.20735013750205 | 1 |
69.07014406283025 | 52.740469730167646 | 1 |
67.94685547711617 | 46.67857410673128 | 0 |
70.66150955499434 | 92.92713789364832 | 1 |
76.97878372747499 | 47.575963649755316 | 1 |
67.37202754570876 | 42.83843832029179 | 0 |
89.67677575072081 | 65.79936592745237 | 1 |
50.534788289883004 | 48.85581152764205 | 0 |
34.21206097786789 | 44.209528598662885 | 0 |
77.9240914545704 | 68.9723599933059 | 1 |
62.27101367004632 | 69.95445795447587 | 1 |
80.19018075095659 | 44.821628932183536 | 1 |
93.114388797442 | 38.80067033713209 | 0 |
61.83020602312595 | 50.25610789244621 | 0 |
38.78580379679423 | 64.99568095539578 | 0 |
61.379289447425 | 72.80788731317097 | 1 |
85.40451939411645 | 57.05198397627122 | 1 |
52.10797973193984 | 63.127623768817145 | 0 |
52.04540476831827 | 69.43286012045222 | 1 |
40.236893735451105 | 71.16774802184875 | 0 |
54.63510555424817 | 52.21388588061123 | 0 |
33.91550010906887 | 98.86943574220612 | 0 |
64.17698887494485 | 80.90806058670817 | 1 |
74.78925295941542 | 41.57341522824434 | 0 |
34.1836400264419 | 75.2377203360134 | 0 |
83.90239366249155 | 56.30804621605328 | 1 |
51.54772026906181 | 46.85629026349976 | 0 |
94.44336776917852 | 65.56892160559052 | 1 |
82.36875375713919 | 40.61825515970618 | 0 |
51.04775177128865 | 45.82270145776001 | 0 |
62.222675761201884 | 52.060991948366784 | 0 |
77.19303492601364 | 70.4582000018096 | 1 |
97.7715992800023 | 86.7278223300282 | 1 |
62.073063796676465 | 96.76882412413984 | 1 |
91.56497449807442 | 88.69629254546598 | 1 |
79.94481794066932 | 74.16311935043757 | 1 |
99.2725269292572 | 60.99903099844988 | 1 |
90.5467141139985 | 43.39060180650027 | 1 |
34.52451385320009 | 60.39634245837173 | 0 |
50.286496118990705 | 49.80453881323059 | 0 |
49.58667721632031 | 59.80895099453265 | 0 |
97.64563396007769 | 68.86157272420604 | 1 |
32.57720016809309 | 95.59854761387876 | 0 |
74.24869136721598 | 69.82457122657193 | 1 |
71.7964620586338 | 78.45356224515052 | 1 |
75.39561146568029 | 85.75993667331619 | 1 |
35.28611281526193 | 47.02051394723416 | 0 |
56.253817497116245 | 39.261472510580184 | 0 |
30.05882244669796 | 49.59297386723685 | 0 |
44.66826172480893 | 66.45008614558913 | 0 |
66.56089447242954 | 41.09209807936973 | 0 |
40.45755098375164 | 97.53518548909935 | 1 |
49.07256321908844 | 51.88321182073966 | 0 |
80.27957401466998 | 92.11606081344084 | 1 |
66.74671856944039 | 60.99139402740989 | 1 |
32.722833040603234 | 43.307173064300635 | 0 |
64.0393204150601 | 78.03168802018232 | 1 |
72.34649422579923 | 96.22759296761404 | 1 |
60.4578857391896 | 73.09499809758037 | 1 |
58.840956217268015 | 75.85844831279043 | 1 |
99.82785779692128 | 72.36925193383884 | 1 |
47.26426910848174 | 88.47586499559782 | 1 |
50.45815980285988 | 75.80985952982456 | 1 |
60.45555629271532 | 42.50840943572217 | 0 |
82.2266615778557 | 42.719878537164576 | 0 |
88.91389641665329 | 69.80378889835472 | 1 |
94.83450672430195 | 45.69430680250754 | 1 |
67.31925746917527 | 66.58935317747915 | 1 |
57.238706315698614 | 59.51428198012957 | 1 |
80.36675600171273 | 90.96014789746954 | 1 |
68.46852178591112 | 85.59430710452014 | 1 |
42.0754545384731 | 78.84478600148044 | 0 |
75.47770200533905 | 90.42453899753963 | 1 |
78.6354243489802 | 96.64742716885642 | 1 |
52.34800398794108 | 60.76950525602592 | 0 |
94.09433112516793 | 77.15910509073893 | 1 |
90.44855097096365 | 87.50879176484702 | 1 |
55.48216114069585 | 35.570703472288656 | 0 |
74.49269241843041 | 84.84513684930135 | 1 |
89.8458067072098 | 45.35828361091658 | 1 |
83.48916274498238 | 48.38028579728175 | 1 |
42.261700809981704 | 87.10385094025456 | 1 |
99.31500880510393 | 68.77540947206617 | 1 |
55.34001756003703 | 64.9319380069486 | 1 |
74.77589300092767 | 89.52981289513276 | 1 |
我们不妨将通过考试的样本作为正例,未通过的作为负例。
二、建立分类器
根据以上场景,我们可以首先用线性模型来分析:
y = S c o r e A × θ A + S c o r e B × θ B + θ 0 y = Score_A \times \theta_A + Score_B \times \theta_B +\theta_0 y=ScoreA×θA+ScoreB×θB+θ0
如果 y y y大于某个阈值时,即判定为正例,表示会通过考试;否则判定为负例,表示不会通过考试。
但仅仅使用线性模型将存在问题:x的输入很大时,y的输出也是无限大小的,不利于设置决策边界。
因此我们需要使用到一种激活函数来进行值映射。这里我们采用sigmoid函数。
1.sigmoid函数
g ( z ) = 1 1 + e − z g(z) = \frac{1}{1+e^{-z}} g(z)=1+e−z1
sigmoid函数具有良好的特性:处处可导(连续)、自变量为任意实数、值域在[0,1]之间。函数图像如下:
我们可以利用该函数将线性模型得到预测值映射到[0,1]之间,方便我们进行分类:
h θ ( x ) = g ( θ x ) = 1 1 + e − θ x \begin{aligned} h_\theta(x) &= g(\theta x)\\ & =\frac{1}{1+e^{-\theta x}} \end{aligned} hθ(x)=g(θx)=1+e−θx1
比如可以设置阈值为0.5,当 h θ ( x ) > 0.5 h_\theta(x)>0.5 hθ(x)>0.5时判断为正例,表示会通过考试; h θ ( x ) < 0.5 h_\theta(x)<0.5 hθ(x)<0.5则判定为负例,表示不会通过考试。可根据实际效果情况适当调整阈值大小。
2.损失函数
刚学完线性回归的同学,可能立马想到使用线性回归中的损失函数:
J ( θ ) = ∑ i = 1 m ( y i − h θ ( x i ) ) 2 = ∑ i = 1 m ( y i − 1 1 + e − θ x ) 2 \begin{aligned} J(\theta) &= \sum_{i=1}^{m}(y^i-h_\theta(x^i))^2 \\ & = \sum_{i=1}^{m}(y^i-\frac{1}{1+e^{-\theta x}})^2 \end{aligned} J(θ)=i=1∑m(yi−hθ(xi))2=i=1∑m(yi−1+e−θx1)2
使用最小二乘法来作为逻辑回归的损失函数理论上是可行的,但是实际操作困难,因为利用最小二乘法构造的损失函数在逻辑回归中不是一个凸函数,可能存在许多的局部最优解,就如下图所示:
因此逻辑回归中的损失函数并不采用最小二乘法,而是采用最大似然。
由于是二分类问题,我们不妨假设样本为正例和负例的概率分别为:
P ( y = 1 ∣ x ; θ ) = h θ ( x ) P ( y = 0 ∣ x ; θ ) = 1 − h θ ( x ) \begin{aligned} & P(y=1|x;\theta)= h_\theta(x) \\ & P(y=0|x;\theta)= 1-h_\theta(x) \end{aligned} P(y=1∣x;θ)=hθ(x)P(y=0∣x;θ)=1−hθ(x)
我们将上面两个式子整合:
P ( y ∣ x ; θ ) = ( h θ ( x ) ) y ( 1 − h θ ( x ) ) 1 − y (1) P(y|x;\theta) = (h_\theta(x))^y(1-h_\theta(x))^{1-y} \tag 1 P(y∣x;θ)=(hθ(x))y(1−hθ(x))1−y(1)
之所以能这样整合,因为二分类问题y的取值只有1和0:当y=1时保留左半边,y=0时保留右半边。
整体样本的似然函数 L L L为:
L ( θ ) = ∏ i = 1 m P ( y ∣ x ; θ ) = ∏ i = 1 m ( h θ ( x ( i ) ) ) y ( i ) ( 1 − h θ ( x ( i ) ) ) 1 − y ( i ) (2) \begin{aligned} L(\theta) &= \prod_{i=1}^mP(y|x;\theta) \\ & = \prod_{i=1}^m(h_\theta(x^{(i)}))^{y^{(i)}}(1-h_\theta(x^{(i)}))^{1-y^{(i)}} \tag 2 \end{aligned} L(θ)=i=1∏mP(y∣x;θ)=i=1∏m(hθ(x(i)))y(i)(1−hθ(x(i)))1−y(i)(2)
为了方便求偏导,取对数似然:
ln L ( θ ) = ln ∏ i = 1 m ( h θ ( x ( i ) ) ) y ( i ) ( 1 − h θ ( x ( i ) ) ) 1 − y ( i ) = ∑ i = 1 m ln ( h θ ( x ( i ) ) ) y ( i ) ( 1 − h θ ( x ( i ) ) ) 1 − y ( i ) = ∑ i = 1 m ( y ( i ) ln h θ ( x ( i ) ) + ( 1 − y ( i ) ) ln ( 1 − h θ ( x ( i ) ) ) ) \begin{aligned} \ln L(\theta) &= \ln \prod_{i=1}^m(h_\theta(x^{(i)}))^{y^{(i)}}(1-h_\theta(x^{(i)}))^{1-y^{(i)}} \\ & = \sum_{i=1}^{m} \ln (h_\theta(x^{(i)}))^{y^{(i)}}(1-h_\theta(x^{(i)}))^{1-y^{(i)}} \\ & = \sum_{i=1}^{m}( y^{(i)}\ln h_\theta(x^{(i)}) +(1-y^{(i)})\ln (1-h_{\theta}(x^{(i)}))) \end{aligned} lnL(θ)=lni=1∏m(hθ(x(i)))y(i)(1−hθ(x(i)))1−y(i)=i=1∑mln(hθ(x(i)))y(i)(1−hθ(x(i)))1−y(i)=i=1∑m(y(i)lnhθ(x(i))+(1−y(i))ln(1−hθ(x(i))))
得到了对数似然函数后,求解能够使函数值最大的 θ \theta θ值成为了我们的目标。
我们可以使用梯度上升来求解函数最大值,不过为了和前面的梯度下降联系起来,这里我们可以引入:
J ( θ ) = − 1 m ln L ( θ ) J(\theta)=-\frac{1}{m}\ln L(\theta) J(θ)=−m1lnL(θ)
这样我们就将之转化为对目标函数 J ( θ ) J(\theta) J(θ)梯度下降任务。
3.目标函数求偏导
∇ J ( θ ) = ∇ ( − 1 m ln L ( θ ) ) = − 1 m ∑ i = 1 m ( y ( i ) 1 h θ ( x ( i ) ) − ( 1 − y ( i ) ) ( 1 1 − h θ ( x ( i ) ) ) ) ∇ g ( θ x ) = − 1 m ∑ i = 1 m ( y ( i ) 1 g ( θ x ( i ) ) − ( 1 − y ( i ) ) ( 1 1 − g ( θ x ( i ) ) ) ) ( 1 − g ( θ x ( i ) ) ) g ( θ x ( i ) ) ∇ ( θ x ( i ) ) = − 1 m ∑ i = 1 m ( y ( i ) ( 1 − g ( θ x ( i ) ) ) − ( 1 − y ( i ) ) g ( θ x ( i ) ) ) x ( i ) = − 1 m ∑ i = 1 m ( y ( i ) − h θ ( x ( i ) ) ) x ( i ) \begin{aligned} \nabla J(\theta) & =\nabla (-\frac{1}{m}\ln L(\theta)) \\ & =-\frac{1}{m}\sum_{i=1}^{m} (y^{(i)}\frac{1}{h_{\theta}(x^{(i)})}-(1-y^{(i)})(\frac{1}{1-h_{\theta}(x^{(i)})}))\nabla g(\theta x) \\ & = -\frac{1}{m}\sum_{i=1}^{m} (y^{(i)}\frac{1}{g(\theta x^{(i)})}-(1-y^{(i)})(\frac{1}{1-g(\theta x^{(i)})}))(1-g(\theta x^{(i)}))g(\theta x^{(i)})\nabla (\theta x^{(i)})\\ & = -\frac{1}{m}\sum_{i=1}^{m}(y^{(i)}(1-g(\theta x^{(i)}) )-(1-y^{(i)})g(\theta x^{(i)}))x^{(i)}\\ & = -\frac{1}{m}\sum_{i=1}^{m}(y^{(i)}-h_{\theta}(x^{(i)}))x^{(i)} \end{aligned} ∇J(θ)=∇(−m1lnL(θ))=−m1i=1∑m(y(i)hθ(x(i))1−(1−y(i))(1−hθ(x(i))1))∇g(θx)=−m1i=1∑m(y(i)g(θx(i))1−(1−y(i))(1−g(θx(i))1))(1−g(θx(i)))g(θx(i))∇(θx(i))=−m1i=1∑m(y(i)(1−g(θx(i)))−(1−y(i))g(θx(i)))x(i)=−m1i=1∑m(y(i)−hθ(x(i)))x(i)
求导过程需要用到加减乘除法的求导法则以及复合函数的求导法则,这里不再展开了。
上式的 y y y与 x x x都是已知的,只需编写程序代入即可。
三、 手写逻辑回归算法
1.sigmoid函数实现
def sigmoid(x):
return 1/(1+np.exp(-x))
2.预测模型构建
def model(X, thetas):
return sigmoid(np.dot(X, thetas))
3.分类器实现
def classify(X, thetas, threshold):
return model(X, thetas) > threshold
3.梯度下降实现
def gradient_function(thetas, X, y):
'''计算梯度
'''
diff = model(X, thetas)
return -(1/X.shape[0])*(np.dot((y-model(X, thetas)).reshape(1, X.shape[0]), X)).T
def gradient_descent(X, y, alpha):
'''梯度下降迭代计算
'''
thetas = np.zeros(X.shape[1]).reshape(X.shape[1], 1)
gradient = gradient_function(thetas, X, y)
while not np.all(np.absolute(gradient) < 1e-2):
thetas = thetas-alpha*gradient
gradient = gradient_function(thetas, X, y)
return thetas
4.评价指标函数
def accuracy(prediction, reality):
'''准确率
'''
count = 0
arr=prediction==reality
for item in arr:
if(item==True):
count+=1
return count/len(arr)
def recall(prediction,reality):
'''召回率
'''
count_tp = 0
count_fn=0
for i in range(len(prediction)):
if(reality[i]==True and prediction[i]==True):
count_tp +=1
if(reality[i]==True and prediction[i]==False):
count_fn+=1
return count_tp/(count_tp+count_fn)
def precision(prediction,reality):
'''精确度
'''
count_tp=0
count_fp=0
for i in range(len(prediction)):
if(reality[i]==True and prediction[i]==True):
count_tp +=1
if(reality[i]==False and prediction[i]==True):
count_fp +=1
return count_tp/(count_fp+count_tp)
四、完整代码与测试结果
我们将逻辑回归的核心函数封装成class
,并对本文的案例数据集进行测试:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
class LogisticRegression:
@classmethod
def sigmoid(cls, x):
return 1/(1+np.exp(-x))
@classmethod
def model(cls, X, thetas):
return cls.sigmoid(np.dot(X, thetas))
@classmethod
def classify(cls, X, thetas, threshold):
return cls.model(X, thetas) > threshold
@classmethod
def gradient_function(cls, thetas, X, y):
'''计算梯度
'''
diff = cls.model(X, thetas)
return -(1/X.shape[0])*(np.dot((y-cls.model(X, thetas)).reshape(1, X.shape[0]), X)).T
@classmethod
def gradient_descent(cls, X, y, alpha):
'''梯度下降迭代计算
'''
thetas = np.zeros(X.shape[1]).reshape(X.shape[1], 1)
gradient = cls.gradient_function(thetas, X, y)
print(gradient)
while not np.all(np.absolute(gradient) < 1e-2):
thetas = thetas-alpha*gradient
gradient = cls.gradient_function(thetas, X, y)
print(gradient)
return thetas
@classmethod
def accuracy(cls, prediction, reality):
count = 0
arr=prediction==reality
for item in arr:
if(item==True):
count+=1
return count/len(arr)
@classmethod
def recall(cls,prediction,reality):
count_tp = 0
count_fn=0
for i in range(len(prediction)):
if(reality[i]==True and prediction[i]==True):
count_tp +=1
if(reality[i]==True and prediction[i]==False):
count_fn+=1
return count_tp/(count_tp+count_fn)
@classmethod
def precision(cls,prediction,reality):
count_tp=0
count_fp=0
for i in range(len(prediction)):
if(reality[i]==True and prediction[i]==True):
count_tp +=1
if(reality[i]==False and prediction[i]==True):
count_fp +=1
return count_tp/(count_fp+count_tp)
if __name__ == "__main__":
pdData = pd.read_csv("LogiReg_data.csv", header=None,
names=["Score_A", "Score_B", "Admitted"])
pdData.insert(0, "one", 1)
orig_data = pdData.values
cols = orig_data.shape[1]
X = orig_data[:, 0:cols-1]
x = orig_data[:, 1:cols-1]
y = orig_data[:, cols-1:cols]
thetas =LogisticRegression.gradient_descent(X,y,0.001)
print(thetas) # 打印出最优的θ
prediction=LogisticRegression.classify(X,thetas, 0.5) #分类器的阈值设定0.5
print(prediction)#打印预测分类值
accuracy=LogisticRegression.accuracy(prediction,y)
print("accuracy:"+str(accuracy))# 打印准确率
recall=LogisticRegression.recall(prediction,y)
print("recall:"+str(recall))#打印召回率
precision = LogisticRegression.precision(prediction,y)
print("precicion:"+str(precision))#打印精确度
输出结果:
thetas:
[[-11.83740615]
[ 0.09998542]
[ 0.09388469]]
accuracy:0.91
recall:0.95
precicion:0.9047619047619048