手写机器学习算法系列03——逻辑回归

黄嘉成

已于 2022-06-09 07:42:32 修改

阅读量1k

点赞数 1

分类专栏：手写机器学习系列文章标签：机器学习逻辑回归手写算法数学推导

于 2019-08-08 18:43:41 首次发布

本文链接：https://blog.csdn.net/qq_33829547/article/details/98774613

版权

手写机器学习系列专栏收录该内容

7 篇文章 17 订阅

订阅专栏

引言

逻辑回归虽然名字中带有回归，但实质上并不是起到回归作用，而是一种典型的二分类算法。当然，多分类问题也可以用逻辑回归来解决，比如针对每种分类都分别建立一个二分类来实现，也可以通过改进逻辑回归算法为softmax来实现。

一、场景模拟

假设有一个考试需要考AB两门课，达到一定要求的考生就通过考试，但我们并不清楚具体的通过标准。

现在我们需要使用手上拥有的过往考生的考试成绩和通过情况，从中学习出判断通过与否的标准。已有数据如下：

Score_A	Score_B	是否通过(1通过，0未通过)
34.62365962451697	78.0246928153624	0
30.28671076822607	43.894997524001006	0
35.84740876993872	72.90219802708364	0
60.18259938620975	86.30855209546827	1
79.0327360507101	75.3443764369103	1
45.083277476683385	56.316371781530506	0
61.10666453684766	96.51142588489624	1
75.02474556738889	46.55401354116538	1
76.09878670226256	87.42056971926803	1
84.43281996120034	43.533393310721095	1
95.86155507093572	38.22527805795094	0
75.01365838958247	30.60326323428011	0
82.3070533739948	76.48196330235605	1
69.36458875970939	97.71869196188608	1
39.538339143672225	76.0368108511588	0
53.9710521485623	89.20735013750205	1
69.07014406283025	52.740469730167646	1
67.94685547711617	46.67857410673128	0
70.66150955499434	92.92713789364832	1
76.97878372747499	47.575963649755316	1
67.37202754570876	42.83843832029179	0
89.67677575072081	65.79936592745237	1
50.534788289883004	48.85581152764205	0
34.21206097786789	44.209528598662885	0
77.9240914545704	68.9723599933059	1
62.27101367004632	69.95445795447587	1
80.19018075095659	44.821628932183536	1
93.114388797442	38.80067033713209	0
61.83020602312595	50.25610789244621	0
38.78580379679423	64.99568095539578	0
61.379289447425	72.80788731317097	1
85.40451939411645	57.05198397627122	1
52.10797973193984	63.127623768817145	0
52.04540476831827	69.43286012045222	1
40.236893735451105	71.16774802184875	0
54.63510555424817	52.21388588061123	0
33.91550010906887	98.86943574220612	0
64.17698887494485	80.90806058670817	1
74.78925295941542	41.57341522824434	0
34.1836400264419	75.2377203360134	0
83.90239366249155	56.30804621605328	1
51.54772026906181	46.85629026349976	0
94.44336776917852	65.56892160559052	1
82.36875375713919	40.61825515970618	0
51.04775177128865	45.82270145776001	0
62.222675761201884	52.060991948366784	0
77.19303492601364	70.4582000018096	1
97.7715992800023	86.7278223300282	1
62.073063796676465	96.76882412413984	1
91.56497449807442	88.69629254546598	1
79.94481794066932	74.16311935043757	1
99.2725269292572	60.99903099844988	1
90.5467141139985	43.39060180650027	1
34.52451385320009	60.39634245837173	0
50.286496118990705	49.80453881323059	0
49.58667721632031	59.80895099453265	0
97.64563396007769	68.86157272420604	1
32.57720016809309	95.59854761387876	0
74.24869136721598	69.82457122657193	1
71.7964620586338	78.45356224515052	1
75.39561146568029	85.75993667331619	1
35.28611281526193	47.02051394723416	0
56.253817497116245	39.261472510580184	0
30.05882244669796	49.59297386723685	0
44.66826172480893	66.45008614558913	0
66.56089447242954	41.09209807936973	0
40.45755098375164	97.53518548909935	1
49.07256321908844	51.88321182073966	0
80.27957401466998	92.11606081344084	1
66.74671856944039	60.99139402740989	1
32.722833040603234	43.307173064300635	0
64.0393204150601	78.03168802018232	1
72.34649422579923	96.22759296761404	1
60.4578857391896	73.09499809758037	1
58.840956217268015	75.85844831279043	1
99.82785779692128	72.36925193383884	1
47.26426910848174	88.47586499559782	1
50.45815980285988	75.80985952982456	1
60.45555629271532	42.50840943572217	0
82.2266615778557	42.719878537164576	0
88.91389641665329	69.80378889835472	1
94.83450672430195	45.69430680250754	1
67.31925746917527	66.58935317747915	1
57.238706315698614	59.51428198012957	1
80.36675600171273	90.96014789746954	1
68.46852178591112	85.59430710452014	1
42.0754545384731	78.84478600148044	0
75.47770200533905	90.42453899753963	1
78.6354243489802	96.64742716885642	1
52.34800398794108	60.76950525602592	0
94.09433112516793	77.15910509073893	1
90.44855097096365	87.50879176484702	1
55.48216114069585	35.570703472288656	0
74.49269241843041	84.84513684930135	1
89.8458067072098	45.35828361091658	1
83.48916274498238	48.38028579728175	1
42.261700809981704	87.10385094025456	1
99.31500880510393	68.77540947206617	1
55.34001756003703	64.9319380069486	1
74.77589300092767	89.52981289513276	1

我们不妨将通过考试的样本作为正例，未通过的作为负例。

二、建立分类器

根据以上场景，我们可以首先用线性模型来分析：

$Score_A \times \theta_A + Score_B \times \theta_B +\theta_0$

如果 $y$ 大于某个阈值时，即判定为正例，表示会通过考试；否则判定为负例，表示不会通过考试。

但仅仅使用线性模型将存在问题：x的输入很大时，y的输出也是无限大小的，不利于设置决策边界。

因此我们需要使用到一种激活函数来进行值映射。这里我们采用sigmoid函数。

1.sigmoid函数

$\frac{1}{1+e^{-z}}$

sigmoid函数具有良好的特性：处处可导（连续）、自变量为任意实数、值域在[0,1]之间。函数图像如下：
sigmoid函数图像

我们可以利用该函数将线性模型得到预测值映射到[0,1]之间，方便我们进行分类：

$\begin{aligned} h_\theta(x) &= g(\theta x)\\ & =\frac{1}{1+e^{-\theta x}} \end{aligned}$

比如可以设置阈值为0.5，当 $h_\theta(x)>0.5$ 时判断为正例，表示会通过考试； $h_\theta(x)<0.5$ 则判定为负例，表示不会通过考试。可根据实际效果情况适当调整阈值大小。

2.损失函数

刚学完线性回归的同学，可能立马想到使用线性回归中的损失函数：

$\begin{aligned} J(\theta) &= \sum_{i=1}^{m}(y^i-h_\theta(x^i))^2 \\ & = \sum_{i=1}^{m}(y^i-\frac{1}{1+e^{-\theta x}})^2 \end{aligned}$

使用最小二乘法来作为逻辑回归的损失函数理论上是可行的，但是实际操作困难，因为利用最小二乘法构造的损失函数在逻辑回归中不是一个凸函数，可能存在许多的局部最优解，就如下图所示：
在这里插入图片描述

因此逻辑回归中的损失函数并不采用最小二乘法，而是采用最大似然。

由于是二分类问题，我们不妨假设样本为正例和负例的概率分别为：

$\begin{aligned} & P(y=1|x;\theta)= h_\theta(x) \\ & P(y=0|x;\theta)= 1-h_\theta(x) \end{aligned}$

我们将上面两个式子整合：

$P(y|x;\theta) = (h_\theta(x))^y(1-h_\theta(x))^{1-y} \tag 1$

之所以能这样整合，因为二分类问题y的取值只有1和0：当y=1时保留左半边，y=0时保留右半边。

整体样本的似然函数 $L$ 为：

$\begin{aligned} L(\theta) &= \prod_{i=1}^mP(y|x;\theta) \\ & = \prod_{i=1}^m(h_\theta(x^{(i)}))^{y^{(i)}}(1-h_\theta(x^{(i)}))^{1-y^{(i)}} \tag 2 \end{aligned}$

为了方便求偏导，取对数似然：

$\begin{aligned} \ln L(\theta) &= \ln \prod_{i=1}^m(h_\theta(x^{(i)}))^{y^{(i)}}(1-h_\theta(x^{(i)}))^{1-y^{(i)}} \\ & = \sum_{i=1}^{m} \ln (h_\theta(x^{(i)}))^{y^{(i)}}(1-h_\theta(x^{(i)}))^{1-y^{(i)}} \\ & = \sum_{i=1}^{m}( y^{(i)}\ln h_\theta(x^{(i)}) +(1-y^{(i)})\ln (1-h_{\theta}(x^{(i)}))) \end{aligned}$

得到了对数似然函数后，求解能够使函数值最大的 $\theta$ 值成为了我们的目标。

我们可以使用梯度上升来求解函数最大值，不过为了和前面的梯度下降联系起来，这里我们可以引入：

$J(\theta)=-\frac{1}{m}\ln L(\theta)$

这样我们就将之转化为对目标函数 $J(\theta)$ 梯度下降任务。

3.目标函数求偏导

$\begin{aligned} \nabla J(\theta) & =\nabla (-\frac{1}{m}\ln L(\theta)) \\ & =-\frac{1}{m}\sum_{i=1}^{m} (y^{(i)}\frac{1}{h_{\theta}(x^{(i)})}-(1-y^{(i)})(\frac{1}{1-h_{\theta}(x^{(i)})}))\nabla g(\theta x) \\ & = -\frac{1}{m}\sum_{i=1}^{m} (y^{(i)}\frac{1}{g(\theta x^{(i)})}-(1-y^{(i)})(\frac{1}{1-g(\theta x^{(i)})}))(1-g(\theta x^{(i)}))g(\theta x^{(i)})\nabla (\theta x^{(i)})\\ & = -\frac{1}{m}\sum_{i=1}^{m}(y^{(i)}(1-g(\theta x^{(i)}) )-(1-y^{(i)})g(\theta x^{(i)}))x^{(i)}\\ & = -\frac{1}{m}\sum_{i=1}^{m}(y^{(i)}-h_{\theta}(x^{(i)}))x^{(i)} \end{aligned}$

求导过程需要用到加减乘除法的求导法则以及复合函数的求导法则，这里不再展开了。

上式的 $y$ 与 $x$ 都是已知的，只需编写程序代入即可。

三、手写逻辑回归算法

1.sigmoid函数实现

def sigmoid(x):
    return 1/(1+np.exp(-x))

2.预测模型构建

def model(X, thetas):
    return sigmoid(np.dot(X, thetas))

3.分类器实现

def classify(X, thetas, threshold):
    return model(X, thetas) > threshold

3.梯度下降实现

def gradient_function(thetas, X, y):
    '''计算梯度
    '''
    diff = model(X, thetas)
    return -(1/X.shape[0])*(np.dot((y-model(X, thetas)).reshape(1, X.shape[0]), X)).T
    
def gradient_descent(X, y, alpha):
    '''梯度下降迭代计算
    '''
    thetas = np.zeros(X.shape[1]).reshape(X.shape[1], 1)
    gradient = gradient_function(thetas, X, y)
    while not np.all(np.absolute(gradient) < 1e-2):
        thetas = thetas-alpha*gradient
        gradient = gradient_function(thetas, X, y)
    return thetas

4.评价指标函数

def accuracy(prediction, reality):
    '''准确率
    '''
    count = 0
    arr=prediction==reality 
    for item in arr:
        if(item==True):
            count+=1
    return count/len(arr)

def recall(prediction,reality):       
    '''召回率
    ''' 
    count_tp = 0
    count_fn=0
    for i in range(len(prediction)):
        if(reality[i]==True and prediction[i]==True):
            count_tp +=1
        if(reality[i]==True and prediction[i]==False):
            count_fn+=1
    return count_tp/(count_tp+count_fn)

def precision(prediction,reality):
    '''精确度
    '''
    count_tp=0
    count_fp=0
    for i in range(len(prediction)):
        if(reality[i]==True and prediction[i]==True):
            count_tp +=1
        if(reality[i]==False and prediction[i]==True):
            count_fp +=1
    return count_tp/(count_fp+count_tp)

四、完整代码与测试结果

我们将逻辑回归的核心函数封装成class，并对本文的案例数据集进行测试：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


class LogisticRegression:

    @classmethod
    def sigmoid(cls, x):
        return 1/(1+np.exp(-x))

    @classmethod
    def model(cls, X, thetas):
        return cls.sigmoid(np.dot(X, thetas))

    @classmethod
    def classify(cls, X, thetas, threshold):
        return cls.model(X, thetas) > threshold

    @classmethod
    def gradient_function(cls, thetas, X, y):
        '''计算梯度
        '''
        diff = cls.model(X, thetas)
        return -(1/X.shape[0])*(np.dot((y-cls.model(X, thetas)).reshape(1, X.shape[0]), X)).T

    @classmethod
    def gradient_descent(cls, X, y, alpha):
        '''梯度下降迭代计算
        '''
        thetas = np.zeros(X.shape[1]).reshape(X.shape[1], 1)
        gradient = cls.gradient_function(thetas, X, y)
        print(gradient)
        while not np.all(np.absolute(gradient) < 1e-2):
            thetas = thetas-alpha*gradient
            gradient = cls.gradient_function(thetas, X, y)
            print(gradient)
        return thetas

    @classmethod
    def accuracy(cls, prediction, reality):
        count = 0
        arr=prediction==reality 
        for item in arr:
            if(item==True):
                count+=1
        return count/len(arr)

    @classmethod
    def recall(cls,prediction,reality):        
        count_tp = 0
        count_fn=0
        for i in range(len(prediction)):
            if(reality[i]==True and prediction[i]==True):
                count_tp +=1
            if(reality[i]==True and prediction[i]==False):
                count_fn+=1
        return count_tp/(count_tp+count_fn)

    @classmethod
    def precision(cls,prediction,reality):
        count_tp=0
        count_fp=0
        for i in range(len(prediction)):
            if(reality[i]==True and prediction[i]==True):
                count_tp +=1
            if(reality[i]==False and prediction[i]==True):
                count_fp +=1
        return count_tp/(count_fp+count_tp)



if __name__ == "__main__":
    pdData = pd.read_csv("LogiReg_data.csv", header=None,
                         names=["Score_A", "Score_B", "Admitted"])
    pdData.insert(0, "one", 1)
    orig_data = pdData.values
    cols = orig_data.shape[1]
    X = orig_data[:, 0:cols-1]
    x = orig_data[:, 1:cols-1]
    y = orig_data[:, cols-1:cols]
    thetas =LogisticRegression.gradient_descent(X,y,0.001)
    print(thetas) # 打印出最优的θ
    prediction=LogisticRegression.classify(X,thetas, 0.5) #分类器的阈值设定0.5
    print(prediction)#打印预测分类值
    accuracy=LogisticRegression.accuracy(prediction,y)
    print("accuracy:"+str(accuracy))# 打印准确率
    recall=LogisticRegression.recall(prediction,y)
    print("recall:"+str(recall))#打印召回率
    precision = LogisticRegression.precision(prediction,y)
    print("precicion:"+str(precision))#打印精确度

输出结果：

thetas:
[[-11.83740615]
[ 0.09998542]
[ 0.09388469]]
accuracy:0.91
recall:0.95
precicion:0.9047619047619048

黄嘉成

关注

1
点赞
踩
9

收藏

觉得还不错? 一键收藏
打赏
3
评论
手写机器学习算法系列03——逻辑回归

逻辑回归虽然名字中带有回归，但实质上并不是起到回归作用，而是一种典型的二分类算法。至于为什么叫回归，可能是因为分类过程需要依赖于线性回归模型的输出。多分类问题也可以用逻辑回归来解决，比如通过对每种分类都分别进行一次二分类来实现。...
复制链接

扫一扫