机器学习实战----逻辑回归

最新推荐文章于 2024-08-11 23:49:39 发布

bailixuance

最新推荐文章于 2024-08-11 23:49:39 发布

阅读量339

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/bailixuance/article/details/85008951

版权

机器学习专栏收录该内容

16 篇文章 1 订阅

订阅专栏

一、简介：

计算部分gradAscent（）

数据与标签均转换为numpy矩阵

" * " : 矩阵相乘

维度：

数据：100行3列（添加了常数项）
标签：100行一列
初始权重：3行一列

每轮循环步骤：

数据矩阵（100行3列） * 权重矩阵（3行一列），结果是100行一列
矩阵乘积（100行一列）代入 sigmoid()函数，结果是100行一列，即预测值
标签值（100行一列）减去预测值（100行一列），结果是100行一列，即计算误差（100行一列）
权重矩阵（三行一列）加上步长 * 数据矩阵转置（三行100列）* 误差（100行一列），结果是3行一列，即更新权重矩阵

参考：

逻辑回归：损失函数与梯度下降(公式推导）：

https://blog.csdn.net/jediael_lu/article/details/77852060

逻辑回归原理（python代码实现）（似然理解，函数求导，代码实现（多个实现函数）都有）：

https://blog.csdn.net/csqazwsxedc/article/details/69690655

二、梯度上升法：

函数：

导函数：

import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

x = np.linspace(0,4)
y = -x**2 + 4*x

plt.plot(x,y,'-k')

# 写一下就明白了

def grad_ascent():
    def f_prime(x_old):
        return -2 * x_old + 4
    x_old = 4
    x_new = 0
    alpha = 0.01
    presission = 0.00000001
    while(abs(x_new - x_old) > presission):
        x_old = x_new
        x_new = x_old + alpha *f_prime(x_old) 
    print(x_new)
    
grad_ascent()

1.999999515279857

数学表达式：

三、逻辑回归公式

似然函数：

梯度上升的梯度迭代公式：

梯度下降的迭代公式：

梯度上升与梯度下降其实就是同一个公式，只是梯度上升求导中前面没有取负号，

四、代码实现

注：

1、loadDataSet()：

要添加常数项，

计算时，使用mat() 函数将数据转换为numpy矩阵，

2、计算部分gradAscent（）

数据与标签均转换为numpy矩阵

" * " : 矩阵相乘

维度：

数据：100行3列（添加了常数项）
标签：100行一列
初始权重：3行一列

每轮循环步骤：

数据矩阵（100行3列） * 权重矩阵（3行一列），结果是100行一列
矩阵乘积（100行一列）代入 sigmoid()函数，结果是100行一列，即预测值
标签值（100行一列）减去预测值（100行一列），结果是100行一列，即计算误差（100行一列）
权重矩阵（三行一列）加上步长 * 数据矩阵转置（三行100列）* 误差（100行一列），结果是3行一列，即更新权重矩阵

from numpy import *
filename = 'testSet.txt'
def loadDataSet():
    dataMat = []
    labelMat = []
    fr = open(filename)
    for line in fr.readlines():
        # Python strip() 方法用于移除字符串头尾指定的字符（默认为空格或换行符）或字符序列。
        # Python split() 通过指定分隔符对字符串进行切片，如果参数 num 有指定值，则仅分隔 num 个子字符串
        lineArr = line.strip().split()
        # 前面的1，表示方程的常量。比如两个特征X1,X2，共需要三个参数，W1+W2*X1+W3*X2
        dataMat.append([1.0,float(lineArr[0]),float(lineArr[1])])
        labelMat.append(int(lineArr[2]))

    return dataMat,labelMat

def sigmoid(inX):
    return 1.0/(1+exp(-inX))

def gradAscent(dataMat,labelMat):
    # 用mat函数转换为矩阵之后可以才进行一些线性代数的操作。
    # 列表转换为矩阵，默认转换为一个行矩阵，所以需要transpose()转换为列矩阵
    # transpose()作用是为转置
    # ones()全一矩阵，zeros()全零矩阵，eyes()单位阵
    # 矩阵，使用*是矩阵乘法，即行乘以列
    # print(labelMat)
    dataMatrix = mat(dataMat)
    # print(mat(labelMat))
    classLabels = mat(labelMat).transpose()
    # print(classLabels)
    m,n = shape(dataMatrix)
    alpha = 0.001
    maxCyle = 500
    weights = ones((n,1))
    for k in range(maxCyle):
        h = sigmoid(dataMatrix*weights)
        error = (classLabels-h)
        weights = weights + alpha*dataMatrix.transpose()*error
    # print(weights)
    return weights

def plotBestFit(weights):
    import matplotlib.pyplot as plt
    dataMat,labelMat = loadDataSet()
    dataArr = array(dataMat)
    # print(dataArr)
    n = shape(dataArr)[0]
    xcord1 = []; ycord1 = []
    xcord2 = []; ycord2 = []
    for i in range(n):
        if int(labelMat[i]) == 1:
            xcord1.append(dataArr[i,1])
            ycord1.append(dataArr[i,2])
        else:
            xcord2.append(dataArr[i,1])
            ycord2.append(dataArr[i,2])

    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(xcord1,ycord1,s=30,c='red',marker='s')
    ax.scatter(xcord2,ycord2,s=30,c='green')
    x = arange(-3.0,3.0,0.1)
    y = (-weights[0]-weights[1]*x)/weights[2]
    ax.plot(x,y)
    plt.xlabel('x1')
    plt.ylabel('x2')
    plt.show()
    

def main():
    dataMat,labelMat = loadDataSet()
    weights = gradAscent(dataMat,labelMat).getA()
    plotBestFit(weights)
  
if __name__ == "__main__":
    main()

数据：

-0.017612   14.053064   0
-1.395634   4.662541    1
-0.752157   6.538620    0
-1.322371   7.152853    0
0.423363    11.054677   0
0.406704    7.067335    1
0.667394    12.741452   0
-2.460150   6.866805    1
0.569411    9.548755    0
-0.026632   10.427743   0
0.850433    6.920334    1
1.347183    13.175500   0
1.176813    3.167020    1
-1.781871   9.097953    0
-0.566606   5.749003    1
0.931635    1.589505    1
-0.024205   6.151823    1
-0.036453   2.690988    1
-0.196949   0.444165    1
1.014459    5.754399    1
1.985298    3.230619    1
-1.693453   -0.557540   1
-0.576525   11.778922   0
-0.346811   -1.678730   1
-2.124484   2.672471    1
1.217916    9.597015    0
-0.733928   9.098687    0
-3.642001   -1.618087   1
0.315985    3.523953    1
1.416614    9.619232    0
-0.386323   3.989286    1
0.556921    8.294984    1
1.224863    11.587360   0
-1.347803   -2.406051   1
1.196604    4.951851    1
0.275221    9.543647    0
0.470575    9.332488    0
-1.889567   9.542662    0
-1.527893   12.150579   0
-1.185247   11.309318   0
-0.445678   3.297303    1
1.042222    6.105155    1
-0.618787   10.320986   0
1.152083    0.548467    1
0.828534    2.676045    1
-1.237728   10.549033   0
-0.683565   -2.166125   1
0.229456    5.921938    1
-0.959885   11.555336   0
0.492911    10.993324   0
0.184992    8.721488    0
-0.355715   10.325976   0
-0.397822   8.058397    0
0.824839    13.730343   0
1.507278    5.027866    1
0.099671    6.835839    1
-0.344008   10.717485   0
1.785928    7.718645    1
-0.918801   11.560217   0
-0.364009   4.747300    1
-0.841722   4.119083    1
0.490426    1.960539    1
-0.007194   9.075792    0
0.356107    12.447863   0
0.342578    12.281162   0
-0.810823   -1.466018   1
2.530777    6.476801    1
1.296683    11.607559   0
0.475487    12.040035   0
-0.783277   11.009725   0
0.074798    11.023650   0
-1.337472   0.468339    1
-0.102781   13.763651   0
-0.147324   2.874846    1
0.518389    9.887035    0
1.015399    7.571882    0
-1.658086   -0.027255   1
1.319944    2.171228    1
2.056216    5.019981    1
-0.851633   4.375691    1
-1.510047   6.061992    0
-1.076637   -3.181888   1
1.821096    10.283990   0
3.010150    8.401766    1
-1.099458   1.688274    1
-0.834872   -1.733869   1
-0.846637   3.849075    1
1.400102    12.628781   0
1.752842    5.468166    1
0.078557    0.059736    1
0.089392    -0.715300   1
1.825662    12.693808   0
0.197445    9.744638    0
0.126117    0.922311    1
-0.679797   1.220530    1
0.677983    2.556666    1
0.761349    10.693862   0
-2.168791   0.143632    1
1.388610    9.341997    0
0.317029    14.739025   0

五、使用SKLearn构建逻辑回归

参考：https://blog.csdn.net/c406495762/article/details/77851973

疝气病症状预测病马的死亡率,

原始数据集下载地址：http://archive.ics.uci.edu/ml/datasets/Horse+Colic

这里的数据包含了368个样本和28个特征。

原始的数据集经过处理，保存为两个文件：horseColicTest.txt和horseColicTraining.txt。

局部数据：

可以发现，SKLearn主要代码也就两行：

classifier = LogisticRegression(solver='liblinear',max_iter=10).fit(trainingSet, trainingLabels)
test_accurcy = classifier.score(testSet, testLabels) * 100

from sklearn.linear_model import LogisticRegression

def colicSklearn():
    frTrain = open('horseColicTraining.txt')                                        #打开训练集
    frTest = open('horseColicTest.txt')                                                #打开测试集
    trainingSet = []; trainingLabels = []
    testSet = []; testLabels = []
    for line in frTrain.readlines():
        currLine = line.strip().split('\t')
        lineArr = []
        for i in range(len(currLine)-1):
            lineArr.append(float(currLine[i]))
        trainingSet.append(lineArr)
        trainingLabels.append(float(currLine[-1]))
    for line in frTest.readlines():
        currLine = line.strip().split('\t')
        lineArr =[]
        for i in range(len(currLine)-1):
            lineArr.append(float(currLine[i]))
        testSet.append(lineArr)
        testLabels.append(float(currLine[-1]))
    classifier = LogisticRegression(solver='liblinear',max_iter=10).fit(trainingSet, trainingLabels)
    test_accurcy = classifier.score(testSet, testLabels) * 100
    print('正确率:%f%%' % test_accurcy)

if __name__ == '__main__':
    colicSklearn()

结果：