CP05 Logistic regression(Logistic 回归)

最新推荐文章于 2024-08-14 23:30:40 发布

DB架构

最新推荐文章于 2024-08-14 23:30:40 发布

阅读量522

点赞数

分类专栏： Machine Learning 文章标签：机器学习

本文链接：https://blog.csdn.net/u011868279/article/details/124263890

版权

Machine Learning 专栏收录该内容

35 篇文章 0 订阅

订阅专栏

本文介绍了逻辑回归和Sigmoid函数的基础知识，并详细阐述了梯度上升和随机梯度上升两种优化算法在训练逻辑回归模型中的使用。通过实例展示了如何在Python中实现这些算法，并绘制决策边界。最后，文章讨论了处理数据中缺失值的不同策略，并以马匹因肠绞痛致死率预测为例，展示了逻辑回归的实际应用。

摘要由CSDN通过智能技术生成

This article covers

The sigmoid function and the logistic regression classifier（Sigmoid 函数和Logistic 回归分类器）
Our first look at optimization
The gradient descent optimization algorithm(梯度下降最优化算法)
Dealing with missing values in our data.

General approach to logistic regression

Collect: Any method
Prepare: Numeric values are needed for a distance calculation.A structured data format is best.
Analyze: Any method
Train: We'll spend most of the time training, where we try to find optimal coefficients to classify our data.
Test: Classification is quick and easy once the training step is done.
Use: This application needs to get some input data and output structured numeric values.Next, the appplication applies the simple regression calculation on this input data and determines which class the input data should belong to.The application then takes some action on the calculated class.

gradient ascent and stochastic gradient ascent .These optimization algorithms will be used to train our classifier.

5.1 Classification with logistic regression and the sigmoid function: a tractable step function

Logistic regression

Pros: Computationally inexpensive,easy to implement,knowledge representation easy to interpret
Cons:Prone to underfitting,may have low accuracy
Works with: Numeric values, nominal values

give all of our features and it will predict the class, the function will split out a 0 or a 1. it's easier tyo deal with mathematically. This function is called the sigmoid. The sigmod is given by the following equation:

$\delta (z)=\frac{1}{1+e^{z}}$

5.2 Using optimization to find the best regression coefficients

The input to the sigmod function described will be z,where z is given by the following :

$z = w_{0}x_{0}+ ...+ w_{n}x_{n}$

5.2.1 Gradient ascent

The first optimization algorithm we're going to look at is called gradient ascent. Gradient ascent is based on the idea that if we want to find the maximum point on a function, then the best way to move is in the direction of the gradient.

The gradient with the symbol $\bigtriangledown$

The gradient of a function $\, f(x,y)$

Let's put this into action on our logistic regression classifier and some python.

First, we need a dataset

5.2.2 Train:using gradient ascent to find the best parameters.

'''
Author: Maxwell Pan
Date: 2022-04-19 06:46:59
LastEditTime: 2022-04-19 10:10:03
FilePath: \cp05\logRegres.py
Description: Logistic regression
Software:VSCode,env:
'''

import numpy as np

# Logistic regression gradient ascent optimization functions
def loadDataSet():
    dataMat = []; labelMat = []
    fr = open('testSet.txt')
    for line in fr.readlines():
        lineArr = line.strip().split()
        dataMat.append([1.0,float(lineArr[0]),float(lineArr[1])])
        labelMat.append(int(lineArr[2]))
    return dataMat,labelMat

def sigmoid(inX):
    return 1.0/(1+np.exp(-inX))
    
def gradAscent(dataMatIn,classLabels):
    dataMatrix = np.mat(dataMatIn)
    labelMat = np.mat(classLabels).transpose()  # Convert to NumPy matrix data type.
    m,n = np.shape(dataMatrix)
    alpha = 0.001
    maxCycles = 500
    weights = np.ones((n,1))
    for k in range(maxCycles):
        h = sigmoid(dataMatrix*weights)
        error = (labelMat - h)
        weights = weights + alpha * dataMatrix.transpose()*error
    return weights    # Matrix multiplication

Type the following at your notebook.

import logRegres
dataArr,labelMat=logRegres.loadDataSet()
logRegres.gradAscent(dataArr,labelMat)

5.2.3 Analyze: plotting the decision boundary

We're solving for a set of weights used to make a line that separates the different classes of data.

# Plotting the logistic regression best-fit line and dataset.
def plotBestFit(wei):
    import matplotlib.pyplot as plt
    weights = wei.getA()
    dataMat,labelMat=loadDataSet()
    dataArr = np.array(dataMat)
    n = np.shape(dataArr)[0]
    xcord1 = []; ycord1 = []
    xcord2 = []; ycord2 = []
    for i in range(n):
        if int(labelMat[i]) == 1:
            xcord1.append(dataArr[i,1]);ycord1.np.append(dataArr[i,2])
        else:
            xcord2.append(dataArr[i,1]);ycord2.np.append(dataArr[i,2])
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(xcord1,ycord1,s=30,c='red',marker='s')
    ax.scatter(xcord2,ycord2,s=30,c='green')
    x = np.arange(-3.0, 3.0, 0.1)
    y = (-weights[0]-weights[1]*x)/weights[2]
    y = y.reshape((60,1))
    ax.plot(x, y)
    plt.xlabel('X1');plt.ylabel('X2')
    plt.show()

import logRegres
import imp
imp.reload(logRegres)
weights = logRegres.gradAscent(dataArr,labelMat)
logRegres.plotBestFit(weights)

5.2.4 Train:stochastic gradient ascent

# Stochastic gradient ascent

def stocGradAscent0(dataMatrix, classLabels):
    dataMatrix=np.array(dataMatrix)
    m,n = np.shape(dataMatrix)
    alpha = 0.01
    weights = np.ones(n)
    for i in range(n):
        h = sigmoid(sum(dataMatrix[i]*weights))
        error = classLabels[i] - h
        weights = weights + alpha * error * dataMatrix[i]
    return weights

def stocGradAscent1(dataMatrix, classLabels,numIter=150):
    dataMatrix=np.array(dataMatrix)
    m,n = np.shape(dataMatrix)
    weights = np.ones(n)
    for j in range(numIter):
        dataIndex = list(range(m))
        for i in range(m):
            alpha = 4/(1.0+j+i)+0.01
            randIndex = int(random.uniform(0,len(dataIndex)))
            h = sigmoid(sum(dataMatrix[randIndex]*weights))
            error = classLabels[randIndex] - h
            weights = weights + alpha * error * dataMatrix[randIndex]
            del(dataIndex[randIndex])
    return weights

import logRegres
dataArr,labelMat=logRegres.loadDataSet()
weights=logRegres.stocGradAscent1(dataArr,labelMat)
logRegres.plotBestFit(weights)

5.3 Example: estimating horse fatalities from colic

Example: using logistic regression to estimate horse fatalities from colic

1. Collect: Data file provided.

2. Prepare: Parse a text file in Python, and fill in missing values.

3. Analyze: Visually inspect the data.

4. Train: Use an optimization algorithm to find the best coefficients.

5. Test: To measure the success, we’ll look at error rate. Depending on the error rate, we may decide to go back to the training step to try to find better values for the regression coefficients by adjusting the number of iterations and step size.

6. Use: Building a simple command-line program to collect horse symptoms and output live/die diagnosis won’t be difficult. I’ll leave that up to you as an exercise.

5.3.1 Prepare:dealing with missing values in the data.

Here are some options:

■ Use the feature’s mean value from all the available data.

■ Fill in the unknown with a special value like -1.

■ Ignore the instance.

■ Use a mean value from similar items.

■ Use another machine learning algorithm to predict the value.

5.3.2 Test: classifying with logistic regression

def classifyVector(inX, weights):
    prob = sigmoid(sum(inX*weights))
    if prob > 0.5:
        return 1.0
    else:
        return 0.0

def colicTest():
    frTrain = open('horseColicTraining.txt')
    frTest = open('horseColicTest.txt')
    trainingSet = [];trainingLabels = []
    for line in frTrain.readlines():
        currLine = line.strip().split('\t')
        lineArr = []
        for i in range(21):
            lineArr.append(float(currLine[i]))
        trainingSet.append(lineArr)
        trainingLabels.append(float(currLine[21]))
    trainWeights = stocGradAscent1(np.array(trainingSet), trainingLabels,500)
    errorCount = 0; numTestVec = 0.0
    for line in frTest.readlines():
        numTestVec += 1.0
        currLine = line.strip().split('\t')
        lineArr = []
        for i in range(21):
            lineArr.append(float(currLine[i]))
        if int(classifyVector(np.array(lineArr),trainWeights))!= int(currLine[21]):
            errorCount += 1
    errorRate = (float(errorCount)/numTestVec)
    print("the error rate of this test is: %f" % errorRate)
    return errorRate

def multiTest():
    numTests = 10; errorSum=0.0
    for k in range(numTests):
        errorSum += colicTest()
    print("after %d iterations the average error rate is: %f" %(numTests,errorSum/float(numTests)))

import logRegres

import imp

imp.reload(logRegres)

logRegres.multiTest()

4.5 Summary

Logistic regression is finding best-fit parameters to a nonlinear function called the sigmoid. Methods of optimization can be used to find the best-fit parameters. Among the optimization algorithms, one of the most common algorithms is gradient ascent. Gradient ascent can be simplified with stochastic gradient ascent. Stochastic gradient ascent can do as well as gradient ascent using far fewer computing resources. In addition, stochastic gradient ascent is an online algorithm; it can update what it has learned as new data comes in rather than reloading all of the data as in batch processing. One major problem in machine learning is how to deal with missing values in the data. There’s no blanket answer to this question. It really depends on what you’re doing with the data. There are a number of solutions, and each solution has its own advantages and disadvantages. In the next chapter we’re going to take a look at another classification algorithm similar to logistic regression. The algorithm is called support vector machines and is considered one of the best stock algorithms.