CP05 Logistic regression(Logistic 回归)

This article covers

  • The sigmoid function and the logistic regression classifier(Sigmoid 函数和Logistic 回归分类器)
  • Our first look at optimization
  • The gradient descent optimization algorithm(梯度下降最优化算法)
  • Dealing with missing values in our data.

General approach to logistic regression

  • Collect: Any method
  • Prepare: Numeric values are needed for a distance calculation.A structured data format is best.
  • Analyze: Any method
  • Train: We'll spend most of the time training, where we try to find optimal coefficients to classify our data.
  • Test: Classification is quick and easy once the training step is done.
  • Use: This application needs to get some input data and output structured numeric values.Next, the appplication applies the simple regression calculation on this input data and determines which class the input data should belong to.The application then takes some action on the calculated class.

gradient ascent and stochastic gradient ascent .These optimization algorithms will be used to train our classifier.

5.1 Classification with logistic regression and the sigmoid function: a tractable step function

Logistic regression

  • Pros: Computationally inexpensive,easy to implement,knowledge representation easy to interpret
  • Cons:Prone to underfitting,may have low accuracy
  • Works with: Numeric values, nominal values

give all of our features and it will predict the class, the function will split out a 0 or a 1. it's easier tyo deal with mathematically. This function is called the sigmoid. The sigmod is given by the following equation:

\delta (z)=\frac{1}{1+e^{z}}

 5.2 Using optimization to find the best regression coefficients

The input to the sigmod function described will be z,where z is given by the following :

z = w_{0}x_{0}+ ...+ w_{n}x_{n}

5.2.1 Gradient ascent

The first optimization algorithm we're going to look at is called gradient ascent. Gradient ascent is based on the idea that if we want to find the maximum point on a function, then the best way to move is in the direction of the gradient.

The gradient with the symbol \bigtriangledown

The gradient of a function \, f(x,y)

 

 Let's put this into action on our logistic regression classifier and some python.

First, we need a dataset

 5.2.2 Train:using gradient ascent to find the best parameters.

'''
Author: Maxwell Pan
Date: 2022-04-19 06:46:59
LastEditTime: 2022-04-19 10:10:03
FilePath: \cp05\logRegres.py
Description: Logistic regression
Software:VSCode,env:
'''

import numpy as np

# Logistic regression gradient ascent optimization functions
def loadDataSet():
    dataMat = []; labelMat = []
    fr = open('testSet.txt')
    for line in fr.readlines():
        lineArr = line.strip().split()
        dataMat.append([1.0,float(lineArr[0]),float(lineArr[1])])
        labelMat.append(int(lineArr[2]))
    return dataMat,labelMat

def sigmoid(inX):
    return 1.0/(1+np.exp(-inX))
    
def gradAscent(dataMatIn,classLabels):
    dataMatrix = np.mat(dataMatIn)
    labelMat = np.mat(classLabels).transpose()  # Convert to NumPy matrix data type.
    m,n = np.shape(dataMatrix)
    alpha = 0.001
    maxCycles = 500
    weights = np.ones((n,1))
    for k in range(maxCycles):
        h = sigmoid(dataMatrix*weights)
        error = (labelMat - h)
        weights = weights + alpha * dataMatrix.transpose()*error
    return weights    # Matrix multiplication

 Type the following at your notebook.

import logRegres
dataArr,labelMat=logRegres.loadDataSet()
logRegres.gradAscent(dataArr,labelMat)

 5.2.3 Analyze: plotting the decision boundary

 We're solving for a set of weights used to make a line  that separates the different classes of data.

 

# Plotting the logistic regression best-fit line and dataset.
def plotBestFit(wei):
    import matplotlib.pyplot as plt
    weights = wei.getA()
    dataMat,labelMat=loadDataSet()
    dataArr = np.array(dataMat)
    n = np.shape(dataArr)[0]
    xcord1 = []; ycord1 = []
    xcord2 = []; ycord2 = []
    for i in range(n):
        if int(labelMat[i]) == 1:
            xcord1.append(dataArr[i,1]);ycord1.np.append(dataArr[i,2])
        else:
            xcord2.append(dataArr[i,1]);ycord2.np.append(dataArr[i,2])
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(xcord1,ycord1,s=30,c='red',marker='s')
    ax.scatter(xcord2,ycord2,s=30,c='green')
    x = np.arange(-3.0, 3.0, 0.1)
    y = (-weights[0]-weights[1]*x)/weights[2]
    y = y.reshape((60,1))
    ax.plot(x, y)
    plt.xlabel('X1');plt.ylabel('X2')
    plt.show()
import logRegres
import imp
imp.reload(logRegres)
weights = logRegres.gradAscent(dataArr,labelMat)
logRegres.plotBestFit(weights)
 

 5.2.4 Train:stochastic gradient ascent

# Stochastic gradient ascent

def stocGradAscent0(dataMatrix, classLabels):
    dataMatrix=np.array(dataMatrix)
    m,n = np.shape(dataMatrix)
    alpha = 0.01
    weights = np.ones(n)
    for i in range(n):
        h = sigmoid(sum(dataMatrix[i]*weights))
        error = classLabels[i] - h
        weights = weights + alpha * error * dataMatrix[i]
    return weights

 

def stocGradAscent1(dataMatrix, classLabels,numIter=150):
    dataMatrix=np.array(dataMatrix)
    m,n = np.shape(dataMatrix)
    weights = np.ones(n)
    for j in range(numIter):
        dataIndex = list(range(m))
        for i in range(m):
            alpha = 4/(1.0+j+i)+0.01
            randIndex = int(random.uniform(0,len(dataIndex)))
            h = sigmoid(sum(dataMatrix[randIndex]*weights))
            error = classLabels[randIndex] - h
            weights = weights + alpha * error * dataMatrix[randIndex]
            del(dataIndex[randIndex])
    return weights
import logRegres
dataArr,labelMat=logRegres.loadDataSet()
weights=logRegres.stocGradAscent1(dataArr,labelMat)
logRegres.plotBestFit(weights)

 5.3 Example: estimating horse fatalities from colic

Example: using logistic regression to estimate horse fatalities from colic

1. Collect: Data file provided.

2. Prepare: Parse a text file in Python, and fill in missing values.

3. Analyze: Visually inspect the data.

4. Train: Use an optimization algorithm to find the best coefficients.

5. Test: To measure the success, we’ll look at error rate. Depending on the error rate, we may decide to go back to the training step to try to find better values for the regression coefficients by adjusting the number of iterations and step size.

6. Use: Building a simple command-line program to collect horse symptoms and output live/die diagnosis won’t be difficult. I’ll leave that up to you as an exercise.

5.3.1 Prepare:dealing with missing values in the data.

Here are some options:

■ Use the feature’s mean value from all the available data.

■ Fill in the unknown with a special value like -1.

■ Ignore the instance.

■ Use a mean value from similar items.

■ Use another machine learning algorithm to predict the value.

5.3.2 Test: classifying with logistic regression

def classifyVector(inX, weights):
    prob = sigmoid(sum(inX*weights))
    if prob > 0.5:
        return 1.0
    else:
        return 0.0

def colicTest():
    frTrain = open('horseColicTraining.txt')
    frTest = open('horseColicTest.txt')
    trainingSet = [];trainingLabels = []
    for line in frTrain.readlines():
        currLine = line.strip().split('\t')
        lineArr = []
        for i in range(21):
            lineArr.append(float(currLine[i]))
        trainingSet.append(lineArr)
        trainingLabels.append(float(currLine[21]))
    trainWeights = stocGradAscent1(np.array(trainingSet), trainingLabels,500)
    errorCount = 0; numTestVec = 0.0
    for line in frTest.readlines():
        numTestVec += 1.0
        currLine = line.strip().split('\t')
        lineArr = []
        for i in range(21):
            lineArr.append(float(currLine[i]))
        if int(classifyVector(np.array(lineArr),trainWeights))!= int(currLine[21]):
            errorCount += 1
    errorRate = (float(errorCount)/numTestVec)
    print("the error rate of this test is: %f" % errorRate)
    return errorRate

def multiTest():
    numTests = 10; errorSum=0.0
    for k in range(numTests):
        errorSum += colicTest()
    print("after %d iterations the average error rate is: %f" %(numTests,errorSum/float(numTests)))

 

import logRegres

import imp

imp.reload(logRegres)

logRegres.multiTest()

 4.5 Summary

Logistic regression is finding best-fit parameters to a nonlinear function called the sigmoid. Methods of optimization can be used to find the best-fit parameters. Among the optimization algorithms, one of the most common algorithms is gradient ascent. Gradient ascent can be simplified with stochastic gradient ascent. Stochastic gradient ascent can do as well as gradient ascent using far fewer computing resources. In addition, stochastic gradient ascent is an online algorithm; it can update what it has learned as new data comes in rather than reloading all of the data as in batch processing. One major problem in machine learning is how to deal with missing values in the data. There’s no blanket answer to this question. It really depends on what you’re doing with the data. There are a number of solutions, and each solution has its own advantages and disadvantages. In the next chapter we’re going to take a look at another classification algorithm similar to logistic regression. The algorithm is called support vector machines and is considered one of the best stock algorithms.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值