逻辑回归算是接触到最优化的内容了,logistic回归的一般过程
step1:收集数据
step2:准备数据,这部分意思是将数据转换成我们的代码需要的数据格式,因为logistic需要计算,所以需要数据类型为数值型,结构化的数据最佳
step3:分析数据
step4:训练算法(占用大量时间),找一个最好的分类回归系数,这边待会会讲
step5:验证算法,很快
step5:对我们需要测试的数据,先转化成对应的结构数据,然后基于训练好的回归系数,判断属于哪个类别
简单了解了一下logistic回归算法过程,来看看原理以及具体是怎么实现的。
我们这边只提二项式逻辑回归,不得不提一下Sigmoid函数
逻辑斯蒂回归的最重要的就是找回归系数,如果对没一个特征乘以一个回归系数,然后带入sigmoid函数中,最终大于0.5的为1,小于0.5的为0,整个分类的问题变成,最佳系数是多少。
上面的Z为
X为特征,w对应特征的系数(权重、好像又不能称之为权重?)
最佳系数该怎么找呢?
梯度上升法(一种最优化算法)
要找个某函数的最大值,沿着该函数梯度方向寻找,系数更新就变成了
梯度上升和梯度下降是差不多的,梯度上升用来求最大值,梯度下降用来求最小值
来从代码中探究吧,例子来自《机器学习实战》这本书
准备数据,数据类型如下
def loadDataSet(): #加载数据
dataMat=[]
labelMat=[]
fr = open('./testSet.txt')
for line in fr.readlines():
lineArr = line.strip().split()
dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])#特征第一列都为1
labelMat.append(int(lineArr[-1]))
return dataMat, labelMat
看看样本分布
寻找一个线,将样本更好的区分出来
def sigmoid(inX):
return 1.0/(1+np.exp(-inX))
def gradAscent(dataMatIn, classLabels):
'''
:param dataMatIn: 数据
:param classLabels: 数据对应的label
:return: 权重
'''
dataMat = np.mat(dataMatIn) # 转换成矩阵
labelMat = np.mat(classLabels).transpose() #将1*m的label转换成m*1
m,n = dataMat.shape #得到矩阵的行和列,m指数据数量,n指特征个数
alpha = 0.001
maxCycles = 500 # 迭代次数
weights = np.ones((n, 1))
for k in range(maxCycles):
# 这边需要理解一下
# dataMat是m*n weight是n*1
# h是m*1
# errror 是m*1
h = sigmoid(dataMat * weights)
error = labelMat - h
print('loss is ',np.sum(error)*1.0/m) # 计算损失
weights = weights + alpha * dataMat.transpose() * error
# 梯度上升,dataMat(m*n)每一列对应的是特征量,每一行对应的是样本
# error(m*1)就一列,每一行对应每个样本的损失
# weight是一个n*1的矩阵,每一行对应dataMat每一列的权重
# 也就是说,整个矩阵的损失相当于dataMat的每一列 乘 error之和
# 所以的dataMat转置后(n*m)*(m*1)=(n*1)也就是梯度增加的部分
return weights
得到系数后,来看看结果
再来看一个随机梯度
初始化回归系数
对数据集中每个样本
计算样本梯度
使用alpha*grad更新系数
返回系数值
看代码
def stocGradAscent0(dataArr, classLabels):# 随机梯度上升
m, n = np.shape(dataArr)
alpha = 0.01
weights = np.ones(n)
for i in range(m):
h = sigmoid(sum(dataArr[i]*weights))
error = classLabels[i] - h
weights = weights + alpha * error * np.array(dataArr[i])
return weights
还有一种比较好的梯度上升方法,
具体从代码里面实现
#coding=utf-8
import numpy as np
def loadDataSet(): #加载数据
dataMat=[]
labelMat=[]
fr = open('./testSet.txt')
for line in fr.readlines():
lineArr = line.strip().split()
dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])
labelMat.append(int(lineArr[-1]))
return dataMat, labelMat
def sigmoid(inX):
return 1.0/(1+np.exp(-inX))
def gradAscent(dataMatIn, classLabels):
'''
:param dataMatIn: 数据
:param classLabels: 数据对应的label
:return: 权重
'''
dataMat = np.mat(dataMatIn) # 转换成矩阵
labelMat = np.mat(classLabels).transpose() #将1*m的label转换成m*1
m,n = dataMat.shape #得到矩阵的行和列,m指数据数量,n指特征个数
alpha = 0.001
maxCycles = 500 # 迭代次数
weights = np.ones((n, 1))
for k in range(maxCycles):
# 这边需要理解一下
# dataMat是m*n weight是n*1
# h是m*1
# errror 是m*1
h = sigmoid(dataMat * weights)
error = labelMat - h
print('loss is ',np.sum(error)*1.0/m) # 计算损失
weights = weights + alpha * dataMat.transpose() * error
# 梯度上升,dataMat(m*n)每一列对应的是特征量,每一行对应的是样本
# error(m*1)就一列,每一行对应每个样本的损失
# weight是一个n*1的矩阵,每一行对应dataMat每一列的权重
# 也就是说,整个矩阵的损失相当于dataMat的每一列 乘 error之和
# 所以的dataMat转置后(n*m)*(m*1)=(n*1)也就是梯度增加的部分
return weights
def stocGradAscent0(dataArr, classLabels):# 随机梯度上升
m, n = np.shape(dataArr)
alpha = 0.01
weights = np.ones(n)
for i in range(m):
h = sigmoid(sum(dataArr[i]*weights))
error = classLabels[i] - h
weights = weights + alpha * error * np.array(dataArr[i])
return weights
def stocGradAscent1(dataArr, classLabels, numIter = 150):
m, n = np.shape(dataArr)
weights = np.ones(n)
for j in range(numIter):
dataIndex = range(m)
for i in range(m):
alpha = 0.01 + 4/(1.0 + j + i)
randIndex = int(np.random.uniform(0,len(dataIndex)))
h = sigmoid(sum(dataArr[randIndex] * weights))
error = classLabels[randIndex] - h
weights = weights + alpha * error * np.array(dataArr[randIndex])
return weights
def classify(inX,weights):
prob = sigmoid(sum(inX*weights))
if(prob>0.5):
return 1
else:
return 0
def colicTest():
frTrain = open('./horseColicTraining.txt')
frTest = open('./horseColicTest.txt')
trainSet = [];trainLabels=[]
for line in frTrain.readlines():
curLine = line.strip().split('\t')
trainSet.append(np.array(curLine[:-1]).astype('float64'))
trainLabels.append(np.array(curLine[-1]).astype('float64'))
trainWeights = stocGradAscent1(trainSet, trainLabels, numIter = 150)
errorCount = 0;numTestVec = 0.0
for line in frTest.readlines():
numTestVec += 1
curLine = line.strip().split('\t')
testVec = np.array(curLine[:-1]).astype('float64')
label = classify(testVec, trainWeights)
if(int(label) != int(curLine[-1])):
errorCount+=1
errRate = errorCount*1.0/numTestVec
print("The error rate is",errRate)
def plot2D():
import matplotlib.pyplot as plt
dataMat,labelMat=loadDataSet()
weights = stocGradAscent1(dataMat,labelMat)
print weights
n = np.shape(dataMat)[0] #number of points to create
xcord1 = []; ycord1 = []
xcord2 = []; ycord2 = []
markers =[]
colors =[]
for i in range(n):
if int(labelMat[i]) == 1:
xcord1.append(dataMat[i][1]); ycord1.append(dataMat[i][2])
else:
xcord2.append(dataMat[i][1]); ycord2.append(dataMat[i][2])
fig = plt.figure()
ax = fig.add_subplot(111)
#ax.scatter(xcord,ycord, c=colors, s=markers)
type1 = ax.scatter(xcord1, ycord1, s=30, c='red', marker='s')
type2 = ax.scatter(xcord2, ycord2, s=30, c='green')
x = np.arange(-3.0, 3.0, 0.1)
# weights = [-2.9, 0.72, 1.29]
#weights = [-5, 1.09, 1.42]
# weights = [13.03822793, 1.32877317, -1.96702074]
weights = [4.12, 0.48, -0.6168]
# weights =[1.01702007 , 0.85914348 ,- 0.36579921]
y = (-weights[0]-weights[1]*x)/weights[2]
type3 = ax.plot(x, y)
ax.legend([type1, type2,type3], ["Did Not Like", "Liked in Small Doses"], loc=2)
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()
if __name__ == "__main__":
plot2D()
# colicTest()