西瓜书《机器学习》课后编程题答案——3.3
3.3编程实现对率回归,并给出西瓜数据集3.0α上的结果
本文在参考《机器学习实战》的基础上,利用Python3编写了代码,完成了题目3.3,主要由这几部分组成:
加载数据
def loadDataSet(): #general function to parse tab -delimited floats
numFeat = len(open('3.0alpha.txt').readline().split('\t')) #get number of fields
dataMat = []; labelMat = []
fr = open('3.0alpha.txt')
for line in fr.readlines():
lineArr =[]
curLine = line.strip().split('\t')
lineArr.append(float(1.0))
for i in range(numFeat-1):
lineArr.append(float(curLine[i]))
dataMat.append(lineArr)
labelMat.append(float(curLine[-1]))
return dataMat,labelMat
Sigmoid 函数
def sigmoid(inX):
return 1.0/(1+exp(-inX))
低度下降计算权重
def gradAscent(dataMatIn, classLabels):
dataMatrix = mat(dataMatIn) #convert to NumPy matrix 进行线性代数必要操作
labelMat = mat(classLabels).transpose() #convert to NumPy matrix 转置
m,n = shape(dataMatrix)
alpha = 0.001
maxCycles = 1000
weights = ones((n,1))
for k in range(maxCycles): #heavy on matrix operations
h = sigmoid(dataMatrix*weights) #matrix mult
error = (labelMat - h) #vector subtraction
weights = weights + alpha * dataMatrix.transpose()* error #matrix mult
return weights
分类
def classifyVector(inX, weights):
prob = sigmoid(sum(inX*weights))
if prob > 0.5: return 1.0
else: return 0.0
到此就实现了所有的功能,为了更好的使用各部分函数,对上述几个函数进行整合成一个函数。
完整程序
def colicTest():
frTest = open('3.0alpha.txt') # 数据较少 用原来的数据计算训练误差 也可以将数据进行分割成训练集和测试集
trainingSet, trainingLabels = loadDataSet()
trainWeights = gradAscent(array(trainingSet), trainingLabels)
errorCount = 0; numTestVec = 0.0
numFeat = len(frTest.readline().split('\t'))
for line in frTest.readlines():
numTestVec += 1.0
currLine = line.strip().split('\t')
lineArr =[]
lineArr.append(float(1.0))
for i in range(numFeat - 1):
lineArr.append(float(currLine[i]))
if int(classifyVector(array(lineArr), trainWeights))!= int(currLine[-1]):
errorCount += 1
errorRate = (float(errorCount)/numTestVec)
print("the error rate of this test is: %f" % errorRate)
return errorRate
画出分类后的点
def plot2D():
import matplotlib
import matplotlib.pyplot as plt
dataMat, labelMat = loadDataSet()
dataArr = array(dataMat)
weights = gradAscent(dataMat, labelMat)
n = len(labelMat)
xcord0 = [];ycord0 = []
xcord1 = [];ycord1 = []
for i in range(n):
if labelMat[i] == 0:
xcord0.append(dataMat[i][1]);ycord0.append(dataMat[i][2])
else:
xcord1.append(dataMat[i][1]);ycord1.append(dataMat[i][2])
fig = plt.figure()
ax = fig.add_subplot(111)
type0 = ax.scatter(xcord0, ycord0, s=30, c='red', marker='s')
type1 = ax.scatter(xcord1, ycord1, s=30, c='green')
x = arange(0, 1.0, 0.05)
y = (-weights[0] - weights[1] * x) / weights[2]
type3 = ax.plot(x, y.transpose())
plt.xlabel('Density')
plt.ylabel('Sugar_content')
plt.show()
到此就完成了整个题目,并且对程序稍加修改就可以应用到别的场合中。
最后得到错误率为:
the error rate of this test is: 0.312500
0.3125
完整的代码和数据集下载地址 [链接].(https://download.csdn.net/download/yanying1113/10807706).