机器学习实战学习记录（4-5章）

我的宠物不是小马

于 2022-10-13 19:41:53 发布

阅读量330

点赞数

文章标签：学习

本文链接：https://blog.csdn.net/weixin_52539595/article/details/127248609

版权

参考：机器学习实战Peter Harrington

(11条消息) 机器学习实战教程(13篇)_chenyanlong_v的博客-CSDN博客_机器学习实战

四、朴素贝叶斯：（1）选择具有最高概率的决策。（用p1(x,y)表示数据点（x,y)属于类别1，p2(x,y)表示数据点（x,y)属于类别2，选择最高的概率）

优点：在数据较少的情况下仍然有效，可以处理多类别的问题

缺点：对于输入数据的准备方式较为敏感

适用数据类型：标称型数据

→条件概率：在事件B发生的情况下，事件A发生的概率，用P(A|B)来表示。在事件B发生的情况下

事件A发生的概率：

因此，

同理可得，

所以，

即

这就是条件概率的计算公式。

→全概率公式：

即

在上一节的推导当中，我们已知

所以，

这就是全概率公式。它的含义是，如果A和A'构成样本空间的一个划分，那么事件B的概率，就等于A和A'的概率分别乘以B对这两个事件的条件概率之和。

将这个公式代入上一节的条件概率公式，就得到了条件概率的另一种写法：

条件概率可变形为：

我们把P(A)称为"先验概率"（Prior probability），即在B事件发生之前，我们对A事件概率的一个判断。

P(A|B)称为"后验概率"（Posterior probability），即在B事件发生之后，我们对A事件概率的重新评估。

P(B|A)/P(B)称为"可能性函数"（Likelyhood），这是一个调整因子，使得预估概率更接近真实概率。

即条件概率为：后验概率=先验概率*调整因子

在这里，如果"可能性函数"P(B|A)/P(B)>1，意味着"先验概率"被增强，事件A的发生的可能性变大；如果"可能性函数"=1，意味着B事件无助于判断事件A的可能性；如果"可能性函数"<1，意味着"先验概率"被削弱，事件A的可能性变小。

（2）朴素贝叶斯对条件个概率分布做了条件独立性的假设。

示例1：使用朴素贝叶斯进行文档分类

观察文档中出现的词，把每个词的出现与否作为一个特征。如果每个特征需要N个样本，那么对于10个特征将需要N^10个样本，如果特征之间相互独立，样本数就可减少到N*10。

# -*- coding: UTF-8 -*-
import numpy as np

def loadDataSet():
    postingList=[['my','dog','has','flea','problems','help','please'],
                  ['maybe','not','take','him','to','dog','park','stupid'],
                  ['my','dalmation','is','so','cute','I','love','him'],
                 ['stop','posting','stupid','worthless','garbage'],
                 ['mr','licks','ate','my','steak','how','to','stop','him'],
                 ['quit','buying','worthless','dog','food','stupid']]
    classVec=[0,1,0,1,0,1]#1代表侮辱性文字，0代表正常言论
    return postingList,classVec#postingList是存放词条列表,classVec是存放每个词条的所属类别

def createVocabList(dataSet):#创建一个空的不重复列表
    vocabSet=set([])
    for document in dataSet:
        vocabSet=vocabSet|set(document)#取并集
    return list(vocabSet)

def setOfWords2Vec(vocabList,inputSet):
    returnVec=[0]*len(vocabList)#创建一个其中元素都为0的向量
    for word in inputSet:#遍历每个词条
        if word in vocabList:#如果词条存在于词汇表中，则置1
            returnVec[vocabList.index(word)]=1
    else:print("the word:%s is not in my Vocabulary!"%word)
    return returnVec#返回文档向量

if __name__ =='__main__':
    postingList,classVec=loadDataSet()#postingList是原始的词条列表
    print('postingList:\n',postingList)
    myVocabList=createVocabList(postingList)#myVocabList是所有单词的集合（无重复）
    print('myVocabList:\n',myVocabList)
    trainMat=[]#trainMat是所有的词条向量组成的列表。它里面存放的是根据myVocabList向量化的词条向量（向量化即出现记1，未出现记0）
    for postinDoc in postingList:
        trainMat.append(setOfWords2Vec(myVocabList,postinDoc))
    print('trainMat:\n',trainMat)

计算概率：

# -*- coding: UTF-8 -*-
import numpy as np

def loadDataSet():
    postingList=[['my','dog','has','flea','problems','help','please'],
                  ['maybe','not','take','him','to','dog','park','stupid'],
                  ['my','dalmation','is','so','cute','I','love','him'],
                 ['stop','posting','stupid','worthless','garbage'],
                 ['mr','licks','ate','my','steak','how','to','stop','him'],
                 ['quit','buying','worthless','dog','food','stupid']]
    classVec=[0,1,0,1,0,1]#1代表侮辱性文字，0代表正常言论
    return postingList,classVec#postingList是存放词条列表,classVec是存放每个词条的所属类别

def createVocabList(dataSet):#创建一个空的不重复列表
    vocabSet=set([])
    for document in dataSet:
        vocabSet=vocabSet|set(document)#取并集
    return list(vocabSet)

def setOfWords2Vec(vocabList,inputSet):
    returnVec=[0]*len(vocabList)#创建一个其中元素都为0的向量
    for word in inputSet:#遍历每个词条
        if word in vocabList:#如果词条存在于词汇表中，则置1
            returnVec[vocabList.index(word)]=1
    else:print("the word:%s is not in my Vocabulary!"%word)
    return returnVec#返回文档向量

def trainNB0(trainMatrix,trainCategory):
    numTrainDocs=len(trainMatrix)#计算训练的文档数量，trainMatrix为训练文档矩阵
    numWords=len(trainMatrix[0])#计算每篇文档的词条数
    pAbusive=sum(trainCategory)/float(numTrainDocs)#文档属于侮辱类的概率，trainCategory是标签向量
    p0Num=np.zeros(numWords);p1Num=np.zeros(numWords)#创建numpy.zeros数组,词条出现数初始化为0
    p0Denom=0.0;p1Denom=0.0#分母初始化为0
    for i in range(numTrainDocs):
        if trainCategory[i]==1:#统计属于侮辱类的条件概率所需的数据，即P(w0|1),P(w1|1),P(w2|1)···
            p1Num+=trainMatrix[i]
            p1Denom+=sum(trainMatrix[i])
        else:                  #统计属于非侮辱类的条件概率所需的数据，即P(w0|0),P(w1|0),P(w2|0)···
            p0Num+=trainMatrix[i]
            p0Denom+=sum(trainMatrix[i])
    p1Vect=p1Num/p1Denom #p1V是侮辱性言论的调整因子
    p0Vect=p0Num/p0Denom #p2V是侮辱性言论的调整因子
    return p0Vect,p1Vect,pAbusive#返回属于侮辱类的条件概率数组，属于非侮辱类的条件概率数组，文档属于侮辱类的概率


if __name__ =='__main__':
    postingList,classVec=loadDataSet()
    myVocabList=createVocabList(postingList)
    print('myVocabList:\n',myVocabList)
    trainMat=[]
    for postinDoc in postingList:
        trainMat.append(setOfWords2Vec(myVocabList,postinDoc))
    p0V,p1V,pAb=trainNB0(trainMat,classVec)#pAb是先验概率
    print('p0V:\n',p0V)
    print('p1V:\n',p1V)
    print('classVec:\n',classVec)
    print('pAb:\n',pAb)

使用分类器进行分类：

①利用贝叶斯分类器进行文档分类时，要进行多个概率的乘积，如果其中一个概率值为0，则全为0。为降低这种影响，可将所有词的出现数初始化为1，将分母的初始化为2。

p0Num=ones(numWords);p1Num=ones(numWords)

p0Denom=2.0;p1Denom=2.0

②另一个问题是下溢出，这是由于太多很小的数相乘造成的（概率P<1)，最终结果会得到。一种解决方法是对乘积取自然对数。

p1Vec=log(p1Num/p1Denom)

p2Vec=log(p0Num/p0Denom)

# -*- coding: UTF-8 -*-
import numpy as np
from functools import reduce
from math import log

def loadDataSet():
    postingList=[['my','dog','has','flea','problems','help','please'],
                  ['maybe','not','take','him','to','dog','park','stupid'],
                  ['my','dalmation','is','so','cute','I','love','him'],
                 ['stop','posting','stupid','worthless','garbage'],
                 ['mr','licks','ate','my','steak','how','to','stop','him'],
                 ['quit','buying','worthless','dog','food','stupid']]
    classVec=[0,1,0,1,0,1]#1代表侮辱性文字，0代表正常言论
    return postingList,classVec#postingList是存放词条列表,classVec是存放每个词条的所属类别

def createVocabList(dataSet):#创建一个空的不重复列表
    vocabSet=set([])
    for document in dataSet:
        vocabSet=vocabSet|set(document)#取并集
    return list(vocabSet)

def setOfWords2Vec(vocabList,inputSet):
    returnVec=[0]*len(vocabList)#创建一个其中元素都为0的向量
    for word in inputSet:#遍历每个词条
        if word in vocabList:#如果词条存在于词汇表中，则置1
            returnVec[vocabList.index(word)]=1
    else:print("the word:%s is not in my Vocabulary!"%word)
    return returnVec#返回文档向量

def trainNB0(trainMatrix,trainCategory):
    numTrainDocs=len(trainMatrix)#计算训练的文档数量，trainMatrix为训练文档矩阵
    numWords=len(trainMatrix[0])#计算每篇文档的词条数
    pAbusive=sum(trainCategory)/float(numTrainDocs)#文档属于侮辱类的概率，trainCategory是标签向量
    p0Num=np.ones(numWords);p1Num=np.ones(numWords)#所有词的出现次数初始化为1
    p0Denom=2.0;p1Denom=2.0#分母初始化为2
    for i in range(numTrainDocs):
        if trainCategory[i]==1:#统计属于侮辱类的条件概率所需的数据，即P(w0|1),P(w1|1),P(w2|1)···
            p1Num+=trainMatrix[i]
            p1Denom+=sum(trainMatrix[i])
        else:                  #统计属于非侮辱类的条件概率所需的数据，即P(w0|0),P(w1|0),P(w2|0)···
            p0Num+=trainMatrix[i]
            p0Denom+=sum(trainMatrix[i])
    p1Vect=np.log(p1Num/p1Denom)
    p0Vect=np.log(p0Num/p0Denom)
    return p0Vect,p1Vect,pAbusive#返回属于侮辱类的条件概率数组，属于非侮辱类的条件概率数组，文档属于侮辱类的概率

def classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):#vec2Classify是待分类的词条数组，p0Vec是非侮辱类的条件概率数组，p1Vec是侮辱类条件概率数组，pclass1是文档属于侮辱类的概率（先验概率）
    p1=sum(vec2Classify*p1Vec)+log(pClass1)#不同的ci
    p0=sum(vec2Classify*p0Vec)+log(1.0-pClass1)
    print('p0:',p0)
    print('p1:',p1)
    if p1>p0:
        return 1
    else:
        return 0

def testingNB():
    listOPosts,listClasses=loadDataSet()#创建实验样本
    myVocabList=createVocabList(listOPosts)#创建词汇表
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(setOfWords2Vec(myVocabList,postinDoc))#将实验样本向量化
    p0V,p1V,pAb=trainNB0(np.array(trainMat),np.array(listClasses))#训练朴素贝叶斯分类器
    testEntry=['love','my','dalmation']
    thisDoc=np.array(setOfWords2Vec(myVocabList,testEntry))#测试样本向量化
    if classifyNB(thisDoc,p0V,p1V,pAb):#训练分类并打印
        print(testEntry,'属于侮辱类')
    else:
        print(testEntry,'属于非侮辱类')
    testEntry=['stupid','garbage']

    thisDoc=np.array(setOfWords2Vec(myVocabList,testEntry))
    if classifyNB(thisDoc,p0V,p1V,pAb):
        print(testEntry,'属于侮辱类')
    else:
        print(testEntry,'属于非侮辱类')

if __name__=='__main__':
    testingNB()

if __name__ =='__main__':
    postingList,classVec=loadDataSet()
    myVocabList=createVocabList(postingList)
    print('myVocabList:\n',myVocabList)
    trainMat=[]
    for postinDoc in postingList:
        trainMat.append(setOfWords2Vec(myVocabList,postinDoc))
    p0V,p1V,pAb=trainNB0(trainMat,classVec)
    print('p0V:\n',p0V)
    print('p1V:\n',p1V)
    print('classVec:\n',classVec)
    print('pAb:\n',pAb)

示例2：

示例3：

五、Logistic回归：（1）根据现有的数据对分类边界线建立回归公式（回归一词源于“最佳拟合”，我们用一条直线对现有的数据点进行拟合的过程称为回归）通常，Logistic回归用于二分类问题。

优点：计算代价不高，易于实现

缺点：容易欠拟合，分类精度不高

适用数据类型：数值型、标称型（是/否）

→sigmoid函数：

为了实现logistic回归分类器，我们可以在每个特征上都乘一个回归系数，然后把所有的结果值相加，将这个总和带入sigmoid函数，进而得到一个范围在0~1之间的数值。任何>0.5的被分入1类，<0.5的被分入0类，所以logistic回归也可被看作一种概率估计。

sigmoid函数的输入记为z，z=w0x0+w1x1+...+wnxn,向量的写法即为

它表示这两个数值向量对应元素相乘然后全部相加得到Z值，其中向量X是输入数据，向量W是我们要找到的最佳回归系数。

（2）梯度上升法：要找到某函数的最大值，最快最好的方法是沿着该函数的梯度方向探寻。

根据sigmoid函数的特性，我们可以做出如下的假设：(θ即为w）

即为在已知样本x和参数θ的情况下，样本x属性正样本(y=1)和负样本(y=0)的条件概率。理想状态下，根据上述公式，求出各个点的概率均为1，也就是完全分类都正确。但是考虑到实际情况，样本点的概率越接近于1，其分类效果越好。比如一个样本属于正样本的概率为0.51，那么我们就可以说明这个样本属于正样本。另一个样本属于正样本的概率为0.99，那么我们也可以说明这个样本属于正样本。但是显然，第二个样本概率更高，更具说服力。我们可以把上述两个概率公式合二为一：

合并出来的Loss，我们称之为损失函数。当y等于1时，(1-y)项(第二项)为0；当y等于0时，y项(第一项)为0。为s了简化问题，我们对整个表达式求对数：

这个损失函数，是对于一个样本而言的。给定一个样本，我们就可以通过这个损失函数求出，样本所属类别的概率，而这个概率越大越好，所以也就是求解这个损失函数的最大值。既然概率出来了，那么最大似然估计也该出场了。假定样本与样本之间相互独立，那么整个样本集生成的概率即为所有样本生成概率的乘积，便可得到如下公式：

其中，m为样本的总数，y(i)表示第i个样本的类别，x(i)表示第i个样本，需要注意的是θ是多维向量，x(i)也是多维向量。

就像爬坡一样，一点一点逼近极值。这种寻找最佳拟合参数的方法，就是最优化算法。爬坡这个动作用数学公式表达即为：

其中，α为步长，也就是学习速率，控制更新的幅度。

梯度上升算法测试：

# -*- coding: UTF-8 -*-

def Gradient_Ascent_test():
    def f_prime(x_old):
        return -x_old+4#求f(x)=-x^2+4x的导数
    x_old=-1#初始值，定一个小于x_new的值
    x_new=0#梯度上升算法初始值，即从（0，0）开始
    alpha=0.01#学习率
    presision=0.00000001#精度，即更新阈值
    while abs(x_new-x_old)>presision:
        x_old=x_new
        x_new=x_old+alpha*f_prime(x_old)#xi+1=xi+a*f'(xi)
    print(x_new)#打印最终求解的极值近似值

if __name__=='__main__':
    Gradient_Ascent_test()

已经非常接近我们的真实极值2了。这一过程，就是梯度上升算法。那么同理，J(θ)这个函数的极值，也可以这么求解。公式可以这么写：

由上小节可知J(θ)为：

sigmoid函数为：

现在我只要求出J(θ)的偏导，就可以利用梯度上升算法，求解J(θ)的极大值了

综上所述：

因此，梯度上升迭代公式为：

示例1：

# -*- coding: UTF-8 -*-
import matplotlib.pyplot as plt
import numpy as np

def loadDataSet():
    dataMat=[]#创建数据列表
    labelMat=[]#创建标签列表
    fr=open('testSet.txt')
    for line in fr.readlines():#逐行读取
        lineArr=line.strip().split()#去回车 放入列表
        dataMat.append([1.0,float(lineArr[0]),float(lineArr[1])])#添加数据
        labelMat.append(int(lineArr[2]))#添加标签
    fr.close()#关闭文件
    return dataMat,labelMat

def sigmoid(inX):
    return 1.0/(1+np.exp(-inX))#sigmoid公式

def gradAscent(dataMatIn,classLabels):#输入数据集，数据标签
    dataMatrix=np.mat(dataMatIn)#调用mat()函数可以将数组转换为矩阵
    labelMat=np.mat(classLabels).transpose()#调用transpose使该矩阵变为转置
    m,n=np.shape(dataMatrix)#返回dataMatrix的大小。m为行数,n为列数
    alpha=0.001
    maxCycles=500#最大迭代次数
    weights=np.ones((n,1))
    for k in range(maxCycles):
        h=sigmoid(dataMatrix*weights)#梯度上升矢量化公式
        error=(labelMat-h)
        weights=weights+alpha*dataMatrix.transpose()*error#梯度上升迭代公式
    return weights.getA()#getA()函数与mat()函数的功能相反，是将一个numpy矩阵转换为数组

if __name__=='__main__':
    dataMat,labelMat=loadDataSet()
    print(gradAscent(dataMat,labelMat))

绘制决策边界：

# -*- coding: UTF-8 -*-
import matplotlib.pyplot as plt
import numpy as np

def loadDataSet():
    dataMat=[]#创建数据列表
    labelMat=[]#创建标签列表
    fr=open('testSet.txt')
    for line in fr.readlines():#逐行读取
        lineArr=line.strip().split()#去回车 放入列表
        dataMat.append([1.0,float(lineArr[0]),float(lineArr[1])])#添加数据
        labelMat.append(int(lineArr[2]))#添加标签
    fr.close()#关闭文件
    return dataMat,labelMat

def sigmoid(inX):
    return 1.0/(1+np.exp(-inX))#sigmoid公式

def gradAscent(dataMatIn,classLabels):#输入数据集，数据标签
    dataMatrix=np.mat(dataMatIn)#调用mat()函数可以将数组转换为矩阵
    labelMat=np.mat(classLabels).transpose()#调用transpose使该矩阵变为转置
    m,n=np.shape(dataMatrix)#返回dataMatrix的大小。m为行数,n为列数
    alpha=0.001
    maxCycles=500#最大迭代次数
    weights=np.ones((n,1))
    for k in range(maxCycles):
        h=sigmoid(dataMatrix*weights)#梯度上升矢量化公式
        error=(labelMat-h)
        weights=weights+alpha*dataMatrix.transpose()*error#更新
    return weights.getA()#getA()函数与mat()函数的功能相反，是将一个numpy矩阵转换为数组

def plotBestFit(weights):
    dataMat,labelMat=loadDataSet()
    dataArr=np.array(dataMat)#转换成numpy的array数组
    n=np.shape(dataMat)[0]#样本个数
    xcord1=[];ycord1=[]#正样本
    xcord2=[];ycord2=[]#负样本
    for i in range(n):#根据数据集标签进行分类
        if int(labelMat[i])==1:#1为正样本
            xcord1.append(dataArr[i,1]);ycord1.append(dataArr[i,2])
        else:                  #0为负样本
            xcord2.append(dataArr[i,1]);ycord2.append(dataArr[i,2])
    fig=plt.figure()
    ax=fig.add_subplot(111)#添加subplot
    ax.scatter(xcord1,ycord1,s=20,c='yellow',marker='s',alpha=.5)#绘制正样本
    ax.scatter(xcord2,ycord2,s=20,c='green',alpha=.5)#绘制负样本
    x=np.arange(-3.0,3.0,0.1)#np.arange()函数返回一个有终点和起点的固定步长的排列
    y=(-weights[0]-weights[1]*x)/weights[2]#设置sigmoid函数为0，0使类别1和0的分界，我们设定0=w0x0+w1x1+w2x2(x0=0),解出x1 x2的关系式，即分隔线的方程
    ax.plot(x,y)
    plt.xlabel('x1');plt.ylabel('x2');#绘制label
    plt.show()

if __name__=='__main__':
    dataMat,labelMat=loadDataSet()
    weights=gradAscent(dataMat,labelMat)
    plotBestFit(weights)

示例2：改进的随机梯度上升

梯度上升算法在每次更新回归系数都要遍历整个数据集，如果特征数目过多，则会造成计算复杂度过高。一种改进方法是一次仅用一个样本点来更新回归系数，该方法即为随机梯度上升。

# -*- coding
from matplotlib.font_manager import FontProperties
import matplotlib.pyplot as plt
import numpy as np
import random

def loadDataSet():
    dataMat=[]#创建数据列表
    labelMat=[]#创建标签列表
    fr=open('testSet.txt')
    for line in fr.readlines():#逐行读取
        lineArr=line.strip().split()#去回车 放入列表
        dataMat.append([1.0,float(lineArr[0]),float(lineArr[1])])#添加数据
        labelMat.append(int(lineArr[2]))#添加标签
    fr.close()#关闭文件
    return dataMat,labelMat

def sigmoid(inX):
    return 1.0/(1+np.exp(-inX))#sigmoid公式


def plotBestFit(weights):
    dataMat,labelMat=loadDataSet()
    dataArr=np.array(dataMat)#转换成numpy的array数组
    n=np.shape(dataMat)[0]#样本个数
    xcord1=[];ycord1=[]#正样本
    xcord2=[];ycord2=[]#负样本
    for i in range(n):#根据数据集标签进行分类
        if int(labelMat[i])==1:#1为正样本
            xcord1.append(dataArr[i,1]);ycord1.append(dataArr[i,2])
        else:                  #0为负样本
            xcord2.append(dataArr[i,1]);ycord2.append(dataArr[i,2])
    fig=plt.figure()
    ax=fig.add_subplot(111)#添加subplot
    ax.scatter(xcord1,ycord1,s=20,c='yellow',marker='s',alpha=.5)#绘制正样本
    ax.scatter(xcord2,ycord2,s=20,c='green',alpha=.5)#绘制负样本
    x=np.arange(-3.0,3.0,0.1)#np.arange()函数返回一个有终点和起点的固定步长的排列
    y=(-weights[0]-weights[1]*x)/weights[2]#设置sigmoid函数为0，0使类别1和0的分界，我们设定0=w0x0+w1x1+w2x2(x0=0),解出x1 x2的关系式，即分隔线的方程
    ax.plot(x,y)
    plt.xlabel('x1');plt.ylabel('x2');#绘制label
    plt.show()


def stocGradAscent1(dataMatrix, classLabels, numIter=150):
    m, n=np.shape(dataMatrix)#返回dataMatrix的大小。m为行数,n为列数
    weights=np.ones(n)#参数初始化
    for j in range(numIter):
        dataIndex=list(range(m))
        for i in range(m):
            alpha=4/(1.0+j+i)+0.01#降低alpha的大小，每次减小1/(j+i)，j是迭代次数，i是样本点的下标
            randIndex=int(random.uniform(0, len(dataIndex)))#随机选取样本
            h=sigmoid(sum(dataMatrix[dataIndex[randIndex]] * weights))#选择随机选取的一个样本，计算h
            error=classLabels[dataIndex[randIndex]] - h#计算误差
            weights=weights + alpha * error * dataMatrix[dataIndex[randIndex]]#更新回归系数
            del(dataIndex[randIndex])#删除已经使用的样本
    return weights  #返回


if __name__ == '__main__':
    dataMat, labelMat = loadDataSet()
    weights = stocGradAscent1(np.array(dataMat), labelMat)
    plotBestFit(weights)