【机器学习实战】4.基于概率论的分类方法——朴素贝叶斯

    主要讲使用概率论进行分类,朴素贝叶斯之朴素,是因为整个过程只做最简单的假设。

    涉及以下内容:

        1、利用python的文本处理能力将文档切分成词向量,利用词向量对文档进行分类

        2、构建另一个分类器,观察在真是垃圾邮件数据集中的过滤效果

        3、介绍如何从个人发布的大量广告中学习分类器,并将学习结果转换成人类可理解的信息

-----

4.1 基于贝叶斯角色理论的分类方法

        优点:数据少的情况下依然有效;可处理多类别问题

        缺点:对输入数据的准备方式较敏感

        适用数据类型:标称型数据


    假设有一个数据集,有两类数据组成,p1(x,y)表示点(x,y)属于类别1的概率,p2(x,y)表示点(x,y)属于类别2的概率,对于一个新的数据点,可以这么判断:【贝叶斯理论核心思想

        【1】p1(x,y) > p2(x,y),那么属于类别1

        【2】p1(x,y) < p2(x,y),那么属于类别2

相对于别的算法:

        【1】KNN,进行1000次距离计算,计算量大

        【2】决策树,分别沿着x轴,y轴划分数据,不会非常成功


--------

4.2条件概率

补充:等可能概型(古典概型)

    【1】试验的样本空间只包含有限个元素

    【2】试验中的每个基本事件发生的可能性相同




有事件A\B,事件A在事件B基础上发生的概率为:

P(A|B) = P(AB)/P(B)



A中有4个球,B中有3个,试问从B中取出球为白色的概率

P(W|B) = P(WB) / P(B) = 1/7  /  3/7 = 1/3

--------

4.3 贝叶斯概率






naive bayes
0、给出文档矩阵,和向量标签
1、遍历文档,把所有分类为1(敏感)的单词,造一个列表向量
2、输入矩阵和列表向量,对于矩阵每一行转换为一个向量,判断每个值在上述敏感列表向量中是否出现,相应位置显示1或0

3、朴素贝叶斯分类器:输入矩阵向量和敏感词列表向量,遍历矩阵,根据类别造两个大向量(遍历累加),统计向量中敏感词的总个数,之后就可以计算出概率向量了



#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 18/2/11 下午2:35
# @Author  : 
# @Site    : 
# @File    : bayes.py
# @Software: PyCharm
import numpy
from numpy import ones,log, array
import re
import random
from numpy import array, unicode
import feedparser

#词表到向量的转换函数
def loadDataSet():

    #实验样本,切分后的留言词条集合
    postingList = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                   ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                   ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                   ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                   ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                   ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0, 1, 0, 1, 0, 1] #类别标签集合(由人工标注),1代表侮辱性文字,0正常言论
    return postingList, classVec


#创建一个词的列表,在所有文档中不重复
def createVocabList(dataSet):
    vocabSet = set([])     #创建空集
    for document in dataSet:
        vocabSet = vocabSet | set(document) #取并集,该集合为不重复字符的集合
    return list(vocabSet)


#词集模型,单词只出现一次。
# 把输入文档和词汇表转换为文档向量[0,1,0,1,1,1,0,0,0,0]这种,位置是list里各个单词第一次出现的位置
def setOfWords2Vec(vocabList, inputSet):
    print('------进入文档set转化为vec函数,输入为%s,词汇表为%s'%(inputSet,vocabList))
    returnVec = [0]*len(vocabList)  #创建一个所有元素为0的向量
    for word in inputSet:
        # print('------遍历的word%s'%word)
        if word in vocabList:
            # print('------word  %s敏感' % word)
            returnVec[vocabList.index(word)] = 1 #如果出现词汇表中的单词,将输出文档向量中对应值设为1
        else:
            print('------word  %s不敏感' % word)
            print("the word : %s is not in my Vocablary!" % word)
    print('--------转化后的向量',returnVec)
    return returnVec

#词袋模型,单词可以出现多次
def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0] * len(vocabList)  # 创建一个所有元素为0的向量
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec

#朴素贝叶斯分类器训练函数(文档矩阵,类别标签向量)
def trainNBO(trainMatrix, trainCategory):
    numTrainDocs = len(trainMatrix) #文档个数,矩阵有多少行
    numWords = len(trainMatrix[0])  #文档单词数,每行有多少个单词
    pAbusive = sum(trainCategory)/float(numTrainDocs) #文档属于分类1的概率
    p0Num = ones(numWords) #用0填充的数组,属于分类0的词向量求和
    p1Num = ones(numWords) #属于分类1的词向量求和
    p0Denom = 2.0; p1Denom = 02.0 #初始化概率,# 分类 0/1 的所有文档内的所有单词数统计

    for i in range(numTrainDocs):
        if trainCategory[i] == 1:  #如果这行文本的标签为1
            p1Num += trainMatrix[i] #则该分类对应的词向量p1Num++,就是把所有词向量对应加起来,看哪个词有几个
            # print('---1----trainMatrix[%s]' % i, trainMatrix[i])
            # print('---sum----trainMatrix[%s]' % i, sum(trainMatrix[i]))
            p1Denom += sum(trainMatrix[i]) #文档总词数也++ 这行文档中的敏感词也加到总的数量中
        else:
            p0Num += trainMatrix[i]
            # print('---0----trainMatrix[%s]'%i,trainMatrix[i])
            # print('---sum----trainMatrix[%s]' % i, sum(trainMatrix[i]))
            p0Denom += sum(trainMatrix[i])

    p1Vect = log(p1Num/p1Denom)  #[0,1,0,2,0,0]/n 统计每个词占的比例
    p0Vect = log(p0Num/p0Denom)
    # print('----p1词向量:',p1Num,'\n','-----p1/单词数:',p1Vect)
    # print('----p0词向量:',p0Num,'\n','-----p0/单词数:',p0Vect)
    return p0Vect, p1Vect, pAbusive

# print('-----------------开始load数据-------------')
# data, vec = loadDataSet()
# # li1 = createVocabList(data)
# li = createVocabList(loadDataSet()[0])
# # print(li1)
# # print(li)
# print('---文档矩阵--',data)
# print('---标签向量---',vec)
# print('---文档中不重复vocabList向量---',li)
#
# trainMat = []
# #遍历文档
# for postinDoc in data:
#     print('\n','--------遍历文档,将文档向量加入矩阵中',postinDoc)
#     trainMat.append(setOfWords2Vec(li,postinDoc ))
#     print(trainMat)
#
# p0V,p1V,pAb = trainNBO(trainMat,vec)


#朴素贝叶斯分类函数
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else:
        return 0

def testingNB():
    print('-----------------开始load数据-------------')
    listOPosts, listClasses = loadDataSet()
    myVocabList = createVocabList(listOPosts)
    trainMat = []
    for postinDoc in listOPosts:
        trainMat.append(setOfWords2Vec(myVocabList,postinDoc))
    p0V, p1V, pAb = trainNBO(array(trainMat),array(listClasses))
    testEntry = ['stupid', 'garbage']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry,'classified as :', classifyNB(thisDoc, p0V, p1V, pAb))

    testEntry = ['love', 'my','dalmation']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry,'classified as :', classifyNB(thisDoc, p0V, p1V, pAb))
#第一步:准备数据,切分文本,用单词、数字以外的字符区分

def textParse(inputText):
    regEx = re.compile('\\W+') #匹配非字母、数字、下划线。等价于 '[^A-Za-z0-9_]'
    listOfTokens = regEx.split(inputText)
    # print(listOfTokens)
    list = [tok.lower() for tok in listOfTokens if len(tok)>0] #去除空格转换大小写

    return list


def spamTest():
    docList=[]; classList=[]; fullText=[]
    #将txt文件读进来
    for i in range(1, 26):
        wordList = textParse(open('./spam/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)

        wordList = textParse(open('./ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)

    #随机构建训练集
    trainingSet = range(50); testSet=[]
    for i in range(10):
        randIndex = int(random.uniform(0, len(trainingSet)))#随机生成一个实数
        testSet.append(trainingSet[randIndex])
    trainMat=[]; trainClasses = []

    for docIndex in trainingSet:
        trainMat.append(setOfWords2Vec(vocabList,docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNBO(array(trainMat),array(trainClasses))
    errorCount = 0

    for docIndex in testSet:
        wordVector = setOfWords2Vec(vocabList, docList[docIndex])
        if classifyNB(array(wordVector), p0V, p1V,pSpam) != classList[docIndex]:
            errorCount += 1
    print('the error rate is:',float(errorCount)/len(testSet))


#--------------使用nb发现地域相关的用词--------------

#RSS源分类器及高频词去除函数

#遍历每个词,统计出现次数,排序,返回排序最高的100个词
def calcMostFreq(vocabList,fullText):
    import operator
    freqDict = {}
    for token in vocabList:
        freqDict[token]=fullText.count(token)
    sortedFreq = sorted(freqDict.items(),key=operator.itemgetter(1),#a[b]取第一个域的值
                        reverse=True)#operator模块itemgetter用于获取对象的哪些维的数据
    return sortedFreq[:30]



def localWords(feed1,feed0):#使用两个rss源作为参数
    import feedparser
    docList=[]; classList=[]; fullText=[]
    minLen = min(len(feed1['entries']), len(feed0['entries']))
    for i in range(minLen):
        wordList = textParse(feed1['entries'][i]['summary'])
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(feed0['entries'][i]['summary'])
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)
    top30Words = calcMostFreq(vocabList,fullText)#获取top30的高频词
    #过滤高频词
    for pairW in top30Words:
        if pairW[0] in vocabList:
            vocabList.remove(pairW[0])
    trainingSet = list(range(2*minLen)); testSet=[] #py2和3的差别a=range(10) 3中:a=list(range(10))

    for i in range(20):#随机选20篇作为训练集
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])

    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSam = trainNBO(array(trainMat),array(trainClasses))
    errorCount = 0
    for docIndex in testSet:
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector), p0V, p1V, pSam) != classList[docIndex]:
            errorCount += 1
    print('the error rate is:', float(errorCount)/len(testSet))
    return vocabList,p0V,p1V

#分析数据,显示地域相关的用词,返回排名最高的x个单词
def getTopWords(ny, sf):
    import operator
    vocabList, p0V, p1V = localWords(ny, sf)
    topNY=[];topSF=[]
    for i in range(len(p0V)):
        if p0V[i] > -6.0:topSF.append((vocabList[i], p0V[i]))
        if p1V[i] > -6.0:topNY.append((vocabList[i], p1V[i]))
    sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)
    print("SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**")
    for item in sortedSF:
        print(item[0])
    sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)
    print("NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**")
    for item in sortedNY:
        print(item[0])

def main():
    import feedparser
    ny = feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
    sf=feedparser.parse('http://sfbay.craigslist.org/stp/index.rss')
    # vocabList,psF,pNY = localWords(ny,sf)
    # print(vocabList,'\n-----',psF,'\n-----',pNY)
    # vocabList,psF,pNY = localWords(ny,sf)
    getTopWords(ny,sf)



def test():
    a = list(range(5))
    print(a)
    a.append((-1,-2))
    print(a)

if __name__ == "__main__":
    # testingNB()
    main()
    # test()


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 18/2/11 下午2:35
# @Author  : 
# @Site    : 
# @File    : bayes.py
# @Software: PyCharm
import numpy
from numpy import ones,log, array

#词表到向量的转换函数
def loadDataSet():

    #实验样本,切分后的留言词条集合
    postingList = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                   ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                   ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                   ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                   ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                   ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0, 1, 0, 1, 0, 1] #类别标签集合(由人工标注),1代表侮辱性文字,0正常言论
    return postingList, classVec


#创建一个词的列表,在所有文档中不重复
def createVocabList(dataSet):
    vocabSet = set([])     #创建空集
    for document in dataSet:
        vocabSet = vocabSet | set(document) #取并集,该集合为不重复字符的集合
    return list(vocabSet)


#词集模型,单词只出现一次。
# 把输入文档和词汇表转换为文档向量[0,1,0,1,1,1,0,0,0,0]这种,位置是list里各个单词第一次出现的位置
def setOfWords2Vec(vocabList, inputSet):
    print('------进入文档set转化为vec函数,输入为%s,词汇表为%s'%(inputSet,vocabList))
    returnVec = [0]*len(vocabList)  #创建一个所有元素为0的向量
    for word in inputSet:
        # print('------遍历的word%s'%word)
        if word in vocabList:
            # print('------word  %s敏感' % word)
            returnVec[vocabList.index(word)] = 1 #如果出现词汇表中的单词,将输出文档向量中对应值设为1
        else:
            print('------word  %s不敏感' % word)
            print("the word : %s is not in my Vocablary!" % word)
    print('--------转化后的向量',returnVec)
    return returnVec

#词袋模型,单词可以出现多次
def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0] * len(vocabList)  # 创建一个所有元素为0的向量
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec

#朴素贝叶斯分类器训练函数(文档矩阵,类别标签向量)
def trainNBO(trainMatrix, trainCategory):
    numTrainDocs = len(trainMatrix) #文档个数,矩阵有多少行
    numWords = len(trainMatrix[0])  #文档单词数,每行有多少个单词
    pAbusive = sum(trainCategory)/float(numTrainDocs) #文档属于分类1的概率
    p0Num = ones(numWords) #用0填充的数组,属于分类0的词向量求和
    p1Num = ones(numWords) #属于分类1的词向量求和
    p0Denom = 2.0; p1Denom = 02.0 #初始化概率,# 分类 0/1 的所有文档内的所有单词数统计

    for i in range(numTrainDocs):
        if trainCategory[i] == 1:  #如果这行文本的标签为1
            p1Num += trainMatrix[i] #则该分类对应的词向量p1Num++,就是把所有词向量对应加起来,看哪个词有几个
            print('---1----trainMatrix[%s]' % i, trainMatrix[i])
            print('---sum----trainMatrix[%s]' % i, sum(trainMatrix[i]))
            p1Denom += sum(trainMatrix[i]) #文档总词数也++ 这行文档中的敏感词也加到总的数量中
        else:
            p0Num += trainMatrix[i]
            print('---0----trainMatrix[%s]'%i,trainMatrix[i])
            print('---sum----trainMatrix[%s]' % i, sum(trainMatrix[i]))
            p0Denom += sum(trainMatrix[i])

    p1Vect = log(p1Num/p1Denom)  #[0,1,0,2,0,0]/n 统计每个词占的比例
    p0Vect = log(p0Num/p0Denom)
    print('----p1词向量:',p1Num,'\n','-----p1/单词数:',p1Vect)
    print('----p0词向量:',p0Num,'\n','-----p0/单词数:',p0Vect)
    return p0Vect, p1Vect, pAbusive

# print('-----------------开始load数据-------------')
# data, vec = loadDataSet()
# # li1 = createVocabList(data)
# li = createVocabList(loadDataSet()[0])
# # print(li1)
# # print(li)
# print('---文档矩阵--',data)
# print('---标签向量---',vec)
# print('---文档中不重复vocabList向量---',li)
#
# trainMat = []
# #遍历文档
# for postinDoc in data:
#     print('\n','--------遍历文档,将文档向量加入矩阵中',postinDoc)
#     trainMat.append(setOfWords2Vec(li,postinDoc ))
#     print(trainMat)
#
# p0V,p1V,pAb = trainNBO(trainMat,vec)


#朴素贝叶斯分类函数
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else:
        return 0

def testingNB():
    print('-----------------开始load数据-------------')
    listOPosts, listClasses = loadDataSet()
    myVocabList = createVocabList(listOPosts)
    trainMat = []
    for postinDoc in listOPosts:
        trainMat.append(setOfWords2Vec(myVocabList,postinDoc))
    p0V, p1V, pAb = trainNBO(array(trainMat),array(listClasses))
    testEntry = ['stupid', 'garbage']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry,'classified as :', classifyNB(thisDoc, p0V, p1V, pAb))

    testEntry = ['love', 'my','dalmation']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry,'classified as :', classifyNB(thisDoc, p0V, p1V, pAb))


if __name__ == "__main__":
    testingNB()


打印的数据

-----------------开始load数据-------------
---文档矩阵-- [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
---标签向量--- [0, 1, 0, 1, 0, 1]
---文档中不重复vocabList向量--- ['worthless', 'love', 'flea', 'mr', 'maybe', 'take', 'I', 'problems', 'dalmation', 'park', 'food', 'dog', 'is', 'stupid', 'so', 'posting', 'cute', 'how', 'ate', 'steak', 'help', 'buying', 'to', 'him', 'not', 'garbage', 'please', 'quit', 'licks', 'my', 'stop', 'has']

 --------遍历文档,将文档向量加入矩阵中 ['my', 'dog', 'has', 'flea', 'problems', 'help', 'please']
------进入文档set转化为vec函数,输入为['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],词汇表为['worthless', 'love', 'flea', 'mr', 'maybe', 'take', 'I', 'problems', 'dalmation', 'park', 'food', 'dog', 'is', 'stupid', 'so', 'posting', 'cute', 'how', 'ate', 'steak', 'help', 'buying', 'to', 'him', 'not', 'garbage', 'please', 'quit', 'licks', 'my', 'stop', 'has']
--------转化后的向量 [0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1]
[[0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1]]

 --------遍历文档,将文档向量加入矩阵中 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid']
------进入文档set转化为vec函数,输入为['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],词汇表为['worthless', 'love', 'flea', 'mr', 'maybe', 'take', 'I', 'problems', 'dalmation', 'park', 'food', 'dog', 'is', 'stupid', 'so', 'posting', 'cute', 'how', 'ate', 'steak', 'help', 'buying', 'to', 'him', 'not', 'garbage', 'please', 'quit', 'licks', 'my', 'stop', 'has']
--------转化后的向量 [0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
[[0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1], [0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]]

 --------遍历文档,将文档向量加入矩阵中 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him']
------进入文档set转化为vec函数,输入为['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],词汇表为['worthless', 'love', 'flea', 'mr', 'maybe', 'take', 'I', 'problems', 'dalmation', 'park', 'food', 'dog', 'is', 'stupid', 'so', 'posting', 'cute', 'how', 'ate', 'steak', 'help', 'buying', 'to', 'him', 'not', 'garbage', 'please', 'quit', 'licks', 'my', 'stop', 'has']
--------转化后的向量 [0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0]
[[0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1], [0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0]]

 --------遍历文档,将文档向量加入矩阵中 ['stop', 'posting', 'stupid', 'worthless', 'garbage']
------进入文档set转化为vec函数,输入为['stop', 'posting', 'stupid', 'worthless', 'garbage'],词汇表为['worthless', 'love', 'flea', 'mr', 'maybe', 'take', 'I', 'problems', 'dalmation', 'park', 'food', 'dog', 'is', 'stupid', 'so', 'posting', 'cute', 'how', 'ate', 'steak', 'help', 'buying', 'to', 'him', 'not', 'garbage', 'please', 'quit', 'licks', 'my', 'stop', 'has']
--------转化后的向量 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0]
[[0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1], [0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0]]

 --------遍历文档,将文档向量加入矩阵中 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him']
------进入文档set转化为vec函数,输入为['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],词汇表为['worthless', 'love', 'flea', 'mr', 'maybe', 'take', 'I', 'problems', 'dalmation', 'park', 'food', 'dog', 'is', 'stupid', 'so', 'posting', 'cute', 'how', 'ate', 'steak', 'help', 'buying', 'to', 'him', 'not', 'garbage', 'please', 'quit', 'licks', 'my', 'stop', 'has']
--------转化后的向量 [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0]
[[0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1], [0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0]]

 --------遍历文档,将文档向量加入矩阵中 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']
------进入文档set转化为vec函数,输入为['quit', 'buying', 'worthless', 'dog', 'food', 'stupid'],词汇表为['worthless', 'love', 'flea', 'mr', 'maybe', 'take', 'I', 'problems', 'dalmation', 'park', 'food', 'dog', 'is', 'stupid', 'so', 'posting', 'cute', 'how', 'ate', 'steak', 'help', 'buying', 'to', 'him', 'not', 'garbage', 'please', 'quit', 'licks', 'my', 'stop', 'has']
--------转化后的向量 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
[[0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1], [0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]]
---0----trainMatrix[0] [0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1]
---sum----trainMatrix[0] 7
---1----trainMatrix[1] [0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
---sum----trainMatrix[1] 8
---0----trainMatrix[2] [0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0]
---sum----trainMatrix[2] 8
---1----trainMatrix[3] [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0]
---sum----trainMatrix[3] 5
---0----trainMatrix[4] [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0]
---sum----trainMatrix[4] 9
---1----trainMatrix[5] [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
---sum----trainMatrix[5] 6
----p1词向量: [ 2.  0.  0.  0.  1.  1.  0.  0.  0.  1.  1.  2.  0.  3.  0.  1.  0.  0.
  0.  0.  0.  1.  1.  1.  1.  1.  0.  1.  0.  0.  1.  0.] 
 -----p1/单词数: [ 0.10526316  0.          0.          0.          0.05263158  0.05263158
  0.          0.          0.          0.05263158  0.05263158  0.10526316
  0.          0.15789474  0.          0.05263158  0.          0.          0.
  0.          0.          0.05263158  0.05263158  0.05263158  0.05263158
  0.05263158  0.          0.05263158  0.          0.          0.05263158
  0.        ]
----p0词向量: [ 0.  1.  1.  1.  0.  0.  1.  1.  1.  0.  0.  1.  1.  0.  1.  0.  1.  1.
  1.  1.  1.  0.  1.  2.  0.  0.  1.  0.  1.  3.  1.  1.] 
 -----p0/单词数: [ 0.          0.04166667  0.04166667  0.04166667  0.          0.
  0.04166667  0.04166667  0.04166667  0.          0.          0.04166667
  0.04166667  0.          0.04166667  0.          0.04166667  0.04166667
  0.04166667  0.04166667  0.04166667  0.          0.04166667  0.08333333
  0.          0.          0.04166667  0.          0.04166667  0.125
  0.04166667  0.04166667]


评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值