原理:
P(c1) > P(c0),则认为发生时间C1;
P(c0) > P(c1),则认为发生时间C0;
将社区言论分为侮辱性发言,和非侮辱性发言,首先样本为postList,分类为classVec,其中,1代表侮辱性发言
def loadDataSet():
postingList = [
['my','dog','has','flea','problem','help','please'],
['maybe','not','take','him','to','dog','park','stupid'],
['my','dalmation','is','so','cute','i','love','him'],
['stop','posting','stupid','garbage'],
['mr','licks','age','my','steak','how','to','stop','him'],
['quit','buying','worthless','dog','food','stupid']
]
classVec = [0,1,0,1,0,1] #0正常1侮辱性
return postingList,classVec
接着建立训练集的过程,将发言中的词汇统一到一个列表中,去除重复词汇,返回包含所有词汇的vocabulary。
def createVocabList(dataSet): #vector 向量
vocabSet = set([])
for document in dataSet:
vocabSet = vocabSet | set(document)
return list(vocabSet) #将所有的词集中到一个列表中,去重
要将发言文字转化为向量,0代表未出现在词汇表中,1代表出现在词汇表中。
def setOfWords2Vec(vocabList,inputSet): #词汇表(不重复)、输入集
"""
对输入集中的每一个值判断,若存在,就将词汇表中该词位置改为1,否则为零!
将文字集转化为词汇表长度的向量
"""
returnVec = [0] * len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] = 1
else:
print('the word:%s is not in my Vocabulary' % word)
return returnVec
发言处理后,返回如下:
[
[1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1],
[1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0,