基于NBC的文本分类

最新推荐文章于 2023-02-12 11:54:53 发布

种种的迹象表明

最新推荐文章于 2023-02-12 11:54:53 发布

阅读量612

点赞数 2

分类专栏：学习文章标签：机器学习

本文链接：https://blog.csdn.net/giraffe_kun/article/details/103586607

版权

学习专栏收录该内容

13 篇文章 1 订阅

订阅专栏

1.实验目的

使用朴素贝叶斯分类器，进行文档分类

2.理论方法介绍

2.1贝叶斯定理

如果有两个事件，事件A和事件B。已知事件A发生的概率为 $p (A)$ ，事件B发生的概率为 $P (B)$ ，事件A发生的前提下。事件B发生的概率为 $p (B ∣ A)$ ，事件B发生的前提下。事件A发生的概率为 $p (A ∣ B)$ ，事件A和事件B同一时候发生的概率是 $p (A B)$ 。则有

$\tag{1}$

依据式(1)能够推出贝叶斯定理为
$p(B|A)=\frac{p(B)p(A|B)}{p(A)}\tag{2}$

给定一个全集 ${B_1,B_1,…,B_n}$ ，当中Bi与Bj是不相交的，即 $B_iB_j=\varnothing$ 。则依据全概率公式。对于一个事件A。会有
$p(A)=\sum_{i=1}^{n} {p(Bi)p(A|B_i)} \tag{3}$
则广义的贝叶斯定理有
$p(B_i|A)=\frac {p(B_i)p(A|B_i)}{\sum_{i=1}^{n}{p(B_i)p(A|B_i)}}\tag{4}$

2.2朴素贝叶斯基本原理

给定一组训练数据集 ${(X_1,y_1),(X_2,y_2),(X_3,y_3),…,(X_m,y_m)}$ 。当中，m是样本的个数。每个数据集包括着n个特征，即 $X_i=(x_{i1},x_{i2},…,x_{in})$ 。类标记集合为 ${y_1,y_2,…,y_k}$ 。设 $p(y=y_i|X=x)$ 表示输入的X样本为x时，输出 $y$ 为 $y_k$ 的概率。
如果如今给定一个新的样本x。要推断其属于哪一类，可分别求解 $p(y=y_1|x)，p(y=y_2|x)，p(y=y_3|x)，…，p(y=y_k|x)$ 的值。哪一个值最大，就属于那一类。即，求解最大的后验概率 $a r g m a x p (y ∣ x)$ 。
那怎样求解出这些后验概率呢？依据贝叶斯定理。有
$\mathrm{p}\left(\mathrm{y}=\mathrm{y}_{\mathrm{i}} | \mathrm{x}\right)=\frac{\mathrm{p}\left(\mathrm{y}_{\mathrm{i}}\right) \mathrm{p}\left(\mathrm{x} | \mathrm{y}_{\mathrm{i}}\right)}{\mathrm{p}(\mathrm{x})}\tag{5}$
一般地，朴素贝叶斯方法如果各个特征之间是相互独立的，则式(5)能够写成:
$p\left(y=y_{i} | x\right)=\frac{p\left(y_{i}\right) p\left(x | y_{i}\right)}{p(x)}=\frac{p\left(y_{i}\right) \prod_{j=1}^{n} p\left(x_{j} | y_{i}\right)}{\prod_{j=1}^{n} p\left(x_{j}\right)}\tag{6}$
由于式(6)的分母。对于每个 $p(y=y_i|x)$ 求解都是一样的。所以，在实际操作中,能够省略掉。朴素贝叶斯分类器的判别公式变成例如以下的形式：
$\mathrm{y}=\arg \max _{\mathrm{y}_{\mathrm{i}}} \mathrm{p}\left(\mathrm{y}_{\mathrm{i}}\right) \mathrm{p}\left(\mathrm{x} | \mathrm{y}_{\mathrm{i}}\right)=\arg \max _{\mathrm{y}_{\mathrm{i}}} \mathrm{p}\left(\mathrm{y}_{\mathrm{i}}\right) \prod_{\mathrm{j}=1}^{\mathrm{n}} \mathrm{p}\left(\mathrm{x}_{\mathrm{j}} | \mathrm{y}_{\mathrm{i}}\right) \tag{7}$

3.实验数据及方法

3.1 数据集

从互联网上收集经济、体育、计算机3类中文文档各15篇

3.2实验介绍

3.2.1 分词，去停用词的情况

1.遍历文件夹，获取文本路径，将其存储到列表path_list中

path_list = []
for a,b,c in os.walk('D:\workspace\机器学习实验(课堂)\基于NBC的文本分类\语料',topdown=False):
    for files in c:
        path = a+'\\'+files
        path_list.append(path)

2.导入停用词表，文件使用txt格式存储，内容如下

#每个词占一行
!
"
#
$
%
&
'
....

3.读入后，使用strip()函数去掉每个字符后面都含有的换行符’\n’，然后将读入的字符存储到列表stopWords中

stopWords = []
infile = open("D:\workspace\机器学习实验(课堂)\基于NBC的文本分类\stop_word_list.txt",encoding='utf-8')
stopwords_lst = infile.readlines()
for word in stopwords_lst:
 stopWords.append(word.strip())

4.分词，去停用词

X_list = [] #储存处理好的文本，字符串形式，特征
y_list = [] #储存标签
for path in path_list:
    print(path)
    with open(path,mode='r',encoding='utf-8') as note:
        note = note.read()
        note_1 = note.replace('\n','') #用''替换文本中的\n，
        # print(note_1)
        seglist = jieba.lcut(note_1) #分词，返回一个列表
        #去停用词，并存储到newSent中
        newSent = []
        for word in seglist:
            word = word.strip()
            if word not in stopWords:
                if word != '\t' and word != '\n':
                    newSent.append(word)
        label = path.split('\\')[-2]  #文本路径的按'\'分开，倒数第二个字符串,是该文本的类别
        # print(label)
        '''
        去停用词后的，合并为一个大的字符串，每个列表中的元素，用空格隔开，
        为了接下来计算词频和计算tf-idf值做准备
        '''
        text_str = ' '.join(newSent)
        X_list.append(text_str) 
        y_list.append(label)
        #print('类别:',label)

5.调用CountVectorizer计算词频

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()  
X = vectorizer.fit_transform(X_list)  #计算个词语出现的次数  
print(X.toarray() ) #查看词频结果  
'''
输出：
一个存储词频的稀疏矩阵
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 1 ... 0 0 0]
 ...
 [0 1 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
'''

6.调用TfidfTransformer计算tf-idf值

from sklearn.feature_extraction.text import TfidfTransformer  

transformer = TfidfTransformer()   #类调用  
tfidf = transformer.fit_transform(X)  #将词频矩阵X统计成TF-IDF值  
X_data = tfidf.toarray() #查看数据结构 tfidf[i][j]表示i类文本中的tf-idf权重  
print(tfidf.toarray())
'''
输出：
[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.01676228 ... 0.         0.         0.        ]
 ...
 [0.         0.00381169 0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]

7.自己写了一个函数计算宏平均F1和微平均F1
多分类的宏平均F1和微平均F1详细请参考：
多分类的评价指标PRF（Macro-F1/MicroF1/weighted）详解

def strList2numList(y_pred,y_true):
	#将分类结果用数字表示以便于求F1值    
    pred_num_list = []
    true_num_list = []
    P = []
    R = []
    F1 = []
    for i in set(y_pred):
        print(i)
        for i_0 in y_pred:
            if i_0 == i:
                pred_num_list.append(1)
            else:
                pred_num_list.append(0)
        for i_1 in y_true:
            if i_1 == i:
                true_num_list.append(1)
            else:
                true_num_list.append(0)
    return pred_num_list,true_num_list
    
# TP——将正类预测为正类的个数
# FN——将正类预测为负类的个数
# FP——将负类预测为正类的个数
# TN——将负类预测为负类的个数
def showF1(y_pred,y_true):
	#求F1值
    pred_num_list,true_num_list = strList2numList(y_pred,y_true)
    TP_list = []
    FP_list = []
    FN_list = []
    P = []
    R = []
    F1 = []
    TP = 0
    TN = 0
    FN = 0 
    FP = 0
    for n in range(len(pred_num_list)):
        if pred_num_list[n] == true_num_list[n]:
            if true_num_list[n] == 1:
                TP += 1
            elif true_num_list[n] == 0:
                TN += 1
        elif pred_num_list[n] != true_num_list[n]:
            if true_num_list[n] == 1 and pred_num_list[n] == 0:
                FN += 1
            elif true_num_list[n] == 0 and pred_num_list[n] == 1:
                FP += 1
    TP_list.append(TP)
    FP_list.append(FP)
    FN_list.append(FN)
    P.append(TP/(TP + FP))
    R.append(TP/(TP + FN))
    F1.append(2*P[-1]*R[-1]/(P[-1]+R[-1]))
    #Micro Average
    # print('TP_list:',TP_list)
    # print(np.sum(TP_list))
    # print(len(TP_list))
    micro_TP = np.sum(TP_list)/len(TP_list)
    micro_FP = np.sum(FP_list)/len(FP_list)
    micro_FN = np.sum(FN_list)/len(FN_list)
    micro_P = micro_TP/(micro_TP+micro_FP)
    micro_R = micro_TP/(micro_TP+micro_FN)
    micro_F1 = 2*micro_P*micro_R/(micro_P+micro_R)
    #Macro Average
    macro_P = np.sum(P)/len(P)
    macro_R = np.sum(R)/len(R)
    macro_F1 = 2*macro_P*macro_R/(macro_R+macro_P)
    print('微平均F1:',micro_F1)
    print('宏平均F1:',macro_F1)

使用不同的随机种子训练朴素贝叶斯

for n in range(1,11): 
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X_data,y_list,test_size=0.33,random_state=3)
    from sklearn.naive_bayes import MultinomialNB
    classifier = MultinomialNB()
    #模型训练
    classifier.fit(X_train, y_train)
    #使用训练好的模型进行预测
    print('随机种子n:',n)
    print(classifier.score(X_test, y_test))
    y_pred = classifier.predict(X_test)
    y_true = np.array(y_test)
    # print(y_pred)
    # print(y_true)
    # print(y_pred == y_true)

训练结果：
训练结果如下，十分稳定
笔算的micro-F1=14/15，macro-F1=14/15,与函数计算结果是相同的，可以保证正确性

随机种子n: 1
0.9333333333333333
随机种子n: 2
0.9333333333333333
随机种子n: 3
0.9333333333333333
随机种子n: 4
0.9333333333333333
随机种子n: 5
0.9333333333333333
随机种子n: 6
0.9333333333333333
随机种子n: 7
0.9333333333333333
随机种子n: 8
0.9333333333333333
随机种子n: 9
0.9333333333333333
随机种子n: 10
0.9333333333333333
微平均F1: 0.9333333333333333
宏平均F1: 0.9333333333333333

3.2.2分词，不去停用词的情况

代码仅在步骤4中做一些小改动即可

...
X_list = [] #储存处理好的文本，字符串形式
y_list = [] #储存标签
for path in path_list:
    # print(path)
    with open(path,mode='r',encoding='utf-8') as note:
        note = note.read()
        note_1 = note.replace('\n','')
        # print(note_1)
        seglist = jieba.lcut(note_1)
        newSent = seglist #下边不用变了
        label = path.split('\\')[-2]
        # print(label)
        text_str = ' '.join(newSent)
        X_list.append(text_str)
        y_list.append(label)
 ...

训练结果：
输出的结果与不去停用词完全相同

随机种子n: 1
0.9333333333333333
随机种子n: 2
0.9333333333333333
随机种子n: 3
0.9333333333333333
随机种子n: 4
0.9333333333333333
随机种子n: 5
0.9333333333333333
随机种子n: 6
0.9333333333333333
随机种子n: 7
0.9333333333333333
随机种子n: 8
0.9333333333333333
随机种子n: 9
0.9333333333333333
随机种子n: 10
0.9333333333333333

3.2.3 分字，去停用词

前面的代码相同
将不空格，直接合并的字符串word_str直接转换成列表list(word_str)就可以达到分字的效果,使用char_str储存分好字的列表
所有文档一共包含99583个字符

X_list = [] #储存处理好的文本，字符串形式
y_list = [] #储存标签
sum = 0
for path in path_list:
    # print(path)
    with open(path,mode='r',encoding='utf-8') as note:
        note = note.read()
        note_1 = note.replace('\n','')
        # print(list(note_1))
        seglist = jieba.lcut(note_1)
        newSent = []
        for word in seglist:
            word = word.strip()
            if word not in stopWords:
                if word != '\t' and word != '\n':
                    newSent.append(word)
        word_str = ''.join(newSent)
        char_str = list(word_str) 
        sum += len(char_str)
        #print('sum:',sum)
        label = path.split('\\')[-2]
        X_list.append(char_str)
        y_list.append(label）
#输出：sum:99583

合并大数组中的小数组，即把包含45个一维数组的二维数组，变成1个一维数组。99583个字符中，不重复的有2245个

from itertools import chain
d = list(chain(*X_list))
print(len(d)) #所有文本的总字数
print(len(set(d))) #去重，计算不重复的字符
'''
99583
2245
'''

函数all_list统计字频

def all_list(arr):
    result = {}
    for i in set(arr):
        result[i] = arr.count(i)
    return result
# 结果：{0: 1, 1: 2, 2: 3, 3: 2}
print(all_list(X_list[0]))
'''
{'冲': 1, '拓': 2, '埃': 1, '片': 3, '迎': 1, '期': 3, '比': 11, '静': 1,
 '献': 1, '2': 12, '阿': 47, '地': 5, '巴': 11, '量': 1, '摔': 1, '连': 3,
 ...}
'''

计算每个文本的字频，用稀疏矩阵存储，稀疏矩阵为45行，2245列

char_num = []
char_num_all = []
for n in range(len(X_list)):
    char_dic = all_list(X_list[n]) #统计单个文档的字频
    for c in set(d):
     	#遍历储存所有的不重复字
        if c in char_dic:
        	#若文本中存在的字在对应位置保存字频
            char_num.append(char_dic[c])
            # print(char_dic[c])
        elif c not in char_dic:
        	#若不存在的记为0
            char_num.append(0)
    char_num_all.append(char_num)
    char_num = []
#print(char_num_all)
char_num_all =np.array(char_num_all)
print(char_num_all.shape)
#输出：(45, 2245)

训练结果：
计算tf-idf，使用朴素贝叶斯分类代码与上面相同
输出的结果如下，分字模型会受到随机分布不同的影响

随机种子n: 1
准确率: 0.6666666666666666
微平均F1: 0.6666666666666666
宏平均F1: 0.6666666666666666
随机种子n: 2
准确率: 0.9333333333333333
微平均F1: 0.9333333333333333
宏平均F1: 0.9333333333333333
随机种子n: 3
准确率: 0.8666666666666667
微平均F1: 0.8666666666666667
宏平均F1: 0.8666666666666667
随机种子n: 4
准确率: 0.8666666666666667
微平均F1: 0.8666666666666667
宏平均F1: 0.8666666666666667
随机种子n: 5
准确率: 0.8666666666666667
微平均F1: 0.8666666666666667
宏平均F1: 0.8666666666666667
随机种子n: 6
准确率: 0.6
微平均F1: 0.6
宏平均F1: 0.6
随机种子n: 7
准确率: 0.5333333333333333
微平均F1: 0.5333333333333333
宏平均F1: 0.5333333333333333
随机种子n: 8
准确率: 0.8
微平均F1: 0.8000000000000002
宏平均F1: 0.8000000000000002
随机种子n: 9
准确率: 0.8
微平均F1: 0.8000000000000002
宏平均F1: 0.8000000000000002
随机种子n: 10
准确率: 0.6666666666666666
微平均F1: 0.6666666666666666
宏平均F1: 0.6666666666666666

3.2.4分字，不去停用词的情况

代码仅在此处做改动

...
X_list = [] #储存处理好的文本，字符串形式
y_list = [] #储存标签
sum = 0
for path in path_list:
    # print(path)
    with open(path,mode='r',encoding='utf-8') as note:
        note = note.read()
        note_1 = note.replace('\n','')
        seglist = jieba.lcut(note_1)
        newSent = seglist
        word_str = ''.join(newSent)
        char_str = list(word_str)
        # obj_list = ' '.join(word_str)
        # print(len(char_str))
        sum += len(char_str)
        # print('sum:',sum)
        label = path.split('\\')[-2]
        # print(label)
        X_list.append(char_str)
        y_list.append(label)
...

训练结果：
输出的结果如下

随机种子n: 1
准确率: 0.4666666666666667
微平均F1: 0.6363636363636364
宏平均F1: 0.6363636363636364
随机种子n: 2
准确率: 0.9333333333333333
微平均F1: 0.9333333333333333
宏平均F1: 0.9333333333333333
随机种子n: 3
准确率: 0.9333333333333333
微平均F1: 0.9333333333333333
宏平均F1: 0.9333333333333333
随机种子n: 4
准确率: 0.9333333333333333
微平均F1: 0.9333333333333333
宏平均F1: 0.9333333333333333
随机种子n: 5
准确率: 0.6
微平均F1: 0.6
宏平均F1: 0.6
随机种子n: 6
准确率: 0.6666666666666666
微平均F1: 0.6666666666666666
宏平均F1: 0.6666666666666666
随机种子n: 7
准确率: 0.4
微平均F1: 0.5714285714285715
宏平均F1: 0.5714285714285715
随机种子n: 8
准确率: 0.7333333333333333
微平均F1: 0.7333333333333333
宏平均F1: 0.7333333333333333
随机种子n: 9
准确率: 0.6666666666666666
微平均F1: 0.6666666666666666
宏平均F1: 0.6666666666666666
随机种子n: 10
准确率: 0.6
微平均F1: 0.6
宏平均F1: 0.6