815 文本分类实例

最新推荐文章于 2022-11-01 14:50:49 发布

端午节放纸鸢

最新推荐文章于 2022-11-01 14:50:49 发布

阅读量583

点赞数

文章标签： python

本文链接：https://blog.csdn.net/qq_43689281/article/details/123343887

版权

今天是815 tut7
coursework part 2 的相关内容！！！

下面直接开始研究代码
首先是引入的包

# -*- coding: utf-8 -*-
"""
Created on Mon Mar  7 19:01:54 2022

@author: Pamplemousse
"""

#设定图片大小
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = [9.0, 6.0]

import nltk
from sklearn.datasets import load_files #读取文件的工具
from nltk.corpus import stopwords
import os
import string
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression #Logistic
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis #线性判别器
from sklearn.naive_bayes import GaussianNB #先验为高斯分布的朴素贝叶斯
from sklearn.svm import SVC #支持向量机
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score #一些拟合优度工具
from sklearn.naive_bayes import MultinomialNB #先验为多项式分布的朴素贝叶斯
from sklearn.pipeline import Pipeline
import random
from functools import partial
from tabulate import tabulate
from yellowbrick.classifier import ClassificationReport
from yellowbrick.classifier import ConfusionMatrix
from sklearn.neural_network import MLPClassifier #人工神经网络

一些参数的设置

default_stopwords = nltk.corpus.stopwords.words('english')#stopwords

lemma = WordNetLemmatizer()#lemmatize
porter_stemmer = PorterStemmer()#stemming

清洗文本的函数，和上周的tut6-part D&E是一样的，其实可以直接复制过来，所以这里就不给注释了

def clean_text(doc, rm_punctuation = True, rm_digits = True, lemmatize = False, 
               norm_case = True, stem = False, rm_stopwords = True):
    
    if(rm_punctuation == True):
        table = str.maketrans({key: None for key in string.punctuation})
        doc =str(doc).translate(table)
    
    if(rm_digits == True):
        table = str.maketrans({key: None for key in string.digits})
        doc = str(doc).translate(table)
    
    if(norm_case == True):
        doc = doc.lower()
    
    if(lemmatize == True):
        words = " ".join(lemma.lemmatize(word) for word in doc.split())
    else:
        words = " ".join([i for i in doc.split()])
    
    if(stem == True):
        words = " ".join(porter_stemmer.stem(word) for word in words.split())
    
    if(rm_stopwords == True):
        words = " ".join([i for i in words.split() if i not in default_stopwords])
    
    return words

然后是评价模型的函数

def evaluate_model(model):
    
    model.fit(X_train, y_train)#训练模型
    cr = ClassificationReport(model)#关于模型的分类报告框架
    cr.score(X_test, y_test)#测试模型，并得到测试结果的数据
    cr.finalize() #这里应该是绘制一个报告的热力图的动作
    #总之调用这个函数会得到一个热力图

读取文件

movie_dataDir = os.path.realpath("Desktop/King/815/Tutorial Week 7-20220307/Week6 Tutorial/txt_sentoken")
movie_data = load_files(movie_dataDir)
#load_files是读取文本的工具，其返回值包括data、target、target_names

print(movie_data.target)

print(movie_data.data[0])

第一个print的输出是
[0 1 1 … 1 0 0]
P.S.中间的省略号是编译器无法显示那么多，所以省略了，不是真的省略号
在movie_data.target内存储的是文件的类型（0/1）
第二个print输出的是第一个文件的内容
截图示意：
在这里插入图片描述
随后对文本进行清洗，即对movie_data.data进行处理

documents = [clean_text(x, stem = False, lemmatize = False) for x in movie_data.data]
#调用我们自定义的clean_text()函数

print(documents[0])#输出第一篇文章看看清洗情况

截图示意：
在这里插入图片描述
随后将document（自变量）和movie_data.target（因变量）转换成电脑可以处理的digit型数据

X,y = documents, movie_data.target

vectorizer = CountVectorizer(max_features = 1500, min_df = 5, max_df = 0.7, stop_words = stopwords.words('english'))#词频转换器
#去除stopwords后，出现频率不高于0.7，出现次数不低于5的前1500个词
X = vectorizer.fit_transform(documents).toarray()#构建词频向量

print(X[0][:10])#第1个文章的词频向量的前10个数据

得到
[0 0 0 0 0 0 0 0 0 5]
P.S.有些词在总的里面出现次数多，但在单个文本文本里出现的次数可能是0，这个词频向量表示文章中某个词出现的频率

然后将词频（tf）使用逆文本频率指数(idf)进行转换，下面请自行百度

tfidfconverter = TfidfTransformer()
X = tfidfconverter.fit_transform(X).toarray()

print(X[0][:10])

得到的结果是
[0. 0. 0. 0. 0. 0.
0. 0. 0. 0.24686232]

8:2分离训练集和测试集

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

然后开始跑不同的分类器，并输出结果
首先是logistics回归

logistic = LogisticRegression()
logistic.fit(X_train, y_train)

logistic_prediction = logistic.predict(X_test)

print(accuracy_score(logistic_prediction, y_test))
print(confusion_matrix(logistic_prediction, y_test))
print(classification_report(logistic_prediction, y_test))

accuracy score:
0.835

confusion matrix:
[[168 26]
[ 40 166]]

report:

	precision	recall	f1-score	support
0	0.81	0.87	0.84	194
1	0.86	0.81	0.83	106
accuracy			0.83	400
macro avg	0.84	0.84	0.83	400
weighted avg	0.84	0.83	0.83	400

P.S.这个表格是我手动敲出来的，但是结果是跑出来的~

线性判别模型LinearDiscriminant

lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)

lda_prediction = lda.predict(X_test)

print(accuracy_score(lda_prediction, y_test))
print(confusion_matrix(lda_prediction, y_test))
print(classification_report(lda_prediction, y_test))

accuracy score:
0.61

confusion matrix:
[[115 63]
[ 93 129]]

report:

	precision	recall	f1-score	support
0	0.55	0.65	0.60	178
1	0.67	0.58	0.62	222
accuracy			0.61	400
macro avg	0.61	0.61	0.61	400
weighted avg	0.62	0.61	0.61	400

朴素贝叶斯（高斯分布）

nb = GaussianNB()
nb.fit(X_train, y_train)

nb_prediction = nb.predict(X_test)

print(accuracy_score(nb_prediction, y_test))
print(confusion_matrix(nb_prediction, y_test))
print(classification_report(nb_prediction, y_test))

accuracy score:
0.7625

confusion matrix:
[[164 51]
[ 44 141]]

report:

	precision	recall	f1-score	support
0	0.79	0.76	0.78	215
1	0.73	0.76	0.75	185
accuracy			0.76	400
macro avg	0.76	0.76	0.76	400
weighted avg	0.76	0.76	0.76	400

支持向量机

SVC_model = SVC()
SVC_model.fit(X_train, y_train)

SVC_prediction = SVC_model.predict(X_test)

print(accuracy_score(SVC_prediction, y_test))
print(confusion_matrix(SVC_prediction, y_test))
print(classification_report(SVC_prediction, y_test))

accuracy score:
0.8275

confusion matrix:
[[167 28]
[ 41 164]]

report:

	precision	recall	f1-score	support
0	0.80	0.86	0.83	295
1	0.85	0.80	0.83	205
accuracy			0.83	400
macro avg	0.83	0.83	0.83	400
weighted avg	0.83	0.83	0.83	400

然后是pipeline的一大块，就是流水线
先构建一个流水线模型

model = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', MultinomialNB()),
    ])#这个模型先做tfidf向量化，然后使用朴素贝叶斯（多项式分布）训练模型

model.fit(movie_data.data, movie_data.target)#拟合模型

Pipeline(steps=[(‘tfidf’, TfidfVectorizer()), (‘clf’, MultinomialNB())])

然后我们从documents里随机抽取一篇文章
使用模型对其进行tfidf向量化，然后预测

rantdoc = random.choice(documents)

print(rantdoc)

target = model.named_steps['tfidf'].transform([rantdoc])
target

print(model.predict([rantdoc]))

这里就不贴rantdoc的输出了
target的输出：
<1x39659 sparse matrix of type ‘<class ‘numpy.float64’>’
with 361 stored elements in Compressed Sparse Row format>

预测的输出：
[0]

但是这里我们没办法知道实际的分类是0还是1，除非一篇篇去比对这哪篇文章

输出该预测的概率

tabulate = partial(tabulate, headers = 'firstrow', tablefmt = 'pipe')

probas = model.predict_proba([rantdoc])
table = [["Class", "Probability"]] + list(zip(model.classes_, probas[0]))
#构建概率表格
print(tabulate(table))

得到这个表格

Class	Probability
0	0.689799
1	0.310201

P.S.这个输出复制到Markdown就自然变成表格了，点赞
所以对这个rantdoc的预测，这个表格有0.6998的概率是0，所以前面的输出是0

随后是对模型优度评价的可视化，使用的是前面的evaluate_model()，这个自定义函数
然后我们对几个分类模型分别调用该函数

evaluate_model(LogisticRegression())
evaluate_model(LinearDiscriminantAnalysis())
evaluate_model(GaussianNB())
evaluate_model(MultinomialNB())
evaluate_model(SVC())
evaluate_model(MLPClassifier())

P.S.这里的代码要一行一行地运行，否则输出可能有问题
得到如下热力图
Logistic Regression
Linear Discriminant Analysis
GuassianNB
MultinomialNB
Support Vector Machine
MLP
这些热力图颜色越深说明数值越高，效果越好

然后是可视化的另一种方式
这里对Logistic回归和线性判别分析进行了操作，这两大块要分别运行得到两个结果

viz = ConfusionMatrix(LogisticRegression())
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.poof()

viz = ConfusionMatrix(LinearDiscriminantAnalysis())
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.poof()