815 文本分类实例

今天是815 tut7
coursework part 2 的相关内容!!!

下面直接开始研究代码
首先是引入的包

# -*- coding: utf-8 -*-
"""
Created on Mon Mar  7 19:01:54 2022

@author: Pamplemousse
"""

#设定图片大小
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = [9.0, 6.0]

import nltk
from sklearn.datasets import load_files #读取文件的工具
from nltk.corpus import stopwords
import os
import string
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression #Logistic
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis #线性判别器
from sklearn.naive_bayes import GaussianNB #先验为高斯分布的朴素贝叶斯
from sklearn.svm import SVC #支持向量机
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score #一些拟合优度工具
from sklearn.naive_bayes import MultinomialNB #先验为多项式分布的朴素贝叶斯
from sklearn.pipeline import Pipeline
import random
from functools import partial
from tabulate import tabulate
from yellowbrick.classifier import ClassificationReport
from yellowbrick.classifier import ConfusionMatrix
from sklearn.neural_network import MLPClassifier #人工神经网络

一些参数的设置

default_stopwords = nltk.corpus.stopwords.words('english')#stopwords

lemma = WordNetLemmatizer()#lemmatize
porter_stemmer = PorterStemmer()#stemming

清洗文本的函数,和上周的tut6-part D&E是一样的,其实可以直接复制过来,所以这里就不给注释了

def clean_text(doc, rm_punctuation = True, rm_digits = True, lemmatize = False, 
               norm_case = True, stem = False, rm_stopwords = True):
    
    if(rm_punctuation == True):
        table = str.maketrans({key: None for key in string.punctuation})
        doc =str(doc).translate(table)
    
    if(rm_digits == True):
        table = str.maketrans({key: None for key in string.digits})
        doc = str(doc).translate(table)
    
    if(norm_case == True):
        doc = doc.lower()
    
    if(lemmatize == True):
        words = " ".join(lemma.lemmatize(word) for word in doc.split())
    else:
        words = " ".join([i for i in doc.split()])
    
    if(stem == True):
        words = " ".join(porter_stemmer.stem(word) for word in words.split())
    
    if(rm_stopwords == True):
        words = " ".join([i for i in words.split() if i not in default_stopwords])
    
    return words

然后是评价模型的函数

def evaluate_model(model):
    
    model.fit(X_train, y_train)#训练模型
    cr = ClassificationReport(model)#关于模型的分类报告框架
    cr.score(X_test, y_test)#测试模型,并得到测试结果的数据
    cr.finalize() #这里应该是绘制一个报告的热力图的动作
    #总之调用这个函数会得到一个热力图

读取文件

movie_dataDir = os.path.realpath("Desktop/King/815/Tutorial Week 7-20220307/Week6 Tutorial/txt_sentoken")
movie_data = load_files(movie_dataDir)
#load_files是读取文本的工具,其返回值包括data、target、target_names

print(movie_data.target)

print(movie_data.data[0])

第一个print的输出是
[0 1 1 … 1 0 0]
P.S.中间的省略号是编译器无法显示那么多,所以省略了,不是真的省略号
在movie_data.target内存储的是文件的类型(0/1)
第二个print输出的是第一个文件的内容
截图示意:
在这里插入图片描述
随后对文本进行清洗,即对movie_data.data进行处理

documents = [clean_text(x, stem = False, lemmatize = False) for x in movie_data.data]
#调用我们自定义的clean_text()函数

print(documents[0])#输出第一篇文章看看清洗情况

截图示意:
在这里插入图片描述
随后将document(自变量)和movie_data.target(因变量)转换成电脑可以处理的digit型数据

X,y = documents, movie_data.target

vectorizer = CountVectorizer(max_features = 1500, min_df = 5, max_df = 0.7, stop_words = stopwords.words('english'))#词频转换器
#去除stopwords后,出现频率不高于0.7,出现次数不低于5的前1500个词
X = vectorizer.fit_transform(documents).toarray()#构建词频向量

print(X[0][:10])#第1个文章的词频向量的前10个数据

得到
[0 0 0 0 0 0 0 0 0 5]
P.S.有些词在总的里面出现次数多,但在单个文本文本里出现的次数可能是0,这个词频向量表示文章中某个词出现的频率

然后将词频(tf)使用逆文本频率指数(idf)进行转换,下面请自行百度

tfidfconverter = TfidfTransformer()
X = tfidfconverter.fit_transform(X).toarray()

print(X[0][:10])

得到的结果是
[0. 0. 0. 0. 0. 0.
0. 0. 0. 0.24686232]

8:2分离训练集和测试集

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

然后开始跑不同的分类器,并输出结果
首先是logistics回归

logistic = LogisticRegression()
logistic.fit(X_train, y_train)

logistic_prediction = logistic.predict(X_test)

print(accuracy_score(logistic_prediction, y_test))
print(confusion_matrix(logistic_prediction, y_test))
print(classification_report(logistic_prediction, y_test))

accuracy score:
0.835

confusion matrix:
[[168 26]
[ 40 166]]

report:

precisionrecallf1-scoresupport
00.810.870.84194
10.860.810.83106
accuracy0.83400
macro avg0.840.840.83400
weighted avg0.840.830.83400

P.S.这个表格是我手动敲出来的,但是结果是跑出来的~

线性判别模型LinearDiscriminant

lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)

lda_prediction = lda.predict(X_test)

print(accuracy_score(lda_prediction, y_test))
print(confusion_matrix(lda_prediction, y_test))
print(classification_report(lda_prediction, y_test))

accuracy score:
0.61

confusion matrix:
[[115 63]
[ 93 129]]

report:

precisionrecallf1-scoresupport
00.550.650.60178
10.670.580.62222
accuracy0.61400
macro avg0.610.610.61400
weighted avg0.620.610.61400

朴素贝叶斯(高斯分布)

nb = GaussianNB()
nb.fit(X_train, y_train)

nb_prediction = nb.predict(X_test)

print(accuracy_score(nb_prediction, y_test))
print(confusion_matrix(nb_prediction, y_test))
print(classification_report(nb_prediction, y_test))

accuracy score:
0.7625

confusion matrix:
[[164 51]
[ 44 141]]

report:

precisionrecallf1-scoresupport
00.790.760.78215
10.730.760.75185
accuracy0.76400
macro avg0.760.760.76400
weighted avg0.760.760.76400

支持向量机

SVC_model = SVC()
SVC_model.fit(X_train, y_train)

SVC_prediction = SVC_model.predict(X_test)

print(accuracy_score(SVC_prediction, y_test))
print(confusion_matrix(SVC_prediction, y_test))
print(classification_report(SVC_prediction, y_test))

accuracy score:
0.8275

confusion matrix:
[[167 28]
[ 41 164]]

report:

precisionrecallf1-scoresupport
00.800.860.83295
10.850.800.83205
accuracy0.83400
macro avg0.830.830.83400
weighted avg0.830.830.83400

然后是pipeline的一大块,就是流水线
先构建一个流水线模型

model = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', MultinomialNB()),
    ])#这个模型先做tfidf向量化,然后使用朴素贝叶斯(多项式分布)训练模型

model.fit(movie_data.data, movie_data.target)#拟合模型

Pipeline(steps=[(‘tfidf’, TfidfVectorizer()), (‘clf’, MultinomialNB())])

然后我们从documents里随机抽取一篇文章
使用模型对其进行tfidf向量化,然后预测

rantdoc = random.choice(documents)

print(rantdoc)

target = model.named_steps['tfidf'].transform([rantdoc])
target

print(model.predict([rantdoc]))

这里就不贴rantdoc的输出了
target的输出:
<1x39659 sparse matrix of type ‘<class ‘numpy.float64’>’
with 361 stored elements in Compressed Sparse Row format>

预测的输出:
[0]

但是这里我们没办法知道实际的分类是0还是1,除非一篇篇去比对这哪篇文章

输出该预测的概率

tabulate = partial(tabulate, headers = 'firstrow', tablefmt = 'pipe')

probas = model.predict_proba([rantdoc])
table = [["Class", "Probability"]] + list(zip(model.classes_, probas[0]))
#构建概率表格
print(tabulate(table))

得到这个表格

ClassProbability
00.689799
10.310201

P.S.这个输出复制到Markdown就自然变成表格了,点赞
所以对这个rantdoc的预测,这个表格有0.6998的概率是0,所以前面的输出是0

随后是对模型优度评价的可视化,使用的是前面的evaluate_model(),这个自定义函数
然后我们对几个分类模型分别调用该函数

evaluate_model(LogisticRegression())
evaluate_model(LinearDiscriminantAnalysis())
evaluate_model(GaussianNB())
evaluate_model(MultinomialNB())
evaluate_model(SVC())
evaluate_model(MLPClassifier())

P.S.这里的代码要一行一行地运行,否则输出可能有问题
得到如下热力图
Logistic Regression
Linear Discriminant Analysis
GuassianNB
MultinomialNB
Support Vector Machine
MLP
这些热力图颜色越深说明数值越高,效果越好

然后是可视化的另一种方式
这里对Logistic回归和线性判别分析进行了操作,这两大块要分别运行得到两个结果

viz = ConfusionMatrix(LogisticRegression())
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.poof()

viz = ConfusionMatrix(LinearDiscriminantAnalysis())
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.poof()

Logistic
Linear Discriminant Analysis

这里我们希望TP和TN更大,所以是左上和右下颜色越深,左下和右上颜色越浅越好

大家coursework加油~
安包安不好什么的可以来找我,tutorial相关也可以找我,debug也可以找我,coursework相关就不要问了
好的,下班!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

端午节放纸鸢

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值