今天是815 tut7
coursework part 2 的相关内容!!!
下面直接开始研究代码
首先是引入的包
# -*- coding: utf-8 -*-
"""
Created on Mon Mar 7 19:01:54 2022
@author: Pamplemousse
"""
#设定图片大小
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = [9.0, 6.0]
import nltk
from sklearn.datasets import load_files #读取文件的工具
from nltk.corpus import stopwords
import os
import string
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression #Logistic
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis #线性判别器
from sklearn.naive_bayes import GaussianNB #先验为高斯分布的朴素贝叶斯
from sklearn.svm import SVC #支持向量机
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score #一些拟合优度工具
from sklearn.naive_bayes import MultinomialNB #先验为多项式分布的朴素贝叶斯
from sklearn.pipeline import Pipeline
import random
from functools import partial
from tabulate import tabulate
from yellowbrick.classifier import ClassificationReport
from yellowbrick.classifier import ConfusionMatrix
from sklearn.neural_network import MLPClassifier #人工神经网络
一些参数的设置
default_stopwords = nltk.corpus.stopwords.words('english')#stopwords
lemma = WordNetLemmatizer()#lemmatize
porter_stemmer = PorterStemmer()#stemming
清洗文本的函数,和上周的tut6-part D&E是一样的,其实可以直接复制过来,所以这里就不给注释了
def clean_text(doc, rm_punctuation = True, rm_digits = True, lemmatize = False,
norm_case = True, stem = False, rm_stopwords = True):
if(rm_punctuation == True):
table = str.maketrans({key: None for key in string.punctuation})
doc =str(doc).translate(table)
if(rm_digits == True):
table = str.maketrans({key: None for key in string.digits})
doc = str(doc).translate(table)
if(norm_case == True):
doc = doc.lower()
if(lemmatize == True):
words = " ".join(lemma.lemmatize(word) for word in doc.split())
else:
words = " ".join([i for i in doc.split()])
if(stem == True):
words = " ".join(porter_stemmer.stem(word) for word in words.split())
if(rm_stopwords == True):
words = " ".join([i for i in words.split() if i not in default_stopwords])
return words
然后是评价模型的函数
def evaluate_model(model):
model.fit(X_train, y_train)#训练模型
cr = ClassificationReport(model)#关于模型的分类报告框架
cr.score(X_test, y_test)#测试模型,并得到测试结果的数据
cr.finalize() #这里应该是绘制一个报告的热力图的动作
#总之调用这个函数会得到一个热力图
读取文件
movie_dataDir = os.path.realpath("Desktop/King/815/Tutorial Week 7-20220307/Week6 Tutorial/txt_sentoken")
movie_data = load_files(movie_dataDir)
#load_files是读取文本的工具,其返回值包括data、target、target_names
print(movie_data.target)
print(movie_data.data[0])
第一个print的输出是
[0 1 1 … 1 0 0]
P.S.中间的省略号是编译器无法显示那么多,所以省略了,不是真的省略号
在movie_data.target内存储的是文件的类型(0/1)
第二个print输出的是第一个文件的内容
截图示意:
随后对文本进行清洗,即对movie_data.data进行处理
documents = [clean_text(x, stem = False, lemmatize = False) for x in movie_data.data]
#调用我们自定义的clean_text()函数
print(documents[0])#输出第一篇文章看看清洗情况
截图示意:
随后将document(自变量)和movie_data.target(因变量)转换成电脑可以处理的digit型数据
X,y = documents, movie_data.target
vectorizer = CountVectorizer(max_features = 1500, min_df = 5, max_df = 0.7, stop_words = stopwords.words('english'))#词频转换器
#去除stopwords后,出现频率不高于0.7,出现次数不低于5的前1500个词
X = vectorizer.fit_transform(documents).toarray()#构建词频向量
print(X[0][:10])#第1个文章的词频向量的前10个数据
得到
[0 0 0 0 0 0 0 0 0 5]
P.S.有些词在总的里面出现次数多,但在单个文本文本里出现的次数可能是0,这个词频向量表示文章中某个词出现的频率
然后将词频(tf)使用逆文本频率指数(idf)进行转换,下面请自行百度
tfidfconverter = TfidfTransformer()
X = tfidfconverter.fit_transform(X).toarray()
print(X[0][:10])
得到的结果是
[0. 0. 0. 0. 0. 0.
0. 0. 0. 0.24686232]
8:2分离训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
然后开始跑不同的分类器,并输出结果
首先是logistics回归
logistic = LogisticRegression()
logistic.fit(X_train, y_train)
logistic_prediction = logistic.predict(X_test)
print(accuracy_score(logistic_prediction, y_test))
print(confusion_matrix(logistic_prediction, y_test))
print(classification_report(logistic_prediction, y_test))
accuracy score:
0.835
confusion matrix:
[[168 26]
[ 40 166]]
report:
precision | recall | f1-score | support | |
---|---|---|---|---|
0 | 0.81 | 0.87 | 0.84 | 194 |
1 | 0.86 | 0.81 | 0.83 | 106 |
accuracy | 0.83 | 400 | ||
macro avg | 0.84 | 0.84 | 0.83 | 400 |
weighted avg | 0.84 | 0.83 | 0.83 | 400 |
P.S.这个表格是我手动敲出来的,但是结果是跑出来的~
线性判别模型LinearDiscriminant
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)
lda_prediction = lda.predict(X_test)
print(accuracy_score(lda_prediction, y_test))
print(confusion_matrix(lda_prediction, y_test))
print(classification_report(lda_prediction, y_test))
accuracy score:
0.61
confusion matrix:
[[115 63]
[ 93 129]]
report:
precision | recall | f1-score | support | |
---|---|---|---|---|
0 | 0.55 | 0.65 | 0.60 | 178 |
1 | 0.67 | 0.58 | 0.62 | 222 |
accuracy | 0.61 | 400 | ||
macro avg | 0.61 | 0.61 | 0.61 | 400 |
weighted avg | 0.62 | 0.61 | 0.61 | 400 |
朴素贝叶斯(高斯分布)
nb = GaussianNB()
nb.fit(X_train, y_train)
nb_prediction = nb.predict(X_test)
print(accuracy_score(nb_prediction, y_test))
print(confusion_matrix(nb_prediction, y_test))
print(classification_report(nb_prediction, y_test))
accuracy score:
0.7625
confusion matrix:
[[164 51]
[ 44 141]]
report:
precision | recall | f1-score | support | |
---|---|---|---|---|
0 | 0.79 | 0.76 | 0.78 | 215 |
1 | 0.73 | 0.76 | 0.75 | 185 |
accuracy | 0.76 | 400 | ||
macro avg | 0.76 | 0.76 | 0.76 | 400 |
weighted avg | 0.76 | 0.76 | 0.76 | 400 |
支持向量机
SVC_model = SVC()
SVC_model.fit(X_train, y_train)
SVC_prediction = SVC_model.predict(X_test)
print(accuracy_score(SVC_prediction, y_test))
print(confusion_matrix(SVC_prediction, y_test))
print(classification_report(SVC_prediction, y_test))
accuracy score:
0.8275
confusion matrix:
[[167 28]
[ 41 164]]
report:
precision | recall | f1-score | support | |
---|---|---|---|---|
0 | 0.80 | 0.86 | 0.83 | 295 |
1 | 0.85 | 0.80 | 0.83 | 205 |
accuracy | 0.83 | 400 | ||
macro avg | 0.83 | 0.83 | 0.83 | 400 |
weighted avg | 0.83 | 0.83 | 0.83 | 400 |
然后是pipeline的一大块,就是流水线
先构建一个流水线模型
model = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', MultinomialNB()),
])#这个模型先做tfidf向量化,然后使用朴素贝叶斯(多项式分布)训练模型
model.fit(movie_data.data, movie_data.target)#拟合模型
Pipeline(steps=[(‘tfidf’, TfidfVectorizer()), (‘clf’, MultinomialNB())])
然后我们从documents里随机抽取一篇文章
使用模型对其进行tfidf向量化,然后预测
rantdoc = random.choice(documents)
print(rantdoc)
target = model.named_steps['tfidf'].transform([rantdoc])
target
print(model.predict([rantdoc]))
这里就不贴rantdoc的输出了
target的输出:
<1x39659 sparse matrix of type ‘<class ‘numpy.float64’>’
with 361 stored elements in Compressed Sparse Row format>
预测的输出:
[0]
但是这里我们没办法知道实际的分类是0还是1,除非一篇篇去比对这哪篇文章
输出该预测的概率
tabulate = partial(tabulate, headers = 'firstrow', tablefmt = 'pipe')
probas = model.predict_proba([rantdoc])
table = [["Class", "Probability"]] + list(zip(model.classes_, probas[0]))
#构建概率表格
print(tabulate(table))
得到这个表格
Class | Probability |
---|---|
0 | 0.689799 |
1 | 0.310201 |
P.S.这个输出复制到Markdown就自然变成表格了,点赞
所以对这个rantdoc的预测,这个表格有0.6998的概率是0,所以前面的输出是0
随后是对模型优度评价的可视化,使用的是前面的evaluate_model(),这个自定义函数
然后我们对几个分类模型分别调用该函数
evaluate_model(LogisticRegression())
evaluate_model(LinearDiscriminantAnalysis())
evaluate_model(GaussianNB())
evaluate_model(MultinomialNB())
evaluate_model(SVC())
evaluate_model(MLPClassifier())
P.S.这里的代码要一行一行地运行,否则输出可能有问题
得到如下热力图
这些热力图颜色越深说明数值越高,效果越好
然后是可视化的另一种方式
这里对Logistic回归和线性判别分析进行了操作,这两大块要分别运行得到两个结果
viz = ConfusionMatrix(LogisticRegression())
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.poof()
viz = ConfusionMatrix(LinearDiscriminantAnalysis())
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.poof()
这里我们希望TP和TN更大,所以是左上和右下颜色越深,左下和右上颜色越浅越好
大家coursework加油~
安包安不好什么的可以来找我,tutorial相关也可以找我,debug也可以找我,coursework相关就不要问了
好的,下班!