贝叶斯文本分类python_scikit_learn 中朴素贝叶斯算法做文本分类的实践总结

最新推荐文章于 2023-10-13 10:06:59 发布

weixin_39938331

最新推荐文章于 2023-10-13 10:06:59 发布

阅读量329

点赞数

文章标签：贝叶斯文本分类python

本文链接：https://blog.csdn.net/weixin_39938331/article/details/112886103

版权

本文介绍了使用Python的scikit-learn库实现朴素贝叶斯算法进行文本分类的过程，包括数据预处理、特征提取、模型训练和测试。通过处理数据库中的文章，构建词频矩阵，并进行特征筛选，最终训练出一个准确率超过80%的模型。

摘要由CSDN通过智能技术生成

朴素贝叶斯算法对于分类非常高效想了解的可以参考这篇博文：贝叶斯从浅入深详细解析，详细例子解释 - zwan0518的专栏 - 博客频道 - CSDN.NET贝叶斯从浅入深

先来做个小小总结说明

在这之前我还从没写过python

这东西，接到这个小项目的时候便去恶补了一天python

的基本语法和包安装以及一些基本包的使用(主要是数据库包还有编码以及scikit-learn的贝叶斯算法的api)。感觉python

还是比较容易上手的，大概花了一周时间就大致done了，准确率百分之八九十以上！(哈哈，这里有吹牛B的嫌疑)

行了，我们来开始工作吧(跟哥飞起....)第一步

首先我们创建一个目录 tlw(我名字的缩写) 写个 get_articke.py 的脚本脚本把我们数据库的文章全部弄出来写道一个文件里面去，结构是：文章id,文章内容，文章分类

#!/usr/bin/python

# -*- coding: utf-8 -*-

import MySQLdb

import re

import sys

reload(sys)

#编码重置

sys.setdefaultencoding("utf8")

conn=MySQLdb.connect(db='db',host='host',port=3339,user='user',passwd='xiemeijiao5201314',charset='utf8')

cursor = conn.cursor()

cursor.execute("selectid,type,title, content_text, create_time, update_time,publish_time,origin_url,tag from sw_article_info where type is not null and replace(type,' ','')<>'' and type<>'推荐' and content_text is not null and replace(content_text,' ','')<>''")

rows = cursor.fetchall()

for row in rows:

contents = row[3].replace('\n','').replace(',','').replace('@','')

print row[0],",",contents,",",row[1]

cursor.close

我把推荐这个分类去掉了，原因是这个分类太变态，完全跟文章内容无关的嘛是不是 (哥可是有心眼的)。

好了执行 ./get_articke.py >article.txt

倒数 3、2、1 ，然后ls 一下我可以看见目路下多了一个 article.txt 文件，马上打开看下行数，我靠 3500 行这....(这里，稍稍凌乱了下)，3500 就3500吧，反正就是用来测试下的。

第二步

写一个 trainTDF.py 的脚本来生成我们的测试数据，这里各位亲们可以先去了解TfidfVectorizer 它会一次都处理好，生成词频矩阵，这是计算的基础哦。

#!/usr/bin/python

# -*- coding: utf-8 -*-

import os

import time

import re

import string

import sys

# 引入Bunch 构建python 存文件的对象

from sklearn.datasets.base import Bunch

import cPickle as pickle #这个是用来把python对象存文件用的

from sklearn.feature_extraction.text import TfidfVectorizer #用于生成tfidf的词频权重矩阵

import pandas as pd

from pandas import Series, DataFrame

import jieba #这个是python的分词模块，记得要先安装这个包哦

reload(sys)

sys.setdefaultencoding("utf8")

path = '/data/webapps/dataColl/sklearn/tlw/'

# 读取bunch对象

def _readbunchobj(path):

with open(path, "rb") as file_obj:

bunch = pickle.load(file_obj)

return bunch

# 写入bunch对象

def _writebunchobj(path, bunchobj):

with open(path, "wb") as file_obj:

pickle.dump(bunchobj, file_obj)引入需要的包，定义两个读写对象的方法，基本上scikit-lerning 是python 自带的包但是分词工具jieba 是要手动安装的,可以用pip 命令很方便的安装好

#定义下数组和字典

words_list = []

filename_list = []

category_list = []

#停用词是哥在网上下的，直接下放在tlw目录下，记得把编码转成UTF8

stopwordContents = open(path + 'stopword.txt')

stopwords = stopwordContents.read().splitlines()

stopwordContents.close()

#下面这两个是过滤文章的特殊字符用的

delEStr = string.punctuation + ' ' + string.digits

identify = string.maketrans('', '')

#定义一个方法给每篇文章进行分词，网上有些人说在这里用循环去过滤停用词，哥试了下速度很慢，所以哥下面想了其他办法

def filewordprocess(contents):

wordslist = []

contents = re.sub(r'\s+', '', contents)

contents = re.sub(r'\n', '', contents) # trans 换行 to 空格

contents = re.sub(r'[A-Za-z0-9]', '', contents) # trans Tab to 空格

contents = contents.translate(identify, delEStr)

#for w in jieba.cut(contents):

# if(w in stopwords) : continue

# wordslist.append(w)

#file_string = ' '.join(wordslist)

file_string = ' '.join(jieba.cut(contents))

return file_string

说下停用词这个坑,停用词列表是我从网上下的一个，直接放在tlw 目录下。开始是用循环过滤，就是上面注释中那一段，但是速度很慢。其实用TfidfVectorizer 构建词频权重矩阵的时候可以传入一个停用词列表参数的。之所以这样，是因为参数传进去不起作用,好像汉字没用，所以我在后面做特征词筛选时顺便去了停用词。

#读取文章循环处理每一篇

contents = open(path + 'article.txt')

i=0

for line in contents :

row = line.split(',')

if(row[0].strip() == ''):continue

if(row[1].strip() == ''):continue

if(row[2].strip() == ''):continue

i+=1

#取3200篇文章做训练用

if (i == 3200): break

starttime = time.clock();

wordProcessed = filewordprocess(row[1]) # 内容分词成列表

words_list.append(wordProcessed)

filename_list.append(row[0].strip())

category_list.append(row[2].strip())

endtime = time.clock();

print i,' 类别:%s>>>>文件:%s>>>>导入用时:%.3f' % (row[0],row[2],endtime-starttime)

contents.close()

# 参数可以参考文档说明。max_df=0.5,min_df=0.011 保留50%--1.1% 的文章中出现过的词

stop_words=stopwords #这里传入停用词list 但是哥试试了发现对中文没用，有点郁闷，不过还好哥是个聪明人

freWord = TfidfVectorizer(stop_words=stopwords, sublinear_tf=True, max_df=0.5,min_df=0.011)

#生成词频矩阵

fre_matrix = freWord.fit_transform(words_list)

feature_names = freWord.vocabulary_

freWordVector_df = pd.DataFrame(fre_matrix.toarray()) # 全词库词频向量矩阵

到这里词频权重矩阵就好了，TfidfVectorizer 赶快 print 这个东西亲记得看看文档啊，这里先 print freWordVector_df 这个东西看看吧(哥第一打印这东西出来的时候，心里稍稍激动了下：我靠，这就是词频矩阵,顿时信心就飞起了啊)

print freWordVector_df

这里就不贴图了。

#特征筛选筛选每篇文章前一百TDF 好多词，但每篇文章大部分词都是没用的

word_list=[]

WORDS = Series(feature_names.keys(),index=feature_names.values())

for T in fre_matrix.toarray():

i+=1

T100 = Series(T).order()[-100:]

words = WORDS[T100.index].values.tolist()

word_list=list(set(word_list+words))

#转成字典

i=0

v={}

for w in word_list :

if w in stopwords: continue #过滤停用词(我在这里过滤停用词了)

v[w]=i

i+=1

# 我们在来根据我们取的词(注意vocabulary 参数 )来构建我们的词频矩阵，测试集也是这样才会统一

freWord = TfidfVectorizer(stop_words=stopwords, sublinear_tf=True,vocabulary=v)

fre_matrix = freWord.fit_transform(words_list) #生成词频矩阵

#tfidf = transformer.fit_transform(fre_matrix) #生成词向量矩阵

#feature_names = freWord.get_feature_names() # 特征名

feature_names = freWord.vocabulary_

freWordVector_df = pd.DataFrame(fre_matrix.toarray()) # 全词库词频向量矩阵

这一步主要是特征筛选和过滤停用词(取每篇文章tfidf值前一百的词)，然后根据选出来的词来构建词频矩阵，注意这个参数 vocabulary，表示 TfidfVectorizer使用指定字典来构建词频矩阵，后面测试数据也是依据这个字典来构建词频矩阵的。(这一步的操作，属于个人原创，你也可以不进行，哥会理解你的)

到这里训练的数据就弄好了，把它存起来先，在 tlw 目录下创建一个data目录，前面引入bunch 就是用来存python 对象的，如下。

tdfBunch = Bunch(target_name=[], label=[], filenames=[], contents=[],tdf = [],cateTDF = [])

vabularyBunch = Bunch(vabulary = {})

tdfBunch.tdf = fre_matrix

tdfBunch.filenames = filename_list

tdfBunch.label = category_list

#在tlw 的目录下创建一个data 目录来保存我们这一步的数据

vabularyBunch.vabulary = freWord.vocabulary_

_writebunchobj(path+'data/train.dat', tdfBunch)

_writebunchobj(path+'data/words.dat', vabularyBunch)

执行 ./trainTDF.py, 就ok了，速度很快的。

你可以开心地去放松一下啦，去微信调戏下喜欢的妹子，喝杯咖啡，或者在厕所的镜子前自恋一下，都可以的，哥去看一会小黄图(嘿嘿。。)。第三步

写一个 testTDF.py 的脚本来生成我们的测试数据，这里我就废话不多说了直接上代码啦。

#!/usr/bin/python

# -*- coding: utf-8 -*-

import os

import time

import re

import string

import sys

# 引入Bunch 构建python 存文件的对象

from sklearn.datasets.base import Bunch

import cPickle as pickle #这个是用来把python对象存文件用的

from sklearn.feature_extraction.text import TfidfVectorizer #用于生成tfidf的词频权重矩阵

import pandas as pd

from pandas import Series, DataFrame

import jieba #这个是python的分词模块，记得要先安装这个包哦

reload(sys)

sys.setdefaultencoding("utf8")

path = '/data/webapps/dataColl/sklearn/tlw/'

# 读取bunch对象

def _readbunchobj(path):

with open(path, "rb") as file_obj:

bunch = pickle.load(file_obj)

return bunch

# 写入bunch对象

def _writebunchobj(path, bunchobj):

with open(path, "wb") as file_obj:

pickle.dump(bunchobj, file_obj)

words_list = []

filename_list = []

category_list = []

stopwords = []

delEStr = string.punctuation + ' ' + string.digits

identify = string.maketrans('', '')

def filewordprocess(contents):

wordslist = []

contents = re.sub(r'\s+', '', contents)

contents = re.sub(r'\n', '', contents) # trans 换行 to 空格

contents = re.sub(r'[A-Za-z0-9]', '', contents) # trans Tab to 空格

contents = contents.translate(identify, delEStr)

#for w in jieba.cut(contents):

# if(w in stopwords) : continue

# wordslist.append(w)

#file_string = ' '.join(wordslist)

file_string = ' '.join(jieba.cut(contents))

return file_string

contents = open(path + 'article.txt')

i=0

for line in contents :

row = line.split(',')

if(row[0].strip() == ''):continue

if(row[1].strip() == ''):continue

if(row[2].strip() == ''):continue

i+=1

if (i <=3200): continue

starttime = time.clock();

wordProcessed = filewordprocess(row[1]) # 内容分词成列表

words_list.append(wordProcessed)

filename_list.append(row[0].strip())

category_list.append(row[2].strip())

endtime = time.clock();

print i,' 类别:%s>>>>文件:%s>>>>导入用时:%.3f' % (row[0],row[2],endtime-starttime)

contents.close()

上面这些和构建训练矩阵一样，下面就要利用到先前的，保存的字典了。

bunch = _readbunchobj(path+'data/words.dat')

freWord = TfidfVectorizer(stop_words=stopwords,sublinear_tf=True,vocabulary=bunch.vabulary)

fre_matrix = freWord.fit_transform(words_list) #生成词频矩阵

print pd.DataFrame(fre_matrix.toarray())

feature_names = freWord.vocabulary_

freWordVector_df = pd.DataFrame(fre_matrix.toarray()) # 全词库词频向量矩阵

同样你可以先 print freWordVector_df 看看，然后保存起来

tdfBunch = Bunch(target_name=[], label=[], filenames=[], contents=[],tdf = [],cateTDF = [])

vabularyBunch = Bunch(vabulary = {})

tdfBunch.tdf = fre_matrix

tdfBunch.filenames = filename_list

tdfBunch.label = category_list

vabularyBunch.vabulary = feature_names

_writebunchobj(path+'data/test.dat', tdfBunch)

执行 ./testTDF.py。好了，下面我们一股作气，来创建一个机器学习的模型吧第三步训练数据构建模型我们使用多项式贝叶斯算法 MultinomialNB,亲们可以看看接口文档贝叶斯

#!/usr/bin/python

# -*- coding: utf-8 -*-

import os

import time

import sys

# 引入Bunch类

from sklearn.datasets.base import Bunch

from sklearn.naive_bayes import MultinomialNB # 导入多项式贝叶斯算法

from sklearn.externals import joblib #存储模型

import cPickle as pickle

import pandas as pd

from pandas import Series, DataFrame

reload(sys)

sys.setdefaultencoding("utf8")

path = '/data/webapps/dataColl/sklearn/tlw/'

# 读取bunch对象

def _readbunchobj(path):

with open(path, "rb") as file_obj:

bunch = pickle.load(file_obj)

return bunch

# 写入bunch对象

def _writebunchobj(path, bunchobj):

with open(path, "wb") as file_obj:

pickle.dump(bunchobj, file_obj)

载入训练数据

bunchTDF = _readbunchobj(path+'data/train.dat')

TDF = bunchTDF.tdf

LABEL = bunchTDF.label

TDF = pd.DataFrame(TDF.toarray())

其实进行到这一步的时候，是可以直接用这里TDF 来做训练的，当时哥也是兴奋地就这样做了，结果准确率只有百分之四十几，顿时就凌乱了，小黄图也不想看了，各种去找原因啊，然后找到了下面这个方法。

UTDFIndex = TDF.sum(axis=0).order()[-300:].index

UTDF = TDF[UTDFIndex]

上面这段代码应该好理解，就是将TDF 这个矩阵每行累加，然后排序取最大的前300(这个值你可以自己调)，得到UTDF 这个矩阵再来做训练如下：

clf = MultinomialNB(alpha=0.0001).fit(UTDF, bunchTDF.label)

joblib.dump(clf, path + 'data/file.model')

哈哈，是不是很简单，一个机器学习的模型就这样诞生了，整个世界都为你欢呼了，快去和你喜欢的妹子说：我已经在进入人工智能的领域了。她一定会崇拜你，至少有那么一下下…

有了模型了，下面就是要看看这个模型到底好不好用了第五步模型测试

前面我们已经生成和保存了我们的测试矩阵，现在就要来使用了

#!/usr/bin/python

# -*- coding: utf-8 -*-

import os

import time

import sys

# 引入Bunch类

from sklearn.datasets.base import Bunch

from sklearn.naive_bayes import MultinomialNB # 导入多项式贝叶斯算法

from sklearn.externals import joblib #存储模型

import cPickle as pickle

import pandas as pd

from pandas import Series, DataFrame

reload(sys)

sys.setdefaultencoding("utf8")

path = '/data/webapps/dataColl/sklearn/tlw/'

# 读取bunch对象

def _readbunchobj(path):

with open(path, "rb") as file_obj:

bunch = pickle.load(file_obj)

return bunch

# 写入bunch对象

def _writebunchobj(path, bunchobj):

with open(path, "wb") as file_obj:

pickle.dump(bunchobj, file_obj)

#训练集的TDF

bunchTDF = _readbunchobj(path+'data/train.dat')

TDF = bunchTDF.tdf

在做训练的视乎，我们给数据有个降维操作，记得吧，就是累加取前三百的地方，为了测试数据和训练数据同步在同一平面的矩阵中，我们这里要把降维操作从新来一下得到降维矩阵的索引，当然你也可以在训练的那一步将降维矩阵的索引存起来，然后在这里载入入，这个哥就不弄了，反正速度很快，不影响什么。

TDF = pd.DataFrame(TDF.toarray())

UTDFIndex = TDF.sum(axis=0).order()[-300:].index

载入测试数据

bunchTDF = _readbunchobj(path+'data/test.dat')

TDF = bunchTDF.tdf

LABEL = bunchTDF.label

TDF = pd.DataFrame(TDF.toarray())

UTDF = TDF[UTDFIndex]

激动人心的时刻到了

#使用模型

clf = joblib.load(path + 'data/file.model')

predicted = clf.score(UTDF,bunchTDF.label)

print predicted

日，跑出来也只有六十几的准确率.....

没事，一般遇到这种情况，我们就看看为什么没预测对，那就把没预测对的打印出来瞧瞧...

m=-1

for i in predicted :

m+=1

if(bunchTDF.label[m] != i):

print FILE[m],",",i,",",bunchTDF.label[m]

这里打印出来了文章的id 已经预测分类和人工分类，可喜的是，哥看过了，其中大部分文章都是人工分错，机器分的是准的.......。为了体现哥的大度我这里就只画个圈圈诅咒下给这文章分类的那个人吧(要是个妹子，就诅咒她迷恋哥一辈子啊，哈哈哈)

weixin_39938331

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

贝叶斯文本分类python_scikit_learn 中朴素贝叶斯算法做文本分类的 实践总结

贝叶斯文本分类python_scikit_learn 中朴素贝叶斯算法做文本分类的实践总结