NLP12-Bayes与文本分类探讨

最新推荐文章于 2024-05-02 15:55:51 发布

happyprince

最新推荐文章于 2024-05-02 15:55:51 发布

阅读量358

点赞数 1

分类专栏： NLP 数据挖掘 python 文章标签： nlp bayes sklearn

本文链接：https://blog.csdn.net/ld326/article/details/78524486

版权

NLP 同时被 3 个专栏收录

79 篇文章 6 订阅

订阅专栏

python

42 篇文章 1 订阅

订阅专栏

数据挖掘

31 篇文章 1 订阅

订阅专栏

摘要：学习Bayes的基础，公式，原理，把Bayes应用到文本分类的小例子。通过手工例子理解后，依托skLearn工具，进行对中文作一个分类探讨，采用三类200多条记录做实现，三类组合起来的正确率为83%，两两区别90%以上。

0. Bayes定义

Bayes的定义网上很多，可以看一下< 从贝叶斯方法谈到贝叶斯网络>
http://blog.csdn.net/v_july_v/article/details/40984699，
理解一下思想：先验分布 f(a) + 样本信息X ==> 后验分布 f(a|x)

1. 例子

理解好Bayes的公式与原理，最好看一下这个东西在文本分类是怎样用，用一个简单的手工例子去计算一下，来自http://blog.csdn.net/jteng/article/details/51499363，下面是为自这个博客里写的一个例子：
这里写图片描述

2. 实践

当学习完定义，理解完Bayes在文本上运用之后，考虑计算是怎样实现的，从sklearn的用户手册找到了Bayes的运用（http://scikit-learn.org/stable/modules/naive_bayes.html），
Bayes如下清晰说明：
这里写图片描述

2.1 sklearn

Naive Bayes的三个模型: Gaussian Naive Bayes；Multinomial Naive Bayes；Bernoulli Naive Bayes
这个三个模型对于大的数量提供了partial_fit 函数来求解。

2.2 构造函数

def init(self, priors=None)
可以转入一个先验，如果没先验概率，会是这样计算：

# Update if only no priors is provided
if self.priors is None:
    # Empirical prior, with sample_weight taken into account
    self.class_prior_ = self.class_count_ / self.class_count_.sum()

2.3两个训练接口

def fit(self, X, y, sample_weight=None)
def partial_fit(self, X, y, classes=None, sample_weight=None)

两个训练函数都会调用这个函数来训练：

def _partial_fit(self, X, y, classes=None, _refit=False,
                 sample_weight=None)

参数更新，对于Gaussian Naive Bayes可以在线计算，相关参考论文，《Updating Formulae and a Pairwise Algorithm for Computing Sample Variances》，http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf
def _update_mean_variance(n_past, mu, var, X, sample_weight=None)

3. 数据

主要是抓取了三类数据【慢性病预防，母婴，药界新闻】，查看文本的分类：
这里写图片描述
记录分布

相关标签,对于标签，分别标记为0，1，3；由于相看看任何两类的距离情况，把这三类，分成了两两一组，共三组来计算。像上一篇文章的做法一样，应用了LSI生成了向量矩阵。用这个向量矩阵进行了分类学习，这里采用了Gaussian Naive Bayes，不过有一个问题未想明白，不知道样本经过LSI降维后是否是正态分布？有知道麻烦告诉一下。

4.运行的结果

母婴与慢性病两类文章分类，平均ROC area = 0.95,效果还是比较好的。
这里写图片描述
母婴与新闻

慢病预防与新闻

5. 代码

# -*- coding:utf-8 -*-
import re
import string

import jieba
import jieba.analyse
import matplotlib.pyplot as plt
import numpy as np
from bs4 import BeautifulSoup
from gensim import corpora, models, matutils
from scipy import interp
from sklearn.cross_validation import StratifiedKFold
from sklearn.metrics import roc_curve, auc


# 判断是否是数字
def isXiaoShu(word):
    rs = False
    a = re.search(r'^\d*\.?\d*$', word)
    if a:
        if a.group(0) == '':
            pass
        else:
            rs = True
    else:
        pass
    return rs


# 分词
def cutPhase(inFile, outFile):
    # jieba.load_userdict("dict_all.txt")
    stoplist = {}.fromkeys([line.strip() for line in open('config\stopwords.txt', 'r', encoding='utf-8')])
    f1 = open(inFile, 'r', encoding='utf-8')
    f2 = open(outFile, 'a', encoding='utf-8')
    line = f1.readline()
    count = 0
    while line:
        b = BeautifulSoup(line, "lxml")
        line = b.text
        # line.replace('\u3000', '').replace('\t', '').replace(' ', '')
        segs = jieba.cut(line, cut_all=False)
        segs = [word for word in list(segs)
                if word.lstrip() is not None
                and word.lstrip() not in stoplist
                and word.lstrip() not in string.punctuation
                and not isXiaoShu(word.lstrip())
                ]
        f2.write(" ".join(segs))
        f2.write('\n')
        line = f1.readline()
        count += 1
        if count % 100 == 0:
            print(count)
    f1.close()
    f2.close()


class MyNews(object):
    def __init__(self, dict, in_file):
        self.dict = dict
        self.in_file = in_file

    def __iter__(self):
        for line in open(self.in_file, encoding='utf-8'):
            yield self.dict.doc2bow(line.split())

    def __len__(self):
        return 0


def trainBayes():
    # 生成相似矩阵
    print('加载bows')
    bows = corpora.MmCorpus(u'data/资讯文章数据.mm')
    print('加载LSI模型')
    lsi = models.LsiModel.load(u'data/资讯文章数据.lsi')
    bow_lsi = lsi[bows]
    # 把语料储存类型转numpy类型
    data = np.transpose(matutils.corpus2dense(bow_lsi, 100))
    target = np.loadtxt("data/资讯文章数据_f.txt")
    print('data.shape:', data.shape)
    print('target.shape:', target.shape)
    from sklearn.naive_bayes import GaussianNB
    classifier = GaussianNB()
    params = classifier.get_params()
    print(params)
    cv = StratifiedKFold(target, n_folds=6)
    mean_tpr = 0.0
    mean_fpr = np.linspace(0, 1, 100)
    all_tpr = []
    # 解决中文问题
    plt.rcParams["font.family"] = "SimHei"
    for i, (train, test) in enumerate(cv):
        probas_ = classifier.fit(data[train], target[train]).predict_proba(data[test])
        fpr, tpr, thresholds = roc_curve(target[test], probas_[:, 1])
        # 对mean_tpr在mean_fpr处进行插值，通过scipy包调用interp()函数
        mean_tpr += interp(mean_fpr, fpr, tpr)
        mean_tpr[0] = 0.0  # 初始处为0
        roc_auc = auc(fpr, tpr)
        # 画图，只需要plt.plot(fpr,tpr),变量roc_auc只是记录auc的值，通过auc()函数能计算出来
        plt.plot(fpr, tpr, lw=1, label='ROC fold %d (area = %0.2f)' % (i, roc_auc))

    # 画对角线
    plt.plot([0, 1], [0, 1], '--', color=(0.6, 0.6, 0.6), label='Luck')
    mean_tpr /= len(cv)  # 在mean_fpr100个点，每个点处插值插值多次取平均
    mean_tpr[-1] = 1.0  # 坐标最后一个点为（1,1）
    mean_auc = auc(mean_fpr, mean_tpr)  # 计算平均AUC值
    # 画平均ROC曲线
    plt.plot(mean_fpr, mean_tpr, 'k--',
             label='Mean ROC (area = %0.2f)' % mean_auc, lw=2)
    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('两类ROC-慢病预防&&新闻')
    plt.legend(loc="lower right")
    plt.show()


if __name__ == '__main__':
    is_train = True
    # 进行训练计算模型
    if is_train:
        print("***分词***")
        cutPhase(inFile=u'data\资讯文章数据.txt', outFile=u"data\资讯文章数据.cut")

        print("***建立词典***")
        dict = corpora.Dictionary(line.lower().split() for line in open(u'data\资讯文章数据.cut', encoding='utf-8'))
        dict.save('data\资讯文章数据.dic')

        # 加载词典:建立词袋语料
        # if is_load:
        #     dict = corpora.Dictionary.load(u'data/资讯文章数据.dic')
        print('=================dictinary info=============')
        print('词数：', len(dict.keys()))
        print('处理的文档数(num_docs):', dict.num_docs)
        print('没有去重词条总数(num_pos):', dict.num_pos)
        print('=================dictinary=============')
        bows = MyNews(dict, in_file=u'data/资讯文章数据.cut')

        print("***保存词代信息***")
        corpora.MmCorpus.serialize('data/资讯文章数据.mm', bows)

        print("***计算iftdf***")
        tfidf = models.TfidfModel(dictionary=dict)
        corpus_tfidf = tfidf[bows]
        tfidf.save(u'data/资讯文章数据.tfidf')

        print("***计算lsi模型并保存***")
        lsi = models.LsiModel(corpus_tfidf, id2word=dict, num_topics=100)
        lsi.save(u'data/资讯文章数据.lsi')
        # 计算所有语料
        corpus_lsi = lsi[corpus_tfidf]
        # 训练
        trainBayes()