中文文本分类_特征选择算法初探

最新推荐文章于 2023-07-30 03:05:31 发布

Seepen_L

最新推荐文章于 2023-07-30 03:05:31 发布

阅读量1.1k

点赞数 1

分类专栏：文本特征选择文章标签： python 机器学习数据挖掘

本文链接：https://blog.csdn.net/qq_39496504/article/details/105375213

版权

文本特征选择专栏收录该内容

2 篇文章 0 订阅

订阅专栏

中文文本分类之特征选择

0 数据集回顾
- 一点更改
1 特征选择_谁更重要？
2 运行结果
- MI_result
3 Reference

0 数据集回顾

一点更改

看过我上篇文章的同学可能还记得我们最后处理得到的训练集，是每个分类一个大txt。这样的数据其实也不是不能处理，比如你做基于词频的特征选择方法时是方便的。但一旦涉及到文档频率，那就比较尴尬，因为你把所有文档合到一起了。。统计不了。

我最终选择的教程是使用MI, IG, CHI做一个初步选择的，所以我按照他所给的函数要求重新做了数据集，是和上篇文章标准语料库的格式一样，把训练集小文档分别分词再储存的。

数据集就交待到这里。（虽然只是短短一段话，从摸索文档格式到更改却是花了我半天时间。。我太菜了）

1 特征选择_谁更重要？

1.1 预处理

OK，你发现在开始工作前还是得预处理。

这次是要把文档与标签合成一个嵌套list。这也是一个大文件，就是那个all_text，里面每一条文档是分开的，且都有自己对应的label。

看过结果后，总结其数据结构为[（‘label’, {‘term’, ‘term’, … }）, ‘label’, {‘term’, ‘term’, … }），… ] 这样。

code

def preprocess(raw_text):
    """
    预处理,将文件中读出的str以空格为分隔，转化为list
    :param raw_text: read()出的数据
    :return:   处理好的list
    """
    # 以空格为分隔转成list
    word_li = raw_text.split()
    # 去除空白符
    word_li = [w.strip() for w in word_li if w.strip()]
    # 移除单字词
    word_li = [w for w in word_li if len(w) > 1]
    return word_li


def getDocuments(root_path, catoga_li):
    """
    读取原始文档集并进行预处理
    :param root_path: 文档集所在路径
    :param catoga_li: 文档类别list
    :return: 预处理后的文档列表
    """
    all_text = []
    print("start load files...")
    all_data = load_files(container_path=root_path, categories=catoga_li,
                          encoding="utf-8", decode_error="ignore")
    print("load files over.")

    for label, raw_text in zip(all_data.target, all_data.data):
        word_li = preprocess(raw_text)
        label = all_data.target_names[label]
        all_text.append((label, set(word_li)))
    return all_text


def getVocabulary(all_text):
    """
    获取文档集词汇表
    :param all_text:
    :return:
    """
    global vocabulary
    for label, word_set in all_text:
        vocabulary |= word_set

最后这个vocabulart是一个词语库，将每个不同的词都存到一个list里，用来算后面的卡方和互信息。

1.2 三种特征选择方法

原作者将MI, DF, CHI集成到一个函数里了，emmm它们的具体原理、公式什么的我一向不感冒，过几天写论文时再详细贴出来吧，今天就只debug然后改代码了。所以咱们就先只谈风月，实用主义，能用就行。

互信息（Mutual Information）

原理公式占坑，先上代码


def multual_infomation(N_10, N_11, N_00, N_01):
    """
    互信息计算
    :param N_10:
    :param N_11:
    :param N_00:
    :param N_01:
    :return: 词项t互信息值
    """
    N = N_11 + N_10 + N_01 + N_00
    I_UC = (N_11 * 1.0 / N) * log2((N_11 * N * 1.0) / ((N_11 + N_10) * (N_11 + N_01))) + \
           (N_01 * 1.0 / N) * log2((N_01 * N * 1.0) / ((N_01 + N_00) * (N_01 + N_11))) + \
           (N_10 * 1.0 / N) * log2((N_10 * N * 1.0) / ((N_10 + N_11) * (N_10 + N_00))) + \
           (N_00 * 1.0 / N) * log2((N_00 * N * 1.0) / ((N_00 + N_10) * (N_00 + N_01)))
    return I_UC

文档频率（Document Frequency）

原理公式占坑，先上代码。
额，这个简单，之前也提到过，这次直接说了吧。。原理一句话就能说清楚，基本公式也就在注释里了。


def freq_select(t_doc_cnt, doc_cnt):
    """
    频率特征计算
    :param t_doc_cnt: 类别c中含有词项t的文档数
    :param doc_cnt: 类别c中文档总数
    :return: 词项t频率特征值
    """
    return t_doc_cnt * 1.0 / doc_cnt

卡方检验（CHI）

原理公式占坑，先上代码


def chi_square(N_10, N_11, N_00, N_01):
    """
    卡方计算
    :param N_10:
    :param N_11:
    :param N_00:
    :param N_01:
    :return: 词项t卡方值
    """
    fenzi = (N_11 + N_10 + N_01 + N_00) * (N_11 * N_00 - N_10 * N_01) * (N_11 * N_00 - N_10 * N_01)
    fenmu = (N_11 + N_01) * (N_11 + N_10) * (N_10 + N_00) * (N_01 + N_00)
    if fenmu == 0:
        return 0
    return fenzi * 1.0 / fenmu

集成函数

code

def selectFeatures(documents, category_name, top_k, select_type="chi"):
    """
    特征抽取
    :param documents: 预处理后的文档集
    :param category_name: 类目名称
    :param top_k:  返回的最佳特征数量
    :param select_type: 特征选择的方法，可取值chi,mi,freq，默认为chi
    :return:  最佳特征词序列
    """
    L = []
    # 互信息和卡方特征抽取方法
    if select_type == "chi" or select_type == "mi":
        for t in vocabulary:
            N_11 = 0
            N_10 = 0
            N_01 = 0
            N_00 = 0
            N = 0
            for label, word_set in documents:
                if (t in word_set) and (category_name == label):
                    N_11 += 1
                elif (t in word_set) and (category_name != label):
                    N_10 += 1
                elif (t not in word_set) and (category_name == label):
                    N_01 += 1
                elif (t not in word_set) and (category_name != label):
                    N_00 += 1
                else:
                    print("N error")
                    exit(1)

            if N_00 == 0 or N_01 == 0 or N_10 == 0 or N_11 == 0:
                continue
            # 互信息计算
            if select_type == "mi":
                A_tc = multual_infomation(N_10, N_11, N_00, N_01)
            # 卡方计算
            else:
                A_tc = chi_square(N_10, N_11, N_00, N_01)
            L.append((t, A_tc))
    # 频率特征抽取法
    elif select_type == "freq":
        for t in vocabulary:
            # C类文档集中包含的文档总数
            doc_cnt = 0
            # C类文档集包含词项t的文档数
            t_doc_cnt = 0
            for label, word_set in documents:
                if category_name == label:
                    doc_cnt += 1
                    if t in word_set:
                        t_doc_cnt += 1
            A_tc = freq_select(t_doc_cnt, doc_cnt)
            L.append((t, A_tc))
    else:
        print("error param select_type")
    return sorted(L, key=lambda x: x[1], reverse=True)[:top_k]

打完收工。噢，还有最后一步——

1.3 主函调用

分模块记录的，所以代码结构可能有点乱，凑活看吧。

这里多说一句，上一篇文章我的训练集是每类4000条文档共32000条文档，晚上我在用MI跑数据时，近40分钟才跑完一个类。。这么算来跑完全部得五六个小时，实在遭不住，所以将训练集和测试集分别减半了。

可能是我电脑太烂，能读研的话是该换一台了。。


from sklearn.datasets import load_files

# from pyhanlp import *

train_path = "data_standard/train_splited/"
category_l = ['auto', 'education', 'finance', 'game', 'IT', 'politics', 'sports', 'yule']

#import json
#import codecs
import re
from math import log2
import time

time1 = time.time()

vocabulary = set()

'''
上文各种函数插在这里
'''

if __name__ == "__main__":
    # 读取文档集（需要根据具体类目名称修改）
    category_name_li = category_l

    # 获取文本（根目录需要根据具体类目名称修改）
    all_text = getDocuments(train_path, category_name_li)
    print("all_text len = ", len(all_text))
    f1=open("all_text.txt", 'w', encoding='utf-8')
    f1.write(str(all_text))

    # 读取词汇表
    getVocabulary(all_text)
    print("vocabulary len = ", len(vocabulary))
    f2 = open("vocabulary.txt", 'w', encoding='utf-8')
    f2.write(str(vocabulary))

    # 获取特征词表
    '''
    print("=" * 20, '\n', "  卡方特征选择  \n", "=" * 20)
    feature_select_type = "chi"
    for category_name in category_name_li:
        # 特征抽取，最后一个参数可选值 "chi"卡方,"mi"互信息,"freq"频率
        feature_li = selectFeatures(all_text, category_name, 1000, feature_select_type)
        print(category_name)
        for t, i_uc in feature_li:
            print(t, i_uc)
        print("=" * 10)
'''
    print("=" * 20, '\n', "  互信息特征选择  \n", "=" * 20)
    feature_select_type = "mi"
    for category_name in category_name_li:
        # 特征抽取，最后一个参数可选值 "chi"卡方,"mi"互信息,"freq"频率
        feature_li = selectFeatures(all_text, category_name, 1000, feature_select_type)
        print(category_name)
        f=open("result_MI/"+category_name+'.txt', 'w')
        for t, i_uc in feature_li:
            print(t, i_uc)
            f.write(t+'\t'+str(i_uc)+'\n')
        print("=" * 10)
    print("program finished")

'''
    print("=" * 20, '\n', "  频率特征选择  \n", "=" * 20)
    feature_select_type = "freq"
    for category_name in category_name_li:
        # 特征抽取，最后一个参数可选值 "chi"卡方,"mi"互信息,"freq"频率
        feature_li = selectFeatures(all_text, category_name, 1000, feature_select_type)
        print(category_name)
        for t, i_uc in feature_li:
            print(t, i_uc)
        print("=" * 10)
'''

time2 = time.time()
print(" run time is " + str(time2 - time1) + 's')

2 运行结果

MI_result

目前只跑了MI的一部分，且文档编码忘了encoding='utf-8’了。。所以编译器乱码，记事本打开是这样：
在这里插入图片描述

函数里你可以选输出top k个特征，这些特征子集可以继续用特征选择算法筛选或直接向量化去训练分类器。

3 Reference

感谢知乎用户@baiziyu的专栏文章。

如果我能早些发现这个宝藏，能少走很多弯路呀。

Seepen_L

关注

1
点赞
踩
11

收藏

觉得还不错? 一键收藏
2
评论
中文文本分类_特征选择算法初探

中文文本分类之特征选择0 数据集回顾一点更改1 特征选择_谁更重要？1.1 预处理1.2 三种特征选择方法互信息（Mutual Information）文档频率（Document Frequency）卡方检验（CHI）集成函数1.3 主函调用2 运行结果MI_result3 Reference0 数据集回顾一点更改看过我上篇文章的同学可能还记得我们最后处理得到的训练集，是每个分类一个大txt...
复制链接

扫一扫

专栏目录