文本分类基础demo

最新推荐文章于 2022-08-18 15:24:16 发布

solumin

最新推荐文章于 2022-08-18 15:24:16 发布

阅读量453

点赞数

分类专栏：机器学习实验

本文链接：https://blog.csdn.net/solumin/article/details/90481582

版权

这篇博客介绍了文本分类的基础步骤，包括读取停词表，利用结巴分词进行文本切分，接着进行了数据预处理和特征提取操作。

摘要由CSDN通过智能技术生成

读取停词表

N=4774
def stop_words():
    stop_words_file = open('stop_words_ch.txt', 'r')
    stopwords_list = []
    for line in stop_words_file.readlines():
        stopwords_list.append(line.decode('gbk')[:-1])
    return stopwords_list

使用结巴分词把文件进行切分

import jieba
def jieba_fenci(raw, stopwords_list):
    word_list = list(jieba.cut(raw, cut_all=False))
    for word in word_list:
        if word in stopwords_list:
            word_list.remove(word)
    # word_set用于统计A[nClass]
    word_list.remove('\n')
    word_set = set(word_list)
    return word_list, word_set

数据预处理

def process_file(train_path, test_path):
    '''
    本函数用于处理样本集中的所有文件。并返回处理结果所得到的变量
    :param floder_path: 样本集路径
    :return: A：CHI公示中的A值，嵌套字典。用于记录某一类中包含单词t的文档总数。第一层总共9个key，对应9类新闻分类
                第二层则是某一类中所有单词及其包含该单词的文档数（而不是出现次数）。{
   {1：{‘hello’：8，‘hai’：7}}，{2：{‘apple’：8}}}
            TFIDF：用于计算TFIDF权值。三层嵌套字典。第一层和A一样，key为类别。第二层的key为文件名（这里使用文件编号代替0-99）.第三层
                    key为单词，value为盖单词在本文件中出现的次数。用于记录每个单词在每个文件中出现的次数。
            train_set:训练样本集。与测试样本集按7:3比例分开。三元组（文档的单词表，类别，文件编号）
            test_set:测试样本集。三元组（文档的单词表，类别，文件编号）
    '''
    stopwords_list = stop_words()
    # 用于记录CHI公示中的A值
    A = {
   }
    tf = []
    i=0
    # 存储训练集/测试集
    count = [0]*11
    train_set = []
    test_set = []
    with open(train_path, 'r') as f:
        for line in f:
            tf.append({
   }

最低0.47元/天解锁文章

solumin

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
文本分类基础demo

读取停词表N=4774def stop_words(): stop_words_file = open('stop_words_ch.txt', 'r') stopwords_list = [] for line in stop_words_file.readlines(): stopwords_list.append(line.decode('gbk'...
复制链接

扫一扫

专栏目录