自然语言处理（二）基于CNN的新闻文本分类

最新推荐文章于 2024-05-05 14:44:55 发布

dayday学习

最新推荐文章于 2024-05-05 14:44:55 发布

阅读量7.5k

点赞数 13

分类专栏：自然语言处理文章标签：自然语言处理（二） CNN字符级中文文本分类影评文本分类

本文链接：https://blog.csdn.net/weixin_41781408/article/details/88082213

版权

1.Task1 数据集探索

1.1下载数据集

数据集：中、英文数据集各一份
中文数据集：THUCNews
THUCNews数据子集：https://pan.baidu.com/s/1hugrfRu 密码：qfud
基于CNN的文本分类问题已经有了一定的研究成果，CNN做句子分类的论文可以参看: Convolutional Neural Networks for Sentence Classification。
以及字符级CNN的论文：Character-level Convolutional Networks for Text Classification。
在网上也有了一些开源的实现，例如比较著名的dennybritz大牛的博客Implementing a CNN for Text Classification in TensorFlow基于早期TensorFlow的一个实现版本。

1.2数据集的描述

本文采用了清华NLP组提供的THUCNews新闻文本分类数据集的一个子集（原始的数据集大约74万篇文档，训练起来需要花较长的时间）。
本次训练使用了其中的10个分类，每个分类6500条，总共65000条新闻数据。
类别为：体育, 财经, 房产, 家居, 教育, 科技, 时尚, 时政, 游戏, 娱乐。
数据集划分如下：

训练集: 5000*10
验证集: 500*10
测试集: 1000*10

从原数据集生成子集的过程请参看helper下的两个脚本。其中，copy_data.sh用于从每个分类拷贝6500个文件，cnews_group.py用于将多个文件整合到一个文件中。执行该文件后，得到三个数据文件：

cnews.train.txt: 训练集(50000条)
cnews.val.txt: 验证集(5000条)
cnews.test.txt: 测试集(10000条)

1.3 数据的预处理

data/cnews_loader.py为数据的预处理文件。
read_file(): 读取文件数据;
build_vocab(): 构建词汇表，使用字符级的表示，这一函数会将词汇表存储下来，避免每一次重复处理;
read_vocab(): 读取上一步存储的词汇表，转换为{词：id}表示;
read_category(): 将分类目录固定，转换为{类别: id}表示;
to_words(): 将一条由id表示的数据重新转换为文字;
preocess_file(): 将数据集从文字转换为固定长度的id序列表示;
batch_iter(): 为神经网络的训练准备经过shuffle的批次的数据。

cnews_loader.py 为数据的预处理文件。

# coding: utf-8

import sys
from collections import Counter

import numpy as np
import tensorflow.contrib.keras as kr

if sys.version_info[0] > 2:
    is_py3 = True
else:
    reload(sys)
    sys.setdefaultencoding("utf-8")
    is_py3 = False


def native_word(word, encoding='utf-8'):
    """如果在python2下面使用python3训练的模型，可考虑调用此函数转化一下字符编码"""
    if not is_py3:
        return word.encode(encoding)
    else:
        return word


def native_content(content):
    if not is_py3:
        return content.decode('utf-8')
    else:
        return content


def open_file(filename, mode='r'):
    """
    常用文件操作，可在python2和python3间切换.
    mode: 'r' or 'w' for read or write
    """
    if is_py3:
        return open(filename, mode, encoding='utf-8', errors='ignore')
    else:
        return open(filename, mode)


def read_file(filename):
    """读取文件数据"""
    contents, labels = [], []
    with open_file(filename) as f:
        for line in f:
            try:
                label, content = line.strip().split('\t')
                if content:
                    contents.append(list(native_content(content)))
                    labels.append(native_content(label))
            except:
                pass
    return contents, labels


def build_vocab(train_dir, vocab_dir, vocab_size=5000):
    """根据训练集构建词汇表，存储"""
    data_train, _ = read_file(train_dir)

    all_data = []
    for content in data_train:
        all_data.extend(content)

    counter = Counter(all_data)
    count_pairs = counter.most_common(vocab_size - 1)
    words, _ = list(zip(*count_pairs))
    # 添加一个 <PAD> 来将所有文本pad为同一长度
    words = ['<PAD>'] + list(words)
    open_file(vocab_dir, mode='w').write('\n'.join(words) + '\n')


def read_vocab(vocab_dir):
    """读取词汇表"""
    # words = open_file(vocab_dir).read().strip().split('\n')
    with open_file(vocab_dir) as fp:
        # 如果是py2 则每个值都转化为unicode
        words = [native_content(_.strip()) for _ in fp.readlines()]
    word_to_id = dict(zip(words, range(len(words))))
    return words, word_to_id


def read_category():
    """读取分类目录，固定"""
    categories = ['体育', '财经', '房产', '家居', '教育', '科技', '时尚', '时政', '游戏', '娱乐']

    categories = [native_content(x) for x in categories]

    cat_to_id = dict(zip(categories, range(len(categories)

最低0.47元/天解锁文章

dayday学习

关注

13
点赞
踩
97

收藏

觉得还不错? 一键收藏
9
评论
自然语言处理（二）基于CNN的新闻文本分类

Task1 数据集探索数据集数据集：中、英文数据集各一份中文数据集：THUCNewsTHUCNews数据子集：https://pan.baidu.com/s/1hugrfRu 密码：qfud英文数据集：IMDB数据集 Sentiment AnalysisIMDB数据集下载和探索...
复制链接

扫一扫

专栏目录