FlyAI小课堂:【文本分类-中文】textCNN

摘要: 随着深度学习的快速发展,人们创建了一整套神经网络结构来解决各种各样的任务和问题。在英文分类基础上,中文文本分类的处理也同样重要...

人工智能学习离不开实践的验证,推荐大家可以多在FlyAI-AI竞赛服务平台多参加训练和竞赛,以此来提升自己的能力。FlyAI是为AI开发者提供数据竞赛并支持GPU离线训练的一站式服务平台。每周免费提供项目开源算法样例,支持算法能力变现以及快速的迭代算法模型。

目录

  1. 概述
  2. 数据集合
  3. 代码
  4. 结果展示

一、概述

在英文分类的基础上,再看看中文分类的,是一种10分类问题(体育,科技,游戏,财经,房产,家居等)的处理。

二、数据集合

数据集为新闻,总共有四个数据文件,在/data/cnews目录下,包括内容如下图所示测试集,训练集和验证集,和单词表(最后的单词表cnews.vocab.txt可以不要,因为训练可以自动产生)。数据格式:前面为类别,后面为描述内容。

训练数据地址:链接: https://pan.baidu.com/s/1ZHh98RrjQpG5Tm-yq73vBQ 提取码:2r04

其中训练集的格式:

vocab.txt的格式:每个字一行,其中前面加上PAD。

三、代码

3.1 数据采集cnews_loader.py

# coding: utf-8
    import sys
    from collections import Counter
    import numpy as np
    import tensorflow.contrib.keras as kr

    if sys.version_info[0] > 2:
        is_py3 = True
    else:
        reload(sys)
        sys.setdefaultencoding("utf-8")
        is_py3 = False

    def native_word(word, encoding='utf-8'):
        """如果在python2下面使用python3训练的模型,可考虑调用此函数转化一下字符编码"""
        if not is_py3:
            return word.encode(encoding)
        else:
            return word

    def native_content(content):
        if not is_py3:
            return content.decode('utf-8')
        else:
            return content

    def open_file(filename, mode='r'):
        """
        常用文件操作,可在python2和python3间切换.
        mode: 'r' or 'w' for read or write
        """
        if is_py3:
            return open(filename, mode, encoding='utf-8', errors='ignore')
        else:
            return open(filename, mode)

    def read_file(filename):
        """读取文件数据"""
        contents, labels = [], []
        with open_file(filename) as f:
            for line in f:
                try:
                    label, content = line.strip().split('\t')
                    if content:
                        contents.append(list(native_content(content)))
                        labels.append(native_content(label))
                except:
                    pass
        return contents, labels

    def build_vocab(train_dir, vocab_dir, vocab_size=5000):
        """根据训练集构建词汇表,存储"""
        data_train, _ = read_file(train_dir)
        all_data = []
        for content in data_train:
            all_data.extend(content)
        counter = Counter(all_data)
        count_pairs = counter.most_common(vocab_size - 1)
        words, _ = list(zip(*count_pairs))
        # 添加一个 <PAD> 来将所有文本pad为同一长度
        words = ['<PAD>'] + list(words)
        open_file(vocab_dir, mode='w').write('\n'.join(words) + '\n')

    def read_vocab(vocab_dir):
        """读取词汇表"""
        # words = open_file(vocab_dir).read().strip().split('\n')
        with open_file(vocab_dir) as fp:
            # 如果是py2 则每个值都转化为unicode
            words = [native_content(_.strip()) for _ in fp.readlines()]
        word_to_id = dict(zip(words, range(len(words))))
        return words, word_to_id

    def read_category():
        """读取分类目录,固定"""
        categories = ['体育', '财经', '房产', '家居', '教育', '科技', '时尚', '时政', '游戏', '娱乐']
        categories = [native_content(x) for x in categories]
        cat_to_id = dict(zip(categories, range(len(categories))))
        return categories, cat_to_id

    def to_words(content, words):
        """将id表示的内容转换为文字"""
        return ''.join(words[x] for x in content)

    def process_file(filename, word_to_id, cat_to_id, max_length=600):
        """将文件转换为id表示"""
        contents, labels = read_file(filename)

        data_id, label_id = [], []
        for i in range(len(contents)):
            data_id.append([word_to_id[x] for x in contents[i] if x in word_to_id])
            label_id.append(cat_to_id[labels[i]])
        # 使用keras提供的pad_sequences来将文本pad为固定长度
        x_pad = kr.preprocessing.sequence.pad_sequences(data_id, max_length)
        y_pad = kr.utils.to_categorical(label_id, num_classes=len(cat_to_id))  # 将标签转换为one-hot表示
        return x_pad, y_pad

    def batch_iter(x, y, batch_size=64):
        """生成批次数据"""
        data_len = len(x)
        num_batch = int((data_len - 1) / batch_size) + 1
        indices = np.random.permutation(np.arange(data_len))
        x_shuffle = x[indices]
        y_shuffle = y[indices]

        for i in range(num_batch):
            start_id = i * batch_size
            end_id = min((i + 1) * batch_size, data_len)
            yield x_shuffle[start_id:end_id], y_shuffle[start_id:end_id]

3.2 模型搭建cnn_model.py

定义训练的参数,TextCNN()模型

# coding: utf-8
    import tensorflow as tf
    class TCNNConfig(object):
        """CNN配置参数"""
        embedding_dim = 64  # 词向量维度
        seq_length = 600  # 序列长度
        num_classes = 10  # 类别数
        num_filters = 256  # 卷积核数目
        kernel_size = 5  # 卷积核尺寸
        vocab_size = 5000  # 词汇表达小
        hidden_dim = 128  # 全连接层神经元
        dropout_keep_prob = 0.5  # dropout保留比例
        learning_rate = 1e-3  # 学习率
        batch_size = 64  # 每批训练大小
        num_epochs = 10  # 总迭代轮次
        print_per_batch = 100  # 每多少轮输出一次结果
        save_per_batch = 10  # 每多少轮存入tensorboard

    class TextCNN(object):
        """文本分类,CNN模型"""
        def __init__(self, config):
            self.config = config
            # 三个待输入的数据
            self.input_x = tf.placeholder(tf.int32, [None, self.config.seq_length], name='input_x')
            self.input_y = tf.placeholder(tf.float32, [None, self.config.num_classes], name='input_y')
            self.keep_prob = tf.placeholder(tf.float32, name='keep_prob')
            self.cnn()

        def cnn(self):
            """CNN模型"""
            # 词向量映射
            with tf.device('/cpu:0'):
                embedding = tf.get_variable('embedding', [self.config.vocab_size, self.config.embedding_dim])
                embedding_inputs = tf.nn.embedding_lookup(embedding, self.input_x)

            with tf.name_scope("cnn"):
                # CNN layer
                conv = tf.layers.conv1d(embedding_inputs, self.config.num_filters, self.config.kernel_size, name='conv')
                # global max pooling layer
                gmp = tf.reduce_max(conv, reduction_indices=[1], name='gmp')

            with tf.name_scope("score"):
                # 全连接层,后面接dropout以及relu激活
                fc = tf.layers.dense(gmp, self.config.hidden_dim, name='fc1')
                fc = tf.contrib.layers.dropout(fc, self.keep_prob)
                fc = tf.nn.relu(fc)
                # 分类器
                self.logits = tf.layers.dense(fc, self.config.num_classes, name='fc2')
                self.y_pred_cls = tf.argmax(tf.nn.softmax(self.logits), 1)  # 预测类别

            with tf.name_scope("optimize"):
                # 损失函数,交叉熵
                cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=self.logits, labels=self.input_y)
                self.loss = tf.reduce_mean(cross_entropy)
                # 优化器
                self.optim = tf.train.AdamOptimizer(learning_rate=self.config.learning_rate).minimize(self.loss)

            with tf.name_scope("accuracy"):
                # 准确率
                correct_pred = tf.equal(tf.argmax(self.input_y, 1), self.y_pred_cls)
                self.acc = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

3.3 运行代码run_cnn.py

#!/usr/bin/python
# -*- coding: utf-8 -*-
from __future__ import print_function
import os
import sys
import time
from datetime import timedelta
import numpy as np
import tensorflow as tf
from sklearn import metrics
from cnn_model import TCNNConfig, TextCNN
from  cnews_loader import read_vocab, read_category, batch_iter, process_file, build_vocab

base_dir = '../data/cnews'
train_dir = os.path.join(base_dir, 'cnews.train.txt')
test_dir = os.path.join(base_dir, 'cnews.test.txt')
val_dir = os.path.join(base_dir, 'cnews.val.txt')
vocab_dir = os.path.join(base_dir, 'cnews.vocab.txt')
save_dir = 'checkpoints/textcnn'
save_path = os.path.join(save_dir, 'best_validation')  # 最佳验证结果保存路径

def get_time_dif(start_time):
    """获取已使用时间"""
    end_time = time.time()
    time_dif = end_time - start_time
    return timedelta(seconds=int(round(time_dif)))

def feed_data(x_batch, y_batch, keep_prob):
    feed_dict = {
        model.input_x: x_batch,
        model.input_y: y_batch,
        model.keep_prob: keep_prob
    }
    return feed_dict

def evaluate(sess, x_, y_):
    """评估在某一数据上的准确率和损失"""
    data_len = len(x_)
    batch_eval = batch_iter(x_, y_, 128)
    total_loss = 0.0
    total_acc = 0.0
    for x_batch, y_batch in batch_eval:
        batch_len = len(x_batch)
        feed_dict = feed_data(x_batch, y_batch, 1.0)
        loss, acc = sess.run([model.loss, model.acc], feed_dict=feed_dict)
        total_loss += loss * batch_len
        total_acc += acc * batch_len
    return total_loss / data_len, total_acc / data_len

def train():
    print("Configuring TensorBoard and Saver...")
    # 配置 Tensorboard,重新训练时,请将tensorboard文件夹删除,不然图会覆盖
    tensorboard_dir = '../tensorboard/textcnn'
    if not os.path.exists(tensorboard_dir):
        os.makedirs(tensorboard_dir)
    tf.summary.scalar("loss", model.loss)
    tf.summary.scalar("accuracy", model.acc)
    merged_summary = tf.summary.merge_all()
    writer = tf.summary.FileWriter(tensorboard_dir)

    # 配置 Saver
    saver = tf.train.Saver()
    if not os.path.exists(save_dir):
        os.makedirs(save_dir)

    print("Loading training and validation data...")
    # 载入训练集与验证集
    start_time = time.time()
    x_train, y_train = process_file(train_dir, word_to_id, cat_to_id, config.seq_length)
    x_val, y_val = process_file(val_dir, word_to_id, cat_to_id, config.seq_length)
    time_dif = get_time_dif(start_time)
    print("Time usage:", time_dif)

    # 创建session
    session = tf.Session()
    session.run(tf.global_variables_initializer())
    writer.add_graph(session.graph)

    print('Training and evaluating...')
    start_time = time.time()
    total_batch = 0  # 总批次
    best_acc_val = 0.0  # 最佳验证集准确率
    last_improved = 0  # 记录上一次提升批次
    require_improvement = 1000  # 如果超过1000轮未提升,提前结束训练

    flag = False
    for epoch in range(config.num_epochs):
        print('Epoch:', epoch + 1)
        batch_train = batch_iter(x_train, y_train, config.batch_size)
        for x_batch, y_batch in batch_train:
            feed_dict = feed_data(x_batch, y_batch, config.dropout_keep_prob)
            #print("x_batch is {}".format(x_batch.shape))
            if total_batch % config.save_per_batch == 0:
                # 每多少轮次将训练结果写入tensorboard scalar
                s = session.run(merged_summary, feed_dict=feed_dict)
                writer.add_summary(s, total_batch)
            if total_batch % config.print_per_batch == 0:
                # 每多少轮次输出在训练集和验证集上的性能
                feed_dict[model.keep_prob] = 1.0
                loss_train, acc_train = session.run([model.loss, model.acc], feed_dict=feed_dict)
                loss_val, acc_val = evaluate(session, x_val, y_val)  # todo
                if acc_val > best_acc_val:
                    # 保存最好结果
                    best_acc_val = acc_val
                    last_improved = total_batch
                    saver.save(sess=session, save_path=save_path)
                    improved_str = '*'
                else:
                    improved_str = ''
                time_dif = get_time_dif(start_time)
                msg = 'Iter: {0:>6}, Train Loss: {1:>6.2}, Train Acc: {2:>7.2%},' \
                      + ' Val Loss: {3:>6.2}, Val Acc: {4:>7.2%}, Time: {5} {6}'
                print(msg.format(total_batch, loss_train, acc_train, loss_val, acc_val, time_dif, improved_str))

            session.run(model.optim, feed_dict=feed_dict)  # 运行优化
            total_batch += 1

            if total_batch - last_improved > require_improvement:
                # 验证集正确率长期不提升,提前结束训练
                print("No optimization for a long time, auto-stopping...")
                flag = True
                break  # 跳出循环
        if flag:  # 同上
            break

def test():
    print("Loading test data...")
    start_time = time.time()
    x_test, y_test = process_file(test_dir, word_to_id, cat_to_id, config.seq_length)

    session = tf.Session()
    session.run(tf.global_variables_initializer())
    saver = tf.train.Saver()
    saver.restore(sess=session, save_path=save_path)  # 读取保存的模型

    print('Testing...')
    loss_test, acc_test = evaluate(session, x_test, y_test)
    msg = 'Test Loss: {0:>6.2}, Test Acc: {1:>7.2%}'
    print(msg.format(loss_test, acc_test))

    batch_size = 128
    data_len = len(x_test)
    num_batch = int((data_len - 1) / batch_size) + 1

    y_test_cls = np.argmax(y_test, 1)
    y_pred_cls = np.zeros(shape=len(x_test), dtype=np.int32)  # 保存预测结果
    for i in range(num_batch):  # 逐批次处理
        start_id = i * batch_size
        end_id = min((i + 1) * batch_size, data_len)
        feed_dict = {
            model.input_x: x_test[start_id:end_id],
            model.keep_prob: 1.0
        }
        y_pred_cls[start_id:end_id] = session.run(model.y_pred_cls, feed_dict=feed_dict)

    # 评估
    print("Precision, Recall and F1-Score...")
    print(metrics.classification_report(y_test_cls, y_pred_cls, target_names=categories))

    # 混淆矩阵
    print("Confusion Matrix...")
    cm = metrics.confusion_matrix(y_test_cls, y_pred_cls)
    print(cm)

    time_dif = get_time_dif(start_time)
    print("Time usage:" , time_dif)

if __name__ == '__main__':

    config = TCNNConfig()
    if not os.path.exists(vocab_dir):  # 如果不存在词汇表,重建,这里存在,因此不用重建
        build_vocab(train_dir, vocab_dir, config.vocab_size)
    categories, cat_to_id = read_category()
    words, word_to_id = read_vocab(vocab_dir)
    config.vocab_size = len(words)
    model = TextCNN(config)
    option='train'
    if option == 'train':
        train()
    else:
        test()

3.4 预测predict.py

# coding: utf-8
    from __future__ import print_function
    import os
    import tensorflow as tf
    import tensorflow.contrib.keras as kr
    from cnn_model import TCNNConfig, TextCNN
    from cnews_loader import read_category, read_vocab
    try:
        bool(type(unicode))
    except NameError:
        unicode = str

    base_dir = '../data/cnews'
    vocab_dir = os.path.join(base_dir, 'cnews.vocab.txt')
    save_dir = '../checkpoints/textcnn'
    save_path = os.path.join(save_dir, 'best_validation')  # 最佳验证结果保存路径

    class CnnModel:
        def __init__(self):
            self.config = TCNNConfig()
            self.categories, self.cat_to_id = read_category()
            self.words, self.word_to_id = read_vocab(vocab_dir)
            self.config.vocab_size = len(self.words)
            self.model = TextCNN(self.config)
            self.session = tf.Session()
            self.session.run(tf.global_variables_initializer())
            saver = tf.train.Saver()
            saver.restore(sess=self.session, save_path=save_path)  # 读取保存的模型

        def predict(self, message):
            # 支持不论在python2还是python3下训练的模型都可以在2或者3的环境下运行
            content = unicode(message)
            data = [self.word_to_id[x] for x in content if x in self.word_to_id]

            feed_dict = {
                self.model.input_x: kr.preprocessing.sequence.pad_sequences([data], self.config.seq_length),
                self.model.keep_prob: 1.0
            }

            y_pred_cls = self.session.run(self.model.y_pred_cls, feed_dict=feed_dict)
            return self.categories[y_pred_cls[0]]

    if __name__ == '__main__':
        cnn_model = CnnModel()
        test_demo = ['三星ST550以全新的拍摄方式超越了以往任何一款数码相机',
                     '热火vs骑士前瞻:皇帝回乡二番战 东部次席唾手可得新浪体育讯北京时间3月30日7:00']
        for i in test_demo:
            print(cnn_model.predict(i))

四、结果展示

   

 

相关代码可见:https://github.com/yifanhunter/textClassifier_chinese


 

有关于‘文本分类’的竞赛项目,大家可移步官网进行查看和参赛!

更多精彩内容请访问FlyAI-AI竞赛服务平台;为AI开发者提供数据竞赛并支持GPU离线训练的一站式服务平台;每周免费提供项目开源算法样例,支持算法能力变现以及快速的迭代算法模型。

挑战者,都在FlyAI!!!

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值