二十六、基于TextCNN中文文本分类

1. 数据预处理

1.1 批量加载数据

  • 定义DatasetIterator完成数据的批量加载
    • 初始化函数def __init__: ,参数信息包括:
      • batches(训练集、验证集、测试集)
      • batch_size(训练批次的大小)
      • device(程序执行的设备:CPU&GPU)
    • 将数据转化为张量def _to_tensor(),参数信息包括:
      • datas:运行时的数据
    • 读取下一批次的数据def __next__

TextCNN模型的网络结构

  • 输入层、Embedding层、卷积层、池化层、全连接层

1.3 代码实现

  • 步骤一:TextCNN模型构建TextCNN.py
# coding: UTF-8
import torch
import torch.nn as nn
import torch.nn.functional as F

"""上接配置参数信息"""

class Model(nn.Module):
    def __init__(self, config):
        super(Model, self).__init__()
        self.embedding = nn.Embedding(config.n_vocab, config.embed,
                                      padding_idx=config.n_vocab - 1)
        self.convs = nn.ModuleList(
            [nn.Conv2d(1, config.num_filters, (k, config.embed))
             for k in config.filter_sizes])
        self.dropout = nn.Dropout(config.dropout)
        self.fc = nn.Linear(config.num_filters * len(config.filter_sizes), config.num_classes)

    def conv_and_pool(self, x, conv):
        x = F.relu(conv(x)).squeeze(3)
        x = F.max_pool1d(x, x.size(2)).squeeze(2)
        return x

    def forward(self, x):
        out = self.embedding(x[0])
        out = out.unsqueeze(1)
        out = torch.cat([self.conv_and_pool(out, conv) for conv in self.convs], 1)
        out = self.dropout(out)
        out = self.fc(out)
        return out

  • 步骤二:批量加载数据load_data_iter.py
# coding: UTF-8

from tqdm import tqdm
from unit25.TextCNN import Config

MAX_VOCAB_SIZE = 10000  # 词表长度限制
UNK, PAD = '<UNK>', '<PAD>'  # 未知字,padding符号

def build_vocab(file_path, tokenizer, max_size, min_freq):
    vocab_dic = {}
    with open(file_path, 'r', encoding='UTF-8') as f:
        for line in tqdm(f):
            lin = line.strip()
            if not lin:
                continue
            content = lin.split('\t')[0]
            for word in tokenizer(content):
                # 统计每个字出现的频数
                vocab_dic[word] = vocab_dic.get(word, 0) + 1
        #  按照频数对字典进行倒序排序
        vocab_list = sorted([_ for _ in vocab_dic.items() if _[1] >= min_freq], 
                            key=lambda x: x[1], reverse=True)[
                     :max_size]
        vocab_dic = {word_count[0]: idx for idx, word_count in enumerate(vocab_list)}
        vocab_dic.update({UNK: len(vocab_dic), PAD: len(vocab_dic) + 1})
    return vocab_dic
  • 步骤三:文本向量化load_data.py
# coding:utf-8
import torch
from unit26.TextCNN import Config
from unit26.TextCNN import Model
from unit26.load_data import build_dataset

# 批量加载数据
class DatasetIterater(object):
    def __init__(self, batches, batch_size, device):
        self.batch_size = batch_size
        self.batches = batches
        self.n_batches = len(batches) // batch_size
        self.residue = False  # 记录batch数量是否为整数
        if len(batches) % self.n_batches != 0:
            self.residue = True
        self.index = 0
        self.device = device

    def _to_tensor(self, datas):
        x = torch.LongTensor([_[0] for _ in datas]).to(self.device)
        y = torch.LongTensor([_[1] for _ in datas]).to(self.device)

        # pad前的长度(超过pad_size的设为pad_size)
        seq_len = torch.LongTensor([_[2] for _ in datas]).to(self.device)
        return (x, seq_len), y

    def __next__(self):
        # 读取数据集剩下的部分
        if self.residue and self.index == self.n_batches:
            # 数据共312 [2 * 128 : 312]
            batches = self.batches[self.index * self.batch_size: len(self.batches)]
            self.index += 1
            batches = self._to_tensor(batches)
            return batches
        elif self.index >= self.n_batches:
            self.index = 0
            raise StopIteration
        else:
            # 读取每个批次的数据:第一批数据为[1:128]
            batches = self.batches[self.index * self.batch_size: (self.index + 1) * self.batch_size]
            self.index += 1
            batches = self._to_tensor(batches)
            return batches

    def __iter__(self):
        return self

    def __len__(self):
        if self.residue:
            return self.n_batches + 1
        else:
            return self.n_batches

def build_iterator(dataset, config, predict):
    if predict is True:
        config.batch_size = 1
    iter = DatasetIterater(dataset, config.batch_size, config.device)
    return iter
  • 步骤四:主函数load_data_iter.py
if __name__ == "__main__":
    config = Config()
    print("Loading data...")
    vocab, train_data, dev_data, test_data = build_dataset(config, False)
    # 1. 批量加载数据
    train_iter = build_iterator(train_data, config, False)
    for batch, train in enumerate(train_iter):
        print(batch, train)
        break

    config.n_vocab = len(vocab)
    # 2. 构建模型
    model = Model(config).to(config.device)
    print(model.parameters)
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值