微博文本分类任务

最新推荐文章于 2025-02-28 09:48:24 发布

ssyshenn

最新推荐文章于 2025-02-28 09:48:24 发布

阅读量1.6k

点赞数

分类专栏： NLP 文章标签：文本分类 NLP

本文链接：https://blog.csdn.net/ssyshenn/article/details/103202345

版权

NLP 专栏收录该内容

1 篇文章

订阅专栏

数据

借用了这位兄弟的数据，4类文本分类问题：https://blog.csdn.net/qq_28626909/article/details/80382029
在这里插入图片描述

代码参考

预处理工具torchtext学习参考了nlpuser和dendi_hust二位兄弟：https://blog.csdn.net/nlpuser/article/details/88067167 https://blog.csdn.net/dendi_hust/article/details/101221922

模型代码学习了dendi_hust兄弟的博客：https://blog.csdn.net/dendi_hust/article/details/94435919
源博客中只有模型部分的代码，所以找了份数据跑下，将整个流程补充完整，顺便加了些代码的注释方便理解。数据集很小，在没用预训练词向量下一定是欠拟合的，结果仅供参考。

预处理

首先，将文件拼成一个DataFrame
在这里插入图片描述
后续处理用torchtext包进行
torchtext官方教程：https://torchtext.readthedocs.io/en/latest/index.html

Field 格式定义

重要的参数：

sequential：是否是可序列化数据（类似于字符串数据），默认值是 True；
user_vocab：是否使用 Vocab 对象，如果取 False，则该字段必须是数值类型；默认值是True；
tokenize：是一个 function 类型的对象（如 string.cut 、jieba.cut 等），用于对字符串进行分词；
batch_first：如果该属性的值取 True，则该字段返回的 Tensor 对象的第一维度是 batch 的大小；默认值是False；
fix_length：该字段是否是定长，如果取 None 则按同 batch 该字段的最大长度进行pad；
stop_words：停用词，List，在tokenize步会剔除;

重要函数：
build_vocab：为该Field创建Vocab；

CONTENT = Field(sequential=True, tokenize=jieba.cut, batch_first=True, fix_length=200, stop_words=stopwords)
LABEL = Field(sequential=False, use_vocab=False)

Dataset 构建数据集

重要参数：

examples：Example对象列表;
fields：格式是List(tuple(str, Field))，其中 str 是 Field 对象的描述；

重要函数：

split()：此方法用于划分数据集，将数据集划分为train、test、valid数据集；
split_ratio：此参数为 float 或 list 类型，当参数为float类型时（参数取值要在[0, 1]），表示数据集中多少比例的数据划分到train（训练）数据集里，剩余的划分到valid（验证）数据集里；如果该参数是list类型（如[0.7, 0.2, 0.1]），表示train，test、valid数据集的比例；该参数默认值是 0.7

# get_dataset构造并返回Dataset所需的examples和fields
def get_dataset(csv_data, text_field, label_field, test=False):
    # id数据对训练在训练过程中没用，使用None指定其对应的field
    fields = [("id", None), # we won't be needing the id, so we pass in None as the field
                 ("content", text_field), ("label", label_field)]       
    examples = []

    if test:
        # 如果为测试集，则不加载label
        for text in tqdm(csv_data['content']):
            examples.append(Example.fromlist([None, text, None], fields))
    else:
        for text, label in tqdm(zip(csv_data['content'], csv_data['label'])):
            examples.append(Example.fromlist([None, text, label], fields))
    return examples, fields

train_examples, train_fields = get_dataset(train_df, CONTENT, LABEL)
test_examples, test_fields = get_dataset(test_df, CONTENT, None, test=True)

# 构建Dataset数据集
train_dataset = Dataset(train_examples, train_fields)
test_dataset = Dataset(test_examples, test_fields)

Iterator 构建迭代器

根据batch_size 生成 Dataset 的 Iteratior。常用的有 Iterator 和 BucketIterator 。其中 BucketIterator 是 Iterator 的子类，与 Iterator 相比，BucketIterator 会把相同或相近长度的数据（按 sort_key）属性进行排序，这样可以最小化 pad。

重要参数：

dataset：需要生成Iterator的数据集；
batch_size：每个 batch的大小；
sort_key：用来为每个 Example 进行排序的字段，默认是None；
shuffle：每次 epoch 是否进行 shuffle；

重要函数：

splits()：为数据集生成Iterator；
datasets：Tuple 类型的 Dataset，Tuple的第一个元素应该是train数据集的Dataset；
batch_sizes：Tuple类型，和datasets中的Dataset一一对应，表示各个数据集生成batch的大小；

# 同时对训练集和验证集进行迭代器的构建
train_iter, val_iter = BucketIterator.splits(
        (train_dataset, test_dataset), # 构建数据集所需的数据集
        batch_sizes=(8, 8),
        device=-1, # 如果使用gpu，此处将-1更换为GPU的编号
        sort_key=lambda x: len(x.content), # the BucketIterator needs to be told what function it should use to group the data.
        sort_within_batch=False,
        repeat=False # we pass repeat=False because we want to wrap this Iterator layer.
)

Vocab 构建词典

重要参数：

counter：collections.Counter 类型的对象，用于保存数据（如：单词）的频率；
vectors：预训练的词向量，可以是torch.vocab.Vectors类型，也可以是其他类型；
min_freq: 最低频率限制，如果某个词频率比min_freq低，则不记录到词典；

CONTENT.build_vocab(train_dataset)

看一下构建的词典

e = list(train_iter)[0]
e.content

tensor([[  317,    89,   569,  ...,     1,     1,     1],
        [15317,   673, 24260,  ...,     1,     1,     1],
        [  157, 18588,  2441,  ...,     1,     1,     1],
        ...,
        [10065, 15698,   120,  ...,     1,     1,     1],
        [ 2691,  7237,  8676,  ...,     1,     1,     1],
        [  390,  1667,   125,  ...,     1,     1,     1]])

e.content.shape

torch.Size([8, 50])

跑模型 BILSTM+ATTENTION

搭双向LSTM和最后的全连接层，

    def build_model(self):
        # 初始化词向量
        self.char_embeddings = nn.Embedding(self.char_size, self.char_embedding_size)
        # 词向量参与更新
        self.char_embeddings.weight.requires_grad = True
        # attention layer
        self.attention_layer = nn.Sequential(
            nn.Linear(self.hidden_dims, self.hidden_dims),
            nn.ReLU(inplace=True)
        )
        # self.attention_weights = self.attention_weights.view(self.hidden_dims, 1)

        # 双层lstm
        self.lstm_net = nn.LSTM(self.char_embedding_size, self.hidden_dims,
                                num_layers=self.rnn_layers, dropout=self.keep_dropout,
                                bidirectional=True)
        # FC层
        self.fc_out = nn.Sequential(
            # nn.Dropout(self.keep_dropout),
            # nn.Linear(self.hidden_dims, self.hidden_dims),
            # nn.ReLU(inplace=True),
            # nn.Dropout(self.keep_dropout),
            nn.Linear(self.hidden_dims, self.num_classes),
        )

attention部分
在这里插入图片描述
attention的理解参考：https://distill.pub/2016/augmented-rnns/
简化了attention层，减少全连接层、Relu层，减少参数量

def attention_net(self, lstm_out, lstm_hidden):
    '''

    :param lstm_out:    [batch_size, len_seq, n_hidden * 2]
    :param lstm_hidden: [batch_size, num_layers * num_directions, n_hidden]
    :return: [batch_size, n_hidden]
    '''
    # chunk的方法做的是对张量进行分块，返回一个张量列表。但如果指定轴的元素个数被chunks除不尽，最后一块的元素个数会少。
    lstm_tmp_out = torch.chunk(lstm_out, 2, -1)
    
    # h [batch_size, time_step, hidden_dims] 
    # 将双向LSTM的激活值相加
    h = lstm_tmp_out[0] + lstm_tmp_out[1]
    
    # 将最后一层隐藏层两个方向权重加起来
    # [batch_size, num_layers * num_directions, n_hidden]
    lstm_hidden = torch.sum(lstm_hidden, dim=1)
    
    # [batch_size, 1, n_hidden]  在下标1的位置增加1维
    lstm_hidden = lstm_hidden.unsqueeze(1)
    
    # atten_score [batch_size, 1, time_step]  attention score层
    atten_score = torch.bmm(lstm_hidden, h.transpose(1, 2))
    
    # atten_weight [batch_size, 1, time_step] softmax归一化（attention distribution层）
    atten_weight = F.softmax(atten_score, dim=-1)
    
    # [batch_size, 1, n_hidden] 加权输出值
    context = torch.bmm(atten_weight, h)
    result = context.squeeze(1)
    return result

最终结果：非常菜

Epoch: 1  | time in 0 minutes, 47 seconds
	Loss: 1.2363(train)	|	Acc: 49.8%(train)
	Loss: 1.0654(valid)	|	Acc: 70.5%(valid)
Epoch: 2  | time in 0 minutes, 49 seconds
	Loss: 0.9987(train)	|	Acc: 74.4%(train)
	Loss: 0.9925(valid)	|	Acc: 74.0%(valid)
Epoch: 3  | time in 0 minutes, 48 seconds
	Loss: 0.8978(train)	|	Acc: 84.4%(train)
	Loss: 0.9435(valid)	|	Acc: 80.0%(valid)
Epoch: 4  | time in 0 minutes, 48 seconds
	Loss: 0.8553(train)	|	Acc: 88.3%(train)
	Loss: 0.9059(valid)	|	Acc: 84.5%(valid)
Epoch: 5  | time in 0 minutes, 47 seconds
	Loss: 0.8406(train)	|	Acc: 89.9%(train)
	Loss: 0.9185(valid)	|	Acc: 83.0%(valid)

完整代码见：https://github.com/DiegoSong/nlp_on_RM/blob/master/text_cat/BiLSTM_attention.ipynb