task3 Faster情感分析

最新推荐文章于 2024-11-14 22:49:43 发布

郑不凡

最新推荐文章于 2024-11-14 22:49:43 发布

阅读量217

点赞数

分类专栏：情感分析文章标签：自然语言处理深度学习神经网络

本文链接：https://blog.csdn.net/m0_50896529/article/details/120418541

版权

本文档介绍datawhale情感分析组队学习笔记，主要探讨FastText模型，包括数据预处理（n-grams）、模型构建（输入层、隐含层、输出层的详细解释）和训练过程。FastText利用字符级n-gram和分层Softmax提高效率，文章还提供了具体实现和训练模型的细节。

摘要由CSDN通过智能技术生成

该文档为datawhale情感分析组队学习的笔记
Github地址：team-learning-nlp/Emotional_Analysis at master · datawhalechina/team-learning-nlp (github.com)

本文实现论文 Bag of Tricks for Efficient Text Classification中的模型（fasttext）
该模型与改进后的RNN有相当的性能，但训练速度要快得多

1. 数据预处理

1.1 n-grams

FastText分类模型与其他文本分类模型最大的不同之处在于其计算了输入句子的n-gram，并将n-gram作为一种附加特征来获取局部词序特征信息添加至标记化列表的末尾。
它是一种基于语言模型的算法，基本思想是将文本内容按照字节顺序进行大小为N的滑动窗口操作，最终形成长度为N的字节片段序列。

def generate_bigrams(x):
    n_grams = set(zip(*[x[i:] for i in range(2)]))
    for n_gram in n_grams:
        x.append(' '.join(n_gram))
    return x

generate_bigrams(['This', 'film', 'is', 'terrible'])

# ['This', 'film', 'is', 'terrible', 'film is', 'This film', 'is terrible']

带来的优点：
1. 对于低频词生成的词向量效果会更好。因为它们的n-gram可以和其他词共享
2. 对于训练词库之外的单词，仍然可以构建他们的词向量。我们可以叠加它们的字符级n-gram向量。

1.2 Filed中的preprocessing参数

TorchText ‘Field’ 中有一个preprocessing参数。此处传递的函数将在对句子进行 tokenized （从字符串转换为标token列表）之后，但在对其进行数字化（从tokens列表转换为indexes列表）之前应用于句子。我们将在这里传递generate_bigrams函数。(由于我们没有使用RNN，所以不需要使用压缩填充序列，因此我们不需要设置“include_length=True”。)

import torch
from torchtext.legacy import data
from torchtext.legacy import datasets

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize = 'spacy',
                  tokenizer_language = 'en_core_web_sm',
                  preprocessing = generate_bigrams)

LABEL = data.LabelField(dtype = torch.float)

1.3数据预处理

与前面一样

import random

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

train_data, valid_data = train_data.split(random_state = random.seed(SEED))

MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(train_data, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

LABEL.build_vocab(train_data)

BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE, 
    device = device)

2. 构建模型

2.1 模型架构

![[Pasted image 20210922152253.png]]

fasttext模型只有三层：输入层，隐含层，输出层

输入层：多个经向量表示的单词及其n-gram特征，这些特征用来表示单个文档
隐含层：多个词向量的叠加平均
输出层：一个特定的target

fastText在输入时，将单词的字符级别的n-gram向量作为额外的特征；
fasttext在输出时，fastText采用了分层Softmax，大大降低了模型训练时间。
fastText的核心思想就是：将整篇文档的词及n-gram向量叠加平均得到文档向量，然后使用文档向量做softmax多分类。这中间涉及到两个技巧：字符级n-gram特征的引入以及分层Softmax分类。