第四节情感分析

最新推荐文章于 2023-09-27 13:01:02 发布

zsp1727988028

最新推荐文章于 2023-09-27 13:01:02 发布

阅读量119

点赞数

本文链接：https://blog.csdn.net/zsp1727988028/article/details/118540447

版权

本教程通过PyTorch和TorchText介绍情感分析，包括数据预处理、Word Averaging、RNN/LSTM、CNN模型的构建，并使用IMDb数据集进行训练和测试。

摘要由CSDN通过智能技术生成

学习目标

学习和训练文本分类模型
学习torchtext的基本使用方法
- BucketIterator
学习torch.nn的一些基本模型
- Conv2d

本notebook参考了https://github.com/bentrevett/pytorch-sentiment-analysis

在这份notebook中，我们会用PyTorch模型和TorchText再来做情感分析(检测一段文字的情感是正面的还是负面的)。我们会使用IMDb 数据集，即电影评论。

模型从简单到复杂，我们会依次构建：

Word Averaging模型
RNN/LSTM模型
CNN模型

准备数据

TorchText中的一个重要概念是Field。Field决定了你的数据会被怎样处理。在我们的情感分类任务中，我们所需要接触到的数据有文本字符串和两种情感，"pos"或者"neg"。
Field的参数制定了数据会被怎样处理。
我们使用TEXT field来定义如何处理电影评论，使用LABEL field来处理两个情感类别。
我们的TEXT field带有tokenize='spacy'，这表示我们会用spaCy tokenizer来tokenize英文句子。如果我们不特别声明tokenize这个参数，那么默认的分词方法是使用空格。
安装spaCy
LABEL由LabelField定义。这是一种特别的用来处理label的Field。我们后面会解释dtype。
更多关于Fields，参见https://github.com/pytorch/text/blob/master/torchtext/data/field.py
和之前一样，我们会设定random seeds使实验可以复现。

In [114]:

import torch
from torchtext import data

SEED = 1234

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField(dtype=torch.float)

TorchText支持很多常见的自然语言处理数据集。
下面的代码会自动下载IMDb数据集，然后分成train/test两个torchtext.datasets类别。数据被前面的Fields处理。IMDb数据集一共有50000电影评论，每个评论都被标注为正面的或负面的。

In [115]:

from torchtext import datasets
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

查看每个数据split有多少条数据。

In [116]:

print(f'Number of training examples: {len(train_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 25000
Number of testing examples: 25000

查看一个example。

In [117]:

print(vars(train_data.examples[0]))

{'text': ['Brilliant', 'adaptation', 'of', 'the', 'novel', 'that', 'made', 'famous', 'the', 'relatives', 'of', 'Chilean', 'President', 'Salvador', 'Allende', 'killed', '.', 'In', 'the', 'environment', 'of', 'a', 'large', 'estate', 'that', 'arises', 'from', 'the', 'ruins', ',', 'becoming', 'a', 'force', 'to', 'abuse', 'and', 'exploitation', 'of', 'outrage', ',', 'a', 'luxury', 'estate', 'for', 'the', 'benefit', 'of', 'the', 'upstart', 'Esteban', 'Trueba', 'and', 'his', 'undeserved', 'family', ',', 'the', 'brilliant', 'Danish', 'director', 'Bille', 'August', 'recreates', ',', 'in', 'micro', ',', 'which', 'at', 'the', 'time', 'would', 'be', 'the', 'process', 'leading', 'to', 'the', 'greatest', 'infamy', 'of', 'his', 'story', 'to', 'the', 'hardened', 'Chilean', 'nation', ',', 'and', 'whose', 'main', 'character', 'would', 'Augusto', 'Pinochet', '(', 'Stephen', 'similarities', 'with', 'it', 'are', 'inevitable', ':', 'recall', ',', 'as', 'an', 'example', ',', 'that', 'image', 'of', 'the', 'senator', 'with', 'dark', 'glasses', 'that', 'makes', 'him', 'the', 'wink', 'to', 'the', 'general', 'to', 'begin', 'making', 'the', 'palace).<br', '/><br', '/>Bille', 'August', 'attends', 'an', 'exceptional', 'cast', 'in', 'the', 'Jeremy', 'protruding', 'Irons', ',', 'whose', 'character', 'changes', 'from', 'arrogance', 'and', 'extreme', 'cruelty', ',', 'the', 'hard', 'lesson', 'that', 'life', 'always', 'brings', 'us', 'to', 'almost', 'force', 'us', 'to', 'change', '.', 'In', 'Esteban', 'fully', 'applies', 'the', 'law', 'of', 'resonance', ',', 'with', 'great', 'wisdom', ',', 'Solomon', 'describes', 'in', 'these', 'words:"The', 'things', 'that', 'freckles', 'are', 'the', 'same', 'punishment', 'that', 'will', 'serve', 'you', '.', '"', '<', 'br', '/><br', '/>Unforgettable', 'Glenn', 'Close', 'playing', 'splint', ',', 'the', 'tainted', 'sister', 'of', 'Stephen', ',', 'whose', 'sin', ',', 'driven', 'by', 'loneliness', ',', 'spiritual', 'and', 'platonic', 'love', 'was', 'the', 'wife', 'of', 'his', 'cruel', 'snowy', 'brother', '.', 'Meryl', 'Streep', 'also', 'brilliant', ',', 'a', 'woman', 'whose', 'name', 'came', 'to', 'him', 'like', 'a', 'glove', 'Clara', '.', 'With', 'telekinetic', 'powers', ',', 'cognitive', 'and', 'mediumistic', ',', 'this', 'hardened', 'woman', ',', 'loyal', 'to', 'his', 'blunt', ',', 'conservative', 'husband', ',', 'is', 'an', 'indicator', 'of', 'character', 'and', 'self', '-', 'control', 'that', 'we', 'wish', 'for', 'ourselves', 'and', 'for', 'all', 'human', 'beings', '.', '<', 'br', '/><br', '/>Every', 'character', 'is', 'a', 'portrait', 'of', 'virtuosity', '(', 'as', 'Blanca', 'worthy', 'rebel', 'leader', 'Pedro', 'Segundo', 'unhappy', '...', ')', 'or', 'a', 'portrait', 'of', 'humiliation', ',', 'like', 'Stephen', 'Jr.', ',', 'the', 'bastard', 'child', 'of', 'Senator', ',', 'who', 'serves', 'as', 'an', 'instrument', 'for', 'the', 'return', 'of', 'the', 'boomerang', '.', '<', 'br', '/><br', '/>The', 'film', 'moves', 'the', 'bowels', ',', 'we', 'recreated', 'some', 'facts', 'that', 'should', 'not', 'ever', 'be', 'repeated', ',', 'but', 'that', 'absurdly', 'still', 'happen', '(', 'Colombia', 'is', 'a', 'sad', 'example', ')', 'and', 'another', 'reminder', 'that', ',', 'against', 'all', ',', 'life', 'is', 'wonderful', 'because', 'there', 'are', 'always', 'people', 'like', 'Isabel', 'Allende', 'and', 'immortalize', 'just', 'Bille', 'August', '.'], 'label': 'pos'}

由于我们现在只有train/test这两个分类，所以我们需要创建一个新的validation set。我们可以使用.split()创建新的分类。
默认的数据分割是 70、30，如果我们声明split_ratio，可以改变split之间的比例，split_ratio=0.8表示80%的数据是训练集，20%是验证集。
我们还声明random_state这个参数，确保我们每次分割的数据集都是一样的。

In [118]:

import random
train_data, valid_data = train_data.split(random_state=random.seed(SEED))

检查一下现在每个部分有多少条数据。

In [119]:

print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 17500
Number of validation examples: 7500
Number of testing examples: 25000

下一步我们需要创建 vocabulary 。vocabulary 就是把每个单词一一映射到一个数字。
我们使用最常见的25k个单词来构建我们的单词表，用max_size这个参数可以做到这一点。
所有其他的单词都用<unk>来表示。

In [120]:

# TEXT.build_vocab(train_data, max_size=25000)
# LABEL.build_vocab(train_data)
TEXT.build_vocab(train_data, max_size=25000, vectors="glove.6B.100d", unk_init=torch.Tensor.normal_)
LABEL.build_vocab(train_data)

In [121]:

print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}")

Unique tokens in TEXT vocabulary: 25002
Unique tokens in LABEL vocabulary: 2

当我们把句子传进模型的时候，我们是按照一个个 batch 穿进去的，也就是说，我们一次传入了好几个句子，而且每个batch中的句子必须是相同的长度。为了确保句子的长度相同，TorchText会把短的句子pad到和最长的句子等长。
下面我们来看看训练数据集中最常见的单词。

In [122]:

print(TEXT.vocab.freqs.most_common(20))

[('the', 201455), (',', 192552), ('.', 164402), ('a', 108963), ('and', 108649), ('of', 100010), ('to', 92873), ('is', 76046), ('in', 60904), ('I', 54486), ('it', 53405), ('that', 49155), ('"', 43890), ("'s", 43151), ('this', 42454), ('-', 36769), ('/><br', 35511), ('was', 34990), ('as', 30324), ('with', 29691)]

我们可以直接用 stoi(string to int) 或者 itos (int to string) 来查看我们的单词表。

In [123]:

print(TEXT.vocab.itos[:10])

['<unk>', '<pad>', 'the', ',', '.', 'a', 'and', 'of', 'to', 'is']

查看labels。

In [124]:

print(LABEL.vocab.stoi)

defaultdict(<function _default_unk_index at 0x7fbec39a79d8>, {'neg': 0, 'pos': 1})

最后一步数据的准备是创建iterators。每个itartion都会返回一个batch的examples。
我们会使用BucketIterator。BucketIterator会把长度差不多的句子放到同一个batch中，确保每个batch中不出现太多的padding。
严格来说，我们这份notebook中的模型代码都有一个问题，也就是我们把<pad>也当做了模型的输入进行训练。更好的做法是在模型中把由<pad>产生的输出给消除掉。在这节课中我们简单处理，直接把<pad>也用作模型输入了。由于<pad>数量不多，模型的效果也不差。
如果我们有GPU，还可以指定每个iteration返回的tensor都在GPU上。

In [125]:

BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size=BATCH_SIZE,
    device=device)

zsp1727988028

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
第四节情感分析

学习目标学习和训练文本分类模型学习torchtext的基本使用方法 BucketIterator 学习torch.nn的一些基本模型 Conv2d 本notebook参考了https://github.com/bentrevett/pytorch-sentiment-analysis在这份notebook中，我们会用PyTorch模型和TorchText再来做情感分析(检测一段文字的情感是正面的还是负面的)。我们会使用IMDb 数据集，即电影评论。模型从简单到复杂，我...
复制链接

扫一扫