文章目录
- Transformers for Sentiment Analysis
- 1.设置随机种子
- 2.加载transformer,并tokenize
- 3.设置特殊token
- 4.定义模型要训练时每个句子的最大长度
- 5.定义一个分词函数
- 6.定义fields
- 7.加载数据
- 8.为Labels创建词汇表(数值化)
- 9. 创建迭代器iterators+开启GPU
- 10.加载预训练BERT模型+构建分类器
- 11.将模型实例化
- 12.计算有多少个参数
- 13. 冻住transformer不让其训练
- 14.构建优化器和损失函数
- 15.定义精确度函数
- 16.定义 训练函数
- 17.定义评估函数
- 18.定义计时函数
- 19.正式训练
- 20.检测性能
- 21. 实际预测
Transformers for Sentiment Analysis
这里我们会使用BERT模型(Bidirectional Encoder Representations from Transformers)
我们会使用与训练好的transformers,来用他们作为embedding layer,嵌入层。我们会将transformer冻住,而不是训练其内部的参数。
只会训练模型的其余部分。这里我们使用 multi-layer bi-directional GRU
1.设置随机种子
import torch
import random
import numpy as np
SEED = 1234
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
2.加载transformer,并tokenize
transformer已经用一个特定的词汇表训练过了,意味着,我们需要用同一套词汇表,并且tokenize我们数据的方式要跟transormer的一样
这就要使用transformer自带的分词工具,这里我们用的是忽略大小写的BERT模型(tokenizer会自动将样本的字母都小写)
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
查看tokenizer所包含的词汇表里有多少词
len(tokenizer.vocab)
30522
举个例子,分词
tokens = tokenizer.tokenize('Hello WORLD how ARE yoU?')
print(tokens)
[‘hello’, ‘world’, ‘how’, ‘are’, ‘you’, ‘?’]
将其转化为索引
indexes = tokenizer.convert_tokens_to_ids(tokens)
print(indexes)
3.设置特殊token
transformer会将句子里添加些特殊字符,[CLS] [SEP] [PAD] [UNK]
init_token = tokenizer.cls_token
eos_token = tokenizer.sep_token
pad_token = tokenizer.pad_token
unk_token = tokenizer.unk_token
print(init_token, eos_token, pad_token, unk_token)
[CLS] [SEP] [PAD] [UNK]
获取这些特殊字符的id
init_token_idx = tokenizer.cls_token_id
eos_token_idx = tokenizer.sep_token_id
pad_token_idx = tokenizer.pad_token_id
unk_token_idx = tokenizer.unk_token_id
print(init_token_idx, eos_token_idx, pad_token_idx, unk_token_idx)
101 102 0 100
4.定义模型要训练时每个句子的最大长度
这里我们设置成BERT模型默认的可处理的句子最大长度
max_input_length = tokenizer.max_model_input_sizes['bert-base-uncased']
print(max_input_length)
5.定义一个分词函数
之前notebook中使用的都是spaCy的分词器,我现在需要定义一个函数来传进TEXT filed对象中,用来分词,并且顺带会将tokens的数量减少到设置的最大长度maximum length。需要注意的是我们的最大长度要比实际的最大长度要小2个数,因为需要对每个sequence的头尾各添加两个tokens,
def tokenize_and_cut(sentence):
tokens = tokenizer.tokenize(sentence)
tokens = tokens[:max_input_length-2]
return tokens
6.定义fields
from torchtext import data
TEXT = data.Field(batch_first = True, # transformer希望batch dimension 尺寸放在第一位
use_vocab = False, # 我们使用自己的transformer提供的词汇表,将use_vocab关闭,告诉其我们自己处理就行
tokenize = tokenize_and_cut, # 将上面定义的切词函数传入其中,进行切词
preprocessing = tokenizer.convert_tokens_to_ids, #将切词后的一个个元素转化成id
init_token = init_token_idx, # 定义一系列特殊tokens
eos_token = eos_token_idx,
pad_token = pad_token_idx,
unk_token = unk_token_idx)
LABEL = data.LabelField(dtype = torch.float)
7.加载数据
from torchtext import datasets
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
train_data, valid_data = train_data.split(random_state = random.seed(SEED))
print(f"Number of training examples: {len(train_data)}")
print(f"Number of validation examples: {len(valid_data)}")
print(f"Number of testing examples: {len(test_data)}")
Number of training examples: 17500
Number of validation examples: 7500
Number of testing examples: 25000
查看其中一个样本,确保文本已经被切词且数字化了
print(vars(train_data.examples[6]))
{‘text’: [5949, 1997, 2026, 2166, 1010, 1012, 1012, 1012, 1012, 1996, 2472, 2323, 2022, 10339, 1012, 2339, 2111, 2514, 2027, 2342, 2000, 2191, 22692, 5691, 2097, 2196, 2191, 3168, 2000, 2033, 1012, 2043, 2016, 2351, 2012, 1996, 2203, 1010, 2009, 2081, 2033, 4756, 1012, 1045, 2018, 2000, 2689, 1996, 3149, 2116, 2335, 2802, 1996, 2143, 2138, 1045, 2001, 2893, 10339, 3666, 2107, 3532, 3772, 1012, 11504, 1996, 3124, 2040, 2209, 9895, 2196, 4152, 2147, 2153, 1012, 2006, 2327, 1997, 2008, 1045, 3246, 1996, 2472, 2196, 4152, 2000, 2191, 2178, 2143, 1010, 1998, 2038, 2010, 3477, 5403, 3600, 2579, 2067, 2005, 2023, 10231, 1012, 1063, 1012, 6185, 2041, 1997, 2184, 1065], ‘label’: ‘neg’}
可以看看,将这些id转化回一个个tokens
tokens = tokenizer.convert_ids_to_tokens(vars(train_data.examples[6])['text'])
print(tokens)
8.为Labels创建词汇表(数值化)
尽管已经为文本处理了词汇表,但仍旧需要为labels建个词表
LABEL.build_vocab(train_data)
print(LABEL.vocab.stoi)
9. 创建迭代器iterators+开启GPU
BATCH_SIZE = 128
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
(train_data, valid_data, test_data),
batch_size = BATCH_SIZE,
device = device)
10.加载预训练BERT模型+构建分类器
加载预训练BERT模型
from transformers import BertTokenizer, BertModel
bert = BertModel.from_pretrained('bert-base-uncased')
定义分类模型
我们使用预训练的transformer模型来替换传统的embedding layer 嵌入层
这些embeddings会被传进GRU网络中(Gate Recurrent Unit)一种RNN网络的一种,类似LSTM。来进行预测情感
import torch.nn as nn
class BERTGRUSentiment(nn.Module):
def __init__(self,
bert,
hidden_dim,
output_dim,
n_layers,
bidirectional,
dropout):
super().__init__()
self.bert = bert
embedding_dim = bert.config.to_dict()['hidden_size'] # 通过bert.config属性获取嵌入层维度
self.rnn = nn.GRU(embedding_dim, # 输入维度
hidden_dim,
num_layers = n_layers, #几层
bidirectional = bidirectional, #是否双向
batch_first = True,
dropout = 0 if n_layers < 2 else dropout) # 是否dropout
# 线性连接层,接上面的GRU输出的结果
self.out = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, text):
#text = [batch size, sent len]
with torch.no_grad(): # 我们将transformer包裹在no_grad块内,来保证,不会计算这一部分的梯度
embedded = self.bert(text)[0]
#embedded = [batch size, sent len, emb dim]
_, hidden = self.rnn(embedded)
#hidden = [n layers * n directions, batch size, emb dim]
if self.rnn.bidirectional:
hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
else:
hidden = self.dropout(hidden[-1,:,:])
#hidden = [batch size, hid dim]
output = self.out(hidden)
#output = [batch size, out dim]
return output
11.将模型实例化
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.25
model = BERTGRUSentiment(bert,
HIDDEN_DIM,
OUTPUT_DIM,
N_LAYERS,
BIDIRECTIONAL,
DROPOUT)
12.计算有多少个参数
def count_parameters(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {count_parameters(model):,} trainable parameters')
The model has 112,241,409 trainable parameters
其中110万是transformer的参数,不需要训练的
13. 冻住transformer不让其训练
for name, param in model.named_parameters():
if name.startswith('bert'):
param.requires_grad = False
再数一下
def count_parameters(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {count_parameters(model):,} trainable parameters')
The model has 2,759,169 trainable parameters
检查一下那些会被训练的参数,确保没有bert有关的
for name, param in model.named_parameters():
if param.requires_grad:
print(name)
14.构建优化器和损失函数
import torch.optim as optim
optimizer = optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()
model = model.to(device)
criterion = criterion.to(device)
15.定义精确度函数
def binary_accuracy(preds, y):
"""
Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
"""
#round predictions to the closest integer
rounded_preds = torch.round(torch.sigmoid(preds))
correct = (rounded_preds == y).float() #convert into float for division
acc = correct.sum() / len(correct)
return acc
16.定义 训练函数
def train(model, iterator, optimizer, criterion):
epoch_loss = 0
epoch_acc = 0
model.train()
for batch in iterator:
optimizer.zero_grad()
predictions = model(batch.text).squeeze(1)
loss = criterion(predictions, batch.label)
acc = binary_accuracy(predictions, batch.label)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
epoch_acc += acc.item()
return epoch_loss / len(iterator), epoch_acc / len(iterator)
17.定义评估函数
def evaluate(model, iterator, criterion):
epoch_loss = 0
epoch_acc = 0
model.eval()
with torch.no_grad():
for batch in iterator:
predictions = model(batch.text).squeeze(1)
loss = criterion(predictions, batch.label)
acc = binary_accuracy(predictions, batch.label)
epoch_loss += loss.item()
epoch_acc += acc.item()
return epoch_loss / len(iterator), epoch_acc / len(iterator)
18.定义计时函数
import time
def epoch_time(start_time, end_time):
elapsed_time = end_time - start_time
elapsed_mins = int(elapsed_time / 60)
elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
return elapsed_mins, elapsed_secs
19.正式训练
N_EPOCHS = 5
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
start_time = time.time()
train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
end_time = time.time()
epoch_mins, epoch_secs = epoch_time(start_time, end_time)
if valid_loss < best_valid_loss:
best_valid_loss = valid_loss
torch.save(model.state_dict(), 'tut6-model.pt')
print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
print(f'\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')
20.检测性能
model.load_state_dict(torch.load('tut6-model.pt'))
test_loss, test_acc = evaluate(model, test_iterator, criterion)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')
21. 实际预测
def predict_sentiment(model, tokenizer, sentence):
model.eval()
tokens = tokenizer.tokenize(sentence)
tokens = tokens[:max_input_length-2]
indexed = [init_token_idx] + tokenizer.convert_tokens_to_ids(tokens) + [eos_token_idx]
tensor = torch.LongTensor(indexed).to(device)
tensor = tensor.unsqueeze(0)
prediction = torch.sigmoid(model(tensor))
return prediction.item()
predict_sentiment(model, tokenizer, "This film is terrible")