文章目录
项目说明
转载改编自 唐国梁Tommy:10-01 轻松学PyTorch 情感分类_Tensorflow_LSTM实现
Sentiment140 数据
- 官网: http://help.sentiment140.com/home
- 下载训练数据(2个途径)
http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
https://docs.google.com/file/d/0B04GJPshIjmPRnZManQwWEdTZjg/edit
得到 trainingandtestdata.zip 解压后有两个文件,训练使用 training.1600000.processed.noemoticon.csv
文件,约238.8M,100万条数据。
testdata.manual.2009.06.14.csv
中是 498条数据。
数据样式如下:
0 | 1467810369 | Mon Apr 06 22:19:45 PDT 2009 | NO_QUERY | TheSpecialOne | @switchfoot http://twitpic.com/2y1zl - Awww, that’s a bummer. You shoulda got David Carr of Third Day to do it. ;D |
---|---|---|---|---|---|
0 | 1467810672 | Mon Apr 06 22:19:49 PDT 2009 | NO_QUERY | scotthamilton | is upset that he can’t update his Facebook by texting it… and might cry as a result School today also. Blah! |
0 | 1467810917 | Mon Apr 06 22:19:53 PDT 2009 | NO_QUERY | mattycus | @Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds |
0 | 1467811184 | Mon Apr 06 22:19:57 PDT 2009 | NO_QUERY | ElleCTF | my whole body feels itchy and like its on fire |
0 | 1467811193 | Mon Apr 06 22:19:57 PDT 2009 | NO_QUERY | Karoli | @nationwideclass no, it’s not behaving at all. i’m mad. why am i here? because I can’t see you all over there. |
数据是删除了表情符号的CSV。数据文件格式有6个字段:
0-推特的极性(0=消极,2=中性,4=积极)
1-推特的id(2087)
2-推文日期(2009年5月16日星期六23:58:44 UTC)
3-查询(lyx)。如果没有查询,则该值为no_query。
4-发推特的用户(robotickkilldozr)
5-推文文本(Lyx很酷)
代码实现
加载、查看数据
# 导入数据
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
# 读取数据 , engine默认是'c'
file_path = '/Users/luyi/Documents/repos/nlp_repos/10_PyTorch_情感分类_LSTM实现/training.1600000.processed.noemoticon.csv'
dataset = pd.read_csv(file_path, engine='python', header=None)
dataset.shape
# (1600000, 6)
dataset.info() # 数据表信息
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 1600000 non-null int64
1 1 1600000 non-null int64
2 2 1600000 non-null object
3 3 1600000 non-null object
4 4 1600000 non-null object
5 5 1600000 non-null object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB
dataset.describe() # 数据表描述
0 | 1 | |
---|---|---|
count | 1.600000e+06 | 1.600000e+06 |
mean | 2.000000e+00 | 1.998818e+09 |
std | 2.000001e+00 | 1.935761e+08 |
min | 0.000000e+00 | 1.467810e+09 |
25% | 0.000000e+00 | 1.956916e+09 |
50% | 2.000000e+00 | 2.002102e+09 |
75% | 4.000000e+00 | 2.177059e+09 |
max | 4.000000e+00 | 2.329206e+09 |
dataset.columns # 列名
# Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')
dataset.head() # 默认显示前5行
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 0 | 1467810369 | Mon Apr 06 22:19:45 PDT 2009 | NO_QUERY | TheSpecialOne | @switchfoot http://twitpic.com/2y1zl - Awww, t… |
1 | 0 | 1467810672 | Mon Apr 06 22:19:49 PDT 2009 | NO_QUERY | scotthamilton | is upset that he can’t update his Facebook by … |
2 | 0 | 1467810917 | Mon Apr 06 22:19:53 PDT 2009 | NO_QUERY | mattycus | @Kenichan I dived many times for the ball. Man… |
3 | 0 | 1467811184 | Mon Apr 06 22:19:57 PDT 2009 | NO_QUERY | ElleCTF | my whole body feels itchy and like its on fire |
4 | 0 | 1467811193 | Mon Apr 06 22:19:57 PDT 2009 | NO_QUERY | Karoli | @nationwideclass no, it’s not behaving at all… |
# 统计各个类别数据占比
dataset[0].value_counts()
'''
4 800000
0 800000
Name: 0, dtype: int64
'''
dataset['sentiment_category'] = dataset[0].astype('category') # 类型转换 --> 分类变量
dataset['sentiment_category'].value_counts() # 统计各个类别数量
4 800000
0 800000
Name: sentiment_category, dtype: int64
dataset['sentiment'] = dataset['sentiment_category'].cat.codes # 分类变量值转换为 0 和 1 两个类别
dataset.head()
0 | 1 | 2 | 3 | 4 | 5 | sentiment_category | sentiment | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 1467810369 | Mon Apr 06 22:19:45 PDT 2009 | NO_QUERY | TheSpecialOne | @switchfoot http://twitpic.com/2y1zl - Awww, t… | 0 | 0 |
1 | 0 | 1467810672 | Mon Apr 06 22:19:49 PDT 2009 | NO_QUERY | scotthamilton | is upset that he can’t update his Facebook by … | 0 | 0 |
2 | 0 | 1467810917 | Mon Apr 06 22:19:53 PDT 2009 | NO_QUERY | mattycus | @Kenichan I dived many times for the ball. Man… | 0 | 0 |
3 | 0 | 1467811184 | Mon Apr 06 22:19:57 PDT 2009 | NO_QUERY | ElleCTF | my whole body feels itchy and like its on fire | 0 | 0 |
4 | 0 | 1467811193 | Mon Apr 06 22:19:57 PDT 2009 | NO_QUERY | Karoli | @nationwideclass no, it’s not behaving at all… | 0 | 0 |
dataset['sentiment'].value_counts() # 统计类别占比
'''
1 800000
0 800000
Name: sentiment, dtype: int64
'''
划分训练、测试集
dataset.to_csv('training-processed.csv', header=None, index=None) # 保存文件
# 随机选择10000个样本当作测试集
dataset.sample(10000).to_csv("test_sample.csv", header=None, index=None)
# 设置标签和文本
from torchtext import data
LABEL = data.LabelField() # 标签
TWEET = data.Field(lower=True) # 内容/文本
# 设置表头
fields = [('score', None), ('id',None), ('date',None), ('query',None),
('name',None),('tweet',TWEET), ('category',None), ('label',LABEL)]
# 读取数据
twitterDataset = data.TabularDataset(
path = 'training-processed.csv',
format = 'CSV',
fields = fields,
skip_header = False
)
# 分离 train, test, val
train, test, val = twitterDataset.split(split_ratio=[0.8, 0.1, 0.1], stratified=True, strata_field='label')
len(train) # 1280000
len(test) # 160000
len(val) # 160000
# 显示一个样本
vars(train.examples[11])
{'tweet': ['@monica2112',
'oh',
"don't",
'worry,',
'i',
"don't",
'mind',
'if',
'you',
'are.',
"i'm",
'just',
'happy',
'u',
'want',
'to',
'meet',
'me!'],
'label': '1'}
构建词汇表
vocab_size = 20000
TWEET.build_vocab(train, max_size=vocab_size)
LABEL.build_vocab(train)
# 查看词汇表大小
len(TWEET.vocab) # unk --> 未知单词,pad --> 填充
# 20002
# 查看词汇表中最常见的单词
TWEET.vocab.freqs.most_common(10)
[('i', 597446),
('to', 447324),
('the', 415058),
('a', 300964),
('my', 250409),
('and', 236538),
('you', 190004),
('is', 184795),
('for', 171218),
('in', 167840)]
TWEET.vocab.itos[:10] # 索引 --> 单词
# ['<unk>', '<pad>', 'i', 'to', 'the', 'a', 'my', 'and', 'you', 'is']
TWEET.vocab.stoi # 单词 --> 索引
defaultdict(<bound method Vocab._default_unk_index of <torchtext.vocab.Vocab object at 0x7f1cbe4a2520>>,
{'<unk>': 0,
'<pad>': 1,
'i': 2,
'to': 3,
'the': 4,
'a': 5,
'my': 6,
'and': 7,
'you': 8,
'is': 9,
'for': 10,
'in': 11,
...
'taken': 998,
'now...': 999,
...})
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# 文本批处理,即一批一批地读取数据
train_iter, val_iter, test_iter = data.BucketIterator.splits((train, val, test),
batch_size=32,
device = device,
sort_within_batch = True,
sort_key = lambda x : len(x.tweet))
sort_within_batch = True,一个batch内的数据就会按sort_key的排列规则降序排列,
sort_key是排列的规则,这里使用tweet的长度,即每条用户评论所包含的单词数量。
构建模型
import torch.nn as nn
class simple_LSTM(nn.Module):
def __init__(self, hidden_size, embedding_dim, vocab_size):
super(simple_LSTM, self).__init__() # 调用父类的构造方法
self.embedding = nn.Embedding(vocab_size, embedding_dim) # vocab_size词汇表大小, embedding_dim词嵌入维度
self.encoder = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_size, num_layers=1)
self.predictor = nn.Linear(hidden_size, 2) # 全连接层
def forward(self, seq):
output, (hidden, cell) = self.encoder(self.embedding(seq))
# output : torch.Size([24, 32, 100])
# hidden : torch.Size([1, 32, 100])
# cell : torch.Size([1, 32, 100])
preds = self.predictor(hidden.squeeze(0))
return preds
lstm_model = simple_LSTM(hidden_size=100, embedding_dim=300, vocab_size=20002)
lstm_model.to(device)
simple_LSTM(
(embedding): Embedding(20002, 300)
(encoder): LSTM(300, 100)
(predictor): Linear(in_features=100, out_features=2, bias=True)
)
定义模型训练过程
from torch import optim
# 优化器
optimizer = optim.Adam(lstm_model.parameters(), lr=0.001)
# 损失函数
criterion = nn.CrossEntropyLoss() # 多分类 (负面、正面、中性)
def train_val_test(model, optimizer, criterion, train_iter, val_iter, test_iter, epochs):
for epoch in range(1, epochs+1):
train_loss = 0.0 # 训练损失
val_loss = 0.0 # 验证损失
model.train() # 声明开始训练
for indices, batch in enumerate(train_iter):
optimizer.zero_grad() # 梯度置0
outputs = model(batch.tweet) # 预测后输出 outputs shape : torch.Size([32, 2])
# batch.label shape : torch.Size([32])
loss = criterion(outputs, batch.label) # 计算损失
loss.backward() # 反向传播
optimizer.step() # 更新参数
# batch.tweet shape : torch.Size([26, 32]) --> 26:序列长度, 32:一个batch_size的大小
train_loss += loss.data.item() * batch.tweet.size(0) # 累计每一批的损失值
train_loss /= len(train_iter) # 计算平均损失 len(train_iter) : 40000
print("Epoch : {}, Train Loss : {:.2f}".format(epoch, train_loss))
model.eval() # 声明模型验证
for indices, batch in enumerate(val_iter):
context = batch.tweet.to(device) # 部署到device上
target = batch.label.to(device)
pred = model(context) # 模型预测
loss = criterion(pred, target) # 计算损失 len(val_iter) : 5000
val_loss += loss.item() * context.size(0) # 累计每一批的损失值
val_loss /= len(val_iter) # 计算平均损失
print("Epoch : {}, Val Loss : {:.2f}".format(epoch, val_loss))
model.eval() # 声明
correct = 0.0 # 计算正确率
test_loss = 0.0 # 测试损失
with torch.no_grad(): # 不进行梯度计算
for idx, batch in enumerate(test_iter):
context = batch.tweet.to(device) # 部署到device上
target = batch.label.to(device)
outputs = model(context) # 输出
loss = criterion(outputs, target) # 计算损失
test_loss += loss.item() * context.size(0) # 累计每一批的损失值
# 获取最大预测值索引
preds = outputs.argmax(1)
# 累计正确数
correct += preds.eq(target.view_as(preds)).sum().item()
test_loss /= len(test_iter) # 平均损失 len(test_iter) : 5000
print("Epoch : {}, Test Loss : {:.2f}".format(epoch, test_loss))
print("Accuracy : {}".format(100 * correct / (len(test_iter) * batch.tweet.size(1))))
训练和验证
# 开始训练和验证
train_val_test(lstm_model, optimizer, criterion, train_iter, val_iter, test_iter, epochs=5)
Epoch : 1, Train Loss : 5.95
Epoch : 1, Val Loss : 5.58
Epoch : 1, Test Loss : 5.57
Accuracy : 81.628125
Epoch : 2, Train Loss : 5.36
Epoch : 2, Val Loss : 5.47
Epoch : 2, Test Loss : 5.48
Accuracy : 82.045
Epoch : 3, Train Loss : 5.11
Epoch : 3, Val Loss : 5.47
Epoch : 3, Test Loss : 5.48
Accuracy : 82.185625
Epoch : 4, Train Loss : 4.92
Epoch : 4, Val Loss : 5.51
Epoch : 4, Test Loss : 5.51
Accuracy : 82.220625
Epoch : 5, Train Loss : 4.77
Epoch : 5, Val Loss : 5.51
Epoch : 5, Test Loss : 5.53
Accuracy : 82.275
知识点:text 数据增强 data argumentation
- random insertion 随机插入
- random deletion 随机删除
- random swap 随机交换
参考论文: EDA : Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks - Back Translation
举例: 英语 --> 中文 --> 英语
需要安装 google_trans_new :
pip install google_trans_new
from google_trans_new import google_translator
translator = google_translator()
sentence = ['stay hungry, stay foolish. -- spoken / said by Steve Jobs']
# 英 --> 中
translation_cn = translator.translate(sentence, lang_tgt='zh-cn')
# "['保持饥饿,保持愚蠢。 -史蒂夫·乔布斯(Steve Jobs)说的话/ "
# 中 --> 英
translation_en = translator.translate(translation_cn, lang_tgt='en')
# "['stay Hungry Stay Foolish. -What Steve Jobs said/ "
随机选择一种语言翻译
import random
import google_trans_new
languages = list(google_trans_new.LANGUAGES.keys())
len(languages) # 可翻译的语言种类 108 种
# 108
object_lang = random.choice(languages) # 'hu'
# 正向翻译
translations = translator.translate(sentence, lang_tgt=object_lang)
translations
# "['maradj éhes, maradj őrült. - Steve Jobs mondta / mondta "
# 反向翻译
back_trans = translator.translate(translations, lang_tgt='en')
back_trans
# "['stay hungry, stay crazy. - Steve Jobs said "
伊织 2023-02-22