NLP - 情感分类 Sentiment140 - lstm


项目说明

转载改编自 唐国梁Tommy:10-01 轻松学PyTorch 情感分类_Tensorflow_LSTM实现


Sentiment140 数据

得到 trainingandtestdata.zip 解压后有两个文件,训练使用 training.1600000.processed.noemoticon.csv 文件,约238.8M,100万条数据。
testdata.manual.2009.06.14.csv 中是 498条数据。


数据样式如下:

01467810369Mon Apr 06 22:19:45 PDT 2009NO_QUERYTheSpecialOne@switchfoot http://twitpic.com/2y1zl - Awww, that’s a bummer. You shoulda got David Carr of Third Day to do it. ;D
01467810672Mon Apr 06 22:19:49 PDT 2009NO_QUERYscotthamiltonis upset that he can’t update his Facebook by texting it… and might cry as a result School today also. Blah!
01467810917Mon Apr 06 22:19:53 PDT 2009NO_QUERYmattycus@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
01467811184Mon Apr 06 22:19:57 PDT 2009NO_QUERYElleCTFmy whole body feels itchy and like its on fire
01467811193Mon Apr 06 22:19:57 PDT 2009NO_QUERYKaroli@nationwideclass no, it’s not behaving at all. i’m mad. why am i here? because I can’t see you all over there.

数据是删除了表情符号的CSV。数据文件格式有6个字段:
0-推特的极性(0=消极,2=中性,4=积极)
1-推特的id(2087)
2-推文日期(2009年5月16日星期六23:58:44 UTC)
3-查询(lyx)。如果没有查询,则该值为no_query。
4-发推特的用户(robotickkilldozr)
5-推文文本(Lyx很酷)


代码实现

加载、查看数据

# 导入数据
import warnings
warnings.filterwarnings('ignore')

import pandas as pd

# 读取数据 , engine默认是'c'
file_path = '/Users/luyi/Documents/repos/nlp_repos/10_PyTorch_情感分类_LSTM实现/training.1600000.processed.noemoticon.csv'
dataset = pd.read_csv(file_path, engine='python', header=None)
dataset.shape
# (1600000, 6)

dataset.info() # 数据表信息
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1600000 entries, 0 to 1599999
    Data columns (total 6 columns):
     #   Column  Non-Null Count    Dtype 
    ---  ------  --------------    ----- 
     0   0       1600000 non-null  int64 
     1   1       1600000 non-null  int64 
     2   2       1600000 non-null  object
     3   3       1600000 non-null  object
     4   4       1600000 non-null  object
     5   5       1600000 non-null  object
    dtypes: int64(2), object(4)
    memory usage: 73.2+ MB

dataset.describe() # 数据表描述
01
count1.600000e+061.600000e+06
mean2.000000e+001.998818e+09
std2.000001e+001.935761e+08
min0.000000e+001.467810e+09
25%0.000000e+001.956916e+09
50%2.000000e+002.002102e+09
75%4.000000e+002.177059e+09
max4.000000e+002.329206e+09

dataset.columns # 列名
# Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')

dataset.head() # 默认显示前5行
012345
001467810369Mon Apr 06 22:19:45 PDT 2009NO_QUERYTheSpecialOne@switchfoot http://twitpic.com/2y1zl - Awww, t…
101467810672Mon Apr 06 22:19:49 PDT 2009NO_QUERYscotthamiltonis upset that he can’t update his Facebook by …
201467810917Mon Apr 06 22:19:53 PDT 2009NO_QUERYmattycus@Kenichan I dived many times for the ball. Man…
301467811184Mon Apr 06 22:19:57 PDT 2009NO_QUERYElleCTFmy whole body feels itchy and like its on fire
401467811193Mon Apr 06 22:19:57 PDT 2009NO_QUERYKaroli@nationwideclass no, it’s not behaving at all…

# 统计各个类别数据占比
dataset[0].value_counts() 
'''
    4    800000
    0    800000
    Name: 0, dtype: int64
'''

dataset['sentiment_category'] = dataset[0].astype('category') # 类型转换 --> 分类变量

dataset['sentiment_category'].value_counts() # 统计各个类别数量
4    800000
0    800000
Name: sentiment_category, dtype: int64

dataset['sentiment'] = dataset['sentiment_category'].cat.codes # 分类变量值转换为 0 和 1 两个类别

dataset.head()
012345sentiment_categorysentiment
001467810369Mon Apr 06 22:19:45 PDT 2009NO_QUERYTheSpecialOne@switchfoot http://twitpic.com/2y1zl - Awww, t…00
101467810672Mon Apr 06 22:19:49 PDT 2009NO_QUERYscotthamiltonis upset that he can’t update his Facebook by …00
201467810917Mon Apr 06 22:19:53 PDT 2009NO_QUERYmattycus@Kenichan I dived many times for the ball. Man…00
301467811184Mon Apr 06 22:19:57 PDT 2009NO_QUERYElleCTFmy whole body feels itchy and like its on fire00
401467811193Mon Apr 06 22:19:57 PDT 2009NO_QUERYKaroli@nationwideclass no, it’s not behaving at all…00


dataset['sentiment'].value_counts() # 统计类别占比
'''
    1    800000
    0    800000
    Name: sentiment, dtype: int64
'''

划分训练、测试集

dataset.to_csv('training-processed.csv', header=None, index=None) # 保存文件

# 随机选择10000个样本当作测试集
dataset.sample(10000).to_csv("test_sample.csv", header=None, index=None)

# 设置标签和文本
from torchtext import data

LABEL = data.LabelField() # 标签
TWEET = data.Field(lower=True) # 内容/文本

# 设置表头
fields = [('score', None), ('id',None), ('date',None), ('query',None),
          ('name',None),('tweet',TWEET), ('category',None), ('label',LABEL)]

# 读取数据
twitterDataset = data.TabularDataset(
    path = 'training-processed.csv',
    format = 'CSV',
    fields = fields,
    skip_header = False
)

# 分离 train, test, val
train, test, val = twitterDataset.split(split_ratio=[0.8, 0.1, 0.1], stratified=True, strata_field='label')

len(train) # 1280000
len(test) # 160000
len(val) # 160000

# 显示一个样本
vars(train.examples[11])

    {'tweet': ['@monica2112',
      'oh',
      "don't",
      'worry,',
      'i',
      "don't",
      'mind',
      'if',
      'you',
      'are.',
      "i'm",
      'just',
      'happy',
      'u',
      'want',
      'to',
      'meet',
      'me!'],
     'label': '1'}

构建词汇表

vocab_size = 20000
TWEET.build_vocab(train, max_size=vocab_size)
LABEL.build_vocab(train)

# 查看词汇表大小
len(TWEET.vocab) # unk --> 未知单词,pad --> 填充
# 20002


# 查看词汇表中最常见的单词
TWEET.vocab.freqs.most_common(10)
    [('i', 597446),
     ('to', 447324),
     ('the', 415058),
     ('a', 300964),
     ('my', 250409),
     ('and', 236538),
     ('you', 190004),
     ('is', 184795),
     ('for', 171218),
     ('in', 167840)]

TWEET.vocab.itos[:10] # 索引 --> 单词
# ['<unk>', '<pad>', 'i', 'to', 'the', 'a', 'my', 'and', 'you', 'is']

TWEET.vocab.stoi # 单词 -->  索引
    defaultdict(<bound method Vocab._default_unk_index of <torchtext.vocab.Vocab object at 0x7f1cbe4a2520>>,
                {'<unk>': 0,
                 '<pad>': 1,
                 'i': 2,
                 'to': 3,
                 'the': 4,
                 'a': 5,
                 'my': 6,
                 'and': 7,
                 'you': 8,
                 'is': 9,
                 'for': 10,
                 'in': 11, 
                 ...
                 'taken': 998,
                 'now...': 999,
                 ...})


import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# 文本批处理,即一批一批地读取数据
train_iter, val_iter, test_iter = data.BucketIterator.splits((train, val, test), 
                                                             batch_size=32, 
                                                             device = device,
                                                             sort_within_batch = True,
                                                             sort_key = lambda x : len(x.tweet))
 

sort_within_batch = True,一个batch内的数据就会按sort_key的排列规则降序排列,
sort_key是排列的规则,这里使用tweet的长度,即每条用户评论所包含的单词数量。


构建模型


import torch.nn as nn

class simple_LSTM(nn.Module):
    def __init__(self, hidden_size, embedding_dim, vocab_size):
        super(simple_LSTM, self).__init__() # 调用父类的构造方法
        self.embedding = nn.Embedding(vocab_size, embedding_dim) # vocab_size词汇表大小, embedding_dim词嵌入维度
        self.encoder = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_size, num_layers=1)
        self.predictor = nn.Linear(hidden_size, 2) # 全连接层
        
    def forward(self, seq):
        output, (hidden, cell) = self.encoder(self.embedding(seq))
        # output :  torch.Size([24, 32, 100])
        # hidden :  torch.Size([1, 32, 100])
        # cell :  torch.Size([1, 32, 100])
        preds = self.predictor(hidden.squeeze(0))
        return preds

lstm_model = simple_LSTM(hidden_size=100, embedding_dim=300, vocab_size=20002)
lstm_model.to(device)

    simple_LSTM(
      (embedding): Embedding(20002, 300)
      (encoder): LSTM(300, 100)
      (predictor): Linear(in_features=100, out_features=2, bias=True)
    )

定义模型训练过程

from torch import optim

# 优化器
optimizer = optim.Adam(lstm_model.parameters(), lr=0.001)

# 损失函数
criterion = nn.CrossEntropyLoss() # 多分类 (负面、正面、中性)

def train_val_test(model, optimizer, criterion, train_iter, val_iter, test_iter, epochs):
    for epoch in range(1, epochs+1):
        train_loss = 0.0 # 训练损失
        val_loss = 0.0 # 验证损失
        model.train() # 声明开始训练
        for indices, batch in enumerate(train_iter):
            optimizer.zero_grad() # 梯度置0
            outputs = model(batch.tweet) # 预测后输出 outputs shape :  torch.Size([32, 2])
            # batch.label shape :  torch.Size([32])
            loss = criterion(outputs, batch.label) # 计算损失
            loss.backward() # 反向传播
            optimizer.step() # 更新参数
            # batch.tweet shape :  torch.Size([26, 32]) --> 26:序列长度, 32:一个batch_size的大小
            train_loss += loss.data.item() * batch.tweet.size(0) # 累计每一批的损失值
            
        train_loss /= len(train_iter) # 计算平均损失 len(train_iter) :  40000
        print("Epoch : {}, Train Loss : {:.2f}".format(epoch, train_loss))
        
        model.eval() # 声明模型验证
        for indices, batch in enumerate(val_iter):
            context = batch.tweet.to(device) # 部署到device上
            target = batch.label.to(device)
            pred = model(context) # 模型预测
            loss = criterion(pred, target) # 计算损失 len(val_iter) :  5000
            val_loss += loss.item() * context.size(0) # 累计每一批的损失值
        val_loss /= len(val_iter) # 计算平均损失 
        print("Epoch : {}, Val Loss : {:.2f}".format(epoch, val_loss))
        
        model.eval() # 声明
        correct = 0.0 # 计算正确率
        test_loss = 0.0 # 测试损失
        
        with torch.no_grad(): # 不进行梯度计算
            for idx, batch in enumerate(test_iter):
                context = batch.tweet.to(device) # 部署到device上
                target = batch.label.to(device)
                outputs = model(context) # 输出
                loss = criterion(outputs, target) # 计算损失
                test_loss += loss.item() * context.size(0) # 累计每一批的损失值
                # 获取最大预测值索引
                preds = outputs.argmax(1)
                # 累计正确数
                correct += preds.eq(target.view_as(preds)).sum().item()
            test_loss /= len(test_iter) # 平均损失 len(test_iter) :  5000
            print("Epoch : {}, Test Loss : {:.2f}".format(epoch, test_loss))
            print("Accuracy : {}".format(100 * correct / (len(test_iter) * batch.tweet.size(1))))

训练和验证

# 开始训练和验证
train_val_test(lstm_model,  optimizer, criterion, train_iter, val_iter, test_iter, epochs=5)

    Epoch : 1, Train Loss : 5.95
    Epoch : 1, Val Loss : 5.58
    Epoch : 1, Test Loss : 5.57
    Accuracy : 81.628125
    Epoch : 2, Train Loss : 5.36
    Epoch : 2, Val Loss : 5.47
    Epoch : 2, Test Loss : 5.48
    Accuracy : 82.045
    Epoch : 3, Train Loss : 5.11
    Epoch : 3, Val Loss : 5.47
    Epoch : 3, Test Loss : 5.48
    Accuracy : 82.185625
    Epoch : 4, Train Loss : 4.92
    Epoch : 4, Val Loss : 5.51
    Epoch : 4, Test Loss : 5.51
    Accuracy : 82.220625
    Epoch : 5, Train Loss : 4.77
    Epoch : 5, Val Loss : 5.51
    Epoch : 5, Test Loss : 5.53
    Accuracy : 82.275

知识点:text 数据增强 data argumentation

  1. random insertion 随机插入
  2. random deletion 随机删除
  3. random swap 随机交换
    参考论文: EDA : Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
  4. Back Translation

举例: 英语 --> 中文 --> 英语


需要安装 google_trans_new :

pip install google_trans_new

from google_trans_new import google_translator

translator = google_translator() 

sentence = ['stay hungry, stay foolish. -- spoken / said by Steve Jobs'] 

# 英 --> 中
translation_cn = translator.translate(sentence, lang_tgt='zh-cn')
# "['保持饥饿,保持愚蠢。 -史蒂夫·乔布斯(Steve Jobs)说的话/ "

# 中 --> 英
translation_en = translator.translate(translation_cn, lang_tgt='en')
# "['stay Hungry Stay Foolish. -What Steve Jobs said/ "

随机选择一种语言翻译


import random
import google_trans_new

languages = list(google_trans_new.LANGUAGES.keys()) 

len(languages) # 可翻译的语言种类 108 种
# 108 

object_lang = random.choice(languages) # 'hu'


# 正向翻译
translations = translator.translate(sentence, lang_tgt=object_lang)
translations
# "['maradj éhes, maradj őrült. - Steve Jobs mondta / mondta "


# 反向翻译
back_trans = translator.translate(translations, lang_tgt='en')
back_trans  
# "['stay hungry, stay crazy. - Steve Jobs said "

伊织 2023-02-22

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值