文本分类 之 词向量平均模型 Word Average Model

这是一个文本分类的系列专题,将采用不同的方法有简单到复杂实现文本分类。
使用Stanford sentiment treebank 电影评论数据集 (Socher et al. 2013). 数据集可以从这里下载
链接:数据集
提取码:yeqw
代码请参考:文本分类

第一个最简单的模型: 词向量平均模型(Word Average Model)

词向量平均模型

我们用X = {x_1, x_2,x_3…x_n}表示一个句子,x_t是句子中的第t个单词,我们使用emb来表示单词的embedding函数,也就是说 emb(x)返回一个d维度的词向量。
首先我们定义一个word_averaging 句子encoder:
h a v g = 1 / ∣ x ∣ ∗ ∑ t e m b ( x t ) h_{avg} = 1/|x| * \sum_t emb(x_t) havg=1/xtemb(xt)
然后,这个句子是正面情感的概率就是:
p o s = σ ( W T ∗ h a v g ) pos = \sigma(W^T * h_{avg}) pos=σ(WThavg)

\sigma是逻辑斯蒂函数, w 是一个d维向量。如果,pos>=0.5分类器就返回正面的情感,否则就返回负面情感.

在训练的时候我们使用binary log loss。整个模型的参数就是embedding函数 emb 和向量 w 。注意词向量的维度 d 和 w 的维度必须相同。有些单词可能在DEV和TEST中出现,但是没有在TRAIN当中出现。针对这些单词,我们可以随机生成一个词向量(一个特殊的UNK词向量)。不过在初始化词向量的时候,注意不要初始化太大的范围,否则这些unknown words的norm太大可能会导致模型效果变差(所以这里我们将词向量初始化为-0.1到0.1之间的随机数)

import random
from collections import Counter
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
USE_CUDA = torch.cuda.is_available()
device = torch.device('cuda' if USE_CUDA else 'cpu')
读数据
with open('senti.train.tsv','r') as rf:
    lines = rf.readlines()
print(lines[:10])

[‘hide new secretions from the parental units\t0\n’, ‘contains no wit , only labored gags\t0\n’, ‘that loves its characters and communicates something rather beautiful about human nature\t1\n’, ‘remains utterly satisfied to remain the same throughout\t0\n’, ‘on the worst revenge-of-the-nerds clich茅s the filmmakers could dredge up\t0\n’, “that 's far too tragic to merit such superficial treatment\t0\n”, ‘demonstrates that the director of such Hollywood blockbusters as Patriot Games can still turn out a small , personal film with an emotional wallop .\t1\n’, ‘of saucy\t1\n’, “a depressed fifteen-year-old 's suicidal poetry\t0\n”, “are more deeply thought through than in most ` right-thinking ’ films\t1\n”]

def read_corpus(path):
    sentences = []
    labels = []
    with open(path,'r', encoding='utf-8') as f:
        for line in f:
            sentence, label = line.split('\t')
            sentences.append(sentence.lower().split())
            labels.append(label[0])
    return sentences, labels
train_path,dev_path,test_path = 'senti.train.tsv','senti.dev.tsv','senti.test.tsv'
train_sentences, train_labels = read_corpus(train_path)
dev_sentences, dev_labels = read_corpus(dev_path)
test_sentences, test_labels = read_corpus(test_path)
print(len(train_sentences)), print(len(train_labels))

67349
67349

train_sentences[1], train_labels[1]

([‘contains’, ‘no’, ‘wit’, ‘,’, ‘only’, ‘labored’, ‘gags’], ‘0’)

构造词典

def build_vocab(sentences, word_size=20000):
    c = Counter()
    for sent in sentences:
        for word in sent:
            c[word] += 1
    print('文本总单词量为:',len(c))
    words_most_common = c.most_common(word_size)
    ## adding unk, pad
    idx2word = ['<pad>','<unk>'] + [item[0] for item in words_most_common]
    word2dix = {w:i for i, w in enumerate(idx2word)}
    return idx2word, word2dix
WORD_SIZE=20000
idx2word, word2dix = build_vocab(train_sentences, word_size=WORD_SIZE)

文本总单词量为: 14828

idx2word[:10]

[’’, ‘’, ‘the’, ‘,’, ‘a’, ‘and’, ‘of’, ‘.’, ‘to’, “'s”]

构造batch

def numeralization(sentences, labels, word2idx):
    '把word list表示的句子转成 index 表示的列表'
    numeral_sent = [[word2dix.get(w, word2dix['<unk>']) for w in s] for s in sentences]
    numeral_label =[int(label) for label in labels]
    return list(zip(numeral_sent, numeral_label))
num_train_data = numeralization(train_sentences, train_labels, word2dix)
num_test_data = numeralization(test_sentences, test_labels, word2dix)
num_dev_data = numeralization(dev_sentences, dev_labels, word2dix)

def convert2tensor(batch_sentences):
    '将batch数据转成tensor,这里主要是为了padding'
    lengths = [len(s) for s in batch_sentences]
    max_len = max(lengths)
    batch_size = len(batch_sentences)
    batch = torch.zeros(batch_size, max_len, dtype=torch.long)
    for i, l in enumerate(lengths):
        batch[i, :l] = torch.tensor(batch_sentences[i])
    return batch
def generate_batch(numeral_sentences_labels, batch_size=32):
    '''将list index 数据 分成batch '''
    batches = []
    num_sample = len(numeral_sentences_labels)
    random.shuffle(numeral_sentences_labels)
    numeral_sent = [n[0] for n in numeral_sentences_labels]
    numeral_label = [n[1] for n in numeral_sentences_labels]
    for start in range(0, num_sample, batch_size):
        end = start + batch_size
        if end > num_sample:
            batch_sentences = numeral_sent[start : num_sample]
            batch_labels = numeral_label[start : num_sample]
            batch_sent_tensor = convert2tensor(batch_sentences)
            batch_label_tensor = torch.tensor(batch_labels, dtype=torch.float)
        else:
            batch_sentences = numeral_sent[start : end]
            batch_labels = numeral_label[start : end]
            batch_sent_tensor = convert2tensor(batch_sentences)
            batch_label_tensor = torch.tensor(batch_labels, dtype=torch.float)
        batches.append((batch_sent_tensor.cuda(), batch_label_tensor.cuda()))
    return batches
train_data = generate_batch(num_train_data)
a = train_data[4]
text,label=a
text

tensor([[ 2, 1470, 0, …, 0, 0, 0],
[ 3789, 0, 0, …, 0, 0, 0],
[ 2056, 15, 283, …, 0, 0, 0],
…,
[11711, 3, 12789, …, 42, 2365, 7],
[ 1484, 524, 0, …, 0, 0, 0],
[ 308, 11, 10, …, 0, 0, 0]], device=‘cuda:0’)

构建模型

class AVGModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, output_size, pad_idx):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx)
        initrange = 0.1
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc = nn.Linear(embed_dim, output_size)
    def forward(self, text):
        ## [batch_size, seq_len]->[batch_size, seq_len, embed_dim]
        embed = self.embedding(text)
        ## attention
        ##[batch_size, seq_len, embed_dim]->[batch_size, embed_dim]
        pooled = F.avg_pool2d(embed, (embed.size(1),1)).squeeze(1)
        ## [batch_size, embed_dim]->[batch_size, output_size]
        out = self.fc(pooled)
        return out
    def get_embed_weigth(self):
        return self.embedding.weight.data
VOCAB_SIZE = len(word2dix)
EMBEDDING_DIM = 100
OUTPUT_SIZE = 1
PAD_IDX = word2dix['<pad>']
model = AVGModel(vocab_size=VOCAB_SIZE,
                 embed_dim=EMBEDDING_DIM,
                 output_size=OUTPUT_SIZE, 
                 pad_idx=PAD_IDX)
model.to(device)

AVGModel(
(embedding): Embedding(14830, 100, padding_idx=0)
(fc): Linear(in_features=100, out_features=1, bias=True)
)

定义损失函数 和优化函数

criterion = nn.BCEWithLogitsLoss()
criterion = criterion.to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-4)

训练模型

def get_accuracy(output, label):
    ## output: batch_size 
    y_hat = torch.round(torch.sigmoid(output)) ## 将output 转成0和1
    correct = (y_hat == label).float()
    acc = correct.sum()/len(correct)
    return acc
def evaluate(batch_data, model, criterion, get_accuracy):
    model.eval()
    num_epoch = epoch_loss = epoch_acc = 0
    with torch.no_grad():
        for text, label in batch_data:
            out = model(text).squeeze(1)
            loss = criterion(out, label)
            acc = get_accuracy(out, label)
            num_epoch +=1 
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    
    return epoch_loss/num_epoch, epoch_acc/num_epoch          
def train(batch_data, model, criterion, optimizer, get_accuracy):
    model.train()
    num_epoch = epoch_loss = epoch_acc = 0
    for text, label in batch_data:
        model.zero_grad()
        out = model(text).squeeze(1)
        loss = criterion(out, label)
        acc = get_accuracy(out, label)
        loss.backward()
        optimizer.step()
        num_epoch +=1 
        epoch_loss += loss.item()
        epoch_acc += acc.item()
    
    return epoch_loss/num_epoch, epoch_acc/num_epoch
        
NUM_EPOCH = 30
best_valid_acc = -1

dev_data = generate_batch(num_dev_data)
for epoch in range(NUM_EPOCH):
    train_data = generate_batch(num_train_data)
    train_loss, train_acc = train(train_data, model, criterion, optimizer, get_accuracy)
    valid_loss, valid_acc = evaluate(dev_data, model, criterion, get_accuracy)
    if valid_acc > best_valid_acc:
        best_valid_acc = valid_acc
        torch.save(model.state_dict(),'avg-model.pt')
    
    print(f'Epoch: {epoch+1:02} :')
    print(f'\t Train Loss: {train_loss:.4f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Valid Loss: {valid_loss:.4f} | Valid Acc: {valid_acc*100:.2f}%')
    

Epoch: 01 :
Train Loss: 0.1558 | Train Acc: 94.39%
Valid Loss: 0.6171 | Valid Acc: 82.25%
Epoch: 02 :
Train Loss: 0.1550 | Train Acc: 94.45%
Valid Loss: 0.6319 | Valid Acc: 81.47%
Epoch: 03 :
Train Loss: 0.1526 | Train Acc: 94.53%
Valid Loss: 0.6300 | Valid Acc: 82.59%
Epoch: 04 :
Train Loss: 0.1510 | Train Acc: 94.60%
Valid Loss: 0.6502 | Valid Acc: 81.25%
Epoch: 05 :
Train Loss: 0.1495 | Train Acc: 94.64%
Valid Loss: 0.6515 | Valid Acc: 82.37%

model.load_state_dict(torch.load('avg-model.pt'))

<All keys matched successfully

test_data = generate_batch(num_test_data)
test_loss, test_acc = evaluate(test_data, model, criterion, get_accuracy)
print(f'Test Loss: {test_loss:.4f} |  Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.5369 | Test Acc: 81.23%

打印词向量

embed = model.get_embed_weigth()
embed_norm = torch.norm(embed, p=None, dim=1)
sort_embed_norm, sort_embed_norm_idx = embed_norm.sort()
print('norm 最小的30个单词:')
for idx in sort_embed_norm_idx[:30].tolist():
    print(idx2word[idx], end=' / ')

norm 最小的30个单词:
par / holiday / pastiche / seedy / e-graveyard / quieter / home / captain / keeps / possibly / urge / aching / career / album / code / elegy / peculiar / squint / handheld / blown / quite / cops / miss / the / blush / judd / trip / appointed / make / themselves /

print('norm 最大的30个单词:')
for idx in sort_embed_norm_idx[-30:].tolist():
    print(idx2word[idx], end=' / ')

norm 最大的30个单词:
wonderfully / lousy / unlikable / choppy / badly / splendid / worst / dazzling / outstanding / inept / listless / lacking / playful / mesmerizing / unnecessary / amazing / stunning / irritating / unimaginative / refreshingly / heartwarming / devoid / riveting / suffers / tiresome / pointless / thought-provoking / poorly / mess / unfunny /

norm 最大的30个单词都是和电影评价相关的词语

norm 最小的30个单词 都是和对电影情感评价无关的词语

  • 2
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值