文本分类之有Attention的词向量平均模型 Word Average Model with Attention

最新推荐文章于 2022-07-15 09:36:53 发布

JL_Jessie

最新推荐文章于 2022-07-15 09:36:53 发布

阅读量1.2k

点赞数 2

分类专栏： NLP 文章标签：文本分类 wordAverage attention

本文链接：https://blog.csdn.net/m0_37531129/article/details/102555917

版权

NLP 专栏收录该内容

45 篇文章 8 订阅

订阅专栏

这是一个文本分类的系列专题，将采用不同的方法有简单到复杂实现文本分类。
使用Stanford sentiment treebank 电影评论数据集 (Socher et al. 2013). 数据集可以从这里下载
链接：数据集
提取码：yeqw

本文的方法是在WordAverageModel的基础上加上Attention 机制

词向量平均模型请参考我的上一个博文：文本分类之词向量平均模型

有Attention的词向量平均模型 Word Average Model with Attention

我们现在定义一种基于简单attention机制的编码器(encoder)。这种编码器针对句子中的每个单词生成一个权重，然后句子中每个单词的词向量的加权平均就用来表示这个句子:
$\alpha_t \propto exp\{cos(u, emb(x_t))\}$
$h_{att} = \sum_t a_t*emb(x_t)$

然后句子的正面情感概率就用下面的式子计算：
$\sigma(W^T * h_{att})$

\sigma是逻辑斯蒂函数，w是一个d维向量，作为模型的参数。

attention模型与word average模型的不同之处在于我们增加了一个d维向量参数 u 用于计算单词的权重，也是模型参数的一部分。

在这个模型中，我们用u和单词embedding的cosine similarity表示单词的重要性。我们会用一层softmax来把这些权重归一化

import random
from collections import Counter
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
USE_CUDA = torch.cuda.is_available()
device = torch.device('cuda' if USE_CUDA else 'cpu')

读数据

with open('senti.train.tsv','r') as rf:
    lines = rf.readlines()
print(lines[:10])

['hide new secretions from the parental units\t0\n', 'contains no wit , only labored gags\t0\n', 'that loves its characters and communicates something rather beautiful about human nature\t1\n', 'remains utterly satisfied to remain the same throughout\t0\n', 'on the worst revenge-of-the-nerds clich茅s the filmmakers could dredge up\t0\n', "that 's far too tragic to merit such superficial treatment\t0\n", 'demonstrates that the director of such Hollywood blockbusters as Patriot Games can still turn out a small , personal film with an emotional wallop .\t1\n', 'of saucy\t1\n', "a depressed fifteen-year-old 's suicidal poetry\t0\n", "are more deeply thought through than in most ` right-thinking ' films\t1\n"]

def read_corpus(path):
    sentences = []
    labels = []
    with open(path,'r', encoding='utf-8') as f:
        for line in f:
            sentence, label = line.split('\t')
            sentences.append(sentence.lower().split())
            labels.append(label[0])
    return sentences, labels

train_path,dev_path,test_path = 'senti.train.tsv','senti.dev.tsv','senti.test.tsv'

train_sentences, train_labels = read_corpus(train_path)
dev_sentences, dev_labels = read_corpus(dev_path)
test_sentences, test_labels = read_corpus(test_path)

print(len(train_sentences)), print(len(train_labels))

67349
67349

train_sentences[1], train_labels[1]

([‘contains’, ‘no’, ‘wit’, ‘,’, ‘only’, ‘labored’, ‘gags’], ‘0’)

构造词典

def build_vocab(sentences, word_size=20000):
    c = Counter()
    for sent in sentences:
        for word in sent:
            c[word] += 1
    print('文本总单词量为：',len(c))
    words_most_common = c.most_common(word_size)
    ## adding unk, pad
    idx2word = ['<pad>','<unk>'] + [item[0] for item in words_most_common]
    word2dix = {w:i for i, w in enumerate(idx2word)}
    return idx2word, word2dix

WORD_SIZE=20000
idx2word, word2dix = build_vocab(train_sentences, word_size=WORD_SIZE)

文本总单词量为： 14828

idx2word[:10]

[’’, ‘’, ‘the’, ‘,’, ‘a’, ‘and’, ‘of’, ‘.’, ‘to’, “'s”]

构造batch

def numeralization(sentences, labels, word2idx):
    '把word list表示的句子转成 index 表示的列表'
    numeral_sent = [[word2dix.get(w, word2dix['<unk>']) for w in s] for s in sentences]
    numeral_label =[int(label) for label in labels]
    return list(zip(numeral_sent, numeral_label))

num_train_data = numeralization(train_sentences, train_labels, word2dix)
num_test_data = numeralization(test_sentences, test_labels, word2dix)
num_dev_data = numeralization(dev_sentences, dev_labels, word2dix)

def convert2tensor(batch_sentences):
    '将batch数据转成tensor,这里主要是为了padding 补全0'
    lengths = [len(s) for s in batch_sentences]
    max_len = max(lengths)
    batch_size = len(batch_sentences)
    batch = torch.zeros(batch_size, max_len, dtype=torch.long)
    for i, l in enumerate(lengths):
        batch[i, :l] = torch.tensor(batch_sentences[i])
    return batch

def generate_batch(numeral_sentences_labels, batch_size=32):
    '''将list index 数据 分成batch '''
    batches = []
    num_sample = len(numeral_sentences_labels)
    random.shuffle(numeral_sentences_labels)
    numeral_sent = [n[0] for n in numeral_sentences_labels]
    numeral_label = [n[1] for n in numeral_sentences_labels]
    for start in range(0, num_sample, batch_size):
        end = start + batch_size
        if end > num_sample:
            batch_sentences = numeral_sent[start : num_sample]
            batch_labels = numeral_label[start : num_sample]
            batch_sent_tensor = convert2tensor(batch_sentences)
            batch_label_tensor = torch.tensor(batch_labels, dtype=torch.float)
        else:
            batch_sentences = numeral_sent[start : end]
            batch_labels = numeral_label[start : end]
            batch_sent_tensor = convert2tensor(batch_sentences)
            batch_label_tensor = torch.tensor(batch_labels, dtype=torch.float)
        batches.append((batch_sent_tensor.cuda(), batch_label_tensor.cuda()))
    return batches

train_data = generate_batch(num_train_data)

构建模型

class AVGAttenModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, output_size, pad_idx):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx)
        initrange = 0.1
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.u = nn.Parameter(torch.randn(embed_dim))
        self.fc = nn.Linear(embed_dim, output_size)
    def forward(self, text):
        ## [batch_size, seq_len]->[batch_size, seq_len, embed_dim]
        embed = self.embedding(text)
        ## attention
        ## 扩展u [embed_dim] ->[batch_size, seq_len, embed_dim]
        u = self.u.repeat(embed.size(0), embed.size(1), 1)
        ## cos: [batch_size, seq_len] 每个位置都是 batch_size, seq_len 对应的cos值
        cos = F.cosine_similarity(embed, u, dim=2) ## 沿着dim=2计算 
        alpha = F.softmax(cos, dim=1) ## 求权重 batch_size, seq_len ##对一个句子内的cosj进行归一化
        # [batch_size, embed_size] ## embed 乘以对应位置的权重
        h_attn = torch.sum(embed * alpha.unsqueeze(2), dim=1).squeeze(1)
        out = self.fc(h_attn)
        return out
    def get_embed_weigth(self):
        return self.embedding.weight.data
    def get_u(self):
        return self.u

VOCAB_SIZE = len(word2dix)
EMBEDDING_DIM = 100
OUTPUT_SIZE = 1
PAD_IDX = word2dix['<pad>']

model = AVGAttenModel(vocab_size=VOCAB_SIZE,
                 embed_dim=EMBEDDING_DIM,
                 output_size=OUTPUT_SIZE, 
                 pad_idx=PAD_IDX)
model.to(device)

AVGAttenModel(
(embedding): Embedding(14830, 100, padding_idx=0)
(fc): Linear(in_features=100, out_features=1, bias=True)
)

定义损失函数和优化函数

criterion = nn.BCEWithLogitsLoss()
criterion = criterion.to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-4)

训练模型

def get_accuracy(output, label):
    ## output: batch_size 
    y_hat = torch.round(torch.sigmoid(output)) ## 将output 转成0和1
    correct = (y_hat == label).float()
    acc = correct.sum()/len(correct)
    return acc

def evaluate(batch_data, model, criterion, get_accuracy):
    model.eval()
    num_epoch = epoch_loss = epoch_acc = 0
    with torch.no_grad():
        for text, label in batch_data:
            out = model(text).squeeze(1)
            loss = criterion(out, label)
            acc = get_accuracy(out, label)
            num_epoch +=1 
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    
    return epoch_loss/num_epoch, epoch_acc/num_epoch

def train(batch_data, model, criterion, optimizer, get_accuracy):
    model.train()
    num_epoch = epoch_loss = epoch_acc = 0
    for text, label in batch_data:
        model.zero_grad()
        out = model(text).squeeze(1)
        loss = criterion(out, label)
        acc = get_accuracy(out, label)
        loss.backward()
        optimizer.step()
        num_epoch +=1 
        epoch_loss += loss.item()
        epoch_acc += acc.item()
    
    return epoch_loss/num_epoch, epoch_acc/num_epoch

NUM_EPOCH = 20
best_valid_acc = -1

dev_data = generate_batch(num_dev_data)
for epoch in range(NUM_EPOCH):
    train_data = generate_batch(num_train_data)
    train_loss, train_acc = train(train_data, model, criterion, optimizer, get_accuracy)
    valid_loss, valid_acc = evaluate(dev_data, model, criterion, get_accuracy)
    if valid_acc > best_valid_acc:
        best_valid_acc = valid_acc
        torch.save(model.state_dict(),'avg-atten-model.pt')
    
    print(f'Epoch: {epoch+1:02} :')
    print(f'\t Train Loss: {train_loss:.4f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Valid Loss: {valid_loss:.4f} | Valid Acc: {valid_acc*100:.2f}%')

Epoch: 01 :
Train Loss: 0.2159 | Train Acc: 92.08%
Valid Loss: 0.4092 | Valid Acc: 82.59%
Epoch: 02 :
Train Loss: 0.2047 | Train Acc: 92.44%
Valid Loss: 0.4172 | Valid Acc: 82.70%
Epoch: 03 :
Train Loss: 0.1956 | Train Acc: 92.80%
Valid Loss: 0.4296 | Valid Acc: 82.70%
Epoch: 04 :
Train Loss: 0.1872 | Train Acc: 93.02%
Valid Loss: 0.4389 | Valid Acc: 82.81%
Epoch: 05 :
Train Loss: 0.1802 | Train Acc: 93.29%
Valid Loss: 0.4473 | Valid Acc: 82.70%
Epoch: 06 :
Train Loss: 0.1740 | Train Acc: 93.51%
Valid Loss: 0.4600 | Valid Acc: 82.59%
Epoch: 07 :
Train Loss: 0.1688 | Train Acc: 93.70%
Valid Loss: 0.4731 | Valid Acc: 82.70%
Epoch: 08 :
Train Loss: 0.1640 | Train Acc: 93.93%
Valid Loss: 0.4810 | Valid Acc: 82.81%
Epoch: 09 :
Train Loss: 0.1589 | Train Acc: 94.09%
Valid Loss: 0.4955 | Valid Acc: 82.48%
Epoch: 10 :
Train Loss: 0.1559 | Train Acc: 94.19%
Valid Loss: 0.5087 | Valid Acc: 82.48%
Epoch: 11 :
Train Loss: 0.1518 | Train Acc: 94.33%
Valid Loss: 0.5186 | Valid Acc: 82.48%
Epoch: 12 :
Train Loss: 0.1489 | Train Acc: 94.45%
Valid Loss: 0.5310 | Valid Acc: 82.37%
Epoch: 13 :
Train Loss: 0.1457 | Train Acc: 94.55%
Valid Loss: 0.5434 | Valid Acc: 82.59%
Epoch: 14 :
Train Loss: 0.1431 | Train Acc: 94.66%
Valid Loss: 0.5581 | Valid Acc: 82.14%
Epoch: 15 :
Train Loss: 0.1403 | Train Acc: 94.75%
Valid Loss: 0.5686 | Valid Acc: 82.03%
Epoch: 16 :
Train Loss: 0.1378 | Train Acc: 94.84%
Valid Loss: 0.5810 | Valid Acc: 82.03%
Epoch: 17 :
Train Loss: 0.1357 | Train Acc: 94.99%
Valid Loss: 0.5932 | Valid Acc: 82.48%
Epoch: 18 :
Train Loss: 0.1337 | Train Acc: 95.02%
Valid Loss: 0.6074 | Valid Acc: 82.14%
Epoch: 19 :
Train Loss: 0.1315 | Train Acc: 95.06%
Valid Loss: 0.6212 | Valid Acc: 82.03%
Epoch: 20 :
Train Loss: 0.1302 | Train Acc: 95.13%
Valid Loss: 0.6346 | Valid Acc: 81.81%

model.load_state_dict(torch.load('avg-atten-model.pt'))

<All keys matched successfully>

test_data = generate_batch(num_test_data)
test_loss, test_acc = evaluate(test_data, model, criterion, get_accuracy)
print(f'Test Loss: {test_loss:.4f} |  Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.3854 | Test Acc: 82.89%

结果比没有attention 的词向量平均模型高出1个百分点，说明attention有用。

打印词向量

torch.cuda.empty_cache()
embed = model.get_embed_weigth() ##  vocab_size * embed_size
u = model.get_u()

embed.shape

torch.Size([14830, 100])

u.shape

torch.Size([100])

u = u.repeat(embed.size(0), 1)
with torch.no_grad():
    cos = F.cosine_similarity(embed, u, dim=1)

sorted_socre, sorted_idx = cos.sort()

sorted_socre[-10:]

tensor([0.6202, 0.6207, 0.6222, 0.6338, 0.6354, 0.6356, 0.6402, 0.6418, 0.6988,
0.7162], device=‘cuda:0’)

print('attention 最小的30个单词：')
for i,s in zip(sorted_idx[:30], sorted_socre[:30]):
    print(idx2word[i], ' ', s.item(), end=' \n')

attention 最小的30个单词：
the -0.9967465400695801
-0.9959186315536499
's -0.9858197569847107
a -0.9832067489624023
for -0.9813275337219238
is -0.9744138121604919
your -0.9691771864891052
, -0.9663684964179993
very -0.9609249830245972
directed -0.9591320753097534
in -0.9578869938850403
time -0.9567098021507263
. -0.9532570838928223
can -0.9528145790100098
his -0.9518136382102966
sense -0.9483051300048828
of -0.9468281269073486
story -0.9467564821243286
to -0.9453473687171936
– -0.9451014399528503
it -0.944945752620697
that -0.9435582160949707
its -0.9423429369926453
-lrb- -0.938753068447113
about -0.9368602633476257
two -0.9318870902061462
are -0.9273205995559692
such -0.9272398948669434
words -0.9214128851890564
‘’ -0.9189384579658508

print('atten 最大的30个单词：')
for i,s in zip(sorted_idx[-30:], sorted_socre[-30:]):
    print(idx2word[i], ' ', s.item(), end=' \n')

atten 最大的30个单词：
dim-witted 0.6067432761192322
slow 0.6074190139770508
oppressively 0.608052134513855
neither 0.6080911755561829
pointless 0.6082792282104492
saddest 0.608433723449707
mediocre 0.6101371049880981
drag 0.6104316115379333
tiresomely 0.6107774376869202
drab 0.6113846302032471
annoying 0.612328052520752
falls 0.6125514507293701
smug 0.6126384735107422
missed 0.6136465072631836
hack 0.6157366633415222
lacks 0.6180804371833801
horrible 0.6181895136833191
deadly 0.6193328499794006
off-putting 0.6193955540657043
clueless 0.6194455027580261
exit 0.6202219128608704
rolling 0.6207405924797058
misery 0.6222171187400818
inc. 0.633823812007904
not 0.6354070901870728
problem 0.6355876922607422
bad 0.6402461528778076
sanctimonious 0.6418283581733704
wrong 0.6987859606742859
trouble 0.7162488698959351

结论

attention 最大的30个单词都是一个句子中与电影评价相关的词语
attention 最小的30个单词都是和对电影情感评价无关的词语

分析相同单词在不同语境下attention的变化

#获取词向量，和计算单词的权重参数
word_embedding = model.get_embed_weigth().cpu()
u = model.get_u().cpu()

## 选取词频大于100的单词
c = Counter()
for sent in train_sentences:
    for word in sent:
        c[word] += 1
print('文本总单词量为：',len(c))
words = []
for w in c:
    if c[w] >=100:
        words.append(w)
print('词频大于100的单词有：{}'.format(len(words)))

文本总单词量为： 14828
词频大于100的单词有：668

' / '.join(words)

“new / from / the / no / wit / , / only / gags / that / its / characters / and / something / rather / beautiful / about / human / nature / remains / to / same / on / worst / clichés / filmmakers / could / up / 's / far / too / such / director / of / hollywood / as / can / still / turn / out / a / small / personal / film / with / an / emotional / . / are / more / deeply / through / than / in / most / ` / ’ / films / goes / for / those / who / they / do / n’t / make / movies / like / part / where / nothing / how / bad / this / movie / was / some / dumb / story / greatest / cold / his / usual / intelligence / concept / is / above / all / young / woman / face / by / whose / it / original / ways / even / if / anything / see / black / your / comes / performances / unfunny / cast / which / half / worse / : / or / world / cinema / very / good / plot / but / action / will / find / little / interest / often / year / sit / another / `` / best / man / ‘’ / funny / adults / have / i / people / almost / horror / …”

def get_attention(sentence, word_embedding, u, word2idx):
    '计算一个句子中每个单词在句子中的attention， 返回单词与attention值的字典'
    num_sentence = [word2dix[w] for w in sentence]
    s_embed = word_embedding[num_sentence]
    u = u.repeat(s_embed.size(0),1)
    score = torch.cosine_similarity(s_embed, u, dim=1)
    attn = torch.softmax(score, dim=0)
    return {w:a for w, a in zip(sentence, attn.tolist())}

## 初始化单词attention 列表为空，将所有单词做成一个字典
word_attention_li = {w:[] for w in words}
print(word_attention_li['new'])
for s in train_sentences:
    s_w_attn = get_attention(s, word_embedding, u, word2dix)
    for w in s_w_attn:
        if w in word_attention_li:
            word_attention_li[w].append(s_w_attn[w])
print(word_attention_li['new'])

[]
[0.06510528922080994, 0.039851825684309006, 0.30102190375328064, 0.09700364619493484, 0.027896232903003693, 0.1262332648038864, 0.0618707612156868, 0.025379355996847153, 0.02642618678510189, 0.03359510004520416, 0.015629686415195465,…]

print(len(word_attention_li['new']))

def mean_std_list(word_attention_li, sort=True):
    '计算attention的均值和标准差，并按照标准差排序'
    import numpy as np
    word_mean_std_li=[]
    for w in word_attention_li:
        arr = np.array(word_attention_li[w])
        word_mean_std_li.append((w, arr.mean(), arr.std()))
    if sort:
        sorted_std_li = sorted(word_mean_std_li, key=lambda x:x[2], reverse=True)
        sorted_mean_li = sorted(word_mean_std_li, key=lambda x:x[1], reverse=True)
    else:
        return word_mean_std_li
    return sorted_std_li, sorted_mean_li

sorted_std_li, sorted_mean_li= mean_std_list(word_attention_li)

print('30个标准差最大的单词：')
for word, amean, astd in sorted_std_li[:30]:
    print('{} : {:.4}'.format(word, astd))

30个标准差最大的单词：
stupid : 0.2377
awful : 0.234
terrific : 0.2326
tedious : 0.2304
watchable : 0.2272
provocative : 0.222
flat : 0.2218
painful : 0.221
inventive : 0.2205
bland : 0.2201
boring : 0.2197
appealing : 0.2171
waste : 0.2168
gorgeous : 0.2158
remarkable : 0.2154
excellent : 0.2145
mess : 0.2145
worse : 0.2141
beautifully : 0.2127
unfunny : 0.2111
impressive : 0.209
brilliant : 0.2076
intriguing : 0.2066
convincing : 0.2036
slow : 0.2034
cool : 0.2032
engrossing : 0.2026
wonderful : 0.2015
delightful : 0.2012
bad : 0.2009

print('30个标准差最小的单词：')
for word, amean, astd in word_mean_std_li[-30:]:
    print('{} : {:.4}'.format(word, astd))

30个标准差最小的单词：
of : 0.03817
if : 0.03788
adults : 0.03738
whose : 0.03656
about : 0.03644
i : 0.03643
but : 0.03612
– : 0.03554
though : 0.0352
. : 0.03507
at : 0.03471
while : 0.03463
filmmakers : 0.03399
had : 0.03391
to : 0.03372
ever : 0.03367
shows : 0.03364
de : 0.0335
they : 0.03286
we : 0.03205
now : 0.03194
mr. : 0.03096
into : 0.03038
: : 0.03024
that : 0.02741
because : 0.02642
which : 0.02581
who : 0.02491
when : 0.02489
; : 0.0237

print('30个均值最大的单词：')
for word, amean, astd in sorted_mean_li[:30]:
    print('{} : {:.4}'.format(word, amean))

30个均值最大的单词：
mess : 0.3365
wrong : 0.324
stupid : 0.3218
awful : 0.3215
waste : 0.3172
terrific : 0.3164
brilliant : 0.3059
tired : 0.3039
unfunny : 0.3021
bad : 0.3021
touching : 0.2984
worst : 0.2979
provocative : 0.2962
engrossing : 0.2907
slow : 0.2906
excellent : 0.2885
impressive : 0.288
remarkable : 0.2864
boring : 0.2827
beautifully : 0.2813
watchable : 0.281
flat : 0.2797
wonderful : 0.2792
hilarious : 0.2788
fascinating : 0.2783
painful : 0.2768
inventive : 0.2749
appealing : 0.2741
pretentious : 0.2741
delightful : 0.2739

print('30个均值最小的单词：')
for word, amean, astd in sorted_mean_li[-30:]:
    print('{} : {:.4}'.format(word, amean))

30个均值最小的单词：
shows : 0.04875
… : 0.04869
times : 0.04832
had : 0.04819
filmmakers : 0.04786
ever : 0.04786
-rrb- : 0.04785
i : 0.04767
though : 0.04661
de : 0.046
from : 0.04579
which : 0.0457
whose : 0.04494
. : 0.04483
about : 0.04473
they : 0.04469
to : 0.04457
‘’ : 0.04442
but : 0.04431
at : 0.04378
when : 0.04286
we : 0.04208
into : 0.04168
– : 0.04111
-lrb- : 0.04107
: : 0.04001
because : 0.03962
that : 0.03814
who : 0.03568
; : 0.03429

分析

均值和标准差代表一组数据的分布情况，从上面可以看出对于一些不重要的词，词向量的attention的均值和标准差都比较小，说明在每一句话中这些单词的attention 都不高。对于词向量和attention 都比较大的词都是和评价有关的词语，这些词在句子中占有较大得attention，当然标准差也比较大说明在不同句子中的attention 值不一样，这是根据评论语句内容有关，如果评价语句中出现了不止一个重要的单词，那么attention也会分给其他单词，所以在某一个单词上的attention会减小，如果评价语句中只有一个重要的单词，那么这个单词的attention会非常高。

例如：我们以‘stupid’为例，把所有含有stupid的句子取出来，看每个句子中 stupid的attention 大小

## 取出含有‘stupid’的句子，并计算句子中每个单词的attention
stupid_sents = []
stupid_words = []
for sentence in train_sentences:
    if 'stupid' in sentence:
        stupid_sents.append(sentence)
        for word in sentence:
            stupid_words.append(word)


stupid_word_attention = {w:[] for w in stupid_words}
for s in stupid_sents:
    s_w_attn = get_attention(s, word_embedding, u, word2dix)
    for w in s_w_attn:
        if w in stupid_word_attention:
            stupid_word_attention[w].append(s_w_attn[w])
print(stupid_word_attention)

sorted_std_stupid, sorted_mean_stupid= mean_std_list(stupid_word_attention)

以‘stupid’的attention从大到小排序，可以看出来 attention 随着句子的长度和句子中出现重要单词的多少而减小。

sent_atten = []
for i in range(len(stupid_sents)):
    sent_atten.append((stupid_word_attention['stupid'][i],  " ".join(stupid_sents[i])))
    #print('attention:', stupid_word_attention['stupid'][i],  " ".join(stupid_sents[i]), '\n')
    
## 以‘stupid’的attention从大到小排序
sorted(sent_atten, key=lambda x:x[0], reverse=True) 
## 可以看出来 attention 随着句子的长度和句子中出现重要单词的多少而减小。

>   [(1.0, 'stupid'),
     (1.0, 'stupid'),
     (0.8240495920181274, 'stupid ,'),
     (0.8240495920181274, ', stupid'),
     (0.8237711787223816, 'stupid characters'),
     (0.8222578763961792, "'s stupid"),
     (0.8200607299804688, 'stupid and'),
     (0.8061841130256653, 'so stupid'),
     (0.800701379776001, 'profoundly stupid'),
     (0.7876837849617004, 'stupid americans'),
     (0.78346186876297, 'being stupid'),
     (0.7657994031906128, 'really stupid'),
     (0.7512186169624329, 'turn stupid'),
     (0.7088738679885864, 'insanely stupid'),
     (0.7068068981170654, 'pretty stupid'),
     (0.7050910592079163, 'stupid sequel'),
     (0.6936635375022888, "it 's stupid"),
     (0.6581787467002869, ', really stupid'),
     (0.6335198283195496, 'simply stupid ,'),
     (0.6308980584144592, 'be so stupid'),
     (0.6131293773651123, "'s pretty stupid"),
     (0.6056569814682007, 'so insanely stupid'),
     (0.587374746799469, 'you turn stupid'),
     (0.5786352157592773, 'on `` stupid'),
     (0.5478944182395935, 'really , really stupid'),
     (0.5426362156867981, "'s pretty stupid ."),
     (0.540003776550293, 'profoundly stupid affair'),
     (0.5363026261329651, 'so insanely stupid ,'),
     (0.5137991905212402, "on `` stupid ''"),
     (0.5077778100967407, 'unbelievably stupid'),
     (0.49585697054862976, 'too stupid'),
     (0.49035701155662537, 'is really , really stupid'),
     (0.4849787652492523, 'a profoundly stupid affair'),
     (0.4834858477115631, "it 's pretty stupid ."),
     (0.44486135244369507, 'stupid and annoying'),
     (0.4442058503627777, 'is really , really stupid .'),
     (0.44391199946403503, 'pointless , stupid'),
     (0.4427518844604492, 'stupid and pointless'),
     (0.4407324492931366, 'stupid , derivative horror film'),
     (0.43824437260627747, ", it 's pretty stupid ."),
     (0.40547922253608704, ', pointless , stupid'),
     (0.404511034488678, ', stupid and pointless'),
     (0.4033791422843933, 'a stupid , derivative horror film'),
     (0.40185871720314026, 'very stupid and annoying'),
     (0.3998126983642578, 'an unbelievably stupid film'),
     (0.3887804448604584, "'s also too stupid"),
     (0.38554927706718445, 'simply stupid , irrelevant'),
     (0.3703272044658661, 'very stupid and annoying .'),
     (0.3554767072200775, 'simply stupid , irrelevant and'),
     (0.3287999629974365, 'so insanely stupid , so awful'),
     (0.3262757360935211, 'silly , stupid and pointless'),
     (0.323175847530365, 'simply stupid , irrelevant and deeply'),
     (0.3210461437702179, "frankly , it 's pretty stupid ."),
     (0.305471271276474, "landing squarely on `` stupid ''"),
     (0.3020733594894409, "'s simply stupid , irrelevant and deeply"),
     (0.3016372621059418, '-lrb- a -rrb- soulless , stupid sequel'),
     (0.2869524657726288, 'ugly , pointless , stupid'),
     (0.2837705910205841, "'s simply stupid , irrelevant and deeply ,"),
     (0.2798386514186859, '-lrb- a -rrb- soulless , stupid sequel ...'),
     (0.2683190107345581, "do n't care about being stupid"),
     (0.2675773799419403, 'ugly , pointless , stupid movie'),
     (0.26182451844215393, "'s simply stupid , irrelevant and deeply , truly"),
     (0.2577085793018341, "before landing squarely on `` stupid ''"),
     (0.2561054229736328,
      'a stupid , derivative horror film that substitutes extreme'),
     (0.25393128395080566, "that do n't care about being stupid"),
     (0.25128045678138733, 'an ugly , pointless , stupid movie'),
     (0.24796229600906372, "'s simply stupid , irrelevant and deeply , truly ,"),
     (0.23981519043445587, 'loud , silly , stupid and pointless'),
     (0.23857834935188293, 'an ugly , pointless , stupid movie .'),
     (0.23349826037883759, 'so insanely stupid , so awful in so many ways'),
     (0.23173819482326508, 'stupid , infantile , redundant , sloppy'),
     (0.22821901738643646, 'loud , silly , stupid and pointless .'),
     (0.22414197027683258, 'far-flung , illogical , and plain stupid'),
     (0.2223779261112213, 'is so insanely stupid , so awful in so many ways'),
     (0.22081227600574493, 'stupid , infantile , redundant , sloppy ,'),
     (0.2157662957906723, 'would not likely be so stupid as to get'),
     (0.213875412940979, 'is far-flung , illogical , and plain stupid'),
     (0.2058711051940918, 'who would not likely be so stupid as to get'),
     (0.20460370182991028, 'is far-flung , illogical , and plain stupid .'),
     (0.2036629319190979,
      'the movie , as opposed to the manifesto , is really , really stupid .'),
     (0.20323993265628815,
      'an unbelievably stupid film , though occasionally fun enough to make you'),
     (0.18843136727809906,
      'the story is far-flung , illogical , and plain stupid .'),
     (0.18570968508720398,
      'absurd plot twists , idiotic court maneuvers and stupid characters'),
     (0.18550170958042145,
      "the top and movies that do n't care about being stupid"),
     (0.18181368708610535,
      'a stupid , derivative horror film that substitutes extreme gore for suspense .'),
     (0.17968744039535522,
      '... the story is far-flung , illogical , and plain stupid .'),
     (0.1783759444952011,
      'of absurd plot twists , idiotic court maneuvers and stupid characters'),
     (0.17126235365867615,
      "'s simply stupid , irrelevant and deeply , truly , bottomlessly cynical"),
     (0.16914644837379456,
      'equilibrium the movie , as opposed to the manifesto , is really , really stupid .'),
     (0.16876907646656036,
      'game of absurd plot twists , idiotic court maneuvers and stupid characters'),
     (0.1672833114862442,
      'one look at a girl in tight pants and big tits and you turn stupid'),
     (0.16526541113853455,
      "'s simply stupid , irrelevant and deeply , truly , bottomlessly cynical ."),
     (0.15932875871658325,
      "it 's simply stupid , irrelevant and deeply , truly , bottomlessly cynical ."),
     (0.15484601259231567,
      "go over the top and movies that do n't care about being stupid"),
     (0.15245948731899261,
      'is so insanely stupid , so awful in so many ways that watching it leaves you giddy'),
     (0.1520555019378662,
      'played game of absurd plot twists , idiotic court maneuvers and stupid characters'),
     (0.1501546949148178,
      'an unbelievably stupid film , though occasionally fun enough to make you forget its absurdity .'),
     (0.14970165491104126,
      'stupid , infantile , redundant , sloppy , over-the-top , and amateurish .'),
     (0.14770428836345673,
      'that is so insanely stupid , so awful in so many ways that watching it leaves you giddy'),
     (0.14486443996429443,
      'one look at a girl in tight pants and big tits and you turn stupid ?'),
     (0.13773369789123535,
      'an unsympathetic character and someone who would not likely be so stupid as to get'),
     (0.1337944120168686,
      'about an unsympathetic character and someone who would not likely be so stupid as to get'),
     (0.1334332823753357,
      'comes along that is so insanely stupid , so awful in so many ways that watching it leaves you giddy'),
     (0.1320648491382599,
      "`` one look at a girl in tight pants and big tits and you turn stupid ? ''"),
     (0.13114982843399048,
      "the courage to go over the top and movies that do n't care about being stupid"),
     (0.1297646164894104,
      'comes along that is so insanely stupid , so awful in so many ways that watching it leaves you giddy .'),
     (0.12267553061246872,
      "movies with the courage to go over the top and movies that do n't care about being stupid"),
     (0.11352398991584778,
      "'s also too stupid to realize that they 've already seen this exact same movie a hundred times"),
     (0.1106908842921257,
      "it 's also too stupid to realize that they 've already seen this exact same movie a hundred times"),
     (0.11034628748893738,
      'every so often a film comes along that is so insanely stupid , so awful in so many ways that watching it leaves you giddy .'),
     (0.10816267877817154,
      "that it 's also too stupid to realize that they 've already seen this exact same movie a hundred times"),
     (0.10788409411907196,
      'a lame romantic comedy about an unsympathetic character and someone who would not likely be so stupid as to get'),
     (0.10759475827217102,
      'stupid americans will get a kick out of goofy brits with cute accents performing ages-old slapstick and unfunny tricks'),
     (0.10367858409881592,
      'if stupid americans will get a kick out of goofy brits with cute accents performing ages-old slapstick and unfunny tricks'),
     (0.1033768281340599,
      "there is a difference between movies with the courage to go over the top and movies that do n't care about being stupid"),
     (0.09753934293985367,
      'to see if stupid americans will get a kick out of goofy brits with cute accents performing ages-old slapstick and unfunny tricks'),
     (0.09664229303598404,
      "played game of absurd plot twists , idiotic court maneuvers and stupid characters that even freeman ca n't save it"),
     (0.09569261223077774,
      'a lame romantic comedy about an unsympathetic character and someone who would not likely be so stupid as to get involved with her .'),
     (0.09470311552286148,
      "played game of absurd plot twists , idiotic court maneuvers and stupid characters that even freeman ca n't save it ."),
     (0.09395401179790497,
      'a profoundly stupid affair , populating its hackneyed and meanspirited storyline with cardboard characters and performers who value cash above credibility .'),
     (0.08705271035432816,
      "see a movie that takes such a speedy swan dive from `` promising '' to `` interesting '' to `` familiar '' before landing squarely on `` stupid ''"),
     (0.08546370267868042,
      "to see a movie that takes such a speedy swan dive from `` promising '' to `` interesting '' to `` familiar '' before landing squarely on `` stupid ''"),
     (0.08074133098125458,
      "rare to see a movie that takes such a speedy swan dive from `` promising '' to `` interesting '' to `` familiar '' before landing squarely on `` stupid ''"),
     (0.07935629785060883,
      "'s rare to see a movie that takes such a speedy swan dive from `` promising '' to `` interesting '' to `` familiar '' before landing squarely on `` stupid ''"),
     (0.07804407179355621,
      "'s rare to see a movie that takes such a speedy swan dive from `` promising '' to `` interesting '' to `` familiar '' before landing squarely on `` stupid '' ."),
     (0.07669458538293839,
      "it 's rare to see a movie that takes such a speedy swan dive from `` promising '' to `` interesting '' to `` familiar '' before landing squarely on `` stupid '' ."),
     (0.0762612372636795,
      "i 'm not sure which half of dragonfly is worse : the part where nothing 's happening , or the part where something 's happening , but it 's stupid"),
     (0.07504859566688538,
      "i 'm not sure which half of dragonfly is worse : the part where nothing 's happening , or the part where something 's happening , but it 's stupid ."),
     (0.0744263157248497,
      'cranked out on an assembly line to see if stupid americans will get a kick out of goofy brits with cute accents performing ages-old slapstick and unfunny tricks'),
     (0.07058679312467575,
      "sinks so low in a poorly played game of absurd plot twists , idiotic court maneuvers and stupid characters that even freeman ca n't save it ."),
     (0.06713598966598511,
      "that not only would subtlety be lost on the target audience , but that it 's also too stupid to realize that they 've already seen this exact same movie a hundred times"),
     (0.06658100336790085,
      'more than a widget cranked out on an assembly line to see if stupid americans will get a kick out of goofy brits with cute accents performing ages-old slapstick and unfunny tricks'),
     (0.06565481424331665,
      'more than a widget cranked out on an assembly line to see if stupid americans will get a kick out of goofy brits with cute accents performing ages-old slapstick and unfunny tricks .'),
     (0.06278679519891739,
      "assumes that not only would subtlety be lost on the target audience , but that it 's also too stupid to realize that they 've already seen this exact same movie a hundred times"),
     (0.06211848556995392,
      'nothing more than a widget cranked out on an assembly line to see if stupid americans will get a kick out of goofy brits with cute accents performing ages-old slapstick and unfunny tricks .'),
     (0.060545653104782104,
      "the script assumes that not only would subtlety be lost on the target audience , but that it 's also too stupid to realize that they 've already seen this exact same movie a hundred times"),
     (0.045822471380233765,
      "the plot is nothing but boilerplate clichés from start to finish , and the script assumes that not only would subtlety be lost on the target audience , but that it 's also too stupid to realize that they 've already seen this exact same movie a hundred times")]

JL_Jessie

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
文本分类之有Attention的词向量平均模型 Word Average Model with Attention

这是一个文本分类的系列专题，将采用不同的方法有简单到复杂实现文本分类。使用Stanford sentiment treebank 电影评论数据集 (Socher et al. 2013). 数据集可以从这里下载链接：数据集提取码：yeqw本文的方法是在WordAverageModel的基础上加上Attention 机制有Attention的词向量平均模型 Word Average Mod...
复制链接

扫一扫