文章目录
1.embedding 概述
Embeddings是将一个one-hot encoded 的稀疏向量转化成一个更小的实数向量。这些word embeddings的一个重要概念是,越类似的词靠得越近,也就是欧氏距离越近。举例:
“I purchased some items at the shop”
“I purchased some items at the store”
shop和store意思相近,在embedding的高维向量空间里,这两个词之间的距离很近
在PyTorch中,我们使用nn.Embedding 层来处理词向量(word embedding),
nn.Embedding layer 所接收的输入张量的形状是 [sentence length, batch size],(250,64)
然后转换成形状是[sentence length, batch size, embedding dimensions] 的张量 (250,64,300)
2.加载GloVe词向量
import torchtext.vocab
# name = 6B:预料包含60亿词, 还有42B,840B的选项(只提供300维)
# dim=100:训练出来的词向量维度是100
glove = torchtext.vocab.GloVe(name = '6B', dim = 100)
print(f'There are {len(glove.itos)} words in the vocabulary')
There are 400000 words in the vocabulary
可以看到,有40万个unique词在GloVe词汇表中,注意:其中的每个词都是小写的
glove.vectors.shape
torch.Size([400000, 100])
2.1 我们可以看看属性靠前的10个词分别是
glove.itos[:10]
[‘the’, ‘,’, ‘.’, ‘of’, ‘to’, ‘and’, ‘in’, ‘a’, ‘"’, “'s”]
2.2 看某个词的索引是
glove.stoi['the']
0
2.3 获取某个词的词向量
glove.vectors[glove.stoi['the']]
创建于一个函数来方便地获取某个词的词向量(如果词不存在会报错)
def get_vector(embeddings, word):
assert word in embeddings.stoi, f'*{word}* is not in the vocab!'
return embeddings.vectors[embeddings.stoi[word]]
get_vector(glove, 'the')
看形状
get_vector(glove, 'the').shape
3. 查找某个词最近的有哪些词(查找近义词)
import torch
def closest_words(embeddings, vector, n = 10):
distances = [(word, torch.dist(vector, get_vector(embeddings, word)).item())
for word in embeddings.itos]
return sorted(distances, key = lambda w: w[1])[:n]
word_vector = get_vector(glove, 'korea')
closest_words(glove, word_vector,n=10) # 这里的glove是上面实例化了的glove = torchtext.vocab.GloVe(name = '6B', dim = 100)
[(‘korea’, 0.0),
(‘korean’, 5.22153377532959),
(‘pyongyang’, 5.738825798034668),
(‘seoul’, 5.8383941650390625),
(‘koreans’, 5.877933025360107),
(‘south’, 6.03314733505249),
(‘north’, 6.1711201667785645),
(‘dprk’, 6.172987937927246),
(‘japan’, 6.599035739898682),
(‘kim’, 6.714361667633057),
(‘pyahng’, 7.071170806884766),
(‘china’, 7.08419132232666),
(’,’, 7.363252639770508),
(‘yonhap’, 7.434319019317627),
(‘noting’, 7.435735702514648)]
再创建一个函数,用来将上面的输出,更好的打印出来
def print_tuples(tuples):
for w, d in tuples:
print(f'({d:02.04f}) {w}')
word_vector = get_vector(glove, 'sports')
print_tuples(closest_words(glove, word_vector, n=10))
(0.0000) sports
(5.1187) sport
(6.1492) sporting
(6.4893) basketball
(6.5769) baseball
(6.5860) soccer
(6.6995) additionally
(6.8040) espn
(6.8280) football
(6.8700) .
4.类比
def analogy(embeddings, word1, word2, word3, n=5):
#get vectors for each word
word1_vector = get_vector(embeddings, word1)
word2_vector = get_vector(embeddings, word2)
word3_vector = get_vector(embeddings, word3)
#calculate analogy vector
analogy_vector = word2_vector - word1_vector + word3_vector
#find closest words to analogy vector
candidate_words = closest_words(embeddings, analogy_vector, n+3)
#filter out words already in analogy
candidate_words = [(word, dist) for (word, dist) in candidate_words
if word not in [word1, word2, word3]][:n]
print(f'{word1} is to {word2} as {word3} is to...')
return candidate_words
def print_tuples(tuples):
for w, d in tuples:
print(f'({d:02.04f}) {w}')
print_tuples(analogy(glove, 'man', 'king', 'woman'))
man is to king as woman is to…
(4.0811) queen
(4.6429) monarch
(4.9055) throne
(4.9216) elizabeth
(4.9811) prince
analogy(glove, 'cat', 'kitten', 'dog')
5. 纠正拼写错误
我们要加载一个更大的词向量,因为拼写错误不怎么会发生在词很少的文本中
注意这个词向量很大,有2G左右
glove = torchtext.vocab.GloVe(name = '840B', dim = 300)
glove.vectors.shape
torch.Size([2196017, 300])
word_vector = get_vector(glove, 'korea')
print_tuples(closest_words(glove, word_vector))
(0.0000) korea
(3.9857) taiwan
(4.4022) korean
(4.9016) asia
(4.9593) japan
(5.0721) seoul
(5.4058) thailand
(5.6025) singapore
(5.7010) russia
(5.7240) hong
因为是在一个更大的语料库中训练的,相似词会有些不一样,
首先我们先看一下拼错的那些词靠近哪些词
word_vector = get_vector(glove, 'relieable') # 正确拼法: reliable
print_tuples(closest_words(glove, word_vector))
(0.0000) relieable
(5.0366) relyable
(5.2610) realible
(5.4719) realiable
(5.5402) relable
(5.5917) relaible
(5.6412) reliabe
(5.8802) relaiable
(5.9593) stabel
(5.9981) consitant
可以看到正确的reliable没有出现在最近的10个词里,不因该啊,难道不是应该靠近吗?其实不然
那些拼错的单词都偏移了与原来正确单词相近的距离。通常,包含有拼错单词的文本,通常是非正式书写的,是否正确拼写不是那么重要。例如推特上。因此拼写错误会一起出现。
我们首先获取一个正确拼写单词的向量,接着我们拿8个拼写错误的单词,
再接着算,正确单词的词向量到8个拼错单词词向量的距离的平均数
reliable_vector = get_vector(glove, 'reliable')
reliable_misspellings = ['relieable', 'relyable', 'realible', 'realiable',
'relable', 'relaible', 'reliabe', 'relaiable']
diff_reliable = [(reliable_vector - get_vector(glove, s)).unsqueeze(0)
for s in reliable_misspellings]
# 计算距离
misspelling_vector = torch.cat(diff_reliable, dim = 0).mean(dim = 0)
我们就可以更正拼写错误,通过将错误的拼写词的词向量加上距离,得到正确的词向量
word_vector = get_vector(glove, 'becuase')
print_tuples(closest_words(glove, word_vector + misspelling_vector))
(6.1090) because
(6.4250) even
(6.4358) fact
(6.4914) sure
(6.5094) though
(6.5601) obviously
(6.5682) reason
(6.5856) if
(6.6099) but
(6.6415) why
word_vector = get_vector(glove, 'defintiely')
print_tuples(closest_words(glove, word_vector + misspelling_vector))
(5.4070) definitely
(5.5643) certainly
(5.7192) sure
(5.8152) well
(5.8588) always
(5.8812) also
(5.9557) simply
(5.9667) consider
(5.9821) probably
(5.9948) definately