论文总体结构:
一、摘要
主要提出新的词向量学习方法glove,利用全局统计信息和局部上下文信息学习
1、当前词向量学习模型能够通过向量的算数计算捕捉之间语法和语意规律,但是背后的规律依旧不可解释
2、经过仔细分析,发现了一种有助于这种洗向量规律的特性,并基于词提出了一种新的对数双线性回归模型,该模型利用矩阵分解和局部上下文的有点来学习词向量
3、模型通过只在共现矩阵非0位置训练达到高效训练的目的
4、模型在词对推理任务上75%的准确率并且在多个任务上得到最优结果
二、Introduction
讲解之前的相关方法-矩阵分解和word2vec各有优势
矩阵分解方法(Matrix Factorization Methods)
词共现矩阵-窗口window=1
1. I enjoy flying
2. I like NLP.
3. I like deep learning.
I like enjoy deep learning NLP flying .
I 0 2 1 0 0 0 0 0
like 2 0 0 1 0 1 0 0
enjoy 1 0 0 0 0 0 1 0
deep 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
NLP 0 1 0 0 0 0 0 1
flying 0 0 1 0 0 0 0 1
. 0 0 0 0 1 1 1 0
词表大小 |V|, x.size = |v| * |v|
缺点: 词对推理效果比较差,没有学习到语意信息
基于上下文的向量学习方法(shallow window-based methods)
word2vec缺点: 只使用窗口内上下文信息,没有使用全局统计信息
三、相关工作
介绍矩阵分解和word2vec相关方法
四、glove model
介绍glove推导过程,glove与其它模型联系,复杂度分析
注:i、j相当于共现矩阵的索引,wi、wj表示词向量,学习的就是这个参数,bi、bj相当于i和j的偏置、目标就是(i、j)位置出现的个数的对数,相当于一个回归问题,构建方程,损失函数相当于之间的差值,模型训练的过程就是loss减小的过程。
五、实验分析
模型实验结果和超参数分析
词对推理实验结果
命名实体识别实验结果
向量长度和窗口大小超参分析
语料大小超参分析
word2vec对比实验
六、总结
关键点:
1、矩阵分解的词向量学习方法
2、基于上下文的词向量学习方法
3、预训练词向量
创新点:
1、提出一种新的词向量训练模型-Glove
2、在多个任务上取得最好的结果
3、公布了一系列预训练的词向量
启发点:
1、相对原始的概率,概率的比值更能够区分相关的词和不相关的词并且能够区分两种相关的词
2、提出一种新的对数双线性回归模型,这种模型结合全局矩阵分解和局部上下文的优点
七、代码实现
# ****** 数据处理部分 ******
# encoding = 'utf-8'
from torch.utils import data
import os
import numpy as np
import pickle
min_count = 50 # 设置最小词频数
data = open("./data/text8.txt").read()
data = data.split()
# 构建词频去掉低频词
word2freq = {}
for word in data:
if word2freq.get(word)!=None:
word2freq[word] += 1
else:
word2freq[word] = 1
word2id = {}
for word in word2freq:
if word2freq[word] < min_count:
continue
else:
if word2id.get(word) == None:
word2id[word] = len(word2id)
# 构建共现矩阵
vocab_size = len(word2id)
comat = np.zeros((vocab_size,vocab_size))
window_size = 2 # 设置滑动窗口大小
for i in range(len(data)):
if i%1000000==0:
print(i,len(data))
if word2id.get(data[i]) == None:
continue
w_index = word2id[data[i]]
for j in range(max(0,i-window_size),min(len(data),i+window_size+1)):
if word2id.get(data[j]) == None or i==j:
continue
u_index = word2id[data[j]]
comat[w_index][u_index] += 1
coocs = np.transpose(np.nonzero(comat)) # 提取非零数据
# 生成训练集
labels = []
for i in range(len(coocs)):
if i%1000000==0:
print(i,len(coocs))
labels.append(comat[coocs[i][0]][coocs[i][1]])
labels = np.array(labels)
np.save("./data/data.npy",coocs)
np.save('./data/label.npy',labels)
pickle.dump(word2id,open("./data/word2id",'wb'))
# ***** 模型构建部分 *****
import torch
import torch.nn as nn
class glove_model(nn.Module):
def __init__(self, vocab_size, embed_size, x_max, alpha):
super(glove_model, self).__init__()
self.vocab_size = vocab_size
self.embed_size = embed_size
self.x_max = x_max
self.alpha = alpha
self.w_embed = nn.Embedding(self.vocab_size,self.embed_size).type(torch.float64) # 中心词词向量
self.w_bias = nn.Embedding(self.vocab_size,1).type(torch.float64) # 中心词bias
self.v_embed = nn.Embedding(self.vocab_size,self.embed_size).type(torch.float64) # 周围词词向量
self.v_bias = nn.Embedding(self.vocab_size,1).type(torch.float64) # 周围词bias
def forward(self, w_data, v_data, labels):
w_data_embed = self.w_embed(w_data)
w_data_bias = self.w_bias(w_data)
v_data_embed = self.v_embed(v_data)
v_data_bias = self.v_bias(v_data)
weights = torch.pow(labels/self.x_max,self.alpha) # 权重生成
weights[weights>1] = 1
loss = torch.mean(weights*torch.pow(torch.sum(w_data_embed * v_data_embed,1) + w_data_bias + v_data_bias - torch.log(labels),2))
return loss
def save_embedding(self, word2id, file_name):
embedding_1 = self.w_embed.weight.data.cpu().numpy()
embedding_2 = self.v_embed.weight.data.cpu().numpy()
embedding = (embedding_1 + embedding_2)/2
fout = open(file_name, 'w')
fout.write("%d %d\n" %(len(word2id),self.embed_size))
for w,wid in word2id.items():
e = embedding[wid]
w = ' '.join(map(lambda x:str(x), e))
fout.write('%s %s\n' %(w,e))
model = glove_model(100,100,100,0.75)
word2id = dict()
for i in range(100):
word2id[str(i)] = i
w_data = torch.Tensor([0,0,1,1,1]).long()
v_data = torch.Tensor([1,2,0,2,3]).long()
labels = torch.Tensor([1,2,3,4,5])
model.forward(w_data,v_data,labels)
embedding_1 = model.w_embed.weight.data.cpu().numpy()
# ***** 模型训练部分 *****
from data import Wiki_Dataset
from model import glove_model
import torch
import numpy as np
import torch.optim as optim
from tqdm import tqdm
import config as argumentparser
config = argumentparser.ArgumentParser()
if config.cuda and torch.cuda.is_available():
torch.cuda.set_device(config.gpu)
torch.cuda.is_available()
wiki_dataset = Wiki_Dataset(min_count=config.min_count,window_size = config.window_size)
training_iter = torch.utils.data.DataLoader(dataset=wiki_dataset,batch_size=config.batch_size,shuffle=True,num_workers=2)
model = glove_model(len(wiki_dataset.word2id),config.embed_size,config.x_max,config.alpha)
if config.cuda and torch.cuda.is_available():
torch.cuda.set_decvice(config.gpu)
model.cuda()
optimizer = optim.Adam(model.parameters(),lr=config.learning_rate)
loss = -1
for epoch in range(config.epoch):
process_bar = tqdm(training_iter)
print(process_bar)
for data,label in process_bar:
w_data = torch.Tensor(np.array([sample[0] for sample in data])).long()
v_data = torch.Tensor(np.array([sample[1] for sample in data])).long()
if config.cuda and torch.cuda.is_available():
w_data = w_data.cuda()
v_data = v_data.cuda()
label = label.cuda()
loss_now = model(w_data,v_data,label)
if loss == -1:
loss = loss_now.data.item()
else:
loss = 0.95*loss + 0.05 *loss_now.data.item()
process_bar.set_postfix(loss=loss)
process_bar.update()
optimizer.zero_grad()
loss_now.backward()
optimizer.step()
model.save_embedding()
具体代码详见:https://github.com/wangtao666666/NLP/tree/master/Glove