Word2vec
1. 独热编码
热编码(one-hot recording)
如果词典如下
V
=
(
a
p
p
l
e
,
g
o
i
n
g
,
I
,
h
o
m
e
,
m
a
c
h
i
n
e
,
l
e
a
r
n
i
n
g
)
V = (apple, going, I, home, machine, learning)
V=(apple,going,I,home,machine,learning)
a p p l e = ( 1 , 0 , 0 , 0 , 0 , 0 ) m a c h i n e = ( 0 , 0 , 0 , 0 , 1 , 0 ) l e a r n i n g = ( 0 , 0 , 0 , 0 , 0 , 1 ) I , G o i n g , H o m e = ( 0 , 1 , 1 , 1 , 0 , 0 ) \begin{aligned} & apple = (1, 0, 0, 0, 0, 0) \\ & machine = (0, 0, 0, 0, 1, 0) \\ & learning = (0, 0, 0, 0, 0, 1) \\ & I, Going, Home = (0, 1, 1, 1, 0, 0) \end{aligned} apple=(1,0,0,0,0,0)machine=(0,0,0,0,1,0)learning=(0,0,0,0,0,1)I,Going,Home=(0,1,1,1,0,0)
稀疏表示的缺点:
- 稀疏性
- 无法表示单词的相似度(任何两个单词的内积都为0),
- 表达的能力弱
2. 分布式表示方法
分布式的表示方法是机器学习的核心
a
p
p
l
e
=
(
0.1
,
0.3
,
0.5
,
0.1
)
m
a
c
h
i
n
e
=
(
0.2
,
0.3
,
0.1
,
0.6
)
l
e
a
r
n
i
n
g
=
(
0.1
,
0.2
,
0.6
,
0.1
)
I
,
G
o
i
n
g
,
H
o
m
e
=
(
0.5
,
1.1
,
0.5
,
0.2
)
\begin{aligned} & apple = (0.1, 0.3, 0.5, 0.1) \\ & machine = (0.2, 0.3, 0.1, 0.6) \\ & learning = (0.1, 0.2, 0.6, 0.1) \\ & I, Going, Home = (0.5, 1.1, 0.5, 0.2) \end{aligned}
apple=(0.1,0.3,0.5,0.1)machine=(0.2,0.3,0.1,0.6)learning=(0.1,0.2,0.6,0.1)I,Going,Home=(0.5,1.1,0.5,0.2)
分布式表示的优点:
- 分布式表示是可以表示单词之间的相关性 (semantic);
- 表达能力比较强(Dense Meaning).因为稠密表示的方式可以表示无穷多个单词;
- 泛化能力强(global representation).
不过分布式表示不能直接统计得到,需要学习算法
3 How to Learn Word2Vec
我们希望可以达到的目标:具有相似度高的单词聚类在一起。
Motivation: 单词离得越近相似度越大
CBow Model:
通过周围的单词预测中间的单词
Skip-Gram:
通过当前的单词预测周围的单词
下面我们以skip-gram为例,CBow推理相似
具体周围的单词取几个是超参数window_size
s e n t e n s e = ( v 1 , v 2 , v 3 , v 4 , v 5 , v 6 ) sentense = (v_1, v_2, v_3, v_4, v_5, v_6) sentense=(v1,v2,v3,v4,v5,v6)
假如我们以 v 3 v_3 v3以中间词,且window_size=2,我们想最大化:
q ( v 3 ) = p ( v 1 ∣ v 3 ) p ( v 2 ∣ v 3 ) p ( v 4 ∣ v 3 ) p ( v 5 ∣ v 3 ) q(v_3) = p(v_1|v_3)p(v_2|v_3)p(v_4|v_3)p(v_5|v_3) q(v3)=p(v1∣v3)p(v2∣v3)p(v4∣v3)p(v5∣v3)
对于以上句子,其实我们想要最大化的就是:
arg max θ = q ( v 1 ) ∗ q ( v 2 ) ∗ q ( v 3 ) ∗ q ( v 4 ) ∗ q ( v 5 ) ∗ q ( v 6 ) \begin{aligned} \arg \max_\theta = q(v_1)*q(v_2) * q(v_3) * q(v_4) * q(v_5) * q(v_6) \end{aligned} argθmax=q(v1)∗q(v2)∗q(v3)∗q(v4)∗q(v5)∗q(v6)
也可以表示为:
arg max θ = ∏ v ∈ s e n t e n s e ∏ c ∈ n e b ( w ) p ( c ∣ w ; θ ) \arg \max_\theta = \prod_{v \in sentense} \prod_{c \in neb(w)} p(c|w;\theta) argθmax=v∈sentense∏c∈neb(w)∏p(c∣w;θ)
其中 w w w是中心词, c c c是中心词周围的单词。
对上式取对数可得:
arg max θ = ∑ v ∈ s e n t e n s e ∑ c ∈ n e b ( w ) log p ( c ∣ w ; θ ) \arg \max_\theta = \sum_{v \in sentense} \sum_{c \in neb(w)} \log p(c|w;\theta) argθmax=v∈sentense∑c∈neb(w)∑logp(c∣w;θ)
其中模型参数 θ \theta θ可以表示为:
θ = [ U , V ] \theta = [U, V] θ=[U,V]
U
U
U是一个二维矩阵
N
∗
K
N*K
N∗K,
N
N
N表示词典单词个数,
K
K
K是代表每个单词的向量,
V
V
V的表示和
U
U
U是一致的。
其中
V
V
V是为了表示中心词,
U
U
U表示上下文单词。
由于 ( c , w ) (c, w) (c,w)出现在一起时, p ( c ∣ w ; θ ) p(c|w;\theta) p(c∣w;θ)越大
p ( c ∣ w ; θ ) = exp ( U c ∗ V w ) ∑ c ′ exp ( U c ′ ∗ V w ) p(c|w;\theta)= \frac {\exp(U_c*V_w)} {\sum_{c'} \exp(U_{c'}*V_w)} p(c∣w;θ)=∑c′exp(Uc′∗Vw)exp(Uc∗Vw)
其中 c ′ c' c′表示词库中的所有单词。
所以我们最后求取的是:
arg max θ = ∑ v ∈ s e n t e n s e ∑ c ∈ n e b ( w ) [ U c ∗ V w − log ∑ c ′ exp ( U c ′ ∗ V w ) ] \arg \max_\theta = \sum_{v \in sentense} \sum_{c \in neb(w)} [U_c*V_w - \log \sum_{c'} \exp(U_{c'}*V_w)] argθmax=v∈sentense∑c∈neb(w)∑[Uc∗Vw−logc′∑exp(Uc′∗Vw)]
如果对上式直接进行SGD求解,复杂度太多,可以用下面的方法优化求解:
- Negative Sampling
- Hierarchical Softmax
4. Skip-Gram的目标函数
也可以用另一种形式表示目标函数
同样的用上面的例子:
s e n t e n s e = ( v 1 , v 2 , v 3 , v 4 , v 5 , v 6 ) sentense = (v_1, v_2, v_3, v_4, v_5, v_6) sentense=(v1,v2,v3,v4,v5,v6)
假如window_size=2,我们可以表示为:
p ( v 2 , v 3 ) = 1 1 + exp ( − U v 2 ∗ V v 3 ) = 1 p(v_2, v_3) = \frac {1} {1 + \exp (-U_{v_2} * V_{v_3})} = 1 p(v2,v3)=1+exp(−Uv2∗Vv3)1=1
p ( v 2 , v 3 ) = 1 − 1 1 + exp ( − U v 2 ∗ V v 3 ) = 0 p(v_2, v_3) =1- \frac {1} {1 + \exp (-U_{v_2} * V_{v_3})} = 0 p(v2,v3)=1−1+exp(−Uv2∗Vv3)1=0
所以目标函数可以写为:
arg max θ ∏ w , c ∈ D p ( y = 1 ∣ w , c ; θ ) ∏ w , c ∈ D ^ p ( y = 0 ∣ w , c ; θ ) \arg \max_\theta \prod_{w, c \in D} p(y=1|w, c;\theta) \prod_{w, c \in \hat D} p(y=0|w, c;\theta) argθmaxw,c∈D∏p(y=1∣w,c;θ)w,c∈D^∏p(y=0∣w,c;θ)
其中 w w w是中心词, c c c是上下文, D D D表示词对符合上下文,是正样本空间, D ^ \hat D D^表示词对不是上下文关系,是负样本。
目标函数可以写为:
arg max θ ∏ w , c ∈ D 1 1 + exp ( − U c ∗ V w ) ∏ w , c ∈ D ^ [ 1 − 1 1 + exp ( − U c ∗ V w ) ] \arg \max_\theta \prod_{w, c \in D} \frac {1} {1 + \exp(-U_c * V_w)} \prod_{w, c \in \hat D} [1 - \frac {1} {1 + \exp(-U_c * V_w)}] argθmaxw,c∈D∏1+exp(−Uc∗Vw)1w,c∈D^∏[1−1+exp(−Uc∗Vw)1]
对数后:
arg max θ ∑ w , c ∈ D log 1 1 + exp ( − U c ∗ V w ) + ∑ w , c ∈ D ^ log [ 1 − 1 1 + exp ( − U c ∗ V w ) ] \arg \max_\theta \sum_{w, c \in D} \log \frac {1} {1 + \exp(-U_c * V_w)} +\sum_{w, c \in \hat D} \log [1 - \frac {1} {1 + \exp(-U_c * V_w)}] argθmaxw,c∈D∑log1+exp(−Uc∗Vw)1+w,c∈D^∑log[1−1+exp(−Uc∗Vw)1]
可以看出来 w , c ∈ D ^ w, c \in \hat D w,c∈D^的样本量太大了,所以负样本进行降采样(Negative Sampling)。
目标函数可以表示为:
arg max θ ∑ w , c ∈ D log σ ( U c ∗ V w ) + ∑ c ′ ∉ n e b ( x ) log σ ( − U c ′ ∗ V w ) \arg \max_\theta \sum_{w, c \in D} \log \sigma(U_c * V_w) + \sum_{c' \notin neb(x)} \log \sigma(-U_{c'} * V_w) argθmaxw,c∈D∑logσ(Uc∗Vw)+c′∈/neb(x)∑logσ(−Uc′∗Vw)
tips:
e
−
x
/
(
1
+
e
−
x
)
=
1
/
(
1
+
e
x
)
e^{-x} / (1 + e^{-x}) = 1/(1+e^x)
e−x/(1+e−x)=1/(1+ex)
对参数进行求导:
∂ l ( θ ) ∂ U c = σ ( U c ∗ V w ) [ 1 − σ ( U c ∗ V w ) ] ∗ V w σ ( U c ∗ V w ) = [ 1 − σ ( U c ∗ V w ) ] ∗ V w \frac {\partial l(\theta)} {\partial U_c} = \frac {\sigma(U_c*V_w)[1-\sigma(U_c*V_w)] * V_w} {\sigma(U_c*V_w)} = [1-\sigma(U_c*V_w)] * V_w ∂Uc∂l(θ)=σ(Uc∗Vw)σ(Uc∗Vw)[1−σ(Uc∗Vw)]∗Vw=[1−σ(Uc∗Vw)]∗Vw
同样可得:
∂ l ( θ ) ∂ U c ′ = − [ 1 − σ ( U c ′ ∗ V w ) ] ∗ V w \frac {\partial l(\theta)} {\partial U_{c'}} = -[1-\sigma(U_{c'}*V_w)] * V_w ∂Uc′∂l(θ)=−[1−σ(Uc′∗Vw)]∗Vw
∂ l ( θ ) ∂ V w = [ 1 − σ ( U c ∗ V w ) ] ∗ U c − ∑ c ′ ∉ n e b ( x ) [ 1 − σ ( U c ′ ∗ V w ) ] ∗ U c ′ \frac {\partial l(\theta)} {\partial V_w} = [1-\sigma(U_c*V_w)] * U_c - \sum_{c' \notin neb(x)}[1-\sigma(U_{c'}*V_w)] * U_{c'} ∂Vw∂l(θ)=[1−σ(Uc∗Vw)]∗Uc−c′∈/neb(x)∑[1−σ(Uc′∗Vw)]∗Uc′
接下来更新梯度就好了
5. 词向量的评估
- 将词向量降维到二维空间中(TSNE),可视化观察特点,比如是否相似的单词比较接近;
- 可以抽样计算相似度,比如余弦相似度;
- Analogy类比方式,比如woman:man, girl:? 看看?和boy的距离
6. 词向量在推荐系统中的应用
距离近的单词有更强的相关性,和推荐系统中基于内容的推荐,产品也希望表达其相关性。
论文:Real-time Personalization using Embedding for Search Ranking at Airbnb
是短租的网站的房屋推荐
传统的做法就是做特征工程:
比如:房间大小 房间个数 房间地址等等
这种方式可以看作独热编码
用词向量的方式就可以用稠密方式编码,每一维没有具体含义
论文核心用了Skip-Gram Model
数据单元 session:
用户1: 房屋001, 房屋002, 房屋003;
用户2: 房屋212, 房屋993, 房屋889, 房屋989;
…
一个用户在短时间浏览的房屋相似度会很高和一句话中单词的相似度高逻辑一致
这篇论文做了几点修改:
- 如果成交,这个房屋和该用户本次搜索的其它房屋都有上下文关系;
- 负样本基本都来自于不同地区,正样本基本上都是在同一地区
7. Skip-Gram 的缺点
缺点:
- 没有考虑到上下文;
- 窗口长度有限,无法考虑全局;
- 无法有效学习低频词和未登录词OOV(out of vocabulary)
- 语序问题;
- 多义词无法区别。
上下文的问题可以用Elmo和Bert解决;
低频次和OOV可以用subword embedding
8. pytorch实现
import numpy as np
import torch
from torch import nn, optim
import random
from collections import Counter
"""
- 数据预处理
- 构建损失器以及网络
- 模型训练
"""
"""
读取文件
"""
def get_data(file_name):
with open(file_name) as f:
text = f.read()
return text
text = get_data('test.txt')
print(text)
"""
预处理
"""
def preprocess(text, freq=0):
# 变为小写
text = text.lower()
# 对特殊符号进行处理
text = text.replace('.', '<PERIO>') # 还可以补充其它特殊符号
# 英文的话分词
words = text.split()
# 对单词进行统计
word_count = Counter(words)
# 去除低于阈值的单词
trimmed_words = [word for word in words if word_count[word] > freq]
return trimmed_words
"""
准备工作:辞典,embedding, 准备训练文本
"""
def prepair_train_data(text):
words = preprocess(text)
vocab = set(words)
# 单词==》索引
vocab2index = {w: c for c, w in enumerate(vocab)}
# 索引 ==》 单词
index2vocab = {c: w for c, w in enumerate(vocab)}
# 将文本的所有单词转化为索引
index_words = [vocab2index[w] for w in words]
index_word_counts = Counter(index_words)
# 单词总数
total_count = len(index_words)
# 每个单词的占比
word_freqs = {w: c/total_count for w, c in index_word_counts.items()}
# 计算删除单词的概率
t = 1e-5
prob_drop = {w: 1 - np.sqrt(t/ word_freqs[w]) for w in index_word_counts}
# 保留的单词
# train_words = [w for w in index_words if random.random()<(1-prob_drop[w])]
train_words = index_words
return train_words, index2vocab, vocab2index, word_freqs
"""
计算单词概率分布
"""
def cal_distribution(word_freqs):
# todo: 先对word排序,之后再计算
word_freqs = np.array(list(word_freqs.values()))
unigram_dist = word_freqs / word_freqs.sum()
noise_dist = torch.from_numpy(unigram_dist ** 0.75 / np.sum(unigram_dist ** 0.75))
return noise_dist
"""
获取周边词/target
"""
def get_target(words, idx, window_size=5):
# 窗口大小随机调整
target_window = np.random.randint(2, window_size+1)
# 初始下表
start_index = idx - target_window if (idx - target_window) > 0 else 0
# 结束下表
end_point = idx + target_window
# 获取单词
targets = set(words[start_index: idx] + words[idx + 1: end_point + 1])
return list(targets)
"""
batch迭代器
"""
def get_batch(words, batch_size, window_size):
# 看单词可以分为多少个batch
n_bathches = len(words) // batch_size
# 将单词数目规整为可以batch_size整初的
words = words[: n_bathches * batch_size]
for idx in range(0, len(words), batch_size):
batch_x, batch_y = [], []
# 先获取一个batch的单词
batch = words[idx: idx + batch_size]
# 在一个batch中依次以每个词为中心词获取周边次
for i in range(len(batch)):
x = batch[i]
y = get_target(batch, i, window_size)
# 为了使得x和y长度相等
batch_x.extend([x] * len(y))
batch_y.extend(y)
yield batch_x, batch_y
"""
构造网络结构
"""
class SkipGramNeg(nn.Module):
def __init__(self, n_vocab, n_embed, noise_dist=None):
"""
:param n_vocab: 单词个数
:param n_embed: embedding 个数
:param noise_dist: noise distribution 为了负采样
"""
super().__init__()
self.n_vocab = n_vocab
self.n_embed = n_embed
self.noise_dist = noise_dist
# 定义输入层和输出层的嵌入
self.in_embed = nn.Embedding(n_vocab, n_embed)
self.out_embed = nn.Embedding(n_vocab, n_embed)
# 初始话参数,为了更好的收敛
self.in_embed.weight.data.uniform_(-1, 1)
self.out_embed.weight.data.uniform_(-1, 1)
def forward_input(self, input_words):
# 输入层进行embedding
input_vectors = self.in_embed(input_words)
return input_vectors
def forward_output(self, output_words):
# 输出层
output_vectors = self.out_embed(output_words)
return output_vectors
def forward_noise(self, batch_size, n_sample):
"""生成noise vectors, shape(batch_size, n_samples, n_embed)"""
if self.noise_dist is None:
# 均匀采样 sample words uniformly
noise_dist = torch.ones(self.n_vocab)
else:
noise_dist = self.noise_dist
# 通过多项式采样
noise_words = torch.multinomial(noise_dist,
batch_size * n_sample,
replacement=True)
noise_vect = self.out_embed(noise_words).view(batch_size, n_sample, self.n_embed)
return noise_vect
"""
构造损失函数
"""
class NegativeSamplingLoss(nn.Module):
def __init__(self):
super(NegativeSamplingLoss, self).__init__()
def forward(self, input_vector, output_vector, noise_vectors):
batch_size, embed_size = input_vector.shape
# 对输入和输出数据整型
input_vector = input_vector.view(batch_size, embed_size, 1)
output_vector = output_vector.view(batch_size, 1, embed_size)
# bmm = batch matrix multiplication
# correct log-sigmoid loss
out_loss = torch.bmm(output_vector, input_vector).sigmoid().log()
out_loss = out_loss.squeeze()
# 负采样的log-sigmoid loss
noise_loss = torch.bmm(noise_vectors.neg(), input_vector).sigmoid().log()
noise_loss = noise_loss.squeeze().sum(1)
return -(out_loss + noise_loss).mean()
"""
训练模型
"""
def train_model():
train_words, index2vocab, vocab2index, word_freqs = prepair_train_data(text)
noise_dist = cal_distribution(word_freqs)
# 初始化模型
embedding_dim = 300
model = SkipGramNeg(len(vocab2index), embedding_dim, noise_dist)
# 定义损失函数和优化器
criterion = NegativeSamplingLoss()
optimizer = optim.Adam(model.parameters(), lr=0.003)
print_every = 1
steps = 0
epoch = 5
batch_size = 50
n_samples = 5
for e in range(epoch):
for input_words, target_words in get_batch(train_words, batch_size, window_size=5):
steps += 1
inputs, targets = torch.LongTensor(input_words), torch.LongTensor(target_words)
input_vectors = model.forward_input(inputs)
output_vectors = model.forward_output(targets)
current_batch_size = inputs.__len__()
noise_vectors = model.forward_noise(current_batch_size, n_samples)
loss = criterion(input_vectors, output_vectors, noise_vectors)
if steps % print_every == 0:
print('loss', loss)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if __name__ == '__main__':
train_model()