优化算法进阶
Momentum Algorithm
原理
m
t
←
β
m
t
−
1
+
η
t
g
t
,
x
t
←
x
t
−
1
−
m
t
,
\begin{aligned} \boldsymbol{m}_t &\leftarrow \beta \boldsymbol{m}_{t-1} + \eta_t \boldsymbol{g}_t, \\ \boldsymbol{x}_t &\leftarrow \boldsymbol{x}_{t-1} - \boldsymbol{m}_t, \end{aligned}
mtxt←βmt−1+ηtgt,←xt−1−mt,
another version:
m
t
←
β
m
t
−
1
+
(
1
−
β
)
g
t
,
x
t
←
x
t
−
1
−
α
t
m
t
,
α
t
=
η
t
1
−
β
\begin{aligned} \boldsymbol{m}_t &\leftarrow \beta \boldsymbol{m}_{t-1} + (1-\beta) \boldsymbol{g}_t, \\ \boldsymbol{x}_t &\leftarrow \boldsymbol{x}_{t-1} - \alpha_t \boldsymbol{m}_t, \end{aligned}\\ \alpha_t = \frac{\eta_t}{1-\beta}
mtxt←βmt−1+(1−β)gt,←xt−1−αtmt,αt=1−βηt
实现
def momentum_2d(x1, x2, v1, v2):
v1 = beta * v1 + eta * 0.2 * x1 # 0.2*x1和4*x2是梯度
v2 = beta * v2 + eta * 4 * x2
return x1 - v1, x2 - v2, v1, v2
由指数加权移动平均理解动量法
y t = ( 1 − β ) x t + β y t − 1 = ( 1 − β ) x t + ( 1 − β ) ⋅ β x t − 1 + β 2 y t − 2 = ( 1 − β ) x t + ( 1 − β ) ⋅ β x t − 1 + ( 1 − β ) ⋅ β 2 x t − 2 + β 3 y t − 3 = ( 1 − β ) ∑ i = 0 t β i x t − i y t ≈ 0.05 ∑ i = 0 19 0.9 5 i x t − i . m t ← β m t − 1 + ( 1 − β ) ( η t 1 − β g t ) . \begin{aligned} y_t &= (1-\beta) x_t + \beta y_{t-1}\\ &= (1-\beta)x_t + (1-\beta) \cdot \beta x_{t-1} + \beta^2y_{t-2}\\ &= (1-\beta)x_t + (1-\beta) \cdot \beta x_{t-1} + (1-\beta) \cdot \beta^2x_{t-2} + \beta^3y_{t-3}\\ &= (1-\beta) \sum_{i=0}^{t} \beta^{i}x_{t-i} \end{aligned}\\ y_t \approx 0.05 \sum_{i=0}^{19} 0.95^i x_{t-i}.\\ \boldsymbol{m}_t \leftarrow \beta \boldsymbol{m}_{t-1} + (1 - \beta) \left(\frac{\eta_t}{1 - \beta} \boldsymbol{g}_t\right). yt=(1−β)xt+βyt−1=(1−β)xt+(1−β)⋅βxt−1+β2yt−2=(1−β)xt+(1−β)⋅βxt−1+(1−β)⋅β2xt−2+β3yt−3=(1−β)i=0∑tβixt−iyt≈0.05i=0∑190.95ixt−i.mt←βmt−1+(1−β)(1−βηtgt).
def init_momentum_states():
v_w = torch.zeros((features.shape[1], 1), dtype=torch.float32)
v_b = torch.zeros(1, dtype=torch.float32)
return (v_w, v_b)
def sgd_momentum(params, states, hyperparams):
for p, v in zip(params, states):
v.data = hyperparams['momentum'] * v.data + hyperparams['lr'] * p.grad.data
p.data -= v.data
简洁实现
optimizer = torch.optim.SGD(model.parameters(), lr = 0.01, momentum=0.9)
AdaGrad
原理
s t ← s t − 1 + g t ⊙ g t , x t ← x t − 1 − η s t + ϵ ⊙ g t , \boldsymbol{s}_t \leftarrow \boldsymbol{s}_{t-1} + \boldsymbol{g}_t \odot \boldsymbol{g}_t,\\ \boldsymbol{x}_t \leftarrow \boldsymbol{x}_{t-1} - \frac{\eta}{\sqrt{\boldsymbol{s}_t + \epsilon}} \odot \boldsymbol{g}_t, st←st−1+gt⊙gt,xt←xt−1−st+ϵη⊙gt,
存在的问题
1.如果目标函数有关自变量中某个元素的偏导数一直都较大,那么该元素的学习率将下降较快;
2.如果目标函数有关自变量中某个元素的偏导数一直都较小,那么该元素的学习率将下降较慢。然而,由于
s
t
\boldsymbol{s}_t
st一直在累加按元素平方的梯度,自变量中每个元素的学习率在迭代过程中一直在降低(或不变)。所以,当学习率在迭代早期降得较快且当前解依然不佳时,AdaGrad算法在迭代后期由于学习率过小,可能较难找到一个有用的解。
实现
def init_adagrad_states():
s_w = torch.zeros((features.shape[1], 1), dtype=torch.float32)
s_b = torch.zeros(1, dtype=torch.float32)
return (s_w, s_b)
def adagrad(params, states, hyperparams):
eps = 1e-6
for p, s in zip(params, states):
s.data += (p.grad.data**2)
p.data -= hyperparams['lr'] * p.grad.data / torch.sqrt(s + eps)
简洁实现
optimizer = torch.optim.Adagrad(net.parameters(), lr=1e-2)
RMSprop
原理
v t ← β v t − 1 + ( 1 − β ) g t ⊙ g t . x t ← x t − 1 − α v t + ϵ ⊙ g t , \boldsymbol{v}_t \leftarrow \beta \boldsymbol{v}_{t-1} + (1 - \beta) \boldsymbol{g}_t \odot \boldsymbol{g}_t.\\ \boldsymbol{x}_t \leftarrow \boldsymbol{x}_{t-1} - \frac{\alpha}{\sqrt{\boldsymbol{v}_t + \epsilon}} \odot \boldsymbol{g}_t, vt←βvt−1+(1−β)gt⊙gt.xt←xt−1−vt+ϵα⊙gt,
实现
def init_rmsprop_states():
s_w = torch.zeros((features.shape[1], 1), dtype=torch.float32)
s_b = torch.zeros(1, dtype=torch.float32)
return (s_w, s_b)
def rmsprop(params, states, hyperparams):
gamma, eps = hyperparams['beta'], 1e-6
for p, s in zip(params, states):
s.data = gamma * s.data + (1 - gamma) * (p.grad.data)**2
p.data -= hyperparams['lr'] * p.grad.data / torch.sqrt(s + eps)
简洁实现
optimizer = torch.optim.RMSprop(net.parameters(), lr = LR, alpha = 0.9)
AdaDelta
原理
s
t
←
ρ
s
t
−
1
+
(
1
−
ρ
)
g
t
⊙
g
t
.
g
t
′
←
Δ
x
t
−
1
+
ϵ
s
t
+
ϵ
⊙
g
t
,
x
t
←
x
t
−
1
−
g
t
′
.
Δ
x
t
←
ρ
Δ
x
t
−
1
+
(
1
−
ρ
)
g
t
′
⊙
g
t
′
.
\boldsymbol{s}_t \leftarrow \rho \boldsymbol{s}_{t-1} + (1 - \rho) \boldsymbol{g}_t \odot \boldsymbol{g}_t.\\ \boldsymbol{g}_t' \leftarrow \sqrt{\frac{\Delta\boldsymbol{x}_{t-1} + \epsilon}{\boldsymbol{s}_t + \epsilon}} \odot \boldsymbol{g}_t,\\ \boldsymbol{x}_t \leftarrow \boldsymbol{x}_{t-1} - \boldsymbol{g}'_t.\\ \Delta\boldsymbol{x}_t \leftarrow \rho \Delta\boldsymbol{x}_{t-1} + (1 - \rho) \boldsymbol{g}'_t \odot \boldsymbol{g}'_t.
st←ρst−1+(1−ρ)gt⊙gt.gt′←st+ϵΔxt−1+ϵ⊙gt,xt←xt−1−gt′.Δxt←ρΔxt−1+(1−ρ)gt′⊙gt′.
如不考虑的影响,AdaDelta算法与RMSProp算法的不同之处在于使用
Δ
x
t
−
1
\sqrt{\Delta\boldsymbol{x}_{t-1}}
Δxt−1来替代超参数
η
\eta
η。
实现
def init_adadelta_states():
s_w, s_b = torch.zeros((features.shape[1], 1), dtype=torch.float32), torch.zeros(1, dtype=torch.float32)
delta_w, delta_b = torch.zeros((features.shape[1], 1), dtype=torch.float32), torch.zeros(1, dtype=torch.float32)
return ((s_w, delta_w), (s_b, delta_b))
def adadelta(params, states, hyperparams):
rho, eps = hyperparams['rho'], 1e-5
for p, (s, delta) in zip(params, states):
s[:] = rho * s + (1 - rho) * (p.grad.data**2)
g = p.grad.data * torch.sqrt((delta + eps) / (s + eps))
p.data -= g
delta[:] = rho * delta + (1 - rho) * g * g
简洁实现
optimizer = torch.optim.Adadelta(net.parameters(), rho=0.9)
Adam
原理
m t ← β 1 m t − 1 + ( 1 − β 1 ) g t . v t ← β 2 v t − 1 + ( 1 − β 2 ) g t ⊙ g t . m ^ t ← m t 1 − β 1 t , v ^ t ← v t 1 − β 2 t . g t ′ ← η m ^ t v ^ t + ϵ , x t ← x t − 1 − g t ′ . \boldsymbol{m}_t \leftarrow \beta_1 \boldsymbol{m}_{t-1} + (1 - \beta_1) \boldsymbol{g}_t.\\ \boldsymbol{v}_t \leftarrow \beta_2 \boldsymbol{v}_{t-1} + (1 - \beta_2) \boldsymbol{g}_t \odot \boldsymbol{g}_t.\\ \hat{\boldsymbol{m}}_t \leftarrow \frac{\boldsymbol{m}_t}{1 - \beta_1^t},\\ \hat{\boldsymbol{v}}_t \leftarrow \frac{\boldsymbol{v}_t}{1 - \beta_2^t}.\\ \boldsymbol{g}_t' \leftarrow \frac{\eta \hat{\boldsymbol{m}}_t}{\sqrt{\hat{\boldsymbol{v}}_t} + \epsilon},\\ \boldsymbol{x}_t \leftarrow \boldsymbol{x}_{t-1} - \boldsymbol{g}_t'. mt←β1mt−1+(1−β1)gt.vt←β2vt−1+(1−β2)gt⊙gt.m^t←1−β1tmt,v^t←1−β2tvt.gt′←v^t+ϵηm^t,xt←xt−1−gt′.
实现
def init_adam_states():
v_w, v_b = torch.zeros((features.shape[1], 1), dtype=torch.float32), torch.zeros(1, dtype=torch.float32)
s_w, s_b = torch.zeros((features.shape[1], 1), dtype=torch.float32), torch.zeros(1, dtype=torch.float32)
return ((v_w, s_w), (v_b, s_b))
def adam(params, states, hyperparams):
beta1, beta2, eps = 0.9, 0.999, 1e-6
for p, (v, s) in zip(params, states):
v[:] = beta1 * v + (1 - beta1) * p.grad.data
s[:] = beta2 * s + (1 - beta2) * p.grad.data**2
v_bias_corr = v / (1 - beta1 ** hyperparams['t'])
s_bias_corr = s / (1 - beta2 ** hyperparams['t'])
p.data -= hyperparams['lr'] * v_bias_corr / (torch.sqrt(s_bias_corr) + eps)
hyperparams['t'] += 1
简洁实现
optimizer = torch.optim.Adam(net_Adam.parameters(), lr = LR, betas= (0.9, 0.99))
word2vec
由于one-hot 词向量无法准确表达不同词之间的相似度,如我们常常使用的余弦相似度。
为了解决上面这个问题,Word2Vec 词嵌入工具将每个词表示成一个定长的向量,并通过在语料库上的预训练使得这些向量能较好地表达不同词之间的相似和类比关系,以引入一定的语义信息。主要有Skip-Gram跳字模型和CBOW 连续词袋模型
PTB 数据集
1.建立索引
2.二次采样
数据集中每个被索引词
w
i
w_i
wi将有一定概率被丢弃,该丢弃概率为
P
(
w
i
)
=
max
(
1
−
t
f
(
w
i
)
,
0
)
P(w_i)=\max(1-\sqrt{\frac{t}{f(w_i)}},0)
P(wi)=max(1−f(wi)t,0)
f(w_i)是数据集中词
w
i
w_i
wi的个数与总词数之比,常数
t
t
t是一个超参数(实验中设为
1
0
−
4
10^{−4}
10−4)。可见,当
f
(
w
i
)
>
t
f(w_i)>t
f(wi)>t时,我们才有可能在二次采样中丢弃词
w
i
w_i
wi,并且越高频的词被丢弃的概率越大。
def discard(idx):
return random.uniform(0, 1) < 1 - math.sqrt(1e-4 / counter[idx_to_token[idx]] * num_tokens) # True/False 表示是否丢弃该单词
提取中心词和背景词
def get_centers_and_contexts(dataset, max_window_size):
centers, contexts = [], []
for st in dataset:
if len(st) < 2: # 每个句子至少要有2个词才可能组成一对“中心词-背景词”
continue
centers += st # len(st)>2时句中每个词都能成为中心词
for center_i in range(len(st)):
window_size = random.randint(1, max_window_size) # 随机选取背景词窗大小
indices = list(range(max(0, center_i - window_size),
min(len(st), center_i + 1 + window_size)))
indices.remove(center_i) # 将中心词排除在背景词之外
contexts.append([st[idx] for idx in indices])
return centers, contexts
Skip-Gram 跳字模型
在跳字模型中,每个词被表示成两个
d
d
d维向量,用来计算条件概率。假设这个词在词典中索引为
i
i
i,当它为中心词时向量表示为
v
i
∈
R
d
\boldsymbol{v}_i\in\mathbb{R}^d
vi∈Rd,而为背景词时向量表示为
u
i
∈
R
d
\boldsymbol{u}_i\in\mathbb{R}^d
ui∈Rd。设中心词
w
c
w_c
wc在词典中索引为
c
c
c,背景词
w
o
w_o
wo在词典中索引为
o
o
o,我们假设给定中心词生成背景词的条件概率满足下式:
P
(
w
o
∣
w
c
)
=
exp
(
u
o
⊤
v
c
)
∑
i
∈
V
exp
(
u
i
⊤
v
c
)
P(w_o\mid w_c)=\frac{\exp(\boldsymbol{u}_o^\top \boldsymbol{v}_c)}{\sum_{i\in\mathcal{V}}\exp(\boldsymbol{u}_i^\top \boldsymbol{v}_c)}
P(wo∣wc)=∑i∈Vexp(ui⊤vc)exp(uo⊤vc)
PyTorch 预置的Embedding层与批量乘法函数
net = nn.Sequential(nn.Embedding(num_embeddings=len(idx_to_token), embedding_dim=embed_size),
nn.Embedding(num_embeddings=len(idx_to_token), embedding_dim=embed_size))
torch.bmm(X, Y)
Skip-Gram 模型的前向计算
def skip_gram(center, contexts_and_negatives, embed_v, embed_u):
v = embed_v(center) # embed_v: 中心词的 embedding 层,shape of (n, 1, d)
u = embed_u(contexts_and_negatives) # embed_u: 背景词的 embedding 层,shape of (n, m, d)
pred = torch.bmm(v, u.permute(0, 2, 1)) # 中心词与背景词(或噪音词)的内积,之后可用于计算概率 p(w_o|w_c)
return pred
负采样近似
负采样方法用以下公式来近似条件概率
P
(
w
o
∣
w
c
)
=
exp
(
u
o
⊤
v
c
)
∑
i
∈
V
exp
(
u
i
⊤
v
c
)
P(w_o\mid w_c)=\frac{\exp(\boldsymbol{u}_o^\top \boldsymbol{v}_c)}{\sum_{i\in\mathcal{V}}\exp(\boldsymbol{u}_i^\top \boldsymbol{v}_c)}
P(wo∣wc)=∑i∈Vexp(ui⊤vc)exp(uo⊤vc)
P
(
w
o
∣
w
c
)
=
P
(
D
=
1
∣
w
c
,
w
o
)
∏
k
=
1
,
w
k
∼
P
(
w
)
K
P
(
D
=
0
∣
w
c
,
w
k
)
P(w_o\mid w_c)=P(D=1\mid w_c,w_o)\prod_{k=1,w_k\sim P(w)}^K P(D=0\mid w_c,w_k)
P(wo∣wc)=P(D=1∣wc,wo)k=1,wk∼P(w)∏KP(D=0∣wc,wk)
其中
P
(
D
=
1
∣
w
c
,
w
o
)
=
σ
(
u
o
⊤
v
c
)
P(D=1\mid w_c,w_o)=\sigma(\boldsymbol{u}_o^\top\boldsymbol{v}_c)
P(D=1∣wc,wo)=σ(uo⊤vc),
σ
(
⋅
)
\sigma(\cdot)
σ(⋅)为 sigmoid 函数。对于一对中心词和背景词,我们从词典中随机采样
K
K
K个噪声词(实验中设
K
=
5
K=5
K=5)。根据 Word2Vec 论文的建议,噪声词采样概率
P
(
w
)
P(w)
P(w)设为
w
w
w词频与总词频之比的
0.75
0.75
0.75次方。
def get_negatives(all_contexts, sampling_weights, K):
all_negatives, neg_candidates, i = [], [], 0
population = list(range(len(sampling_weights)))
for contexts in all_contexts:
negatives = []
while len(negatives) < len(contexts) * K:
if i == len(neg_candidates): # 候选噪声词取完了,需要重新采样候选噪声词
# 根据每个词的权重(sampling_weights)随机生成k个词的索引作为噪声词。
# 为了高效计算,可以将k设得稍大一点
i, neg_candidates = 0, random.choices(
population, sampling_weights, k=int(1e5))
neg, i = neg_candidates[i], i + 1
if neg not in set(contexts): # 噪声词不能是背景词
negatives.append(neg)
all_negatives.append(negatives)
return all_negatives
sampling_weights = [counter[w]**0.75 for w in idx_to_token]
all_negatives = get_negatives(all_contexts, sampling_weights, 5)
批量读取数据
class MyDataset(torch.utils.data.Dataset):
def __init__(self, centers, contexts, negatives):
assert len(centers) == len(contexts) == len(negatives)
self.centers = centers
self.contexts = contexts
self.negatives = negatives
def __getitem__(self, index):
return (self.centers[index], self.contexts[index], self.negatives[index])
def __len__(self):
return len(self.centers)
def batchify(data):
max_len = max(len(c) + len(n) for _, c, n in data)
centers, contexts_negatives, masks, labels = [], [], [], []
for center, context, negative in data:
cur_len = len(context) + len(negative) # 实际有效长度
centers += [center]
contexts_negatives += [context + negative + [0] * (max_len - cur_len)] # 填充至最大长度
masks += [[1] * cur_len + [0] * (max_len - cur_len)] # 使用掩码变量mask来避免填充项对损失函数计算的影响
labels += [[1] * len(context) + [0] * (max_len - len(context))] # 区分背景词噪声词
batch = (torch.tensor(centers).view(-1, 1), torch.tensor(contexts_negatives),
torch.tensor(masks), torch.tensor(labels)) # 把中心词变为2维
return batch
训练模型
损失函数
∑ t = 1 T ∑ − m ≤ j ≤ m , j ≠ 0 [ − log P ( D = 1 ∣ w ( t ) , w ( t + j ) ) − ∑ k = 1 , w k ∼ P ( w ) K log P ( D = 0 ∣ w ( t ) , w k ) ] \sum_{t=1}^T\sum_{-m\le j\le m,j\ne 0} [-\log P(D=1\mid w^{(t)},w^{(t+j)})-\sum_{k=1,w_k\sim P(w)^K}\log P(D=0\mid w^{(t)},w_k)] t=1∑T−m≤j≤m,j=0∑[−logP(D=1∣w(t),w(t+j))−k=1,wk∼P(w)K∑logP(D=0∣w(t),wk)]
class SigmoidBinaryCrossEntropyLoss(nn.Module):
def __init__(self):
super(SigmoidBinaryCrossEntropyLoss, self).__init__()
def forward(self, inputs, targets, mask=None):
inputs, targets, mask = inputs.float(), targets.float(), mask.float()
res = nn.functional.binary_cross_entropy_with_logits(inputs, targets, reduction="none", weight=mask)
res = res.sum(dim=1) / mask.float().sum(dim=1)
return res
loss = SigmoidBinaryCrossEntropyLoss()
pred = torch.tensor([[1.5, 0.3, -1, 2], [1.1, -0.6, 2.2, 0.4]])
label = torch.tensor([[1, 0, 0, 0], [1, 1, 0, 0]]) # 标签变量label中的1和0分别代表背景词和噪声词
mask = torch.tensor([[1, 1, 1, 1], [1, 1, 1, 0]]) # 掩码变量
print(loss(pred, label, mask))
def sigmd(x):
return - math.log(1 / (1 + math.exp(-x)))
print('%.4f' % ((sigmd(1.5) + sigmd(-0.3) + sigmd(1) + sigmd(-2)) / 4)) # 注意1-sigmoid(x) = sigmoid(-x)
print('%.4f' % ((sigmd(1.1) + sigmd(-0.6) + sigmd(-2.2)) / 3))
训练模型
net = net.to(device)
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
for epoch in range(num_epochs):
start, l_sum, n = time.time(), 0.0, 0
for batch in data_iter:
center, context_negative, mask, label = [d.to(device) for d in batch]
pred = skip_gram(center, context_negative, net[0], net[1])
l = loss(pred.view(label.shape), label, mask).mean() # 一个batch的平均loss
optimizer.zero_grad()
l.backward()
optimizer.step()
l_sum += l.cpu().item()
n += 1
测试模型
def get_similar_tokens(query_token, k, embed):
W = embed.weight.data
x = W[token_to_idx[query_token]]
# 添加的1e-9是为了数值稳定性
cos = torch.matmul(W, x) / (torch.sum(W * W, dim=1) * torch.sum(x * x) + 1e-9).sqrt()
_, topk = torch.topk(cos, k=k+1)
topk = topk.cpu().numpy()
for i in topk[1:]: # 除去输入词
print('cosine sim=%.3f: %s' % (cos[i], (idx_to_token[i])))
get_similar_tokens('chip', 3, net[0])
词嵌入进阶
GloVe 全局向量的词嵌入
Word2Vec 的损失函数(以 Skip-Gram 模型为例,不考虑负采样近似):
−
∑
t
=
1
T
∑
−
m
≤
j
≤
m
,
j
≠
0
log
P
(
w
(
t
+
j
)
∣
w
(
t
)
)
-\sum_{t=1}^T\sum_{-m\le j\le m,j\ne 0} \log P(w^{(t+j)}\mid w^{(t)})
−t=1∑T−m≤j≤m,j=0∑logP(w(t+j)∣w(t))
其中
P
(
w
j
∣
w
i
)
=
exp
(
u
j
⊤
v
i
)
∑
k
∈
V
exp
(
u
k
⊤
v
i
)
P(w_j\mid w_i) = \frac{\exp(\boldsymbol{u}_j^\top\boldsymbol{v}_i)}{\sum_{k\in\mathcal{V}}\exp(\boldsymbol{u}_k^\top\boldsymbol{v}_i)}
P(wj∣wi)=∑k∈Vexp(uk⊤vi)exp(uj⊤vi)
等价于
−
∑
i
∈
V
∑
j
∈
V
x
i
j
log
q
i
j
−
∑
i
∈
V
x
i
∑
j
∈
V
p
i
j
log
q
i
j
-\sum_{i\in\mathcal{V}}\sum_{j\in\mathcal{V}} x_{ij}\log q_{ij}\\ -\sum_{i\in\mathcal{V}}x_i\sum_{j\in\mathcal{V}}p_{ij} \log q_{ij}
−i∈V∑j∈V∑xijlogqij−i∈V∑xij∈V∑pijlogqij
GloVe 模型的损失函数表达式:
∑
i
∈
V
∑
j
∈
V
h
(
x
i
j
)
(
u
j
⊤
v
i
+
b
i
+
c
j
−
log
x
i
j
)
2
\sum_{i\in\mathcal{V}}\sum_{j\in\mathcal{V}} h(x_{ij}) (\boldsymbol{u}^\top_j\boldsymbol{v}_i+b_i+c_j-\log x_{ij})^2
i∈V∑j∈V∑h(xij)(uj⊤vi+bi+cj−logxij)2
改进的地方
在 Word2Vec 之后提出的 GloVe 模型,在之前的基础上做出了以下几点改动:
1.使用非概率分布的变量
p
i
j
′
=
x
i
j
p'_{ij}=x_{ij}
pij′=xij 和
q
i
j
′
=
exp
(
u
j
⊤
v
i
)
q'_{ij}=\exp(\boldsymbol{u}^\top_j\boldsymbol{v}_i)
qij′=exp(uj⊤vi),并对它们取对数;
2.为每个词
w
i
w_i
wi增加两个标量模型参数:中心词偏差项
b
i
b_i
bi和背景词偏差项
c
i
c_i
ci,松弛了概率定义中的规范性;
3.将每个损失项的权重
x
i
x_i
xi替换成函数
h
(
x
i
j
)
h(x_{ij})
h(xij),权重函数
h
(
x
)
h(x)
h(x)是值域在
[
0
,
1
]
[0,1]
[0,1]上的单调递增函数,松弛了中心词重要性与
x
i
x_i
xi线性相关的隐含假设;
4.用平方损失函数替代了交叉熵损失函数。
载入预训练的 GloVe 向量
import torch
import torchtext.vocab as vocab
print([key for key in vocab.pretrained_aliases.keys() if "glove" in key])
cache_dir = "/home/kesci/input/GloVe6B5429"
glove = vocab.GloVe(name='6B', dim=50, cache=cache_dir) # 包含三个属性:stoi(词到索引)、itos(索引到词)和vector(词向量)
求近义词
def knn(W, x, k):
cos = torch.matmul(W, x.view((-1,))) / (
(torch.sum(W * W, dim=1) + 1e-9).sqrt() * torch.sum(x * x).sqrt())
_, topk = torch.topk(cos, k=k)
topk = topk.cpu().numpy()
return topk, [cos[i].item() for i in topk]
def get_similar_tokens(query_token, k, embed):
topk, cos = knn(embed.vectors,
embed.vectors[embed.stoi[query_token]], k+1)
for i, c in zip(topk[1:], cos[1:]): # 除去输入词
print('cosine sim=%.3f: %s' % (c, (embed.itos[i])))
get_similar_tokens('chip', 3, glove)
求类比词
对于类比关系中的4个词“ a a a之于 b b b相当于 c c c之于 d d d,给定前3个词 a , b , c a,b,c a,b,c求 d d d。求类比词的思路是,搜索与 vec ( c ) + vec ( b ) − vec ( a ) \text{vec}(c)+\text{vec}(b)−\text{vec}(a) vec(c)+vec(b)−vec(a)的结果向量最相似的词向量,其中 vec ( w ) \text{vec}(w) vec(w)为 w w w的词向量。
def get_analogy(token_a, token_b, token_c, embed):
vecs = [embed.vectors[embed.stoi[t]]
for t in [token_a, token_b, token_c]]
x = vecs[1] - vecs[0] + vecs[2]
topk, cos = knn(embed.vectors, x, 1) # 即求c+b-a的近义词d
res = embed.itos[topk[0]]
return res
get_analogy('man', 'woman', 'son', glove)