词嵌入向量生成
-
写出每个单词的one-hot encoding
import numpy as np X = np.eye(5) words = ['quick','fox','dog','lazy','brown'] for i in range(5): print(words[i],"的one-hot编码:",X[i])
quick 的one-hot编码: [1. 0. 0. 0. 0.] fox 的one-hot编码: [0. 1. 0. 0. 0.] dog 的one-hot编码: [0. 0. 1. 0. 0.] lazy 的one-hot编码: [0. 0. 0. 1. 0.] brown 的one-hot编码: [0. 0. 0. 0. 1.]
-
随机初始化需要训练的参数矩阵W,其中W的维度大小为5x2,分别计算quick、fox、dog在嵌入空间中的embedding向量。
from gensim.test.utils import common_texts, get_tmpfile from gensim.models import Word2Vec sentences = [['quick','brown'],['quick','fox'],['lazy','dog']] epochs = 20 word_dim = 2 negative = 1 # word2vec对象创建并训练 model = Word2Vec(vector_size=word_dim, window=5, min_count=1, workers=8, negative=negative) model.build_vocab(sentences) # 得到每个单词在嵌入空间中的embedding向量 for i,vec in enumerate([model.wv.get_vector(i) for i in words]): print(words[i]+': '+str(vec))
quick: [-0.02681136 0.01182151] fox: [0.32294357 0.4486494 ] dog: [0.25516748 0.45046365] lazy: [-0.4651475 -0.35584044] brown: [-0.2507714 -0.18816864]
-
使用python语言,画出quick、fox、dog这三个单词在二维坐标系中的点
import matplotlib.pyplot as plt %matplotlib inline words = ['quick','fox','dog','lazy','brown'] for i,vec in enumerate([model.wv.get_vector(i) for i in words]): plt.scatter(vec[0],vec[1]) plt.text(vec[0]+0.01,vec[1],words[i])
-
使用python语言,计算上述样本对的损失函数。
假设目标单词是quick,正样本单词是fox,负样本是lazy (负样本的单词可以随机从单词库中选取),写出词嵌入的损失函数
l o s s = log σ ( v w O ′ ⊤ v w I ) + ∑ i = 1 k E w i ∼ P n ( w ) [ log σ ( − v w i ′ ⊤ v w I ) ] loss = \log \sigma\left(v_{w_{O}}^{\prime}{ }^{\top} v_{w_{I}}\right)+\sum_{i=1}^{k} \mathbb{E}_{w_{i} \sim P_{n}(w)}\left[\log \sigma\left(-v_{w_{i}}^{\prime}{ }^{\top}v_{w_I}\right)\right] loss=logσ(vwO′⊤vwI)+∑i=1kEwi∼Pn(w)[logσ(−vwi′⊤vwI)]from math import log def sigmoid(x): return 1.0/(1+np.exp(-x)) words = ['quick','fox','dog','lazy','brown'] loss = log(sigmoid(model.wv['fox'].T.dot(model.wv['quick']))) + \ log(sigmoid(-model.wv[np.random.choice(words)].T.dot(model.wv['quick']))) # random choice