word2vec python实现_Word2Vec介绍：skip-gram模型的python实现

最新推荐文章于 2024-02-28 15:55:13 发布

LHZ5388015210

最新推荐文章于 2024-02-28 15:55:13 发布

阅读量551

点赞数

文章标签： word2vec python实现

建议看此文之前先看以下两篇文章：

1. 我们的任务是什么？

假设我们有由字母组成的以下句子：a e d b d c d e e c a

Skip-gram算法就是在给出中心字母(也就是c)的情况下，预测它的上下文字母(除中心字母外窗口内的其他字母，这里的窗口大小是5，也就是左右各5个字母)。

首先创建需要的变量：

inputVectors = np.random.randn(5, 3) # 输入矩阵，语料库中字母的数量是5，我们使用3维向量表示一个字母

outputVectors = np.random.randn(5, 3) # 输出矩阵

sentence = ['a', 'e', 'd', 'b', 'd', 'c','d', 'e', 'e', 'c', 'a'] # 句子

centerword = 'c' # 中心字母

context = ['a', 'e', 'd', 'd', 'd', 'd', 'e', 'e', 'c', 'a'] # 上下文字母

tokens = dict([("a", 0), ("b", 1), ("c", 2), ("d", 3), ("e", 4)]) # 用于映射字母在输入输出矩阵中的索引

以下是测试，如何得到中心字母'c'的词向量。

>python

>inputVectors[tokens['c']]

array([ 1.08864485, -0.67509503, 0.51392943])

2. 加载必要的库和函数

python

import numpy as np

import random

softmax和sigmoid是我们自己写的函数。

python

from softmax import softmax

from sigmoid import sigmoid, sigmoid_grad

softmax用于计算向量或者矩阵每行的softmax。sigmoid用于计算sigmoid，sigmoid_grad用于计算sigmoid的导数。

>print(softmax(np.array([1,2]))) # 测试softmax

>x = np.array([[1, 2], [-1, -2]])

>print(sigmoid(x)) # 测试sigmoid

>print(sigmoid_grad(sigmoid(x))) # 测试sigmoid的梯度

[ 0.26894142 0.73105858]

[[ 0.73105858 0.88079708]

[ 0.26894142 0.11920292]]

[[ 0.19661193 0.10499359]

[ 0.19661193 0.10499359]]

3. 计算softmax代价和梯度

现在我们先考虑预测一个字母的情况，也就是说在给定中心字母‘c’的情况下，预测下一个字母是'd'。

先打造实现上述功能的单元模块softmaxCostAndGradient(predicted, target, outputVectors)如下

def softmaxCostAndGradient(predicted, target, outputVectors):

vhat = predicted # 中心词向量

z = np.dot(outputVectors, vhat) # 预测得分

yhat = softmax(z) # 预测输出yhat

cost = -np.log(y_hat[target]) # 计算代价

z = y_hat.copy()

z[target] -= 1.0

grad = np.outer(z, v_hat) # 计算中心词的梯度

gradPred = np.dot(outputVectors.T, z) # 计算输出词向量矩阵的梯度

return cost, gradPred, grad

参数解释：

predicted：输入词向量，也就是例子中'c'的词向量

target：目标词向量的索引，也就是真实值'd'的索引

outputVectors：输出向量矩阵

测试一下：

>python

>softmaxCostAndGradient(inputVectors[2], 3, outputVectors)

(2.7620705713684814,

array([ 0.46424045, -1.62684537, -0.251175 ]),

array([[ 0.32020414, -0.19856634, 0.15116255],

[ 0.15037396, -0.09325054, 0.07098881],

[ 0.04124188, -0.02557509, 0.01946954],

[-1.01988512, 0.63245545, -0.48146921],

[ 0.50806514, -0.31506349, 0.23984831]]))

4. skip-gram模型

有了单元模块，可以进一步打造skip-gram模型，相较于单元模块只能实现通过中心字母'c'来预测下一字母'd',下面要创建的skipgram模块可以实现通过中心字母'c'来预测窗口内的上下文字母context = ['a', 'e', 'd', 'd', 'd', 'd', 'e', 'e', 'c', 'a']

先给出代码如下：

def skipgram(currentWord, contextWords, tokens, inputVectors, outputVectors):

# 初始化变量

cost = 0.0

gradIn = np.zeros(inputVectors.shape)

gradOut = np.zeros(outputVectors.shape)

cword_idx = tokens[currentWord] # 得到中心单词的索引

v_hat = inputVectors[cword_idx] # 得到中心单词的词向量

# 循环预测上下文中每个字母

for j in contextWords:

u_idx = tokens[j] # 得到目标字母的索引

c_cost, c_grad_in, c_grad_out = softmaxCostAndGradient(v_hat, u_idx, outputVectors) #计算一个中心字母预测一个上下文字母的情况

cost += c_cost # 所有代价求和

gradIn[cword_idx] += c_grad_in # 中心词向量梯度求和

gradOut += c_grad_out # 输出词向量矩阵梯度求和

return cost, gradIn, gradOut

测试一下：

>python

>c, gin, gout = skipgram(centerword, context, tokens, inputVectors, outputVectors)

看一下计算后的代价是多少

>python

19.055215361667734

skip-gram得到的代价是之前单元模块代价的大约10倍，因为我们的窗口大小是5，相当于计算2*5次单元模块并求和。

再看一下输出的代价gin。

>python

>gin

array([[ 0. , 0. , 0. ],

[ 0. , 0. , 0. ],

[ 2.39436138, -7.37446751, -2.73334444],

[ 0. , 0. , 0. ],

[ 0. , 0. , 0. ]])

gin只有第三行有值，其他行全是0，因为我们只更新输入矩阵中中心字母‘c’的词向量。

>python

>gout

array([[ 1.02475167, -0.63547332, 0.48376662],

[ 1.50373964, -0.93250536, 0.70988812],

[-0.67622608, 0.41934416, -0.31923403],

[-3.66698203, 2.27398434, -1.7311155 ],

[ 1.8147168 , -1.12534983, 0.85669479]])

我们需要更新输出矩阵中的所有词向量。

5. 更新权重

得到了梯度，下一步就可以更新我们的词向量矩阵。

>step = 0.01 #更新步进

>inputVectors -= step * gin # 更行输入词向量矩阵

>outputVectors -= step * gout

>print(inputVectors)

>print(outputVectors)

[[ 0.0293318 3.20359468 0.02918116]

[-0.18855591 -0.95316493 -0.64365733]

[ 1.04075763 -0.52760568 0.56859632]

[ 0.32805586 0.44831171 -1.50020838]

[-0.62146793 2.25323721 0.2220386 ]]

[[ 0.22033549 0.3710705 0.34345891]

[-0.23107472 -0.43592947 -1.24740455]

[-1.24593039 1.44290408 0.92737814]

[-0.21438081 1.50037497 -0.0857829 ]

[ 0.49294149 -0.63268862 -0.68114989]]

6. 完整测试代码

import numpy as np

import random

def softmax(x):

orig_shape = x.shape

# 根据输入类型是矩阵还是向量分别计算softmax

if len(x.shape) > 1:

# 矩阵

tmp = np.max(x,axis=1) # 得到每行的最大值，用于缩放每行的元素，避免溢出

x-=tmp.reshape((x.shape[0],1)) # 使每行减去所在行的最大值(广播运算)

x = np.exp(x) # 第一步，计算所有值以e为底的x次幂

tmp = np.sum(x, axis = 1) # 将每行求和并保存

x /= tmp.reshape((x.shape[0], 1)) # 所有元素除以所在行的元素和(广播运算)

else:

# 向量

tmp = np.max(x) # 得到最大值

x -= tmp # 利用最大值缩放数据

x = np.exp(x) # 对所有元素求以e为底的x次幂

tmp = np.sum(x) # 求元素和

x /= tmp # 求somftmax

return x

def sigmoid(x):

s = np.true_divide(1, 1 + np.exp(-x)) # 使用np.true_divide进行加法运算

return s

def sigmoid_grad(s):

ds = s * (1 - s) # 可以证明：sigmoid函数关于输入x的导数等于`sigmoid(x)(1-sigmoid(x))`

return ds

def softmaxCostAndGradient(predicted, target, outputVectors):

v_hat = predicted # 中心词向量

z = np.dot(outputVectors, v_hat) # 预测得分

y_hat = softmax(z) # 预测输出y_hat

cost = -np.log(y_hat[target]) # 计算代价

z = y_hat.copy()

z[target] -= 1.0

grad = np.outer(z, v_hat) # 计算中心词的梯度

gradPred = np.dot(outputVectors.T, z) # 计算输出词向量矩阵的梯度

return cost, gradPred, grad

def skipgram(currentWord, contextWords, tokens, inputVectors, outputVectors):

# 初始化变量

cost = 0.0

gradIn = np.zeros(inputVectors.shape)

gradOut = np.zeros(outputVectors.shape)

cword_idx = tokens[currentWord] # 得到中心单词的索引

v_hat = inputVectors[cword_idx] # 得到中心单词的词向量

# 循环预测上下文中每个字母

for j in contextWords:

u_idx = tokens[j] # 得到目标字母的索引

c_cost, c_grad_in, c_grad_out = softmaxCostAndGradient(v_hat, u_idx, outputVectors) #计算一个中心字母预测一个上下文字母的情况

cost += c_cost # 所有代价求和

gradIn[cword_idx] += c_grad_in # 中心词向量梯度求和

gradOut += c_grad_out # 输出词向量矩阵梯度求和

return cost, gradIn, gradOut

inputVectors = np.random.randn(5, 3) # 输入矩阵，语料库中字母的数量是5，我们使用3维向量表示一个字母

outputVectors = np.random.randn(5, 3) # 输出矩阵

sentence = ['a', 'e', 'd', 'b', 'd', 'c','d', 'e', 'e', 'c', 'a'] # 句子

centerword = 'c' # 中心字母

context = ['a', 'e', 'd', 'd', 'd', 'd', 'e', 'e', 'c', 'a'] # 上下文字母

tokens = dict([("a", 0), ("b", 1), ("c", 2), ("d", 3), ("e", 4)]) # 用于映射字母在输入输出矩阵中的索引

c, gin, gout = skipgram(centerword, context, tokens, inputVectors, outputVectors)

step = 0.01 #更新步进

print('原始输入矩阵:\n',inputVectors)

print('原始输出矩阵:\n',outputVectors)

inputVectors -= step * gin # 更行输入词向量矩阵

outputVectors -= step * gout

print('更新后的输入矩阵:\n',inputVectors)

print('更新后的输出矩阵:\n',outputVectors)

运行结果：

原始输入矩阵:

[[-2.00926589 -0.71395683 1.26236425]

[ 0.33655978 0.3235073 -0.82206833]

[-0.48877763 0.34254799 0.17305621]

[ 2.86310407 0.8945647 -0.74332816]

[ 1.78756342 0.32268516 -0.63464343]]

原始输出矩阵:

[[-1.65265982 -0.45342181 -1.56449817]

[ 1.26414084 -0.07519223 0.02234691]

[-1.28539302 -0.3520508 -0.80631146]

[-0.23932257 -1.61557932 1.14076112]

[-0.58790059 0.1913722 0.31944376]]

更新后的输入矩阵:

[[-2.00926589 -0.71395683 1.26236425]

[ 0.33655978 0.3235073 -0.82206833]

[-0.48012737 0.30942437 0.22498659]

[ 2.86310407 0.8945647 -0.74332816]

[ 1.78756342 0.32268516 -0.63464343]]

更新后的输出矩阵:

[[-1.64993767 -0.45532956 -1.56546197]

[ 1.26864072 -0.07834587 0.02075368]

[-1.27795164 -0.35726591 -0.80894614]

[-0.25215558 -1.60658561 1.14530477]

[-0.58973098 0.19265498 0.32009183]]

LHZ5388015210

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
word2vec python实现_Word2Vec介绍：skip-gram模型的python实现

建议看此文之前先看以下两篇文章：1. 我们的任务是什么？假设我们有由字母组成的以下句子：a e d b d c d e e c aSkip-gram算法就是在给出中心字母(也就是c)的情况下，预测它的上下文字母(除中心字母外窗口内的其他字母，这里的窗口大小是5，也就是左右各5个字母)。首先创建需要的变量：inputVectors = np.random.randn(5, 3) # 输入矩阵，语料库...
复制链接

扫一扫