word2vec模型评估_[Hulu百面机器学习]python实战系列(1.6)——Word2Vec

最新推荐文章于 2023-08-07 00:20:56 发布

原创最新推荐文章于 2023-08-07 00:20:56 发布 · 248 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#word2vec模型评估

本文介绍了Word2Vec的工作原理，对比了它与LDA的区别，并通过Python实现CBOW和skip-gram算法，探讨了模型评估和过拟合问题。

上一章在研究了词袋模型以后，这次的模型上下文的神经网络，就更有挑战性。

Q:Word2Vec 是如何工作的？它和LDA有什么区别与联系？(难度：3⭐)

CBOW是根据上下文出现的词语来预测当前词的生成概率，Skip-gram是根据当前词来预测上下文各个词的生成概率。

其中是当前关注的词，，其中前后滑动窗口大小均设为N。

它们都由输入层、映射层(其他神经网络叫隐藏层)和输出层构成构成的神经网络。

Word2Vec和LDA的区别和联系，首先LDA是按照主题来聚类的，即"文档-单词"矩阵进行分解，得到"文档-主题"和"主题-单词"两个概率分布。而Word2Vec其实是对"上下文-单词"的矩阵进行学习，其上下文由周围简单单词组成。主题模型通过"上下文-单词"主题矩阵进行推理。同理，词嵌入也可用"文档-单词"矩阵学习出词的隐含向量表示。

接下来就实战CBOW和skip-gram两种算法吧！

CBOW篇

import re
import numpy as np
import matplotlib.pyplot as plt

找个句子

#%%
sentences = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells."""

# 去掉特殊字符
sentences = re.sub('[^A-Za-z0-9]+', ' ', sentences)

# 去除单字母单词
sentences = re.sub(r'(?:^| )\w(?:$| )', ' ', sentences).strip()

# 改成全部小写
sentences = sentences.lower()

整理成词汇：

words = sentences.split()
vocab = set(words)

vocab_size = len(vocab)
embed_dim = 10
context_size = 2

#%%
# 字典词汇
word_to_ix = {word: i for i, word in enumerate(vocab)}
ix_to_word = {i: word for i, word in enumerate(vocab)}

整理成上下文单词和本单词构成的矩阵：

def create_windows(context_size):
    data = []
    for i in range(context_size, len(words) - context_size):
        context = [words[i - j] for j in range(-context_size,context_size+1)]
        context.pop(context_size)
        target = words[i]
        data.append((context, target))
    print(data[:10])
    return data

data = create_windows(context_size)
embeddings =  np.random.random_sample((vocab_size, embed_dim))

定义损失函数:

def linear(m, theta):
    return np.dot(m,theta)

# Log softmax + NLLloss = Cross Entropy
def log_softmax(x):
    e_x = np.exp(x - np.max(x))
    return np.log(e_x / (e_x.sum() + 0.001))

def NLLLoss(logs, targets):
    out = logs[range(len(targets)), targets]
    return -out.sum()/len(out)

def log_softmax_crossentropy_with_logits(logits,target):

    out = np.zeros_like(logits)
    out[np.arange(len(logits)),target] = 1
    
    softmax = np.exp(logits) / np.exp(logits).sum(axis=-1,keepdims=True)
    
    return (- out + softmax) / logits.shape[0]

设立正向和反向函数：

反向函数是用log_softmax_crossentropy_with_logits(logits,target)的：

def forward(context_idxs, theta):
    m = embeddings[context_idxs].reshape(1, -1)
    n = linear(m, theta)
    o = log_softmax(n)
    
    return m, n, o

# 通过反向函数去求梯度
def backward(preds, theta, target_idxs):
    m, n, o = preds
    
    dlog = log_softmax_crossentropy_with_logits(n, target_idxs)
    dw = m.T.dot(dlog)
    
    return dw

优化函数

# 优化函数
def optimize(theta, grad, lr=0.03):
    theta -= grad * lr
    return theta

# 训练函数
theta = np.random.uniform(-1, 1, (2 * context_size * embed_dim, vocab_size))
epoch_losses = {}

for epoch in range(100):

    losses =  []

    for context, target in data:
        context_idxs = np.array([word_to_ix[w] for w in context])
        preds = forward(context_idxs, theta)

        target_idxs = np.array([word_to_ix[target]])
        loss = NLLLoss(preds[-1], target_idxs)

        losses.append(loss)

        grad = backward(preds, theta, target_idxs)
        theta = optimize(theta, grad, lr=0.03)
        
     
    epoch_losses[epoch] = losses
    if(epoch % 5 == 0):
        print("Cost after epoch {}: {}".format(epoch, epoch_losses[epoch]))

作图一个新函数：

# Plot loss/epoch
ix = np.arange(0,80)

fig = plt.figure()
fig.suptitle('Epoch/Losses', fontsize=20)
plt.plot(ix,[epoch_losses[i][0] for i in ix])
plt.xlabel('Epochs', fontsize=12)
plt.ylabel('Losses', fontsize=12)

某个随机结果如下：

预测并算出精确度：

# 预测
def predict(words):
    context_idxs = np.array([word_to_ix[w] for w in words])
    preds = forward(context_idxs, theta)
    word = ix_to_word[np.argmax(preds[-1])]
    
    return word

# 评估模型的准确度
def accuracy():
    wrong = 0

    for context, target in data:
        if(predict(context) != target):
            wrong += 1
            
    return (1 - (wrong / len(data)))

发现竟然为1！是不是很神奇啊？其实不然，这其实因为生成的测试集里面都源于训练集，非常依赖经验主义。这很容易导致过拟合啊，一旦换篇文章就不行了啊。所以解决办法之一是，快速扩充文章数量和句子数量。至于生僻的语境譬如develop作为动词洗胶卷的意思的话，要靠机器学习判断还是自求多福了。

至于上一期有一位童鞋说到的Sentence2Vec和Paragraph2Vec，其实原理很简单，就是把

words = sentences.split()

改成

sentence = sentences.split(".")

这样就ok了。段落向量也是一样的规则。

还有一个skip-gram算法(从CBOW生成的词袋words开始)：

def generate_training_data(words, word_to_id, window_size):
    X, Y = [], []

    for i in range(len(words)):
        nbr_inds = list(range(max(0, i - window_size), i)) + \
                   list(range(i + 1, min(len(words), i + window_size + 1)))
        for j in nbr_inds:
            X.append(word_to_id[words[i]])
            Y.append(word_to_id[words[j]])
            
    return np.array(X), np.array(Y)

def expand_dims(x, y):
    x = np.expand_dims(x, axis=0)
    y = np.expand_dims(y, axis=0)
    return x,y

x, y = generate_training_data(words, word_to_ix, 3)
x, y = expand_dims(x, y)

然后就是前后向传播和softmax激活函数：

# 前向传播
def init_parameters(vocab_size, emb_size):
    wrd_emb = np.random.randn(vocab_size, emb_size) * 0.01
    w = np.random.randn(vocab_size, emb_size) * 0.01
    
    return wrd_emb, w

def softmax(z):
    return np.divide(np.exp(z), np.sum(np.exp(z), axis=0, keepdims=True) + 0.001)


def forward(inds, params):
    wrd_emb, w = params
    word_vec = wrd_emb[inds.flatten(), :].T
    z = np.dot(w, word_vec)
    out = softmax(z)
    
    cache = inds, word_vec, w, z
    
    return out, cache

# 垮熵损失函数：
def cross_entropy(y, y_hat):
    m = y.shape[1]
    cost = -(1 / m) * np.sum(np.sum(y_hat * np.log(y + 0.001), axis=0, keepdims=True), axis=1)
    return cost

# softmax的求导
def dsoftmax(y, out):
    return out - y

# 找到损失函数关于z,w,word_vec的偏导
def backward(y, out, cache):
    inds, word_vec, w, z = cache
    wrd_emb, w = params    
    dl_dz = dsoftmax(y, out)
    # deviding by the word_vec length to find the average
    dl_dw = (1/word_vec.shape[1]) * np.dot(dl_dz, word_vec.T)
    # 
    dl_dword_vec = np.dot(w.T, dl_dz)    
    return dl_dz, dl_dw, dl_dword_vec

def update(params, cache, grads, lr=0.03):
    inds, word_vec, w, z = cache
    wrd_emb, w = params
    dl_dz, dl_dw, dl_dword_vec = grads
    
    wrd_emb[inds.flatten(), :] -= dl_dword_vec.T * lr
    w -= dl_dw * lr
    
    return wrd_emb, w

训练测试词袋(根据单词猜出上下文)：

vocab_size = len(ix_to_word)

m = y.shape[1]
y_one_hot = np.zeros((vocab_size, m))
y_one_hot[y.flatten(), np.arange(m)] = 1

y = y_one_hot

batch_size = 256
embed_size = 50

params = init_parameters(vocab_size, 50)

costs = []

for epoch in range(5000):
    epoch_cost = 0
    
    batch_inds = list(range(0, x.shape[1], batch_size))
    np.random.shuffle(batch_inds)
    
    for i in batch_inds:
        x_batch = x[:, i:i+batch_size]
        y_batch = y[:, i:i+batch_size]
        
        pred, cache = forward(x_batch, params)
        grads = backward(y_batch, pred, cache)
        params = update(params, cache, grads, 0.03)
        cost = cross_entropy(pred, y_batch)
        
        epoch_cost += np.squeeze(cost)
        
    costs.append(epoch_cost)
    
    if(epoch % 250 == 0):
        print("Cost after epoch {}: {}".format(epoch, epoch_cost))

测试集：

#%% Test
x_test = np.arange(vocab_size)
x_test = np.expand_dims(x_test, axis=0)
softmax_test, _ = forward(x_test, params)
top_sorted_inds = np.argsort(softmax_test, axis=0)[-4:,:]
for input_ind in range(vocab_size):
    input_word = ix_to_word[input_ind]
    output_words = [ix_to_word[output_ind] for output_ind in top_sorted_inds[::-1, input_ind]]
    print("{}'s skip-grams: {}".format(input_word, output_words))

这里就没有展示它的准确率了，测试结果如下：

they's skip-grams: ['manipulate', 'evolve', 'inhabit', 'as']
programs's skip-grams: ['processes', 'to', 'that', 'direct']
things's skip-grams: ['the', 'we', 'conjure', 'in']
spells's skip-grams: ['computer', 'our', 'with', 'we']
people's skip-grams: ['to', 'programs', 'program', 'create']
about's skip-grams: ['the', 'we', 'conjure', 'in']
manipulate's skip-grams: ['things', 'they', 'evolve', 'other']
that's skip-grams: ['as', 'beings', 'computers', 'inhabit']
is's skip-grams: ['pattern', 'evolution', 'directed', 'by']
called's skip-grams: ['create', 'of', 'people', 'rules']
by's skip-grams: ['rules', 'of', 'directed', 'pattern']
evolution's skip-grams: ['is', 'data', 'processes', 'of']
program's skip-grams: ['programs', 'rules', 'people', 'create']
are's skip-grams: ['that', 'beings', 'computational', 'process']
pattern's skip-grams: ['called', 'by', 'rules', 'is']
effect's skip-grams: ['the', 'we', 'conjure', 'in']
spirits's skip-grams: ['computer', 'the', 'conjure', 'we']
computers's skip-grams: ['evolve', 'they', 'as', 'beings']
inhabit's skip-grams: ['they', 'as', 'computers', 'that']
directed's skip-grams: ['of', 'the', 'in', 'spirits']
in's skip-grams: ['conjure', 'processes', 'direct', 'we']
evolve's skip-grams: ['other', 'they', 'manipulate', 'computers']
other's skip-grams: ['called', 'by', 'rules', 'is']
study's skip-grams: ['of', 'the', 'in', 'spirits']
of's skip-grams: ['the', 'with', 'computer', 'conjure']
computer's skip-grams: ['spells', 'the', 'our', 'with']
beings's skip-grams: ['computers', 'are', 'abstract', 'that']
as's skip-grams: ['processes', 'to', 'that', 'direct']
to's skip-grams: ['in', 'direct', 'processes', 'programs']
process's skip-grams: ['by', 'directed', 'computational', 'evolution']
with's skip-grams: ['spells', 'the', 'our', 'with']
we's skip-grams: ['spirits', 'conjure', 'effect', 'to']
abstract's skip-grams: ['data', 'things', 'other', 'manipulate']
processes's skip-grams: ['we', 'to', 'direct', 'effect']
conjure's skip-grams: ['of', 'the', 'in', 'spirits']
our's skip-grams: ['spells', 'the', 'our', 'with']
the's skip-grams: ['spirits', 'of', 'the', 'our']
direct's skip-grams: ['effect', 'in', 'create', 'to']
idea's skip-grams: ['process', 'computational', 'study', 'effect']
rules's skip-grams: ['people', 'by', 'pattern', 'program']
data's skip-grams: ['of', 'the', 'in', 'spirits']
computational's skip-grams: ['abstract', 'process', 'are', 'processes']
create's skip-grams: ['direct', 'programs', 'called', 'people']

看样子这里的准确率不是100%，毕竟1推多比多推1难得多嘛！

fasttext是脸书推出来的源码，有兴趣的点击下方链接：https://github.com/facebookresearch/fastText/

我们的第一章很快就要结束了，现在冷不丁的来个神经网络，才是个特征工程就要搞到这种程度。。。于是我就借用外网的代码来充数了。大家看看第三章机器学习的线性回归的numpy源码就知道这个项目是有多难，而且fasttext肯定比CBOW和skip-gram都更加复杂。

百面机器学习系列，免费的时日已经不多了！感兴趣想要加粉丝群的欢迎来加微信号：xiaoshida002，非诚勿扰！