CBOW模型

最新推荐文章于 2023-11-24 12:09:24 发布

Ensheng Shi

最新推荐文章于 2023-11-24 12:09:24 发布

阅读量1.6k

点赞数

文章标签： CBOW word2vec python

本文链接：https://blog.csdn.net/qq_36097393/article/details/88569812

版权

word2vec是 Mikolov2013年的paper中提到的，是对bengio的NNML（A Neural Probabilistic Language Model）的改进，他去掉了隐含层，
文章链接： Efficient Estimation of Word Representations in Vector Space
具体的实现在： Distributed Representations of Words and Phrases and their Compositionality提到。
源码和说明文档在 word2vec

还用到了霍夫曼树，negative sampling等等。

所以他的简单实现就是：
随机初始化词向量，summation起来（也就相当于过一个linear层，输入为ontext_size*embedding_dim,vocab_size）

class CBOW(nn.Module):

    def __init__(self,vocab_size, embedding_dim, context_size):
        super(CBOW,self).__init__()
        self.embeddings = nn.Embedding(vocab_size,embedding_dim)
        self.linear1 = nn.Linear(context_size*embedding_dim,vocab_size)

    def forward(self,input):
        embeds = self.embeddings(input).view((1,-1))
        out = self.linear1(embeds)
        log_probs = F.log_softmax(out,dim = 1)
        return log_probs

具体实现就是：


#!/usr/bin/env python
#!-*-coding:utf-8 -*-
"""
@version: python3.7
@author: ‘v-enshi‘
@license: Apache Licence 
@contact: 123@qq.com
@site: 
@software: PyCharm
@file: BOW.py
@time: 2019/3/15 9:29
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

CONTEXT_SIZE = 4
EMBEDDING_DIM = 10
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

vocab = set(raw_text)
vocab_size = len(vocab)
print(vocab_size)
word_to_ix = {word:i for i, word in enumerate(vocab)}

data = []

for i in range(2,len(raw_text) - 2):
    context = [raw_text[i - 2],raw_text[i - 1],raw_text[i + 1], raw_text[i+2]]
    target = raw_text[i]
    data.append((context,target))
print(data[:5])

class CBOW(nn.Module):

    def __init__(self,vocab_size, embedding_dim, context_size):
        super(CBOW,self).__init__()
        self.embeddings = nn.Embedding(vocab_size,embedding_dim)
        self.linear1 = nn.Linear(context_size*embedding_dim,vocab_size)

    def forward(self,input):
        embeds = self.embeddings(input).view((1,-1))
        out = self.linear1(embeds)
        log_probs = F.log_softmax(out,dim = 1)
        return log_probs

def make_context_vector(context,word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs,dtype = torch.long)


losses = []
loss_function = nn.NLLLoss()
model = CBOW(vocab_size,EMBEDDING_DIM,CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(),lr = 0.001)

for epoch in range(10):
    total_loss = 0
    for context,target in data:
        model.zero_grad()
        log_probs = model(make_context_vector(context,word_to_ix))
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]],dtype = torch.long))

        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    losses.append(total_loss)
print(losses)

参考文献

Ensheng Shi

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
CBOW模型

word2vec是 Mikolov2013年的paper中提到的，是对bengio的NNML的改进，他去掉了隐含层，所以他的简单实现就是：#!/usr/bin/env python#!-*-coding:utf-8 -*-&amp;amp;amp;amp;amp;quot;&amp;amp;amp;amp;amp;quot;&amp;amp;amp;amp;amp;quot;@version: python3.7@author: ‘v-enshi‘@lic
复制链接

扫一扫