CBOW模型

word2vec是 Mikolov2013年的paper中提到的,是对bengio的NNML(A Neural Probabilistic Language Model)的改进,他去掉了隐含层,
文章链接: Efficient Estimation of Word Representations in Vector Space
具体的实现在: Distributed Representations of Words and Phrases and their Compositionality提到。
源码和说明文档在 word2vec

还用到了霍夫曼树,negative sampling等等。

所以他的简单实现就是:
随机初始化词向量,summation起来(也就相当于过一个linear层,输入为ontext_size*embedding_dim,vocab_size)

class CBOW(nn.Module):

    def __init__(self,vocab_size, embedding_dim, context_size):
        super(CBOW,self).__init__()
        self.embeddings = nn.Embedding(vocab_size,embedding_dim)
        self.linear1 = nn.Linear(context_size*embedding_dim,vocab_size)

    def forward(self,input):
        embeds = self.embeddings(input).view((1,-1))
        out = self.linear1(embeds)
        log_probs = F.log_softmax(out,dim = 1)
        return log_probs

具体实现就是:


#!/usr/bin/env python
#!-*-coding:utf-8 -*-
"""
@version: python3.7
@author: ‘v-enshi‘
@license: Apache Licence 
@contact: 123@qq.com
@site: 
@software: PyCharm
@file: BOW.py
@time: 2019/3/15 9:29
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

CONTEXT_SIZE = 4
EMBEDDING_DIM = 10
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

vocab = set(raw_text)
vocab_size = len(vocab)
print(vocab_size)
word_to_ix = {word:i for i, word in enumerate(vocab)}

data = []

for i in range(2,len(raw_text) - 2):
    context = [raw_text[i - 2],raw_text[i - 1],raw_text[i + 1], raw_text[i+2]]
    target = raw_text[i]
    data.append((context,target))
print(data[:5])

class CBOW(nn.Module):

    def __init__(self,vocab_size, embedding_dim, context_size):
        super(CBOW,self).__init__()
        self.embeddings = nn.Embedding(vocab_size,embedding_dim)
        self.linear1 = nn.Linear(context_size*embedding_dim,vocab_size)

    def forward(self,input):
        embeds = self.embeddings(input).view((1,-1))
        out = self.linear1(embeds)
        log_probs = F.log_softmax(out,dim = 1)
        return log_probs

def make_context_vector(context,word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs,dtype = torch.long)


losses = []
loss_function = nn.NLLLoss()
model = CBOW(vocab_size,EMBEDDING_DIM,CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(),lr = 0.001)

for epoch in range(10):
    total_loss = 0
    for context,target in data:
        model.zero_grad()
        log_probs = model(make_context_vector(context,word_to_ix))
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]],dtype = torch.long))

        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    losses.append(total_loss)
print(losses)

参考文献

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值