词向量（Word Embedding）和单词预测（Word Prediction）

最新推荐文章于 2023-08-01 10:50:54 发布

爱吃蛋炒饭的小老鼠

最新推荐文章于 2023-08-01 10:50:54 发布

阅读量1.7k

点赞数 3

分类专栏：深度学习笔记文章标签：深度学习机器学习 python

本文链接：https://blog.csdn.net/qq_29380039/article/details/107915692

版权

文章目录

一、词变量
二、单词预测

一、词变量

为什么要引入词变量？
在分类问题中，采用的编码为one-hot编码，例如总共有五类，属于第二类的标签为（0,1,0,0,0）。但是在一篇文章中，单词的个数有成千上万个，倘若还是用one-hot编码，会消耗过多计算资源。
词变量：将单词转化为一个n维向量。
在这里插入图片描述
根据单词的数量使用torch.nn.Embedding(num_embeddings: int, embedding_dim: int建立num_embeddings个词向量，每个词向量的维度为embedding_dim定义一个字典，建立单词word和索引idx的映射，从而使单词与词向量对应上。

words = ['hello', 'world','my','name']
word2idx = {
   words[i]:i for i in range(len(words))}
print(word2idx)
# 建立字典映射
embeds = nn.Embedding(4, 5)   # 定义4个词变量，每个为5维
idx = list(word2idx.values())
for i,word in enumerate(word2idx):
    print('*'*30)
    print('word:{}\nword embedding:{}'.format(word, embeds(torch.LongTensor([i]))))

单词和词向量的对应关系。

{
   'hello': 0, 'world': 1, 'my': 2, 'name': 3}
******************************
word:hello
word embedding:tensor([[0.0829, 0.2781, 0.7545, 0.5752, 0.4612]], grad_fn=<EmbeddingBackward>)
******************************
word:world
word embedding:tensor([[-1.0953, -1.2260, -0.9847,  1.1382, -1.1667]],
       grad_fn=<EmbeddingBackward>)
******************************
word:my
word embedding:tensor([[-1.4400,  0.2628,  0.0790,  0.0571, -2.1112]],
       grad_fn=<EmbeddingBackward>)
******************************
word:name
word embedding:tensor([[-0.6483, -0.9538, -0.3826, -0.7920,  0.5763]],
       grad_fn=<EmbeddingBackward>)

二、单词预测

通过前后上下文的单词来预测当前单词。
import库文件

import torch
from torch import nn,optim
from torch.nn import functional as F

超参及原始数据

# 通过上下文各context_size个单词来预测
CONTEXT_SIZE = 2  
EMBEDDING_DIM = 100  # 词向量的维度为100
use_gpu = True if torch.cuda.is_available() else False  # 选择是否使用gpu
# 数据集
raw_text = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
print(raw_text)

训练数据处理

data = []
for i in range(CONTEXT_SIZE, len(raw_text) - CONTEXT_SIZE):
    context = [
        raw_text[i - 2], raw_text[i - 1], raw_text[i + 1], raw_text[i + 2]
    ]
    target = raw_text[i]
    data.append((context, target))
print(data)

构造成元组形式，包含一个含有上下文各context_size个单词的列表（input）和当前单词（标签target）。（训练目标是将input的2*context_size个词向量输入网络，输出在target处的分数最高）。

[(['When', 'forty', 'shall', 'besiege'], 'winters'), (['forty', 'winters', 'besiege', 'thy'], 'shall'), (['winters', 'shall', 'thy', 'brow,'], 'besiege'), (['shall', 'besiege', 'brow,', 'And'], 'thy'), (['besiege', 'thy', 'And', 'dig'], 'brow,'), (['thy', 'brow,', 'dig', 'deep'], 'And'), (['brow,', 'And', 'deep', 'trenches'], 'dig'), (<