😁大家好,我是CuddleSabe,目前大四在读,深圳准入职算法工程师,研究主要方向为多模态(VQA、ImageCaptioning等),欢迎各位佬来讨论!
🍭我最近在有序地计划整理CV入门实战系列及NLP入门实战系列。在这两个专栏中,我将会带领大家一步步进行经典网络算法的实现,欢迎各位读者(da lao)订阅🍀
PyTorch文字处理及Embedding
PyTorch文字处理及Embedding
文字处理
在深度学习中我们需要在文本处理时创立两个 Dic字典 :
word2idx 以及idx2word
这是以所爱隔山海,山海皆可开为例
代码部分:
def Word_Proccess(sentence):
symbols = ',.?。,()()/*-+!!@#$¥%……^&-_ '
word2idx = {}
idx2word = {}
i = 0
for word in sentence:
if word2idx.get(word) == None\
and word not in symbols:
word2idx[word] = i
idx2word[i] = word
i += 1
return word2idx, idx2word, i
word2idx, idx2word, vocab_num = Word_Proccess('所爱隔山海,山海皆可开。')
print(word2idx)
print(idx2word)
print(vocab_num)
Embedding
torch.nn.Embedding(单词数, Embedding维度)
Embedding相比于one-hot,更加节省空间,且更训练。
Embedding层本质是生成一张 单词数*Embedding维度 大小的表
以embedding = torch.nn.Embedding(3, 5)为例
单词/idx | 特征一 | 特征二 | 特征三 | 特征四 | 特征五 |
---|---|---|---|---|---|
‘我’/0 | -0.3367 | -3.1418 | -1.2322 | 1.1305 | 1.1179 |
‘爱’/1 | -0.6050 | 2.6915 | 4.0444 | 1.3259 | 1.6389 |
‘你’/2 | -3.0094 | -2.0047 | 1.8739 | -2.0861 | -3.3471 |
完整代码如下:
import torch
import torch.nn as nn
import math
class Embeddings(nn.Module):
def __init__(self, d_model, vocab):
super(Embeddings, self).__init__()
self.lut = nn.Embedding(vocab, d_model)
self.d_model = d_model
def forward(self, x):
return self.lut(x) * math.sqrt(self.d_model)
def Word_Proccess(sentence):
symbols = ',.?。,()()/*-+!!@#$¥%……^&-_ '
word2idx = {}
idx2word = {}
i = 0
for word in sentence:
if word2idx.get(word) == None\
and word not in symbols:
word2idx[word] = i
idx2word[i] = word
i += 1
return word2idx, idx2word, i
sentence = '所爱隔山海,山海皆可开。' # 待编码的句子
word2idx, idx2word, vocab_num = Word_Proccess(sentence) # 获得两张表和单词数量
embedding_dim = 5 # Embedding后每个单词的特征数,即Embedding维度
Embedding_Layer = Embeddings(embedding_dim, vocab_num) # (Embedding维度, 单词数)
# 即创立一个 单词数*维度 大小的表
idx_s = []
for word in sentence:
if word in word2idx:
idx_s.append(word2idx[word])
print('word2idx之前的句子为:'+sentence)
print('word2idx之后的句子为:'+str(idx_s))
idx_s = torch.LongTensor(idx_s)
print(Embedding_Layer(idx_s))
Embedding的向量是会被Train的,所以在经过Train之后进行降维可视化,可以看出词义相近的单词,两个单词的Embedding维度的距离也比较近。