第N2周：Embeddingbag与Embedding详解

千筱夜

已于 2024-05-20 10:43:07 修改

阅读量791

点赞数 20

文章标签： embedding

于 2024-05-20 10:34:23 首次发布

本文链接：https://blog.csdn.net/geo436872/article/details/139053059

版权

🍨 本文为🔗365天深度学习训练营中的学习记录博客
🍖 原作者：K同学啊 | 接辅导、项目定制
🚀 文章来源：K同学的学习圈子

1.Embedding：

定义：Embedding是一种将离散的数据（如词汇）映射到连续向量空间中的技术。在自然语言处理中，Embedding通常用于将单词或短语转换为固定维度的向量，这些向量可以捕获词汇之间的语义关系。

应用：Embedding在自然语言处理中有广泛的应用，包括词向量表示、句子向量表示以及语义相似度计算等任务。通过将词语或句子映射到连续向量空间中的固定维度表示，Embedding模型能够在保留语义信息的同时，对文本进行更好的处理和分析。

实现：在深度学习框架中，Embedding是一个基本的操作。例如，在PyTorch中，Embedding的输入是一个整数张量，每个整数代表一个词汇的索引，输出是一个浮点型的张量，每个浮点数代表对应词汇的词嵌入向量。

常见模型：Word2Vec和GloVe是两种常见的Embedding模型。Word2Vec通过训练一个神经网络模型来学习词语的分布式表示，而GloVe则使用全局词共现统计信息来生成词向量。

示例：

import torch
import torch.nn as nn

vocab_size = 12 #词汇表大小
embedding_dim  = 4 #嵌入向量的维度

#创建一个embedding层
embedding = nn.Embedding(vocab_size, embedding_dim)

#假设我们有一个包含两个单词索引的输入序列
input_sequence1 = torch.tensor([1,5,8], dtype = torch.long)
input_sequence2 = torch.tensor([2,4], dtype = torch.long)

#使用Embedding层将输入序列转换为词嵌入
embedded_sequence1 = embedding(input_sequence1)
embedded_sequence2 = embedding(input_sequence2)

print(embedded_sequence1)
print(embedded_sequence2)

输出：

tensor([[ 1.7796,  0.4485,  0.0318,  0.9585],
        [ 0.4706,  0.0850,  1.3851,  2.5884],
        [ 1.2294,  2.5925,  0.2432, -0.6803]], grad_fn=<EmbeddingBackward>)
tensor([[ 0.7696, -0.2784, -2.6275,  0.2936],
        [ 0.4663,  0.8806, -0.8406,  0.5400]], grad_fn=<EmbeddingBackward>)

2.EmbeddingBag：

定义：EmbeddingBag是在Embedding基础上进一步优化的工具，它可以直接处理不定长的句子，并且可以计算句子中所有词汇的词嵌入向量的均值或总和。这在处理变长序列（如句子）时非常有用。

应用：EmbeddingBag常用于处理文本数据中的句子或段落，通过将句子中的每个词汇的嵌入向量进行聚合（如求平均或求和），得到一个表示整个句子的嵌入向量。这可以用于各种NLP任务，如文本分类、情感分析等。

实现：在PyTorch中，EmbeddingBag的输入是一个整数张量和一个偏移量张量。整数张量中的每个整数代表一个词汇的索引，而偏移量张量则指示句子中每个词汇的位置。输出是一个浮点型的张量，每个浮点数代表对应句子的词嵌入向量的均值或总和。

import torch
import torch.nn as nn

vocab_size = 12 #词汇表大小
embedding_dim = 4 #嵌入向量维度

#创建一个EmbeddingBag层
embedding_bag = nn.EmbeddingBag(vocab_size, embedding_dim, mode = 'mean')

#假设我们有两个不同长度的输入序列
input_sequence1 = torch.tensor([1, 5, 8], dtype = torch.long)
input_sequence2 = torch.tensor([2, 4], dtype = torch.long)

#将两个输入序列拼接在一起，并创建一个偏移张量
input_sequences = torch.cat([input_sequence1, input_sequence2])
offsets = torch.tensor([0,len(input_sequence1)], dtype = torch.long)

#使用EmbeddingBag层计算序列汇总（这里使用平均值）
embedded_bag = embedding_bag(input_sequences, offsets)

print(embedded_bag)

输出：

tensor([[-0.0077,  0.3324, -0.3279,  0.1561],
        [ 0.3227,  0.0503,  0.8153,  0.3354]], grad_fn=<EmbeddingBagBackward>)

3.任务：

加载附件中的.txt文件，并使用EmbeddingBag和Embedding完成词嵌入

思路:
1.先进行文本清洗去除符号标点，方便后续分词等操作。
2.对清洗后的文本进行分词操作。
3.建立词汇表。
4.文本向量化。
5.进行词嵌入。

import os  
import re  
import torch  
import torch.nn as nn  
from collections import Counter  
  
# 文本清洗和分词函数  
def clean_and_tokenize(text):  
    text = re.sub(r'[^\w\s]', '', text.lower())  
    tokens = text.split()  
    return tokens  
  
# 建立词汇表函数  
def build_vocab(tokens, min_freq=1):  
    counter = Counter(tokens)  
    vocab = [token for token, freq in counter.items() if freq >= min_freq]  
    word2idx = {word: idx for idx, word in enumerate(vocab)}  
    idx2word = {idx: word for idx, word in enumerate(vocab)}  
    return word2idx, idx2word, vocab  
  
# 文本向量化函数（仅用于EmbeddingBag）  
def text_to_tensor(text, word2idx, max_length=None):  
    tokens = clean_and_tokenize(text)  
    indices = [word2idx[token] if token in word2idx else word2idx.get('<unk>', 0) for token in tokens]  
    if max_length is not None and len(indices) > max_length:  
        indices = indices[:max_length]  
    elif max_length is not None:  
        indices.extend([word2idx.get('<pad>', 0)] * (max_length - len(indices)))  
    return torch.LongTensor(indices)  
  
# 加载文本文件，构建词汇表，并处理所有文本  
def load_data(file_path, max_length=None):  
    texts = []  
    with open(file_path, 'r', encoding='utf-8') as f:  
        for line in f:  
            texts.append(line.strip())  
      
    all_tokens = []  
    for text in texts:  
        all_tokens.extend(clean_and_tokenize(text))  
      
    word2idx, idx2word, vocab = build_vocab(all_tokens)  
      
    # 创建所有文本段的tensor  
    tensors = [text_to_tensor(text, word2idx, max_length) for text in texts]  
      
    return tensors, vocab, word2idx, idx2word  
  
# 主程序  
def main():  
    file_path = '任务文件.txt'  
    if not os.path.exists(file_path):  
        raise FileNotFoundError(f"文件 '{file_path}' 不存在")  
  
    # 加载数据  
    tensors, vocab, word2idx, idx2word = load_data(file_path, max_length=50)  # 假设最大长度为50  
      
    # 假设词嵌入维度为50  
    embedding_dim = 50  
      
    # 使用EmbeddingBag  
    embedding_bag = nn.EmbeddingBag(num_embeddings=len(vocab), embedding_dim=embedding_dim)  
      
    # 为每个文本段计算嵌入（需要堆叠tensors）  
    stacked_tensors = torch.nn.utils.rnn.pad_sequence(tensors, batch_first=True, padding_value=word2idx.get('<pad>', 0))  
    embedded = embedding_bag(stacked_tensors)  
      
    # 注意：EmbeddingBag的输出是整批数据的平均嵌入，不是每个文本段的嵌入  
  
    print(embedded)  
  
if __name__ == "__main__":  
    main()

输出：

tensor([[-1.6118, -0.4554,  0.5279, -0.5009,  0.7052, -0.4306,  0.7289, -0.6750,
          0.4072,  1.0113, -0.8747,  0.7145, -0.5797, -0.4556,  0.9419, -0.1248,
          0.0888, -0.4001, -0.5126, -1.0407, -0.0327,  1.8303,  0.9100, -0.8426,
         -0.4496, -0.8193,  1.5006, -1.5324,  0.6471,  1.4066, -0.6371,  0.6539,
         -0.4501, -0.0812,  0.1221,  2.3178,  1.7144,  1.9879, -0.7364, -0.4359,
          0.6967,  0.3287, -0.4194,  1.3571, -0.7829,  0.2957,  0.1792,  0.4824,
         -0.0356,  1.2959],
        [-1.5826, -0.4662,  0.5125, -0.5000,  0.6775, -0.4096,  0.7248, -0.6835,
          0.4022,  0.9572, -0.8394,  0.6378, -0.5887, -0.4500,  0.8984, -0.1438,
          0.0985, -0.4026, -0.5085, -1.0061, -0.0082,  1.7877,  0.8991, -0.7848,
         -0.4094, -0.7691,  1.4907, -1.4910,  0.6322,  1.3926, -0.5950,  0.6120,
         -0.4088, -0.0629,  0.1178,  2.3096,  1.6717,  1.9310, -0.7448, -0.4001,
          0.6530,  0.3162, -0.4222,  1.3149, -0.7514,  0.2920,  0.1793,  0.4640,
         -0.0657,  1.2758]], grad_fn=<EmbeddingBagBackward0>)