第N4周：NLP中的文本嵌入

lihuhelihu

已于 2024-08-14 01:22:49 修改

阅读量556

点赞数 16

分类专栏： NLP小白入门文章标签：自然语言处理人工智能机器翻译深度学习机器学习 word2vec 语言模型

于 2024-08-14 01:17:53 首次发布

本文链接：https://blog.csdn.net/lihuhelihu/article/details/141170783

版权

NLP小白入门专栏收录该内容

4 篇文章 0 订阅

订阅专栏

本文为365天深度学习训练营中的学习记录博客
原作者：K同学啊

任务要求：加载第N1周的.txt文件，使用Embeddingbag与Embedding完成词嵌入

第N1周的.txt文件的名称为“任务文件.txt”，内容为：

比较直观的编码方式是采用上面提到的字典序列。例如，对于一个有三个类别的问题，可以用1、2和3分别表示这三个类别。但是，这种编码方式存在一个问题，就是模型可能会错误地认为不同类别之间存在一些顺序或距离关系，而实际上这些关系可能是不存在的或者不具有实际意义的。
为了避免这种问题，引入了one-hot编码（也称独热编码）。one-hot编码的基本思想是将每个类别映射到一个向量，其中只有一个元素的值为1，其余元素的值为0。这样，每个类别之间就是相互独立的，不存在顺序或距离关系。例如，对于三个类别的情况，可以使用如下的one-hot编码：

词嵌入是一种用于自然语言处理 (NLP) 的技术，用于将单词表示为数字，以便计算机可以处理它们。通俗的讲就是，一种把文本转为数值输入到计算机中的方法。

在《第N1周：one-hot编码案例》中提到的将文本转换为字典序列、one-hot编码就是最早期的词嵌入方法。
Embedding和EmbeddingBag则是PyTorch中的用来处理文本数据中词嵌入（word embedding）的工具，它们将离散的词汇映射到低维的连续向量空间中，使得词汇之间的语义关系能够在向量空间中得到体现。

1.Embedding嵌入

Embedding是PyTorch中最基本的词嵌入操作，TensorFlow中也有相同的函数，功能是一样。它将每个离散的词汇映射到一个低维的连续向量空间中，并且保持了词汇之间的语义关系。在PyTorch中，Embedding的输入是一个整数张量，每个整数都代表着一个词汇的索引，输出是一个浮点型的张量，每个浮点数都代表着对应词汇的词嵌入向量。

●输入shape：[batch, seqSize]（seqSize为单个文本长度）
●输出shape：[batch, seqSize, embed_dim]（embed_dim嵌入维度）

嵌入层使用随机权重初始化，并将学习数据集中所有词的嵌入。它是一个灵活的层，可以以各种方式使用，如：

●它可以用作深度学习模型的一部分，其中嵌入与模型本身一起被学习。
●它可以用于加载训练好的词嵌入模型。

嵌入层被定义为网络的第一个隐藏层。
函数原型：

torch.nn.Embedding(num_embeddings, embedding_dim, padding_idx=None, 
                   max_norm=None,norm_type=2.0,scale_grad_by_freq=False, 
                   sparse=False,_weight=None,_freeze=False, device=None, 
                   dtype=None)

官方API地址：Embedding — PyTorch 2.0 documentation

常见参数：

1.num_embeddings：词汇表大小，即，最大整数 index + 1。
2.embedding_dim：词向量的维度。

下面是一个简单的例子，用Embedding将两个句子转换为词嵌入向量：

1.1.自定义数据集类

import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

class MyDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts  = texts
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        texts  = self.texts[idx]
        labels = self.labels[idx]
        
        return texts, labels

1.2.定义填充函数

def collate_batch(batch):
    texts, labels = zip(*batch)
    max_len = max(len(text) for text in texts)
    padded_texts = [F.pad(text, (0, max_len - len(text)), value=0) for text in texts]
    padded_texts = torch.stack(padded_texts)
    labels = torch.tensor(labels, dtype=torch.float).unsqueeze(1)
    return padded_texts, labels

1.3.准备数据和数据加载器

# 假设我们有以下三个样本，分别由不同数量的单词索引组成
text_data = [
    torch.tensor([1, 1, 1, 1], dtype=torch.long),  # 样本1
    torch.tensor([2, 2, 2], dtype=torch.long),     # 样本2
    torch.tensor([3, 3], dtype=torch.long)         # 样本3
]

# 对应的标签
labels = torch.tensor([4, 5, 6], dtype=torch.float)

# 创建数据集和数据加载器
my_dataset  = MyDataset(text_data, labels)
data_loader = DataLoader(my_dataset, batch_size=2, shuffle=True, collate_fn=collate_batch)

for batch in data_loader:
    print(batch)

代码输出

(tensor([[1, 1, 1, 1],
        [2, 2, 2, 0]]), tensor([[4.],
        [5.]]))
(tensor([[3, 3]]), tensor([[6.]]))

1.4.定义模型

class EmbeddingModel(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super(EmbeddingModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.fc = nn.Linear(embed_dim, 1)  # 假设我们做一个二分类任务

    def forward(self, text):        
        print("embedding输入文本是：",text)
        print("embedding输入文本shape：",text.shape)
        embedding = self.embedding(text)
        embedding_mean = embedding.mean(dim=1)  # 对每个样本的嵌入向量进行平均
        print("embedding输出文本shape：",embedding_mean.shape)
        return self.fc(embedding_mean)

特别注意：
如果使用embedding_mean = embedding.mean(dim=1)语句对每个样本的嵌入向量求平均，输出shape为[batch, embed_dim]。若注释掉该语句，输出shape则为[batch, seqSize, embed_dim]。

1.5.训练模型

# 示例词典大小和嵌入维度
vocab_size = 10
embed_dim = 6

# 创建模型实例
model = EmbeddingModel(vocab_size, embed_dim)

# 定义一个简单的损失函数和优化器
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# 训练模型
for epoch in range(1):  # 训练1个epoch
    for batch in data_loader:
        texts, labels = batch


        # 前向传播
        outputs = model(texts)
        loss = criterion(outputs, labels)

        # 反向传播和优化
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f'Epoch {epoch+1}, Loss: {loss.item()}')

代码输出

embedding输入文本是： tensor([[3, 3, 0],[2, 2, 2]])
embedding输入文本shape： torch.Size([2, 3])
embedding输出文本shape： torch.Size([2, 6])

embedding输入文本是： tensor([[1, 1, 1, 1]])
embedding输入文本shape： torch.Size([1, 4])
embedding输出文本shape： torch.Size([1, 6])

Epoch 1, Loss: 1.4471522569656372

2.EmbeddingBag嵌入

EmbeddingBag是在Embedding基础上进一步优化的工具，其核心思想是将每个输入序列的嵌入向量进行合并，能够处理可变长度的输入序列，并且减少了计算和存储的开销，并且可以计算句子中所有词汇的词嵌入向量的均值或总和。
在PyTorch中，EmbeddingBag的输入是一个整数张量和一个偏移量张量，每个整数都代表着一个词汇的索引，偏移量则表示句子中每个词汇的位置，输出是一个浮点型的张量，每个浮点数都代表着对应句子的词嵌入向量的均值或总和。

●输入shape：[seqsSize]（seqsSize为单个batch文本总长度）
●输出shape：[batch, embed_dim]（embed_dim嵌入维度）

假定原始输入数据为：[[1, 1, 1, 1],[2, 2, 2],[3, 3]]
1.输入：
○输入是一个展平的词汇索引张量（input），例如 [2, 2, 2, 1, 1, 1, 1]。
○对应的偏移量（offsets），例如 [0, 3]，表示每个样本在展平张量中的起始位置。
2.合并操作：
○根据偏移量，将嵌入向量进行合并操作。
○合并操作可以是求和、平均或取最大值，默认是平均（mean）。

函数原型：

torch.nn.EmbeddingBag(num_embeddings, embedding_dim, max_norm=None, 
                      norm_type=2.0, scale_grad_by_freq=False, 
                      mode='mean', sparse=False, _weight=None, 
                      include_last_offset=False, padding_idx=None, 
                      device=None, dtype=None)

主要参数：

●num_embeddings (int)：词典的大小。
●embedding_dim (int)：每个词向量的维度，即嵌入向量的长度。
●mode (str)：指定嵌入向量的聚合方式。可选值为 ‘sum’、‘mean’ 和 ‘max’。
○（假设有一个序列 [2, 3, 1]，每个数字表示一个离散特征的索引，对应的嵌入向量分别为 [[0.1, 0.2, 0.3],[0.2, 0.3, 0.4],[0.3, 0.4, 0.5]]）
○’sum’：对所有的嵌入向量求和，则使用 ‘sum’ 模式汇总后的嵌入向量为 [0.6, 0.9,1.2]。
○’mean’：对所有的嵌入向量求平均值，使用 ‘mean’ 模式汇总后的嵌入向量为 [0.2,0.3, 0.4]。
○’max’：对所有的嵌入向量求最大值，使用 ‘max’ 模式汇总后的嵌入向量为 [0.3,0.4,0.5]。

下面是一个简单的例子，用EmbeddingBag将两个句子转换为词嵌入向量并计算它们的均值。

2.1.自定义数据集类

import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

class MyDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts  = texts
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        texts  = self.texts[idx]
        labels = self.labels[idx]
        
        return texts, labels

2.2.准备数据和数据加载器

# 假设我们有以下三个样本，分别由不同数量的单词索引组成
text_data = [
    torch.tensor([1, 1, 1, 1], dtype=torch.long),  # 样本1
    torch.tensor([2, 2, 2], dtype=torch.long),     # 样本2
    torch.tensor([3, 3], dtype=torch.long)         # 样本3
]

# 对应的标签
labels = torch.tensor([4, 5, 6], dtype=torch.float)

# 创建数据集和数据加载器
my_dataset     = MyDataset(text_data, labels)
data_loader = DataLoader(my_dataset, batch_size=2, shuffle=True, collate_fn=lambda x: x)

for batch in data_loader:
    print(batch)

代码输出

[(tensor([1, 1, 1, 1]), tensor(4.)), (tensor([2, 2, 2]), tensor(5.))]
[(tensor([3, 3]), tensor(6.))]

2.3.定义模型

class EmbeddingBagModel(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super(EmbeddingBagModel, self).__init__()
        self.embedding_bag = nn.EmbeddingBag(vocab_size, embed_dim, mode='mean')
        self.fc = nn.Linear(embed_dim, 1)  # 假设我们做一个二分类任务

    def forward(self, text, offsets):
        print("embedding_bag输入文本是：",text)
        print("embedding_bag输入文本shape：",text.shape)
        embedded = self.embedding_bag(text, offsets)
        print("embedding_bag输出文本shape：",embedded.shape)
        return self.fc(embedded)

2.4.训练模型

# 示例词典大小和嵌入维度
vocab_size = 10
embed_dim  = 6

# 创建模型实例
model = EmbeddingBagModel(vocab_size, embed_dim)

# 定义一个简单的损失函数和优化器
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# 训练模型
for epoch in range(1):  # 训练1个epoch
    for batch in data_loader:
        # 将批处理的数据展平并计算偏移量
        texts, labels = zip(*batch)

        offsets = [0] + [len(text) for text in texts[:-1]]
        offsets = torch.tensor(offsets).cumsum(dim=0)
        texts   = torch.cat(texts)
        labels  = torch.tensor(labels).unsqueeze(1)

        # 前向传播
        outputs = model(texts, offsets)
        loss = criterion(outputs, labels)

        # 反向传播和优化
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f'Epoch {epoch+1}, Loss: {loss.item()}')

代码输出

embedding_bag输入文本是： tensor([1, 1, 1, 1, 2, 2, 2])
embedding_bag输入文本shape： torch.Size([7])
embedding_bag输出文本shape： torch.Size([2, 6])

embedding_bag输入文本是： tensor([3, 3])
embedding_bag输入文本shape： torch.Size([2])
embedding_bag输出文本shape： torch.Size([1, 6])

Epoch 1, Loss: 11.76957893371582

3.用Embedding嵌入处理txt文件内容

3.1.自定义数据集类

import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

class MyDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts  = texts
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        texts  = self.texts[idx]
        labels = self.labels[idx]
        
        return texts, labels

3.2.定义填充函数

def collate_batch(batch):
    texts, labels = zip(*batch)
    max_len = max(len(text) for text in texts)
    padded_texts = [F.pad(text, (0, max_len - len(text)), value=0) for text in texts]
    padded_texts = torch.stack(padded_texts)
    labels = torch.tensor(labels, dtype=torch.float).unsqueeze(1)
    return padded_texts, labels

3.3.准备数据和数据加载器

import torch
import torch.nn.functional as F
import jieba


# 打开txt文件
file_name = "./N1/任务文件.txt"
with open(file_name,"r",encoding = "utf-8") as file:
    context = file.read()
    texts = context.split()
texts


# 使用结巴分词进行分词
tokenized_texts = [list(jieba.cut(text)) for text in texts]

# 构建词汇表
word_index = {}
index_word = {}
for i, word in enumerate(set([word for text in tokenized_texts for word in text])):
    word_index[word] = i
    index_word[i] = word

# 将文本转化为整数序列
sequences = [[word_index[word] for word in text] for text in tokenized_texts]

# 获取词汇表大小
vocab_size = len(word_index)

# 将整数序列转化为one-hot编码
one_hot_results = torch.zeros(len(texts), vocab_size)
for i, seq in enumerate(sequences):
    one_hot_results[i, seq] = 1

# 打印结果
print("词汇表:")
print(word_index)
print("\n文本:")
print(texts)
print("\n分词结果")
print(tokenized_texts)
print("\n文本序列:")
print(sequences)
print("\nOne-Hot编码:")
print(one_hot_results)

代码输出

词汇表:
{'one': 0, '了': 1, '基本': 2, '类别': 3, '如下': 4, '实际上': 5, '-': 6, '其中': 7, '避免': 8, '情况': 9, '就是': 10, '存在': 11, '独立': 12, '）': 13, '实际意义': 14, '的': 15, '到': 16, '2': 17, '使用': 18, '提到': 19, '不同': 20, '一个': 21, '用': 22, '可以': 23, '例如': 24, '会': 25, '问题': 26, '相互': 27, '不': 28, '可能': 29, '之间': 30, '这些': 31, '编码方式': 32, '是': 33, '顺序': 34, '思想': 35, '将': 36, '分别': 37, '地': 38, '1': 39, '值': 40, '模型': 41, '有': 42, '比较': 43, '。': 44, '为': 45, '、': 46, '或': 47, '这种': 48, '0': 49, '映射': 50, '上面': 51, '只有': 52, '元素': 53, '独热': 54, '和': 55, '认为': 56, '距离': 57, '对于': 58, '称': 59, '其余': 60, '具有': 61, '编码': 62, '引入': 63, '关系': 64, '这样': 65, '为了': 66, '直观': 67, '也': 68, '字典': 69, '或者': 70, '，': 71, '三个': 72, '向量': 73, '错误': 74, '3': 75, 'hot': 76, '但是': 77, '采用': 78, '（': 79, '每个': 80, '：': 81, '序列': 82, '表示': 83, '而': 84, '一些': 85, '这': 86}

文本:
['比较直观的编码方式是采用上面提到的字典序列。例如，对于一个有三个类别的问题，可以用1、2和3分别表示这三个类别。但是，这种编码方式存在一个问题，就是模型可能会错误地认为不同类别之间存在一些顺序或距离关系，而实际上这些关系可能是不存在的或者不具有实际意义的。', '为了避免这种问题，引入了one-hot编码（也称独热编码）。one-hot编码的基本思想是将每个类别映射到一个向量，其中只有一个元素的值为1，其余元素的值为0。这样，每个类别之间就是相互独立的，不存在顺序或距离关系。例如，对于三个类别的情况，可以使用如下的one-hot编码：']

分词结果
[['比较', '直观', '的', '编码方式', '是', '采用', '上面', '提到', '的', '字典', '序列', '。', '例如', '，', '对于', '一个', '有', '三个', '类别', '的', '问题', '，', '可以', '用', '1', '、', '2', '和', '3', '分别', '表示', '这', '三个', '类别', '。', '但是', '，', '这种', '编码方式', '存在', '一个', '问题', '，', '就是', '模型', '可能', '会', '错误', '地', '认为', '不同', '类别', '之间', '存在', '一些', '顺序', '或', '距离', '关系', '，', '而', '实际上', '这些', '关系', '可能', '是', '不', '存在', '的', '或者', '不', '具有', '实际意义', '的', '。'], ['为了', '避免', '这种', '问题', '，', '引入', '了', 'one', '-', 'hot', '编码', '（', '也', '称', '独热', '编码', '）', '。', 'one', '-', 'hot', '编码', '的', '基本', '思想', '是', '将', '每个', '类别', '映射', '到', '一个', '向量', '，', '其中', '只有', '一个', '元素', '的', '值', '为', '1', '，', '其余', '元素', '的', '值', '为', '0', '。', '这样', '，', '每个', '类别', '之间', '就是', '相互', '独立', '的', '，', '不', '存在', '顺序', '或', '距离', '关系', '。', '例如', '，', '对于', '三个', '类别', '的', '情况', '，', '可以', '使用', '如下', '的', 'one', '-', 'hot', '编码', '：']]

文本序列:
[[43, 67, 15, 32, 33, 78, 51, 19, 15, 69, 82, 44, 24, 71, 58, 21, 42, 72, 3, 15, 26, 71, 23, 22, 39, 46, 17, 55, 75, 37, 83, 86, 72, 3, 44, 77, 71, 48, 32, 11, 21, 26, 71, 10, 41, 29, 25, 74, 38, 56, 20, 3, 30, 11, 85, 34, 47, 57, 64, 71, 84, 5, 31, 64, 29, 33, 28, 11, 15, 70, 28, 61, 14, 15, 44], [66, 8, 48, 26, 71, 63, 1, 0, 6, 76, 62, 79, 68, 59, 54, 62, 13, 44, 0, 6, 76, 62, 15, 2, 35, 33, 36, 80, 3, 50, 16, 21, 73, 71, 7, 52, 21, 53, 15, 40, 45, 39, 71, 60, 53, 15, 40, 45, 49, 44, 65, 71, 80, 3, 30, 10, 27, 12, 15, 71, 28, 11, 34, 47, 57, 64, 44, 24, 71, 58, 72, 3, 15, 9, 71, 23, 18, 4, 15, 0, 6, 76, 62, 81]]

One-Hot编码:
tensor([[0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 1., 1., 0., 1.,
         0., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 0.,
         0., 1., 1., 1., 0., 1., 1., 1., 1., 0., 1., 1., 1., 0., 0., 1., 0., 0.,
         0., 1., 1., 1., 1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 1., 1., 1.,
         1., 0., 1., 1., 0., 1., 1., 0., 0., 0., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 0.,
         1., 0., 0., 1., 0., 1., 1., 0., 1., 1., 1., 0., 1., 0., 0., 1., 1., 1.,
         1., 0., 0., 1., 1., 0., 0., 0., 1., 1., 0., 1., 1., 1., 1., 0., 1., 1.,
         1., 0., 0., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 0., 1., 0., 0., 1.,
         1., 1., 0., 0., 1., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0.]])

# 将文本序列转换为PyTorch张量
text_data = [torch.tensor(seq, dtype=torch.long) for seq in sequences]

# 假设标签是一些浮点数值（根据实际任务定义标签）
labels = torch.tensor([1.0, 2.0], dtype=torch.float)

# 输出结果
print("Text Data:", text_data)
print("Labels:", labels)

代码输出

Text Data: [tensor([43, 67, 15, 32, 33, 78, 51, 19, 15, 69, 82, 44, 24, 71, 58, 21, 42, 72,
         3, 15, 26, 71, 23, 22, 39, 46, 17, 55, 75, 37, 83, 86, 72,  3, 44, 77,
        71, 48, 32, 11, 21, 26, 71, 10, 41, 29, 25, 74, 38, 56, 20,  3, 30, 11,
        85, 34, 47, 57, 64, 71, 84,  5, 31, 64, 29, 33, 28, 11, 15, 70, 28, 61,
        14, 15, 44]), tensor([66,  8, 48, 26, 71, 63,  1,  0,  6, 76, 62, 79, 68, 59, 54, 62, 13, 44,
         0,  6, 76, 62, 15,  2, 35, 33, 36, 80,  3, 50, 16, 21, 73, 71,  7, 52,
        21, 53, 15, 40, 45, 39, 71, 60, 53, 15, 40, 45, 49, 44, 65, 71, 80,  3,
        30, 10, 27, 12, 15, 71, 28, 11, 34, 47, 57, 64, 44, 24, 71, 58, 72,  3,
        15,  9, 71, 23, 18,  4, 15,  0,  6, 76, 62, 81])]
Labels: tensor([1., 2.])

# 创建数据集和数据加载器
my_dataset  = MyDataset(text_data, labels)
data_loader = DataLoader(my_dataset, batch_size=2, shuffle=True, collate_fn=collate_batch)

for batch in data_loader:
    print(batch)

代码输出

(tensor([[43, 67, 15, 32, 33, 78, 51, 19, 15, 69, 82, 44, 24, 71, 58, 21, 42, 72,
          3, 15, 26, 71, 23, 22, 39, 46, 17, 55, 75, 37, 83, 86, 72,  3, 44, 77,
         71, 48, 32, 11, 21, 26, 71, 10, 41, 29, 25, 74, 38, 56, 20,  3, 30, 11,
         85, 34, 47, 57, 64, 71, 84,  5, 31, 64, 29, 33, 28, 11, 15, 70, 28, 61,
         14, 15, 44,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [66,  8, 48, 26, 71, 63,  1,  0,  6, 76, 62, 79, 68, 59, 54, 62, 13, 44,
          0,  6, 76, 62, 15,  2, 35, 33, 36, 80,  3, 50, 16, 21, 73, 71,  7, 52,
         21, 53, 15, 40, 45, 39, 71, 60, 53, 15, 40, 45, 49, 44, 65, 71, 80,  3,
         30, 10, 27, 12, 15, 71, 28, 11, 34, 47, 57, 64, 44, 24, 71, 58, 72,  3,
         15,  9, 71, 23, 18,  4, 15,  0,  6, 76, 62, 81]]), tensor([[1.],
        [2.]]))

3.4.定义模型

class EmbeddingModel(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super(EmbeddingModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.fc = nn.Linear(embed_dim, 1)  # 假设我们做一个二分类任务

    def forward(self, text):
        print("embedding输入文本是：",text)
        print("embedding输入文本shape：",text.shape)
        embedding = self.embedding(text)
        embedding_mean = embedding.mean(dim=1)  # 对每个样本的嵌入向量进行平均
        print("embedding输出文本shape：",embedding_mean.shape)
        return self.fc(embedding_mean)

3.5.训练模型

# 示例词典大小和嵌入维度
vocab_size = vocab_size
embed_dim = 10

# 创建模型实例
model = EmbeddingModel(vocab_size, embed_dim)

# 定义一个简单的损失函数和优化器
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# 训练模型
for epoch in range(1):  # 训练1个epoch
    for batch in data_loader:
        texts, labels = batch

        # 前向传播
        outputs = model(texts)
        loss = criterion(outputs, labels)

        # 反向传播和优化
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f'Epoch {epoch+1}, Loss: {loss.item()}')

代码输出

embedding输入文本是： tensor([[43, 67, 15, 32, 33, 78, 51, 19, 15, 69, 82, 44, 24, 71, 58, 21, 42, 72,
          3, 15, 26, 71, 23, 22, 39, 46, 17, 55, 75, 37, 83, 86, 72,  3, 44, 77,
         71, 48, 32, 11, 21, 26, 71, 10, 41, 29, 25, 74, 38, 56, 20,  3, 30, 11,
         85, 34, 47, 57, 64, 71, 84,  5, 31, 64, 29, 33, 28, 11, 15, 70, 28, 61,
         14, 15, 44,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [66,  8, 48, 26, 71, 63,  1,  0,  6, 76, 62, 79, 68, 59, 54, 62, 13, 44,
          0,  6, 76, 62, 15,  2, 35, 33, 36, 80,  3, 50, 16, 21, 73, 71,  7, 52,
         21, 53, 15, 40, 45, 39, 71, 60, 53, 15, 40, 45, 49, 44, 65, 71, 80,  3,
         30, 10, 27, 12, 15, 71, 28, 11, 34, 47, 57, 64, 44, 24, 71, 58, 72,  3,
         15,  9, 71, 23, 18,  4, 15,  0,  6, 76, 62, 81]])
embedding输入文本shape： torch.Size([2, 84])
embedding输出文本shape： torch.Size([2, 10])
Epoch 1, Loss: 0.9843546152114868

4.EmbeddingBag嵌入处理txt文件内容

4.1.自定义数据集类

import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

class MyDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts  = texts
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        texts  = torch.tensor(self.texts[idx], dtype=torch.long)
        labels = torch.tensor(self.labels[idx], dtype=torch.float)
        
        return texts, labels

4.2.准备数据和数据加载器

import torch
import torch.nn.functional as F
import jieba


# 打开txt文件
file_name = "./N1/任务文件.txt"
with open(file_name,"r",encoding = "utf-8") as file:
    context = file.read()
    texts = context.split()
texts


# 使用结巴分词进行分词
tokenized_texts = [list(jieba.cut(text)) for text in texts]

# 构建词汇表
word_index = {}
index_word = {}
for i, word in enumerate(set([word for text in tokenized_texts for word in text])):
    word_index[word] = i
    index_word[i] = word

# 将文本转化为整数序列
sequences = [[word_index[word] for word in text] for text in tokenized_texts]

# 获取词汇表大小
vocab_size = len(word_index)

# 将整数序列转化为one-hot编码
one_hot_results = torch.zeros(len(texts), vocab_size)
for i, seq in enumerate(sequences):
    one_hot_results[i, seq] = 1

# 打印结果
print("词汇表:")
print(word_index)
print("\n文本:")
print(texts)
print("\n分词结果")
print(tokenized_texts)
print("\n文本序列:")
print(sequences)
print("\nOne-Hot编码:")
print(one_hot_results)

代码输出

词汇表:
{'one': 0, '了': 1, '基本': 2, '类别': 3, '如下': 4, '实际上': 5, '-': 6, '其中': 7, '避免': 8, '情况': 9, '就是': 10, '存在': 11, '独立': 12, '）': 13, '实际意义': 14, '的': 15, '到': 16, '2': 17, '使用': 18, '提到': 19, '不同': 20, '一个': 21, '用': 22, '可以': 23, '例如': 24, '会': 25, '问题': 26, '相互': 27, '不': 28, '可能': 29, '之间': 30, '这些': 31, '编码方式': 32, '是': 33, '顺序': 34, '思想': 35, '将': 36, '分别': 37, '地': 38, '1': 39, '值': 40, '模型': 41, '有': 42, '比较': 43, '。': 44, '为': 45, '、': 46, '或': 47, '这种': 48, '0': 49, '映射': 50, '上面': 51, '只有': 52, '元素': 53, '独热': 54, '和': 55, '认为': 56, '距离': 57, '对于': 58, '称': 59, '其余': 60, '具有': 61, '编码': 62, '引入': 63, '关系': 64, '这样': 65, '为了': 66, '直观': 67, '也': 68, '字典': 69, '或者': 70, '，': 71, '三个': 72, '向量': 73, '错误': 74, '3': 75, 'hot': 76, '但是': 77, '采用': 78, '（': 79, '每个': 80, '：': 81, '序列': 82, '表示': 83, '而': 84, '一些': 85, '这': 86}

文本:
['比较直观的编码方式是采用上面提到的字典序列。例如，对于一个有三个类别的问题，可以用1、2和3分别表示这三个类别。但是，这种编码方式存在一个问题，就是模型可能会错误地认为不同类别之间存在一些顺序或距离关系，而实际上这些关系可能是不存在的或者不具有实际意义的。', '为了避免这种问题，引入了one-hot编码（也称独热编码）。one-hot编码的基本思想是将每个类别映射到一个向量，其中只有一个元素的值为1，其余元素的值为0。这样，每个类别之间就是相互独立的，不存在顺序或距离关系。例如，对于三个类别的情况，可以使用如下的one-hot编码：']

分词结果
[['比较', '直观', '的', '编码方式', '是', '采用', '上面', '提到', '的', '字典', '序列', '。', '例如', '，', '对于', '一个', '有', '三个', '类别', '的', '问题', '，', '可以', '用', '1', '、', '2', '和', '3', '分别', '表示', '这', '三个', '类别', '。', '但是', '，', '这种', '编码方式', '存在', '一个', '问题', '，', '就是', '模型', '可能', '会', '错误', '地', '认为', '不同', '类别', '之间', '存在', '一些', '顺序', '或', '距离', '关系', '，', '而', '实际上', '这些', '关系', '可能', '是', '不', '存在', '的', '或者', '不', '具有', '实际意义', '的', '。'], ['为了', '避免', '这种', '问题', '，', '引入', '了', 'one', '-', 'hot', '编码', '（', '也', '称', '独热', '编码', '）', '。', 'one', '-', 'hot', '编码', '的', '基本', '思想', '是', '将', '每个', '类别', '映射', '到', '一个', '向量', '，', '其中', '只有', '一个', '元素', '的', '值', '为', '1', '，', '其余', '元素', '的', '值', '为', '0', '。', '这样', '，', '每个', '类别', '之间', '就是', '相互', '独立', '的', '，', '不', '存在', '顺序', '或', '距离', '关系', '。', '例如', '，', '对于', '三个', '类别', '的', '情况', '，', '可以', '使用', '如下', '的', 'one', '-', 'hot', '编码', '：']]

文本序列:
[[43, 67, 15, 32, 33, 78, 51, 19, 15, 69, 82, 44, 24, 71, 58, 21, 42, 72, 3, 15, 26, 71, 23, 22, 39, 46, 17, 55, 75, 37, 83, 86, 72, 3, 44, 77, 71, 48, 32, 11, 21, 26, 71, 10, 41, 29, 25, 74, 38, 56, 20, 3, 30, 11, 85, 34, 47, 57, 64, 71, 84, 5, 31, 64, 29, 33, 28, 11, 15, 70, 28, 61, 14, 15, 44], [66, 8, 48, 26, 71, 63, 1, 0, 6, 76, 62, 79, 68, 59, 54, 62, 13, 44, 0, 6, 76, 62, 15, 2, 35, 33, 36, 80, 3, 50, 16, 21, 73, 71, 7, 52, 21, 53, 15, 40, 45, 39, 71, 60, 53, 15, 40, 45, 49, 44, 65, 71, 80, 3, 30, 10, 27, 12, 15, 71, 28, 11, 34, 47, 57, 64, 44, 24, 71, 58, 72, 3, 15, 9, 71, 23, 18, 4, 15, 0, 6, 76, 62, 81]]

One-Hot编码:
tensor([[0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 1., 1., 0., 1.,
         0., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 0.,
         0., 1., 1., 1., 0., 1., 1., 1., 1., 0., 1., 1., 1., 0., 0., 1., 0., 0.,
         0., 1., 1., 1., 1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 1., 1., 1.,
         1., 0., 1., 1., 0., 1., 1., 0., 0., 0., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 0.,
         1., 0., 0., 1., 0., 1., 1., 0., 1., 1., 1., 0., 1., 0., 0., 1., 1., 1.,
         1., 0., 0., 1., 1., 0., 0., 0., 1., 1., 0., 1., 1., 1., 1., 0., 1., 1.,
         1., 0., 0., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 0., 1., 0., 0., 1.,
         1., 1., 0., 0., 1., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0.]])

# 分词后的文本和对应的标签
text_data = sequences

# 对应的标签
# 假设有两个标签
labels = [0, 1]

# 创建数据集和数据加载器
my_dataset     = MyDataset(text_data, labels)
data_loader = DataLoader(my_dataset, batch_size=2, shuffle=True, collate_fn=lambda x: x)

for batch in data_loader:
    print(batch)

代码输出

[(tensor([66,  8, 48, 26, 71, 63,  1,  0,  6, 76, 62, 79, 68, 59, 54, 62, 13, 44,
         0,  6, 76, 62, 15,  2, 35, 33, 36, 80,  3, 50, 16, 21, 73, 71,  7, 52,
        21, 53, 15, 40, 45, 39, 71, 60, 53, 15, 40, 45, 49, 44, 65, 71, 80,  3,
        30, 10, 27, 12, 15, 71, 28, 11, 34, 47, 57, 64, 44, 24, 71, 58, 72,  3,
        15,  9, 71, 23, 18,  4, 15,  0,  6, 76, 62, 81]), tensor(1.)), (tensor([43, 67, 15, 32, 33, 78, 51, 19, 15, 69, 82, 44, 24, 71, 58, 21, 42, 72,
         3, 15, 26, 71, 23, 22, 39, 46, 17, 55, 75, 37, 83, 86, 72,  3, 44, 77,
        71, 48, 32, 11, 21, 26, 71, 10, 41, 29, 25, 74, 38, 56, 20,  3, 30, 11,
        85, 34, 47, 57, 64, 71, 84,  5, 31, 64, 29, 33, 28, 11, 15, 70, 28, 61,
        14, 15, 44]), tensor(0.))]

4.3.定义模型

class EmbeddingBagModel(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super(EmbeddingBagModel, self).__init__()
        self.embedding_bag = nn.EmbeddingBag(vocab_size, embed_dim, mode='mean')
        self.fc = nn.Linear(embed_dim, 1)  # 假设我们做一个二分类任务

    def forward(self, text, offsets):
        print("embedding_bag输入文本是：",text)
        print("embedding_bag输入文本shape：",text.shape)
        embedded = self.embedding_bag(text, offsets)
        print("embedding_bag输出文本shape：",embedded.shape)
        return self.fc(embedded)

4.4.训练模型

# 示例词典大小和嵌入维度
vocab_size = vocab_size
embed_dim  = 6

# 创建模型实例
model = EmbeddingBagModel(vocab_size, embed_dim)

# 定义一个简单的损失函数和优化器
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# 训练模型
for epoch in range(1):  # 训练1个epoch
    for batch in data_loader:
        # 将批处理的数据展平并计算偏移量
        texts, labels = zip(*batch)

        offsets = [0] + [len(text) for text in texts[:-1]]
        offsets = torch.tensor(offsets).cumsum(dim=0)
        texts   = torch.cat(texts)
        labels  = torch.tensor(labels).unsqueeze(1)

        # 前向传播
        outputs = model(texts, offsets)
        loss = criterion(outputs, labels)

        # 反向传播和优化
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f'Epoch {epoch+1}, Loss: {loss.item()}')

代码输出

embedding_bag输入文本是： tensor([43, 67, 15, 32, 33, 78, 51, 19, 15, 69, 82, 44, 24, 71, 58, 21, 42, 72,
         3, 15, 26, 71, 23, 22, 39, 46, 17, 55, 75, 37, 83, 86, 72,  3, 44, 77,
        71, 48, 32, 11, 21, 26, 71, 10, 41, 29, 25, 74, 38, 56, 20,  3, 30, 11,
        85, 34, 47, 57, 64, 71, 84,  5, 31, 64, 29, 33, 28, 11, 15, 70, 28, 61,
        14, 15, 44, 66,  8, 48, 26, 71, 63,  1,  0,  6, 76, 62, 79, 68, 59, 54,
        62, 13, 44,  0,  6, 76, 62, 15,  2, 35, 33, 36, 80,  3, 50, 16, 21, 73,
        71,  7, 52, 21, 53, 15, 40, 45, 39, 71, 60, 53, 15, 40, 45, 49, 44, 65,
        71, 80,  3, 30, 10, 27, 12, 15, 71, 28, 11, 34, 47, 57, 64, 44, 24, 71,
        58, 72,  3, 15,  9, 71, 23, 18,  4, 15,  0,  6, 76, 62, 81])
embedding_bag输入文本shape： torch.Size([159])
embedding_bag输出文本shape： torch.Size([2, 6])
Epoch 1, Loss: 0.711330235004425

lihuhelihu

关注

16
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
第N4周：NLP中的文本嵌入

在PyTorch中，EmbeddingBag的输入是一个整数张量和一个偏移量张量，每个整数都代表着一个词汇的索引，偏移量则表示句子中每个词汇的位置，输出是一个浮点型的张量，每个浮点数都代表着对应句子的词嵌入向量的均值或总和。EmbeddingBag是在Embedding基础上进一步优化的工具，其核心思想是将每个输入序列的嵌入向量进行合并，能够处理可变长度的输入序列，并且减少了计算和存储的开销，并且可以计算句子中所有词汇的词嵌入向量的均值或总和。通俗的讲就是，一种把文本转为数值输入到计算机中的方法。
复制链接

扫一扫

专栏目录