bert提取词向量比较两文本相似度

木下瞳

已于 2024-02-21 23:37:11 修改

阅读量2.5k

点赞数 10

分类专栏： NLP，大模型文章标签： bert 深度学习人工智能

于 2024-01-29 23:08:51 首次发布

本文链接：https://blog.csdn.net/zjkpy_5/article/details/135923005

版权

NLP，大模型专栏收录该内容

21 篇文章 3 订阅

订阅专栏

本文介绍了如何使用BERT模型进行文本转为词向量，并在CPU和GPU环境下进行相似度计算。重点讨论了GPU加速在大规模文本比较中的优势，以及如何处理短文本和提高语义相似度的准确性。

摘要由CSDN通过智能技术生成

任务

使用 bert-base-chinese 预训练模型做词嵌入（文本转向量）

模型下载

模型下载：bert预训练模型下载-CSDN博客

参考文章：使用bert提取词向量

文本转为向量示例代码

下面这段代码是一个传入句子转为词向量的函数

from transformers import BertTokenizer, BertModel
import torch

# 加载中文 BERT 模型和分词器
model_name = "../bert-base-chinese"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)


def get_word_embedding(sentence):
    # 分词
    tokens = tokenizer.tokenize(sentence)
    # 添加特殊标记 [CLS] 和 [SEP]
    tokens = ['[CLS]'] + tokens + ['[SEP]']
    # 将分词转换为对应的编号
    input_ids = tokenizer.convert_tokens_to_ids(tokens)
    # 转换为 PyTorch tensor 格式
    input_ids = torch.tensor([input_ids])

    # 获取词向量
    outputs = model(input_ids)

    # outputs[0]是词嵌入表示
    embedding = outputs[0]
    # 去除头尾标记的向量值
    word_embedding = embedding[:, 1:-1, :]

    return word_embedding

embedding[:, 1:-1, :] 这一行的意是以下，数据类型张量

[batch_size, sequence_length, hidden_size]，其中：

batch_size 是输入文本的批次大小，即一次输入的文本样本数量。
sequence_length 是输入文本序列的长度，即编码器输入的词的数量。
hidden_size 是隐藏状态的维度大小，是 BERT 模型的超参数，通常为 768 或 1024。

在这里 embedding[:, 1:-1, :] 可以得到转换后的向量，但在继续去计算相似度计算会报错，所以建议直接使用下面的代码

文本转向量，相似度计算-cpu

比较两文本相似度，可直接使用

from transformers import BertTokenizer, BertModel
import torch


def compare_sentence(sentence1, sentence2):
    # 分词
    tokens1 = tokenizer.tokenize(sentence1)
    tokens2 = tokenizer.tokenize(sentence2)
    # 添加特殊标记 [CLS] 和 [SEP]
    tokens1 = ['[CLS]'] + tokens1 + ['[SEP]']
    tokens2 = ['[CLS]'] + tokens2 + ['[SEP]']
    # 将分词转换为对应的词表中的索引
    input_ids1 = tokenizer.convert_tokens_to_ids(tokens1)
    input_ids2 = tokenizer.convert_tokens_to_ids(tokens2)
    # 转换为 PyTorch tensor 格式
    input_ids1 = torch.tensor([input_ids1])
    input_ids2 = torch.tensor([input_ids2])

    # 获取词向量
    outputs1 = model(input_ids1)
    outputs2 = model(input_ids2)

    # outputs[0]是词嵌入表示
    embedding1 = outputs1[0]
    embedding2 = outputs2[0]
    # 提取 [CLS] 标记对应的词向量作为整个句子的表示
    sentence_embedding1 = embedding1[:, 0, :]
    sentence_embedding2 = embedding2[:, 0, :]

    # 计算词的欧氏距离
    # 计算p范数距离的函数，其中p设置为2，这意味着它将计算的是欧几里得距离（L2范数）
    euclidean_distance = torch.nn.PairwiseDistance(p=2)
    distance = euclidean_distance(sentence_embedding1, sentence_embedding2)
    # 计算余弦相似度
    # dim=1 表示将在第一个维度（通常对应每个样本的特征维度）上计算余弦相似度；eps=1e-6 是为了数值稳定性而添加的一个很小的正数，以防止分母为零
    cos = torch.nn.CosineSimilarity(dim=1, eps=1e-6)
    similarity = cos(sentence_embedding1, sentence_embedding2)

    print("句1: ", sentence1)
    print("句2: ", sentence2)
    print("相似度: ", similarity.item())
    print("欧式距离: ", distance.item())


compare_sentence("黄河南大街70号8门", "皇姑区黄河南大街70号8门")

报错：长度错误，bert 的 config.json 配置文件中 max_length_embedding 的数值是 512，也就是说最长英文长度为 512，中文可能是500-505左右，如果超出这个长度，会计算不了相似度，可以吧比较文本适当截取

两个不相似的文本，计算出来的相似度确有 0.91：

在使用BERT模型进行文本相似度计算时，如果发现两个语义上不那么相关的文本得到了较高的余弦相似度得分（如0.91），这可能由以下几个原因造成：

预训练模型的通用性：BERT模型是在大规模无监督数据集上预训练得到的，其嵌入向量能够捕捉到一定的语言结构和上下文信息，但并不一定对特定领域或任务上的语义相似度有非常精确的区分能力。尤其对于完全不同主题的句子，模型可能会因为它们共享一些通用词汇或句法结构而导致向量空间中的距离较近。
训练目标与相似度任务之间的差异：BERT模型在预训练阶段主要通过掩码语言模型（MLM）和下一个句子预测（NSP）任务来学习文本表示，这些任务并没有直接优化向量间的相似度度量。
未针对相似度任务微调：BERT模型虽然强大，但在应用于文本相似度任务之前，通常需要针对特定任务进行微调以提高其在此类任务上的性能。没有经过针对性微调的BERT基础模型输出的向量可能无法很好地用于衡量两个文本之间的精确语义相似度。
噪声和偶然性：由于高维空间中向量分布的特点，即使不同语义的文本也可能偶然地在某个维度上有较大的重叠，从而导致计算出的余弦相似度较高。
短文本问题：对于特别短的文本片段，尤其是当它们都包含一些常见词汇时，模型可能无法完全捕获到它们各自的独特语义特征。

为了改善这种情况，可以考虑以下策略：

使用专门为文本相似度任务微调过的BERT变体。
对输入文本进行适当的预处理，比如去除无关词汇、提取关键词等。
应用其他适应于文本相似度计算的方法，如基于Siamese网络架构进一步训练模型，或者利用诸如 Sentence-BERT (SBERT) 这样的方法，它专门设计了编码器结构和损失函数来获取更好的句子级相似度表示。

最后，请注意，在实际应用中应根据具体场景和需求选择合适的文本相似度计算方法，并可能需要调整和验证模型参数及配置以达到理想效果。

文本转向量，相似度计算-gpu-单条计算

import torch
from transformers import BertTokenizer, BertModel


# 检查是否有可用的GPU设备
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    print("没有找到GPU设备，将在CPU上运行.")
    device = torch.device("cpu")

# 初始化tokenizer和model，并将其放在GPU上（如果可用）
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
model = BertModel.from_pretrained('bert-base-chinese').to(device)

def compare_sentence(sentence1, sentence2):
    # 分词
    tokens1 = tokenizer.tokenize(sentence1)
    tokens2 = tokenizer.tokenize(sentence2)
    # 添加特殊标记 [CLS] 和 [SEP]
    tokens1 = ['[CLS]'] + tokens1 + ['[SEP]']
    tokens2 = ['[CLS]'] + tokens2 + ['[SEP]']
    # 将分词转换为对应的词表中的索引
    input_ids1 = tokenizer.convert_tokens_to_ids(tokens1)
    input_ids2 = tokenizer.convert_tokens_to_ids(tokens2)

    # 将输入_ids转换为在device上的张量
    input_ids1 = torch.tensor([input_ids1]).to(device)
    input_ids2 = torch.tensor([input_ids2]).to(device)

    # 获取词向量，在GPU上运行模型
    outputs1 = model(input_ids1.to(device))
    outputs2 = model(input_ids2.to(device))

    # 提取句子
    sentence_embedding1 = outputs1[0][:, 0, :]
    sentence_embedding2 = outputs2[0][:, 0, :]

    # 计算余弦相似度，相似度任务中单个句子两两比较 dim 设置为 1，批次间比较设置为 0
    cos = torch.nn.CosineSimilarity(dim=1, eps=1e-6).to(device)
    similarity = cos(sentence_embedding1, sentence_embedding2)

    # 如果需要在CPU上打印结果，可以将相似度转到CPU
    similarity_cpu = similarity.detach().cpu()

    print("句1: ", sentence1)
    print("句2: ", sentence2)
    print("相似度: ", similarity_cpu.item())


compare_sentence("黄河南大街70号8门", "皇姑区黄河南大街70号8门")

GPU可以直接进行计算和处理数据，但它本身并不具备直接打印输出结果的能力。通常情况下，GPU负责加速模型的训练和推理过程，而打印操作以及后续的数据展示是在CPU上完成的。

文本转向量，相似度计算-gpu-批次计算

将单条计算改为批次计算能够更好地利用GPU进行并行处理的优势，从而显著提高计算效率

以下是一个实例代码，批次为 32

import torch
import numpy as np
from transformers import BertTokenizer, BertModel


# 初始化及检查GPU设备
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("当前正在使用GPU进行计算.\n")
else:
    device = torch.device("cpu")
    print("当前没有找到可用的GPU，正在使用CPU进行计算.\n")

# 初始化tokenizer和model，并将其放在GPU上（如果可用）
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
model = BertModel.from_pretrained('bert-base-chinese').to(device)


def batch_compare_sentences(sentences1, sentences2, batch_size=32):
    # 确保 sentences1 和 sentences2 的长度相等
    assert len(sentences1) == len(sentences2)

    # 初始化存储结果的容器
    similarities = []

    # 将句子按照批次拆分
    for i in range(0, len(sentences1), batch_size):
        batch1 = sentences1[i:i+batch_size]
        batch2 = sentences2[i:i+batch_size]

        # 使用tokenizer对批次数据进行处理
        encoded_inputs1 = tokenizer(
            batch1,  # 即将被编码的批次句子
            padding=True,  # 自动填充句子为最大长度，默认用 0 填充
            truncation=True,  # 超过max_length指定的限制时截断
            max_length=model.config.max_position_embeddings,
            return_tensors="pt"
        )
        encoded_inputs2 = tokenizer(
            batch2,
            padding=True,
            truncation=True,
            max_length=model.config.max_position_embeddings,
            return_tensors="pt"
        )

        input_ids1 = encoded_inputs1['input_ids'].to(device)
        attention_mask1 = encoded_inputs1['attention_mask'].to(device)
        input_ids2 = encoded_inputs2['input_ids'].to(device)
        attention_mask2 = encoded_inputs2['attention_mask'].to(device)

        # 获取词向量，在GPU上运行模型
        with torch.no_grad():
            outputs1 = model(input_ids1, attention_mask=attention_mask1)
            outputs2 = model(input_ids2, attention_mask=attention_mask2)

        # 提取句子向量
        sentence_embeddings1 = outputs1[0][:, 0, :]
        sentence_embeddings2 = outputs2[0][:, 0, :]

        # 计算余弦相似度
        cos = torch.nn.CosineSimilarity(dim=1, eps=1e-6).to(device)
        batch_similarities = cos(sentence_embeddings1, sentence_embeddings2)

        # 将批次内每个句子对的相似度添加到结果容器
        similarities.extend(batch_similarities.cpu().numpy())

    # 返回所有批次计算出的相似度结果
    return np.array(similarities)


# 使用示例
sentences1 = ["黄河南大街70号8门", "另一个句子1", "更多句子..."]  # 假设有足够多的句子
sentences2 = ["皇姑区黄河南大街70号8门", "另一个句子2", "更多句子..."]  # 与sentences1长度相同

# 按照每批32个句子对计算相似度
similarities = batch_compare_sentences(sentences1, sentences2, batch_size=32)

# 处理结果
batch_size = 32
for i in range(0, len(sentences1), batch_size):
    current_batch_similarities = similarities[i:i+batch_size]

    # 输出当前批次的每个句子对及其相似度
    for j in range(min(batch_size, len(sentences1)-i)):
        print(f"句1: {sentences1[i+j]}")
        print(f"句2: {sentences2[i+j]}")
        print(f"相似度: {current_batch_similarities[j]:.4f}\n")  # 控制相似度小数点后四位精度

    # 在批次之间空一行
    if i + batch_size < len(sentences1):
        print("\n")

# 若最后一个批次不满batch_size，也正常处理
remaining = len(sentences1) - (len(sentences1) // batch_size * batch_size)
if remaining > 0:
    last_batch_similarities = similarities[-remaining:]
    for j in range(remaining):
        print(j)
        print(f"句1: {sentences1[-remaining+j]}")
        print(f"句2: {sentences2[-remaining+j]}")
        print(f"相似度: {last_batch_similarities[j]:.4f}\n")