通用文本嵌入（GTE）模型,使用入门

最新推荐文章于 2025-03-17 23:31:59 发布

香辣脆脆鱼

最新推荐文章于 2025-03-17 23:31:59 发布

阅读量1.4k

点赞数 2

文章标签：人工智能

本文链接：https://blog.csdn.net/weixin_43949898/article/details/141183467

版权

文本嵌入是一种将文本（如单词、句子或段落）映射到一个连续的数值向量空间的技术。这个向量空间中的每个维度都代表了文本的某种语义或语法特征。

通用文本嵌入（GTE）模型。通过多阶段对比学习实现通用文本嵌入 GTE 模型由阿里巴巴达摩院训练。它们主要基于 BERT 框架，目前为中文和英文提供不同规模的模型。 GTE 模型是在大规模相关性文本对语料库中训练的，涵盖了广泛的领域和场景。这使得 GTE 模型可以应用于文本嵌入的各种下游任务，包括信息检索、语义文本相似性、文本重排等。

衡量标准 MTEB和（C-MTEB，中文版）

MTEB（Massive Text Embedding Benchmark）是一个大规模文本嵌入基准测试平台。它通过一系列任务和数据集，对各种文本嵌入模型的性能进行评估，从而为研究者和开发者提供一个客观、全面的比较标准

MTEB的中文版（C-MTEB）是针对中文文本嵌入模型的评测基准。它涵盖了分类、聚类、检索、排序、文本相似度、STS（语义文本相似度）等多个经典任务，并提供了丰富的中文数据集。

使用：

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

input_texts = [
'That is a happy person', 'That is a very happy person'
]

tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-large-zh")
model = AutoModel.from_pretrained("thenlper/gte-large-zh")

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
print(batch_dict['input_ids'].shape) #（batch,seq_len）
print(batch_dict['input_ids'])
#tensor([[  101,  9231,  8310,   143,  9200,  9205,  8640,  8224,   102,     0],
#        [  101,  9231,  8310,   143, 11785,  9200,  9205,  8640,  8224,   102]])
outputs = model(**batch_dict)
print(outputs.last_hidden_state.shape)#(batch,seq_len,1024)

embeddings = outputs.last_hidden_state[:, 0]
print(embeddings.shape)#(batch,1024)
 
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=-1)
scores = (embeddings[:1] @ embeddings[1:].T)
print(scores.tolist())
#相似度[[0.9323416948318481]]

参考：

文本向量评测MTEB和C-MTEB-CSDN博客