Sentence Transformers专注于句子和文本嵌入,支持超过100种语言。利用深度学习技术,特别是Transformer架构的优势,将文本转换为高维向量空间中的点,使得相似的文本在几何意义上更接近。
- 语义搜索:构建高效的语义搜索系统,找到最相关的查询结果。
- 信息检索与重排:在大规模文档集合中查找相关文档并重新排序。
- 聚类分析:将文本自动分组,发现隐藏的主题或模式。
- 摘要挖掘:识别和提取文本的主要观点。
- 平行句对挖掘:在多语言数据中找出对应的翻译句子。
💥pip安装:
pip install -U sentence-transformers
💥conda安装:
conda install -c conda-forge sentence-transformers
快速使用:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
# 加载all-MiniLM-L6-v2,这是一个在超过 10 亿个训练对的大型数据集上微调的 MiniLM 模型
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# 计算所有句子对之间的相似度
similarities = model.similarity(embeddings, embeddings)
print(similarities)
输出:
Cross Encoder
-
计算给定文本对的相似度得分。
-
通常比Sentence Transformer模型慢,因为它需要对每一对而不是每个文本进行计算
-
交叉编码器(Cross Encoder)经常被用来对Sentence Transformer模型的top-k个结果进行重新排序。
💯Cross Encoder (又名 reranker) 模型的用法与 Sentence Transformers 类似:
from sentence_transformers.cross_encoder import CrossEncoder
# 我们选择要加载的CrossEncoder模型
model = CrossEncoder("cross-encoder/stsb-distilroberta-base")
# 定义查询句子和语料库
query = "A man is eating pasta."
corpus = [
"A man is eating food.",
"A man is eating a piece of bread.",
"The girl is carrying a baby.",
"A man is riding a horse.",
"A woman is playing violin.",
"Two men pushed carts through the woods.",
"A man is riding a white horse on an enclosed ground.",
"A mo