如何基于语义相似度拆分文本：深入解析及实用指南

最新推荐文章于 2024-11-04 22:26:34 发布

llzwxh888

最新推荐文章于 2024-11-04 22:26:34 发布

阅读量342

点赞数 3

文章标签： python

本文链接：https://blog.csdn.net/ppoojjj/article/details/142698615

版权

如何基于语义相似度拆分文本：深入解析及实用指南

在文本处理和自然语言处理（NLP）的领域，如何有效地拆分文本是一项重要的任务。今天我们将讨论如何基于语义相似度来拆分文本，这是Greg Kamradt的精彩笔记《5_Levels_Of_Text_Splitting》中提到的一种方法。此方法利用文本嵌入来根据语义相似度来确定文本拆分点。

引言

本指南旨在介绍如何基于语义相似度来拆分文本。如果文本嵌入之间的距离足够大，则会将其拆分成小块。通过这一方法，文本首先被拆分成句子，然后按每3个句子分组，最后在嵌入空间中合并相似的文本块。

主要内容

安装依赖

首先需要安装必要的依赖库：

!pip install --quiet langchain_experimental langchain_openai

加载示例数据

让我们加载一个示例文档：

# 这是一个长文档，我们将其拆分
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()

创建文本拆分器

为了实例化SemanticChunker，我们必须指定一个嵌入模型。以下我们将使用OpenAIEmbeddings：

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

text_splitter = SemanticChunker(OpenAIEmbeddings())

拆分文本

我们可以通过调用 .create_documents 方法来拆分文本：

docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)

拆分点

SemanticChunker通过查看两个句子之间的嵌入差异来确定何时拆分。如果该差异超过某个阈值，则进行拆分。可以通过breakpoint_threshold_type参数控制不同的拆分方式。

百分位数拆分

默认情况下，拆分是基于百分位数的：

text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type="percentile")

docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
print(len(docs))

标准差拆分

在这种方法中，任何大于X个标准差的差异都会被拆分：

text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type="standard_deviation")

docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
print(len(docs))

四分位距拆分

在此方法中，使用四分位距来拆分文本块：

text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type="interquartile")

docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
print(len(docs))

梯度拆分

这种方法使用距离的梯度来拆分文本块，并结合百分位数方法。这对于高度相关或特定领域的文本块非常有用，例如法律或医疗领域。

text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type="gradient")

docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
print(len(docs))