文本分割的方式第一篇

江先森

已于 2024-01-24 14:09:53 修改

阅读量2.6k

点赞数 37

文章标签： python ai 人工智能语言模型 AIGC

于 2024-01-24 13:56:34 首次发布

本文链接：https://blog.csdn.net/joy357692577/article/details/135816834

版权

本文探讨了字符拆分的基本概念及其局限性，介绍了递归字符文本分割器如何考虑文档结构，以及LangChain和LlamaIndex库在处理文本时的灵活性，重点在于如何通过设置参数优化文本块的分割和索引。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一：按照字符拆分（Character Splitting）

字符拆分是拆分文本的最基本形式。这是将文本简单地划分为 N 个字符大小的块的过程，无论其内容或形式如何。

不建议任何应用程序使用此方法 - 但它是我们了解基础知识的一个很好的起点。

优点：简单
缺点：非常死板，没有考虑文本的结构

需要了解的概念：

Chunk Size
块大小 - 您希望在块中包含的字符数。 50、100、100,000 等
Chunk Overlap
块重叠 - 您希望连续块重叠的数量。这是为了尽量避免将单个上下文切割成多个部分。这将在块之间创建重复数据。

首先让我们获取一些示例文本

text = "This is the text I would like to chunk up. It is the example text for this exercise"

然后我们手动分割这段文本

# Create a list that will hold your chunks
chunks = []

chunk_size = 35 # Characters

# Run through the a range with the length of your text and iterate every chunk_size you want
for i in range(0, len(text), chunk_size):
    chunk = text[i:i + chunk_size]
    chunks.append(chunk)
chunks

['This is the text I would like to ch',
 'unk up. It is the example text for ',
 'this exercise']

恭喜！您刚刚分割了第一个文本。

在语言模型世界中处理文本时，我们不处理原始字符串。处理文档更为常见。文档是保存您关心的文本的对象，同时也是附加元数据，使以后的过滤和操作变得更加容易。

让我们加载 LangChains CharacterSplitter 来为我们做到这一点

from langchain.text_splitter import CharacterTextSplitter

然后让我们加载这个文本分割器。我需要指定 chunk overlap 和 separator

text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap=0, separator='', strip_whitespace=False)

然后我们实际上可以通过 create_documents 分割文本。注意： create_documents 需要一个文本列表，因此如果您只有一个字符串（像我们一样），则需要将其包装在 [] 中

text_splitter.create_documents([text])

[Document(page_content='This is the text I would like to ch'),
 Document(page_content='unk up. It is the example text for '),
 Document(page_content='this exercise')]

Chunk Overlap & Separators
块重叠和分隔符

块重叠会将我们的块混合在一起，这样块 #1 的尾部将是相同的，而块 #2 的头部将是相同的，依此类推

这次我将用值 4 加载重叠，这意味着 4 个字符的重叠

text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap=4, separator='')

text_splitter.create_documents([text])

[Document(page_content='This is the text I would like to ch'),
 Document(page_content='o chunk up. It is the example text'),
 Document(page_content='ext for this exercise')]

请注意我们如何拥有相同的块，但现在 1 & 2 和 2 & 3 之间存在重叠。块 #1 尾部的“o ch”与块 #2 头部的“o ch”相匹配。

Separators
分隔符是您想要分割的字符序列。假设您想在 ch 处对数据进行分块，您可以指定它。

text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap=0, separator='ch')

text_splitter.create_documents([text])

[Document(page_content='This is the text I would like to'),
 Document(page_content='unk up. It is the example text for this exercise')]

Llama Index 是分块和索引过程灵活性的绝佳选择。它们提供开箱即用的节点关系，可以帮助以后检索。

让我们看看他们的句子分割器。它类似于字符分割器，但使用其默认设置，它会分割句子。

from llama_index.text_splitter import SentenceSplitter
from llama_index import SimpleDirectoryReader

#加载您的分离器
splitter = SentenceSplitter(
    chunk_size=200,
    chunk_overlap=15,
)

#加载您的文档
documents = SimpleDirectoryReader(
    input_files=["../data/PaulGrahamEssayMedium/mit.txt"]
).load_data()


#创建您的节点。节点与文档类似，但添加了更多关系数据。
nodes = splitter.get_nodes_from_documents(documents)

nodes[0]

TextNode(id_='6ef3c8f2-7330-42f2-b492-e2d76a0a01f5', embedding=None, metadata={'file_path': '../data/PaulGrahamEssayMedium/mit.txt', 'file_name': 'mit.txt', 'file_type': 'text/plain', 'file_size': 36045, 'creation_date': '2023-12-21', 'last_modified_date': '2023-12-21', 'last_accessed_date': '2023-12-21'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='b8b090da-5c4d-40cf-8246-44dcf3008aa8', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_path': '../data/PaulGrahamEssayMedium/mit.txt', 'file_name': 'mit.txt', 'file_type': 'text/plain', 'file_size': 36045, 'creation_date': '2023-12-21', 'last_modified_date': '2023-12-21', 'last_accessed_date': '2023-12-21'}, hash='203cdcab32f6aac46e4d95044e5dce8c3e2a2052c2d172def021e0724f515e36'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='3157bde1-5e51-489e-b4b7-f80b74063ea9', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='5ebb6555924d31f20f1f5243ea3bfb18231fbb946cb76f497dbc73310fa36d3a')}, hash='fe82de145221729f15921a789c2923659746b7304aa2ce2952b923f800d2b85d', text="Want to start a startup?  Get funded by\nY Combinator.\n\n\n\n\nOctober 2006(This essay is derived from a talk at MIT.)\nTill recently graduating seniors had two choices: get a job or go\nto grad school.  I think there will increasingly be a third option:\nto start your own startup.  But how common will that be?I'm sure the default will always be to get a job, but starting a\nstartup could well become as popular as grad school.  In the late\n90s my professor friends used to complain that they couldn't get\ngrad students, because all the undergrads were going to work for\nstartups.", start_char_idx=2, end_char_idx=576, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')

正如您所看到的，Llama Index 的节点中保存了更多的关系数据。我们稍后再讨论这些

基本字符分割可能只对少数应用程序有用。

二：递归字符文本分割（Recursive Character Text Splitting）

字符分割的问题是我们根本没有考虑文档的结构。我们只是按固定数量的字符进行分割。

递归字符文本拆分器可以帮助解决此问题。有了它，我们将指定一系列分隔符，用于分割我们的文档。

您可以在此处查看 LangChain 的默认分隔符。让我们一一看看。

"\n\n" - Double new line, or most commonly paragraph breaks
"\n\n" - 双换行符，或最常见的段落分隔符
"\n" - New lines
“\n” - 换行
" " - Spaces
“” - 空格
"" - Characters ““ - 字符

这是分离器中的瑞士军刀，也是我在模拟快速应用程序时的首选。如果您不知道从哪个分离器开始，这是一个不错的选择。

我们来尝试一下

from langchain.text_splitter import RecursiveCharacterTextSplitter

然后我们加载一段更大的文本

text = """
One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.

Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.

It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]
"""

现在让我们创建文本分割器

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 65, chunk_overlap=0)
text_splitter.create_documents([text])

[Document(page_content="One of the most important things I didn't understand about the"),
 Document(page_content='world when I was a child is the degree to which the returns for'),
 Document(page_content='performance are superlinear.'),
 Document(page_content='Teachers and coaches implicitly told us the returns were linear.'),
 Document(page_content='"You get out," I heard a thousand times, "what you put in." They'),
 Document(page_content='meant well, but this is rarely true. If your product is only'),
 Document(page_content="half as good as your competitor's, you don't get half as many"),
 Document(page_content='customers. You get no customers, and you go out of business.'),
 Document(page_content="It's obviously true that the returns for performance are"),
 Document(page_content='superlinear in business. Some think this is a flaw of'),
 Document(page_content='capitalism, and that if we changed the rules it would stop being'),
 Document(page_content='true. But superlinear returns for performance are a feature of'),
 Document(page_content="the world, not an artifact of rules we've invented. We see the"),
 Document(page_content='same pattern in fame, power, military victories, knowledge, and'),
 Document(page_content='even benefit to humanity. In all of these, the rich get richer.'),
 Document(page_content='[1]')]

请注意，现在有更多以句点“.”结尾的块。这是因为这些可能是段落的结尾，并且拆分器首先查找双换行符（段落分隔符）。

一旦段落被分割，它就会查看块的大小，如果块太大，那么它将被下一个分隔符分割。如果该块仍然太大，那么它将移动到下一个块，依此类推。

对于这种大小的文本，让我们分成更大的文本。

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 450, chunk_overlap=0)
text_splitter.create_documents([text])

[Document(page_content="One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear."),
 Document(page_content='Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor\'s, you don\'t get half as many customers. You get no customers, and you go out of business.'),
 Document(page_content="It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]")]

对于本文来说，450 完美地分割了段落。您甚至可以将块大小切换为 469 并获得相同的分割。这是因为这个分离器内置了一些缓冲和摆动空间，可以让您的块“卡”到最近的分离器。