Langchain提供了多种文本分割器,包括CharacterTextSplitter(),MarkdownHeaderTextSplitter(),RecursiveCharacterTextSplitter()等,各种Splitter的作用如下图所示:
TextSplitter
下面的代码是使用RecursiveCharacterTextSplitter对一段文字进行分割。
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
chunk_size = 20
chunk_overlap = 4
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=chunk_overlap)
c_splitter = CharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=chunk_overlap)
text = "hello world, how about you? thanks, I am fine. the machine learning class. So what I wanna do today is just spend a little time going over the logistics of the class, and then