确定文本切割的最优策略
在使用基于检索的生成模型(RAG)处理长文本数据时,合理的文本切割策略是提高模型性能和效率的关键。
文本切割策略主要依赖于两个参数:chunksize
(块大小)和overlap
(重叠)。正确配置这些参数可以显著影响模型的输出质量和处理速度。
- chunk_size 基于模型的限制(embedding , LLM )
- 不同Text splitter 的优劣,如何选取
- 可视化文本切分的效果,供大家切分文本初步参考
基于模型选取chunk_size
- 首先是embedding model, 向量嵌入模型有Max Tokens 的限制,设置的chunk size不可以超过模型支持的最大长度,否则将丢失语义。
不同的embedding model 支持的 Max Tokens都有不同,具体可参考model 排行
- 其次是LLM model , 大语言模型有Max sequence length的限制,处理知识增强的时候,prompt中召回的文本不可以超出最大长度。
需要根据不同的LLM支持的最大token长度,选取合适的参数
不同的文本切分策略
- 1: CharacterTextSplitter - 这是最简单的方法。它默认基于字符(默认为"")来分割,并且通过字符的数量来衡量块的长度。
- 2:RecursiveCharacterTextSplitter - 基于字符列表拆分文本。
- 3: Document Specific Splitting - 基于不同的文件类型使用不同的切分 (PDF, Python, Markdown)
- 4: Semantic Splitting - 基于滑动窗口的语义切分
那我们就开始实际看一下不同的textsplitter切分效果如何?
text = "大家好,我是果粒奶优有果粒,欢迎关注我,让我们一起探索AI。"
1 CharacterTextSplitter
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
separator="",
chunk_size=5,
chunk_overlap=1,
length_function=len,
is_separator_regex=False,
)
text_splitter.split_text(text)
['大家好,我', '我是果粒奶', '奶优有果粒', '粒,欢迎关', '关注我,让', '让我们一起', '起探索AI', 'I。']
切分原理
#创建chunks维护切分的文本块
chunks = []
chunk_size = 5
chunk_overlap = 1
i = 0
while i < len(text):
# 如果这不是第一块,就回溯chunk_overlap个字符以创建重叠
if i > 0:
start = max(i - chunk_overlap, 0)
else:
start = i
# 确定这个块的结束位置
end = min(start + chunk_size, len(text))
# 提取块并添加到列表
chunk = text[start:end]
chunks.append(chunk)
# 更新下一块的开始位置
i = end
print(chunks)
['大家好,我', '我是果粒奶', '奶优有果粒', '粒,欢迎关', '关注我,让', '让我们一起', '起探索AI', 'I。']
2 RecursiveCharacterTextSplitter
RecursiveCharacterTextSplitter文本分割工具的设计目的是为了在处理文本时,能够在不损失语义关联性的前提下,将文本有效分割成更小的单元。通过先尝试分割段落,如果段落仍然过大,再尝试分割成句子,依此类推,直至分割成单词。这种分割方法尽量保留文本的原有结构和意义,使得处理后的文本单元在语义上保持连贯性。
- “\n\n” - 段落
- “\n” - 换行
- " " - 空格
- “” - 字符
text = '''
为什么文本切割在RAG中很重要?RAG(Retrieval-Augmented Generation)是一种将检索机制集成到生成式语言模型中的技术,目的是通过从大量文档或知识库中检索相关信息来增强模型的生成能力。这种方法特别适用于需要广泛背景知识的任务,如问答、文章撰写或详细解释特定主题。在RAG架构中,文本切割(即将长文本分割成较短片段的过程)非常重要,原因如下:
1. **提高检索效率:** 对于大规模的文档库,直接在整个库上执行检索任务既不切实际也不高效。通过将长文本切割成较短的片段,可以使检索过程更加高效,因为短文本片段更容易被比较和索引。这样可以加快检索速度,提高整体性能。
2. **提升结果相关性:** 当查询特定信息时,与查询最相关的内容往往只占据文档中的一小部分。通过文本切割,可以更精确地匹配查询和文档片段之间的相关性,从而提高检索到的信息的准确性和相关性。这对于生成高质量、相关性强的回答尤为重要。
3. **内存和处理限制:** 当代的语言模型,尽管强大,但处理长文本时仍受到内存和计算资源的限制。将长文本分割成较短的片段可以帮助减轻这些限制,因为模型可以分别处理这些较短的文本片段,而不是一次性处理整个长文档。
4. **提高生成质量:** 在RAG框架中,检索到的文本片段将直接影响生成模块的输出。通过确保检索到高质量和高相关性的文本片段,可以提高最终生成内容的质量和准确性。
5. **适应性和灵活性:** 文本切割允许模型根据需要处理不同长度的文本,增加了模型处理各种数据源的能力。这种灵活性对于处理多样化的查询和多种格式的文档特别重要。
总之,文本切割在RAG中非常重要,因为它直接影响到检索效率、结果的相关性、系统的处理能力,以及最终生成内容的质量和准确性。通过优化文本切割策略,可以显著提升RAG系统的整体性能和用户满意度。'''
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size=50,
chunk_overlap=1,
length_function=len,
is_separator_regex=False,
separators = ["\n\n", "\n", " " , ""]
)
chunk_doc = text_splitter.create_documents([text])
chunk_doc
[Document(page_content='为什么文本切割在RAG中很重要?RAG(Retrieval-Augmented'),
Document(page_content='Generation)是一种将检索机制集成到生成式语言模型中的技术,目的是通过从大量文档或知识库中'),
Document(page_content='中检索相关信息来增强模型的生成能力。这种方法特别适用于需要广泛背景知识的任务,如问答、文章撰写或详细'),
Document(page_content='细解释特定主题。在RAG架构中,文本切割(即将长文本分割成较短片段的过程)非常重要,原因如下:'),
Document(page_content='1. **提高检索效率:**'),
Document(page_content='对于大规模的文档库,直接在整个库上执行检索任务既不切实际也不高效。通过将长文本切割成较短的片段,可'),
Document(page_content='可以使检索过程更加高效,因为短文本片段更容易被比较和索引。这样可以加快检索速度,提高整体性能。'),
Document(page_content='2. **提升结果相关性:**'),
Document(page_content='当查询特定信息时,与查询最相关的内容往往只占据文档中的一小部分。通过文本切割,可以更精确地匹配查询'),
Document(page_content='询和文档片段之间的相关性,从而提高检索到的信息的准确性和相关性。这对于生成高质量、相关性强的回答尤为'),
Document(page_content='为重要。'),
Document(page_content='3. **内存和处理限制:**'),
Document(page_content='当代的语言模型,尽管强大,但处理长文本时仍受到内存和计算资源的限制。将长文本分割成较短的片段可以帮'),
Document(page_content='帮助减轻这些限制,因为模型可以分别处理这些较短的文本片段,而不是一次性处理整个长文档。'),
Document(page_content='4. **提高生成质量:**'),
Document(page_content='在RAG框架中,检索到的文本片段将直接影响生成模块的输出。通过确保检索到高质量和高相关性的文本片段'),
Document(page_content='段,可以提高最终生成内容的质量和准确性。'),
Document(page_content='5. **适应性和灵活性:**'),
Document(page_content='文本切割允许模型根据需要处理不同长度的文本,增加了模型处理各种数据源的能力。这种灵活性对于处理多样'),
Document(page_content='样化的查询和多种格式的文档特别重要。'),
Document(page_content='总之,文本切割在RAG中非常重要,因为它直接影响到检索效率、结果的相关性、系统的处理能力,以及最终'),
Document(page_content='终生成内容的质量和准确性。通过优化文本切割策略,可以显著提升RAG系统的整体性能和用户满意度。')]
from sentence_transformers import SentenceTransformer
import matplotlib.pyplot as plt
from transformers import AutoTokenizer
import pandas as pd
Embedding_name = 'BAAI/bge-large-zh-v1.5'
SentenceTransformer(Embedding_name).max_seq_length
512
len(chunk_doc[1].page_content)
49
chunk_doc[1].page_content
'Generation)是一种将检索机制集成到生成式语言模型中的技术,目的是通过从大量文档或知识库中'
tokenizer = AutoTokenizer.from_pretrained(Embedding_name)
len(tokenizer.encode(chunk_doc[1].page_content))
44
token数量
def plot_chunk(chunk_doc , Embedding_name):
tokenizer = AutoTokenizer.from_pretrained(Embedding_name)
length = [len(tokenizer.encode(doc.page_content))
for doc in chunk_doc ]
fig = pd.Series(length).hist()
plt.show()
plot_chunk(chunk_doc , Embedding_name)
3 其他结构的文本切割
- python - RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)
- json - RecursiveJsonSplitter
- Markdown - MarkdownTextSplitter
- Html - HTMLHeaderTextSplitter
4 Semantic Chunking
为什么我们在处理文本时通常会使用固定的分块大小,而不考虑实际内容的语义意义。是不是可以基于文本的语义实现一种更好的方法来处理文本分块,即并非固定参数(chunksize),而是基于语义自行动态确定参数。
我们可以通过embedding技术进行动态规划
Embedding将文本转化为高维空间中的向量的技术,这些向量能够反映出文本的语义内容。通过文本嵌入技术,可以捕捉到文本的深层次语义信息。当比较两段文本的嵌入向量时,可以根据它们在高维空间中的距离或者角度,来推断这两段文本在语义上的相似度或者差异。利用相似度,将语义上相似的文本自动分组在一起,形成聚类,这有助于更好地理解和组织大量的文本数据。
with open('dream.txt') as file:
essay = file.read()
我们需要将文本进行拆分,拆分成多个单句,可以按照标点符号进行切分
split_char = [‘.’, ‘?’, ‘!’]
import re
# Splitting the essay on '.', '?', and '!'
single_sentences_list = re.split(r'(?<=[.?!])\s+', essay)
print (f"{len(single_sentences_list)} senteneces were found")
77 senteneces were found
single_sentences_list
['I have a Dream\n\nby Martin Luther King, Jr.',
'Delivered on the steps at the Lincoln Memorial in Washington\nD.C.',
'on August 28, 1963\n\nFive score years ago, a great American, in whose symbolic shadow\nwe stand signed the Emancipation Proclamation.',
'This momentous\ndecree came as a great beacon light of hope to millions of Negro\nslaves who had been seared in the flames of withering injustice.',
'It came as a joyous daybreak to end the long night of\ncaptivity.',
'But one hundred years later, we must face the tragic fact that\nthe Negro is still not free.',
'One hundred years later, the life\nof the Negro is still sadly crippled by the manacles of\nsegregation and the chains of discrimination.',
'One hundred years\nlater, the Negro lives on a lonely island of poverty in the\nmidst of a vast ocean of material prosperity.',
'One hundred years\nlater, the Negro is still languishing in the corners of American\nsociety and finds himself an exile in his own land.',
'So we have\ncome here today to dramatize an appalling condition.',
"In a sense we have come to our nation's capital to cash a check.",
'When the architects of our republic wrote the magnificent words\nof the Constitution and the declaration of Independence, they\nwere signing a promissory note to which every American was to\nfall heir.',
'This note was a promise that all men would be\nguarranteed the inalienable rights of life, liberty, and the\npursuit of happiness.',
'It is obvious today that America has defaulted on this\npromissory note insofar as her citizens of color are concerned.',
'Instead of honoring this sacred obligation, America has given\nthe Negro people a bad check which has come back marked\n"insufficient funds." But we refuse to believe that the bank of\njustice is bankrupt.',
'We refuse to believe that there are\ninsufficient funds in the great vaults of opportunity of this\nnation.',
'So we have come to cash this check -- a check that will\ngive us upon demand the riches of freedom and the security of\njustice.',
'We have also come to this hallowed spot to remind\nAmerica of the fierce urgency of now.',
'This is no time to engage\nin the luxury of cooling off or to take the tranquilizing drug\nof gradualism.',
'Now is the time to rise from the dark and\ndesolate valley of segregation to the sunlit path of racial\njustice.',
"Now is the time to open the doors of opportunity to all\nof God's children.",
'Now is the time to lift our nation from the\nquicksands of racial injustice to the solid rock of\nbrotherhood.',
'It would be fatal for the nation to overlook the urgency of the\nmoment and to underestimate the determination of the Negro.',
"This\nsweltering summer of the Negro's legitimate discontent will not\npass until there is an invigorating autumn of freedom and\nequality.",
'Nineteen sixty-three is not an end, but a beginning.',
'Those who hope that the Negro needed to blow off steam and will\nnow be content will have a rude awakening if the nation returns\nto business as usual.',
'There will be neither rest nor tranquility\nin America until the Negro is granted his citizenship rights.',
'The whirlwinds of revolt will continue to shake the foundations\nof our nation until the bright day of justice emerges.',
'But there is something that I must say to my people who stand on\nthe warm threshold which leads into the palace of justice.',
'In\nthe process of gaining our rightful place we must not be guilty\nof wrongful deeds.',
'Let us not seek to satisfy our thirst for\nfreedom by drinking from the cup of bitterness and hatred.',
'We must forever conduct our struggle on the high plane of\ndignity and discipline.',
'We must not allow our creative protest\nto degenerate into physical violence.',
'Again and again we must\nrise to the majestic heights of meeting physical force with soul\nforce.',
'The marvelous new militancy which has engulfed the Negro\ncommunity must not lead us to distrust of all white people, for\nmany of our white brothers, as evidenced by their presence here\ntoday, have come to realize that their destiny is tied up with\nour destiny and their freedom is inextricably bound to our\nfreedom.',
'We cannot walk alone.',
'And as we walk, we must make the pledge that we shall march\nahead.',
'We cannot turn back.',
'There are those who are asking the\ndevotees of civil rights, "When will you be satisfied?" We can\nnever be satisfied as long as our bodies, heavy with the fatigue\nof travel, cannot gain lodging in the motels of the highways and\nthe hotels of the cities.',
"We cannot be satisfied as long as the\nNegro's basic mobility is from a smaller ghetto to a larger one.",
'We can never be satisfied as long as a Negro in Mississippi\ncannot vote and a Negro in New York believes he has nothing for\nwhich to vote.',
'No, no, we are not satisfied, and we will not be\nsatisfied until justice rolls down like waters and righteousness\nlike a mighty stream.',
'I am not unmindful that some of you have come here out of great\ntrials and tribulations.',
'Some of you have come fresh from narrow\ncells.',
'Some of you have come from areas where your quest for\nfreedom left you battered by the storms of persecution and\nstaggered by the winds of police brutality.',
'You have been the\nveterans of creative suffering.',
'Continue to work with the faith\nthat unearned suffering is redemptive.',
'Go back to Mississippi, go back to Alabama, go back to Georgia,\ngo back to Louisiana, go back to the slums and ghettos of our\nnorthern cities, knowing that somehow this situation can and\nwill be changed.',
'Let us not wallow in the valley of despair.',
'I say to you today, my friends, that in spite of the\ndifficulties and frustrations of the moment, I still have a\ndream.',
'It is a dream deeply rooted in the American dream.',
'I have a dream that one day this nation will rise up and live\nout the true meaning of its creed: "We hold these truths to be\nself-evident: that all men are created equal."\n\nI have a dream that one day on the red hills of Georgia the sons\nof former slaves and the sons of former slaveowners will be able\nto sit down together at a table of brotherhood.',
'I have a dream that one day even the state of Mississippi, a\ndesert state, sweltering with the heat of injustice and\noppression, will be transformed into an oasis of freedom and\njustice.',
'I have a dream that my four children will one day live in a\nnation where they will not be judged by the color of their skin\nbut by the content of their character.',
'I have a dream today.',
"I have a dream that one day the state of Alabama, whose\ngovernor's lips are presently dripping with the words of\ninterposition and nullification, will be transformed into a\nsituation where little black boys and black girls will be able\nto join hands with little white boys and white girls and walk\ntogether as sisters and brothers.",
'I have a dream today.',
'I have a dream that one day every valley shall be exalted, every\nhill and mountain shall be made low, the rough places will be\nmade plain, and the crooked places will be made straight, and\nthe glory of the Lord shall be revealed, and all flesh shall see\nit together.',
'This is our hope.',
'This is the faith with which I return to the\nSouth.',
'With this faith we will be able to hew out of the\nmountain of despair a stone of hope.',
'With this faith we will be\nable to transform the jangling discords of our nation into a\nbeautiful symphony of brotherhood.',
'With this faith we will be\nable to work together, to pray together, to struggle together,\nto go to jail together, to stand up for freedom together,\nknowing that we will be free one day.',
'This will be the day when all of God\'s children will be able to\nsing with a new meaning, "My country, \'tis of thee, sweet land\nof liberty, of thee I sing.',
'Land where my fathers died, land of\nthe pilgrim\'s pride, from every mountainside, let freedom ring."\n\nAnd if America is to be a great nation this must become true.',
'So\nlet freedom ring from the prodigious hilltops of New Hampshire.',
'Let freedom ring from the mighty mountains of New York.',
'Let\nfreedom ring from the heightening Alleghenies of Pennsylvania!',
'Let freedom ring from the snowcapped Rockies of Colorado!',
'Let freedom ring from the curvaceous peaks of California!',
'But not only that; let freedom ring from Stone Mountain of\nGeorgia!',
'Let freedom ring from Lookout Mountain of Tennessee!',
'Let freedom ring from every hill and every molehill of\nMississippi.',
'From every mountainside, let freedom ring.',
'When we let freedom ring, whem we let it ring from every village\nand every hamlet, from every state and every city, we will be\nable to speed up that day when all of God\'s children, black men\nand white men, Jews and Gentiles, Protestants and Catholics,\nwill be able to join hands and sing in the words of the old\nNegro spiritual, "Free at last!',
'free at last!',
'thank God\nAlmighty, we are free at last!"\n']
我们需要为单个句子拼接更多的句子,但是 list
添加比较困难。因此将其转换为字典列表(List[dict]
)
{ ‘sentence’ : XXX , ‘index’ : 0}
sentences = [{'sentence': x, 'index' : i} for i, x in enumerate(single_sentences_list)]
sentences[:3]
[{'sentence': 'I have a Dream\n\nby Martin Luther King, Jr.', 'index': 0},
{'sentence': 'Delivered on the steps at the Lincoln Memorial in Washington\nD.C.',
'index': 1},
{'sentence': 'on August 28, 1963\n\nFive score years ago, a great American, in whose symbolic shadow\nwe stand signed the Emancipation Proclamation.',
'index': 2}]
def combine_sentences(sentences, buffer_size=1):
#
combined_sentences = [
' '.join(sentences[j]['sentence'] for j in range(max(i - buffer_size, 0), min(i + buffer_size + 1, len(sentences))))
for i in range(len(sentences))
]
# 更新原始字典列表,添加组合后的句子
for i, combined_sentence in enumerate(combined_sentences):
sentences[i]['combined_sentence'] = combined_sentence
return sentences
sentences = combine_sentences(sentences)
sentences[:6]
[{'sentence': 'I have a Dream\n\nby Martin Luther King, Jr.',
'index': 0,
'combined_sentence': 'I have a Dream\n\nby Martin Luther King, Jr. Delivered on the steps at the Lincoln Memorial in Washington\nD.C.'},
{'sentence': 'Delivered on the steps at the Lincoln Memorial in Washington\nD.C.',
'index': 1,
'combined_sentence': 'I have a Dream\n\nby Martin Luther King, Jr. Delivered on the steps at the Lincoln Memorial in Washington\nD.C. on August 28, 1963\n\nFive score years ago, a great American, in whose symbolic shadow\nwe stand signed the Emancipation Proclamation.'},
{'sentence': 'on August 28, 1963\n\nFive score years ago, a great American, in whose symbolic shadow\nwe stand signed the Emancipation Proclamation.',
'index': 2,
'combined_sentence': 'Delivered on the steps at the Lincoln Memorial in Washington\nD.C. on August 28, 1963\n\nFive score years ago, a great American, in whose symbolic shadow\nwe stand signed the Emancipation Proclamation. This momentous\ndecree came as a great beacon light of hope to millions of Negro\nslaves who had been seared in the flames of withering injustice.'},
{'sentence': 'This momentous\ndecree came as a great beacon light of hope to millions of Negro\nslaves who had been seared in the flames of withering injustice.',
'index': 3,
'combined_sentence': 'on August 28, 1963\n\nFive score years ago, a great American, in whose symbolic shadow\nwe stand signed the Emancipation Proclamation. This momentous\ndecree came as a great beacon light of hope to millions of Negro\nslaves who had been seared in the flames of withering injustice. It came as a joyous daybreak to end the long night of\ncaptivity.'},
{'sentence': 'It came as a joyous daybreak to end the long night of\ncaptivity.',
'index': 4,
'combined_sentence': 'This momentous\ndecree came as a great beacon light of hope to millions of Negro\nslaves who had been seared in the flames of withering injustice. It came as a joyous daybreak to end the long night of\ncaptivity. But one hundred years later, we must face the tragic fact that\nthe Negro is still not free.'},
{'sentence': 'But one hundred years later, we must face the tragic fact that\nthe Negro is still not free.',
'index': 5,
'combined_sentence': 'It came as a joyous daybreak to end the long night of\ncaptivity. But one hundred years later, we must face the tragic fact that\nthe Negro is still not free. One hundred years later, the life\nof the Negro is still sadly crippled by the manacles of\nsegregation and the chains of discrimination.'}]
接下来使用embedding model对sentences 进行编码
from langchain.embeddings import OpenAIEmbeddings
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass()
oaiembeds = OpenAIEmbeddings()
c:\Users\blackink\.conda\envs\llangchainhf\lib\site-packages\langchain_core\_api\deprecation.py:117: LangChainDeprecationWarning: The class `langchain_community.embeddings.openai.OpenAIEmbeddings` was deprecated in langchain-community 0.0.9 and will be removed in 0.2.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import OpenAIEmbeddings`.
warn_deprecated(
embeddings = oaiembeds.embed_documents([x['combined_sentence'] for x in sentences])
将embedding添加到sentence中
for i, sentence in enumerate(sentences):
sentence['combined_sentence_embedding'] = embeddings[i]
sentences[0]
{'sentence': 'I have a Dream\n\nby Martin Luther King, Jr.',
'index': 0,
'combined_sentence': 'I have a Dream\n\nby Martin Luther King, Jr. Delivered on the steps at the Lincoln Memorial in Washington\nD.C.',
'combined_sentence_embedding': [-0.02888452068466074,
-0.027186900202331638,
-0.03372773364944415,
-0.01185838206710934,
0.0028179259391784466,
0.02053372316871437,
-0.013443659894256317,
-0.025114804202728776,
-0.017662747928239422,
0.010472823584519068,
0.004515547119999525,
0.02450316178917094,
-0.007514469326362963,
-0.028684800408781404,
0.005620248964634478,
0.000699020383501036,
0.01123425620498641,
-0.0053050656370930566,
-0.01095964082565232,
-0.00670310617126447,
0.020396416410369957,
-0.011845899549866881,
0.01647690972152849,
0.0022546529306891475,
0.0037478727752496372,
-0.00783901430900557,
0.02119529751388731,
-0.030207667512361352,
0.023991377650907503,
-0.009361880481262885,
0.004081779878654747,
-0.02102054227249289,
-0.020146766065520787,
-0.027286760340271308,
-0.04171653723403659,
-0.005932311430034626,
-0.019372849996488355,
-0.019135683100204273,
0.029458716477813836,
-0.022855469513166405,
0.014404812790602996,
-0.00509286284397126,
0.017101033721006156,
-0.04533646537170431,
-0.04001891861435275,
0.01882362110046544,
0.004787041637192341,
-0.022768092823791827,
-0.021257709168776968,
0.022543407513427573,
0.04483716468200597,
0.01939781503097327,
-0.016526839790498325,
-0.032679202201077624,
-0.011827175308341877,
0.011571284170532792,
-0.015365965686949675,
0.0013871185644149207,
0.002655653447857143,
-0.00814483551578449,
0.002019045534153071,
0.014517155445785123,
-0.018673830893555936,
-0.018698795928040853,
0.004593562619934233,
-0.0007559718102120887,
-0.002200041801338062,
0.03355298027069499,
0.0033640356028746938,
0.003523187697716041,
0.03742255502792136,
0.01315656199767977,
0.011084465998076908,
0.012700950583991346,
-0.0025916804305742133,
0.009948557860335808,
-0.017812538135148926,
0.0023747970427324826,
0.01314407948043731,
-0.017974809927978255,
0.007938874446945238,
-0.007945115239905152,
-0.012201650825615635,
0.013793170377045158,
0.018286871927717087,
-0.0019659948358726215,
-0.02324242847900525,
0.022343687237548236,
-0.02160721965156581,
-0.031306127167052444,
-0.006784242533340451,
0.008825132239837167,
-0.012513712825354467,
0.012607332170334223,
-0.022318722203063315,
0.03285395930511731,
0.007501986809120504,
0.018137081720807587,
0.028784660546721075,
-0.025564174823457287,
0.010560201205216278,
-0.022768092823791827,
-0.020571171651764378,
-0.0150663862044533,
-0.03527556578750901,
-0.03512577558059951,
0.002140749844436384,
0.0023030225685883457,
0.0071836826194378095,
-0.021457429444656305,
-0.012931877618638146,
0.015815337239000818,
0.022530925927507744,
-0.05017967833648772,
0.014005373170166953,
-0.04266520854446344,
-0.028035711374818823,
-0.028984380822600412,
0.0069589977747348715,
-0.041017516268458905,
0.01843666213462659,
0.020321521306915207,
0.03285395930511731,
0.0034482925942612894,
-0.012132996515120797,
0.003109704663305587,
-0.032679202201077624,
-0.03372773364944415,
-0.0011741357885884645,
0.029508646546783673,
0.017775089652098917,
0.023229945030440163,
0.016926280342257,
-0.01959753530685261,
-0.008132352998542031,
0.020758408479078626,
-0.006977721550598559,
0.042715138613433276,
-0.021906800065384816,
-0.016951245376741916,
0.009349397964020427,
0.01908575303123444,
-0.013605932618408277,
0.015091351238938218,
-0.014704393204422002,
-0.0065283509298700495,
0.02005938937614621,
-0.003601203197650749,
0.015178728859635428,
-0.01675152510086258,
0.010997088377379697,
-0.015453343307646885,
0.0035856000510976756,
-0.003183039335689702,
-0.002256213128929126,
0.02377917578910834,
0.017026138617551402,
0.03687332426925318,
0.0007501206593585186,
0.01549079085937426,
0.04321443930313162,
0.006821690085067827,
0.01541589575591951,
-0.0009073222153615702,
0.01462949810096725,
0.024902602340929614,
0.0024746569478414932,
0.013655862687378112,
-0.033153539718936316,
0.011209291170501493,
0.004880660050849465,
0.03330332992584582,
-0.03939479088958455,
0.021419982824251563,
-0.02367931565116867,
0.02092068213455322,
0.002285858990964636,
0.004549873809585628,
-0.019984494272691458,
0.031306127167052444,
-0.016951245376741916,
0.0164644262729634,
-0.0015454904437209496,
-0.0014588930385590576,
-0.005635851878356893,
-0.015091351238938218,
0.024702882065050277,
-0.03170556771881112,
0.0028725368356988736,
0.005164637085284739,
0.012332716791000134,
-0.0077641192055508184,
0.0019207457108687093,
-0.025913685306246128,
-0.6442971033472308,
-0.0173382024799355,
-0.013543520032195985,
-0.00040412117558244205,
0.009536635722657305,
0.006038412826595525,
-0.011833417032624423,
0.0070401341368108526,
-0.01145269979106812,
-0.010572683722458736,
-0.0030972221460631283,
-0.0028584941202164367,
0.0035980825683401343,
-0.01778757310066401,
-0.0033546737149428496,
-0.0033733974908065375,
-0.003142471154651712,
-0.010017212170830646,
-0.0034202069304657572,
0.009911110308608431,
0.00015573878284157517,
0.004846333361263361,
-0.023991377650907503,
-0.020745926893158798,
-0.02326739351349017,
0.01653932137641815,
0.00973635506721401,
-0.0025620345685387035,
-0.013119114445952392,
0.030007947236482015,
-0.027112005098876887,
0.011084465998076908,
-0.004287740947493997,
0.009424293067475178,
0.04209101461395561,
0.002646291559925299,
-0.04785793199203569,
0.005979120636863188,
0.010060900515517935,
0.027062075029907054,
-0.013069184376982559,
-0.010435376032791693,
0.011708591860199836,
0.0020736564306734983,
0.0057606770507814785,
0.0073834028953171475,
0.03235465861541897,
-0.017288272410965665,
0.019560088686447864,
-0.0062256505852324034,
-0.0048744187922282346,
0.008151077240067035,
-0.005046053637142699,
0.006827931343689056,
-0.0025573535081574524,
0.011895828687514083,
0.026612704409178543,
0.029658436753693174,
-0.0059947240162469195,
0.0028912606115625615,
-0.011365322170370909,
-8.352552477031857e-05,
-0.005236411792259534,
-0.006128910843772691,
0.002560474137468067,
0.028759695512236158,
0.005379960740547808,
-0.003133109266719868,
0.019123201514284448,
-0.006284942309303424,
-0.007801566757278194,
0.012220374135818007,
0.002855373490905822,
-0.01066630213611586,
0.023117603306580668,
-0.0013301671377038678,
0.01225158089458547,
-0.00688410267128012,
0.002597921689195443,
0.008650376998442746,
0.010241896549872268,
-0.024128686271897182,
-0.011939517963524004,
0.008750237136382415,
0.03005787730545185,
0.008400727584916206,
-0.03170556771881112,
-0.0041410716027257665,
0.007589364429817714,
0.005507906309452351,
0.02893445075363058,
0.038620876683197386,
0.005457976240482516,
-0.028959415788115495,
-0.023454630340804417,
0.01541589575591951,
-0.00818852479179441,
0.003716666482143491,
0.008182283067511864,
-0.0010103028661965245,
-0.006640693585052177,
-0.005354995706062891,
-0.011015812618904702,
0.008881304033089546,
0.0431395423370316,
-0.025426866202467608,
-0.015640581997606397,
0.010291826618842103,
0.017300753996885493,
-0.0074957455504992745,
-0.0010337075860261343,
0.0032828992407987127,
-0.018162046755292503,
-0.013194009549407144,
0.018461627169111507,
-0.021819421513364974,
0.025988580409700878,
-0.021657149720535646,
-0.005916708516312211,
-0.010909710756682487,
-0.0010079624524212281,
0.0053768398784065355,
-0.011839657825584335,
-0.011883347101594256,
0.010397928481064316,
0.0057669183094027085,
0.004624768913040379,
-0.014666945652694626,
0.006078980774802857,
0.01839921551422185,
-0.004478099568272149,
-0.0030301284994695846,
0.02265575109993233,
0.004172277895831913,
0.014067785756379247,
-0.0024559331719778052,
0.007501986809120504,
-0.02648787923675396,
0.003178358275308451,
-0.020184214548570795,
-0.016152364273224567,
0.02608843868499528,
0.0007610428619456699,
-0.015590650997313929,
-0.009243296101798214,
-0.024940048961334355,
-0.008918751584816921,
-0.03375270054657433,
-0.03390249075348383,
0.01242633613597989,
-0.017874949790038588,
-0.022618302616882323,
-0.014005373170166953,
-0.014929078514786257,
-0.007202406395301498,
-0.008631653688240374,
0.0015610934738586938,
-0.007439574222908211,
-0.0029224669046687083,
-0.025738930064851707,
0.0026540930167865064,
0.014904113480301339,
-0.021931765099869733,
-0.008032492860602362,
0.013293869687346813,
-0.0067655187574767635,
0.01516624634239297,
0.03829633309753873,
-0.00852555182601816,
-0.029658436753693174,
0.01995952923820654,
-0.005323789412956744,
-0.008756478860664961,
0.008719031308937584,
0.0203340047554803,
-0.018848586134950357,
0.0021423102755070205,
-0.013343799756316648,
0.022680714271771985,
0.019909599169236705,
0.01899837634185986,
-0.010160760653457604,
-0.001128106564464563,
-0.010597648756943655,
0.034177104270172655,
0.009418051343192632,
0.023804140823593258,
0.004003763913058722,
-0.01129042799823879,
0.005024208999137738,
-0.010697508894883323,
-0.0013059823187542689,
-0.012613572963294135,
-0.011877105377311712,
0.011259221239471328,
0.05027953847442739,
0.01630215448013407,
-0.008244695653724158,
0.010535236170731361,
0.02648787923675396,
0.00814483551578449,
0.0035856000510976756,
0.0015891791376542255,
-0.0056826610851854545,
0.0056857819473267275,
-0.028210466616213244,
0.00013486958887875584,
-0.04765821171615635,
0.03083179151183902,
-0.0034451719649506744,
0.013718275273590406,
-0.002519906189260735,
-0.01616484585914439,
-0.03312857282180613,
0.015653063583526222,
0.01618981089362931,
-0.015590650997313929,
0.007377161636695918,
-0.023616903996279012,
-0.009942317067375894,
0.011340357135885992,
0.006965239033356101,
-0.01788743323860368,
-0.012132996515120797,
-0.004718387792358818,
0.0017272669264412593,
0.00027168957107664586,
0.00963025413631443,
-0.005192722981910929,
-0.013830617928772533,
-0.023966412616422586,
-0.006319268998889527,
0.013019255239335356,
0.02248099585853791,
-0.021432464410171388,
-0.007221130171165185,
-0.013368764790801565,
-0.041641643993227104,
0.03270416909820781,
-0.014242540066451034,
-0.0059916031541056465,
0.006578280998839885,
0.028185501581728327,
-0.024590538478545518,
-0.0021251466978833107,
-0.0011437097110176362,
0.009911110308608431,
0.007214888912543956,
-0.01939781503097327,
-0.010310550860367107,
-0.016114915790174558,
-0.004730870309601278,
-0.010216931515387351,
-0.0015408094997550277,
0.010965882549934866,
-0.030107807374421685,
0.009567841550102136,
0.00932443292953551,
0.034052277235102804,
0.019622500341337525,
0.02332980516837983,
0.012326475998040222,
0.03717290468307219,
0.0034358100770188306,
-0.01550327337661672,
-0.01480425334236167,
-0.0030894204563712627,
-0.004169157499351957,
-0.04129213351043826,
-0.01066630213611586,
0.0032579342063137956,
-0.009243296101798214,
0.020209179583055712,
-0.02586375523727629,
0.009742596791496557,
-0.015053903687210842,
0.01418012748023874,
0.008363280033188831,
-0.004381359826812437,
0.009536635722657305,
-0.01723834234199583,
-0.034251997510982145,
0.005916708516312211,
0.005420528688755141,
-0.0036168063442038223,
0.0041067449131396636,
-0.015827818824920643,
0.025813825168306458,
0.004574838844070545,
0.016177329307709484,
0.0071899238780590395,
0.018736242548445598,
0.00941181055023272,
-0.016252224411164234,
-0.003735390258007179,
0.001928547284145246,
0.02831032675415291,
-0.018698795928040853,
0.007002686585083477,
-0.01636456613502373,
0.003966316361331347,
-0.007938874446945238,
-0.0015587530600833972,
-0.009224572791595841,
0.048631846198422854,
-0.001456552508368432,
0.01049154689472144,
0.008057457895087279,
0.02151984296219123,
-0.022193897030638732,
0.0029567938270854692,
-0.008949957412261752,
-0.010791127308540446,
-0.010254379067114726,
0.00509286284397126,
-0.021419982824251563,
-0.014916595997543797,
0.0007196945235800258,
0.03247948192519829,
0.004908745947475654,
0.007052616654053311,
-0.026013545444185795,
-0.009336915446777968,
0.026837389719542797,
0.12123011060120409,
0.0026260073529909746,
0.003360914973564079,
0.006977721550598559,
-0.005017967740516509,
0.022119001927183978,
-0.01618981089362931,
-0.035675006339267686,
0.009293226170768047,
0.007433332964286982,
0.01127794548099633,
0.0029521129995348765,
0.016576769859468158,
0.005692023205947957,
0.0023404701203157212,
-0.0037821996976663986,
-0.018049705031433008,
-0.008575481894987995,
-0.020046905927581116,
-0.021619701237485637,
-0.020271591237945374,
0.010572683722458736,
0.0034732576287462066,
0.016876350273287165,
-0.012675985549506429,
-0.04169157406219694,
0.02137005275528173,
0.010454099342994063,
0.014205092514723659,
-0.02806067640930374,
-0.0007317869912624897,
0.017350684065855326,
0.009443016377677551,
0.016040020686719808,
-0.00705885791267454,
0.03584975971801684,
0.01149638906707804,
0.01507886872169576,
0.0247153636509701,
0.010391686756781772,
-0.007289784015998708,
0.008513069308775701,
0.0033858800080489963,
-0.022867952961731493,
0.00411610656824085,
-0.015615616031798847,
-0.02422854640983685,
0.020933163720473047,
-0.017650264479674334,
0.014691910687179543,
0.010104589791527856,
-0.0064971451024252195,
-0.02918410109847975,
-0.023117603306580668,
0.00382900890449496,
0.014092750790864163,
0.007008927843704706,
0.023304840133894913,
0.015390930721434591,
0.025139769237213696,
-0.002106423154850281,
-0.02656277434020871,
-0.007258577722892561,
-0.02063358330665404,
-0.002828848258180927,
0.0010360481162167599,
-0.022106520341264153,
0.02019669613449062,
-0.01667662999740783,
-0.014230057549208575,
-0.0037197871114541055,
-0.0024715363185308786,
-0.01812460013488776,
0.0018162047453784476,
0.014617015583724791,
-0.00010103029389561051,
0.020383934824450133,
-0.01985966910026687,
0.004069297361412288,
-0.007645535757408778,
-0.009561600757142224,
-0.014080268273621705,
0.004287740947493997,
-0.0013964804523467647,
-0.00574819453353902,
0.027236830271301474,
0.013868065480499908,
-0.018935962824324935,
-0.018324320410767096,
0.004752714481944921,
0.004072417757892245,
0.007520710584984192,
0.00976132010169893,
-0.018137081720807587,
-0.015940160548780138,
-0.014142680859833997,
0.002165714878921301,
-0.0005554715516280912,
-0.0076143294643026315,
0.0098799044811636,
0.019984494272691458,
-0.025551691374892195,
-0.0196100187554177,
0.021457429444656305,
0.02278057440971165,
0.014879148445816422,
0.02080833854804846,
0.03175549778778095,
-0.01218292658409063,
0.00709006374011937,
0.026013545444185795,
-0.024453231720201103,
0.00866910123996775,
0.016501874756013408,
0.004562356326828086,
0.020346486341400124,
-0.00762681198154509,
0.017600334410704497,
0.003688580818347959,
-0.01843666213462659,
-0.007008927843704706,
-0.02603851047867071,
0.019634983789902617,
0.01757536937621958,
-0.008531793550300705,
0.015615616031798847,
-0.03155577751190162,
0.010809850618742818,
-0.01612739923873965,
0.007477021774635587,
-0.002666575534028965,
0.01792487985900842,
0.0034389307063294452,
0.025127285788648604,
0.004047452723407327,
-0.03098158171874852,
-0.014504672928542664,
-0.009605289101829511,
0.0033234674218367032,
-0.023729245720138504,
0.012176685791130718,
-0.017125998755491072,
-0.00762681198154509,
-0.01818701178977742,
-0.001822446003999677,
-0.03560010937316767,
-0.002566715396089296,
-0.013218974583892061,
-0.03217990151137928,
0.013843100446014991,
-0.00640352622310678,
0.022518442478942656,
-0.01612739923873965,
0.012569884618606848,
0.006038412826595525,
-0.010847298170470194,
-0.02507735571967877,
-0.005560956774902141,
0.006122669585151462,
0.02708704006439197,
0.017700194548644167,
0.013281387170104355,
0.027561375719605395,
0.0033141055339048595,
0.007970080274390069,
0.00073763814211606,
0.017825019721068754,
0.005320668550815471,
-0.028784660546721075,
0.005663937774983083,
0.01489163096305888,
0.0024091237323185855,
0.01715096378997599,
0.033378223166655306,
0.008868821515847088,
0.006778001274719222,
-0.009068540860403793,
-0.023442148754884592,
-0.022456030824052994,
-0.009056059274483967,
-0.02281802289276166,
-0.006022809447211793,
-0.003735390258007179,
-0.0016554924522971224,
-0.0036573745252418124,
-0.022168931996153815,
0.052875898335568244,
-0.020995577238007972,
-0.014966526066513632,
-0.0006042313555026128,
0.022443547375487902,
-0.026962214891967384,
0.018137081720807587,
-0.007639294498787548,
0.0037978028442194715,
-0.014217575031966117,
-0.0070401341368108526,
-0.015478308342131802,
0.003626168232135666,
0.015977609031830146,
0.015028938652725926,
-0.006940273998871183,
0.0019082633100415798,
0.0285350102018719,
0.018112118548967934,
0.022031625237809403,
-0.005657696516361854,
-0.01878617261741543,
0.007963839481430156,
-0.02890948571914566,
0.002732108749551873,
-0.021894316616819724,
-0.025639069926912037,
-0.05452359061157278,
0.008225972343521785,
-0.005542232999038454,
0.0005383080904197106,
0.02053372316871437,
-0.0021641546806813227,
-0.042016117647855594,
-0.008157318033026948,
-0.011746039411927213,
-0.00446561705102969,
0.020134284479600962,
0.009087265101928798,
-0.003360914973564079,
-0.018636382410505928,
-0.043538982888790274,
0.04601051944015182,
0.012956842653123064,
0.006578280998839885,
0.020870752065583385,
0.028834590615690908,
-0.0042596555165291235,
-0.018461627169111507,
0.0030769379391288043,
0.033228432959745806,
0.006206926809368715,
-0.01197696551525138,
0.012782087411728644,
0.006983962809219788,
-0.03457654482193133,
-0.00023911801719902158,
-0.002551112249536223,
-0.0008043415645266159,
0.03210500827056979,
-0.018199495238342512,
-0.013343799756316648,
-0.01446722537681529,
-0.0211578490308373,
-0.03455157792480115,
-0.017700194548644167,
-0.02360442054771392,
-0.008469380964088412,
-0.012994290204850439,
-0.01637704958358882,
-0.013506072480468609,
-0.013605932618408277,
-0.008550516860503078,
0.023991377650907503,
-0.017762608066179093,
0.013106631928709934,
-0.0013286068230485605,
-0.00688410267128012,
0.012794569928971102,
-0.006553315964354967,
-0.004743352826843736,
-0.0049399522405818,
-0.007464539257393129,
0.023978896064987678,
-0.03193025116653011,
-0.033952420822453665,
-0.019322919927518518,
0.02970836682266301,
0.014304952652663327,
0.013593450101165819,
-0.009861180239638598,
-0.010085865550002852,
-0.0058761401024435625,
-0.020134284479600962,
-0.004440652016544773,
-0.015028938652725926,
-0.016739041652297487,
-0.02278057440971165,
0.011009570894622156,
0.005511026705932307,
-0.006784242533340451,
0.0007922490968441518,
0.0030535333357145232,
-0.010397928481064316,
0.0004279159449720465,
0.0001727071901661262,
-0.008893786550332005,
-0.003975678482093848,
0.003239210663280766,
0.024078756202927345,
-0.013605932618408277,
0.025589139857942204,
0.005130310395698636,
0.012669744756546517,
0.004075538620033518,
0.002081458120365364,
0.005885502223206065,
-0.014017855687409412,
0.018012258411028263,
-0.009262020343323216,
0.01803722344551318,
-0.0036917014476585738,
-0.03667360399337385,
-0.010772403067015442,
0.019809739031297038,
0.010210690722427437,
0.0030004826374340743,
-0.018149565169372675,
0.008562999377745536,
0.008269660688209075,
0.006294303964404609,
0.00903109423999905,
0.0025058632409476395,
0.021869351582334808,
-0.012782087411728644,
-0.005448614119720014,
0.013418694859771398,
-0.01188958789455417,
0.013955443101197118,
0.029558576615753507,
-0.009467981412162468,
0.017949844893493338,
-0.03275409916717764,
0.019385333445053443,
0.009942317067375894,
0.018773691031495607,
-0.017637782893754506,
-0.014167644962996282,
0.006063377861080441,
0.01489163096305888,
-0.011590007480735163,
0.01627718944564915,
-0.01376820534256024,
-0.005882381361064792,
-0.00697148029197733,
0.020571171651764378,
-0.0004950094460464292,
0.0073834028953171475,
0.0062256505852324034,
-0.005947914343757041,
-0.0026681359650996012,
-0.01818701178977742,
-0.02536445454757795,
-0.039644441234433725,
-0.0052613768267444514,
0.04468737447509647,
-0.0049118663439556105,
0.010653819618873402,
-0.007489504291878045,
0.003451413223571904,
-0.05517268150818063,
-0.01525362396309018,
0.009380603791465257,
0.012844499997940937,
0.02716193516784672,
0.015116316273423136,
0.00688410267128012,
-0.00045054047837017044,
-0.01892348123840511,
0.017650264479674334,
-0.007127511291846746,
-0.008338314998703914,
0.018586452341536094,
-0.02278057440971165,
0.005445493723240057,
0.024166132892301923,
0.009099747619171256,
0.004312705981978914,
0.01110319023960191,
0.001161653271346006,
-0.011215532894784039,
0.02153232454811106,
-0.015565685962829012,
-0.02367931565116867,
0.024203581375351932,
-0.026238228891904785,
-0.007801566757278194,
0.021457429444656305,
-0.01685138523880225,
-0.04343912275085061,
-0.0009120031593274922,
-0.014666945652694626,
-0.006940273998871183,
0.00314403135289169,
0.02281802289276166,
0.0031565138701341484,
0.014829218376846587,
-0.00011263511959254261,
-0.007152476326331664,
0.000660402624886012,
0.0003551663178240909,
-0.017750124617614,
0.026937249857482467,
-0.010560201205216278,
-0.00853803434326062,
0.0063005452230258385,
0.006194444292126257,
-0.0055734392921446,
0.002083018318605342,
0.004989882309551635,
-0.008725272101897498,
-0.020383934824450133,
-0.02184438654784989,
-0.02422854640983685,
-0.002945871740913647,
0.0024028827065280143,
0.010347998412094482,
-0.018848586134950357,
0.014916595997543797,
-0.008513069308775701,
-0.006865378895416432,
-0.025938650340731045,
0.016114915790174558,
0.019035822962264606,
-0.02553920978897237,
0.011009570894622156,
-0.0173382024799355,
-0.042515418337553935,
-0.026437949167784122,
-0.02281802289276166,
0.004162916240730728,
-0.021794456478880057,
0.0016866987454032688,
0.0034014831546020693,
-0.011958241273726376,
-0.01978477399681212,
-0.01994704578964145,
-0.004094262395897205,
0.011864622860069252,
0.014367365238875619,
0.2286795221667339,
0.002207843491029928,
-0.016264705997084062,
0.02481522378890977,
-0.005751314930018977,
0.004662216464767755,
0.02003442434166129,
0.008812649722594709,
-0.028609905305326654,
-0.02334228861694492,
-0.024303441513291602,
-0.007633053240166319,
-0.010934675791167404,
0.0030160857839871477,
0.024565573444060598,
-0.018499075652161516,
-0.019447745099943105,
-0.03135605723602228,
-0.03210500827056979,
-0.051827370612492256,
0.0028085640512466024,
0.01254491958412193,
0.0008901587541531896,
-0.01732571903137041,
0.022930364616621155,
0.020171731100005703,
0.011789727756614502,
-0.0008768961377907419,
0.0068778614126588904,
0.0007723551140928157,
-0.02540190116798269,
0.01149638906707804,
-0.00438448068895371,
0.02030903972099538,
0.014504672928542664,
0.0033577943442534646,
0.02385407089256309,
-0.008032492860602362,
0.009917352032890977,
0.020396416410369957,
0.026512844271238876,
-0.018486592203596428,
-0.015965125583265054,
-0.010903469963722573,
0.0046809402406314425,
0.041816397371976254,
...]}
接下来需要根据余弦相似度进行切分
def cosine_similarity(vec1, vec2):
"""Calculate the cosine similarity between two vectors."""
dot_product = np.dot(vec1, vec2)
norm_vec1 = np.linalg.norm(vec1)
norm_vec2 = np.linalg.norm(vec2)
return dot_product / (norm_vec1 * norm_vec2)
import numpy as np
cosine_similarity(sentences[0]['combined_sentence_embedding'], sentences[1]['combined_sentence_embedding'])
0.9583807615923087
def calculate_cosine_distances(sentences):
distances = []
for i in range(len(sentences) - 1):
embedding_current = sentences[i]['combined_sentence_embedding']
embedding_next = sentences[i + 1]['combined_sentence_embedding']
# Calculate cosine similarity
similarity = cosine_similarity(embedding_current, embedding_next)
# Convert to cosine distance
distance = 1 - similarity
distances.append(distance)
# Store distance in the dictionary
sentences[i]['distance_to_next'] = distance
return distances, sentences
distances, sentences = calculate_cosine_distances(sentences)
sentences[-2]['distance_to_next']
0.11893614462164948
import matplotlib.pyplot as plt
plt.plot(distances);
有很多方法可以基于这些距离来划分论文,但我打算将任何超过距离95百分位数的距离视为一个分割点。这是我们需要配置的唯一参数。
import numpy as np
plt.plot(distances)
y_upper_bound = 0.15
plt.ylim(0, y_upper_bound)
plt.xlim(0, len(distances))
# We need to get the distance threshold that we'll consider an outlier
# We'll use numpy .percentile() for this
breakpoint_percentile_threshold = 95
breakpoint_distance_threshold = np.percentile(distances, breakpoint_percentile_threshold) # If you want more chunks, lower the percentile cutoff
plt.axhline(y=breakpoint_distance_threshold, color='r', linestyle='-')
num_distances_above_theshold = len([x for x in distances if x > breakpoint_distance_threshold]) # The amount of distances above your threshold
plt.text(x=(len(distances)*.01), y=y_upper_bound/50, s=f"{num_distances_above_theshold + 1} Chunks")
# Then we'll get the index of the distances that are above the threshold. This will tell us where we should split our text
indices_above_thresh = [i for i, x in enumerate(distances) if x > breakpoint_distance_threshold] # The indices of those breakpoints on your list
# Start of the shading and text
colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k']
for i, breakpoint_index in enumerate(indices_above_thresh):
start_index = 0 if i == 0 else indices_above_thresh[i - 1]
end_index = breakpoint_index if i <= len(indices_above_thresh) - 1 else len(distances)
plt.axvspan(start_index, end_index, facecolor=colors[i % len(colors)], alpha=0.25)
plt.text(x=np.average([start_index, end_index]),
y=breakpoint_distance_threshold + (y_upper_bound)/ 20,
s=f"Chunk #{i}", horizontalalignment='center',
rotation='vertical')
# # Additional step to shade from the last breakpoint to the end of the dataset
if indices_above_thresh:
last_breakpoint = indices_above_thresh[-1]
if last_breakpoint < len(distances):
plt.axvspan(last_breakpoint, len(distances), facecolor=colors[len(indices_above_thresh) % len(colors)], alpha=0.25)
plt.text(x=np.average([last_breakpoint, len(distances)]),
y=breakpoint_distance_threshold + (y_upper_bound)/ 20,
s=f"Chunk #{i+1}",
rotation='vertical')
plt.title("Essay Chunks Based On Embedding Breakpoints")
plt.xlabel("Index of sentences in essay (Sentence Position)")
plt.ylabel("Cosine distance between sequential sentences")
plt.show()
# Initialize the start index
start_index = 0
# Create a list to hold the grouped sentences
chunks = []
# Iterate through the breakpoints to slice the sentences
for index in indices_above_thresh:
# The end index is the current breakpoint
end_index = index
# Slice the sentence_dicts from the current start index to the end index
group = sentences[start_index:end_index + 1]
combined_text = ' '.join([d['sentence'] for d in group])
chunks.append(combined_text)
# Update the start index for the next group
start_index = index + 1
# The last group, if any sentences remain
if start_index < len(sentences):
combined_text = ' '.join([d['sentence'] for d in sentences[start_index:]])
chunks.append(combined_text)
# grouped_sentences now contains the chunked sentences
for i, chunk in enumerate(chunks[:2]):
buffer = 200
print (f"Chunk #{i}")
print (chunk[:buffer].strip())
print ("...")
print (chunk[-buffer:].strip())
print ("\n")
Chunk #0
I have a Dream
by Martin Luther King, Jr. Delivered on the steps at the Lincoln Memorial in Washington
D.C. on August 28, 1963
Five score years ago, a great American, in whose symbolic shadow
we sta
...
. One hundred years
later, the Negro is still languishing in the corners of American
society and finds himself an exile in his own land. So we have
come here today to dramatize an appalling condition.
Chunk #1
In a sense we have come to our nation's capital to cash a check. When the architects of our republic wrote the magnificent words
of the Constitution and the declaration of Independence, they
were sign
...
fied, and we will not be
satisfied until justice rolls down like waters and righteousness
like a mighty stream. I am not unmindful that some of you have come here out of great
trials and tribulations.