9.6.6 策略选择函数
(1)到现在为止,已经实现了多种策略的生成文档摘要的方案。为了便于选择策略,在下面的代码中定义了一系列策略函数,用于根据不同的策略对文档进行摘要处理。
def check_inputs(summarization_strategy, chunking_strategy):
if summarization_strategy not in ['stuffing', 'map_reduce', 'refine']:
raise ValueError(f'Wrong parameter "summarization_strategy": select either "stuffing", "map_reduce" or "refine". "{summarization_strategy}" was chosen instead.')
if chunking_strategy not in ['html', 'character', 'html_character'] and summarization_strategy!='stuffing':
raise ValueError(f'Wrong parameter "chunking_strategy": select either "html", "character" or "html_character". "{chunking_strategy}" was chosen instead.')
def prepare_prompt(langchain_pipeline, writeup, verbose):
print('> Applying chat template to the prompt') if verbose else None
messages = [
{
"role": "user",
"content": "Summarize the following text in a technical way. Focus on facts, numbers and strategies used. Divide the summary in chapters, be impersonal and use bullet points:\n\n{}".format(writeup)
}
]
prompt = langchain_pipeline.hf_pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
return prompt
def mapreduce_strategy(langchain_pipeline, chunks):
prompt_template = """<bos><start_of_turn>user
Summarize the following text in a technical way. Focus on facts, numbers and strategies used. Divide the summary in chapters, be impersonal and use bullet points:
{text}<end_of_turn>
<start_of_turn>model"""
prompt_init = PromptTemplate.from_template(prompt_template)
combine_template = """<bos><start_of_turn>user
You are given a text containing summaries of different part of a document.
Create one single summary combining all the information of the chapters. Divide the summary in chapters, be impersonal and use bullet points:
{text}<end_of_turn>
<start_of_turn>model"""
combine_prompt = PromptTemplate.from_template(combine_template)
chain = load_summarize_chain(langchain_pipeline,
chain_type='map_reduce',
map_prompt=prompt_init,
combine_prompt=combine_prompt)
out_summary = chain.invoke(chunks)
return out_summary
def refine_strategy(langchain_pipeline, chunks):
prompt_template = """<bos><start_of_turn>user
Summarize the following text in a technical way. Focus on facts, numbers and strategies used. Divide the summary in chapters, be impersonal and use bullet points:
{text}<end_of_turn>
<start_of_turn>model"""
prompt_init = PromptTemplate.from_template(prompt_template)
refine_template = """<bos><start_of_turn>user
Your job is to produce a final document divided in chapters and bullet points.
You are given a text containing an existing summary to a certain point:
{existing_answer}
You can now refine it (if necessary) with more context below.
{text}
Given the new context, refine the original summary.<end_of_turn>
<start_of_turn>model"""
prompt_refine = PromptTemplate.from_template(refine_template)
chain = load_summarize_chain(langchain_pipeline, chain_type='refine',
return_intermediate_steps=True,
input_key='input_documents',
output_key='output_text',
question_prompt=prompt_init,
refine_prompt=prompt_refine)
out_summary = chain.invoke(chunks, return_only_outputs=True)
return out_summary
def prepare_chunks(text_to_split, chunking_strategy, verbose):
print(f'> Preparing text chunking. Strategy: {chunking_strategy}') if verbose else None
output_chunks = text_to_split # To avoid local variable referenced before assignment
if (chunking_strategy == 'html') or (chunking_strategy == 'html_character'):
print(f'> Splitting at HTML level') if verbose else None
# Split on HTML headers
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2")
]
text_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on, return_each_element=False)
output_chunks = text_splitter.split_text(text_to_split)
print(f"""
Length writup: {len(text_to_split)}
Number of html splits: {len(output_chunks)}
Length of each split: {[len(i.page_content) for i in output_chunks]}
""") if verbose else None
for i, text in enumerate(output_chunks):
# Join the metadata and the content together
final_content = '\n'.join(text.metadata.values()) + '\n' + text.page_content
# Replace the old content with the enriched one
text.page_content = final_content
if (chunking_strategy == 'character') or (chunking_strategy == 'html_character'):
print(f'> Splitting at character level (newline char)') if verbose else None
text_splitter = CharacterTextSplitter(separator='\n', chunk_size=2000, chunk_overlap=100)
if chunking_strategy == 'character':
# split_text doesn't automatically create Documents object in LangChain if we only want character splitting (and no html)
docs = text_splitter.create_documents([output_chunks])
output_chunks = text_splitter.split_documents(docs)
else:
# otherwise, htmlsplitter does create documents that we can further split
output_chunks = text_splitter.split_documents(output_chunks)
print(f"""
Number of final splits: {len(output_chunks)}
Length of each final split: {[len(i.page_content) for i in output_chunks]}
""") if verbose else None
return output_chunks
def summarize(langchain_pipeline, writeup, summarization_strategy='stuffing', chunking_strategy='html', verbose=True):
check_inputs(summarization_strategy, chunking_strategy)
print('> Begin summarization') if verbose else None
if summarization_strategy == 'stuffing':
print('> Summarization strategy: Stuffing. Ignoring chunking_strategy') if verbose else None
prompt = prepare_prompt(langchain_pipeline=langchain_pipeline, writeup=writeup, verbose=verbose)
print('> Invoking chain...') if verbose else None
output = langchain_pipeline.invoke(prompt)
elif summarization_strategy == 'map_reduce':
print('> Summarization strategy: MapReduce') if verbose else None
chunks = prepare_chunks(writeup, chunking_strategy, verbose)
print('> Invoking chain...') if verbose else None
output = mapreduce_strategy(langchain_pipeline=langchain_pipeline, chunks=chunks)['output_text'].replace('\n\n','\n')
else:
print('> Summarization strategy: Refine') if verbose else None
chunks = prepare_chunks(writeup, chunking_strategy, verbose)
print('> Invoking chain...') if verbose else None
output = refine_strategy(langchain_pipeline=langchain_pipeline, chunks=chunks)['output_text']
print('\n########## SUMMARY ##########\n') if verbose else None
return output
对上述函数的具体说明如下:
- check_inputs:检查用户输入的 summarization_strategy 和 chunking_strategy 是否有效。
- prepare_prompt:为指定的文档 writeup 准备提示模板,以便用于摘要生成。
- mapreduce_strategy:实现 MapReduce 摘要策略,将文档分割成多个块,然后对每个块生成摘要,最后合并这些摘要。
- refine_strategy:实现 Refine 摘要策略,先对文档的每个块生成初步摘要,然后根据新的上下文信息逐步细化这些摘要。
- prepare_chunks:根据 chunking_strategy 准备文档块,可以基于 HTML 标签或字符数进行分割。
- summarize:主函数,是整个流程的入口点,它首先验证输入参数,然后根据选择的策略调用相应的函数来生成摘要。如果选择的是 Stuffing 策略,会直接生成摘要而不考虑分割策略。如果选择的是 MapReduce 或 Refine 策略,会先准备文档块,然后调用相应的策略函数来生成摘要。
上述函数共同构成了一个摘要处理流程,允许用户根据自己的需求选择不同的摘要策略(Stuffing、MapReduce 或 Refine)和分割策略(基于 HTML、字符或两者结合)。此外,参数verbose允许用户选择是否在摘要过程中打印额外的详细信息,这对于调试和跟踪摘要过程非常有用。
(2)从 writeups 数据集中选取索引为 20 的行的第 10 列(在 Python 中索引从 0 开始,因此索引 9 表示第 10 列)作为 custom_writeup,然后打印输出这个自定义文档的前 500 个字符。
custom_writeup = writeups.iloc[20, 9]
print(custom_writeup[:500])
执行后会输出:
<p>Team members are junseonglee11 (@junseonglee11), Ayaan Jang(@ayaanjang). <br>
We ensembled 6 LSTM models (2 different versions).<br>
We modified Robin Smith's and Robert Hatch's notebooks. </p>
<h1><strong>Our notebooks:</strong></h1>
<p>Inference: <a href="https://www.kaggle.com/code/ayaanjang/20th-tensorflow-lstm-model-inference-merged" target="_blank">https://www.kaggle.com/code/ayaanjang/20th-tensorflow-lstm-model-inference-merged</a><br>
Train: <a href="https://www.kaggle.com/code/junse
上述代码的作用是展示选定文档的一小部分内容,这样可以检查文档的内容是否正确加载,或者快速浏览文档的开头部分。如果有一个大型的文档或数据集,并且想要查看特定文档的内容,这种方法非常有用。
(3)调用前面定义的函数summarize生成文档 custom_writeup 的摘要,具体实现代码如下所示。
output = summarize(
langchain_pipeline=langchain_hf, # Select the LangChain wrapper around the HF pipeline we init before
writeup=custom_writeup, # Pass the writeup
summarization_strategy='stuffing',
chunking_strategy='html' # Select the chunking strategy. If stuffing, this parameter is ignored
)
display(Markdown(output.replace('#', '')))
通过上述代码,可以在 Jupyter Notebook 或其他支持 Markdown 渲染的环境中以格式化的方式展示摘要结果,这种方法特别适用于需要以结构化和可读性强的方式展示文本摘要的情况。执行后会输出:
> Begin summarization
> Summarization strategy: Stuffing. Ignoring chunking_strategy
> Applying chat template to the prompt
> Invoking chain...
########## SUMMARY ##########
Summary
Chapter 1: Introduction
Team members used 6 LSTM models for an RNN model.
They modified existing notebooks and created new ones for data preprocessing and training.
Chapter 2: Data and Preprocessing
They preprocessed 96 time-series features from the "icecube" dataset.
They experimented with different feature engineering techniques to improve prediction accuracy.
They converted features to TFRecord format for efficient training.
Chapter 3: Model Training and Inference
They trained two versions of the model (with and without feature square root).
They used Adam optimizer and ensemble techniques to combine models.
They split the dataset into 10 folds for training and validation.
Chapter 4: Data Postprocessing
They modified the code to perform weighted average of predicted probabilities and calculate the azimuth and zenith directions of the particle.
They experimented with different post-processing techniques to improve the model's performance.
Chapter 5: Results and Discussion
They presented the results of training and post-processing.
They discussed the importance of feature engineering and ensemble techniques for improving the model's accuracy.
(4)调用前面定义的函数summarize生成生成一个名为 custom_writeup 的文档摘要,具体实现代码如下所示。
output = summarize(
langchain_pipeline=langchain_hf,
writeup=custom_writeup,
summarization_strategy='refine',
chunking_strategy='html')
display(Markdown(output.replace('#', '')))
总的来说,上述代码的目的是根据用户选择的策略,自动生成文档 custom_writeup 的摘要,并将结果格式化为 Markdown格式,并在Jupyter Notebook中展示出来。执行后会输出:
> Begin summarization
> Summarization strategy: Refine
> Preparing text chunking. Strategy: html
> Splitting at HTML level
Length writup: 7083
Number of html splits: 6
Length of each split: [179, 282, 586, 1818, 848, 1000]
> Invoking chain...
########## SUMMARY ##########
Chapter 1: Introduction
LSTM Models
LSTM models are a powerful type of recurrent neural network (RNN) known for their ability to process sequential data effectively. They possess a unique structure with feedback loops that enable them to capture long-term dependencies between consecutive data points.
Chapter 2: Ensemble Learning
Model Selection and Configuration
We select two distinct versions of LSTM models:
Robin Smith's notebook
Robert Hatch's notebook
We optimize these notebooks by adjusting hyperparameters and experimenting with different configurations.
Chapter 3: Experimentation and Evaluation
We conduct a comprehensive set of experiments to identify the optimal settings for the ensemble. We employ various metrics to evaluate the performance of the ensemble, including accuracy, precision, and recall.
Chapter 4: Results and Discussion
The results demonstrate that the ensemble of 2 LSTM models achieves a significant improvement in performance compared to the individual models. We analyze the insights gained from the analysis to understand the strengths and weaknesses of each model.
Chapter 5: Conclusion
In this project, we explored ensemble learning for LSTM models, showcasing its effectiveness in enhancing performance. The results provide valuable insights into the power of combining multiple models for improved accuracy and robustness.